用ChatGPT解读非结构化数据【ChatGPT + SQL】

您所在的位置：网站首页 › sqlserver数据库架构 › 用ChatGPT解读非结构化数据【ChatGPT + SQL】

用ChatGPT解读非结构化数据【ChatGPT + SQL】

2023-03-23 05:00| 来源: 网络整理| 查看: 265

许多现代数据系统都依赖于结构化数据，例如 Postgres DB 或 Snowflake 数据仓库。 LlamaIndex 提供了许多由 LLM 提供支持的高级功能，既可以从非结构化数据创建结构化数据，也可以通过增强的文本到 SQL 功能分析这些结构化数据。

本指南有助于逐步了解这些功能中的每一项。具体来说，我们涵盖以下主题：

推断结构化数据点：将非结构化数据转换为结构化数据。Text-to-SQL（基础）：如何使用自然语言查询一组表。注入上下文：如何将每个表的上下文注入到 text-to-SQL 提示中。上下文可以手动添加，也可以从非结构化文档中派生。 = 在索引中存储表上下文：默认情况下，我们直接将上下文插入到提示中。如果上下文很大，有时这是不可行的。在这里，我们展示了如何实际使用 LlamaIndex 数据结构来包含表上下文！

我们将浏览一个包含城市/人口/国家信息的示例数据库。

在这里插入图片描述

推荐：用 NSDT场景设计器快速搭建3D场景。

1、设置

首先，我们使用 SQLAlchemy 来设置一个简单的 sqlite 数据库：

from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column engine = create_engine("sqlite:///:memory:") metadata_obj = MetaData(bind=engine)

然后我们创建一个 city_stats 表：

# create city SQL table table_name = "city_stats" city_stats_table = Table( table_name, metadata_obj, Column("city_name", String(16), primary_key=True), Column("population", Integer), Column("country", String(16), nullable=False), ) metadata_obj.create_all()

现在是时候插入一些数据点了！

如果你希望通过从非结构化数据推断结构化数据点来研究填充此表，请查看以下部分。否则，可以选择直接填充此表：

from sqlalchemy import insert rows = [ {"city_name": "Toronto", "population": 2731571, "country": "Canada"}, {"city_name": "Tokyo", "population": 13929286, "country": "Japan"}, {"city_name": "Berlin", "population": 600000, "country": "United States"}, ] for row in rows: stmt = insert(city_stats_table).values(**row) with engine.connect() as connection: cursor = connection.execute(stmt)

最后，我们可以用我们的 SQLDatabase 包装器包装 SQLAlchemy 引擎；这允许在 LlamaIndex 中使用数据库：

from llama_index import SQLDatabase sql_database = SQLDatabase(engine, include_tables=["city_stats"])

如果数据库中已经填充了数据，我们可以使用空白文档列表实例化 SQL 索引。否则请参阅以下部分。

index = GPTSQLStructStoreIndex( [], sql_database=sql_database, table_name="city_stats", ) 2、推断结构化数据点

LlamaIndex 提供将非结构化数据点转换为结构化数据的功能。在本节中，我们将展示如何通过提取有关每个城市的维基百科文章来填充 city_stats 表。

首先，我们使用 LlamaHub 的维基百科阅读器加载一些有关相关数据的页面。

from llama_index import download_loader WikipediaReader = download_loader("WikipediaReader") wiki_docs = WikipediaReader().load_data(pages=['Toronto', 'Berlin', 'Tokyo'])

当我们建立SQL索引时，我们可以指定这些文档作为第一个输入；这些文档将被转换为结构化数据点并插入到数据库中：

from llama_index import GPTSQLStructStoreIndex, SQLDatabase sql_database = SQLDatabase(engine, include_tables=["city_stats"]) # NOTE: the table_name specified here is the table that you # want to extract into from unstructured documents. index = GPTSQLStructStoreIndex( wiki_docs, sql_database=sql_database, table_name="city_stats", )

你可以查看当前表以验证是否已插入数据点！

# view current table stmt = select( [column("city_name"), column("population"), column("country")] ).select_from(city_stats_table) with engine.connect() as connection: results = connection.execute(stmt).fetchall() print(results) 3、文本到 SQL（基本）

LlamaIndex 提供“文本到 SQL”功能，既有最基本的水平，也有更高级的水平。在本节中，我们将展示如何在基本级别上使用这些文本到 SQL 的功能。

此处显示了一个简单示例：

# set Logging to DEBUG for more detailed outputs response = index.query("Which city has the highest population?", mode="default") print(response)

你可以通过 response.extra_info[‘sql_query’] 访问底层派生的 SQL 查询。它应该看起来像这样：

SELECT city_name, population FROM city_stats ORDER BY population DESC LIMIT 1 4、注入上下文

默认情况下，text-to-SQL 提示只是将表架构信息注入到提示中。但是，通常你可能还想添加自己的上下文。本节向你展示如何添加上下文，可以手动添加，也可以通过文档提取。

我们为你提供上下文构建器类以更好地管理 SQL 表中的上下文：SQLContextContainerBuilder。这个类接受 SQLDatabase 对象和一些其他可选参数，并构建一个 SQLContextContainer 对象，然后你可以在构造 + 查询时将其传递给索引。

可以手动将上下文添加到上下文构建器。下面的代码片段展示了如何实现：

# manually set text city_stats_text = ( "This table gives information regarding the population and country of a given city.\n" "The user will query with codewords, where 'foo' corresponds to population and 'bar'" "corresponds to city." ) table_context_dict={"city_stats": city_stats_text} context_builder = SQLContextContainerBuilder(sql_database, context_dict=table_context_dict) context_container = context_builder.build_context_container() # building the index index = GPTSQLStructStoreIndex( wiki_docs, sql_database=sql_database, table_name="city_stats", sql_context_container=context_container )

你还可以选择从一组非结构化文档中提取上下文。为此，可以调用 SQLContextContainerBuilder.from_documents。我们使用 TableContextPrompt 和 RefineTableContextPrompt（请参阅参考文档）。

# this is a dummy document that we will extract context from # in GPTSQLContextContainerBuilder city_stats_text = ( "This table gives information regarding the population and country of a given city.\n" ) context_documents_dict = {"city_stats": [Document(city_stats_text)]} context_builder = SQLContextContainerBuilder.from_documents( context_documents_dict, sql_database ) context_container = context_builder.build_context_container() # building the index index = GPTSQLStructStoreIndex( wiki_docs, sql_database=sql_database, table_name="city_stats", sql_context_container=context_container, ) 5、在索引中存储表上下文

一个数据库集合可以有很多表，如果每个表有很多列+与之相关的描述，那么整个上下文可能会非常大。

幸运的是，可以选择使用 LlamaIndex 数据结构来存储此表上下文！然后，当查询 SQL 索引时，我们可以使用这个“边”索引来检索可以输入到文本到 SQL 提示中的正确上下文。

这里我们使用 SQLContextContainerBuilder 中的 derive_index_from_context 函数来创建一个新索引。你可以灵活地选择要指定的索引类+要传入的参数。然后我们使用一个名为 query_index_for_context 的辅助方法，它是 index.query 调用的简单包装器，它包装了一个查询模板+将上下文存储在生成的上下文容器中 .

然后你可以构建上下文容器，并在查询期间将其传递给索引！

from gpt_index import GPTSQLStructStoreIndex, SQLDatabase, GPTSimpleVectorIndex from gpt_index.indices.struct_store import SQLContextContainerBuilder sql_database = SQLDatabase(engine) # build a vector index from the table schema information context_builder = SQLContextContainerBuilder(sql_database) table_schema_index = context_builder.derive_index_from_context( GPTSimpleVectorIndex, store_index=True ) query_str = "Which city has the highest population?" # query the table schema index using the helper method # to retrieve table context SQLContextContainerBuilder.query_index_for_context( table_schema_index, query_str, store_context_str=True ) context_container = context_builder.build_context_container() # query the SQL index with the table context response = index.query(query_str, sql_context_container=context_container) print(response)

原文链接：ChatGPT+SQL — BimANt

【本文地址】

用ChatGPT解读非结构化数据【ChatGPT + SQL】

用ChatGPT解读非结构化数据【ChatGPT + SQL】

今日新闻

推荐新闻