Azure OpenAI Service embeddings tutorial

您所在的位置:网站首页 clusterize Azure OpenAI Service embeddings tutorial

Azure OpenAI Service embeddings tutorial

#Azure OpenAI Service embeddings tutorial| 来源: 网络整理| 查看: 265

Tutorial: Explore Azure OpenAI Service embeddings and document search Article 12/06/2023

This tutorial will walk you through using the Azure OpenAI embeddings API to perform document search where you'll query a knowledge base to find the most relevant document.

In this tutorial, you learn how to:

Install Azure OpenAI. Download a sample dataset and prepare it for analysis. Create environment variables for your resources endpoint and API key. Use the text-embedding-ada-002 (Version 2) model Use cosine similarity to rank search results.

Important

We strongly recommend using text-embedding-ada-002 (Version 2). This model/version provides parity with OpenAI's text-embedding-ada-002. To learn more about the improvements offered by this model, please refer to OpenAI's blog post. Even if you are currently using Version 1 you should migrate to Version 2 to take advantage of the latest weights/updated token limit. Version 1 and Version 2 are not interchangeable, so document embedding and document search must be done using the same version of the model.

Prerequisites An Azure subscription - Create one for free Access granted to Azure OpenAI in the desired Azure subscription. Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at https://aka.ms/oai/access. Open an issue on this repo to contact us if you have an issue. An Azure OpenAI resource with the text-embedding-ada-002 (Version 2) model deployed. This model is currently only available in certain regions. If you don't have a resource the process of creating one is documented in our resource deployment guide. Python 3.7.1 or later version The following Python libraries: openai, num2words, matplotlib, plotly, scipy, scikit-learn, pandas, tiktoken. Jupyter Notebooks Set up Python libraries

If you haven't already, you need to install the following libraries:

OpenAI Python 0.28.1 OpenAI Python 1.x pip install "openai==0.28.1" num2words matplotlib plotly scipy scikit-learn pandas tiktoken pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken Download the BillSum dataset

BillSum is a dataset of United States Congressional and California state bills. For illustration purposes, we'll look only at the US bills. The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills. The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 characters in length. More information on the project and the original academic paper where this dataset is derived from can be found on the BillSum project's GitHub repository

This tutorial uses the bill_sum_data.csv file that can be downloaded from our GitHub sample data.

You can also download the sample data by running the following command on your local machine:

curl "https://raw.githubusercontent.com/Azure-Samples/Azure-OpenAI-Docs-Samples/main/Samples/Tutorials/Embeddings/data/bill_sum_data.csv" --output bill_sum_data.csv Retrieve key and endpoint

To successfully make a call against Azure OpenAI, you need an endpoint and a key.

Variable name Value ENDPOINT This value can be found in the Keys & Endpoint section when examining your resource from the Azure portal. Alternatively, you can find the value in the Azure OpenAI Studio > Playground > Code View. An example endpoint is: https://docs-test-001.openai.azure.com/. API-KEY This value can be found in the Keys & Endpoint section when examining your resource from the Azure portal. You can use either KEY1 or KEY2.

Go to your resource in the Azure portal. The Endpoint and Keys can be found in the Resource Management section. Copy your endpoint and access key as you'll need both for authenticating your API calls. You can use either KEY1 or KEY2. Always having two keys allows you to securely rotate and regenerate keys without causing a service disruption.

Environment variables Command Line PowerShell Bash setx AZURE_OPENAI_API_KEY "REPLACE_WITH_YOUR_KEY_VALUE_HERE" setx AZURE_OPENAI_ENDPOINT "REPLACE_WITH_YOUR_ENDPOINT_HERE" [System.Environment]::SetEnvironmentVariable('AZURE_OPENAI_API_KEY', 'REPLACE_WITH_YOUR_KEY_VALUE_HERE', 'User') [System.Environment]::SetEnvironmentVariable('AZURE_OPENAI_ENDPOINT', 'REPLACE_WITH_YOUR_ENDPOINT_HERE', 'User') echo export AZURE_OPENAI_API_KEY="REPLACE_WITH_YOUR_KEY_VALUE_HERE" >> /etc/environment echo export AZURE_OPENAI_ENDPOINT="REPLACE_WITH_YOUR_ENDPOINT_HERE" >> /etc/environment source /etc/environment

After setting the environment variables, you may need to close and reopen Jupyter notebooks or whatever IDE you're using in order for the environment variables to be accessible. While we strongly recommend using Jupyter Notebooks, if for some reason you cannot you'll need to modify any code that is returning a pandas dataframe by using print(dataframe_name) rather than just calling the dataframe_name directly as is often done at the end of a code block.

Run the following code in your preferred Python IDE:

Import libraries OpenAI Python 0.28.1 OpenAI Python 1.x import openai import os import re import requests import sys from num2words import num2words import os import pandas as pd import numpy as np from openai.embeddings_utils import get_embedding, cosine_similarity import tiktoken API_KEY = os.getenv("AZURE_OPENAI_API_KEY") RESOURCE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT") openai.api_type = "azure" openai.api_key = API_KEY openai.api_base = RESOURCE_ENDPOINT openai.api_version = "2022-12-01" url = openai.api_base + "/openai/deployments?api-version=2022-12-01" r = requests.get(url, headers={"api-key": API_KEY}) print(r.text) { "data": [ { "scale_settings": { "scale_type": "standard" }, "model": "text-embedding-ada-002", "owner": "organization-owner", "id": "text-embedding-ada-002", "status": "succeeded", "created_at": 1657572678, "updated_at": 1657572678, "object": "deployment" }, { "scale_settings": { "scale_type": "standard" }, "model": "code-cushman-001", "owner": "organization-owner", "id": "code-cushman-001", "status": "succeeded", "created_at": 1657572712, "updated_at": 1657572712, "object": "deployment" }, { "scale_settings": { "scale_type": "standard" }, "model": "text-search-curie-doc-001", "owner": "organization-owner", "id": "text-search-curie-doc-001", "status": "succeeded", "created_at": 1668620345, "updated_at": 1668620345, "object": "deployment" }, { "scale_settings": { "scale_type": "standard" }, "model": "text-search-curie-query-001", "owner": "organization-owner", "id": "text-search-curie-query-001", "status": "succeeded", "created_at": 1669048765, "updated_at": 1669048765, "object": "deployment" } ], "object": "list" }

The output of this command will vary based on the number and type of models you've deployed. In this case, we need to confirm that we have an entry for text-embedding-ada-002. If you find that you're missing this model, you'll need to deploy the model to your resource before proceeding.

import os import re import requests import sys from num2words import num2words import os import pandas as pd import numpy as np import tiktoken from openai import AzureOpenAI

Now we need to read our csv file and create a pandas DataFrame. After the initial DataFrame is created, we can view the contents of the table by running df.

df=pd.read_csv(os.path.join(os.getcwd(),'bill_sum_data.csv')) # This assumes that you have placed the bill_sum_data.csv in the same directory you are running Jupyter Notebooks df

Output:

The initial table has more columns than we need we'll create a new smaller DataFrame called df_bills which will contain only the columns for text, summary, and title.

df_bills = df[['text', 'summary', 'title']] df_bills

Output:

Next we'll perform some light data cleaning by removing redundant whitespace and cleaning up the punctuation to prepare the data for tokenization.

pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters # s is input text def normalize_text(s, sep_token = " \n "): s = re.sub(r'\s+', ' ', s).strip() s = re.sub(r". ,","",s) # remove all instances of multiple spaces s = s.replace("..",".") s = s.replace(". .",".") s = s.replace("\n", "") s = s.strip() return s df_bills['text']= df_bills["text"].apply(lambda x : normalize_text(x))

Now we need to remove any bills that are too long for the token limit (8192 tokens).

tokenizer = tiktoken.get_encoding("cl100k_base") df_bills['n_tokens'] = df_bills["text"].apply(lambda x: len(tokenizer.encode(x))) df_bills = df_bills[df_bills.n_tokens Playground > Code View. An example endpoint is: https://docs-test-001.openai.azure.com/. API-KEY This value can be found in the Keys & Endpoint section when examining your resource from the Azure portal. You can use either KEY1 or KEY2.

Go to your resource in the Azure portal. The Endpoint and Keys can be found in the Resource Management section. Copy your endpoint and access key as you'll need both for authenticating your API calls. You can use either KEY1 or KEY2. Always having two keys allows you to securely rotate and regenerate keys without causing a service disruption.

Create and assign persistent environment variables for your key and endpoint.

Environment variables Command Line PowerShell Bash setx AZURE_OPENAI_API_KEY "REPLACE_WITH_YOUR_KEY_VALUE_HERE" setx AZURE_OPENAI_ENDPOINT "REPLACE_WITH_YOUR_ENDPOINT_HERE" $Env:AZURE_OPENAI_KEY = '' $Env:AZURE_OPENAI_ENDPOINT = '' $Env:AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT = '' echo export AZURE_OPENAI_KEY="" >> /etc/environment echo export AZURE_OPENAI_ENDPOINT="" >> /etc/environment echo export AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT="" >> /etc/environment source /etc/environment

For this tutorial, we use the PowerShell 7.4 reference documentation as a well-known and safe sample dataset. As an alternative, you might choose to explore the Microsoft Research tools sample datasets.

Create a folder where you would like to store your project. Set your location to the project folder. Download the dataset to your local machine using the Invoke-WebRequest command and then expand the archive. Last, set your location to the subfolder containing reference information for PowerShell version 7.4.

New-Item '' -Type Directory Set-Location '' $DocsUri = 'https://github.com/MicrosoftDocs/PowerShell-Docs/archive/refs/heads/main.zip' Invoke-WebRequest $DocsUri -OutFile './PSDocs.zip' Expand-Archive './PSDocs.zip' Set-Location './PSDocs/PowerShell-Docs-main/reference/7.4/'

We're working with a large amount of data in this tutorial, so we use a .NET data table object for efficient performance. The datatable has columns title, content, prep, uri, file, and vectors. The title column is the primary key.

In the next step, we load the content of each markdown file into the data table. We also use PowerShell -match operator to capture known lines of text title: and online version:, and store them in distinct columns. Some of the files don't contain the metadata lines of text, but since they're overview pages and not detailed reference docs, we exclude them from the datatable.

# make sure your location is the project subfolder $DataTable = New-Object System.Data.DataTable 'title', 'content', 'prep', 'uri', 'file', 'vectors' | ForEach-Object { $DataTable.Columns.Add($_) } | Out-Null $DataTable.PrimaryKey = $DataTable.Columns['title'] $md = Get-ChildItem -Path . -Include *.md -Recurse $md | ForEach-Object { $file = $_.FullName $content = Get-Content $file $title = $content | Where-Object { $_ -match 'title: ' } $uri = $content | Where-Object { $_ -match 'online version: ' } if ($title -and $uri) { $row = $DataTable.NewRow() $row.title = $title.ToString().Replace('title: ', '') $row.content = $content | Out-String $row.prep = '' # use later in the tutorial $row.uri = $uri.ToString().Replace('online version: ', '') $row.file = $file $row.vectors = '' # use later in the tutorial $Datatable.rows.add($row) } }

View the data using the out-gridview command (not available in Cloud Shell).

$Datatable | out-gridview

Output:

Next perform some light data cleaning by removing extra characters, empty space, and other document notations, to prepare the data for tokenization. The sample function Invoke-DocPrep demonstrates how to use the PowerShell -replace operator to iterate through a list of characters you would like to remove from the content.

# sample demonstrates how to use `-replace` to remove characters from text content function Invoke-DocPrep { param( [Parameter(Mandatory = $true, ValueFromPipeline = $true)] [string]$content ) # tab, line breaks, empty space $replace = @('\t','\r\n','\n','\r') # non-UTF8 characters $replace += @('[^\x00-\x7F]') # html $replace += @('','','','','','') $replace += @('','','','') $replace += @('

','

','') # docs $replace += @('\*\*IMPORTANT:\*\*','\*\*NOTE:\*\*') $replace += @('','---','--',':::') # markdown $replace += @('###','##','#','```') $replace | ForEach-Object { $content = $content -replace $_, ' ' -replace ' ',' ' } return $content }

After you create the Invoke-DocPrep function, use the ForEach-Object command to store prepared content in the prep column, for all rows in the datatable. We're using a new column so the original formatting is available if we would like to retrieve it later.

$Datatable.rows | ForEach-Object { $_.prep = Invoke-DocPrep $_.content }

View the datatable again to see the change.

$Datatable | out-gridview

When we pass the documents to the embeddings model, it encodes the documents into tokens and then returns a series of floating point numbers to use in a cosine similarity search. These embeddings can be stored locally or in a service such as Vector Search in Azure AI Search. Each document has its own corresponding embedding vector in the new vectors column.

The next example loops through each row in the datatable, retrieves the vectors for the preprocessed content, and stores them to the vectors column. The OpenAI service throttles frequent requests, so the example includes an exponential back-off as suggested by the documentation.

After the script completes, each row should have a comma-delimited list of 1536 vectors for each document. If an error occurs and the status code is 400, the file path, title, and error code are added to a variable named $errorDocs for troubleshooting. The most common error occurs when the token count is more than the prompt limit for the model.

# Azure OpenAI metadata variables $openai = @{ api_key = $Env:AZURE_OPENAI_KEY api_base = $Env:AZURE_OPENAI_ENDPOINT # should look like 'https://.openai.azure.com/' api_version = '2023-05-15' # may change in the future name = $Env:AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT # custom name you chose for your deployment } $headers = [ordered]@{ 'api-key' = $openai.api_key } $url = "$($openai.api_base)/openai/deployments/$($openai.name)/embeddings?api-version=$($openai.api_version)" $Datatable | ForEach-Object { $doc = $_ $body = [ordered]@{ input = $doc.prep } | ConvertTo-Json $retryCount = 0 $maxRetries = 10 $delay = 1 $docErrors = @() do { try { $params = @{ Uri = $url Headers = $headers Body = $body Method = 'Post' ContentType = 'application/json' } $response = Invoke-RestMethod @params $Datatable.rows.find($doc.title).vectors = $response.data.embedding -join ',' break } catch { if ($_.Exception.Response.StatusCode -eq 429) { $retryCount++ [int]$retryAfter = $_.Exception.Response.Headers | Where-Object key -eq 'Retry-After' | Select-Object -ExpandProperty Value # Use delay from error header if ($delay -lt $retryAfter) { $delay = $retryAfter++ } Start-Sleep -Seconds $delay # Exponential back-off $delay = [math]::min($delay * 1.5, 300) } elseif ($_.Exception.Response.StatusCode -eq 400) { if ($docErrors.file -notcontains $doc.file) { $docErrors += [ordered]@{ error = $_.exception.ErrorDetails.Message | ForEach-Object error | ForEach-Object message file = $doc.file title = $doc.title } } } else { throw } } } while ($retryCount -lt $maxRetries) } if (0 -lt $docErrors.count) { Write-Host "$($docErrors.count) documents encountered known errors such as too many tokens.`nReview the `$docErrors variable for details." }

You now have a local in-memory database table of PowerShell 7.4 reference docs.

Based on a search string, we need to calculate another set of vectors so PowerShell can rank each document by similarity.

In the next example, vectors are retrieved for the search string get a list of running processes.

$searchText = "get a list of running processes" $body = [ordered]@{ input = $searchText } | ConvertTo-Json $url = "$($openai.api_base)/openai/deployments/$($openai.name)/embeddings?api-version=$($openai.api_version)" $params = @{ Uri = $url Headers = $headers Body = $body Method = 'Post' ContentType = 'application/json' } $response = Invoke-RestMethod @params $searchVectors = $response.data.embedding -join ','

Finally, the next sample function, which borrows an example from the example script Measure-VectorSimilarity written by Lee Holmes, performs a cosine similarity calculation and then ranks each row in the datatable.

# Sample function to calculate cosine similarity function Get-CosineSimilarity ([float[]]$vector1, [float[]]$vector2) { $dot = 0 $mag1 = 0 $mag2 = 0 $allkeys = 0..($vector1.Length-1) foreach ($key in $allkeys) { $dot += $vector1[$key] * $vector2[$key] $mag1 += ($vector1[$key] * $vector1[$key]) $mag2 += ($vector2[$key] * $vector2[$key]) } $mag1 = [Math]::Sqrt($mag1) $mag2 = [Math]::Sqrt($mag2) return [Math]::Round($dot / ($mag1 * $mag2), 3) }

The commands in the next example loop through all rows in $Datatable and calculate the cosine similarity to the search string. The results are sorted and the top three results are stored in a variable named $topThree. The example does not return output.

# Calculate cosine similarity for each row and select the top 3 $topThree = $Datatable | ForEach-Object { [PSCustomObject]@{ title = $_.title similarity = Get-CosineSimilarity $_.vectors.split(',') $searchVectors.split(',') } } | Sort-Object -property similarity -descending | Select-Object -First 3 | ForEach-Object { $title = $_.title $Datatable | Where-Object { $_.title -eq $title } }

Review the output of the $topThree variable, with only title and url properties, in gridview.

$topThree | Select "title", "uri" | Out-GridView

Output:

The $topThree variable contains all the information from the rows in the datatable. For example, the content property contains the original document format. Use [0] to index into the first item in the array.

$topThree[0].content

View the full document (truncated in the output snippet for this page).

--- external help file: Microsoft.PowerShell.Commands.Management.dll-Help.xml Locale: en-US Module Name: Microsoft.PowerShell.Management ms.date: 07/03/2023 online version: https://learn.microsoft.com/powershell/module/microsoft.powershell.management/get-process?view=powershell-7.4&WT.mc_id=ps-gethelp schema: 2.0.0 title: Get-Process --- # Get-Process ## SYNOPSIS Gets the processes that are running on the local computer. ## SYNTAX ### Name (Default) Get-Process [[-Name] ] [-Module] [-FileVersionInfo] [] # truncated example

Finally, rather than regenerate the embeddings every time you need to query the dataset, you can store the data to disk and recall it in the future. The WriteXML() and ReadXML() methods of DataTable object types in the next example simplify the process. The schema of the XML file requires the datatable to have a TableName.

Replace with the full path where you would like to write and read the XML file. The path should end with .xml.

# Set DataTable name $Datatable.TableName = "MyDataTable" # Writing DataTable to XML $Datatable.WriteXml("", [System.Data.XmlWriteMode]::WriteSchema) # Reading XML back to DataTable $newDatatable = New-Object System.Data.DataTable $newDatatable.ReadXml("")

As you reuse the data, you need to get the vectors of each new search string (but not the entire datatable). As a learning exercise, try creating a PowerShell script to automate the Invoke-RestMethod command with the search string as a parameter.

Using this approach, you can use embeddings as a search mechanism across documents in a knowledge base. The user can then take the top search result and use it for their downstream task, which prompted their initial query.

Clean up resources

If you created an OpenAI resource solely for completing this tutorial and want to clean up and remove an OpenAI resource, you'll need to delete your deployed models, and then delete the resource or associated resource group if it's dedicated to your test resource. Deleting the resource group also deletes any other resources associated with it.

Portal Azure CLI Next steps

Learn more about Azure OpenAI's models:

Azure OpenAI Service models

Store your embeddings and perform vector (similarity) search using your choice of Azure service: Azure AI Search Azure Cosmos DB for MongoDB vCore Azure Cosmos DB for NoSQL Azure Cosmos DB for PostgreSQL Azure Cache for Redis


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3