如何通过Python使用OpenAI GPT（#11：高级精调案例，药品分类）

您所在的位置：网站首页 › python里temp › 如何通过Python使用OpenAI GPT（#11：高级精调案例，药品分类）

如何通过Python使用OpenAI GPT（#11：高级精调案例，药品分类）

2023-04-02 00:59| 来源: 网络整理| 查看: 265

案例中用到的数据集

这个例子中我们将会使用一个公开数据集，其中包含药品的名字,用来治疗的问题、疾病或症状。

我们会训练一个模型，”教导“它如何根据用户输入来预测输出。

用户的输入是药品的名字，输出是疾病的名字。

数据集位于http://Kaggle.com，可以通过这个地址来下载。

也可以访问这个URL，下载Medicine_description.xlsx文件。

这个文件中包含三个sheet，我们使用的是”Sheet1“，其中包含3列：

Drug_NameReasonDescription

我们将会使用第一和第二列，也就是药品的名字和推荐的理由。比如：

A CN Gel(Topical) 20gmA CN Soap 75gm ==> Acne PPG Trio 1mg Tablet 10'SPPG Trio 2mg Tablet 10'S ==> Diabetes Iveomezoled 200mg Injection 500ml ==> Fungal准备数据并启动精调

这个数据文件是XLSX格式的文件，我们需要将其转换为JSONL格式，上一篇中已经介绍过，模型精调主要使用这一格式的文件。

我们还将使用这种内容格式：

{"prompt":"Drug: \nMalady:","completion":" "}

可以看到，我们使用\nMalady:作为分隔符。

补完的部分也是以空格开始，还记得上篇中提到的最佳实践吗？

另外，最佳实践中还提到，补完的部分应该以一个固定的序列结尾，告诉模型到此结束。比如\n、###、END或其它前文中没有出现过的符号。

不过在我们的例子中这个倒不需要了，因为我们将使用单个token来标记分类的类别，也就是给每一个症状一个唯一的标识符，比如：

Acne: 1 Allergies: 2 Alzheimer: 3 ...etc

这样一来，模型在推断任何分类的时候都只需要返回单个的token，这也是为什么不再需要停止序列了。

一开始，先用Pandas将数据处理成想要的格式。

import pandas as pd # read the first n rows n = 2000 df = pd.read_excel('Medicine_description.xlsx', sheet_name='Sheet1', header=0, nrows=n) # get the unique values in the Reason column reasons = df["Reason"].unique() # assign a number to each reason reasons_dict = {reason: i for i, reason in enumerate(reasons)} # add a new line and ### to the end of each description df["Drug_Name"] = "Drug: " + df["Drug_Name"] + "\n" + "Malady:" # concatenate the Reason and Description columns df["Reason"] = " " + df["Reason"].apply(lambda x: "" + str(reasons_dict[x])) # drop the Description column df.drop(["Description"], axis=1, inplace=True) # rename the columns df.rename(columns={"Drug_Name": "prompt", "Reason": "completion"}, inplace=True) # convert the dataframe to jsonl format jsonl = df.to_json(orient="records", indent=0, lines=True) # write the jsonl to a file with open("drug_malady_data.jsonl", "w") as f: f.write(jsonl)

上面的代码从Excel文件中读取2000行内容，用以训练模型。当然你也可以使用更多的数据。

脚本一开始先从文件Medicine_description.xlsx中读取前面n行数据，保存在DataFrame的实例df中。

然后从df中获取Reason列中的类别，去除重复的，保存在数组reasons中，并为其中的理由赋以一个数值索引，保存在字典reasons_dict中。

接下来在Drug_Name列中的每一个药品名字后面添加换行符和”Malady:“分隔符。然后把Reason列中每个值替换为reasons_dict中的数值类别，前面加上一个空格。

在这个例子中，我们不需要Description列，所以我们从DataFrame中去除。

接下来把Drug_Name命名为prompt，Reason命名为completion。

最后把DataFrame的数据转成JSONL格式，保存在变量jsonl中，写入文件drug_malady_data.jsonl。

drug_malady_data.jsonl文件的内容看起来像下面这样：

[..] {"prompt":"Drug: Acleen 1% Lotion 25ml\nMalady:","completion":" 0"} [..] {"prompt":"Drug: Capnea Injection 1ml\nMalady:","completion":" 1"} [..] {"prompt":"Drug: Mondeslor Tablet 10'S\nMalady:","completion":" 2"} [..]

数据集准备完毕，我们进入下一个环节，执行精调命令：

openai tools fine_tunes.prepare_data -f drug_malady_data.jsonl

这一步会对数据进行分析，帮助我们理解模型是如何消化这些数据的：

Analyzing... - Your file contains 2000 prompt-completion pairs - Based on your data it seems like you're trying to fine-tune a model for classification - For classification, we recommend you try one of the faster and cheaper models, such as `ada` - For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training - All prompts end with suffix `\nMalady:` - All prompts start with prefix `Drug:

你可以把数据分为两部分，一部分用于训练，另一部分用于验证。

- [Recommended] Would you like to split into training and validation set? [Y/n]:

终端还会给你提供用于训练模型的命令：

Now use that file when fine-tuning: > openai api fine_tunes.create -t "drug_malady_data_prepared_train.jsonl" -v "drug_malady_data_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 3

它还可以检测数据集中用到的类别数量。

最后，执行其提供的命令以启动训练。你可以指定模型，这里我们用的是ada，便宜，而且在我们的场景中表现不错。

你还可以为精调之后的模型名字添加后缀，我们使用的是drug_malady_data。

# export your OpenAI key export OPENAI_API_KEY="xxxxxxxxxxxx" openai api fine_tunes.create \ -t "drug_malady_data_prepared_train.jsonl" \ -v "drug_malady_data_prepared_valid.jsonl" \ --compute_classification_metrics \ --classification_n_classes 3 -m ada --suffix "drug_malady_data"

如果在任务完成之前和服务器断开了连接，可以使用下面的命令来重连及检查：

openai api fine_tunes.follow -i

下面是任务完成之后的输出样例：

Created fine-tune: Fine-tune costs $0.03 Fine-tune enqueued Fine-tune is in the queue. Queue number: 31 Fine-tune is in the queue. Queue number: 30 Fine-tune is in the queue. Queue number: 29 Fine-tune is in the queue. Queue number: 28 [...] [...] [...] Fine-tune is in the queue. Queue number: 2 Fine-tune is in the queue. Queue number: 1 Fine-tune is in the queue. Queue number: 0 Fine-tune started Completed epoch 1/4 Completed epoch 2/4 Completed epoch 3/4 Completed epoch 4/4 Uploaded model: Uploaded result file: Fine-tune succeeded Job complete! Status: succeeded Try out your fine-tuned model: openai api completions.create -m -p

这里展示了精调任务的进展和状态。它给出ID，表示新的精调任务创建成功。它还展示了如下的信息：

精调的费用完成的轮次数量包含精调结果的文件ID验证经过精调的新模型

当精调完成之后，可以通过下面的代码来验证它：

# Configure the model ID. Change this to your model ID. model = "ada:ft-learninggpt:drug-malady-data-2023-02-21-20-36-07" # Let's use a drug from each class drugs = [ "A CN Gel(Topical) 20gmA CN Soap 75gm", # Class 0 "Addnok Tablet 20'S", # Class 1 "ABICET M Tablet 10's", # Class 2 ] # Returns a drug class for each drug for drug_name in drugs: prompt = "Drug: {}\nMalady:".format(drug_name) response = openai.Completion.create( model=model, prompt= prompt, temperature=1, max_tokens=1, ) # Print the generated text drug_class = response.choices[0].text # The result should be 0, 1, and 2 print(drug_class)

我们测试了3中药物名字，类别各部相同：

“A CN Gel(Topical) 20gmA CN Soap 75gm”, Class 0 (Acne)“Addnok Tablet 20’S”, Class 1 (Adhd)“ABICET M Tablet 10’s”, Class 2 (Allergies)

代码执行的结果应该是：

0 1 2

如果我们用下面的提示词来测试：

drugs = [ "What is 'A CN Gel(Topical) 20gmA CN Soap 75gm' used for?", # Class 0 "What is 'Addnok Tablet 20'S' used for?", # Class 1 "What is 'ABICET M Tablet 10's' used for?", # Class 2 ]

结果应该还是一样的。

我们可以在代码中添加一个映射，返回疾病的名字，而非类别的数值：

# Let's use a drug from each class drugs = [ "What is 'A CN Gel(Topical) 20gmA CN Soap 75gm' used for?", # Class 0 "What is 'Addnok Tablet 20'S' used for?", # Class 1 "What is 'ABICET M Tablet 10's' used for?", # Class 2 ] class_map = { 0: "Acne", 1: "Adhd", 2: "Allergies", # ... } # Returns a drug class for each drug for drug_name in drugs: prompt = "Drug: {}\nMalady:".format(drug_name) response = openai.Completion.create( model=model, prompt= prompt, temperature=1, max_tokens=1, ) response = response.choices[0].text try: print(drug_name + " is used for " + class_map[int(response)]) except: print("I don't know what " + drug_name + " is used for.") print()

【本文地址】

如何通过Python使用OpenAI GPT（#11：高级精调案例，药品分类）

如何通过Python使用OpenAI GPT（#11：高级精调案例，药品分类）

今日新闻

推荐新闻