[AI医学] llm

您所在的位置:网站首页 晨威医疗 [AI医学] llm

[AI医学] llm

2023-06-12 00:34| 来源: 网络整理| 查看: 265

关键词: 医疗数据集,大模型微调训练

开源项目:llm-medical-data 用于大模型微调训练的医疗数据集

项目地址:https://github.com/donote/llm-medical-data

该项目主要参考了几篇关于医学领域大模型的论文或项目中的医疗大模型微调项目,收集整理项目涉及到的微调样本数据,数据说明如下:

1. chinese_medical_dialogue_data

源地址:https://github.com/Toyhom/Chinese-medical-dialogue-data

文件名及对应的样本量: 文件名样本量IM_内科.csv307,596andriatria_男科.csv113,877obgyn_妇产科.csv229,706oncology_肿瘤科.csv96,627pediatric_儿科.csv117,099surgical_外科.csv149,576 数据格式:utf8编码,csv格式,具体见样例chinese_medical_dialogue_data/sample_h100.csv 2. huatuo_llama_med_chinese_data

源地址:https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese

文件名及对应的样本量: 文件名样本量liver_cancer.json1,000llama_data.json8,568 数据格式:utf8编码,json格式,符合standford-alpaca指令微调样本格式 { "instruction":" : 我听说免疫疗法是治疗晚期肝癌的一种很有效的方法,真的吗? : 是的,免疫疗法是一种新兴的治疗方法,它可以提高人体免疫系统对癌细胞的反应能力。对于晚期肝癌患者,免疫疗法已证明具有很高的治疗效果。 : 但是,这是否适用于所有的肝癌患者?", "input":"" "output":"不是所有的肝癌患者都适合进行免疫疗法。根据患者的临床表现和病情,医生会对患者进行评估,从而选择最适合的治疗方案。", } 3. medicalgpt_data

源地址:https://github.com/MediaBrain-SJTU/MedicalGPT-zh

项目未提供微调数据,但给出了微调数据生成脚本和种子数据文件,添加openai api key后可自行生成微调样本数据。

pip install -r medicalgpt_data/requirements.txt

在book_based_question_generation.py和dialogue_generation.py中增加openai api key:

### add your openai key list, more key can generate data faster api_keys=[]

在medicalgpt_data下面执行对应的样本生成脚本即可,格式参考样例文件:book_based_qa.json和dialogue_task.json

4. doctorchat_data

源地址:https://github.com/Kent0n-Li/ChatDoctor

文件名及对应的样本量: 文件名样本量GenMedGPT-5k.json2,354iCliniq.json7342HealthCareMagic-100k.json24,665

GenMedGPT-5k为chatGPT生成的医患对话,iCliniq和HealthCareMagic-100k为在线站点真实对话。

数据格式:符合standford-alpaca样本格式json描述 { "instruction": "If you are a doctor, please answer the medical questions based on the patient's description.", "input": "Doctor, I have been experiencing sudden and frequent panic attacks. I don't know what to do.", "output": "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms." } 5. opengpt_data

项目地址:https://github.com/CogStack/opengpt

文件名及对应的样本量: 文件名类型样本量prepared_generated_data_for_nhs_uk_qa.csvQA24,665prepared_generated_data_for_nhs_uk_conversations.csv对话2,354prepared_generated_data_for_medical_tasks.csv任务4,688

使用chatGPT对NHS站点数据生成的样本,生成样本的prompts见prompts dataset。

数据格式:csv格式,text中和对应了standford-alpaca样本格式中的input和output text,raw_data_id " What is high blood pressure? High blood pressure is a condition where the force at which your heart pumps blood around your body is high. It is recorded with 2 numbers, the systolic pressure and the diastolic pressure, both measured in millimetres of mercury (mmHg). References: - https://www.nhs.uk/conditions/Blood-pressure-(high)/Pages/Introduction.aspx ",0

----------END----------

同步更新到:AI加油站



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3