Biopython从入门到精通之Entrez: esearch, efetch 和elink

您所在的位置：网站首页 › python写一个数据库 › Biopython从入门到精通之Entrez: esearch, efetch 和elink

Biopython从入门到精通之Entrez: esearch, efetch 和elink

2023-04-06 06:04| 来源: 网络整理| 查看: 265

Biopython工程是一个使用Python来开发计算分子生物学工具的国际团体， Biopython(http://www.biopython.org)为使用和研究生物信息学的开发者提供了一个在线的资源库，包括模块、脚本以及一些基于Python的软件的网站链接。一般来讲，Biopython致力于通过创造高质量的和可重复利用的模块及类，从而使得Python在生物信息学中的应用变得更加容易。Biopython的特点包括解析各种生物信息学格式的文件(BLAST， Clustalw， FASTA， Genbank...)，访问在线的服务器(NCBI，Expasy...)，常见和不那么常见程序的接口(Clustalw， DSSP，MSMS...)，标准的序列类，各种收集的模块，KD树数据结构等等。今天，我们先来介绍一下biopython的Entrez模块中的esearch, efetch和elink，我们可以使用这几个函数自动下载指定领域的论文、GEO数据集、基因和蛋白序列等等，还可以将不同数据库的记录匹配起来，从一个数据库mapping到另外一个数据库。下面来看代码吧：

from Bio import Entrez from math import ceil Entrez.email ="[email protected]" #可以换成你自己的email Entrez 数据库表1.1: Entrez databasesEntrez DatabaseUID common nameE-utility Database NameBioProjectBioProject IDbioprojectBioSampleBioSample IDbiosampleBooksBook IDbooksConserved DomainsPSSM-IDcdddbGaPdbGaP IDgapdbVardbVar IDdbvarGeneGene IDgeneGenomeGenome IDgenomeGEO DatasetsGDS IDgdsGEO ProfilesGEO IDgeoprofilesHomoloGeneHomoloGene IDhomologeneMeSHMeSH IDmeshNCBI C++ ToolkitToolkit IDtoolkitNLM CatalogNLM Catalog IDnlmcatalogNucleotideGI numbernuccorePopSetPopSet IDpopsetProbeProbe IDprobeProteinGI numberproteinProtein ClustersProtein Cluster IDproteinclustersPubChem BioAssayAIDpcassayPubChem CompoundCIDpccompoundPubChem SubstanceSIDpcsubstancePubMedPMIDpubmedPubMed CentralPMCIDpmcSNPrs numbersnpSRASRA IDsraStructureMMDB-IDstructureTaxonomyTaxIDtaxonomyesearchterm='(SARS-CoV-2[MeSH Terms]) AND (("2023/01/01"[Date - Publication] : "2023/03/01"[Date - Publication]))' #搜索Pubmed上面从2023年1月1日到3月1日期间发表的、包含关键词Mesh词表的SARS-CoV-2的所有新冠相关的研究文献 handle=Entrez.esearch(db="pubmed",retmax=100,retstart=0,term=term) records=Entrez.read(handle) number=records["Count"] pmids=records["IdList"]#获取所有的PMIDs number=int(number); print("total number of papers:") print(number) print("Top 10 PMIDs: ",pmids[:10]) total number of papers: 4939 Top 10 PMIDs: ['36946409', '36946400', '36946390', '36945958', '36945160', '36945083', '36944427', '36944089', '36943829', '36942682']

用esearch搜索关键词，获取到PMID之后，可以通过efetch去下载论文摘要（详细例子，请看efetch部分），或者通过pmd数据库下载全文。同理，也可以通过esearch搜索表1.1中的其它数据库，通过esearch搜索，得到的结果都是每个数据库中记录的ID，比如pubmed论文的PMID，sra数据库的ID， GEO dataset数据库的ID等，拿到ID后，再根据这个ID，用下面的efetch去获取详细的信息。

term='SRP056585[Accession]' handle=Entrez.esearch(db="sra",retmax=100,retstart=0,term=term) record=Entrez.read(handle) handle.close() sra_ids=record['IdList'] print(record) {'Count': '405', 'RetMax': '100', 'RetStart': '0', 'IdList': ['2201884', '2201883', '2201882', '2201881', '2201880', '2201879', '2201878', '2201877', '2201876', '2201875', '2201874', '2201873', '2201872', '2201871', '2201870', '2201869', '2201868', '2201867', '2201866', '2201865', '2201864', '2201863', '2201862', '2201861', '2201860', '2201859', '2201858', '2201857', '2201856', '2201855', '2201854', '2201853', '2201852', '2201851', '2201850', '2201849', '2201848', '2201847', '2201846', '2201845', '2201844', '2201843', '2201842', '2201841', '2201840', '2201839', '2201838', '2201837', '2201836', '2201835', '2201832', '2201830', '2201829', '2201828', '2201827', '2201824', '2201823', '2201822', '2201821', '2201820', '2201819', '2201818', '2201817', '2201811', '2201809', '2201808', '2201807', '2201806', '2201805', '2201804', '2201803', '2201802', '2201800', '2201797', '2201796', '2201795', '2201794', '2201793', '2201792', '2201791', '2201788', '2201787', '2201786', '2201785', '2201784', '2201783', '2201782', '2201781', '2201779', '2201778', '2201777', '2201776', '2201775', '2201774', '2201773', '2201772', '2201771', '2201770', '2201769', '2201768'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'SRP056585[Accession]', 'Field': 'Accession', 'Count': '405', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'SRP056585[Accession]'}

在这个例子中，我们在SRA数据库中搜索了编号为SRP056585的数据项目，并得到了所有样本的ID，后面就可以用这些ID，从sra数据库链接到GEO Dataset数据库

efetchpmid=pmids[0] #"36946409" handle=Entrez.efetch("pubmed",id=pmid,rettype='medline',retmode="text")#获取PMID为pmid的论文，返回medline格式的文本 records=handle.read();#把handle句柄读入records中 print(records) PMID- 36946409 OWN - NLM STAT- MEDLINE DCOM- 20230323 LR - 20230323 IS - 1997-7298 (Print) IS - 1997-7298 (Linking) VI - 123 IP - 3 DP - 2023 TI - [Possibilities of optimizing therapy in COVID-19 survivors with focal epilepsy]. PG - 130-136 LID - 10.17116/jnevro2023123031130 [doi] AB - OBJECTIVE: To study the effect of phenosanoic acid therapy on the frequency of seizures, asthenia and quality of life of adult patients with focal epilepsy who had a new coronavirus infection caused by SARS-CoV-2. MATERIAL AND METHODS: The data of 20 patients with focal epilepsy who suffered COVID-19 and received therapy with phenosanic acid (Dibufelon) were studied. The frequency of epileptic seizures, the severity of asthenia and the quality of life were evaluated according to clinical scales. RESULTS: Significant decrease in the frequency of bilateral tonic-clonic seizures and focal seizures with loss of consciousness was recorded. There was a significant improvement in the quality of life. There was no significant dynamics of asthenia against the background of taking the drug phenosanic acid in patients. CONCLUSION: The preparation of phenosanic acid can be an effective means of add-on therapy in patients with epilepsy who have undergone COVID-19. FAU - Ponomareva, I V AU - Ponomareva IV AUID- ORCID: 0000-0001-6499-3054 AD - Chelyabinsk Regional Clinical Hospital No. 3, Chelyabinsk, Russia. FAU - Stepanova, S B AU - Stepanova SB AUID- ORCID: 0000-0002-3484-6165 AD - South Ural State Medical University, Chelyabinsk, Russia. FAU - Reneva, S A AU - Reneva SA AUID- ORCID: 0000-0002-5744-3502 AD - Regional Treatment and Rehabilitation Center, Tyumen, Russia. FAU - Sorokova, E V AU - Sorokova EV AUID- ORCID: 0000-0002-4110-1719 AD - Medical Center , Yekaterinburg, Russia. FAU - Vagina, M A AU - Vagina MA AUID- ORCID: 0000-0002-6681-5232 AD - Sverdlovsk Regional Clinical Hospital, Yekaterinburg, Russia. FAU - Makodzeba, O A AU - Makodzeba OA AUID- ORCID: 0000-0001-8976-0222 AD - Medical Center , Chelyabinsk, Russia. FAU - Galiullin, T R AU - Galiullin TR AUID- ORCID: 0000-0002-4558-6119 AD - Kuvatov Republican Clinical Hospital, Ufa, Russia. LA - rus PT - English Abstract PT - Journal Article TT - Vozmozhnosti optimizatsii terapii u patsientov s fokal'noi epilepsiei, perenesshikh COVID-19. PL - Russia (Federation) TA - Zh Nevrol Psikhiatr Im S S Korsakova JT - Zhurnal nevrologii i psikhiatrii imeni S.S. Korsakova JID - 9712194 RN - 0 (Anticonvulsants) SB - IM MH - Adult MH - Humans MH - *Epilepsy, Tonic-Clonic MH - Anticonvulsants/therapeutic use MH - Asthenia/drug therapy MH - Quality of Life MH - *COVID-19 MH - SARS-CoV-2 MH - Seizures/drug therapy MH - *Epilepsies, Partial/drug therapy MH - *Epilepsy/drug therapy OTO - NOTNLM OT - COVID-19 OT - antiepileptic drugs OT - efficacy OT - epilepsy OT - fatigue OT - frequency of bilateral tonic-clonic seizures OT - phenosanic acid OT - quality of life OT - tolerability EDAT- 2023/03/23 06:00 MHDA- 2023/03/23 06:00 CRDT- 2023/03/22 07:08 PHST- 2023/03/22 07:08 [entrez] PHST- 2023/03/23 06:00 [pubmed] PHST- 2023/03/23 06:00 [medline] AID - 10.17116/jnevro2023123031130 [doi] PST - ppublish SO - Zh Nevrol Psikhiatr Im S S Korsakova. 2023;123(3):130-136. doi: 10.17116/jnevro2023123031130.表2: efetch中retmode，rettype参数Record Type&rettype&retmodeAll DatabasesDocument summarydocsumxml, defaultList of UIDs in XMLuilistxmlList of UIDs in plain textuilisttextdb = bioprojectFull record XMLxml, defaultxml, defaultdb = biosampleFull record XMLfull, defaultxml, defaultFull record textfull, defaulttextdb = gdsSummarysummary, defaulttext, defaultdb = genetext ASN.1nullasn.1, defaultXMLnullxmlGene tablegene_tabletextdb = homologenetext ASN.1nullasn.1, defaultXMLnullxmlAlignment scoresalignmentscorestextFASTAfastatextHomoloGenehomologenetextdb = meshFull recordfull, defaulttext, defaultdb = nlmcatalogFull recordnulltext, defaultXMLnullxmldb = nuccore, protein or popsettext ASN.1nulltext, defaultbinary ASN.1nullasn.1Full record in XMLnativexmlAccession number(s)acctextFASTAfastatextTinySeq XMLfastaxmlSeqID stringseqidtextAdditional options for db = nuccore or popsetGenBank flat filegbtextGBSeq XMLgbxmlINSDSeq XMLgbcxmlAdditional option for db = nuccore and proteinFeature tablefttextAdditional option for db = nuccoreGenBank flat file with full sequence (contigs)gbwithpartstextCDS nucleotide FASTAfasta_cds_natextCDS protein FASTAfasta_cds_aatextAdditional options for db = proteinGenPept flat filegptextGBSeq XMLgpxmlINSDSeq XMLgpcxmlIdentical Protein XMLipgxmldb = pmcXMLnullxml, defaultMEDLINEmedlinetextdb = pubmedXMLnullxml, defaultMEDLINEmedlinetextPMID listuilisttextAbstractabstracttextdb = sequencestext ASN.1nulltext, defaultAccession number(s)acctextFASTAfastatextSeqID stringseqidtextdb = snptext ASN.1nullasn.1, defaultXMLnullxmlFlat fileflttextFASTAfastatextRS Cluster reportrsrtextSS Exemplar listssexemplartextChromosome reportchrtextSummarydocsettextUID listuilisttext or xmldb = sraXMLfull, defaultxml, defaultdb = taxonomyXMLnullxml, defaultTaxID listuilisttext or xmldb = clinvarClinVar Setclinvarsetxml, defaultUID listuilisttext or xmldb = gtrGTR Test Reportgtraccxml, defaultelink

将一个数据库的ID，转换为其他数据库的ID，详细的elink对应表格请见：https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

pmid="35873672" handle = Entrez.elink(dbfrom="pubmed", id=pmid, linkname="pubmed_gds") #linkname网址：http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html record = Entrez.read(handle) handle.close() gds_id=record[0]['LinkSetDb'][0]['Link'][0]['Id'] print(record) [{'LinkSetDbHistory': [], 'LinkSetDb': [{'Link': [{'Id': '200184410'}], 'DbTo': 'gds', 'LinkName': 'pubmed_gds'}], 'ERROR': [], 'DbFrom': 'pubmed', 'IdList': ['35873672']}]

在这个例子中，我们先找到论文35873672向GEO数据库提交的公共数据的gds_id，再通过这个gds_id用下面的efetch去获取数据的详细信息，包括介绍、下载地址、GSE编号等。

ID=gds_id #200184410 handle=Entrez.efetch("gds",id=ID,retmode="text") records=handle.read() # title=records.strip().split('\n')[0].replace('1. ','') handle.close() print(records) 1. Mouse DNA methylation atlas using Infinium Mouse Methylation Beadchips (Submitter supplied) Design, annotation, and technical and biological validation of the MouseMethylation Beadchips (MM285) platform. Application of the mouse array for tissue and tumor epigenetics, comparative epigenomics, genomic imprinting, epigenetic inhibitors, PDX assessment, backcross tracing, and epigenetic clocks. Organism: Rattus norvegicus; Cricetulus griseus; Mus musculus; Homo sapiens Type: Methylation profiling by genome tiling array Platform: GPL30650 1239 Samples FTP download: GEO (CSV, IDAT) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE184nnn/GSE184410/ Series Accession: GSE184410 ID: 200184410 sra_ids[0] '2201884' ID=sra_ids[0] #2201884,这个ID是我们前面通过在SRA数据库中搜索了编号SRP056585，得到的SRA数据库的ID handle=Entrez.elink(db="sra",id=ID,linkname="sra_gds") record = Entrez.read(handle) gds_id=record[0]['LinkSetDb'][0]['Link'][0]["Id"] handle.close() print(record,gds_id) [{'LinkSetDbHistory': [], 'LinkSetDb': [{'Link': [{'Id': '302052187'}, {'Id': '200067310'}], 'DbTo': 'gds', 'LinkName': 'sra_gds'}], 'ERROR': [], 'DbFrom': 'pubmed', 'IdList': ['2201884']}] 302052187

将sra ID转换为gds ID之后，可以用efetch去获取GEO Dataset数据库中该ID的详细记录

ID=gds_id #302052187 handle=Entrez.efetch("gds",id=ID,retmode="text") records=handle.read() # title=records.strip().split('\n')[0].replace('1. ','') handle.close() print(records) 1. 2dAscl1clonal-iN_iN5_47 Organism: Mus musculus Source name: induced neuronal (iN) cells Platform: GPL19057 Series: GSE67310 FTP download: SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1556399 Sample Accession: GSM2052187 ID: 302052187

总结一下，Entrez中的每个数据库（比如Pubmed, PMC, Gene, SRA, gds等）都有自己的ID，elink就是将不同数据库中的ID对应起来，比如一篇论文发表时，将自己产生的数据，发布到GEO Datasets，也就是gds数据库中，那么一个PMID就可以对应到gds数据库中的ID，就是elink只能将不同数据库的ID对应起来，拿到ID后，再用efetch去获取详细的记录。

【本文地址】

Biopython从入门到精通之Entrez: esearch, efetch 和elink

Biopython从入门到精通之Entrez: esearch, efetch 和elink

今日新闻

推荐新闻