利用计算生物学方法识别原核启动子的研究进展

您所在的位置:网站首页 预测启动子的软件叫什么来着 利用计算生物学方法识别原核启动子的研究进展

利用计算生物学方法识别原核启动子的研究进展

2024-07-16 01:08| 来源: 网络整理| 查看: 265

启动子通常位于基因上游,能与RNA聚合酶特异性结合并起始转录的一段DNA序列,作为转录起始过程的关键元件,激活RNA聚合酶与模板DNA结合,是基因表达和转录调节的起始步骤[1]。

原核生物RNA聚合酶中的σ因子可以特异性识别并结合启动子。在大肠杆菌中,存在多种σ因子,根据分子量可以分为7类,σ70、σ54、σ38、σ32、σ28、σ24、σ19,在已知的7类σ因子中前6类保守性极强,而σ19在大多数基因组中是缺失的[2]。每一类σ因子具有特定的生物学功能[3-6],σ70主要负责持家基因的转录;σ54被认为是参与氮代谢的调控因子以及控制一些辅助进程;σ38参与稳定期基因的调节;σ32是热休克σ因子(热激因子);σ28参与鞭毛的合成;σ24与极端热应激反应有关;σ19则参与对铁离子转运系统的调控。根据σ因子的同源性,可将其大致分为两类:一类是σ70家族,包括σ70、σ38、σ32、σ28、σ24、σ19;另一类是σ54家族。大肠杆菌基因组内的启动子类型依据与之结合的σ因子种类也可分为相应的类型。不同类型的启动子共有序列也有所差异。因此,启动子也依据被识别的片段分为σ70家族和σ54家族。如σ70启动子具有两个重要的基序区域,−10区和−35区,分别位于转录起始位点上游约10 bp和35 bp处。−10区含有保守序列“TATAAT”,又被称为Pribnow box或TATA box,富含腺嘌呤(adenine, A)和胸腺嘧啶(thymine, T),有助于DNA双链解螺旋分离;−35区则由6个保守的核苷酸“TTGACA”组成[7]。除了σ70因子,−10区和−35区也是被σ70家族其他因子识别的重要片段。相比之下,σ54启动子的共有序列及其位置与σ70启动子具有明显差异,在σ54启动子的−24区和−12区存在保守区域,其保守序列分别是“TGGCA[CT][GA]”和“TGC[AT][TA]”[8]。

启动子序列的鉴定对于研究基因表达、分析基因调控机制、研究基因结构以及注释基因信息至关重要。准确识别启动子的方法一般是依靠昂贵且耗时费力的实验检测方法,然而,在全基因组范围内进行检测是一项艰巨的任务。随着测序技术以及计算机技术的发展,越来越多生物的全基因组被测序出来,尤其是原核生物,因此出现了基于计算生物学的启动子预测方法,这些预测方法在不断地改进,有助于鉴别启动子序列。

表 1  39个原核启动子预测工具比较

ToolsBenchmark dataset size (promoter)Sequence similarityFeature extraction/ selectionClassification algorithmEvaluation strategyAUC 1.TLS-NNPP[9]771 (E.coli)/The empirical probability distribution of TSS-TLS distanceANNIndependent test/2.SIDD[10]500 (E.coli)/SIDDFLDIndependent test/3.FS_LSSVM[11]53 (E.coli)/A domain theory for promoters/ C4.5 decision treeLSSVM10-fold cross-validation/4.Free energy[12]1044 (E.coli) 879 (B.subtilis)/Free energyModified scoring functionIndependent test/5.PromPredict[13]1145 (E.coli) 615 (B.subtilis) 82 (M.tuberculosis)/GC content; Average free energydifference between the average free energyTraining and validation/6.SIDD-ANN[14]1648 (E.coli)/SIDD profile dataANNIndependent test/7.PePPER[15]L.lactis/PWMHMM//8.G4PromFinder[16]3570 (S.coelicolor) 2117 (P.aeruginosa)/AT-rich element and G-quadruplex motif-based algorithm/Independent test/9.LN-QSAR[17]135 (M.bovis)/Pseudo-folding 2D lattice graphLDAIndependent test/10.Ensemble-SVM[18]450 (E.coli σ70)/k-mer with location with respect to the TSS/ Symmetric uncertaintyEnsemble-SVM10-fold cross-validation/11.TSS-PREDICT[19]450 (E.coli σ70) 205 (B.subtilis)26 (C.trachomatis)/Information Content; PWMEnsemble-SVMIndependent test/12.TSS-SLP[20]669 (E.coli σ70)/Dinucleotide Frequency FeaturesSLP5-fold cross-validation; Independent test/13.PCSF[21]683 (E.coli σ70)/Conversation of sequence segments; PCSFScore function10-fold cross-validation/14.IPMD[22]270 (B.subtilis σ43)741 (E.coli σ70)/PCSF; IDModified MD10-fold cross-validation0.847 (B.subtilis)0.920 (E.coli)15.70ProPred[23]741 (E.coli σ70)/PSTNPss; PseEIIPSVM5-fold cross-validation; Jackknife test0.99016.iProEP[24]270 (B.subtilis)741 (E.coli)≤80%PseKNC; PCSF/ mRMR; IFSSVM10-fold cross-validation0.988 (B.subtilis)0.976 (E.coli)17.IPWM[25]683 (E.coli σ70)/Entropy-based conservative characteristics; Improved PWMScore function10-fold cross-validation/18.BacPP[26]1034 (E.coli)/Binary digitsANN(2,3,10)-fold cross-validation; Independent test/19.vw Z-curve[27]1401 (E.coli) 660 (B.subtilis)/variable-window Z-curve/ IFSPLS10-fold cross-validation/20.Stability[28]1035 (E.coli)/DNA duplex stabilityANN(2,3,10)-fold cross-validation/21.iPro54-PseKNC[29]161 (prokaryotic σ54)≤75%PseKNC/ F-score; IFSSVMJackknife test/22.PromotePredictor[30]161 (prokaryotic σ54)≤75%Motif profile-based ANF/ MRMDBagging; RF; SVM10-fold cross-validation; Independent test/23.meta-predictior[31]579 (E.coli σ70)≤45%sequence-based features; structure-based featuresMeta-predictorIndependent test0.85024.bTSSfinder[32]3597 (E.coli) 12797 (Nostoc) 351 (Synechocystis) 1471 (S.elongatus)/PWM; Physicochemical properties/ Mahalanobis distanceANNIndependent test/25.iPro70-PseZNC[33]741 (E.coli σ70)/PseZNC/ F-score; IFSSVM5-fold cross-validation0.90926.iPromoter-FSEn[34]741 (E.coli σ70)/Nucleotide Statistics; k-mer; g-gapped k-mer; Approximate signal pattern count; Position specific occurences; Distribution of nucleotides/ Feature subspaceEnsemble learning10-fold cross-validation0.93227.iPro70-FMWin[35]741 (E.coli σ70)/k-mer; g-gapped k-mer; Pattern finding; Positioning distance count/ AdaboostLR10-fold cross-validation0.95928.CNNProm[36]839 (E.coli σ70) 746 (B.subtilis)/one-hotCNN5-fold cross-validation/29.IBBP[37]1888 (E.coli σ70)/Image-based and evolutionary approachSVMIndependent test/30.SAPPHIRE[38]170 (P. aeruginosa and P. putida σ70)/one-hotANN5-fold cross-validation; Independent test/31.iPromoter-2L[39]2860 (E.coli)≤80%Multi-window-based PseKNCRF5-fold cross-validation; Jackknife test/32.iPromoter-2L2.0[40]2860 (E.coli)≤80%Smoothing Cutting Window algorithm; k-mer; PseKNCSVM; Ensemble learning5-fold cross-validation/33.MULTiPly[41]2860 (E.coli)≤80%Bi-profile bayes; KNN; k-mer;DAC/ F-scoreSVM5-fold cross-validation; Jackknife test; Independent test/34.pcPromoter-CNN[42]2860 (E.coli)≤80%one-hotCNN5-fold cross-validation; Independent test0.95735.iPromoter-BnCNN[43]2860 (E.coli)≤80%one-hot; k-mer; Structural propertiesCNN5-fold cross-validation; Independent test/36.SELECTOR[44]2860 (E.coli)≤80%CKSNAP; PCPseDNC; PSTNPss; DNA strandEnsemble learning5-fold cross-validation; Independent test0.98437.iPSW(2L)-PseKNC[45]3382 (E.coli)≤85%NCP; ANFSVM5-fold cross-validation0.90538.deepPromoter[46]3382 (E.coli)≤85%Combination of Continuous FastText N-Grams/ MRMDCNN5-fold cross-validation0.88539.iPSW(PseDNC-DL)[47]3382 (E.coli)≤85%one-hot; PseDNCCNN5-fold cross-validation0.925   PWM: position weight matrix; SIDD: stress-induced DNA duplex destabilization; PCSF: position-correlation scoring function; ID: increment of diversity; PSTNPss: position-specific trinucleotide propensity based on single-strand; PseEIIP: electron-ion interaction pseudo-potentials of trinucleotide; PseKNC: pseudo k-tuple nucleotide composition; ANF: accumulated nucleotide frequency; PseZNC: pseudo multi-window Z-curve nucleotide composition; KNN: k-nearest neighbors; DAC: dinucleotide-based auto-covariance; PCPseDNC: parallel correlation pseudo dinucleotide composition; NCP: nucleotide chemical property; PseDNC: pseudo dinucleotide composition; mRMR: minimum redundancy maximum relevance; IFS: incremental feature selection; MRMD: maximum-relevance-maximum-distance; ANN: artificial neural network; SVM: support vector machine; FLD: fisher linear discriminant; SLP: single-layer perceptron; LSSVM: least square support vector machine; MD: mahalanobis discriminant; PLS: partial least squares; HMM: hidden markov models; RF: random forest; LR: logistic regression; CNN: convolution neural network; LDA: linear discriminant analysis.


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3