敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

您所在的位置：网站首页 › 阳性和阴性是什么区别 › 敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

2024-07-10 21:52| 来源: 网络整理| 查看: 265

医学、机器学习等等，在统计结果时时长会用到这两个指标来说明数据的特性。

定义

敏感性：在金标准判断有病（阳性）人群中，检测出阳性的几率。真阳性。（检测出确实有病的能力）特异性：在金标准判断无病（阴性）人群中，检测出阴性的几率。真阴性。（检测出确实没病的能力）假阳性率：得到了阳性结果，但这个阳性结果是假的。即在金标准判断无病（阴性）人群中，检测出为阳性的几率。（没病，但却检测结果说有病），为误诊率。假阴性率：得到了阴性结果，但这个阴性结果是假的。即在金标准判断有病（阳性）人群中，检测出为阴性的几率。（有病，但却检测结果说没病），为漏诊率。

计算方法

Sensitivity and specificity：完整定义

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 True Positive （真正, TP）被模型预测为正的正样本；可以称作判断为真的正确率 True Negative（真负 , TN）被模型预测为负的负样本；可以称作判断为假的正确率 False Positive （假正, FP）被模型预测为正的负样本；可以称作误报率 False Negative（假负 , FN）被模型预测为负的正样本；可以称作漏报率 True Positive Rate（真正率 , TPR）或灵敏度（sensitivity） TPR = TP /（TP + FN）正样本预测结果数 / 正样本实际数 True Negative Rate（真负率 , TNR）或特指度（specificity） TNR = TN /（TN + FP）负样本预测结果数 / 负样本实际数 False Positive Rate （假正率, FPR） FPR = FP /（FP + TN）被预测为正的负样本结果数 /负样本实际数 False Negative Rate（假负率 , FNR） FNR = FN /（TP + FN）被预测为负的正样本结果数 / 正样本实际数

假阳性率=假阳性人数÷金标准阴性人数

即：假阳性率=b÷（b+d）

金标准金标准阳性（+）阴性（-）合计某筛检方法阳性（+）aba+b某筛检方法阴性（-）cdc+d合计 a+cb+dN

公式为:假阳性率=b/(b+d)×100%

(b:筛选为阳性,而标准分类为阴性的例数;d:阴性一致例数)

假阴性率=假阴性人数÷金标准阳性人数

即： β=c÷（a+c）

终于要用到这个玩意了，很激动，主要统计假阴假阳性率。

我的任务：

1. 评估Pacbio MHC variation calling 结果（CCS/non-CCS）与Hiseq数据结果的一致性。 2. 分别在不同深度梯度的区域完成以上评估，推断PB MHC做variation calling的最低深度。

这里要将一个位点分为SNP、REF 和 LowQual，然后只去 SNP 和 REF 进行统计。

这是我一下午写出来的统计代码：

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 #!/usr/bin/env python # Author: LI ZHIXIN import sys import pysam from collections import OrderedDict def classify_DP(depth): if depth > 101 : return 21 return ((depth - 1 ) / / 5 + 1 ) def parse_rec(rec): sample = list (rec.samples)[ 0 ] # filter the Invalid line if not ( 'GQ' or 'GT' or 'DP' ) in rec.samples[sample].keys() or len (rec.alleles) 30 # variation_dict[rec.pos] = ["snp", rec.alleles] return rec.samples[sample][ 'DP' ], "snp" , rec.pos elif rec.samples[sample][ 'GT' ] = = ( 0 , 0 ): # variation_dict[rec.pos] = ["ref", rec.alleles] return rec.samples[sample][ 'DP' ], "ref" , rec.pos def read_gvcf(gvcf_file_path): variation_dict = OrderedDict() for i in range ( 1 , 22 ): variation_dict[i] = {} for j in ( 'LowQual' , 'snp' , 'ref' ): variation_dict[i][j] = [] # pos_list = [] gvcf_file = pysam.VariantFile(gvcf_file_path) for rec in gvcf_file.fetch( 'chr6' , 28477796 , 33448354 ): DP, pos_type, pos = parse_rec(rec) if DP < 1 or DP > 20 : continue # DP = classify_DP(DP) variation_dict[DP][pos_type].append(pos) # print(pos, DP, pos_type) gvcf_file.close() # return variation_dict, pos_list return variation_dict def read_hiseq_gvcf(gvcf_file_path): variation_dict = OrderedDict() # for i in range(1,22): # variation_dict[i] = {} for j in ( 'LowQual' , 'snp' , 'ref' ): variation_dict[j] = [] # pos_list = [] gvcf_file = pysam.VariantFile(gvcf_file_path) for rec in gvcf_file.fetch( 'chr6' , 28477796 , 33448354 ): DP, pos_type, pos = parse_rec(rec) DP = classify_DP(DP) variation_dict[pos_type].append(pos) # print(pos, DP, pos_type) gvcf_file.close() # return variation_dict, pos_list return variation_dict def show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2): for DP in range ( 1 , 21 ): Hiseq_snp = set (Hiseq_unified_variation_dict[ 'snp' ]) Hiseq_ref = set (Hiseq_unified_variation_dict[ 'ref' ]) Hiseq_lowqual = set (Hiseq_unified_variation_dict[ 'LowQual' ]) PB_snp = PB_non_CCS_variation_dict[DP][ 'snp' ] PB_ref = PB_non_CCS_variation_dict[DP][ 'ref' ] PB_lowqual = PB_non_CCS_variation_dict[DP][ 'LowQual' ] total = set (PB_snp + PB_ref + PB_lowqual) Hiseq_snp = total & Hiseq_snp Hiseq_ref = total & Hiseq_ref Hiseq_lowqual = total & Hiseq_lowqual PB_snp = set (PB_snp) PB_ref = set (PB_ref) PB_lowqual = set (PB_lowqual) a = len (Hiseq_snp & PB_snp) b = len (Hiseq_ref & PB_snp) c = len (Hiseq_lowqual & PB_snp) d = len (Hiseq_snp & PB_ref) e = len (Hiseq_ref & PB_ref) f = len (Hiseq_lowqual & PB_ref) g = len (Hiseq_snp & PB_lowqual) h = len (Hiseq_ref & PB_lowqual) i = len (Hiseq_lowqual & PB_lowqual) Low_total = (g + h + i) / (a + b + c + d + e + f + g + h + i) if (a + b) = = 0 : PPV = "NA" else : PPV = a / (a + b) PPV = "%.4f" % (PPV) if (a + d) = = 0 : TPR = "NA" else : TPR = a / (a + d) TPR = "%.4f" % (TPR) print ( str (DP) + " :\n" , a,b,c, "\n" ,d,e,f, "\n" ,g,h,i, "\n" , file = outf2, sep = '\t' , end = '\n' ) print (DP, TPR, PPV, "%.4f" % Low_total, file = outf, sep = '\t' , end = '\n' ) with open ( "./depth_stat.txt" , "w" ) as outf: print ( "Depth" , "TPR" , "PPV" , "Low_total" , file = outf, sep = '\t' , end = '\n' ) outf2 = open ( "raw.txt" , "w" ) Hiseq_unified_variation_dict = read_hiseq_gvcf( "./hiseq_call_gvcf/MHC_Hiseq.unified.gvcf.gz" ) PB_non_CCS_variation_dict = read_gvcf( "./non_CCS_PB_call_gvcf/MHC_non_CCS.unified.gvcf.gz" ) show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2) outf2

又碰到一个高级python语法：在双层循环中如何退出外层循环？我用了一个手动的flag，有其他好方法吗？

如何统计下机数据的覆盖度和深度？当然要比对之后才能统计，而且还要对比对做一些处理。

在计算一个位点是否是SNP、indel、Ref时，不仅要考虑ref、alts、qual、GQ，而且必须要把GT、DP考虑在内，所以说还是比较复杂的。

最后如何分析第二个问题，call variation的最低深度？

统计不同深度下的假阴假阳性率，看在什么深度下其达到饱和。

【本文地址】

敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

今日新闻

推荐新闻