使用gatk检测WES数据中的cnv |
您所在的位置:网站首页 › cnv算法 › 使用gatk检测WES数据中的cnv |
gatk的cnv流程对环境依赖较高,需要调用许多python包,推荐在dockerhub里找官方镜像,或者用conda来配置环境。 1、dockerhub 在本地的docker环境中直接拉取镜像,如果没有root权限就用conda安装。 docker pull broadinstitute/gatk:4.1.6.02、conda 先下载一个miniconda或者anaconda,然后下载好gatk的安装包解压 wget https://github.com/broadinstitute/gatk/archive/4.1.6.0.tar.gz tar -xzvf 4.1.6.0.tar.gz 解压出来会有一个yml文件 `gatkcondaenv.yml`,使用这个文件创建环境 conda env create -n gatk -f gatkcondaenv.yml 环境好了以后直接source就好了 source activate gatk接下来开始构建CNV的baseline,构建baseline需要无大片段CNV的一些样本,拿到重比对后的bam文件就可以开始操作了。 Step0、把文件地址配置好,参考基因组及interval_list准备妥当 (gatk对格式要求较为严格,需要严格按照文档说明来准备interval_list文件) (https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists) #!/bin/bash gatk=/data/tool/gatk-4.1.6.0/gatk ref=/data/database/ref/human/ucsc.hg19.fasta ref_dict=/data/database/ref/human/ucsc.hg19.dict interval=/data/database/ref/Exome_Target_hg19_ucsc.interval_listStep1、窗口划分 # --bin-length 为你的窗口大小,-imr OVERLAPPING_ONLY 意思为捕获区间有重叠就会合并窗口 $gatk PreprocessIntervals -R $ref -L $interval --bin-length 5000 -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.interval_listStep2、计算样本每个窗口的reads for num in 1 2 3 4 5 6 7 8 9 do input_bam=/data/WES/TestPath/sample${num}/realign/sample${num}.bam $gatk CollectReadCounts -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -I $input_bam --format TSV -O sample${num}.tsv doneStep3、窗口文件添加GC信息 (在窗口文件最后一列添加GC含量,下一步会对GC异常的窗口进行剔除) $gatk AnnotateIntervals -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.annotated.tsvStep4、剔除GC异常以及reads数偏多/偏少的窗口 sample_rc='' for num in 1 2 3 4 5 6 7 8 9 do sample_rc=${sample_rc}' -I 'sample${num}.tsv done $gatk FilterIntervals -L targets.preprocessed.5000.interval_list --annotated-intervals targets.preprocessed.5000.annotated.tsv $sample_rc -imr OVERLAPPING_ONLY -O gc.filtered.bin5000.interval_listStep5、根据先验值确定倍性 $gatk DetermineGermlineContigPloidy -L gc.filtered.bin5000.interval_list --interval-merging-rule OVERLAPPING_ONLY $sample_rc --contig-ploidy-priors contig_ploidy_priors.tsv --output ploidy-calls --output-prefix ploidy --verbosity DEBUG # contig_ploidy_priors.tsv 文件格式参考https://gatk.broadinstitute.org/hc/en-us/articles/360042746591-DetermineGermlineContigPloidyStep6、构建参考集样本的baseline $gatk GermlineCNVCaller --run-mode COHORT -L gc.filtered.bin5000.interval_list $sample_rc --contig-ploidy-calls ploidy-calls/ploidy-calls --annotated-intervals targets.preprocessed.5000.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output baseline --output-prefix baseline --verbosity DEBUGStep7、对单样本进行CNV检测 # 拿到baseline过后,CNV检测需要分三步走 $gatk DetermineGermlineContigPloidy --model ploidy-calls/ploidy-model -I sample.tsv -O sample --output-prefix sample --verbosity DEBUG $gatk GermlineCNVCaller --run-mode CASE -I sample.tsv --contig-ploidy-calls sample/sample-calls --model baseline/baseline-model --output sample_call --output-prefix sample --verbosity DEBUG # call完cnv后还需要转成vcf格式方便查看处理 $gatk PostprocessGermlineCNVCalls --model-shard-path baseline/baseline-model --calls-shard-path sample_call/sample-calls --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls sample/sample-calls --sample-index 0 --output-genotyped-intervals sample-intervals.vcf --output-genotyped-segments sample-segments.vcf --output-denoised-copy-ratios sample-ratio.vcf --sequence-dictionary $ref_dict下面就是用到的整套流程: #!/bin/bash # make cnv baseline (gatk) source activate gatk gatk=/data/tool/gatk-4.1.6.0/gatk ref=/data/WES/database/ref/human/ucsc.hg19.fasta ref_dict=/data/WES/database/ref/human/ucsc.hg19.dict interval=Exome_Target_hg19_ucsc.interval_list # make cnv interval $gatk PreprocessIntervals -R $ref -L $interval --bin-length 5000 -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.interval_list # get sample reads counts for num in 1 2 3 4 5 6 7 8 9 do input_bam=/data/WES/TestPath/sample${num}/realign/sample${num}.bam $gatk CollectReadCounts -L targets.preprocessed.interval_list -R $ref -imr OVERLAPPING_ONLY -I $input_bam --format TSV -O sample${num}.tsv done # anno gc content in interval $gatk AnnotateIntervals -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.annotated.tsv # interval filter sample_rc='' for num in 1 2 3 4 5 6 7 8 9 do sample_rc=${sample_rc}' -I 'sample${num}.tsv done $gatk FilterIntervals -L targets.preprocessed.5000.interval_list --annotated-intervals targets.preprocessed.5000.annotated.tsv $sample_rc -imr OVERLAPPING_ONLY -O gc.filtered.bin5000.interval_list # make ploidy priors $gatk DetermineGermlineContigPloidy -L gc.filtered.bin5000.interval_list --interval-merging-rule OVERLAPPING_ONLY $sample_rc --contig-ploidy-priors contig_ploidy_priors.tsv --output ploidy-calls --output-prefix ploidy --verbosity DEBUG # make baseline $gatk GermlineCNVCaller --run-mode COHORT -L gc.filtered.bin5000.interval_list $sample_rc --contig-ploidy-calls ploidy-calls/ploidy-calls --annotated-intervals targets.preprocessed.5000.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output baseline --output-prefix baseline --verbosity DEBUG # call cnv $gatk DetermineGermlineContigPloidy --model ploidy-calls/ploidy-model -I sample.tsv -O sample --output-prefix sample --verbosity DEBUG $gatk GermlineCNVCaller --run-mode CASE -I sample.tsv --contig-ploidy-calls sample/sample-calls --model baseline/baseline-model --output sample_call --output-prefix sample --verbosity DEBUG $gatk PostprocessGermlineCNVCalls --model-shard-path baseline/baseline-model --calls-shard-path sample_call/sample-calls --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls sample/sample-calls --sample-index 0 --output-genotyped-intervals sample-intervals.vcf --output-genotyped-segments sample-segments.vcf --output-denoised-copy-ratios sample-ratio.vcf --sequence-dictionary $ref_dict |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |