使用gatk检测WES数据中的cnv

您所在的位置:网站首页 cnv算法 使用gatk检测WES数据中的cnv

使用gatk检测WES数据中的cnv

#使用gatk检测WES数据中的cnv| 来源: 网络整理| 查看: 265

gatk的cnv流程对环境依赖较高,需要调用许多python包,推荐在dockerhub里找官方镜像,或者用conda来配置环境。 1、dockerhub 在本地的docker环境中直接拉取镜像,如果没有root权限就用conda安装。

docker pull broadinstitute/gatk:4.1.6.0

2、conda 先下载一个miniconda或者anaconda,然后下载好gatk的安装包解压

wget https://github.com/broadinstitute/gatk/archive/4.1.6.0.tar.gz tar -xzvf 4.1.6.0.tar.gz 解压出来会有一个yml文件 `gatkcondaenv.yml`,使用这个文件创建环境 conda env create -n gatk -f gatkcondaenv.yml 环境好了以后直接source就好了 source activate gatk

接下来开始构建CNV的baseline,构建baseline需要无大片段CNV的一些样本,拿到重比对后的bam文件就可以开始操作了。

Step0、把文件地址配置好,参考基因组及interval_list准备妥当 (gatk对格式要求较为严格,需要严格按照文档说明来准备interval_list文件) (https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists)

#!/bin/bash gatk=/data/tool/gatk-4.1.6.0/gatk ref=/data/database/ref/human/ucsc.hg19.fasta ref_dict=/data/database/ref/human/ucsc.hg19.dict interval=/data/database/ref/Exome_Target_hg19_ucsc.interval_list

Step1、窗口划分

# --bin-length 为你的窗口大小,-imr OVERLAPPING_ONLY 意思为捕获区间有重叠就会合并窗口 $gatk PreprocessIntervals -R $ref -L $interval --bin-length 5000 -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.interval_list

Step2、计算样本每个窗口的reads

for num in 1 2 3 4 5 6 7 8 9 do input_bam=/data/WES/TestPath/sample${num}/realign/sample${num}.bam $gatk CollectReadCounts -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -I $input_bam --format TSV -O sample${num}.tsv done

Step3、窗口文件添加GC信息 (在窗口文件最后一列添加GC含量,下一步会对GC异常的窗口进行剔除)

$gatk AnnotateIntervals -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.annotated.tsv

Step4、剔除GC异常以及reads数偏多/偏少的窗口

sample_rc='' for num in 1 2 3 4 5 6 7 8 9 do sample_rc=${sample_rc}' -I 'sample${num}.tsv done $gatk FilterIntervals -L targets.preprocessed.5000.interval_list --annotated-intervals targets.preprocessed.5000.annotated.tsv $sample_rc -imr OVERLAPPING_ONLY -O gc.filtered.bin5000.interval_list

Step5、根据先验值确定倍性

$gatk DetermineGermlineContigPloidy -L gc.filtered.bin5000.interval_list --interval-merging-rule OVERLAPPING_ONLY $sample_rc --contig-ploidy-priors contig_ploidy_priors.tsv --output ploidy-calls --output-prefix ploidy --verbosity DEBUG # contig_ploidy_priors.tsv 文件格式参考https://gatk.broadinstitute.org/hc/en-us/articles/360042746591-DetermineGermlineContigPloidy

Step6、构建参考集样本的baseline

$gatk GermlineCNVCaller --run-mode COHORT -L gc.filtered.bin5000.interval_list $sample_rc --contig-ploidy-calls ploidy-calls/ploidy-calls --annotated-intervals targets.preprocessed.5000.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output baseline --output-prefix baseline --verbosity DEBUG

Step7、对单样本进行CNV检测

# 拿到baseline过后,CNV检测需要分三步走 $gatk DetermineGermlineContigPloidy --model ploidy-calls/ploidy-model -I sample.tsv -O sample --output-prefix sample --verbosity DEBUG $gatk GermlineCNVCaller --run-mode CASE -I sample.tsv --contig-ploidy-calls sample/sample-calls --model baseline/baseline-model --output sample_call --output-prefix sample --verbosity DEBUG # call完cnv后还需要转成vcf格式方便查看处理 $gatk PostprocessGermlineCNVCalls --model-shard-path baseline/baseline-model --calls-shard-path sample_call/sample-calls --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls sample/sample-calls --sample-index 0 --output-genotyped-intervals sample-intervals.vcf --output-genotyped-segments sample-segments.vcf --output-denoised-copy-ratios sample-ratio.vcf --sequence-dictionary $ref_dict

下面就是用到的整套流程:

#!/bin/bash # make cnv baseline (gatk) source activate gatk gatk=/data/tool/gatk-4.1.6.0/gatk ref=/data/WES/database/ref/human/ucsc.hg19.fasta ref_dict=/data/WES/database/ref/human/ucsc.hg19.dict interval=Exome_Target_hg19_ucsc.interval_list # make cnv interval $gatk PreprocessIntervals -R $ref -L $interval --bin-length 5000 -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.interval_list # get sample reads counts for num in 1 2 3 4 5 6 7 8 9 do input_bam=/data/WES/TestPath/sample${num}/realign/sample${num}.bam $gatk CollectReadCounts -L targets.preprocessed.interval_list -R $ref -imr OVERLAPPING_ONLY -I $input_bam --format TSV -O sample${num}.tsv done # anno gc content in interval $gatk AnnotateIntervals -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.annotated.tsv # interval filter sample_rc='' for num in 1 2 3 4 5 6 7 8 9 do sample_rc=${sample_rc}' -I 'sample${num}.tsv done $gatk FilterIntervals -L targets.preprocessed.5000.interval_list --annotated-intervals targets.preprocessed.5000.annotated.tsv $sample_rc -imr OVERLAPPING_ONLY -O gc.filtered.bin5000.interval_list # make ploidy priors $gatk DetermineGermlineContigPloidy -L gc.filtered.bin5000.interval_list --interval-merging-rule OVERLAPPING_ONLY $sample_rc --contig-ploidy-priors contig_ploidy_priors.tsv --output ploidy-calls --output-prefix ploidy --verbosity DEBUG # make baseline $gatk GermlineCNVCaller --run-mode COHORT -L gc.filtered.bin5000.interval_list $sample_rc --contig-ploidy-calls ploidy-calls/ploidy-calls --annotated-intervals targets.preprocessed.5000.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output baseline --output-prefix baseline --verbosity DEBUG # call cnv $gatk DetermineGermlineContigPloidy --model ploidy-calls/ploidy-model -I sample.tsv -O sample --output-prefix sample --verbosity DEBUG $gatk GermlineCNVCaller --run-mode CASE -I sample.tsv --contig-ploidy-calls sample/sample-calls --model baseline/baseline-model --output sample_call --output-prefix sample --verbosity DEBUG $gatk PostprocessGermlineCNVCalls --model-shard-path baseline/baseline-model --calls-shard-path sample_call/sample-calls --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls sample/sample-calls --sample-index 0 --output-genotyped-intervals sample-intervals.vcf --output-genotyped-segments sample-segments.vcf --output-denoised-copy-ratios sample-ratio.vcf --sequence-dictionary $ref_dict


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3