PLINK: Whole genome data analysis toolset

您所在的位置：网站首页 › mergemode › PLINK: Whole genome data analysis toolset

PLINK: Whole genome data analysis toolset

2024-07-17 07:38| 来源: 网络整理| 查看: 265

Data management tools PLINK provides a simple interface for recoding, reordering, merging, flipping DNA-strand and extracting subsets of data. Recode and reorder a sample A basic, but often useful feature, is to output a dataset: with the PED file markers reordered for physical position, with excluded SNPs (negative values in the MAP file) excluded from the new PED file possibly excluding other SNPs based on filters such as genotyping rate possibly recoding the SNPs to a 1/2 coding possibly recoding the SNPs between letters and numbers (A,C,G,T / 1,2,3,4) possibly transposing the genotype file (SNPs as rows) possibly recoding the SNP to an additive and dominant pair of components possibly listing the data with each specific genotype as a distinct row possibly listing the data one genotype per row possibly listing only minor alleles The basic option to generate a new dataset is the --recode option: plink --file data --recode which will output the allele labels as they appear in the original; also, the missing genotype code is preserved if this is different from 0. Also, if --output-missing-genotype is specified (which can be as well as --missing-genotype) then this value will be used instead (i.e. so that input and output files can have different missing codes; this also applies to the phenotype with --output-missing-phenotype and --missing-phenotype). The --make-bed option does the same as --recode but creates binary files; these can also be filtered, etc, as described below.

In contrast,

plink --file data --recode12 will recode the alleles as 1 and 2 (and the missing genotype will always be 0). Both these commands will create two new files plink.ped plink.map (where, as usual, "plink" would be replaced by any specified --out {filename} ). Unless manually specified, for all these options, the usual filters for missingness and allele frequency will be set so as not to exclude any SNPs or individuals. By explicitly including an option, e.g. --maf 0.05 on the command line, this behaviour is overriden (see this page).

By default, any --recode option, and also --make-bed will preserve all genotypes exactly as they are. To set to missing Mendel errors or heterozygous haploid calls, use the options --set-me-missing and --set-hh-missing respectively. For the former, you will also need to specify --me 1 1 (i.e. to invole an evalation of Mendel errors, which does not occur by default, by not excluding any individuals or SNPs based on the results, i.e. if you only want to zero-out certain genotypes).

To recode SNP alleles from A,C,G,T to 1,2,3,4 or vice versa, use --allele1234 (to go from letters to numbers) and --alleleACGT (to go from numbers to letters). These flags should be used in conjunction with a data generation command (e.g. --make-bed), or any other analysis or summary statistic option. Alleles other than A,C,G,T or 1,2,3,4 will be left unchanged.

It is sometimes useful to have a PED file that is tab-delimited, except that between alleles of the same genotype a space instead of a tab is used. A file formatted in this way can load into Excel, for example, as a tab-delimited file, but with one genotype per column instead of one allele per column. Use the option --tab as well as --recode or --recode12 to achieve this effect.

To make a new file in which non-founders without both parents also in the same fileset are recoded as founders (i.e. pat and mat codes set both to 0), add the --make-founders flag. Transposed genotype files When using either --recode or --recode12, you can obtain a transposed text genotype file by adding the --transpose option. This generates two files: plink.tped plink.fam The first contains the genotype data, with SNPs as rows and individuals as columns, for example: if the original file was 1 1 0 0 1 1 1 1 G G 1 2 0 0 2 1 0 0 A G 1 3 0 0 1 1 1 1 A G 1 4 0 0 2 1 2 1 A A then this would generate 1 snp1 0 10001 1 1 0 0 1 1 2 1 1 snp2 0 20001 G G G A G A A A The first four columns are from the MAP file (chromosome, SNP ID, genetic position, physical position), followed by the genotype data. The plink.fam gives the ID, sex and phenotype information for each individual. The order of individuals in this file is the same as the order across the columns of the TPED file. The FAM file is just the first six columns of the PED file (or literally the same FAM file if the input where a binary fileset). Additive and dominance components The following format is often useful if one wants to use a standard, non-genetic statistical package to analyse the data, as here genotypes are coded as a single allele dosage number. To create a file with SNP genotypes recoded in terms of additive and dominant components, use the option: plink --file data --recodeAD which, assuming C is the minor allele, will recode genotypes as follows: SNP SNP_A , SNP_HET --- ----- ----- A A -> 0 , 0 A C -> 1 , 1 C C -> 2 , 0 0 0 -> NA , NA In otherwords, the default for the additive recoding is to count the number of minor alleles per person. The --recodeAD option produces both an additive and dominance coding: use --recodeA instead to skip the SNP_HET coding. The --recodeAD option saves the data to a single file plink.raw which has a header row indicating the SNP names (with _A and _HET appended to the SNP names to represent additive and dominant components, respectively). For example, consider the following PED file, which has two SNPs: 1 1 0 0 1 1 1 1 G G 1 2 0 0 2 1 0 0 A G 1 3 0 0 1 1 1 1 A G 1 4 0 0 2 1 2 1 A A Using the --recodeAD option generates the file plink-recode.raw: FID IID PAT MAT SEX PHENOTYPE snp1_2 snp1_HET snp2_G snp2_HET 1 1 0 0 1 1 0 0 2 0 1 2 0 0 2 1 NA NA 1 1 1 3 0 0 1 1 0 0 1 1 1 4 0 0 2 1 1 1 0 0 The column labels reflect the snp name (e.g. snp1) with the name of the minor allele appended (i.e. snp1_2 in the first instance, as 2 is the minor allele) for the additive component. The dominant component ( a dummy variable reflecting heterozygote state) is coded with the _HET suffix. This file can be easily loaded into R: for example: d For example, for the first SNP, the individuals are coded 1/1, 0/0, 1/1 and 2/1. The additive count of the number of common (1) alleles is therefore: 2, NA, 2 and 1, which is reflected in the field snp1_2. The field snp1_HET is coded 1 for the fourth individual who is heterozygous -- this field can be used to model dominance effect of the allele. The behavior of the --recodeA and --recodeAD commands can be changed with the --recode-allele command. This allows for the 0, 1, 2 count to reflect the number of a pre-specified allele type per SNP, rather than the number of the minor allele. This command takes as a single argument the name of a file that lists SNP name and allele to report, e.g. if the file recode.txt contained snp1 1 snp2 A then plink --file data --recodeAD --recode-allele recode.txt would now report in the LOG file Reading allele coding list from [ recode.txt ] Read allele codes for 2 SNPs and the plink.raw file would read FID IID PAT MAT SEX PHENOTYPE snp1_1 snp1_HET snp2_A snp2_HET 1 1 0 0 1 1 2 0 0 0 1 2 0 0 2 1 NA NA 1 1 1 3 0 0 1 1 2 0 1 1 1 4 0 0 2 1 1 1 2 0 If the SNP is monomorphic, by default the allele code out will be 0 and all individuals will have a count of 0 (or NA). If an allele is specified in --recode-allele that is not seen in the data, similarly all individuals will receive a 0 count (i.e. rather than an error being given). NOTE For alleles that have exactly 0.50 minor allele frequency, as for the second SNP in the example above, then which allele is labelled as minor will depend on which was first encountered in the PED file. Listing by minor allele count The command --recode-rlist will generate a files plink.rlist plink.fam plink.map where the plink.rlist file format is SNP GENOTYPE (BOTH ALLELES) FID/IID PAIRS ... For example, consider a particular SNP, rs2379981 has a minor allele (G) seen twice (in two heterozygotes) and two individuals with a missing genotpe; all other individuals are homozygous for the major allele. In this case, we would see two rows in the pink.rlist file: rs2379981 HET G A CH18612 NA18612 JA18998 NA18998 rs2379981 NIL 0 0 JA18999 NA18999 JA19003 NA19003 indicating, for example, that individual FID/IID CH18612/NA18612 has a rare heterozygote. This command could be used in conjunction with the --reference command and --freq to list all instances of rare non-reference alleles, e.g. from resequencing study data. Listing by long-format (LGEN) To output a file in the LGEN format, use the command --recode-lgen which generates files plink.lgen plink.fam plink.map that can be read with the --lfile command. The --with-reference with generate a fourth file plink.ref that can be read back in with the --reference command when using --lfile. Listing by genotype Another format that might sometimes be useful is the --list option which genetes a file plink.list that is ordered one genotype per row, listing all family and individual IDs of people with that genotype. For example, if we have a file with two SNPs rs1001 and rs2002 (both on chromosome 1): A 1 0 0 1 2 A A 1 1 B 2 0 0 1 2 A C 0 0 C 3 0 0 1 1 A C 1 2 D 4 0 0 1 1 C C 1 2 then then option plink --file mydata --list will generate the file plink.list 1 rs1001 AA A 1 1 rs1001 AC B 2 C 3 1 rs1001 CC D 4 1 rs1001 00 1 rs2002 22 1 rs2002 21 C 3 D 4 1 rs2002 11 A 1 1 rs2002 00 B 2 which has columns Chromosome SNP identifier Genotype Family ID, Individual ID for 1st person Family ID, Individual ID for 2nd person ... Family ID, Individual ID for final person Obviously, different rows will have a different number of columns. Here, we see that individual A 1 has the A/A genotype for rs1001, etc. This option is often useful in conjunction with --snp, if you want an easy breakdown of which individuals have which genotypes. Write SNP list files To output just the list of SNPs that remain after all filtering, etc, use the --write-snplist command, e.g. to get a list of all high frequency, high genotyping-rate SNPs: plink --bfile mydata --maf 0.05 --geno 0.05 --write-snplist which generates a file plink.snplist This file is simply a list of included SNP names, i.e. the same SNPs that a --recode or --make-bed statement would have produced in the corresponding MAP or BIM files. Update SNP information To automatically update either the genetic or physical positions for some or all SNPs in a dataset, use the --update-map command, which takes a single parameter of a filename, e.g. plink --bfile mydata --update-map build36.txt --make-bed --out mydata2 where, for example, the file build36.txt contains new physical positions for SNPs, based on dbSNP126/build 36, in the simple format of SNP/position per line, e.g. rs100001 1000202 rs100002 6252678 rs100003 7635353 ... To change genetic position (3rd column in map file) add the flag --update-cm as well as --update-map. There is no way to change chromosome codes using this command. Normally, one would want to save the new file with the changed positions, as in the example above, although one could combine other commands instead (e.g. association testing, etc) although the updated positions would then be lost (i.e. the changes are not automatically saved). The file with new SNP information does not need to feature all of the SNPs in the current dataset: SNPs not in this file will be left unchanged. If a SNP is listed more than once in the file, an error will be reported. NOTE When updating the map positions, it is possible that the implied ordering of SNPs in the dataset might change. If this is the case, a message will be written to the LOG file. Although the positions are updated, the order is not changed internally: as SNPs might be out of order, it is important to correct this by saving and reloading the file. For example, the if the original contains ... rs10001 500000 rs10002 520000 rs10003 540000 rs10004 560000 ... but we update rs10002 to position 580000, the data will be ... rs10001 500000 rs10002 580000 rs10003 540000 rs10004 560000 ... Only after saving and reloading (e.g. --make-bed / --bfile ) will the file be in the correct order ... rs10001 500000 rs10003 540000 rs10004 560000 rs10002 580000 ... This will only be an issue for commands which rely on relative SNP positions (e.g. --hap-window, --homozyg, etc). If the LOG file does not show a message that the order of SNPs has changed after using --update-map, one need not worry. The name and chromosome code of a SNP can also be changed, by adding the modifiers --update-name or --update-chr, e.g. ./plink --bfile mydata --update-map rsID.lst --update-name --make-bed --out mydata2 or ./plink --bfile mydata --update-map chr-codes.txt --update-chr --make-bed --out mydata2 In both case, the format of the input file should be two columns per line, e.g. SNP_A-1919191 rs123456 SNP_A-64646464 rs222222 ... or, for chromosome codes (use numeric values and codes X, Y, etc) rs123456 1 rs987654 18 rs678678 X .. You cannot update more than one attribute at a time for SNPs. Update allele information To recode alleles, for example from A,B allele coding to A,C,G,T coding, use the command --update-alleles, for example ./plink --bfile mydata --update-alleles mylist.txt --make-bed --out newfile where the file mylist.txt contains five columns per row listing, SNP identifier Old allele code for one allele Old allele code for other allele New allele code for first allele New allele code for other allele For example, rs10001 A B G T rs10002 A B A C ... will change allele A to G and allele B to T for rs10001, etc. Force a specific reference allele It is possible to manually specify which allele is the A1 allele and which is A2. By default, the minor allele is assigned to be A1. All odds ratios, etc, are calculated with respect to the A1 allele (i.e. an odds ratio greater than 1 implies that the A1 allele increases risk). To set a particular allele as A1, which might not be the minor allele, use the command --reference-allele, which can be used with any other analysis or data generation command, e.g. ./plink --bfile mydata --reference-allele mylist.txt --assoc where the file mylist.txt contains a list of SNP IDs and the allele to be set as A1, e.g. rs10001 A rs10002 T rs10003 T ... This command can make comparing results across studies easier, so that odds ratios reported can be made to be in the same direction as the other study, for example. Update individual information Rather than try to manually edit PED or FAM files (which is not advised), use these functions to change ID codes, sex and parental information for individuals in a fileset. The command plink --bfile mydata --update-ids recoded.txt --make-bed --out mydata2 changes ID codes for individuals specified in recoded.txt, which should be in the format of four columnds per row: old FID, old IID, new FID, new IID, e.g. FA 1001 F0001 I0001 FA 1002.dup F0002 I0002 ... will, for example find the person FA/1001 and change their FID/IID values to F0001/I0001. Not all people need be listed in the file (they will not be changed; the order of the file need not match the original dataset. Two simular commands (but that cannot be run at the same time as --update-ids) are --update-sex myfile1.txt that expects 3 columns per row: FID IID SEX Coded 1/2/0 for M/F/missing and --update-parents myfile2.txt that expects 4 columns per row: FID IID PAT New paternal IID code MAT New maternal IID code PLINK does not check see whether the new parents actually exist in the current file. With all of these commands, you need to issue a data output command (--make-bed, --recode, etc) for the changes to be preserved. Write covariate files If a covariate file is specified along with any of the above --recode options or with --make-bed, then that covariate file will also be written, as plink.cov by default. This option is useful if the covariate file has a different number of individuals, or is ordered differently, to produce a set of covariate values that line up more easily with the newly-created genotype and phenotype files. plink --file data --covar myfile.txt --recode creates also plink.cov. If you want just to create a revised version of the covariate file, but without creating a new set of genotype files, then use the --write-covar option. This can be used in conjunction with filters, etc, to output, for example, only covariates for high-genotyping (99%) cases, as in this example: plink --file data --write-covar myfile.txt --filter-cases --mind 0.01 will output just the relevant lines of myfile.txt to plink.cov, sorted to match the order of data.ped. To also include phenotype information in the plink.cov file add the flag --with-phenotype. This can be useful, for example, when used in conjunction with --recodeA to generate the files needed to replicate an analysis in R (e.g. extracting the appropriate genotype data, and applying filters, etc). To recode a categorical variable to a set of binary dummy variables, add the command --dummy-coding for example ./plink --bfile mydate --covar cdata.raw --write-covar --dummy-coding If the original covariate had two fields, a categorical variable with 8 levels (coded 0 to 7, although it could have any numeric coding, e.g. 100, 150, 200, 250, etc), and a second variable that was continuous, e.g. A8504 1 5 0.606218 A8008 1 1 0.442154 A8542 1 7 0.388042 A8022 1 2 0.286125 A8024 1 3 0.903004 A8026 1 4 0.790778 A8524 1 -9 0.713952 A8556 1 0 0.814292 A8562 1 1 0.803336 ... then the command above will create mynewfile.cov, with added header row, with the fields: FID Family ID IID Individual ID COV1_2 Dummy variable for first covariate, coded 1/0 for 2/other COV1_3 Dummy variable for first covariate, coded 1/0 for 3/other COV1_4 etc COV1_5 COV1_6 COV1_7 COV1_0 COV2 Unchanged continuous covariate Thus mynewfile.cov is as follows (spaces added for clarity): FID IID COV1_2 COV1_3 COV1_4 COV1_5 COV1_6 COV1_7 COV1_0 COV2 A8504 1 0 0 0 1 0 0 0 0.606218 A8008 1 0 0 0 0 0 0 0 0.442154 A8542 1 0 0 0 0 0 1 0 0.388042 A8022 1 1 0 0 0 0 0 0 0.286125 A8024 1 0 1 0 0 0 0 0 0.903004 A8026 1 0 0 1 0 0 0 0 0.790778 A8524 1 -9 -9 -9 -9 -9 -9 -9 0.713952 A8556 1 0 0 0 0 0 0 1 0.814292 A8562 1 0 0 0 0 0 0 0 0.803336 That is, for a variable with K categories, K-1 new dummy variables are created. This new file can be used with --linear and --logistic, and a coefficient for each level would now be estimated for the first covariate (otherwise PLINK would have incorrectly treated the first covariate as an ordinal/ratio measure). For covariate Y, each new dummy variable for level X is named Y_X, e.g. COV1_2, etc. Note that one level is automatically excluded (1 in this case, i.e. there is no COV1_1), which implicitly makes 1 the reference category in subsequent analysis. If PLINK detects more than 50 levels, it assumes the variable is not categorical (i.e. like COV2) and so leaves it unchanged. The command can operate on multiple covariates in a single file at the same time. Note that missing values are correctly handled (i.e. left as missing). NOTE Note that, unlike cluster files (see below) PLINK cannot handle any string information in covariate files. Write cluster files Similar to --write-covar, the --write-cluster will output the single selected cluster from the file specified by --within. Unlike covariate files, this allows string labels to be used. plink --bfile mydata --within clst.dat --write-cluster --out mynewfile which writes a file mynewfile.clst Use --mwithin to select which of multiple clusters is selected. The --dummy-coding can not currently be used with --write-cluster however. Flip DNA strand for SNPs This command will read the list of SNPs in the file list.txt and flip the strand for these SNPs, then save a new PED or BED fileset (i.e. by using either the --recode or --make-bed commands): plink --file data --flip list.txt --recode The list.txt should just be a simple list of SNP IDs, one SNP per line. Flipping strand means changing alleles A -> T C -> G G -> C T -> A so, for example, a A/C SNP will become a T/G; alternatively, a A/T SNP will become a T/A SNP (i.e. in this case, the labels remain the same, but whether the minor allele is A or T will still depend on strand). To flip strand for just a subset of the sample (e.g. if two samples have already been merged, and subsequently a strand issue has been identified for one of those samples) use the option --flip-subset, for example plink --file data --flip list.txt --flip-subset mylist.txt --recode where mylist.txt is a text file containing the individuals (family ID, individual ID) to be flipped. HINT When merging two datasets, it is clearly very important that the two sets of SNPs are concordant in terms of positive or negative strand. Whereas some mismatches will be easy to spot as more than two alleles will be observed in the merged dataset, other instances will not be so easy to spot, i.e. for A/T and C/G SNPs. Using LD to identify incorrect strand assignment in a subset of the sample If cases and controls have been genotyped separately and then the data merged, it is always possible that strand has been incorrectly or incompletely assigned to each SNP, meaning that the merged data may contain a number of SNPs for which the allele coding differs between cases and controls (or between any other grouping, such as collection site, etc). If the two mis-matched groups correspond to cases and controls exactly, then rare SNPs will show a very strong association with disease (e.g. 5% MAF in cases, 95% in controls) and be easy to spot as potential problems. More common SNPs could show intermediate levels of association that might be easier to confuse with a real signal. A simple approach to detect some proportion of such SNPs uses differential patterns of LD in cases versus controls: the command --flip-scan will query each SNP, and calculate the signed correlation between it and a set of nearby SNPs in cases and controls separately (of course, with the --pheno command, case and control status can be set to represent any binary split of the sample). For each index SNP, PLINK identifies other SNPs in which the absolute value of the genotypic correlation is above some threshold. For these SNP pairs, it counts the number of times the signed correlation is different in sign between cases and controls (a negative LD pair) versus the same (a positive LD pair). For example, the command plink --bfile mydata --flip-scan produces the output file plink.flipscan with the fields CHR Chromosome SNP SNP identifier for index SNP BP Base-pair position A1 Minor allele code A2 Major allele code F Allele frequency (A1 allele) POS Number of positive LD matches R_POS Average correlation of these NEG Number of negative LD matches R_NEG Average correlation of these NEGSNPS The SNPs showing negative correlation For example, the majority of this file should show SNPs have a NEG value of 0; the value of POS will be zero or greater, depending on the extent of LD. For example: CHR SNP BP A1 A2 F POS R_POS NEG R_NEG NEGSNPS 1 rs9439462 1452629 T C 0 0 NA 0 NA (NONE) 1 rs1987191 1457348 C T 0 0 NA 0 NA (NONE) 1 rs3766180 1468016 C T 0.285 2 0.893 0 NA (NONE) However, occasionally one might observe different patterns of results. Of particular interest is when one SNP shows a large number of NEG SNPs. For example, here we show rs2240344 and nearby SNPs, all of which have at least one NEG SNP (lines truncated) CHR SNP BP A1 A2 F POS R_POS NEG R_NEG NEGSNPS 14 rs12434442 72158039 T C 0.249 5 0.515 1 0.46 rs2240344 14 rs4899437 72190986 G C 0.394 5 0.802 1 0.987 rs2240344 14 rs2803980 72196284 G A 0.41 5 0.808 1 0.95 rs2240344 14 rs2240344 72197893 C G 0.489 0 NA 7 0.807 rs12434442|rs4899437|... 14 rs2286068 72198107 C T 0.407 7 0.741 1 0.962 rs2240344 14 rs7160830 72209491 T C 0.414 6 0.801 1 0.922 rs2240344 14 rs10129954 72220454 T C 0.413 6 0.729 1 0.73 rs2240344 14 rs7140455 72240734 T C 0.469 4 0.72 1 0.64 rs2240344 This pattern of results quite clearly points to rs2240344 as being the odd man out: for 7 other SNPs, there is strong LD (r above 0.5) in either cases or controls, but with a SNP-SNP correlation in the other phenotype class that has the opposite direction. In contrast, there is not a single SNP for which both cases and controls have a consistent pattern of LD. For the nearby SNPs, all of which have only 1 NEG SNP, it is with rs2240344. So, in this particular case, it would suggest that stand is flipped in either cases or controls. To display the specific sets of correlations in cases and controls for each SNP, add the option --flip-scan-verbose which generates a file plink.flipscan.verbose which lists for any SNP with at least one NEG pair of LD values, the correlations between the index SNP and the other flanking SNPs, showing the correlation in cases (R_A) and controls (R_U): CHR_INDX SNP_INDX BP_INDX A1_INDX SNP_PAIR BP_PAIR A1_PAIR R_A R_U 14 rs2240344 72197893 C rs12434442 72158039 T -0.504 0.416 14 rs2240344 72197893 C rs4899437 72190986 G -0.99 0.983 14 rs2240344 72197893 C rs2803980 72196284 G -0.969 0.931 14 rs2240344 72197893 C rs2286068 72198107 C -0.971 0.952 14 rs2240344 72197893 C rs7160830 72209491 T -0.935 0.91 14 rs2240344 72197893 C rs10129954 72220454 T -0.782 0.679 14 rs2240344 72197893 C rs7140455 72240734 T -0.671 0.609 Here we see a clear pattern in which the correlation is similar between cases and controls in magnitude but has the opposite direction, strongly suggestive of a strand flip problem for this C/G SNP. In this case, the allele frequency turns out to be quite different between cases and controls (60% versus 40%) but the LD approach would have clearly detected this particular SNP being flipped in either cases or controls even if the true allele frequency were exactly 50%. This latter class of SNP would not cause problems of spurious association in single SNP analysis, but it could cause severe problems in haplotype and imputation analysis. Naturally, if a SNP does not show strong LD with nearby SNPs, then this approach will not be able to resolve strand issues. Also, if more than one SNP in a region shows strand flips, or if there is a higher level of mis-coding alleles in general, then this approach may indicate that there are problems (many NEG scores above 0) but it might be less clear how to remedy them. To know which to resolve (cases or controls) one would need to look at the frequency in other panels, or even the correlations, e.g. in HapMap. Ideally, one would only need to do this for a small number of SNPs if any. The --flip and --flip-subset commands described above can then be used to flip the appropriate genotypes. Finally, the default threshold for counting can be changed by the following command: --flip-scan-threshold 0.8 The default is set at 0.5 (i.e. the pair needs to have a correlation of 0.5 or greater in either cases or controls). The number of flanking SNPs with are considered for each index SNP can be modified with the commands --ld-window 10 to set the number of SNPs considered upstream and downstream; the maximum physical distance away from the index SNP (1Mb by default) is specified in kb with the command: --ld-window-kb 500 Merge two filesets To merge two PED/MAP files: plink --file data1 --merge data2.ped data2.map --recode --out merge The --merge option must be followed by 2 arguments: the name of the second PED file and the name of the second MAP file. A --recode (or --make-bed, etc) option is necessary to output the newly merged file; in this case, --out option will create the files merge-recode.ped and merge-recode.map. The --merge option can also be used with binary PED files, either as input or output, but not as the second file: i.e. plink --bfile data1 --merge data2.ped data2.map --make-bed --out merge will create merge.bed, merge.fam and merge.bim, as the --make-bed option was used instead of the --recode option. Likewise, the data1.* files point to a binary PED file set. If the second fileset (data2.*) were in binary format, then you must use --bmerge instead of --merge plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed --out merge which takes 3 parameters (the names of the BED, BIM and FAM files, in that order). The two filesets can either overlap completely, partially, or not at all both in terms of markers and individuals. Imputed genotypes will be set to missing (i.e. if SNP_B is not measured in the first file, but it is in the second, then any individuals in the first file who are not also present in the second file will be set to missing for SNP_B. By default, any existing genotype data (i.e. in data1.ped) will not be over-written by data in the second file (data2.ped). By specifying a --merge-mode this default behavior can be changed. The modes are: 1 Consensus call (default) 2 Only overwrite calls which are missing in original PED file 3 Only overwrite calls which are not missing in new PED file 4 Never overwrite 5 Always overwrite mode 6 Report all mismatching calls (diff mode -- do not merge) 7 Report mismatching non-missing calls (diff mode -- do not merge) The default (mode 1) behaviour is to call the merged genotype as missing if the original and new files contain different, non-missing calls; otherwise: i.e. Merge mode data1.ped , data2.ped -> 1 2 3 4 5 --------- --------- ----------------------- 0/0 , 0/0 -> 0/0 0/0 0/0 0/0 0/0 0/0 , A/A -> A/A A/A A/A 0/0 A/A A/A , 0/0 -> A/A A/A A/A A/A 0/0 A/A , A/T -> 0/0 A/A A/T A/A A/T Modes 6 and 7 effectively provide a means for comparing two PED files -- no merging is performed in these cases; rather, a list of mismatching SNPs is written to the file plink.diff They should also report the concordance rate in the LOG file, based on all SNPs that feature in both sets. A warning will be given if the chromosome and/or physical position differ between the two MAP files. NOTE Alleles must be exactly coded to match: that is, PLINK will not assume that a {1,2,3,4} SNP coding maps onto a {A,C,G,T} coding. You can use the --allele1234 and --alleleACGT commands prior to merging to convert datasets and then merge these consistently coded files (you cannot convert and merge on the fly, i.e. simply do putting --allele1234 on the command line along with --merge will not work: you need to use --allele1234 and --make-bed first). Merge multiple filesets To merge more than two standard and/or binary filesets, it is often more convenient to specify a single file that contains a list of PED/MAP and/or BED/BIM/FAM files and use the --merge-list option. Consider, for an extreme example, the case where each fileset contains only a single SNP, and that there are thousands of these files -- this option would help build a single fileset, in this case. For example, consider we had 4 PED/MAP filesets (labelled fA.* through fD.*) and 4 binary filesets, labelled fE.* through fH.*). Then using the command plink --file fA --merge-list allfiles.txt --make-bed --out mynewdata would create the binary fileset mynewdata.bed mynewdata.bim mynewdata.fam (alternatively, the --recode option could have been used instead of --make-bed to generate a standard ASCII PED/MAP fileset). In this case, the file allfiles.txt was a list of the to-be-merged files, one set per row: fB.ped fB.map fC.ped fC.map fD.ped fD.map fE.bed fE.bim fE.fam fF.bed fF.bim fF.fam fG.bed fG.bim fG.fam fH.bed fH.bim fH.fam Important Each fileset must be on a line by itself: lines with two files are interpreted as PED/MAP filesets; lines with three files are interpreted as binary BED/BIM/FAM filesets. The files on a line must always be in this order (PED then MAP; BED then BIM then FAM) Note In this case the first of the 8 files must be the starting file, i.e. associated with --file on the command line; this file only contains the 8-1 remaining files therefore. The final mynewdata.* files will contain information from all 8 files. The --merge-mode option can also be used with the --merge-list option, as described above: however, it is not possible to specify the "diff" features (i.e. modes 6 and 7). Extract a subset of SNPs: command line options There are multiple ways to extract just specific SNPs for analysis; this section describes options that use the command-line directly; the next section describes other methods that read a file containing the information. Based on a single chromosome (--chr) To analyse only a specific chromosome use plink --file data --chr 6 Based on a range of SNPs (--from and --to) To select a specific range of markers (that must all fall on the same chromosome) use, for example: plink --bfile mydata --from rs273744 --to rs89883 Based on single SNP (and window) (--snp and --window) Alternatively, you can specify a single SNP and, optionally, also ask for all SNPs in the surrounding region, with the --window option: plink --bfile mydata --snp rs652423 --window 20 which extracts only SNPs within +/- 20kb of rs652423. Based on multiple SNPs and ranges (--snps) Alternatively, the newer --snps command is more flexible but slower than the previously described --snp and --from/--to commands. The --snps command will accept a comma-delimited list of SNPs, including ranges based on physical position. For example, plink --bfile mydata --snps rs273744-rs89883,rs12345-rs67890,rs999,rs222 selects the same range as above (rs273744 to rs89883) but also the separate range rs273744 to rs89883 as well as the two individual SNPs rs999 and rs222. Note that SNPs need not be on the same chromosome; also, a range can span multiple chromosomes (the range is defined based on chromosome code order in that case, as well as physical position, i.e. a range from a SNP on chromosome 4 to one on chromosome 6 includes all SNPs on chromosome 5). No spaces are allowed between SNP names or ranges, i.e. it is --snps rs1111-rs2222,rs3333,rs4444 and not --snps rs1111 - rs2222, rs3333 ,rs4444 Hint As mentioned above, unlike other methods mentioned above, --snps will load in all the data before extracting what it needs, whereas --snp only loads in what it needs, as so is a much faster way to extract a region from a very large dataset: as a result, if you really do want only a single SNP or a single range, use --snp (with --window) or some variant of the from/--to commands. Based on physical position (--from-kb, etc) One can also select regions based on a window defined in terms of physical distance rather than SNP ID, using the command: e.g. plink --bfile mydata --chr 2 --from-kb 5000 --to-kb 10000 to select all SNPs within this 5000kb region on chromosome 2 (when using --from-kb and --to-kb you always need to specify the chromosome with the --chr option). HINT Two alternate forms of the --from-kb command are --from-bp and --from-mb that take a parameter in terms of base-pair position or megabase position, instead of kilobase (to be used with the corresponding --to-bp and --to-mb options). Based on a random sampling (--thin) To keep only a random 20% of SNPs, for example, add the flag --thin 0.2 All the above options can be used either with standard pedigree files (i.e. using --ped or --file) or with binary format pedigree (BED) files (i.e. using --bfile). One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option. Extract a subset of SNPs: file-list options To extract only a subset of SNPs, it is possible to specify a list of required SNPs and make a new file, or perform an analysis on this subset, by using the command plink --file data --extract mysnps.txt where the file is just a list of SNPs, one per line, e.g. snp005 snp008 snp101 Alternatively, you can use the command --range to modify the behavior of --extract and --exclude. If the --range flag is added, then instead of a list of SNPs, PLINK will expect a list of chromosomal ranges to be given instead, one per line. plink --file data --extract myrange.txt --range All SNPs within that range will then be excluded or extracted. The format of myrange.txt should be, one range per line, whitespace-separated: CHR Chromosome code (1-22, X, Y, XY, MT, 0) BP1 Start of range, physical position in base units BP2 End of range, as above LABEL Name of range/gene For example, 2 30000000 35000000 R1 2 60000000 62000000 R2 X 10000000 20000000 R3 would extract/exclude all SNPs in these three regions (5Mb and 2Mb on chromosome 2 and 10Mb on chromosome X). Based on an attribute file (--attrib) See below Based on a set file (--gene) Finally, if a SET file is also specified, you can use the --gene option to extract all SNPs in that gene/region. For example, if the SET file genes.set contains two genes: GENE1 rs123456 rs10912 rs66222 END GENE2 rs929292 rs288222 rs110191 END then plink --file mydata --set genes.set --gene GENE2 --recode would, for example, create a new dataset with only the 3 SNPs in GENE2. One must combine these options with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option. Remove a subset of SNPs To re-write the PED/MAP files, but with certain SNPs excluded, use the option plink --file data --exclude mysnps.txt where the file mysnps.txt is, as for the --extract command, just a list of SNPs, one per line. As described above, the --range command can modify the behaviour of --exclude in the same manner as for --extract. One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option. NOTE Another way of removing SNPs is to make the physical position negative in the MAP file (this can not be done for binary filesets (e.g. the *.bim file). Make missing a specific set of genotypes To blank out a specific set of genotypes, use the following commands, e.g. --zero-cluster test.zero --within test.clst in conjunction with other data analysis, file generation or summary statistic commands, where the file test.zero is a list of SNPs and clusters, and test.clust is a standard cluster file. If the original PED file is 1 1 0 0 1 1 A A C C A A 2 1 0 0 1 1 C C A A C C 3 1 0 0 1 1 A C A A A C 4 1 0 0 1 1 A A C C A A 5 1 0 0 1 1 C C A A C C 6 1 0 0 1 1 A C A A A C 1b 1 0 0 1 1 A A C C A A 2b 1 0 0 1 1 C C A A C C 3b 1 0 0 1 1 A C A A A C 4b 1 0 0 1 1 A A C C A A 5b 1 0 0 1 1 C C A A C C 6b 1 0 0 1 1 A C A A A C and the MAP file is 1 snp1 0 1000 1 snp2 0 2000 1 snp3 0 3000 and the list of SNPs/clusters to zero out in test.zero is snp2 C1 snp3 C1 snp1 C2 and the cluster file test.clst is 1b 1 C1 2b 1 C1 3b 1 C1 4b 1 C1 5b 1 C1 6b 1 C1 2 1 C2 3 1 C2 then the command plink --file test --zero-cluster test.zero --within test.clst --recode results in a new PED file, plink.ped, 1 1 0 0 1 1 A A C C A A 2 1 0 0 1 1 0 0 A A C C 3 1 0 0 1 1 0 0 A A A C 4 1 0 0 1 1 A A C C A A 5 1 0 0 1 1 C C A A C C 6 1 0 0 1 1 A C A A A C 1b 1 0 0 1 1 A A 0 0 0 0 2b 1 0 0 1 1 C C 0 0 0 0 3b 1 0 0 1 1 A C 0 0 0 0 4b 1 0 0 1 1 A A 0 0 0 0 5b 1 0 0 1 1 C C 0 0 0 0 6b 1 0 0 1 1 A C 0 0 0 0 i.e. with the appropriate genotypes zeroed out. HINT See the section on handling obligatory missing genotype data, which can often be useful in this context. Extract a subset of individuals To keep only certain individuals in a file, use the option: plink --file data --keep mylist.txt where the file mylist.txt is, as for the --remove command, just a list of Family ID / Individual ID pairs, one set per line, i.e. one person per line. (fields can occur after the 2nd column but they will be ignored -- i.e. you could use a FAM file as the parameter of the --keep command, or have comments in the file. For example F101 1 F1001 2_B F3033 1_A Drop this individual because of consent issues F4442 22 would be fine. One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option. Remove a subset of individuals To remove certain individuals from a file plink --file data --remove mylist.txt where the file mylist.txt is, as for the --keep command, just a list of Family ID / Individual ID pairs, one set per line, i.e. one person per line (although, as for --keep, fields after the 2nd column are allowed but they will be ignored). One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option. Filter out a subset of individuals Whereas the options to keep or remove individuals are based on files containing lists, it is also possible to specify a filter to include only certain individuals based on phenotype, sex or some other variable. The basic form of the command is --filter which takes two arguments, a filename and a value to filter on, for example: plink --file data --filter myfile.raw 1 --freq implies a file myfile.raw exists which has a similar format to phenotype and cluster files: that is, the first two columns are family and individual IDs; the third column is expected to be a numeric value (although the file can have more than 3 columns), and only individuals who have a value of 1 for this would be included in any subsequent analysis or file generation procedure. e.g. if myfile.raw were F1 I1 2 F2 I1 7 F3 I1 1 F3 I2 1 F3 I3 3 then only two individuals (F3 I1 and F3 I2) would be included based on this filter for the calculation of allele frequencies. The filter can be any integer numeric value. As with --pheno and --within, you can specify an offset to read the filter from a column other than the first after the obligatory ID columns. Use the --mfilter option for this. For example, if you have a binary fileset, and so the FAM file contains phenotype as the sixth column, then you could specify plink --bfile data --filter data.fam 2 --mfilter 4 to select cases only; i.e. cases have the value 2, and this is the 4th variable in the file (i.e. the first two columns are ignored, as these are the ID columns). Because filtering on cases or controls, or on sex, or on position within the family, will be common operations, there are some shortcut options that can be used instead of --filter. These are --filter-cases --filter-controls --filter-males --filter-females --filter-founders --filter-nonfounders These flags can be used in any circumstances, e.g. to make a file of control founders, plink --bfile data --filter-controls --filter-founders --make-bed --out newfile or to analyse only males plink --bfile data --assoc --filter-males IMPORTANT Take care when using these with options to merge filesets: the merging occurs before these filters. Attribute filters for markers and individuals One can define an attribute file for SNPs (or for individuals, see below) that is simply a list of user-defined attributes for SNPs. For example, this might be a file snps.txt which contains rs0001 exonic rs0007 candidate rs0010 failed exonic rs0012 nssnp These codes can be whatever you like, as is appropriate for your study; a SNP can have multiple, white-space delimited attributes. Not all SNPs need appear in this file; SNPs not in the dataset are allowed to appear (they are just ignored); the order does not need to be the same. Each SNP should only be listed once however. A SNP can be listed by itself without any attributes (for example, to ensure it is not excluded when filtering to exclude SNPs with a certain attribute, see below). To filter SNPs on these, use the command (combined with some other data generation or analysis option) --attrib snps.txt exonic for example, to extract only exonic SNPs (rs0001 and rs0010 in this example, assuming they've been coded this way). To exclude SNPs that match the attribute, preface the attribute with a minus sign on the command line, e.g. --attrib snps.txt -failed to extract only non-failed SNPs. Finally, multiple filters can be combined in a comma-delimited list --attrib snps.txt exonic,-failed would select exonic SNPs that did not fail. If a SNP does not feature in the attribute file, it will always be excluded. NOTE Within match type, multiple matches are treated as logical ORs; between positive and negative matches as AND. For example, matching on A,B,-C,-D implies individuals with ( A or B ) and not ( C or D ) This approach works similarly for individuals, except the command is now --attrib-indiv, e.g. --attrib-indiv inddat.txt sample1,fullinfo and the attribute file starts with family ID and individual ID before listing any attributes, e.g. F1 1 sample2 F2 1 sample1 F3 1 sample2 fullinfo ... Create a SET file based on a list of ranges Given a list of ranges in the following format (4 columns per row; no header file) Chromosome Start base-pair position End base-pair position Set/range/gene name then the command plink --file mydata --make-set gene.list --write-set will generate the file plink.set in the standard set file format. The command --make-set-border takes a single integer argument, allowing for a certain kb window before and after the gene to be included, e.g. for 20kb upstream and downstream: plink --file mydata --make-set gene.list --make-set-border 20 --write-set HINT The --make-set command doesn't necessarily have to be used with --write-set. Rather, it can be used anywhere that --set can be used, to make sets on the fly. Similar, --set and --write-set can be combined, e.g. to create a new, filtered set file. Options for --make-set To collapse all ranges into a single set (i.e. to generate one set that corresponds to all SNPs in a gene, from a list of gene co-ordinates, for example), use --make-set-collapse-all SETNAME along with --make-set, where SETNAME is any single word that you use to name to newly created set. To make a set file of all SNPs not in the specified ranges, add --make-set-complement-all SETNAME Optionally, the range file can contain a fifth column, to specify groups of ranges. Sets can be constructed which collapse over these groups. That is, the input for --make-set is now Chromosome Start base-pair position End base-pair position Set/range/gene name Group label e.g. 1 10001 20003 GENE1 PWAY-A 8 80001 99995 GENE2 PWAY-A 12 1001 10001 GENE3 PWAY-B 5 110001 127362 GENE4 PWAY-B ... Normally, the fifth column will just be ignored, unless the command --make-set-collapse-group is added, which creates sets of SNPs that correspond to each group (i.e. PWAY-A, PWAY-B, etc, in this example) rather than each gene/region (i.e. GENE1, etc). The command --make-set-complement-group works in a similar manner, except now forming sets of all SNPs not in the given group of ranges. HINT See the resources page for pre-compiled RefSeq gene-lists that can be used here. Tabulate set membership for all SNPs It is possible to create a table that maps SNPs to sets, given a --set file has been specified, with the --set-table command, e.g. ./plink --bfile mydata --set mydata.set --set-table which generates a file plink.set.table which contains the fields SNP SNP identifier CHR Chromosome code BP Base-pair physical position First set name Membership of first set Second set name Membership of second set ... For each row, a series of 0s and 1s indicate whether or not each SNP in the dataset is in a given SET. This format can be useful for subsequent analyses (i.e. it can be easily lined up with other result files, e.g. from --assoc). SNP-based quality scores PLINK supports quality scores for SNPs and, described in the next section, genotypes. These can be used to filter on user-defined thresholds. The command --qual-scores indicates the file containing the scores. Scores are assumed to be numbers between 0 and 1, a higher number representing better quality. The threshold at which SNPs are selected can be set with the command --qual-threshold. For example, ./plink --bfile mydata --qual-scores myscores.txt --qual-threshold 0.8 --make-bed --out qc-data where myscores.txt is a text file of SNPs and scores, e.g. rs10001 0.87 rs10002 0.46 rs10003 1.00 ... will remove SNPs with scores less than 0.8. The additional flag --qual-max-threshold can be used to specify a maximum threshold also (i.e. to select low-quality SNPs only). Not all SNPs need be in the file (the SNP is left in, in this case; the order can be different, it can contain SNPs not in the data). Genotype-based quality scores Quality scores for each genotype, rather than each SNP, can also be applied to PLINK datasets, using the --qual-geno-scores command, e.g. ./plink --bfile mydata --qual-geno-scores gqual.txt --qual-geno-threshold 0.99 --assoc (with a similar --qual-geno-max-threshold command as well). The file containing the genotype quality scores should have the following format: Q FID IID SNPID score e.g. Q fam1 ind1 rs10001 0.873 Q fam1 ind1 rs10002 0.998 ... Not all genotypes need be in this file. Rather than have a very large file, one could only list genotype scores that are below some threshold, for example, assuming most genotypes are of very good quality. Genotypes not in the this file will be untouched. This format is designed to accept wildcards, as follows. Every item should start with a Q character, to allow PLINK to check the correctness of the file format. Consider this example file, Q A 1 rs1234 0.986 Q B 1 rs1234 0.923 Q A 1 rs5678 0.323 Q B 1 rs5678 0.97 that lists two genotypes for people with FID/IID A/1 and B/1 for SNPs rs1234 and rs5678. If a score if below threshold, it is set to missing in the data. The order of this file is arbitrary; not all individuals/SNPs need appear. PLINK accepts wildcards in this file, to allow for different data formats to be specified. With a person wild-card, PLINK expects all quality scores for that SNP, in order as in the FAM or PED file, e.g. Q * rs1234 0.986 0.923 Q * rs5678 0.323 0.97 With a SNP wildcard, PLINK exects all SNPs for a given person: Q A 1 * 0.986 0.323 Q B 1 * 0.923 0.97 All these formats can be mixed together in a single file. These can be combined (in which case, PLINK expects all individuals for the first SNP, all for the second SNP, etc) Q * * 0.986 0.923 0.323 0.97 WARNING This option is recently added in beta-stage of development. Currently, a wild card looks to the current data to get the list of individuals and SNPs to loop over. This could cause a problem if the file has been filtered, etc. The next release will include commands to specify the order of individuals and SNPs, e.g. --qual-people-list mysamples.lst where mysamples.lst is a file with 2 columns (FID/IID), and --qual-geno-snp-list mysnp.lst where mysnp.lst is list of SNPs. This way if somebody is in the quality score file but they have been removed from the actual genotype dataset (or added), then this can be handled properly without needing to change the whole quality score file.

【本文地址】

今日新闻

mergemode

PLINK: Whole genome data analysis toolset

PLINK: Whole genome data analysis toolset

今日新闻

推荐新闻