A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies

doi:10.1371/journal.pgen.1000529

Figure 1.

Schematic drawing of imputation Scenario A.

In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into two disjoint sets: a set T (blue) that is genotyped in all individuals and a set U (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U in the study sample.

More »

Expand

Figure 2.

Schematic drawing of imputation Scenario B.

In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into three disjoint sets: a set T (blue) that is genotyped in all individuals, a set U₂ (yellow) that is genotyped in both the haploid and diploid reference panels but not the study sample, and a set U₁ (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U₂ in the study sample and SNPs in the set U₁ in both the study sample and, if desired, the diploid reference panel.

More »

Expand

Figure 3.

Percentage discordance versus percentage missing genotypes for Scenario A dataset.

(A) Full range of results, corresponding to calling thresholds from 0.33 to 0.99. (B) Magnified results for calling thresholds near 0.99. (C) Magnified results for calling thresholds near 0.33.

More »

Expand

Table 1.

Running times and memory requirements for various algorithms in Scenario A.

More »

Expand

Figure 4.

Percentage discordance versus percentage missing genotypes for restricted Scenario B dataset.

(A) Results for masked Illumina genotypes imputed from Affymetrix genotypes in the study sample. (B) Results for masked Affymetrix genotypes imputed from Illumina genotypes in the study sample. (C) Results for masked Illumina genotypes (SNPs with MAF<5% only) imputed from Affymetrix genotypes in the study sample. (D) Results for masked Affymetrix genotypes (SNPs with MAF<5% only) imputed from Illumina genotypes in the study sample.

More »

Expand

Figure 5.

Percentage discordance versus percentage missing genotypes for full Scenario B dataset.

(A) Results for masked Illumina genotypes imputed from Affymetrix genotypes in the study sample. (B) Results for masked Affymetrix genotypes imputed from Illumina genotypes in the study sample. Solid lines were obtained from the restricted Scenario B dataset (Figure 4) and are shown for reference; dashed lines were obtained from the full Scenario B dataset.

More »

Expand

Table 2.

False negative (FN) and false positive (FP) minor allele call rates at rare SNPs (MAF<5%) in Scenario B.

More »

Expand

Table 3.

Running times and memory requirements for various algorithms in Scenario B.

More »

Expand