Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation

doi:10.1371/journal.pcbi.1002604

Figure 1.

A statistical framework for joint genotype calls.

We developed a statistical framework to jointly estimate sample genotypes from array intensities, sequence reads, and haplotype phasing. The framework first estimates genotype likelihoods independently for each SNP from array and sequence data, given initial parameters for genotype cluster locations and sequence read error rates. It then multiplies the likelihoods for each SNP to obtain joint likelihoods, inputs these to haplotype phasing and imputation, and then uses the output likelihoods to re-cluster intensities for all SNPs. The process is iterated, and upon termination genotype likelihoods are converted to posterior genotype probabilities. The framework can estimate genotypes given only sequence data or array data as well as with or without imputation — many of these special cases are similar in principle to previously described genotyping algorithms (Text S1).

More »

Expand

Figure 2.

Sensitivity and specificity of data collection strategies.

For different combinations of array and sequence data, we produced joint genotype calls on chromosome 20 for 382 European samples from the 1000G project. For a single test sample, we obtained “gold-standard” genotypes from high coverage multi-technology sequencing published by the 1000G project. We then measured non-reference site sensitivity and specificity with imputation (, ) and without (, ). (a) (left) and (right) of calls from five array densities and four sequence coverages. The first row of each table contains results for strategies with only sequence data, and the first column contains results for strategies with only array data. A common color scheme is used across all tables, with white corresponding to 100%, red corresponding to , and yellow corresponding to 80%. (b) of calls; is given in Figure S9. (c) for three variant frequency ranges, with frequency estimated from the non-test samples. Private variants have frequency 0% in the non-test samples. (d) for four sequence coverages, with separate lines that correspond to joint calls made with each SNP array. (e) for four array densities, with separate lines that correspond to joint calls made with each sequence coverage. No Array: from sequence data alone; 0×: from array data alone; .5×-4×: mean number of sequence reads per genomic position; array abbreviations are defined in Materials and Methods.

More »

Expand

Figure 3.

Data collection strategies for studies with prior array data.

For the (a) Affy 6 and (b) Ilmn 1 M arrays, we produced joint calls after addition of each sequence coverage or array; joint calls with multiple arrays include combined data from both arrays. The y-axis shows , while the x-axis shows of the additional data collected. is a measure of the genotyping investment intrinsic to a technology that serves as a proxy for cost. The blue point (None) shows if no additional data is collected; the other points are labeled with the additional data collected. Labels are defined in 2.

More »

Expand

Figure 4.

Reduction in errors from joint genotype calls.

(a) To assess the improvement in imputation quality afforded by joint genotype calls with a SNP array (relative to calls based on sequence data alone), we measured sensitivity and specificity at sites absent from the array; errors at these sites can be reduced only through improved imputation. The Metabochip is absent from this plot, as it is not a genome-wide array. Plotted are and , the sum of which equals the number of sites where (1) the gold-standard or called genotype is non-reference and (2) the gold-standard and called genotypes disagree. Normalized values (defined in Materials and Methods) are plotted to show visual trends; actual values are given in Figure S16(b) To assess the reduction in erroneous genotype cluster locations afforded by joint genotype calls with sequence data (relative to calls based on array data alone), we measured sensitivity and specificity at sites on the array. Red bars correspond to and , measured from calls without haplotype phasing; blue bars correspond to and , measured from joint calls. As described in Materials and Methods, these experiments used 82 additional unrelated samples, absent from our other experiments, to inform cluster locations.

More »

Expand

Figure 5.

A novel next-generation sequencing error mode.

(a) We identified a novel error mode based on visual examination of disputed SNPs. As shown in the cluster plot, one of the samples is called homozygous reference (Hom-ref) based on analysis of array data but homozygote non-reference (Hom-var) based on analysis of sequence data (shown by the sample outlined in green within the red cluster). This unusual error mode contrasts with the more common error mode, due to low sequence coverage, of samples called heterozygous (Het) based on array data but homozygous reference or non-reference based on sequence data (shown by samples outlined in pink or green within the blue cluster). (b) Inspection of the sequence reads in the Integrated Genomics Viewer (IGV) [54] shows that the sample in question has only two reads that cover this SNP, and these reads are pairs sequenced from the same underlying DNA fragment. (c) This error mode is introduced in the shearing and library preparation stage of next-generation sequencing and as a result is reflected in both reads from the same DNA fragment. Depending on protocol details, the error rate is around 1/10,000. During genotype calling, independent treatment of reads (read-based) results in much more confident (here 100×) non-reference genotype calls than analysis at the fragment level (fragment-based). (d) To account for these effects, which can be large for low coverage sequencing projects like the 1000G Project, we implemented a fragment based genotyping algorithm in the Unified Genotyper of the Genome Analysis Toolkit (GATK). Use of this new caller has a significant impact on SNP call quality, shown by a smaller number of novel SNP calls and a higher Transition∶Transversion ratio (proxies for accuracy [27]). The effect is pronounced for populations such as MXL and ASW, which have a higher fraction of newer Illumina sequencing data with longer reads (e.g., AWS data is reads, while YRI has less than ), which results in greatly increased rate of overlapping reads and associated errors. Abbreviations are as defined in the 1000G Project.

More »

Expand