Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data

doi:10.1371/journal.pcbi.1002417

Figure 1.

Phase increased sensitivity to detect variants.

Phase increased sensitivity to detect variants, as seen over a range of error rates at coverages of 100-fold, 250-fold, and 500-fold. The phased variant detection threshold frequency (VDTF) is the lowest frequency of reads with variants at two specific loci that V-Phaser can distinguish from error among reads that span both loci. The unphased VDTF is the lowest frequency of one variant that V-Phaser can distinguish from error among reads that cover that locus. 100-fold phased sequence coverage achieves comparable detection thresholds as 500-fold unphased. We use Equation 7 to calculate the phased and unphased VDTFs. (See the Materials and Methods section for Equation 7 and its derivation.)

More »

Expand

Figure 2.

Phase distance approached length of average read as coverage increased.

The phase distance was longer than half the average read length for loci covered more than 65-fold, and as coverage increased, it approached the length of the average read. The phase distance is a measure of how far apart phased variants can be and still be detected at lower frequencies than variants not in phase. We show the phase distance as a percentage of average read length.

More »

Expand

Figure 3.

Error rates were not uniformly distributed.

Error rates varied by (A) read position, (B) base transition, and (C) base quality score. We counted as errors any mismatches to the consensus assembly for each of the two runs in the control read set under the assumption that the NL-43 infectious clone had no diversity. We defined the read position relative to the beginning or end of the read, whichever was closer. We defined a base transition as a dinucleotide representing the transition from the preceding base to the current base, and we scored a transition as an error if the current base was a mismatch. Base quality scores came from the sequencing process.

More »

Expand

Figure 4.

Phase information increased sensitivity, and base quality scores increased specificity.

We compared V-Phaser to alternate versions of V-Phaser with specific components disabled. In the No Phase version, V-Phaser called variants without phase information. In the Uniform Errors version, V-Phaser estimated uniform error rates within homopolymer and nonhomopolymer regions without regard to assigned base qualities. In the No Filtering version, V-Phaser did not filter out low quality bases. (A) Phase information increased sensitivity. The version without phase information attained a sensitivity of 90%, but all other versions of V-Phaser used phase information and attained a sensitivity of 97% or more. We calculated sensitivity as the percentage of known variants correctly identified. Data are from WNV mixed population control dataset. (B) Individual base quality scores increased specificity. Among loci with mismatches, the Uniform Errors version had only 91% specificity, but all other versions incorporated base quality scores in their probability model and attained 97% specificity or more. We calculated specificity as the percentage of loci in the control sample correctly identified as having no variants among loci that had at least one candidate variant. Data are from infectious clone (HIV NL4-3) control dataset.

More »

Expand

Figure 5.

Phase information increased sensitivity to detect minor variants.

Phase information increased sensitivity to detect low frequency variants, as shown by these histograms of variants under 2.5%. All versions of V-Phaser detected 100% of the variants above 2.5% frequency, so these variants are not shown here. All versions of V-Phaser with phase information (A), (C), and (D) detected most variants below 1% in frequency, but the No Phase version (B) missed many variants below 1% and some variants as high as 2.5%. Data are from control WNV mixed population.

More »

Expand

Figure 6.

NQS filtering improves fit of probability model to data.

(A) Quantile-quantile (q-q) plots under NQS filtering show good fit of the probability model to the observed distribution of errors. Since the probability model is discrete, p values are projected onto a uniform distribution, and the distribution of projected p values is compared with the expected null distribution. See Materials and Methods section for details. (B) In contrast, q-q plots under no filtering show that no filtering skews the calibration of the probability model used by V-Phaser. Q-q plots of models based on subsets of the reads demonstrate that this effect becomes more pronounced with increasing coverage (see Figure S1). Q-q plots are scaled to fit curve, so y = x line is not at a 45 degree angle.

More »

Expand

Table 1.

Comparison of V-Phaser to other viral variant callers.

More »

Expand

Figure 7.

Low frequency variants overwhelmingly called with phase.

Histogram shows low frequency variants overwhelmingly called with phase thresholds. Variants frequencies are estimated by the frequencies of variants among the reads at that position. Versions of V-Phaser with and without phase thresholds called variants on a clinical sample that are binned by their frequency at their locus. Most variants <5% were detected only be V-Phaser with phase thresholds, and the version without phase thresholds detected no variants <1%.

More »

Expand