Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data

doi:10.1371/journal.pcbi.1003031

Table 1.

Sequencing runs generated for this study.

More »

Expand

Table 2.

Number of genomic locations where a significant proportion of reads disagreed with the reference.

More »

Expand

Figure 1.

Relationship between G+C% and the observed mean coverage for 100 bp bins in the reference genome.

Panel (a) is a boxplot of the distribution of the square-root normalized mean read depth across the 100 bp windows for each reference genome, broken down further by sequencing kit and G+C% bin. The coverage for each run was normalised by the mean coverage –the boxplots show the square-root fold-change from the mean genomic coverage for each combination of species, kit and G+C% bin. Thus a value of 2 means the coverage was four times that of the mean for that sequencing run. The boxes display the central 50% of the values in each treatment, with the median represented by the solid black horizontal bar. The whiskers each extend for 1.5× the inter-quartile range, and the black dots represent extreme individual observations which fall outside this range. The variability observed in the high G+C bins are likely due to the small sample size for these G+C regions, shown in panel (b). The outliers are potentially due to repetitive content in the genome that failed to be masked by our perfect match repeat approach.

More »

Expand

Figure 2.

Mean rates of insertion, deletion and substitution errors across the three sequencing kits.

Each box-plot shows the distribution of error rates for the specified type across the runs for the specified kit (species are aggregated).

More »

Expand

Figure 3.

Relationship between base position and error rate for homopolymer (over-call/under-call) versus substitution errors.

Panel (a) shows the homopolymer error rate (insertion+deletion) by read base position, and panel (b) shows the substitution error rate by base position. Each line is the raw mean error rate for a single data-set with the kit and species as specified by the colour key.

More »

Expand

Figure 4.

Examples of over-call/under-call errors in homopolymers of length less than 2.

By aligning the read (derived from the rounded flow-values), and its corresponding reference sequence (considered the ‘true’ sequence) at the flow level, we can identify examples of over-calling a zero-length homopolymer (Flow Cycle #2), and under-calling a one-length homopolymer (Flow Cycle # 6). Flow Cycle #5 demonstrates a zero-length homopolymer being correctly called as zero.

More »

Expand

Figure 5.

Calling accuracy decreases with homopolymer length.

Lines show mean accuracy for each kit by reference homopolymer length, across bases 10–100 and bases 10–200, the latter range only relevant for the two 200 bp kits.

More »

Expand

Table 3.

Estimated main and deviance effects for each explanatory variable in the double-generalised linear model.

More »

Expand

Figure 6.

The influence of position in cycle (PIC, labelled 0–31) on flow-value distributions and consequently error rate and type.

Panel (a) shows the coefficient (main-effect) of each flow cycle position as predictors of mean of the flow-value distribution. Panel (b) shows the error rate broken down by insertions and deletions for each PIC. These do not include flow-values for homopolymers where the reference homopolymer length is zero.

More »

Expand

Figure 7.

Comparison of predicted versus empirical distributions of flow-values for homopolymers of length 1–5.

Predicted (solid line) and empirical distributions (dotted line) of flow-values for homopolymers of length 1–5 (colours - black, red, green, blue and teal), for flow cycles 2,5,9 (rows) and PIC 1,10 and 12 (columns) for species B. amyloliquefaciens. The low number of observations of homopolymers of length 5 is the likely cause for abnormal distributions for this homopolymer length. The ‘shoulders’ observed in the data are often due to unexpectedly high popularity of boundary flow-values (eg. 0.51, 1.49, 1.51…).

More »

Expand

Figure 8.

Breakdown of substitution type as a proportion of all substitutions for each sequencing kit.

More »

Expand

Figure 9.

Ion Torrent quality scores versus empirically estimated quality score for base.

The grey cloud surrounding the LOESS smoother function indicates the 95% confidence interval for the conditional mean. Individual observations for each quality are plotted as black points.

More »

Expand

Table 4.

Effect of quality and flow trimming on dataset metrics, aggregated by kit used.

More »

Expand