Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements

doi:10.1371/journal.pone.0106689

Figure 1.

Characteristics of TruSeq synthetic long-reads.

A: Read length distribution. B, C, & D: Position-dependent profiles of B: mismatches, C: insertions, and D: deletions compared to the reference genome. Error rates presented in these figures represent all differences with the reference genome, and can be due to errors in the reads, mapping errors, errors in the reference genome, or accurate sequencing of residual polymorphism.

More »

Expand

Figure 2.

Depth of synthetic long-read coverage per chromosome arm.

The suffix “Het” indicates the heterochromatic portion of the corresponding chromosome. M refers to the mitochondrial genome of the y; cn, bw, sp strain. U and Uextra are additional scaffolds in the reference assembly that could not be mapped to chromosomes.

More »

Expand

Table 1.

Size and correctness metrics for de novo assembly.

More »

Expand

Table 2.

Alignment statistics for Celera Assembler contigs aligned to the reference genome.

More »

Expand

Figure 3.

Results of generalized linear mixed model describing probability of accurate TE assembly.

Predictor variables include: TE length (, , ), GC content (, , ), divergence (, , ), and number of high-identity (0.01 substitutions per base compared to the canonical sequence) copies within family (, , ). Black lines represent predicted values from the GLMM fit to the binary data (colored points). The upper sets of points represent TEs which were perfectly assembled, while the lower set of points represent TEs which are absent from the assembly or were mis-assembled with respect to the reference. The exact positions of the colored points along the Y-axis should therefore be disregarded. Colors indicate different TE families (122 total). To visualize the interaction between divergence and the number of high-identity copies (, , ), we plotted predicted values for both families with low numbers of high-identity copies (dashed line) as well as families with high numbers of high-identity copies (solid line).

More »

Expand

Figure 4.

Assembly metrics as a function of depth of coverage of TruSeq synthetic long-reads.

A: NG(X) contig length for full and down-sampled coverage data sets. This metric represents the size of the contig for which X% of the genome length (180 Mbp) lies in contigs of that size or longer. B: The proportion of genes and transposable elements accurately assembled (100% length and sequence identity) for full and down-sampled coverage data sets.

More »

Expand