Selecting Superior De Novo Transcriptome Assemblies: Lessons Learned by Leveraging the Best Plant Genome

doi:10.1371/journal.pone.0146062

Table 1.

Summary of assembly software.

More »

Expand

Fig 1.

Illumina Sequence coverage of Arabidopsis cDNAs.

Coverage of all detected genes by sequencing reads. The darkest bar represents the number of detected genes not tagged, and each progressively lighter bar represents genes in a bin with a 5% increase in coverage, with the two lightest bars showing the number of genes covered at >90% and >99%, respectively. Normalized BR12 = Normalized Illumina library from pooled biological replicates 1 and 2, BR12 = combined coverage of Illumina biological replicates 1 and 2, BR1 = Illumina biological replicate 1, BR2 = Illumina biological replicate 2.

More »

Expand

Table 2.

Summary of assembly quality metrics used in this study.

More »

Expand

Table 3.

Assembled sequence and RNA-Seq statistics for assemblies of Illumina biological replicate 1 (BR1).

More »

Expand

Fig 2.

SCERNA Flowchart.

SCERNA stands for Scaffolding and Error correction for de novo assemblies of RNA-Seq data. This collection of post-processing tools allows flexible implementation at various steps post assembly and with multiple assemblers and data types.

More »

Expand

Fig 3.

Post-processed assembly delta plot, showing the effect of post-processing in several assembly quality categories.

The histogram shows the magnitude (% change) and direction of change in each category for the Mosaik-S, CLC-S and Trinity-ICB assemblies of Illumina biological replicate 1. The post processed values in each category are printed above the x axis for each assembly. A vertical line separates categories where an assembly improvement would result in a decrease in the respective measure (“Expect Δ<0” categories, left of line) or an increase in the respective measure (“Expect Δ>0” categories, right of line).

More »

Expand

Fig 4.

The quality of unigenes as a function of sequencing depth for Illumina biological replicate 1.

The units for “Assembly Quality” are Normalized Bit Score (BS, maximum of 2) and the units of “Sequence Depth” are Sequenced Fragments/bp (SFB). The number printed in the plot area is the number of unigenes with normalized bit score above 1.5. A BS of 1.5 is an arbitrary threshold, yet represents long and accurate assemblies (75% length, high accuracy), and is used to illustrate the difference in the high-density region seen in most plots near BS 1.75–2.

More »

Expand

Fig 5.

Normalization improves the recovery rate of highly expressed genes (compare to Fig 4).

The units for “Assembly Quality” are Normalized Bit Score (BS) and the units for “Sequence Depth” are Sequenced Fragments/bp (SFB). The number printed in the plot area is the number of unigenes with normalized bit score above 1.5. A BS of 1.5 is an arbitrary threshold and is used to illustrate the differences in the high-density region seen in most plots near BS 1.75–2.

More »

Expand

Fig 6.

Read titration analysis.

Reads were mapped to post-processed assemblies of biological replicate 1. The number of reads (X axis) that map to an assembly are an indicator of assembly completeness and quality, while the incidence of new tags (Y axis) indicates how completely the assembly reflects the diversity present in the read data.

More »

Expand

Fig 7.

Ultra-conserved orthologs (UCO) coverage in post-processed assemblies of BR1.

The use of UCOs as a proxy for the transcriptome assembly helps to identify leading assemblies when considered with the read titration curve analysis. When considered together, Trinity-ICB, which was the clear leader in our reference-based analyses, is also selected as the leader by the reference-independent metrics.

More »

Expand

Fig 8.

4 Gbp is a practical target volume for de novo transcriptome assembly.

Illumina biological replicate 1 was subsampled to produce datasets of 1, 2, 3, and 4 Gbp. Replicate subsamples at 1 Gbp (top row) show reproducible results. Increasing the data by 1 Gbp to 2, 3, and 4 show diminishing increases in well assembled (>1.5BS) genes. The 4 Gbp subsampled assembly showed highly similar results to the biological replicate datasets. Doubling the data volume (BR12) produced a small increase in well assembled genes (>1.5BS) accompanied by a small increase in Type I mis-assembly. This analysis indicates 4 Gbp is a practical target data volume for de novo plant transcriptomes.

More »

Expand

Fig 9.

Integration of our reference-independent metrics shows the top 3 are closest to Mosaik-S.

Considering the N₅₀ length, proportion of mappable reads, and UCO recovery we recapitulate the reference-dependent ranking of de novo assemblers.

More »

Expand