Improved transcriptome assembly using a hybrid of long and short reads with StringTie

doi:10.1371/journal.pcbi.1009730

Fig 1.

A) Artifacts present in the long read alignments: i) retained introns; ii) disagreement around the splice sites; iii) spurious extra exons; iv) falsely skipped exons; v) false alternative splice sites. B) Example of a human transcript that can only be correctly assembled using both the long and short reads. This is human transcript ENST000000361722.7 from the TBKBP1 gene. Blue lines in the middle of the reads (gray boxes) indicate a spliced alignment. Purple lines within the reads indicate mismatches in the alignment. The long reads alignments do not have coverage of exons 1–3 and contain a retained intron. The short-read alignments lack adequate splice-site support across the 4^th intron and the 7^th intron and do not have complete coverage of exons 5 and 8.

More »

Expand

Fig 2.

A) Sensitivity and precision for StringTie assemblies of simulated data with varying sensitivity parameters. The two StringTie parameters varied were the minimum read coverage allowed for a transcript (-c) and the minimum isoform abundance as a fraction of the most abundant transcript at a given locus (-f). Each shape represents a different combination of -c,-f parameters with the values indicated in the legend. The default values of -c and -f are 1.0 and 0.01 respectively and are represented by the circle marker. B) Calculated coverage vs. expected coverage for long-read, short-read, and hybrid-read assemblies of simulated data. Coverage values are normalized to log₂(1 + coverage). C) Precision of long-read, short-read, and hybrid-read assemblies of simulated data with and without guide annotation. D) Sensitivity of long-read, short-read, and hybrid-read assemblies of simulated data with and without guide annotation.

More »

Expand

Table 1.

Availability of real RNA-seq datasets and descriptions of sequencing technology used including chemistry and basecaller version for ONT datasets.

More »

Expand

Fig 3.

Precision and the number of annotated transcripts assembled for 9 real datasets from Arabidopsis thaliana, Mus musculus, and human.

Only loci with long read expression are considered for these calculations The circle markers represent assemblies created from uncorrected reads, and the stars represent assemblies created from long-reads corrected with TALC. The long and short read combinations analyzed from Arabidopsis thaliana were A) ERR3486096 and ERR3764345 B) ERR3486098 and ERR3764349 C) ERR3486099 and ERR3764351. The long and short read combinations analyzed from Mus musculus were D) ERR2680378 and ERR2680375 E) ERR2680378 and ERR2680377 F) ERR2680380 and ERR2680379. The long and short read combinations analyzed from human were G) SRR4235527 and NA12878-cDNA H) SRR4235527 and SRR4235527 I) SRR1153470 and SRR1163655.

More »

Expand

Fig 4.

A) Precision of assemblies of all real datasets with and without guide annotation B) The number of annotated transcripts assembled in assemblies of all real datasets with and without guide annotation.

More »

Expand

Fig 5.

Noisy alignments make the splice graph vastly more complicated.

The clean splice graph on the upper right is based on the two error-free transcripts, while the noisy splice graph is based on all four of the transcripts shown on the left. Regions shown in orange are errors due to mis-alignments.

More »

Expand