Strawberry: Fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq

doi:10.1371/journal.pcbi.1005851

Fig 1.

Overview of the algorithm of Strawberry, compared to StringTie and Cufflinks.

All methods begin with a set of RNA-Seq alignments and output transcript structures and abundances in GFF/GTF format. Strawberry uses a min-flow algorithm for solving Constrained Minimum Path Cover(CMPC) problem on splicing graph, followed by assigning subexon paths to compatible assembled transcripts. In quantification step, all of the RNA-Seq read alignments on each subexon path as a whole are the subject of the EM algorithm.

More »

Expand

Fig 2.

Recall and precision at the nucleotide, exon, intron and transcript level.

StringTie, Cufflinks and Strawberry were run on data RD100, which is a simulated Arabidopsis RNA-Seq data set.

More »

Expand

Fig 3.

Box plots of F1 scores at the transcript and loci level.

StringTie, Cufflinks and Strawberry were evaluated on data GEU, which is a simulated Human RNA-Seq data set.

More »

Expand

Fig 4.

Frequency plot of Proportional correlation, Spearman correlation, Mean Absolute Relative Difference (MARD) for the 10 replicates in RD100, which is a simulated Arabidopsis data.

More »

Expand

Fig 5.

Frequency plot of Proportional correlation, Spearman correlation, Mean Absolute Relative Difference (MARD) for the 6 samples in GEU, which is a simulated Human data.

These statistics are calculated based on the predicted FPKM values of all reconstructed transcripts and the true FPKM values used in the simulation.

More »

Expand

Table 1.

Averaged Spearman correlation, Proportional correlation, Mean Absolute Relative Difference (MARD) for the 6 samples in GEU, which is a simulated Human data.

These statistics are calculated based on the predicted FPKM values of 1) all reconstructed transcripts 2) only transcripts that match the known, and the true FPKM values used in the simulation.

More »

Expand

Table 2.

Correlation of FPKMs and probe counts on real RNA-Seq data HepG2.

NanoString counts were compared to the FPKM values reported for three programs. The number of probes which have matching transcripts is reported on the last line.

More »

Expand

Fig 6.

Read alignments and reconstructed transcripts at gene NAT14 using HepG2 data.

A new isoform, transcript.14285.3 (shown as the middle one), has been identified by Strawberry. The junction reads that support the new AS event (alternative 3 splice site) are highlighted. The two ends of a read-pair are in the same color. A total 7 uniquely mapped read-pairs supports the novel junction. This figure is made by IGV (http://software.broadinstitute.org/software/igv/).

More »

Expand

Fig 7.

Running time in minutes of Cufflinks, Strawberry, linux word count and StringTie(ordered by slowest to fastest) on textitRD25(2.5 million reads), RD100(10 millions reads), and HepG2 data(100 millions reads).

More »

Expand

Fig 8.

Translation of read alignments into a splicing graph.

(a) Eleven imaginary aligned paired-end reads (or read-pairs) are represented by light blue boxes intersected by solid lines, which indicate splicing junctions, and broken lines, which indicates gap sequences. Above the read-pairs, the coverage plot is shown. The white regions have zero coverage. Below the read-pairs, three primitive exons are shown as purple boxes and five subexons in dark blue, numbered from 1–5. (b) The splicing graph constructed from part (a). The numbered nodes in the splicing graph are subexons from part (a). Dashed Arrows represent the non-intron edges and solid arrows indicate the intron edges. The numbers next to edges are the weights(number of read-pairs supports). A read-pair that contributes to an edge weight is stressed using an asterisk near its upper-left corner. All the arrows also indicate the transcription direction. The source node and target node in the splicing graph are not shown.

More »

Expand

Fig 9.

An input flow network with a subpath constraint {2-4-7}.

(a), the number next to an edge is the edge cost. For every edge e, the edge constraint implies 1 ≤ f(e) ≤ inf. (b), the transformed min-flow circulation network. The 2-tuple (a, b) next to each edge indicates the optimal flow on the edge and the edge cost respectively. After Step 3, the path constraints set is P^sub = {(1, 2), (1, 3), (2, 4, 7), (4, 5), (4, 6), (5, 8), (6, 8), (7, 8)}. Two edges no longer in the constraint set are shown in green. For these two edges, the minimum flow requirement is 0; for the rest of edges, it is 1. Two dummy nodes, s and t, are added to complete the circulation. The number of flows after decomposition is equal to the minimum flow which is 3.

More »

Expand

Fig 10.

(a), a gene with three subexons and two isoform are shown. The length of i1 is 260 bp, i2 200 bp. A paired-end read (or read-pair) is represented by light blue boxes intersected by broken lines, which indicates gap sequences. The read length is 50x2 bp. (b) A subexon path {s₁, s₃} applies to both isoform. When on i1, this subexon path implies three subexons with the one in middle shown in gray. Consider a fixed size fragment with gap size 75 bp(shown in gray) and total fragment length 175 bp. This particular fragment can arise from 16 different positions from subexon path {s₁, s₃} on i1 and 26 different positions from subexon path {s₁, s₃} on i2.

More »

Expand