Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud

doi:10.1371/journal.pcbi.1004393

Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud

Fig 6

Comparison of stranded and unstranded RNA-seq library methods and their influence on interpretation and analysis.

(A) Many RNA-seq library construction protocols do not maintain the strand identity of RNA transcripts in the sequence data (S1 Table). In these “unstranded” strategies, double-stranded cDNA is sequenced, and knowledge of the transcription strand of the RNA molecule is lost. This results in an even mix of reads from both strands. In panel A, a gene transcribed on the positive strand is shown in green, a second gene transcribed on the negative strand is shown in brown, and a third gene transcribed on the positive strand (partially overlapping the second gene) is shown in yellow. The first two genes are protein coding with the open reading frame (ORF) portion depicted as thick rectangles and the UTRs depicted as thin rectangles. The third gene is a noncoding RNA gene. Aligned paired-end read sequences (read 1 and read 2) are depicted as short colored bars connected by thin lines. The thin connecting line in each read pair depicts the portion of the cDNA fragment that remains unsequenced when the cDNA fragment is larger than two times the read length. Each read is colored according to the strand sequenced, blue for the positive (forward/sense) strand and red for the negative (reverse/antisense) strand. Using known annotations, the mapped position of each read, and knowledge of exon splicing patterns, the likely transcription strand of some reads can be inferred. However, for many aligned reads the transcription strand cannot be inferred and sense-antisense expression analysis is not possible. Note that for each gene, an approximately equal proportion of reads corresponding to each strand are observed. Also note that read pairing information can sometimes be used to infer which gene a read was likely derived from. These reads are referred to as “encompassing” read pairs, in which one read of a pair aligns within one exon and the second read of a pair aligns within another exon. However, reads that align within a region corresponding to overlapping genes cannot be unambiguously assigned to either gene (e.g., the portion of the brown and yellow genes that overlap). Note that in this figure we are not depicting any reads in which a single read of a read pair spans across an intron. These exon–exon “spanning” reads can usually be matched unambiguously to a transcript, even in an unstranded library, because the exon–exon junction alignments line up with known splice sites and exon boundaries. (B) More recent “stranded” RNA-seq library strategies allow the strand information to be retained. In the resulting alignments, depicted in panel B, the strand of the alignment corresponds in a predictable way to the transcription strand of the sequenced RNA molecule. Now we see that reads aligning within a gene are indicated as being derived from the expected transcription strand for that gene. Furthermore, in regions where two genes overlap on opposite strands, we can now unambiguously assign reads to each gene. (C) When strand information is maintained by the RNA-seq protocol, it can be visualized in genome browsers such as IGV [62]. For example, to make IGV color read alignments according to strand, use the “Color alignments” by “First-of-pair strand” setting (refer to S5 Table for more strand-related software settings).

doi: https://doi.org/10.1371/journal.pcbi.1004393.g006