High-Resolution Transcriptome Analysis with Long-Read RNA Sequencing

doi:10.1371/journal.pone.0108095

Figure 1.

The effect of read length on read-mapping performance.

We compared the percentage of reads uniquely mapped (a) and the average runtime per a million reads (b) of Bowtie2, TopHat2, STAR, and GSNAP on L262 and L75. Only the splice-mappers – TopHat2, STAR, and GSNAP – achieved higher alignment rates for L262 compared to L75. The increase in runtime going from L75 to L262 varied greatly across the mappers, TopHat2 being the most sensitive among the four mappers tested.

More »

Expand

Figure 2.

The effect of read length on RNA genotype concordance.

Results of comparisons of discordance p-values ( transformed) between L262 and base-subsampled L75 stratified by mapper and read depth. Lower p-values (plotted at the top) represent stronger disagreement between the observed and expected genotypes due to inaccurate mapping. After aligning reads to the reference without genotype annotations, sites with known genotypes were stratified into four groups based on read depth in each dataset: 1) 10–99, 2) 100–499, 3) 500–999, and 4) ≥1000 reads. For each mapper, at each of the four read depths, a one-sided Wilcoxon rank-sum test was conducted to determine whether the p-values for the L262 dataset were significantly higher compared to L75. The behavior of GSNAP agreed with our expectation where L262 displayed improved concordance over L75 across all read depth bins. The red horizontal line represents a significance level of . Each symbol represents an independently subsampled data for L75.

More »

Expand

Figure 3.

Most of the difference in gene quantification arise from poorly mappable genes.

(a, b) Log scatter plots of fraction of reads mapped to each gene between L262 and L75. Only plotting genes that are not pseudogenes and have perfect mappability scores leads to a scatter plot with near perfect correlation. Scatter plots that only show protein coding genes can be found in Figure S9. (c) Genes that are observed in both libraries and are at least 500 bp long were divided into five groups according to the mean mappability score, which is obtained by summarizing the 75 bp mappability score track on UCSC Genome Browser. For each group, we computed the fraction of genes that had a higher proportion of reads in L262 than in L75. The groups corresponding to lower mappability scores showed higher proportion of genes more represented in L262. The same trend was observed even when we restricted our analysis to protein coding genes.

More »

Expand

Figure 4.

Longer reads are consistent with fewer number of mRNA isoforms.

If all the bases of a read that maps to a gene are part of the same isoform, this common isoform (which does not have to be unique) is said to be consistent with the read. We compared the cumulative distribution of the number of consistent isoforms of reads in L75 versus reads in L262 (a) and across read groups within L262 (b) divided according to the number of reference bases covered by the alignment.

More »

Expand

Figure 5.

Comparison of Cufflinks and FluxCapacitor on transcript quantification.

We used transcript counts from L262 to assess performance of (a) Cufflinks and (b) FluxCapacitor on L75. Cufflinks and FluxCapacitor were respectively applied on base-subsampled L75 to generate FPKM/RPKM counts. On L262, we only counted reads that unambiguously mapped to each transcript, and divided each by the number of bases that are unique to that transcript. We restricted our analysis to 41,539 transcripts of 12,964 protein coding genes with multiple isoforms that have at least ten nucleotides uniquely assigned to any single transcript. Using this approach, we identified that Cufflinks has significantly better correlation to the long-read data than FluxCapacitor (p-value ).

More »

Expand

Figure 6.

Longer reads have a higher chance of containing a heterozygous site that allows us to differentiate the haplotype of origin.

We compared the proportion of reads unambiguously assigned to maternal or paternal haplotypes between L75 and L262 (a) and within L262 (b), where reads in L262 were grouped by the number of reference bases covered by the alignment of (possibly overlapping) paired-end reads. The observed data in (b) were extrapolated using a poisson model to even longer read lengths. To provide additional support for the extrapolation, we directly approximated the proportion of reads containing at least one heterozygous site based on the GENCODE annotation and the genotype of this individual, assuming that the true gene expression levels are the same as those observed in L262, that all mRNA isoforms of a gene are uniformly expressed, and that each starting location of a transcript is equally likely to be included in the library. We only considered reads that mapped to autosomal chromosomes. Our approximation agreed well with both the observed data and the extrapolation.

More »

Expand

Figure 7.

Longer reads enable more effective detection of allele-specific patterns.

To make a fair comparison, L75 was randomly subsampled () down to the same number of bases (L75B) and the same number of reads (L75R) as L262. We ran the subsampled libraries and L262 through the pipeline for detecting allele-specific gene expression and allele-specific exon inclusion-exclusion patterns. The figures above show the number of significant cases we called at FDR = 10% using the Benjamini and Hochberg procedure [26].

More »

Expand

Figure 8.

Examples of allele-specific isoform distribution detected with L262.

Shown are the IGV [31] coverage plots of (a) genes corresponding to the top two significant cases (raw p-value ) and (b) two significant cases (FDR = 10%) with a strong prior support by having a heterozygous site at a known sQTL from a large population-based study. The implicated sQTL and the exon that is known to be affected are noted by a red line and a black box, respectively. Red asterisks above each plot denote blocks with significant differential inclusion-exclusion pattern (FDR = 10%).

More »

Expand