HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data

doi:10.1371/journal.pone.0013875

Figure 1.

HMMSplicer pipeline.

After removing reads that have full-length alignments to the genome, reads are divided in half and aligned to the genome (step 1 as defined in the Materials and Methods). The HMM is trained using a subset of the read-half alignments (step 2a). The HMM bins quality scores into five levels. Although only three levels are shown in this overview for simplification, the values for all five levels can be found in Table 1. The trained HMM is then used to determine the splice position within each read-half alignment (step 2b). The remaining second piece of the read is then matched downstream to find the other intron edge (step 3). The initial set of splice junctions then proceed to rescue (step 4) and filter and collapse (step 5) to generate the final set of splice junctions.

More »

Expand

Figure 2.

Algorithm parameters.

a) Percent of oligos able to map within a genome as a function of oligo size. The solid lines show the percentages if oligos are able to map up to 50 times within the genome (the value used in HMMSplicer seeding). The dashed lines show the percentages if a unique match is required. b) HMM training. The values for the two most variable parameters of the HMM are shown here, with the x-axis representing different training set sizes and initial HMM parameters. The error bars show the standard deviation of ten repetitions of training. HMMSplicer uses a training subset size of 10,000. c) Effect of size, in bases, for the second piece of the read. The percent of second pieces uniquely mapping within 80 kbp of the first piece increases as the size of the second piece increases, while the percent of second pieces mapping to multiple locations decreases.

More »

Expand

Table 1.

HMM Parameter Values.

More »

Expand

Table 2.

Simulation Results.

More »

Expand

Table 3.

Datasets.

More »

Expand

Figure 3.

Simulation results.

(a) Results for HMMSplicer and TopHat for 50 and 75 bp reads. Although values are similar at higher coverage levels, HMMSplicer exhibits substantial increases in sensitivity at lower coverage levels. (b) ROC curve for the 50 bp simulation results at 1×, 10×, and 50× coverage demonstrates that HMMSplicer's scoring algorithm accurately discriminates between true and false junctions. The number in parentheses is the area under the curve for each coverage level.

More »

Expand

Figure 4.

Overview of HMMSplicer and TopHat results in (a) A. thaliana, and (b) P. falciparum and (c) H. sapiens.

For each dataset, HMMSplicer results are shown at five different score thresholds. The numbers on the bottom axis (200 to 600) are the thresholds for junctions with multiple reads; the threshold was set 200 points higher for junctions with a single read. The * indicates HMMSplicer's default score threshold. SpliceMap results are shown for the A. thaliana dataset only, as SpliceMap cannot be run datasets with reads less than 50 nt long. For P. falciparum, TopHat was run with two different parameter sets. TopHat A was run with a segment length of 23 resulting in more junctions but a lower specificity whereas TopHat B used the default segment length of 25 resulting in fewer junctions with more specificity.

More »

Expand

Figure 5.

Human results compared by transcript abundance.

Transcript abundance was measured as Reads Per Kilobase per Million reads mapped (RPKM) and the genes were binned by RPKM to show the number of RefSeq junctions found at different levels of transcript abundance. For genes with an RPKM less than 10, HMMSplicer found 76.2% more junctions, whereas for genes with an RPKM above 50, HMMSplicer found only 6.7% more junctions. While a smaller number of highly expressed genes dominate the mRNA population, 74.8% of genes have RPKM values less than 10.

More »

Expand

Figure 6.

Alternative 5′ and 3′ splice sites.

HMMSplicer results within 15 bp of RefSeq introns were analyzed to measure the number of bases added or removed from the spliced transcript. There were 997 instances where the intron had an alternate 5′ splice site (5′SS, shown in grey) and 2,577 instances of an alternate 3′ splice site (3′SS site, shown in black). The most common alternative splice was 3 bases removed or added to the exon at the 3′SS. TopHat results showed a similar pattern, though only 875 alternates (262 5′SS alternates and 613 3′SS alternates) are found, less than a quarter of the HMMSplicer results. WebLogos were constructed from the sequences at the 1,099 alternate 3′SS with three bases removed from the transcript and the 460 alternate 3′SS with three bases added to the transcript. For these, the green dashed line shows the alternate splice site while the red dashed line shows the canonical splice site. In both cases, a repetition of the YAG splice motif is evident.

More »

Expand

Figure 7.

XBP1 non-canonical intron.

HMMSplicer discovers the non-canonical XBP1 intron. HMMSplicer identifies three reads containing the non-canonical CA-AG splice site in XBP1. Because the reads are fairly evenly split, both read-halves aligned to the genome. The edges identified by HMMSplicer are 2 and 4 bp off from the actual splice site because the sequence at the beginning of the intron repeats the sequence at the beginning of the subsequent exon. When identical junctions are collapsed, there are two junctions, one with a score of 1024 and one with a score of 1030, which puts them in the top 0.5% of the collapsed non-canonical junctions.

More »

Expand

Figure 8.

Experimental confirmation of predicted Plasmodium falciparum splice junctions.

Schematics of the predicted splice junctions and sequenced RT-PCR products for a) PFC0285c, b) PF07_0101, and c) PFD0185c. For PFC0285c, the verified junction likely splices an additional exon in the 5′UTR to the coding region of the gene. The confirmed junction in PF07_0101 splices out 291 nt (97 aa) from the first exon, which could represent an alternative protein-coding isoform, or an error in the gene model. The demonstrated junctions in PFD0185c excise 85bp near the 3′ end of the gene, causing a frameshift, and appear to splice two exons within the 3′UTR of the gene together. Again, the junction within the gene model may represent an alternative splicing event or an error in the gene model. ESTs near all three areas are included to provide the direction of the genes.

More »

Expand

Table 4.

Primer Sequences.

More »

Expand