Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

doi:10.1371/journal.pcbi.0030020

Figure 1.

Simplified Support Vector Machine

Learn a function f such that the difference of predictions (the margin) of positively and negatively labeled examples is maximal. Previously unseen examples will often be close to the training examples. The large margin then ensures that these examples are correctly classified as well, i.e., the decision rule generalizes.

More »

Expand

Figure 2.

Given Two Sequences, s₁ and s₂ of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution w_l Depending on Its Length l, Where Longer Matches Contribute More Significantly

For predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).

More »

Expand

Figure 3.

Given the Start of the First and the End of the Last Exon, Our System (mSplicer) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVM_GY) and Acceptor (SVM_AG) Splice Sites

The detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the margin between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).

More »

Expand

Figure 4.

An Elementary State Model for Unspliced mRNA

The 5′ end of the transcript is either directly followed by the 3′ end (single exon gene) or by an arbitrary number of donor–acceptor splice site pairs exhibiting the GT/GC and AG dimmer. A transition in this state model corresponds to accepting a whole segment (as in generalized HMMs 12), i.e., an exon or intron, with the corresponding dimer at the 3′ boundary of the segment (except in state 4).

More »

Expand

Figure 5.

The State Model That Uses Open Reading Frame Information

The sequences next to the state indicate which consensus has to appear at the transitions between intron (capital) and exon (bold). Here, we use the IUPAC code for ambiguous nucleotides (e.g., B = C/G/T, R = A/G, Y = C/T). The digit on the transition arrows is related to the reading frame and indicates the required frame shift to follow the transition (e.g., between state 1 and 2, one can only accept exons leading to a frame shift of 0). Also, it defines in which frame stop codons are allowed to occur—no stop codon should appear in-frame. Finally, the model is constructed such that in-frame stop codons cannot be assembled on the exon boundaries (this required the three additional state pairs 6/7, 10/11, and 12/13).

More »

Expand

Table 1.

Splice Form Error Rates (1-Accuracy), Exon Sensitivities, Exon Specificities, Exon Nucleotide Sensitivities, Exon Nucleotide Specificities of mSplicer—with (OM) and without (SM)—Using ORF Information as well as ExonHunter and SNAP on Two Different Problems: mRNA Including (UCI) and Excluding (CI) UTR

More »

Expand

Table 2.

Splice Form Error Rates, Sensitivities, and Specificities of mSplicer Trained on WS150 (Including Signal and Content Sensors)

More »

Expand

Table 3.

Measure of the Agreement of the WS120 Annotation on 5,166 Completely Unconfirmed Genes with mSplicer's Predictions (SM and OM) (Reusing WS120′s Gene Starts and Ends)

More »

Expand

Table 4.

On Newly Confirmed Segments, Measure of the Accuracy of the WS120 Annotation and mSplicer Based on WS120 (OM and SM)

More »

Expand

Table 5.

Comparison between Wormbase Annotation WS160 and the One Generated by mSplicer Applied to Annotated Transcripts

More »

Expand

Figure 6.

POIMs for Donor (Left) and Acceptor (Right) SVM Classifiers

Shown are the color-coded importance scores of substring lengths for positions around the splice sites. Near the splice site, many important oligomers are identified. Particularly long substrings are important upstream of the donor and downstream of the acceptor site. See the main text for discussion.

More »

Expand

Table 6.

Error Rates, Sensitivities, and Specificities of mSplicer (SM) for Three Other Nematodes Trained on C. elegans Sequences (“mSplicer WS120”)

More »

Expand

Table 7.

Error Rate, Sensitivities, and Specificities of the cb25 Genome Annotation of the C. briggsae Genome and mSplicer Trained on WS120

More »

Expand