Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

We collected all known C. elegans ESTs from Wormbase (1) (release WS120; 236,893 sequences) and dbEST (2) (as of February 22, 2004; 231,096 sequences). Using blat (3) we aligned them against the genomic DNA (release WS120). The alignment was used to confirm exons and introns. We refined the alignment by correcting typical sequencing errors, for instance by removed minor insertions and deletions. If an intron did not exhibit the consensus GT/AG or GC/AG at the 5' and 3' ends, then we tried to achieve this by shifting the boundaries up to 2 base pairs (bp). If this still did not lead to the consensus, then we split the sequence into two parts and considered each subsequence separately. In a next step we merged alignments, if they did not disagree and shared at least one complete exon or intron. This lead to a set of 124,442 unique EST-based sequences.

cDNA Sequences
We repeated the above procedure with all known cDNAs from Wormbase (release WS120; 4,855 sequences). These sequences only contain the coding part of the mRNA. We use their ends as annotation for start and stop codons.

Clustering
We clustered the sequences in order to obtain independent training, validation and test sets. In the beginning each of the above EST and cDNA sequences were in a separate cluster. We iteratively joined clusters, if any two sequences from distinct clusters match to the same genomic location (this includes many forms of alternative splicing). We obtained 21,086 clusters, while 4072 clusters contained at least one cDNA.

Splitting into Training, Validation and Test Sets
For the training set we chose 40% of the clusters containing at least one cDNA (1536) and all clusters not containing a cDNA (17215). For the validation set we used 20% of clusters with cDNA (775). The remaining 40% of clusters with at least one cDNA (1,560) was filtered to remove confirmed alternative splice forms.
This left 1,177 cDNA sequences for testing with an average of 4.8 exons per gene and 2,313bp from the 5' to the 3' end.

Processing the Annotation
We used the Wormbase (WS120) genome annotation. We first extracted all curated genes without annotated alternative splicing. We removed all genes that overlapped with any of the EST clusters identified above. We removed all genes with non-canonical splice sites, leaving 5,166 completely unconfirmed genes with an average of 4.8 exons per gene and 1,961bp from the start to the end of the coding region. This set was used for the comparison with our prediction method.
For the retrospective analysis we repeated steps 1.1.1-1.1.3 for ESTs and cD-NAs from dbEST (as of 11/10/2005) and Wormbase (WS150). For all WS120 unconfirmed genes (see above) we identified overlapping segments of the gene with an EST or cDNA sequence match on the genome. We only considered cases where the WS150 sequences did not reveal any evidence for alternative splicing. This way identified 474 newly partially confirmed genes in 529 segments. We used 426 segments (in 379 genes) for our evaluation and the remaining sequences for model selection.
1.2 C. remanei, C. briggsae and P. pacificus We repeated the steps 1.1.1-1.1.3 for the other three genomes where we started with 15,155, 2,424 and 12,428 EST sequences for C. remanei, C. remanei and P. pacificus, respectively. After clustering we obtained 4,395, 787 and 2,744 EST clusters. For P. pacificus we used a random subset of 500 clusters and for C. remanei and C. briggsae all clusters without evidence for alternative splicing or non-canonical splice sites for final out-of-sample evaluation. For retraining the second step of mSplicer for P. pacificus we used another 500 EST clusters. The splice site detectors and exon/intron content sensors have not changed.

List of Important Oligomers
Below is the list with the most important oligmers for discrimination of donor and acceptor splice sites (three for every length). Shown are the position relative to the splice site, the oligomer sequence and the contribution of the oligomer.

Sequencing results
Out of the set of completely unconfirmed genes in WS120, we randomly selected a set of 24 genes where our predictions significantly differed from the annotation (accession ids: B0432.7, C05A9. 3 Table 1: The illustrations show the annotated (green), predicted (blue) and newly confirmed (red) splice forms of 25 C. elegans genes that were unconfirmed in WS120.
Only in 31 out of 44 experiments we obtained sequencable PCR products. 7 cases were excluded since no splicing was observed (to exclude contamination with DNA). Hence, we could only consider 24 sequences (marked with + or * ) for further analysis. This suggests that in the unconfirmed part of the annotation many genes include intergenic regions resulting in primers pairs matching two separated mRNA sequences and therefore no PCR product. We found four genes (marked with +) which show evidence for alternative splicing (excluded from our analysis).