Identify Alternative Splicing Events Based on Position-Specific Evolutionary Conservation

The evolution of eukaryotes is accompanied by the increased complexity of alternative splicing which greatly expands genome information. One of the greatest challenges in the post-genome era is a complete revelation of human transcriptome with consideration of alternative splicing. Here, we introduce a comparative genomics approach to systemically identify alternative splicing events based on the differential evolutionary conservation between exons and introns and the high-quality annotation of the ENCODE regions. Specifically, we focus on exons that are included in some transcripts but are completely spliced out for others and we call them conditional exons. First, we characterize distinguishing features among conditional exons, constitutive exons and introns. One of the most important features is the position-specific conservation score. There are dramatic differences in conservation scores between conditional exons and constitutive exons. More importantly, the differences are position-specific. For flanking intronic regions, the differences between conditional exons and constitutive exons are also position-specific. Using the Random Forests algorithm, we can classify conditional exons with high specificities (97% for the identification of conditional exons from intron regions and 95% for the classification of known exons) and fair sensitivities (64% and 32% respectively). We applied the method to the human genome and identified 39,640 introns that actually contain conditional exons and classified 8,813 conditional exons from the current RefSeq exon list. Among those, 31,673 introns containing conditional exons and 5,294 conditional exons classified from known exons cannot be inferred from RefSeq, UCSC or Ensembl annotations. Some of these de novo predictions were experimentally verified.


Introduction
Alternative splicing is one of the most important mechanisms for higher organisms to expand the information content from genome to transcriptome. Bioinformatics analyses based on EST sequences and exon-exon junction microarray studies show that 59%,74% of human genes are alternatively spliced [1,2]. Previous studies estimated that cassette exons make up 53%,61% of alternative splicing events in most species [3,4]. Although EST and microarray based studies have made much progress in the prediction of alternative splicing events, they are not sufficient to detect all splice variants due to the biased sampling and the bias and noise inherent to EST preparation and microarray technology. Leparc et al. used splice-site sequence Markov models and a Bayesian classifier to identify novel cassette exons from intron sequences [5]. They successfully predicted and experimentally confirmed 26 novel human cassette exons which are involved in intracellular signaling. Sorek et al. assembled 243 alternative and 1,753 constitutive exons that are conserved between human and mouse [6]. They identified several features differentiating between alternatively spliced and constitutively spliced exons. The most important features are the ones based on the sequence similarity between human and mouse. Yeo et al. used sequence features to distinguish alternative splicing events conserved in human and mouse [7]. Therefore, sequence content and sequence conservation provide alternative ways to study alternative splicing [8].
It has been shown that the evolution rate is lower for exon regions near the intron-exon boundaries than the middle part of exons, by estimating the non-synonymous substitution rate and the synonymous substitution rate from the alignment of human-mouse sequences [9]. The SNP density is the lowest near the splice sites, which also indicates that exon regions near the splice sites are under higher selection pressure [10]. Here, we consider the conservation score of every site of conditional exons, constitutive exons and conditional-exon-free introns. The conservation score is from the PhastCons phylo-hidden Markov model [11] and it is the posterior probability that the site is conserved across 17 vertebrate species. We uncovered the position-specific patterns for the conservation scores and compared conditional exons, constitutive exons and conditional-exon-free introns. The position-specific conservation pattern is more efficient in identifying conditional exons than the overall conservation score of individual exons.
Recently, the pilot project of the Encyclopedia of DNA Elements (ENCODE) [12] has rigorously identified functional elements in the 1% region of the human genome. The GENCODE [13] consortium of the ENCODE project has manually prepared a high-quality annotation for transcripts in the ENCODE regions. In this paper, we utilize the detailed annotation of the ENCODE regions and assemble the lists of conditional exons, constitutive exons and conditional-exon-free introns as training sets. We have two goals: (1) identify novel conditional exons from intron regions; (2) classify known exons into conditional exons and constitutive exons. We used the Random Forests machine learning method [14] to identify novel conditional exons from intron regions and achieved 97% specificity and 64% sensitivity. For classifying exons into conditional exons and constitutive exons, although the sensitivity is only 32%, the specificity can be as high as 95%.

Position-specific Conservation Pattern for Exons and Introns
In this paper, we are interested in conditional exons that are included in some transcripts but are completely spliced out for other transcripts. These include traditional definition of cassette exons, mutually exclusive exons, retained introns and other complicated alternative splicing events, but not alternative 59 or 39 exons. Non-conditional exons are called constitutive exons. Our data flowchart contains the training with the ENCODE data and the prediction using the Random Forests classifiers (Fig. 1).
Conditional exons, constitutive exons and conditional-exon-free introns were assembled from the ENCODE regions. Figure 2A plots conservation scores along relative positions of exons or introns. The bias due to the different lengths of exons was corrected in the following way. For each relative position x, the average conservation score was calculated only for exons containing that position. Similar correction was performed for introns. Compared with introns, both conditional exons (red) and constitutive exons (black) have much higher conservations. The conservation scores gradually reduce along the relative positions to exon edges. On the contrary, the conservation scores of introns (green) drop quickly and stayed around 0.07 after about 30 base pair (bp). The 39 positions of introns (247 to 27) have relatively higher conservation scores than the 59 positions of introns (7 to 47) (p-values based on one-tailed t tests#0.001), probably reflecting the branching point and the poly-pyrimidine tract upstream of the 39 splice site. Compared with constitutive exons, conditional exons have lower conservation scores and the difference is remarkably more significant for regions near the edges. One-tailed t tests were performed to compare the conservation scores of conditional exons and those of constitutive exons. Figure 2B shows the p-value for each position. It indicates that the selection pressure on the boundaries of conditional exons is significant lower than that for constitutive exons. The difference tends to decrease towards the middle part of exons. On the contrary, if we consider the conservation scores of flanking intronic regions, intronic regions flanking conditional exons (red) have higher conservation scores than those flanking constitutive exons (black) (see Fig. 3A). Sorek et al. also reported that the intronic regions flanking cassette exons are conserved between human and mouse [6]. In addition, observed from Figure 3A, the differences for upstream intronic regions of exons are larger than those for downstream intronic regions. One-tailed t test was performed to compare the differences for upstream regions (2100, 210) and the differences for downstream regions (10, 100). The p-value is 1.1610 26 . Figure 3B plots the positionspecific p-values for those differences. The flanking (246, 218)  regions are the most conserved regions for conditional exons (pvalues for the differences#10 215 ). Therefore, the upstream intronic regions and the downstream intronic regions are not symmetric and the upstream intronic regions of conditional exons are much more conserved. All of these results show that there are differences between conditional exons, constitutive exons, and their flanking intronic regions in terms of conservation levels. Moreover, these differences are position-dependent and are functions of the relative positions to the exon-intron boundaries.
We further consider whether there are subpopulations among those conditional exons. We suspect that conditional exons can be divided into two groups: functional or non-functional exons. The assumption is that for those conditional exons with function, their conservation scores are relatively high and they tend to be highly regulated. The conservation scores of their regulatory upstream regions are also high. However, for those conditional exons produced by alternative splicing as random events, their conservation scores are low and the upstream intronic regions tend to be less conserved. Figure 4A shows the violin plots of the conservation scores of conditional exons and constitutive exons. Violin plot is similar to boxplot except that it adds the kernel density plot of the data. It clearly shows that the distribution of conservation scores of conditional exons is bimodal. Some of the conditional exons have high conservation scores and some of them have very low conservation scores. We next consider whether the upstream intronic regions of those highly conserved conditional exons are more conserved. Figure 4B shows the relationship between the conservation level of exon region (X axis) and the conservation level of upstream intronic region (246, 218) (Y axis) for conditional exons (upper panel) and constitutive exons (lower panel). For both constitutive exons and conditional exons, if the conservation score of exon region is high, the upstream intronic region is more conserved. Compared with constitutive exons, conditional exons with the same conservation scores tend to have more conserved upstream intronic regions. It indicates that they may be highly regulated and most likely that they are functional.

Discover Novel Conditional Exons from Intron Sequences
We used the Random Forests to learn the classifier for conditional exons and intron sequences. The Random Forests consist of many decision trees and each tree is constructed by a bootstrap sample from the original data. A decision tree can be treated as a set of Boolean functions of features and these conjunctions of features partition training samples into groups with homogenous class label. The output of the Random Forests for each test sample is the class with majority votes from these trees. The Random Forests generates an internal unbiased estimate of classification error based on the out-of-bag data during the Forests building process. There is no need for cross-validation or a separate test data. In this study, the high-quality training data were from the GENCODE project whose ultimate goal is to identify all protein-coding genes in the human ENCODE regions. We assembled 1,185 conditional exons and 4,490 intron sequence fragments with length from 30 to 330 bp. The 330 features we used were conservation scores of positions: 0, …, 164, 2164, …, 20. The classification error rate is 0.10, the sensitivity is 0.64, the specificity is 0.97, and the false discovery rate is 0.15. We also considered the area under the curve (AUC) score that is the value of the area under the receiver operating characteristic (ROC) curve. AUC score is a global performance measure by combining both the sensitivity and the specificity. A perfect classifier will have AUC = 1 and a random classifier will have AUC = 0.5. Using the position-specific conservation score, we achieved an AUC score of 0.86.
We assembled 28,324,910 overlapped potential conditional exons (see Materials and Methods) from the intron regions of RefSeq gene collection. These potential exons are 30 to 330 bp long and flanked by AG and GT dinucleotides (the splice sites of introns). They have a poly-pyrimidine tract in their upstream regions. And they don't introduce in-frame stop codons. Those potential exons were tested using the trained classifier from the Random Forests. It resulted in 1,273,698 conditional exons. Because these predicted conditional exons may be overlapped, we focused on introns with at least one predicted conditional exon instead. About 21% (39,640/185,233) of tested introns contain at least one predicted conditional exon. However, for the ENCODE regions, about 16% of introns contain at least one conditional exon (excluding terminal exons). It indicates either a high false discovery rate in the discovered conditional exons or the incomplete annotation for the ENCODE regions. Indeed, It has bee reported that 59,74% of human genes are alternative spliced and the cassette exons make up 53,61% portion of alternative splicing events [1][2][3][4]. Our results that 21% of known introns contain at least one conditional exon narrow the gap between current annotation and experimental predication. By comparing the annotations of Refseq transcripts themselves, 4,774 introns are concluded to contain at least one conditional exon. Our Random Forests classifier predicted 76% of them (3,643/4,774). Based on a larger transcript annotation set (Refseq [15], UCSC [16] and Ensembl genes [17]), among the 185,233 tested RefSeq introns, 13,759 introns contain at least one conditional exon. And our Random Forests classifier predicted 58% (7,967/13,759) of them. The Random Forests classifier predicted another 31,673 introns containing conditional exons which can not be inferred from RefSeq, UCSC and Ensembl annotations.
We used RT-PCR to test our predictions. We designed primers in the exonic regions flanking the introns which were predicted to contain conditional exons. These primers were screened against the provided library of human sequence repeats and have a high melting temperature (.60uC) to minimize non-specific amplification. Since many alternative splicing events occur in a tissuespecific manner, we harvested total RNA samples from five different human cell lines: LA-N-5, WERI, HeLa, HEK 293 and SHSY5Y cells. We randomly picked 15 introns from the top predictions (all of classification trees vote for conditional exons instead of introns) and designed specific primer pairs targeting exons which flank these introns. By RT-PCR, eleven primer pairs of fifteen yielded only one or no amplicons in these five cell lines. But the other four showed additional amplicons of higher molecular weight in one or more cell lines, indicating alternative transcripts with exon inclusion (Fig. 5A). Sequencing of these amplicons proved that they partially overlap with the predicted conditional exons.  (1,5,9,58,60,66,69,82,84,86,94,96) are significantly different between conditional exons and constitutive exons (the median pvalue is 2.4610 24 ). Table 1 lists other significant features with p-values#0.01 by comparing conditional exons and constitutive exons. Those 3-mer words and the 59 splice site positions may be related to splicing cis elements. The exon length and exon divisibility by 3 are not significant with a t test p-value 0.26 and a Fisher's exact test p-value 0.15. Figure 6 shows the boxplots of importance measures of features. The importance measure is estimated by the Random Forests. It is the raw importance score divided by its standard error (z-score). The raw importance score is determined by comparing the training data and the simulated data in which the considered feature is randomly permuted and other features are kept intact. The results indicate that the positionspecific scores are the most important features.

Features Distinguishing Conditional Exons and Constitutive Exons
Based on the training data from the ENCODE regions, although the specificity is high (0.95), the sensitivity is low (0.32), the AUC score is 0.73, the FDR is 0.27, and the classification error rate is 0.25. We applied this classifier to exons assembled from RefSeq genes (excluding 59 terminal and 39 terminal exons). We continued to test our prediction using RT-PCR and sequencing. We used exon array data across different tissues (available on the Affymetrix website http://www.affymetrix.com/) to roughly determine whether a gene was expressed in a specific tissue before we selected predicted transcripts for validation. Primers in the exon regions flanking predicted conditional exons were designed for RT-PCR experiments. These primers were also screened against the provided library of human sequence repeats and have a high melting temperature (.60uC) to minimize nonspecific amplification. Of five exons we tested (the five exons are on the top prediction list), four have apparent PCR products excluding the predicted conditional exons (Fig. 5B), while the fifth does not yield any PCR product maybe due to low expression level. Sequencing of these alternative PCR transcripts proved that they are exactly the predicted conditional exons. In addition to the novel predicted conditional exons, examples of known conditional exons which were predicted by our methods were shown in Supplementary Figure S1.

Ontology Study for Genes with Many Conditional Exons
After we predicted conditional exons, we were interested to know whether there is any functional characteristic of genes enriched with conditional exons. For each RefSeq gene, different transcripts were combined to assemble non-redundant introns and exons. For those introns, we counted the frequency of them containing conditional exons based on our predictions. For those exons, we counted the frequency of them being conditional exons based on our predictions. A total of 837 genes have $15 introns+exons and $20% of those introns and exons are related to conditional exons. David Functional Annotation tool (2008) [19] was applied to analyze gene annotations. Table 2 lists the significant gene annotation terms with at least 10 gene counts and the p-value after Bonferroni's correction#0.001. Bonferroni's correction is a very stringent multiple comparison correction. Here it controls the probability of having one or more falsely declared significant annotation term#0.001. The term ''alternative splicing'' is a UniProt knowledgebase keyword meaning ''protein for which at least two isoforms exist due to distinct pre-mRNA splicing events''. It is the rank one significant gene annotation with Bonferroni corrected p-value 3.0610 262 . The UniProt sequence feature ''splice variants'' is also enriched with Bonferroni corrected p-value 1.6610 244 . Other significant annotations include GO terms related to nervous system development, synapse, protein binding, transcription factor activity, etc. This is consistent with the idea that development and signaling pathways are thought to involve a large number of alternative splicing events [20,21].

Discussion
In this study, we characterize important features of positionspecific conservation scores across conditional exons, constitutive exons and introns through the thoroughly annotated ENCODE genomic regions. Based on such important distinct features, we predicted many novel conditional exons which were previously known to be constitutive exons and predicted many introns which contain conditional exons. Some of these predictions were validated by RT-PCR followed by sequencing. Our comparative genomics approach is an important complement to current experimental technologies in identifying alternative splicing events at the genomic scale. In addition, our novel predictions provide an immediate interest of adding corresponding probes into exon arrays and exon-exon junction arrays.
In this paper, we found that constitutive exons have higher conservation level in exon regions and lower conservation level in flanking intron regions compared to conditional exons. This is based on the high-quality annotations of ENCODE regions and the conservation scores estimated from the alignment of 17 vertebrates. Some groups reported opposite results [7,22]. Looking at each literature's methods carefully, we found that such discrepancy was mainly due to different sampling of training data. When sampling alternative exons, Sorek's and Yeo's papers [7,22] selected human-mouse orthologous exons both of which are flanked by splice sites. They further required that alternative splicing events occur in both human and mouse. Such sampling had undoubtedly achieved very high conservation level of ''alternative exons''. These ''alternative exons'' maintain conserved sequences for human-mouse orthology and conserved regulatory sequence elements for conserved alternative splicing. In contrast, our sampling of conditional exons and constitutive exons did not take into account of neither human-mouse orthology nor conserved splicing events between human and mouse. Our sampling completely relied on the high-quality ENCODE annotation data instead. This allows the hypothesis that human genome and mouse genome evolve independently to create different alternative splicing events. And it has no bias of assembling ''conserved sequence'' to study conservation level. We also found that the distribution of conservation scores of conditional exons is bimodal. Some of the conditional exons have high conservation scores and some of them have very low conservation scores (Fig. 4A).
Most importantly, we found that the differences of conservation scores are position-dependent. The position-specific conservation scores of exons and their flanking intronic regions may reflect functional splicing cis elements. The differences in position-specific conservation between conditional exons and constitutive exons and their flanking intronic regions may provide us information about the subtly different, if not significantly different splicing mechanisms for conditional exons and constitutive exons. For the exon region, the differences between conditional exons and constitutive exons are remarkably more significant in regions near   the boundaries (Fig. 2). This could indicate that constitutive exons have stronger splicing signals at the boundary and for instance facilitate exon definition during splicing. For their flanking intronic regions, the upstream intronic regions and the downstream intronic regions are not symmetric. The conservation differences for upstream intronic regions are larger than those for downstream intronic regions. These may indicates that upstream intronic regions are more important than downstream intronic regions in regulating functional alternative splicing. Particularly, the upstream 246 to 218 bp intronic regions of conditional exons are significantly more conserved than those of constitutive exons (Fig. 3). The enriched sequence motifs in these regions may participate in the alternative splicing modulation. In addition, we classified sub-populations of conditional exons. Some conditional exons are conserved and have highly conserved upstream intronic regions (Fig. 4), which indicate that they may be highly regulated and functional. Some conditional exons are less conserved and lack the highly conserved upstream regions (Fig. 4). They may just be the products of random splicing events or newly evolved splicing event. It is also noteworthy that in all introns, the (247, 27) region close to 39 splice site are more conserved than its corresponding (7, 47) region close to the 59 splice site. Such asymmetry seems consistent with the importance of polypyrimidine tract right upstream of the 39 splice site. Based on the high-quality training data set, the Random Forests classifier achieved specificity as high as of 0.97 and a sensitivity of 0.64 for conditional exon prediction from intron regions. For the classification of conditional exons from the current exon list, if we only use the position-specific conservation score, the classification error rate is about 25.2%. If we ignore the position-specific effect and use the average conservation score of exon regions, upstream and downstream intronic regions, the error rate increases to 30.3%. Adding other features such as triplet counts and others improves the classifier a little (error rate decreases from 25.2% to 25.0%).
Compared with Sorek et al.'s studies [22], in our training data the exon length and exon divisibility by 3 are not significant. Although the exon lengths of conditional exons are less than those of the constitutive exons (average 170.4 bp vs. average 182.2 bp), the difference is not statistically significant with a t test p-value 0.26 (one-tailed p-value 0.13). Although the portion of exons whose length is a multiple of 3 among conditional exons is slightly larger than that among constitutive exons (42% vs. 40%), the Fisher's exact test p-value is 0.15 (one-tailed p-value 0.08). There are several possible reasons: first, the scope of alternative exons and constitutive exons that we studied is different. We focus on conditional exons and constitutive exons. The conditional exons include cassette exons, mutually exclusive exons, retained introns and other complicated alternative splicing events. Sorek et al's cassette exons were those exons included and skipped in one or more transcripts, and the boundaries of both 59 and 39 flanking exons are shared in the transcripts that include and skip that exon, and the skipping events happen both in human and mouse. Their definition of constitutive exons was those that are supported by at least four expressed sequences, with no skipping event, both in human and mouse. These dramatically narrowed down the scope of either alternative exons or constitutive exons. It has been reported that there are slightly more exons whose length are exact multiple of three for alternatively spliced exons. However, orthologous exons that are alternatively spliced in multiple organisms showed a substantially increased bias to be exact multiple of three in length [23,24]. Our definition of conditional exons does not require them to be orthologous exons so that the selection pressure for protein reading frame preservation is relatively low. Secondly, the selection of training data is different. We used thoroughly annotated ENCODE regions. Sorek et al used 243 alternative and 1,753 constitutive exons that are conserved between human and mouse. After all, the 1,753 constitutive exons may still contain a handful of alternative exons which have not been discovered by meticulous experiments. In the process of validating our predicted conditional exons, we found that in most cases the novel predicted transcripts are either not expressed or expressed at a much lower level than those of known transcripts. This might be one of the reasons why they have not been discovered by previous EST sequencing. It may also be the reason for the low validation rate (4 out of 15) in the case of the prediction from intron regions. In a world of alternative splicing, it is difficult to disprove an alternative splicing possibility. This problem may still exist in the ENCODE annotation, but to a lesser degree presumably.
Our method has much broader scope and application than previous alternative exon prediction algorithms. For example, Yeo et al. focused on alternative splicing events conserved in human and mouse [7]. Their training sets were limited to orthologous human-mouse exon pairs with conserved splicing patterns. Secondly, only orthologous human-mouse exons (,100k) are eligible for their prediction program. Thirdly, their approach can not predict novel exon inclusion events, or splicing events from regions currently annotated as introns. Our prediction of exon skipping events does not rely on the occurrence of its orthologous exon skipping in another organism. In addition, we can predict novel exons from intron regions.
Finally, although we achieved a high specificity (97% and 95% for the identification of novel exons from introns and the identification of conditional exons from current exon list respectively), the sensitivity is still not satisfying (64% for identifying novel conditional exons from intron sequences, 32% for identifying conditional exons from current exon list). Future work will need to explore more features which can differentiate conditional exons, constitutive exons and introns.

Position-specific Conservation
The high-quality manual annotations in the ENCODE regions were generated by the GENCODE project and were downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/). Non-redundant middle exons (excluding 59-terminal and 39terminal exons) and introns were used for further analysis. A total of 4,187 exons and 5,749 introns were assembled. If one exon is located in the intron region of other transcripts, it is called a conditional exon. Otherwise, it is called a constitutive exon. A total of 1,314 conditional exons and 2,873 constitutive exons were identified. If an intron doesn't contain a conditional exon, it is called a conditional-exon-free intron. A total of 4,800 such introns were identified. We note that constitutive exons and conditionalexon-free introns may still have other types of alternative splicing such as alternative 39 or 59 splice sites.
The conservation score based on a phylogenetic hidden Markov model for 17 vertebrates [11] was downloaded from the UCSC Genome Browser. The score of each site is the posterior probability that the site is in the conserved state of the phylogenetic hidden Markov model. For every site of exon, define x as the position relative to the nearest splice site. It is positive for distances from the 59 exon edge and negative for distances from the 39 exon edge. For example, an exon with length 50 contains positions 0, … , 24, 224, …, 20. Similarly, x can be defined for introns. The conservation score can be found for each x as s(x).

Training Data
The training data for identifying novel conditional exons from introns were prepared as following. A total of 1,185 conditional exons with length from 30 to 330 bp in the ENCODE regions were used as training data. The lengths were recorded as (L 1 , L 2 , …, L 1,185 ). For each conditional-exon-free intron with length$100 bp in the ENCODE regions, we randomly picked up a fragment with length sampled from (L 1 , L 2 , …, L 1185 ). Therefore, those sampled intron fragments also have length from 30 to 330 bp. In total, 4,490 intron fragments were created and used as training data. In addition, a total of 1,314 conditional exons and 2,873 constitutive exons in the ENCODE regions were used as the training data for classifying exons into conditional exons and constitutive exons. The training exons and introns can be found in Supplementary Table S1, S2, S3.

Random Forests Learning
Random Forests machine learning [14] was used to learn the classifier. The code was downloaded from (http://www.stat. berkeley.edu/,breiman/RandomForests/cc_home.htm). We built 1,000 trees for each Random Forests. At each node, the number of variables we considered is the square root of the total number Where n + is the number of positive samples (e.g., conditional exons), n 2 is the number of negative samples (e.g., constitutive exons), x + are the features for positive samples, x 2 are the features for negative samples, f(?) is the scoring function (e.g. the number of votes for conditional exons), and 1 (?) is the indicator function. The features for classifying exons into conditional exons and constitutive exons include: position-specific conservation scores of exon positions (0, …, 149, 2149, …, 20), upstream 100 bp regions (2100, …, 21) and downstream 100 bp regions (1, …, 100); exon length; exon divisibility by 3 (1: yes, 0: no); 3-mer words frequencies for exon, upstream 100 bp region and downstream 100 bp region; position-dependent single base counts at 59 splice site for 23 to +6 positions (excluding +1 and +2 invariant positions); intensity of the poly-pyrimidine tract (PPT) which is the number of pyrimidines in a 15 bp window of the last 19 nucleotides of the upstream intron (not including the last 4 nucleotides of the intron). Those non-conservation score features were also used in Dror et al.'s paper [18]. In their paper, instead of using position-specific conservation scores, they used the percent identity when aligned to the mouse counterpart.

Test Data
Known protein-coding genes from the NCBI mRNA reference sequences collection (RefSeq) [15] were downloaded from the UCSC Genome Browser (Build hg18). Introns were assembled from these genes. If two introns share the same positions but they have different phases, they were still treated as two introns. In other places of the paper, if two introns share the same positions, we treated them as redundant introns. For each intron, we scanned it from 59 to 39 to identify possible exons: (1) with length from 30 to 330 bp; (2) they are flanked by AG and GT dinucleotides; (3) the intensity of PPT is $9; (4) they will not cause in-frame stop codons. Those procedures resulted in 28,324,910 fragments. Conservation scores were assigned to every position of those fragments. The fragments were classified as conditional exons or introns according to the classifier we learned from the training data (with $50% trees voting for conditional exons or introns). For the classification of conditional exons from current exon list, we assembled 162,941 unique exons from RefSeq (excluding terminal exons). They were classified into conditional exons and constitutive exons according to the classifier learned from the training data. The test exons and introns can be downloaded from http://www-rcf.usc.edu/,liangche/research/rfexon/.

Tissue Cell Culture and RNA Preparation
LA-N-5, HeLa, SHSY5Y, WERI and HEK 293 cell lines were cultured following standard guidelines provided by American Type Culture Collection. Total RNA samples of these cell lines were prepared using Trizol according to manufacturer's protocol (Invitrogen, CA).

Primer Design and RT-PCR
Primer design was done with the Primer3 online software (http://frodo.wi.mit.edu). Sequence assembly of a tested transcript includes the predicted conditional exon and its flanking exons and the product should include the predicted conditional exon. Primers are filtered against mispriming human libraries and have a high melting temperature (.60uC) to minimize non-specific amplification. Primer sequences are as followed.

PCR Product Extraction and Sequencing
PCR products were separated by electrophoresis on a 2% agarose gel supplemented with ethidium bromide and were visualized under a UV light. PCR products were extracted using Qiagen Gel Extraction kit (Qiagen, CA), ligated into pCR-TOPO vector and then transformed into chemically competent cells using TOPO TA Cloning Kit (Invitrogen, CA) according to manufacturers' instructions. Bacteria were plated on LB/x-gal/Amp agar plates and grown overnight at 37uC. A maximum of 3 colonies were picked from each plate, amplified and used for sequencing reaction (www.laragen.com) with forward M13 primers. Figure S1 Examples of well known alternatively spliced genes. For gene APP, there are three RefSeq transcript isoforms: NM_000484, NM_201413 and NM_201414. Two exons (exon 7 and exon 8) are known to be included in some transcripts and spliced out for others. Our methods predicted both of them correctly. There are three RefSeq transcript isoforms for gene GRIA2: NM_001083619, NM_000826, NM_001083620. Two exons (exon 14, exon15) are known conditional exons. Our methods predicted both of them correctly. In addition, our methods predicted that exon 16 is a conditional exon. Found at: doi:10.1371/journal.pone.0002806.s001 (0.86 MB TIF) Table S1 Training conditional exons assembled from the ENCODE annotations. They are middle exons (excluding 59terminal and 39-terminal exons) and they are located in the intron region of other transcripts. All of them (1,314) were used to train the classifier to distinguish conditional exons and constitutive exons. A total of 1,185 conditional exons with length between 30 and 330 bp were used to train the classifier to distinguish conditional exons and intron sequences.