Alu Exonization Events Reveal Features Required for Precise Recognition of Exons by the Splicing Machinery

Despite decades of research, the question of how the mRNA splicing machinery precisely identifies short exonic islands within the vast intronic oceans remains to a large extent obscure. In this study, we analyzed Alu exonization events, aiming to understand the requirements for correct selection of exons. Comparison of exonizing Alus to their non-exonizing counterparts is informative because Alus in these two groups have retained high sequence similarity but are perceived differently by the splicing machinery. We identified and characterized numerous features used by the splicing machinery to discriminate between Alu exons and their non-exonizing counterparts. Of these, the most novel is secondary structure: Alu exons in general and their 5′ splice sites (5′ss) in particular are characterized by decreased stability of local secondary structures with respect to their non-exonizing counterparts. We detected numerous further differences between Alu exons and their non-exonizing counterparts, among others in terms of exon–intron architecture and strength of splicing signals, enhancers, and silencers. Support vector machine analysis revealed that these features allow a high level of discrimination (AUC = 0.91) between exonizing and non-exonizing Alus. Moreover, the computationally derived probabilities of exonization significantly correlated with the biological inclusion level of the Alu exons, and the model could also be extended to general datasets of constitutive and alternative exons. This indicates that the features detected and explored in this study provide the basis not only for precise exon selection but also for the fine-tuned regulation thereof, manifested in cases of alternative splicing.

, panels B-D) may result in a decreased structure throughout the entire Alu exon. Given such a scenario, the observed single-strandedness of the 5'ss may not be specific to the 5'ss region; rather, any randomly chosen site throughout the exonizing arm might be relatively less structured. To address this, we arbitrarily selected nine equally distanced sites within the right and the left Alu arms, beginning at position 40 and ending at position 280 sampling one site every 30 nt. We calculated the PU value of the 9-mer (the length of the 5'ss) at each site within each of the three core datasets. For five of the nine sites, no significant differences in PU values (at a level of p<0.05) were found between the groups. In the four remaining sites, although differences between datasets were significant, the PU values were not consistently higher in the exonizing dataset. The fact that consistently higher, statistically significant, PU values were found specifically for the recognized 5'ss but not for the randomly selected position therefore implies potential biological importance.

Ranking of features
Mutual information is a quantity that measures the mutual dependence of two variables, and is calculated as: where p(x,y) is the joint probability distribution function of X and Y, and p 1 (x) and p 2 (y) are the marginal probability distribution functions of X and Y respectively. For calculating this value we made use of the bioDist() package [1] in R, which first discretizes each variable by binning them into 10 bins. The mutual information of each variable was next normalized (divided) by the entropy of the binary 'group' variable which indicates whether an Alu exonizes from a given arm or not. The entropy of this variable, H(x), was calculated as: where p(x i ) is, in turn, the probability that an Alu does and does not undergo exonization. The final value was multiplied by 100, to yield the percentage of information. This measure has previously been termed coefficient of constraint [2] and uncertainty coefficient [3].

Analysis of Alus by inclusion levels
It has recently been shown that older Alu exons are characterized by stronger signals and higher inclusion levels than younger ones [4]. We were thus interested in determining whether the different features identified in this study are stronger in Alu exons characterized by higher inclusion levels.
To assess the impact of inclusion level, we divided all 313 Alus exonizing from the right arm into two groups of low and high inclusion levels, using an inclusion level threshold of 20% to divide the groups. This left 263 and 50 Alu exons in the LOW and HIGH inclusion groups. For each of the features described in the manuscript, we next used t-tests to determine whether they significantly differed between the two groups. Two features were found to be significantly different in the two groups: the 5'ss score (mean in LOW -76.21, mean in HIGH -80.54, P-value -0.003), and right arm secondary structure (mean Z-score in LOW --0.51, mean in HIGH --0.27, P-value -0.001). This analysis again underscores the importance of secondary structure, which in this case was even more significant than that of the 5'ss. Full results for this analysis can be found in Supplementary   Table 2. Repeating this analysis in the left arm did not yield significant results, which is at least partially to be attributed to the much lower number of Alus in this dataset.

Analysis of Alus by location of exonization
We have previously reported a tendency for exonization events of transposable elements to occur within the UTRs [5]. Analyzing this in our datasets, we find that that of the 313 Alus exonizing from the right arm, 109   Table 1. Statistical significance of features across the three core datasets. For each feature, four tests were performed: a first general Kruskal-Wallis one way analysis of variance test (or ANOVA if explicitly stated), to assess whether the level distributed differently across the three core datasets, followed by three Mann-Whitney tests (or t-tests) between each pair of datasets, to identify which datasets differed from others. Pvalues beneath 0.05 are highlighted in yellow. In the first, exons were divided into two groups based on their inclusion levels (above and below 20%), and in the second based on location (5' UTR vs. introns). Ttests were performed to compare each feature in each of the two groups.