A Biophysical Model for Identifying Splicing Regulatory Elements and Their Interactions

Alternative splicing (AS) of precursor mRNA (pre-mRNA) is a crucial step in the expression of most eukaryotic genes. Splicing factors (SFs) play an important role in AS regulation by binding to the cis-regulatory elements on the pre-mRNA. Although many splicing factors (SFs) and their binding sites have been identified, their combinatorial regulatory effects remain to be elucidated. In this paper, we derive a biophysical model for AS regulation that integrates combinatorial signals of cis-acting splicing regulatory elements (SREs) and their interactions. We also develop a systematic framework for model inference. Applying the biophysical model to a human RNA-Seq data set, we demonstrate that our model can explain 49.1%–66.5% variance of the data, which is comparable to the best result achieved by biophysical models for transcription. In total, we identified 119 SRE pairs between different regions of cassette exons that may regulate exon or intron definition in splicing, and 77 SRE pairs from the same region that may arise from a long motif or two different SREs bound by different SFs. Particularly, putative binding sites of polypyrimidine tract-binding protein (PTB), heterogeneous nuclear ribonucleoprotein (hnRNP) F/H and E/K are identified as interacting SRE pairs, and have been shown to be consistent with the interaction models proposed in previous experimental results. These results show that our biophysical model and inference method provide a means of quantitative modeling of splicing regulation and is a useful tool for identifying SREs and their interactions. The software package for model inference is available under an open source license.


Introduction
A key step in eukaryotic gene expression is to remove introns from precursor messenger RNA (pre-mRNA) so that exons can be joined together to form the mature mRNA. By including different exons in the mRNA, alternative splicing (AS) can generate different isoforms from a single gene to increase proteomic diversity. Recent studies have found that ,95% of human genes undergo AS [1,2]. The importance of AS is highlighted recently by the findings that AS related mutations can cause many human diseases including cancer [3,4].
AS can be regulated via several mechanisms, often in a tissuespecific manner [5,6]. One mechanism is that recognition of splice sites by the spliceosome is influenced by a class of RNA binding proteins named splicing factors (SFs) that can bind to the cis-acting splicing regulatory elements (SREs) on the pre-mRNA. These SREs are categorized as exonic splicing enhancers (ESEs) and silencers (ESSs), and intronic splicing enhancers (ISEs) and silencers (ISSs) based on their locations and effects on splicing [7]. Recently, several experimental [8][9][10].and computational methods have been employed to identify SREs. The computational methods can be largely categorized into three approaches. The first enrichment-based approach is to identify SREs as short nucleotide sequences (typically hexamers or octamers) that are statistically enriched in a carefully selected set of introns and exons against a background or negative dataset [11][12][13]. The second conservation-based approach utilizes comparative genomic methods to identify evolutionarily conserved motifs in introns and exons, which can also be combined with the enrichment-based approach to identify SREs [14][15][16][17]. The third regression-based approach exploits both sequence information and expression levels of different isoforms in a unified framework [18,19]. Comparing with the other two approaches, the regression-based approach offers flexibility of identifying combinatorial regulatory effects of multiple SREs. However, the current regression methods for AS were not developed systematically from a theoretical base, which may limit their performance.
Multiple SREs could act cooperatively to promote or repress splicing by regulating exon or intron definition [20,21]. However, given the huge number of possible pairs of SREs in different regions, experimental approaches for identifying SRE pairs is expensive and time-consuming if feasible. For this reason, computational methods become an important means of selecting candidate SRE pairs in a systematic and high-throughput manner. Several recent computational works have studied cooperative SRE pairs in AS regulation. Ke and Chasin [22].searched for frequently co-occurred SRE pairs from two ends of exons that mediate exon definition. Friedman et al. [23] identified cooperative SRE pairs from two ends of human and mouse introns possibly mediating intron definition. Suyama et al. [24] analyzed conserved pentamers that often co-occur in the same region of upstream or downstream introns, which may arise from cooperative binding of different SFs or actually from a single long motif. These different types of SRE pairs from different regions reveal that cooperative interaction between SREs may be a common mechanism in AS regulation. However, these SRE pair detection methods did not incorporate expression data into the analysis, and like the enrichment-based approach to the detection of single SRE, they could not exploit sequence and expression data in a systematic way, which may limit their detection power.
In this paper, we first derive a novel biophysical regression model for the regulation of AS, which can capture both main effects of individual SREs and combinatorial effects of multiple SREs. We then develop a systematic framework to infer the regression model, which in turn identifies both single SREs and different types of cooperative SRE pairs. The key feature of our model inference framework is that we employ the shrinkage technique [25] to identify a small number of SREs and SRE pairs from a huge number of all possible SREs and their pairs. Our numerical results show that our regression model can explain a significant portion of the variance in the data comparable to the best result for transcription achieved by a nonlinear biophysical model [26]. Using an RNA-Seq data set [1], we identify 619 SREs and 196 SRE pairs, some of which are verified with previous experimental results. For example, binding sites of PTB in upstream and downstream of introns of alternatively spliced exons (ASEs), binding sites of hnRNP F/H in two ends of downstream intron, and binding sites of hnRNP E/K in the same region are identified as interacting SRE pairs, which are consistent with existing experimental evidences.

Biophysical Model for AS Regulation
The basal machinery of splicing is known as the spliceosome, a large multicomponent ribonucleoprotein complex having U1, U2, U4, U5 and U6 snRNPs as its main building blocks [27]. Splicing begins with a multi-step process of spliceosome assembly around the splice sites and the branch point. SFs bound to nearby SREs can influence spliceosome assembly by facilitating or inhibiting the subunits of spliceosome to recognize the splice sites [7,28,29]. They can also regulate splicing through other mechanisms such as regulation of the transition from exon definition to intron definition [5,30]. Moreover, multiple SFs and the spliceosome can interact cooperatively or antagonistically to affect the splicing process.
We model the spliceosome assembling process with a chemical reaction: where S denotes the spliceosome. This simplification is similar to the one used in the derivation of a biophysical model [26,31,32] for transcription where assembly of the RNA polymerase (RNAP) complex is simplified to one reaction. If an SF binding to an SRE interacts with the spliceosome, it increases or decreases the equilibrium constant of the above reaction, which in turn enhances or inhibits splicing. Let us consider a gene with one ASE and let I 1 (I 2 ) be the isoform that includes (excludes) the ASE. Let E I 1 and E I 2 be the expression levels of I 1 and I 2 , respectively. We assume that the pre-mRNA of the gene with assembled spliceosome produces I 1 , whereas the pre-mRNA without assembled spliceosome produces I 2 . Therefore, the probability of producing I 1 is equal to the probability that the pre-mRNA is bound by the spliceosome. The first probability can also be expressed as E I 1 =(E I 1 zE I 2 ), while the second probability can be derived from the biophysical chemical reactions modeling spliceosome assembly and binding of SREs to the pre-mRNA as detailed in Materials and Methods. Based on this observation, we derive the following regression model in Materials and Methods to capture the regulatory effects of SREs on the splicing of an ASE: where y~log for tissue t 1 and t being the index of N tissues; M and I are the set of potential SREs and the set of potential SRE pairs, respectively; x i is a binary variable to indicate presence (x i~1 ) or absence (x i~0 ) of the ith SRE; b i reflects the contribution of the ith SRE to the splicing of the ASE, and b ij indicates the cooperative contribution of the ith SRE and the jth SRE; e is the measurement error, which is modeled as a Gaussian random variable with zero mean. Note that the ith and jth SREs in the third term at right hand side of (2) can be from the same region or two different regions as described in Figure 1A. The first term log (f t 1 ) in y is the logarithm of the ratio between the expression levels of two isoforms including or excluding the ASEs; it is determined by the basal splicing level and the regulatory effects of SFs binding to the SREs. The second term in y is added to remove the basal splicing level determined by the spliceosome alone as described in Materials and Methods. Thus, given the splicing profile y of a set of ASEs and a set of candidate SREs, we can identify SREs or SRE pairs that have regulatory effect by finding b i or b ij that are not equal to zero with certain statistical significance. The method for model inference to determine b i or b ij is presented in next section. The condition to determine if an SRE is a enhancer or silencer based on the sign of b i and the expression level of the SF binding to the SRE is given in Proposition 1 in Materials and Methods. Whether an interacting pair of two SREs is an enhancer or silencer, however, is difficult to be inferred from the sign of b ij .
Note that model (2) is linear with respect to unknown parameters b i and b ij . The linearity facilitates model inference even when the number of parameters is very large. As shown in Materials and Methods, linear model (2) is directly derived from biophysical principles. In contrast, the biophysical model for gene transcription [26,[31][32][33]] is a nonlinear model with respect to unknown parameters to be inferred. While it is possible to infer parameters of the nonlinear model when the number of parameters is small [26], model inference becomes difficult when the number of parameters is relatively large. The linear regression models for transcription [34][35][36][37] or for splicing [18,19] are an approximation of the nonlinear biophysical models, as shown in [36]. Although these linear models enable efficient model inference, their performance is limited by the approximation inherent to the models. A nonlinear model based on regression spline [38] was also developed to approximate the nonlinear biophysical model. Comparing with linear regression, regression spline reduces approximation error, but increases the complexity of model inference.

Framework for Model Inference
We applied the biophysical model to identify SREs and SRE pairs involved in alternative splicing. As described in Materials and Methods, We first determined a set of ASEs from the UCSC KnownGene table [39], then extracted all the hexamers in five regions around ASEs as candidate SREs ( Figure 1A), and obtained y for this set of ASEs from the RNA-Seq data. With this data set, model inference was carried out using four components described in Figure 1B.
Since we considered all 6-mers and their interactions, the regression model (2) contained 5|4 6~2 0480 variables for main effects of possible SREs and w10 8 (20480 : 20479=2) variables for interactions of possible SRE pairs. The huge number of variables not only requires huge computation for model inference, but also may yield a large number of false SREs. To overcome these problems, we developed the four-component framework in Figure 1B to select reliable SREs without overfitting the model. In the first variable screening component, we used the sure independence screening method [40] described in Materials and Methods to remove 6-mers that have very small correlation with the response variable y.
In the second component, we adopted the Lasso [41] and the adaptive Lasso [42] to perform penalized multiple regression to select variables. Descriptions of the Lasso, the adaptive Lasso and the cross-validation procedure for determining the parameter l of the Lasso and the adaptive Lasso are given in Materials and Methods. The Lasso is known to shrink many variables, with no or small correlation with the response variable, to zero [41], and thus yields a sparse model that only contains a small number of variables. Using both the Lasso and the adaptive Lasso was to ensure more reliable variable selection as suggested in [43]. This component produced a sequence of models for different values of l. The minimum variance of the residual error at l~l min determined by cross-validation was denoted as s 2 min . Although the Lasso and the adaptive Lasso only retained a small number of variables in the model, we wanted to ensure that the overfitting problem did not occur. To this end, we added the third component named refitted cross-validation (RCV) [44] to the inference procedure. RCV reliably estimates the variance s 2 RCV of the residual error in a linear model of ultrahigh dimension. We then compared s 2 min , with s 2 RCV . If s 2 min ws 2 RCV , we selected variables that gave s 2 min ; otherwise, we identified the value of l that yielded a residual variance equal to s 2 RCV and selected variables with this l.
The adaptive Lasso together with RCV selected a set of SREs and SRE pairs, but it did not give p-value for each variable in the model. In the last component, we used ordinary least squares (OLS) method to refit the model with the variables selected by the adaptive Lasso. We then used the p-values provided by OLS to select variables at a false discovery rate (FDR) ƒ0:01 [45]. This final set of variables were identified as SREs and SRE pairs. A software package implementing the inference framework is freely available under an open source license.

Performance of the Model and the Regression Framework
As described in Materials and Methods, we selected a set of ASEs for each tissue from the KnownGene table and calculated the inclusion ratio E I1 =(E I1 zE I2 ) from the RNA-Seq data [1] for each ASE, which was then used to calculate the response y in model (2). We applied our biophysical model and inference framework to this data set to identify SREs and SRE pairs. The number of ASEs used in model inference, the number of SREs and SRE pairs in the final model and the percentage variance explained by the final model are given in Table 1. In each tissue, our biophysical model explained 49.1%-66.5% of the variance in the data (see R 2 in Table 1), which was comparable to that achieved by the best model for transcription reported in [26]. More specifically, the linear model for gene transcription in [34] explained 9.6% of the variance on average, while the regression spline model for gene expression that incorporated interaction terms explained 13.9% to 32.9% of the variance [38]. Most recent work by Gertz et al. [26] fitted a nonlinear biophysical model to the expression data of synthetic genes. Their models explained 44-59% of the variance in gene expression. Thus, considering the non-synthetic genes we used, our model has captured a large fraction of the variance in the splicing response.
We compared our 854 SREs with the results of Castle et al. [12], who performed a systematic screening of all 4 to 7-mers for cis-regulatory motifs enriched near ASEs using microarray. We only kept 5 to 7-mers with a p-value smaller than 10 {3 (Bonferroni-corrected p-values as described in [12]) for the comparison. In total, 137 non-redundant 5 to 7-mers were left (91 5-mers, 28 6-mers and 18 7-mers). We considered it as a match if a 7-mer contains one of our SREs, a 5-mer is part of our SREs, or a 6-mer exactly matches one of our SREs. This yielded 103 kmers, k~5,6,7, that could find at least one match in our SREs. In order to evaluate the significance of the overlap, we generated a list of 6-mers from the 137 k-mers of Castle et al. with the following procedure. For each 5-mer, 8 different 6-mers containing the 5mer were obtained by padding a nucleotide to the beginning or the end of the 5-mer. For each 7-mer, 2 different 6-mers were obtained by extracting the first or the last 6 nts. In total, 639 different 6-mers were extracted from Castle's k-mers, k~5,6,7. A significant number of 6-mer (180) were found in both our 854 SREs and the 639 6-mers obtained from 137 k-mers of Castle et al. (p-value = 9.37e27 from Fisher's exact test).

Experimental Evidence of SREs
Several well-defined SREs involved in tissue-specific AS have been detected in our work. We will take SREs bound by Fox-1 protein, polypyrimidine tract binding protein (PTB), quaking protein (QKI), muscleblind-like protein (MBNL) as examples (listed in Table 2) to illustrate how to understand the results and compare them with the available experimental evidences.
Fox protein recognizes [U]GCAUG as its SRE and it has been shown to be one of the most conserved regulators of tissue-specific AS in metazoans. Fox-1 is exclusively expressed in brain, heart, and skeletal muscle as reported in [67], which is consistent with the RNA-Seq data we used as shown in Figure 2. Its paralog Fox-2 has relatively low expression level in all tissues. In our results, we detected two SREs containing UGCAUG as summarized in Table 2. These 2 SREs were detected in heart or muscle, which is consistent with the tissues where Fox-1 is expressed. Note that the second SRE UGCAUG(UU)-GCAUGU(UU) detected in upstream region in heart is an interaction term in the regression model. Since our model includes interaction between SREs in the same region or from different regions, it is possible that an SRE longer than 6 nts is detected as an interaction term. This interaction term actually arises from a 7-mer SRE UGCAU-GU(UU) [68] (Among the 28 UGCAUG(UU)-GCAUGU(UU) pairs used for inference in heart, 27 are derived from UGCAUGU(UU)). Using the inference method for regulatory effects described in Materials and Methods, we found that the SRE that we identified from muscle are enhancers in the downstream region (DU) ( Table 2), which is consistent with the computational analysis [1] and the experimental evidence [67,76].
Another well-defined SF is PTB which binds to UC-rich SREs and has high binding affinity to UCUU and/or UCUCU [69,77]. Two SREs we identified are possible binding sites of PTB (Table 2). They are found in the upstream region in adipose and lymph node. Both SREs are detected from the tissues where PTB are over-expressed as shown in Figure 2. Using Proposition 1 in Materials and Methods, These two SREs were determined to be a silencer in the upstream of ASEs. This result coincides with position-dependent alternative splicing activity of the PTB as identified using microarrays [78].
One of QKI's binding site ACUAAY [59,70] was detected in the upstream intron of the ASE in muscle ( Table 2). The SRE ACUAAC was previously reported as an over-represented motif in the downstream region of ASEs in muscle [18] and was also predicted as an upstream intronic SRE specific to the central nervous system [79]. Our result indicates that this upstream SRE detected in muscle is a silencer, which is consistent with the experimental result [80].
The MBNL family protein can bind to the motif YGCU(U/G)Y [71]. Two sequences of this motif can be found in our SREs (Table 2). However, the function of this motif is complicated. First, both MBNL and CELF family proteins can bind to similar motifs [81,82], although different protein isoforms may have different binding specificities [83]. Second, MBNL can either promote or repress splicing of specific ASEs on different pre-mRNAs by antagonizing the activity of CELF proteins [71]. Thus, although our result indicates that this SRE is related to AS regulation in muscle and lymph node, it is difficult to interpret its regulatory effect based on current result.

Experimental Evidence of SRE Pairs
We also identified 196 different cooperative SRE pairs (Table  S1). Thirty-nine percent (77 SRE pairs) of them are interactions of SREs in the same region. As discussed in the previous section, a part of this kind of interaction may come from a single SRE longer than 6 nts. These SRE pairs are actually not related to interactions, although they also have biological meanings, since a longer conserved SRE may have stronger regulatory effect than a shorter SRE. We have marked all 23 such SRE pairs in Table  S1. Among the remaining interactions between different regions (119 SRE pairs), the most frequent interactions are SREs in region pairs UD-DU, UU-UD and UD-DD. Region pair UD-DU may reflect the effect in the exon definition stage of spliceosome assembly, and region pair UU-UD may reflect the effect in the intron definition stage. A summary of interaction between SRE pairs in different regions is given in Figure 3. Several detected SRE pairs are listed in Table 3.
Several previously identified SREs were detected in our interaction results. They either cooperate with their own or other different SREs in a different region or in the same region. Two SRE pairs bound by PTB [69,77] were identified in our result. All the four SREs located at the upstream region of the 39 splice site which can also be a part of an extended polypyrimidine tract, although the 39 splice site consensus of 15 nts containing the classical polypyrimidine tract [7,11] has been removed in our analysis. One SRE pair UUCUCU-UCCUCU was identified in the upstream intron in brain. The other interaction detected involves PTB's binding site UCUUCC in the UD region and CCUUCU in the DD region ( Table 3). The downstream CCUUCU resembles the PTB binding sites and might also be bound by PTB. In a cooperative model for the regulatory mechanism of PTB, it was proposed that a single PTB or a PTB dimer could loop out the branch point upstream of 39 splice site or the ASE by binding to several pyrimidine tracts upstream and downstream to repress splicing [84]. A single PTB-binding element had weak silencing activity, while multiple PTB-binding sites at both upstream and downstream of the ASE or of a branch point could have strong inhibitory effect [85]. This model and the experimental results in [84,85] are consistent with the interaction pairs we identified.
hnRNP F/H can bind to GGGA and G-rich SREs [72,73]. The proposed mechanism underlying the splicing regulation of hnRNP  F/H involves an interaction between proteins bound at both ends of an intron that loops out the intron and brings distantly separated exons into closer proximity [86]. Consistent with this model, one pair of SREs GGGGCA(DU)/UGGGGA(DD) ( Table 3) putatively bound by the hnRNP F/H was detected in two ends of the downstream intron.
Both hnRNP E and hnRNP K proteins contain three copies of the KH domain arranged in a very similar manner, and they are the major poly-C binding proteins in mammalian cells [55,74]. We extracted all the SRE pairs in our result that contain at least 4 Cs. Two such SRE pairs (Table 3) were found to resemble the binding motif of the hnRNP E/K [55]. One SRE pair is due to a longer motif ACCCCUC at downstream intron. The other SRE pair was detected as interaction in the same region. This result may reflect the fact that the three KH domains bind RNA synergistically while a single KH domain appears to have very low level of RNA binding activity [87].
An interesting outcome of this work was the identification of many AU-rich elements in SREs and SRE pairs. The AU-rich elements have been identified previously by comparative analysis as a large class of conserved mammalian ISEs [17]. In plant splicing system, the AU-rich sequences in upstream and downstream introns appear to be involved in early intron recognition and stabilization of spliceosomal complex [88]; and multiple short AU-rich SREs could cooperatively modulate splice site usage [89].
However, the interaction between AU-rich SREs has not been reported in mammals. In our result, The SRE pairs involved in two AU-rich SREs from different regions are listed in Table 3. Among these SRE pairs, 7 pairs were detected in upstream and downstream introns, and 4 were detected at two ends of the same intron. These AU-rich SREs resemble the binding sites of TIA-1/ TIAR, Hu protein and Sam68 [49,90]. Hu proteins can bind to both upstream and downstream SREs. It has been speculated that binding of Hu protein to multiple sites both upstream and downstream of an ASE could loop out the ASE and as a result block exon definition to repress splicing [90]. Moreover, Hu proteins can promote exon inclusion by binding to AU-rich sequences that are conserved at both upstream and downstream of ASE [91]. The intricate experimental results stress the importance of further mutation analysis to study the cooperative interactions of AU-rich binding proteins.
We also identified many interactions between AU-rich elements and binding sites of other known SFs. For example, We found one interactions between upstream AU-rich SRE and a downstream SRE resembling QKI's binding site. Another example is the interactions between upstream AU-rich SRE and a CU-rich SRE resembling polypyrimidine tract (Table 3). These results indicate that AU-rich SREs and cooperative pairs may play an important regulatory role in mammalian AS and worth further experimental investigations. In summary, various cooperative mechanisms could be detected in our result as an interaction, which implies the important role of interaction in splicing regulation.

Discussion
Our regression model for splicing regulation is derived from the same biophysical principle regarding protein and nucleic acids interaction as the one used to derive the model for transcription [26,[31][32][33]. However, our model uses the ratio of the expression levels of two isoforms or equivalently the ratio of binding and unbinding probabilities of the spliceosome to the mRNA as the response variable, whereas the model for transcription uses the gene expression level or equivalently the binding probability of the RNAP to the promoter as the response variable. For this reason, our model for splicing is linear with respect to unknown parameters to be inferred, whereas the model for transcription is nonlinear. While the nonlinear model for transcription with relatively small number of unknown parameters can be directly inferred [26], a linear approximation [34][35][36][37] and an nonlinear approximation using splines [18,38] have been proposed to facilitate model inference. It was shown that the spline approximation [37,38] can offer significantly better performance than the linear approximation in terms of the variance explained by the model. The linear regression model [19] and the spline regression model [18] for splicing regulation use the expression level of an isoform as the response variable. Therefore these two models are also approximate models. In contrast, our regression model is directly derived from the biophysical principle and the linearity of our model with respect to unknown parameters enables efficient model inference even when the number of unknown parameters is very large. This explains why the variance explained by our model is comparable to the best result achieved by the nonlinear biophysical model for transcription [26]. Our model may be improved to explain more variance, for example, by including interactions involved more than two SREs, by accounting for the number of occurrences of each SRE, and by including SREs of length not necessarily equal to six nucleotides, although this can increase the complexity of model inference. On the other hand, if we are interested in the identification of SREs interacting with the binding sites of a specific SF, we can design well-controlled experiments which measure the splicing profile of all the genes before and after knockdown of a specific SF, and then apply our model to identify SRE pairs related to the SF.
To the best of our knowledge, this is the first time that the regression-based approach is employed to systematically identify cooperative SRE pairs. Moreover, our regression framework can identify SRE pairs without being affected by GC content. Current methods for identifying cooperative SRE pairs in splicing regulation use hypergeometric test to find the co-occurrence of SRE pairs over-presented in different regions [22][23][24]. Since they only use sequence information, some sequence features that are not related to splicing regulation may increase the FDR. For example, without correction of GC content, the majority of the motif pairs detected in [22,23] with high p-values share similar GC contents, being either GC-rich or AU-rich. For this reason, they corrected the GC content by grouping the ASEs with similar GC content [22,23]. However, If AU-rich or GC-rich SRE pairs indeed have regulatory effects, correction for GC content might introduce bias and under-or over-estimate the statistical significance of the SRE pairs. In fact, it has been shown that the AU-rich motif is conserved in mammalian introns [17] and could be bound by TIA-1/TIAR, Hu protein or Sam68 [49,90]. In our work, since our model uses the isoform expression information in addition to the sequence information, it can automatically handle the GC content problem. If there is no correlation between the splicing response and the GC-or AU-rich SRE pairs, these pairs will not be identified since they are just sequence features irrelevant to splicing regulation. On the other hand, if they do have regulatory effect, our method will identify them as SRE pairs based on their correlation with the response.
Since we want to detect all possible SREs and SRE pairs, our linear regression model contains a very large number of candidate SREs and their pairs as variables. The challenging problem in model inference is to select correct variables without overfitting the data. We employed four techniques to tackle this problem. First, variable screening is used to exclude SREs and SRE pairs that are present in less than 1% of the samples or have no or very small correlation with the response variable. Second, regularized inference methods, the Lasso and the adaptive Lasso, were employed in conjunction with cross-validation to select a small number of SREs and SRE pairs. Third, RCV is used to estimate the residual variance which was further used to prevent the possible overfitting problem. Finally, the FDR was calculated to retain only the most statistically significant SREs and SRE pairs in the final model. Overall, these steps combine the state-of-the-art techniques and form an effective framework to reduce the FDR and prevent model overfitting, without compromising the power of detection.
The SREs and SRE pairs we identified have a significant overlap with a set of SREs identified with experiments. The regulatory effects of several well-defined SREs were correctly inferred from our model. For several different interaction patterns proposed based on the experimental results, our model successfully identified them as SRE pairs in the same regions as the proposed ones and provided more insight into their interactions. Note that we can identify SRE pairs at two ends of intron [23], at two ends of exon [22], and in the same region [24] in one framework, and  capture the combinatorial regulatory effects of multiple SREs more faithfully. We also report AU-rich SRE pairs as a putative interaction pattern that is important and prevalent in human splicing regulation. In summary, our biophysical regression model provides a useful platform for discovering splicing regulators and unraveling splicing regulatory mechanisms.

The Biophysical Model
Suppose a gene contains an ASE, and it can generate an isoform I 1 with the ASE or another isoform I 2 without the ASE. Since splicing is coupled with transcription and the product emerging from this coupled process is either I 1 or I 2 , we can consider the splicing of each pre-mRNA independently. We model the multistep assembly of spliceosome S to the splice sites of the ASE on each individual pre-mRNA as a single chemical reaction: where RNA denotes the state in which S is not fully assembled, and RNA represents the state in which S is fully assembled around the ASE. Then I 1 is produced from the RNA state and I 2 is produced from the RNA state. The probability of having a spliceosome assembled around the ASE is equal to the probability of the RNA state in reaction (3), which can be expressed as can write P s as follows: where q s~ks ½S. Similar to the gene expression model [36,38], the dynamic changes of the concentration of I 1 denoted as E I 1 can be written as: where k g and k d are synthesis and degradation rates, respectively.
In the steady state where dE I 1 =dt~0, we have E I 1~k g k d P s~a P s , where a~k g =k d . Likewise, concentration of the second isoform E I 2~c (1{P s ), where c is another constant. Thus, the ratio of E I 1 and E I 2 can be written as: where constant c~a=c. Now we consider an SF that can bind to an SRE around the ASE and influence the assembly of the spliceosome. The pre-mRNA can have four possible states: 1) bound by both S and SF (RNA), 2) bound by S only (RNA), 3) bound by SF only (RNA), and 4) bound by neither of them (RNA). Then the probability P s of having a spliceosome assembled around the ASE is equal to the probability of states 1) and 2). Following [26,31,32], we can write P s as P s~( z 1 zz 2 )=(z 1 zz 2 zz 3 zz 4 ), where z i is the Boltzmann weight for state i. Let w be the cooperative factor reflecting the interaction between SF and S, q sf~ksf ½SF where k sf~½ RNA : SF ½RNA½SF , then it is not difficult to find that z 1~w q s q sf , z 2~qs , z 3~qsf and z 4~1 , which yields: If w~1, SF and S bind to the transcript independently and P s in (7) is simplified to that in (4). If ww1, the binding of SF to SRE increases the probability of spliceosome assembly, which implies that the SF is an enhancer. If wv1, binding of the SF has a negative effect on spliceosome assembly and the SF is a repressor. The ratio of the expression levels of I 1 and I 2 can be written as: If two SFs can cooperatively bind to their SREs around the ASE and interact with the spliceosome, it is not difficult to derive the following ratio: where q sf 1~k sf 1 ½SF 1 with k sf 1~½ RNA : SF 1 ½RNA½SF 1 , q sf 2 and k sf 2 are defined similarly for SF 2 , w i ,i~1,2, is the cooperativity factor between SF i and S, and w 12 is the cooperativity factor between two SFs. If w 12~1 , there is no cooperative interaction between two SFs, and they enhance or repress spliceosome assembly independently. In this case, (9) can be simplified as If w 12 =1, we can express (9) as: where w~( 1zw 1 q sf 1 zw 2 q sf 2 zw 12 w 1 w 2 q sf 1 q sf 2 1zq sf 1 zq sf 2 zw 12 q sf 1 q sf 2 )= ( 1zw 1 q sf 1 1zq sf 1 1zw 2 q sf 2 1zq sf 2 ). If we define b 1~l og ( 1zw 1 q sf 1 1zq sf 1 ), b 2~l og ( 1zw 2 q sf 2 1zq sf 2 ) and b 12~l og (w), we can write (11) as: log ( E I 1 E I 2 )~log (c : q s )zb 1 zb 2 zb 12 : The first term reflects the basal level of splicing determined by the spliceosome alone, the second and third terms are the effects of interactions between the spliceosome and each individual SF, while the last term is the effect of the interaction between two SFs.
This can be seen from the fact that b i~0 if w i~1 and b 12~0 if w 12~1 . In other words, if the ith SRE affects splicing, then b i =0; otherwise b i~0 ; Similarly, if two SREs interact with each other and affect splicing jointly, then b 12 =0; otherwise b 12~0 . Four different conditions are depicted in Figure 4. Note that b i ,i~1,2, is determined by k sfi , ½SF i and w i . Since an SF can bind to the same set of SREs around different ASEs, we assume that k sfi and w i are the same for different ASEs. Therefore, if we consider a set of ASEs in the same tissue or under the same condition, where ½SF i is fixed, b i is identical for these ASEs. Similarly, we assume that b 12 is a constant for different ASEs in the same tissue. On the other hand, since different exons may have strong or weak splice sites and different genes may have different degradation rates, the first term log (c : q s ) in (12) may be different for different exons even in the same tissue or under the same condition. Since our goal is to infer b 1 , b 2 and b 12 from data of multiple ASEs in the same tissue, we need to remove the exonspecific effects from the model.
If expression levels of isoforms in two tissues t 1 and t 2 are available, the first term in (12) is identical in these two tissues, and can be removed by forming the following model: The data of tissue t 2 can be regarded as a reference. Subtraction of the reference data from the data of tissue t 1 removes the exonspecific effects. When data of multiple tissues are available, we can arbitrarily choose a tissue as the reference. However, since the expression level of each isoform is estimated from the noisy measurements, a better reference can be obtained by averaging the data of multiple tissues, which is similar to the strategy used in [18]. Specifically, suppose we have a set of data E t I1 ,E t I2 for tissue t~1,:::,T, we can remove the first term in (12) by forming the following model: If we define y~log ( (14) can be simplified as: So far, we have assumed that y can be measured without any error. If the measurement error is taken into account, equation (15) becomes: where e is the measurement error modeled as a Gaussian random variable with zero mean. Model (16) is derived under the assumption that two SREs are present to regulate splicing. Since we do not know which SRE or SRE pair contributes to splicing, we can include all potential SREs and their pairwise interaction in the model by adding a parameter to the model for each potential SRE and SRE pairs. Moreover, when we use this model to identify SREs and their interactions, we need to apply it to a set of ASEs. However, different ASEs may have different SREs. To overcome this problem, we include all possible SREs (typically hexamers) and their pairwise interactions in the model, but multiply b i by a binary variable x i [f0,1g that indicates if the corresponding SRE is present in the ith ASE, and similarly multiply b ij by x i x j . This gives rise to the model in (2). We can also include interactions involving more than two SREs, but this will dramatically increase the number of unknowns that is already very large, which will make model inference extremely difficult if not impossible. For this reason, we only include pairwise interactions in our model.

Inference of Regulatory Effects
Our model inference framework will infer b i and b ij in regression model (2). If b i or b ij is not equal to zero with certain statistical significance, then we determine that the ith element or the pair involving the ith and jth elements has regulatory effect. However, it is unclear whether it is an enhancer or silencer, since it is the sign of w i {1 or w ij {1, not the sign of b i or b ij that determines an enhancing or inhibitory effect. We will next show that we can infer the regulatory effect of the ith SRE from b i and ½SF i that is the concentration of the SF that can bind to the SRE. The enhancing or inhibitory effect of an SRE pair however is difficult to infer.
Let us first consider the situation where only one reference sample is used as in (13).
). Then it can be easily shown that the sign of b i is the same as the sign of , then w i w1 and SRE i is an enhancer; otherwise, w i v1 and SRE i is a silencer.
For model (2) or (15) where multiple tissues are used as the reference, the following proposition can be used to infer the regulatory effect of an SRE: Proposition 1 Given the condition we have we have Before we proceed to prove Proposition 1, we make the following comments. The sign of b i alone can not determine if the SRE is an enhancer (w i w1) or silencer (w i v1). It should be used together with condition (17) and (18) to infer the regulatory effect of the SRE. It is difficult to determine the regulatory effect of an SRE interacting pair bound by two SFs.
Proof: For simplicity, we will omit subscript i throughout the . If ww1, f (q t ) is a monotonically increasing function; otherwise, f (q t ) is a monotonically decreasing function ( Figure 5).
Define q w such that f (q w )~1 T{1 X t=t1 f (q t ), which gives: where 0vwv1 or ww1. From (19), we obtain the following result: {1 P Define q a~qw for ww1 and q b~qw for 0vwv1. In Lemma 1 given later at the end of this section, we prove that q w is a monotonically decreasing function of w. Therefore, we have the following inequalities: From (14) and (15), . When ww1, due to the fact that f (q t ) is an increasing function, we have bw0 if q t1 wq a or bv0 if q t1 vq a . Similarly, when wv1, we have bv0 if q t1 wq b or bw0 if q t1 vq b . This is illustrated in Figure 5. Now suppose that ½SF t1 w 1 Using (20), we have q t1 wq b wq a . Therefore, we can infer that ww1 if bw0, or wv1 if bv0 as illustrated in Figure 5. Similarly, if ½SF t1 v½ 1 , and thus, q t1 vq a vq b . We can infer that ww1 if bv0, or wv1 if bw0, again as illustrated in Figure 5. Note that q a and q b are determined by unknown parameter w and they can be anywhere in between q ? and q 0 . Therefore, if is a monotonically decreasing function for x[(0,1) S (1,?), when q i , i~1,:::,N are positive. Proof: , where g'(x)~d g(x) dx : Let us define the numerator of dh(x) dx as J(x), since the denominator is positive, we next prove that J(x)ƒ0, which implies that dh(x) dx ƒ0 and therefore h(x) is a decreasing function.
where the inequality is due to the facts that g(x)w0 and that ), since xw0 and q i w0,i~1,:::,N.

ASE Selection
The KnownGene table of human February 2009 assembly (hg19) was downloaded from the University of California Santa Cruz (UCSC) genome database [39]. We chose UCSC Known Genes as the reference gene annotation, since they contain a comprehensive gene set that is constructed mostly from experimental data in Genbank and Uniprot [92]. For each gene, the KnownGene table gives all known isoforms of its mRNA transcripts. An exon was selected as an ASE in our dataset if the following five criteria were satisfied: 1) at least one isoform includes the exon, 2) at least one isoform does not include the exon, 3) the upstream 59 splice site is the same in all isoforms, and similarly the downstream 39 splice site is the same in all isoforms, as illustrated in Figure 6A, 4) the upstream 39 splice site is the same in all isoforms with the ASE, and similarly the downstream 59 splice site is the same in all isoforms with the ASE, as illustrated in Figure 6A, and 5) both the upstream and downstream introns are of §400 nts. With these strict criteria, the five regions of the ASE shown in Figure 1A are defined without ambiguity. Note that isoforms in Figure 6B do not satisfy criterion 3), and thus ASEs with isoforms in Figure 6A and 6B are not included in our data set. Similarly, isoforms in Figure 6C do not satisfy criterion 4), and ASEs with isoform in Figure 6A and 6C are not included in our data set. To ensure a reliable estimate of the expression ratio, we only kept ASEs with gene expression level greater than 3 RPKM (reads per kilobases per million mapped reads). This gave a set of ASEs for each tissue. The number of ASEs for each tissue is given in Table 1 and more detailed description of the ASEs including their genomic coordinates are given in Table S2. Most ASEs were used for model inference in almost all tissues. Some ASEs were not used in a specific tissue because they did not pass the minimum expression requirement in that tissue. Note that although almost the same set of ASEs were used, the splicing response variable y of the same ASE was different in different tissues, and thus the data used for model inference were in fact different for different tissues.

RNA-Seq Data
The data set in [1] includes RNA-Seq reads from 9 tissues: adipose, whole brain, breast, colon, heart, liver, lymph node, skeletal muscle and testes, as well as several cerebellar cortex samples and cell lines. We only used RNA-Seq data of 9 tissues, which contains over 200 million reads of 32 nts, to detect SREs and cooperative SRE pairs.

Estimation of Expression Level and Inclusion Ratio
We started by mapping the RNA-Seq reads against an expanded human genome (hg19) downloaded from the UCSC genome database, allowing up to two mismatches, using Bowtie (version 0.12.7) [93]. The expanded human genome consists of the UCSC hg19 whole genome reference sequence and the 56 nt long splice-crossing sequences for each exon junction documented in the UCSC KnownGene table. Reads that could be mapped to multiple loci of the genome were excluded, and 140 million uniquely mapped reads were kept for the following analysis. We next calculated the expression level of each isoform including or excluding a selected ASE in 9 tissues using the algorithm of Jiang et al. [94]. Since we only kept uniquely mapped reads, for an exon of length l, we used an effective exon length l{r{m, where r is the read length and m is the number of multimappable positions of the exon. To find out m, we re-mapped all possible 32-nt subsequences of candidate ASEs and splice junctions against the same expanded genome described above using Bowtie [93]. Moreover, to minimize the effect of nonuniformity of read distribution [95], we only used three exons, including the ASE itself, the adjacent upstream and downstream exons to estimate the expression level of each isoform.
After the expression level of each isoform of the selected gene was calculated, the inclusion ratio (IR) of an ASE in a specific tissue was calculated as the ratio of the expression level of the isoforms with the ASE to the total expression level of all isoforms of the gene, i.e., IR~E , where E I1 is the total expression level of isoforms including the ASE and E I2 is the total expression level of isoforms excluding the ASE.

RNA Sequence Elements
For each tissue and each ASE, we extracted all hexamers in five regions around the ASE, including the 200 nts intronic region adjacent to the upstream 59 splice site (UU in Figure 1A), the 200 nts intronic region adjacent to the upstream 39 splice site (UD in Figure 1A), the ASE region (EXON in Figure 1A), the 200 nts intronic region adjacent to the downstream 59 splice site (DU in Figure 1A) and the 200 nts intronic region adjacent to the downstream 39 splice site (DD in Figure 1A). The EXON region is the ASE itself if the ASE is less than 200 nts; otherwise, it is the combination of the first and last 100 nts of the ASE. Since the 59 and 39 splice sites have the consensus sequences MAG/GURAGU and Y 10 NCAG/G [11,96], respectively, we excluded the sequences in the window from 23 to 6 around the 59 splice site and in the window from 214 to 1 around the 39 splice site in our analysis.

Variable Screening
Before applying the Lasso, we used a strategy similar to the sure independence screening [40] to reduce the dimensionality of the feature space, thereby improving variable selection in terms of both speed and accuracy. Specifically, for the ith hexamer, i[(1,:::,4 6 ), we used the following simple linear regression to test the correlation between its presence in one of the five regions of ASEs with the response variable: where x ei ,e~1,:::,n is a binary variable to indicate if the ith hexamer is present (x ei~1 ) or absent (x ei~0 ) in one of the five regions of the eth ASE, and y e is the splicing response of the eth ASE as defined earlier, and E e ,e~1,:::,n are independent and identically distributed normal random variables. In some samples, we have inclusion ratio IR e~1 or IR e~0 , usually due to the low read abundance of the minor isoform. For these samples, we set which is equivalent to IR e &0:9999 or IR e &1e {5 . Hexamers having a significant correlation with a p-value v 0.05 were kept in set M for further analysis with the Lasso and the adaptive Lasso.
In the next step, for each pair of the retained hexamers, their interaction was tested using the model: Interaction terms with a p-value v 0.05 were also kept in set I for further analysis. To reduce the possible false positive effects, we also required that the co-occurrence frequency of the two hexamers in an interaction pair was significant (p-valuev0.05 from a hypergeometric test based on the null hypothesis that the presence of the first hexamer is independent of the presence of the second hexamer) in the five regions of the selected ASEs defined earlier, and that any hexamer or hexamer-pair must be present in at least 1% of the ASEs.
We also define x i : Ã x j as element-wise multiplication of two vectors. The Lasso procedure was performed by solving the following problem [41]: The optimal value of parameter l was obtained using 100-fold cross-validation based on the mean squared prediction error. Then we choseŵ w i~1 =Db b i D andŵ w ij~1 =Db b ij D and solved the following adaptive Lasso problem [42]: The optimal value of l was also obtained using 100-fold crossvalidation. We solved these problems using the coordinate descent algorithm of Friedman et al. [97] implemented in the 'glmnet' package.

Refitted Cross-validation
RCV is a technique to estimate residual variance in linear regression models of ultrahigh dimension [44]. In our case, the n samples were randomly split into two even datasets. We applied the Lasso to the first dataset to select a set V of variables from the variables in M and I resulted from the variable screening procedure. We then again used the Lasso to refit the model with the variable set V to the second dataset. The refitting process selected a set V' of variables from V. Finally, the variance of the residual errorŝ s 2 1 is estimated from the second dataset with variables in V' using the OLS method. We reversed the role of the two datasets, and obtained another estimate of the variance of the residual error,ŝ s 2 2 . The final estimate is then defined aŝ s s 2~(ŝ s 2 1 zŝ s 2 2 )=2. We repeat this process 100 times by randomly splitting the dataset, and the average variance s 2 RCV of the 100 estimates was the final estimate of the residual variance.

Final Model Selection and Correction for Multiple Testing
The adaptive Lasso procedure produced a sequence of models for different values of l. For each l, we extracted the variables of the model and estimated the residual variance s 2 with these selected variables using the OLS method. For the sequence of models, we selected the one having the smallest variance that was larger than s 2 RCV as the optimal model. Although the adaptive Lasso selected a set of variable in the final model, it did not give p-value for each variable. We then used the OLS method to refit the model and calculated the p-value for each variable. Based on these p-values, we chose the variables at an FDR ƒ0:01 [45]. The final model contained these variables, and the percentage of the variance explained (R 2 ) by this model was calculated as: (y e {ŷ y e ) 2 = X n e~1 (y e { y y) 2 , ð26Þ whereŷ y e is the predicted value of y e from the final model, and y y is the sample mean of y e . Note that R 2 was also used in [26,34,38] as the figure of merit for performance evaluation. The software package for model inference framework is available under an open source license.

Supporting Information
Table S1 The SREs and SRE pairs identified in 9 tissues.