Incorporating Evolutionary Information and Functional Domains for Identifying RNA Splicing Factors in Humans

Regulation of pre-mRNA splicing is achieved through the interaction of RNA sequence elements and a variety of RNA-splicing related proteins (splicing factors). The splicing machinery in humans is not yet fully elucidated, partly because splicing factors in humans have not been exhaustively identified. Furthermore, experimental methods for splicing factor identification are time-consuming and lab-intensive. Although many computational methods have been proposed for the identification of RNA-binding proteins, there exists no development that focuses on the identification of RNA-splicing related proteins so far. Therefore, we are motivated to design a method that focuses on the identification of human splicing factors using experimentally verified splicing factors. The investigation of amino acid composition reveals that there are remarkable differences between splicing factors and non-splicing proteins. A support vector machine (SVM) is utilized to construct a predictive model, and the five-fold cross-validation evaluation indicates that the SVM model trained with amino acid composition could provide a promising accuracy (80.22%). Another basic feature, amino acid dipeptide composition, is also examined to yield a similar predictive performance to amino acid composition. In addition, this work presents that the incorporation of evolutionary information and domain information could improve the predictive performance. The constructed models have been demonstrated to effectively classify (73.65% accuracy) an independent data set of human splicing factors. The result of independent testing indicates that in silico identification could be a feasible means of conducting preliminary analyses of splicing factors and significantly reducing the number of potential targets that require further in vivo or in vitro confirmation.


Introduction
Alternative splicing (AS), in eukaryotes, is one of the mechanisms of post-transcriptional regulation that generate multiple transcripts from the same gene. These transcripts are then translated into multiple proteins having diverse biological functions. According to the comparative alignment of EST sequences and high-throughput biotechnology techniques such as exon/exon-junction array and RNA-Seq, it has been revealed that most genes (larger than 90%) undergo alternative splicing in humans [1,2,3,4]. In general, alternative splicing is regulated by splicing factors (SF) that recognize and associate with specific RNA sequence elements in order to enhance or repress the ability of the spliceosome to recognize nearby splice sites [5,6]. More precisely, the mechanism is finished through many of the positive or negative trans-acting splicing factors which are recruited to the enhancer or silencer cis-acting sequence elements of the pre-mRNA, such as exonic splicing enhancer (ESE), exonic splicing silencer (ESS), intronic splicing enhancer (ISE) and intronic splicing silencer (ISS) [7,8,9]. Meanwhile, the process exploits the dynamic composition of splicing factors under various cell lines or developmental stages to have flexible intermolecular interactions such as protein-RNA, RNA-RNA, and protein-protein interactions [10,11,12]. Cancer cells often take advantage of this flexibility to produce proteins that promote growth and survival [13].
Eukaryotic messenger RNAs (mRNAs) are produced by accurately removing introns from precursors (pre-mRNAs) in a process called RNA splicing. RNA splicing is required for typical eukaryotes that produce mature mRNA before it can be used to code a correct protein through translation. The eukaryotic RNA splicing is done in a series of reactions that are catalyzed by the spliceosome, which is a collection of small nuclear RNAs (snRNAs) and proteins recruited to pre-mRNAs for carrying out intron excision [14,15]. With the comprehensively biochemical and genetic studies in a variety of biological systems, spliceosomes have been revealed to contain five essential snRNAs, each of which functions as an RNA-protein complex called a small nuclear ribonucleoprotein (snRNP) [16,17]. RNAs and proteins cooperate extensively in ribonucleoproteins (RNPs) to bring about the biological functions of splicing machinery [11]. Two types of spliceosomes have been identified for eukaryotes: one is U2-type spliceosome, which consists of U1, U2, U4, U5, and U6 snRNPs; the other is U12-type spliceosome, which is composed of U11, U12, U4atac, U5, and U6atac snRNPs [16]. The U2-type spliceosome catalyzes the removal of most introns and U12-type spliceosome recognizes less than 1% of human introns [18].
Regulation of pre-mRNA splicing is achieved through the interaction of RNA sequence elements and a variety of RNAsplicing related proteins (splicing factors) [19,20]. Within the assembled spliceosome, intron excision contains two major chemical steps: the first step refers to the 59 splice site cleavage and lariat formation; the second step refers to the 39 splice site cleavage and exon ligation [14]. The initial event of RNA splicing is the recognition of specific sequences located at the 59 and 39 splice sites by splicing factors [21], which determines the intron boundaries. One of the well-known protein families of splicing factors in terms of serine-and arginine-rich carboxy-terminal domains is the SR proteins. This protein family consists of at least five different proteins with molecular masses of 20, 30, 40, 55, and 75 kD [15]. However, although the introns are excised with a high degree of precision, the splice site sequences are weakly conserved [16,22]. The alternative selection of splice sites (alternative splicing) present within a pre-mRNA, leads to the production of multiple mRNAs from a single gene [13].
Due to the multiplicity of protein-protein and protein-RNA interactions that modulate the associations between splicing factors and pre-mRNAs, the first mass spectrometry-based analysis of in vitro-derived spliceosomes was limited to species visible in stained 2D-gels. This analysis was able to identify 17 previously known splicing factors (including hnRNP proteins) and 23 novel splicingrelated proteins [23]. Although previous works have identified more than 200 human splicing factors based on comprehensive proteomic analysis [24,25], many of the newly identified proteins have not yet been experimentally verified to function in pre-mRNA splicing [16]. Without functional validation, it would be premature to label all of these proteins as bona fide splicing factors. A previous work by Jurica and Moore [14] have manually conducted about 180 human splicing factors by literature survey.
Due to the importance of splicing factors in pre-mRNA splicing, more attention is being paid to mass spectrometry-based proteomic studies [19,24,25,26,27], which has been observed to identify an increasing number of experimentally verified splicing factors. However, experimental identification is proven to be timeconsuming and lab-intensive. Thus, in silico investigation has the potential for characterizing splicing factors prior to experimental verification. Over the last few years, several studies have been proposed to computationally predict RNA-binding proteins [28,29]. Additionally, many computational methods have been developed to identify RNA-binding residues on protein sequences [30,31,32,33,34,35,36,37,38,39]. In particular, SFmap [21], a web server for predicting putative splicing factor binding sites in genomic data, utilizes a modified Hamming distance formula to define a match between a splicing factor sequence query and a target sequence. The distance scores are then standardized and a Z-score is obtained for calculating the significance of each query relative to a background model which is then compared to a threshold value in order to give a probable prediction. Another work done by Barbosa-Morais et al. [16] presents a semiautomated computational pipeline to aid in identifying and annotating spliceosomal proteins. The proposed method utilizes annotated human splicing factors grouped into families based on full-length homology, functional domain, and Ensembl protein family classification which are then transformed into phylogenetic trees. Their work has revealed more than 200 proteins of multiple organisms for which there is experimental evidence regarding its involvement in splicing. Furthermore, a related work by Zheng et al. [40] proposed a method which utilizes support vector machine, a binary-class classification algorithm, to construct a model for discriminating transcription factors (TFs) from non-TFs using protein domain and functional site information. The authors have also employed error-correcting output coding, a multi-class classification algorithm, in order to classify the identified TFs according to: basic-TFs, zinc-TFs, helix-TFs, and beta-TFs. These published works have demonstrated their accuracy and stability; however, there is no fully computational method developed to identify splicing factors based on protein sequences so far. Therefore, we are motivated to develop a novel method focusing on the identification of human splicing factors using the experimentally verified spliceosomal proteins and RNA-splicing related proteins.
In this study, the experimentally validated human splicing factors have been collected from two previously published literatures [14,16]. This work not only investigates the composition of amino acids on splicing factors, but also considers evolutionary information through a position-specific scoring matrix (PSSM). The explored features are used to construct a predictive model for differentiating splicing factors from non-splicing proteins. A support vector machine (SVM) is used to construct a predictive model with various features. Moreover, the information of functional domains extracted from InterPro [41] is also adopted to improve the prediction scheme. Finally, an independent test set, which is not included in the training set, is also constructed to evaluate whether the predictive model is over-fitted to the training set. Figure 1 presents the system flow of the proposed method. It consists of the following steps: data collection and pre-processing, feature extraction, model learning and cross-validation, and independent testing. The details of each process are described as follows.

Data collection and pre-processing
The experimentally verified splicing factors in humans were collected from published literatures [14,16]. Jurica and Moore [14] have proposed about 180 manually curated splicing factors in humans by literature survey. In addition, Barbosa-Morais et al. [16] have proposed more than 200 splicing factors from multiple organisms by an integrative method incorporating systematic pipeline and experimental evidence. After the removal of redundant protein entries, it resulted in a total of 283 human splicing factors which are regarded as positive data for feature investigation and model training. Furthermore, human proteins which are not among the positive data obtained from literature were extracted from the UniProt protein knowledge base [42] by running a search on UniProt IDs using the keyword ''HUMAN''. To construct the positive data of independent testing, only experimentally verified splicing factors are obtained from the resulting dataset by collecting protein entries annotated as ''RNA splicing'', ''spliceosome'', or ''splicing factors''. UniProt uses such annotations to define a protein entry that has been experimentally identified to be essential for RNA splicing. This yielded 99 protein sequences which are then regarded as positive data for independent testing. In order to filter out potential noise data for non-splicing proteins, the remaining proteins consisting of keyword ''RNA-binding'' are removed. As a result, a total of 19512 proteins are regarded as negative data.
In classifying splicing factors and non-splicing proteins, there is a possibility that the prediction performance of the constructed models is overestimated due to an over-fit to the training set. Therefore, an independent test set is used to estimate the actual prediction performance. However, there may be a possible overestimation in the prediction performance due to homologous sequences found in the training data and independent test data. With reference to the work by Panwar et al. [43], homologous sequences from the collected data are removed by using CD-HIT. CD-HIT firstly forms a cluster with a representative sequence having the longest length which is then compared to the remaining sequences. If the similarity between a target sequence and the representative sequence is above the user-selected sequence identity threshold which refers to the pairwise sequence identity between two proteins, then the target sequence is considered homologous to the representative sequence [44]. Different values were tested for the sequence identity parameter as shown in Table 1. The resulting dataset given a sequence identity parameter of 30% contains 173 positive sequences of training set, 65 positive sequences of independent test set, and 11113 negative sequences. The negative data is then randomly divided into two sets -5557 protein sequences are regarded as negative data for model training, and 5556 protein sequences are regarded as negative data for independent testing.

Feature extraction
Compositions of amino acids and amino acid dipeptide. Each protein sequence in the data set is represented using a vector {x i , i = 1,…,n} labeled according to its corresponding protein group (e.g. splicing factor or non-splicing protein). The vector x i has 20 elements for the amino acid composition and 400 elements for the amino acid dipeptide composition. For amino acid composition, the 20 elements specify the numbers of occurrences of 20 amino acids normalized with the total number of residues in the protein. On the other hand, for amino acid dipeptide composition, the 400 elements specify the numbers of occurrences of 400 amino acid dipeptides normalized with the total number of dipeptides in the protein.
Statistically significant amino acid dipeptides. In further exploring potential features for protein classification, various methods aimed at selecting relevant sequence features given a large set of features have been used [45]. In this work, the importance of amino acid dipeptides in identifying splicing factors is further investigated by means of measuring the statistical significance of each dipeptide in the data set. For each amino acid dipeptide, the number of splicing factors and non-splicing proteins containing the target dipeptide is calculated separately. The statistical significance of each dipeptide is then obtained by examining a sample against a background set based on the hypergeometric equation (P-value) [46]: where K is the background set represented by the number of all proteins and T is the sample set represented by the number of splicing factors; k is the number of all proteins having the target amino acid dipeptide and t is the number of splicing factors having the target amino acid dipeptide. P-value is calculated for each dipeptide based on the hypergeometric equation. A smaller p-value corresponds to a greater statistical significance. Furthermore, the positive and negative probabilities of each amino acid dipeptide are computed by means of dividing the number of splicing factors or non-splicing proteins having the target amino acid dipeptide by the total number of splicing factors or non-splicing proteins, respectively. The probability difference between the positive and the negative probability is then obtained. In this work, amino acid dipeptides having a p-value less than 0.05 and a probability difference greater than 0 is considered as statistically informative for the identification of splicing factors.
Evolutionary information. Several amino acid residues of a protein can go through mutation without changing its structure, and two proteins may share similar structures with different amino acid compositions. In this work, evolutionary information is obtained using position-specific scoring matrix (PSSM). PSSM profiles have been extensively utilized in protein secondary structure prediction, subcellular localization and other approaches in bioinformatics [47,48,49]. The PSSM profiles of each protein were obtained by using PSI-BLAST search against the non-redundant database of protein sequences compiled by NCBI [50]. Due to the fact that the data consists of protein sequences with variable length, a weighted score of features is obtained by summing up the position-specific scores of the same amino acids occurring in a protein sequence to get a uniform number of features. Figure 2 displays in detail how to generate a 400dimensional (20620 residue pairs) PSSM feature vector for each splicing factor and non-splicing protein. PSSM profile is a matrix of m620 elements where m represents the protein sequence length and 20 represents the position specific scores for each type of amino acid. Then, the PSSM profile is transformed to a 20620 matrix by summing up each row of same amino acid in the PSSM profile and the variable is denoted as ''x''. Finally, every element of 400-dimensional PSSM vector is divided by the length of the sequence and then is scaled by 1 1ze {x for normalizing the values between 0 and 1.   Information of functional domains. Previous works on protein prediction have exhibited the ability of distinguishable domain regions in the classification of proteins [45]. In this work, domain information is investigated as a feature for classifying splicing factors from non-splicing proteins. To investigate the preference of functional domains in splicing factors, this study referred to the annotations in InterPro [41]. InterPro is an integrated resource, which was developed initially as a means of rationalizing the complementary efforts of the PROSITE [51], PRINTS [52], Pfam [53], and ProDom [54] databases, for providing protein ''signatures'' such as protein families, domains and functional sites. The domain information of each splicing factor in the training data is collected by referring to its corresponding InterPro ID in the UniProt database. The collected domains are then analyzed in order to identify the most distinguishable domains in splicing factors. For this work, functional domains present in more than five splicing factors are considered as significant domains.
Feature Combination. A hybrid approach is investigated in this work by combining different sets of feature vectors with the goal of improving splicing factor prediction performance. Three types of hybrid combinations are explored. In the first combination, the effect of combining PSSM with the composition-based features is explored. In the second combination, the effect of combining domain information with the composition-based features is explored. In the third combination, the effect of combining both PSSM and domain information with the composition-based features is explored.

Model learning and cross-validation evaluation
Support vector machine (SVM) is applied to generate computational models that incorporate the encoded set of features. Based on binary classification, the concept of SVM is to map the input samples into a higher dimensional space using a kernel function, and then to find a hyper-plane that discriminates between the two classes with maximal margin and minimal error. A public SVM library, LibSVM [55], is used to train the predictive model with positive and negative training sets, which are encoded with reference to various training features. The radial basis function (RBF) K(S i ,S j )~exp ({c S i {S j 2 ) is selected as the kernel function of SVM. Cross-validation is important to the application of the predictor [56]. The predictive performance of the constructed models is evaluated by performing k-fold cross validation. The training data is divided into k groups by splitting each dataset into k approximately equal sized subgroups. In this work, k is set to five. During cross-validation, each subgroup is regarded as the validation set in turn, and the remainder is regarded as the training set. Next, the following measures of predictive performance of the trained models are defined: where TP, TN, FP and FN represent the numbers of true positives, true negatives, false positives and false negatives, respectively. Additionally, the parameters of the predictive model, cost and gamma value of the SVM models are optimized to maximize predictive accuracy. In optimization of SVM parameter C and RBF kernel parameter gamma, the grid search is applied to obtain the parameters that achieve the best accuracy during k-fold crossvalidation. Then, the hybrid combinations of features that yield the highest accuracy are employed to construct predictive models for independent testing. Finally, the SVM model trained with the combined features and the selected parameters (C and gamma) are evaluated the predictive performance using independent testing data.

Independent testing
In order to further evaluate the trained models, an independent test set from humans is obtained as discussed previously, resulting in 65 positive data and 5556 negative data shown in Table 1. In addition, this work also investigates the ability of the predictive model to identify splicing factors from other mammalian species (File S1).

Investigation of amino acid composition in splicing factors
The difference between splicing factors and non-splicing proteins is analyzed in terms of its amino acid composition as shown in Figure 3. It can be observed that splicing factors are significantly distinguishable from non-splicing proteins at the amino acid composition level. For instance, Arginine (R), Aspartic Acid (D), Glutamic Acid (E), Glycine (G), Leucine (L), and Lysine (K) residues all exhibit a remarkable difference between splicing factors and non-splicing proteins. The dominance of these amino acid residues indicates its contribution in RNA-protein and protein-protein interactions. Among these residues, the abundance of R and K in splicing factors is reasonable because these positively charged residules can easily interact with negatively charged RNA. Another abundant amino acid group observed in splicing factors is D and E which are negatively charged residues and are easily located on surface area of a protein for interacting with other splicing factors. Interestingly, the small size and flexibility of G residue is probably responsible for making it suitable for the structural adjustments required during the protein-protein interactions [37]. Furthermore, Leucine (L) is observed to be the most prominent among all under-representated residues. In order to examine the effectiveness of amino acid composition in identifying splicing factors, an SVM model is trained using a 20-dimensional vector consisting of the composition scores for twenty amino acids. The amino acid composition-based model is evaluated by means of five-fold cross-validation. As shown in Table 2, the model

Investigation of amino acid dipeptide composition in splicing factors
Previous studies have exhibited that dipeptide compositionbased methods can yield a better performance as compared to amino acid composition-based methods [43,57]. In order to investigate this claim in terms of identifying splicing factors, an SVM model is trained using amino acid dipeptide composition as features. Firstly, the composition of all possible amino acid pairs is calculated in splicing factors and non-splicing proteins, respectively. Thus, each protein sequence can be encoded as a 400dimensional vector consisting of the composition scores for 20620 amino acid pairs. Using the resulting 400-dimensional dipeptide vectors, an SVM model is trained and is evaluated by means of five-fold cross-validation. The dipeptide composition-based model achieved 78.62% sensitivity, 78.53% specificity, and 78.53% accuracy as shown in Table 2. It can be observed that the amino acid composition-based method yields higher accuracy in identifying splicing factors. However, using dipeptide composition yields a more balanced sensitivity and specificity.
The amino acid dipeptide composition of splicing factors and nonsplicing proteins is further analyzed by means of selecting statistically significant dipeptides among the 400 amino acid pairs. Figure 4 shows the probability difference of 400 amino acid pairs between splicing factors and non-splicing proteins. In the 20620 matrix, amino acid pairs marked in red indicates over-representation in splicing factors while amino acid pairs marked in blue indicates under-representation. It can be observed in Figure 4 that DD pairs are over-represented in splicing factors as well as D residues paired with E, R, and K. Also, KK pairs are observed to be overrepresented in splicing factors. Furthermore, it can also be observed that Cysteine (C) residues paired with other resides are underrepresented in splicing factors. The P-value and the probability difference of each amino acid dipeptide is calculated as discussed previously. After ranking the dipeptides according to P-value, each amino acid pair having a P-value,0.05 and a probability difference .0 is considered as a statistically significant pair. A total of 64 pairs are selected among the 400 amino acid pairs. Interestingly, it is found that these observations in Figure 4 coincide with the selected 64 significant pairs based on P-value (File S2).
An SVM model is trained using a 64-dimensional vector consisting of the composition scores for the selected 64 statistically significant amino acid dipeptides. The model is evaluated by means of five-fold cross-validation. As shown in Table 2, the statistically significant dipeptide-based model achieved 76.31% sensitivity, 79.07% specificity, and 78.98% accuracy. It can be concluded that the method used for selecting statistically significant dipeptides was able to select the features that mostly distinguish splicing factors from non-splicing proteins. Also, the method was able to maintain a performance similar to that yielded by using all 400 amino acid composition features. In line with this, it can be assumed that the dipeptides not selected by the method do not significantly distinguish splicing factors from non-splicing proteins.

Investigation of evolutionary information
It has been shown in previous works that using evolutionary information encapsulated in a PSSM profile provides a more comprehensive information as compared to single sequence features [37]. In this work, the application of evolutionary information is investigated in terms of identifying splicing factors by training an SVM model using a 400-dimensional vector derived from the PSSM profile of each protein sequence. A PSSM profile is the probability of the occurrence of each type of amino acid residues at each position along with insertion/deletion. Hence, PSSM is regarded as a measure of residue conservation in a given protein sequence. As shown in Table 2, the PSSM-based model achieved 79.81% sensitivity, 79.48% specificity, and 79.49% accuracy.

Investigation of functional domain information in splicing factors
In order to analyze functional domain information in splicing factors, the experimentally verified domains of each splicing factor in the training data is collected by referring to the ''InterPro'' field in UniProt. This resulted to a a total of 252 functional domains existing in splicing factors. In order to capture the representative functional domains in splicing factors, functional domains which are present in more than 5 splicing factors are selected as distinguishable domains. This resulted to 15 functional domains as shown in Table 3. It is observed that the most distinguishable functional domain is the ''Nucleotide-bd a/b plait'' with InterPro ID: IPR012677 which exists in 46 splicing factors. Another distinguishable functional domain is the ''RRM'' domain with InterPro ID: IPR000504 which exists in 45 splicing factors. In order to evaluate the performance of using the selected distinguishable domains, an SVM model is trained using a 15dimensional vector consisting of the 15 distinguishable domains represented by a binary score: 1 if present and 0 otherwise. As shown in Table 2, the domain-based model achieved 38.75% sensitivity, 93.82% specificity, and 92.16% accuracy. It can be observed that using domain information alone is not sufficient to correctly identify all splicing factors as seen in the low sensitivity of the prediction model. As discussed previously, only those functional domains present in more than 5 splicing factors are considered by the model. This affected the prediction of true positives due to the fact that many splicing factors are not annotated with the selected functional domains. This may later improved given a more comprehensive InterPro annotation on the dataset. On the other hand, the high specificity yielded by the model signifies that the selected functional domains are meaningful since they do not exist in most of the non-splicing proteins.

Cross-validation performance using hybrid features
The composition-based features are combined with PSSM and domain information in order to investigate the effects of incorporating evolutionary information and domain information. Three types of hybrid combinations are explored in this study: the first type refers to the combination of basic sequence information with evolutionary information; the second type refers to the combination of basic sequence information with domain information; and the third type refers to the combination of basic sequence information with both evolutionary information and domain information. An SVM model is trained using each set of hybrid feature combination. As shown in Table 4, the amino acid composition-based model improved

Independent testing
The method is further evaluated by using an independent data set composed of collected human splicing factors and non-splicing proteins as discussed previously. The independent data is first tested on each model trained on single features as shown in Table 5. It can be observed that the amino acid composition-based model yields a lower performance with 68.07% sensitivity, 68.17% specificity, and 68.17% accuracy as compared to the models based on dipeptide composition and statistically significant dipeptides. The dipeptide composition-based model performs with 69.61% sensitivity, 69.64% specificity, and 69.63% accuracy while the statistically significant dipeptides-based model performs slightly higher with 69.61% sensitivity, 70.46% specificity, and 70.45% accuracy. With regard to the use of evolutionary information, the PSSM-based model achieved the highest performance among all single feature-based models with 72.69% sensitivity, 72.20% specificity, and 72.21% accuracy. On the other hand, similar to its cross-validation performance, the domain-based model performed with a low sensitivity of 21.53%, 93.63% specificity, and 92.79% accuracy.
The independent data is then tested on the models based on hybrid feature combinations. As presented in Table 6, the amino acid composition-based model improved in classifying the independent data with 72.69% sensitivity, 72.16% specificity, and 72.17% accuracy when combined with the evolutionary information from PSSM profiles. Both dipeptide composition and statistically significant dipeptides-based models also improved with 72.69% sensitivity, 72.20% specificity, and 72.21% accuracy when combined with evolutionary information. With regard to incorporating basic features with domain information, the amino acid composition-based model yields a slightly lower performance on the independent data with 68.07% sensitivity, 68.10% specificity, and 68.10% accuracy. The statistically significant dipeptides-based model also yields a lower performance with 66.53% sensitivity, 66.57% specificity, and 66.57% accuracy. On the other hand, the dipeptide composition-based model slightly improved with 68.07% sensitivity, 70.53% specificity, and 70.50% accuracy. Furthermore, incorporating both domain and evolutionary information to the basic featurebased models gives the highest performance with 74.23% sensitivity, 73.64% specificity, and 73.65% accuracy. Similar to the cross-validation performance, incorporating both domain information and evolutionary information on the three basic models allowed it to converge at the same prediction performance.

Conclusion
Although the importance of splicing factors has been indicated in pre-mRNA splicing and alternatively splicing, in vivo or in vitro identification of splicing factors are subject to technical limitations.  Here we propose a computational method to identify splicing factors on the basis of amino acid sequence of a protein. With reference to two previously published works, a total of 283 experimentally verified human splicing factors have been obtained in in this study. After the removal of homologous sequences, the investigation of amino acid composition reveals that there are remarkable differences between splicing factors and non-splicing proteins. The most prominent feature is the abundance of positively and negatively charged residues in splicing factors. Another important characteristic is the slight enrichment of G residues in splicing factors. A five-fold cross-validation evaluation has demonstrated that using amino acid composition could provide a promising prediction accuracy. Another basic feature, amino acid dipeptide composition, is also examined that has similar predictive performance to amino acid composition. Moreover, this method has presented that the evolutionary information could provide a balanced predictive performance, but the domain information resulted in low sensitivity and high specificity. However, the incorporation of evolutionary information and domain information improve the predictive performance compared to the models trained with basic features. Additionally, the independent testing has demonstrated that the constructed model can identify new splicing factors in human proteome, as well as in mouse and rat (File S1). Although several approaches have been proposed to computationally predict RNA-binding proteins [28,29], these methods, such as the web server RNApred [28], provide a high sensitivity but a very low specificity using the collected human independent testing data. The biological process of RNA splicing machinery has not yet been fully elucidated, partly because splicing factors are not yet exhaustively identified. The recent genome-wide sequencing techniques [16,19,26] provide an opportunity to exhaustively observe splicing factors in an organism. This work shows that the in silico identification could be a feasible means of conducting preliminary analyses as well as significantly reducing the number of potential targets that require further in vivo or in vitro confirmation.

Supporting Information
File S1 Cross-species Testing.