Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties

Background With the huge amount of uncharacterized protein sequences generated in the post-genomic age, it is highly desirable to develop effective computational methods for quickly and accurately predicting their functions. The information thus obtained would be very useful for both basic research and drug development in a timely manner. Methodology/Principal Findings Although many efforts have been made in this regard, most of them were based on either sequence similarity or protein-protein interaction (PPI) information. However, the former often fails to work if a query protein has no or very little sequence similarity to any function-known proteins, while the latter had similar problem if the relevant PPI information is not available. In view of this, a new approach is proposed by hybridizing the PPI information and the biochemical/physicochemical features of protein sequences. The overall first-order success rates by the new predictor for the functions of mouse proteins on training set and test set were 69.1% and 70.2%, respectively, and the success rate covered by the results of the top-4 order from a total of 24 orders was 65.2%. Conclusions/Significance The results indicate that the new approach is quite promising that may open a new avenue or direction for addressing the difficult and complicated problem.


Introduction
With the rapid growth of genome sequences and gene expression profiles, there is increasing concern about using computational methods to determine the linkages between protein sequences and their biological functions [1,2,3,4]. This is because experimental determination of protein functions is not only expensive but also time-consuming and hence can no longer catch up with the pace of the fast growth of newly found proteins. In this paper, we are to propose a computational method to tackle the problem by studying the functions of proteins in mouse, one of the most extensively studied organisms. On one side, functions of many proteins of mouse are already known, which can help us establish a statistical predictor with solid training dataset. On the other side, thousands of proteins of mouse still lack functional annotation [5] and it would be beneficial if a well-performed predictor can be developed to provide us with their possible functions in a timely manner, particularly for drug target selecting and screening [4].
The most established approaches [6] for protein function prediction are based on sequence similarity using BLAST [7], sequence motifs such as PROSITE [8], profile methods such as PFAM [9] and PSI-BLAST [7], and structure based method such as FATCAT [10] and ProCAT [10]. However, estimates based on 2 million known proteins suggested that about 33% of proteins with unknown function were closely related to well-characterized homologues and could be effectively targeted by these methods [11]. Because protein-protein interaction (PPI) data sets are becoming increasingly available for more and more organisms, using PPI data to assign protein function has also been extensively studied. Algorithms based on PPI data suggest that proteins with short distances to each other in the network are likely to share the common biological functions [12,13,14,15], and interactive neighbors are more likely to have the same biological functions than non-interactive ones [16,17]. This is because the query protein and its interactive proteins may form a protein complex to perform a particular function. In a recent review [12], R Sharan et al described two main classes of the network-based methods for predicting protein functions: direct methods such as neighborhood counting based method [18],Graph theoretic methods [19,20]; module-assisted methods such as hierarchical clustering-based methods [21,22], graph clustering methods [23,24]. However, some few researches were carried out by only considering binary interactions [25,26,27] in this regard; i.e., whether they are interactive with each other or not, without considering the likelihood of the occurring of these interactions. Actually, such likelihood is often associated with the interaction strengths. Another problem is that some proteins still lack interaction data, preventing them from being predicted and analyzed. In view of this, in the current study we are to adopt a weighted interaction network instead of binary one. For those proteins that do not have PPI data, the hybrid properties of proteins, including their biochemical and physicochemical properties, are used to code the protein samples for prediction. Because such coding treatments have been successfully used to improve the quality in predicting many other protein attributes, such as membrane protein type [28], protein subcellular locations [29], and protein complexes [30].
A total of 24 functional categories are adopted in this paper. Each protein is predicted as belonging to some of these 24 functional categories. Here, we are concerned about not only the first-order predicted function -the most likely function predicted by the prediction model, but also the lower-order functions sorted by the prediction criteria. As a result, the accuracies of the firstorder function prediction for the overall training set and test set were 69.1% and 70.2%, respectively.

Data set
The dataset for studying the functions of proteins in mouse here was downloaded from MfunGD [31] (MOUSE Functional Genome Database, ftp://ftpmips.gsf.de/MfunGD/). There are a total of 24 function categories from FunCat [32], which are collected from manually annotation in the literature and GO annotation [33,34]. Among the 42,682 proteins obtained, there were 14,732 proteins with both sequence and function information, constituting the benchmark dataset. These proteins belonged to 24 functional categories. The number of proteins in each of the 24 categories is shown in Table 1, from which we found that most proteins perform more than one function.
The interaction network takes proteins as its nodes, with an edge between two proteins if they interact with each other. The initial weighted PPI network was retrieved from STRING [35] (http://string.embl.de/), which is a large database of known and predicted protein interactions. These interactions contain direct (physical) and indirect (functional) interactions, derived from numerous sources such as experimental repositories, computational prediction methods. In the network, each edge is marked with a score as the edge weight to quantify the interaction confidence, i.e., the likelihood that an interaction occurs.
Then the obtained 14,732 proteins were separated into two subsets: (A) 10,194 proteins in the above PPI network for training and testing the network-based method (see the following section); (B) 4,538 proteins not in the PPI network for training and testing hybrid-property based method (see the following section). For subset A, 1,076 proteins were randomly selected as the independent test set denoted by S n Te for network-based method, the remaining 9,118 proteins were comprised of training set S n Tr . Because the initial network was divided into two parts, some edges were removed, causing a few proteins to drop the connection to all their neighbors in the initial network. Such proteins losing PPI information should be taken away from the training set and test set for network-based method. As a result, S n Te consisted of 1,074 proteins and 88,960 interactions, and S n Tr consisted of 9,093 proteins and 742,200 interactions. For subset B, the homologous proteins were removed by CLUSTAL-W [36] to keep any two proteins having lower than 50% sequence identity, and then 248 proteins were randomly selected to constitute the test set S h Te for hybrid-property based method, the remaining 2,905 proteins were assigned as training set S h Tr . These four datasets can be found in Table S1, Table S2, Table S3, and Table S4.

Network-based method
Firstly the proteins in the MfunGD [31] database and those in the STRING [35] database need to be registered with each other to construct the weighted PPI interaction network. Then the functions of a query protein can be predicted by using the interaction network according to some criteria.
Network mapping. The protein IDs in MfunGD were different from those, the ensemble protein IDs in STRING. To convert MfunGD IDs to ensemble IDs, the ID (the MGI ID [37]) of each MfunGD protein was mapped to the ensemble ID by applying BioMart [38] to get the corresponding ensemble IDs from the MGI IDs.
Prediction with PPI information. Towards a query protein in the PPI network, we care not only about its neighbor proteins, but also about the weights of the interactions. Generally, let us consider a PPI network in which proteins belong to 24 functions (F~½F 1 ,F 2 ,:::,F 24 ), where F 1 denotes the ''METABOLISM'', F 2 the ''ENERGY'', F 3 the ''CELL CYCLE AND DNA PROCESSING'', and so forth (cf. Table 1). Suppose the network consists of n proteins P 1 ,P 2 ,:::,P n f g , in which the functions of the k-th protein is denoted by W(P k )~½f k,1 ,f k,2 ,:::,f k,j ,:::,f k,24 T (k~1,2,:::,n; j~1,2,:::,24) ð1Þ For a query protein P x , we define its interaction with the proteins in the PPI network like this W (P x )~½w x,1 ,w x,2 ,:::,w x,k ,:::,w x,n T ð3Þ where w x,k represents the interaction confidence score [35] between P x and the k-th protein in the network; when there is no interaction between them, we have w x,k~0 . By default, we also have w x,k~0 if x~k since there is no self-interaction in the network. Here, let us introduce a new concept, the so-called ''inclined potential'' of protein P x to the j-thfunction, as can be formulated by where Y(P x [j) is the ''inclined potential'' of protein P x to the j-thfunction in the PPI network. Therefore, the larger the value of Y(P x [j), the more likely the protein P x performs the j-thfunction. In other words, the most likely function of the protein P x can be predicted as the m-th function if m~argmax j Y(P x [j)jj~1,2,:::,24 f g ð5Þ where m represents the argument of j that maximizes the value of Y(P x [j). However, most proteins in vivo often perform more than one function, the prediction with only one candidate function is not sufficient. In view of this, to make the predictor capable to deal with proteins with multiple functions and provide experimental biologists with more flexible information in prioritizing candidate targets, let us introduce a 24-D (dimensional vector) to reflect the probability with which the query protein may perform each of the 24 functions, as formulated as follows where D ; is a descending operator to arrange the 24 scores of Y(P x [j) according to the descending order; i.e., , …, then that the query protein P x performs the 15 th function (CELL FATE) will have the highest likelihood, that P x performs the 1 st function (METABOLISM) will have the second highest likelihood, that P x performs the 5 th function (PROTEIN SYNTHESIS) will have the third highest likelihood, and so forth (cf. Table 1). In rare case when more than one element in Eq.6 has the same score, their order will be assigned randomly. According to the descending order of Eq.6, the predicted results are respectively called the 1 st -order result, the 2 nd -order result, the 3 rd -order result, and so forth.

Hybrid-property approach
Firstly, each protein is coded into feature vector using the hybrid properties. Secondly the features are sorted in descending order by mRMR method. Thirdly, Incremental Feature Selection Method was applied to select the best-performing predictor constructed by Nearest Neighbor Algorithm.
Biochemical and physicochemical description of proteins. Many studies have indicated that the success rates for predicting protein attributes could be remarkably improved by incorporating various biochemical or physicochemical properties into the descriptors of protein samples [39,40] and a long list of relevant references cited in a recent review [41]). Therefore, the biochemical and physicochemical properties (hybrid properties) of proteins are used to code proteins. As the name suggests, it consists of two kinds of properties: (1) Biochemical properties, including two features: amino acid composition, and secondary structural propensity; (2) Physicochemical properties, including five features: polarizability, solvent accessibility, normalized van der Waals volume, and polarity [40].
Of the above seven features, except for the amino acid composition that is an extended quantity to reflect the entire protein, the rest are local quantities to reflect each of the constituent amino acids. Each of such local feature quantities can be classified into two or three groups. For instance, for each amino acid, the secondary structural propensity is characterized as three groups: helix, sheet and coil as predicted by Predator [42]; the hydrophobicity is classified into polar, neutral, or hydrophobic [43]; the solvent accessibility is marked as buried or exposed to solvent by PredAcc [44].
The biochemical or physicochemical character of a protein sequence can be reflected through each of these local feature properties. For instance, using P, N and H to represent the three groups of hydrophobicity: polar, neutral, and hydrophobic, a given protein sequence ''MSDKPDMAEIEKFSKETIEQEKQA-GESTQEKNPLPMLLPATDKSKLKKTE'' can be transformed into ''HNPPNPHNPHPPHNPPNHPPPPPNNPNNPPPPNHNH-HHNNNPPNPHPPNP''.
For each such letter sequence, three properties can be obtained: composition (C), transition (T), and distribution (D). C describes the global percent composition of each of the groups in the letter sequence; T, the percent frequencies with which the letter changes to another along the entire length of the letter sequence; and D, the distribution pattern of the letters along the sequence, measuring the percentage of the sequence length within which the first, 25%, 50%, 75%, and 100% of the amino acids of each letter is located.
For the solvent accessibility, there are only two local feature groups, and hence resulting in seven features rather than 21 as illustrated above. The amino acid compositions have 20 features, each of which represents the percentage or occurrence frequency of the constituent amino acids in a protein sample [45]. For each of the other five local feature properties, 21 global features can be obtained as in the case of hydrophobicity described above. Using all these results, a total of 132 (132~5|21z20z7) features can be obtained to represent a protein sequence. Listed in Table 2 are the 132 features used in our study.
After each protein was coded, two criteria were applied to the vectors set. (1) Excluded proteins with the same coding vectors, but the different functional categories. (2) Keep one of the proteins that share common coding vectors and functional categories.
Feature sorting. Maximum Relevance, Minimum Redundancy (mRMR) Method was originally developed by Peng et al. to process microarray data [46]. The idea is to rank each feature based on its relevance to the target and redundancy with other features. A ''good'' feature is defined as one that has the best trade-off between maximum relevance to target and minimum redundancy within the features. To quantify both relevance and redundancy, mutual information (MI), which estimates how much one vector is related to another, is defined as following.
where x, y are two vectors, p(x,y) is the joint probabilistic density, p(x) and p(y) are the marginal probabilistic densities. Let V denotes the whole feature set, while V s denotes the already-selected feature set which contains m vectors. The to-beselected feature set with n features is denoted by V t . The relevance D of the feature f in V t with the target c can be calculated by: And redundancy R of the feature f in V t with all the features in V s can be calculated by: To obtain the feature f j in V t with maximum relevance and minimum redundancy, Eq. (8) and Eq. (9) are combined to obtain the mRMR function: For a feature set with N, the feature evaluation will be executed N rounds. In the first round, the redundancy is 0 for V s is null, therefore the feature with the maximum relevance to target c is selected. After the N evaluations, the following feature set S in the selection order can be obtained by the mRMR method: where the subscript index indicates at which round the feature is selected. The better the feature, the earlier it will satisfy Eq. (10), the earlier it will be selected, and the smaller its index will be. Prediction with hybrid properties. Nearest Neighbor Algorithm (NNA) is a simple machine learning method that arranges the object to the class of its nearest neighbor sample. It has been widely used for predicting protein subcellular localization (see, e.g., a comprehensive review [47] and the references cited therein). In this study, the similarity between vector P i and P j is defined as Q(P i ,P j )~cos (P i ,P j )~P i : P j jjP i jj : jjP j jj ð12Þ where P i : P j is the inner product of P i and P j , and jjP i jj and jjP j jj represent their modules, respectively. Also suppose that a query proteins P x in the dataset consisting of n proteins P 1 ,P 2 ,:::,P n f g , then the larger the value of Q(P x ,P k ), the more likely P x has the same function as P k .
Modeling. Incremental Feature Selection Method [48,49] (IFS) is employed to select the optimal feature subset from the feature space and the predictor with best performance. Firstly, 132 feature subsets were generated according to the sorted features like this S i~f f 0 , f 1 ,:::, f i g(0ƒiƒN{1) ð13Þ where f i is the i-th feature in the sorted 132 features. With each feature subset, the proteins were recoded to (iz1)-dimensionalvectors. Then the functions of protein are predicted with the hybrid properties as described above (see Eq.12 and Eq.6) according to NNA. A curve named IFS curve, was plotted by using index i as the x-axis and the first order accuracy of feature subset S i as the y-axis. The optimal feature set S optimal~f0 ,f 1 , . . . ,f h f gwas selected when the curve arrived at the apogee with index h. Meanwhile, the predictor based on S optimal was used to predict the functions of proteins.
For more discussions about the hybrid-property approach, refer to [49,50].

Overall prediction
The prediction was carried out according to such a procedure that if a test protein has PPI information, the network-based method was applied for identifying its functions; otherwise, the hybrid-property based method was applied.
Three cross-validation methods are often used in statistical prediction [51]: independent dataset test, subsampling (K-fold) test, and jackknife test. Of these three, the jackknife is deem the most objective that can always yield a unique outcome for a given benchmark dataset as elucidated in [29] and demonstrated by Eq.50 of [47]. Accordingly, the jackknife test has been increasingly used by investigators to evaluate various predictors (see, e.g., [52,53,54,55]). During the jackknifing for the network-based method, each node (protein) was in turn taken away from the PPI network and then predicted. During the jackknifing for the hybridproperty based method, each protein was in turn singled out and predicted according to the NNA. In this study, the j-th order overall accuracy AC j for the dataset can be calculated like this where M j is the number of proteins whose j-th order predicted function is the one of the true functions of the proteins, and N is the total number of proteins in the dataset. Therefore, the 24order overall accuracies were used as an evaluation for the both methods. The higher AC j with a small j and the lower AC j with a large j mean that the method performs well in the prediction. Besides, the average number of functions that each protein in the dataset performs can be calculated like this

C~N
umber of functions of proteins Number of proteins ð15Þ Hence, another evaluation for both the methods was presented as the likelihood that the first k-order prediction results include all the functions of proteins, which can be calculated like this where k is the smallest integer greater than or equal to C. A large P k also means a good performance of the method for the protein functions prediction.

Performance of network-based method
In this study, 9,093 proteins and 1,074 proteins were used to train and test the network-based method. The overall jackknife success rate on S n Tr and S n Te thus obtained for the first-order function was 75.9% and 78.2%, respectively. Shown in Table 3 are the accuracies of all the 24-order function predictions using the PPI network-based approach. For training set S n Tr , in all the cases, higher-order function prediction is better than the lower one, implying that the protein functions are nicely sorted using the prediction criteria. The average number of functions that a protein possesses is 3.25 according to Eq. (15). Therefore we only consider the first 4 (4~3:25 ½ z1) ones in the 24-order predictions. The probability that all true functions included for a protein by taking the first 4-order predicted functions is 68.4% according to Eq. (16), indicating that the predictor performs quite well in predicting these functions.

Performance of hybrid-property based method
After the filtering procedure (see biochemical and physicochemical description of proteins section), the obtained S h Tr comprised of 2,842 proteins and S h Te comprised of 246 proteins were then used to train and test the hybrid-property based method. Listed in Table 3 are the accuracies by the jackknife test with the hybrid-property based method. The prediction accuracy of the first-order predicted function for S h Tr and S h Te were 47.2% and 35.4%, respectively, using 90 optimized hybrid features selected by IFS procedure from a total of 132 features, which can be seen from the IFS curve in Figure 1. Detail of these 90 features can be found in Table S5, and the distribution of the subtypes of protein hybrid properties in the 90 features is showed in Figure 2. For the training set S h Tr , the average number of functions that a protein possesses is 2.81. Thus the first 3 (3~2:81 ½ z1) ones in the 24-order predictions is considered. According to Eq. (16), it is 44.1% for the probability that all true functions of a protein are included by taking the first 3-order predicted functions, indicating that the predictor using hybrid properties performs fairly well.

The overall prediction
The overall prediction combines the results of predictions from both network-based and hybrid-property based methods. The accuracies of the first-order function prediction for the overall training set (S n Tr zS h Tr ) and test set (S n Te zS h Te ) were 69.1% and 70.2%, respectively. Results of the other-order function predictions are shown in Table 3. For the overall training set, the average number of functions that a protein possesses is 3.14. Here, we still only consider the first 4 (4~3:14 ½ z1) ones in the 24-order predictions. According to Eq. (16), it is 65.2% for the probability that all true functions of a protein are included by taking the first 4-order predicted functions, indicating that our method performs quite well for the entire prediction. In this study, for the 24-order predicted functions generated by the method, the first 4-order predicted functions should be paid more attention to than other functions.

Comparison of network-based methods with STRING data and IntAct data
Besides the known PPIs, STRING also includes the predicted PPIs from the computational prediction methods. Here we want to investigate whether using both the known and predicted PPIs will improve the performance of the predictor using only the known PPIs or not. The known PPIs were downloaded from the IntAct [56], where the high-quality molecular interactions were collected from the curation of published literature or from the data depositions of the expert curators. After mapping the IntAct data to MfunGD, we filtered the proteins without function annotation. As a result, a PPI network consisted of 1,939 proteins and 6,543 known interactions was obtained. With the network, the jackknife test prediction accuracies of the network-based method for the 1,939 proteins are listed in Table 4. Meanwhile, with the STRING network (S n Tr ), the prediction accuracies for the 1,939 proteins are also listed in Table 4. The first-order prediction accuracy with STRING data is 83.5%, 26% higher than the firstorder prediction accuracy with IntAct data (57.5%). For the collected 1939 proteins, the average number of functions that a protein possesses is 3.94. The probabilities that all true functions of a protein are included by taking the first 4-order predicted functions are 66.3% and 47.4% for STRING data and IntAct data, respectively. The comparison indicates that the networkbased method performed better with the STRING PPIs than the IntAct PPIs.
Limited to the existing biotechnology means, many PPIs are difficult to detect in the lab. Therefore, the existing PPI networks derived from the experiments only cover a very small part of the total proteome. For example, InterAct [56] stores ,3000 mouse protein with PPI information about 7% of the mouse proteom; and DIP [57] contains ,1,000 mouse protein with PPI information, about 2% of the mouse proteom. STRING trys to integrate as many PPIs data as possible mainly from four sources: genomic context, high-throughput experiments, co-expression and previous knowledge. STRING (version 8.0) gathered ,2.5 million proteins of 630 organisms together. For mouse, STRING covers ,25% of the proteom. Therefore, the functions of more proteins will be predicted using the network-based method with STIRNG PPIs.
It should be pointed out that STRING data contains many predicted PPIs, which may lead to the wrong classifications. To avid the problem, we used the interaction confidence score (i.e. edge weight) in the network-based method as described above. According to STRING, a more reliable PPI will be assigned a higher edge weight. From the principle (Eq. (1) -Eq. (6)), the network-based method is robust with respect to false PPIs. Overall, the predicted PPIs should be used very cautiously.

Comparison between the network-based method and hybrid-property based method
In this study, network-based method and hybrid-property based method were developed to predict the functions of protein in mouse. In order to compare the performance between them, we also trained and tested the hybrid-property based method using the S n Tr and S n Te . The prediction results are listed in the Table 3. The first-order prediction accuracies on the S n Tr and S n Te are 57.1% and 42.7%, respectively, which are much lower than the prediction accuracies of the network-based method on the same training set and test. For the training set S n Tr , the probability that all true functions included for a protein by taking the first 4-order predicted functions is 58.4% according to Eq. (16), which are also lower than the probability of 68.4% of the network-based method. Therefore, the network-based method outperforms the hybridproperty based method.
Comparison between the hybrid-property based method and the motif-based method As a sequence-based method, the hybrid-property based method should be compared to other sequence-based method. We selected the method based on the motif information to predict functions of proteins, which has been proved to effective for the predicting functions of proteins in yeast [58]. The motif-based method can be described as follows: First, 739 short domain sequence were downloaded from SBASE [59], which is a collection of domain sequences designed for facilitating the detection of domain homologies. Then BLASTP [7] was used to compare the protein with the 739 domain sequences to find the alignments with e-value lower than 0.8. The protein sequences can be represented by vector: V~½v 1 ,v 2 ,:::,v 739 , where v i~1 (i~1,2,:::,739) when e-value of the alignment lower than 0.8, otherwise v i~0 . Using NNA, the method was trained and tested on the same training set (S h Tr ) and test set (S h Te ). The prediction results are listed in the Table 3. The first-order prediction accuracies on the S h Tr and S h Te are 31.8% and 32.7%, respectively, which are lower than the prediction accuracies of the hybrid-property based method. For S h Tr , the probability that all true functions included for a protein by taking the first 3-order predicted functions is 30.6%, which are also lower than the probability of 44.1% of the hybrid-property based method. Overall, the hybrid-property based method performs a little better than the motif-based method.

Biological relevance of the optimized hybrid features
It is shown in Figure 2 that amino acid compositions and secondary structure contribute the most towards protein function  prediction. These protein properties have also been used for predicting many other protein attributes, such as classification of nuclear receptors, protein fold recognition [60], protein quaternary structure [61], membrane protein types [28], and protein folding rate [62,63], among many others. Amino acid compositions are reported to correlate to proteins' structural and biological characters [64,65]. Alteration of secondary structure is a common and causative factor for causing human diseases [66,67,68] by probably altering the protein functions. It has also been reported that the alteration of secondary structure of amyloid beta peptide relates to the neurotoxic activity in vitro [69,70]. Listed above are just a few examples of showing the importance of these protein properties in shaping protein functions. There are surely a number of other findings in validating their importance, as well as the importance of other properties investigated in this study, such as the polarity, normalized van der Waals volume, polarizability, hydrophobicity, and solvent accessibility.
In this study, we propose a novel multi-target model, in which a sample may belong to several classes, for predicting protein functions. Two kinds of multi-target predictors are implemented: one is for proteins with PPI information and the other for those without PPI information. The average number of functions that a protein possesses is 3.14. There are 24 protein functional categories, meaning that in average a random guess of a protein function will have a success chance of 13.1% (13:1%~3:14=24), much lower than the first order prediction accuracy of 69.1%. Therefore, our method can serve as a useful high throughput tool for annotating the functions for many uncharacterized protein sequences. It is very interesting to see that the PPI network-based method is significantly better than the hybrid-property based method in both the rates of first-order function prediction and the probability rates calculated by Eq. (16). It is anticipated that the method based on the PPI network information is quite promising, and may become a powerful tool for annotating the functions of proteins.

Supporting Information
Table S1 Training set for network-based method. The Mfun ID and Functional number (see Table 1) of proteins are shown.