Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree

Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew’s correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.


Introduction
Protein-protein interactions (PPIs) play a key role in various biological functions such as DNA transcription, metabolic cycles and signaling cascades in cells. Therefore, identification of PPIs can provide a great insight into protein functions and further biological processes [1]. With the development of proteomics, many experimental techniques have been developed such as protein chip [2], tandem affinity purification (TAP) [3] and other high-throughput biological PLOS  techniques [4]. However, PPI pairs identified by experimental approaches only cover a small fraction of the whole PPI networks [5]. In addition, they hold inherent disadvantages, such as being time-consuming, expensive, and having high false positive rate. Hence, there is a strong motivation to develop efficient computational methods as alternative for inferring PPIs efficiently and accurately [6,7]. A number of computational methods have been developed for the prediction of PPIs. However the application of most existing methods is limited because they need information about protein homology or the interaction marks of the protein partners. Recently, much effort has been devoted to propose machine learning approaches for detecting PPIs using protein sequences alone [8][9][10].
For predicting PPIs by sequences, one of the main computational challenges is to find a suitable way to fully describe the important information of PPI. Shen et. al [8] used the conjoint triad method to extract features of protein sequences based on properties of amino acids. They classified 20 amino acids into seven group according to dipoles and volumes of the side chains to reduce the dimensions of vector space. The traid types and their numerical values of three continuous amino acids are feed into the feature vector space. Zhou [10] and Yang [11] divided the whole sequences into different local regions of varying length, then calculated three local descriptors (composition, transition and distribution) in each local region to describe multiple overlapping continuous and discontinuous interaction patterns in protein sequences. Guo et. al [9] used the auto covariance (AC) method to construct the feature vectors of protein sequences. It took neighboring effects into account and discovered patterns in entire sequences. Furthermore, there are several other kinds of feature representation methods including Auto Cross Covariance (ACC) [9], Multi-scale Continuous and Discontinuous (MCD) [12], and Multi-scale Local Feature Representation (MLD) [13]. Fortunately enough, recent advances in developing numerous web servers for extracting features from biological sequences, such as RepDNA [14], RepRNA [15] and Pse-in-One [16] for DNA, RNA and protein sequence respectively, make the procedure quickly and effectively.
Sample classification is another important issue for predicting PPIs computationally. Most of current computational methods are based on the traditional classifier such as support vector machine [9,10,12] and random forests [13,17]. Although these classifiers have strong classification ability, they need much labor and time to adjust corresponding parameters for the best performance. Recently, Gradient Boosting Decision Tree (GBDT) [18] classifier is earning reputation for its powerful classification performance. As an effective off-the-shelf method for generating models for classification and regression tasks, GBDT produces a prediction model in the form of an ensemble of weak prediction models, builds the model in a stage-wise fashion, and generalizes them by allowing optimization of an arbitrary differentiable loss function. Because of the arbitrary of choosing the loss function, it makes the GBDT highly customizable to any particular data-driven task. Meanwhile, the GB algorithms are relatively simple to implement, which allows one to experiment with different model designs. Thus the GBDT algorithms have shown considerable success in not only practical applications [19], but also in various machine learning and data mining challenges [20].
In this paper, we present a computational approach for predicting PPIs by combining a multi-scale encoding representation of proteins and a gradient boosting decision tree classifier. First, physicochemical characteristics, including their qualitative and quantitative attributes, of amino acids are used to encode a protein sequence at multi-scale. Then, Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. A 347 dimensional vector of a protein sample is obtained after the transformation. Thirdly, we combine every two corresponding protein feature vectors into 694-dimensional vectors as the inputs for classifier.
Finally, the gradient boosting decision tree algorithm is introduced to predict protein interaction class based on the multi-scale feature representation scheme.
In order to evaluate the performance of the proposed method, it is tested with the PPI data of S.cerevisiae. The prediction accuracy of 95.28% and Matthew's correlation coefficient of 90.68% are achieved. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interaction network and the prediction accuracies are also very promising.

Results
In this section, we firstly evaluate the performance of the proposed method for predicting PPIs on three different PPI datasets: S.cerevisiae, H.pylori and Human by using different evaluation measures including Matthew's correlation coefficient (MCC). Then, the prediction performances on three different feature representations including qualitative characteristic feature, quantitative characteristic feature, and the full features are discussed. Thirdly, we compare the classification performances among GBDT, Random Forest(RF) and Support Vector Machine (SVM) by using the same feature vector representation. Furthermore, we compare the performance of the proposed method with the previous existing methods. In addition, we also present the results of the experiments on a crossover protein-protein interaction network.

Data set
The PPI datasets from S.cerevisiae, H.pylori and Human are used to evaluate the performances.
All the datasets are downloaded from the existing works done by You et. al [12], Martin et. al [21] and Huang et. al [22] respectively. The distributions of the golden positive and negative samples (GPS and GNS) are shown in Table 1.
It should be noticed that the sequence homology is an important problem for sequencebased predictors [23]. All the protein pairs which contain a protein with fewer than 50 residues or have ! 40% sequence identity have been removed in the first dataset. The third dataset has removed protein pairs with ! 25% sequence identity. In the second dataset, the positive samples were from proteome-wide experiment using two-hybrid measurements, the negative samples were selected randomly. For testing the generability of models, sequence redundancy in this dataset was not considered.
In the first two datasets, the numbers of positive and negative samples are equal. For the third dataset, the number of positive samples is less than the one of negative samples. We choose these three balanced and unbalanced datasets for testing the generability of our model.

Evaluation measures
To evaluate the performance of the proposed method, five-fold cross validation and a couple of assessment measures are used in this study. These criteria includes overall prediction accuracy (ACC), sensitivity(SN), positive predictive value (PPV), weighted average of the PPV and F-score, and Matthew's correlation coefficient (MCC). There are defined in Eqs from 1 to 5.
where true positive (TP) is the number of true PPIs that are predicted correctly; false negative (FN) is the number of true PPIs that are predicted to be non-interacting pairs; false positive (FP) is the number of true non-interacting pairs that are predicted to be PPIs, and true negative (TN) is the number of true non-interacting pairs that are predicted correctly.

Prediction performances of the proposed method
The performances of the proposed approach are investigated using the PPI datasets of three species: S.cerevisiae, H.pylori and Human. To make the experimental results generalizable regarding new data in the predictions, each dataset is randomly partitioned into training and testing sets via a five-fold cross validation. Each of the five subsets acts as an independent holdout testing dataset for the model trained with the rest of four subsets. Thus five models for each dataset are generated for its corresponding five sets of data. The prediction performances of GBDT classifier with full feature representation of protein sequences across five runs is shown in Tables 2-4. The highest prediction accuracies on three PPI datasets are 95.84%, 91.94% and 98.41% respectively. The average ones on them reach 95.28%, 89.27% and 98.00%. These results show that the performance of the proposed method is quite promising. To better investigate the prediction ability of our model, we also calculated the values of SN, PPV, and MCC. Nearly over 90% of these values on three datasets ensures robustness of the prediction capability of the method.
To further investigate the performance on the different numbers of positive and negative samples, we analyze the standard variances of the prediction accuracies on three datasets. The

Prediction performances with different features
In order to understand the contribution of QLC and QNC features, the investigated experiments are performed on the three datasets using three kinds of feature components, QLC, QNC and QLC+QNC. Also, the five-fold cross validation is used to evaluate the performance.
The results are shown in Tables 5-7. On the S.cerevisiae dataset, QLC and QNC demonstrate similar performance and their combination outperforms other two in all the performance measures, especially F-score by 3.26% improvement. Moreover, the values of its MCC and SN can be raised at least by 1.22%, and 1.52%. The similar situations are true on the datasets of H.pylori and Huamn. These comparative experiments show that QLC and QNC play a similar role, are somehow complementary, in the prediction of PPIs. And their combination improve significantly the performance.

Comparsion of the prediction performance with different classifiers
Here we investigate whether or not the GBDT classifiers can significantly improve the performance of PPI prediction compared against other classifiers. SVM and Random Forest are two commonly used classifiers for predicting protein interactions. We compare the classification performance between SVM, Random Forest and GBDT using the same features. Figs 1-4 plot accuracy, sensitivity, F-score and MCC value for the three classifiers.
As shown in Table 8, the GBDT wins all other two classifiers on the three datasets in terms of all the assessment measures. Compared with SVM On the H.pylori dataset, the prediction

Comparison of the prediction performance with existing methods
In order to highlight the advantage of our method, we compare the prediction ability with the state-of-the-art methods on the PPI data of H. pylori and S.cerevisiae. These methods include Ding et. al [17,24], You et. al [12], Wong et. al [25], Guo et. al [9], and zhou et. al [10]. The features, feature extraction methods and classifiers used in these method are shown in Tables 9  and 10. On the S.cerevisiae dataset, the prediction accuracy of our model increases nearly by 1% than the best method with the highest MCC, and slight low values of SN and PPV. On the H. pylori dataset, our model also obtain the best prediction accuracy with nearly similar values of SN, PPV, and MCC as ones in the best methods. These experimental results demonstrate that our model outperforms all other previous methods on a couple of PPI datasets. Prediction performance on a real Wnt-related network The most useful application for predicting PPIs is to build a biological meaningful PPI network. To test the generability of our method, the model trained on the S.cerevisiae dataset is applied to a real Wnt-related network produced by Ulrich et. al [26]. It predicts 87 interactions among all the 96 PPI pairs, see Fig 5 (the red line indicates a false prediction). Compared to 73 interactions by Shen' method [8], the accuracy of our method raises by 14.58%.
Our result is also compared against Ding's work [24] with 91 interactions in this network. We find that the false predictions between two methods are completely different. Meanwhile, our 9 false predictions connect 10 proteins and their 5 false predictions connect 9 proteins. These slight differences might suggest that the two methods could apply to different situations.
To further explore the false predictions, we find that three proteins (FZD10, WNT9A and WNT4) are often predicted incorrectly via different runs. FZD10 is a receptor for Wnt proteins, which may be involved in transduction and intercellular transmission of polarity information. WNT9A and WNT4 are ligands for the members of the frizzled family of seven transmembrane receptors, which are likely to signal over only few cell diameters. It is hypothesized that the poor signal interaction between proteins that transmit a small amount of signal at the cell diameter and other proteins will result in poor prediction performance.

Discussions
It should be noticed that high dimension data might cause over-fitting, information redundancy and dimension disaster, which can overestimate the performance and reduce the generalization ability of a predictor [27]. To exclude noise or redundant information, Yang and Chen et al [28,29] employed ANOVA, Zhao et al [30] used mRMR program to further optimize the feature set. A series of feature sets in various sizes were obtained based on IFS  strategy. However, GBDT is an additive model that minimizes the loss function by the weak classifier. With this model, the individual classifiers do not need to be particularly complex.
On the contrary, simple classifiers tend to work best to evade from overfitting. Furthermore, the optimal value of iterations and the total number of leaves are often selected by monitoring prediction error. Moreover GBDT selects features in the form of an ensemble of decision trees. As shown in a series of recent publications, in addition to the predictor's high accuracy, it is also very important to make its web-server available so that users can easily get the results without the need to go through the mathematical details [31][32][33][34][35]. Only with this, can it be widely used by most experimental scientists [36]. All the source codes are available at the github server (https://github.com/lovekeyczw/zhouchang/). we shall make efforts in our future work to provide a web-server for the method reported in this paper.

Methods
This section describes the proposed approach for predicting protein interactions from primary sequences alone. It consists mainly of three steps (see Fig 6): (1) Encode a protein sequence by qualitative and quantitative characteristics of amino acids in the sequence. (2) Extract features by five protein sequence descriptors. (3) feed feature vectors into the gradient boosting decision tree classifier for predicting PPIs.

Encoding of protein sequences
The proposed encoding model of protein sequences is mainly based on the assumption that whether two proteins interact can be greatly influenced by their physicochemical characteristics such as residues' hydrophobicity and polarizability. The descriptions of these properties could be qualitative or quantitative. For the first case, seven properties including hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure, and solvent accessibility are used and each property is divided three groups, see Table 11. For the second case, six properties including hydrophobicity (H), volumes of side chains of amino acids (VSC), polarity (P1), polarizability (P2), solvent-accessible surface area (SASA) and net charge index of side chains (NCISC) are used, see Table 12. Definition 1 A protein sequence S = s 1 s 2 , Á Á Á, s n is encoded by a property P = {p 1 , p 2 , p k } if each s i 2 S is replaced by the value p j 2 P of its corresponding property.
Finally, for a given protein sequence, there are totally 13 kinds of encodings.

Extraction of feature vectors
After protein sequences are encoded, feature extraction aiming at mining useful information from these encodings and represent them as fixed-length feature vectors is a crucial step for predicting protein interactions. In this study, Five kinds of protein descriptors, amino acid frequency, composition, transition, distribution and auto covariance, are extracted to form the feature vector of a protein sequence.
The frequency of a particular amino acid in a protein sequence can be directly calculated from itself. There are 20 dimensions for this descriptor.
The composition (C), transition (T) and distribution (D), were employed to describe the global composition of each of qualitative properties.
C is the number of amino acids of a particular property divided by the total number of amino acids in a protein sequence. Tcharacterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. D measures the chain length within which the first, 25%, 50%, 75%, and 100% of the amino acids of a particular property are located, respectively. For each qualitative property of a protein sequence, C, T and D produce 3, 3 and 15 dimensions of features respectively. There are 7 Ã (3 + 3 + 15) = 147 dimensions of features for seven qualitative properties.
The features from the extraction of frequency, composition, transformation and distribution are called the qualitative characteristic feature (QLC feature) of the protein sequence in this study.
Auto covariance (AC) [9] describes the statistical significant to formalize the information of amino acids within a specific length. It accounts for the interactions between amino acids within a certain number of amino acids apart in the sequence.
For each of the six quantitative properties of amino acids in the sequence, the values of its corresponding encodings are normalized to zero mean and unit standard deviation according to the Eq 6: where P i,j is the value of j-th property for i-th amino acid, P j is the mean of j-th property over 20 amino acids, S j is the corresponding standard deviation and j is the index of six quantitative properties. Table 11. Seven physicochemical properties for 20 amino acid types.

Amino acid Group1 Group2 Group3
Hydrophobicity where n is the length of sequence, lag is the length between the i − th and (i + lag) − th residues of the sequence. The lag ranges from 1 to max 2 [1..n − 1]. Ding [24] showed that less than 30 of the max will lose some of the useful features while the larger may induce noises. Thus, max is set 30 in this study. The number of AC values for each quantitative property is 30. Finally the AC descriptor produces 180 dimensions of features. We call this kind of features as quantitative characteristic features (QNC features).
The QLC and QNC features are directly combined to represent a protein sequence. For a pair of proteins, the feature space consists of 694 dimensions.

Gradient boosting decision tree
As a machine learning technique for regression and classification problems, Gradient Boosting (GB) produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Unlike common ensemble techniques such as Adaboost and random forests, the learning procedure in GB consecutively fits new models to provide a more accurate estimate of the response variables. The principle idea behind this algorithm is to build the new base learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble.
Supposed that there are N training examples: The gradient boosting decision tree(GBDT) model estimates the function of future variable x by the linear combinition of the individual decision trees, see Eq 8.
Where T(x; Θ m ) is the i-th decision tree, Θ m is its parameter, and M is the number of decision trees. The GBDT algorithm calculates the final estimation in a forward stage-wise fashion. Supposed the initial model of x be f 0 (x), the model in m step can be obtained by the Eq 9.
Where f m−1 (x) is the model in m − 1 step. The parameter Θ m is learned by the principle of empirical risk minimization in Eq 10.
Where L is the loss function.
Because of the assumption of linear additivity of the base function, our purpose becomes to estimate the Θ m for best fitting the residual L(y − f m−1 (x)). To this end, the negative gradient of lost function at f m−1 is used to estimate the residual approximately.
Where i is the index of i-th example. Finally, we train a decision tree model by all the R mi , i 2 [1.
.N] for estimating the parameter Θ m . The parameter of a decision tree model is used to partition the space of input variables into homogeneous rectangle areas by a tree-based rule system. Each tree split corresponds to an ifthen rule over some input variables. This structure of a decision tree naturally models the interactions between predictor variables. If the parameter maps the input space X into J disjoint regions R 1 , Á Á Á, R J , and the output is c j for each region R j , then the tree T can be written as Eq 12.
To summarize, we can formulate the complete form of the GBDT algorithm, as in algorithm 1.

Conclusion
In this paper, we develop a efficient model for predicting PPIs by combining GBDT classifier with multi-scale encoding of protein sequences by the quantitative and quantitative characteristics of amino acids. The multi-scale encoding scheme is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. The highly customizable GBDT classifier makes the prediction more flexible and robust. Experimental results shows that the proposed method performed significantly well in both balanced and unbalanced PPI datasets, and GBDT classier wins other classifiers. Comparative experiments demonstrate that the proposed approach outperforms all other previous methods on a couple of PPI datasets.