PROTS-RF: A Robust Model for Predicting Mutation-Induced Protein Stability Changes

The ability to improve protein thermostability via protein engineering is of great scientific interest and also has significant practical value. In this report we present PROTS-RF, a robust model based on the Random Forest algorithm capable of predicting thermostability changes induced by not only single-, but also double- or multiple-point mutations. The model is built using 41 features including evolutionary information, secondary structure, solvent accessibility and a set of fragment-based features. It achieves accuracies of 0.799,0.782, 0.787, and areas under receiver operating characteristic (ROC) curves of 0.873, 0.868 and 0.862 for single-, double- and multiple- point mutation datasets, respectively. Contrary to previous suggestions, our results clearly demonstrate that a robust predictive model trained for predicting single point mutation induced thermostability changes can be capable of predicting double and multiple point mutations. It also shows high levels of robustness in the tests using hypothetical reverse mutations. We demonstrate that testing datasets created based on physical principles can be highly useful for testing the robustness of predictive models.


Introduction
The ability to improve protein thermostability via protein engineering is of great scientific interest and has significant practical value because many native proteins are only marginally stable under normal physiological and storage conditions [1][2][3][4][5][6][7][8]. For example, protein-based pharmaceuticals are often vulnerable to degradation that may affect their potency and even safety [9]. In addition, stable proteins are highly desirable in many biotechnological applications including biopharmaceuticals, biomaterials, and biofuel, etc. [7,8]. Enzymes with enhanced stability allow catalyzed reactions to be performed at higher temperatures, which often lead to more efficient industrial processes.
Computational methods for designing proteins with enhanced thermostability can be advantageous over conventional approaches because of their potential low cost and time-saving properties [10]. Existing computational approaches use either force-fields [2,[11][12][13][14][15][16] or data mining technologies [17][18][19][20][21][22][23][24][25]. The former require high-resolution 3D structures and are often highly computer-intensive. Consequently, in recent years, data mining technologies employing various machine learning algorithms have increasingly attracted attention. The general procedure of machine learning approaches is to train predictive models based on available experimental data using features (properties) such as substitution types, secondary structures, solvent accessibility, and the amino acid composition of neighboring residues. Many algorithms including support vector machines [17][18][19][20], neuronal networks [21], and multiple regression and classification techniques [22,23], have been used for predicting protein stability changes induced by mutations. The machine learning approaches hold great promises because they may be used to discover subtle patterns governing mutation induced stability changes and protein stability in general. However, recently we discovered that some of these types of methods may suffer from the over-fitting problem when hypothetical reverse mutations were used to test the robustness of these methods [26].
Usually protein stability changes upon mutations are experimentally measured through changes in the melting temperature (DTm) or alterations of folding free energies (DDG) between wild type proteins and their mutants. Existing protein stability predictors use one or the other as the metric for stability changes. Because both metrics are thermodynamic parameters and thus also state functions [27], the DDG (or DTm) of a mutation from a wild type protein to its mutant (WT2.MT) equals the negated DDG(or DTm)of a hypothetical reverse mutation (MT2.WT), i.e., Our tests revealed that the tested methods lost predictive ability considerably when hypothetical reverse mutations were used to evaluate the robustness of these methods [26]. Our findings are consistent to the comprehensive analysis conducted by Khan and Vihinen recently. They evaluated and compared 11 online stability predictors and found that ''at best, the predictions were only moderately accurate (,60%)'' [28]. Thus, effective and   robust computational algorithms for predicting mutation induced protein stability change are still in critical demand. In addition, most existing algorithms were developed for predicting thermostability changes of single-point mutations, despite the fact that the ability to predict protein stability changes upon multiple point mutations is also important because stabilization induced by single mutation may not be sufficient for practical applications of a protein. Only in recent years a few studies have been focused on multiple-mutation induced thermostability changes. For example, Huang and Gromiha proposed a predictive model named WET, a weighted decision table method for predicting protein thermostability change upon double mutation from amino acid sequences [23]. The model was built and tested on a set of 180 double point mutations. The correlation coefficient of the predicted and experimental DDG reached 0.75 and the overall accuracy was 82.2% in the 10-fold cross validation test [23]. However, the accuracy drops to 0.57 when it is tested on the hypothetical reverse mutations (see details in the results).
In this work, we attempt to develop a robust algorithm that can treat free energy as a thermodynamic parameter for predicting not only single-, but also multiple-point mutation induced thermostability change. A prerequisite for such a model is a set of suitable features relevant to the protein stability. We use several types of features for this study. The first type of features is the evolutionary information extracted from the target proteins since the ''survival of the fittest'' principle may be also applicable to protein thermostability. In fact, a concept of evolutionary pseudo free energy upon mutations was introduced and was found to have statistically significant correlations with protein thermostability changes [29]. Other features include secondary structures and solvent accessibility, either assigned based on structures or predicted by PSIPRED [30], depending on the availability of structures. In addition, we include features that we previously developed in ThermoRank [31], and a set of fragment-based thermostability terms [26].
In the following sections, we firstly describe the mutation datasets and the features used in the study, and the Random Forest algorithm for constructing the predictive model, PROTS-RF (PROtein Thermostability Random Forest model). We then present the results from cross validation on a single-point mutation dataset and benchmark tests on a set of double-point mutations and a set of multiple point mutations. We test the robustness of the predictive model using hypothetical reverse mutations. We also present a comparison of PROTS-RF to several other relevant potentials or algorithms. In all cases, PROTS-RF delivers better performance than other algorithms. Conclusions and prospects will be presented in the end of the report.

Mutation datasets
Three mutation datasets are used in this work. The first dataset was originally collected by Potapov et al. [32]. It contains 2,156 single point mutations (D2156) with experimentally determined changes of folding free energies (DDG). These mutants are derivatives from 84 wild-type proteins. We cluster these proteins using Blastclust [33] with 30% sequence identity and then group these clusters into 5 portions with each having a similar number of mutations. Therefore, proteins from different portions share 30% or less sequence identity. These five groups are then used in a standard five-fold cross validation (CV). The second dataset includes 180 double point mutations (D180) from 27 wild-type proteins with DDG values, was collected by Huang and Gromiha [23]. The final dataset contains 141 multiple point mutations Table 1. Cont.

216
The entropy difference from the occurrence of Delaunay four-residue fragments with at least 3 sequentially continuous residues, only 2 continuous residues and four non-neighboring residues, respectively. (D141) from 19 different wild type proteins which were collected from ProTherm database [34]. For each mutation in the all three datasets, a corresponding hypothetical reverse mutation (i.e. WT2.MT) is created by swapping the wild-type protein and its mutant involved in the mutation. The free energy change during a hypothetical reverse mutation has the same value but opposite sign to that of the experimental forward mutation (Eq. 1). The hypothetical reverse mutations are grouped in the same fold as their corresponding mutations in the cross validation test. Therefore another benefit of using hypothetical reverse mutations is that the dataset is now perfectly balanced.

Features
We assemble a set of 41 sequential and structural features. These features are carefully selected so that the free energy can be treated as thermodynamic parameters. The name and description of each feature is available in Table 1. These features can be classified into the following four groups: 1. Evolutionary information (10 features). PSIBLAST is used to search the wild type proteins against the NCBI nonredundant (NR) protein database pre-filtered by sequence identity of 90% [33]. We consider the log-odds and weighted scores of the wild type residues and mutant residues, as well as the conservation of wild-type residues and neighboring residues in a window centered in the mutation site. We use three different window sizes: 5, 9 or 15. The log-odds and the weighted scores are directly extracted from the position specific scoring matrices (PSSMs) for single point mutations. For multiple point mutations, the averages of these values are used instead. Overall, ten parameters are generated to record the evolutionary information for each singleor multiple-point mutation.
2. Secondary structure and solvent accessibility (5 features). We assign secondary structure and solvent exposure status of each residue based on the wild-type proteins. If the structure of a wild-type protein is available, we use DSSP [35] to assign the secondary structures of all residues to three states: helix (H), extend (E) and coil (C); and solvent accessibility to exposed (e) or buried (b) using 25% relative accessible surface area as the threshold. We assume that the mutations do not significantly change the conformation of the protein and therefore the secondary structure and the solvent accessibility of wild-type and mutant remain the same.
3. Relative difference (6 features). We also utilize six relative differences of compositions and properties between the wild-type and the mutant sequences including the change of positive charged residues, charged residues, small residues, tiny   residues, maximum area of solvent accessibility (ASA) and the isoelectric point (pIa). These features were identified and used to build a model for discriminating thermophilic proteins from their mesophilic homologs [31].
4. PROTS terms (20 for structure-based model or 13 for sequence-based model). PROTS is a protein stability potential derived from a comparative study between a large set of thermophilic and mesophilic proteins and a set of point mutations with measurements of mutation induced the change of melting temperature [26]. There are 20 features in this category, including 13 sequential features and 7 Delaunay Tetrahedron (DT) based spatial features if the protein structure is available. The sequential features are used for all models but the Delaunay Tetrahedron based features are only used for structure-based models.

Random Forest algorithm (RF)
Predictive models are built using the Random Forest algorithm (RF) [36], an ensemble technique utilizing hundreds or thousands of independent decision trees to perform classification or regression. Each of the member trees is built on a bootstrap sample from the training data using a random subset of available variables. The algorithm is a state-of-the-art machine learning method and has been successfully used to build many predictive models [37][38][39][40][41]. Unlike many other competitive machine learning algorithms such as support vector machine, RF does not require fine-tuning parameters because using the default values of the parameters often results in near-optimal performance. Moreover, the predicting time for a RF model is often a small fraction of that for a corresponding support vector machine (SVM) model [39]. Another advantage of RF is that it provides several variable importance measures [40,41]. It is particularly suitable for mining high-dimensional and noisy data. In this study, we use an R implementation of the Random Forest algorithm to construct the predictive model in regression manner [42]. The predicted free energy changes are then used to calculate the accuracy of the predictions using zero change as the threshold for classification.

Algorithms used for comparison
We compare PROTS-RF to a variety of methods including several top-ranked ones in a recent comprehensive evaluation of protein stability predictors [28]. LSE is a local structure entropy derived from representative protein structures and has shown a strong correlation with protein thermostability [12]. MUpro is a support vector machine (SVM) based predictor at sequence level for the variation of folding free energy (DDG) upon point mutations [18]. I-Mutant2.0 is a SVM based predictor using structure and sequence information for DDG prediction [17]. Both EGAD [13] and FoldX [11] are force fields parameterized on a large set of point mutations with experimentally determined stability changes.

Evaluation parameters
We use several metrics to measure the performance of the predictive models. The first is accuracy, which is defined as the ratio of the number of correctly predicted mutations in stabilizing or destabilizing of wild type proteins against the total number of predicted mutations. The second is the area under receiver operating characteristic curve (ROC), known as AUC. It should be pointed out that AUC can be a misleading parameter in some situations and therefore the AUC results should be interpreted with caution [43,44]. We provide AUC for comparison purposes

Statistical analysis of the single mutation dataset
We analyze the statistical distributions of features used in the study. We use the Kolmogorov-Smirnov test for normality find that none but one of the features are normally distributed. We calculate the medium, the mean, and the p-value of the Kolmogorov-Smirnov test for each feature's distributions in stabilizing vs. destabilizing mutations (Table 1). We also generate boxplots to illustrate the distributions of features of stabilizing and destabilization mutations ( Figure S1). The results presented in Table 1 clearly show that the distributions of a number of features are significantly different in stabilizing and destabilizing mutations. For example, mutations occurring in sheets are more likely to be destabilizing (p-value: 2.5610 24 ). Mutations on buried residues are more likely destabilization than stabilization (p-value: 6.0610 215 ), which can be explained by the fact that the protein cores are tightly packed and thus it is difficult to further optimize the interactions within the cores [45].

Cross validation and model training
We use an R implementation of the Random Forest algorithm to build models. Each model in the five-fold cross validation comprises 2,000 decision trees. The importance of a feature is estimated using the sum of the impurity increase over all trees induced by the feature in the model [36]. The average and standard error of the importance of the 41 features in structurebased prediction and the 34 features in sequence-based prediction are shown in Figure 1. The results clearly show that the PROTS features and the evolutionary information are strongly correlated with protein stability.
The results from all five test datasets in the cross validation are combined. The data from actual experimental and hypothetical mutation are separated and fitted to the experimental data, discretely (Table 2). For the experimental mutations, the Pearson correlation coefficients (R) are 0.628 for the structure-based predictions and 0.620 for the sequence-based predictions ( Table 2). We then use various DDG values as cutoff thresholds to classify mutations as stabilizing and destabilizing and calculate the areas under receiver operating characteristic (ROC) curves. We find the areas under ROC (AUC) reach 0.873 and 0.869 for structure and sequence-based predictions, respectively. Very similar R and AUC are obtained for the hypothetical reverse mutations (Table 2). This result demonstrates that the predictive model is quite robust.
The model constructed in this work yields comparatively more reliable predictions than other tested models ( Table 2). Machine learning based algorithms MUPro and I-mutant2.0 perform poorly for the hypothetical reverse mutations because the AUCs are only slightly higher than 0.5, the level of random selection. The models based on force-fields or potentials such as LSE, FoldX and EGAD can treat temperature and free energy as thermodynamic parameters. The performance of these tested algorithms in the study, nevertheless, are not as good as the PROTS-RF. Besides, PROTS-RF performs better than PROTS, a fragmentbased protein thermostability potential we recently developed [26].
We then build the final structure-and sequence-based models using all the 2,156 point mutations and test these models using double-and multiple-point mutations.

Blind test on double point mutation dataset D180
In the blind test on the 180 double-point mutations, the regression of prediction against experimentally measured DDG values results in correlation coefficients of 0.775 and 0.755 for structure and sequence-based predictions respectively, and the classification achieves AUCs of 0.868 and 0.869 (Table 3 and Figure 2). The predictions on the experimental data are similar to a previous reported model WET [23], in which the authors achieved correlation coefficients up to 0.75 and the AUC up to 0.87 in 10-fold cross validation tests using a weighted decision table method. However, PROTS-RF achieves very similar results for the hypothetical reverse mutations (0.863 and 0.868 respectively), while the WET model provided by Huang et al. [23] delivers an AUC of 0.518 and R of 0.110, a strong indication for the existence of an over-fitting problem with the model.
Huang et al suggested that the methods developed for predicting protein stability change upon single point mutations may not be suitable for predicting the stability change upon double point mutations because the thermostability changes are not always additive [23]. Our results, nevertheless, have clearly indicated that a predictive model trained from single point mutations may still be capable of predicting double point mutations induced by protein stability changes. Some features used in our models, especially PROTS terms, reflect the surrounding environment of the mutation sites. The changes of these features are additive for remote mutations but not additive for mutations close to each other. This approach is consistent with the observations that in general non-additive mutations involve mutations close to each other while additive mutations involve mutations far apart (There are exceptions, however, to this rule because of long range interactions).

Blind test on multiple point mutations D141
The thermostability changes upon multiple point mutations are more complicated than single-and double-point mutations and therefore it is expected to be more difficult to be correctly predicted. Nevertheless, the correlation coefficients of predictions of the 141 multiple point mutations and experimentally measured DDG values reach 0.663 and 0.637 for structure and sequencebased predictions, and the classification results in AUCs of 0.862 and 0.855, respectively ( Figure 3 and Table 4). This result suggests that our predictive model is also capable of predicting stability changes upon multiple point mutations with high accuracy.

Prediction thermostability of Staphylococcal Nuclease mutants
Staphylococcal Nuclease (SNase) has been used as a model protein for studying protein stability and therefore there is a significant amount of experimental data for free energy changes upon mutations of this enzyme [46]. We use PROTS-RF predict free energy changes upon mutations and then plot them against the experimental values in Fig. 4. The predicted and experimental DDG values narrowly distribute along a line passing through the Origin. Both structure-based and sequence-based predictions are highly correlated with the experimental data (R Pearson = 0.855 and 0.843, respectively), and the predictions for mutations and the corresponding hypothetical reverse mutations are strongly symmetric with respect to the Origin. A Trp residue at position 140 is critical to SNase structure, stability and function [47]. PROTS-RF correctly predicts W140 related mutations and their hypothetical reverse mutations qualitatively but not quantitatively (Fig. 4), suggesting further improvement remains desirable.

Discussion
The model developed in the study is robust as demonstrated in the cross validation and blind tests. We believe that the high robustness of this model can be attributed to the Random Forest algorithm and the features used in the models. The Random Forest algorithm is well known for its high robustness and is particularly suitable for mining high-dimensional and noisy data. We utilize diverse features ranging from evolutionary information, protein structure profile, and protein properties to the thermostability terms learned from a large amount of native proteins [26,31]. These features are less dependent on the proteins in training datasets and the over-fitting problem is less pronounced in the model. Consequently, they are robust and capable of predicting not only single-point mutations, but also double-or multiple-point mutations. The tests using the hypothetical reverse mutations in this study have shown that the tested machine learning models for predicting mutation induced protein stability change may suffer from the over-fitting problem. The results are surprising because all these models have undergone cross validation, a common practice widely considered as a rigorous validation approach. We suggest that testing datasets created based on physical principles can be highly useful for testing the robustness of predictive models.
In the present study, it is observed that the structure-based and sequence-based predictors result in very similar performance, suggesting the structural features used in the study do not make significant contribution to the performance of the models. This is consistent to their relatively low importance as shown in Figure 1. The most important structural feature (FBDTD43) is the seventh overall most important feature. Its ability to deliver good predictions without structural information is advantageous over other methods requiring structural information because vast majority of proteins do not have solved structures. It is possible that the information encoded in these structural features is also captured in the sequential features used in the study. In addition, the number of structural features is relatively small (7 structural vs. 34 sequential features) and they may not interact well with sequential features. Nevertheless, we think it is possible to further improve model performance if the structural class of proteins and more structure-based features are considered. Recently, we were made aware that alpha/beta class proteins normally have higher residue contact density (i.e., number of contacts per residue) than other proteins [48]. Proteins with higher contact density tend to bear more mutations without significantly change its thermostability [49] and thermophiles tend to have higher contact density than mesophiles [50]. Moreover, a recently report concluded that the accessible surface area of beta proteins increases more rapidly with the size of proteins in comparison with that of the alpha proteins [51]. It was also reported that the aggregation propensity of a protein is highly correlated with its structural classification [52]. Currently we are investigating different classes of proteins and will report the results in future.

Conclusion
We have presented PROTS-RF, a predictive model based on the Random Forest algorithm for predicting mutation induced protein stability change. This model is constructed based on a large set of features in proteins and trained by the Random Forest algorithm. In the cross validation test and the blind tests using double-and multiple-mutation datasets, this model is comparatively more reliable in the prediction of protein thermostability changes over other existing methods. It also shows high levels of robustness in the tests using hypothetical reverse mutations. We demonstrate that the hypothetical reverse mutations based on physical principles are highly useful for testing the robustness of algorithms for predicting mutation induced protein stability change. Figure S1 The distributions of features of stabilizing and destabilization mutations. (TIF)