The ability to improve protein thermostability via protein engineering is of great scientific interest and also has significant practical value. In this report we present PROTS-RF, a robust model based on the Random Forest algorithm capable of predicting thermostability changes induced by not only single-, but also double- or multiple-point mutations. The model is built using 41 features including evolutionary information, secondary structure, solvent accessibility and a set of fragment-based features. It achieves accuracies of 0.799,0.782, 0.787, and areas under receiver operating characteristic (ROC) curves of 0.873, 0.868 and 0.862 for single-, double- and multiple- point mutation datasets, respectively. Contrary to previous suggestions, our results clearly demonstrate that a robust predictive model trained for predicting single point mutation induced thermostability changes can be capable of predicting double and multiple point mutations. It also shows high levels of robustness in the tests using hypothetical reverse mutations. We demonstrate that testing datasets created based on physical principles can be highly useful for testing the robustness of predictive models.
Citation: Li Y, Fang J (2012) PROTS-RF: A Robust Model for Predicting Mutation-Induced Protein Stability Changes. PLoS ONE 7(10): e47247. https://doi.org/10.1371/journal.pone.0047247
Editor: Narayanaswamy Srinivasan, Indian Institute of Science, India
Received: April 30, 2012; Accepted: September 11, 2012; Published: October 15, 2012
Copyright: © Li and Fang. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The authors have no support or funding to report.
Competing interests: The authors have declared that no competing interests exist.
The ability to improve protein thermostability via protein engineering is of great scientific interest and has significant practical value because many native proteins are only marginally stable under normal physiological and storage conditions –. For example, protein-based pharmaceuticals are often vulnerable to degradation that may affect their potency and even safety . In addition, stable proteins are highly desirable in many biotechnological applications including biopharmaceuticals, biomaterials, and biofuel, etc. , . Enzymes with enhanced stability allow catalyzed reactions to be performed at higher temperatures, which often lead to more efficient industrial processes.
Computational methods for designing proteins with enhanced thermostability can be advantageous over conventional approaches because of their potential low cost and time-saving properties . Existing computational approaches use either force-fields , – or data mining technologies –. The former require high- resolution 3D structures and are often highly computer-intensive. Consequently, in recent years, data mining technologies employing various machine learning algorithms have increasingly attracted attention. The general procedure of machine learning approaches is to train predictive models based on available experimental data using features (properties) such as substitution types, secondary structures, solvent accessibility, and the amino acid composition of neighboring residues. Many algorithms including support vector machines –, neuronal networks , and multiple regression and classification techniques , , have been used for predicting protein stability changes induced by mutations. The machine learning approaches hold great promises because they may be used to discover subtle patterns governing mutation induced stability changes and protein stability in general. However, recently we discovered that some of these types of methods may suffer from the over-fitting problem when hypothetical reverse mutations were used to test the robustness of these methods .
Usually protein stability changes upon mutations are experimentally measured through changes in the melting temperature (ΔTm) or alterations of folding free energies (ΔΔG) between wild type proteins and their mutants. Existing protein stability predictors use one or the other as the metric for stability changes. Because both metrics are thermodynamic parameters and thus also state functions , the ΔΔG (or ΔTm) of a mutation from a wild type protein to its mutant (WT−>MT) equals the negated ΔΔG(or ΔTm)of a hypothetical reverse mutation (MT−>WT), i.e.,(1)(2)
Our tests revealed that the tested methods lost predictive ability considerably when hypothetical reverse mutations were used to evaluate the robustness of these methods . Our findings are consistent to the comprehensive analysis conducted by Khan and Vihinen recently. They evaluated and compared 11 online stability predictors and found that “at best, the predictions were only moderately accurate (∼60%)” . Thus, effective and robust computational algorithms for predicting mutation induced protein stability change are still in critical demand.
In addition, most existing algorithms were developed for predicting thermostability changes of single-point mutations, despite the fact that the ability to predict protein stability changes upon multiple point mutations is also important because stabilization induced by single mutation may not be sufficient for practical applications of a protein. Only in recent years a few studies have been focused on multiple-mutation induced thermostability changes. For example, Huang and Gromiha proposed a predictive model named WET, a weighted decision table method for predicting protein thermostability change upon double mutation from amino acid sequences . The model was built and tested on a set of 180 double point mutations. The correlation coefficient of the predicted and experimental ΔΔG reached 0.75 and the overall accuracy was 82.2% in the 10-fold cross validation test . However, the accuracy drops to 0.57 when it is tested on the hypothetical reverse mutations (see details in the results).
In this work, we attempt to develop a robust algorithm that can treat free energy as a thermodynamic parameter for predicting not only single-, but also multiple- point mutation induced thermostability change. A prerequisite for such a model is a set of suitable features relevant to the protein stability. We use several types of features for this study. The first type of features is the evolutionary information extracted from the target proteins since the “survival of the fittest” principle may be also applicable to protein thermostability. In fact, a concept of evolutionary pseudo free energy upon mutations was introduced and was found to have statistically significant correlations with protein thermostability changes . Other features include secondary structures and solvent accessibility, either assigned based on structures or predicted by PSIPRED , depending on the availability of structures. In addition, we include features that we previously developed in ThermoRank , and a set of fragment-based thermostability terms .
In the following sections, we firstly describe the mutation datasets and the features used in the study, and the Random Forest algorithm for constructing the predictive model, PROTS-RF (PROtein Thermostability Random Forest model). We then present the results from cross validation on a single-point mutation dataset and benchmark tests on a set of double-point mutations and a set of multiple point mutations. We test the robustness of the predictive model using hypothetical reverse mutations. We also present a comparison of PROTS-RF to several other relevant potentials or algorithms. In all cases, PROTS-RF delivers better performance than other algorithms. Conclusions and prospects will be presented in the end of the report.
Materials and Methodology
Three mutation datasets are used in this work. The first dataset was originally collected by Potapov et al. . It contains 2,156 single point mutations (D2156) with experimentally determined changes of folding free energies (ΔΔG). These mutants are derivatives from 84 wild-type proteins. We cluster these proteins using Blastclust  with 30% sequence identity and then group these clusters into 5 portions with each having a similar number of mutations. Therefore, proteins from different portions share 30% or less sequence identity. These five groups are then used in a standard five-fold cross validation (CV). The second dataset includes 180 double point mutations (D180) from 27 wild-type proteins with ΔΔG values, was collected by Huang and Gromiha . The final dataset contains 141 multiple point mutations (D141) from 19 different wild type proteins which were collected from ProTherm database .
For each mutation in the all three datasets, a corresponding hypothetical reverse mutation (i.e. WT−>MT) is created by swapping the wild-type protein and its mutant involved in the mutation. The free energy change during a hypothetical reverse mutation has the same value but opposite sign to that of the experimental forward mutation (Eq. 1). The hypothetical reverse mutations are grouped in the same fold as their corresponding mutations in the cross validation test. Therefore another benefit of using hypothetical reverse mutations is that the dataset is now perfectly balanced.
We assemble a set of 41 sequential and structural features. These features are carefully selected so that the free energy can be treated as thermodynamic parameters. The name and description of each feature is available in Table 1. These features can be classified into the following four groups:
1. Evolutionary information (10 features).
PSIBLAST is used to search the wild type proteins against the NCBI non-redundant (NR) protein database pre-filtered by sequence identity of 90% . We consider the log-odds and weighted scores of the wild type residues and mutant residues, as well as the conservation of wild-type residues and neighboring residues in a window centered in the mutation site. We use three different window sizes: 5, 9 or 15. The log-odds and the weighted scores are directly extracted from the position specific scoring matrices (PSSMs) for single point mutations. For multiple point mutations, the averages of these values are used instead. Overall, ten parameters are generated to record the evolutionary information for each single- or multiple- point mutation.
2. Secondary structure and solvent accessibility (5 features).
We assign secondary structure and solvent exposure status of each residue based on the wild-type proteins. If the structure of a wild-type protein is available, we use DSSP  to assign the secondary structures of all residues to three states: helix (H), extend (E) and coil (C); and solvent accessibility to exposed (e) or buried (b) using 25% relative accessible surface area as the threshold. We assume that the mutations do not significantly change the conformation of the protein and therefore the secondary structure and the solvent accessibility of wild-type and mutant remain the same.
3. Relative difference (6 features).
We also utilize six relative differences of compositions and properties between the wild-type and the mutant sequences including the change of positive charged residues, charged residues, small residues, tiny residues, maximum area of solvent accessibility (ASA) and the iso-electric point (pIa). These features were identified and used to build a model for discriminating thermophilic proteins from their mesophilic homologs .
4. PROTS terms (20 for structure-based model or 13 for sequence-based model).
PROTS is a protein stability potential derived from a comparative study between a large set of thermophilic and mesophilic proteins and a set of point mutations with measurements of mutation induced the change of melting temperature . There are 20 features in this category, including 13 sequential features and 7 Delaunay Tetrahedron (DT) based spatial features if the protein structure is available. The sequential features are used for all models but the Delaunay Tetrahedron based features are only used for structure-based models.
Random Forest algorithm (RF)
Predictive models are built using the Random Forest algorithm (RF) , an ensemble technique utilizing hundreds or thousands of independent decision trees to perform classification or regression. Each of the member trees is built on a bootstrap sample from the training data using a random subset of available variables. The algorithm is a state-of-the-art machine learning method and has been successfully used to build many predictive models –. Unlike many other competitive machine learning algorithms such as support vector machine, RF does not require fine-tuning parameters because using the default values of the parameters often results in near-optimal performance. Moreover, the predicting time for a RF model is often a small fraction of that for a corresponding support vector machine (SVM) model . Another advantage of RF is that it provides several variable importance measures , . It is particularly suitable for mining high-dimensional and noisy data. In this study, we use an R implementation of the Random Forest algorithm to construct the predictive model in regression manner . The predicted free energy changes are then used to calculate the accuracy of the predictions using zero change as the threshold for classification.
Algorithms used for comparison
We compare PROTS-RF to a variety of methods including several top-ranked ones in a recent comprehensive evaluation of protein stability predictors . LSE is a local structure entropy derived from representative protein structures and has shown a strong correlation with protein thermostability . MUpro is a support vector machine (SVM) based predictor at sequence level for the variation of folding free energy (ΔΔG) upon point mutations . I-Mutant2.0 is a SVM based predictor using structure and sequence information for ΔΔG prediction . Both EGAD  and FoldX  are force fields parameterized on a large set of point mutations with experimentally determined stability changes.
We use several metrics to measure the performance of the predictive models. The first is accuracy, which is defined as the ratio of the number of correctly predicted mutations in stabilizing or destabilizing of wild type proteins against the total number of predicted mutations. The second is the area under receiver operating characteristic curve (ROC), known as AUC. It should be pointed out that AUC can be a misleading parameter in some situations and therefore the AUC results should be interpreted with caution , . We provide AUC for comparison purposes because it is widely used in similar studies. The third is the Pearson correlation coefficient of predicted and experimental ΔΔG values.
Statistical analysis of the single mutation dataset
We analyze the statistical distributions of features used in the study. We use the Kolmogorov-Smirnov test for normality find that none but one of the features are normally distributed. We calculate the medium, the mean, and the p-value of the Kolmogorov-Smirnov test for each feature's distributions in stabilizing vs. destabilizing mutations (Table 1). We also generate boxplots to illustrate the distributions of features of stabilizing and destabilization mutations (Figure S1). The results presented in Table 1 clearly show that the distributions of a number of features are significantly different in stabilizing and destabilizing mutations. For example, mutations occurring in sheets are more likely to be destabilizing (p-value: 2.5×10−4). Mutations on buried residues are more likely destabilization than stabilization (p-value: 6.0×10−15), which can be explained by the fact that the protein cores are tightly packed and thus it is difficult to further optimize the interactions within the cores .
Cross validation and model training
We use an R implementation of the Random Forest algorithm to build models. Each model in the five-fold cross validation comprises 2,000 decision trees. The importance of a feature is estimated using the sum of the impurity increase over all trees induced by the feature in the model . The average and standard error of the importance of the 41 features in structure-based prediction and the 34 features in sequence-based prediction are shown in Figure 1. The results clearly show that the PROTS features and the evolutionary information are strongly correlated with protein stability.
The error bars denote the variation in five-fold cross validation.
The results from all five test datasets in the cross validation are combined. The data from actual experimental and hypothetical mutation are separated and fitted to the experimental data, discretely (Table 2). For the experimental mutations, the Pearson correlation coefficients (R) are 0.628 for the structure-based predictions and 0.620 for the sequence-based predictions (Table 2). We then use various ΔΔG values as cutoff thresholds to classify mutations as stabilizing and destabilizing and calculate the areas under receiver operating characteristic (ROC) curves. We find the areas under ROC (AUC) reach 0.873 and 0.869 for structure and sequence-based predictions, respectively. Very similar R and AUC are obtained for the hypothetical reverse mutations (Table 2). This result demonstrates that the predictive model is quite robust.
The model constructed in this work yields comparatively more reliable predictions than other tested models (Table 2). Machine learning based algorithms MUPro and I-mutant2.0 perform poorly for the hypothetical reverse mutations because the AUCs are only slightly higher than 0.5, the level of random selection. The models based on force-fields or potentials such as LSE, FoldX and EGAD can treat temperature and free energy as thermodynamic parameters. The performance of these tested algorithms in the study, nevertheless, are not as good as the PROTS-RF. Besides, PROTS-RF performs better than PROTS, a fragment-based protein thermostability potential we recently developed .
We then build the final structure- and sequence- based models using all the 2,156 point mutations and test these models using double- and multiple- point mutations.
Blind test on double point mutation dataset D180
In the blind test on the 180 double-point mutations, the regression of prediction against experimentally measured ΔΔG values results in correlation coefficients of 0.775 and 0.755 for structure and sequence-based predictions respectively, and the classification achieves AUCs of 0.868 and 0.869 (Table 3 and Figure 2). The predictions on the experimental data are similar to a previous reported model WET , in which the authors achieved correlation coefficients up to 0.75 and the AUC up to 0.87 in 10-fold cross validation tests using a weighted decision table method. However, PROTS-RF achieves very similar results for the hypothetical reverse mutations (0.863 and 0.868 respectively), while the WET model provided by Huang et al.  delivers an AUC of 0.518 and R of 0.110, a strong indication for the existence of an over-fitting problem with the model.
Huang et al suggested that the methods developed for predicting protein stability change upon single point mutations may not be suitable for predicting the stability change upon double point mutations because the thermostability changes are not always additive . Our results, nevertheless, have clearly indicated that a predictive model trained from single point mutations may still be capable of predicting double point mutations induced by protein stability changes. Some features used in our models, especially PROTS terms, reflect the surrounding environment of the mutation sites. The changes of these features are additive for remote mutations but not additive for mutations close to each other. This approach is consistent with the observations that in general non-additive mutations involve mutations close to each other while additive mutations involve mutations far apart (There are exceptions, however, to this rule because of long range interactions).
Blind test on multiple point mutations D141
The thermostability changes upon multiple point mutations are more complicated than single- and double- point mutations and therefore it is expected to be more difficult to be correctly predicted. Nevertheless, the correlation coefficients of predictions of the 141 multiple point mutations and experimentally measured ΔΔG values reach 0.663 and 0.637 for structure and sequence-based predictions, and the classification results in AUCs of 0.862 and 0.855, respectively (Figure 3 and Table 4). This result suggests that our predictive model is also capable of predicting stability changes upon multiple point mutations with high accuracy.
Prediction thermostability of Staphylococcal Nuclease mutants
Staphylococcal Nuclease (SNase) has been used as a model protein for studying protein stability and therefore there is a significant amount of experimental data for free energy changes upon mutations of this enzyme . We use PROTS-RF predict free energy changes upon mutations and then plot them against the experimental values in Fig. 4. The predicted and experimental ΔΔG values narrowly distribute along a line passing through the Origin. Both structure-based and sequence-based predictions are highly correlated with the experimental data (RPearson = 0.855 and 0.843, respectively), and the predictions for mutations and the corresponding hypothetical reverse mutations are strongly symmetric with respect to the Origin. A Trp residue at position 140 is critical to SNase structure, stability and function . PROTS-RF correctly predicts W140 related mutations and their hypothetical reverse mutations qualitatively but not quantitatively (Fig. 4), suggesting further improvement remains desirable.
The model developed in the study is robust as demonstrated in the cross validation and blind tests. We believe that the high robustness of this model can be attributed to the Random Forest algorithm and the features used in the models. The Random Forest algorithm is well known for its high robustness and is particularly suitable for mining high-dimensional and noisy data. We utilize diverse features ranging from evolutionary information, protein structure profile, and protein properties to the thermostability terms learned from a large amount of native proteins , . These features are less dependent on the proteins in training datasets and the over-fitting problem is less pronounced in the model. Consequently, they are robust and capable of predicting not only single-point mutations, but also double- or multiple- point mutations. The tests using the hypothetical reverse mutations in this study have shown that the tested machine learning models for predicting mutation induced protein stability change may suffer from the over-fitting problem. The results are surprising because all these models have undergone cross validation, a common practice widely considered as a rigorous validation approach. We suggest that testing datasets created based on physical principles can be highly useful for testing the robustness of predictive models.
In the present study, it is observed that the structure-based and sequence-based predictors result in very similar performance, suggesting the structural features used in the study do not make significant contribution to the performance of the models. This is consistent to their relatively low importance as shown in Figure 1. The most important structural feature (FBDTD43) is the seventh overall most important feature. Its ability to deliver good predictions without structural information is advantageous over other methods requiring structural information because vast majority of proteins do not have solved structures. It is possible that the information encoded in these structural features is also captured in the sequential features used in the study. In addition, the number of structural features is relatively small (7 structural vs. 34 sequential features) and they may not interact well with sequential features. Nevertheless, we think it is possible to further improve model performance if the structural class of proteins and more structure-based features are considered. Recently, we were made aware that alpha/beta class proteins normally have higher residue contact density (i.e., number of contacts per residue) than other proteins . Proteins with higher contact density tend to bear more mutations without significantly change its thermostability  and thermophiles tend to have higher contact density than mesophiles . Moreover, a recently report concluded that the accessible surface area of beta proteins increases more rapidly with the size of proteins in comparison with that of the alpha proteins . It was also reported that the aggregation propensity of a protein is highly correlated with its structural classification . Currently we are investigating different classes of proteins and will report the results in future.
We have presented PROTS-RF, a predictive model based on the Random Forest algorithm for predicting mutation induced protein stability change. This model is constructed based on a large set of features in proteins and trained by the Random Forest algorithm. In the cross validation test and the blind tests using double- and multiple- mutation datasets, this model is comparatively more reliable in the prediction of protein thermostability changes over other existing methods. It also shows high levels of robustness in the tests using hypothetical reverse mutations. We demonstrate that the hypothetical reverse mutations based on physical principles are highly useful for testing the robustness of algorithms for predicting mutation induced protein stability change.
We wish to thank the three anonymous reviewers and the editor for their constructive comments and suggestions. We are indebted to Dr. Vladimir Potapov for kindly sharing his data and Dr. Michael Gromiha for providing the WET program.
Conceived and designed the experiments: JF. Performed the experiments: YL. Analyzed the data: YL JF. Contributed reagents/materials/analysis tools: YL JF. Wrote the paper: YL JF.
- 1. Dahiyat BI (1999) In silico design for protein stabilization. Current Opinion in Biotechnology 10: 387–390.
- 2. Korkegian A, Black ME, Baker D, Stoddard BL (2005) Computational thermostabilization of an enzyme. Science 308: 857–860.
- 3. Lazar GA, Marshall SA, Plecs JJ, Mayo SL, Desjarlais JR (2003) Designing proteins for therapeutic applications. Curr Opin Struct Biol 13: 513–518.
- 4. Schweiker KL, Makhatadze GI (2009) Protein Stabilization by the Rational Design of Surface Charge-Charge Interactions. In: Shriver JW, editor. Protein Structure, Stability, and Interactions: Humana Press. pp. 261–283.
- 5. Sterner R, Liebl W (2001) Thermophilic adaptation of proteins. Critical Reviews in Biochemistry and Molecular Biology 36: 39–106.
- 6. Chennamsetty N, Voynov V, Kayser V, Helk B, Trout BL (2009) Design of therapeutic proteins with enhanced stability. Proc Natl Acad Sci U S A 106: 11937–11942.
- 7. Unsworth LD, van der Oost J, Koutsopoulos S (2007) Hyperthermophilic enzymes–stability, activity and implementation strategies for high temperature applications. FEBS J 274: 4044–4056.
- 8. Schoemaker HE, Mink D, Wubbolts MG (2003) Dispelling the myths - Biocatalysis in industrial synthesis. Science 299: 1694–1697.
- 9. Frokjaer S, Otzen DE (2005) Protein drug stability: a formulation challenge. Nat Rev Drug Discov 4: 298–306.
- 10. Lippow SM, Tidor B (2007) Progress in computational protein design. Current Opinion in Biotechnology 18: 305–311.
- 11. Guerois R, Nielsen JE, Serrano L (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 320: 369–387.
- 12. Chan CH, Liang HK, Hsiao NW, Ko MT, Lyu PC, et al. (2004) Relationship between local structural entropy and protein thermostability. Proteins 57: 684–691.
- 13. Pokala N, Handel TM (2005) Energy functions for protein design: adjustment with protein-protein complex affinities, models for the unfolded state, and negative design of solubility and specificity. J Mol Biol 347: 203–227.
- 14. Zhou H, Zhou Y (2002) Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 11: 2714–2726.
- 15. Yin S, Ding F, Dokholyan NV (2007) Modeling backbone flexibility improves protein stability estimation. Structure 15: 1567–1576.
- 16. Kellogg EH, Leaver-Fay A, Baker D (2011) Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins: Structure, Function, and Bioinformatics 79: 830–838.
- 17. Capriotti E, Fariselli P, Casadio R (2005) I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 33: W306–310.
- 18. Cheng J, Randall A, Baldi P (2006) Prediction of protein stability changes for single-site mutations using support vector machines. Proteins 62: 1125–1132.
- 19. Masso M, Vaisman II (2008) Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis. Bioinformatics 24: 2002–2009.
- 20. Montanucci L, Fariselli P, Martelli PL, Casadio R (2008) Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics 24: I190–I195.
- 21. Wu LC, Lee JX, Huang HD, Liu BJ, Horng JT (2009) An expert system to predict protein thermostability using decision tree. Expert Systems with Applications 36: 9007–9014.
- 22. Gromiha MM, Oobatake M, Sarai A (1999) Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophysical Chemistry 82: 51–67.
- 23. Huang LT, Gromiha MM (2009) Reliable prediction of protein thermostability change upon double mutation from amino acid sequence. Bioinformatics 25: 2181–2187.
- 24. Glyakina AV, Garbuzynskiy SO, Lobanov MY, Galzitskaya OV (2007) Different packing of external residues can explain differences in the thermostability of proteins from thermophilic and mesophilic organisms. Bioinformatics 23: 2231–2238.
- 25. Capriotti E, Fariselli P, Rossi I, Casadio R (2008) A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 9 (Suppl 2)S6.
- 26. Li Y, Zhang J, Tai D, Russell Middaugh C, Zhang Y, et al. (2012) Prots: A fragment based protein thermo-stability potential. Proteins: Structure, Function, and Bioinformatics 80: 81–92.
- 27. Becktel WJ, Schellman JA (1987) Protein stability curves. Biopolymers 26: 1859–1877.
- 28. Khan S, Vihinen M (2010) Performance of protein stability predictors. Human Mutation 31: 675–684.
- 29. Sanchez IE, Tejero J, Gomez-Moreno C, Medina M, Serrano L (2006) Point mutations in protein globular domains: Contributions from function, stability and misfolding. J Mol Biol 363: 422–432.
- 30. McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16: 404–405.
- 31. Li Y, Middaugh CR, Fang J (2010) A novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting relative thermostability of protein mutants. BMC Bioinformatics 11: 62.
- 32. Potapov V, Cohen M, Schreiber G (2009) Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel 22: 553–560.
- 33. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402.
- 34. Kumar MD, Bava KA, Gromiha MM, Prabakaran P, Kitajima K, et al. (2006) ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res 34: D204–206.
- 35. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 2577–2637.
- 36. Breiman L (2001) Random Forests. Machine Learning 45: 5–32.
- 37. Wang L, Yang MQ, Yang JY (2009) Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics 10 (Suppl 1)S1.
- 38. Sikic M, Tomic S, Vlahovicek K (2009) Prediction of protein-protein interaction sites in sequences and 3D structures by random forests. PLoS Comput Biol 5: e1000278.
- 39. Li Y, Fang Y, Fang J (2011) Predicting Residue-Residue Contacts Using Random Forest Models. Bioinformatics 27: 3379–3384.
- 40. Fang J, Koen YM, Hanzlik RP (2009) Bioinformatic analysis of xenobiotic reactive metabolite target proteins and their interacting partners. BMC Chem Biol 9: 5.
- 41. Fang JW, Dong YH, Williams TD, Lushington GH (2008) Feature selection in validating mass spectrometry database search results. J Bioinform Comput Biol 6: 223–240.
- 42. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, et al. (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43: 1947–1958.
- 43. Lobo JM, Jimenez-Valverde A, Real R (2008) AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography 17: 145–151.
- 44. Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning 77: 103–123.
- 45. Schweiker KL, Makhatadze GI (2009) A Computational Approach for the Rational Design of Stable Proteins and Enzymes: Optimization of Surface Charge-Charge Interactions. Methods in Enzymology: Computer Methods, Vol 454, Pt A 454: 175–211.
- 46. Frenz CM (2005) Neural network-based prediction of mutation-induced protein stability changes in staphylococcal nuclease at 20 residue positions. Proteins-Structure Function and Bioinformatics 59: 147–151.
- 47. Hirano S, Kamikubo H, Yamazaki Y, Kataoka M (2005) Elucidation of information encoded in tryptophan 140 of staphylococcal nuclease. Proteins 58: 271–277.
- 48. Galzitskaya OV, Reifsnyder DC, Bogatyreva NS, Ivankov DN, Garbuzynskiy SO (2008) More compact protein globules exhibit slower folding rates. Proteins: Structure, Function, and Bioinformatics 70: 329–332.
- 49. Shakhnovich BE, Deeds E, Delisi C, Shakhnovich E (2005) Protein structure and evolutionary history determine sequence space topology. Genome Research 15: 385–392.
- 50. England JL, Shakhnovich BE, Shakhnovich EI (2003) Natural selection of more designable folds: A mechanism for thermophilic adaptation. Proceedings of the National Academy of Sciences 100: 8727–8731.
- 51. Glyakina AV, Bogatyreva NS, Galzitskaya OV (2011) Accessible Surfaces of Beta Proteins Increase with Increasing Protein Molecular Mass More Rapidly than Those of Other Proteins. PLoS One 6: e28464.
- 52. Niwa T, Ying BW, Saito K, Jin W, Takada S, et al. (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc Natl Acad Sci U S A 106: 4201–4206.