Skip to main content
Advertisement
  • Loading metrics

DDGWizard: Integration of feature calculation resources for analysis and prediction of changes in protein thermostability upon point mutations

  • Mingkai Wang,

    Roles Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft

    Affiliations Institute of Health and Social Care, University of Bradford, Bradford, United Kingdom, Key Laboratory of Industrial Fermentation Microbiology, Ministry of Education, Tianjin Key Laboratory of Industrial Microbiology, The College of Biotechnology, Tianjin University of Science and Technology, Tianjin, China

  • Khaled Jumah,

    Roles Formal analysis, Methodology, Validation, Writing – review & editing

    Affiliation Institute of Health and Social Care, University of Bradford, Bradford, United Kingdom

  • Qun Shao,

    Roles Conceptualization, Formal analysis, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliation Institute of Health and Social Care, University of Bradford, Bradford, United Kingdom

  • Katarzyna Kamieniecka,

    Roles Formal analysis, Methodology, Writing – review & editing

    Affiliation Institute of Health and Social Care, University of Bradford, Bradford, United Kingdom

  • Yihan Liu ,

    Roles Conceptualization, Project administration, Supervision, Writing – review & editing

    lyh@tust.edu.cn (YL); K.Poterlowicz1@bradford.ac.uk (KP)

    Affiliation Key Laboratory of Industrial Fermentation Microbiology, Ministry of Education, Tianjin Key Laboratory of Industrial Microbiology, The College of Biotechnology, Tianjin University of Science and Technology, Tianjin, China

  • Krzysztof Poterlowicz

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    lyh@tust.edu.cn (YL); K.Poterlowicz1@bradford.ac.uk (KP)

    Affiliation Institute of Health and Social Care, University of Bradford, Bradford, United Kingdom

Abstract

Thermostability is an important property of proteins and a critical factor for their wide application. Accurate prediction of enables the estimation of the impact of mutations on thermostability in advance. A range of prediction methods based on machine learning has now emerged. However, their prediction performance remains limited due to insufficiently informative training features and little effort has been made to integrate feature calculation resources. Based on this, we integrated 12 computational resources to develop a pipeline capable of automatically calculating 1,547 features. In addition, a feature-enriched DDGWizard dataset was created, including 15,752 data. Furthermore, we performed feature selection and developed an accurate prediction model that achieved an R2 of 0.61 in cross-validation. It also outperformed several other representative prediction methods in comparisons with independent datasets. Together, the feature calculation pipeline, DDGWizard dataset, and prediction model constitute the DDGWizard system, freely available for analysis and prediction.

Author summary

A protein’s ability to maintain its structure under high temperatures, known as thermostability, is critical for many industrial and therapeutic applications and might be affected by genetic mutations. To address the challenge, we built a robust machine learning model to predict the impact of mutations on thermostability. DDGWizard integrates data from multiple computational tools to calculate over 1,500 features for each mutation, offering detailed insights into protein structure and stability. DDGWizard simplifies the complex process of analysis and enables scientists to design more stable proteins for various applications. It bridges the gap between data-rich resources and practical tools. Our model demonstrated superior performance compared to existing methods and provides a freely accessible platform for researchers and industry professionals available at https://github.com/bioinfbrad/DDGWizard.

Introduction

Thermostability is an important property of proteins, representing their ability to resist irreversible changes in structure and chemical attributes due to elevation in temperature [1]. It highly influences the application scope of proteins. For therapeutic proteins, such as monoclonal antibodies, insufficient thermostability can result in denaturation or reduced potency when temperature excursions occur during manufacturing, storage, and transportation [2], undermining their effectiveness. In addition, thermostability determines whether partial food proteins, such as whey proteins, can withstand thermal treatments [3], which is important in food processing to extend shelf life or create desired flavours [4]. For enzymes, specialized proteins widely used as biological catalysts, thermostability is a crucial parameter to function extensively [5]. As accelerating reactions, improving substrate solubility, and reducing the risk of microbial contamination require high temperatures in industrial environments, only enzymes with sufficient thermostability can operate continuously and be reused effectively [6]. However, most naturally evolved enzymes have poor thermostability [7], significantly limiting their applications.

Continuous efforts have been made to increase the thermostability of proteins [6] employing a variety of strategies. Directed evolution (DE) has been widely applied in protein engineering to increase protein thermostability [811]. It simulates natural selection and involves key steps such as constructing mutation libraries, introducing random mutations, and screening the target protein based on specific criteria. However, a major drawback of DE is its high demand for labor, material, and financial resources to identify the desired protein [12]. To identify effective mutations to increase protein thermostability more precisely, rational and semi-rational design strategies have been applied, which often require prior knowledge or computational methods [6]. is an indicator of protein thermostability changes resulting from mutations, as it represents the difference in the folding free energy change between the wild-type and mutant protein [13]. Since accurate prediction enables the estimation of the impacts of mutations on thermostability in advance, it can assist in the rational design of the selective introduction of mutations [1416].

Early prediction methods are mainly based on empirical force fields [17], utilizing experimental parameters, classical equations, and energy evaluations to calculate , such as the classic FoldX prediction method [18]. With the continuous advancement of computational techniques and data science, prediction methods based on machine learning (ML) have emerged and are now widely adopted. Among the 23 prediction methods previously reviewed, 15 are based on ML [17]. However, despite their increase in number, current ML-based prediction methods still suffer from the issue of inadequate prediction performance [1922]. One of the main reasons for this is that the features used for training models are insufficiently informative [19]. ACDC-NN [23] employs a neural network and optimizes for antisymmetric properties; however, its input features consist only of encodings of mutation type and amino acid distributions around the mutation site, lacking the integration of direct prior knowledge [24]. mCSM [25] and DynaMut [26] introduce pharmacophore features and protein dynamics features based on normal mode analysis (NMA), but they do not consider richer protein information, such as evolutionary conservation, residue interactions, and a broader range of amino acid physicochemical properties. DUET [27] relies solely on the prediction outputs of two other methods, SDM [28] and mCSM [25], as input features. In addition, some methods, such as DDGun3D [29] and FoldX [18], rely on linear fitting, which oversimplifies the problem and might be difficult to represent complex protein conformation changes. Finally, the size and protein diversity of some training datasets are limited [20], which may hinder model generalization (S1 Table lists the algorithms, datasets, and feature sets of ACDC-NN, DDGun3D, mCSM, DynaMut, FoldX, SDM, and DUET).

So far, although many computational resources have been used to calculate features [25,26,30,31] or output potentially relevant features [3235], little effort has been made to integrate these resources for the comprehensive calculation of features for data. This could provide more diverse information, facilitating further analysis, feature selection, and prediction.

Here, we describe DDGWizard as a analysis system. It includes a feature calculation pipeline that integrates 12 computational resources [18,3242] and is capable of automatically calculating 1,547 features for data. The calculated features provide information for the prediction from various perspectives, including the structure and environment of wild-type proteins, structural and environmental changes before and after mutation, mutation types, and evolutionary information. In addition, it provides a feature-enriched dataset created using the pipeline, including 15,752 data. Furthermore, it incorporates an accurate prediction model developed with the selected optimal features. The model achieved an R2 of 0.61 in cross-validation. It also outperformed several other prediction methods ACDC-NN [23], DDGun3D [29], FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28]. The application program, datasets, and source code for DDGWizard training and validation have been published to ensure accessibility and reproducibility.

Results

An overview of DDGWizard

DDGWizard is a comprehensive analysis system. It incorporates a feature calculation pipeline, provides a feature-enriched dataset, and includes an accurate prediction model. The process of its development and validation includes five steps (as shown in Fig 1).

thumbnail
Fig 1. An overview of DDGWizard.

A: Integrate 12 computational resources [18,3242] to develop a feature calculation pipeline. B: Collect data from the VariBench [43] database, conduct feature enrichment to the collected data using the feature calculation pipeline to obtain the DDGWizard dataset, and then split it into training and test sets for subsequent ML tasks. C: Perform feature selection based on the RFE (recursive feature elimination) algorithm, followed by a further analysis of feature importance. D: Develop a prediction model using the XGBoost [44] algorithm based on the optimal features. E: Evaluate the developed model and compare it with other representative prediction methods using the identical cross-validation sets, test set, S669 dataset [45], and p53 dataset [25].

https://doi.org/10.1371/journal.pcbi.1013783.g001

DDGWizard feature calculation pipeline

The feature calculation pipeline was developed by integrating 12 computational resources [18,3242] (see Table 1) to obtain structural, environmental, and evolutionary information for proteins and associated mutation types. It requires raw data as input, including basic information on PDB ID [46] (e.g., 2OCJ for the p53 protein [25]), amino acid substitution (e.g., K6Q for lysine-to-glutamine at position 6), chain identifier (e.g., “A”), pH, temperature (in °C), and value. The computational resources are called to calculate the features, and users can then access the feature-enriched data, which totally includes 1,547 features (Fig 2).

thumbnail
Fig 2. The feature calculation pipeline of DDGWizard.

The pipeline requires the input of raw data (PDB ID, amino acid substitution, chain ID, pH, temperature, and value). It uses the PDB ID to download the wild-type protein structure file from the RCSB PDB database [46], employs Modeller [72] to construct the mutant protein structure file, and calls a series of computational resources [18,3242] to calculate features, ultimately outputting the dataset containing 1,547 calculated features.

https://doi.org/10.1371/journal.pcbi.1013783.g002

thumbnail
Table 1. The computational resources used for feature calculation.

https://doi.org/10.1371/journal.pcbi.1013783.t001

The description of the calculated features and the corresponding computational resources is provided below.

Structural and environmental information of the wild-type protein. The first feature group incorporates structural information within the wild-type protein, covering the proportion of different amino acids and different amino acid categories (uncharged polar, positively charged polar, negatively charged polar, nonpolar, aromatic, aliphatic, heterocyclic and sulfur-containing) calculated with Biopython [37], buried/exposed amino acids and different secondary structures (-helix, alpha-helix, pi-helix, helix-turn, extended beta sheet, beta bridge, bend and other/loop) obtained from DSSP [39], disordered regions predicted by DisEMBL [32], different residue interactions (hydrogen bonds, disulfide bridges, ionic interactions, Van der Waals forces, cation, and ππ stacking) output by Ring [33], different atomic pharmacophores [25] (hydrophobic, positive, negative, hydrogen acceptor, hydrogen donor, aromatic, sulphur, and neutral) calculated with RDKit [40], and hydrophobic clusters analyzed by Protlego [35]. To account for the varying effects of residues and protein conformations at different distances from the mutation site, structural information is divided into four spatial regions: within 7 Å of the mutation site, within 10 Å of the mutation site, within 13 Å of the mutation site, and across the entire protein structure.

Subsequently, different properties of wild-type amino acids are included, including RSA (Relative Solvent Accessibility) calculated by DSSP [39], atomic fluctuation information based on NMA (Normal Mode Analysis) [50] by Bio3D [38], B-factor (Temperature Factor) predicted by Profbval [34], and the physicochemical properties recorded in the AAindex database [36].

Finally, the energy information of the wild-type protein is incorporated from FoldX [18]. In total, the first group includes 724 features.

Structural and environmental changes between mutant and wild-type proteins. The second group contains 647 features to describe the changes in structure and environment between mutant and wild-type proteins. First, the features are calculated for the mutant protein in the similar manner as it has been done for the wild protein using the computational resources described above. Subsequently, the difference in the feature values between the mutant and wild-type proteins constitutes this feature group. Considering that some features of the structural proportion of proteins do not show significant changes before and after single-point mutations, such as the proportion of disordered regions and buried/exposed amino acids, these features have not been included.

Types of mutations. The third group includes 146 features to describe the mutation types. Various encodings are incorporated to represent information on amino acid substitutions, such as substitution encoding for changes of amino acid types, secondary structures, and residue interactions on the mutated amino acids. Subsequently, values from amino acid substitution matrices in the AAindex database [36] are also included to describe mutation types. Finally, the tool SIFT [42]’s prediction results, to reflect the impact of amino acid substitutions on proteins, are also encoded to represent the mutation types.

Evolutionary information. The fourth group includes 26 features to describe the evolutionary information. These features are statistics from the PSSM (position-specific scoring matrix) generated by the protein sequence alignment and homology search tools PSI-BLAST [41]. The PSSM scores at the mutation site and surrounding sites of both the wild-type and mutant proteins are included. Additionally, the difference in PSSM scores at the mutation site between the mutant and wild-type proteins, and the difference in the average PSSM scores surrounding the mutation site between the mutant and wild-type proteins, are also included.

Feature-enriched DDGWizard dataset

Fig 3 demonstrates the workflow of dataset construction and feature enrichment. We chose the VariBench database [43] as the data source. VariBench is a database that curates previously validated mutation datasets, including datasets. A total of 20 raw datasets were collected (see S2 Table) that met the requirements of including five pieces of basic mutation information (PDB ID, amino acid substitution, chain identifier, pH and temperature) and experimental values. To maximize data utility, we merged these 20 datasets based on the following merging rules:

  • For data with the same basic information and the same value, we retained only one instance.
  • For data with the same basic information but different values, we selected one instance with the value closest to 0 (according to the previous report [73], current data have a trend toward to 0, the data closer to 0 could be more reliable).
thumbnail
Fig 3. The workflow of dataset construction and feature enrichment.

https://doi.org/10.1371/journal.pcbi.1013783.g003

After merging, we obtained 7,876 unique mutation data points from 222 different proteins. Considering that the hypothetical reverse mutation theory has been adopted by many studies [19,23,29,74], both in the testing [45,75,76] and development [7779] of prediction methods, we conducted the data augmentation that added the hypothetical reverse mutations, eventually obtaining 15,752 data.

We applied the developed feature calculation pipeline to the obtained data. It enriched the feature number of the data from 5 to 1,547. Fig 4 shows the distribution of feature-enriched data and highlights the similarity of direct and reverse mutation data with an MMD2 [80] of 0.0006. It reflects that the reverse mutation data could approximately serve as an equivalent augmentation of the dataset [81].

thumbnail
Fig 4. t-SNE plot for both direct and reverse mutation data.

The t-SNE plot shows the distribution of direct and reverse mutation data, projected from high-dimensional feature spaces into two dimensions. The blue points represent direct mutation data, while the red points represent reverse mutation data. MMD2 quantifies the difference in feature distributions between the two types of data.

https://doi.org/10.1371/journal.pcbi.1013783.g004

The created new dataset was named “DDGWizard” dataset. It is a non-redundant collection including unique 15,752 mutation data points from 222 proteins and integrated comprehensive feature information covering measuring conditions, structures and environments of the wild-type protein, structural and environmental changes between mutant and wild-type proteins, mutation types, and evolutionary information, making it a valuable resource for feature selection, development of ML models, and further analytical studies.

Next, we split the dataset for ML tasks. Each pair of direct mutation data and hypothetical reverse mutation data in the DDGWizard dataset was treated as a single unit, and all pairs were randomly shuffled using a seed of 42. The first 90% data was selected as the training set, comprising 14,178 mutations (7,089 pairs of direct and reverse mutations) from 219 different proteins. The remaining 10% data was selected as the test set, comprising 1,594 mutations (787 pairs of direct and reverse mutations) from 134 different proteins.

Optimal feature set

To identify the most effective features, feature selection was carried out. We first trained the model with the XGBoost algorithm [44] using all 1,574 features as baseline. The 20-fold pair-level cross-validation (it ensures that the direct and reverse mutation data remain together in either the training set or the validation set) was used to evaluate the model training performance. Fig 5 shows the performance of the model before feature selection, with an average R2 of 0.55 and a standard deviation of 0.06.

thumbnail
Fig 5. R2 of each fold from cross-validation before and after feature selection.

https://doi.org/10.1371/journal.pcbi.1013783.g005

Next, the RFE algorithm was employed to select features, which iteratively removes the least important features and outputs the evaluation metric in each round (the flowchart of RFE is shown in Fig 6A). Fig 6B shows the changes in average R2 across the RFE rounds. The RFE curve performed relatively stable or showed few fluctuations during the elimination of the first 1,397 features. When features were reduced to fewer than 150, the prediction performance began to improve. When RFE reached 1,478 rounds, reducing the features to 69, prediction performance peaked with an average R2 of 0.58. Fig 5 compares the model’s performance before and after feature selection. The average R2 increased by 0.03 when the model was trained with the selected 69 features. In addition, the standard deviation of R2 decreased from 0.06 to 0.05.

thumbnail
Fig 6. Feature selection and feature importance ranking.

A: The flowchart of feature selection based on the RFE algorithm. B: The RFE results reflect the changes in the average R2 of the 20-fold pair-level cross-validation as the number of RFE rounds increases and the number of features decreases. C: The top 10 most important features among the 69 features.

https://doi.org/10.1371/journal.pcbi.1013783.g006

The optimal 69 features are listed in S3 Table, including evolutionary features, energy terms, changes in amino acid physicochemical properties, RSA (relative solvent accessibility) at the mutation site, temperature, and distributions of amino acid categories, secondary structures, residue interactions, atomic pharmacophores, disorder regioins, and hydrophobic clusters. These features were used for further analysis and model development.

Moreover, we used the XGBoost algorithm to output the feature importance (the top 10 most important features are shown in Table 2 and Fig 6C, respectively). The most important feature is “diff_pssm_score”, which represents the difference in PSSM scores at the mutation site between the mutant and wild-type proteins. In addition, two other evolutionary features, “diff_pssm_score_aver” (the change in the average PSSM value of the surrounding sequence at the mutation site), and “wt_PSSM_score” (the PSSM value at the mutation site in the wild-type protein) are also among the top 10 important features. Since the PSSM score provides a quantitative measure of the conservation degree of amino acids at a specific site [82], the difference in the PSSM scores between mutant and wild-type amino acids reflects how well the mutation aligns with the preferred amino acid at the site. Larger differences indicate a greater deviation from the most favorable amino acid at that position. Such deviations may affect the function or structure of the protein, as conservation at these positions often suggests that they are essential to maintain its integrity [83]. The second most important feature is “diff_foldx_total_energy”, which represents the difference in the overall energy, calculated by FoldX [18], between mutant and wild-type proteins. It shows that empirical force field methods like FoldX can effectively assist ML methods for predictions. It is worth mentioning that the four features, reflecting changes in physicochemical properties derived from the AAindex, are ranked among the top 10 features. Among them, the feature “diff_aaindex_p_values_of_mesophilic_proteins_based_b_values” can reflect the statistical significance changes in protein thermostability for mesophilic proteins based on the distributions of b values [84]; the other three features reflect changes in parameters associated with different secondary structures at the mutation site [8587].

Model development and evaluation

The XGBoost algorithm was chosen to train the prediction model of DDGWizard. Table 3 presents the results of a model selection, comparing the performance of 11 machine learning (ML) algorithms: AdaBoost [88], decision tree [89], KNN [90], Lasso regression [91], LightGBM [92], linear regression [93], MLP [94], random forest [95], Gaussian process [96], support vector regression [97], and XGBoost [44]. Traditional ML algorithms were evaluated with their default hyperparameters, while the tuning of MLP hyperparameters is summarized in S4 Table. Among these algorithms, XGBoost achieved the highest average R2 of 0.55 under the same 20-fold pair-level cross-validation.

thumbnail
Table 3. Average R2 for the model selection under the same 20-fold pair-level cross-validation.

https://doi.org/10.1371/journal.pcbi.1013783.t003

We then trained the model using the optimal 69 features with the XGBoost algorithm. Bayesian optimization [98] was employed to tune the model’s hyperparameters, with the average R2 during the 20-fold pair-level cross-validation as the optimization target (specific parameter ranges and tuning results can be found in S5 Table). After Bayesian optimization, the average R2 of the model training improved from 0.58 to 0.61.

Fig 7A shows the prediction results during cross-validation, while Fig 7B demonstrates the comparison between the average prediction values and experimental values within 10 bins that have equivalent data amount [99101]. The distribution of 10 comparison points around y=x indicates model’s good calibration and strong reliability.

thumbnail
Fig 7. Prediction results of DDGWizard’s model from the cross-validation.

A: The scatter plot to visualize the comparison between all predicted and true values. The red line indicates the overall regression fit. The plot also provides the regression equation, R2, γ, and σ values for the overall prediction. B: The binned scatter plot compares average prediction values and experimental values within 10 bins that have equivalent data amounts. The error bars represent the standard error of the residuals between the average predicted and true values within each bin. C: The scatter plot to visualize the prediction results from the 20-fold cross-validation on the 3,970 mutation Î"Î"G data points where the PSSM score of the mutant amino acid is less than 0. D: The scatter plot to visualize the prediction results from the 20-fold protein-level crossvalidation on the 30 proteins that have mutual sequence similarity less than 30%.

https://doi.org/10.1371/journal.pcbi.1013783.g007

To assess the robustness of our model on low-conservation residue data, we conducted 20-fold cross-validation using data where the PSSM score of the mutant amino acid was less than 0 (a total of 3,970 data points), representing relatively low conservation of the mutant amino acid [82]. Fig 7C shows the test results, and our model achieved an average R2 of 0.51 under the same optimal features and hyperparameters as used before.

To test our model’s performance on proteins with low mutual sequence similarity (<30%), we selected 30 proteins (PDB IDs: 1BNI, 1W3D, 1VQB, 1STN, 3SSI, 1RX4, 2LZM, 1RTB, 1LZ1, 2CI2, 1FKJ, 1DIV, 2ABD, 1UZC, 3MBP, 1FTG, 1RN1, 1ARR, 1TEN, 1AMQ, 2RN2, 1YYJ, 1APS, 5PTI, 1HZ6, 1SAK, 1OTR, 1PIN, 5AZU, 1TTG) with at least 50 mutations in our dataset. We then performed 20-fold protein-level cross-validation [31]. As shown in Fig. 7D, our model achieved an average R2 of 0.42.

To evaluate the impact of inclusion of reverse mutation data on model performance, we conducted a comparison study (Table 4). We first performed 20-fold cross-validation with direct mutation data for both training and validation dataset, which yielded an average R2 of 0.58 (Table 4, row 1). Next, we added the corresponding reverse mutation data into the training sets while keeping the validation sets unchanged, and the average R2 remained 0.58 (Table 4, row 2). This indicates that adding reverse mutation data to the training set does not significantly affect the prediction performance on direct mutations under different data splits. In the third experiment, we used direct and reverse mutation data as both training and validation sets and a 20-fold pair-level cross-validation was conducted, which obtained an average R2 of 0.61 (Table 4, row 3). The final experiment included direct mutation data for the training set, and direct and reverse mutation data for the validation sets, and the average R2 dropped to 0.26 (Table 4, row 4). It suggests that including reverse mutation data in the training set can effectively improve the prediction performance on reverse mutations and therefore enhance models’ generalization ability.

thumbnail
Table 4. Comparison study on the inclusion of reverse mutation data.

https://doi.org/10.1371/journal.pcbi.1013783.t004

The model with the highest R2 (0.73) on the validation set from the 20-fold pair-level cross-validation was selected as DDGWizard’s prediction model. For new prediction needs, users need to provide basic mutation information on PDB ID, amino acid substitution, chain identifier, pH, and temperature, and the developed feature calculation pipeline will automatically calculate the optimal 69 feature values to input into the prediction model. The model will then output the predicted values.

Comparisons

To compare the performance differences between DDGWizard’s prediction model and others, seven representative methods were chosen for the comparison, including ACDC-NN [23], DDGun3D [29], FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28]. S1 Table provides information on the algorithms, datasets, and feature sets used by these methods. We conducted four comparisons using different datasets: identical cross-validation sets, test set, S669 dataset [45] and p53 dataset [25]. All test datasets have undergone data augmentation, enabling evaluation of the prediction methods’ performance in predicting all data, direct mutation data, and reverse mutation data.

Comparison with the cross-validation sets.

To initially compare DDGWizard’s prediction model with other prediction methods, we first selected two representative prediction methods to compare: ACDC-NN [23] and DDGun3D [29]. These two methods were ranked as the top two methods in the previous study [45]. We used ACDC-NN and DDGun3D to predict identical pair-level cross-validation sets that DDGWizard used and compared their prediction performance with the DDGWizard’s model. Table 5 and Fig 8 present the comparison results, showing that DDGWizard’s model significantly outperforms ACDC-NN and DDGun3D, achieving , , and values of 0.79, 0.76, and 0.72 (, , and represent the Pearson correlation coefficient between the predicted and true values for all data, direct mutation data, and reverse mutation data, respectively). Statistical significance was confirmed by zall and pall (significance metrics for correlation coefficient comparison derived from Steiger’s Z-test [102,103]), with zall exceeding 50 and pall less than 0.001. All three prediction methods were constructed with consideration of the hypothetical reverse mutation theory, and the effectiveness of this consideration was reflected in the models’ antisymmetric property [23]. The values of (Pearson correlation coefficient between the predicted values of direct mutation data and reverse mutation data) for the three methods are close to the ideal prediction of –1, and the values of (the average of the sums of the predicted values for each pair of direct and reverse mutation data) are similarly close to the ideal prediction of 0.

thumbnail
Fig 8. Pearson correlation coefficients of three prediction methods evaluated with the identical cross-validation sets.

https://doi.org/10.1371/journal.pcbi.1013783.g008

thumbnail
Table 5. Comparison results of three prediction methods evaluated with the identical cross-validation sets.

https://doi.org/10.1371/journal.pcbi.1013783.t005

We also compared the three prediction methods using the identical cross-validation sets on the low-conservation residue data and low similarity proteins. The DDGWizard’s model achieved better performance than ACDC-NN and DDGun3D with of 0.64 and 0.72 (see S6 Table and S7 Table), respectively.

Comparison with the test set.

To further compare performance differences between the DDGWizard’s prediction model and other prediction methods, we selected additional five representative methods which are FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28] to predict the test set. Table 6 and Fig 9 present the test results of eight prediction methods. As shown, the DDGWizard’s model achieved the best prediction performance when predicting all data (with a of 0.68), direct mutation data (with a of 0.66), and reverse mutation (with a of 0.63). Its performance advantage is also statistically significant, as all pall from comparisons with other methods were less than 0.001. In terms of the comparison of antisymmetric property [23], DDGWizard’s model, ACDC-NN, and DDGun3D significantly outperformed other methods.

thumbnail
Fig 9. Pearson correlation coefficients of eight prediction methods evaluated with the test set.

https://doi.org/10.1371/journal.pcbi.1013783.g009

thumbnail
Table 6. Comparison results of eight prediction methods evaluated with the test set.

https://doi.org/10.1371/journal.pcbi.1013783.t006

Comparison with the S669 dataset.

Table 7 and Fig 10 present the test results on the widely used [48,104,105] S699 dataset [45] for the eight prediction methods, including the DDGWizard’s model. Since 43 mutation data points from S669 were included in our training set, we excluded these data and retrained [7779] DDGWizard’s model using the same features and hyperparameters as before for comparison. In the evaluation on S669, the DDGWizard’s model, ACDC-NN, and DDGun3D remained the top-performing prediction methods. Our model achieved the highest of 0.63, and ACDC-NN exhibited the best anti-symmetric performance with of –0.98.

thumbnail
Fig 10. Pearson correlation coefficients of eight prediction methods evaluated with the dataset S669.

https://doi.org/10.1371/journal.pcbi.1013783.g010

thumbnail
Table 7. Comparison results of eight prediction methods evaluated with the dataset S669.

https://doi.org/10.1371/journal.pcbi.1013783.t007

Comparison with the p53 dataset.

Table 8 and Fig 11 present the test results on the p53 dataset [25] for the eight prediction methods, including the DDGWizard’s model. As four data from the dataset p53 were included in DDGWizard’s training data, we excluded these data and retrained [23,31,56] DDGWizard’s model using the same features and hyperparameters as before for comparison. Based on the ranking of , DDGWizard’s model outperformed the other methods (0.79).

thumbnail
Fig 11. Pearson correlation coefficients of eight prediction methods evaluated with the p53 dataset.

https://doi.org/10.1371/journal.pcbi.1013783.g011

thumbnail
Table 8. Comparison results of eight prediction methods evaluated with the p53 dataset.

https://doi.org/10.1371/journal.pcbi.1013783.t008

Accessibility and reproducibility

We developed DDGWizard as a freely available system for analysis and prediction. The user can access the DDGWizard application on https://github.com/bioinfbrad/DDGWizard. The feature calculation pipeline requires to input raw data and outputs new data with 1,574 features. The DDGWizard’s prediction model requires to provide basic mutation information and it returns predicted values. Both of feature calculation pipeline and prediction model support parallel processing to handle large-scale data. To better assist users in predicting , the program also provides tools for prediction of saturation mutagenesis and full-site mutagenesis. Detailed usage instructions can be found at https://ddgwizard.readthedocs.io/en/latest/. The DDGWizard dataset, the source code for model training and validation, and the evaluation and comparison data are released on https://zenodo.org/records/14512134.

Discussion

Thermostability has a significant impact on the broad applications of proteins. Continuous efforts have been made to increase protein thermostability, employing various strategies, such as rational design or semi-rational design. Since prediction can estimate the impact of mutations on thermostability in advance, it has become a powerful tool for rational or semi-rational design. Although a range of prediction methods have been developed, especially those based on ML, they still suffer from inadequate prediction performance. The main reason for this is that the features used for training models are insufficiently informative. In fact, many computational resources are available to calculate the features for predictions. However, there is a lack of work to integrate these resources for comprehensive calculation. It could provide more diverse feature information, facilitating further analysis, feature selection, and prediction.

In this study, we integrated 12 computational resources [18,3242] to develop a pipeline to aid users in feature enrichment for their own datasets. It can automatically output 1,547 calculated features, covering diverse information, such as the structures and environments of wild-type proteins, structural and environmental changes between mutant and wild-type proteins, mutation types, and evolutionary information. Furthermore, we collected data and applied our pipeline to create the feature-enriched DDGWizard dataset, including 15,752 data points, serving as a valuable resource for research.

In addition, to identify more effective features for prediction, we carried out feature selection based on RFE (recursive feature elimination). During this process, the RFE curve first remained stable over a long range and then began to rise. At the peak, 69 features were selected as the optimal subset, resulting in a more accurate and robust model with improved R2 and a decreased standard. This can be attributed to the elimination of redundant features, allowing the model to focus on more informative ones [106]. Similar RFE patterns can be observed in previous studies [107110]. According to importance ranking of optimal features, we found that the difference in PSSM scores at the mutation site between mutant and wild-type proteins was the most important feature. This may be because changes in the PSSM score at the mutation site can reflect how well the mutation matches the preferred amino acid at that position. Larger differences indicate greater deviation, which may potentially affect the protein’s function or structure, since conserved positions are often critical for maintaining integrity. Besides, we found that the energy terms derived from FoldX and changes in physicochemical properties related to certain secondary structures are also important for prediction.

Finally, using the optimal features, we developed an accurate new prediction model. It outperformed ACDC-NN [23], DDGun3D [29], FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28]. ACDC-NN employs a convolutional neural network and optimizes for antisymmetric properties. However, its input features include only encodings of mutation type and amino acid distribution around the mutation site, lacking the utilization of prior knowledge–based features [24]. This limits the model’s interpretability and may increase the risk of overfitting [111]. In contrast, the features used in our model have more direct contributions to due to knowledge-based feature design. Moreover, the training set it uses, S2648 dataset [112] (also employed by DynaMut [26], mCSM [25], and DUET [27]), contains 132 source proteins that are entirely covered by the 219 proteins in our training set. As a result, it has been trained on a relatively narrower range of proteins than our model, which could limit its generalization performance. DDGun3D uses four features to represent differences in conservation, hydrophobicity, sequence interaction energy, and structural interaction energy between mutant and wild-type amino acids, and it fits values through a linear combination. While this approach is intuitive, the linear combination may be insufficient to capture the complex nonlinear relationships between features and . A similar limitation is observed in FoldX [18], which computes rich and complex conformational energy terms of proteins but only performs simple linear weighting of these terms. mCSM and DynaMut introduce pharmacophore features and protein dynamics features based on NMA (normal mode analysis) [50], using Gaussian processes and random forest algorithms, respectively, to train their models. However, both methods don’t train with amino acid conservation features [66,82] that are important features found in our results. In addition, they did not incorporate the XGBoost algorithm, which demonstrated better performance in our model selection than the algorithms adopted in their models. Furthermore, they do not consider hypothetical reverse mutations [19], which hinders their models to learn reverse mutations’ patterns, resulting in relatively low (Pearson correlation between predictions of direct and reverse mutations). SDM [28] is a statistical potential function based on an environment-specific amino acid substitution table. While statistical approaches are valuable for understanding data distributions, their reliance on prior assumptions about data distributions might lead to prediction biases on new data [113]. DUET [27] is a consensus predictor that uses the outputs of SDM and mCSM as features and applies the SVM for training. Across the comparisons, DUET’s performance is slightly better than SDM and mCSM individually, indicating the effectiveness of consensus prediction. However, its accuracy is still significantly lower than that of the top-performing methods.

Our current work focuses on integrating features from 12 computational resources [18,3242] based on expert knowledge, identifying the optimal subset from 1,574 integrated features using RFE, and developing an accurate prediction model with XGBoost. While XGBoost is a powerful tool that is effective for structured data and provides strong interpretability [114], its limitation lies in that it cannot perform complex transformations of input features to automatically learn new feature representations and contextual patterns in the data [115117]. We aim to address this problem in our future work. We will explore the incorporation of deep learning (DL) to further improve model accuracy. DL allows the automated extraction of abstract representations [118] from data and often achieves better performance on large-scale datasets, despite its limited interpretability. DL-based representation of sequence conservation, such as the output embeddings from pre-trained protein language models (PLMs) [119], could be introduced. DL algorithms GNN [120] or CNN [121] could be utilized to further extract deep-learned representation from the distribution of amino acids, secondary structures, and amino acid interactions. We aim to integrate the current RFE-selected features with deep-learned representations to develop hybrid models for further improving model performance.

Overall, the analysis and prediction system, DDGWizard, consists of an integrated feature calculation pipeline, a feature-enriched dataset, and an accurate prediction model. The system is freely available, and the source code for its training and validation procedures has been published to ensure accessibility and reproducibility.

Materials and methods

Development of feature calculation pipeline

The feature calculation pipeline was developed in the Python programming language (v3.10.12). It was programmed to read raw data (PDB ID, amino acid substitution, chain identifier, pH, and temperature) as input. Then it downloads the structural files of the wild-type proteins from the RCSB PDB database [46] according to the provided PDB ID using the requests (v2.31.0) library and utilize the homology modeling software Modeller (v10.4) [72] to generate the mutant protein structures using the wild-type protein structure as template. Next, a series of computational resources [18,3242] is called to calculate the feature values, and it finally saves the calculated results in CSV format. Detailed descriptions of the usage of each computational resource in the pipeline are provided in S8 Table.

Data sources

In this study, three data sources were used:

VariBench. VariBench [43] is a benchmark database that includes mutation datasets, such as datasets, and follows seven principles (relevance; representative-ness; non-redundancy; experimentally verified cases; positive and negative cases; scalability; reusability) to improve the quality of the collected datasets. 20 datasets from the VariBench database were selected, which were further merged, filtered, and split to achieve the training set and test set used for ML tasks.

S669. The dataset S669 [45] contains 669 mutation data points from 87 different proteins. It is a high-quality benchmark dataset and has been used by several studies [7779] for independent tests.

p53. The dataset p53 [25] contains 42 data of tumor suppressor proteins (PDB ID: 2OCJ). Since the p53 dataset is widely used for comparing and testing prediction methods [27,28,77,79], it was also adopted in this study for testing and comparison purposes.

Data augmentation based on hypothetical reverse mutation theory

The changes in thermostability () caused by protein mutations are represented by the difference in protein folding free energy () between mutant and wild-type proteins. As a thermodynamic state function [122], the difference in should be reversible. Namely, at the same position in the protein, the for a mutation from amino acid A to amino acid B should be equal to the negative of the for the hypothetical reverse mutation from amino acid B to amino acid A [19] (as shown in Eq 1). This is known as the hypothetical reverse mutation theory.

(1)

This theory has been widely applied in many studies [19,23,29,74], both in the testing [45,75,76] and development [7779] of prediction methods. According to this theory, a robust prediction method should perform well not only in predicting direct mutations but also in predicting hypothetical reverse mutations [19]. In the test set, hypothetical reverse mutation data can be generated from each direct mutation data. This type of data augmentation for the test set allows comprehensive evaluations for prediction methods by additionally predicting reverse mutation data. In addition to being used in testing, this theory should also be applied in the construction of prediction methods. Previous studies [45] have shown that incorporating this theory can effectively improve methods’ prediction performance when predicting hypothetical reverse mutation data and allow methods to learn the antisymmetric property [23] of . In contrast, prediction methods that did not consider this theory achieved much poorer performance [45,75,76]. For prediction methods based on ML, the hypothetical reverse mutation theory can be incorporated to generate reverse mutation data in the training set for data augmentation [74,7779].

Pair-level cross-validation

Among the prediction methods [74,7779] that utilized the hypothetical reverse mutation theory to increase data in the training set, mutation-level cross-validation (randomly shuffle all mutation data during cross-validation [31]) was employed by them. However, considering that a pair of real data and its hypothetical reverse mutation data are correlated, if they are randomly shuffled during cross-validation, one real data instance and its augmented data instance might be located in the training and validation sets, respectively. This could result in the validation set not being entirely unseen for the training set, leading to the training set and validation set not being independently separated. Previous study [123] suggested that, when conducting cross-validation after data augmentation, if training and validation data are not independently separated, data leakage might occur and overly optimistic performances could be caused. To address this issue, we employed the pair-level cross-validation, which means splitting datasets based on a pair of real data and its augmented data as a unit in the cross-validation. This ensures that each data pair appears entirely in the training set or in the validation set, preventing the potential issue of unfair validation.

Feature selection

Feature selection is implemented using the RFE (Recursive Feature Elimination) algorithm. RFE can effectively eliminate redundant features and identify the optimal feature subset to improve model prediction performance, making it a widely used technique in various ML tasks [124126]. RFE is an algorithm that relies on feature importance, and its basic idea is to iteratively train the model, evaluate the prediction performance of the model, calculate feature importance and remove the least important feature in each round, ultimately selecting a subset of features that contribute the most to the model’s prediction performance. In this study, RFE was implemented based on the RFECV function [127] from sklearn.feature_selection library [128]. The ML algorithm XGBoost was used to train the models during RFE rounds and output feature importance. The average R2 of 20-fold pair-level cross-validation was employed as the metric to evaluate the model performance for each RFE round. To be more specific, RFE performed the following three iterative steps (denoting the feature set at each round as X, which is initially set to include all candidate features):

  • Train the XGBoost model using the feature set X, perform cross-validation, calculate the average R2, and record the result.
  • Use the feature importance output by the XGBoost model to rank the features in descending order. Remove the lowest-ranked feature from X and record the remaining features.
  • Repeat the step 1 and step 2 until all features have been removed from X.

After completing RFE, the remaining features corresponding to the round with the highest average R2 are selected as the optimal features, finalizing the feature selection process.

Model development

The prediction model of DDGWizard was developed using the XGBoost [44] algorithm. The XGBoost algorithm is a powerful ML method [44] based on gradient boosting trees. It incorporates both the L1 and L2 regularization penalty terms to control the model complexity and reduce overfitting [129], while its post-split pruning strategy [130] further prevents unnecessary tree growth. The inclusion of L1 regularization also enables a more reliable estimation of feature importance [131], making it well-suited for integration with RFE-based feature selection. The implementation of XGBoost was achieved using the ML library scikit-learn (v1.3.1). The model’s hyperparameters were determined through Bayesian optimization [98], which is a sequential design strategy for global optimization of black-box functions, suitable for hyperparameter tuning in ML models. In this study, Bayesian optimization set the average R2 from the 20-fold pair-level cross-validation on dataset S7089 as the optimization target and was implemented using the library Bayesian optimization (v1.4.3).

Evaluation metrics

MMD (maximum mean discrepancy) test [80] was conducted to evaluate the feature distribution difference between the direct and reverse mutation data. It is a widely used [132134] method that quantifies the difference between two probability distributions in the high-dimensional space. The metrics MMD2 [80,135] was employed, and its formula is given by Eq 2 (where P and Q represent two distributions; samples x and y are drawn from distributions P and Q, with sizes m and n; k represents the RBF kernel function [80] implemented via the pairwise_kernels function of the sklearn.metrics library [128]):

(2)

During cross-validation for feature selection and model development, the coefficient of determination (R2) between real and predicted is used as the evaluation metric. Its formula is given by Eq 3 (where n is the total amount of data; and represent the predicted and real values for the number i data; represents the mean of the real values):

(3)

In comparisons of prediction methods, a total of eight evaluation metrics, which were used in previous studies [23,45,75,79], were employed:

  • Pearson correlation coefficient between the predicted and true values for all data, direct mutation data, and reverse mutation data (use , , and to represent them, respectively).
  • Root mean square error between predicted and true values for all data, direct mutation data, and reverse mutation data (use , , and to represent them, respectively).
  • Pearson correlation coefficient between the predicted values of the direct mutation data and reverse mutation data (use to represent).
  • The average of the sums of the predicted values for each pair of direct and reverse mutation data (use to represent).

The formula for calculating the Pearson correlation coefficient (γ) is given by Eq 4 (where n is the total amount of data; and represent the predicted and real values for the i-th data; and represent the means of the predicted and real values):

(4)

The formula for calculating the root mean square error (σ) is given by Eq 5 (where n is the total amount of data; and represent the predicted and real values for the number i data):

(5)

The formula for calculating the Pearson correlation coefficient between the predicted values of the direct mutation data and the reverse mutation data () is given by Eq 6 (where n is the total number of pairs of the direct and reverse mutation data; and represent the predicted values for the i-th pair of the direct and reverse mutation data, respectively; and represent the means of all predicted values for the direct mutation data and reverse mutation data):

(6)

The formula for calculating the average of the sums of the predicted values for each pair of direct and reverse mutation data (δ) is given by Eq 7 (where n is the total number of pairs of direct and reverse mutations; and respectively represent the predicted values for the i-th pair of direct and reverse mutation data):

(7)

The is the metric to rank compared methods. Steiger’s Z-test [102] was employed to evaluate the statistical significance of the differences in between the DDGWizard’s model and other methods. It is a method for determining whether two correlation coefficients associated with the same target variable are statistically significantly different [136]. The test was implemented using the online server of Cocor [103]. The inputs included the of the DDGWizard’s model, of the compared methods, Pearson correlation coefficient between the predicted values of the DDGWizard’s model and the compared methods, and data number in the test set. The output included a Z-score (zall) and a p-value (pall). The zall quantifies the statistical significance of the difference in between the DDGWizard’s model and the compared methods, where a larger absolute value means stronger difference significance. The pall (ranging from 0 to 1) represents the probability of obtaining the current statistical result or more extreme results under the null hypothesis [137] that there is no difference in between the DDGWizard’s model and the compared methods.

Supporting information

S1 Table. List of algorithms, training datasets and feature sets used in representative prediction methods.

https://doi.org/10.1371/journal.pcbi.1013783.s001

(PDF)

S2 Table. List of collected datasets.

The list of 20 datasets that were collected from the VariBench database and merged.

https://doi.org/10.1371/journal.pcbi.1013783.s002

(PDF)

S3 Table. List of the remaining 69 features from feature selection based on the RFE algorithm.

https://doi.org/10.1371/journal.pcbi.1013783.s003

(PDF)

S4 Table. Performance comparison of different MLP hyperparamers with identical 20-fold pair-level cross-validation.

https://doi.org/10.1371/journal.pcbi.1013783.s004

(PDF)

S5 Table. Used hyperparameters for Bayesian optimization.

Hyperparameter ranges used for 100 rounds of Bayesian optimization.

https://doi.org/10.1371/journal.pcbi.1013783.s005

(PDF)

S6 Table. Comparison results of three prediction methods evaluated with the identical cross-validation sets on the low-conservation residue data.

https://doi.org/10.1371/journal.pcbi.1013783.s006

(PDF)

S7 Table. Comparison results of three prediction methods evaluated with the identical protein-level cross-validation sets.

https://doi.org/10.1371/journal.pcbi.1013783.s007

(PDF)

S8 Table. Detailed usage of computational resources.

https://doi.org/10.1371/journal.pcbi.1013783.s008

(PDF)

Acknowledgments

The authors acknowledge the use of the University of Bradford High Performance Computing Service in the completion of this work.

References

  1. 1. Kumwenda B, Litthauer D, Bishop OT, Reva O. Analysis of protein thermostability enhancing factors in industrially important thermus bacteria species. Evol Bioinform Online. 2013;9:327–42. pmid:24023508
  2. 2. Jiang B, Jain A, Lu Y, Hoag SW. Probing thermal stability of proteins with temperature scanning viscometer. Mol Pharm. 2019;16(8):3687–93. pmid:31306023
  3. 3. De Wit JN. Thermal stability and functionality of whey proteins. Journal of Dairy Science. 1990;73(12):3602–12.
  4. 4. Yousefi N, Abbasi S. Food proteins: solubility & thermal stability improvement techniques. Food Chemistry Advances. 2022;1:100090.
  5. 5. Xu Z, Cen Y-K, Zou S-P, Xue Y-P, Zheng Y-G. Recent advances in the improvement of enzyme thermostability by structure modification. Crit Rev Biotechnol. 2020;40(1):83–98. pmid:31690132
  6. 6. Wu H, Chen Q, Zhang W, Mu W. Overview of strategies for developing high thermostability industrial enzymes: discovery, mechanism, modification and challenges. Crit Rev Food Sci Nutr. 2023;63(14):2057–73. pmid:34445912
  7. 7. Nezhad NG, Rahman RNZRA, Normi YM, Oslan SN, Shariff FM, Leow TC. Thermostability engineering of industrial enzymes through structure modification. Appl Microbiol Biotechnol. 2022;106(13–16):4845–66. pmid:35804158
  8. 8. Minagawa H, Yoshida Y, Kenmochi N, Furuichi M, Shimada J, Kaneko H. Improving the thermal stability of lactate oxidase by directed evolution. Cell Mol Life Sci. 2007;64(1):77–81. pmid:17131051
  9. 9. Li G, Zhang H, Sun Z, Liu X, Reetz MT. Multiparameter optimization in directed evolution: engineering thermostability, enantioselectivity, and activity of an epoxide hydrolase. ACS Catal. 2016;6(6):3679–87.
  10. 10. Zhang Z-G, Yi Z-L, Pei X-Q, Wu Z-L. Improving the thermostability of Geobacillus stearothermophilus xylanase XT6 by directed evolution and site-directed mutagenesis. Bioresour Technol. 2010;101(23):9272–8. pmid:20691586
  11. 11. Chen C, Su L, Xu F, Xia Y, Wu J. Improved thermostability of maltooligosyltrehalose synthase from arthrobacter ramosus by directed evolution and site-directed mutagenesis. J Agric Food Chem. 2019;67(19):5587–95. pmid:31016980
  12. 12. Xiong W, Liu B, Shen Y, Jing K, Savage TR. Protein engineering design from directed evolution to de novo synthesis. Biochemical Engineering Journal. 2021;174:108096.
  13. 13. Chen C-W, Lin M-H, Chang H-P, Chu Y-W. Improvement of protein stability prediction by integrated computational approach. In: Proceedings of the 2020 10th International Conference on Bioscience, Biochemistry and Bioinformatics. 2020. p. 8–13. https://doi.org/10.1145/3386052.3386065
  14. 14. Zhao Y, Li D, Bai X, Luo M, Feng Y, Zhao Y, et al. Improved thermostability of proteinase K and recognizing the synergistic effect of Rosetta and FoldX approaches. Protein Eng Des Sel. 2021;34:gzab024. pmid:34671809
  15. 15. Go S-R, Lee S-J, Ahn W-C, Park K-H, Woo E-J. Enhancing the thermostability and activity of glycosyltransferase UGT76G1 via computational design. Commun Chem. 2023;6(1):265. pmid:38057441
  16. 16. Bi J, Chen S, Zhao X, Nie Y, Xu Y. Computation-aided engineering of starch-debranching pullulanase from Bacillus thermoleovorans for enhanced thermostability. Appl Microbiol Biotechnol. 2020;104(17):7551–62. pmid:32632476
  17. 17. Marabotti A, Scafuri B, Facchiano A. Predicting the stability of mutant proteins by computational approaches: an overview. Brief Bioinform. 2021;22(3):bbaa074. pmid:32496523
  18. 18. Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002;320(2):369–87. pmid:12079393
  19. 19. Fang J. A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation. Brief Bioinform. 2020;21(4):1285–92. pmid:31273374
  20. 20. Geng C, Xue LC, Roel-Touris J, Bonvin AMJJ. Finding the ΔΔG spot: Are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it?. WIREs Comput Mol Sci. 2019;9(5).
  21. 21. Khan S, Vihinen M. Performance of protein stability predictors. Hum Mutat. 2010;31(6):675–84. pmid:20232415
  22. 22. Marabotti A, Del Prete E, Scafuri B, Facchiano A. Performance of web tools for predicting changes in protein stability caused by mutations. BMC Bioinformatics. 2021;22(Suppl 7):345. pmid:34225665
  23. 23. Benevenuta S, Pancotti C, Fariselli P, Birolo G, Sanavia T. An antisymmetric neural network to predict free energy changes in protein variants. J Phys D: Appl Phys. 2021;54(24):245403.
  24. 24. Xu H, Chen Y, Zhang D. Worth of prior knowledge for enhancing deep learning. Nexus. 2024;1(1):100003.
  25. 25. Pires DEV, Ascher DB, Blundell TL. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics. 2014;30(3):335–42. pmid:24281696
  26. 26. Rodrigues CH, Pires DE, Ascher DB. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability. Nucleic Acids Res. 2018;46(W1):W350–5. pmid:29718330
  27. 27. Pires DEV, Ascher DB, Blundell TL. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res. 2014;42(Web Server issue):W314-9. pmid:24829462
  28. 28. Pandurangan AP, Ochoa-Montaño B, Ascher DB, Blundell TL. SDM: a server for predicting effects of mutations on protein stability. Nucleic Acids Res. 2017;45(W1):W229–35. pmid:28525590
  29. 29. Montanucci L, Capriotti E, Birolo G, Benevenuta S, Pancotti C, Lal D, et al. DDGun: an untrained predictor of protein stability changes upon amino acid variants. Nucleic Acids Res. 2022;50(W1):W222–7. pmid:35524565
  30. 30. Berliner N, Teyra J, Colak R, Garcia Lopez S, Kim PM. Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation. PLoS One. 2014;9(9):e107353. pmid:25243403
  31. 31. Quan L, Lv Q, Zhang Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics. 2016;32(19):2936–46. pmid:27318206
  32. 32. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11(11):1453–9. pmid:14604535
  33. 33. Clementel D, Del Conte A, Monzon AM, Camagni GF, Minervini G, Piovesan D, et al. RING 3.0: fast generation of probabilistic residue interaction networks from structural ensembles. Nucleic Acids Res. 2022;50(W1):W651–6. pmid:35554554
  34. 34. Schlessinger A, Yachdav G, Rost B. PROFbval: predict flexible and rigid residues in proteins. Bioinformatics. 2006;22(7):891–3. pmid:16455751
  35. 35. Ferruz N, Noske J, Höcker B. Protlego: a Python package for the analysis and design of chimeric proteins. Bioinformatics. 2021;37(19):3182–9. pmid:33901273
  36. 36. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. pmid:10592278
  37. 37. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. pmid:19304878
  38. 38. Grant BJ, Skjaerven L, Yao X-Q. The Bio3D packages for structural bioinformatics. Protein Sci. 2021;30(1):20–30. pmid:32734663
  39. 39. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–637. pmid:6667333
  40. 40. Landrum G. RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum. 2013;8(31.10):5281.
  41. 41. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. pmid:9254694
  42. 42. Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11(5):863–74. pmid:11337480
  43. 43. Shirvanizadeh N, Vihinen M. VariBench, new variation benchmark categories and data sets. Front Bioinform. 2023;3:1248732. pmid:37795169
  44. 44. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
  45. 45. Pancotti C, Benevenuta S, Birolo G, Alberini V, Repetto V, Sanavia T, et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Brief Bioinform. 2022;23(2):bbab555. pmid:35021190
  46. 46. Ogino S, Gulley ML, den Dunnen JT, Wilson RB, Association for Molecular Patholpogy Training and Education Committtee. Standard mutation nomenclature in molecular diagnostics: practical and educational challenges. J Mol Diagn. 2007;9(1):1–6. pmid:17251329
  47. 47. Yang Y, Urolagin S, Niroula A, Ding X, Shen B, Vihinen M. PON-tstab: protein variant stability predictor. importance of training data quality. Int J Mol Sci. 2018;19(4):1009. pmid:29597263
  48. 48. Zhou Y, Pan Q, Pires DEV, Rodrigues CHM, Ascher DB. DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res. 2023;51(W1):W122–8. pmid:37283042
  49. 49. Panja AS, Bandopadhyay B, Maiti S. Protein thermostability is owing to their preferences to non-polar smaller volume amino acids, variations in residual physico-chemical properties and more salt-bridges. PLoS One. 2015;10(7):e0131495. pmid:26177372
  50. 50. Wako H, Endo S. Normal mode analysis as a method to derive protein dynamics information from the Protein Data Bank. Biophys Rev. 2017;9(6):877–93. pmid:29103094
  51. 51. Mamonova TB, Glyakina AV, Galzitskaya OV, Kurnikova MG. Stability and rigidity/flexibility-two sides of the same coin?. Biochim Biophys Acta. 2013;1834(5):854–66. pmid:23416444
  52. 52. Chu H-L, Chen T-H, Wu C-Y, Yang Y-C, Tseng S-H, Cheng T-M, et al. Thermal stability and folding kinetics analysis of disordered protein, securin. J Therm Anal Calorim. 2014;115(3):2171–8.
  53. 53. Ji Y-Y, Li Y-Q. The role of secondary structure in protein structure selection. Eur Phys J E Soft Matter. 2010;32(1):103–7. pmid:20524028
  54. 54. Marsh JA. Buried and accessible surface area control intrinsic protein flexibility. J Mol Biol. 2013;425(17):3250–63. pmid:23811058
  55. 55. Chen Y, Lu H, Zhang N, Zhu Z, Wang S, Li M. PremPS: Predicting the impact of missense mutations on protein stability. PLoS Comput Biol. 2020;16(12):e1008543. pmid:33378330
  56. 56. Giollo M, Martin AJ, Walsh I, Ferrari C, Tosatto SC. NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation. BMC Genomics. 2014;15:1–11.
  57. 57. Huang A, Chen Z, Wu X, Yan W, Lu F, Liu F. Improving the thermal stability and catalytic activity of ulvan lyase by the combination of FoldX and KnowVolution campaign. Int J Biol Macromol. 2024;257(Pt 1):128577. pmid:38070809
  58. 58. Mahase V, Sobitan A, Rhoades R, Zhang F, Baranova A, Johnson M, et al. Genetic variations affecting ACE2 protein stability in minority populations. Front Med (Lausanne). 2022;9:1002187. pmid:36388927
  59. 59. Sobitan A, Edwards W, Jalal MS, Kolawole A, Ullah H, Duttaroy A, et al. Prediction of the effects of missense mutations on human myeloperoxidase protein stability using in silico saturation mutagenesis. Genes (Basel). 2022;13(8):1412. pmid:36011324
  60. 60. Tian J, Wu N, Chu X, Fan Y. Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinformatics. 2010;11:370. pmid:20598148
  61. 61. Aggarwal R, R Koes D. PharmRL: pharmacophore elucidation with deep geometric reinforcement learning. BMC Biol. 2024;22(1):301. pmid:39736736
  62. 62. Wilkinson HC, Dalby PA. Fine-tuning the activity and stability of an evolved enzyme active-site through noncanonical amino-acids. FEBS J. 2021;288(6):1935–55. pmid:32897608
  63. 63. Tang H, Shi K, Shi C, Aihara H, Zhang J, Du G. Enhancing subtilisin thermostability through a modified normalized B-factor analysis and loop-grafting strategy. J Biol Chem. 2019;294(48):18398–407. pmid:31615894
  64. 64. Camilloni C, Bonetti D, Morrone A, Giri R, Dobson CM, Brunori M, et al. Towards a structural biology of the hydrophobic effect in protein folding. Sci Rep. 2016;6:28285. pmid:27461719
  65. 65. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of hydrophobic interactions to protein stability. J Mol Biol. 2011;408(3):514–28. pmid:21377472
  66. 66. Ahmad S, Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. pmid:15720719
  67. 67. Studer RA, Dessailly BH, Orengo CA. Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J. 2013;449(3):581–94. pmid:23301657
  68. 68. Cao H, Wang J, He L, Qi Y, Zhang JZ. DeepDDG: predicting the stability change of protein point mutations using neural networks. J Chem Inf Model. 2019;59(4):1508–14. pmid:30759982
  69. 69. Li Y, Fang J. PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS One. 2012;7(10):e47247. pmid:23077576
  70. 70. Scandurra R, Consalvi V, Chiaraluce R, Politi L, Engel PC. Protein thermostability in extremophiles. Biochimie. 1998;80(11):933–41. pmid:9893953
  71. 71. DePristo MA, Weinreich DM, Hartl DL. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet. 2005;6(9):678–87. pmid:16074985
  72. 72. Webb B, Sali A. Comparative protein structure modeling using MODELLER. Curr Protoc Bioinformatics. 2016;54:5.6.1-5.6.37. pmid:27322406
  73. 73. Kebabci N, Timucin AC, Timucin E. Toward compilation of balanced protein stability data sets: flattening the ΔΔG curve through systematic enrichment. J Chem Inf Model. 2022;62(5):1345–55. pmid:35201762
  74. 74. Capriotti E, Fariselli P, Rossi I, Casadio R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics. 2008;9 Suppl 2(Suppl 2):S6. pmid:18387208
  75. 75. Pucci F, Bernaerts KV, Kwasigroch JM, Rooman M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics. 2018;34(21):3659–65. pmid:29718106
  76. 76. Thiltgen G, Goldstein RA. Assessing predictors of changes in protein stability upon mutation using self-consistency. PLoS One. 2012;7(10):e46084. pmid:23144695
  77. 77. Fariselli P, Martelli PL, Savojardo C, Casadio R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics. 2015;31(17):2816–21. pmid:25957347
  78. 78. Rodrigues CHM, Pires DEV, Ascher DB. DynaMut2: Assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 2021;30(1):60–9. pmid:32881105
  79. 79. Li B, Yang YT, Capra JA, Gerstein MB. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput Biol. 2020;16(11):e1008291. pmid:33253214
  80. 80. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A kernel two-sample test. The Journal of Machine Learning Research. 2012;13(1):723–73.
  81. 81. Volkova S. An overview on data augmentation for machine learning. In: International Scientific and Practical Conference Digital and Information Technologies in Economics and Management. 2023. p. 143–54.
  82. 82. Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol Methods Protoc. 2022;7(1):bpac008. pmid:35388370
  83. 83. Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23(15):1875–82. pmid:17519246
  84. 84. Parthasarathy S, Murthy MR. Protein thermal stability: insights from atomic displacement parameters (B values). Protein Eng. 2000;13(1):9–13. pmid:10679524
  85. 85. Robson B, Suzuki E. Conformational properties of amino acid residues in globular proteins. J Mol Biol. 1976;107(3):327–56. pmid:1003471
  86. 86. Muñoz V, Serrano L. Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins. 1994;20(4):301–11. pmid:7731949
  87. 87. Qian N, Sejnowski TJ. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol. 1988;202(4):865–84. pmid:3172241
  88. 88. Cao Y, Miao Q-G, Liu J-C, Gao L. Advance and prospects of AdaBoost algorithm. Acta Automatica Sinica. 2013;39(6):745–58.
  89. 89. de Ville B. Decision trees. WIREs Computational Stats. 2013;5(6):448–55.
  90. 90. Larose DT, Larose CD. K-nearest neighbor algorithm. Wiley Data and Cybersecurity; 2014.
  91. 91. Ranstam J, Cook JA. LASSO regression. British Journal of Surgery. 2018;105(10):1348–1348.
  92. 92. Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021;22(1):271. pmid:34544450
  93. 93. Su X, Yan X, Tsai C. Linear regression. WIREs Computational Stats. 2012;4(3):275–94.
  94. 94. Popescu MC, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88.
  95. 95. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
  96. 96. Schulz E, Speekenbrink M, Krause A. A tutorial on Gaussian process regression: modelling, exploring, and exploiting functions. Journal of Mathematical Psychology. 2018;85:1–16.
  97. 97. Sabzekar M, Hasheminejad SMH. Robust regression using support vector regressions. Chaos, Solitons & Fractals. 2021;144:110738.
  98. 98. Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. 2012;25.
  99. 99. Witz G, van Nimwegen E, Julou T. Initiation of chromosome replication controls both division and replication cycles in E. coli through a double-adder mechanism. Elife. 2019;8:e48063. pmid:31710292
  100. 100. Esters L, Rutgersson A, Nilsson E, Sahlée E. Non-local impacts on eddy-covariance air–lake CO2 fluxes. Boundary-Layer Meteorol. 2020;178(2):283–300.
  101. 101. Starr E, Goldfarb B. Binned scatterplots: a simple tool to make research easier and better. Strategic Management Journal. 2020;41(12):2261–74.
  102. 102. Steiger JH. Tests for comparing elements of a correlation matrix. Psychological Bulletin. 1980;87(2):245–51.
  103. 103. Diedenhofen B, Musch J. cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One. 2015;10(3):e0121945. pmid:25835001
  104. 104. Umerenkov D, Nikolaev F, Shashkova TI, Strashnov PV, Sindeeva M, Shevtsov A, et al. PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics. 2023;39(11):btad671. pmid:37935419
  105. 105. Mishra SK. PSP-GNM: predicting protein stability changes upon point mutations with a Gaussian network model. Int J Mol Sci. 2022;23(18):10711. pmid:36142614
  106. 106. Kumar V, Minz S. Feature selection. SmartCR. 2014;4(3):211–29.
  107. 107. Moshrefi A, Tawfik HH, Elsayed MY, Nabki F. Industrial fault detection employing meta ensemble model based on contact sensor ultrasonic signal. Sensors (Basel). 2024;24(7):2297. pmid:38610508
  108. 108. Wang J, Zhao J, Hua C, Zhang J. Constructing real-time meteorological forecast method of short-term cyanobacteria bloom area index changes in the Lake Taihu. Sustainability. 2025;17(18):8376.
  109. 109. Khan Rifat MdA, Kabir A, Huq A. An explainable machine learning approach to traffic accident fatality prediction. Procedia Computer Science. 2024;246:1905–14.
  110. 110. Khaleghi Ardabili A, Rice S, Bonavia AS. Diagnosing sepsis through proteomic insights: findings from a prospective ICU cohort. medRxiv. 2025;:2025–08.
  111. 111. Xu H, Chen Y, Zhang D. Worth of prior knowledge for enhancing deep learning. Nexus. 2024;1(1):100003.
  112. 112. Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics. 2011;12:151. pmid:21569468
  113. 113. Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform. 2024;152:104629. pmid:38552994
  114. 114. Zheng J-X, Li X, Zhu J, Guan S-Y, Zhang S-X, Wang W-M. Interpretable machine learning for predicting chronic kidney disease progression risk. Digit Health. 2024;10:20552076231224225. pmid:38235416
  115. 115. Chauhan NK, Singh K. A review on conventional machine learning vs deep learning. In: 2018 International conference on computing, power and communication technologies (GUCON). 2018. p. 347–52.
  116. 116. Attari V, Arroyave R. Decoding non-linearity and complexity: deep tabular learning approaches for materials science. Digital Discovery. 2025;4(10):2765–80.
  117. 117. McCarroll N, McShane P, O’Connell E, Curran K, Singh M, McNamee E, et al. Evaluating shallow and deep learning strategies for legal text classification of clauses in non-disclosure agreements. SN COMPUT SCI. 2025;6(7):784.
  118. 118. Johnston WJ, Fusi S. Abstract representations emerge naturally in neural networks trained to perform multiple tasks. Nat Commun. 2023;14(1):1040. pmid:36823136
  119. 119. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):7112–27. pmid:34232869
  120. 120. Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: a comprehensive review. Comput Soc Netw. 2019;6(1):11. pmid:37915858
  121. 121. Kulikova AV, Diaz DJ, Loy JM, Ellington AD, Wilke CO. Learning the local landscape of protein structures with convolutional neural networks. J Biol Phys. 2021;47(4):435–54. pmid:34751854
  122. 122. Becktel WJ, Schellman JA. Protein stability curves. Biopolymers. 1987;26(11):1859–77. pmid:3689874
  123. 123. Lee H-T, Cheon H-R, Lee S-H, Shim M, Hwang H-J. Risk of data leakage in estimating the diagnostic performance of a deep-learning-based computer-aided system for psychiatric disorders. Sci Rep. 2023;13(1):16633. pmid:37789047
  124. 124. Duan K-B, Rajapakse JC, Wang H, Azuaje F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobioscience. 2005;4(3):228–34. pmid:16220686
  125. 125. Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, et al. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations. PLoS One. 2014;9(3):e92863. pmid:24675610
  126. 126. Liu W, Zhai J, Ding H, He X. The research of algorithm for protein subcellular localization prediction based on SVM-RFE. In: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). 2017. p. 1–6.
  127. 127. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46(1–3):389–422.
  128. 128. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research. 2011;12:2825–30.
  129. 129. Moradi R, Berangi R, Minaei B. A survey of regularization strategies for deep models. Artif Intell Rev. 2019;53(6):3947–86.
  130. 130. Osei-Bryson KM. Post-pruning in regression tree induction: an integrated approach. Expert Systems with Applications. 2008;34(2):1481–90.
  131. 131. Jitkrittum W, Hachiya H, Sugiyama M. Feature selection l1-penalized squared-loss mutual information. IEICE Trans Inf Syst. 2013;96(7):1513–24.
  132. 132. Gao H, Shao X. Two sample testing in high dimension via maximum mean discrepancy. Journal of Machine Learning Research. 2023;24(304):1–33.
  133. 133. Shekhar S, Kim I, Ramdas A. A permutation-free kernel two-sample test. Advances in Neural Information Processing Systems. 2022;35:18168–80.
  134. 134. Ding T, Li Z, Zhang Y. Testing the equality of distributions using integrated maximum mean discrepancy. Journal of Statistical Planning and Inference. 2025;236:106246.
  135. 135. Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics. 2006;22(14):e49-57. pmid:16873512
  136. 136. Wilson GA, Martin SA. An empirical comparison of two methods for testing the significance of a correlation matrix. Educational and Psychological Measurement. 1983;43(1):11–4.
  137. 137. Sedgwick PM, Hammer A, Kesmodel US, Pedersen LH. Current controversies: null hypothesis significance testing. Acta Obstet Gynecol Scand. 2022;101(6):624–7. pmid:35451497