DDGWizard: Integration of feature calculation resources for analysis and prediction of changes in protein thermostability upon point mutations

Mingkai Wang; Khaled Jumah; Qun Shao; Katarzyna Kamieniecka; Yihan Liu; Krzysztof Poterlowicz

doi:10.1371/journal.pcbi.1013783

Abstract

Thermostability is an important property of proteins and a critical factor for their wide application. Accurate prediction of enables the estimation of the impact of mutations on thermostability in advance. A range of prediction methods based on machine learning has now emerged. However, their prediction performance remains limited due to insufficiently informative training features and little effort has been made to integrate feature calculation resources. Based on this, we integrated 12 computational resources to develop a pipeline capable of automatically calculating 1,547 features. In addition, a feature-enriched DDGWizard dataset was created, including 15,752 data. Furthermore, we performed feature selection and developed an accurate prediction model that achieved an R² of 0.61 in cross-validation. It also outperformed several other representative prediction methods in comparisons with independent datasets. Together, the feature calculation pipeline, DDGWizard dataset, and prediction model constitute the DDGWizard system, freely available for analysis and prediction.

Author summary

A protein’s ability to maintain its structure under high temperatures, known as thermostability, is critical for many industrial and therapeutic applications and might be affected by genetic mutations. To address the challenge, we built a robust machine learning model to predict the impact of mutations on thermostability. DDGWizard integrates data from multiple computational tools to calculate over 1,500 features for each mutation, offering detailed insights into protein structure and stability. DDGWizard simplifies the complex process of analysis and enables scientists to design more stable proteins for various applications. It bridges the gap between data-rich resources and practical tools. Our model demonstrated superior performance compared to existing methods and provides a freely accessible platform for researchers and industry professionals available at https://github.com/bioinfbrad/DDGWizard.

Citation: Wang M, Jumah K, Shao Q, Kamieniecka K, Liu Y, Poterlowicz K (2025) DDGWizard: Integration of feature calculation resources for analysis and prediction of changes in protein thermostability upon point mutations. PLoS Comput Biol 21(12): e1013783. https://doi.org/10.1371/journal.pcbi.1013783

Editor: Samuel V. Scarpino, Northeastern University, UNITED STATES OF AMERICA

Received: January 6, 2025; Accepted: November 24, 2025; Published: December 1, 2025

Copyright: © 2025 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The code of the DDGWizard application is available on a GitHub repository at https://github.com/bioinfbrad/DDGWizard. The DDGWizard dataset, the source code for model training and validation, and the data for evaluation and comparisons are stored on https://zenodo.org/records/14512134.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Thermostability is an important property of proteins, representing their ability to resist irreversible changes in structure and chemical attributes due to elevation in temperature [1]. It highly influences the application scope of proteins. For therapeutic proteins, such as monoclonal antibodies, insufficient thermostability can result in denaturation or reduced potency when temperature excursions occur during manufacturing, storage, and transportation [2], undermining their effectiveness. In addition, thermostability determines whether partial food proteins, such as whey proteins, can withstand thermal treatments [3], which is important in food processing to extend shelf life or create desired flavours [4]. For enzymes, specialized proteins widely used as biological catalysts, thermostability is a crucial parameter to function extensively [5]. As accelerating reactions, improving substrate solubility, and reducing the risk of microbial contamination require high temperatures in industrial environments, only enzymes with sufficient thermostability can operate continuously and be reused effectively [6]. However, most naturally evolved enzymes have poor thermostability [7], significantly limiting their applications.

Continuous efforts have been made to increase the thermostability of proteins [6] employing a variety of strategies. Directed evolution (DE) has been widely applied in protein engineering to increase protein thermostability [8–11]. It simulates natural selection and involves key steps such as constructing mutation libraries, introducing random mutations, and screening the target protein based on specific criteria. However, a major drawback of DE is its high demand for labor, material, and financial resources to identify the desired protein [12]. To identify effective mutations to increase protein thermostability more precisely, rational and semi-rational design strategies have been applied, which often require prior knowledge or computational methods [6]. is an indicator of protein thermostability changes resulting from mutations, as it represents the difference in the folding free energy change between the wild-type and mutant protein [13]. Since accurate prediction enables the estimation of the impacts of mutations on thermostability in advance, it can assist in the rational design of the selective introduction of mutations [14–16].

Early prediction methods are mainly based on empirical force fields [17], utilizing experimental parameters, classical equations, and energy evaluations to calculate , such as the classic FoldX prediction method [18]. With the continuous advancement of computational techniques and data science, prediction methods based on machine learning (ML) have emerged and are now widely adopted. Among the 23 prediction methods previously reviewed, 15 are based on ML [17]. However, despite their increase in number, current ML-based prediction methods still suffer from the issue of inadequate prediction performance [19–22]. One of the main reasons for this is that the features used for training models are insufficiently informative [19]. ACDC-NN [23] employs a neural network and optimizes for antisymmetric properties; however, its input features consist only of encodings of mutation type and amino acid distributions around the mutation site, lacking the integration of direct prior knowledge [24]. mCSM [25] and DynaMut [26] introduce pharmacophore features and protein dynamics features based on normal mode analysis (NMA), but they do not consider richer protein information, such as evolutionary conservation, residue interactions, and a broader range of amino acid physicochemical properties. DUET [27] relies solely on the prediction outputs of two other methods, SDM [28] and mCSM [25], as input features. In addition, some methods, such as DDGun3D [29] and FoldX [18], rely on linear fitting, which oversimplifies the problem and might be difficult to represent complex protein conformation changes. Finally, the size and protein diversity of some training datasets are limited [20], which may hinder model generalization (S1 Table lists the algorithms, datasets, and feature sets of ACDC-NN, DDGun3D, mCSM, DynaMut, FoldX, SDM, and DUET).

So far, although many computational resources have been used to calculate features [25,26,30,31] or output potentially relevant features [32–35], little effort has been made to integrate these resources for the comprehensive calculation of features for data. This could provide more diverse information, facilitating further analysis, feature selection, and prediction.

Here, we describe DDGWizard as a analysis system. It includes a feature calculation pipeline that integrates 12 computational resources [18,32–42] and is capable of automatically calculating 1,547 features for data. The calculated features provide information for the prediction from various perspectives, including the structure and environment of wild-type proteins, structural and environmental changes before and after mutation, mutation types, and evolutionary information. In addition, it provides a feature-enriched dataset created using the pipeline, including 15,752 data. Furthermore, it incorporates an accurate prediction model developed with the selected optimal features. The model achieved an R² of 0.61 in cross-validation. It also outperformed several other prediction methods ACDC-NN [23], DDGun3D [29], FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28]. The application program, datasets, and source code for DDGWizard training and validation have been published to ensure accessibility and reproducibility.

Results

An overview of DDGWizard

DDGWizard is a comprehensive analysis system. It incorporates a feature calculation pipeline, provides a feature-enriched dataset, and includes an accurate prediction model. The process of its development and validation includes five steps (as shown in Fig 1).

Download:

Fig 1. An overview of DDGWizard.

A: Integrate 12 computational resources [18,32–42] to develop a feature calculation pipeline. B: Collect data from the VariBench [43] database, conduct feature enrichment to the collected data using the feature calculation pipeline to obtain the DDGWizard dataset, and then split it into training and test sets for subsequent ML tasks. C: Perform feature selection based on the RFE (recursive feature elimination) algorithm, followed by a further analysis of feature importance. D: Develop a prediction model using the XGBoost [44] algorithm based on the optimal features. E: Evaluate the developed model and compare it with other representative prediction methods using the identical cross-validation sets, test set, S669 dataset [45], and p53 dataset [25].

https://doi.org/10.1371/journal.pcbi.1013783.g001

DDGWizard feature calculation pipeline

The feature calculation pipeline was developed by integrating 12 computational resources [18,32–42] (see Table 1) to obtain structural, environmental, and evolutionary information for proteins and associated mutation types. It requires raw data as input, including basic information on PDB ID [46] (e.g., 2OCJ for the p53 protein [25]), amino acid substitution (e.g., K6Q for lysine-to-glutamine at position 6), chain identifier (e.g., “A”), pH, temperature (in °C), and value. The computational resources are called to calculate the features, and users can then access the feature-enriched data, which totally includes 1,547 features (Fig 2).

Download:

Fig 2. The feature calculation pipeline of DDGWizard.

The pipeline requires the input of raw data (PDB ID, amino acid substitution, chain ID, pH, temperature, and value). It uses the PDB ID to download the wild-type protein structure file from the RCSB PDB database [46], employs Modeller [72] to construct the mutant protein structure file, and calls a series of computational resources [18,32–42] to calculate features, ultimately outputting the dataset containing 1,547 calculated features.

https://doi.org/10.1371/journal.pcbi.1013783.g002

Download:

Table 1. The computational resources used for feature calculation.

https://doi.org/10.1371/journal.pcbi.1013783.t001

The description of the calculated features and the corresponding computational resources is provided below.

Structural and environmental information of the wild-type protein. The first feature group incorporates structural information within the wild-type protein, covering the proportion of different amino acids and different amino acid categories (uncharged polar, positively charged polar, negatively charged polar, nonpolar, aromatic, aliphatic, heterocyclic and sulfur-containing) calculated with Biopython [37], buried/exposed amino acids and different secondary structures (-helix, alpha-helix, pi-helix, helix-turn, extended beta sheet, beta bridge, bend and other/loop) obtained from DSSP [39], disordered regions predicted by DisEMBL [32], different residue interactions (hydrogen bonds, disulfide bridges, ionic interactions, Van der Waals forces, cation, and π − π stacking) output by Ring [33], different atomic pharmacophores [25] (hydrophobic, positive, negative, hydrogen acceptor, hydrogen donor, aromatic, sulphur, and neutral) calculated with RDKit [40], and hydrophobic clusters analyzed by Protlego [35]. To account for the varying effects of residues and protein conformations at different distances from the mutation site, structural information is divided into four spatial regions: within 7 Å of the mutation site, within 10 Å of the mutation site, within 13 Å of the mutation site, and across the entire protein structure.

Subsequently, different properties of wild-type amino acids are included, including RSA (Relative Solvent Accessibility) calculated by DSSP [39], atomic fluctuation information based on NMA (Normal Mode Analysis) [50] by Bio3D [38], B-factor (Temperature Factor) predicted by Profbval [34], and the physicochemical properties recorded in the AAindex database [36].

Finally, the energy information of the wild-type protein is incorporated from FoldX [18]. In total, the first group includes 724 features.

Structural and environmental changes between mutant and wild-type proteins. The second group contains 647 features to describe the changes in structure and environment between mutant and wild-type proteins. First, the features are calculated for the mutant protein in the similar manner as it has been done for the wild protein using the computational resources described above. Subsequently, the difference in the feature values between the mutant and wild-type proteins constitutes this feature group. Considering that some features of the structural proportion of proteins do not show significant changes before and after single-point mutations, such as the proportion of disordered regions and buried/exposed amino acids, these features have not been included.

Types of mutations. The third group includes 146 features to describe the mutation types. Various encodings are incorporated to represent information on amino acid substitutions, such as substitution encoding for changes of amino acid types, secondary structures, and residue interactions on the mutated amino acids. Subsequently, values from amino acid substitution matrices in the AAindex database [36] are also included to describe mutation types. Finally, the tool SIFT [42]’s prediction results, to reflect the impact of amino acid substitutions on proteins, are also encoded to represent the mutation types.

Evolutionary information. The fourth group includes 26 features to describe the evolutionary information. These features are statistics from the PSSM (position-specific scoring matrix) generated by the protein sequence alignment and homology search tools PSI-BLAST [41]. The PSSM scores at the mutation site and surrounding sites of both the wild-type and mutant proteins are included. Additionally, the difference in PSSM scores at the mutation site between the mutant and wild-type proteins, and the difference in the average PSSM scores surrounding the mutation site between the mutant and wild-type proteins, are also included.

Feature-enriched DDGWizard dataset

Fig 3 demonstrates the workflow of dataset construction and feature enrichment. We chose the VariBench database [43] as the data source. VariBench is a database that curates previously validated mutation datasets, including datasets. A total of 20 raw datasets were collected (see S2 Table) that met the requirements of including five pieces of basic mutation information (PDB ID, amino acid substitution, chain identifier, pH and temperature) and experimental values. To maximize data utility, we merged these 20 datasets based on the following merging rules:

For data with the same basic information and the same value, we retained only one instance.
For data with the same basic information but different values, we selected one instance with the value closest to 0 (according to the previous report [73], current data have a trend toward to 0, the data closer to 0 could be more reliable).

Download:

Fig 3. The workflow of dataset construction and feature enrichment.

https://doi.org/10.1371/journal.pcbi.1013783.g003

After merging, we obtained 7,876 unique mutation data points from 222 different proteins. Considering that the hypothetical reverse mutation theory has been adopted by many studies [19,23,29,74], both in the testing [45,75,76] and development [77–79] of prediction methods, we conducted the data augmentation that added the hypothetical reverse mutations, eventually obtaining 15,752 data.

We applied the developed feature calculation pipeline to the obtained data. It enriched the feature number of the data from 5 to 1,547. Fig 4 shows the distribution of feature-enriched data and highlights the similarity of direct and reverse mutation data with an MMD² [80] of 0.0006. It reflects that the reverse mutation data could approximately serve as an equivalent augmentation of the dataset [81].

Download:

Fig 4. t-SNE plot for both direct and reverse mutation data.

The t-SNE plot shows the distribution of direct and reverse mutation data, projected from high-dimensional feature spaces into two dimensions. The blue points represent direct mutation data, while the red points represent reverse mutation data. MMD² quantifies the difference in feature distributions between the two types of data.

https://doi.org/10.1371/journal.pcbi.1013783.g004

The created new dataset was named “DDGWizard” dataset. It is a non-redundant collection including unique 15,752 mutation data points from 222 proteins and integrated comprehensive feature information covering measuring conditions, structures and environments of the wild-type protein, structural and environmental changes between mutant and wild-type proteins, mutation types, and evolutionary information, making it a valuable resource for feature selection, development of ML models, and further analytical studies.

Next, we split the dataset for ML tasks. Each pair of direct mutation data and hypothetical reverse mutation data in the DDGWizard dataset was treated as a single unit, and all pairs were randomly shuffled using a seed of 42. The first 90% data was selected as the training set, comprising 14,178 mutations (7,089 pairs of direct and reverse mutations) from 219 different proteins. The remaining 10% data was selected as the test set, comprising 1,594 mutations (787 pairs of direct and reverse mutations) from 134 different proteins.

Optimal feature set

To identify the most effective features, feature selection was carried out. We first trained the model with the XGBoost algorithm [44] using all 1,574 features as baseline. The 20-fold pair-level cross-validation (it ensures that the direct and reverse mutation data remain together in either the training set or the validation set) was used to evaluate the model training performance. Fig 5 shows the performance of the model before feature selection, with an average R² of 0.55 and a standard deviation of 0.06.

Download:

Fig 5. R² of each fold from cross-validation before and after feature selection.

https://doi.org/10.1371/journal.pcbi.1013783.g005

Next, the RFE algorithm was employed to select features, which iteratively removes the least important features and outputs the evaluation metric in each round (the flowchart of RFE is shown in Fig 6A). Fig 6B shows the changes in average R² across the RFE rounds. The RFE curve performed relatively stable or showed few fluctuations during the elimination of the first 1,397 features. When features were reduced to fewer than 150, the prediction performance began to improve. When RFE reached 1,478 rounds, reducing the features to 69, prediction performance peaked with an average R² of 0.58. Fig 5 compares the model’s performance before and after feature selection. The average R² increased by 0.03 when the model was trained with the selected 69 features. In addition, the standard deviation of R² decreased from 0.06 to 0.05.

Download:

Fig 6. Feature selection and feature importance ranking.

A: The flowchart of feature selection based on the RFE algorithm. B: The RFE results reflect the changes in the average R2 of the 20-fold pair-level cross-validation as the number of RFE rounds increases and the number of features decreases. C: The top 10 most important features among the 69 features.

https://doi.org/10.1371/journal.pcbi.1013783.g006

The optimal 69 features are listed in S3 Table, including evolutionary features, energy terms, changes in amino acid physicochemical properties, RSA (relative solvent accessibility) at the mutation site, temperature, and distributions of amino acid categories, secondary structures, residue interactions, atomic pharmacophores, disorder regioins, and hydrophobic clusters. These features were used for further analysis and model development.

Moreover, we used the XGBoost algorithm to output the feature importance (the top 10 most important features are shown in Table 2 and Fig 6C, respectively). The most important feature is “diff_pssm_score”, which represents the difference in PSSM scores at the mutation site between the mutant and wild-type proteins. In addition, two other evolutionary features, “diff_pssm_score_aver” (the change in the average PSSM value of the surrounding sequence at the mutation site), and “wt_PSSM_score” (the PSSM value at the mutation site in the wild-type protein) are also among the top 10 important features. Since the PSSM score provides a quantitative measure of the conservation degree of amino acids at a specific site [82], the difference in the PSSM scores between mutant and wild-type amino acids reflects how well the mutation aligns with the preferred amino acid at the site. Larger differences indicate a greater deviation from the most favorable amino acid at that position. Such deviations may affect the function or structure of the protein, as conservation at these positions often suggests that they are essential to maintain its integrity [83]. The second most important feature is “diff_foldx_total_energy”, which represents the difference in the overall energy, calculated by FoldX [18], between mutant and wild-type proteins. It shows that empirical force field methods like FoldX can effectively assist ML methods for predictions. It is worth mentioning that the four features, reflecting changes in physicochemical properties derived from the AAindex, are ranked among the top 10 features. Among them, the feature “diff_aaindex_p_values_of_mesophilic_proteins_based_b_values” can reflect the statistical significance changes in protein thermostability for mesophilic proteins based on the distributions of b values [84]; the other three features reflect changes in parameters associated with different secondary structures at the mutation site [85–87].

Download:

Table 2. Details of the 10 most important features.

https://doi.org/10.1371/journal.pcbi.1013783.t002

Model development and evaluation

The XGBoost algorithm was chosen to train the prediction model of DDGWizard. Table 3 presents the results of a model selection, comparing the performance of 11 machine learning (ML) algorithms: AdaBoost [88], decision tree [89], KNN [90], Lasso regression [91], LightGBM [92], linear regression [93], MLP [94], random forest [95], Gaussian process [96], support vector regression [97], and XGBoost [44]. Traditional ML algorithms were evaluated with their default hyperparameters, while the tuning of MLP hyperparameters is summarized in S4 Table. Among these algorithms, XGBoost achieved the highest average R² of 0.55 under the same 20-fold pair-level cross-validation.

Download:

Table 3. Average R² for the model selection under the same 20-fold pair-level cross-validation.

https://doi.org/10.1371/journal.pcbi.1013783.t003

We then trained the model using the optimal 69 features with the XGBoost algorithm. Bayesian optimization [98] was employed to tune the model’s hyperparameters, with the average R² during the 20-fold pair-level cross-validation as the optimization target (specific parameter ranges and tuning results can be found in S5 Table). After Bayesian optimization, the average R² of the model training improved from 0.58 to 0.61.

Fig 7A shows the prediction results during cross-validation, while Fig 7B demonstrates the comparison between the average prediction values and experimental values within 10 bins that have equivalent data amount [99–101]. The distribution of 10 comparison points around y=x indicates model’s good calibration and strong reliability.

Download:

Fig 7. Prediction results of DDGWizard’s model from the cross-validation.

A: The scatter plot to visualize the comparison between all predicted and true values. The red line indicates the overall regression fit. The plot also provides the regression equation, R2, Î³, and Ïƒ values for the overall prediction. B: The binned scatter plot compares average prediction values and experimental values within 10 bins that have equivalent data amounts. The error bars represent the standard error of the residuals between the average predicted and true values within each bin. C: The scatter plot to visualize the prediction results from the 20-fold cross-validation on the 3,970 mutation Î"Î"G data points where the PSSM score of the mutant amino acid is less than 0. D: The scatter plot to visualize the prediction results from the 20-fold protein-level crossvalidation on the 30 proteins that have mutual sequence similarity less than 30%.

https://doi.org/10.1371/journal.pcbi.1013783.g007

To assess the robustness of our model on low-conservation residue data, we conducted 20-fold cross-validation using data where the PSSM score of the mutant amino acid was less than 0 (a total of 3,970 data points), representing relatively low conservation of the mutant amino acid [82]. Fig 7C shows the test results, and our model achieved an average R² of 0.51 under the same optimal features and hyperparameters as used before.

To test our model’s performance on proteins with low mutual sequence similarity (<30%), we selected 30 proteins (PDB IDs: 1BNI, 1W3D, 1VQB, 1STN, 3SSI, 1RX4, 2LZM, 1RTB, 1LZ1, 2CI2, 1FKJ, 1DIV, 2ABD, 1UZC, 3MBP, 1FTG, 1RN1, 1ARR, 1TEN, 1AMQ, 2RN2, 1YYJ, 1APS, 5PTI, 1HZ6, 1SAK, 1OTR, 1PIN, 5AZU, 1TTG) with at least 50 mutations in our dataset. We then performed 20-fold protein-level cross-validation [31]. As shown in Fig. 7D, our model achieved an average R² of 0.42.

To evaluate the impact of inclusion of reverse mutation data on model performance, we conducted a comparison study (Table 4). We first performed 20-fold cross-validation with direct mutation data for both training and validation dataset, which yielded an average R² of 0.58 (Table 4, row 1). Next, we added the corresponding reverse mutation data into the training sets while keeping the validation sets unchanged, and the average R² remained 0.58 (Table 4, row 2). This indicates that adding reverse mutation data to the training set does not significantly affect the prediction performance on direct mutations under different data splits. In the third experiment, we used direct and reverse mutation data as both training and validation sets and a 20-fold pair-level cross-validation was conducted, which obtained an average R² of 0.61 (Table 4, row 3). The final experiment included direct mutation data for the training set, and direct and reverse mutation data for the validation sets, and the average R² dropped to 0.26 (Table 4, row 4). It suggests that including reverse mutation data in the training set can effectively improve the prediction performance on reverse mutations and therefore enhance models’ generalization ability.

Download:

Table 4. Comparison study on the inclusion of reverse mutation data.

https://doi.org/10.1371/journal.pcbi.1013783.t004

The model with the highest R² (0.73) on the validation set from the 20-fold pair-level cross-validation was selected as DDGWizard’s prediction model. For new prediction needs, users need to provide basic mutation information on PDB ID, amino acid substitution, chain identifier, pH, and temperature, and the developed feature calculation pipeline will automatically calculate the optimal 69 feature values to input into the prediction model. The model will then output the predicted values.

Comparisons

To compare the performance differences between DDGWizard’s prediction model and others, seven representative methods were chosen for the comparison, including ACDC-NN [23], DDGun3D [29], FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28]. S1 Table provides information on the algorithms, datasets, and feature sets used by these methods. We conducted four comparisons using different datasets: identical cross-validation sets, test set, S669 dataset [45] and p53 dataset [25]. All test datasets have undergone data augmentation, enabling evaluation of the prediction methods’ performance in predicting all data, direct mutation data, and reverse mutation data.

Comparison with the cross-validation sets.

To initially compare DDGWizard’s prediction model with other prediction methods, we first selected two representative prediction methods to compare: ACDC-NN [23] and DDGun3D [29]. These two methods were ranked as the top two methods in the previous study [45]. We used ACDC-NN and DDGun3D to predict identical pair-level cross-validation sets that DDGWizard used and compared their prediction performance with the DDGWizard’s model. Table 5 and Fig 8 present the comparison results, showing that DDGWizard’s model significantly outperforms ACDC-NN and DDGun3D, achieving , , and values of 0.79, 0.76, and 0.72 (, , and represent the Pearson correlation coefficient between the predicted and true values for all data, direct mutation data, and reverse mutation data, respectively). Statistical significance was confirmed by z_all and p_all (significance metrics for correlation coefficient comparison derived from Steiger’s Z-test [102,103]), with z_all exceeding 50 and p_all less than 0.001. All three prediction methods were constructed with consideration of the hypothetical reverse mutation theory, and the effectiveness of this consideration was reflected in the models’ antisymmetric property [23]. The values of (Pearson correlation coefficient between the predicted values of direct mutation data and reverse mutation data) for the three methods are close to the ideal prediction of –1, and the values of (the average of the sums of the predicted values for each pair of direct and reverse mutation data) are similarly close to the ideal prediction of 0.

Download:

Fig 8. Pearson correlation coefficients of three

prediction methods evaluated with the identical cross-validation sets.

https://doi.org/10.1371/journal.pcbi.1013783.g008

Download:

Table 5. Comparison results of three

prediction methods evaluated with the identical cross-validation sets.

https://doi.org/10.1371/journal.pcbi.1013783.t005

We also compared the three prediction methods using the identical cross-validation sets on the low-conservation residue data and low similarity proteins. The DDGWizard’s model achieved better performance than ACDC-NN and DDGun3D with of 0.64 and 0.72 (see S6 Table and S7 Table), respectively.

Comparison with the test set.

To further compare performance differences between the DDGWizard’s prediction model and other prediction methods, we selected additional five representative methods which are FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28] to predict the test set. Table 6 and Fig 9 present the test results of eight prediction methods. As shown, the DDGWizard’s model achieved the best prediction performance when predicting all data (with a of 0.68), direct mutation data (with a of 0.66), and reverse mutation (with a of 0.63). Its performance advantage is also statistically significant, as all p_all from comparisons with other methods were less than 0.001. In terms of the comparison of antisymmetric property [23], DDGWizard’s model, ACDC-NN, and DDGun3D significantly outperformed other methods.

Download:

Fig 9. Pearson correlation coefficients of eight

prediction methods evaluated with the test set.

https://doi.org/10.1371/journal.pcbi.1013783.g009

Download:

Table 6. Comparison results of eight

prediction methods evaluated with the test set.

https://doi.org/10.1371/journal.pcbi.1013783.t006

Comparison with the S669 dataset.

Table 7 and Fig 10 present the test results on the widely used [48,104,105] S699 dataset [45] for the eight prediction methods, including the DDGWizard’s model. Since 43 mutation data points from S669 were included in our training set, we excluded these data and retrained [77–79] DDGWizard’s model using the same features and hyperparameters as before for comparison. In the evaluation on S669, the DDGWizard’s model, ACDC-NN, and DDGun3D remained the top-performing prediction methods. Our model achieved the highest of 0.63, and ACDC-NN exhibited the best anti-symmetric performance with of –0.98.

Download:

Fig 10. Pearson correlation coefficients of eight

prediction methods evaluated with the dataset S669.

https://doi.org/10.1371/journal.pcbi.1013783.g010

Download:

Table 7. Comparison results of eight

prediction methods evaluated with the dataset S669.

https://doi.org/10.1371/journal.pcbi.1013783.t007

Comparison with the p53 dataset.

Table 8 and Fig 11 present the test results on the p53 dataset [25] for the eight prediction methods, including the DDGWizard’s model. As four data from the dataset p53 were included in DDGWizard’s training data, we excluded these data and retrained [23,31,56] DDGWizard’s model using the same features and hyperparameters as before for comparison. Based on the ranking of , DDGWizard’s model outperformed the other methods (0.79).

Download:

Fig 11. Pearson correlation coefficients of eight

prediction methods evaluated with the p53 dataset.

https://doi.org/10.1371/journal.pcbi.1013783.g011

Download:

Table 8. Comparison results of eight

prediction methods evaluated with the p53 dataset.

https://doi.org/10.1371/journal.pcbi.1013783.t008

Accessibility and reproducibility

We developed DDGWizard as a freely available system for analysis and prediction. The user can access the DDGWizard application on https://github.com/bioinfbrad/DDGWizard. The feature calculation pipeline requires to input raw data and outputs new data with 1,574 features. The DDGWizard’s prediction model requires to provide basic mutation information and it returns predicted values. Both of feature calculation pipeline and prediction model support parallel processing to handle large-scale data. To better assist users in predicting , the program also provides tools for prediction of saturation mutagenesis and full-site mutagenesis. Detailed usage instructions can be found at https://ddgwizard.readthedocs.io/en/latest/. The DDGWizard dataset, the source code for model training and validation, and the evaluation and comparison data are released on https://zenodo.org/records/14512134.

Discussion

Thermostability has a significant impact on the broad applications of proteins. Continuous efforts have been made to increase protein thermostability, employing various strategies, such as rational design or semi-rational design. Since prediction can estimate the impact of mutations on thermostability in advance, it has become a powerful tool for rational or semi-rational design. Although a range of prediction methods have been developed, especially those based on ML, they still suffer from inadequate prediction performance. The main reason for this is that the features used for training models are insufficiently informative. In fact, many computational resources are available to calculate the features for predictions. However, there is a lack of work to integrate these resources for comprehensive calculation. It could provide more diverse feature information, facilitating further analysis, feature selection, and prediction.

In this study, we integrated 12 computational resources [18,32–42] to develop a pipeline to aid users in feature enrichment for their own datasets. It can automatically output 1,547 calculated features, covering diverse information, such as the structures and environments of wild-type proteins, structural and environmental changes between mutant and wild-type proteins, mutation types, and evolutionary information. Furthermore, we collected data and applied our pipeline to create the feature-enriched DDGWizard dataset, including 15,752 data points, serving as a valuable resource for research.

In addition, to identify more effective features for prediction, we carried out feature selection based on RFE (recursive feature elimination). During this process, the RFE curve first remained stable over a long range and then began to rise. At the peak, 69 features were selected as the optimal subset, resulting in a more accurate and robust model with improved R² and a decreased standard. This can be attributed to the elimination of redundant features, allowing the model to focus on more informative ones [106]. Similar RFE patterns can be observed in previous studies [107–110]. According to importance ranking of optimal features, we found that the difference in PSSM scores at the mutation site between mutant and wild-type proteins was the most important feature. This may be because changes in the PSSM score at the mutation site can reflect how well the mutation matches the preferred amino acid at that position. Larger differences indicate greater deviation, which may potentially affect the protein’s function or structure, since conserved positions are often critical for maintaining integrity. Besides, we found that the energy terms derived from FoldX and changes in physicochemical properties related to certain secondary structures are also important for prediction.

Finally, using the optimal features, we developed an accurate new prediction model. It outperformed ACDC-NN [23], DDGun3D [29], FoldX [18], DynaMut [26], DUET [27], mCSM [25], and SDM [28]. ACDC-NN employs a convolutional neural network and optimizes for antisymmetric properties. However, its input features include only encodings of mutation type and amino acid distribution around the mutation site, lacking the utilization of prior knowledge–based features [24]. This limits the model’s interpretability and may increase the risk of overfitting [111]. In contrast, the features used in our model have more direct contributions to due to knowledge-based feature design. Moreover, the training set it uses, S2648 dataset [112] (also employed by DynaMut [26], mCSM [25], and DUET [27]), contains 132 source proteins that are entirely covered by the 219 proteins in our training set. As a result, it has been trained on a relatively narrower range of proteins than our model, which could limit its generalization performance. DDGun3D uses four features to represent differences in conservation, hydrophobicity, sequence interaction energy, and structural interaction energy between mutant and wild-type amino acids, and it fits values through a linear combination. While this approach is intuitive, the linear combination may be insufficient to capture the complex nonlinear relationships between features and . A similar limitation is observed in FoldX [18], which computes rich and complex conformational energy terms of proteins but only performs simple linear weighting of these terms. mCSM and DynaMut introduce pharmacophore features and protein dynamics features based on NMA (normal mode analysis) [50], using Gaussian processes and random forest algorithms, respectively, to train their models. However, both methods don’t train with amino acid conservation features [66,82] that are important features found in our results. In addition, they did not incorporate the XGBoost algorithm, which demonstrated better performance in our model selection than the algorithms adopted in their models. Furthermore, they do not consider hypothetical reverse mutations [19], which hinders their models to learn reverse mutations’ patterns, resulting in relatively low (Pearson correlation between predictions of direct and reverse mutations). SDM [28] is a statistical potential function based on an environment-specific amino acid substitution table. While statistical approaches are valuable for understanding data distributions, their reliance on prior assumptions about data distributions might lead to prediction biases on new data [113]. DUET [27] is a consensus predictor that uses the outputs of SDM and mCSM as features and applies the SVM for training. Across the comparisons, DUET’s performance is slightly better than SDM and mCSM individually, indicating the effectiveness of consensus prediction. However, its accuracy is still significantly lower than that of the top-performing methods.

Our current work focuses on integrating features from 12 computational resources [18,32–42] based on expert knowledge, identifying the optimal subset from 1,574 integrated features using RFE, and developing an accurate prediction model with XGBoost. While XGBoost is a powerful tool that is effective for structured data and provides strong interpretability [114], its limitation lies in that it cannot perform complex transformations of input features to automatically learn new feature representations and contextual patterns in the data [115–117]. We aim to address this problem in our future work. We will explore the incorporation of deep learning (DL) to further improve model accuracy. DL allows the automated extraction of abstract representations [118] from data and often achieves better performance on large-scale datasets, despite its limited interpretability. DL-based representation of sequence conservation, such as the output embeddings from pre-trained protein language models (PLMs) [119], could be introduced. DL algorithms GNN [120] or CNN [121] could be utilized to further extract deep-learned representation from the distribution of amino acids, secondary structures, and amino acid interactions. We aim to integrate the current RFE-selected features with deep-learned representations to develop hybrid models for further improving model performance.

Overall, the analysis and prediction system, DDGWizard, consists of an integrated feature calculation pipeline, a feature-enriched dataset, and an accurate prediction model. The system is freely available, and the source code for its training and validation procedures has been published to ensure accessibility and reproducibility.

Materials and methods

Development of feature calculation pipeline

The feature calculation pipeline was developed in the Python programming language (v3.10.12). It was programmed to read raw data (PDB ID, amino acid substitution, chain identifier, pH, and temperature) as input. Then it downloads the structural files of the wild-type proteins from the RCSB PDB database [46] according to the provided PDB ID using the requests (v2.31.0) library and utilize the homology modeling software Modeller (v10.4) [72] to generate the mutant protein structures using the wild-type protein structure as template. Next, a series of computational resources [18,32–42] is called to calculate the feature values, and it finally saves the calculated results in CSV format. Detailed descriptions of the usage of each computational resource in the pipeline are provided in S8 Table.

Data sources

In this study, three data sources were used:

VariBench. VariBench [43] is a benchmark database that includes mutation datasets, such as datasets, and follows seven principles (relevance; representative-ness; non-redundancy; experimentally verified cases; positive and negative cases; scalability; reusability) to improve the quality of the collected datasets. 20 datasets from the VariBench database were selected, which were further merged, filtered, and split to achieve the training set and test set used for ML tasks.

S669. The dataset S669 [45] contains 669 mutation data points from 87 different proteins. It is a high-quality benchmark dataset and has been used by several studies [77–79] for independent tests.

p53. The dataset p53 [25] contains 42 data of tumor suppressor proteins (PDB ID: 2OCJ). Since the p53 dataset is widely used for comparing and testing prediction methods [27,28,77,79], it was also adopted in this study for testing and comparison purposes.

Data augmentation based on hypothetical reverse mutation theory

The changes in thermostability () caused by protein mutations are represented by the difference in protein folding free energy () between mutant and wild-type proteins. As a thermodynamic state function [122], the difference in should be reversible. Namely, at the same position in the protein, the for a mutation from amino acid A to amino acid B should be equal to the negative of the for the hypothetical reverse mutation from amino acid B to amino acid A [19] (as shown in Eq 1). This is known as the hypothetical reverse mutation theory.

(1)

This theory has been widely applied in many studies [19,23,29,74], both in the testing [45,75,76] and development [77–79] of prediction methods. According to this theory, a robust prediction method should perform well not only in predicting direct mutations but also in predicting hypothetical reverse mutations [19]. In the test set, hypothetical reverse mutation data can be generated from each direct mutation data. This type of data augmentation for the test set allows comprehensive evaluations for prediction methods by additionally predicting reverse mutation data. In addition to being used in testing, this theory should also be applied in the construction of prediction methods. Previous studies [45] have shown that incorporating this theory can effectively improve methods’ prediction performance when predicting hypothetical reverse mutation data and allow methods to learn the antisymmetric property [23] of . In contrast, prediction methods that did not consider this theory achieved much poorer performance [45,75,76]. For prediction methods based on ML, the hypothetical reverse mutation theory can be incorporated to generate reverse mutation data in the training set for data augmentation [74,77–79].

Pair-level cross-validation

Among the prediction methods [74,77–79] that utilized the hypothetical reverse mutation theory to increase data in the training set, mutation-level cross-validation (randomly shuffle all mutation data during cross-validation [31]) was employed by them. However, considering that a pair of real data and its hypothetical reverse mutation data are correlated, if they are randomly shuffled during cross-validation, one real data instance and its augmented data instance might be located in the training and validation sets, respectively. This could result in the validation set not being entirely unseen for the training set, leading to the training set and validation set not being independently separated. Previous study [123] suggested that, when conducting cross-validation after data augmentation, if training and validation data are not independently separated, data leakage might occur and overly optimistic performances could be caused. To address this issue, we employed the pair-level cross-validation, which means splitting datasets based on a pair of real data and its augmented data as a unit in the cross-validation. This ensures that each data pair appears entirely in the training set or in the validation set, preventing the potential issue of unfair validation.

Feature selection

Feature selection is implemented using the RFE (Recursive Feature Elimination) algorithm. RFE can effectively eliminate redundant features and identify the optimal feature subset to improve model prediction performance, making it a widely used technique in various ML tasks [124–126]. RFE is an algorithm that relies on feature importance, and its basic idea is to iteratively train the model, evaluate the prediction performance of the model, calculate feature importance and remove the least important feature in each round, ultimately selecting a subset of features that contribute the most to the model’s prediction performance. In this study, RFE was implemented based on the RFECV function [127] from sklearn.feature_selection library [128]. The ML algorithm XGBoost was used to train the models during RFE rounds and output feature importance. The average R² of 20-fold pair-level cross-validation was employed as the metric to evaluate the model performance for each RFE round. To be more specific, RFE performed the following three iterative steps (denoting the feature set at each round as X, which is initially set to include all candidate features):

Train the XGBoost model using the feature set X, perform cross-validation, calculate the average R², and record the result.
Use the feature importance output by the XGBoost model to rank the features in descending order. Remove the lowest-ranked feature from X and record the remaining features.
Repeat the step 1 and step 2 until all features have been removed from X.

After completing RFE, the remaining features corresponding to the round with the highest average R² are selected as the optimal features, finalizing the feature selection process.

Model development

The prediction model of DDGWizard was developed using the XGBoost [44] algorithm. The XGBoost algorithm is a powerful ML method [44] based on gradient boosting trees. It incorporates both the L1 and L2 regularization penalty terms to control the model complexity and reduce overfitting [129], while its post-split pruning strategy [130] further prevents unnecessary tree growth. The inclusion of L1 regularization also enables a more reliable estimation of feature importance [131], making it well-suited for integration with RFE-based feature selection. The implementation of XGBoost was achieved using the ML library scikit-learn (v1.3.1). The model’s hyperparameters were determined through Bayesian optimization [98], which is a sequential design strategy for global optimization of black-box functions, suitable for hyperparameter tuning in ML models. In this study, Bayesian optimization set the average R² from the 20-fold pair-level cross-validation on dataset S7089 as the optimization target and was implemented using the library Bayesian optimization (v1.4.3).

Evaluation metrics

MMD (maximum mean discrepancy) test [80] was conducted to evaluate the feature distribution difference between the direct and reverse mutation data. It is a widely used [132–134] method that quantifies the difference between two probability distributions in the high-dimensional space. The metrics MMD² [80,135] was employed, and its formula is given by Eq 2 (where P and Q represent two distributions; samples x and y are drawn from distributions P and Q, with sizes m and n; k represents the RBF kernel function [80] implemented via the pairwise_kernels function of the sklearn.metrics library [128]):

(2)

During cross-validation for feature selection and model development, the coefficient of determination (R²) between real and predicted is used as the evaluation metric. Its formula is given by Eq 3 (where n is the total amount of data; and represent the predicted and real values for the number i data; represents the mean of the real values):

(3)

In comparisons of prediction methods, a total of eight evaluation metrics, which were used in previous studies [23,45,75,79], were employed:

Pearson correlation coefficient between the predicted and true values for all data, direct mutation data, and reverse mutation data (use , , and to represent them, respectively).
Root mean square error between predicted and true values for all data, direct mutation data, and reverse mutation data (use , , and to represent them, respectively).
Pearson correlation coefficient between the predicted values of the direct mutation data and reverse mutation data (use to represent).
The average of the sums of the predicted values for each pair of direct and reverse mutation data (use to represent).

The formula for calculating the Pearson correlation coefficient (γ) is given by Eq 4 (where n is the total amount of data; and represent the predicted and real values for the i-th data; and represent the means of the predicted and real values):

(4)

The formula for calculating the root mean square error (σ) is given by Eq 5 (where n is the total amount of data; and represent the predicted and real values for the number i data):

(5)

The formula for calculating the Pearson correlation coefficient between the predicted values of the direct mutation data and the reverse mutation data () is given by Eq 6 (where n is the total number of pairs of the direct and reverse mutation data; and represent the predicted values for the i-th pair of the direct and reverse mutation data, respectively; and represent the means of all predicted values for the direct mutation data and reverse mutation data):

(6)

The formula for calculating the average of the sums of the predicted values for each pair of direct and reverse mutation data (δ) is given by Eq 7 (where n is the total number of pairs of direct and reverse mutations; and respectively represent the predicted values for the i-th pair of direct and reverse mutation data):

(7)

The is the metric to rank compared methods. Steiger’s Z-test [102] was employed to evaluate the statistical significance of the differences in between the DDGWizard’s model and other methods. It is a method for determining whether two correlation coefficients associated with the same target variable are statistically significantly different [136]. The test was implemented using the online server of Cocor [103]. The inputs included the of the DDGWizard’s model, of the compared methods, Pearson correlation coefficient between the predicted values of the DDGWizard’s model and the compared methods, and data number in the test set. The output included a Z-score (z_all) and a p-value (p_all). The z_all quantifies the statistical significance of the difference in between the DDGWizard’s model and the compared methods, where a larger absolute value means stronger difference significance. The p_all (ranging from 0 to 1) represents the probability of obtaining the current statistical result or more extreme results under the null hypothesis [137] that there is no difference in between the DDGWizard’s model and the compared methods.

Supporting information

S1 Table. List of algorithms, training datasets and feature sets used in representative prediction methods.

https://doi.org/10.1371/journal.pcbi.1013783.s001

(PDF)

S2 Table. List of collected datasets.

The list of 20 datasets that were collected from the VariBench database and merged.

https://doi.org/10.1371/journal.pcbi.1013783.s002

(PDF)

S3 Table. List of the remaining 69 features from feature selection based on the RFE algorithm.

https://doi.org/10.1371/journal.pcbi.1013783.s003

(PDF)

S4 Table. Performance comparison of different MLP hyperparamers with identical 20-fold pair-level cross-validation.

https://doi.org/10.1371/journal.pcbi.1013783.s004

(PDF)

S5 Table. Used hyperparameters for Bayesian optimization.

Hyperparameter ranges used for 100 rounds of Bayesian optimization.

https://doi.org/10.1371/journal.pcbi.1013783.s005

(PDF)

S6 Table. Comparison results of three prediction methods evaluated with the identical cross-validation sets on the low-conservation residue data.

https://doi.org/10.1371/journal.pcbi.1013783.s006

(PDF)

S7 Table. Comparison results of three prediction methods evaluated with the identical protein-level cross-validation sets.

https://doi.org/10.1371/journal.pcbi.1013783.s007

(PDF)

S8 Table. Detailed usage of computational resources.

https://doi.org/10.1371/journal.pcbi.1013783.s008

(PDF)

Acknowledgments

The authors acknowledge the use of the University of Bradford High Performance Computing Service in the completion of this work.

References

1. Kumwenda B, Litthauer D, Bishop OT, Reva O. Analysis of protein thermostability enhancing factors in industrially important thermus bacteria species. Evol Bioinform Online. 2013;9:327–42. pmid:24023508
- View Article
- PubMed/NCBI
- Google Scholar
2. Jiang B, Jain A, Lu Y, Hoag SW. Probing thermal stability of proteins with temperature scanning viscometer. Mol Pharm. 2019;16(8):3687–93. pmid:31306023
- View Article
- PubMed/NCBI
- Google Scholar
3. De Wit JN. Thermal stability and functionality of whey proteins. Journal of Dairy Science. 1990;73(12):3602–12.
- View Article
- Google Scholar
4. Yousefi N, Abbasi S. Food proteins: solubility & thermal stability improvement techniques. Food Chemistry Advances. 2022;1:100090.
- View Article
- Google Scholar
5. Xu Z, Cen Y-K, Zou S-P, Xue Y-P, Zheng Y-G. Recent advances in the improvement of enzyme thermostability by structure modification. Crit Rev Biotechnol. 2020;40(1):83–98. pmid:31690132
- View Article
- PubMed/NCBI
- Google Scholar
6. Wu H, Chen Q, Zhang W, Mu W. Overview of strategies for developing high thermostability industrial enzymes: discovery, mechanism, modification and challenges. Crit Rev Food Sci Nutr. 2023;63(14):2057–73. pmid:34445912
- View Article
- PubMed/NCBI
- Google Scholar
7. Nezhad NG, Rahman RNZRA, Normi YM, Oslan SN, Shariff FM, Leow TC. Thermostability engineering of industrial enzymes through structure modification. Appl Microbiol Biotechnol. 2022;106(13–16):4845–66. pmid:35804158
- View Article
- PubMed/NCBI
- Google Scholar
8. Minagawa H, Yoshida Y, Kenmochi N, Furuichi M, Shimada J, Kaneko H. Improving the thermal stability of lactate oxidase by directed evolution. Cell Mol Life Sci. 2007;64(1):77–81. pmid:17131051
- View Article
- PubMed/NCBI
- Google Scholar
9. Li G, Zhang H, Sun Z, Liu X, Reetz MT. Multiparameter optimization in directed evolution: engineering thermostability, enantioselectivity, and activity of an epoxide hydrolase. ACS Catal. 2016;6(6):3679–87.
- View Article
- Google Scholar
10. Zhang Z-G, Yi Z-L, Pei X-Q, Wu Z-L. Improving the thermostability of Geobacillus stearothermophilus xylanase XT6 by directed evolution and site-directed mutagenesis. Bioresour Technol. 2010;101(23):9272–8. pmid:20691586
- View Article
- PubMed/NCBI
- Google Scholar
11. Chen C, Su L, Xu F, Xia Y, Wu J. Improved thermostability of maltooligosyltrehalose synthase from arthrobacter ramosus by directed evolution and site-directed mutagenesis. J Agric Food Chem. 2019;67(19):5587–95. pmid:31016980
- View Article
- PubMed/NCBI
- Google Scholar
12. Xiong W, Liu B, Shen Y, Jing K, Savage TR. Protein engineering design from directed evolution to de novo synthesis. Biochemical Engineering Journal. 2021;174:108096.
- View Article
- Google Scholar
13. Chen C-W, Lin M-H, Chang H-P, Chu Y-W. Improvement of protein stability prediction by integrated computational approach. In: Proceedings of the 2020 10th International Conference on Bioscience, Biochemistry and Bioinformatics. 2020. p. 8–13. https://doi.org/10.1145/3386052.3386065
14. Zhao Y, Li D, Bai X, Luo M, Feng Y, Zhao Y, et al. Improved thermostability of proteinase K and recognizing the synergistic effect of Rosetta and FoldX approaches. Protein Eng Des Sel. 2021;34:gzab024. pmid:34671809
- View Article
- PubMed/NCBI
- Google Scholar
15. Go S-R, Lee S-J, Ahn W-C, Park K-H, Woo E-J. Enhancing the thermostability and activity of glycosyltransferase UGT76G1 via computational design. Commun Chem. 2023;6(1):265. pmid:38057441
- View Article
- PubMed/NCBI
- Google Scholar
16. Bi J, Chen S, Zhao X, Nie Y, Xu Y. Computation-aided engineering of starch-debranching pullulanase from Bacillus thermoleovorans for enhanced thermostability. Appl Microbiol Biotechnol. 2020;104(17):7551–62. pmid:32632476
- View Article
- PubMed/NCBI
- Google Scholar
17. Marabotti A, Scafuri B, Facchiano A. Predicting the stability of mutant proteins by computational approaches: an overview. Brief Bioinform. 2021;22(3):bbaa074. pmid:32496523
- View Article
- PubMed/NCBI
- Google Scholar
18. Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002;320(2):369–87. pmid:12079393
- View Article
- PubMed/NCBI
- Google Scholar
19. Fang J. A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation. Brief Bioinform. 2020;21(4):1285–92. pmid:31273374
- View Article
- PubMed/NCBI
- Google Scholar
20. Geng C, Xue LC, Roel-Touris J, Bonvin AMJJ. Finding the ΔΔG spot: Are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it?. WIREs Comput Mol Sci. 2019;9(5).
- View Article
- Google Scholar
21. Khan S, Vihinen M. Performance of protein stability predictors. Hum Mutat. 2010;31(6):675–84. pmid:20232415
- View Article
- PubMed/NCBI
- Google Scholar
22. Marabotti A, Del Prete E, Scafuri B, Facchiano A. Performance of web tools for predicting changes in protein stability caused by mutations. BMC Bioinformatics. 2021;22(Suppl 7):345. pmid:34225665
- View Article
- PubMed/NCBI
- Google Scholar
23. Benevenuta S, Pancotti C, Fariselli P, Birolo G, Sanavia T. An antisymmetric neural network to predict free energy changes in protein variants. J Phys D: Appl Phys. 2021;54(24):245403.
- View Article
- Google Scholar
24. Xu H, Chen Y, Zhang D. Worth of prior knowledge for enhancing deep learning. Nexus. 2024;1(1):100003.
- View Article
- Google Scholar
25. Pires DEV, Ascher DB, Blundell TL. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics. 2014;30(3):335–42. pmid:24281696
- View Article
- PubMed/NCBI
- Google Scholar
26. Rodrigues CH, Pires DE, Ascher DB. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability. Nucleic Acids Res. 2018;46(W1):W350–5. pmid:29718330
- View Article
- PubMed/NCBI
- Google Scholar
27. Pires DEV, Ascher DB, Blundell TL. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res. 2014;42(Web Server issue):W314-9. pmid:24829462
- View Article
- PubMed/NCBI
- Google Scholar
28. Pandurangan AP, Ochoa-Montaño B, Ascher DB, Blundell TL. SDM: a server for predicting effects of mutations on protein stability. Nucleic Acids Res. 2017;45(W1):W229–35. pmid:28525590
- View Article
- PubMed/NCBI
- Google Scholar
29. Montanucci L, Capriotti E, Birolo G, Benevenuta S, Pancotti C, Lal D, et al. DDGun: an untrained predictor of protein stability changes upon amino acid variants. Nucleic Acids Res. 2022;50(W1):W222–7. pmid:35524565
- View Article
- PubMed/NCBI
- Google Scholar
30. Berliner N, Teyra J, Colak R, Garcia Lopez S, Kim PM. Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation. PLoS One. 2014;9(9):e107353. pmid:25243403
- View Article
- PubMed/NCBI
- Google Scholar
31. Quan L, Lv Q, Zhang Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics. 2016;32(19):2936–46. pmid:27318206
- View Article
- PubMed/NCBI
- Google Scholar
32. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11(11):1453–9. pmid:14604535
- View Article
- PubMed/NCBI
- Google Scholar
33. Clementel D, Del Conte A, Monzon AM, Camagni GF, Minervini G, Piovesan D, et al. RING 3.0: fast generation of probabilistic residue interaction networks from structural ensembles. Nucleic Acids Res. 2022;50(W1):W651–6. pmid:35554554
- View Article
- PubMed/NCBI
- Google Scholar
34. Schlessinger A, Yachdav G, Rost B. PROFbval: predict flexible and rigid residues in proteins. Bioinformatics. 2006;22(7):891–3. pmid:16455751
- View Article
- PubMed/NCBI
- Google Scholar
35. Ferruz N, Noske J, Höcker B. Protlego: a Python package for the analysis and design of chimeric proteins. Bioinformatics. 2021;37(19):3182–9. pmid:33901273
- View Article
- PubMed/NCBI
- Google Scholar
36. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. pmid:10592278
- View Article
- PubMed/NCBI
- Google Scholar
37. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. pmid:19304878
- View Article
- PubMed/NCBI
- Google Scholar
38. Grant BJ, Skjaerven L, Yao X-Q. The Bio3D packages for structural bioinformatics. Protein Sci. 2021;30(1):20–30. pmid:32734663
- View Article
- PubMed/NCBI
- Google Scholar
39. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–637. pmid:6667333
- View Article
- PubMed/NCBI
- Google Scholar
40. Landrum G. RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum. 2013;8(31.10):5281.
- View Article
- Google Scholar
41. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. pmid:9254694
- View Article
- PubMed/NCBI
- Google Scholar
42. Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11(5):863–74. pmid:11337480
- View Article
- PubMed/NCBI
- Google Scholar
43. Shirvanizadeh N, Vihinen M. VariBench, new variation benchmark categories and data sets. Front Bioinform. 2023;3:1248732. pmid:37795169
- View Article
- PubMed/NCBI
- Google Scholar
44. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
45. Pancotti C, Benevenuta S, Birolo G, Alberini V, Repetto V, Sanavia T, et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Brief Bioinform. 2022;23(2):bbab555. pmid:35021190
- View Article
- PubMed/NCBI
- Google Scholar
46. Ogino S, Gulley ML, den Dunnen JT, Wilson RB, Association for Molecular Patholpogy Training and Education Committtee. Standard mutation nomenclature in molecular diagnostics: practical and educational challenges. J Mol Diagn. 2007;9(1):1–6. pmid:17251329
- View Article
- PubMed/NCBI
- Google Scholar
47. Yang Y, Urolagin S, Niroula A, Ding X, Shen B, Vihinen M. PON-tstab: protein variant stability predictor. importance of training data quality. Int J Mol Sci. 2018;19(4):1009. pmid:29597263
- View Article
- PubMed/NCBI
- Google Scholar
48. Zhou Y, Pan Q, Pires DEV, Rodrigues CHM, Ascher DB. DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res. 2023;51(W1):W122–8. pmid:37283042
- View Article
- PubMed/NCBI
- Google Scholar
49. Panja AS, Bandopadhyay B, Maiti S. Protein thermostability is owing to their preferences to non-polar smaller volume amino acids, variations in residual physico-chemical properties and more salt-bridges. PLoS One. 2015;10(7):e0131495. pmid:26177372
- View Article
- PubMed/NCBI
- Google Scholar
50. Wako H, Endo S. Normal mode analysis as a method to derive protein dynamics information from the Protein Data Bank. Biophys Rev. 2017;9(6):877–93. pmid:29103094
- View Article
- PubMed/NCBI
- Google Scholar
51. Mamonova TB, Glyakina AV, Galzitskaya OV, Kurnikova MG. Stability and rigidity/flexibility-two sides of the same coin?. Biochim Biophys Acta. 2013;1834(5):854–66. pmid:23416444
- View Article
- PubMed/NCBI
- Google Scholar
52. Chu H-L, Chen T-H, Wu C-Y, Yang Y-C, Tseng S-H, Cheng T-M, et al. Thermal stability and folding kinetics analysis of disordered protein, securin. J Therm Anal Calorim. 2014;115(3):2171–8.
- View Article
- Google Scholar
53. Ji Y-Y, Li Y-Q. The role of secondary structure in protein structure selection. Eur Phys J E Soft Matter. 2010;32(1):103–7. pmid:20524028
- View Article
- PubMed/NCBI
- Google Scholar
54. Marsh JA. Buried and accessible surface area control intrinsic protein flexibility. J Mol Biol. 2013;425(17):3250–63. pmid:23811058
- View Article
- PubMed/NCBI
- Google Scholar
55. Chen Y, Lu H, Zhang N, Zhu Z, Wang S, Li M. PremPS: Predicting the impact of missense mutations on protein stability. PLoS Comput Biol. 2020;16(12):e1008543. pmid:33378330
- View Article
- PubMed/NCBI
- Google Scholar
56. Giollo M, Martin AJ, Walsh I, Ferrari C, Tosatto SC. NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation. BMC Genomics. 2014;15:1–11.
- View Article
- Google Scholar
57. Huang A, Chen Z, Wu X, Yan W, Lu F, Liu F. Improving the thermal stability and catalytic activity of ulvan lyase by the combination of FoldX and KnowVolution campaign. Int J Biol Macromol. 2024;257(Pt 1):128577. pmid:38070809
- View Article
- PubMed/NCBI
- Google Scholar
58. Mahase V, Sobitan A, Rhoades R, Zhang F, Baranova A, Johnson M, et al. Genetic variations affecting ACE2 protein stability in minority populations. Front Med (Lausanne). 2022;9:1002187. pmid:36388927
- View Article
- PubMed/NCBI
- Google Scholar
59. Sobitan A, Edwards W, Jalal MS, Kolawole A, Ullah H, Duttaroy A, et al. Prediction of the effects of missense mutations on human myeloperoxidase protein stability using in silico saturation mutagenesis. Genes (Basel). 2022;13(8):1412. pmid:36011324
- View Article
- PubMed/NCBI
- Google Scholar
60. Tian J, Wu N, Chu X, Fan Y. Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinformatics. 2010;11:370. pmid:20598148
- View Article
- PubMed/NCBI
- Google Scholar
61. Aggarwal R, R Koes D. PharmRL: pharmacophore elucidation with deep geometric reinforcement learning. BMC Biol. 2024;22(1):301. pmid:39736736
- View Article
- PubMed/NCBI
- Google Scholar
62. Wilkinson HC, Dalby PA. Fine-tuning the activity and stability of an evolved enzyme active-site through noncanonical amino-acids. FEBS J. 2021;288(6):1935–55. pmid:32897608
- View Article
- PubMed/NCBI
- Google Scholar
63. Tang H, Shi K, Shi C, Aihara H, Zhang J, Du G. Enhancing subtilisin thermostability through a modified normalized B-factor analysis and loop-grafting strategy. J Biol Chem. 2019;294(48):18398–407. pmid:31615894
- View Article
- PubMed/NCBI
- Google Scholar
64. Camilloni C, Bonetti D, Morrone A, Giri R, Dobson CM, Brunori M, et al. Towards a structural biology of the hydrophobic effect in protein folding. Sci Rep. 2016;6:28285. pmid:27461719
- View Article
- PubMed/NCBI
- Google Scholar
65. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of hydrophobic interactions to protein stability. J Mol Biol. 2011;408(3):514–28. pmid:21377472
- View Article
- PubMed/NCBI
- Google Scholar
66. Ahmad S, Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. pmid:15720719
- View Article
- PubMed/NCBI
- Google Scholar
67. Studer RA, Dessailly BH, Orengo CA. Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J. 2013;449(3):581–94. pmid:23301657
- View Article
- PubMed/NCBI
- Google Scholar
68. Cao H, Wang J, He L, Qi Y, Zhang JZ. DeepDDG: predicting the stability change of protein point mutations using neural networks. J Chem Inf Model. 2019;59(4):1508–14. pmid:30759982
- View Article
- PubMed/NCBI
- Google Scholar
69. Li Y, Fang J. PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS One. 2012;7(10):e47247. pmid:23077576
- View Article
- PubMed/NCBI
- Google Scholar
70. Scandurra R, Consalvi V, Chiaraluce R, Politi L, Engel PC. Protein thermostability in extremophiles. Biochimie. 1998;80(11):933–41. pmid:9893953
- View Article
- PubMed/NCBI
- Google Scholar
71. DePristo MA, Weinreich DM, Hartl DL. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet. 2005;6(9):678–87. pmid:16074985
- View Article
- PubMed/NCBI
- Google Scholar
72. Webb B, Sali A. Comparative protein structure modeling using MODELLER. Curr Protoc Bioinformatics. 2016;54:5.6.1-5.6.37. pmid:27322406
- View Article
- PubMed/NCBI
- Google Scholar
73. Kebabci N, Timucin AC, Timucin E. Toward compilation of balanced protein stability data sets: flattening the ΔΔG curve through systematic enrichment. J Chem Inf Model. 2022;62(5):1345–55. pmid:35201762
- View Article
- PubMed/NCBI
- Google Scholar
74. Capriotti E, Fariselli P, Rossi I, Casadio R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics. 2008;9 Suppl 2(Suppl 2):S6. pmid:18387208
- View Article
- PubMed/NCBI
- Google Scholar
75. Pucci F, Bernaerts KV, Kwasigroch JM, Rooman M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics. 2018;34(21):3659–65. pmid:29718106
- View Article
- PubMed/NCBI
- Google Scholar
76. Thiltgen G, Goldstein RA. Assessing predictors of changes in protein stability upon mutation using self-consistency. PLoS One. 2012;7(10):e46084. pmid:23144695
- View Article
- PubMed/NCBI
- Google Scholar
77. Fariselli P, Martelli PL, Savojardo C, Casadio R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics. 2015;31(17):2816–21. pmid:25957347
- View Article
- PubMed/NCBI
- Google Scholar
78. Rodrigues CHM, Pires DEV, Ascher DB. DynaMut2: Assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 2021;30(1):60–9. pmid:32881105
- View Article
- PubMed/NCBI
- Google Scholar
79. Li B, Yang YT, Capra JA, Gerstein MB. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput Biol. 2020;16(11):e1008291. pmid:33253214
- View Article
- PubMed/NCBI
- Google Scholar
80. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A kernel two-sample test. The Journal of Machine Learning Research. 2012;13(1):723–73.
- View Article
- Google Scholar
81. Volkova S. An overview on data augmentation for machine learning. In: International Scientific and Practical Conference Digital and Information Technologies in Economics and Management. 2023. p. 143–54.
82. Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol Methods Protoc. 2022;7(1):bpac008. pmid:35388370
- View Article
- PubMed/NCBI
- Google Scholar
83. Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23(15):1875–82. pmid:17519246
- View Article
- PubMed/NCBI
- Google Scholar
84. Parthasarathy S, Murthy MR. Protein thermal stability: insights from atomic displacement parameters (B values). Protein Eng. 2000;13(1):9–13. pmid:10679524
- View Article
- PubMed/NCBI
- Google Scholar
85. Robson B, Suzuki E. Conformational properties of amino acid residues in globular proteins. J Mol Biol. 1976;107(3):327–56. pmid:1003471
- View Article
- PubMed/NCBI
- Google Scholar
86. Muñoz V, Serrano L. Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins. 1994;20(4):301–11. pmid:7731949
- View Article
- PubMed/NCBI
- Google Scholar
87. Qian N, Sejnowski TJ. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol. 1988;202(4):865–84. pmid:3172241
- View Article
- PubMed/NCBI
- Google Scholar
88. Cao Y, Miao Q-G, Liu J-C, Gao L. Advance and prospects of AdaBoost algorithm. Acta Automatica Sinica. 2013;39(6):745–58.
- View Article
- Google Scholar
89. de Ville B. Decision trees. WIREs Computational Stats. 2013;5(6):448–55.
- View Article
- Google Scholar
90. Larose DT, Larose CD. K-nearest neighbor algorithm. Wiley Data and Cybersecurity; 2014.
91. Ranstam J, Cook JA. LASSO regression. British Journal of Surgery. 2018;105(10):1348–1348.
- View Article
- Google Scholar
92. Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021;22(1):271. pmid:34544450
- View Article
- PubMed/NCBI
- Google Scholar
93. Su X, Yan X, Tsai C. Linear regression. WIREs Computational Stats. 2012;4(3):275–94.
- View Article
- Google Scholar
94. Popescu MC, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88.
- View Article
- Google Scholar
95. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
- View Article
- Google Scholar
96. Schulz E, Speekenbrink M, Krause A. A tutorial on Gaussian process regression: modelling, exploring, and exploiting functions. Journal of Mathematical Psychology. 2018;85:1–16.
- View Article
- Google Scholar
97. Sabzekar M, Hasheminejad SMH. Robust regression using support vector regressions. Chaos, Solitons & Fractals. 2021;144:110738.
- View Article
- Google Scholar
98. Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. 2012;25.
- View Article
- Google Scholar
99. Witz G, van Nimwegen E, Julou T. Initiation of chromosome replication controls both division and replication cycles in E. coli through a double-adder mechanism. Elife. 2019;8:e48063. pmid:31710292
- View Article
- PubMed/NCBI
- Google Scholar
100. Esters L, Rutgersson A, Nilsson E, Sahlée E. Non-local impacts on eddy-covariance air–lake CO₂ fluxes. Boundary-Layer Meteorol. 2020;178(2):283–300.
- View Article
- Google Scholar
101. Starr E, Goldfarb B. Binned scatterplots: a simple tool to make research easier and better. Strategic Management Journal. 2020;41(12):2261–74.
- View Article
- Google Scholar
102. Steiger JH. Tests for comparing elements of a correlation matrix. Psychological Bulletin. 1980;87(2):245–51.
- View Article
- Google Scholar
103. Diedenhofen B, Musch J. cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One. 2015;10(3):e0121945. pmid:25835001
- View Article
- PubMed/NCBI
- Google Scholar
104. Umerenkov D, Nikolaev F, Shashkova TI, Strashnov PV, Sindeeva M, Shevtsov A, et al. PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics. 2023;39(11):btad671. pmid:37935419
- View Article
- PubMed/NCBI
- Google Scholar
105. Mishra SK. PSP-GNM: predicting protein stability changes upon point mutations with a Gaussian network model. Int J Mol Sci. 2022;23(18):10711. pmid:36142614
- View Article
- PubMed/NCBI
- Google Scholar
106. Kumar V, Minz S. Feature selection. SmartCR. 2014;4(3):211–29.
- View Article
- Google Scholar
107. Moshrefi A, Tawfik HH, Elsayed MY, Nabki F. Industrial fault detection employing meta ensemble model based on contact sensor ultrasonic signal. Sensors (Basel). 2024;24(7):2297. pmid:38610508
- View Article
- PubMed/NCBI
- Google Scholar
108. Wang J, Zhao J, Hua C, Zhang J. Constructing real-time meteorological forecast method of short-term cyanobacteria bloom area index changes in the Lake Taihu. Sustainability. 2025;17(18):8376.
- View Article
- Google Scholar
109. Khan Rifat MdA, Kabir A, Huq A. An explainable machine learning approach to traffic accident fatality prediction. Procedia Computer Science. 2024;246:1905–14.
- View Article
- Google Scholar
110. Khaleghi Ardabili A, Rice S, Bonavia AS. Diagnosing sepsis through proteomic insights: findings from a prospective ICU cohort. medRxiv. 2025;:2025–08.
111. Xu H, Chen Y, Zhang D. Worth of prior knowledge for enhancing deep learning. Nexus. 2024;1(1):100003.
- View Article
- Google Scholar
112. Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics. 2011;12:151. pmid:21569468
- View Article
- PubMed/NCBI
- Google Scholar
113. Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform. 2024;152:104629. pmid:38552994
- View Article
- PubMed/NCBI
- Google Scholar
114. Zheng J-X, Li X, Zhu J, Guan S-Y, Zhang S-X, Wang W-M. Interpretable machine learning for predicting chronic kidney disease progression risk. Digit Health. 2024;10:20552076231224225. pmid:38235416
- View Article
- PubMed/NCBI
- Google Scholar
115. Chauhan NK, Singh K. A review on conventional machine learning vs deep learning. In: 2018 International conference on computing, power and communication technologies (GUCON). 2018. p. 347–52.
116. Attari V, Arroyave R. Decoding non-linearity and complexity: deep tabular learning approaches for materials science. Digital Discovery. 2025;4(10):2765–80.
- View Article
- Google Scholar
117. McCarroll N, McShane P, O’Connell E, Curran K, Singh M, McNamee E, et al. Evaluating shallow and deep learning strategies for legal text classification of clauses in non-disclosure agreements. SN COMPUT SCI. 2025;6(7):784.
- View Article
- Google Scholar
118. Johnston WJ, Fusi S. Abstract representations emerge naturally in neural networks trained to perform multiple tasks. Nat Commun. 2023;14(1):1040. pmid:36823136
- View Article
- PubMed/NCBI
- Google Scholar
119. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):7112–27. pmid:34232869
- View Article
- PubMed/NCBI
- Google Scholar
120. Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: a comprehensive review. Comput Soc Netw. 2019;6(1):11. pmid:37915858
- View Article
- PubMed/NCBI
- Google Scholar
121. Kulikova AV, Diaz DJ, Loy JM, Ellington AD, Wilke CO. Learning the local landscape of protein structures with convolutional neural networks. J Biol Phys. 2021;47(4):435–54. pmid:34751854
- View Article
- PubMed/NCBI
- Google Scholar
122. Becktel WJ, Schellman JA. Protein stability curves. Biopolymers. 1987;26(11):1859–77. pmid:3689874
- View Article
- PubMed/NCBI
- Google Scholar
123. Lee H-T, Cheon H-R, Lee S-H, Shim M, Hwang H-J. Risk of data leakage in estimating the diagnostic performance of a deep-learning-based computer-aided system for psychiatric disorders. Sci Rep. 2023;13(1):16633. pmid:37789047
- View Article
- PubMed/NCBI
- Google Scholar
124. Duan K-B, Rajapakse JC, Wang H, Azuaje F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobioscience. 2005;4(3):228–34. pmid:16220686
- View Article
- PubMed/NCBI
- Google Scholar
125. Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, et al. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations. PLoS One. 2014;9(3):e92863. pmid:24675610
- View Article
- PubMed/NCBI
- Google Scholar
126. Liu W, Zhai J, Ding H, He X. The research of algorithm for protein subcellular localization prediction based on SVM-RFE. In: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). 2017. p. 1–6.
127. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46(1–3):389–422.
- View Article
- Google Scholar
128. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research. 2011;12:2825–30.
- View Article
- Google Scholar
129. Moradi R, Berangi R, Minaei B. A survey of regularization strategies for deep models. Artif Intell Rev. 2019;53(6):3947–86.
- View Article
- Google Scholar
130. Osei-Bryson KM. Post-pruning in regression tree induction: an integrated approach. Expert Systems with Applications. 2008;34(2):1481–90.
- View Article
- Google Scholar
131. Jitkrittum W, Hachiya H, Sugiyama M. Feature selection l1-penalized squared-loss mutual information. IEICE Trans Inf Syst. 2013;96(7):1513–24.
- View Article
- Google Scholar
132. Gao H, Shao X. Two sample testing in high dimension via maximum mean discrepancy. Journal of Machine Learning Research. 2023;24(304):1–33.
- View Article
- Google Scholar
133. Shekhar S, Kim I, Ramdas A. A permutation-free kernel two-sample test. Advances in Neural Information Processing Systems. 2022;35:18168–80.
- View Article
- Google Scholar
134. Ding T, Li Z, Zhang Y. Testing the equality of distributions using integrated maximum mean discrepancy. Journal of Statistical Planning and Inference. 2025;236:106246.
- View Article
- Google Scholar
135. Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics. 2006;22(14):e49-57. pmid:16873512
- View Article
- PubMed/NCBI
- Google Scholar
136. Wilson GA, Martin SA. An empirical comparison of two methods for testing the significance of a correlation matrix. Educational and Psychological Measurement. 1983;43(1):11–4.
- View Article
- Google Scholar
137. Sedgwick PM, Hammer A, Kesmodel US, Pedersen LH. Current controversies: null hypothesis significance testing. Acta Obstet Gynecol Scand. 2022;101(6):624–7. pmid:35451497
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Kumwenda B, Litthauer D, Bishop OT, Reva O. Analysis of protein thermostability enhancing factors in industrially important thermus bacteria species. Evol Bioinform Online. 2013;9:327–42. pmid:24023508
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Jiang B, Jain A, Lu Y, Hoag SW. Probing thermal stability of proteins with temperature scanning viscometer. Mol Pharm. 2019;16(8):3687–93. pmid:31306023
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. De Wit JN. Thermal stability and functionality of whey proteins. Journal of Dairy Science. 1990;73(12):3602–12.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Yousefi N, Abbasi S. Food proteins: solubility & thermal stability improvement techniques. Food Chemistry Advances. 2022;1:100090.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Xu Z, Cen Y-K, Zou S-P, Xue Y-P, Zheng Y-G. Recent advances in the improvement of enzyme thermostability by structure modification. Crit Rev Biotechnol. 2020;40(1):83–98. pmid:31690132
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Wu H, Chen Q, Zhang W, Mu W. Overview of strategies for developing high thermostability industrial enzymes: discovery, mechanism, modification and challenges. Crit Rev Food Sci Nutr. 2023;63(14):2057–73. pmid:34445912
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Nezhad NG, Rahman RNZRA, Normi YM, Oslan SN, Shariff FM, Leow TC. Thermostability engineering of industrial enzymes through structure modification. Appl Microbiol Biotechnol. 2022;106(13–16):4845–66. pmid:35804158
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Minagawa H, Yoshida Y, Kenmochi N, Furuichi M, Shimada J, Kaneko H. Improving the thermal stability of lactate oxidase by directed evolution. Cell Mol Life Sci. 2007;64(1):77–81. pmid:17131051
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Li G, Zhang H, Sun Z, Liu X, Reetz MT. Multiparameter optimization in directed evolution: engineering thermostability, enantioselectivity, and activity of an epoxide hydrolase. ACS Catal. 2016;6(6):3679–87.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Zhang Z-G, Yi Z-L, Pei X-Q, Wu Z-L. Improving the thermostability of Geobacillus stearothermophilus xylanase XT6 by directed evolution and site-directed mutagenesis. Bioresour Technol. 2010;101(23):9272–8. pmid:20691586
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Chen C, Su L, Xu F, Xia Y, Wu J. Improved thermostability of maltooligosyltrehalose synthase from arthrobacter ramosus by directed evolution and site-directed mutagenesis. J Agric Food Chem. 2019;67(19):5587–95. pmid:31016980
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Xiong W, Liu B, Shen Y, Jing K, Savage TR. Protein engineering design from directed evolution to de novo synthesis. Biochemical Engineering Journal. 2021;174:108096.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref13] 13. Chen C-W, Lin M-H, Chang H-P, Chu Y-W. Improvement of protein stability prediction by integrated computational approach. In: Proceedings of the 2020 10th International Conference on Bioscience, Biochemistry and Bioinformatics. 2020. p. 8–13. https://doi.org/10.1145/3386052.3386065

[ref14] 14. Zhao Y, Li D, Bai X, Luo M, Feng Y, Zhao Y, et al. Improved thermostability of proteinase K and recognizing the synergistic effect of Rosetta and FoldX approaches. Protein Eng Des Sel. 2021;34:gzab024. pmid:34671809
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref15] 15. Go S-R, Lee S-J, Ahn W-C, Park K-H, Woo E-J. Enhancing the thermostability and activity of glycosyltransferase UGT76G1 via computational design. Commun Chem. 2023;6(1):265. pmid:38057441
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref16] 16. Bi J, Chen S, Zhao X, Nie Y, Xu Y. Computation-aided engineering of starch-debranching pullulanase from Bacillus thermoleovorans for enhanced thermostability. Appl Microbiol Biotechnol. 2020;104(17):7551–62. pmid:32632476
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref17] 17. Marabotti A, Scafuri B, Facchiano A. Predicting the stability of mutant proteins by computational approaches: an overview. Brief Bioinform. 2021;22(3):bbaa074. pmid:32496523
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref18] 18. Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002;320(2):369–87. pmid:12079393
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref19] 19. Fang J. A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation. Brief Bioinform. 2020;21(4):1285–92. pmid:31273374
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref20] 20. Geng C, Xue LC, Roel-Touris J, Bonvin AMJJ. Finding the ΔΔG spot: Are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it?. WIREs Comput Mol Sci. 2019;9(5).
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref21] 21. Khan S, Vihinen M. Performance of protein stability predictors. Hum Mutat. 2010;31(6):675–84. pmid:20232415
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref22] 22. Marabotti A, Del Prete E, Scafuri B, Facchiano A. Performance of web tools for predicting changes in protein stability caused by mutations. BMC Bioinformatics. 2021;22(Suppl 7):345. pmid:34225665
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref23] 23. Benevenuta S, Pancotti C, Fariselli P, Birolo G, Sanavia T. An antisymmetric neural network to predict free energy changes in protein variants. J Phys D: Appl Phys. 2021;54(24):245403.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref24] 24. Xu H, Chen Y, Zhang D. Worth of prior knowledge for enhancing deep learning. Nexus. 2024;1(1):100003.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref25] 25. Pires DEV, Ascher DB, Blundell TL. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics. 2014;30(3):335–42. pmid:24281696
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Rodrigues CH, Pires DE, Ascher DB. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability. Nucleic Acids Res. 2018;46(W1):W350–5. pmid:29718330
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref27] 27. Pires DEV, Ascher DB, Blundell TL. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res. 2014;42(Web Server issue):W314-9. pmid:24829462
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref28] 28. Pandurangan AP, Ochoa-Montaño B, Ascher DB, Blundell TL. SDM: a server for predicting effects of mutations on protein stability. Nucleic Acids Res. 2017;45(W1):W229–35. pmid:28525590
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref29] 29. Montanucci L, Capriotti E, Birolo G, Benevenuta S, Pancotti C, Lal D, et al. DDGun: an untrained predictor of protein stability changes upon amino acid variants. Nucleic Acids Res. 2022;50(W1):W222–7. pmid:35524565
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref30] 30. Berliner N, Teyra J, Colak R, Garcia Lopez S, Kim PM. Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation. PLoS One. 2014;9(9):e107353. pmid:25243403
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref31] 31. Quan L, Lv Q, Zhang Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics. 2016;32(19):2936–46. pmid:27318206
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref32] 32. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11(11):1453–9. pmid:14604535
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref33] 33. Clementel D, Del Conte A, Monzon AM, Camagni GF, Minervini G, Piovesan D, et al. RING 3.0: fast generation of probabilistic residue interaction networks from structural ensembles. Nucleic Acids Res. 2022;50(W1):W651–6. pmid:35554554
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref34] 34. Schlessinger A, Yachdav G, Rost B. PROFbval: predict flexible and rigid residues in proteins. Bioinformatics. 2006;22(7):891–3. pmid:16455751
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref35] 35. Ferruz N, Noske J, Höcker B. Protlego: a Python package for the analysis and design of chimeric proteins. Bioinformatics. 2021;37(19):3182–9. pmid:33901273
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref36] 36. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. pmid:10592278
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref37] 37. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. pmid:19304878
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref38] 38. Grant BJ, Skjaerven L, Yao X-Q. The Bio3D packages for structural bioinformatics. Protein Sci. 2021;30(1):20–30. pmid:32734663
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref39] 39. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–637. pmid:6667333
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref40] 40. Landrum G. RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum. 2013;8(31.10):5281.
View Article
Google Scholar

[148] View Article

[149] Google Scholar

[ref41] 41. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. pmid:9254694
View Article
PubMed/NCBI
Google Scholar

[151] View Article

[152] PubMed/NCBI

[153] Google Scholar

[ref42] 42. Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11(5):863–74. pmid:11337480
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref43] 43. Shirvanizadeh N, Vihinen M. VariBench, new variation benchmark categories and data sets. Front Bioinform. 2023;3:1248732. pmid:37795169
View Article
PubMed/NCBI
Google Scholar

[159] View Article

[160] PubMed/NCBI

[161] Google Scholar

[ref44] 44. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.

[ref45] 45. Pancotti C, Benevenuta S, Birolo G, Alberini V, Repetto V, Sanavia T, et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Brief Bioinform. 2022;23(2):bbab555. pmid:35021190
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref46] 46. Ogino S, Gulley ML, den Dunnen JT, Wilson RB, Association for Molecular Patholpogy Training and Education Committtee. Standard mutation nomenclature in molecular diagnostics: practical and educational challenges. J Mol Diagn. 2007;9(1):1–6. pmid:17251329
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref47] 47. Yang Y, Urolagin S, Niroula A, Ding X, Shen B, Vihinen M. PON-tstab: protein variant stability predictor. importance of training data quality. Int J Mol Sci. 2018;19(4):1009. pmid:29597263
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref48] 48. Zhou Y, Pan Q, Pires DEV, Rodrigues CHM, Ascher DB. DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res. 2023;51(W1):W122–8. pmid:37283042
View Article
PubMed/NCBI
Google Scholar

[176] View Article

[177] PubMed/NCBI

[178] Google Scholar

[ref49] 49. Panja AS, Bandopadhyay B, Maiti S. Protein thermostability is owing to their preferences to non-polar smaller volume amino acids, variations in residual physico-chemical properties and more salt-bridges. PLoS One. 2015;10(7):e0131495. pmid:26177372
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref50] 50. Wako H, Endo S. Normal mode analysis as a method to derive protein dynamics information from the Protein Data Bank. Biophys Rev. 2017;9(6):877–93. pmid:29103094
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref51] 51. Mamonova TB, Glyakina AV, Galzitskaya OV, Kurnikova MG. Stability and rigidity/flexibility-two sides of the same coin?. Biochim Biophys Acta. 2013;1834(5):854–66. pmid:23416444
View Article
PubMed/NCBI
Google Scholar

[188] View Article

[189] PubMed/NCBI

[190] Google Scholar

[ref52] 52. Chu H-L, Chen T-H, Wu C-Y, Yang Y-C, Tseng S-H, Cheng T-M, et al. Thermal stability and folding kinetics analysis of disordered protein, securin. J Therm Anal Calorim. 2014;115(3):2171–8.
View Article
Google Scholar

[192] View Article

[193] Google Scholar

[ref53] 53. Ji Y-Y, Li Y-Q. The role of secondary structure in protein structure selection. Eur Phys J E Soft Matter. 2010;32(1):103–7. pmid:20524028
View Article
PubMed/NCBI
Google Scholar

[195] View Article

[196] PubMed/NCBI

[197] Google Scholar

[ref54] 54. Marsh JA. Buried and accessible surface area control intrinsic protein flexibility. J Mol Biol. 2013;425(17):3250–63. pmid:23811058
View Article
PubMed/NCBI
Google Scholar

[199] View Article

[200] PubMed/NCBI

[201] Google Scholar

[ref55] 55. Chen Y, Lu H, Zhang N, Zhu Z, Wang S, Li M. PremPS: Predicting the impact of missense mutations on protein stability. PLoS Comput Biol. 2020;16(12):e1008543. pmid:33378330
View Article
PubMed/NCBI
Google Scholar

[203] View Article

[204] PubMed/NCBI

[205] Google Scholar

[ref56] 56. Giollo M, Martin AJ, Walsh I, Ferrari C, Tosatto SC. NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation. BMC Genomics. 2014;15:1–11.
View Article
Google Scholar

[207] View Article

[208] Google Scholar

[ref57] 57. Huang A, Chen Z, Wu X, Yan W, Lu F, Liu F. Improving the thermal stability and catalytic activity of ulvan lyase by the combination of FoldX and KnowVolution campaign. Int J Biol Macromol. 2024;257(Pt 1):128577. pmid:38070809
View Article
PubMed/NCBI
Google Scholar

[210] View Article

[211] PubMed/NCBI

[212] Google Scholar

[ref58] 58. Mahase V, Sobitan A, Rhoades R, Zhang F, Baranova A, Johnson M, et al. Genetic variations affecting ACE2 protein stability in minority populations. Front Med (Lausanne). 2022;9:1002187. pmid:36388927
View Article
PubMed/NCBI
Google Scholar

[214] View Article

[215] PubMed/NCBI

[216] Google Scholar

[ref59] 59. Sobitan A, Edwards W, Jalal MS, Kolawole A, Ullah H, Duttaroy A, et al. Prediction of the effects of missense mutations on human myeloperoxidase protein stability using in silico saturation mutagenesis. Genes (Basel). 2022;13(8):1412. pmid:36011324
View Article
PubMed/NCBI
Google Scholar

[218] View Article

[219] PubMed/NCBI

[220] Google Scholar

[ref60] 60. Tian J, Wu N, Chu X, Fan Y. Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinformatics. 2010;11:370. pmid:20598148
View Article
PubMed/NCBI
Google Scholar

[222] View Article

[223] PubMed/NCBI

[224] Google Scholar

[ref61] 61. Aggarwal R, R Koes D. PharmRL: pharmacophore elucidation with deep geometric reinforcement learning. BMC Biol. 2024;22(1):301. pmid:39736736
View Article
PubMed/NCBI
Google Scholar

[226] View Article

[227] PubMed/NCBI

[228] Google Scholar

[ref62] 62. Wilkinson HC, Dalby PA. Fine-tuning the activity and stability of an evolved enzyme active-site through noncanonical amino-acids. FEBS J. 2021;288(6):1935–55. pmid:32897608
View Article
PubMed/NCBI
Google Scholar

[230] View Article

[231] PubMed/NCBI

[232] Google Scholar

[ref63] 63. Tang H, Shi K, Shi C, Aihara H, Zhang J, Du G. Enhancing subtilisin thermostability through a modified normalized B-factor analysis and loop-grafting strategy. J Biol Chem. 2019;294(48):18398–407. pmid:31615894
View Article
PubMed/NCBI
Google Scholar

[234] View Article

[235] PubMed/NCBI

[236] Google Scholar

[ref64] 64. Camilloni C, Bonetti D, Morrone A, Giri R, Dobson CM, Brunori M, et al. Towards a structural biology of the hydrophobic effect in protein folding. Sci Rep. 2016;6:28285. pmid:27461719
View Article
PubMed/NCBI
Google Scholar

[238] View Article

[239] PubMed/NCBI

[240] Google Scholar

[ref65] 65. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of hydrophobic interactions to protein stability. J Mol Biol. 2011;408(3):514–28. pmid:21377472
View Article
PubMed/NCBI
Google Scholar

[242] View Article

[243] PubMed/NCBI

[244] Google Scholar

[ref66] 66. Ahmad S, Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. pmid:15720719
View Article
PubMed/NCBI
Google Scholar

[246] View Article

[247] PubMed/NCBI

[248] Google Scholar

[ref67] 67. Studer RA, Dessailly BH, Orengo CA. Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J. 2013;449(3):581–94. pmid:23301657
View Article
PubMed/NCBI
Google Scholar

[250] View Article

[251] PubMed/NCBI

[252] Google Scholar

[ref68] 68. Cao H, Wang J, He L, Qi Y, Zhang JZ. DeepDDG: predicting the stability change of protein point mutations using neural networks. J Chem Inf Model. 2019;59(4):1508–14. pmid:30759982
View Article
PubMed/NCBI
Google Scholar

[254] View Article

[255] PubMed/NCBI

[256] Google Scholar

[ref69] 69. Li Y, Fang J. PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS One. 2012;7(10):e47247. pmid:23077576
View Article
PubMed/NCBI
Google Scholar

[258] View Article

[259] PubMed/NCBI

[260] Google Scholar

[ref70] 70. Scandurra R, Consalvi V, Chiaraluce R, Politi L, Engel PC. Protein thermostability in extremophiles. Biochimie. 1998;80(11):933–41. pmid:9893953
View Article
PubMed/NCBI
Google Scholar

[262] View Article

[263] PubMed/NCBI

[264] Google Scholar

[ref71] 71. DePristo MA, Weinreich DM, Hartl DL. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet. 2005;6(9):678–87. pmid:16074985
View Article
PubMed/NCBI
Google Scholar

[266] View Article

[267] PubMed/NCBI

[268] Google Scholar

[ref72] 72. Webb B, Sali A. Comparative protein structure modeling using MODELLER. Curr Protoc Bioinformatics. 2016;54:5.6.1-5.6.37. pmid:27322406
View Article
PubMed/NCBI
Google Scholar

[270] View Article

[271] PubMed/NCBI

[272] Google Scholar

[ref73] 73. Kebabci N, Timucin AC, Timucin E. Toward compilation of balanced protein stability data sets: flattening the ΔΔG curve through systematic enrichment. J Chem Inf Model. 2022;62(5):1345–55. pmid:35201762
View Article
PubMed/NCBI
Google Scholar

[274] View Article

[275] PubMed/NCBI

[276] Google Scholar

[ref74] 74. Capriotti E, Fariselli P, Rossi I, Casadio R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics. 2008;9 Suppl 2(Suppl 2):S6. pmid:18387208
View Article
PubMed/NCBI
Google Scholar

[278] View Article

[279] PubMed/NCBI

[280] Google Scholar

[ref75] 75. Pucci F, Bernaerts KV, Kwasigroch JM, Rooman M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics. 2018;34(21):3659–65. pmid:29718106
View Article
PubMed/NCBI
Google Scholar

[282] View Article

[283] PubMed/NCBI

[284] Google Scholar

[ref76] 76. Thiltgen G, Goldstein RA. Assessing predictors of changes in protein stability upon mutation using self-consistency. PLoS One. 2012;7(10):e46084. pmid:23144695
View Article
PubMed/NCBI
Google Scholar

[286] View Article

[287] PubMed/NCBI

[288] Google Scholar

[ref77] 77. Fariselli P, Martelli PL, Savojardo C, Casadio R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics. 2015;31(17):2816–21. pmid:25957347
View Article
PubMed/NCBI
Google Scholar

[290] View Article

[291] PubMed/NCBI

[292] Google Scholar

[ref78] 78. Rodrigues CHM, Pires DEV, Ascher DB. DynaMut2: Assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 2021;30(1):60–9. pmid:32881105
View Article
PubMed/NCBI
Google Scholar

[294] View Article

[295] PubMed/NCBI

[296] Google Scholar

[ref79] 79. Li B, Yang YT, Capra JA, Gerstein MB. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput Biol. 2020;16(11):e1008291. pmid:33253214
View Article
PubMed/NCBI
Google Scholar

[298] View Article

[299] PubMed/NCBI

[300] Google Scholar

[ref80] 80. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A kernel two-sample test. The Journal of Machine Learning Research. 2012;13(1):723–73.
View Article
Google Scholar

[302] View Article

[303] Google Scholar

[ref81] 81. Volkova S. An overview on data augmentation for machine learning. In: International Scientific and Practical Conference Digital and Information Technologies in Economics and Management. 2023. p. 143–54.

[ref82] 82. Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol Methods Protoc. 2022;7(1):bpac008. pmid:35388370
View Article
PubMed/NCBI
Google Scholar

[306] View Article

[307] PubMed/NCBI

[308] Google Scholar

[ref83] 83. Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23(15):1875–82. pmid:17519246
View Article
PubMed/NCBI
Google Scholar

[310] View Article

[311] PubMed/NCBI

[312] Google Scholar

[ref84] 84. Parthasarathy S, Murthy MR. Protein thermal stability: insights from atomic displacement parameters (B values). Protein Eng. 2000;13(1):9–13. pmid:10679524
View Article
PubMed/NCBI
Google Scholar

[314] View Article

[315] PubMed/NCBI

[316] Google Scholar

[ref85] 85. Robson B, Suzuki E. Conformational properties of amino acid residues in globular proteins. J Mol Biol. 1976;107(3):327–56. pmid:1003471
View Article
PubMed/NCBI
Google Scholar

[318] View Article

[319] PubMed/NCBI

[320] Google Scholar

[ref86] 86. Muñoz V, Serrano L. Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins. 1994;20(4):301–11. pmid:7731949
View Article
PubMed/NCBI
Google Scholar

[322] View Article

[323] PubMed/NCBI

[324] Google Scholar

[ref87] 87. Qian N, Sejnowski TJ. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol. 1988;202(4):865–84. pmid:3172241
View Article
PubMed/NCBI
Google Scholar

[326] View Article

[327] PubMed/NCBI

[328] Google Scholar

[ref88] 88. Cao Y, Miao Q-G, Liu J-C, Gao L. Advance and prospects of AdaBoost algorithm. Acta Automatica Sinica. 2013;39(6):745–58.
View Article
Google Scholar

[330] View Article

[331] Google Scholar

[ref89] 89. de Ville B. Decision trees. WIREs Computational Stats. 2013;5(6):448–55.
View Article
Google Scholar

[333] View Article

[334] Google Scholar

[ref90] 90. Larose DT, Larose CD. K-nearest neighbor algorithm. Wiley Data and Cybersecurity; 2014.

[ref91] 91. Ranstam J, Cook JA. LASSO regression. British Journal of Surgery. 2018;105(10):1348–1348.
View Article
Google Scholar

[337] View Article

[338] Google Scholar

[ref92] 92. Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021;22(1):271. pmid:34544450
View Article
PubMed/NCBI
Google Scholar

[340] View Article

[341] PubMed/NCBI

[342] Google Scholar

[ref93] 93. Su X, Yan X, Tsai C. Linear regression. WIREs Computational Stats. 2012;4(3):275–94.
View Article
Google Scholar

[344] View Article

[345] Google Scholar

[ref94] 94. Popescu MC, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88.
View Article
Google Scholar

[347] View Article

[348] Google Scholar

[ref95] 95. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
View Article
Google Scholar

[350] View Article

[351] Google Scholar

[ref96] 96. Schulz E, Speekenbrink M, Krause A. A tutorial on Gaussian process regression: modelling, exploring, and exploiting functions. Journal of Mathematical Psychology. 2018;85:1–16.
View Article
Google Scholar

[353] View Article

[354] Google Scholar

[ref97] 97. Sabzekar M, Hasheminejad SMH. Robust regression using support vector regressions. Chaos, Solitons & Fractals. 2021;144:110738.
View Article
Google Scholar

[356] View Article

[357] Google Scholar

[ref98] 98. Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. 2012;25.
View Article
Google Scholar

[359] View Article

[360] Google Scholar

[ref99] 99. Witz G, van Nimwegen E, Julou T. Initiation of chromosome replication controls both division and replication cycles in E. coli through a double-adder mechanism. Elife. 2019;8:e48063. pmid:31710292
View Article
PubMed/NCBI
Google Scholar

[362] View Article

[363] PubMed/NCBI

[364] Google Scholar

[ref100] 100. Esters L, Rutgersson A, Nilsson E, Sahlée E. Non-local impacts on eddy-covariance air–lake CO₂ fluxes. Boundary-Layer Meteorol. 2020;178(2):283–300.
View Article
Google Scholar

[366] View Article

[367] Google Scholar

[ref101] 101. Starr E, Goldfarb B. Binned scatterplots: a simple tool to make research easier and better. Strategic Management Journal. 2020;41(12):2261–74.
View Article
Google Scholar

[369] View Article

[370] Google Scholar

[ref102] 102. Steiger JH. Tests for comparing elements of a correlation matrix. Psychological Bulletin. 1980;87(2):245–51.
View Article
Google Scholar

[372] View Article

[373] Google Scholar

[ref103] 103. Diedenhofen B, Musch J. cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One. 2015;10(3):e0121945. pmid:25835001
View Article
PubMed/NCBI
Google Scholar

[375] View Article

[376] PubMed/NCBI

[377] Google Scholar

[ref104] 104. Umerenkov D, Nikolaev F, Shashkova TI, Strashnov PV, Sindeeva M, Shevtsov A, et al. PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics. 2023;39(11):btad671. pmid:37935419
View Article
PubMed/NCBI
Google Scholar

[379] View Article

[380] PubMed/NCBI

[381] Google Scholar

[ref105] 105. Mishra SK. PSP-GNM: predicting protein stability changes upon point mutations with a Gaussian network model. Int J Mol Sci. 2022;23(18):10711. pmid:36142614
View Article
PubMed/NCBI
Google Scholar

[383] View Article

[384] PubMed/NCBI

[385] Google Scholar

[ref106] 106. Kumar V, Minz S. Feature selection. SmartCR. 2014;4(3):211–29.
View Article
Google Scholar

[387] View Article

[388] Google Scholar

[ref107] 107. Moshrefi A, Tawfik HH, Elsayed MY, Nabki F. Industrial fault detection employing meta ensemble model based on contact sensor ultrasonic signal. Sensors (Basel). 2024;24(7):2297. pmid:38610508
View Article
PubMed/NCBI
Google Scholar

[390] View Article

[391] PubMed/NCBI

[392] Google Scholar

[ref108] 108. Wang J, Zhao J, Hua C, Zhang J. Constructing real-time meteorological forecast method of short-term cyanobacteria bloom area index changes in the Lake Taihu. Sustainability. 2025;17(18):8376.
View Article
Google Scholar

[394] View Article

[395] Google Scholar

[ref109] 109. Khan Rifat MdA, Kabir A, Huq A. An explainable machine learning approach to traffic accident fatality prediction. Procedia Computer Science. 2024;246:1905–14.
View Article
Google Scholar

[397] View Article

[398] Google Scholar

[ref110] 110. Khaleghi Ardabili A, Rice S, Bonavia AS. Diagnosing sepsis through proteomic insights: findings from a prospective ICU cohort. medRxiv. 2025;:2025–08.

[ref111] 111. Xu H, Chen Y, Zhang D. Worth of prior knowledge for enhancing deep learning. Nexus. 2024;1(1):100003.
View Article
Google Scholar

[401] View Article

[402] Google Scholar

[ref112] 112. Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics. 2011;12:151. pmid:21569468
View Article
PubMed/NCBI
Google Scholar

[404] View Article

[405] PubMed/NCBI

[406] Google Scholar

[ref113] 113. Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform. 2024;152:104629. pmid:38552994
View Article
PubMed/NCBI
Google Scholar

[408] View Article

[409] PubMed/NCBI

[410] Google Scholar

[ref114] 114. Zheng J-X, Li X, Zhu J, Guan S-Y, Zhang S-X, Wang W-M. Interpretable machine learning for predicting chronic kidney disease progression risk. Digit Health. 2024;10:20552076231224225. pmid:38235416
View Article
PubMed/NCBI
Google Scholar

[412] View Article

[413] PubMed/NCBI

[414] Google Scholar

[ref115] 115. Chauhan NK, Singh K. A review on conventional machine learning vs deep learning. In: 2018 International conference on computing, power and communication technologies (GUCON). 2018. p. 347–52.

[ref116] 116. Attari V, Arroyave R. Decoding non-linearity and complexity: deep tabular learning approaches for materials science. Digital Discovery. 2025;4(10):2765–80.
View Article
Google Scholar

[417] View Article

[418] Google Scholar

[ref117] 117. McCarroll N, McShane P, O’Connell E, Curran K, Singh M, McNamee E, et al. Evaluating shallow and deep learning strategies for legal text classification of clauses in non-disclosure agreements. SN COMPUT SCI. 2025;6(7):784.
View Article
Google Scholar

[420] View Article

[421] Google Scholar

[ref118] 118. Johnston WJ, Fusi S. Abstract representations emerge naturally in neural networks trained to perform multiple tasks. Nat Commun. 2023;14(1):1040. pmid:36823136
View Article
PubMed/NCBI
Google Scholar

[423] View Article

[424] PubMed/NCBI

[425] Google Scholar

[ref119] 119. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):7112–27. pmid:34232869
View Article
PubMed/NCBI
Google Scholar

[427] View Article

[428] PubMed/NCBI

[429] Google Scholar

[ref120] 120. Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: a comprehensive review. Comput Soc Netw. 2019;6(1):11. pmid:37915858
View Article
PubMed/NCBI
Google Scholar

[431] View Article

[432] PubMed/NCBI

[433] Google Scholar

[ref121] 121. Kulikova AV, Diaz DJ, Loy JM, Ellington AD, Wilke CO. Learning the local landscape of protein structures with convolutional neural networks. J Biol Phys. 2021;47(4):435–54. pmid:34751854
View Article
PubMed/NCBI
Google Scholar

[435] View Article

[436] PubMed/NCBI

[437] Google Scholar

[ref122] 122. Becktel WJ, Schellman JA. Protein stability curves. Biopolymers. 1987;26(11):1859–77. pmid:3689874
View Article
PubMed/NCBI
Google Scholar

[439] View Article

[440] PubMed/NCBI

[441] Google Scholar

[ref123] 123. Lee H-T, Cheon H-R, Lee S-H, Shim M, Hwang H-J. Risk of data leakage in estimating the diagnostic performance of a deep-learning-based computer-aided system for psychiatric disorders. Sci Rep. 2023;13(1):16633. pmid:37789047
View Article
PubMed/NCBI
Google Scholar

[443] View Article

[444] PubMed/NCBI

[445] Google Scholar

[ref124] 124. Duan K-B, Rajapakse JC, Wang H, Azuaje F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobioscience. 2005;4(3):228–34. pmid:16220686
View Article
PubMed/NCBI
Google Scholar

[447] View Article

[448] PubMed/NCBI

[449] Google Scholar

[ref125] 125. Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, et al. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations. PLoS One. 2014;9(3):e92863. pmid:24675610
View Article
PubMed/NCBI
Google Scholar

[451] View Article

[452] PubMed/NCBI

[453] Google Scholar

[ref126] 126. Liu W, Zhai J, Ding H, He X. The research of algorithm for protein subcellular localization prediction based on SVM-RFE. In: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). 2017. p. 1–6.

[ref127] 127. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46(1–3):389–422.
View Article
Google Scholar

[456] View Article

[457] Google Scholar

[ref128] 128. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research. 2011;12:2825–30.
View Article
Google Scholar

[459] View Article

[460] Google Scholar

[ref129] 129. Moradi R, Berangi R, Minaei B. A survey of regularization strategies for deep models. Artif Intell Rev. 2019;53(6):3947–86.
View Article
Google Scholar

[462] View Article

[463] Google Scholar

[ref130] 130. Osei-Bryson KM. Post-pruning in regression tree induction: an integrated approach. Expert Systems with Applications. 2008;34(2):1481–90.
View Article
Google Scholar

[465] View Article

[466] Google Scholar

[ref131] 131. Jitkrittum W, Hachiya H, Sugiyama M. Feature selection l1-penalized squared-loss mutual information. IEICE Trans Inf Syst. 2013;96(7):1513–24.
View Article
Google Scholar

[468] View Article

[469] Google Scholar

[ref132] 132. Gao H, Shao X. Two sample testing in high dimension via maximum mean discrepancy. Journal of Machine Learning Research. 2023;24(304):1–33.
View Article
Google Scholar

[471] View Article

[472] Google Scholar

[ref133] 133. Shekhar S, Kim I, Ramdas A. A permutation-free kernel two-sample test. Advances in Neural Information Processing Systems. 2022;35:18168–80.
View Article
Google Scholar

[474] View Article

[475] Google Scholar

[ref134] 134. Ding T, Li Z, Zhang Y. Testing the equality of distributions using integrated maximum mean discrepancy. Journal of Statistical Planning and Inference. 2025;236:106246.
View Article
Google Scholar

[477] View Article

[478] Google Scholar

[ref135] 135. Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics. 2006;22(14):e49-57. pmid:16873512
View Article
PubMed/NCBI
Google Scholar

[480] View Article

[481] PubMed/NCBI

[482] Google Scholar

[ref136] 136. Wilson GA, Martin SA. An empirical comparison of two methods for testing the significance of a correlation matrix. Educational and Psychological Measurement. 1983;43(1):11–4.
View Article
Google Scholar

[484] View Article

[485] Google Scholar

[ref137] 137. Sedgwick PM, Hammer A, Kesmodel US, Pedersen LH. Current controversies: null hypothesis significance testing. Acta Obstet Gynecol Scand. 2022;101(6):624–7. pmid:35451497
View Article
PubMed/NCBI
Google Scholar

[487] View Article

[488] PubMed/NCBI

[489] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

An overview of DDGWizard

DDGWizard feature calculation pipeline

Feature-enriched DDGWizard dataset

Optimal feature set

Model development and evaluation

Comparisons

Comparison with the cross-validation sets.

Comparison with the test set.

Comparison with the S669 dataset.

Comparison with the p53 dataset.

Accessibility and reproducibility

Discussion

Materials and methods

Development of feature calculation pipeline

Data sources

Data augmentation based on hypothetical reverse mutation theory

Pair-level cross-validation

Feature selection

Model development

Evaluation metrics

Supporting information

S1 Table. List of algorithms, training datasets and feature sets used in representative prediction methods.

S2 Table. List of collected datasets.

S3 Table. List of the remaining 69 features from feature selection based on the RFE algorithm.

S4 Table. Performance comparison of different MLP hyperparamers with identical 20-fold pair-level cross-validation.

S5 Table. Used hyperparameters for Bayesian optimization.

S6 Table. Comparison results of three prediction methods evaluated with the identical cross-validation sets on the low-conservation residue data.

S7 Table. Comparison results of three prediction methods evaluated with the identical protein-level cross-validation sets.

S8 Table. Detailed usage of computational resources.

Acknowledgments

References