Fig 1.
A: Integrate 12 computational resources [18,32–42] to develop a feature calculation pipeline. B: Collect data from the VariBench [43] database, conduct feature enrichment to the collected
data using the feature calculation pipeline to obtain the DDGWizard dataset, and then split it into training and test sets for subsequent ML tasks. C: Perform feature selection based on the RFE (recursive feature elimination) algorithm, followed by a further analysis of feature importance. D: Develop a
prediction model using the XGBoost [44] algorithm based on the optimal features. E: Evaluate the developed model and compare it with other representative
prediction methods using the identical cross-validation sets, test set, S669 dataset [45], and p53 dataset [25].
Fig 2.
The feature calculation pipeline of DDGWizard.
The pipeline requires the input of raw data (PDB ID, amino acid substitution, chain ID, pH, temperature, and
value). It uses the PDB ID to download the wild-type protein structure file from the RCSB PDB database [46], employs Modeller [72] to construct the mutant protein structure file, and calls a series of computational resources [18,32–42] to calculate features, ultimately outputting the dataset containing 1,547 calculated features.
Table 1.
The computational resources used for feature calculation.
Fig 3.
The workflow of dataset construction and feature enrichment.
Fig 4.
t-SNE plot for both direct and reverse mutation data.
The t-SNE plot shows the distribution of direct and reverse mutation data, projected from high-dimensional feature spaces into two dimensions. The blue points represent direct mutation data, while the red points represent reverse mutation data. MMD2 quantifies the difference in feature distributions between the two types of data.
Fig 5.
R2 of each fold from cross-validation before and after feature selection.
Fig 6.
Feature selection and feature importance ranking.
A: The flowchart of feature selection based on the RFE algorithm. B: The RFE results reflect the changes in the average R2 of the 20-fold pair-level cross-validation as the number of RFE rounds increases and the number of features decreases. C: The top 10 most important features among the 69 features.
Table 2.
Details of the 10 most important features.
Table 3.
Average R2 for the model selection under the same 20-fold pair-level cross-validation.
Fig 7.
Prediction results of DDGWizard’s model from the cross-validation.
A: The scatter plot to visualize the comparison between all predicted and true values. The red line indicates the overall regression fit. The plot also provides the regression equation, R2, γ, and σ values for the overall prediction. B: The binned scatter plot compares average prediction values and experimental values within 10 bins that have equivalent data amounts. The error bars represent the standard error of the residuals between the average predicted and true values within each bin. C: The scatter plot to visualize the prediction results from the 20-fold cross-validation on the 3,970 mutation Î"Î"G data points where the PSSM score of the mutant amino acid is less than 0. D: The scatter plot to visualize the prediction results from the 20-fold protein-level crossvalidation on the 30 proteins that have mutual sequence similarity less than 30%.
Table 4.
Comparison study on the inclusion of reverse mutation data.
Fig 8.
Pearson correlation coefficients of three prediction methods evaluated with the identical cross-validation sets.
Table 5.
Comparison results of three prediction methods evaluated with the identical cross-validation sets.
Fig 9.
Pearson correlation coefficients of eight prediction methods evaluated with the test set.
Table 6.
Comparison results of eight prediction methods evaluated with the test set.
Fig 10.
Pearson correlation coefficients of eight prediction methods evaluated with the dataset S669.
Table 7.
Comparison results of eight prediction methods evaluated with the dataset S669.
Fig 11.
Pearson correlation coefficients of eight prediction methods evaluated with the p53 dataset.
Table 8.
Comparison results of eight prediction methods evaluated with the p53 dataset.