Skip to main content
Advertisement

< Back to Article

Fig 1.

An overview of DDGWizard.

A: Integrate 12 computational resources [18,3242] to develop a feature calculation pipeline. B: Collect data from the VariBench [43] database, conduct feature enrichment to the collected data using the feature calculation pipeline to obtain the DDGWizard dataset, and then split it into training and test sets for subsequent ML tasks. C: Perform feature selection based on the RFE (recursive feature elimination) algorithm, followed by a further analysis of feature importance. D: Develop a prediction model using the XGBoost [44] algorithm based on the optimal features. E: Evaluate the developed model and compare it with other representative prediction methods using the identical cross-validation sets, test set, S669 dataset [45], and p53 dataset [25].

More »

Fig 1 Expand

Fig 2.

The feature calculation pipeline of DDGWizard.

The pipeline requires the input of raw data (PDB ID, amino acid substitution, chain ID, pH, temperature, and value). It uses the PDB ID to download the wild-type protein structure file from the RCSB PDB database [46], employs Modeller [72] to construct the mutant protein structure file, and calls a series of computational resources [18,3242] to calculate features, ultimately outputting the dataset containing 1,547 calculated features.

More »

Fig 2 Expand

Table 1.

The computational resources used for feature calculation.

More »

Table 1 Expand

Fig 3.

The workflow of dataset construction and feature enrichment.

More »

Fig 3 Expand

Fig 4.

t-SNE plot for both direct and reverse mutation data.

The t-SNE plot shows the distribution of direct and reverse mutation data, projected from high-dimensional feature spaces into two dimensions. The blue points represent direct mutation data, while the red points represent reverse mutation data. MMD2 quantifies the difference in feature distributions between the two types of data.

More »

Fig 4 Expand

Fig 5.

R2 of each fold from cross-validation before and after feature selection.

More »

Fig 5 Expand

Fig 6.

Feature selection and feature importance ranking.

A: The flowchart of feature selection based on the RFE algorithm. B: The RFE results reflect the changes in the average R2 of the 20-fold pair-level cross-validation as the number of RFE rounds increases and the number of features decreases. C: The top 10 most important features among the 69 features.

More »

Fig 6 Expand

Table 2.

Details of the 10 most important features.

More »

Table 2 Expand

Table 3.

Average R2 for the model selection under the same 20-fold pair-level cross-validation.

More »

Table 3 Expand

Fig 7.

Prediction results of DDGWizard’s model from the cross-validation.

A: The scatter plot to visualize the comparison between all predicted and true values. The red line indicates the overall regression fit. The plot also provides the regression equation, R2, γ, and σ values for the overall prediction. B: The binned scatter plot compares average prediction values and experimental values within 10 bins that have equivalent data amounts. The error bars represent the standard error of the residuals between the average predicted and true values within each bin. C: The scatter plot to visualize the prediction results from the 20-fold cross-validation on the 3,970 mutation Î"Î"G data points where the PSSM score of the mutant amino acid is less than 0. D: The scatter plot to visualize the prediction results from the 20-fold protein-level crossvalidation on the 30 proteins that have mutual sequence similarity less than 30%.

More »

Fig 7 Expand

Table 4.

Comparison study on the inclusion of reverse mutation data.

More »

Table 4 Expand

Fig 8.

Pearson correlation coefficients of three prediction methods evaluated with the identical cross-validation sets.

More »

Fig 8 Expand

Table 5.

Comparison results of three prediction methods evaluated with the identical cross-validation sets.

More »

Table 5 Expand

Fig 9.

Pearson correlation coefficients of eight prediction methods evaluated with the test set.

More »

Fig 9 Expand

Table 6.

Comparison results of eight prediction methods evaluated with the test set.

More »

Table 6 Expand

Fig 10.

Pearson correlation coefficients of eight prediction methods evaluated with the dataset S669.

More »

Fig 10 Expand

Table 7.

Comparison results of eight prediction methods evaluated with the dataset S669.

More »

Table 7 Expand

Fig 11.

Pearson correlation coefficients of eight prediction methods evaluated with the p53 dataset.

More »

Fig 11 Expand

Table 8.

Comparison results of eight prediction methods evaluated with the p53 dataset.

More »

Table 8 Expand