Deep geometric representations for modeling effects of mutations on protein-protein binding affinity

doi:10.1371/journal.pcbi.1009284

Fig 1.

Schematic overview of the GeoPPI pipeline.

(A) The self-supervised learning scheme, during which a geometric encoder learns to reconstruct the original structure of a complex given the perturbed one (where the side-chain torsion angles of a residue are randomly sampled). The geometric encoder is a neural network that performs the neural message passing operation on graph structures [19, 20]. The input of the geometric encoder is the graph structure of a complex, where we only consider the atoms that are no more than a predefined distance from either the mutated residues or the interface ones to reduce the computation complexity (Materials and methods). (B) The prediction process of GeoPPI, where the trained geometric encoder produces geometric representations for a given wild-type complex and a mutant, respectively, and a gradient boosting tree (GBT) takes these representations as input to predict the corresponding affinity change.

More »

Expand

Fig 2.

Visualization of the representation space of individual elements in the protein structure.

(A) The learned representation space of the atoms with different perturbed distances. (B) The representation space of the perturbed atoms by initialized neural weights (i.e., the weights are not tuned by self-supervised learning). (C) The learned representation space of α-carbon atoms, where the color stands for their locations (on or not on the interface) in complexes. (D) The space of α-carbon atoms by initialized neural weights. (E) The learned amino acid space, where the color indicates the corresponding group. (F) The amino acid space by initialized neural weights.

More »

Expand

Table 1.

Comparison of individual methods for the single-point mutations in terms of Pearson correlation coefficient (R_P) and root-mean-square error (RMSE) on the S645, S1131, S4169 and S8338 datasets.

The methods are evaluated by the split-by-structure cross-validation (SSCV), where ECOD is used in data split to avoid the complexes similar to the training data appearing in the test set. The dash sign indicates the results of the corresponding methods are not available. ^†: Results were obtained based on the released source code. ^‡: Results were obtained via the released tool.

More »

Expand

Table 2.

Comparison of individual methods for the multi-point mutations in SSCV in terms of Pearson correlation coefficient (R_P) and root-mean-square error (RMSE) on the M1101 and M1707 datasets.

The dash sign indicates the results of the corresponding methods are not available. ^†: Results were obtained based on the released data. ^‡: Results were obtained via the released tool.

More »

Expand

Fig 3.

Performance of the prediction models in the leave-one-structure-out cross-validation (CV).

(A) Distributions of the per-structure Pearson correlation coefficients of GeoPPI and TopGBT on the S645 dataset. (B) Distributions of the per-structure Pearson correlation coefficients of GeoPPI and MutaBind2 on the M1707 dataset. (C) The experimental values of the affinity changes and those predicted by GeoPPI on S645. (D) The experimental values of the affinity changes and those predicted by GeoPPI on M1707.

More »

Expand

Fig 4.

Comparison of GeoPPI with the baseline methods in terms of prediction performance and computational speed.

(A) An example of the most conservative mutation and the predicted binding affinity changes by GeoPPI and TopGBT. (B) Prediction performance of GeoPPI and TopGBT on a subset consisting of the most conservative mutations in the S645 dataset. (C) Computational time (second/sample) needed for the prediction of individual methods.

More »

Expand

Table 3.

Comparison of prediction performance of GeoPPI with that of different baseline methods on the S641 dataset.

In this test, the S1131 dataset is the training dataset of GeoPPI, TopGBT and TopNetTree. Besides the regression performance, a binary classification experiment is conducted to evaluate the ability of classifying the stabilizing and destabilizing mutations in terms of the classification accuracy (ACC), the area under the receiver operating characteristic curve (AUC) and Matthews correlation coefficient (MCC).

More »

Expand

Fig 5.

A case study on the antibodies (Abs) that neutralize SARS-CoV-2 by binding with the receptor-binding domain (RBD) of the spike protein.

(A) Structurally similar SARS-CoV-2 neutralizing Abs and their CDR3 sequences (S9 Table). (B) Pairwise prediction performance between structurally similar Abs. The structures of these Abs are not solved and approximated by homology modeling. (C and D) Prediction performance of GeoPPI and TopGBT on the single-point mutations of SARS-CoV-2 complexed with individual Abs. This newly collected single-point mutation dataset (S10 Table) contains 98 mutations and corresponding binding affinity changes, including the complexes of SARS-CoV-2 bound to CR3022 [40], C002, C104, C105, C110, C121, C119, C135, C144 [41]. Among them, GeoPPI obtains the highest correlation on the variants of C110. (E) The average predicted affinity changes of the mutations on each residue on the interface of C110 complexed with SARS-CoV-2. (F) The structure around site A107 on C110. (G) The structure around site W107 on C110 with the mutation A107W.

More »

Expand