Fig 1.
Pipeline of BindProf for predicting protein-binding affinity using features derived from interface structural profiles, wild type (WT) and mutant sequences, and physics based scoring of the structures of the WT and mutant complexes.
(1) Interface profile scores and Interface profile scores features are derived by profile scoring structural alignment of structurally similar interface using an interface similarity cutoff to define the aligned sequences that are used to build the profile. (2) Physics based scores are formed at the residue or atomic level formed by modeling the mutant monomeric protein and complex and evaluating the difference in energy. (3) Sequence features are formed by the difference between the WT and mutant sequences in the number of hydrophobic (V, I, L, M, F, W, or C), aromatic (Y, F, or W), charged (R, K, D, or E), hydrogen bond acceptors (D, E, N, H, Q, S, T, or Y), and hydrogen bond donating residues (R, K, W, N, Q, H, S, T, or Y) along with the difference in amino acid volume calculated from the sequence.
Fig 2.
Comparison of the accuracy of mutant interface profile scores formed from different structural alignment methods in predicting ΔΔG of complex formation.
The iTM-score considers only structural similarity at the interface, Iscore considers structural similarity at the interface and the fraction of native contacts preserved, and PCscore considers both physicochemical and structural similarity at the interface. TM-score considers only structural alignment of the mutated monomeric protein. Profiles are constructed from sequences meeting each cutoff and the predicted ΔΔG values are calculated according to Eq 2.
Fig 3.
Dependency of the accuracy of ΔΔG prediction on the number of sequences that can be aligned at the site of the mutation and the formation of an adaptive profile mixing sequences from high and low interface similarities.
Only single site mutations are considered (81% of the total number of mutations). Nseq,mut and Nseq, add are the number of sequences that can be aligned at the site of the mutation and the number of lower similarity sequences added to the profile, respectively. (A) Pearson’s correlation c between predicted and experimental ΔΔG values as a function of the number of sequences that can be aligned at the site of the mutation. (B) Fraction of the total number of single site mutations as a function of the number of sequences that can be aligned at the site of the mutation. (C) Improvement in accuracy of an adaptive profile mixing sequences from high and low interface similarities over profiles formed purely using high and low interface similarity cutoffs.
Fig 4.
Comparison of the accuracy interface profile scores at ΔΔG compared to other physical, statistical, and sequence based potentials for all mutations in the SKEMPI dataset.
See text for a description of each potential.
Fig 5.
Breakdown of the performance of the interface profile score compared to other potentials for different classes of mutations.
Favorable: ΔΔG ≤ 0 kcal/mol, Strongly Favorable ≤ -1 kcal/mol, Unfavorable: ΔΔG ≥ 0 kcal/mol, Strongly Unfavorable: ΔΔG ≥ 0 kcal/mol, Neutral ΔΔG ≤ 1 kcal/mol and ≥ 1 kcal/mol. See text for a description of each potential.
Fig 6.
An illustration of the interface residue types onto the surface shown from the growth hormone-receptor complex structure (PDB ID: 1A22).
The monomer structure of one of the chains is shown on top with the complex structure on bottom. ‘Core’ residues (blue) are exposed in the monomeric structure but buried in the complex; ‘Support’ residues (green) are partly buried in the monomeric structure and fully buried in the complex; ‘Rim’ residues (orange) are fully exposed in the monomeric structure and partly buried in the complex; ‘Interior’ residues (sky blue) are fully buried in the monomer, while surface residues (red) are fully exposed in both the monomeric and complex structures.
Fig 7.
Median and interquartile ranges of experimental ΔΔG values by interface classification.
Full distributions can be found in the Supporting Information as S1 Fig.
Fig 8.
Median and interquartile ranges of the RMSD of the alignment at the mutation site at low (Iscore = 0.19) (A) and high (Iscore = 0.25) (B) interface similarity.
Fig 9.
Breakdown of the performance of the interface profile score compared to other potentials for different types of interface residues.
See Fig 6 for the definition of the interface residue types.
Fig 10.
Prediction of ΔΔG value by different combinations of the interface profile scores.
(A) Interface profile only; (B) Interface profile and residue level potentials; (C) Interface potential, residue level potentials, and atomic level potentials. In each picture, the right panel shows the overall correlation between predicted and experimental ΔΔG values; the right penal shows different features from random forest model as sorted by their effect on the residual error (right) or the node purity (a measure of the efficiency of splitting on feature during the construction of the decision tree) (left). Correlation values are for 10 fold cross-validation repeated three times.
Fig 11.
Accuracy of ΔΔG prediction on a per protein basis after leave-one-protein-out cross-validation for the 24 proteins with more than 10 mutants available based on the standard error of prediction.
Proteins are arranged left to right in order from the low to high mean experimental ΔΔG value. The mean standard error across the set increases from 1.11 kcal/mol to 1.33 kcal/mol if the tested protein is left out during training.