Mining the Protein Data Bank to improve prediction of changes in protein-protein binding

doi:10.1371/journal.pone.0257614

Fig 1.

Program flow.

The user must provide an initial Protein Data Bank (PDB) ID, specify which relevant chains are in which of two interacting complexes (irrelevant chains may be left out). 1. The fasta_lwp program searches the PDB for structures containing chains homologous (E-value below eValueCutoff, here 10⁻¹¹) to those specified by the user. 2. We group the thus-discovered homolog chains by PDB ID, each such PDB ID is referred to as a “homolog.” We loop over the homologs, performing three checks on each. 3. As a first check, we determine whether the thus-discovered homologs contain chains corresponding to all those specified by the user; those not having all such chains are discarded. 4. Homologs in which all chains do not have at least 90% sequence identity vs. the corresponding user-specified chain are discarded. 5. We perform a rigid alignment of the entire homolog against the user-specified structure, based only on the user-specified chains. Non-corresponding (extraneous) chains are moved along with the rest of the complex. This is the most computationally-expensive process, but only needs to be done once for homolog that makes it to this step; results of all three checks are saved persistently. 6. If RMSD > 6.0 Å (again based on corresponding chains), we discard the homolog. Most homologs which are rejected at this step contain the correct chains but in a different configuration. 7. We then compute the ΔΔG for the user-requested mutation, using the homolog structure and FoldX4. Steps 3–7 are repeated for each homolog. 8. We average ΔΔG over all homologs that reached and completed step 7 and report the result.

More »

Expand

Fig 2.

Illustration of the sequence and structure matching procedure.

More »

Expand

Table 1.

Comparison of computing ΔΔG using single vs. multiple structures, for several subsets of our benchmark set.

More »

Expand

Fig 3.

Scatterplot of ΔΔG_predicted vs. ΔΔG_experimental, for dataset C (single-position substitutions, where more than one structure was available).

Green circles: mutants averaged over multiple structures, N = 522. Black dots: mutants computed on a single structure—as multiple structures were available for each mutant, this has a higher N = 4028. Note the clear outliers are all single-structure points. Note the third quadrant is populated with True Positives— ΔΔG_predicted and ΔΔG_experimental both negative. On the other hand, the fourth quadrant, representing False Positives, does not have any multiple-structure results below ΔΔG_predicted < -0.65 kcal/mol. The improvement in Positive Predictive Value is discussed elsewhere in this work.

More »

Expand

Fig 4.

Receiver Operating Characteristic, comparing homologyScanner vs. calculation on single structures.

Here the Test Positives are defined as mutants with ΔΔG_predicted < threshold, where the threshold is varied.

More »

Expand

Fig 5.

Positive Predictive Value (PPV) for single vs. multiple structures.

TP + FP is the denominator of PPV, so we emphasize that this quantity becomes small for ΔΔG_predicted < -1 kcal/mol (crosses). This is why the PPV becomes erratic, at least for single structures.

More »

Expand

Fig 6.

The homologyScanner public web server.

Users can provide PDB ID, chose chains in each of two subunits, and specify a mutation to be computed. FoldX ΔΔG is computed for the query and all matching complexes and reported to the user. The results are available for browsing by others. Compute nodes are needed only for high-throughput runs. The software components are available on github, simtk.org, and dockerhub. A server has also been set up on a single-board computer for private deployment.

More »

Expand