Mining the Protein Data Bank to improve prediction of changes in protein-protein binding

Predicting the effect of mutations on protein-protein interactions is important for relating structure to function, as well as for in silico affinity maturation. The effect of mutations on protein-protein binding energy (ΔΔG) can be predicted by a variety of atomic simulation methods involving full or limited flexibility, and explicit or implicit solvent. Methods which consider only limited flexibility are naturally more economical, and many of them are quite accurate, however results are dependent on the atomic coordinate set used. In this work we perform a sequence and structure based search of the Protein Data Bank to find additional coordinate sets and repeat the calculation on each. The method increases precision and Positive Predictive Value, and decreases Root Mean Square Error, compared to using single structures. Given the ongoing growth of near-redundant structures in the Protein Data Bank, our method will only increase in applicability and accuracy.


Introduction
In this work we are interested in predicting the change in protein-protein interaction (PPI) energy (ΔΔG) resulting from amino acid substitutions at the protein-protein interface. This quantity determines the change in protein-protein binding affinity and is thus important for understanding signaling, complex assembly, host-pathogen interaction, and other functions. As the accuracy of computational methods increases, hope grows that these will match the accuracy of experimental ΔΔG measurement, heralding a new age of in the development of biologics-proteins which have therapeutic and/or diagnostic utility [1]. It would also help design proteins for purification, catalysis, and other purposes.

Methods of computing change in protein-protein interaction energy (ΔΔG)
ΔΔG = ΔG mutant − ΔG wild-type . ΔG wild-type is the free energy change upon binding in the wildtype complex, while ΔG mutant is the same quantity for the mutant complex. The equilibrium constant widely used in pharmacology is computed as Kd = exp(ΔG/RT), where R is the universal gas constant. Many computational methods exist to calculate ΔΔG predicted , an estimate of the (known or unknown) experimental value, ΔΔG experimental . Some methods use reduced representations [1,2], while others include all atoms [3]. such methods compute the protein-protein binding enthalpy (including electrostatic and van der Waals interactions) using physical formulae, but differ in the way they estimate the effect of solvent and side-chain entropy. The most successful and widely-used methods use implicit solvent to estimate these latter terms, namely they compute the solvent-accessible surface area on an atomic basis, then combine this quantity with the atom type and empirically-adjusted weight factors [3][4][5][6]. In recent years such approaches have made relatively small gains in accuracy.
Many limited-flexibility, implicit-solvent methods including FoldX offer good accuracy and economy [3,7,8]. Multiple workers have found it is actually counterproductive to minimize the structure globally (e.g. by Molecular Dynamics or MD) [9,10], especially when the mutation induces small conformational changes [11]. Rather it is better to model the substitution and limit further modifications to those required for annealing the resulting steric clashes in a mostly-local minimization-in other words, a perturbative approach [3]. All force fields are inevitably biased by the data they are fitted to, and most are classically formulated, considering quantum mechanical effects only indirectly. Thus modeling can only reduce the accuracy of 3D atomic coordinates, compared to X-ray crystallography or other high-resolution experimental methods. Also, potentials which successfully predict changes in PPI energy upon mutation (ΔΔG) are typically trained on experimental structures [3]. All of this argues in favor of limited, local minimization. The downside of local minimization is that it makes the results dependent on the idiosyncrasies of the experimental coordinates which may reflect crystallization conditions, which would change upon mutation, or which represent only one of many thermodynamically accessible configurations. This work addresses this limitation of local minimization, by identifying and using additional structural data. These additional structures are rapidly increasing in number, as we will explain. We repeat the ΔΔG calculation over all such available structures, and average over the results to increase precision.

Near-redundant structures in the Protein Data Bank (PDB)
Although the era of fold discovery is over, the growth of structural data is still accelerating (S1 Fig). Many of the new structures differ only slightly from existing entries, having been obtained to e.g. seek higher resolution, determine the effect of a mutation, or add a ligand or subunit. In this work "near-redundant structures" refer to those which have the same (or nearly the same) composition, at least in the protein-protein interface of interest, but which were resolved in separate experiments or crystallographic units. The experimental effort is dramatically reduced when following proven protocols as opposed to solving proteins of previously unknown structure. Thus it is economical to solve the same complex to probe structural variations, improve resolution, etc. As an illustrative example, the complex of the human Growth Hormone (hGH) bound to two copies of its Receptor (hGHR) was resolved in [12]. There exists a mutant of hGH which binds one copy of hGHR, and the same work also reports the 1:1 complex structure [12]. The same lab then remodeled part of the interface by point and phage display mutagenesis and reported the structure [13]. Lastly, they solved the 1:1 complex again at improved resolution [14]. Even minor differences in biopolymer sequence, number of additional subunits, experimental conditions, and fitting procedure can be expected to introduce differences in atomic positions on the order of tenths of Ångströms. While for some purposes these differences may be insignificant, they lead to variations in predicted ΔΔG when using FoldX and other limited-flexibility methods.

Why does averaging improve precision?
Accuracy is the closeness of the prediction or measurement to the correct value, whereas precision refers to the closeness of independent predictions or measurements to each other. The principle underlying homologyScanner is that ZEMu and other methods which perform only a local minimization are subject not only to limitations in accuracy (due to biases in the ΔΔG prediction potential, and to errors in the ΔΔG experimental used as a gold standard) but also in precision (due to particularities of the coordinate sets used).
According to the central limit theorem, for N independent random variables distributed with mean μ and standard deviation σ, the sample mean approaches a normal distribution with mean μ and standard deviation s= ffi ffi ffi ffi N p . Here the ΔΔG predicted computed using a single structure would be a random variable sampled from an underlying normal distribution of unknown mean μ and standard deviation σ. A ΔΔG predicted averaged from multiple calculations would have a smaller standard deviation, s= ffi ffi ffi ffi N p , about the same mean μ. Thus averaging multiple calculations obtained using independent structures provides a more-accurate estimate of the underlying mean μ. Note that even in the hypothetical case of very large N, μ would still be the mean of many individual values of ΔΔG predicted , subject to biases in the FoldX potential, and may not converge to an accurately measured ΔΔG experimental . Also the latter number depends on experimental conditions. And so though precision is increased by our method, one must consider limitations in the force fields and experimental measurements.

Methods
In our method, the user proposes a mutation, the PDB identifier of a single"query" structure, and two lists of chains, one for each subunit in the interaction of interest. Lastly, the user specifies the chain ID, residue position, and substituted residue type, for one or multiple simultaneous substitutions. The user typically needs not perform any further actions until the calculation completes, the remaining steps are automated, per the flowchart (Fig 1).
The automated steps are, in short: 1) Do a sequence search of the PDB for structures having all the chains specified by the user, within a specified e-value and sequence identity. 2) Do a structural alignment between the query and each subject structure, and accept the subject if the RMS Deviation (RMSD) meets the cutoff. 3) Translate the mutation to the numbering system of the subject structure, and compute ΔΔG. 4) Average ΔΔG over all structures, and report to the user.
The procedure is illustrated graphically in Fig 2. homologyScanner is an extension of MacroMoleculeBuilder (MMB), written in C++ and made available on github.

MacroMoleculeBuilder (MMB)
MMB is an open-source, general-purpose, multiscale modeling code. Its internal-coordinate framework [15] gives us full control over the flexibility of our molecular system, thus one can have chains which are fully rigid, fully flexible, or which are rigid in some parts and flexible in others. In past work we have used MMB for applications as diverse as morphing [16], local minimization [17], and fitting to low-resolution electron density maps [18]. MMB can also do homology modeling, which may be considered as alignment of a flexible chain (of presumed unknown structure) to a rigid chain of known structure [19]. In a related technique, both chains can be partially-flexible or fully-rigid, and here the alignment can be called template docking [20] or simply rigid alignment, depending on how constraints are applied. MMB can also use the Kabsch algorithm to compute the minimum Root Mean Square Deviation (RMSD) with which two complexes can be aligned-this is a very fast operation.

Sequence search and alignment
homologyScanner starts by searching the PDB [21] for chains which match the sequence of the user-specified chains with very high statistical significance (e-value < = 10 −11 ). Only structures which contain all the user-specified chains, each satisfying the e-value requirement, are kept and the rest are discarded. homologyScanner then uses the SeqAn-based [22] alignment tools in MacroMoleculeBuilder (MMB) to compute the sequence identity for each corresponding chain, viz: Matching residues / minimum(query chain length, subject chain length) Structures which do not satisfy the minimum sequence identity (> 90%) are discarded. We thus know that the remaining structures have the relevant chains, but do not yet know whether they have the correct tertiary and quaternary structure. For that we perform the final structural check using MMB as follows.

Structural alignment
The Kabsch structural alignment is based on residue-residue (and ultimately atom-atom) correspondence between the query and subject structures, which we obtain from the mentioned sequence alignment. MMB can robustly deal with missing or non-canonical atoms (often encountered in PDB structures), usually without user intervention [16]. The Kabsch alignment gives us the RMSD of the query vs. subject complex. If the RMSD meets a cutoff threshold (< = 6Å), the homolog complex is then passed on to the ΔΔG calculation. Note that this differs from the procedure of [23], in which the PDB is searched on structure but not sequence; we wished to use only structures of very similar sequence to maintain accuracy.

ΔΔG calculation
The user specifies a mutation(s) with chain ID(s) and residue number(s) in the context of the query structure. However different subject structures may employ different residue numbering conventions. We translate the user-specified mutation into the numbering system of the subject structure on the basis of the sequence alignment. We compute ΔΔG predicted on the query and all subject structures, using FoldX. FoldX was originally formulated and trained to predict changes in stability rather than PPI energy. However in [17] and in this work, we show that it The user must provide an initial Protein Data Bank (PDB) ID, specify which relevant chains are in which of two interacting complexes (irrelevant chains may be left out). 1. The fasta_lwp program searches the PDB for structures containing chains homologous (E-value below eValueCutoff, here 10 −11 ) to those specified by the user. 2. We group the thus-discovered homolog chains by PDB ID, each such PDB ID is referred to as a "homolog." We loop over the homologs, performing three checks on each. 3. As a first check, we determine whether the thus-discovered homologs contain chains corresponding to all those specified by the user; those not having all such chains are discarded. 4. Homologs in which all chains do not have at least 90% sequence identity vs. the corresponding userspecified chain are discarded. 5. We perform a rigid alignment of the entire homolog against the user-specified structure, based only on the user-specified chains. Non-corresponding (extraneous) chains are moved along with the rest of the complex. This is the most computationally-expensive process, but only needs to be done once for homolog that makes it to this step; results of all three checks are saved persistently. 6. If RMSD > 6.0 Å (again based on corresponding chains), we discard the homolog. Most homologs which are rejected at this step contain the correct chains but in a different configuration. 7. We then compute the ΔΔG for the user-requested mutation, using the homolog structure and FoldX4. Steps 3-7 are repeated for each homolog. 8. We average ΔΔG over all homologs that reached and completed step 7 and report the result.

PLOS ONE
Mining the Protein Data Bank to improve prediction of changes in protein-protein binding is also highly effective for the latter. SKEMPI was compiled long after [3], and so overfitting is not a significant problem. Readers are referred to [3] for details on PPI energy evaluation in FoldX.

Results
We benchmarked homologyScanner on the dataset used in [17], comprising 1243 mutations (see Table 1, dataset A). This is a very diverse dataset of 1243 mutants, including some mutants with single-substitutions and some with multiple simultaneous substitutions. We first tried using only single structures, as done in [17], and then repeated using multiple structures and quantified the improvement in correlation and Root Mean Square Error (RMSE) [17]: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi We also tested a subset of the above, comprising single-and multiple-substitution mutants for which multiple structures were available ( A scatterplot (Fig 3) of ΔΔG predicted vs. ΔΔG experimental for dataset C, shows the effect of averaging on outliers.
We also computed a Receiver Operating Characteristic (ROC) curve (Fig 4). This plots the True Positive Rate (TPR) vs. True Negative Rate (TNR) as the ΔΔG predicted threshold is moved from ΔΔG predicted = +1 (loosest) to -1 (strictest). Mutations with ΔΔG predicted < threshold are taken to be test positives, mutations with ΔΔG experimental < 0 are taken to be Gold Standard positives, and so e.g. mutations in the intersect set (i.e. those that meet both of these criteria) are True Positives (TP). Accordingly, FP (False Positives) are those mutations for which with ΔΔG predicted < threshold but ΔΔG experimental � 0. The rest of the quantities (TN: True Negatives, FN: False Negatives, etc.) are computed accordingly. The ROC was also computed based on single structures, and on multiple structures for comparison. Another useful quantity is the Positive Predictive Value (PPV) = TP/(TP + FP). This answers the question: for a given threshold, what fraction of test positives will be TPs? If the goal is to get ΔΔG > 0, then PPV tells us which fraction would have achieved this, for a given test threshold. Again we compute for single as well as multiple structures for the full range of test thresholds (Fig 5).

Discussion
As noted previously many ΔΔG prediction methods that limit structural rearrangements, particularly in regions distant from the mutation site, can yield good results, compared to no minimization or minimization of entire structures [3]. However a local minimization leaves us vulnerable to structural idiosyncrasies of the structure employed. In the present work, we diversify the coordinate data by identifying additional complexes including the relevant chains, in the relevant quaternary arrangement. Significantly, we introduce no adjustable parameters and so our results should apply to other potentials and localized minimization methods. The part of the error due to experimental uncertainty and limitations of the DDG potential, remains unchanged in our work but is the topic of ongoing research in the field [3,9,24].
Our benchmarking was done using the dataset of [17], itself compiled from SKEMPI [25]. The full dataset A contains mutants with single and multiple simultaneous substitutions. More to the point, for some mutants in dataset A only one structure is available, whereas for others there are several similar structures available (in one case 18 were found, see S1 Table). When only one structure is available of course homologyScanner makes no improvement, but since for many there were multiple structures, homologyScanner reduced RMSE by about 0.1 kcal/ mol.
The more relevant comparison is the case for which multiple structures are available (datasets B, C, and D). Dataset C comprises only single-substitution, D comprises multiple-substitution mutations, while B contains both. homologyScanner presented the largest RMSE improvement for D, but error was high overall, so in general we do not recommend using homologyScanner for multiple-substitutions. For the single-substitutions (dataset C), the best RMSE of all, 1.04 kcal/mol, was obtained, better than reported in related work [3,17].
Aggregate results however are only part of the story. Fig 3 highlights another important feature of homologyScanner:F whereas several outliers are evident when using single structures, there are arguably zero outliers when using multiple structures. For a given ΔΔG experimental , ΔΔG predicted is consistently closer to the trendline for multiple than for single structures. One may note that the slope ΔΔG predicted /ΔΔG experimental is not unity, this is a characteristic of FoldX which we do not reparameterize here; in any case the slope itself is not as important as the statistical measures of accuracy.
ROC curves are commonly used to evaluate binary classifiers with an adjustable thresholdin this case, we can classify mutations into those predicted to decrease ΔΔG (improve affinity), vs. those that should increase ΔΔG or leave it neutral. In ROC curves, Area Under the Curve (AUC) and slope at the point TNR = 1, TPR = 0 are two important measures. Larger AUC's correspond to more significant classifiers, whereas steeper slope indicates better performance for the highest-confidence cases, here those with lowest computed ΔΔG. Both quantities are larger when using multiple structures (Fig 4).
But perhaps the most important statistic, again for the purposes of design, is PPV, plotted in Fig 5. For single structures, PPV fluctuates around 0.5 for the highest-confidence mutants, that is to say in the range of ΔΔG predicted < -1 kcal/mol. For multiple structures, in contrast, PPV is a solid 1.0 in the same range of ΔΔG predicted . To reiterate, in an experiment all such mutants would have been found to improve affinity. While we believe this result is important and impressive, we also urge the reader to be cautious. This dataset is compiled from published data, which we strongly suspect contains more affinity-improving mutations than would be obtained by random mutagenesis. We reason that may investigators are looking to improve affinity, and will use tools at their disposal-published and unpublished data, structural calculations, bioinformatics, etc., prior to attempting a new mutation-and if they do not succeed they

PLOS ONE
may decide not to publish. So while the case is strong for using multiple structures, we believe PPV will be less than unity in new applications.
In conclusion, we have presented a protocol for taking advantage of the growing accumulation of near-redundant structures in the PDB to improve prediction of ΔΔG. Though the approach is simple, it provides an improvement which is remarkable since clearly demonstrable improvements in ΔΔG accuracy have been slow in recent years. The method should be compatible with other ΔΔG prediction methods which use perturbative energy minimizers, in addition to FoldX. As multiple programs are required to implement homologyScanner efficiently, and since there is considerable incentive to reuse calculations, we make the method publicly available on an easy to use web server.

Distribution
HomologyScanner is available on a public server at biodesign.scilifelab.se. The setup is shown in Fig 6. To request a ΔΔG calculation, the user goes to the Submit tab and provides the PDB ID of one suitable structure, and specifies the relevant chains in subunit 1 and subunit 2 of the interaction of interest (chains not in the interface can be left out). The user then specifies the mutation to be computed (one to four simultaneous substitutions). homologyScanner is then invoked, meaning the PDB is searched for structurally similar complexes, the FoldX ΔΔG calculation is performed for all such complexes found, and the user is notified by email (usually within a few hours, depending on queue status and job characteristics) when the job is done. The server saves all results, so the structure search needs be done only once per PDB ID and definition of subunits 1 and 2, and each FoldX calculation is only done once in total.
There is also a View tab where the public can browse all results by selecting a PDB ID, subunits, and mutation (all from drop-down lists). They will then see the ΔΔG for all structural homologs, as well as the average ΔΔG. A Jsmol window displays the protein structure in Cartoon render style, with the mutation highlighted in Sticks style.
The web server itself comprises a computer with at least two CPU cores, one of which is responsible for running Apache, MySQL, and other web services. The remaining cores are managed by the SLURM queueing system. The web server submits homologyScanner jobs to this queue upon user request. These jobs call homologyScanner itself, which interacts with the PDB to perform the sequence and structure search. homologyScanner spawns one breeder job for each suitable homologous structure found. Breeder is a program introduced in [17] which manages FoldX and stores ΔΔG and related results in the MySQL database.
Privacy may be important for some users, for example academics with unpublished data, or product developers in the pharmaceutical industry. For such users we have prepared a lowcost Udoo X86 Ultra single board computer implementation. The compact (120x85 mm) and light (under 200 g, excluding power adapter) format means it can easily be posted to the an academic or industry user. The X86 architecture ensures ease of compilation and update, compared to ARM architectures used by other single-board computers. The computer has a 2.56 GHz Intel Pentium quad-core processor. One core is used for running the web server, including MySQL database, and 1-3 cores are managed by SLURM for running homologyScanner. The computer can be fitted with an M.2-format Solid State Drive with up to 1TB capacity (we used 512 GB) as well as external HDs. We used a StarTech 300Mbps mini-wireless network adapter. An HDMI and three USB-3.0 type A ports mean it can be connected to a display, keyboard, mouse, and wireless network adapter. Alternatively such users can provide their own hardware (with mysql, docker, slurm) and use the dockerhub image (samuelflores/mmbubuntu:homologyscanner) which contains MMB, Breeder, and homologyScanner.
DDGs averaged over multiple structures are provided in.csv format, as (in a separate file) are those computed using only single structures, at biodesign.scilifelab.se/publicdata. Supporting information S1 Table. Root Mean Square Error (RMSE) grouped by primary (author-provided) PDB ID, for dataset A (N = 1190). Number of homologs ranged from 1 (41 PDB IDS, accounting for a total of 465 mutants) to 29 (2PCC, 12 mutants). Some PDB IDs appear more than once, because not all calculations converged for all homolog structures. Note the particularly low RMSE for 1A22 and 1DAN, which dominate the statistics for number of homologs = 3 and 4. This may reflect a low error in the ΔΔG experimental , due to quality experimental work in the Wells and Edgington labs, on proteins of medical interest. All data points are over the same set of mutants, namely single-and multiple-substitutions mutants, where 4 or more structures are available, N = 511. For all mutants, we randomly selected n = 1, 2, 3, and 4 of the available structures and computed RMSE; we repeated this five times. A maximum of n = 4 was selected to include the high-quality data associated with 1A22, and also to have a sufficiently high N. We suggest a model (based on normally distributed errors) in which total error σ total is given by: Thus for large n, the total error would converge to σ systematic -the error due to the perturbative assumption, crystallization artifacts, biases in the underlying force field, etc. The RMSE qualitatively appears to have this convergence, though the data admit other models. (TIF)