Assessment of software methods for estimating protein-protein relative binding affinities

A growing number of computational tools have been developed to accurately and rapidly predict the impact of amino acid mutations on protein-protein relative binding affinities. Such tools have many applications, for example, designing new drugs and studying evolutionary mechanisms. In the search for accuracy, many of these methods employ expensive yet rigorous molecular dynamics simulations. By contrast, non-rigorous methods use less exhaustive statistical mechanics, allowing for more efficient calculations. However, it is unclear if such methods retain enough accuracy to replace rigorous methods in binding affinity calculations. This trade-off between accuracy and computational expense makes it difficult to determine the best method for a particular system or study. Here, eight non-rigorous computational methods were assessed using eight antibody-antigen and eight non-antibody-antigen complexes for their ability to accurately predict relative binding affinities (ΔΔG) for 654 single mutations. In addition to assessing accuracy, we analyzed the CPU cost and performance for each method using a variety of physico-chemical structural features. This allowed us to posit scenarios in which each method may be best utilized. Most methods performed worse when applied to antibody-antigen complexes compared to non-antibody-antigen complexes. Rosetta-based JayZ and EasyE methods classified mutations as destabilizing (ΔΔG < -0.5 kcal/mol) with high (83–98%) accuracy and a relatively low computational cost for non-antibody-antigen complexes. Some of the most accurate results for antibody-antigen systems came from combining molecular dynamics with FoldX with a correlation coefficient (r) of 0.46, but this was also the most computationally expensive method. Overall, our results suggest these methods can be used to quickly and accurately predict stabilizing versus destabilizing mutations but are less accurate at predicting actual binding affinities. This study highlights the need for continued development of reliable, accessible, and reproducible methods for predicting binding affinities in antibody-antigen proteins and provides a recipe for using current methods.

Introduction Protein-protein binding is an essential physiological event that governs a large number of biological processes in the cell [1]. Amino acid mutations of these proteins can introduce diversity into genomes, and disrupt or modulate protein-protein interactions by changing the underlying binding free energy (ΔG, i.e. binding affinity), the amount of energy required to form protein complexes [2]. The binding free energy associated with a protein-protein complex determines the stability of the complex formation and the conditions for protein-protein association. Accurate prediction of binding free energies allows us to understand how these affinities can be modified, and leads to a more comprehensive understanding of protein interactions in living organisms [3].
Experimental biophysical methods can quantitatively measure change in the protein-protein binding free energy due to a mutation (i.e. relative binding affinity, ΔΔG), but these methods are typically costly, laborious, and time-consuming since all mutant proteins must be expressed and purified. Many researchers have developed and utilized computational methods to predict ΔΔG values for single-or multiple-amino acid mutations (see e.g. [4][5][6]). Historically, the most promising in terms of accuracy are rigorous methods based on statistical mechanics that use molecular dynamics (MD) simulations and thus automatically address conformational flexibility and entropic effects [7,8]. However, these methods are computationally expensive since they employ rigorous sampling and utilize classical mechanics [9] or quantum mechanics [10] approximations of intermolecular interactions, and require a large number of calculations per time-step. Because of the expense, rigorous methods are not wellsuited to studying large sets of mutations or large proteins thus necessitating less expensive, non-rigorous methods.
Non-rigorous high-throughput methods attempt to lower the computational cost, as compared to rigorous methods, while still providing accurate ΔΔG predictions. They accomplish this by including precalculated physico-chemical structural information in combination with predictive algorithms. The core mechanics that drive these methods fall under numerous classification umbrellas which have been covered by review articles [11,12]. These review articles provide a broad overview but do not provide an unbiased, rigorous, comparative analysis outside of what the original developers provide. The developers of any given method tend to provide comparisons with other methods of the same general class to define where their method fits in the current landscape. BindProfX, for example, is available as a web server and standalone and utilizes structure-based interface profiles with pseudo counts. Upon release, it was most notably compared to FoldX (a semi-empirical trained method [13]) and DCOMPLEX (a physics-based method [14]) [15,16]. iSEE, a statistically trained method based on 31 structure, evolution, and energy-based terms was tested against FoldX, BindProfX, and BeAtMuSiC (a machine learning-based approach [17]). Mutabind [18] and some other methods not explored in this work follow a similar testing methodology [19][20][21]. While these comparisons are beneficial in providing context for how a given model fits in the existing research landscape, they are not very robust, since only a narrow subset of methodologies are included. Conversely for folding stability, Kroncke et al. compared a large number of available software methods on a small dataset of transmembrane proteins providing a general overview of performance [6]. Despite the narrow dataset, this study provides a diverse, useful collection of evaluation metrics between multiple classes of methods. Our intent in this study is to provide a similar robust comparison of methods for non-rigorous binding affinity estimation.
In this work, we evaluate the ability of eight non-rigorous methods to predict relative binding affinities due to single amino acid mutations. We restrict our study to cases where both an experimental structure of the complex, and experimentally determined binding affinity values are available. To investigate the trade-off between speed and accuracy, we chose 16 proteinprotein test complexes with empirical ΔΔG values for observed mutations. We calculated the ΔΔG values for each mutation using all eight methods and compared the results against empirical ΔΔG values. The goal of this study was to determine whether software methods that use (most costly) energy functions with a wider variety of physico-chemical structural features would provide more accurate binding affinity and interface destabilization predictions compared to those that rely on a single descriptive (less costly) energy function. We have determined scenarios in which some of these methods may be better or worse than traditional computational methods in predicting ΔΔG values.

Compilation of experimental ΔΔG values
To assess the performance of a range of protein-protein binding affinity prediction methods, we first assembled a dataset containing single amino acid mutations with known experimental ΔΔG values. This list was assembled from Structural Kinetic and Energetic database of Mutant Protein Interaction (SKEMPI) version 2.0 [22]. SKEMPI uses data from a variety of different biophysical measurement techniques; these are converted to ΔΔG values if not explicitly reported. Overall, the error associated with experimental ΔΔG values reported in the SKEMPI dataset is thought to range from 0.25 to 1 kcal/mol [22]. While generating this list, we considered four aspects: (i) type of protein-protein complex; (ii) availability of quality 3-D structural information; (iii) range of experimental ΔΔG values; and (iv) the type of mutations at differing sites on the complex. Our final dataset contained 654 mutations from 16 protein-protein complexes and their respective experimental ΔΔG values. We further categorized these 16 complexes as either non-antibody-antigen (non-Ab) or antibody-antigen (Ab). Table 1 shows the complexes in our dataset with their respective non-Ab and Ab categories and the number of mutations associated with each complex. The dataset contains a total of 401 non-Ab mutations and 253 Ab mutations.

Selection of protein-protein binding affinity methods
Binding affinity prediction methods were chosen to have both a distinct approach to binding affinity calculation that utilized 3-D structural information and had functional standalone software in September 2020, available either online or upon request to the author. Table 2 summarizes the methods selected in this study, their approaches, and their type of scoring functions. For simplicity, we categorized scoring functions (mathematical functions to calculate ΔΔG values) as semi-empirical, statistical, or physics-based. Semi-empirical methods replace as many calculations as possible with pre-calculated data and are trained using existing crystal structures and known binding affinity measurements for mutations [79]. Statistical methods use pre-calculated data and consider changes in coarse structural features such as the change in overall volume [80]. Physics-based methods use molecular mechanics based-energy functions to estimate enthalpic binding contributions [14]. In general, statistical or semi-empirical scoring functions involve a training step where existing datasets are leveraged to determine the weight of input parameters. MD, JayZ, and EasyE were not developed by training the methods against experimental data designed to improve predictive power while all other methods utilized this step.

Calculation and comparison of computational speed
The methods in Table 2 were used to predict ΔΔG values for each mutation on our experimental list shown in Table 1. Detailed protocols for predicting ΔΔG values using each selected method are provided in the (see S1 File). Runtimes were determined by using a representative protein complex from each category: 1ppf, a non-Ab complex with 274 total amino acids, and 1yy9, an Ab complex with 1058 total amino acids (see Table 2). These runtimes are estimates provided to give a point of comparison between the speeds of different methods. Specific runtimes will be determined by hardware specifications, method parameters, the number of mutations being computed, and overall protein size. For MD+FoldX, computational runtime was the length of time of the MD simulation plus the FoldX runtime for a single mutation. Reporting runtime in this fashion highlights the large CPUh requirement needed in order to add the sampling of MD into FoldX calculations. We note that, in contrast to the other methods tested here, the MD simulations that must be performed for MD+FoldX can be completed very

Comparing experimental and predicted ΔΔG values
To carry out statistical analysis of our results we built an in-house Python script (see S2 File) that uses a combination of libraries including matplotlib, numpy, pandas, statistics, scipy, and sklearn. Using this script, we compared predicted values to experimental ΔΔG values for each method.
To evaluate the predictive ability of each method tested, we compared the following correlation coefficients using our script: concordance (ρ c ), Pearson (r), Kendall (τ), and Spearman (ρ) (see Table 3). We distinguish between methods that were trained to predict ΔΔG values from methods that compute metrics that are expected to linearly correlate with ΔΔG values. This distinction is important since for optimal performance we expect a regression line that passes through the coordinate origin and has a slope of 1, leading to a correlation coefficient equal to 1.
To compare the discriminating power of the methods, we generated receiver operating characteristic (ROC) curves (see Table 3). These curves quantify the ability of a method to correctly classify point mutations as destabilizing (ΔΔG < −0.5 kcal/mol) or neutral/stabilizing (ΔΔG > −0.5 kcal/mol). ROC curves that are skewed toward a higher true positive rate (sensitivity) classify mutations more accurately, as quantified by area under curve (AUC, ranging between 1.0 and 0.5 for perfect and chance classification, respectively).
We also used our script to parse the results on the basis of several physico-chemical and structural features to allow us to evaluate the methods based on these characteristics: wild type amino acid type, mutant amino acid type, protein-protein interacting versus antibody-antigen, secondary structure classification of the mutation [89,90], coordination number [91], Sneath index [92], mostly α-helical proteins versus mostly β-sheet proteins versus a mix of both αhelical and β-sheet proteins, percent exposure, location of the mutation, change in charge, change in polarity, change in volume, and whether or not the mutation location is predicted as an active or passive residue [93][94][95]. The script uses data from S3 File as an input and outputs scatter plots, correlation plots, receiver operating characteristic (ROC) curves, and box plots to visualize the data, as well as correlations and standard deviations for each method. All plots in this manuscript were generated using this script.

Correlation
Brief Description Type

Concordance
The concordance correlation coefficient (ρ c ) measures the degree to which the predicted ΔΔG value equals the actual experimental value (0 indicates no agreement and 1 perfect agreement).

Pearson
The Pearson correlation coefficient (r) measures the degree to which a uniform linear transformation of the predicted ΔΔG values (i.e., a shift and scale change) would yield the actual experimental values (0 indicates no agreement after transformation, 1 perfect agreement, and −1 perfect inverse agreement).

Kendall and Spearman
The rank correlation coefficient measures the degree to which the rank ordering of the predicted ΔΔG values matches the rank ordering of the actual experimental values (0 indicates no agreement after transformation, 1 perfect agreement, and −1 perfect inverse agreement). In a normal case, the Kendall correlation (τ) is considered more robust than the Spearman correlation (ρ) because of a smaller gross error sensitivity and more efficient due to a smaller asymptotic variance [88].
Rank AUC and ROC The receiver operating characteristic (ROC) curve tests several cutoff values for binning mutations as neutral or destabilizing between the most negative calculated ΔΔG value and the most positive calculated ΔΔG value, with true positive rates (sensitivity) calculated at each point. As the true positive rate is calculated, the classifier is moved to less extreme values; this yields the ROC curve. The area under curve (AUC) is a summary statistic that approximates how well the predictor actually discriminates between the two classifications.

Results
The purpose of our study was to assess the ability of eight different relative binding affinity calculation methods (see Table 2) to estimate ΔΔG values. We selected 16 different protein complexes (eight Ab, eight non-Ab, see Table 1) with a total of 654 single amino acid mutations. Each method was then used to estimate ΔΔG values of 654 mutations and a variety of statistical measures were employed to assess their predictive ability. We also examined the computational speed of each method in the context of accuracy to determine its efficiency.

Non-antibody-antigen (non-Ab) results
Our dataset of eight non-Ab test protein complexes contains 401 total mutations and are mainly classified as protein-protein systems formed by inhibitors and receptors that range from 199 to 583 residues in size. The distribution and our classification of experimental ΔΔG values for all non-Ab test complexes is as follows: 13% of point mutations resulted in ΔΔG values less than -0.5 kcal/mol (classified as destabilizing); 31% between -0.5 and 0.5 kcal/mol (neutral); and 56% greater than 0.5 kcal/mol (stabilizing). Figs 1 (blue data points and values) and 2 show various performance metrics for each method to assess their ability to predict the non-Ab ΔΔG values. Overall, EasyE has the highest correlation coefficient, r = 0.62, and iSEE has the lowest, r = 0.17 (see Figs 1 and 2). JayZ and EasyE, both of which utilize Rosetta's conformational sampling algorithms, consistently have the best r values for non-Ab mutations.  Table 2 reports CPUh required (i.e. runtimes) for each method to calculate ΔΔG for the entire list of mutations for a representative non-Ab protein complex. BindProfX, BindProfX (BPX)+FoldX, JayZ, and EasyE allow users to specify a list of mutations that the method is then able to calculate in one setting. This list can be optimized based on the available hardware to achieve efficiency. iSEE requires significant preparatory work (see S1 File) prior to calculation, but once completed, it calculates the ΔΔG values for the entire list of mutations nearly instantly. DCOMPLEX is not as flexible out of the box but can handle large numbers of mutations through an automated script. For MD+FoldX, 1yy9 (roughly four times larger than 1ppf) requires considerably more CPUh to calculate. All other methods calculate 1yy9 in a shorter time frame than 1ppf. This may seem counterintuitive. However, MD must statistically sample the conformational energy of the entire complex, while all other methods use algorithms that are likely impacted more by the number of residues involved in the interaction rather than the protein size. Overall, DCOMPLEX has a much faster runtime compared to other methods, and if the goal is to determine stabilizing and destabilizing non-Ab mutations, it offers similar discriminating power to JayZ and EasyE, at a fraction of the computational cost. JayZ estimates ΔΔG value of one mutation in~2.7 s, EasyE in~9.1 s, but DCOMPLEX requires just~0.25 s.
Overall, EasyE appears to be the best option for balancing accuracy and speed and DCOM-PLEX is recommended for discriminating between stability and destabilizing mutations.
A method might not be a good overall performer in predicting ΔΔG values but could still perform well for mutations with certain physico-chemical and structural features. Therefore, we calculated various statistical measures to assess each method on unique subsets of mutations (see Table 4 and S1-S4 Figs). This table shows eight different data subsets with two r per method. EasyE has the highest r for non-Ab for five out of eight subsets (wild type non-gly or non-pro, alpha helix, beta sheet, surface exposure, and large volume changes). Where this method did not have the highest r, it had either the second or third highest r. JayZ mirrors the performance of EasyE in all the same categories and performs better than Easy in the neutral charge subset. These results further highlight the versatility of EasyE's and JayZ's performance in estimating the effects of non-Ab mutations compared to the other methods tested in this study. All methods apart from iSEE and BindProfX perform surprisingly well in the WT Gly or Pro subset. iSEE's performance in this subset, while still mediocre compared to the other tested methods, is substantially better than in all other subsets.

Antibody-antigen (Ab) results
Our dataset of eight Ab test protein complexes contains 253 mutations and the proteins range in size from 352 to 1058 residues. The distribution and our classification of experimental ΔΔG values for all Ab test complexes is as follows: 5% of point mutations resulted in ΔΔG values less than -0.5 kcal/mol (classified as destabilizing); 40% between -0.5 and 0.5 kcal/mol (neutral); and 55% greater than 0.5 kcal/mol (stabilizing).
Figs 1 (data points and values in red), 4, and 5 show the performance of each method in predicting the ΔΔG values of Ab mutations. Overall, the highest correlation is for MD+FoldX with r = 0.39 and the lowest is iSEE with r = -0.09 (see Figs 1 and 4). An interesting trend is that the methods with the highest r values for non-Ab complexes do not have the highest r for Ab complexes. Fig 5 shows the ROC plot for all the tested Ab methods. These ROC plots highlight how well a method is actually able to discriminate between stabilizing and destabilizing mutations. Compared to non-Ab complexes, all methods performed better for antibody-antigen complexes except for FoldX and DCOMPLEX which were marginally worse. JayZ (0.97), EasyE (0.98), FoldX (0.85), and MD+FoldX (0.82) had the highest AUC values. Combined with the results from Figs 1 and 4, at least for the systems studied here, it appears that the MD+FoldX

Fig 2. Performance of each method for non-Ab complexes (401 total mutations) in predicting true ΔΔG values (ρ c ), linearly correlated ΔΔG values (r), and rank order (ρ and τ). The error for each method is reported under the correlation points.
https://doi.org/10.1371/journal.pone.0240573.g002 method is the best overall performer in terms of accuracy, discriminating stabilizing mutations from destabilizing, and ranking mutations based on their experimental ΔΔG values.
Compared to other methods, EasyE has a much faster runtime and is recommended if the goal is to discriminate between stabilizing and destabilizing (ΔΔG for one mutation takes~21 s, see Table 2). By comparison, MD+FoldX cost~941 CPUh for one mutation of 1yy9. DCOM-PLEX provides a slightly lower r (0.31) and computational cost (~0.35 s) for one mutation of 1yy9. Overall, MD+FoldX appears to be the best option for accuracy and EasyE or JayZ are the best options for discriminating between destabilizing and stabilizing mutations. Table 4 summarizes the ability of each method to predict ΔΔG values for subsets of Ab mutations. Most methods had mediocre r values less than 0.60. The exceptions to this are MD+FoldX and DCOMPLEX within the WT Gly or Pro subset with r = 0.71 and 0.89, respectively. BPX+FoldX has the highest r for Ab complexes for five of the eight subsets (WT nonGly or nonPro, beta sheet, surface exposure, neutral charge, hydrophobic to polar, and large volume changes) and performs equally well for the neutral charge subset as DCOMPLEX, which  also has the highest r for WT Gly or Pro subset. For the beta sheet subset, MD+FoldX had the second highest r. In the surface exposure subset, JayZ and EasyE both had nearly identical r (0.36 and 0.35 respectively), the highest for this subset, but substantially worse than they did for non-Ab complexes.

Discussion
We assessed the performance of eight distinct protein-protein binding affinity calculation methods that use 3-D structural information. To test the performance of these methods, we selected 16 different protein complexes (see Table 1) with a total of 654 single amino acid mutations: eight antigen-antibody complexes (Ab, 253 mutations) and eight non-antigen-antibody (Non-Ab, 401 mutations) complexes. Each method was used to estimate ΔΔG values of the 654 mutations, a variety of statistical measures, CPU cost, and physico-chemical structural features to assess the performance. Our results suggest each method has both strengths and weaknesses depending on the properties of the protein system. Most methods did not perform well when applied to mutations in Ab complexes compared to non-Ab complexes. Rosettabased JayZ and EasyE were able to classify mutations as destabilizing (ΔΔG < -0.5 kcal/mol) with high (83-98%) accuracy at relatively low computational cost. Some of the best results for Ab systems came from combining MD simulations with FoldX with a r coefficient of 0.39, but at the highest computational cost of all the tested methods. Fig 1 summarizes the performance of each method in terms of its ability to estimate ΔΔG values for all (non-Ab + Ab) single mutations. None of the test methods show a very high r between experimental and predicted ΔΔG values. Two of the best performing methods, JayZ and EasyE, both have an r of 0.49 for all mutations, with a higher r of 0.61 and 0.62 respectively for non-Ab complexes. These results agree with published results from the authors of JayZ and EasyE. Our results agree moderately with published results from iSEE (they obtained r = 0.25, we obtained r = 0.17) and BindProfX (they used a much larger dataset). Published results for DCOMPLEX show a very good correlation of r = 0.87; much larger than what we obtained here. This difference is very likely due to the dataset size and compilation; DCOMPLEX was originally tested against 69 experimental data points, compared to the 654 values used here. MD+FoldX has an r of 0.39 for Ab complexes and appears to perform well for larger systems, which could indicate the importance of conformational sampling for antibody-antigen systems. Other methods used in this study have little to no conformational sampling which could explain their poor performance on Ab complexes. By contrast, these same methods perform well for non-Ab complexes, suggesting that conformational sampling is not the limiting factor to achieve accurate results for these protein complexes. For example, FoldX has a trained scoring function derived using a dataset of mostly non-Ab complexes and performs poorly for Ab complexes when using a single structure (see Table 2). However, when used with snapshots from an MD simulation, this same method outperforms all other methods selected in this study. This highlights the need for conformational sampling for reliable and efficient predictions of binding affinity for some systems. In our previous study, we combined coarse-grained forcefield with umbrella sampling to calculate ΔΔG values for eight mutations of 3hfm Ab complex (one of the test systems in this study) and obtained better predictions than FoldX and MD+FoldX [96]. This study further emphasizes the need for better conformational strategies for some systems. A rigorous endpoint free energy method could potentially be employed to overcome the conformational sampling problem. Endpoint methods typically use molecular mechanics force fields to generate conformational ensembles at the two states of interest. These ensembles are then evaluated with implicit solvent models such as molecular mechanics generalized Born surface area (MM/GBSA) and molecular mechanics Poisson-Boltzmann surface area (MM/PBSA) [97][98][99]. These methods are computationally less expensive than other rigorous approaches since simulations are only performed for two states, however their accuracy is system-dependent and sensitive to simulation protocols such as sampling strategy and entropy calculation. MM/PBSA and MM/GBSA have been successfully used by several groups to estimate ΔΔG values for a small number of protein complexes and recently reviewed by Wang E et al [98] and Wang C et al [99]. These studies obtained consistently higher overall correlation to experimental ΔΔG values, albeit for a small subset of mutations, compared to the methods tested in our study, but at the expense of significantly higher computational costs.
Statistical measures used to analyze performance are listed and defined in Table 3. For Ab, BPX+FoldX, MD+FoldX, and DCOMPLEX have the highest r values of the methods in our study (see Fig 4). MD+FoldX appears to be the most accurate method for Ab complexes. Bind-ProfX, FoldX, JayZ, EasyE, and iSEE have low r and ρ c indicating that affinities estimated using these methods do not correlate well with experimental ΔΔG values using a linear transformation. Also, the τ and ρ were lower compared to MD+FoldX, indicating these methods do poorly at calculating a rank order that matches experimental data.
The ROC curves allow us to determine each method's ability to classify mutations as either destabilizing or neutral/stabilizing (Figs 3 and 5). For non-Ab complexes, JayZ (0.84 AUC) and EasyE (0.83 AUC) have the best true positive rate followed by DCOMPLEX (0.82 AUC). For Ab complexes, JayZ (0.97 AUC) and EasyE (0.98 AUC) have better true positive rates than MD+FoldX, the method with the highest r value. If classification of destabilizing vs stabilizing is the primary need, then JayZ or EasyE are both recommended over the other methods tested here due to their high accuracy and fast runtime.
While accuracy is generally the main reason for choosing a particular method, computational efficiency is also an important consideration, especially when predicting the effects of a large number of mutations. Here, we discuss the performance of each method in terms of its trade-off between speed and accuracy for predicting ΔΔG values. For all single mutations and our non-Ab subset, EasyE and JayZ performed well; JayZ is the faster method of the two with EasyE at a similar speed to FoldX. DCOMPLEX is more accurate than FoldX for all single mutations and has similar accuracy as FoldX for non-Ab mutations, but at much lower cost. MD+FoldX has similar accuracy to DCOMPLEX for all single mutations and has similar accuracy to FoldX in non-Ab mutations but is by far the most computationally expensive method we tested. Although a synergistic combination of BPX+FoldX implements several structural and physico-chemical interaction terms in its algorithm, computation time was longer than all but MD+FoldX without a concomitant improvement in r. We note that this method is perhaps the most accessible of those tested, due to the easy-to-use online server interface and accuracy that is similar to FoldX for most systems. BindProfX utilizes the same scoring profile as BPX +FoldX without the FoldX calculations. In this case, accuracy decreased while calculation speed remained similar to BPX+FoldX. iSEE, the least correlating method, employs the widest variety of information to obtain relative binding affinity predictions and is the fastest of all methods (not including the non-trivial preparation time). For Ab complexes, MD+FoldX, the slowest of all the methods, had the highest accuracy, followed by DCOMPLEX. iSEE is again the fastest of all methods but also the least accurate. BindProfX utilizes several pre-calculated physico-chemical structural data in its scoring function while, JayZ and EasyE each layer an additional predictive calculating feature on top of Rosetta's backbone sampling, adding complexity to the predictive algorithms. However, all three have similar r yet they do not achieve the accuracy of MD+FoldX. Overall, for non-Ab complexes, EasyE and JayZ appear to have the best balance between speed and accuracy of the methods we tested. For Ab complexes, DCOMPLEX appears to have the best balance.
We have demonstrated that all the tested methods have specific strengths and weaknesses; some perform better in specific contexts (Table 4), and some have longer runtimes to obtain similar predictive power to comparably faster methods. This highlights the complexity of the physico-chemical properties and structural features that drive, and limit, these predictive models. Moreover, our study highlights the need to separately evaluate the performance of future ΔΔG predictors for both Ab and non-Ab complexes. There is also a need for a much larger training dataset of experimentally measured binding affinities for both types of complexes. New binding affinity calculation approaches are also needed to properly account for the contribution of bridging water molecules that are often present at the protein-protein interface. Our results can be used to make informed decisions for methods that may be preferable for a particular study or system. Table 4 suggests that if the goal is to estimate only the order of magnitude or sign of relative binding affinities, then the preferred method will likely be very different than if the goal is to obtain the best possible accuracy for antibody-antigen systems. To improve accessibility, we have generated an in-house Python script (provided in the supplement with the full dataset used in this work) that can be used to parse any of the parameters used in this study and provide tailored information. This information in combination with the runtime and other details provided in this study can be used to inform users on methods that can provide the best accuracy and efficiency for a given protein-protein complex type, set of physico-chemical features or structural parameters, and set of mutations. Additionally, the script can be extended to other methods and feature-sets, potentially elucidating specific problems or areas of improvement to existing and future methods.

Conclusions
In this study, we have assessed the accuracy and efficiency of eight computational methods on predicting binding affinity changes due to single amino acid mutations. Methods were tested on 16 different protein complexes: eight antigen-antibody (Ab) and eight non-antigen-antibody (Non-Ab) complexes. While some methods perform consistently better than others, how well each performs depends on the physico-chemical and structural components of each complex. EasyE was the most accurate for non-Ab complexes, and MD+FoldX was most accurate for Ab complexes. JayZ and EasyE were better able to distinguish between destabilizing (ΔΔG > 0.5 kcal/mol) and stabilizing (ΔΔG < -0.5 kcal/mol) as compared to any other method. Future work could include more systems or different methods, including those that are solely web server-based in order to expand and better refine our conclusions on their predictive capability.