Estimation of the breadth of CD4bs targeting HIV antibodies by molecular modeling and machine learning

HIV is a highly mutable virus for which all attempts to develop a vaccine have been unsuccessful. Nevertheless, few long-infected patients develop antibodies, called broadly neutralizing antibodies (bnAbs), that have a high breadth and can neutralize multiple variants of the virus. This suggests that a universal HIV vaccine should be possible. A measure of the efficacy of a HIV vaccine is the neutralization breadth of the antibodies it generates. The breadth is defined as the fraction of viruses in the Seaman panel that are neutralized by the antibody. Experimentally the neutralization ability is measured as the half maximal inhibitory concentration of the antibody (IC50). To avoid such time-consuming experimental measurements, we developed a computational approach to estimate the IC50 and use it to determine the antibody breadth. Given that no direct method exists for calculating IC50 values, we resort to a combination of atomistic modeling and machine learning. For each antibody/virus complex, an all-atoms model is built using the amino acid sequence and a known structure of a related complex. Then a series of descriptors are derived from the atomistic models, and these are used to train a Multi-Layer Perceptron (an Artificial Neural Network) to predict the value of the IC50 (by regression), or if the antibody binds or not to the virus (by classification). The neural networks are trained by use of experimental IC50 values collected in the CATNAP database. The computed breadths obtained by regression and classification are reported and the importance of having some related information in the data set for obtaining accurate predictions is analyzed. This approach is expected to prove useful for the design of HIV bnAbs, where the computation of the potency must be accompanied by a computation of the breadth, and for evaluating the efficiency of potential vaccination schemes developed through modeling and simulation.


Introduction
Vaccination is a medical procedure which has played an essential role in protecting mankind against viral and bacterial infections since the time of Edward Jenner, who developed a vaccine for smallpox over 200 years ago. Although for some diseases caused by viruses, such as measles, a small number of vaccinations provide almost permanent immunity, for other such as influenza, an annual revaccination, which provides only limited protection, is required. Since the measles virus, like the flu virus, undergoes error-prone replication that introduces mutations, it is not clear why the measles vaccination works as well as it does. Recent research [1] suggests that the measles virus remains antigenically monotypic because mutations are almost always lethal, though the reason for that is not known.
For HIV, which is the focus of this paper, no approved vaccine exists, although we are now almost forty years into the HIV/AIDS epidemic. As is the case for influenza virus, HIV evolves rapidly so that there exist many different viral strains. Some of them can evade the immune response to a vaccination directed against only a small number of strains. Thanks to the development of antiretroviral therapies [2], people who are infected by HIV can live essentially normal lives, without succumbing to AIDS. Several years after infection, a small fraction of the HIV infected individuals develop antibodies that are referred to as broadly neutralizing antibodies (bnAbs), i.e. antibodies that are effective against many strains of the virus [3]; an example of a detailed structural study of the germline and mature bnAbs from a single patient is given in Fera et al. [4]. The fact that the immune system can develop bnAbs over time has led to renewed interest in the possibility of developing a vaccine that would be effective against HIV [5,6].
The "neutralization breadth" of an antibody is a measure of its ability to recognize and neutralize different variants of the virus. An antibody with a high breadth can recognize effectively many different variants, while a low breadth antibody is more specific. Experimentally, the neutralization breadth of an antibody is evaluated by testing its ability to inhibit a panel of antigens from different clades of the target virus [7]. In what follows we simply write "breadth" when referring to "neutralization breadth". For HIV, the Seaman virus panel contains 109 representatives of HIV clades A, B, C and circulating recombinant forms [7]. To evaluate the breadth of an antibody, the half maximal inhibitory concentration of the antibody (IC 50 ) is measured for each virus in the panel. The breadth is defined as the fraction of viruses for which the IC 50 is less than a given cutoff, usually set to 1 μg/ml.
The present paper describes a computational method to estimate the breadth of new HIV antibodies using only the sequence of their heavy and light chains, and the assumption that they form antibody/antigen complexes that are similar to a known crystal structure of an antibody/antigen complex that may not have high sequence homology. Of the known HIV antibodies, we select those that target the CD4 binding site (CD4bs) of the HIV Envelope glycoprotein, a binding site used by diverse bnAbs [5]. Because the CD4bs is highly conserved, an HIV vaccine designed to elicit bnAbs that bind to this site would have a high therapeutic potential: bnAbs would likely recognize the conserved core, as well as the variable regions in the neighborhood that are required for the broad-based character to develop [8].
However, a major obstacle to a successful computational design of bnAbs is a lack of accurate methods for computing the antibody breadth. Here, we use machine learning [9] coupled with descriptors obtained from all-atoms models of the antibody/antigen complex to predict the IC 50 values for the viruses in the Seaman panel from which the antibody breadth can be estimated. To predict the IC 50 we use a supervised artificial neural network [10,11], a multilayer perceptron (MLP); see Fig 1. The neural network is trained by backpropagation, where the errors in the outputs are propagated backwards into the artificial network structure to optimize the internal parameters of the model. An essential element for the successful application of artificial neural networks is the availability of a large data set for training the networks [11]. For HIV, experimental IC 50 values have been collected in the CATNAP database of neutralizing antibodies [12], which contains more than 40000 IC 50 values; of these, 6179 satisfy the criteria required for our machine learning approach.
In the body of the paper, we describe training of the MLP to predict either the actual value of the IC 50 (by regression), or whether or not the antibody recognizes the virus (by classification). The outputs of the two artificial neural networks are then used to estimate the breadth of known antibodies. In a Concluding Discussion, we consider the limitation and extension of the methodology described here, as well as its utility for vaccine design.

Neural network regression model for the IC 50
The first step toward the estimation of the breadth of a given antibody is to evaluate the IC 50 for the binding of the antibody with all the antigens in the Seaman virus panel [7]. Here, the IC 50 values are estimated using a Neural Network regression model (an MLP regressor), trained over experimental IC 50 values. Experimental values are obtained from the CATNAP database, which contains a total of about 40000 IC 50 values [12]. The values in the database were filtered to make sure that the amino acid sequences of both the antibody and antigen are available, that at least one crystallographic structure of the antibody bound to a related HIV antigen is known, and that the antibody binds to the CD4 binding site (CD4bs). Upon filtering the database for these attributes, a total of 6179 IC 50 values were obtained. Of these, 3864 are reported as exact values, while the remaining 2315 are reported as "greater than" a given cutoff (e.g. ">50 μg/ml"); see Table 1. We refer to these as "approximated" values. To train the MLP regression model, only the 3864 exact values are used; the approximate ones were discarded.
As inputs for the MLP regressor, we use the values of descriptors of the antibody/antigen bound system obtained from an atomistic model of the complex. A total of 24 descriptors were tested; they fall into four classes: atomic descriptors, protein/protein scoring functions, protein stability scoring functions, and entropy models (see Methods). The procedure for building the model for one antibody/antigen complex and evaluating all the descriptors requires about 13 minutes (see Methods); for evaluating the values of the subset of 19 chosen descriptors only about 20 seconds are required, once the model has been constructed (see Methods).
For training the MLP regressor the available experimental IC 50 values are randomly divided in half to obtain a training set and a test set. The training is then performed, and the obtained neural network used to predict the IC 50 . The training is repeated 30 times using each time a randomly generated training and validation set, and the results are averaged over the 30 neural network results to estimate the error in the predictions. It is composed of an input layer (the X i nodes) that contains the descriptors developed for the system being studied, a hidden layer (H i nodes) that combines non-linearly The correlation between the experimental and computed IC 50 is shown in Fig 2 for both the training and the test sets. The Pearson correlation coefficient for the training set is 0.686 (Spearman coefficient is 0.682). For the test set, the correlation coefficients decrease to 0.410 (Pearson) and 0.408 (Spearman). This is a significant improvement over the use of individual descriptors, which maximally reach a Pearson correlation coefficient of 0.28; see Fig 3 (last  column).
With these IC 50 values, the breadth of known antibodies can be computed and compared with the experimental values. In the CATNAP database, 24 antibodies have a known sequence and bind to the CD4bs with a known crystallographic structure; see Table 1. The breadth is calculated for all of these using the computed IC 50 values for the viruses in the Seaman panel. Specifically, for each virus in the Seaman panel, the IC 50 is estimated using the MLP regressor, and the fraction of viruses for which the IC 50 is less than 1 μg/ml is counted. Fig 4 (left) compares the experimental and computed breadths and their estimated errors (see Methods). For most antibodies the predicted breadth is in reasonable agreement with the experiments, although there is a tendency to overestimate it. The Pearson correlation coefficient is 0.800 (Spearman is 0.766) indicating a meaningful correlation. The breadth of the antibody CAP257-RH1 is significantly Table 1. Antibodies studied in this work. The second and third columns report the number of experimental IC 50 values available as exact or approximate values. The fourth column reports the experimental breadth calculated based on the Seaman antigen panel. The number in parenthesis is the number of antigens used to calculate the breadth (the Seaman panel contains 109 antigens, but the experimental IC 50 value is not available for all of them). The fifth column indicates the PDB ID used as the template for modeling the antibody/antigen complex. If only one PDB is indicated, that is used for the modeling "as is". If two are present, the first is used to model the antibody, the second for the antigen (after best-fit RMSD superimposition to the antigen in the first PDB). The last three columns report the breadth calculated using the IC 50 regressor with and without approximated values and using the IC 50 classifier (see text). overestimated (experimental is 0.013, computed is 0.61±0.30), but this antibody is the one for which only a single exact IC 50 value is available, and the computed uncertainty is highest. Given the limited accuracy for the prediction of the IC 50 values (Fig 2), this result is particularly encouraging. It indicates a low sensitivity of the breadth to the actual IC 50 values. This is rooted in the definition of the breadth, which does not require the exact value of the IC 50 , but only whether the antibody does or does not bind to a given virus.

Antibody
As a further test, all 6179 experimental values are used in the regression model including the "greater than" approximate values. For these, the maximum experimental value is used; e.g. if the IC 50 is expressed as >10 μg/ml and >50 μg/ml in two different studies, the value 50 μg/ml is used. The obtained regression model shows an increase in accuracy, in particular an increase ability in ranking different antibodies (Spearman correlation coefficient increases from 0.766 to 0.960), but it has a loss in sensitivity for all low-breadth antibodies predicting zero breadth for most antibodies with experimental breadth less than 0.4; see Fig 4 (center.) As a result, although the accuracy of the slope is improved, the regression line is shifted to the right, yielding worse values.

Neural network classifier for the IC 50
The breadth can be estimated more directly with an MLP classifier, which is trained to predict whether the IC 50 is less or greater than the 1 μg/ml cutoff. A higher accuracy in the correlation between experiments and predictions is expected, because a classification is a simpler problem than a regression. The methodology used for the MLP classifier is the same as that employed for the MLP regressor, except that the whole set of 6179 experimental IC 50 values is used; i.e.,  Estimation of the breadth of HIV antibodies by molecular modeling and machine learning the 2351 "approximate" values are included. Table 1 lists the antibodies and the number of "exact" and "approximate" IC 50 values available for each.
The correlation between the experimental and the computed breadth using the IC 50 classifier is shown in Fig 4 (right). As expected, the results are significantly better than for the MLP regressor, with the correlation with the experimental breadth increasing from 0.800 for the regressor to 0.973 for the classifier. Moreover, no outliers are present and the estimated breadths of all 24 antibodies lie on the diagonal with near unit slope.
The performance of the MLP classifier is evaluated further by the analysis of its confusion matrix, which contains the percentage of true negative, true positive, false negative and false positive values. As reported in Table 2, the accuracy in the test set, corresponding to the sum of true positives and true negatives, is 72.3±1.0, with an almost equal rate of false positive and false negative of about 13%.

Test of MLP classifier on germline or immature antibodies
Although our primary purpose for developing the MLP classifier is to evaluate antibodies designed to have increasing breadth (see Concluding Discussion), we decided to determine whether it could be used to compute the breadth of putative germlines, as compared to the breadth of the mature antibodies. We consider bnAbs VRC01, NIH45-46 and 3BNC60, for which the experimental breadths of the mature antibodies are available, see Table 3. For the putative germlines no experimental breadths are available, but the expected breadth should be low, as they appears to have no affinity for native HIV antigens [13][14][15].
The computed breadth values are reported in Table 3. For the mature antibodies the computed values are in good agreement with the experimental data. This is particularly interesting for the 3BNC60 antibody, for which no published structure of any antibody/antigen complex is currently available. For this reason, 3BNC60 was excluded from the training of the MLP regressor and classifier. However, a bound crystallographic structure is available for its putative germline precursor [16]. Assuming no significant change in the binding pose upon affinity maturation, the germline crystallographic structure was used for both the mature and germline antibody. This is likely to be a valid assumption, as indicated by the VRC01 and NIH45-46 Table 2. Confusion matrix for the IC 50 classifier. The accuracy (ACC) is defined as the sum of the true positive (TP) and true negative (TN): ACC = TP+TN; the balanced accuracy (BACC) is defined as the average between the true positive divided by total positive samples (P) and true negative divided by total negative (N) samples: BACC = (TP/P + TN/N)/2. antibodies for which crystallographic structures are available for both mature and germline forms bound to an antigen (see PDB 4JPK and 3NGB for VRC01 and 5IGX and 3U7Y for NIH45-46). The breadth of antibody 3BNC60 is slightly underestimated (0.53±0.21 vs 0.78), but it also has a higher than usual uncertainty. For the putative germlines, the calculated breadths range between 0.27 and 0.45. Importantly, in all cases the breadth of the germline is significantly lower than the breadth of the mature antibody. This finding is encouraging, because only mature antibodies were used in the training of the MLP classifier, as no quantitative information for germline binding to HIV viruses is available in the CATNAP database. The MLP classifier is thus biased towards mature antibodies.
Finally, we consider the DRVIA7 antibody, which is an immature form of a VRC01-class antibody [17]. Few experimental IC 50 values are available for this antibody; they show a breadth of about 0.23. The estimated value by modeling is 0.45±0.17, which is again higher than expected, but lower than the VRC01 mature value (0.79±0.08) and similar to the germline (0.45±0.23).

Analysis of descriptors used in the MLP classifier
An important aspect in the present application of machine learning for the prediction of antibody breadth is the choice of descriptors. The 24 descriptors considered can be broadly divided into four categories: atomic descriptors such as buried surface area or number of hydrogen bonds (15 descriptors in total), protein/protein binding affinity measures (4 descriptors: Prodigy, ZRANK, ZRANK2 and DFIRE), protein folding scoring functions (FoldX and two pairwise statistical potentials: RFHA and RFCB), and entropy models (normal mode entropy from two different elastic network models: ENM_EXP and ENM_R6); see Methods for more details. Their relative correlation coefficients and their correlation with the experimental values are reported in Fig 3. As expected, some descriptors are highly correlated with each other, such as the number of hydrogen bonds evaluated using different definitions, or the van der Waals energy with the buried solvent accessible surface area.
Selecting too few descriptors would risk making the model less general. Consequently, we kept a large number of descriptors, using as discriminant the computational cost needed to evaluate them. From the full set of 24 descriptors, 5 are initially removed due to their computational cost. The remaining 19 descriptors are all very quick to evaluate, about 20 seconds in total for one model, obtaining a 12.8x speedup in the calculation of the descriptors with respect to whole set of 24 descriptors. No decrease in accuracy is observed.
If too many descriptors are used, there is risk of overfitting. This is avoided here by the very large number of available experimental values. For the MLP classifier 6179 experimental values are available, half of which are used in the fitting. The MLP classifier with 19 descriptors and 10 hidden nodes contains 200 fitting parameters (see Methods), versus the 3089 (6179/2) experimental values used in the fitting. A ratio between the number of experimental values in the training and number of parameters in the model of ten or greater is suggested [11]; here the ratio is >15. As supporting evidence, Table 2 shows that the accuracy of the MLP classifier in the training and test sets is similar.
We also studied the effects of the number of descriptors used in the MLP classifier on the accuracy of the model as measured by the confusion matrix; see Fig 5. Each sample in these two plots is a different MLP classifier model fitted over a different set of descriptors. The first graph shows the accuracy of the given model, as measured by the confusion matrix as sum of the percentages of true positive and true negative, while the second one shows the Pearson correlation coefficients obtained when comparing the calculated and experimental breadths. The blue points correspond to models with randomly-chosen descriptors, while the red points represent models obtained by successively removing the descriptor that is most correlated with all other descriptors and less correlated with the experimental IC 50 (see Methods). The accuracy starts to decrease significantly when the number of descriptors is less than about 10. The Pearson coefficient is more robust and starts decreasing at around 7 descriptors. At this point, the maximum cross-correlation (Pearson coefficient) between the descriptors is about 0.4. These data suggest that a simpler model of about 7-10 descriptors could be used, which would have (approximately) the same accuracy of the 19-descriptors model; see Table 4. As expected from the earlier analysis, the breadth results converge more rapidly than the accuracy.
This raises the question of which descriptors are most important in the prediction of the neutralization ability of an antibody. There are a variety of methods to extract descriptor importance from an artificial neural network, but there is no apparent consensus on which is better, in particular if the descriptors show a relative high level of cross correlation [18]. However, it is possible to extract some information from the correlation of the descriptors with the experimental IC 50 values and with the deletion order reported in Table 4. The protein/protein scoring functions are on average more important than molecular descriptors, in particular the protein folding propensity descriptors (FoldX) and the two statistical pairwise potentials for estimating protein stability (RFHA and RFCB). Also, the entropy from elastic network models (ENM_EXP and ENM_R6) plays an important role, with ENM_EXP the last deleted descriptors with a relatively high correlation with the experiments (0.292 correlation coefficient). These highlight the importance of entropy in the binding, and the usefulness of complex protein potentials, with respect to simple molecular descriptors.

MLP classifier generalization ability
An issue in machine learning models is how good they are at generalizing to inputs outside their training set [11,19,20]. In computational studies about HIV neutralization epitopes it is relatively common to have one model trained for each antibody under study [21][22][23]. By Estimation of the breadth of HIV antibodies by molecular modeling and machine learning contrast, in this work only one model was trained to predict the neutralization breadth of all antibodies. We note that our training set did contain some experimental data from each antibody under study. This suggests the generalization ability of the model should be examined. Fig 6 shows how the predicted breadth of each antibody changes when an increasing number of experimental data relative to that antibody are included in the training set. At zero, no relevant experimental data are included; e.g. for the VRC01 antibody, at zero no experimental IC 50 value of the VRC01 antibody is included in the training set. In these plots, the blue line is the breath computed using all available experimental data in the training set, while the red line is the experimental neutralization breadth. Comparing the plots for different antibodies, about half of them have a good breadth prediction even at zero, and the predicted breadth does not change significantly upon inclusion of more data; see for examples antibodies 1B2530, 8ANC131, 8ANC134, CH103, VRC01, VRC07, VRC-PG20. For other antibodies the prediction at zero is quite far from the experimental value; e.g. antibodies NIH45-46, CAP257-RH1, CH235.12, or N6. However, all these antibodies converge to the correct experimental value when a relatively small number of experimental values are included in the training set, usually between 10 and 20, relative to the total number of values available (see Table 1). The Pearson correlation coefficient between the computed and experimental breadth is 0.50 for the classifier with zero experimental data when all but 4 out of the 23 antibodies studied are included in the analysis (excluded antibodies are N6, CAP257-RH1, CH235.12 and NIH45-46, which have the highest error in the computed breadth). When the entire set of antibodies is considered, the correlation coefficient is 0.14; including just one experimental value per antibody, the Pearson coefficient increases to 0.54, and it reaches 0.90 when including only four experimental values. Another point to consider is that the good results obtained for some antibodies are due to their close similarity to other antibodies in the training set; e.g. the 8ANC131/8ANC134 pair and the VRC01/VRC07 pair. These results show that the neural network model is able to Estimation of the breadth of HIV antibodies by molecular modeling and machine learning predict accurate values for the breadth in instances when a small number of experimental data for the antibody under study, or a close relative, are included in the training set.

Comparison with other machine learning techniques
It is of interest to compare the results from MLP classifier used here with other machine learning techniques. We compared the Neural Network with one hidden layer used in this paper with a Neural Network with two hidden layers, k-nearest neighbors (kNN) [24], random forests (RF) [25,26], and Support Vector Machine (SVM) [27]; see Methods. All methods produce very similar results, in both the correlation with the experimental breadths and the estimated error in the single breadth values; see Table 5. This is supporting evidence of the robustness of the results obtained with the MLP. Moreover, it suggests that to improve the results new information about the system needs to be introduced, e.g. in the form of different descriptors.

Discussion
To the best of our knowledge, we report in this paper the first attempt to estimate the breadth of HIV antibodies by the use atomistic modeling and machine learning techniques. We developed a method based on a Multi-Layer Perceptron (an artificial neural network), which is able to accurately reproduce and predict the breadth of CD4bs targeting HIV antibodies. For the development of such a model, many experimental IC 50 values to train the neural networks have to be available. Here we used experimental IC 50 values from the CATNAP database [12]. After cleaning and filtering the data, 6179 IC 50 values were obtained for the training of the classifier and 3864 for the regressor. These significantly outnumber the number of parameters in the neural network (200), reducing the possibility of overfitting. For best results, a small number of experimental values specific for the antibody under study, or a related antibody, need to be present in the training set. For the application of the same techniques to different HIV binding sites or different viruses, such as influenza, sufficient high-quality experimental data would have to be collected and validated.
An important concern in any HIV model is the glycosylation of the virus since the HIV Envelope protein is heavily glycosylated. In the CD4bs, the focus of this work, there are at least

Fig 6. Computed breadth for the 24 antibodies studied in this work as a function of the number of experimental IC 50 values for that particular antibody used in the training of the MLP classifier.
The blue line is the computed breadth using all available experimental data in the training set, the red line is the experimental breadth. The graphs are shown up to 50 included experimental data, the total number of available experimental data is reported in Table 1. https://doi.org/10.1371/journal.pcbi.1006954.g006

Table 5. Comparison of the MLP classifier results with other machine learning techniques: MLP classifier with two hidden layers, Random Forests (RF with 31 trees), Supported Vector Machine (SVM with radial basis function kernel), and k-nearest neighbors (with 15 neighbors and weights assigned inversely proportional to the distance).
The reported data are the accuracy of the classifier (from the confusion matrix), and the Pearson, Spearman, slope and intercept for the correlation of the calculated and experimental breadths. Estimation of the breadth of HIV antibodies by molecular modeling and machine learning five glycans (N197, N234, N276, N363, and N462) that can interact, and interfere, with the antibody binding. One glycan, N276, is particularly problematic in a comparison between germline and mature antibodies, as it has been observed that mature antibodies, like VRC01, introduce deletions or mutations to glycine in CDRL1 to avoid clashes and to accommodate the glycan [28,29]. This could be a major reason why the current model overestimates the breadth of putative germlines. The tools presented here are important for the computational design and screening of potential HIV antibodies. The usual focus in antibody design is in the optimization of the potency and specificity for a particular antigen [30]. As pointed out in the Introduction, this is only one part of the problem for highly mutable virus like HIV, for which the exposed antigens have a high variability, so that bnAbs are required. Thus, it is important to couple the calculation of the potency with an estimation of the breadth.

Method
The present technique for estimating antibody breadth based on IC 50 values could prove useful in the computational design of vaccination protocols to elicit HIV bnAbs. Promising computational/theoretical approaches for the design of vaccination strategies include a description of the affinity maturation (AM) process [31,32]. This is achieved by simulating how antibodies mutate during the AM from the germline precursors to the mature antibodies. One essential step in these simulations is evaluating the breadth of the produced antibodies. Since the method described here requires as inputs only the sequence of the antibody and a template for the bound structure, it is likely to be a useful part of the design process. Moreover, the techniques developed here can, in principle, be applied to other highly variables viruses for which a definition of antibody breadth is important, such as influenza or hepatitis C viruses, if sufficient experimental neutralization data are available for training the models. From preliminary examination of the literature, this appears to be true at least for influenza.
An alternative perspective is to consider the evaluation of the breadth from the antibody/ antigen binding affinity, see S1 Appendix. This requires an accurate, relatively rapid, method for the calculation of the binding affinity. Research on developing such a methodology is in progress [33]. The use of the "free energy of binding breadth", rather than the "IC 50 neutralizing breadth", would, in principle, bypass the need for machine learning techniques.

Methods
First, the source of experimental IC 50 values used in the training of the neural network is described. Then, the procedure to create an atomistic model of antibody/antigen complexes is presented, followed by the computation of all molecular descriptors derived from the atomistic models. The computational cost of creating the models and evaluating the descriptors is then reported. The final section is dedicated to the details of the neural network models.

Experimental IC 50 database
To calculate the breadth of an antibody, the IC 50 values for the antigens in the virus panel are required. In this work the experimental IC 50 are used to both calculate the experimental breadth and to train a neural network (see below) to predict the IC 50 values and breadth of new antibodies.
Experimental values of the IC 50 for different antibody/antigen complexes were obtained from the CATNAP database [12] (accessed on March 5, 2018). The IC 50 values are presented in the database in two forms. If available, the exact value is reported; if the complex has a low binding affinity, the exact value is not available, and the IC 50 is reported to be greater than a given cutoff, e.g. "IC 50 >50μg/ml". The full database contains information for 1024 HIV viruses and 334 antibodies, for a total of 23851 exact and 16467 approximate IC 50 values (most antibodies are tested on different viruses). Of interest in this work are antibodies binding on the CD4bs. Because our method requires atomistic models of the antibody/antigen complex, IC 50 values are selected such that the binding is at the CD4bs (based on known crystallographic structures of the bound antibody/antigen complex), the amino acid sequences of both the antibody and the antigen are known, and at least one crystallographic structure of the complex is experimentally available. After applying these filters, 24 antibodies (over the 334 available) are selected, for which a total of 3864 exact and 2315 approximate IC 50 values are available; see Table 1.

Creating 3D atomistic models
The protocol described in this work requires building of atomistic models for arbitrary antibody/antigen complexes and computing molecular descriptors to train the neural network models. The first task is to build a full 3D atomistic model of the antibody/antigen complex given only the sequences of the two proteins. A requirement is to have a template of the complex, which is used for homology modeling. We select one template for each antibody, neglecting differences in the antigens. The underlying assumption is that the binding mode of a given antibody to any antigen is very similar and can be approximated to be constant. Moreover, the structure of the antigen is assumed to be conserved over the sequence space. This is not generally true for antigens with significant insertion or deletions in their (hyper)variable loops. Because we focus on the CD4bs, the assumption is reasonable. Crystallographic structures of antibody/antigen complexes were used as templates for the 24 antibodies. In most cases, the antibody was not bound to naturally-occurring antigens, but to engineered monomeric (gp120) domains. In these cases, the engineered protein was substituted with the BG505 SOSIP structure (PDB 5FYJ) using best-fit RMSD alignment for the superimposition. The PDB codes used for each antibody are reported in Table 1.
Given the sequences of the antigen and the antibody and a template crystallographic structure for the complex, Modeller [34] was used to create a model of the complex. The structure was then refined by CHARMM [35], including adding missing hydrogen atoms, creating disulfide bonds, and minimizing the all-atom structure by 100 steps steepest descent. The system was then further minimized using OpenMM [36] until the potential energy was changing by less than 1 KJ/mol. The CHARMM36 force field [37] was used for the all-atoms energy minimizations. In the modeling, only one monomer of the Env gp160 trimer was built, the gp41 deleted, and only the variable part of the antibody was kept. Glycosylation was not included.

Molecular descriptors
The optimized structure obtained by molecular modeling (see previous paragraph) was used for the computation of all descriptors to train the neural networks. 24 molecular descriptors were computed. These are divided in four main classes: atomic descriptors, protein/protein binding scoring functions, protein stability scoring, and entropy models.
As "atomic descriptors" we consider descriptors that are simple function of the atomic coordinates. The two most obvious are the electrostatic (Coulomb) interaction between the antibody and the antigen (E_elec) and the dispersive (Lennard-Jones) interaction (E_vdw). These are modeled according to the CHARMM36 force field [37] and computed using OpenMM [36]. To approximate the solvation of the protein, the GBn2 [38,39] generalized Born implicit model was used as third descriptor (E_gbsa). No cutoff in either electrostatic or dispersion force was used. Other used descriptors are the buried surface area [40] (MD_sasa) upon binding, the number of hydrogen bonds according to Baker & Hubbard [41] (MD_h1) and according to Wernet, Nilsson et al. [42] (MD_h2) definitions, and hydrogen bond energy according to Kabsch & Sander [43] (MD_h3). These four descriptors were computed using the MDTraj [44] python module. To these, the number of interchain contacts classified according to polarity and charge are added (a total of six descriptors: IC_AA, IC_PP, IC_CC, IC_AP,  IC_CP, IC_AC), as well as the apolar and charged non-interacting surface area (NIS_A and NIS_C). These last eight descriptors were computed using Prodigy [45].
More complex descriptors are scoring functions developed for estimating the binding affinity of two proteins. These methods have been developed to score binding modes in protein/ protein docking software. Here, they are not used as independent scoring functions (as originally intended), but as descriptors. The used methods are: Prodigy [45], ZRANK [46], ZRANKr [47,48], and DFIRE [49].
The third class of descriptors consists of scoring functions optimized to reproduce the folding propensity of a protein or its stability. With these scoring functions, the binding affinity can be estimated as the difference between the "stability" of the complex and that of the separated antibody and antigen. The methods tested are: FoldX [50,51] (sidechains are optimized), and two pairwise statistical potentials (RF_HA_SRS and RF_CB_SRS_OD) [52,53].
The fourth class of descriptors contains two methods used to estimate the entropy change upon binding. Two classical approaches to evaluate the entropy of a macromolecule are the use of the normal mode analysis (NMA) or quasi-harmonic (QHA) [54]. Neither of these can be used here directly: NMA requires that the structure is at an energy minimum, and QHA requires a large ensemble of structures. Alternatives are Elastic Network Models (ENM) [55,56]. In these models, the protein is modeled as a set of atoms interconnected by elastic springs, which vibrate around the given input structure. The total energy E of the system is obtained as sum of pairwise potentials, each acting between a pair of atoms whose distance is under a given cutoff: The sum is over all pairs of atoms ij, d ij is the distance between them, d 0 is the reference distance (from the input structure), and k(d 0 ) is the force constant. k(d 0 ) can take different expressions. The first developed used a constant for each pairs under a given distance cutoff [55]. More accurate models use force constants that depend on the reference distance d 0 . In this paper two expressions are tested. In the first (ENM_EXP) the force constant decreases exponentially with d 0 : [57]. In the second (ENM_R6), the force constant decreases proportionally to d À 6 0 : kðd 0 Þ ¼ ad À 6 0 [58]. In both cases an arbitrary proportionality constant a has to be set. Here the value of 1000 is used (in arbitrary units). A 100-fold increase in the force constant caused no significant change in the results. From the energy expression, the Hessian matrix can be calculated, and, upon diagonalization of the mass-weighted Hessian matrix, the normal frequencies are obtained, and from them the harmonic entropy. These calculations are fast and do not require energy minimization as the input structure is considered to be the minimum.
A problem with most of the descriptors used in this work is their high sensitivity to the values of the atomic coordinates. To alleviate this problem, six models for each complex are generated with Modeller using random initial seeds, and the descriptors are averaged over the six models.

Timing
An important factor to consider when designing and choosing descriptors to use in neural networks and regressor models is the computational cost, or timing, needed to compute them. In this work, two are the main steps: first, generate the 3D atomistic model; second, compute all the descriptors.

Neural network models
To predict the IC 50 values a Multi-Layer Perceptron regressor (MLP regressor) is used. This type of artificial neural network takes as input a number of descriptors, parse them in a hidden layer composed, in this case, of ten nodes, and outputs the predicted IC 50 value. For the training, the 3864 exact IC 50 values are split randomly in half into a training set and a validation set. The total number of parameters N p to fit in the MLP is given by N p = N i N h +N h N o where N i , N h , and N o are the number of input descriptors, hidden nodes, and outputs. Using 19 descriptors, and requiring one output (the IC 50 value), the total number of parameters to fit is 200. As an empirical rule, the ratio between the available experimental values used in the training and the number of parameters to fit has to be much greater than one. In our case, the number of experimental IC 50 values used for the training is 3864/2, which is 9 times higher than the number of parameters to fit. A logistic activation function is used in the hidden nodes, and the lbfgs solver is used for optimization. Fig 3 shows the cross correlation between all descriptors and the experimental IC 50 values.
A similar artificial neural network is used to predict whether the IC 50 is less or greater than the experimentally-derived 1 μg/ml cutoff, which is used to calculate the antibody breadth. In this case, a Multi-Layer Perceptron classifier (MLP classifier) is trained instead of a regressor on the same set of descriptors. The same settings as for the MLP regressor were used, with the only difference that the output is filtered with a logistic function to be between 0 and 1. Here the total number of experimental values is 6179. Of these, 3089 are randomly selected and used to train the model. The ratio between the number of experimental values in the training (3089) and the number of parameters (200) is more than 15, decreasing the chance of overfitting. Table 6 compares the composition of the full dataset and a randomly generated set containing 50% of the values. All antibodies are represented in the training set, with the same percentages of binders and not-binders as in the whole set.
It was observed that the outputs of both the MLP regressor and MLP classifier would change, in some cases significantly, upon repeated training using different random splitting of the experimental values into training and validation sets, and upon fitting the neural networks using different random seeds. For this reason, we average over 30 independently generated neural networks using each time a new splitting of the training and validation set and a new Table 6. Composition of the IC 50 database used to train the IC 50 classifier. One set of data is available for each antibody: the line "All" indicates the statistics for the whole data set, "Random" for a randomly generated set of containing 50% of the data. Count is the number of complexes containing that antibody, with percentages relative to the full data set. "Bind" and "Not bind" are the numbers (percentages) of complexes for which binding and no binding is detected. random seed in the training. The standard deviation between the 30 replicas is used as an estimate of the statistical error.

Antibody
To select which descriptors to use in the neural network, the first criterium was to keep the scoring function as fast as possible, avoiding the more computationally expensive OpenMM energy-based terms and FoldX. Removing these four descriptors does not cause a decrease in accuracy, while reducing the required time by a factor of 12.8. Prodigy was also removed, as it is a linear combination of other descriptors (no effect is observed in the accuracy after removal). This way, the number of descriptors used decreases from 24 to 19 without loss in accuracy. To study the performance of simplified models including fewer descriptors, the models are sequentially simplified by removing the highest correlated descriptor. To do this, the cross-correlation matrix with all descriptors is analyzed to find the pair of descriptors with the highest correlation among themselves; see Fig 3. The descriptor with the lowest correlation with the experimental IC 50 values is removed. The process is repeated until only one descriptor is present. The list of descriptors which are deleted at each step (and their correlation coefficients) are reported in Table 4. All descriptors are normalized to zero mean and unit variance (standard scaler).
To verify the robustness of the neural network, the same analyses were repeated using other machine learning methods: k-nearest neighbors (kNN) [24], random forests (RF) [25,26], and support vector machine (SVM) [27]. First, the MLP were compared with a second MLP containing two hidden layers instead of one. Having observed the same results, it was compared with a kNN (with distance weighting), a RF (composed of 31 trees) and a SVM (with radial basis function kernel) and no significant improvement was observed with any of them.
All machine learning methods (MLP, kNN, RF, SVM) and the standard scaler are used as implemented in the scikit-learn [59] python module.
Supporting information S1 Appendix. Computation of the breadth from the antibody/antigen binding affinity. (PDF) S1 Data. Python scripts and data to reproduce the figures and tables in the text. (TGZ)

Acknowledgments
We thank Victor Ovchinnikov for useful comments throughout this research and for careful reading of the manuscript.
Funding acquisition: Martin Karplus. Estimation of the breadth of HIV antibodies by molecular modeling and machine learning