Modeling the sequence dependence of differential antibody binding in the immune response to infectious disease

Past studies have shown that incubation of human serum samples on high density peptide arrays followed by measurement of total antibody bound to each peptide sequence allows detection and discrimination of humoral immune responses to a variety of infectious diseases. This is true even though these arrays consist of peptides with near-random amino acid sequences that were not designed to mimic biological antigens. This “immunosignature” approach, is based on a statistical evaluation of the binding pattern for each sample but it ignores the information contained in the amino acid sequences that the antibodies are binding to. Here, similar array-based antibody profiles are instead used to train a neural network to model the sequence dependence of molecular recognition involved in the immune response of each sample. The binding profiles used resulted from incubating serum from 5 infectious disease cohorts (Hepatitis B and C, Dengue Fever, West Nile Virus and Chagas disease) and an uninfected cohort with 122,926 peptide sequences on an array. These sequences were selected quasi-randomly to represent an even but sparse sample of the entire possible combinatorial sequence space (~1012). This very sparse sampling of combinatorial sequence space was sufficient to capture a statistically accurate representation of the humoral immune response across the entire space. Processing array data using the neural network not only captures the disease-specific sequence-binding information but aggregates binding information with respect to sequence, removing sequence-independent noise and improving the accuracy of array-based classification of disease compared with the raw binding data. Because the neural network model is trained on all samples simultaneously, a highly condensed representation of the differential information between samples resides in the output layer of the model, and the column vectors from this layer can be used to represent each sample for classification or unsupervised clustering applications.


Introduction
The humoral immune response to infectious disease involves the production of pathogen-specific antibodies by B cells. While much has been learned about the mechanisms underlying the development of the adaptive humoral immune response upon infection [1][2][3][4], comprehensive models describing the dependence of the antibody molecular recognition profile on antigen structure and the variability of the immune response between individuals are lacking [5][6][7]. Bcell sequencing has started to fill this gap in terms of understanding the repertoire of antibodies produced [8], but the approach is difficult and expensive to apply to very large numbers of samples. Phage display and other display approaches in which amino acid sequences of various lengths are incubated with serum and antibody binding is detected have provided a means of analyzing more directly target molecular recognition [9], but are generally strongly biased towards detecting strong binding interactions. Protein microarrays have also been used to look at the serum antibody binding in specific proteomes, identifying potential targets for either biomarker or drug development [10,11].
Large-scale peptide microarray technology provides an alternate approach and has been used to measure binding of protein targets to large numbers of peptide sequences in the analysis of protein-protein and peptide-protein interactions [5,[12][13][14][15]. Antibody reactivity profiling (epitope mapping) has been one of the main research fields employing the technology [12,15,16]. As peptide arrays have become more available and affordable, they have become attractive for looking broadly at sequence-binding relationships. Array based approaches have the advantage that one can obtain datasets on large numbers of peptide-target binding interactions simultaneously, measuring binding across a large dynamic range [17,18].
While most of the antibody or immune response applications of peptide arrays have been focused on measuring the interaction of peptides that represent known or suspected antigens with antibodies, the Johnston lab and others have developed the use of high density quasi-random peptide arrays as a tool for generating antibody binding profiles [17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. A key feature of these arrays is that the peptide sequences are chosen to cover combinatorial sequence space as evenly as possible, rather than focusing on biological sequences or known epitopes. Due to the random nature of the peptide sequences, the probability that the peptide array contains sequences that exactly match immunogenic regions of a pathogen is low. As a result, this "immunosignature" approach captures mostly low to moderate affinity interactions of antibodies with the array peptides. None-the-less, statistical analysis of serum antibody binding to the arrays has been shown to enable robust detection or identification of immune responses associated with numerous infectious and chronic diseases [18,22,23,[25][26][27]30].
The immunosignature method involves applying a small amount of diluted serum to a dense array of peptides with nearly random sequences of amino acids, typically with >100,000 distinct peptide sequences of about 10 amino acids in length [17]. In most of the studies done, only 16 of the 20 natural amino acids were used to synthesize the peptides. The level of antibody binding to the peptides on the array is then detected quantitatively using a fluorescently labeled secondary antibody and imaged by an array scanner. Based on a statistical comparison of binding patterns between case and reference samples, classifier models can be built to distinguish one disease response from another [20].
The cognate epitopes of the antibodies involved in an immune response are highly unlikely to appear within a random set of~10 5 sequences on a peptide array. For a linear epitope of 10 amino acids in length, there are~10 13 possible amino acid combinations, yet somehow the interaction of serum antibodies with only~10 5 sequences captures sufficient information to both detect and identify disease state with high accuracy [17,18,[21][22][23][25][26][27]30,33]. If sufficient information can be obtained from a random sparse sampling of antibody binding to 1 out of every 10 8 possible sequences (~10 13 /~10 5 ), then the antibodies associated with an immune response must recognize millions to billions of different sequences to some extent in a manner that is disease specific. The fundamental question of the current study is whether this amino acid sequence-dependent antibody binding can be modeled using sparse sampling of the combinatorial binding space. If so, such a relationship could potentially be used to more effectively aggregate information from the array or to design new panels of sequences that more effectively differentiate diseases.
Recently, our group modeled the sequence-binding relationships of nine different, wellcharacterized, isolated proteins to the peptide arrays described above [34]. Binding patterns of each protein were recorded, and a feed-forward neural network model was used to relate the amino acid sequences on the array to the binding values. Remarkably, it was possible to train the network with 90% of the sequence/binding value pairs and predict the binding of the remaining sequences with accuracy equivalent to the noise of the antibody binding measurements (the Pearson correlation coefficients (R) between the observed and predicted binding values were equivalent to that between measured binding values of multiple technical replicates, and in some cases as high as R = 0.99). In fact, accurate binding predictions (R > 0.9) for some protein targets could be achieved by training on as few as hundreds of randomly chosen sequence/binding value pairs from the array. In addition, the binding predictions were specific; the model captured not only the bulk binding of individual proteins but also the differential binding between proteins. Finally, a neural network trained on weakly binding sequences effectively predicted the binding values of sequences on the array 1-2 orders of magnitude greater. At least in the context of the combinatorial space of possible sequences in this model array-based system (~10 residue peptides using 16 different amino acids with the C-terminus bound to the surface of a silica substrate), training on one set of thousands of randomly selected sequences resulted in statistically accurate prediction of the binding to any other randomly selected set of sequences.
Binding to antibodies, in this case IgG in human sera, represents a much more complex system than binding to isolated proteins, and one might expect substantially more complex sequence-binding relationships. Previous studies have developed such relationships for immune responses using various starting datasets. A number of groups have looked at overlapping peptides presented on microarrays or in phage display libraries generated by tiling antigens or entire proteomes [35][36][37][38]. Bio-panning of phage or bacterial peptide display libraries coupled with next generation sequencing have provided broader binding profiles [39,40]. The advantage of tiling and panning approaches is that one is starting with known or suspected binding sequences, and thus the dataset is naturally rich in strong binding information. In one particularly effective study in this regard, a method referred to as Protein-based Immunome Wide Association Study was used to explore sequence binding relationships in 31 systemic lupus erythematosus samples [41]. Here a large bacterial display library (10 10 12-mer sequences) was reduced to~10 6 sequences found to bind to serum antibodies from the samples and the enrichment of specific 5-mer and 6-mer sequences within the resulting library was determined. These enriched sequences were then used to identify autoantibodies in the human proteome, and the authors were successful at identifying several known autoantigens for the disease within their top candidates. The same group has used similar methods to perform epitope mapping of antibodies to SARS-CoV-2 [42].
Machine learning algorithms have also been used to develop sequence-based models predicting binding of proteins to peptides, antibodies, and DNA [43][44][45][46][47][48][49]. For example, machine learning models have been used to model anti-microbial peptides, infectious viral variants that escape protection, potential epitopes on target antigens, high antibody binding regions on target proteins, and optimization of target DNA sequences for transcription factors. To do this, two approaches have primarily been used: 1) introducing single or multiple point mutations on a target site with known function to identify desired leads, and 2) use of proteomes of interest or known antigenic proteins to predict epitopes. For example, epitope prediction tools such as BepiPred-2.0 are generally developed using known antigens derived from crystal structures of antibody-antigen complexes [50]. With regard to modeling of serum binding to random sequences, Greiff et al, applied multivariate regression to serum antibody binding to a library of 255 random peptides [51]. In that study, serum antibody binding from naïve mice was well modeled by relating peptide composition to binding intensity, though binding of serum antibodies from previously infected mice proved more challenging to model.
The current work focuses on the feasibility of developing comprehensive sequence-binding relationships that describe the infectious disease specific binding of total IgG to our model library of 122,926 peptides each between 7 and 12 residues in length and composed of 16 of the 20 natural amino acids. While this library is clearly limited in terms of size (only 10 5 of the trillions of possible sequences), composition (16 of 20 natural amino acids) and context (C-terminus affixed to a silica surface), it is capable of distinguishing immune responses to different infectious agents, as described previously [17,18,21,26]. Neural network-based models were used to build quantitative relationships for sequence-antibody binding using sera from cohorts of individuals who are either uninfected (controls) or infected with one of five infectious agents including three closely related members of the family Flaviviridae (dengue virus, West Nile virus and hepatitis C virus), a more distantly related member of the family Hepadnaviridae (hepatitis B virus) and an extremely complex eukaryotic trypanosome (the agent of Chagas disease, Trypanosoma cruzi). This allowed a thorough evaluation of the model's ability to capture the disease-specific information content of the array binding. This study has shown that it is possible to create accurate sequence-binding models, which not only learn the disease specific information, but also effectively capture the binding information on the arrays for applications in noise suppression and disease classification.

Study design and initial analysis
The serum samples shown in Table 1 were incubated on identical peptide microarrays as described in Methods and IgG bound to the array peptides was detected via subsequent incubation with a secondary anti-IgG antibody. The peptide sequence 'QPGGFVDVALSG' is present on the array as a set of replicate features (n = 276). This peptide sequence gives a consistently moderate to strong binding value from sample to sample and is used to assess the intra-array spatial uniformity of antibody binding intensities. Median normalized arrays with an intra-array replicate feature coefficient of variation (CV) � 0.3 for this peptide sequence were set aside as well as arrays that showed significant physical defects or overall differences in binding intensity between different regions of the array (collectively these are referred to as "High CV samples"). In all, 20% of the 679 arrays measured were excluded from the initial part of the analysis but considered in the last section which focuses on using the sequencebinding relationship to remove noise from the arrays. Thus, 542 arrays total were considered "Low CV Samples" in Table 1.
Comparison of average binding profiles of peptides to serum IgG. Fig 1 shows the cohort average serum IgG binding intensity distributions of the 122,926 unique peptide sequences. The samples were all median normalized prior to averaging each peptide binding value within the cohort. The log 10 of the average binding is displayed on the x-axis as the log distributions are much closer to a normal distribution than are the linear binding values. Sera from individuals infected with HCV, dengue virus or WNV have sharper distributions (smaller full width at half maximum) than the other samples, while sera from individuals infected with HBV show a distribution width similar to those from uninfected donors. Sera from individuals with Chagas disease have a broader binding distribution than the others, with a long tail on the high binding side. Overall, the width of the distribution increases with increasing proteome size. Interestingly, for the viruses with small proteomes, some of the higher binding antibodies are lost compared to uninfected samples. However, it is important to remember that the array peptides have no relationship to the viral proteomes or indeed any biological proteome, except by chance. Thus, what is lost in the small virus samples compared to strong binding in uninfected samples, may well be gained in more specific binding not immediately apparent.

Neural network analysis
The fundamental question of this study is whether it is possible to accurately predict the sequence dependence of the antibody binding associated with an immune response to a given pathogen, both in terms of accurately representing the IgG binding to each peptide sequence in individual serum samples and in terms of the ability of the neural network to capture sequence dependent differences in IgG binding between samples and cohorts. Towards this end, the low CV samples (Table 1) were analyzed using a feed-forward neural network model PLOS COMPUTATIONAL BIOLOGY [34]. All samples were fit simultaneously such that a single neural network was trained to predict the binding for all samples for any given sequence. The optimized network involved an input layer, three hidden layers with 542 nodes each and an output layer of width 465 corresponding to the number of target samples. The loss function used was the sum of least squares error based on a comparison of the predicted and measured values for the peptides in the sample.
The neural network uses the sequence information to rapidly converge on a solution. Fig 2A shows the rate at which the loss function drops during training. When the correct sequence is paired with its corresponding binding value (blue and red lines, Fig 2A), the value of the loss function drops rapidly and the values for the training set and test set drop in

PLOS COMPUTATIONAL BIOLOGY
concert; there is almost no overfitting. As a control, the same neural network was used to analyze data in which the order of the peptide sequences was randomized relative to their binding intensities. One would not expect any relationship between sequence and binding under these circumstances. In this case, the loss function value slowly decreases for the training set of  Table 1

PLOS COMPUTATIONAL BIOLOGY
peptides and slowly increases for the test set (yellow trace: test, purple trace: train) indicating overfitting to the training set. This implies that the neural network is capable of rapidly converging on a true relationship between the sequences and their binding values in the context of the array peptide library.
The neural network results in a comprehensive binding model applicable across the model sequence space used. Fig 2B shows a density-colored scatter plot [52] comparing the predicted and measured values from the neural network model. In this case, the model was trained on 95% of the peptide sequence-binding pairs, randomly selected, with the remaining 5% or 6,146 peptide sequences excluded from training and used for model testing (that is 6,146 binding values for each of the 542 low CV samples used =~3.3 million binding values in the test set). Only the test set values are displayed in Fig 2B. Since the sequences used on the array are nearly random, these sequences should be statistically equivalent to any randomly selected set of sequences from the combinatorial space of possible sequences sampled by the array (peptides of about 10 residues utilizing any of 16 amino acids corresponds to about 10 12 sequences). The Pearson correlation coefficient (R) between the measured and predicted values for the test sequences shown is 0.956. Repeating the training 100 times with randomly selected train and test sets gives an average correlation of 0.962 +/-0.001 (standard error of the mean) and a root mean square average error of 0.103 +/-001. The correlation coefficient between measured and predicted binding for the 95% of the sequences used to train the neural network was 0.967 +/-0.001. This implies that there is almost no overfitting associated with the model (the quality of fit between the test and train data is similar), a conclusion also apparent in the loss function data of Fig 2A. S1 Fig shows the correlation coefficient between measured and predicted binding for each individual sample in the test dataset. While some cohorts and some samples were better fit than others, for the vast majority of the samples, the correlation coefficients are greater than 0.9. 10 3 to 10 4 peptides are sufficient to provide a reasonable description of the entire combinatorial peptide sequence space. Neural network models were trained with different numbers of randomly selected peptides, and binding was predicted for the remaining portion of the peptides. Fig 2C explores the dependence of the overall correlation coefficient between measured and predicted binding values for the test set of each of the sample cohorts as a function of the number of peptides used in the training. When at least 10,000 peptide sequences are used to train the neural network, the correlation coefficient is >0.9 for all cohorts, and the correlation is >0.85 when the model is trained using only 2,000 peptides. This implies that even a very sparse sampling of this sequence space provides a reasonably accurate model of the sequence-binding relationship. The correlation coefficients do continue to increase slowly as a function of training set size. Thus, even though a relatively small set of peptides gives a reasonable overall picture, the predictive power of the relationship continues to improve with more data, and if even more peptide sequences were available for training than the entire 122,926 peptides on the array, an improved prediction would be expected.
There are commonalities in the binding of each sample that make simultaneous modeling of all samples as accurate as individual fits. It is possible to either build a single neural network model that fits all samples simultaneously (as done in this work), or to train entirely independent neural network models for each of the samples considered. Fig 3 shows a direct comparison of the measured vs. predicted correlation coefficients of each sample using the simultaneous and individual modeling approaches. In most cases, the simultaneous model is slightly more accurate in terms of the correlation coefficient. This implies that the network learns commonalities between IgG binding from serum across all samples and different cohorts and takes advantage of those commonalities to refine the weights of the 3 hidden layers. The differences between samples are learned in the output layer (the final weight matrix), with separate columns in that layer giving rise to the predicted binding values for each sample. Additionally, simultaneous modeling is significantly faster than fitting each sample dataset separately. For comparison, a simultaneous training required about 5-6 minutes to complete using a single GPU while the individual modeling required about 10 hours even after optimizing parallel processing. Note that once the network has been trained on sufficient numbers of samples, new samples only require training of the final weight matrix since the column vectors of this matrix are what differentiates the samples.  (Fig 4C), as described below. Alternatively, the neural network can be used to extract information from the sequence/binding relationship for disease classification. This relationship can either be used to recalculate predicted binding values for the array peptide sequences, forcing the data to always be consistent with the sequences (denoising, red line), or it can be projected onto a completely new set of sequences (an in silico array, orange line), and those projected binding values used in classification or determining the number of significant distinguishing peptides between disease pairs.
Values predicted by the neural network result in better ability to distinguish cohorts. In Fig 4C-4E, the number of peptide binding values that are significantly greater in one cohort (on the Y-axis) compared to another (on the X-axis) are shown in each grid. Significance was determined by calculating p-values for each peptide in each comparison using a T-test between cohorts adjusted for multiple hypothesis comparisons using the Bonferroni correction. Significant peptides are those in which the p-value is less than 1/N (N = 122,926) with >95% PLOS COMPUTATIONAL BIOLOGY confidence. Fig 4C shows comparisons between cohorts using the measured data from the arrays. As one might expect, the sera from donors infected with the Flaviviridae viruses are most similar to one another in terms of numbers of distinguishing peptides. In general, they are more strongly distinguished from HBV (except for WNV) and very strongly distinguished from Chagas donors. If one follows, for example, the top row of Fig 4C for HCV, moving to the right one sees that the numbers increase as more and more genetically dissimilar comparisons are made. West Nile virus is an exception in this regard. While it is more similar to Dengue virus than it is to Chagas, it is most similar, in terms of numbers of distinguishing peptides, to HBV (Fig 4C). Fig 4D is the same as Fig 4C except that in this case, the predicted values from the neural network model are used for the array sequences instead of the measured values. Because the network requires that a common relationship between sequence and binding be maintained for all sequences, it increases the signal-to-noise ratio in the system such that significantly more distinguishing peptides are identified in every comparison. The neural network was run 10 times and the results were averaged. Fig 4E shows results in the same format as panels C and D, but using in silico generated sequences with binding values predicted by the neural network model trained on peptide array binding data. These sequences were produced by taking the amino acids at each residue position in the original sequences and randomizing which peptide they were assigned to (considering the sequences as a matrix with rows representing peptides in the array and columns representing residue positions, order of amino acids in each column was randomized separately and at the end any spaces due to varying peptide lengths were removed). This created an in silico array with a completely new set of sequences that had the same number, overall amino acid composition and average length as the sequences on the physical array to ensure a consistent comparison. The binding values for each sample were then predicted for this in silico array and those values were used in the cohort comparisons. The number of significant peptides identified using the new sequence set (Fig 4E) are identical to within error for each comparison with the predictions from the actual array peptide sequences used in the training ( Fig  4D). Note that the result is the average number of significant peptides from analyzing ten different randomized in silico arrays.
Another way to understand how well distinguishing information is captured by the neural network model is to compare disease classification performance based on measured values vs. predicted values, where the predicted values are generated from the sequence binding relationship resulting from the neural network analysis of Fig 2 (note that creation of the sequence binding relationship is performed without knowledge of which sample belongs to which cohort so it does not bias the classification). Fig 4B shows the result of applying a multiclass classifier, either to the original measured binding values, the binding values predicted for the array sequences, or binding values predicted for in silico generated sequences. A multiclass classifier was built using a neural network with a single hidden layer with 300 nodes (described in the Methods and in the supplementary information). This will be referred to simply as the "multiclass classifier" to avoid confusion with the neural network used to model the sequencebinding relationship. The multiclass classifier cannot effectively use all peptides for each sample. Peptide feature selection was performed using a peptide-by-peptide T-test between the binding values of each cohort vs. all others. Fifteen features were used per cohort, giving a total of 83-84 features selected in total for all six cohorts combined due to some overlap between discriminating features found for different cohorts. The training target is a one-hot representation of the sample cohort identity. 80% of the samples were randomly selected and used to train the multiclass classifier and 20% were used as the test set. Both the feature selection and training were performed only on the training samples. Each test sample was then assigned a cohort label based on the largest value in the resulting predicted output vector. The process was repeated 100 times, on 100 randomly selected training and test sets, and overall prediction accuracy determined. For every cohort, with the possible exception of HCV, classification was improved relative to direct use of the measured array values (blue bars) when using the predicted values. This was true using either predicted values for the array sequences (red bars) or predicted values resulting from projection of the trained network on the randomized in silico array sequences (orange bars).

Understanding the noise reduction properties of neural network modeling
The results presented above show that by using the sequence/binding information to first train a neural network model and then predicting the binding using that model (on the same or a different set of sequences), it is possible to improve the signal-to-noise ratio in the data, at least for the purpose of differentiating between disease cohorts. To understand this in more detail, the effects of noise added to the data was explored.
Gaussian noise is effectively removed by the model. In Fig 5, noise was artificially added to each point in the measured dataset by using a random number generator based on a gaussian distribution that was centered at the measured value: In the above equation, mu (μ) is the log 10 of the median normalized measured binding value. Sigma (σ) was then varied from 0 to 1 to give different levels of added noise with σ = 1 corresponding to a 10-fold change, up or down, in the width of the linear binding value distribution (due to the log 10 scaling). Fig 5A shows the resulting distribution of peptide binding values after adding noise. The peptide binding values were mean normalized across all cohorts and then plotted as a distribution for each cohort (since this is the log 10 of the mean normalized value, the distributions are centered at 0). As sigma is increased, the width of the resulting distribution after adding noise increases dramatically. Fig 5B plots the multi-class classification accuracy of each dataset for each sample cohort as a function of sigma (this uses the same multiclass classifier as Fig 4). The classification accuracy of the original measured data with increasing amounts of noise added drops rapidly (dashed lines). Since this is a 6-cohort multi-class classifier, random data would give an average accuracy of~17%. The measured values with added noise approach that accuracy level at the highest noise. However, by running the data with noise added through the neural network and then using predicted values for the array sequences rather than the measured values, the accuracy changes only slightly for sigma values up to 1. Note that because this analysis is on a log scale, a sigma of 1 corresponds to causing the linear measured binding values to randomly vary over a distribution with a width covering between 10% and 1000% of their original values.
To further demonstrate the noise reduction capability of the neural network, S4A Fig shows  a scatter plot of the measured data plus noise (sigma = 1) vs. the original measured data, resulting in a correlation coefficient of only 0.356. In contrast, S4B Fig shows a scatter plot of the predicted binding values trained on the measured data plus noise vs. the original measured data, resulting in a correlation coefficient of 0.958, essentially identical to the predictive capability of the neural network when trained on the measured data without noise (Fig 2B).
Neural network predictions of array signals improved classification of high CV samples. As described above, 137 samples were not used in the analyses because they either had high CV values calculated from repeated reference sequences across the array or because there  Fig 4B) to the data for measured binding values plus noise (dashed lines) and binding values predicted using the measured data plus noise to train the neural network (solid lines) at each value of sigma. Each classification was repeated 100 times (noise at each level was randomly added 10 times and each of these were reclassified 10 times leaving out 20% of the samples as the test set). https://doi.org/10.1371/journal.pcbi.1010773.g005

PLOS COMPUTATIONAL BIOLOGY
were visual artifacts such as scratches or strong overall intensity gradients across the array. A neural network model was trained on all 679 samples in Table 1 (all 542 low CV + 137 high CV) simultaneously. Note that the model does not include any information about what cohort each sample belongs to, so modeling does not introduce a cohort bias. The overall predicted vs. measured scatter plots and correlations are given in Fig 6A and 6B for the low CV and high CV data, respectively. The number of points displayed was randomly selected to be constant between datasets and make the plots comparable. Prediction of the binding values for the high CV data results in more scatter relative to measured values, due to the issues with those particular arrays.
In Fig 6C, the measured and predicted values for the 542 low CV samples were used to train a multiclass classifier which was then used to predict the cohort class of the high CV samples. Three different data sources were used: 1) the measured array data (blue bars), 2) predicted binding values for the array peptide sequences based on the neural network model (red bars) and 3) projected values for in silico generated arrays similar to those used in Fig 4 (orange  bars). The classifier used was the same as that in Fig 4 and the number of features selected was 15 for each cohort comparison. In each case except for the non-disease samples, the use of predicted values resulted in a significantly better classification outcome.

A quantitative relationship between peptide sequences and serum IgG binding
The work described above shows that it is possible to use a relatively simple neural network model to generate a quantitative relationship between amino acid sequence and serum antibody binding over a large amino acid sequence space by training on a very sparse sampling of binding to that sequence space, similar to what was seen previously for isolated proteins binding to the array [34]. Indeed, a reasonably accurate prediction can be obtained with only thousands of sequences (Fig 2C).
The advantage of using neural networks to describe these sequence-antibody binding relationships is that there is no underlying assumption of a specific physical model. This allows the neural network to describe nonlinear interactions between antibodies and the peptide sequences that would be difficult or impossible to anticipate in an explicit physical description. Immune reactions are complex and polyclonal in nature; there is no single antibody or epitope sequence involved, and thus one would not expect a single binding constant or interaction mechanism. While the neural network does not itself provide a physical interpretation of the antibody-sequence binding, it does provide a rapid means of identifying which sequences are likely involved, for example, within the proteome of a pathogen, for further and more detailed investigation.
The model system used here to explore the relationship between antibody molecular recognition profiles and amino acid sequences has limitations. Only 16 of the 20 natural amino acids were used in this model for technical reasons (see Materials and methods). The sequences are also bound at one end to an array surface, and the other end has a free amine rather than a peptide bond as would be seen in a protein. In addition, the array peptides are short, linear and largely unstructured. This limits the range of molecular recognition interactions that can be observed, and thus the level of generality of the conclusions, but also suggests that comprehensive and accurate structure/binding relationships for humoral immune responses should be possible to generate given binding data in a broader sequence context. Such relationships would be invaluable for epitope prediction, autoimmune target characterization, vaccine development, effects of therapeutics on immune responses, etc. Even this rather  PLOS COMPUTATIONAL BIOLOGY simple model system for sequence space already shows the ability to capture differential binding information between multiple diseases simultaneously, including infectious diseases that involve closely related pathogens (Fig 4).
The fact that one can develop comprehensive sequence/binding relationships within this model sequence space also explains, at least in part, why the immunosignature technology is as effective as it is. Immunosignaturing technology as applied to diagnostics uses the quantitative profile of IgG binding to a chemically diverse set of peptides in an array followed by a statistical analysis and classification of the resulting binding pattern to distinguish diseases. The approach has been successfully used to discriminate between serum samples from many different diseases [17, 18, 21-23, 25-27, 29, 30] and has been particularly effective with infectious disease [17,18,21,31], as exemplified by the robust ability to classify the immune response to the infectious diseases studied here (Fig 4D). This raises the question, why would antibodies that are generated by the immune system to bind tightly and specifically with pathogens show any specificity of interaction to nearly random peptide sequences on an array? The success of the neural network in comprehensive modeling of the sequence/binding interaction provides an answer. The information about disease-specific IgG binding is dispersed broadly across peptide sequence space, even in the interaction with sequences that themselves bind weakly and with low specificity, rather than being focused only on a few epitope sequences. It is not necessary to measure binding to the epitope if you have a selection of sequences that are broadly located in the vicinity of the epitope in sequence space.
Note also that by working with sequence/binding relationships, rather than purely statistical comparisons of binding values associated with specific sequences, one can combine information from arrays that contain different peptides. As shown in Fig 2C, when 50% of the array is used to predict the other 50%, the correlation coefficient on average is well over 0.9.

The advantage of analyzing many samples simultaneously
The results of Fig 3 demonstrate that simultaneous neural network analysis of all samples from all cohorts provides a slightly more accurate overall description of binding than does sample by sample analysis, or at least no worse. Conceptually, this suggests that there is enough information in common between the antibody molecular recognition profiles of the various samples that using the same hidden layers to describe all of them, followed by an output layer with a distinct column describing each sample, is sufficient to both describe the general and specific binding interactions. An added practical benefit to this approach is a significant reduction in computation time, as described above.

Using the sequence/binding relationship to suppress noise
In Fig 4, both the number of distinguishing peptides between cohorts and the classification accuracy improved when the measured values for each array sequence were replaced by the corresponding predicted values. Effectively, the neural network aggregates information from the entire peptide dataset onto each of the predicted values. In Fig 5, random noise (sequence independent variation) is purposely added to the array. Since the noise is added to the log 10 of the binding value, a sigma of 1 corresponds to a 10-fold increase in distribution width, as can be seen in Fig 5A. In other words, a sigma of 1 randomly changes the linear values up or down by roughly an order of magnitude. As a result, multi-class classification of the original data with noise added performs poorly (Fig 5B, dashed lines). However, because the neural network predictions effectively aggregate the combined information from nearly 123,000 sequence/binding values in the generation of the sequence/ binding relationship, random noise is dramatically reduced and a sigma of 1 has very little effect on classification performance (Fig 5B, solid lines). This concept is taken further in Fig  6, where arrays that for technical reasons were rejected because of excessive noise or physical artifacts affecting part of the array are included in the simultaneous analysis of all samples and their excess noise and defects are effectively repaired by requiring that all binding be strictly sequence dependent. This is done without the network that creates the sequencebinding relationship having any information about which cohort is which in the analysis. The implication for array based diagnostic applications is that replacing a purely statistical approach like immunosignaturing with a structure-based approach provides a means of eliminating noise that is unrelated to the binding properties of the sequences while retaining the real patient to patient variance. Note that if this noise reduction approach was used in a diagnostic application, additional samples could be analyzed without retraining the entire model. Only the final layer, which contains all the discriminating information for the new samples, would need to be trained, and this is a linear optimization.

Using the neural network model itself for disease discrimination
As shown in both Figs 4 and 6, predicted binding values for a set of peptide sequences that approximately cover the same model sequence space as the array sequences can be used to discriminate between cohorts of samples just as well as predicted values of the original array sequences. In fact, it is the sequence/binding relationship that contains the discriminating information, and it is not necessary to use predicted binding to real sequences at all. In the neural network used here for simultaneous analysis of all samples, the output layer consists of one column corresponding to each sample. The length of the column is the same as the width of the last hidden layer (250 values in this case). The 250 values associated with each sample in this output layer, combined with a single bias value added at the end, contains all of the distinguishing information for that sample and can effectively be used to replace the~123,000 sequence/binding values measured with only 251 values. Importantly, the neural network that gives rise to the final weight matrix has no information provided to it about which cohort each sample belongs to. The column vectors representing each sample are simply a mathematical representation of the data which combines sequence and binding information together in a much more compact form than the original~123,000 sequence/binding pairs. Table 2 shows that this set of 251-element vectors can be used in place of the predicted binding values and perform in a classification analysis just as well. Fig 7 shows an unsupervised clustering using the algorithm UMAP [53,54]. Fig 7A shows the clustering that results from using the~123,000 measured values as input. Fig 7B shows the clustering that results from using the 251 values of the final weight matrix + bias for each sample as input. In each case, UMAP reduced the dimensionality to 2 components. These component values are plotted for each sample and the different cohorts are color coded. When the original measured values are used for clustering (Fig 7A), Uninfected samples and Chagas samples are well separated from the virus samples, but the virus samples are less well separated from each other. This is reasonable considering the levels of pathogen similarity. However, when the final weight matrix of the neural network   Fig 7B). UMAP is a nonlinear clustering algorithm which looks for the most similar features in samples to determine clustering. Apparently, this cluster of individuals had some other unknown immunological stimulus in common that distinguished them from all others. The ability to detect such clusters could prove useful in public health bio-surveillance applications.

Conclusions
While infectious disease diagnostics and associated classification approaches are one possible use-case of the approaches described above, the key conclusions of the work are more general. First, measured binding of serum antibodies to a very sparse and nearly random subset of all possible 10-mer peptides (~10 5 out of~10 12 ) is sufficient to capture all the relevant diseasespecific information available from the array data (Figs 4, 5 and 6). Secondly, the representation of IgG binding is statistically comprehensive, as demonstrated by the high correlation (~0.96) between predicted and measured binding using peptide sequences not involved in the training of the network (Fig 2B). Thirdly, creating a single model to fit all samples simultaneously effectively captures disease specific information, and takes significantly less time (~400 seconds on a single GPU) than fitting each of the 465 samples separately (which takes about 12 hours even with parallel processing) (Fig 3). Fourth, the neural network imposes the requirement that all predicted binding values are consistent with their associated sequence, effective removing any sequence-independent noise. Finally, the final layer of the neural network provides a condensed but complete representation of the disease-specific information in the array, with one column for each sample consisting of a compact set of 250 values plus a bias term instead of the~123,000 binding values. Unsupervised clustering using the final weight matrix as input provides an excellent separation of samples into their respective cohorts (Fig 7).

Peptide array content
The peptide arrays used were produced locally at ASU (see below). The synthesized wafers were cut into microscope slide sized pieces, each slide containing a total of 24 peptide arrays. Each array contained 122,926 unique peptide sequences that were 7-12 amino acids long (average of 10). A 3 amino acid linker consisting of GSG was attached to each peptide and connected the C-terminus to the array surface via amino silane. The peptides were synthesized using 16 of the 20 natural amino acids (A,D,E,F,G,H,K,L,N,P,Q,R,S,V,W,Y) in order to simplify the synthesis process (C and M were excluded due to complications with deprotection and disulfide bond formation and I and T were excluded due to the similarity with V and S and to decrease the overall synthetic complexity and the number of photolithographic steps required [55]. The arrays were created in 64 photolithographic steps (4 rounds through addition of the 16 amino acids) and sequences were chosen from the set to sample all possible sequences as evenly as the synthesis would allow. A detailed description of the amino acid composition of the arrays and peptide length distribution was published previously [34] (these arrays are referred to as CIMw189-S9 in that publication).

Array synthesis methods
Peptide arrays were synthesized photolithographically using the method described previously [17]. Briefly, an oxide-coated silicone wafer functionalized with a monolayer of aminosilane and coupled to Boc-glycine was prepared as array substrate. The wafer was coated with photoresist containing a photolabile acid generator. The binding sites on the wafer were created by photoactivating selected areas on the wafer with UV light using a series of physical photomasks resulting in a patterned deprotection of Boc-protected amines on the substrate. The photoactivation step was followed by the addition of a selected Boc-protected amino acid dissolved in a coupling solution to enable binding to the acid-activated regions. The process was repeated using a pre-determined sequence of amino acids and photolithographic mask pairs to create the peptide library with the desired sequences.

Serum samples
Deidentified serum samples were collected from three different sources: 1) Blood donors' samples from Creative Testing Solutions (CTS), Tempe, AZ, 2) LGC SeraCare, Milford, MA, and 3) Arizona State University (ASU) ( Table 1). The dengue serotype 4 serum samples were collected from 2 of the above sources: 30 samples were provided by CTS and 35 samples were purchased by Lawrence Livermore National Labs (LLNL) from SeraCare before they were donated to the Center for Innovations in Medicine (CIM) in the Biodesign Institute at ASU. Uninfected/control samples consisted of 200 CTS samples and 18 samples from healthy volunteers at ASU. All deidentified infectious case samples came from CTS. All samples provided by CTS were residual samples collected from blood donors who were asymptomatic at the time of blood donation and were identified as test-reactive for infectious disease markers during blood screening at CTS. At the time of donation, blood donors agreed to the use of their samples in research. Serum samples were frozen shortly after collection and not thawed before being received as aliquots. ASU samples were collected under IRB protocol STUDY00002876: DHS Immunosignaturing-A Platform for Detecting and Identifying Multiple Infectious Diseases-July 2015. Serum samples were frozen at the time of collection and not thawed before being received as aliquots. Further sample description and in-house validation of disease state is described in the supplementary materials.

Sample processing and serum IgG binding measurements
Serum from the 6 sample cohorts (5 disease cohorts and uninfected) were diluted (1:1) in glycerol and stored at -20˚C. Before incubation, each serum sample was prepared as 1:625 dilution in 625 μL incubation buffer (phosphate buffered saline with 0.05 Tween 20, pH 7.2). The slides, each containing 24 separate peptide arrays were loaded into an ArrayIt microarray cassette (ArrayIt, San Mateo, CA). Then, 20 μL of the diluted serum (1:625) was added on a Whatman 903T Protein Saver Card. From the center (12 mm circle) of the protein card, a 6 mm circle was punched, and put on the top of each well in the cassette, and covered with an adhesive plate seal (3M, catalogue number: 55003076). Incubation of the diluted serum samples on the arrays was performed for 90 minutes at 37˚C with rotation at 6 RPM in an Agilent Rotary incubator. Then, the arrays were washed 3 times in distilled water and dried under nitrogen. A goat anti-human IgG(H+L) secondary antibody conjugated with either AlexaFluor 555 (Life Technol.) or AlexaFluor 647 (Life Technol.) was prepared in 1x PBST pH 7.2 to a final concentration of 4 nM. Following incubation with primary antibodies, secondary antibodies were added to the array, sealed with a 3M cover and incubated at 37˚C for 1 hour. Then the slides were washed 3 times with PBST (137 mM NaCl, 2.7 mM KCl, 10 mM Na 2 HPO 4 , and 1.8 mM KH 2 PO 4 . 0.1% Tween (w/v)), followed by distilled water, removed from the cassette, sprayed with isopropanol and centrifuged, dried under nitrogen, and scanned at 0.5um resolution in an Innopsys Innoscan 910 0.5 um laser scanner (Innopsys, Carbonne, Fr), excitation 547 nm, emission 590 nm. Each image was analyzed (GenePix Pro 6.0, Molecular Devices, San Jose, CA) and the raw fluorescence intensity data was exported as a GenePix Results ('gpr') file.

Binding analysis using neural networks
The neural network used to relate peptide sequences on the array to the measured binding of total serum IgG is similar to that described previously [34]. The amino acid sequences are input as one-hot representations and fed into a fully connected, feed-forward neural network (3 hidden layers, 250 nodes each), with each layer including a rectified linear unit activation function. This was followed by a linear final output layer (whose width is the same as the number of samples) used to predict total serum IgG binding. The neural network is trained on the peptide sequence/binding value pairs by optimizing an L2 loss function (sum of squared errors) between the measured and predicted binding values. The model performance is assessed by calculating the Pearson correlation coefficient and the RMSE between the measured and predicted binding values for a test dataset not involved in the training. Except where otherwise stated, the neural networks used in this work are trained on all samples simultaneously, where all weights of the neural network are shared across cohorts except for the final layer of the neural network. The neural network was trained using the log 10 of the median-normalized binding values from the peptide array (normalized by the binding values of all peptides in a given sample). Any zeros in the dataset were replaced by 0.03x the median prior to taking the logarithm. A diagram and description is given in the supplementary information (S2 Fig).

Classification algorithm
A second neural network was used as a multiclass classifier. The input to the network was a subset of binding values for each sample. In some cases, the binding values were measured values and in other cases predicted values. In Table 2, the columns of the final output weight matrix of the sequence-binding neural network above was used to represent each sample. In each of these cases, feature selection was performed with one vs. all T-tests. The binding values of each peptide in a cohort were compared to the binding values of all the other cohorts and 15 peptides from each of the six one vs. all comparisons were selected. These binding values (typically 83-84 because of some duplication between comparisons) were fed into the input layer of a neural network with 1 fully connected hidden layer with 300 nodes followed by a rectified linear unit activation function. This was followed by a linear output layer that was 300 x 6. The target values were one-hot vectors, 6 long, which represented the cohort that sample was associated with. A diagram and description is given in the supplementary information (   10 of the measured data values with sigma = 1 gaussian noise added vs. the log 10 measured data values without noise added. (B) in this scatter plot, the Y-axis is log 10 of the binding values predicted from a sequence-binding neural network model generated using the measured data values with sigma = 1 gaussian noise added and the X-axis is again the log 10 of the measured values without noise. The function dscatter was used to generate a density-colored scatter plot (see the description of Fig 2B ref. 52 in the text). The neural network sequence-binding relationship effectively removed the sequence independent noise that was added to the data. (TIF)