Maximum-Entropy Models of Sequenced Immune Repertoires Predict Antigen-Antibody Affinity

The immune system has developed a number of distinct complex mechanisms to shape and control the antibody repertoire. One of these mechanisms, the affinity maturation process, works in an evolutionary-like fashion: after binding to a foreign molecule, the antibody-producing B-cells exhibit a high-frequency mutation rate in the genome region that codes for the antibody active site. Eventually, cells that produce antibodies with higher affinity for their cognate antigen are selected and clonally expanded. Here, we propose a new statistical approach based on maximum entropy modeling in which a scoring function related to the binding affinity of antibodies against a specific antigen is inferred from a sample of sequences of the immune repertoire of an individual. We use our inference strategy to infer a statistical model on a data set obtained by sequencing a fairly large portion of the immune repertoire of an HIV-1 infected patient. The Pearson correlation coefficient between our scoring function and the IC50 neutralization titer measured on 30 different antibodies of known sequence is as high as 0.77 (p-value 10−6), outperforming other sequence- and structure-based models.


Preliminary deep sequencing data analysis
In the present Section, we outline the bioinformatic pipeline performed to extract from the raw deep sequencing data, consisting in nucleotide (nt) reads, a set of productive amino acid sequences provided with the putative IGHV and IGHJ germline genes of origin.
The raw data set contains 697079 nt reads. First the sequencing bar code has been cut away from the raw sequences. Primers (and their reverse complement) have been identified in sequences within two sequencing errors; Sequences presenting at least one primer have been maintained in the set and 3'5' sequences have been reverted and complemented. PCR bias due to different responses of the primers are expected to affect the relative abundance of sequences; Anyway, no null experiment has been performed and so there is no way to correct for these systematic errors. No further sequencing error correction scheme has been applied to nts sequences.
The resulting set of nts sequences has been analyzed with IgBLAST [8]; only sequences identified by the software as productive (i.e. that do not present a stop codon and for which the V and J genes are in frame) are kept and their amino acid (aa) translation has been considered in this work together with the first ranked IGHV and IGHJ putative germline genes of origin. Sequences presenting sequencing errors resulting in insertions and deletions whose net length is not a multiple of three are identified by IgBLAST as non-productive (V-J frame: out of frame) and they are thus automatically eliminated by this procedure.

Multiple sequence alignment
We first aligned our dataset according to the Kabat-Chothia numbering scheme using a modified version of the antibody-specific HMMs previously developed by us [3,5]. The first modification, following the IMGT [6] and AHo [4] numbering schemes, was to place the H3 insertions symmetrically in the central position between residue 94 and 101, thus obtaining a better alignment of the H3 regions neighboring the loop stems.
As shown in Figure S1, the length of the H3 loop of antibodies in the hypermutated cluster varies between 7 and 26 residues, with an average length of 15.35 residues (15.25 when only considering unique sequences). There are two main peaks in the distribution, corresponding to loops of length 14 and 17. H3 length is calculated as the number of residues between position 95 and 102 included according to the Chotia numbering scheme.
The variability in the length distribution of the sequences in the hypermutated cluster, together with the observed sequence diversity, can be attributed to the presence of multiple clones and of intraclonal diversity.

Clustering analysis
The set of sequences with IGHV1-2 and IGHJ2 as inferred germline gene of origin display the presence of two clearly separated clusters. This number of clusters has been confirmed by a stability analysis of the k-means algorithm.
To confirm the result, the set of sequences has been analyzed with the clustering algorithm proposed in [1]. This algorithm is an improved hierarchical clustering method based on the optimization of a cost function over trees of limited depth. The clustering utilizes the cavity method to find a bounded depth D spanning tree on a graph G(V, E) where V is the set of n vertices identified with the data points plus one additional root node and E is the set of edges with weights given by the dissimilarity matrix (the Hamming distance between sequences in our case) and by a unique distance λ from the root node.
The external parameter λ can be used to fix the number of clusters (for λ = 0 each point is a cluster, for large enough λ all points belong to a single cluster) and to check for the structural stability of the clustering itself. Indeed, if a small increase of λ induces large decrease in the number of clusters, we take this as an indication that the structure of the clustering is weak. Robust results can be detected when for large intervals of λ values, the clusters do not change. A typical fingerprint of a robust clustering is shown in Fig. S2, where for values of λ ∈ (50, 100), a stable bipartition of our data set is observed.
A direct inspection of the two clusters shows that one of the two clusters is on average more similar to the germline genes of origin (and thus named germline cluster ), while the second cluster results to be more similar to the broadly neutralizing antibody VRC-PG04 (and named hypermutated cluster ). The consensus sequences of the two clusters are displayed in Fig. S3 aligned with the relative germline sequences.

recombinant HIV-1 virus, in
In Figure S4 we  In Figure S7 we reported the comparison between the performance of the MGM score and the score computed with pseudo-likelihood algorithm (plmDCA). The plmDCA score is less effective in reproducing affinity measurement then the MGM score but still has a better performance than the factorized MGM score and the hmm-score, as expected.

Internal contacts
The gaussian DCA procedure [2] aimed at the internal contacts prediction does not give accurate results with the Rep-Seq data set under inspection. In the Main Text, the contact prediction has been performed taking as input the set of sequences belonging to the hypermutated cluster. Here we report the result of the contact inference over the whole Rep-Seq data set. Even if the enlargement of the set of sequence slightly increases the performances, the predictive power is still unsatisfactory as shown in Figure S8.

Ab-antigen interactions
In this Section, we present the supplementary results referred to a reduction of the horizontal size of the MSA (number of aligned residues in the sequences). By progressively eliminating columns in the MSA following their decreasing entropy (i.e. variability) rank some structural information on the Ab-antigen interaction mode is recovered.
In fact, as columns columns are removed starting from the more variable to the more constant, the reduction of the predictive power is fast and, in particular, the decreasing is more pronounced in correspondence of the deletion of a few residues that are highlighted   Figure S8: Contact prediction true positive rate of Mutual Information (pink) and Direct Information (light blue) analysis performed on the whole set of productive sequences. The reweighting parameter (see [2]) as been set to 0.01 in this case. Two residues are considered to be in contact if at least a couple of atom is at distance lower than 8Å.
with arrows in Figure S9. As discussed in the Main Text (Section 3.1 and Table 2), these residues are often observed to play a structural role in the antigen recognition. This kind of analysis shows how structural information on the antigen recognition mode can in principle be retrieved in cases in which the tridimensional structure of the Ab-antigen complex is lacking but Rep-Seq data and affinity measurements are available.  Figure S9: Iterative pruning of the most variable columns of the MSA is displayed. In both pictures, the correlation coefficient between the inferred MGM energy and the minimum neutralization titers against the tested virus isolates is displayed as the number of residues in the alignment is progressively reduced by eliminating columns according to decreasing entropy rank. The leftmost data point refers to the entire alignment. Residues whose deletion causes a remarkable decreasing of the correlation are highlighted.