Learning the statistics and landscape of somatic mutation-induced insertions and deletions in antibodies

Affinity maturation is crucial for improving the binding affinity of antibodies to antigens. This process is mainly driven by point substitutions caused by somatic hypermutations of the immunoglobulin gene. It also includes deletions and insertions of genomic material known as indels. While the landscape of point substitutions has been extensively studied, a detailed statistical description of indels is still lacking. Here we present a probabilistic inference tool to learn the statistics of indels from repertoire sequencing data, which overcomes the pitfalls and biases of standard annotation methods. The model includes antibody-specific maturation ages to account for variable mutational loads in the repertoire. After validation on synthetic data, we applied our tool to a large dataset of human immunoglobulin heavy chains. The inferred model allows us to identify universal statistical features of indels in heavy chains. We report distinct insertion and deletion hotspots, and show that the distribution of lengths of indels follows a geometric distribution, which puts constraints on future mechanistic models of the hypermutation process.


Introduction
The extraordinary ability of the immune system to detect and neutralize pathogens is partly assured by B cells, with the production of huge amounts of diverse immunoglobulins (Ig). Their initial diversity, provided by stochastic V(D)J recombination [1][2][3][4][5][6][7][8], is further increased by affinity maturation, a mutation-selection Darwinian process that takes place primarily in germinal centers (as well as extrafollicularly prior to germinal center formation [9]), where cells with the highest affinity to the pathogen compete for survival and proliferation [10][11][12].
During affinity maturation, cells diversify their immunoglobulin receptors through somatic hypermutations (SHM), which are initiated by the activation-induced cytidine deaminase (AID) enzyme [13] targeting immunoglobulin-coding genes. Its main effect is to cause point substitutions with a much higher rate than classical somatic mutations, namely � 10 −3 substitutions per base-pair per cell division [14,15]. Since affinity maturation implies many proliferation and selection rounds, this process can lead to Ig with up to 30-40% of amino-acid substitutions with respect to the initial V(D)J rearrangement. Statistics of SHM have been extensively studied [5,7,[16][17][18][19][20][21][22], mainly thanks to high-throughput repertoire sequencing [2,3,6,8], although the link to the molecular details of AID functioning remains elusive.
But point mutations are not the only modifications found in matured Ig receptors. Deletions and insertions of one or multiple base pairs, collectively referred to as indels, have been detected in the V and J segments, albeit at a much lower rate than substitutions. Indels have been reported both in both heavy and light Ig chains [23][24][25][26][27][28][29][30][31][32], with a frequency that grows with the substitution rate during affinity maturation [23,24,28,29]. Indels occur non-uniformly along the V germline templates, with a preference towards complementary-determining regions (CDRs) with respect to framework regions (FWRs); these 'hotspots' often coincide with those for point mutations [24,[32][33][34].
Indels are known to be critical for the increase of antibody repertoire diversity, enhancing the immune response in the presence of viral and bacterial pathogens as e. g. HIV-1 and influenza [35][36][37][38][39] and in response to vaccination [32]. Several works have shown a beneficial impact of indels for binding affinity and neutralization activity in vitro and in vivo [32,40,41], opening new evolutionary pathways beyond point mutations. A particularly striking example is given by anti-HIV-1 broadly-neutralizing antibodies (bnAbs), many of which have a high indel load [42][43][44]. In particular, bnAb CH31 was shown to owe its neutralization breadth to a particular long insertion in its lineage [39].
Yet, little is known about the biological mechanisms behind the occurence of indels. Observations show that inserted segments have a high homology with their flanking regions (either on the 3 0 or 5 0 sides) [29,[31][32][33], possibly suggesting duplications rather than non-templated, random insertions. A proposed mechanism [45,46] involves template regions with repeats and palindromic sequences, where the polymerase can "slip" during replication and "jump" to a nearby homologous element on the templated strand, producing a loop in either the daughter or the templated strand. If this unpaired loop is not properly repaired by the specific DNA repair mechanisms, according to its location on the daughter or the templated strand, it will be propagated as an insertion or a deletion, respectively. This mechanism would also explain colocalization of indels and SHM, since they could likely occur independently in the same susceptible regions (hot-spots due to repeats and palindromes) or also as a product of the repair mechanisms targeting existing replication errors.
Deep high-throughput sequencing of Ig repertoires now allows us to gain insight into the mechanisms of indels, and to collect statistics about their frequency, length distributions, and positional preference. By contrast, previous studies [33,34] have relied on relatively small datasets and therefore small number of indel events due to their rarity. In addition, most studies have focused on productive sequences, for which it is difficult to decouple intrinsic preferences from selection effects. Here, we exploit a recently published large immunoglobulin heavy chain (IgH) repertoire dataset comprising sequences from 9 healthy donors [8]. We developed an inference methodology to provide robust indel statistics and to capture their universal features in the face of natural individual variability.
Our approach uses a probabilistic framework to overcome the issue of annotation errors. Classically, indels are identified by looking at the best-scoring annotation. Such deterministic annotation is sometimes wrong, and we will show that this leads to systematic biases. Our probabilistic method, which considers multiple indel and point-mutation scenarios weighted by their probabilities, allows us to remove these biases, in a similar way to what was done for inferring the statistics of VDJ recombination [5,7,47]. A second key ingredient of our approach is to assume that each sequence may have a different mutation rate, since repertoire data combines B cells from various stages of affinity maturation. We first tested and validated our method on synthetically produced data. We then applied it to Ig sequences with a frameshift in their CDR3 region, which are believed to be nonproductive and thus free of selection biases. This allows us to characterize in great detail the statistics and intrinsic preferences of indel generation.

Insertion and deletion events scale with the mutation rate
We conduct our analysis on a recently published high-throughput IgH repertoire dataset, obtained from the blood of 9 healthy donors [8]. IgM and IgG expressing cells were isolated and analyzed separately. Using raw sequence data, we further segregated IgH sequences into productive (P) and nonproductive (NP) sequences, depending on whether their CDR3 had a frameshift or not. Since hypermutation indels are rare (see below) compared to VDJ recombination frameshifts, we assumed that nonproductive sequences were already faulty at generation. Cells with NP sequences owe their survival to a successful IgH rearrangement on the second chromosome, meaning that the NP sequences themselves are free of selection effects. Most productive sequences undergo various selection processes, which bias the statistics of their features [5]. Here we mainly restrict our analysis to nonproductive sequences to capture the biochemistry of the hypermutation machinery. We obtained 421, 185 IgG and 459, 165 IgM nonproductive sequences from 9 donors.
To get preliminary insights into indel statistics, we first annotated each sequence deterministically using IgBLAST [48]. We estimated the rates per base pair of SHM point mutations, deletions and insertions, in both IgM and IgG compartments, separately for each individual and each of the productive and nonproductive subsets (Fig 1A). Mutation rates for productive sequences are consistent with previous estimates [33], with more mutations in IgG than in IgM. Some individual variability is present, suggesting variations in the degree of antigen exposure across individuals. Nonproductive sequences have higher rates of SHM than productive ones, especially for insertions and deletions (�10-fold difference for IgM and a bit less for IgG), suggesting that mutations are mainly deleterious. Fig 1B shows the distribution of the number of indels in the IgG subset across donors. For comparison, a Poisson-distribution fit agrees well with the data, but underestimates sequences with many indels. Point mutations and indels both accumulate during affinity maturation and are mediated by AID [24,28,29,[32][33][34]. Indels are mostly present in highly point-mutated sequences. Both types of mutations have been reported to co-localize along the sequence, preferably in the CDRs [33,34]. They have also been shown to co-localize temporally, as deduced from the analysis of antibody phylogenetic trees [32,39]. Consistent with this picture, we , deletion (del) and insertion (ins) absolute rates per base pair for the 9 Briney donors [8]. Segregation into productive (P) and nonproductive (NP) sequences coded by colors; data points from IgM sequences are also reported, for comparison. (B) Fraction of IgG sequences (productive and nonproductive) with exactly n indels, either separated by type (del or ins) or cumulated; mean and standard deviation over the 9 donors represented by the markers and bars, respectively. The dashed black line represents the closest Poisson trend for cumulated indels of both types. (C) Fraction of SHM (in nucleotides) along IgG sequences (productive and nonproductive) conditioned on having exactly n indels, again either cumulated or separated by type; averages and one standard deviation bars over the 9 donors. (D) Length profiles for deletions, after NP-P segregation. Average and standard deviation over the 9 donors represented by the solid line and the shaded area around, respectively; corresponding average lengths are represented by the two vertical dashed lines. (E) Same as (D), but for insertions. (F) Overlap between inserted base pairs and same-length flanking regions on either 3 0 or 5 0 side along the sequence (larger is considered); average and one standard deviation over the 9 donors given by solid line and shaded area around, respectively. For comparison, overlap between random insertions and flanking regions is also reported (details in the main text). (G) Distribution of distances between deletions along the same sequence, measured as the number of base pairs in between, for IgG sequences (productive and nonproductive) hosting exactly two deletions and no insertions (data pooled together from the 9 donors); inset shows the full distribution, main plot focuses on the short-distance region. observe that the average point-mutation rate per base pair in IgG grows linearly with the number of indel events (Fig 1C).

Length and composition of inserted and deleted segments
Deletion and insertion length profiles, which show the distributions of the number of deleted or inserted base pairs, are shown in Fig 1D and 1E. Signatures of selection can be seen through an increased preference for multiples of 3 in productive versus nonproductive sequences, suggesting that indels that induce a frameshit are deleterious. The effect is only present at short deletion lengths, but holds for long insertion lengths. Note that a weaker 3-fold periodicity is also present in nonproductive sequences, despite them being in principle free of selection effects. This could come from sequences that were previously productive but were frameshifted relatively recently due to an indel in the CDR3. Long insertions are favored in productive sequences in comparison to long deletions, probably because of stability and functionality constraints [33]. By constrast, in nonproductive sequences deletions and insertions follow similar statistics, suggesting a common mechanism. Both insertions and deletions are approximately geometrically distributed, at least in the tail that describes large inserted or deleted segments. This geometric law puts strong constraints on the type of biophysical mechanisms by which indels may be created.
A proposed mechanism for indels is the polymerase slippage model [45,46]. In germline regions that have repeats and palindromic sequences, the polymerase can "slip" during replication and "jump" to the closeby homologous element on the templated strand, resulting in a insertion or duplication depending on the location of the polymerase on the daughter or templated strand. In this scenario, insertions would be the result of duplications, and would thus be expected to be homologous to their flanking regions. The analysis of inserted segments in nonproductive sequence does reveal a higher than expected overlap with the best matching of the two immediately flanking sequences ( Fig 1F). Previous similar observations were made on small numbers of productive sequences [33], for which effects of selection may confound the effect. Note that overlap is substantially different from 1 (� 0.8), suggesting 20% additional errors in the duplication, which is much higher than the mean point-substitution rate per base pair (around 5-6%). Also, we find a weak preference for the 5 0 end as a duplication source (S1 Fig). However, we found no significant overlap between deleted segments and their flanking regions (S2 Fig). This suggests either that the slip-and-jump mechanism does not favor homologous regions for deletions, or that other mechanisms are responsible for deletions, such as double-strand breaks [34,49].

Indels co-localize with each other and with point mutations even in absence of selection
Since point mutations tend to co-localize in hotspot regions, and indels are associated with point mutations, we wondered if indels co-localized as well. Fig 1G shows the distribution of distances between two consecutive deletion events along the same sequence. To rule out possible confounding factors, we focused on sequences with exactly two deletion events and no insertions. Since such sequences are rare, we pooled the IgG repertoires (P and NP) of all individuals for this analysis. The full distribution (inset) is highly non-uniform, consistent with the observation that indels concentrate on CDRs [33,34]. In addition, we observe a striking depletion of deletion pairs separated by less than 3 base pairs. This depletion is an artifact of the alignment software, which merges deletion events if they are too close to each other. This effect implies that the indels involving � 50−60 base pairs reported in Fig 1D and 1E may result from such mergers, which standard alignment tools cannot disentangle. Our probabilistic framework will allow us to address this issue in the following sections.
We then studied the positional preference of indels along the V segment. To establish a universal profile across all V genes, we used the gapped absolute indexing provided by IMGT [50], and pooled the sequences of all 9 donors. The resulting profiles for nonproductive and productive sequences are shown in Fig 2A and 2B. Due to alignment gaps and finite read lengths, many IMGT positions are only observed in a fraction of sequences (black line). The indel profiles are highly non-uniform, with generally higher rates in the CDRs than in the Framework Regions (FWR), with the exception of a deletion hotspot the end of FWR3. We note that these trends, which are consistent with previous reports on productive sequences [33,34], are found also in nonproductive sequences, suggesting that they cannot be attributed to selective or functional effects. The different mutation profiles (point mutations, insertions, deletions, productive versus nonproductive) are correlated (Fig 2C-2E), meaning that indel and point-mutation hotspots are largely shared. Deletion profiles are well correlated with both point-mutation and insertion profiles. These correlations are larger than between insertions and point mutations, suggesting a possibly different mechanism for the two types of indels. In addition and consistently with this picture, deletions and point mutations show an enrichment of co-localization (within a few base pairs) relative to what would be expected from their shared positional preferences, while insertions and point mutations do not (S3 Fig). These observations are consistent with the leading model that deletions are caused by double-strand breaks [34,49].
Profiles in productive versus nonproductive sets are very similar, suggesting that the positional effects of functional selection on indels and point-mutation preferences is weak, and that most of the previously reported biases in indel positions [33] are due to the positional preference of the involved enzymes rather than to selection.

A probabilistic alignment algorithm
While deterministic annotations give us a good picture of indel statistics, they may introduce systematic biases that could confound the analysis, as e.g. the spurious gap in the inter-deletion interval distribution (Fig 1G), or the over-estimation of long indel events. To give just an example of possible mistakes by deterministic annotation algorithms, a long indel observed in a sequence may have in fact been caused by several shorter indel events at the same site or close by. To evaluate the probability of the end product, one should consider all possible such decompositions into two or more indel events, and sum their probabilities. To address this issue, we designed a probabilistic approach to annotate and analyze hypermutated V segments, similarly to what was previously proposed to infer the statistics of V(D)J recombination from B-and T-cell receptor repertoires [5,47] and implemented in the IGoR tool [7].
We first define a generative model of point mutations, insertions and deletions, with unknown parameters that describe the main statistics of the mutation process: distribution of SHM rates across sequences, mean ratios of deletion and insertion rates to the point mutation rate, and insertion and deletion length profiles. These parameters are then learned from the data using a maximum-likelihood estimator, which is implemented in practice through an expectation-maximization algorithm. In this framework, the likelihood of a particular outcome of the SHM process is obtained by summing the probabilities of all possible scenarios of germline V templates, point mutations, insertions, and deletions that would yield that sequence. In principle, this means that all alignments to germline V segments should be taken into account, and not just the best-scoring one as in traditional methods. However, many unlikely alignment scenarios can be pruned out of the list before computation (see Methods). In practice, this implies that only one plausible germline V gene is typically considered. Once the parameters of the model are learned, they can be analyzed in their own right, or the model may be used to generate synthetic datasets to test hypotheses. Fig 3A summarizes the general workflow of the method.
The generative model is defined so as to be simple enough to be tractable, while recapitulating the main features of the SHM process. Its parameters are directly interpretable in terms of basic SHM statistics. Each sequence has a distinct point-mutation rate μ per base pair, corresponding to its maturation age, drawn from a distribution P(μ). Motivated by the linear relation between indel and point-mutation rates (Fig 1C), we assume that the rates of insertion and deletion events are given by β ins × μ and β del × μ, where β ins and β del are sequence-independent factors. Then insertions, deletions, and point mutations are drawn randomly with these three rates, uniformly along the sequence. The lengths of each insertion and deletion event are drawn from two distributions P ins (ℓ ins ) and P del (ℓ del ). The set of parameters to be inferred is collectively called θ = (P, β ins , β del , P ins , P del ), and corresponds to the basic statistics of the SHM process. To calculate the sum of probabilities over all plausible scenarios efficiently, we use a forward algorithm that avoids computing the sum explicitly. Mathematical details of the probabilistic model and of the likelihood maximization are reported in the Methods section.

Validation of the probabilistic approach on synthetic data
The ability of our inference procedure to recover the true indel statistics can be tested by using large enough synthetic datasets. To this aim, we generated N ¼ 10 independent repertoires of Given the set of parameters θ for the probabilistic model P, for each sequence a set of alignment scenarios (given by template, read and SHM-indel pattern) is considered and the cumulated alignment likelihood Lðseq:jyÞ from all of them is computed. Indel statistics and repertoire-wide age distribution from R can be extracted by using a given model (evaluation mode) or by training a new model on R itself (inference mode). Finally, a synthetic repertoire of arbitrary size can be generated, according to a given input model PðyÞ (generation mode). (B) Repertoire-wide point-mutation rate distribution P(μ); mean and standard deviation over N ¼ 10 independent realizations of a 100, 000 synthetic sequences repertoire represented as a solid line with the shaded area around, for both deterministic and probabilistic estimates. In the inset, the N estimations of the fraction of truly naive sequences; color coding is the same as the main plot. 100, 000 synthetic sequences each, restricted to the V segment, using realistic parameters similar to those obtained by the analysis of standard annotation (Fig 1). Naive sequences were first generated by drawing V templates according to their frequencies in the data. Then, point and indel mutations were added according a generative SHM model belonging to the class described above. We assumed a mixture of 10% naive cells, defined as cells that have not gone through the hypermutation process at all, μ = 0, and 90% experienced cells with μ distributed according to a shifted Gamma distribution mimicking a realistic distribution for point mutations (Gamma with mode μ = 0.07, standard deviation = 0.04, and shift −0.02; see Methods for details). The lengths of insertions and deletions followed geometric distributions with characteristic scale ℓ = 15 and a maximum value of 30 base pairs.
On each of these synthetic datasets, we compare two methods for estimating the parameters of the model. The first method is the one used in the previous sections of the paper: synthetic sequences are aligned to germline V segments using a deterministic tool; then, from these alignments, the statistics P(μ), β ins , etc, are directly computed from their frequencies. The second method is the probabilistic inference approach outlined in the previous section.
In Fig 3B we report results for a realistic rescaled indel rate, β del = β ins = 0.025, of the same order of magnitude of the ones found for nonproductive IgG sequences through standard annotation, see Fig 1A. The fraction of naive cells (μ = 0), shown in inset, and the distribution of μ for experienced cells, shown in the main figure, are correctly inferred only by the probabilistic approach. An example of the error made by the deterministic approach involves naive sequences: sequences with no SHM are always labeled as "naive" by the best-scoring alignment, while the probabilistic method considers the possibility that they come from an experienced cell (μ > 0) in which no SHM has had the chance to occur (which happens with probability e −Lμ , where L is the length of the V segment). This leads the deterministic method to systematically overestimate the true number of naive sequences. The performance of the deterministic method degrades as the rescaled insertion rate β ins is increased (and β del with it, Fig 3C). In that regime of frequent indels, best-scoring alignments merge nearby events, underestimating their relative frequency and compensating resulting errors by an increased mutation rate μ. By constrast, the probabilistic method recovers the ground truth even at very high indel densities, corresponding to 3 deletion and 3 insertion events per sequence on average. The merging of nearby indels also causes the deterministic algorithm to find an excess of non-existing long insertion events above 30 base pairs, while the probabilistic one correctly assigns their frequency to zero (Fig 3D). The analogous of panels 3C and 3D for deletions are reported in S4 Fig. Overall, this analysis on synthetic data shows that while the deterministic analysis correctly captures the main statistics of SHM, it may fail on other quantities such as the fraction of truly naive cells, or the frequency of rare, long indels.

Inference within the probabilistic framework
We next applied our probabilistic inference method to the analysis of nonproductive IgH sequences from the 9 donors of [8] to learn the parameters of the SHM process free of selection effects. We thus obtain individual-specific distributions of maturation ages (Fig 4A, green). For comparison, the naive deterministic estimate, based on the frequency of errors in bestscoring alignments, is also reported (red). We observe substantial variability across individuals, probably due to diverse histories of experience with antigens. The probabilistic method finds more sequences with a small, non-zero maturation age (94.3%±4.1%) than the deterministic one (83.1%±8.6%), see Fig 4B. Because such sequences often have no mutations, the deterministic method underestimates that number (as confirmed by our synthetic data results, see Fig  3B), suggesting that the probabilistic approach gives a more accurate result. While we don't know the ground truth for the true distribution of maturation ages, we can verify this interpretation on the distribution of point mutations. In Fig 4C we compare predictions from the deterministic and probabilistic approaches for the IgG repertoire of one donor. Consistent with our intuition, the deterministic distribution overrepresents unmutated sequences, while the probabilistic distribution is in excellent agreement with the data. In addition, the model reproduces the super-Poissonian distribution of the number of indels (S5 Fig).
In Fig 4D we report estimates for rescaled indel rates, i.e. ratios of indel to point-mutation rates. They are consistent between the deterministic and probabilistic methods, as expected from the analysis on synthetic data, because of the relatively low indel rates in real data. They are also well conserved across individuals. While the raw mutation rates depend on the history of affinity maturation, ratios between them are not expected to depend on it. They should instead be determined by universal biophysical mechanisms of DNA damage and repair, which explains their relative conservation. Similarly, the indel length profiles (Fig 4E and 4F) are consistent between the deterministic and probabilistic methods, as well as across individuals, pointing again to the universality of the biophysical mechanisms at play. However, the average deletion and insertion lengths are slightly larger in the probabilistic estimate, due to a more reliable evaluation of low-probability events corresponding to the tails of the distributions.

Discussion
Most studies of SHM insertions and deletions, and even point mutations, consider productive rearrangements. Instead, we focused our analysis on nonproductive sequences to study the statistics of mutation inherent to the biochemistry of the SHM process, free of subsequent selection effects. However, we found a strong correlation between the productive and nonproductive positional profiles of SHM, implying that most of the heterogeneity is of biochemical origin, rather than stemming from selection. Further studies of the differences in SHM statistics between productive and nonproductive sequences could shed light on the structural and functional constraints that selection imposes on antibodies during affinity maturation.
Because nonproductive sequences and indel SHM are both rare, the combination of both is extremely rare, necessitating a very large dataset such as the one provided by Briney et al. [8]. This study included 9 subjects, which allowed us to assess inter-individual variability. We found that features of the SHM process associated to the biochemistry of the process, such as the positional profiles, the ratio of the number of indels to point mutations, and the distribution of insertion and deletion lengths, were mostly conserved across subjects. By contrast, the distribution of the mutational age of antibodies, which reflects their unique immune history, was specific to each individual. While we studied individual sequences, we know that those are organized in distinct lineages originating from naive germline sequences. This could lead to multiple-counting mutations occuring in several sequences of the same lineage. However, since this effect is random and decoupled from the mutation process itself in nonproductive sequences, we only expect it to increase errors due to sampling. The low individual variability reported for most summary statistics provides an upper bound on that error.
We report a large positional heterogeneity, with a similar relative variability between point and indel hypermutations. Indels tend to concentrate around the CDR1 and CDR2 loops, as do point mutations [51], and consistently with previous reports on productive sequences [33]. However, our results show that the positional profiles of point, insertion, and deletion SHM are still distinct (albeit correlated), suggesting a complex picture of how the 3 processes result from AID selectivity and positional preference.
The precise mechanism causing SHM indels and point mutations is still elusive. Our results confirm the main prediction of the polymerase slippage model [45,46] for indels, namely that inserted segments are homologous to their immediately flanking regions. This complements earlier observations from [33], but without the interference of selection, and on a much bigger dataset allowing for more detailed statistics. By contrast, we found that deleted segments are not significantly homologous to their neighborhood, suggesting the slippage is not driven by sequence homology, but reaches out to a random position along the DNA. An alternative hypothesis is that deletions are driven by double-strand breaks caused by AID, consistent with our and previous observations that they tend to co-localize with point mutations [34,49]. In the slippage model, the indel length corresponds to the distance by which the polymerase slips. Both have a characteristic exponential tail in their distribution, compatible with the processive motion of the slipping polymerase with random stopping.
Our probabilistic approach is not only useful for annotating sequences and inferring rates, but it can also be used as a generative model to produce synthetic datasets for computational controls. However, the model makes a few simplifying assumptions for the sake of computational tractability. First, its positional profiles are constant, in contradiction with the data. Second, inserted segments are assumed to be uniformly random, ignoring homology to flanking regions. While these approximations are unlikely to affect our results, future refinements of the generative model would be welcome to create more realistic datasets.
We evaluated the conditions under which the probabilistic approach we introduced added value to the classical deterministic annotation of mutated antibodies. For most of the bulk repertoire, where the rate of hypermutation is relatively low, both approaches yielded similar results, with a notable exception: the fraction of "naive" sequences that have not undergone any affinity maturation. This definition of naive is distinct from usual phenotypic characterization based on isotype classification, flow cytometry [52] or single-cell RNA sequencing data [53], although they are correlated. This proportion is always overestimated by the deterministic method. Since the SHM process is stochastic, some sequences having undergone affinity maturation may not have accumulated any mutation at all. These sequences are always considered naive by the deterministic approach, while the probabilistic method considers both possibilities, correcting that bias.
Beyond the bulk of the repertoire, the probabilistic approach is most useful for analyzing heavily mutated sequences, where it is difficult to distinguish point mutations from nearby insertion and deletion events. While a minority, such sequences are very important. They are found in the largest clonal families having undergone the longest affinity maturation, and so are the most experienced and selected ones. HIV1-related bnAbs, which often contain many indels, are examples of such sequences [54]. The relatively low frequency of indels that we report could explain why effective bnAbs are rare, and only emerge naturally in a small percentage of infected individuals. In that regard, evidence that indels frequently appear in response to some vaccination protocols [32] suggests promising strategies for HIV vaccination.

Data and preprocessing
We exploited publicly available IgM and IgG repertoire data from a recent ultra-deep repertoire study of immunoglobulin heavy chains (IgH), including approximately 3 billions sequences in total, coming from 9 different healthy donors [8]. The V region is well covered by the mRNA sequencing protocol, starting from nucleotide position � 66 along the templates for almost all the sequences; CDR3 and J gene are also covered.
We first preprocessed raw sequences as in [22], using pRESTO (release 0.5.13) from the Immcantation pipeline [55]. Then, we performed a standard annotation through the IgBLAST standalone tool [48], (release 1.13.0, with default values for penalties), referring to IMGT for germline templates [50] (only functional (F) genes are considered, totaling 265 IGHV different templates, alleles included). We also applied quality filters, keeping only sequences with properly annotated V gene, J gene and CDR3, and covering at least 200 base pairs of the V segment. We checked using synthetic data that this quality filter does not substantially alter the underlying indel distribution below length 60, although sequences with very long indels are slightly less likely to pass it (S6 Fig). Unique molecular identifiers (UMI) were used in the original study to correct for PCR amplification biases. We grouped reads with the same UMI to build consensus sequences for downstream analysis in which read counts were ignored, allowing for a maximum error rate of 10% within each group. Sequences corresponding to low numbers of reads per UMI (e.g. 1 or 2) are expected to be more affected by sequencing errors, because of the inability to build a consensus sequence from multiple reads. We verified that our results are not affected by this potential source of error, by repeating the standard annotation analysis on UMIs represented by at least 3 reads (S7 Fig). Thus, to make the most of the available data, we did not discard sequences with low number of reads per UMI.
It is very hard to call mutations in the CDR3 because of its variability, and the J region is very short. Therefore we restricted ourselves to the V region, starting from the beginning of the read, up to the conserved cysteine before the beginning of the CDR3. We call that sequence s.

Modeling approach
We start from a repertoire {s} of size N. Let {t} be the set of (truncated) V templates used for the alignment (the same 265 templates exploited for IgBLAST annotation). For each s and t, a deterministic alignment algorithm, based on the standard Needleman-Wunsch (NW) algorithm for pairwise global alignments with affine score for gaps [56], provides only the alignment with the highest score Sðs; tÞ.
The value of the parameters used in the NW algorithm were chosen according to the actual frequency of related events in human IgG heavy chains, as estimated through IgBLAST annotation (Fig 1). Assuming an average point-mutation rate of � 5%, an indel rate of 0.05%, and an exponentially decaying length distribution of indels with characteristic length ℓ = 10, we have the following log-odd penalties (using natural logarithm): ' −0.05 for matching and ' −3 for mismatching aligned base pairs; ' −7.7 for opening a deletion gap; ' −0.1 for extending it; ' −9.1 for opening an insertion gap; and −1.5 for extending it (including the log (4) � 1.4 penalty for chosing the base pair) [56]. Note that since the alignment is global, only relative values of the penalty matter.
The optimal alignment is built through a dynamic-programming strategy, relying recursively on the optimal alignments of shorter blocks. The key modification we introduce to the standard NW algorithm is to sum over all the possibilities during the recursive scheme, rather than choosing the optimal one at each step. The result is then no longer an optimal global alignment score, but rather the sum of the likelihoods of all the possible global alignments between s and t, summed over all templates t. Each possible alignment contributes to the final sum proportionally to its likelihood, yielding a probabilistic likelihood of the sequence, by contrast with the deterministic score of the standard NW algorithm.
The resulting likelihood of the sequence s, Lðs; yÞ, depends on the model parameters θ. The total likelihood of the whole repertoire is then given by: where we assumed independent sequences, ignoring the lineage structure of the repertoire (see Discussion). Maximizing the likelihood with respect to θ gives an estimateŷ for the alignment parameters that best describe the data.

Recursive algorithm for the alignment
We now give the details of the computation of Lðs; yÞ. Consider a sequence s and a template t. Indexing starts from zero, including an initial "placeholder" position for potential gaps, running over actual base pairs from s and t (gaps are not taken into account directly in this indexing). So s i is the i-th symbol along s, while s 0:i is the portion of s running from the first actual symbol s 1 to i-th symbol s i included, preceded by the "empty" placeholder position 0. As in NW, the alignment procedure always starts from the beginning of the two sequences s and t, and goes through all the possibilities by extending the previous alignment for shorter pieces of s and t. In our case, however, we sum over these possibilities, rather than keep the best-scoring one. The indices on s and t are not constrained to increase at the same time; an asymmetric increase corresponds to a deletion or an insertion.
The alignment score for the portion s up to position i included (s 0:i ) and the portion of t up to position j included (t 0:j ) is computed by relying on the previous alignments of shorter portions, according to the following possibilities: • s i−1 and t j−1 were previously aligned to each other, so the alignment advances by one position on both sequences, with a match (s i = t j ) or a mismatch (s i 6 ¼ t j ); • t j was previously aligned with s k , k < i, so that a gap on s of length i−k has to be taken into account (i. e., a deletion of templated bases); • s i was previously aligned with t k 0 , k 0 < j, so that a gap on t of length j − k 0 has to be taken into account (i. e., an insertion of non-templated bases).
In the standard NW algorithm, each of these possibilities is linked to a score, and only the largest among them is kept. Here, instead, we work directly with likelihoods, and then we sum over them according to the following recursion: where M(s i , t j ) is the match/mismatch probability between nucleotide s i and t j , Γ del (ℓ) is the probability of a deletion of length ℓ, and Γ ins (ℓ) is the probability of an insertion of length ℓ. The initial conditions of the recursion are: The final score Sðs; tÞ � Sðs 0:L s ; t 0:L t Þ gives the likelihood of s aligning to t, obtained as the sum over all possible alignments. Likelihoods M, P del and P ins can be further described as: assuming uniformly random insertions, where P del/ins (ℓ) is the distribution of deletion and insertion lengths, and: with Ið�Þ being the indicator function (equal to one if argument is true, otherwise equal to zero). The rates μ del , μ ins , and μ are interpreted as the deletion, insertion, and point mutation rates per base pair. Based on evidence from data (Fig 1C), we assume that these rates scale together with the mutational age, since both types of mutations accumulate with time during affinity maturation cycles. We thus write: where β del and β ins are constant (repertoire-specific) ratios, and μ s is sequence specific, as suggested by the wide range of mutabilities in the data. The ratios β del/ins will also be referred to as rescaled deletion and insertion rates. The mutation rate μ s is treated as a hidden variable, whose distribution P(μ) is a model parameter.
The likelihood of each sequence is obtained by summing over all mutation rates and templates: with Lðsjm s ; �Þ � X t p t Sðs; tjm s ; �Þ; ð8Þ with p t uniform, and where the parameters θ have been separated into two components: θ = (P (μ), ϕ), with ϕ � (β del , β ins , P del (ℓ), P ins (ℓ)). In Eq (8) we made explicit the dependence of alignment score on the mutational age μ s and indel parameters ϕ. The repertoire is composed of a mixture of cells that have undergone some mutational process, presumably in germinal centers, and cells that have stayed naive. To properly model this, we further assume that the prior distribution of mutation rates has two parts: where f is the fraction of naive cells, andPðmÞ is a smooth probability density.
We denote by θ t = (P t (μ), ϕ t ) the value of the parameters at iteration t. Following EM, we define the pseudo log-likelihood to be maximized with respect to θ at each iteration: with Qðyjy t Þ ¼ where the mean h�i μs |s;θ t is with respect to the posterior distribution: Because of the separation between naive μ s = 0 and experienced cells, this posterior may also be decomposed as: andP ðm s js; The pseudo-likelihood Q can be decomposed into two terms corresponding to the two factors inside the logarithm of Eq (11), log Lðsjm s ; �Þ and log P(μ s ). The first one only depends on θ through ϕ, and the second one only through P(μ), allowing for their independent maximization.
Maximization with respect to P(μ). To maximize Q under the normalization constraint, we define the functional with Lagrange multipliers: The extremal condition with respect to P(μ 0 ) gives: When plugging functional forms (13) and (9) in it, together with the extremal condition with respect to λ, we get:P tþ1 ðmÞ Intuitively, the new estimate of the smooth partPðmÞ of the repertoire-wide distribution is the weighted average of the sequence-specific posterior distributions over the repertoire. Similarly, the new estimate for the fraction f of naive cells is given by the average over the repertoire of the probability of being naive.
Maximization with respect ϕ. To maximize the first term of Q with respect to ϕ, we use Monte Carlo sampling to represent the integral over the posterior P(μ s |s;θ t ): with μ s, n � P(μ s |s;θ t ).
Then, the extremal condition with respect to ϕ i reads: log Lðsjm s;n ; �Þ ¼ 0: Since we could not find an exact update rule for ϕ i , we relied on gradient ascent to maximize the likelihood. Each gradient component r t i can be estimated numerically by infinitesimally shifting ϕ i : where� t i is equal to ϕ t for all the components but i, which has been increased by ε. We updated ϕ according to a momentum gradient-ascent update rule: with learning rate α t and inertial term ω t , whose dependence on the iteration step t was chosen to optimize convergence and to avoid long-time oscillations [59]: with T being the maximum number of time iterations allowed. In Eq (23), P denotes the projection onto the simplex defined by the constraints β del � 0, β ins � 0, P del (ℓ) � 0, P ins (ℓ) � 0, and P ' P del ð'Þ ¼ P ' P ins ð'Þ ¼ 1. Projection was done using the procedure described in Ref. [60].

Speeding up the computation
The algorithm described so far is computationally very costly. Just the basic step of computing the alignment likelihood Lðsjm s ; �Þ for a single sequence at fixed μ s is time-consuming: if we allow for a maximum size ℓ = Θ for single-event deletions and insertions, then the requested number of operations roughly scales as L s × hL t i × N t × (2Θ + 1), where N t is the number of V templates considered and hL ti is their average length. This has to be repeated for each sequence, for each mutation rate, for each template, and for each direction of the gradient, and all of this at each iteration. Below we describe approximations that considerably speed up the code.
Pruning the alignment matrix. Indels are quite rare in real Ig sequences, with most sequences having none or at most one, located around well defined regions. Allowing for possible indels (of size Θ at most) everywhere in every sequence contributes very little to the final cumulated alignment likelihood, wasting computational time. Dropping implausible terms would dramatically reduce the computational cost, pushing the factor L s × hL t i × (2Θ + 1) almost down to L s .
Plausible locations for gaps can be identified by computing, for each pair of positions (i, j) along s and t, the likelihood that the alignment goes through that pair. This is given by: where the first term is computed as described above, and the second term similarly, but using a backward iteration on the reverted sequences. Normalizing by the total alignment likelihood, we obtain a relative probability that the correct alignment passes through (i, j), The entries of this matrix can now be used to prune the terms of the recursive Eq (2). When S i;j is smaller than a fixed threshold ϑ = 10 −5 , we set Sðs 0:i ; t 0:j Þ to zero and avoid computing it and including it in further sums.
To implement this strategy, we need to first run the forward and backward iterations without any pruning, for an initial guess of the parameters (μ s , ϕ). TheŜ i;j matrix computed in this way is then used to define the pairs of positions to be kept in future computations. After that, all subsequent evaluations of Sðs; tÞ in the optimization algorithm are done with only those pairs of positions. Reducing the number of templates. Another important performance improvement comes from pruning the list of plausible templates for each sequence s. In principle, the sum in Eq (8) runs over all the templates t, but in practice only one of them actually contributes in a non negligible way. Thus, we only keep the V template t � s closest to s, as given by the alignment score of the standard NW algorithm: The computational cost of Lðs; yÞ is then reduced further by a factor N t . Averaging over the hidden age factor. Another demanding operation is the average over the hidden variable μ s by Monte-Carlo sampling. In principle, the number of Monte-Carlo samples N MC should be very large. In practice, we used a single N MC ¼ 1 sample at each iteration. This simplification works because parameters evolve slowly at each iteration, especially in the last convergence steps, which allows for time averaging as a substitute for large MC sampling.
Posterior approximation. To calculate and sample from the posterior P(μ s |s;θ), we evaluated it on a finite set of values of μ (called nodes), and interpolated the rest of the function using a smooth piecewise cubic interpolation [61]. The nodes are set for each sequences to be placed around the naive estimate of the mutation rate μ s based on the NW alignment, in addition to 2 key nodes at μ = 0 and 1. We used from 15 to 25 nodes depending on how mutated the sequence was.
As both alignment parameters ϕ and repertoire-wide distribution P(μ) get updated during inference, the positions of the nodes are updated for each sequence as well, by placing them around the previously estimated maximum of the posterior with a spread given by its inverse curvature.
The prior distribution P(μ) was numerically stored by binning uniformly the [0, 1] interval with a bin size of 0.0005.

Generation of synthetic data
In order to mimic real distributions of the repertoire-wide point-mutation rate P(μ), we used a shifted Gamma distribution: with α, β determined by the mode and standard deviation of the unshifted Gamma distribution: (α−1)/β = 0.07 and ffi ffi ffi a p =b ¼ 0:04; and where z(α, β, μ 0 ) is a normalization constant. In practice, we drew values from the Gamma distribution with parameters α, β, and then subtracted 0.02 from it, discarding negative values. The resulting distribution has non-zero probability for infinitesimally small values and a mode of μ = 0.05. We used this distribution for generating synthetic data when validating the software and reported as ground truth in