Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation

VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM “factorization” strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM.


INTRODUCTION
The molecular sequences of B and T cell receptors (BCRs and TCRs) determine what antigens will be recognized by these lymphocytes, with B cells recognizing antigens via immunoglobulins and T cells recognizing antigenic peptides presented by the major histocompatibility complex.Together, the molecular sequences of BCRs and TCRs encode immunological memory against a great diversity of antigens.Diversity in these receptor loci is generated first by the process of VDJ recombination, in which germline-encoded V, D, and J genes are randomly selected, the ends of the genes are deleted by some amount, and then joined together with random non-templated insertions (Figure 1).BCR sequences diversify further through the Darwinian process of somatic hypermutation and clonal selection.
In the about 35 years since the discovery of VDJ recombination by Susumu Tonegawa, sequencing assays have been widely applied to BCRs and TCRs to further our understanding of this process.Recently, high throughput sequencing has given a remarkable perspective on the forces determining the immunoglobulin repertoire [1][2][3][4].In addition to advancing basic science understanding, such sequencing is being applied to learn how antibodies develop against antigens of medical interest, such as through influenza vaccination [5].There have also been spectacular advances using this technology to understand the ontogeny of HIV broadly neutralizing antibodies, with the still-elusive goal of eliciting them with an HIV vaccine [6][7][8][9].
A fundamental step in the analysis of such a sequencing data set is to reconstruct the origin of each nucleotide in each sequence: whether it came from a nontemplated insertion or from a germline V, D, or J gene, and if so, which one and where.Even if a complete collection of alleles for the germline V, D, and J genes were available, this problem would be challenging because deletion obscures the boundaries between insertions and germline V, D, and J gene sequences.The BCR case is made more difficult by the processes of somatic hypermutation and clonal selection: if a BCR sequence does not match a germline gene in the area of an insertion-deletion boundary, it may have come from a mutated germline position or a non-templated insertion.We will call this general problem of describing the source of each nucleotide in a BCR (or TCR) sequence the BCR (or TCR) annotation problem.Here we will focus on the more challenging BCR variant of the problem.
One approach is to leverage general-purpose tools for doing pairwise sequence search such as BLAST [10] and Smith-Waterman local alignment [11] for the annotation problem.Adding BCR-specific aspects to these basic algorithms has resulted in a collection of useful tools such as IgBLAST [12] and the online annotation tool on the IMGT [13] website.However, BCR sequence formation is quite complex (reviewed in [3]) and this complexity invites a modeling-based approach, specifically in the framework of hidden Markov models (HMMs).HMMs for sequence analysis consist of a directed graph on "hidden state" nodes with defined start and end states, with each

. . . C T G G G A A C T C C C T A C C T C T A A G
The VDJ recombination process, in which individual V, D, and J genes are randomly selected from a number of copies of each.These genes are then joined together via a process that deletes some randomly distributed number of nucleotides on their boundaries then joins them together with random "nontemplated insertions" (N, blue).The specificity of an antibody is primarily determined by the region defined by the heavy chain recombination site, referred to as the third complementarity determining region (CDR3).node potentially "emitting" a nucleotide base or amino acid residue [14,15].In the BCR case, the hidden states represent either (gene, nucleotide position) pairs or non-templated insertion, and the emission probabilities incorporate the probability of somatic hypermutation at that base.The HMM approach to BCR annotation has been elegantly implemented first in SODA [16], then iHMMunealign [17], and then SODA2 [18].The transition probabilities for these HMM methods are modeled parametrically: specifically, they use the negative binomial distribution as first used in [19], and the emission probabilities come from the same mutation process across positions (even if the process is context-dependent) [17].
High throughput sequencing has become commonplace since these HMMs were designed, which offers both a challenge and an opportunity for model-based approaches.It is a challenge because millions of distinct sequences are now available from a single sample of B cells, so methods must be efficient.On the other hand, it is an opportunity because such large data sets offer the opportunity to develop and fit models with much more detail.
We hypothesized that large data sets would reveal reproducible fine-scale details in the probabilistic rearrangement process that could be used for improved inference.In previous foundational work, researchers have used probability distributions such as the negative binomial [19] to model deletion lengths, and modeled the propensity for somatic hypermutation based on sequence context [17].
However, BCR sequences are protein-coding, and thus there are constraints on insertions and deletions that come from sequence frame; these types of constraints cannot be expressed by unimodal probability distributions with few parameters.
For example, the D gene is preferentially (though not exclusively) used in a specific frame post-selection [2,20], which in some cases is simply due to a stop codon being present in a specific frame [3].This suggests that a parameter-rich approach could be useful: instead of parametric distributions with a few parameters, using a large data set we could fit a per-allele categorical distribution for insertion and deletion lengths.Rather than modeling the process of somatic hypermutation with germline nucleotide context, we could simply infer the per-position per-allele mutation frequency and use this as the mutation probability.
In order to meet the challenge and opportunity of large scale data, we have built partis: a fast, flexible, and open source HMM framework to analyze BCR sequences.In order to process large data sets, we started by writing an efficient new HMM compiler ham which enables inference on an arbitrary HMM specified via a simple text file rather than having to write special-purpose computer code.
We then developed an HMM "factorization" strategy, which along with extensive caching and optimizations in the overall model topology, results in an order of magnitude faster execution than previous HMM implementations for BCR/TCR sequences.This work, along with parallelization, means that the largest data sets available today are comfortably within the capabilities of partis running on standard hardware.
We find that HMM inference using a non-parametric approach to fitting HMM parameter distributions outperforms previous approaches.The parameters for the Observed deletion frequencies for two V, four D, and two J alleles on the Adaptive data set.These alleles were chosen to be representative of the various shapes taken by the empirical distributions.In the complete set of plots (which are publicly available as described in the text), per-allele distributions are frequently multi-modal and appear similar between humans.
non-parametric transition and emission probabilities are fit "on the fly" for each individual data set, using a tiered aggregation strategy to scale the level of model detail to the amount of data.The accuracy of partis is further increased by using multiple sequences from the same rearrangement event (which differ due to somatic hypermutation) using pair-HMMs and their generalization to more than two emission sequences, which we will call "multi-HMMs".We also leverage the full power of HMMs by not only calculating the best (known as Viterbi) annotations of a sequence according to that HMM, but also computing forward probabilities for single and multi-HMMs.These forward probabilities allow one to integrate out uncertainty in quantities that are not of interest, retaining more accurate estimates (with uncertainty) of those parameters that are of interest.The partis software implementation has been engineered to be extensible and maintainable: it is open source and includes continuous integration and reproducible validation using the Docker software container system [21].

Empirical distributions of VDJ recombination process parameters deviate re-
producibly from simple distributions.We find that the probability distributions of rearrangement parameters deviate reproducibly from any commonly used distribution.For example, our inferred deletion length distributions take on a variety of shapes (Figure 2), which vary from allele to allele but appear consistent across humans (see also Figure S1).Mutation rates vary by an order of magnitude from position to position, creating a pattern which is unique to each allele but similar across humans for that allele (Figure 3 and S2).The overall level of mutation, however, varies between humans in our data sets (as noted previously [22]).Insertion lengths, on the other hand, vary much less across alleles, and are typically unimodal (Figure 4 and S3).The full collection of plots is available on Dryad at http://datadryad.org/review?doi=doi:10.5061/dryad.149m8.In .Typical observed mutation frequencies for two V, two D, and two J alleles on the Adaptive data set.Mutation frequencies are highly position-dependent.While the structure of these mutation distributions appears similar between humans, the overall level of mutation varies.The first base of the conserved cysteine and tryptophan codons (i.e. the CDR3 boundaries) are indicated with black vertical dashed lines.In the complete set of plots (which are publicly available as described in the text), mutation frequencies are highly variable across sites with a pattern that is similar between humans.
these plots the error bars are constructed using a bootstrapping procedure: we divide unique sequences in the data into ten subsets and plot the uncertainty as the standard deviation over these subsets (see Methods).By considering all of the parameter estimates together (Figures 5 and S4), we find that estimates are consistent between sequence subsets.That these parameter estimates are consistent between disjoint subsets of a collection of unique sequences, and to a lesser extent consistent between humans, indicates that they reflect biology rather than noise.Although this suggestion of consistency between humans is interesting, our strategy of per-human parameter fitting does not require any such assumption, and so we refrain from formalizing this suggestion of between-human consistency into a statistical statement.
Together, these observations suggest that a framework which is able to take these detailed distributions into account will be more accurate than one using simple parametric distributions.Furthermore, the easy accessibility of many reads from a single human suggests a new strategy for such modeling: to fit parameters "on the fly" for each data set.We implemented such a system, consisting of a new HMM compiler and a new annotation framework, which we test using an independent simulation engine.
ham.We implemented a new HMM compiler, called ham, because none of the existing implementations were suitable for our needs (see Methods for more details).
This program is able to perform classical, pair, and multi-HMM inference on any HMM within computational constraints given a text-file description of its topology.This text file description is in YAML format (http://yaml.org/),which is easy to write both by hand and by existing libraries for all popular programming languages.
Although our initial motivations lay more in usability than optimization, ham's general purpose C++ code is slightly faster and uses somewhat less memory than code generated by the well-known HMMoC [23], even ignoring the extra time for code generation and compilation in the latter (Table 1 1.Efficiency of ham compared to code generated by HM-MoC for the "occasionally dishonest casino" from [14].Elapsed CPU time and memory usage (user time and maximum resident set size from the Unix time command) for the Forward probability calculation are shown for sequences of length 1 million (mean of 300 runs) and 10 million (mean of 30 runs).These estimates do not include the extra time for code generation and compilation which is necessary in HMMoC.Uncertainties are the standard error on the mean.concerning the rearrangement process, and then perform annotation inference for each sequence in the set.It is written in Python, is open source (GPL v3 license), and is available at https://github.com/psathyrella/partis/.A Docker image with partis installed is available at https://registry.hub.docker.com/u/psathyrella/partis/.
Benchmarking on simulated data.We developed a simulation engine for B cell receptor sequences implemented in non HMM-based code separate from that used for inference (see Methods).This engine implicitly takes into account detailed dependencies between variables beyond what is possible in our inferential framework.Briefly, to create an unmutated ancestor sequence it begins by sampling a single point from the fully-joint empirical distribution over all rearrangement parameters observed in a data set.It then generates a random sequence with corresponding rearrangement parameters (e.g.VJ insertion length), and then simulates somatic hypermutation out to each leaf of a given tree using per-position mutation rates.The test data set for which we show performance comparisons may be found in the partis GitHub repository at http://git.io/Fxuk.We find that our strategy using flexible categorical distributions outperforms previous BCR annotation packages on this simulated data.Indeed, partis has higher accuracy than other methods in terms of the simple fraction of genes that method V correct (%) D correct (%) J correct (%)  2. Fraction of genes correctly inferred (up to allele) for all publicly-available BCR annotation methods on a simulated sample of 10,000 sequences.partis is shown both for single sequences (k = 1), and for a multi-HMM (k = 5), which performs simultaneous inference on five clonally related sequences.are correctly inferred (Figure 2), a metric used in previous studies [18,24].This metric, however, ignores the difference between slightly incorrect and very incorrect inferences in terms of underlying sequence (see Methods for details).
A more detailed comparison via the Hamming distance between inferred and true naive sequences (Hamming to true naive, HTTN) also shows that partis outperforms previous HMM-based implementations (Figure 6).Neither SODA [16] inferred -true  nor SODA2 [18] were available from the web or directly from the author for comparison.partis also gave the most accurate parameter inferences (the narrowest distributions around the correct value in Figure 7) for insertion lengths, deletion lengths, and mutation frequencies.Additionally, we find that using multiple sequences from a single rearrangement event (differing only in somatic hypermutation) leads to more accurate inference than using one sequence at a time.ham's ability to simultaneously emit an arbitrary number of symbols allows us to take advantage of this in a way that previous implementations could not, providing an additional boost in accuracy.
This annotation with multi-HMMs is shown in the preceding Figures as "partis Although the benefits of our more detailed model decrease with fewer sequences per sample, we find that our "tiered aggregation" strategy (see Methods) performs well with smaller numbers of sequences (Figure 8).Indeed, performance does not begin to degrade appreciably until sample size decreases to around a hundred sequences.
We also find that partis is computationally efficient, making it suitable for inference on large data sets.For efficiency tests, we used samples of one and ten thousand simulated sequences.The Smith-Waterman methods in ighutil require about 0.03 seconds per sequence.The HMM-based inference, alone, of partis takes about 0.04 seconds per sequence; however partis performs a preliminary Smith-Waterman step (see Methods), and must also write the HMM input files from the Smith-Waterman and initial HMM steps.In total, then, partis takes between 0.1 and 0.2 seconds per sequence, depending on sample size.The other available HMM-based method, iHMMunealign, takes about 5 seconds per sequence on both data sets.We suspect that this gain in speed is from the HMM factorization scheme described in the Methods.
We did not attempt speed comparison with web tools, although we note that as a group they are subject to periods of high load, and return free-form text outputs which are more directed towards human readability rather than computer parseability.

DISCUSSION
We find substantial complexity in the details of the rearrangement process; these details are consistent between data subsets and appear consistent between humans.This agrees with previous work on codon effects upon D segment frame usage [2,20], and extends it to the V and J gene segments.This complexity suggests the use of parameter-rich categorical distributions for transition and emission probabilities in a hidden Markov model (HMM) for BCR annotation.Indeed, we find that incorporating the peculiarities and richness of modern data sets enables more accurate annotation of BCR sequences.By letting the data inform parameters such as per-position mutation frequencies, we side-step the need to perform sophisticated modeling of processes such as context-sensitive hypermutation.The multi-HMM framework enables annotation of a number of sequences in a rearrangement group, resulting in increased accuracy when combined with special insert states.
We have extended the ideas of previous authors using HMMs for BCR sequence annotation [16][17][18] and previous work on HMM compilers [23,24] to build partis, a system to annotate BCR sequences.We find that this system substantially outperforms previous methods when given simulated data.The partis package and its dependencies are open source, and validation can be run using a supplied Docker image.
When partis is presented with a data set for the first time, it infers around one to ten thousand parameters for that data set; with this in mind, we would like to be very clear concerning the potential for over-fitting.For any given application of partis, these parameter estimates are intermediate steps for which we are not making a claim of generality, and are instead specific to that data set.This situation is analogous to fitting the rate parameters of a mutation rate matrix when performing a maximum likelihood phylogenetic analysis.In contrast, previous methods used a collection of sequences as a training set for parameter inference and applied HMMs with those general-purpose parameters to other sequences.
Thus it makes perfect sense that previous authors [18] test on simulated sequences made with a different set of parameters than the ones used for training.However, because partis fits parameters on the fly for each data set, it never needs to make out-of-sample predictions.This is not to say that the approach would not benefit from some form of across-sample estimation of parameters.One could use such overall parameter estimates as across-human priors via empirical Bayes estimation [25].Such an approach will become increasingly viable as we become more familiar with the population-level distributions of the relevant parameters.
We have also presented a robust validation framework for partis.For effective validation we need a set of sequences which is both representative of the data on which we will apply the method, and for which we know the true rearrangement history.Unfortunately real data, which is the most readily-available source of sequences which satisfy the first criterion, does not exist which also satisfies the second.In [17], for instance, validation was performed on two data sets which were each known to consist of a single clone with unknown true annotation, and the authors used the fraction of sequences annotated to have the same germline alleles as a proxy for accuracy.This approach cannot test whether the methods are inferring the correct allele, and more significantly, it tests on only two points in the entire, many-dimensional rearrangement space.In other words, validation was performed on a sample of two rearrangements, and gives no information as to how each method might perform on every other possible combination of germline allele choices and deletion boundaries.We thus chose to use simulation, for which by construction one knows the correct answer, but which also requires proof that it accurately represents data (see Methods for details).
The results presented here have all used the IMGT set of germline sequences.
We acknowledge the substantial discussion concerning this set's accuracy [26,27], but IMGT is still by far the most popular resource for germline information and tools.We have thus designed partisso it can switch to any other germline database using a simple command-line flag, for instance as new information becomes available [27,28].
We emphasize that the good accuracy obtained using the multi-HMM framework in simulation assumed that we knew that all of the sequences derived from a single rearrangement event.On real data, of course, we do not know which sequences derive from the same rearrangement event, and in fact this is a challenging problem.Our next step will be to use the HMM framework presented here to cluster sequences together by rearrangement event.
While we and others have found HMMs to be useful for the BCR annotation problem, they do have certain limitations.The Markov assumption (that the current state is ignorant of all states except for the previous one) makes it difficult to propagate information across the HMM.For example, levels of mutation are correlated between different segments of the BCR [22]; thus upon traversing the V segment of a query sequence we have information concerning the overall mutation rate in the rest of the sequence.The Markov chain's conditional independence, however, makes it impossible to propagate this information to the D and J segments in a strict HMM framework.Additionally, HMMs cannot account for palindromic insertions [29], complex strand interaction events [30,31], or the appearance of tandem D segments [2].Conditional Random Fields (reviewed in [32]) could provide a way around some of these limitations; linear-chain conditional random fields enjoy many of the attractive computational properties of HMMs while allowing for more complex dependencies.As a final note, we have not attempted to model insertion/deletion mutations within gene segments [33], which necessarily reduces the accuracy of inference for sequences with these mutations.
Such events, though rare in the Adaptive data set [22], could be incorporated into an HMM with additional transitions between the states representing insertions and deletions within the HMM, although we have not attempted that here.

Data sets used.
As our primary data set, we used the Illumina-sequenced data generated from a single time point sample from each of three humans, as recently described in [34].This data set is distinguished by its size (30 million total unique sequences, split between three humans and naïve versus memory compartments) and the experimental design, which used replicate wells to enable template count estimation and decrease error [35].This data will be available at http://adaptivebiotech.com/link/publicBCellResource upon publication of [35]; we will call it the "Adaptive" data set.
We also show results on the data from [5], which used Illumina paired-end sequencing to investigate memory B cell recall after vaccination in 15 humans.This data set used unique barcodes to decrease error, a technique which was validated using a reproducibility experiment.This data can be publicly accessed in Gen-Bank (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000656.v1.p1); we will call this the "Vollmers" data set.
ham.We considered writing specialized code to do HMM inference on B cell receptor sequences as previous authors have done [16][17][18].However, for maximum flexibility and usability, we were inspired by other work [23,24] to build an HMM "compiler" that can do HMM inference given a description of the HMM in a format that is simpler to interpret and modify than specialized inferential machinery.
We were greatly inspired by the HMMoC [23] and StochHMM [24] tools, but neither of them fully satisfied our needs.We found the XML input format in HMMoC too complicated to script, and the paradigm of auto-generated C++ code to be extremely difficult to test and debug.StochHMM, meanwhile, had a custom configuration file format, lacked a pair-HMM implementation, and possessed an extraordinarily extensive but not entirely functional feature set.However, StochHMM's overall structure, and its basic idea of reading the HMM specification from a text file rather than generating code, were similar to what we desired, so we used it as a starting point for a complete rewrite.(As an aside, we note that the excellent HMMER tool [36] only implements profile HMMs, and thus is not appropriate for our needs.) We were thus led to build a new tool, which we call ham.ham uses the intuitive YAML text format, which is easy to read and write by hand for simple cases, while being equally simple to use for more complex models via existing programming libraries (such as for python and C++).This makes it trivial to test many different HMM topologies, since such modifications only require editing a few lines in the text file.
In a simple comparison on the canonical "occasionally dishonest casino" example [14] we find it slightly faster and more memory-efficient than HMMoC (Table 1).
To generate these results we ran the Viterbi algorithm using both programs on sequences of length one (resp.ten) million, and averaged CPU time and memory use over 300 (resp.30) runs.We note in passing that HMMoC's XML for this example consists of 5961 characters, while ham's YAML only requires 440.
ham can simultaneously emit onto an arbitrary number (k) of outputs, such as for a pair-HMM (k = 2).As we describe in Section , this ability is useful for BCR ancestral inference.
We have also included an optimization we call "chunk-caching" which greatly speeds repetitive dynamic programing (DP) table calculations.Any time the lefthand side of a sequence is the same as that of another sequence, the corresponding left hand side of their DP tables will be identical.We have thus included the ability A few states in the internal region of an HMM.On the left is a state representing a position in a gene with a germline G, which usually emits a G, but sometimes emits other bases (i.e.mutates).If this state is near enough to the start or end of the gene, there will likely be transitions from the initial state, or to the end state (upper arrows).On the other hand, if it is in the middle of a gene the path is more likely to simply traverse the states in order (straight horizontal arrows).
to cache and recall previous DP tables in order to avoid unnecessary recalculation in such cases.
HMM architecture.We follow previous work [16][17][18] by representing each germline base in each V, D, and J allele as an HMM state (Figure 9).Insertions are represented as a separate class of states, and 5' (3') deletions as transitions to or from germline states which are not at the start (end) of the gene.All of these states can be combined to create a single HMM for the entire VDJ rearrangement process.
While it can be useful to think of individually calculating the probability of each such path, in practice one traverses the dynamic programming (DP) table using the recursion relations found in [14].
Factorization.All inferential operations on the HMM described in the previous section* can be made faster by a process of factorization, which performs inference on a collection of smaller HMMs but returns exactly the same results as a monolithic HMM.Each of these smaller HMMs has the topology obtained by extracting it from the monolithic HMM, resulting in the topologies shown in Figure 10.In order to motivate the procedure, we observe that 1) once a path hits a state corresponding to a given allele, it cannot transition to any state corresponding to a different allele for that segment (for example, a path in allele D 1 cannot transition to D 2 ); and 2) the only external information needed to calculate the DP table on a single allele's HMM for a given query sequence is where in the query sequence to start and where to end.
In order to describe the factorization process in more detail, we will use the notation of [14] referring to a single HMM: e s (b) represents the probability of emitting b in state s, and a ss is the transition probability from state s to s .We also use p i (x, π) as an abbreviation for e πi (x i )a πi−1πi : the contribution of position x to the forward probability.
One can write the total probability of a sequence x under a given HMM, i.e. the forward probability, as a sum over paths π (1) (For the Viterbi algorithm we would replace the summation with argmax.)As written, this corresponds to a single, monolithic HMM, and we are implicitly summing over all paths through all alleles in the V, D, and J segments.In order to factorize this monolithic probability, we first write the probability of each path in more detail as a product over each position i in the sequence of length L: Here π i is the state at position i in path π, and x i is the symbol at position i in the query sequence.Thus e πi (x i ) is the probability of emitting x i from state π i , and a πi−1πi is the transition probability from the state at i − 1 to the state at i.
We now subdivide this product for the path π into three factors for the V, D, and J segments; this divides the sequence correspondingly into three sections of length l V , l D , and L − l V − l D .We will use P R (x, π, l V , l D ) to denote the forward probability of the path π through a single segment R ∈ [V, D, J].
We then use this factorization to rearrange the order of operations in the sum over paths (1): where q g is the probability of choosing each allele g.Here we have isolated the computationally expensive sum over all states π∈g as far to the right as possible, so that we sum only over the paths within each allele, and combine all alleles afterward.Factorization thus reduces a single HMM with a state for each germline position in each of a few hundred alleles, to a number of HMMs, each with states only for a single allele.Since at each step in the forward (and Viterbi) recursion relations we sum over all possible previous states for each current state, computing time scales roughly as the square of the number of states in each HMM.By reducing the maximum number of states per HMM by several orders of magnitude, factorization thus substantially decreases memory and computation time.
Insertions change this picture very little: we simply add them to the left side of each D and J HMM (they could as easily go on the right side of V and D).
The resulting HMM topologies are shown in Figure 10.In this version, we show a single insert state, with emission probabilities corresponding to the empirical insertion nucleotide distribution from data (compare 11).Multi-HMMs.As described above, ham is able to do inference under a model which simultaneously emits an arbitrary number of symbols k.When k = 2 this is typically called a pair HMM [14], and we call the generalized form a multi-HMM (k >= 2).In our setting, the k sequences resulting from such a multi-HMM model are the various sequences deriving from a single rearrangement event (which differ only according to point substitution from somatic hypermutation).
We can use this ability to perform improved inference on a collection of sequences deriving from the same rearrangement event, allowing us to integrate out much of the uncertainty due to somatic hypermutation in each single sequence.
In other words, not all mutations are shared between clonally related sequences, and this provides valuable information that is only available if the HMM can emit several sequences simultaneously.
Although this approach allows us to share inferential power among various sequences for the germline regions, with a standard VDJ HMM topology (as in Our insertion state topology, with four states instead of one, which gives improved discrimination for pairs or tuples of sequences because it does not ignore mutation information within insertions.Self-transitions for insert states ommitted for clarity. Figure 10) non-templated insertions do not contribute to the likelihood in a meaningful way.In order to allow contributions from the non-templated insertions, we replace the single insert state with four states, corresponding to naïve-sequence insertions of A, C, G, and T (Figure 11).The emissions of these four states are then treated as for actual germline states (Figure 9): the A state, for example, has a large probability of emitting an A, and a complementary probability (equal to the observed mutation probability) of emitting one of the other three bases.For example, if we pass in five sequences that share identical insertions except that one sequence has a single mutated base, the HMM uses the information from the four non-mutated insertions to conclude that the difference in the fifth sequence is likely to be a mutation from the un-mutated ancestral sequence.
Parameter estimation.The parameters of our HMM consist of allele usage probabilities, transition probabilities, and emission probabilities; as described above we estimate categorical distributions for each of these.The inferred distributions of these parameters come from either a previous generation of HMM estimates, or from Smith-Waterman (as described below).
In order to convert from mutation frequencies f to emission probabilities, we set the probability of emission for each non-germline base to f /3, and correspondingly set the probability of germline emission to 1 − f .In the HMM implementation, it is trivial to account for different probabilities of mutating to different bases (instead of simply using f /3).However, we do not know of phylogenetic sequence simulation software (we use bppseqgen from Bio++ [37]) that implements both this and per-position mutation rates, and thus lacking reliable means of validation for this feature we leave it out.
Transition probabilities are set according to empirical frequencies of inferred deletions and insertions.For example, again sub-setting by allele, the frequency with which we observe a 3' deletion of length one gives us the transition probability from the second-to-last state in this allele to the end state (thereby bypassing the last state in the allele).This logic is repeated for all positions in each allele, and also for the 5' end (where instead of a transition to the end state, it is a transition from the initial state).Note that we include 5' V deletions and 3' D deletions as a convenience (dashed lines in Figure 10) to account for varying read lengths.The insertion state self-transition probabilities are distributed geometrically with mean set to the observed mean insertion length in data.While the choice of a geometric distribution is simply a result of HMMs' inherent lack of memory, in practice we observe that insertion lengths are not far from geometrically distributed (Figure 4).
Finally, the allele choice probabilities denoted q g above are simply set to the observed frequency of each allele.
For all parameters, we perform an initial estimation step using the Smith-Waterman based methods of [38].These provisional parameters are used to build a set of HMMs which we then run on the same data in order to infer a more accurate set of parameters.We can feed these parameters back into the HMM and continue the process recursively (this is Viterbi training in the language of [14]), but in practice we find no significant improvement after the first iteration.In future versions we may implement full Baum-Welch parameter estimation [14].However, given the level of agreement which we observe for Viterbi training, we do not expect large improvement from moving to full Baum-Welch.
In order to correctly estimate parameter uncertainties we would like to know the sample size: the number of independent rearrangement events.This is not the same as the number of unique reads, because several reads can stem from a single rearrangement through somatic hypermutation and clonal expansion.In principle the sample size for each measurement should be be the number of observed rearrangement events.However, we can only observe sequences rather than rearrangement events, and thus do not know precisely how many rearrangement events we have observed.Previous work has shown that a significant majority of B cell rearrangements are represented by only one sequence in healthy humans, even in a deep sample from the memory compartment [39].In order to be conservative we posit that B-cells are members of clones with two members, and thus multiply all uncertainty measurements (e.g. in Figure 2) by a factor of √ 2 corresponding to this assumed two-fold reduction in sample size.
Parameter merging in small samples via tiered aggregation.As mentioned above, a more highly parameterized model is both allowed by and requires large data sets.
In order to facilitate the use of partis on data sets of varying size, and also to allow accurate inference of rare clones, we have thus implemented a scheme that automatically reduces the detail of the model on smaller samples.The basic idea of this scheme, which we call tiered aggregation, is to merge together the information from increasing numbers of germline alleles, in several tiers of similarity, as the sample size decreases.
As shown above, we observe significant parameter variation between alleles, and we thus initially subset all emission and transition probabilities by allele (i.e. use different probabilities for each allele of each gene in V, D, and J).For large data sets, and for BCR repertoires with relatively homogeneous allele frequency distributions, this approach is sufficient.If we observe a particular allele only a handful of times, however, we cannot confidently estimate parameters for that allele.In such cases we thus include information from similar alleles in the parameter estimation.
Specifically, if we observe an allele fewer than N times, we take a simple average over all the observations of its allelic variants when calculating its parameters, rather than just considering that allele.If we also fail to observe N occurrences of all these allelic variants together, we go on to include all alleles with the same IMGT primary version (e.g.all IGHV1 alleles).Finally, if with this criterion we still do not reach N observations, we include all alleles from that segment (e.g.all V alleles).Here N is a tunable parameter, and in practice gives good results in simulation at about 20, which is the value used for the results in this paper.
This procedure thus has the effect of automatically compensating for decreased parameter accuracy in smaller samples by switching to a less parameter-rich model.
We find this strategy to be effective in simulation (Figure 8).As expected, performance decreases with decreasing sample size, although the degradation is minimal for samples larger than a few hundred sequences.While partis still performs well on fewer than a few hundred sequences, there is little benefit to using it, as one is not taking advantage of its detailed models.Instead, one could use ighutil [38] on small data sets, as in such cases the two will perform comparably in terms of accuracy.
Validation.As described above, we find it necessary to validate BCR annotation methods on simulated sequences; we therefore wrote a simulator of rearrangement and somatic hypermutation.This simulator takes as input the fully-joint categorical distribution over all "rearrangement values" which are necessary to specify the rearrangement event (e.g.3' V deletion length, which V allele, etc.).It then selects a point in this space according to the given distribution, and generates a naive sequence according to the coordinates of this point.Non-templated insertions are generated with corresponding length, and insertion bases are chosen randomly according to the empirical nucleotide distribution.This simulation process implicitly models all correlations between "rearrangement values," even correlations that are ignored in the HMMs.
This gives us a single sequence, corresponding to the rearrangement event's unmutated ancestor.We then use TreeSim [40] to generate a lineage tree descending from this ancestral sequence, with a speciation rate of 1.0 and extinction rate of 0.5, conditioned on a desired number of leaves.The number of leaves is easily configurable in our simulation package either as a constant or as a random variable drawn from some distribution.For simplicity, all plots in this paper use simulation with a constant five leaves per tree; results were consistent when simulating with widely varying constant, as well as non-constant, leaves-per-tree (results not shown).For each rearrangement event we choose a tree at random, which is then rescaled according to the branch lengths observed in data in each segment (V, D, or J).This tree, together with the un-mutated ancestral sequence and the relative positional frequencies described above, are then passed to the bppseqgen mutation simulator in the Bio++ [37] package.In this step we use the empirical nucleotide model for each corresponding segment; insertions are also mutated, using parameters averaged over the three segments.This produces a final set of sequences corresponding to the descendants of a single rearrangement event.
To check that this simulator generates sequences that are close to real data, we compared output parameter distributions for this simulator to the inferred parameters from data (which are the simulator's input).These distributions are similar (Figures S5, S6, and S7) showing that the simulator accurately recapitulates the processes generating the true data.Given this accuracy, the compatibility between true and inferred distributions in Figure 7 shows that our original parameters inferred from data are indeed close to the true parameters in data.
The validation is run within the partis directory in the main repository using the command scons validate.
We emphasize that this simulator's only inputs are the inferred empirical parameter distributions, and that it is thus entirely independent of partis' HMM machinery.The programs, including partis, are only presented with the simulated sequences with no further information about the simulation.
FIGURE 2. Observed deletion frequencies for two V, four D, and two J alleles on the Adaptive data set.These alleles were chosen to be representative of the various shapes taken by the empirical distributions.In the complete set of plots (which are publicly available as described in the text), per-allele distributions are frequently multi-modal and appear similar between humans.

FIGURE 3
FIGURE 3. Typical observed mutation frequencies for two V, two D, and two J alleles on the Adaptive data set.Mutation frequencies are highly position-dependent.While the structure of these mutation distributions appears similar between humans, the overall level of mutation varies.The first base of the conserved cysteine and tryptophan codons (i.e. the CDR3 boundaries) are indicated with black vertical dashed lines.In the complete set of plots (which are publicly available as described in the text), mutation frequencies are highly variable across sites with a pattern that is similar between humans.

FIGURE 4 .
FIGURE 4. Typical observed insertion lengths at the VD and DJ boundaries for two D and two J alleles.In the complete set of plots (which are publicly available as described in the text), distributions have a similar shape and the per-allele plots appear similar between humans.

FIGURE 5 .
FIGURE 5.The across-subset mean and variance of inferred parameter values for each human in the Adaptive data set across 10 disjoint subsets of the data.

FIGURE 6 .
FIGURE 6.  Hamming distance between the inferred and true naive sequences for all available BCR annotation methods on a simulated sample of 10,000 sequences.partis is shown both for single sequences (k = 1), and for a multi-HMM (k = 5), which performs simultaneous inference on five clonally related sequences.

FIGURE 7 .
FIGURE 7. Difference between inferred and true values for deletion lengths, insertion lengths, and mutation frequency for all available BCR annotation methods on a simulated sample of 10,000 sequences.partis is shown both for single sequences (k = 1) and for a multi-HMM (k = 5), which performs simultaneous inference on five clonally related sequences.

FIGURE 8 .
FIGURE 8. partis performance as measured by Hamming distance between inferred and true naive sequences (HTTN) for data set sizes from 50 to 10,000.

(k = 5 )
" because, as an example, we do inference on five sequences known to form a rearrangement group.

FIGURE 10 .
FIGURE 10.The overall topology of the HMMs in the V, D, and J segments.Inserts are shown as a single state for clarity, but are replaced by four states in the actual HMM (Figure11).
FIGURE S1.Typical observed deletion frequencies for two V, four D, and two J alleles on the Vollmers data set.

FIGURE S2 .
FIGURE S2.Typical observed mutation frequencies for two V, two D, and two J alleles on the Vollmers data set.The first base of the conserved cysteine and tryptophan codons (i.e. the CDR3 boundaries) are indicated with black vertical dashed lines.The large uncertainties at the 5' end of V and 3' end of J reflect that our reads very rarely extend into these regions.