• Loading metrics

Somatic hypermutation analysis for improved identification of B cell clonal families from next-generation sequencing data

  • Nima Nouri,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Pathology, Yale School of Medicine, New Haven, Connecticut, United States of America, Center for Medical Informatics, Yale School of Medicine, New Haven, Connecticut, United States of America

  • Steven H. Kleinstein

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Department of Pathology, Yale School of Medicine, New Haven, Connecticut, United States of America, Center for Medical Informatics, Yale School of Medicine, New Haven, Connecticut, United States of America, Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America

Somatic hypermutation analysis for improved identification of B cell clonal families from next-generation sequencing data

  • Nima Nouri, 
  • Steven H. Kleinstein

This is an uncorrected proof.


Adaptive immune receptor repertoire sequencing (AIRR-Seq) offers the possibility of identifying and tracking B cell clonal expansions during adaptive immune responses. Members of a B cell clone are descended from a common ancestor and share the same initial V(D)J rearrangement, but their B cell receptore (BCR) sequence may differ due to the accumulation of somatic hypermutations (SHMs). Clonal relationships are learned from AIRR-seq data by analyzing the BCR sequence, with the most common methods focused on the highly diverse junction region. However, clonally related cells often share SHMs which have been accumulated during affinity maturation. Here, we investigate whether shared SHMs in the V and J segments of the BCR can be leveraged along with the junction sequence to improve the ability to identify clonally related sequences. We develop independent distance functions that capture junction similarity and shared mutations, and combine these in a spectral clustering framework to infer the BCR clonal relationships. Using both simulated and experimental data, we show that this model improves both the sensitivity and specificity for identifying B cell clones. Source code for this method is freely available in the SCOPer (Spectral Clustering for clOne Partitioning) R package (version 0.2 or newer) in the Immcantation framework: under the CC BY-SA 4.0 license.

Author summary

B cells recognize antigens through their BCR. During adaptive immune responses, antigen-specific B cells undergo intense proliferation. This B cell clonal expansion is coupled with a process of SHM, which results in the accumulation of mutations in the DNA encoding the BCR. Within the specialized micro-environment of the germinal center, these diversified B cells compete for antigen binding and presentation to follicular helper T cells. Successful binding leads to repeated cycles of proliferation, SHM and affinity-dependent selection ultimately resulting in the generation of high-affinity memory and antibody-secreting plasma cells. Driven by dramatic improvements in high-throughput sequencing technologies, large-scale characterization of BCR repertoires is now feasible. However, a critical barrier to quantitative analysis of these large-scale BCR repertoire data is the accurate identification of B cell clones. B cells are inferred to be clonally related if the distance between their BCR sequences is close enough. This paper develops a hybrid distance function that integrates information from the V(D)J recombination process (distance between CDR3 sequences), along with information from a common history of clonal expansion (shared SHMs in the V and J segments of the BCR) to improve the ability to identify clonally related sequences.


B cells recognize pathogens through their BCR. The ability to recognize and initiate a response to a wide variety of pathogens depends upon a large population of B cell lymphocytes each of which expresses a particular receptor for antigen. The diversity of the BCRs (also referred to as Immunoglobulin (Ig) receptors) is a result of genetic recombination and diversification mechanisms. BCRs are comprised of two identical heavy (IGH) and light (IGL) chain proteins. For IGH-chains, diversity is initially created in the germline via recombination of variable IGHV, diversity IGHD, and joining IGHJ genes (termed the V(D)J recombination process [1]). Diversity in IGH is further increased by addition of P- and N-nucleotides at the IGHV/IGHD and IGHD/IGHJ boundaries [24]. For IGL-chains, the IGLV gene is rearranged directly to IGLJ gene. The region where IGHV, IGHD and IGHJ come together in IGH (or IGLV and IGLJ for IGL) is termed the CDR3 (the junction region is defined as the CDR3 plus the prefix and suffix conserved flanking amino acid residues), and this high diversity region is often involved in antigen-binding [5].

During T-dependent responses, antigen-activated B cells undergo clonal expansion and acquire additional diversity through SHM, an enzymatically-driven process introducing point substitutions into the BCR locus at a rate of ∼1/1000 bp/cell division [6]. B cells that acquire mutations that improve their ability to bind the pathogen are preferentially expanded leading to affinity maturation of the B cell population over time. Therefore, SHMs have important consequences for the kinetics, quality, and magnitude of B cell clones as the fundamental building blocks of immune repertoires [7].

Accurate identification of clonal relationships is important, as these clonal families form the basis for a wide range of repertoire analyses, including diversity analysis [810], lineage reconstruction and detection of antigen-specific sequences [1113] and effector functionality [6, 14]. One way to monitor and track B cell clonal lineages is to perform large-scale sampling of B cell populations, amplifying, and sequencing the expressed antibody gene rearrangements by next-generation sequencing (NGS) [1518]. Recent studies by NGS have greatly expanded our understanding of B cell clonal lineage development in high-throughput Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data [1921]. However, clonal relationships are not directly measured, but they must be computationally inferred. To this end several computational methods have been proposed to identify B cell clones from high-throughput AIRR-seq data [2226].

Antibody diversity is largely dominated by the IGH-chain [5]. The IGH-chain owes this diversity to the: (1) use of an IGHD gene, which IGL-chains lack, (2) addition of short palindromic (P) nucleotides at the IGHV-IGHD and IGHD-IGHJ joints [3], (3) insertion of non-templated (N) nucleotides at the IGHV-IGHD and IGHD-IGHJ joints by terminal deoxynucleotidyl transferase (TdT) [2], and (4) higher rates of SHM than IGL-chains [27]. The IGH-chain junction region commonly serves as an identifier for clonal inference methodologies. For instance, sequences whose junctions are identical or have a high degree of homology (measured by string distance at the nucleotide level) are often classified as belonging to the same clone [28]. However, to avoid grouping together highly homologous yet distinct sequences, some studies also regroup sequences to have the same IGHV- and IGHJ-gene annotations to be considered clonally-related [29]. Many methods also assume that members of a clone share the same junction length, because SHMs introduced into the BCR sequence are predominantly point substitutions. Probabilistic models have also been developed to calculate the likelihood of sharing a common B cell ancestor and subsequently infer clonal grouping [23, 24]. However, these methodologies have complexities that become substantially expensive for large sequencing datasets. Overall, in practice, a common approach is to infer clones among sequences with high junction region similarity, as well as identical junction length and IGHV- and IGHJ-gene usage (referred to as recombination-based model) [28].

While recombination-based strategies are common among current studies, clonal relationship inference solely based on the similarity of the junction region does not leverage the potential information in the V and J segments. It has been suggested that incorporating shared SHMs in these regions could improve recombination-based clonal inference [30]. Members of an expanded B cell clone often share specific somatic mutations and, sometimes, combinations of mutations across the BCR. Mutations may be shared among two or more members of a clone as a simple result of being passed down during cell division, or may be positively selected as part of the affinity maturation process [3135]. This hierarchy of shared mutations can be considered as the “glue” binding all the members of a B cell clone together and shaping its lineage tree (Fig 1). This additional IGH-chain information could be leveraged to refine clonal relationships.

Fig 1. A B cell lineage tree showing the relationships between clonally-related cells.

The germline sequence (diamond) is shown at the root of lineage, and is connected by a single branch to the most recent common ancestor (MRCA) (square). This branch consists of mutations that are shared across all members of a clone. Several sub-branches descend from the MRCA to inferred sequences (triangles) carrying mutations that are shared by a subset of clone members. Finally, the inferred sequences are connected to observed sequences (circles) through mutations that are unique to each given observed sequence. Shared and unique mutations are marked at each branch by horizontal lines and arrowhead-lines, respectively.

In this study, we investigated whether shared SHM patterns in the V and J segments of the BCR can be leveraged along with the junction sequence to improve the ability to identify clonally related sequences. This model is implemented in the new version of SCOPer. The first version of SCOPer, a spectral clustering-based method for identifying clones from high-throughput B cell repertoire sequencing data, was presented in [26]. In the following sections, we discuss the main steps of the methodology and explain our implementation of the recent improvements upon the original framework. We further examine the performance of SCOPer using simulated and experimental datasets.

Materials and methods

The clonal inference procedure by SCOPer is composed of four main steps (Fig 2). First (1), BCR sequences IGHV and IGHJ genes are identified. This can be done using various publicly available tools such as IMGT/HighV-QUEST [36] or IgBLAST [37]. Then (2), sequences are partitioned into groups (termed as “VJ()-group”) that share the same IGHV- and IGHJ-gene (gene-level grouping) and junction length (length-level grouping). The gene-level grouping is based on the assumption that the identity of germline gene (the clone members unmutated common ancestor) cannot change through affinity maturation. The length-level grouping is based on the assumption that sequences evolve only through point mutation (no indels). Next (3), within each given VJ()-group the defining metric that indicates common clonality among BCR sequence pairs is determined by combining the similarity among junction region sequences (subsection: recombination-based distance matrix calculation) and the V and J segment mutation profile (subsection: SHM-based distance matrix calculation) into an integrated distance function (subsection: graph composition and local scaling). Finally (4), BCR clones are identified using spectral clustering-based approach built upon this distance function (subsection: spectral decomposition and clustering).

Fig 2. Overview of the SCOPer workflow.

First, AIRR-seq data are partitioned into VJ()-groups which contain sequences with the same IGHV gene annotation, IGHJ gene annotation, and junction length. Next, each VJ()-group is subject to a recombination-based and a SHM-based distance calculation. Finally, the outputs of these calculations are combined into an integrated distance function that is used as the basis for inferring the BCR clonal relationships using a spectral clustering-based approach.

Recombination-based distance matrix calculation

The recombination-based component of SCOPer is focused on the sequencing reads’ junction region. At this step, we generate a symmetric and positive pair-wise similarity matrix Xij defined by the Hamming distance between the junction regions corresponding to the ith and jth sequences from a given VJ()-group. This is called the “junction-targeted” recombination-based distance matrix. The Hamming distance is defined as the number of positions at which the corresponding nucleotides are different. The recombination-based distance matrix can also be generated from CDR3 region by excluding the three-nucleotide prefix and suffix from both ends of the junction (i.e. converting junction segment to CDR3 region). Henceforth, this is called a “CDR3-targeted” recombination-based distance matrix.

SHM-based distance matrix calculation

The SHM-based component of SCOPer is focused on the V and J segments. We develop a model in which the occurrence of a mutation at the same nucleotide position of a pair of sequences (referred to as “pair-wise shared mutation”) will be used, accompanying with recombination-based component, in order to define a metric that indicates common clonality among BCR sequence pairs. We generate a SHM-based distance matrix so that a pair of sequences with a higher shared mutation rate are more likely to belong to the same clone, whereas a pair of sequences with a lower shared mutation rate are considered more independent from each other.

We begin with identification of the pair-wise mutations. First, for each VJ()-group a single germline representative is generated by building the effective sequence of all germlines (allele-grouping). This will facilitate identification of the pair-wise mutations in VJ()-groups whose germlines have different nucleotides at the same position (alleles). The representative germline is deterministic such that if a position contains different nucleotides, the effective will be an IUPAC (International Union of Pure and Applied Chemistry) character representing all of the nucleotides present. Henceforth, we refer to such sequence as “effective germline”. Then, in each VJ()-group, pairs of sequences are compared with the effective germline to identify mutations. We note that, depends upon the type of recombination-based matrix calculated in the previous step (i.e., junction- or CDR3-targeted) the junction or CDR3 region of the sequences and germlines are excluded from this analysis.

We continue with a categorical approach to classify the identified pair-wise mutations (Fig 3). For each pair of ith and jth sequences the mutations at each position are flagged with a binary variable and categorized in three classes: (1) a single mutation which occurs only in one of the sequences, , (2) two unique mutations which occur in both sequences, , and (3) a shared mutation which occurs in both sequences, . Here, the parameter n indicates the position of each nucleotide along the sequence string. The binary variables are retrieved to create two matrices. One of the matrices accumulates the total number of mutations: (1) A second matrix accumulates the shared mutations: (2) Here, Tij is a positive value and always larger than or equal to positive value Hij. The term νij, average number of informative positions (∈{A, C, G, T}) in ith and jth sequences, is a normalizing factor used to prevent bias toward pairs of sequences with fewer non-ACGT positions.

Fig 3. Pair of sequences (seq) are compared with each other and the VJ()-group effective germline (EGL) to identify unique and shared somatic hypermutation events.

The effective germline sequence is determined by IUPAC character representation of all the nucleotides present at each position across all germlines in a given VJ()-group (allele-grouping). Each nucleotide position of ith and jth sequences is compared with the corresponding nucleotide position in the effective germline and somatic hypermutation events are flagged with binary variables: (1) α: a single mutation which occurs only in one of the sequences, (2) β: two unique mutations which occur in both sequences, and (3) γ: a shared mutation which occurs in both sequences. The average of the mutabilities of the germlines (GLs) 5-mer motifs in which a shared mutation occurred at the central position is shown by , where superscript n indicates the position that mutation occurred. Mutation events are bold and underlined in the sequences.

We note that SHM biases have been reported [38, 39] both in the bases that are targeted [40, 41] as well as the substitutions that are introduced [42, 43]. These biases have been summarized by hot- and cold-spot targeting model (“S5F” model that produces background likelihood of a particular mutation based on the surrounding sequence context as well as the mutation itself) by [44]. We reasoned that mutations at hot-spot positions could be more likely to be shared by sequences that are not truly clonally related. In order to account for the potential influence of SHM biases, we incorporate a damping matrix in the form of: (3) Here, is the average of the mutabilities of the germlines micro-sequence motifs (e.g., a 5-mer from “S5F” model) in which a mutation occurs at the central position n. Each value is subtracted from one to reverse the scaling direction (Fig 3).

We finalize the calculation by calling Eqs 1, 2, and 3 to calculate the SHM-similarity between ith and jth sequences in the form of: (4) Here, is a continuous Gaussian probability distribution, where parameter σT and σH are the standard deviations of the T and H matrices capturing the variability of total and shared SHM events in each VJ()-group, respectively. It is important to note that for different VJ()-groups the level of similarity that indicates common clonality may be different. Therefore, using the Gaussian probability distribution, built upon the given VJ()-group, will make the model capable of adapting itself to the local mutation frequency. We further note that, the SHM-similarity (∈[0, 1)) becomes non-zero only if the number of pair-wise shared mutations is non-zero (H ≠ 0). Conversely (i.e., H = 0), the SHM-similarity is forced to zero by the third term of Eq 4, even though non-shared mutations exist (T ≠ 0), and consequently the recombination-based part of the SCOPer is fully in charge to infer the clonal relationships. In practice, the behavior of the SHM-similarity function (Eq 4, ignoring the impact of SHM hot-spots, i.e. Mij = 1) comparing two pairs of sequences can be described as follows:

  • if no shared mutations are observed, then the SHM-similarity Sij is zero,
  • if the two pairs have the same total number of mutations, then the pair which accumulates more shared mutations will have higher SHM-similarity, and
  • if the two pairs have the same number of shared mutations, then the pair which accumulates fewer non-shared mutations, will have higher SHM-similarity. (Note that Tij is always larger than or equal to Hij).

Graph composition and local scaling

The spectral clustering at the core of SCOPer works based on a graph construction procedure where the vertices are the observed sequences to be clustered, and the edges between vertices are weighted dependencies among pairs of sequences. The graph construction relies on a quantitative notion of adaptive local neighborhoods in the dataset, which are encoded by a symmetric Kernel function. The Kernel function is used to capture intrinsic data geometries that approximate underlying manifold models from the data. To construct the kernel graph, first, we generate a weighted-distance matrix in the form of, (5) The model is named “recombination-based” when recombination-based distance matrix X is only involved in graph composition. The model is named “integrated” when recombination-based X and SHM-based S distance matrices are both involved in graph composition. In integrated model, each SHM-similarity value Sij is subtracted from one to reverse the scaling direction and transform it into a distance metric. Therefore, the pair of sequences (i.e., the graph vertices) with higher SHM-similarity become closer to each other, thereby more likely to belong to the same clone. The integrated model can be loosely thought of as Hooke’s Law (W = κX, where κ = 1 − S), which rules the attraction force between a pair of sequences using a “spring” with proportionality factor κ (see Fig 4). In the subsequent step, we generate a fully connected graph Kernel using a Gaussian similarity function in the form of, (6) Here, parameters wi and wj are the scaling distances corresponding to the ith and jth sequences, respectively, which control the width of local neighborhoods allowing the level of similarity to vary in different parts of the graph. In this way, the local neighborhoods are determined for each sequence, instead of selecting an universal scaling parameter for all. The width of each local neighborhood is identified by a single weighted-distance value such that sequences inside the neighborhood are more similar to each other than the outsider sequences. In order to determine the sequence-to-sequence scaling parameters a self-tuning framework [45] (the so-called distance-gap procedure) is incorporated into SCOPer. The distance-gap procedure determines the scale parameter wi corresponding to the ith sequence by seeking a relatively large gap in the set of weighted-distances from ith sequence to the rest of the sequences. The distance-gap pipeline is performed as follows. First, the set of weighted-distances corresponding to the ith row of the matrix W is retrieved. Then, a binned Gaussian kernel density estimate of the weighted-distances is generated using the density function from the stats R package. Next, the set of extrema of the continuous density distribution is flagged by finding the weighted-distances at which the first derivative of the distribution is zero while the second derivative is positive, indicating a local minimum following a local maximum. Recall from univariate Calculus that the first and second derivative for some function f(x) corresponds to the slope of the tangent line and curvature of f at point x, respectively. Finally, the scale parameter wi associated with ith sequence is determined as the closest smaller weighted-distance to the extremum with the lowest density value. If such an extremum is not found, the scale parameter wi is simply determined as the first largest gap of the rank-ordered set of entries corresponding to the ith row of the matrix W.

Fig 4. The integrated model pulls together clonally-related sequences to improve the B cell clonal inference process.

(A) V(D)J recombination generates a set of highly diverse (unmutated) sequences with large distances between independent clones (inter-clonal diversity). (B) Clonal expansion with SHM adds additional diversity, and leads the sequences to spread out around the initial points of creation (intra-clonal diversity). Some sequences from independent clones could end up with CDR3s that start to look similar (dashed-lines), and may lead to false positives in the clonal relationship inference process. (C) The SHM-similarity between pairs of sequences, expressed via shared mutations, acts as a spring that pulls clonally-related sequences toward each other resulting in a more accurate distinction of local neighborhoods. Black circles indicate observed sequences, while white circles indicate germlines (GL1 and GL2).

Local scaling is especially useful when the classification of the B cell repertoire contains multiple scales (e.g., if one clone is tight, while another one is sparse). By means of local scaling, the junction sequence similarities between different clones are lower than the similarities within any single clone. Therefore, edges between sequences in local neighborhoods are connected with relatively high kernels (i.e., Kij → 1), while edges between far away sequences have smaller kernels (i.e., Kij → 0). This is an important advantage of this methodology, by allowing the level of sequence similarity to vary in different local neighborhoods (a biologically plausible assumption), over other methodologies that partition sequences using an universal (fixed) level of similarity overall the sequences [25].

Spectral decomposition and clustering

Having defined a scheme to set the graph scale parameters automatically, following with the calculation of the graph Kernel matrix K, the last unknown free parameter in the model is the number of clones k, which is determined by the eigen-decomposition of the Laplacian matrix. First, the Laplacian matrix L = DK is calculated, where D is the diagonal matrix with its ith diagonal element being the sum of ith row of K. Then, the Laplacian matrix is eigen-decomposed with eigenvalues {0 = λ1 ≤ λ2 ≤ ⋯ ≤ λm} and corresponding eigenvectors , where m indicates the number of sequences. Then, the number of clones k is determined by finding the largest gap within the eigenvalue spectrum (the so-called “eigen-gap” procedure) at which adding another clone does not give much better modeling of the data. Finally, we perform k-means Euclidean distance-based clustering over the k eigenvectors associated with the smallest k eigenvalues to find the members of each clone.

Bulk B cell simulation and library preparation

Each simulated dataset was generated using the AbSim R package (version 0.2.6) in a B cell single-lineage fashion [46]. Each B cell clone simulation begins with a random selection from sets of IGHV, IGHD, and IGHJ germline sequences [47] to produce a unique V(D)J recombination event. Then, clones are made by introducing mutations using a local nucleotide context-dependent model (i.e., S5F model from [44]), along a phylogenetic tree in which branching events occur stochastically. This process was repeated to create a collection of 25 simulated datasets. The size of each repertoire was sampled from a normal distribution (mean equal to 600k and standard deviation equal to 100k) and the clone sizes were sampled from a gamma distribution (shape equal to 0.75, scale equal to 0.75, and amplitude sampled from a normal distribution with mean equal to 1k and standard deviation equal to 0.1k). The remaining parameters were set as default. After simulation was done, the V and J annotations along with the junction segment of each simulated sequence were identified using IgBLAST version 1.13.0 [37]. Then, the outputs were retrieved and tab-delimited database files were generated using the command line tool MakeDb, from Change-O (version 0.4.5) [48]. Quality checks were also undertaken to remove non-productive sequences. Specifically, each sequence was checked to satisfy a set of constraints that the: (1) whole sequence be annotated as functional, (2) whole sequence contains no stop codons, and (3) junction is in-frame (i.e. the length is modulo 3). Sequences which did not meet these criteria were excluded. At this point, sequences that are identical (i.e. copies that were generated coincidentally) are grouped together into “unique sequences”. The simulated datasets were further processed using the SHazaM (version 0.1.11 or newer) and Alakazam (version 0.2.11 or newer) R packages from Immcantation framework ( resulting in new columns containing VJ()-group identifiers, mutation frequencies, and distance-to-nearest values (i.e., distribution of normalized Hamming distances from each junction sequence to its nearest non-identical neighbor in a given VJ()-group). Finally, the outcome was a single tab-delimited file per each simulated dataset containing the metadata information associated with each sequence to be used as input to the clonal inference pipeline.

Table 1 presents an overview of 25 BCR simulated datasets used in this study. Furthermore, the global metrics of the BCR simulated repertoires, including: (1) junction length distribution, (2) distance-to-nearest distribution, (3) clonal relative abundance distribution, (4) clone size distribution, (5) mutation frequency distribution, (6) number of clones per VJ()-group, (7) average pair-wise SHM for clone, and (8) negative-control test, are presented in S1 Fig:1-25:A-H, respectively.

Table 1. Overview of 25 simulated datasets generated by the AbSim R package [46].

Each B cell clone is generated by one set of randomly selected unmutated human IGH-chain germline gene sequences [47] to produce the V(D)J recombination event. Then, the germline undergoes clonal expansion along a phylogenetic tree in which branching events occur stochastically. SHM along this tree is modeled using a local sequence context-dependent model (i.e., “S5F” model from [44]).


Pair-wise shared SHM are enriched in B cell clones

Clonally related cells will share SHMs that were accumulated by common ancestors over the course of clonal expansion. However, cells from distinct clones are also expected to share mutations at some positions, such as SHM hot-spots [49]. Therefore, we sought to evaluate the degree to which pair-wise shared mutations were enriched in B cell clones. For each simulated dataset, the pair-wise shared SHM matrix H was generated for each B cell clone by comparing the IGHV and IGHJ regions of each pair of sequences with the relevant germline sequence. Then, the average of the upper triangular elements was calculated (note that H is a symmetric matrix). We found that pair-wise shared SHMs could be identified in ∼ 95% of non-singleton B cell clones (i.e., clones with more than one member) across all simulated datasets. The non-singleton clones without shared mutations tended to be small (with <5 members), so the chance of observing pair-wise shared mutations is lower (S1 Fig:1-25:C).

We next sought to test whether this high rate of pair-wise SHM sharing was specific to clonally-related sequences. We generated a set of artificial clones (negative controls) by randomly sampling sequences across known clones. Specifically, for each clone from the 100 largest VJ()-groups (covering ∼ 30% of the total reads), we generated a set of 1000 negative controls with the same size as the given clone. We note that since sampling was performed within each VJ()-group, the negative controls were generated from sequences with the same junction length, IGHV, and IGHJ genes as the given clone, thus resulting in a conservative control experiment. Then, for each clone and corresponding set of negative controls, the pair-wise shared SHM matrix H was generated by comparing the IGHV and IGHJ regions of each pair of sequences with the relevant germline sequence. We performed this analysis for all simulated datasets and calculated the average of the upper triangular elements of H. We found that the true clones exhibited significantly (p < 0.001) higher pair-wise shared SHM rates (on average ∼16 ± 6 mutations per clone) compared with the set of negative controls (on average ∼5 ± 1 mutations per clone), with a percentage difference of ∼105% on average across all simulated datasets (S1 Fig:1-25:H). Thus, pair-wise shared SHM are enriched in BCR clones. These results support the idea that the pair-wise shared SHM frequency can be leveraged as a biometric (fingerprint) in the clonal relationship inference process.

Focusing on CDR3 improves performance of the recombination-based model

The original recombination-based model for identification of B cell clones used by SCOPer measures distance using the junction region of the BCR [26]. The junction includes the CDR3 along with the two flanking amino acids (one 5′ that is encoded by IGHV, and one 3′ that is encoded by IGHJ) [50]. As the two flanking positions are highly conserved, we sought to determine whether they were necessary to include in the distance measure. Indeed, we hypothesized that including these positions could even lead to decreased performance, as they are likely to be identical across independent clones and will have increasing influence on the distance for clones with shorter junction lengths. To test this hypothesis, we compare the performance of the recombination-based model using either the junction-targeted (termed as ham-junc) or CDR3-targeted (termed as ham-cdr3) approaches. Using simulated data, performance was quantified using the measures of sensitivity, specificity, and precision [25]:

  • True positive (TP) is defined by the number of clonally-related sequence pairs that are correctly identified.
  • False positive (FP) is defined by the number of unrelated sequence pairs that are incorrectly identified as clonally-related.
  • True negative (TN) is defined by the number of unrelated sequence pairs that are correctly identified as unrelated.
  • False negative (FN) is defined by the number of clonally-related sequence pairs that are incorrectly identified as unrelated.

The sensitivity (true positive rate) of each model is defined as the fraction of all sequence pairs from the same clone that were correctly inferred by the model (TP/(TP+FN)), while specificity (true negative rate) is defined as the fraction of pairs of unrelated sequences that were successfully inferred by the model to be in different clones (TN/(TN+FP)). Finally, the precision (positive predictive value) of each model is defined by measuring how often inferred clonal relative sequence pairs are truly clonally related (TP/(TP+FP)).

We found that both approaches inferred the clonal relationships with high sensitivity, specificity, and precision with values of >94.0% on average across all simulated datasets. However, each of the measures of accuracy were significantly (p < 0.001) improved when distance was based on the CDR3 region, rather than the junction region (Fig 5). Thus, the conserved positions flanking the junction should not be used to define the distance between sequences.

Fig 5. Integrating information from CDR3 similarity (recombination-based distance) and shared mutations in the V and J segments (SHM-based distance) improves clonal relationship inference.

The spectral clustering-based framework was applied to identify clonally-related sequences in 25 simulated datasets (diamonds) generated via AbSim R package [46] (Table 1). Performance was assessed by calculating (A) sensitivity, (B) specificity, and (C) precision via applying the recombination-based model on the junction (ham-junc) and CDR3 (ham-cdr3) regions, as well as the integrated model on the junction (ham-shm-junc) and CDR3 (ham-shm-cdr3) regions. Mean performance is indicated by the solid bars, while the error bars define one standard deviation. For the comparisons of interest the asterisks (***) indicate p < 0.001 by paired t-test.

Shared mutations should be integrated with CDR3 distance to identify clones

We next asked whether incorporating shared SHMs of V and J segments into the procedure leads to even better performance. We thus characterized the performance of integrated model using CDR3-targeted (termed as ham-shm-cdr3) approach. Including shared SHM with the integrated model improved measures of sensitivity, specificity, and precision to >96% on average across all simulated datasets. For the sake of completeness, we also characterized the performance of integrated model using junction-targeted (termed as ham-shm-junc) approach. Consistent with our analysis of the recombination-based model, we found that using the junction rather than the CDR3 region led to a significant (p < 0.001) decrease in performance (Fig 5).

These results indicate that the best performance within the spectral clustering-based framework is achieved when the integrated model was accompanied with a CDR3-targeted approach. Overall, when the original SCOPer model (ham-junc) is compared to the new integrated model (ham-shm-cdr3), a ∼2.5% improvement in the sensitivity, ∼1% improvement in the specificity, and <1% improvement in the precision was achieved on average across all simulated datasets (p < 0.001) (Fig 5). We further note that the improvement in both measures of sensitivity and specificity was observed in ∼2% of independent clonal lineages on average across all simulated datasets.

To better understand how the integrated model improves the performance of clonal relationship inference, we examined its operation in detail using one of the identified VJ()-groups with 53 unique sequences. As these are simulated data, we know that these sequences are comprised of three clones, one consisting of 27 sequences, one consisting of 25 sequences, and the last one consisting of only one sequence (singleton). Comparing the clonal relationships using the CDR3-targeted recombination-based model (ham-cdr3) and the CDR3-targeted integrated model (ham-shm-cdr3), we find that both models inferred three clones. However, ham-cdr3 model failed to accurately infer the clonal relationships, which resulted in multiple false positives and false negatives (Fig 6A). On the other hand, when the SHM among sequences was expressed using the pair-wise SHMs (on average ∼42 ± 7 mutations were counted per pair, from which ∼10 ± 6 mutations were shared), the clonally-related sequences were pulled toward each other (on average ∼20 ± 9 mutations per pair were counted for the clone of size 27 and ∼14 ± 4 mutations per pair were counted for the clone of size 25) whereas the singleton (with at least 5-fold fewer shared pair-wise mutation than other clone members) remained separated, thereby the performance of the local scaling procedure was improved (Fig 6B). Hence, the ham-shm-cdr3 model resulted in no false relationships in this particular case (Fig 6C).

Fig 6. The integrated model improves clonal inference by pulling clonally-related sequences toward each other.

The spectral clustering-based model was applied to infer the clonal relationships among 53 sequences from a given VJ()-group. These sequences belong to three clones, one consisting of 27 sequences (circles), one consisting of 25 sequences (diamonds), and the last one consisting of only one sequence (triangle). Clonal relationships were inferred (indicated by filled colors) via the CDR3-targeted recombination-based model (ham-cdr3) leading to three clones (Inferred-1, Inferred-2, and Inferred-3) (A), and CDR3-targeted integrated model (ham-shm-cdr3) leading to two clones (Inferred-1, Inferred-2, and Inferred-3) (C). For visualization, the sequences were embedded in 2D space using the qgraph function from the qgraph R package, where the thickness of each edge indicates the inverse of the pair-wise ham-cdr3 (A) and ham-shm-cdr3 distances (C). Pair-wise distances were normalized by the CDR3 length and compared in log scale (B).

The integrated model performs with high confidence on experimental data

Along with simulated data, we also evaluated the performance of the CDR3-targeted integrated model (ham-shm-cdr3) by estimating specificity using experimental BCR sequencing data from 58 individuals with acute dengue infection (note that two individuals with total reads <1k sequences were excluded) [51]. These samples contained ∼1 − 9k (4056[mean] ± 964[standard deviation]) unique reads and in total ∼235k unique reads (S1 Table). In experimental data, the truth clonal relationships are unknown. However, we do know that, by definition, sequences derived from two different individuals can not be part of the same clone (i.e. clones cannot span different individuals). Thus, if our method assigns sequences from different individuals to the same clone, then this is a false positive. We used the procedure proposed in [52] to estimate specificity using ham-shm-cdr3 model. First, one of the individuals (the dataset with largest number of unique sequences = 8773) was chosen as the “base”. Next, a single sequence was chosen randomly from each of the remaining individuals and added to the sequencing data from the base individual. Specificity was then defined by how often the sequences from non-base individuals were correctly determined to be singletons. Any grouping of these sequences into larger clones must be a false positive (see Fig 7). This procedure was then repeated for 100 cycles. The results indicated that the ham-shm-cdr3 model has a high specificity with a value of ∼96.0% on average across all cycles. Thus, combining shared SHMs in the V and J segments of the BCR can be leveraged along with the CDR3 sequence to identify clonally related sequences with high specificity in experimental data.

Fig 7. Schematic overview of specificity estimation using experimental data.

(A) BCR repertoires from 58 individuals with acute dengue infection are used to evaluate the performance of the clonal inference process. One of the individuals is chosen as the “base” (white plane). (B) Single sequence (gray-circles) are chosen randomly from each of the remaining individuals (gray planes) and added to the sequencing data from the base individual (white-circles). After clonal assignments, sequences from non-base individuals which are correctly determined to be singletons will be counted as true negatives (TN, check mark), while any grouping of these sequences into larger clones will be counted as false positives (FP, cross mark). Specificity is calculated as TN/(TN+FP).

The SCOPer algorithm is efficiently parallelized

Computational efficiency is an important property considering the recent growth in the size of typical BCR repertoires [53, 54]. Using the recombination-based model we found that clonal partitioning ∼685 ± 60k (mean ± standard deviation) simulated sequences (the average repertoire size used in this study) took ∼35 ± 7 min, but when the integrated model was involved the partitioning took ∼263 ± 28 min. This assessment was performed using one core on a Linux computer with a 2.20 GHz Intel processor and 32 GB RAM. There are two main factors that drive this increased computational cost. In our current implementation, clonal inference is performed on the set of unique sequences (i.e., sequences with distinct nucleotide sequences). When using a recombination-based model that considers only the junction or CDR3, the chance of having identical sequences in each VJ()-group is high (on average across all simulated datasets ∼60% of CDR3s are unique per each VJ()-group). This decreases the computational cost of the algorithm. In contrast, when using the integrated model, the V and J segments are also relevant, allowing fewer sequences to be combined into identical groups (i.e., leading to more unique sequences). The computational cost increases with this increasing number of sequences n. Specifically, the eigen-decomposition algorithm, which scales by (we note that the targeted matrix, to be spectrally decomposed, is symmetric which improves the computational cost significantly). Furthermore, the pair-wise SHM analysis brings additional computational complexity. For instance, the computational complexity of generating the pair-wise shared SHM matrix H algorithm is . This run time will be summed up by the pair-wise recombination-based matrix X with the same computational complexity. However, the SCOPer distributed implementation facilitates the clonal inference process by parallelizing the computation and greatly reducing the running time. In our current implementation, the parallelization is achieved by distributing the clonal inference process from each VJ()-group of sequences across processing cores dynamically. The parallelization is possible on cores from a single workstation or on high-performance computing (HPC) cluster facilities. For instance, using only five cores in parallel decreased the running time to ∼67 ± 7 min, a ∼4-fold improvement, for partitioning ∼685 ± 60k sequences via integrated model. Our benchmarks across all simulated datasets demonstrate good scalability resulting in a speedup, defined as the time it takes the integrated algorithm to execute with one processor divided by the time it takes to execute in parallel, that is approximately linear to the number of cores (<10) utilized (Fig 8).

Fig 8. The SCOPer algorithm can be run efficiently on multiple cores.

The speedup, defined as the time it takes the algorithm to execute with one processor divided by the time it takes to execute in parallel, was calculated for the integrated model for different numbers of processing cores. In each case, speedup was calculated as the average across 25 simulated datasets (with error bars showing the standard deviation). Evaluation was carried out on a Linux computer with a 2.20 GHz Intel processor and 32 GB RAM. The linear fit is shown by a dashed line, while the ideal speedup is shown by the dot-dash line.

We further evaluated the clonal inference computational cost using the dengue experiential datasets. Using recombination-based and integrated models we found that it takes about one minute for a single core to infer clonal relationships of ∼4056 ± 964 experimental sequences. Since the performance was fast and efficient, we did not evaluate the algorithm computational cost in parallel.


B cell clonal diversity is introduced through two main mechanisms. The first occurs during maturation in the bone marrow by stochastic joining of germline-encoded V, D, and J heavy chain genes (or V and J light chain genes) combined with the action of exonucleases and terminal deoxynucleotidyl transferase, which add diversity at the recombination boundaries. This diversity acts as a fingerprint that can be used to separate distinct clones based on the distance between their junction (or CDR3) nucleotide sequence (inter-clonal diversity). Subsequently, upon encountering cognate antigen, B cells can enter a germinal center and undergo further diversification through SHM and affinity maturation. The accumulation of SHMs has the effect of spreading out the sequences of B cell clonal variants around their initial points of creation (intra-clonal diversity). A significant challenge in the clonal relationships inference problem is to define meaningful metrics which can leverage inter-clonal diversity to recognize sequences that are part of independent clones (specificity), while also modeling intra-clonal diversity to recognize the variants that are clonally-related (sensitivity).

We developed an unsupervised learning algorithm based on spectral clustering that provides a framework for the inference of B cell clonal relationships. This model combines CDR3 similarity with shared SHM profiles in the V and J segments to capture both inter- and intra-clonal diversification. We showed that the inclusion of pair-wise shared SHM patterns improves the models ability to identify clonally related sequences. This improvement translates into substantial additional true relationships (of about 3k) and removal of false relationships (of about 6k). Overall, the model determines B cell clones by: (1) common IGHV- and IGHJ-gene calls and identical CDR3 length, (2) identical or similar CDR3 nucleotide sequences, and (3) shared somatic hypermutation patterns in the V and J segments. These criteria result in a strict definition that will separate B cells carrying different alleles of the same V or J genes into independent clonal lineages, an important criterion for the clonal relationships inference problem.

In the absence of gold standard experimental data with known clonal relationship between sequences, the validation was performed using B cell simulations which offer a mechanism to generate data where the underlying clonal groups are known. However, using experimental data we also reported a measure of specificity based on the frequency of clones that are predicted to be shared across individuals.

A key step in the clonal inference process involves V and J gene assignment. In practice, gene assignment is performed prior to invoking SCOPer by using current state-of-the-art and publicly available tools such as IMGT/HighV-QUEST [36] or IgBLAST [37] which match BCR sequences against a database of known genes. In cases where genes are highly similar, or even technically indistinguishable, so that the assignment is uncertain, these tools may make multiple assignments to the same BCR. These multiple assignments are taken into account by SCOPer in the initial VJ()-grouping of sequences into partitions that share same V- and J-gene, and junction length such that similar or indistinguishable genes are grouped together. That is, the grouping brings together all sequences that share at least one matching V- or J-gene among the multiple potential assignments (i.e., a chain VJ()-grouping instead of an exact VJ()-grouping).

The influence of SHM hot- and cold-spot biases in the clonal inference process have been incorporated using an SHM targeting model. The analysis described here uses the S5F targeting model for SHM that was previously constructed [44]. However, while hot- and cold-spot biases are generally conserved across individuals, these intrinsic biases can be altered by age [13], and may also differ across species [55]. Clonal identification could be improved by using a data-specific targeting model that can be built using toolkits available in the Immcantation framework ( The S5F model seeks to avoid the biases introduced by selection, and to capture only the intrinsic biases introduced by the activation-induced cytidine deaminase (AID) binding preferences and error-prone DNA repair in a 5-mer micro-sequence context [44]. Future improvements to the SHM targeting model, such as including effects beyond motif-specificity [56], may also improve clonal relationship inference. However, these must be rigorously tested.

While the model presented here was developed and tested for sequencing data from the H chain only, cutting-edge technologies, including single-cell sequencing, provide paired IGH- and IGL-chain data [5759]. These paired data can be incorporated into the proposed model by extending the criteria for the initial grouping of sequences (i.e., VJ()-groups) to include the same IGHV-gene, IGHJ-gene, IGH-CDR3 length, IGLV-gene, IGLJ-gene, and IGL-CDR3 length. BCR clonal inference can then be carried out as before on the H chain of these more refined groups. An alternative approach is to perform clustering on H chain and then simply splitting clones that have multiple L chains. The low diversity of the IGL-chain junction region makes it unlikely that including this region in the clustering will provide a significant performance improvement [30, 60].

The definition of clone used in this work is based on the assumption that SHM introduces only point substitutions into the BCR sequence. However, it has been shown that insertions and deletions (indels) can also be introduced at a low frequency (e.g., <2- 3% per mutation event [42]) [39, 6166]. Distance functions that allow for sequences of different lengths could be used to identify clonally related sequences that differ by indels (leading, for example, to sequences with different CDR3 lengths). However, these must be rigorously tested.

The models described in this study have been implemented in the SCOPer (Spectral Clustering for clOne Partitioning) R package, which provides a computational framework to explore multiple approaches to infer clonal relationships in AIRR-seq data. This implementation of SCOPer is freely available as part of the Immcantation framework ( under the CC BY-SA 4.0 license. The input and output formats of SCOPer conform to the Change-O [48] and AIRR [21] file standard, and thus the method can be used seamlessly as part of the Immcantation tool suite, including methods for B cell clonal lineage reconstruction, lineage topology analysis, clonal diversity analysis, and other advanced repertoire analyses linked to the clonal landscape.

Supporting information

S1 Fig. BCR simulated repertoires overview.

The global metrics of the BCR simulated repertoires, including: (1) junction length distribution, (2) distance-to-nearest distribution, (3) clonal relative abundance distribution, (4) clone size distribution, (5) mutation frequency distribution, (6) number of clones per VJ-group, (7) average pair-wise SHM for clone, and (8) negative-control test (comparing pair-wise SHM sharing rate among real clones and a set of artificial clones generated by randomly sampling sequences across known clones).


S1 Table. BCR experimental repertoires overview.

The empirical statistics of BCR experimental repertoires, including: (1) number of total sequences, (2) number of unique sequences, (3) number of inferred clonal lineages, (4) size of largest inferred clonal lineage, (5) number of unique IGHV genes, and (6) number of unique IGHJ genes.



We thank the Yale Center for Research Computing for guidance and use of the research computing infrastructure. We wish to acknowledge Alina Aleksandrova for a careful reading of the manuscript. The authors also thank Susanna Marquez for useful comments related to the development of the code.


  1. 1. Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302(5909):575–581. pmid:6300689
  2. 2. Alt FW, Baltimore D. Joining of immunoglobulin heavy chain gene segments: implications from a chromosome with evidence of three D-JH fusions. Proceedings of the National Academy of Sciences. 1982;79(13):4118–4122.
  3. 3. Lafaille JJ, DeCloux A, Bonneville M, Takagaki Y, Tonegawa S. Junctional sequences of T cell receptor γδ genes: implications for γδ T cell lineages and for a novel intermediate of V-(D)-J joining. Cell. 1989;59(5):859–870. pmid:2590942
  4. 4. Murphy K. Janeway’s immunobiology. Garland Science; 2011.
  5. 5. Xu JL, Davis MM. Diversity in the CDR3 region of VH is sufficient for most antibody specificities. Immunity. 2000;13(1):37–45. pmid:10933393
  6. 6. McKean D, Huppi K, Bell M, Staudt L, Gerhard W, Weigert M. Generation of antibody diversity in the immune response of BALB/c mice to influenza virus hemagglutinin. Proceedings of the National Academy of Sciences. 1984;81(10):3180–3184.
  7. 7. Kepler TB, Perelson AS. Somatic hypermutation in B cells: an optimal control treatment. Journal of theoretical biology. 1993;164(1):37–64. pmid:8264243
  8. 8. Robins HS, Ericson NG, Guenthoer J, O’briant KC, Tewari M, Drescher CW, et al. Digital genomic quantification of tumor-infiltrating lymphocytes. Science translational medicine. 2013;5(214):214ra169–214ra169. pmid:24307693
  9. 9. Meng W, Zhang B, Schwartz GW, Rosenfeld AM, Ren D, Thome JJ, et al. An atlas of B-cell clonal distribution in the human body. Nature biotechnology. 2017;35(9):879. pmid:28829438
  10. 10. Rosenfeld AM, Chen DY, Meng W, Zhang B, Granot T, Farber DL, et al. PROTOCOL: computational evaluation of B-cell clone sizes in bulk populations. Frontiers in immunology. 2018;9:1472. pmid:30008715
  11. 11. Yaari G, Kleinstein SH. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome medicine. 2015;7(1):121. pmid:26589402
  12. 12. Tsioris K, Gupta NT, Ogunniyi AO, Zimnisky RM, Qian F, Yao Y, et al. Neutralizing antibodies against West Nile virus identified directly from human B cells by single-cell analysis and next generation sequencing. Integrative Biology. 2015;7(12):1587–1597. pmid:26481611
  13. 13. Hoehn KB, Vander Heiden JA, Zhou JQ, Lunter G, Pybus OG, Kleinstein S. Repertoire-wide phylogenetic models of B cell molecular evolution reveal evolutionary signatures of aging and vaccination. BioRxiv. 2019; p. 558825.
  14. 14. Sablitzky F, Wildner G, Rajewsky K. Somatic mutation and clonal expansion of B cells in an antigen-driven immune response. The EMBO journal. 1985;4(2):345–350. pmid:3926481
  15. 15. Ansorge WJ. Next-generation DNA sequencing techniques. New biotechnology. 2009;25(4):195–203. pmid:19429539
  16. 16. Weinstein JA, Jiang N, White RA, Fisher DS, Quake SR. High-throughput sequencing of the zebrafish antibody repertoire. Science. 2009;324(5928):807–810. pmid:19423829
  17. 17. Boyd SD, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf B, et al. Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing. Science translational medicine. 2009;1(12):12ra23–12ra23. pmid:20161664
  18. 18. Metzker ML. Sequencing technologies—the next generation. Nature reviews genetics. 2010;11(1):31. pmid:19997069
  19. 19. Boyd SD, Joshi SA. High-throughput DNA sequencing analysis of antibody repertoires. In: Antibodies for Infectious Diseases. American Society of Microbiology; 2015. p. 345–362.
  20. 20. Rubelt F, Busse CE, Bukhari SAC, Bürckert JP, Mariotti-Ferrandiz E, Cowell LG, et al. Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data. Nature immunology. 2017;18(12):1274. pmid:29144493
  21. 21. Vander Heiden JA, Marquez S, Marthandan N, Bukhari SAC, Busse CE, Corrie B, et al. AIRR Community Standardized Representations for Annotated Immune Repertoires. Frontiers in immunology. 2018;9. pmid:30323809
  22. 22. Glanville J, Kuo TC, von Büdingen HC, Guey L, Berka J, Sundar PD, et al. Naive antibody gene-segment frequencies are heritable and unaltered by chronic lymphocyte ablation. Proceedings of the National Academy of Sciences. 2011;108(50):20066–20071.
  23. 23. Kepler TB. Reconstructing a B-cell clonal lineage. I. Statistical inference of unobserved ancestors. F1000Research. 2013;2. pmid:24555054
  24. 24. Ralph DK, Matsen IV FA. Likelihood-Based Inference of B Cell Clonal Families. PLoS computational biology. 2016;12(10):e1005086. pmid:27749910
  25. 25. Gupta NT, Adams KD, Briggs AW, Timberlake SC, Vigneault F, Kleinstein SH. Hierarchical clustering can identify B cell clones with high confidence in Ig repertoire sequencing data. The Journal of Immunology. 2017;198(6):2489–2499. pmid:28179494
  26. 26. Nouri N, Kleinstein SH. A spectral clustering-based method for identifying clones from high-throughput B cell repertoire sequencing data. Bioinformatics. 2018;34(13):i341–i349. pmid:29949968
  27. 27. Wood R, Gearhart PJ, Neuberger MS. Hypermutation in antibody genes-Preface; 2001.
  28. 28. Hershberg U, Prak ETL. The analysis of clonal expansions in normal and autoimmune B cell repertoires. Phil Trans R Soc B. 2015;370(1676):20140239. pmid:26194753
  29. 29. Zhang B, Meng W, Prak ETL, Hershberg U. Discrimination of germline V genes at different sequencing lengths and mutational burdens: a new tool for identifying and evaluating the reliability of V gene assignment. Journal of immunological methods. 2015;427:105–116. pmid:26529062
  30. 30. Zhou JQ, Kleinstein SH. Cutting edge: ig H chains are sufficient to determine most B cell clonal relationships. The Journal of Immunology. 2019;203(7):1687–1692. pmid:31484734
  31. 31. Clarke SH, Huppi K, Ruezinsky D, Staudt L, Gerhard W, Weigert M. Inter-and intraclonal diversity in the antibody response to influenza hemagglutinin. Journal of Experimental Medicine. 1985;161(4):687–704. pmid:3920342
  32. 32. Blier P, Bothwell A. A limited number of B cell lineages generates the heterogeneity of a secondary immune response. The Journal of Immunology. 1987;139(12):3996–4006. pmid:3500977
  33. 33. Diamond B, Katz JB, Paul E, Aranow C, Lustgarten D, Scharff MD. The role of somatic mutation in the pathogenic anti-DNA response. Annual review of immunology. 1992;10(1):731–757. pmid:1591002
  34. 34. Coker HA, Durham SR, Gould HJ. Local somatic hypermutation and class switch recombination in the nasal mucosa of allergic rhinitis patients. The Journal of Immunology. 2003;171(10):5602–5610. pmid:14607969
  35. 35. Furuta M, Ueno M, Fujimoto A, Hayami S, Yasukawa S, Kojima F, et al. Whole genome sequencing discriminates hepatocellular carcinoma with intrahepatic metastasis from multi-centric tumors. Journal of hepatology. 2017;66(2):363–373. pmid:27742377
  36. 36. Alamyar E, Duroux P, Lefranc MP, Giudicelli V. IMGT® tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS. In: Immunogenetics. Springer; 2012. p. 569–604.
  37. 37. Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic acids research. 2013;41(W1):W34–W40. pmid:23671333
  38. 38. Elhanati Y, Sethna Z, Marcou Q, Callan CG Jr, Mora T, Walczak AM. Inferring processes underlying B-cell repertoire diversity. Philosophical Transactions of the Royal Society B: Biological Sciences. 2015;370(1676):20140243.
  39. 39. Yeap LS, Hwang JK, Du Z, Meyers RM, Meng FL, Jakubauskaitė A, et al. Sequence-intrinsic mechanisms that target AID mutational outcomes on antibody genes. Cell. 2015;163(5):1124–1137. pmid:26582132
  40. 40. Betz AG, Rada C, Pannell R, Milstein C, Neuberger MS. Passenger transgenes reveal intrinsic specificity of the antibody hypermutation mechanism: clustering, polarity, and specific hot spots. Proceedings of the National Academy of Sciences. 1993;90(6):2385–2388.
  41. 41. Shapiro GS, Ellison MC, Wysocki LJ. Sequence-specific targeting of two bases on both DNA strands by the somatic hypermutation mechanism. Molecular immunology. 2003;40(5):287–295. pmid:12943801
  42. 42. Smith DS, Creadon G, Jena PK, Portanova JP, Kotzin BL, Wysocki LJ. Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive B cells. The Journal of Immunology. 1996;156(7):2642–2652. pmid:8786330
  43. 43. Cowell LG, Kepler TB. The nucleotide-replacement spectrum under somatic hypermutation exhibits microsequence dependence that is strand-symmetric and distinct from that under germline mutation. The Journal of Immunology. 2000;164(4):1971–1976. pmid:10657647
  44. 44. Yaari G, Vander Heiden J, Uduman M, Gadala-Maria D, Gupta N, Stern JN, et al. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in immunology. 2013;4:358. pmid:24298272
  45. 45. Zelnik-Manor L, Perona P. Self-tuning spectral clustering. In: Advances in neural information processing systems; 2005. p. 1601–1608.
  46. 46. Yermanos A, Greiff V, Krautler NJ, Menzel U, Dounas A, Miho E, et al. Comparison of methods for phylogenetic B-cell lineage inference using time-resolved antibody repertoire simulations (AbSim). Bioinformatics. 2017;33(24):3938–3946. pmid:28968873
  47. 47. Giudicelli V, Chaume D, Lefranc MP. IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor V–J and V–D–J rearrangement analysis. Nucleic acids research. 2004;32(suppl_2):W435–W440. pmid:15215425
  48. 48. Gupta NT, Vander Heiden JA, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics. 2015;31(20):3356–3358. pmid:26069265
  49. 49. Gadala-Maria D, Gidoni M, Marquez S, Vander Heiden JA, Kos JT, Watson CT, et al. Identification of subject-specific immunoglobulin alleles from expressed repertoire sequencing data. Frontiers in immunology. 2019;10:129. pmid:30814994
  50. 50. Lefranc MP. Immunoglobulin and T cell receptor genes: IMGT® and the birth and rise of immunoinformatics. Frontiers in immunology. 2014;5:22. pmid:24600447
  51. 51. Parameswaran P, Liu Y, Roskin KM, Jackson KK, Dixit VP, Lee JY, et al. Convergent antibody signatures in human dengue. Cell host & microbe. 2013;13(6):691–700.
  52. 52. Nouri N, Kleinstein SH. Optimized threshold Inference for Partitioning of Clones From high-throughput B Cell Repertoire sequencing data. Frontiers in immunology. 2018;9. pmid:30093903
  53. 53. Soto C, Bombardi RG, Branchizio A, Kose N, Matta P, Sevy AM, et al. High frequency of shared clonotypes in human B cell receptor repertoires. Nature. 2019;566(7744):398. pmid:30760926
  54. 54. Briney B, Inderbitzin A, Joyce C, Burton DR. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature. 2019;566(7744):393. pmid:30664748
  55. 55. Cui A, Di Niro R, Vander Heiden JA, Briggs AW, Adams K, Gilbert T, et al. A model of somatic hypermutation targeting in mice based on high-throughput Ig sequencing data. The Journal of Immunology. 2016;197(9):3566–3574. pmid:27707999
  56. 56. MacCarthy T, Kalis SL, Roa S, Pham P, Goodman MF, Scharff MD, et al. V-region mutation in vitro, in vivo, and in silico reveal the importance of the enzymatic properties of AID and the sequence environment. Proceedings of the National Academy of Sciences. 2009;106(21):8629–8634.
  57. 57. DeKosky BJ, Kojima T, Rodin A, Charab W, Ippolito GC, Ellington AD, et al. In-depth determination and analysis of the human paired heavy-and light-chain antibody repertoire. Nature medicine. 2015;21(1):86. pmid:25501908
  58. 58. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. pmid:26000488
  59. 59. Briggs AW, Goldfless SJ, Timberlake S, Belmont BJ, Clouser CR, Koppstein D, et al. Tumor-infiltrating immune repertoires captured by single-cell barcoding in emulsion. bioRxiv. 2017; p. 134841.
  60. 60. Townsend CL, Laffy JM, Wu YCB, Silva O’Hare J, Martin V, Kipling D, et al. Significant differences in physicochemical properties of human immunoglobulin kappa and lambda CDR3 regions. Frontiers in immunology. 2016;7:388. pmid:27729912
  61. 61. Ohlin M, Borrebaeck CA. Insertions and deletions in hypervariable loops of antibody heavy chains contribute to molecular diversity. Molecular immunology. 1998;35(4):233–238. pmid:9736339
  62. 62. Wilson PC, de Bouteiller O, Liu YJ, Potter K, Banchereau J, Capra JD, et al. Somatic hypermutation introduces insertions and deletions into immunoglobulin V genes. Journal of Experimental Medicine. 1998;187(1):59–70. pmid:9419211
  63. 63. de Wildt RM, van Venrooij WJ, Winter G, Hoet RM, Tomlinson IM. Somatic insertions and deletions shape the human antibody repertoire. Journal of molecular biology. 1999;294(3):701–710. pmid:10610790
  64. 64. Briney BS, Willis JR, Crowe J. Location and length distribution of somatic hypermutation-associated DNA insertions and deletions reveals regions of antibody structural plasticity. Genes and immunity. 2012;13(7):523–529. pmid:22717702
  65. 65. Bowers PM, Verdino P, Wang Z, da Silva Correia J, Chhoa M, Macondray G, et al. Nucleotide insertions and deletions complement point mutations to massively expand the diversity created by somatic hypermutation of antibodies. Journal of Biological Chemistry. 2014;289(48):33557–33567. pmid:25320089
  66. 66. Kepler TB, Liao HX, Alam SM, Bhaskarabhatla R, Zhang R, Yandava C, et al. Immunoglobulin gene insertions and deletions in the affinity maturation of HIV-1 broadly reactive neutralizing antibodies. Cell host & microbe. 2014;16(3):304–313.