Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation

doi:10.1371/journal.pcbi.1004409

Fig 1.

The VDJ recombination process, in which individual V, D, and J genes are first randomly selected from a number of copies of each.

These genes are then joined together via a process that deletes some randomly distributed number of nucleotides on their boundaries then joins them together with random “non-templated” nucleotides in the N-region (blue). The specificity of an antibody is to a large extent determined by the region defined by the heavy chain recombination site, referred to as the third complementarity determining region (CDR3).

More »

Expand

Fig 2.

Observed exonuclease deletion length frequencies for two V, four D, and two J alleles on the three humans (A, B, and C) in the Adaptive data set.

These alleles were chosen to be representative of the various shapes taken by the empirical distributions. In the complete set of plots (which are publicly available as described in the text), per-allele distributions are frequently multi-modal and appear similar between humans.

More »

Expand

Fig 3.

Mutation frequency versus position.

Typical observed mutation frequencies for two V, two D, and two J alleles on the three humans (A, B, and C) in the Adaptive data set. The x axis is the zero-indexed position along the IMGT germline allele. Mutation frequencies are seen to be highly position-dependent. While the structure of these mutation distributions appears similar between humans, the overall level of mutation varies. The first base of the conserved cysteine and tryptophan codons (i.e. the CDR3 boundaries) are indicated with black vertical dashed lines. In the complete set of plots (which are publicly available as described in the text), mutation frequencies are highly variable across sites with a pattern that is similar between humans.

More »

Expand

Fig 4.

N-region lengths.

Typical observed N-region lengths at the VD and DJ boundaries for two D and two J alleles. In the complete set of plots (which are publicly available as described in the text), distributions have a similar shape and the per-allele plots appear similar between humans.

More »

Expand

Fig 5.

Mean and variance of inferred parameters.

The across-subset mean and variance of inferred parameter values for each human in the Adaptive data set across 10 disjoint subsets of the data.

More »

Expand

Table 1.

HMM compiler performance comparison.

More »

Expand

Table 2.

Correct gene calls for all public BCR annotation methods.

More »

Expand

Fig 6.

Correct gene calls versus sequence mutation frequency, i.e. the number of sequences for which the specified method called the correct gene (regardless of allele) divided by the total number of sequences, versus sequence mutation frequency, for publicly-available BCR annotation methods on a simulated sample of 30,000 sequences.

partis is shown both for single sequences (k = 1), and for a multi-HMM (k = 5), which performs simultaneous inference on five clonally related sequences.

More »

Expand

Fig 7.

Hamming distance between inferred and true naive sequences, for all available BCR annotation methods on a simulated sample of 30,000 sequences.

partis is shown both for single sequences (k = 1), and for a multi-HMM (k = 5), which performs simultaneous inference on five clonally related sequences.

More »

Expand

Fig 8.

True versus inferred parameters for all methods.

The difference between inferred and true values for exonuclease deletion lengths, N-region lengths, and mutation frequency for all available BCR annotation methods on a simulated sample of 30,000 sequences; more tightly peaked around zero is better. partis is shown both for single sequences (k = 1) and for a multi-HMM (k = 5), which performs simultaneous inference on five clonally related sequences.

More »

Expand

Fig 9.

Performance versus sample size.

partis performance as measured by Hamming distance between inferred and true naive sequences when given 50 to 10,000 total sequences. When fewer sequences than a threshold are provided for a gene, partis uses a tiered aggregation strategy (see Methods) to obtain enough sequences on which to do parameter estimation, which partly recovers performance as shown here.

More »

Expand

Fig 10.

A few states in the internal region of a B cell receptor sequence HMM.

On the left is a state representing a position in a gene with a germline G, which usually emits a G, but sometimes emits other bases (i.e. mutates). If this state is near enough to the start or end of the gene, there will likely be transitions from the initial state, or to the end state (upper arrows). On the other hand, if it is in the middle of a gene the path is more likely to simply traverse the states in order (straight horizontal arrows).

More »

Expand

Fig 11.

Overall topology of the HMMs in the V, D, and J segments.

Inserts are shown as a single state for clarity, but are replaced by four states in the actual HMM (Fig 12). Note that we include 5’ V and 3’ D exonuclease deletions as a convenience (dashed lines) to account for varying read lengths.

More »

Expand

Fig 12.

HMM N-region topology.

By using four states instead of one, this gives improved discrimination for pairs or tuples of sequences because it does not ignore mutation information within N-regions. Self-transitions for insert states ommitted for clarity.

More »

Expand