Inference of B cell clonal families using heavy/light chain pairing information

doi:10.1371/journal.pcbi.1010723

Fig 1.

The partis workflow for paired clustering starts with (potentially multiply-) paired sequences, along with possibly some bulk (i.e. non-single cell) data.

It performs clonal family inference on each chain individually, then merges the resulting single chain partitions together into a “joint” or “paired” partition, with the option of approximately pairing bulk data with these clonal families.

More »

Expand

Fig 2.

Schematic representation of our paired clustering method on two families in the heavy chain (left) and light chain (right), with droplet identifiers represented as single letters.

The naive sequences (green crosses) and trees (dashed black lines) represent the collision (i.e. occurrence of very similar VDJ rearrangements) of the two light chain families, while the heavy families are easily distinguishable. The first step of our method groups together apparently-clonal sequences using only single chain information, and would thus merge together the two light chain families. The second step, which refines the single chain partitions using information on which heavy and light chain sequences pair together (“pairing information” or “pair info”), would then use the difference in heavy clusters to split apart the light families.

More »

Expand

Fig 3.

Clustering performance on simulation as a function of SHM (mean fraction of nucleotides mutated) for heavy chain (left) and light chain (right).

Each point is the mean F1 score (± standard error, often smaller than points) over three samples, each consisting of 10,000 simulated rearrangement events with family sizes drawn from a geometric distribution with mean three. In addition to the inference methods (shown in color), we include two synthetic partition methods (grey), which generate incorrect partitions starting from the true partition. These are purely to provide intuitive comparison: the first splits 20% of sequences, chosen at random, into singleton clusters; the second merges together families whose true naive sequences are closer than 3% in Hamming distance (“synth.”, see text for details). The F1 score’s component precision and sensitivity are plotted in S3 Fig, while the performance with and without using pairing information is compared for partis in S4 and S7 Figs, and SCOPer in S5 Fig. Performance of partis as a function of the number of families (with SHM constant) is shown in S6 and S7 Figs. Note that enclone by design discards some sequences with higher SHM levels, so we display its performance only for those samples where it passes at least 90% of sequences (it discards ≃9% of sequences at 10% SHM, ≃60% at 20%, and ≃94% at 30%).

More »

Expand

Fig 4.

Effectiveness on simulation of the pair info cleaning method as a function of true family size, shown as the fraction of sequences correctly paired (top left); and the fraction not correctly paired, split into those mispaired (bottom left) and left unpaired (bottom right). We also show the fraction that are paired with a sequence from the correct family, though not necessarily the correct sequence from that family (top right). Each point is the mean (± standard error, often smaller than points) over three samples, each consisting of 3,000 simulated rearrangement events (around 12,000 total heavy and light sequences). The family sizes are drawn from a distribution inferred from real data. The effects of a variety of cluster size distributions on performance are shown in S8 Fig, in terms of both fraction correctly paired and of final clustering accuracy.

More »

Expand

Fig 5.

Effectiveness on simulation of the approximate bulk data pairing method as a function of true family size, shown as the fraction of sequences paired with a sequence from the correct family (but not necessarily the correct sequence, left) and the fraction paired with a family similar to the correct family (right).

The method merges together a single cell and a bulk sample drawn from the same pool of B cells, and the “bulk data fraction” shows the fraction of this merged sample that stems from the bulk sample. “Similar” families are defined as families with true naive sequences separated by nucleotide Hamming distances of 3 or less. The fraction of sequences correctly paired, mispaired, and left unpaired are shown in S9 Fig. Other details same as in Fig 4.

More »

Expand

Fig 6.

Time required for clustering for a variety of methods for both single chain clustering (top, for IgH [left] and IgK [right]) and paired clustering (including single chain clustering, bottom left).

Note that for partis the actual paired clustering takes only around five minutes on the 100k samples, so for partis the bottom plot time is roughly equal to the sum of the top two plots (at the moment the single chain clustering for heavy and light are not run concurrently). Each point is the mean (± standard error, often smaller than points) over two samples with the indicated size, run on a single desktop with a 14-core Intel i-99940X processor and 128GB memory (maximum memory usage for partis on the 100k samples was around 9GB). The size of each family is drawn from a geometric distribution with mean 10. Note that because enclone by design discards sequences with high SHM, here it is clustering only the ≃80% of sequences that it passes.

More »

Expand

Fig 7.

Cluster size distributions on real data (thin lines) and simulation (thick lines) for single chain (no paired clustering, blue/green) and joint (with paired clustering, black/red) partitions for IgH (top), IgK (bottom left) and IgL (bottom right).

The real data sample is from the 10X website [45], while the simulation sample is generated to match that particular real data sample, i.e. from parameters inferred from that real data sample. Direct comparisons of other parameter distributions for these two samples are shown for a handful of parameters in S10 Fig. Comparisons for all parameters, along with similar analyses for three other real data samples from [45] may also be found at https://doi.org/10.5281/zenodo.5860143.

More »

Expand

Fig 8.

Pair info cleaning effectiveness on real data for a relatively well-paired sample (left, same sample as Fig 7) and a sample with substantial numbers of multiply-paired sequences (right, not yet published, but raw data included in https://doi.org/10.5281/zenodo.5860143).

The x axis shows the number of sequences paired with each sequence (“paired seqs per seq”) before and after application of our pair info cleaning algorithm. Thus for example a perfectly paired sample (with perfect allelic exclusion, no dropout, etc.) would have all sequences in the 1-bin (i.e. uniquely paired), while a sample with two cells in each droplet would have all sequences in the 3-bin. So an ideal application of our algorithm would leave every sequence uniquely paired (all in the 1-bin), but in practice some are left unpaired (0-bin).

More »

Expand

Fig 9.

Naive sequence inference accuracy on simulation as a function of tree imbalance.

Accuracy is measured as the Hamming distance separating the true and inferred naive sequences. Tree imbalance (the standard deviation of root-to-tip distances) is shown for samples with precisely zero imbalance (“n/a”, since it uses a different simulation framework), and samples generated from the imbalanced trees from [47]. Several partis versions are shown: “1-seq.” uses only information from each single sequence for its annotation (rather than from all family members); “star” approximates each family as a star tree (as in partis versions before May 2021 https://bit.ly/3KIjFiF); “full” is the current default, including subcluster annotation, which improves on the star tree assumption by iteratively annotating small subtrees. Each point is the mean (± standard error) over two samples with 15% mean nucleotide SHM, each consisting of 50 simulated rearrangement events with sizes of around 50 (drawn from geometric for zero imbalance; otherwise taken from the trees from [47]).

More »

Expand

Fig 10.

Resolving discrepancies for one cluster between the heavy and light partitions (left) and incorporating a (different) group of resolved clusters into the final partition (right).

Left: After selecting a single cluster C_l from one partition, we then take all clusters from the other partition (the ) that contain any of the same three sequences a, b, and c. We then determine how C_l should be split apart using the , and finish with the resulting “resolved” clusters . Right: After several rounds of discrepancy resolution, we have a partially complete final partition P_f into which we must incorporate each new group of resolved clusters . We do this by removing common sequences from the larger of any two clusters that share sequences (regardless of whether the larger cluster is in P_f or in ). In this case, this means we remove the e from de in P_f, as well as removing the f from ef in .

More »

Expand

Fig 11.

Schematic representation of the pair info cleaning algorithm showing five droplets, each with some combination of IgH, IgK, and IgL sequences, with subscripts (and color combinations) indicating clonal families (true clonal family id is also indicated as a subscript).

The top left, top right, and bottom left droplets have unique pairings, so require no action. The top middle and bottom right droplets, on the other hand, require disambiguation. In the top middle droplet, h₁ has two potential light chain partners: k₁ and k₃. Since k₁ has three h₁ family members paired with it across all the droplets, while k₃ has only one, we choose k₁ (red arrow). A similar procedure is then followed for the bottom right droplet.

More »

Expand

Table 1.

Default values for simulation parameters controlling germline set generation when simulating from scratch.

More »

Expand