Fig 1.
Stochastic sampling leads to variation in observed overlap.
The members of two hypothetical populations are represented by blue and green circles, respectively. Each population has 16 members, and s = 5 are shared members of both populations. In two independent sampling experiments, shown in top and bottom rows, na = nb = 8 members are sampled at random from each population (dark circles) while the other 8 members are not sampled (transparent circles). Observation of the first experiment finds an overlap of nab = 4, while observation of the second finds nab = 0.
Fig 2.
Inference and uncertainty using the posterior.
The posterior distribution over s is plotted for the realistic scenario of na = 47, nb = 32, and nab = 20 [line; Eq (6)]. The posterior mean provides our estimate of the true overlap [open circle; Eq (7)], and the interval accounting for at least 90% of the area under the posterior curve provides an equal-tailed 90% credible interval [shading; Eq (8)]. The
estimate is shown for comparison [black cross; Eq (1)], and is typically less than or equal to
.
Fig 3.
Bayesian repertoire overlap consistently estimates true overlap.
Repertoires with true overlaps ranging from 0 to 60 were subsampled in simulations. As sampling rates increase from na = nb = 30 (left) to 40 (middle) and to 50 (right), the estimates of BRO (colored circles) approach the true values (dotted lines) symmetrically. Estimates from (crosses) approach the true values from below, systematically underestimating the true overlap. This bias is worse with lower sampling rates [7]. Similar results are found when na ≠ nb, and when the total repertoire sizes are different from each other (S1 Fig).
Fig 4.
Credible intervals quantify uncertainty in overlap estimates.
By using Eq (8), 90% credible intervals are show above as error bars around the point estimates for varying true overlap s. As sampling rate increases from na = nb = 30 (left) to 40 (middle) and to 50 (right), credible intervals shrink, indicating a reduction in uncertainty. In expectation, 90% of intervals cover the true overlap (dotted line).
Fig 5.
Reevaluation of published results.
In 2010, Albrecht et al. compared var repertoires from 5 populations using pairwise type sharing (see Refs. [18, 19, 27] for original data details). (left) Reproduction of analysis of [19], rescaled from [0, 1]→[0, 60]. (middle) Reanalysis using Bayesian repertoire overlap [Eq (7)]. For all boxplots, boxes span inner quartiles; center lines show medians; whiskers extend to 2.5 and 97.5 percentiles. (right) Histograms of Bayesian repertoire overlap distributions from Amele and Ariquemes clones (data identical to those in middle boxplots) colored by width of credible interval [Eq (8)], a measure of uncertainty. Differences in uncertainties are driven primarily by sampling rates: Amele samples average
sequences per parasite while Ariquemes clones average
.
Fig 6.
Quantifying the decrease in uncertainty from increased sequencing.
Histograms show distributions of overlap estimates , computed using Eq (11), for various values of s which are indicated by color-matched dotted lines. While all estimates are distributed around the true values of s, increasing the number of colonies c from 48 (top) to 96 (middle) and to 144 (bottom) substantially decreases the error of estimates. For example the bottom plot shows that successfully sequencing c = 144 colonies from each parasite is guaranteed to produce estimates
that are off by at most 5 (8.3%) in either direction of the true s.