Bayes-optimal estimation of overlap between populations of fixed size

doi:10.1371/journal.pcbi.1006898

Fig 1.

Stochastic sampling leads to variation in observed overlap.

The members of two hypothetical populations are represented by blue and green circles, respectively. Each population has 16 members, and s = 5 are shared members of both populations. In two independent sampling experiments, shown in top and bottom rows, n_a = n_b = 8 members are sampled at random from each population (dark circles) while the other 8 members are not sampled (transparent circles). Observation of the first experiment finds an overlap of n_ab = 4, while observation of the second finds n_ab = 0.

More »

Expand

Fig 2.

Inference and uncertainty using the posterior.

The posterior distribution over s is plotted for the realistic scenario of n_a = 47, n_b = 32, and n_ab = 20 [line; Eq (6)]. The posterior mean provides our estimate of the true overlap [open circle; Eq (7)], and the interval accounting for at least 90% of the area under the posterior curve provides an equal-tailed 90% credible interval [shading; Eq (8)]. The estimate is shown for comparison [black cross; Eq (1)], and is typically less than or equal to .

More »

Expand

Fig 3.

Bayesian repertoire overlap consistently estimates true overlap.

Repertoires with true overlaps ranging from 0 to 60 were subsampled in simulations. As sampling rates increase from n_a = n_b = 30 (left) to 40 (middle) and to 50 (right), the estimates of BRO (colored circles) approach the true values (dotted lines) symmetrically. Estimates from (crosses) approach the true values from below, systematically underestimating the true overlap. This bias is worse with lower sampling rates [7]. Similar results are found when n_a ≠ n_b, and when the total repertoire sizes are different from each other (S1 Fig).

More »

Expand

Fig 4.

Credible intervals quantify uncertainty in overlap estimates.

By using Eq (8), 90% credible intervals are show above as error bars around the point estimates for varying true overlap s. As sampling rate increases from n_a = n_b = 30 (left) to 40 (middle) and to 50 (right), credible intervals shrink, indicating a reduction in uncertainty. In expectation, 90% of intervals cover the true overlap (dotted line).

More »

Expand

Fig 5.

Reevaluation of published results.

In 2010, Albrecht et al. compared var repertoires from 5 populations using pairwise type sharing (see Refs. [18, 19, 27] for original data details). (left) Reproduction of analysis of [19], rescaled from [0, 1]→[0, 60]. (middle) Reanalysis using Bayesian repertoire overlap [Eq (7)]. For all boxplots, boxes span inner quartiles; center lines show medians; whiskers extend to 2.5 and 97.5 percentiles. (right) Histograms of Bayesian repertoire overlap distributions from Amele and Ariquemes clones (data identical to those in middle boxplots) colored by width of credible interval [Eq (8)], a measure of uncertainty. Differences in uncertainties are driven primarily by sampling rates: Amele samples average sequences per parasite while Ariquemes clones average .

More »

Expand

Fig 6.

Quantifying the decrease in uncertainty from increased sequencing.

Histograms show distributions of overlap estimates , computed using Eq (11), for various values of s which are indicated by color-matched dotted lines. While all estimates are distributed around the true values of s, increasing the number of colonies c from 48 (top) to 96 (middle) and to 144 (bottom) substantially decreases the error of estimates. For example the bottom plot shows that successfully sequencing c = 144 colonies from each parasite is guaranteed to produce estimates that are off by at most 5 (8.3%) in either direction of the true s.

More »

Expand