Figure 1.
Outline of DivE species richness estimator.
DivE fits many models to rarefaction curves (black) and subsamples thereof (orange). Data is denoted by circles; fits by solid lines. Models are scored according to the following criteria: i) Discrepancy – mean percentage error between data points and model prediction; ii) Accuracy – error between full sample species richness (purple cross) and estimated species richness from subsample; iii) Similarity – area between subsample fit (orange) and full data fit (black); and iv) Plausibility – we require that S'(x) ≥0 and S"(x) ≤0. The best performing models are aggregated and extrapolated to estimate species richness. Model A performs poorly as criteria ii) and iii) are not satisfied. Model B performs well as all criteria are satisfied.
Figure 2.
Outline of DivE distribution generation algorithm.
A Truncated species frequency distribution with x individuals distributed among y species. The frequency of species Si after sampling x individuals is denoted Fx(Si). B Species accumulation data generated from frequency distribution. C An aggregate of the best performing models as returned by DivE is used to extrapolate to point (x+a, y+1), where the next species is predicted. D Species Sy+1 is assigned a frequency of (1 - pmax)(x+a), where pmax is the maximum-likelihood proportion of individuals occupied by the y previously observed species. The remaining pmax(x+a) individuals are distributed among species S1, …, Sy in proportion to their observed relative frequencies at x. Steps C and D are repeated until the predicted species richness is reached. See Text S1 for further details.
Figure 3.
Comparison of species richness estimators.
A–D The Chao1bc (blue), ACE (grey), Bootstrap (green), Good-Turing (black), and negative-exponential estimators (orange) are applied to in silico random subsamples of observed data. Examples for HTLV-1, microbial, and TCR data are shown. Estimates systematically increase with sample size in datasets where rarefaction curves do not plateau (e.g. in I, J, K). Where rarefaction curves do plateau (e.g. in L), estimates are consistent. E–H DivE (red) is applied to same subsamples as the other estimators. Performance of DivE was evaluated by comparing the error of estimates (Ŝobs), to the (known) number of species Sobs in the full observed data (purple line), i.e. error = |Sobs - Ŝobs| /Sobs. In all datasets, DivE accurately estimates the species richness of the full observed data from subsamples of that data. I–L Corresponding HTLV-1, microbial and TCR rarefaction curves: arrows denote the size of the subsample to which each estimator was applied.
Figure 4.
Comparison of estimators: Effect of sample size on estimated diversity.
Normalized gradients measuring proportional increase in estimated diversity against proportional increase in sample size. Normalized gradients (shown for each estimator and each patient data set in Table S1) were calculated by linear regression. For the HTLV-1 and microbial data, all estimators except DivE show large normalized gradients that are significantly positive. The TCR normalized gradients, though significantly positive, are small and do not show a substantial bias with sample size. *, **, and *** signify p<0.01, p<0.001, and p<0.0001 respectively; two-tailed binomial test (n = 14, 16, 20 for the HTLV-1, TCR and microbial data respectively).
Table 1.
Comparison of estimator performance for TCR data.
Figure 5.
Existing estimators underestimate diversity in HTLV-1 infection.
For HTLV-1 Patient D, three samples are pooled. Rarefaction curves from the pooled sample (black circles) and a subsample (red circles) are shown. Chao1bc, ACE, Bootstrap, Good-Turing and negative exponential estimates (blue, grey, green, black, and orange lines respectively) from the subsample, and DivE estimates (red cross) from the same subsample are plotted. Existing estimators produce a single estimate of diversity, and so their estimates are shown as lines. The diversity in the blood must be at least as great as that observed by pooling the samples. All existing estimators estimate the total diversity to be less than that observed. Given that the observed diversity is likely to be a small fraction of the total diversity this represents a considerable error. We used DivE to produce two estimates: the diversity in the pooled sample (i.e. in 15000 cells, red cross) and the total diversity of the blood. DivE accurately estimates the pooled sample species richness from the subsample, but also predicts higher values of species richness in the blood, consistent with the unseen clones implied by the pooled rarefaction curve. See Figure S3 for further examples.
Figure 6.
Test of species richness estimators at different values of curvature parameter (Cp) using TCR data.
The curvature parameter Cp is plotted against the relative error (|Sobs - Ŝobs| /Sobs) of each estimator. Four patient data sets are shown: A total CD4+ from patient C; B total CD4+ from patient E; C total CD8+ from patient C; D total CD8+ from patient E. Each point represents an estimate from a subsample of data. Note the plots have different y-axis scales and the y-axes in C and D are segmented. Broadly, the accuracy of all estimators improves as Cp increases, and this increase is more pronounced for DivE. From Cp>0.1, DivE generally outperforms the existing estimators, but is prone to error at very low values of Cp., when the rarefaction curve implies a near-constant rate of species accumulation.
Figure 7.
Validation of DivE distribution generation algorithm.
The DivE distribution generation algorithm (Figure 2) was applied to random samples (red dashed) of observed data (black solid). Accuracy was evaluated by comparing the estimated distribution (orange dashed) to the true distribution of the full observed data (black). Examples for HTLV-1 A, TCR B and microbial datasets C are shown.
Table 2.
Performance of DivE frequency distribution generation algorithm.