Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance

doi:10.1371/journal.pcbi.1002743

Figure 1.

Conceptual diagram illustrating how variation in genomic 16S copy number could influence observed abundance of 16S gene sequences in a community.

Observed 16S gene sequence abundances (G) in an environmental sequencing data set (A) could be generated by a variety of underlying organismal abundance distributions (N; e.g. B or C) depending on the genomic copy number of the 16S gene (C) within individual cells of the organisms in the community (gray rectangles denote single cells, black symbols denote copies of the 16S gene from different organisms).

More »

Expand

Figure 2.

Conceptual diagram showing how copy number can be estimated for environmental sequences using a reference phylogeny.

Given a reference phylogeny with copy number known for species A, B, and C, trait values for a hypothetical novel taxon or sequence X (A) can be estimated in a phylogenetically independent contrasts framework by rerooting the phylogeny at the ancestor of X and its closest relative in the reference phylogeny (B). After rerooting, a predicted trait value and standard error for X can be calculated using ancestral state reconstruction.

More »

Expand

Figure 3.

Taxa-abundance and taxa-gene curves (number of species in log₂-abundance octaves) fit to a simulated distribution of organismal abundances (N_i; black) and resulting gene abundances (G_i; red) for 5000 species.

For each species, abundance P(N) was simulated as a zero-truncated lognormal distribution (mean = 2, variance = 4), copy number P(C) was simulated as a zero-truncated Poisson distribution (mean = 4, variance = 4), and P(G) was calculated as P(G) = P(N)P(C) following Equation 3.

More »

Expand

Figure 4.

Rank abundance distributions and estimated species pool richness from 100 simulations of communities of (A) 1000, (B) 10000, and (C) 50000 individual genes or organisms sampled from an underlying distribution of abundances (P(N)) and genes (P(G)).

For each simulation, a distribution of organismal abundances (P(N); black) and resulting gene abundances (P(G); red) was generated for 5000 species following the methods described in the caption for Figure 3. Rank-abundance distributions are presented for a single randomly chosen simulation at each sampling intensity. For each simulation, we estimated the number of species S in the species pool using a parametric method [22], [23], with the true S = 5000. Estimates of species pool size were significantly higher and closer to the true value based on N versus G at all sampling intensities (ANOVA; P<0.01).

More »

Expand

Figure 5.

Bacterial reference phylogeny with genomic 16S copy number indicated with black bars (bar length proportional to genomic 16S copy number) and taxonomic order (determined using RDP Taxonomic Classifier [43]) indicated with color shading of branches.

More »

Expand

Figure 6.

The strength of correlations between true abundance (n_i) versus observed gene abundance (g_i) or estimated relative abundance () for 100 simulated communities generated by drawing 100 taxa from the 484-taxon reference phylogeny followed by estimation of the phylogenetic placement and copy number for those taxa.

We simulated phylogenetic placement and copy number estimation using full-length 16S sequences and sequences trimmed to the 351 bp V2V3 hypervariable region to simulate pyrosequencing data. Letter codes at top of panel indicate simulations that differed according to a Tukey HSD test (P<0.05; simulations that share a letter not significantly different).

More »

Expand

Figure 7.

Rank-abundance distributions for two empirical microbial community data sets from (A) human skin microbiome and (B) ocean bacterial communities.

Solid line indicates the expected relative abundance distribution under a lognormal distribution. Gray points are the observed relative gene abundances (g_i) of sequences in each data set, and black points are the estimated relative organismal abundances ().

More »

Expand

Figure 8.

Comparison of relative abundance of the 20 most abundant taxonomic classes in (A) human microbiome and (B) ocean data sets based on observed gene abundances (g_i) and estimated organismal abundances ().

More »

Expand

Figure 9.

Hierarchical clustering (complete linkage) of communities from the microbiome of a human (subject F1-3 in [13]) based on phylogenetic similarity (weighted UniFrac distance metric) for observed relative gene abundances g_i (A) and for estimated organismal relative abundances (B).

Samples are shaded based on human microbiome habitat characteristics (black = gut/mouth, gray = moist skin sites, white = dry skin sites).

More »

Expand