A statistical model for describing and simulating microbial community profiles

doi:10.1371/journal.pcbi.1008913

Fig 1.

A hierarchical model for microbial community feature profiles.

A) SparseDOSSA comprises a hierarchical model to capture the generation mechanism of microbial sequencing counts, including components for “hidden” absolute abundances, sequencing depth (and thus compositional relative abundances), zero inflation, and feature-feature and feature-environment interactions. Notations not defined in the figure: : cumulative density function (CDF) for the absolute abundance of feature A_j. μ_D, : mean and variance of the log normal sequencing depth distribution. B) SparseDOSSA can be fitted to varied microbial community types using cross-validation procedures by users; the software also provides pre-trained models are provided for human microbiome template datasets. This allows for C) simulation of either null or "true positive" association spiked-in synthetic datasets, to facilitate microbiome benchmarking or power analysis studies.

More »

Expand

Fig 2.

SparseDOSSA accurately recapitulates different microbial community structures.

We compared SparseDOSSA 2 simulated microbial counts versus those of three human microbiome training template datasets (Stool, Vaginal, and IBD). A) Bray-Curtis ordination shows global agreement between SparseDOSSA simulated microbial abundance profiles and those of their originating real-world populations. B) This was quantified by PERMANOVA R² statistics, showing that SparseDOSSA simulated samples were significantly less systematically differentiated from their targets than existing DM and metaSPARSim methods in almost all cases (Wilcoxon rank sum test p-values included in S3 Table). R² compared against randomly split original real-world data are included as baseline controls. C) Representative features from each environment are similarly distributed between real-world and SparseDOSSA simulated samples, as shown in empirical cumulative distribution functions (CDFs) of log-10 relative abundances (with pseudo value 1e-6 to visually represent zeros). D) Per-feature Kolmogorov-Smirnov summary statistics quantify that SparseDOSSA outperforms existing methods in simulating realistic feature-level relative abundance distributions. First, the similarity between the model-simulated feature abundance distribution versus that in the real-world dataset is quantified with K-S statistics. Then, the K-S statistics for SparseDOSSA and the other two models (DM and metaSPARSim) are plotted on the x- and y-axis, respectively (each point representing one feature, smaller K-S statistics represent better approximation). Lastly, the K-S statistics of SparseDOSSA versus other models are formally tested using Wilcoxon signed rank tests (p-values are significant and included in S4 Table).

More »

Expand

Fig 3.

SparseDOSSA can add feature-phenotype and feature-feature associations to modeled microbial community simulations.

A,B) SparseDOSSA 2 correctly simulated feature-phenotype associations targeting the prescribed non-zero relative abundance (A) and prevalence (B) effect sizes of the spiked features, while maintaining non-associations of null features. True associated (spiked) microbial features (red) are well differentiated from null features (black), through Bonferroni corrected p-values (non-significant features marked in gray; test based on linear/generalized linear regression against the spiked metadata variable, see Methods for details). The horizontal dashed lines indicate true spike-in effect sizes: red lines for the positive and negative true effect sizes, respectively, and the black line for null effect (0). C) SparseDOSSA can also prescribe feature-feature associations. Bottom right triangles are Spearman correlations in the simulated absolute abundances. As prescribed, only true association feature pairs are correlated. Top right triangles are Spearman correlations in the corresponding, simulated relative abundances. Note that in this example, Spearman correlation does not differentiate between true (“biological”) covariations versus those induced spuriously due to compositionality (as is also the case in the underlying data on which SparseDOSSA’s model is fit). As expected, both true signals and spurious correlations caused by compositionality can be observed for such data. TP: true positives.

More »

Expand

Fig 4.

SparseDOSSA enables comparative benchmarking and power analysis of microbial community statistical association tests.

For any originating community type of interest, datasets simulated based on a SparseDOSSA model fit can be spiked with known "phenotypes" and feature effect sizes to estimate methods performance (power, FPR, etc.) during (A) benchmarking as well as (B) power analysis, across controlled combinations of potential effect sizes and sample sizes. Points indicate average performance across simulation repetitions and error bars indicate standard error (Methods).

More »

Expand

Fig 5.

SparseDOSSA correctly models the effects of diet and time on the murine gut microbiome by reproducing effects from amplicon sequencing profiles.

A) SparseDOSSA 2 was fitted to subsets of samples from [24] that included up to three time points each from collections of mice fed chow, raw or cooked tubers, and meat. The resulting models were then used to simulate controlled microbial community profiles, which correctly reproduced the beta-diversity structures present in the original study (MDS ordination by Bray-Curtis dissimilarities, corresponding to Fig 1A of [24]). The SparseDOSSA model was also able to model and synthetically replicate changes in "Bacteroidetes" and "Firmicutes" phyla in response to raw vs. cooked diets, including B) overall community alpha-diversity (Shannon index), C) the resulting "Firmicutes" vs. "Bacteroidetes" ratio, and D) overall whole-community effective biomass. These correspond to [24]’s Fig 1F–1H, respectively. TRF = raw tuber (free-fd); TCF = cooked tuber (free-fed); TCR = cooked tuber (restricted ration).

More »

Expand