BMDD: A probabilistic framework for accurate imputation of zero-inflated microbiome sequencing data

doi:10.1371/journal.pcbi.1013124

Fig 1.

Model fit and imputation performance of BMDD.

a: BMDD fit on a real human gut microbiome dataset from the American Gut Project. The blue curve shows the marginal density of the posterior Dirichlet distribution obtained from fitting the Dirichlet-multinomial model, and the red curve depicts the kernel density estimate based on the posterior means obtained from the fit of BMDD. b: Overview of the BMDD. The upper part shows the data generation mechanism of BMDD, and the lower part shows the model computation and inference. Deeper colors indicate higher abundances, and blanks indicate zeros. Details can be found in Section 2.1. c: Imputation performance of BMDD compared with competing methods across 15 distance metrics between the estimated and true composition matrices under the simulation setting S1. The performance improvement is expressed as the percentage reduction in distance metrics relative to the respective method. Formula for the values above zero (the dashed horizontal line): , the distance reduction of BMDD relative to SAVER; Formula for the values below zero , the distance reduction of SAVER relative to BMDD. Likewise for other competing methods.

More »

Expand

Fig 2.

Imputation performance comparison of BMDD with competing methods across 15 distance metrics between the estimated and true composition matrices under settings S2–S9.

The performance improvement is expressed as the percentage reduction in distance metrics relative to the respective method. Formula for the values above zero (the dashed horizontal line): , the distance reduction of BMDD relative to SAVER; Formula for the values below zero , the distance reduction of SAVER relative to BMDD. Likewise for other competing methods.

More »

Expand

Fig 3.

Execution time of the BMDD algorithm.

(R version: 4.3.1 (2023-06-16); Platform: aarch64-apple-darwin20; CPU: Apple M2 Max; Memory: 32 GB). The y-axis represents the time (in seconds) that the BMDD algorithm needs to finish one iteration.

More »

Expand

Fig 4.

Performance of BMDD in differential abundance analysis of microbiome data.

: Simulation results for differential abundance analysis under the gamma model. Empirical false discovery rates and true positive rates were averaged over 500 simulation runs. Error bars represent the 95% confidence intervals and the dashed vertical line indicates the target FDR level of 0.05. : (Bottom) Number of discoveries vs. target FDR level for the real datasets; (Top) Empirical FDR vs. target FDR level for the shuffled real datasets. The dashed gray line represents the diagonal. The results were averaged over 100 simulation runs. : Overlaps of differential taxa with target FDR level of 0.1 for the four real datasets, IBD-1, IBD-2, IBD-3 and IBD-4.

More »

Expand

Fig 5.

Simulation results for differential abundance analysis under the log-normal, Poisson, and negative-binomial models, respectively.

Error bars represent the 95% confidence intervals and the dashed vertical line indicates the target FDR level of 0.05.

More »

Expand

Fig 6.

The BMDD model.

and are the model hyperparameters, represents the unobserved mode assignment variable governed by , is the Dirichlet parameter ydetermined by and , represents the unobserved true composition generated by Dirichlet (), and are the observed count data generated based on and a known sequencing depth N.

More »

Expand