A Bayesian method for detecting pairwise associations in compositional data

doi:10.1371/journal.pcbi.1005852

Fig 1.

BAnOCC infers log-basis correlation and precision matrices from compositions by modeling unobserved and unconstrained counts.

In the BAnOCC model, the observed compositions, C_i, are derived by normalizing the unobserved counts X_i. The BAnOCC model assumes that the X_i follow a log-normal distribution, parametrized by the log-basis mean m and covariance S. It places a normal prior on m, the GLASSO prior of [13] on the log-basis precision matrix O, and a hyperprior on the GLASSO shrinkage parameter λ (see Methods).

More »

Expand

Fig 2.

Spurious correlation is not constrained as a function of feature count, mean, and variance.

A The approximate compositional correlation (based on Eq (3)) between features j and k when σ_X,jk = 0, as a function of the proportion of the total mean and proportion of total variability they contribute. B-D Examples of compositions that display positive (B) and negative (C-D) compositional correlations; in each, the top panel shows the correlation of the unconstrained and unobserved abundances across samples, while the bottom panel shows the correlation of the relative abundances across samples. The spurious correlation can be positive or negative, and of arbitrary magnitude, depending on the characteristics of the unobserved abundances.

More »

Expand

Fig 3.

BAnOCC infers the correct unobserved abundance correlation matrix in four scenarios simulated to be challenging.

Each column represents one four datasets simulated to evaluate methods for identification of correlations from compositional data: “simple”, with no true correlations and no negative dominant correlation; “high spurious”, with no true correlations and the presence of a negative dominant correlation; “retained spike” with several true correlations and no negative dominant correlation; and “reversed spike” with several true correlations and a negative dominant correlation between two positively correlated features. The top row shows the true correlation matrix. The second row shows the uncorrected compositional correlations as estimated using the 1,000 samples in the simulated data. Each of the subsequent rows shows the log-basis correlation estimate and the associated inference using the compositional data for Pearson correlation, BAnOCC, and CCLasso, respectively.

More »

Expand

Fig 4.

The BAnOCC model controls type I error while maintaining power.

Results on simulated data comprising SparseDOSSA-derived compositions modeled on a low-diversity dataset with 14 features. The type I error rate is controlled at the 0.05 level for BAnOCC and approximately so for SparCC, CCLasso, and SPIEC-EASI (MB), but not for simplicial variation or Spearman correlation (on the composition, a negative control). BAnOCC maintains good power across all true correlation values, but as expected has better power for stronger true correlation values. Type I and type II error rates are determined by correct or incorrect rejection of H₀ based on inference (simplicial variation, SparCC, Spearman correlation, and BAnOCC) or estimation (CCLasso and SPIEC-EASI). * = rejection of H₀ based on estimation; ** = rejection of H₀ based on inference from credible intervals; all others, rejection of H₀ based on inference from p-values. (S6 Fig and S7 Fig).

More »

Expand

Table 1.

Methods included in an evaluation on simulated data.

Type I and type II error rates were determined for these methods by the correct or incorrect rejection of H₀; for CCLasso and SPIEC-EASI, no inferential methodology was provided and so the correct or incorrect estimation of w_jk as zero was used. Note that although SPIEC-EASI infers the precision matrix, construction of the true correlation matrix in the simulated data guarantees that the same elements will be non-zero in the precision and covariance matrix.

More »

Expand

Fig 5.

BAnOCC association networks from the Human Microbiome Project.

The association networks inferred from three HMP body sites: stool (A), buccal mucosa (B), and posterior fornix (C). Using four chains with a minimum of 5000 iterations, we ran BAnOCC until convergence (see details in S3 Text). Only significant correlations stronger than 0.15 are shown (see S1 Table, S2 Table and S3 Table). The GLASSO prior results in sparse networks for these datasets, highlighting individual associations between taxa.

More »

Expand