Interactions between species introduce spurious associations in microbiome studies

doi:10.1371/journal.pcbi.1005939

Fig 1.

Microbial interactions generate spurious associations.

(A) A hypothetical interaction network of five species together with their dynamics in disease. Only two species (shown in color) are directly linked to host phenotype. These directly-linked species inhibit or promote the growth of the other members of the community (shown with arrows). As a result, all five species have different abundances between case and control groups. (B) Microbial interactions are visualized via a hierarchically-clustered correlation matrix computed from the data in Ref. [21]. We used Pearson’s correlation coefficient between log-transformed abundances to quantify the strength of co-occurrence for each genus pair. Dark regions reflect strong interspecific interactions that could potentially generate spurious associations. See S1 Text for the list of 47 most prevalent genera included in the plot.

More »

Expand

Fig 2.

Signatures of indirect associations in synthetic and IBD data sets.

The synthetic data set was generated to match the statistical properties of the IBD data set from Ref. [21], but with a predefined number of 6 directly associated taxa (See S1 Text). (A) In synthetic data, DAA identifies no spurious association and detects 4 out of 6 directly associated genera. All 6 genera and no false positives are detected when the sample size is increased further (S9 Fig). In sharp contrast, a large number of spurious associations is observed for metrics that rely on changes in abundance between cases and controls and do not correct for microbial interactions. The number of false positives grows rapidly with statistical power until all taxa are reported as significantly associated with the disease. (B) All spurious associations show substantial differences between cases and controls and, therefore, cannot be discarded based on their effect sizes. To quantify the effect size, we estimated the magnitude of the fold change for each genus. Specifically, we first computed the difference in the mean log-abundance between cases and controls and then exponentiated the absolute value of this difference. The plot shows how the median effect size for significantly associated genera depends on the sample size. Larger samples sizes result in much higher number of associations, but only a small drop in the typical effect size. (C) and (D) are the same as (A) and (B), but for the IBD data set. The results are consistent between the two data sets suggesting that most associations detected by traditional MWAS are spurious. The complete list of indirect associations inferred from the IBD data set is shown in S1 Text, and the results for different synthetic data sets are shown in S14 Fig.

More »

Expand

Fig 3.

Network of direct associations with Crohn’s Disease.

Five species and four genera were found to be significantly associated with Crohn’s Disease (q < 0.05) after correcting for microbial interactions (S1 and S4 Figs). The links correspond to significant interactions (q < 0.05) between the taxa with J_ij > 0.27 or J_ij < −0.15; the width of the arrows reflects the strength of the interactions. For comparison, the correlation-based network for directly associated taxa is shown in S7 and S5 Figs, and a complete summary of correlations and interactions for all species pairs is provided in S1 Text.

More »

Expand

Fig 4.

Direct associations analysis corrects p-value inflation and retains diagnostic accuracy.

(A) The distribution of p-values in DAA closely follows the expected uniform distribution. Because conventional MWAS does not correct for microbial interactions, it yields an excess of low p-values, which is a strong signature of indirect associations. For both methods, p-values were computed using a permutation test. The expected uniform distribution was obtained by sampling from a generator of uniform random numbers. The ranked plot of p-values visualizes their cumulative distribution functions; this is a variant of a Q-Q plot. (B) Direct associations are a small subset of all associations with IBD (see S4 Fig), yet they retain full power in classifying samples as cases or controls. In contrast, the classification power is substantially reduced for an equally-sized subset of randomly-chosen indirect associations. In each case, we used sparse logistic regression to train a classifier on 80% of the data and tested its performance on the remaining 20% (Methods). The shaded regions show one standard deviation obtained by repeated partitioning the data into training and validation sets. Identical results were obtained with a random forest [64, 65] and support vector machine [66] classifiers (S8 Fig)

More »

Expand