Sparse and Compositionally Robust Inference of Microbial Ecological Networks

doi:10.1371/journal.pcbi.1004226

Fig 1.

Conditional independence vs correlation analysis for a toy dataset.

In an ecosystem, the abundance of any OTU is potentially dependent on the abundances of other OTUs in the ecological network. Here, we simulate abundances from a network where OTU 3 directly influences (via some set of biological mechanisms) the abundances of OTUs 1, 2 and 4 (a). The inference goal here is to recover the underlying network from the simulated data. b) Absolute abundances of these four OTUs were drawn from a negative-binomial distribution across 500 samples according to the true network (as described in the Methods section). c) Computing all pairwise Pearson correlation yields a symmetric matrix showing patterns of association (positive correlations are green and negative are red). We thresholded entries of the correlation matrix to generate relevance networks. d) A threshold at ρ ≥ ∣0.35∣ (represented by dashed and solid edges) results in a network in which OTU 3 is connected to all other OTUs with an additional connection between OTU 2 and OTU 4. A more stringent threshold at ρ ≥ ∣0.5∣, results in a sparser relevance network (notably missing the edge between OTU 3 and OTU 1), and is represented in d by solid edges only. Importantly, no single threshold recovers the true underlying hub topology. e) The inverse sample covariance matrix yields a symmetric matrix where entries are approximately zero if the corresponding OTU pairs are conditionally independent. The network (f) inferred from the non-zero entries (colored in blue in e) identifies the correct hub network. Thus, it is possible to choose a threshold for the sample inverse covariance that faithfully recovers the true network. Such a threshold is not guaranteed to exist for correlation or covariance (the metric used by SparCC and CCREPE). Intuitively, this is because simultaneous direct connections can induce strong correlations between nodes that do not have direct relationships (e.g. OTU 2-4). Conversely, weak correlations can arise between directly connected nodes (e.g. OTU 1-3). Although correlation is a useful measure of association in many contexts, it is a pairwise metric and therefore limited in a multivariate setting. On the other hand, SPIEC-EASI’s estimate of entries in the inverse covariance matrix depend on the conditional states of all available nodes. This feature helps SPIEC-EASI avoid detection of indirect network interactions.

More »

Expand

Fig 2.

Workflow of the SPIEC-EASI pipeline.

The SPIEC-EASI pipeline consists of two independent parts for a) synthetic data generation and b) network inference. a) Synthetic data generation requires an OTU count table and a user-selected network topology. Internally, the parameters of a statistical distribution (the zero-inflated Negative binomial model is suggested) are fit to the OTU marginals of the real data, and are combined with the randomly-generated network in the Normal to Anything (NORTA) approach to generate correlated count data. b) Network inference proceeds in three stages on synthetic or real OTU count data: First, data is pre-procssed and centered log-ratio (CLR) transformed to ensure compositional robustness. Next, the user selects one of two graphical model inference procedures: 1) Neighborhood selection (the MB method) or 2) inverse covariance selection (the glasso method). SPIEC-EASI network inference assumes that the underlying network is sparse. We infer the correct model sparseness by the Stability Approach to Regularization Selection (StARS), which involves random subsampling of the dataset to find a network with low variability in the selected set of edges. SPIEC-EASI outputs include an ecological network (from the non-zero entries of the inverse covariance network) and an invertible covariance matrix. If the network was inferred from synthetic data, it can be compared with the input network to assess inference quality.

More »

Expand

Fig 3.

a)Bivariate illustration of the NorTA approach.

First normal data, incorporating the target correlation structure, is generated. Uniform data are then generated for each margin via the normal density function. These is then converted to an arbitrary marginal distribution (Poisson and Zero-inflated Negative Binomial shown as examples) via its quantile function. To generate realistic synthetic data, parameters for these margins are fit to real data. b) Examples of band-like, cluster, and scale-free network topologies

More »

Expand

Fig 4.

Precision-recall performance on synthetic datasets.

a) Red = S-E(glasso), orange = S-E(MB), purple = SparCC, blue = CCREPE, green = Pearson correlation, black = random. Area under precision-recall (AUPR) vs. number of samples n for different κ values are depicted. Bars represent average over 20 synthetic datasets, and error bars represent standard error. Asterisks denote conditions under which SPIEC-EASI methods had significantly higher AUPR relative to all other control methods (P<0.05 for all one-sided T tests). b) Representative precision-recall curves for p = 68, n = 102, κ = 100; solid and dashed lines denote SPIEC-EASI and control methods, respectively.

More »

Expand

Fig 5.

a) Predicted degree distributions (colored) are overlaid with the true degree distribution (white) for n = 1360 samples, p = 205 OTUs, κ = 100.

Lighter shades correspond to regions of overlap between predicted and true distributions. Dissimilarity between the distributions is measured by KL divergence, D_KL. b) Bars represent the average D_KL over three independent sets of synthetic datasets (7 datasets per set); error bars represent standard error. Divergences were compared between S-E and control methods using one-sided T-tests; ***, **, * correspond to P<0.001, 0.01, and 0.05.

More »

Expand

Fig 6.

a) Network reproducibility for inference methods (see main text for details).

Bars represent mean Hamming distance, and errorbars are 95% confidence intervals. b) Visualization of edge overlap between networks inferred with SPIEC-EASI, SparCC, and CCREPE. c) Network visualizations with OTU nodes colored by Family lineage (or Order, when the Family of the OTU is unknown), edges are colored by sign (positive: green, negative: red), and the node diameter proportional to the geometric mean of that OTU’s relative abundance.

More »

Expand