Figures
Abstract
The development of high-throughput sequencing technologies for 16S rRNA gene profiling provides higher quality compositional data for microbe communities. Inferring the direct interaction network under a specific condition and understanding how the network structure changes between two different environmental or genetic conditions are two important topics in biological studies. However, the compositional nature and high dimensionality of the data are challenging in the context of network and differential network recovery. To address this problem in the present paper, we proposed two new loss functions to incorporate the data transformations developed for compositional data analysis into D-trace loss for network and differential network estimation, respectively. The sparse matrix estimators are defined as the minimizer of the corresponding lasso penalized loss. Our method is characterized by its straightforward application based on the ADMM algorithm for numerical solution. Simulations show that the proposed method outperforms other state-of-the-art methods in network and differential network inference under different scenarios. Finally, as an illustration, our method is applied to a mouse skin microbiome data.
Citation: He S, Deng M (2019) Direct interaction network and differential network inference from compositional data via lasso penalized D-trace loss. PLoS ONE 14(7): e0207731. https://doi.org/10.1371/journal.pone.0207731
Editor: Kazuhiro Takemoto, Kyushu Institute of Technology, JAPAN
Received: November 1, 2018; Accepted: July 2, 2019; Published: July 24, 2019
Copyright: © 2019 He, Deng. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying the results presented in the study are available from https://www.nature.com/articles/ncomms3462.
Funding: This work was supported by National Science Foundation of China grant No.31471246. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Microbes play critical roles in Earth’s biogeochemical cycles [1] and impact the health of humans significantly [2]. Understanding interactions among microbes under a specific condition is a key research topic in microbial ecology [3]. Bandyopadhyay et al. [4] also showed that these interactions can change under various environmental or genetic conditions. With the development of high-throughout sequencing technology, 16s rRNA gene sequences can be amplified, sequenced, and grouped into common Operational Taxonomic Units (OTUs), and as a result, microbial abundance information can be obtained for further exploration [5]. One of the major challenges is to discover associations among microbes and how these associations change under different conditions, which could in turn help us to unravel the underlying interaction network and offer an insight into community-wide dynamics.
Correlation analysis is commonly used to infer the interaction network for absolute abundance data. However, applying traditional correlation analysis to compositional data, as only representative of relative abundances of microbial species, may yield spurious results [6, 7]. Recent methods, such as SparCC [7], CCREPE [8, 9], REBACCA [10] and CCLasso [11], have been proposed to address compositional bias and infer the correlation network of microbe communities. However, pairwise correlations contain both direct and indirect interactions, and correlations may arise when microbes are connected indirectly [12]. Thus, the conditional dependence network describing direct interactions is often more intrinsic and fundamental [13, 14].
For absolute abundance, conditional dependence networks are frequently modeled as Gaussian graphical models where direct interactions are correspond to the support of precision matrix [15, 16]. Meinshausen and Bühlmann [17] proposed a neighborhood selection approach to recover the precision matrix row-by-row by fitting a lasso penalized least square regression model [18]. Yuan and Lin [19] derived the likelihood for Gaussian graphical models and suggested using the maxdet algorithm to compute the corresponding lasso penalized estimator. Friedman et al. [20] developed a more efficient algorithm called the graphical lasso. Zhang and Zou [21] proposed a new loss function called D-trace loss and introduced a sparse precision matrix estimator as the minimizer of lasso penalized D-trace loss. Several methods have been proposed to infer the direct interaction network from compositional data. Biswas et al. [22] suggested learning the direct interactions from compositional data with a Poisson-multivariate normal hierarchical model called MInt. Kurtz et al. [12] proposed a method called SPIEC-EASI, which combines centered log-ratio (clr) transformation [6, 23] for compositional data with the neighborhood selection approach [17] or graphical lasso [20] to estimate the precision matrix. Similar to the idea of Yuan and Lin [19], Fang et al. [14] first derived likelihood with compositional data for Gaussian graphical models and then estimated the precision matrix with a lasso penalized maximum likelihood method called gCoda. Yuan et al. [24] introduced a compositional D-trace loss (CD-trace) based on D-trace loss to estimate the precision matrix. In this paper, we proposed a new loss function called CDTr, with more concise form than CD-trace, to incorporate clr transformation [6, 23] into D-trace loss [21] to estimate the precision matrix from compositional data.
Biological networks often vary according to different environmental or genetic conditions [4]. Understanding how networks change and estimating differential networks are important tasks in biological studies. In recent years, researchers have actively sought methods of estimating differential networks for absolute abundance data. Chiquet et al. [25], Guo et al. [26] and Danaher et al. [27] estimated the precision matrices and their differences jointly by penalizing the joint log-likelihood with different penalties. Zhao et al. [28] developed a ℓ1-minimization method for direct estimation of differential networks, which does not require sparsity of precision matrices or their separate estimation. Yuan et al. [29] proposed a new loss function called DTL based on D-trace loss [21] to estimate the precision matrix difference directly. In this paper, we also extended our method to incorporate clr transformation [6, 23] into DTL [29] to estimate the differential network from compositional data.
The remainder of the paper is organized as follows. In Section 2, we introduce our new loss functions to incorporate clr transformations for compositional data analysis into D-trace loss, thereby enabling us to estimate both direct interaction network and differential direct interaction networks from compositional data, respectively. In Section 3, the performance of our method was evaluated and compared with other state-of-the-art methods under various simulation scenarios. In Section 4, the proposed methods are illustrated with an application to a mouse skin microbiome data.
2 Materials and methods
2.1 Compositional data and clr transformation
We begin with some notations and definitions for convenience. For a p × p matrix , its transposition, trace and determinant are denoted as XT, tr(X) and det X, respectively. Let
, ‖X‖∞ = maxi ∑j |Xij|, ‖X‖1 = maxj ∑i|Xij|, |X|1 = ∑i,j |Xij|, and |X|1,off = ∑i≠j |Xij| be the Frobenius norm, ∞-norm, 1-norm, ℓ1-norm and off-diagonal ℓ1-norm of X. Denote by vec(X) the p2-vector from stacking the columns of X, and X ≻ 0 means that X is positive definite. For two matrices
, let X ⊗ Y be the Kronecker product of X and Y. We use 〈X, Y〉 to denote tr(XYT) throughout this paper.
Suppose that there are p microbe species and that their absolute abundances are z = (z1, z2, …, zp) respectively. However, instead of absolute abundances, it is often the case that only the relative abundances (or closed compositions) x = (x1, x2, …, xp), where
(1)
can be observed in real experiments. If the log-transformed absolute abundances ln z follow a multivariate Gaussian distribution with mean μ and nonsingular covariance matrix Σ, the precision matrix Θ = Σ−1 depicts the direct interaction network among microbial species since ln zi and ln zj are conditionally independent given other components of z if and only if Θij = 0 [13]. Moreover, we can describe this direct interaction network with an undirected graph if we represent the p microbe species with p vertices and connect the conditionally dependent species pairs.
Log-ratios [6, 23] are commonly used in compositional data analysis, since ratios are preserved when the absolute abundances are expressed as relative abundances [12]. Aitchison [6, 23] also proposed a statistically equivalent centered log-ratio (clr) transformation. The centering matrix is
, where 1p is a p-dimensional all-ones vector and I is identity matrix. Applying the clr transformation and using ln x = ln z − 1p ln s and G1p = 0p, it follows that
(2)
Denoted by Σln x the covariance matrix of the log-transformed relative abundances, we have
(3)
Similarly, Eqs (2) and (3) establish a bridge between the observed relative abundances and the unobserved absolute abundances. SPIEC-EASI [12] assumes that GΣln xG serves as a good approximation of Σ since G − I ≈ 0 when p ≫ 0, and apply the neighborhood selection approach [17] or graphical lasso [20] to the clr-transformed relative abundances for precision matrix estimation.
2.2 CDTr: Compositional network analysis with D-trace loss
From the empirical loss minimization perspective, SPIEC-EASI is not the most natural and concise because of the approximation and the log-determinant term in graphical lasso [20]. In this section, we introduce an innovative loss function to estimate the direct interaction network from compositional data with D-trace loss. The new D-trace loss for compositional data (CDTr loss) is proposed as
(4)
We can view the CDTr loss as an analogue of the D-trace [21] loss . The meaning of incorporating clr transformation into the original D-trace loss is to avoid the unobserved absolute abundance and account for the compositionality. If we know the absolute abundance data, we can simply substitute the finite sample estimator of Σ (denoted by
) into D-trace loss and estimate the precision matrix Θ with the corresponding lasso penalized estimator. However, for relative abundances or compositional data, only the finite sample estimator of Σln x (denoted by
) is available, instead of the finite sample estimator of Σ. Thanks to the clr transformation and the bridge Eq (3), we can estimate GΣG with
, even though
is not available.
It is easy to check that CDTr loss can be written as
(5)
To ensure that Σ−1 minimizes LCD, namely Σ1/2GΘ − Σ−1/2G = 0 when Θ = Σ−1, we need the following exchangeable condition:
(6)
Denote by σij and ρij the covariance and correlation between ln zi and ln zj, respectively. Then, the exchangeable condition is equivalent to ∑l σil = ∑l σjl for all i, j = 1, 2, …, p, which is similar to the assumption ∑l≠i σil = 0, i = 1, 2, …, p in SparCC [7]. If the variances σii, i = 1, 2, …, p are all the same, then the exchangeable condition simplifies to ∑l≠i ρil, i = 1, 2, …, p are all the same, which implies that the average correlation with other species is nearly the same for each specie. Analogously, the assumption in SparCC simplifies to ∑l≠i ρil = 0, i = 1, 2, …, p, which implies that the average correlations are very small. In the numerical experiments of section 3, we show that CDTr still performs well, even when the exchangeable condition does not hold.
In practical applications, we use the empirical version of CDTr loss as
(7)
Since most species do not interact directly when the number of species p is large, we further assume that the direct interaction network, or Θ, is sparse, which also helps to solve the under-determined problem caused by compositionality and dimensionality [11, 14, 19]. We employ the commonly used ℓ1 penalty [18, 19, 21] to handle the sparse assumption, and our sparse estimator of the precision matrix Θ is proposed as
(8)
where λ ≥ 0 is the tuning parameter for the tradeoff between the model fitting and the sparsity of
. Following the idea of Zhao et al. [28], the tuning parameter is selected by minimizing the Bayesian Information Criterion (BIC) [30] as
(9)
where |Θ|0 is the number of non-zero elements in the upper-triangle of Θ, and n is the sample size.
Zhang and Zou [21] developed an efficient algorithm based on alternating direction methods [31] for the solution of penalized D-trace loss estimator. We can simply replace and I in D-trace loss with
and G in our CDTr loss and use the algorithm of Zhang and Zou [21] for the numerical solution of (8). Following the idea of Zhang and Zhou [21] and Scheinberg et al. [31], we introduce two new matrices, Θ0 and Θ1. The augmented Lagrangian function of (8) are considered, and Λ0, Λ1, ρ are Lagrangian multipliers. The steps of the ADMM algorithm for the lasso penalized CDTr loss estimator are summaried as follows.
- (a). Initialization: k = 0,
and
;
- (b).
;
- (c).
and
;
- (d).
and
;
- (e). k = k+1;
- (f). Repeat (b)-(e) until convergence.
The definitions of matrix operators H(X), S(X) and [X]+ are listed in S1 Appendix. Compared with CD-trace loss [24] which is also based on D-trace loss and has three terms, our CDTr is more concise with only two terms. The simpler structure of CDTr makes the application of ADMM algorithm straightforward, while a symmetrization step and more auxiliary matrices are needed before applying ADMM algorithm in CD-trace.
2.3 DCDTr: Differential compositional network analysis with D-trace loss
Consider that the absolute abundances of p microbe species become under another condition and that the relative abundances are
, respectively. Similarly, we assume
. Thus, we want to estimate the difference between direct interaction networks under different conditions, i.e., the resultant differential network Δ = Σ*−1 − Σ−1.
A straightforward approach to estimate Δ is to estimate Σ−1 and Σ*−1 separately and then subtract the estimates under the key assumption that both precision matrices are sparse. However, a more reasonable assumption is that the difference between the precision matrices are sparse, not that both matrices are sparse, since direct interactions may not be sparse while the changes under different conditions are often sparse [29]. Therefore, we proposed a new loss function for differential network estimation with compositional data (DCDTr loss) to estimate Δ directly, under the assumption that the differential network Δ is sparse. The DCDTr loss is proposed as
(10)
Similarly, our DCDTr loss can be regarded as an analogue to the DTL loss
, which is proposed by Yuan et al. [29] to estimate the differential network Δ when the absolute abundances are known. Again, our DCDTr loss takes the advantage of the bridge Eq (3) to avoid the unobserved absolute abundance and account for the compositionality. From another perspective, we can arrive at our DCDTr loss (10) by substituting the approximation Σ ≈ GΣln x G, Σ* ≈ GΣln x* G into DTL loss. In the numerical experiments of section 3, we also investigated the performance of procedures which combine the approximation Σ ≈ GΣln x G, Σ* ≈ GΣln x* G with other methods for differential network estimation, including the ℓ1-minimization method [28] for direct estimation of differential networks and joint graphical lasso (FGL, GGL) [27] for joint estimation of precision matrices. The detailed formulas are left in S1 Appendix.
Under the exchangeable condition GΣ = ΣG and GΣ* = Σ*G, it is easy to check that
(11)
Obviously, Δ = Σ*−1 − Σ−1 is a minimizer of our DCDTr loss LDCDTr. In practical applications, we incorporate the finite sample estimators of Σ, Σ* and ℓ1 penalty into DCDTr loss, and our sparse estimator for the differential network Δ is proposed as
(12)
The tuning parameter λ is selected by minimizing the Bayesian Information Criterion (BIC) [28–30] as
(13)
where |Δ|0 is the number of non-zero elements in the upper-triangle of Δ, and n and n* are the sample size.
Taking advantage of the algorithm developed by Yuan et al. [29] for the numerical solution of lasso penalized DTL loss estimator, the algorithm for the numerical solution of (12) is straightforward, essentially because we can simply replace and
in DTL loss with
and
in our DCDTr loss. Following the idea of Yuan et al. [29], we introduce three new matrices Δ1,2,3 and Lagrangian multipliers Λ1,2,3, ρ for the solution of (12). The steps of the ADMM algorithm for the lasso penalized DCDTr loss estimator are presented as follows.
- (a). Initialization: k = 0,
and
;
- (b).
;
- (c).
;
- (d).
;
- (e).
,
and
;
- (f). k = k+1;
- (g). Repeat (b)-(f) until convergence.
The definitions of matrix operators K(X) and S(X) are listed in S1 Appendix.
3 Numerical results
In this section, we conduct several numerical experiments under different settings and compare them with other state-of-the-art methods. Given mean μp and precision matrix Θ, we first generate the log-transformed absolute abundance ln zi = (ln zi1, ln zi2, …, ln zip) with the multivariate normal distribution , and then the relative abundances are
, i = 1, 2, …, n. For another given mean
and precision matrix Θ* under a new condition, the samples
, i = 1, 2, …, n are similarly generated. In the following simulations, we take p = 50 and μp sampled from the uniform distribution
.
3.1 Simulations for CDTr loss
To investigate the performance of CDTr loss and the influence of the exchangeable condition, we considered the following network structures for Θ.
- Band graph:
- Cluster graph: Divide p nodes into 5 clusters evenly. The nodes in different clusters are not connected, while the network for each cluster is the same as matrix C = (cij)10×10, where
The link strength is uniformly distributed in [l, u]. To be specific, θij is replaced with θijsij, where . We take (l, u) = (0.1, 0.1), (0.05, 0.15) and (0.0.2) separately to study the performance of CDTr loss when the exchangeable condition is satisfied by different degrees. These scenarios are named as Band-exact (Band-e), Band-approx1 (Band-a1), Band-approx2 (Band-a2) and Cluster-exact (Cluster-e), Cluster-approx1 (Cluster-a1), Cluster-approx2 (Cluster-a2), respectively. To obtain a positive definite precision matrix Θ, we first compute the smallest eigenvalue of Θ (denoted by e); then the diagonal elements of Θ are set as |e| + 0.3. The deviation to the exchangeable condition is measured with dev = ‖GΣ − ΣG‖F. The deviations under the aforementioned six scenarios are listed in Table 1. For each combination of the six network structures and four sample sizes n = 50, 100, 150, 200, a total of 100 datasets are generated and used to recover the network structure. Four state-of-the-art methods for network recovery are investigated, including gCoda [14], CD-trace [24], SPIEC(MB) and SPIEC(GL) [12]. We further consider an approximation method called aCDTr, which approximates Σ with GΣln x G [12] and employs D-trace loss to estimate Θ = Σ−1. Specifically, the estimator of aCDTr is
(14)
The true positive rate and true negative rate are evaluated at different tuning parameters and used to generate the receiver operating characteristic (ROC) curve. We use the area under the curve (AUC) to quantify the ability to recover the true underlying network.
In Table 2, we present the mean AUC scores of the above-mentioned methods under different settings. The mean AUC scores of CDTr and aCDTr are superior to the other four methods in all cases, even when the exchangeable condition does not hold exactly, which implies that CDTr and aCDTr outperform other methods in direct interaction network recovery. Moreover, the mean AUC of CDTr is slightly higher than that of aCDTr, except for the cluster graph and sample size n = 50. With increasing deviation, the performance of CDTr and aCDTr decreases, which is reasonable if the exchangeable condition does not exactly hold. Interestingly, the performance for the other four methods also decreases with increasing deviation. For all network structures and methods, the mean AUC scores increase as the sample size increases.
We further conducted several experiments on the following six representative network structures, without considering the exchangeable condition.
- Random graph: Two nodes are connected with probability 0.1, and the strength is generated from a uniform distribution in [−0.2, −0.1] ∪ [0.1, 0.2].
- Band graph: Connect pair (i, j) with strength uniformly distributed in [0.05m − 0.3, 0.05m − 0.25] ∪ [0.25 − 0.05m, 0.3 − 0.05m], if |i − j| = m, m = 1, 2, 3, 4.
- Neighbor graph: Select p points from
and connect the 5 nearest neighbors for each point with strength sampled from a uniform distribution in [−0.15, −0.05] ∪ [0.05, 0.15].
- Scale-free graph: A scale-free graph is produced, following the B-A algorithm [32]. The initial graph has two connected nodes, and each new node is connected to only one node in the existing graph with the probability proportional to the degree of the each node in the existing graph. This results in p edges in the graph, and the strength between connected nodes is generated from a uniform distribution in [−0.2, −0.1] ∪ [0.1, 0.2].
- Hub graph: Partition the nodes into 3 disjoint groups evenly and select a node as hub for each group. The hubs are connected with the non-hubs in the same group with strength uniformly distributed in [−0.2, −0.1] ∪ [0.1, 0.2].
- Block graph: Divide p nodes into 5 blocks evenly. Connect pairs in the same block with probability 0.3 and pairs in different blocks with probability 0.1. The strength between connected nodes is uniformly distributed in [−0.2, −0.1] ∪ [0.2, 0.1].
Similarly, the diagonal elements of Θ are set as |e| + 0.3, where e is the smallest eigenvalue of Θ. The deviations from the exchangeable condition of these networks are listed in Table 3.
We generated 100 datasets for each setting and used them to estimate the true precision matrix. The mean AUC scores of different methods under different settings are shown in Table 4. We can see that CDTr performs better than other methods in all cases, while the results of aCDTr is comparable to those of gCoda and CD-trace, and the results of SPIEC(MB) and SPIEC(GL) are worse than the others. Note that we did not consider the exchangeable condition when we set up the networks, implying that CDTr still works, even when the the exchangeable condition does not hold. Although the objective functions and performances of CDTr and aCDTr are similar as shown in Tables 2 and 4, they are derived from two quite different perspectives. aCDTr is based on the approximation Σ ≈ GΣln x G and assumes that the inverse of GΣln x G also approximates the inverse of Σ. However, as Fang et al. [14] stated, this approximation depends strongly on the condition number of the inverse covariance matrix. CDTr does not need aforementioned approximation and can guarantee that the inverse of Σ minimizes CDTr loss exactly under the exchangeable condition. The meaning of CDTr is that it avoids the use of approximation assumptions and provides a different perspective for precision matrix estimation.
3.2 Simulations for DCDTr loss
We investigate the performance of DCDTr loss with some experiments in this section. The first precision matrix Θ is generated as follows:
- Random graph: For Θ, two nodes are connected with probability 0.5, and the strength is generated from a uniform distribution in [−0.4, −0.2] ∪ [0.2, 0.4].
- Band graph: Connect pair (i, j) with strength uniformly distributed in [0.05m − 0.3, 0.05m − 0.25] ∪ [0.25 − 0.05m, 0.3 − 0.05m], if |i − j| = m, m = 1, 2, 3, 4.
- Neighbor graph: Select p points from
and connect the 10 nearest neighbors for each point with strength sampled from a uniform distribution in [−0.4, −0.2] ∪ [0.2, 0.4].
- Scale-free graph: The scale-free graph is generated with the B-A algorithm [32]. The strength between connected nodes is generated from a uniform distribution in [−0.4, −0.2] ∪ [0.2, 0.4].
- Hub graph: Partition the nodes into 3 disjoint groups evenly and select a node as hub for each group. The hubs are connected with the non-hubs in the same group with strength uniformly distributed in [−0.4, −0.2] ∪ [0.2, 0.4].
- Block graph: Divide p nodes into 5 blocks evenly. Connect pairs in the same block with probability 0.5 and pairs in different blocks with probability 0.3. The strength between connected nodes is uniformly distributed in [−0.4, −0.2] ∪ [0.4, 0.2].
Then 10% of the connected pairs in Θ will change to an unconnected state, while the same number of unconnected pairs in Θ will change to a connected state, such that we get another precision matrix Θ*. For scale-free and hub graph, the ratio of change is 40% based on the sparsity of the two graphs. The diagonal elements of Θ and Θ* are set as |e| + 0.3, where e is the smallest eigenvalue of Θ or Θ*, respectively. The deviations from the exchangeable condition of Θ and Θ* are listed in Table 5. Therefor, the differential matrix Δ is Θ* − Θ. The two precision matrices Θ and Θ* are used to generate data separately. The aforementioned four methods, including DCDTr, FGL, GGL and ℓ1-M, are used to estimate the true differential matrix Δ. Similarly, we evaluate the true positive rate and true negative rate at different tuning parameters and then compute the area under the ROC curve (AUC). We take the sample size n = 100, 200, 300, 400 and repeat this procedure 100 times.
Table 6 presents the mean AUC scores of different methods under different settings. We see that no method is generally better than the others in all cases. DCDTr performs better than other methods in random graph, neighbor graph and block graph, while GGL achieves higher AUC in scale-free and hub graph. With the increase of sample size, the advantage of DCDTr becomes increasingly significant. Generally speaking, our proposed DCDTr performs well in different network estimations.
4 Real data analysis
In this section, we illustrate our proposed method with an application to mouse skin microbiome data [33]. A total of 261 mice were divided into 3 groups: 78 non-immunized controls (Control), 119 immunized healthy individuals (Healthy) and 64 immunized epidermolysis bullosa acquisita individuals (EBA), according to the health conditions of skin immunizations. The OTUs appearing in less than 50% of the samples are filtered out, and the samples with a number of nonzero OTU counts less than 50% of the total selected OTUs are also removed. We finally arrived at a dataset with p = 77 OTUs and n = 232 samples (63 Control, 114 Healthy and 55 EBA). We use Bayesian-multiplicative replacement [34–36] to impute zero counts and normalize the data to compositional data.
Since the the underlying true direct interaction networks were not available and the accuracy of estimated networks was unobtainable, we evaluated the performance of the proposed methods with reproducibility as Fang et al. [14] and Kurtz et al. [12] suggusted. More specifically, we first constructed a reference network est1 (precision matrix or differential matrix) with all data for each group and method. We then selected half of the samples randomly to estimate the precision matrix or differential matrix (denoted by est2) again. The reproducibility was measured by the fraction of overlapping edges shared by est1 and est2 in the reference network est1.
For each group and each method of precision matrix estimation, the procedure stated above was repeated 20 times. The mean reproducibility is summarized in Table 7. CDTr and aCDTr outperformed the other four methods in terms of reproducibility in all three groups, implying that CDTr and aCDTr are more stable and accurate in direct interaction estimation. We also estimated the differential network for the Control-Healthy and Control-EBA groups, and the evaluation procedure was also repeated 20 times. The mean reproducibility is listed in Table 8. The highest reproducibility of DCDTr also implies that DCDTr is more stable and accurate in differential network estimation.
Finally, we employed all methods to build a candidate microbiome association network from the unified dataset for each group and group pairs. In Fig 1, we present the number of shared edges for direct interaction networks recovered from various methods via Venn diagrams. We can see that the direct interaction network from CDTr is close to that of CD-trace, while the network from SPIEC(GL) and SPIEC(MB) are more similar. A total of 21, 38 and 22 edges are shared by all candidate networks for control, healthy and EBA groups, respectively, comprising the core interaction network among OTUs. Moreover, almost all direct interactions discovered by CDTr are in this core interaction network, while SPIEC(GL), SPIEC(MB) and gCoda discover some eccentric interactions. The number of shared edges for differential networks are shown in Fig 2. The situation for differential networks is much more complicated. ℓ1-M discovered many eccentric differential edges in both groups, but these were not confirmed by other methods. The differential edges from GGL and FGL are almost the same for both groups, and are more than the edges from DCDTr. Most differential edges from DCDTr were verified by both GGL and FGL for both groups, implying that DCDTr is good at inferring the crucial differential edges without mixing nonessential edges.
To investigate the influence of zeros in the compostional data, we first divide 77 variables into 7 sets evenly according to the proportion of nonzero measurements in each variable, and then calculate the percentage of nonzero measurements (named nonzero density) in each set. The average degree of variables (i.e., nodes) in the same set is computed with each network constructed by above-mentioned methods. The nonzero density and average degree for each set are summarized in Tables 9 and 10 for Control, Healthy, EBA and Control-Healthy, Control-EBA group, respectively. For Control and EBA group, the average degree tends to be bigger with larger nonzero density for all methods. When the nonzero density is 20% in Set1 for Control group and 49% in Set1 for EBA group, aCDTr and CDTr do not recover any connections with these rare abundance bacteria, which implies that the recovered connections are not due to zero corrections. For Healthy, Control-Healthy and Control-EBA group with fewer zeros in the data, the average degree does not show clear pattern and is more close to random distribution, which implies that zero measurements do not influence network inference significantly when zeros in compositional data are relatively few.
5 Conclusion
Inferring the direct interactions among microbial species and understanding how the network structure changes are important in the study of ecology and medicine. In this paper, we propose two loss functions to estimate the direct interaction network and differential network from compositional microbial data based on clr transformation and D-trace loss for absolute abundance data. Although the proposed CDTr loss and DCDTr loss are derived from an exchangeable condition, we show that they still perform well and better than other methods under different scenarios in our numerical simulations. However, the reasonableness of the exchangeable condition should be further examined in theory and biology. Finally, the consistency of the estimators does not come with a theoretical guarantee, which is a common limitation of gCoda, SPIEC, CDTr and DCDTr. For future work, we are interested in developing theorems about the consistency property in both direct interaction network and differential network estimation.
Supporting information
S1 Appendix. Supplementary for compositional data analysis via lasso penalized D-trace loss.
The matrix operators S(X),K(X),H(X) and [X]+ used in Algorithm 1 and Algorithm 2 for the numerical solutions of lasso penalized CDTr and DCDTr loss are presented in this Supplementary. We also demonstrate the relationship between D-trace loss and CDTr loss, as well as the relationship between DTL loss and DCDTr loss. The detailed formulas of ℓ1-minimization method and joint graphical lasso (FGL, GGL) are listed in this Supplementary.
https://doi.org/10.1371/journal.pone.0207731.s001
(PDF)
References
- 1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. science. 2008;320(5879):1034–1039. pmid:18497287
- 2. Thiele I, Heinken A, Fleming RM. A systems biology approach to studying the role of microbes in human health. Current opinion in biotechnology. 2013;24(1):4–12. pmid:23102866
- 3. Konopka A. What is microbial community ecology? The ISME journal. 2009;3(11):1223. pmid:19657372
- 4. Bandyopadhyay S, Mehta M, Kuo D, Sung MK, Chuang R, Jaehnig EJ, et al. Rewiring of genetic networks in response to DNA damage. Science. 2010;330(6009):1385–1389. pmid:21127252
- 5. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS computational biology. 2010;6(2):e1000667. pmid:20195499
- 6.
Aitchison J. The statistical analysis of compositional data. Monographs on Statistics and Applied Probability, Chapman and Hall, London, UK. 1986.
- 7. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS computational biology. 2012;8(9):e1002687. pmid:23028285
- 8. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS computational biology. 2012;8(7):e1002606. pmid:22807668
- 9. Faust K, Raes J. Microbial interactions: from networks to models. Nature Reviews Microbiology. 2012;10(8):538. pmid:22796884
- 10. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31(20):3322–3329. pmid:26079350
- 11. Fang H, Huang C, Zhao H, Deng M. CCLasso: correlation inference for compositional data through Lasso. Bioinformatics. 2015;31(19):3172–3180. pmid:26048598
- 12. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS computational biology. 2015;11(5):e1004226. pmid:25950956
- 13. Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004;303(5659):799–805. pmid:14764868
- 14. Fang H, Huang C, Zhao H, Deng M. gCoda: conditional dependence network inference for compositional data. Journal of Computational Biology. 2017;24(7):699–708. pmid:28489411
- 15.
Whittaker J. Graphical models in applied multivariate statistics. Wiley Publishing; 2009.
- 16. Markowetz F, Spang R. Inferring cellular networks–a review. BMC bioinformatics. 2007;8(6):S5. pmid:17903286
- 17. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The annals of statistics. 2006;34(3):1436–1462.
- 18. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(1):267–288.
- 19. Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35.
- 20. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. pmid:18079126
- 21. Zhang T, Zou H. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika. 2014;101(1):103–120.
- 22. Biswas S, McDonald M, Lundberg DS, Dangl JL, Jojic V. Learning microbial interaction networks from metagenomic count data. Journal of Computational Biology. 2016;23(6):526–535. pmid:27267776
- 23.
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R. Modeling and analysis of compositional data. John Wiley & Sons; 2015.
- 24. Yuan H, He S, Deng M. Compositional data network analysis via lasso penalized D-trace loss. Bioinformatics. 2019;.
- 25. Chiquet J, Grandvalet Y, Ambroise C. Inferring multiple graphical structures. Statistics and Computing. 2011;21(4):537–553.
- 26. Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. pmid:23049124
- 27. Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2014;76(2):373–397.
- 28. Zhao SD, Cai TT, Li H. Direct estimation of differential networks. Biometrika. 2014;101(2):253–268. pmid:26023240
- 29. Yuan H, Xi R, Chen C, Deng M. Differential network analysis via lasso penalized D-trace loss. Biometrika. 2017;104(4):755–770.
- 30. Schwarz G, et al. Estimating the dimension of a model. The annals of statistics. 1978;6(2):461–464.
- 31.
Scheinberg K, Ma S, Goldfarb D. Sparse inverse covariance selection via alternating linearization methods. In: Advances in neural information processing systems; 2010. p. 2101–2109.
- 32. Barabási AL, Albert R. Emergence of scaling in random networks. science. 1999;286(5439):509–512. pmid:10521342
- 33. Srinivas G, Möller S, Wang J, Künzel S, Zillikens D, Baines JF, et al. Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature communications. 2013;4:2462. pmid:24042968
- 34. Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J. Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling. 2015;15(2):134–158.
- 35. Rivera-Pinto J, Egozcue J, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle M. Balances: a new perspective for microbiome analysis. MSystems. 2018;3(4):e00053–18. pmid:30035234
- 36. Palarea-Albaladejo J, Martin-Fernandez JA. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems. 2015;143:85–96.