Direct interaction network and differential network inference from compositional data via lasso penalized D-trace loss

Shun He; Minghua Deng

doi:10.1371/journal.pone.0207731

Abstract

The development of high-throughput sequencing technologies for 16S rRNA gene profiling provides higher quality compositional data for microbe communities. Inferring the direct interaction network under a specific condition and understanding how the network structure changes between two different environmental or genetic conditions are two important topics in biological studies. However, the compositional nature and high dimensionality of the data are challenging in the context of network and differential network recovery. To address this problem in the present paper, we proposed two new loss functions to incorporate the data transformations developed for compositional data analysis into D-trace loss for network and differential network estimation, respectively. The sparse matrix estimators are defined as the minimizer of the corresponding lasso penalized loss. Our method is characterized by its straightforward application based on the ADMM algorithm for numerical solution. Simulations show that the proposed method outperforms other state-of-the-art methods in network and differential network inference under different scenarios. Finally, as an illustration, our method is applied to a mouse skin microbiome data.

Citation: He S, Deng M (2019) Direct interaction network and differential network inference from compositional data via lasso penalized D-trace loss. PLoS ONE 14(7): e0207731. https://doi.org/10.1371/journal.pone.0207731

Editor: Kazuhiro Takemoto, Kyushu Institute of Technology, JAPAN

Received: November 1, 2018; Accepted: July 2, 2019; Published: July 24, 2019

Copyright: © 2019 He, Deng. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://www.nature.com/articles/ncomms3462.

Funding: This work was supported by National Science Foundation of China grant No.31471246. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Microbes play critical roles in Earth’s biogeochemical cycles [1] and impact the health of humans significantly [2]. Understanding interactions among microbes under a specific condition is a key research topic in microbial ecology [3]. Bandyopadhyay et al. [4] also showed that these interactions can change under various environmental or genetic conditions. With the development of high-throughout sequencing technology, 16s rRNA gene sequences can be amplified, sequenced, and grouped into common Operational Taxonomic Units (OTUs), and as a result, microbial abundance information can be obtained for further exploration [5]. One of the major challenges is to discover associations among microbes and how these associations change under different conditions, which could in turn help us to unravel the underlying interaction network and offer an insight into community-wide dynamics.

Correlation analysis is commonly used to infer the interaction network for absolute abundance data. However, applying traditional correlation analysis to compositional data, as only representative of relative abundances of microbial species, may yield spurious results [6, 7]. Recent methods, such as SparCC [7], CCREPE [8, 9], REBACCA [10] and CCLasso [11], have been proposed to address compositional bias and infer the correlation network of microbe communities. However, pairwise correlations contain both direct and indirect interactions, and correlations may arise when microbes are connected indirectly [12]. Thus, the conditional dependence network describing direct interactions is often more intrinsic and fundamental [13, 14].

For absolute abundance, conditional dependence networks are frequently modeled as Gaussian graphical models where direct interactions are correspond to the support of precision matrix [15, 16]. Meinshausen and Bühlmann [17] proposed a neighborhood selection approach to recover the precision matrix row-by-row by fitting a lasso penalized least square regression model [18]. Yuan and Lin [19] derived the likelihood for Gaussian graphical models and suggested using the maxdet algorithm to compute the corresponding lasso penalized estimator. Friedman et al. [20] developed a more efficient algorithm called the graphical lasso. Zhang and Zou [21] proposed a new loss function called D-trace loss and introduced a sparse precision matrix estimator as the minimizer of lasso penalized D-trace loss. Several methods have been proposed to infer the direct interaction network from compositional data. Biswas et al. [22] suggested learning the direct interactions from compositional data with a Poisson-multivariate normal hierarchical model called MInt. Kurtz et al. [12] proposed a method called SPIEC-EASI, which combines centered log-ratio (clr) transformation [6, 23] for compositional data with the neighborhood selection approach [17] or graphical lasso [20] to estimate the precision matrix. Similar to the idea of Yuan and Lin [19], Fang et al. [14] first derived likelihood with compositional data for Gaussian graphical models and then estimated the precision matrix with a lasso penalized maximum likelihood method called gCoda. Yuan et al. [24] introduced a compositional D-trace loss (CD-trace) based on D-trace loss to estimate the precision matrix. In this paper, we proposed a new loss function called CDTr, with more concise form than CD-trace, to incorporate clr transformation [6, 23] into D-trace loss [21] to estimate the precision matrix from compositional data.

Biological networks often vary according to different environmental or genetic conditions [4]. Understanding how networks change and estimating differential networks are important tasks in biological studies. In recent years, researchers have actively sought methods of estimating differential networks for absolute abundance data. Chiquet et al. [25], Guo et al. [26] and Danaher et al. [27] estimated the precision matrices and their differences jointly by penalizing the joint log-likelihood with different penalties. Zhao et al. [28] developed a ℓ₁-minimization method for direct estimation of differential networks, which does not require sparsity of precision matrices or their separate estimation. Yuan et al. [29] proposed a new loss function called DTL based on D-trace loss [21] to estimate the precision matrix difference directly. In this paper, we also extended our method to incorporate clr transformation [6, 23] into DTL [29] to estimate the differential network from compositional data.

The remainder of the paper is organized as follows. In Section 2, we introduce our new loss functions to incorporate clr transformations for compositional data analysis into D-trace loss, thereby enabling us to estimate both direct interaction network and differential direct interaction networks from compositional data, respectively. In Section 3, the performance of our method was evaluated and compared with other state-of-the-art methods under various simulation scenarios. In Section 4, the proposed methods are illustrated with an application to a mouse skin microbiome data.

2 Materials and methods

2.1 Compositional data and clr transformation

We begin with some notations and definitions for convenience. For a p × p matrix , its transposition, trace and determinant are denoted as X^T, tr(X) and det X, respectively. Let , ‖X‖_∞ = max_i ∑_j |X_ij|, ‖X‖₁ = max_j ∑_i|X_ij|, |X|₁ = ∑_i,j |X_ij|, and |X|_1,off = ∑_i≠j |X_ij| be the Frobenius norm, ∞-norm, 1-norm, ℓ₁-norm and off-diagonal ℓ₁-norm of X. Denote by vec(X) the p²-vector from stacking the columns of X, and X ≻ 0 means that X is positive definite. For two matrices , let X ⊗ Y be the Kronecker product of X and Y. We use 〈X, Y〉 to denote tr(XY^T) throughout this paper.

Suppose that there are p microbe species and that their absolute abundances are z = (z₁, z₂, …, z_p) respectively. However, instead of absolute abundances, it is often the case that only the relative abundances (or closed compositions) x = (x₁, x₂, …, x_p), where (1) can be observed in real experiments. If the log-transformed absolute abundances ln z follow a multivariate Gaussian distribution with mean μ and nonsingular covariance matrix Σ, the precision matrix Θ = Σ⁻¹ depicts the direct interaction network among microbial species since ln z_i and ln z_j are conditionally independent given other components of z if and only if Θ_ij = 0 [13]. Moreover, we can describe this direct interaction network with an undirected graph if we represent the p microbe species with p vertices and connect the conditionally dependent species pairs.

Log-ratios [6, 23] are commonly used in compositional data analysis, since ratios are preserved when the absolute abundances are expressed as relative abundances [12]. Aitchison [6, 23] also proposed a statistically equivalent centered log-ratio (clr) transformation. The centering matrix is , where 1_p is a p-dimensional all-ones vector and I is identity matrix. Applying the clr transformation and using ln x = ln z − 1_p ln s and G1_p = 0_p, it follows that (2) Denoted by Σ_{ln x} the covariance matrix of the log-transformed relative abundances, we have (3) Similarly, Eqs (2) and (3) establish a bridge between the observed relative abundances and the unobserved absolute abundances. SPIEC-EASI [12] assumes that GΣ_{ln x}G serves as a good approximation of Σ since G − I ≈ 0 when p ≫ 0, and apply the neighborhood selection approach [17] or graphical lasso [20] to the clr-transformed relative abundances for precision matrix estimation.

2.2 CDTr: Compositional network analysis with D-trace loss

From the empirical loss minimization perspective, SPIEC-EASI is not the most natural and concise because of the approximation and the log-determinant term in graphical lasso [20]. In this section, we introduce an innovative loss function to estimate the direct interaction network from compositional data with D-trace loss. The new D-trace loss for compositional data (CDTr loss) is proposed as (4)

We can view the CDTr loss as an analogue of the D-trace [21] loss . The meaning of incorporating clr transformation into the original D-trace loss is to avoid the unobserved absolute abundance and account for the compositionality. If we know the absolute abundance data, we can simply substitute the finite sample estimator of Σ (denoted by ) into D-trace loss and estimate the precision matrix Θ with the corresponding lasso penalized estimator. However, for relative abundances or compositional data, only the finite sample estimator of Σ_{ln x} (denoted by ) is available, instead of the finite sample estimator of Σ. Thanks to the clr transformation and the bridge Eq (3), we can estimate GΣG with , even though is not available.

It is easy to check that CDTr loss can be written as (5) To ensure that Σ⁻¹ minimizes L_CD, namely Σ^1/2GΘ − Σ^−1/2G = 0 when Θ = Σ⁻¹, we need the following exchangeable condition: (6) Denote by σ_ij and ρ_ij the covariance and correlation between ln z_i and ln z_j, respectively. Then, the exchangeable condition is equivalent to ∑_l σ_il = ∑_l σ_jl for all i, j = 1, 2, …, p, which is similar to the assumption ∑_l≠i σ_il = 0, i = 1, 2, …, p in SparCC [7]. If the variances σ_ii, i = 1, 2, …, p are all the same, then the exchangeable condition simplifies to ∑_l≠i ρ_il, i = 1, 2, …, p are all the same, which implies that the average correlation with other species is nearly the same for each specie. Analogously, the assumption in SparCC simplifies to ∑_l≠i ρ_il = 0, i = 1, 2, …, p, which implies that the average correlations are very small. In the numerical experiments of section 3, we show that CDTr still performs well, even when the exchangeable condition does not hold.

In practical applications, we use the empirical version of CDTr loss as (7) Since most species do not interact directly when the number of species p is large, we further assume that the direct interaction network, or Θ, is sparse, which also helps to solve the under-determined problem caused by compositionality and dimensionality [11, 14, 19]. We employ the commonly used ℓ₁ penalty [18, 19, 21] to handle the sparse assumption, and our sparse estimator of the precision matrix Θ is proposed as (8) where λ ≥ 0 is the tuning parameter for the tradeoff between the model fitting and the sparsity of . Following the idea of Zhao et al. [28], the tuning parameter is selected by minimizing the Bayesian Information Criterion (BIC) [30] as (9) where |Θ|₀ is the number of non-zero elements in the upper-triangle of Θ, and n is the sample size.

Zhang and Zou [21] developed an efficient algorithm based on alternating direction methods [31] for the solution of penalized D-trace loss estimator. We can simply replace and I in D-trace loss with and G in our CDTr loss and use the algorithm of Zhang and Zou [21] for the numerical solution of (8). Following the idea of Zhang and Zhou [21] and Scheinberg et al. [31], we introduce two new matrices, Θ₀ and Θ₁. The augmented Lagrangian function of (8) are considered, and Λ₀, Λ₁, ρ are Lagrangian multipliers. The steps of the ADMM algorithm for the lasso penalized CDTr loss estimator are summaried as follows.

(a). Initialization: k = 0, and ;
(b). ;
(c). and ;
(d). and ;
(e). k = k+1;
(f). Repeat (b)-(e) until convergence.

The definitions of matrix operators H(X), S(X) and [X]₊ are listed in S1 Appendix. Compared with CD-trace loss [24] which is also based on D-trace loss and has three terms, our CDTr is more concise with only two terms. The simpler structure of CDTr makes the application of ADMM algorithm straightforward, while a symmetrization step and more auxiliary matrices are needed before applying ADMM algorithm in CD-trace.

2.3 DCDTr: Differential compositional network analysis with D-trace loss

Consider that the absolute abundances of p microbe species become under another condition and that the relative abundances are , respectively. Similarly, we assume . Thus, we want to estimate the difference between direct interaction networks under different conditions, i.e., the resultant differential network Δ = Σ*⁻¹ − Σ⁻¹.

A straightforward approach to estimate Δ is to estimate Σ⁻¹ and Σ*⁻¹ separately and then subtract the estimates under the key assumption that both precision matrices are sparse. However, a more reasonable assumption is that the difference between the precision matrices are sparse, not that both matrices are sparse, since direct interactions may not be sparse while the changes under different conditions are often sparse [29]. Therefore, we proposed a new loss function for differential network estimation with compositional data (DCDTr loss) to estimate Δ directly, under the assumption that the differential network Δ is sparse. The DCDTr loss is proposed as (10) Similarly, our DCDTr loss can be regarded as an analogue to the DTL loss , which is proposed by Yuan et al. [29] to estimate the differential network Δ when the absolute abundances are known. Again, our DCDTr loss takes the advantage of the bridge Eq (3) to avoid the unobserved absolute abundance and account for the compositionality. From another perspective, we can arrive at our DCDTr loss (10) by substituting the approximation Σ ≈ GΣ_{ln x} G, Σ* ≈ GΣ_{ln x*} G into DTL loss. In the numerical experiments of section 3, we also investigated the performance of procedures which combine the approximation Σ ≈ GΣ_{ln x} G, Σ* ≈ GΣ_{ln x*} G with other methods for differential network estimation, including the ℓ₁-minimization method [28] for direct estimation of differential networks and joint graphical lasso (FGL, GGL) [27] for joint estimation of precision matrices. The detailed formulas are left in S1 Appendix.

Under the exchangeable condition GΣ = ΣG and GΣ* = Σ*G, it is easy to check that (11) Obviously, Δ = Σ*⁻¹ − Σ⁻¹ is a minimizer of our DCDTr loss L_DCDTr. In practical applications, we incorporate the finite sample estimators of Σ, Σ* and ℓ₁ penalty into DCDTr loss, and our sparse estimator for the differential network Δ is proposed as (12) The tuning parameter λ is selected by minimizing the Bayesian Information Criterion (BIC) [28–30] as (13) where |Δ|₀ is the number of non-zero elements in the upper-triangle of Δ, and n and n* are the sample size.

Taking advantage of the algorithm developed by Yuan et al. [29] for the numerical solution of lasso penalized DTL loss estimator, the algorithm for the numerical solution of (12) is straightforward, essentially because we can simply replace and in DTL loss with and in our DCDTr loss. Following the idea of Yuan et al. [29], we introduce three new matrices Δ_1,2,3 and Lagrangian multipliers Λ_1,2,3, ρ for the solution of (12). The steps of the ADMM algorithm for the lasso penalized DCDTr loss estimator are presented as follows.

(a). Initialization: k = 0, and ;
(b). ;
(c). ;
(d). ;
(e). , and ;
(f). k = k+1;
(g). Repeat (b)-(f) until convergence.

The definitions of matrix operators K(X) and S(X) are listed in S1 Appendix.

3 Numerical results

In this section, we conduct several numerical experiments under different settings and compare them with other state-of-the-art methods. Given mean μ_p and precision matrix Θ, we first generate the log-transformed absolute abundance ln z_i = (ln z_i1, ln z_i2, …, ln z_ip) with the multivariate normal distribution , and then the relative abundances are , i = 1, 2, …, n. For another given mean and precision matrix Θ* under a new condition, the samples , i = 1, 2, …, n are similarly generated. In the following simulations, we take p = 50 and μ_p sampled from the uniform distribution .

3.1 Simulations for CDTr loss

To investigate the performance of CDTr loss and the influence of the exchangeable condition, we considered the following network structures for Θ.

Band graph:
Cluster graph: Divide p nodes into 5 clusters evenly. The nodes in different clusters are not connected, while the network for each cluster is the same as matrix C = (c_ij)_10×10, where

The link strength is uniformly distributed in [l, u]. To be specific, θ_ij is replaced with θ_ijs_ij, where . We take (l, u) = (0.1, 0.1), (0.05, 0.15) and (0.0.2) separately to study the performance of CDTr loss when the exchangeable condition is satisfied by different degrees. These scenarios are named as Band-exact (Band-e), Band-approx1 (Band-a1), Band-approx2 (Band-a2) and Cluster-exact (Cluster-e), Cluster-approx1 (Cluster-a1), Cluster-approx2 (Cluster-a2), respectively. To obtain a positive definite precision matrix Θ, we first compute the smallest eigenvalue of Θ (denoted by e); then the diagonal elements of Θ are set as |e| + 0.3. The deviation to the exchangeable condition is measured with dev = ‖GΣ − ΣG‖_F. The deviations under the aforementioned six scenarios are listed in Table 1. For each combination of the six network structures and four sample sizes n = 50, 100, 150, 200, a total of 100 datasets are generated and used to recover the network structure. Four state-of-the-art methods for network recovery are investigated, including gCoda [14], CD-trace [24], SPIEC(MB) and SPIEC(GL) [12]. We further consider an approximation method called aCDTr, which approximates Σ with GΣ_{ln x} G [12] and employs D-trace loss to estimate Θ = Σ⁻¹. Specifically, the estimator of aCDTr is (14) The true positive rate and true negative rate are evaluated at different tuning parameters and used to generate the receiver operating characteristic (ROC) curve. We use the area under the curve (AUC) to quantify the ability to recover the true underlying network.

Download:

Table 1. Deviations from the exchangeable condition under different scenarios.

https://doi.org/10.1371/journal.pone.0207731.t001

In Table 2, we present the mean AUC scores of the above-mentioned methods under different settings. The mean AUC scores of CDTr and aCDTr are superior to the other four methods in all cases, even when the exchangeable condition does not hold exactly, which implies that CDTr and aCDTr outperform other methods in direct interaction network recovery. Moreover, the mean AUC of CDTr is slightly higher than that of aCDTr, except for the cluster graph and sample size n = 50. With increasing deviation, the performance of CDTr and aCDTr decreases, which is reasonable if the exchangeable condition does not exactly hold. Interestingly, the performance for the other four methods also decreases with increasing deviation. For all network structures and methods, the mean AUC scores increase as the sample size increases.

Download:

Table 2. The mean AUC scores of different methods under different settings.

https://doi.org/10.1371/journal.pone.0207731.t002

We further conducted several experiments on the following six representative network structures, without considering the exchangeable condition.

Random graph: Two nodes are connected with probability 0.1, and the strength is generated from a uniform distribution in [−0.2, −0.1] ∪ [0.1, 0.2].
Band graph: Connect pair (i, j) with strength uniformly distributed in [0.05m − 0.3, 0.05m − 0.25] ∪ [0.25 − 0.05m, 0.3 − 0.05m], if |i − j| = m, m = 1, 2, 3, 4.
Neighbor graph: Select p points from and connect the 5 nearest neighbors for each point with strength sampled from a uniform distribution in [−0.15, −0.05] ∪ [0.05, 0.15].
Scale-free graph: A scale-free graph is produced, following the B-A algorithm [32]. The initial graph has two connected nodes, and each new node is connected to only one node in the existing graph with the probability proportional to the degree of the each node in the existing graph. This results in p edges in the graph, and the strength between connected nodes is generated from a uniform distribution in [−0.2, −0.1] ∪ [0.1, 0.2].
Hub graph: Partition the nodes into 3 disjoint groups evenly and select a node as hub for each group. The hubs are connected with the non-hubs in the same group with strength uniformly distributed in [−0.2, −0.1] ∪ [0.1, 0.2].
Block graph: Divide p nodes into 5 blocks evenly. Connect pairs in the same block with probability 0.3 and pairs in different blocks with probability 0.1. The strength between connected nodes is uniformly distributed in [−0.2, −0.1] ∪ [0.2, 0.1].

Similarly, the diagonal elements of Θ are set as |e| + 0.3, where e is the smallest eigenvalue of Θ. The deviations from the exchangeable condition of these networks are listed in Table 3.

Download:

Table 3. Deviations from the exchangeable condition of six different network structures.

https://doi.org/10.1371/journal.pone.0207731.t003

We generated 100 datasets for each setting and used them to estimate the true precision matrix. The mean AUC scores of different methods under different settings are shown in Table 4. We can see that CDTr performs better than other methods in all cases, while the results of aCDTr is comparable to those of gCoda and CD-trace, and the results of SPIEC(MB) and SPIEC(GL) are worse than the others. Note that we did not consider the exchangeable condition when we set up the networks, implying that CDTr still works, even when the the exchangeable condition does not hold. Although the objective functions and performances of CDTr and aCDTr are similar as shown in Tables 2 and 4, they are derived from two quite different perspectives. aCDTr is based on the approximation Σ ≈ GΣ_{ln x} G and assumes that the inverse of GΣ_{ln x} G also approximates the inverse of Σ. However, as Fang et al. [14] stated, this approximation depends strongly on the condition number of the inverse covariance matrix. CDTr does not need aforementioned approximation and can guarantee that the inverse of Σ minimizes CDTr loss exactly under the exchangeable condition. The meaning of CDTr is that it avoids the use of approximation assumptions and provides a different perspective for precision matrix estimation.

Download:

Table 4. The mean AUC scores of different methods under different settings.

https://doi.org/10.1371/journal.pone.0207731.t004

3.2 Simulations for DCDTr loss

We investigate the performance of DCDTr loss with some experiments in this section. The first precision matrix Θ is generated as follows:

Random graph: For Θ, two nodes are connected with probability 0.5, and the strength is generated from a uniform distribution in [−0.4, −0.2] ∪ [0.2, 0.4].
Band graph: Connect pair (i, j) with strength uniformly distributed in [0.05m − 0.3, 0.05m − 0.25] ∪ [0.25 − 0.05m, 0.3 − 0.05m], if |i − j| = m, m = 1, 2, 3, 4.
Neighbor graph: Select p points from and connect the 10 nearest neighbors for each point with strength sampled from a uniform distribution in [−0.4, −0.2] ∪ [0.2, 0.4].
Scale-free graph: The scale-free graph is generated with the B-A algorithm [32]. The strength between connected nodes is generated from a uniform distribution in [−0.4, −0.2] ∪ [0.2, 0.4].
Hub graph: Partition the nodes into 3 disjoint groups evenly and select a node as hub for each group. The hubs are connected with the non-hubs in the same group with strength uniformly distributed in [−0.4, −0.2] ∪ [0.2, 0.4].
Block graph: Divide p nodes into 5 blocks evenly. Connect pairs in the same block with probability 0.5 and pairs in different blocks with probability 0.3. The strength between connected nodes is uniformly distributed in [−0.4, −0.2] ∪ [0.4, 0.2].

Then 10% of the connected pairs in Θ will change to an unconnected state, while the same number of unconnected pairs in Θ will change to a connected state, such that we get another precision matrix Θ*. For scale-free and hub graph, the ratio of change is 40% based on the sparsity of the two graphs. The diagonal elements of Θ and Θ* are set as |e| + 0.3, where e is the smallest eigenvalue of Θ or Θ*, respectively. The deviations from the exchangeable condition of Θ and Θ* are listed in Table 5. Therefor, the differential matrix Δ is Θ* − Θ. The two precision matrices Θ and Θ* are used to generate data separately. The aforementioned four methods, including DCDTr, FGL, GGL and ℓ₁-M, are used to estimate the true differential matrix Δ. Similarly, we evaluate the true positive rate and true negative rate at different tuning parameters and then compute the area under the ROC curve (AUC). We take the sample size n = 100, 200, 300, 400 and repeat this procedure 100 times.

Download:

Table 5. Deviations from the exchangeable condition of six different network structures.

https://doi.org/10.1371/journal.pone.0207731.t005

Table 6 presents the mean AUC scores of different methods under different settings. We see that no method is generally better than the others in all cases. DCDTr performs better than other methods in random graph, neighbor graph and block graph, while GGL achieves higher AUC in scale-free and hub graph. With the increase of sample size, the advantage of DCDTr becomes increasingly significant. Generally speaking, our proposed DCDTr performs well in different network estimations.

Download:

Table 6. The mean AUC scores of different methods under different settings.

https://doi.org/10.1371/journal.pone.0207731.t006

4 Real data analysis

In this section, we illustrate our proposed method with an application to mouse skin microbiome data [33]. A total of 261 mice were divided into 3 groups: 78 non-immunized controls (Control), 119 immunized healthy individuals (Healthy) and 64 immunized epidermolysis bullosa acquisita individuals (EBA), according to the health conditions of skin immunizations. The OTUs appearing in less than 50% of the samples are filtered out, and the samples with a number of nonzero OTU counts less than 50% of the total selected OTUs are also removed. We finally arrived at a dataset with p = 77 OTUs and n = 232 samples (63 Control, 114 Healthy and 55 EBA). We use Bayesian-multiplicative replacement [34–36] to impute zero counts and normalize the data to compositional data.

Since the the underlying true direct interaction networks were not available and the accuracy of estimated networks was unobtainable, we evaluated the performance of the proposed methods with reproducibility as Fang et al. [14] and Kurtz et al. [12] suggusted. More specifically, we first constructed a reference network est₁ (precision matrix or differential matrix) with all data for each group and method. We then selected half of the samples randomly to estimate the precision matrix or differential matrix (denoted by est₂) again. The reproducibility was measured by the fraction of overlapping edges shared by est₁ and est₂ in the reference network est₁.

For each group and each method of precision matrix estimation, the procedure stated above was repeated 20 times. The mean reproducibility is summarized in Table 7. CDTr and aCDTr outperformed the other four methods in terms of reproducibility in all three groups, implying that CDTr and aCDTr are more stable and accurate in direct interaction estimation. We also estimated the differential network for the Control-Healthy and Control-EBA groups, and the evaluation procedure was also repeated 20 times. The mean reproducibility is listed in Table 8. The highest reproducibility of DCDTr also implies that DCDTr is more stable and accurate in differential network estimation.

Download:

Table 7. The mean reproducibility for various methods and groups.

https://doi.org/10.1371/journal.pone.0207731.t007

Download:

Table 8. The mean reproducibility for various methods and groups.

https://doi.org/10.1371/journal.pone.0207731.t008

Finally, we employed all methods to build a candidate microbiome association network from the unified dataset for each group and group pairs. In Fig 1, we present the number of shared edges for direct interaction networks recovered from various methods via Venn diagrams. We can see that the direct interaction network from CDTr is close to that of CD-trace, while the network from SPIEC(GL) and SPIEC(MB) are more similar. A total of 21, 38 and 22 edges are shared by all candidate networks for control, healthy and EBA groups, respectively, comprising the core interaction network among OTUs. Moreover, almost all direct interactions discovered by CDTr are in this core interaction network, while SPIEC(GL), SPIEC(MB) and gCoda discover some eccentric interactions. The number of shared edges for differential networks are shown in Fig 2. The situation for differential networks is much more complicated. ℓ₁-M discovered many eccentric differential edges in both groups, but these were not confirmed by other methods. The differential edges from GGL and FGL are almost the same for both groups, and are more than the edges from DCDTr. Most differential edges from DCDTr were verified by both GGL and FGL for both groups, implying that DCDTr is good at inferring the crucial differential edges without mixing nonessential edges.

Download:

Fig 1. Venn diagrams of shared edges among direct interaction networks from various methods.

https://doi.org/10.1371/journal.pone.0207731.g001

Download:

Fig 2. Venn diagrams of shared edges among differential networks from various methods.

https://doi.org/10.1371/journal.pone.0207731.g002

To investigate the influence of zeros in the compostional data, we first divide 77 variables into 7 sets evenly according to the proportion of nonzero measurements in each variable, and then calculate the percentage of nonzero measurements (named nonzero density) in each set. The average degree of variables (i.e., nodes) in the same set is computed with each network constructed by above-mentioned methods. The nonzero density and average degree for each set are summarized in Tables 9 and 10 for Control, Healthy, EBA and Control-Healthy, Control-EBA group, respectively. For Control and EBA group, the average degree tends to be bigger with larger nonzero density for all methods. When the nonzero density is 20% in Set1 for Control group and 49% in Set1 for EBA group, aCDTr and CDTr do not recover any connections with these rare abundance bacteria, which implies that the recovered connections are not due to zero corrections. For Healthy, Control-Healthy and Control-EBA group with fewer zeros in the data, the average degree does not show clear pattern and is more close to random distribution, which implies that zero measurements do not influence network inference significantly when zeros in compositional data are relatively few.

Download:

Table 9. The nonzero density and average degree for each set and networks constructed by various methods in control, healthy and EBA group.

https://doi.org/10.1371/journal.pone.0207731.t009

Download:

Table 10. The nonzero density and average degree for each set and networks constructed by various methods in Control-Healthy and Control-EBA group.

https://doi.org/10.1371/journal.pone.0207731.t010

5 Conclusion

Inferring the direct interactions among microbial species and understanding how the network structure changes are important in the study of ecology and medicine. In this paper, we propose two loss functions to estimate the direct interaction network and differential network from compositional microbial data based on clr transformation and D-trace loss for absolute abundance data. Although the proposed CDTr loss and DCDTr loss are derived from an exchangeable condition, we show that they still perform well and better than other methods under different scenarios in our numerical simulations. However, the reasonableness of the exchangeable condition should be further examined in theory and biology. Finally, the consistency of the estimators does not come with a theoretical guarantee, which is a common limitation of gCoda, SPIEC, CDTr and DCDTr. For future work, we are interested in developing theorems about the consistency property in both direct interaction network and differential network estimation.

Supporting information

S1 Appendix. Supplementary for compositional data analysis via lasso penalized D-trace loss.

The matrix operators S(X),K(X),H(X) and [X]₊ used in Algorithm 1 and Algorithm 2 for the numerical solutions of lasso penalized CDTr and DCDTr loss are presented in this Supplementary. We also demonstrate the relationship between D-trace loss and CDTr loss, as well as the relationship between DTL loss and DCDTr loss. The detailed formulas of ℓ₁-minimization method and joint graphical lasso (FGL, GGL) are listed in this Supplementary.

https://doi.org/10.1371/journal.pone.0207731.s001

(PDF)

References

1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. science. 2008;320(5879):1034–1039. pmid:18497287
- View Article
- PubMed/NCBI
- Google Scholar
2. Thiele I, Heinken A, Fleming RM. A systems biology approach to studying the role of microbes in human health. Current opinion in biotechnology. 2013;24(1):4–12. pmid:23102866
- View Article
- PubMed/NCBI
- Google Scholar
3. Konopka A. What is microbial community ecology? The ISME journal. 2009;3(11):1223. pmid:19657372
- View Article
- PubMed/NCBI
- Google Scholar
4. Bandyopadhyay S, Mehta M, Kuo D, Sung MK, Chuang R, Jaehnig EJ, et al. Rewiring of genetic networks in response to DNA damage. Science. 2010;330(6009):1385–1389. pmid:21127252
- View Article
- PubMed/NCBI
- Google Scholar
5. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS computational biology. 2010;6(2):e1000667. pmid:20195499
- View Article
- PubMed/NCBI
- Google Scholar
6. Aitchison J. The statistical analysis of compositional data. Monographs on Statistics and Applied Probability, Chapman and Hall, London, UK. 1986.
7. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS computational biology. 2012;8(9):e1002687. pmid:23028285
- View Article
- PubMed/NCBI
- Google Scholar
8. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS computational biology. 2012;8(7):e1002606. pmid:22807668
- View Article
- PubMed/NCBI
- Google Scholar
9. Faust K, Raes J. Microbial interactions: from networks to models. Nature Reviews Microbiology. 2012;10(8):538. pmid:22796884
- View Article
- PubMed/NCBI
- Google Scholar
10. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31(20):3322–3329. pmid:26079350
- View Article
- PubMed/NCBI
- Google Scholar
11. Fang H, Huang C, Zhao H, Deng M. CCLasso: correlation inference for compositional data through Lasso. Bioinformatics. 2015;31(19):3172–3180. pmid:26048598
- View Article
- PubMed/NCBI
- Google Scholar
12. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS computational biology. 2015;11(5):e1004226. pmid:25950956
- View Article
- PubMed/NCBI
- Google Scholar
13. Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004;303(5659):799–805. pmid:14764868
- View Article
- PubMed/NCBI
- Google Scholar
14. Fang H, Huang C, Zhao H, Deng M. gCoda: conditional dependence network inference for compositional data. Journal of Computational Biology. 2017;24(7):699–708. pmid:28489411
- View Article
- PubMed/NCBI
- Google Scholar
15. Whittaker J. Graphical models in applied multivariate statistics. Wiley Publishing; 2009.
16. Markowetz F, Spang R. Inferring cellular networks–a review. BMC bioinformatics. 2007;8(6):S5. pmid:17903286
- View Article
- PubMed/NCBI
- Google Scholar
17. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The annals of statistics. 2006;34(3):1436–1462.
- View Article
- Google Scholar
18. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(1):267–288.
- View Article
- Google Scholar
19. Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35.
- View Article
- Google Scholar
20. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. pmid:18079126
- View Article
- PubMed/NCBI
- Google Scholar
21. Zhang T, Zou H. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika. 2014;101(1):103–120.
- View Article
- Google Scholar
22. Biswas S, McDonald M, Lundberg DS, Dangl JL, Jojic V. Learning microbial interaction networks from metagenomic count data. Journal of Computational Biology. 2016;23(6):526–535. pmid:27267776
- View Article
- PubMed/NCBI
- Google Scholar
23. Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R. Modeling and analysis of compositional data. John Wiley & Sons; 2015.
24. Yuan H, He S, Deng M. Compositional data network analysis via lasso penalized D-trace loss. Bioinformatics. 2019;.
- View Article
- Google Scholar
25. Chiquet J, Grandvalet Y, Ambroise C. Inferring multiple graphical structures. Statistics and Computing. 2011;21(4):537–553.
- View Article
- Google Scholar
26. Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. pmid:23049124
- View Article
- PubMed/NCBI
- Google Scholar
27. Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2014;76(2):373–397.
- View Article
- Google Scholar
28. Zhao SD, Cai TT, Li H. Direct estimation of differential networks. Biometrika. 2014;101(2):253–268. pmid:26023240
- View Article
- PubMed/NCBI
- Google Scholar
29. Yuan H, Xi R, Chen C, Deng M. Differential network analysis via lasso penalized D-trace loss. Biometrika. 2017;104(4):755–770.
- View Article
- Google Scholar
30. Schwarz G, et al. Estimating the dimension of a model. The annals of statistics. 1978;6(2):461–464.
- View Article
- Google Scholar
31. Scheinberg K, Ma S, Goldfarb D. Sparse inverse covariance selection via alternating linearization methods. In: Advances in neural information processing systems; 2010. p. 2101–2109.
32. Barabási AL, Albert R. Emergence of scaling in random networks. science. 1999;286(5439):509–512. pmid:10521342
- View Article
- PubMed/NCBI
- Google Scholar
33. Srinivas G, Möller S, Wang J, Künzel S, Zillikens D, Baines JF, et al. Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature communications. 2013;4:2462. pmid:24042968
- View Article
- PubMed/NCBI
- Google Scholar
34. Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J. Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling. 2015;15(2):134–158.
- View Article
- Google Scholar
35. Rivera-Pinto J, Egozcue J, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle M. Balances: a new perspective for microbiome analysis. MSystems. 2018;3(4):e00053–18. pmid:30035234
- View Article
- PubMed/NCBI
- Google Scholar
36. Palarea-Albaladejo J, Martin-Fernandez JA. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems. 2015;143:85–96.
- View Article
- Google Scholar

[ref1] 1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. science. 2008;320(5879):1034–1039. pmid:18497287
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Thiele I, Heinken A, Fleming RM. A systems biology approach to studying the role of microbes in human health. Current opinion in biotechnology. 2013;24(1):4–12. pmid:23102866
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Konopka A. What is microbial community ecology? The ISME journal. 2009;3(11):1223. pmid:19657372
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Bandyopadhyay S, Mehta M, Kuo D, Sung MK, Chuang R, Jaehnig EJ, et al. Rewiring of genetic networks in response to DNA damage. Science. 2010;330(6009):1385–1389. pmid:21127252
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS computational biology. 2010;6(2):e1000667. pmid:20195499
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Aitchison J. The statistical analysis of compositional data. Monographs on Statistics and Applied Probability, Chapman and Hall, London, UK. 1986.

[ref7] 7. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS computational biology. 2012;8(9):e1002687. pmid:23028285
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS computational biology. 2012;8(7):e1002606. pmid:22807668
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Faust K, Raes J. Microbial interactions: from networks to models. Nature Reviews Microbiology. 2012;10(8):538. pmid:22796884
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31(20):3322–3329. pmid:26079350
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Fang H, Huang C, Zhao H, Deng M. CCLasso: correlation inference for compositional data through Lasso. Bioinformatics. 2015;31(19):3172–3180. pmid:26048598
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS computational biology. 2015;11(5):e1004226. pmid:25950956
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004;303(5659):799–805. pmid:14764868
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref14] 14. Fang H, Huang C, Zhao H, Deng M. gCoda: conditional dependence network inference for compositional data. Journal of Computational Biology. 2017;24(7):699–708. pmid:28489411
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Whittaker J. Graphical models in applied multivariate statistics. Wiley Publishing; 2009.

[ref16] 16. Markowetz F, Spang R. Inferring cellular networks–a review. BMC bioinformatics. 2007;8(6):S5. pmid:17903286
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref17] 17. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The annals of statistics. 2006;34(3):1436–1462.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref18] 18. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(1):267–288.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref19] 19. Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref20] 20. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. pmid:18079126
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref21] 21. Zhang T, Zou H. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika. 2014;101(1):103–120.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref22] 22. Biswas S, McDonald M, Lundberg DS, Dangl JL, Jojic V. Learning microbial interaction networks from metagenomic count data. Journal of Computational Biology. 2016;23(6):526–535. pmid:27267776
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref23] 23. Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R. Modeling and analysis of compositional data. John Wiley & Sons; 2015.

[ref24] 24. Yuan H, He S, Deng M. Compositional data network analysis via lasso penalized D-trace loss. Bioinformatics. 2019;.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref25] 25. Chiquet J, Grandvalet Y, Ambroise C. Inferring multiple graphical structures. Statistics and Computing. 2011;21(4):537–553.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref26] 26. Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. pmid:23049124
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref27] 27. Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2014;76(2):373–397.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref28] 28. Zhao SD, Cai TT, Li H. Direct estimation of differential networks. Biometrika. 2014;101(2):253–268. pmid:26023240
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref29] 29. Yuan H, Xi R, Chen C, Deng M. Differential network analysis via lasso penalized D-trace loss. Biometrika. 2017;104(4):755–770.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref30] 30. Schwarz G, et al. Estimating the dimension of a model. The annals of statistics. 1978;6(2):461–464.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref31] 31. Scheinberg K, Ma S, Goldfarb D. Sparse inverse covariance selection via alternating linearization methods. In: Advances in neural information processing systems; 2010. p. 2101–2109.

[ref32] 32. Barabási AL, Albert R. Emergence of scaling in random networks. science. 1999;286(5439):509–512. pmid:10521342
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref33] 33. Srinivas G, Möller S, Wang J, Künzel S, Zillikens D, Baines JF, et al. Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature communications. 2013;4:2462. pmid:24042968
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref34] 34. Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J. Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling. 2015;15(2):134–158.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref35] 35. Rivera-Pinto J, Egozcue J, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle M. Balances: a new perspective for microbiome analysis. MSystems. 2018;3(4):e00053–18. pmid:30035234
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref36] 36. Palarea-Albaladejo J, Martin-Fernandez JA. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems. 2015;143:85–96.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

Figures

Abstract

1 Introduction

2 Materials and methods

2.1 Compositional data and clr transformation

2.2 CDTr: Compositional network analysis with D-trace loss

2.3 DCDTr: Differential compositional network analysis with D-trace loss

3 Numerical results

3.1 Simulations for CDTr loss

3.2 Simulations for DCDTr loss

4 Real data analysis

5 Conclusion

Supporting information

S1 Appendix. Supplementary for compositional data analysis via lasso penalized D-trace loss.

References