Application of fused graphical lasso to statistical inference for multiple sparse precision matrices

In this paper, the fused graphical lasso (FGL) method is used to estimate multiple precision matrices from multiple populations simultaneously. The lasso penalty in the FGL model is a restraint on sparsity of precision matrices, and a moderate penalty on the two precision matrices from distinct groups restrains the similar structure across multiple groups. In high-dimensional settings, an oracle inequality is provided for FGL estimators, which is necessary to establish the central limit law. We not only focus on point estimation of a precision matrix, but also work on hypothesis testing for a linear combination of the entries of multiple precision matrices. We apply a de-biasing technology, which is used to obtain a new consistent estimator with known distribution for implementing the statistical inference, and extend the statistical inference problem to multiple populations. The corresponding de-biasing FGL estimator and its asymptotic theory are provided. A simulation study and an application of the diffuse large B-cell lymphoma data show that the proposed test works well in high-dimensional situation.


Introduction
Undirected graphical models are popular tools for representing the network structure of data and have been widely applied in many domains, such as machine learning, genetics, and finance.Letting x = (x 1 , ..., x p ) T be a p-variate normal random vector with mean vector µ and covariance Σ 0 (Σ 0 is positive definite), the precision matrix (or concentration matrix) is denoted the inverse of the covariance matrix, i.e., Θ 0 := Σ −1 0 .The graphical models capture conditional dependence relationships between random variables via non-zero entries in a precision matrix.If Θ 0ij = 0, x i and x j , i, j = 1, ..., p are dependent on each other, given all other variables.Meanwhile, the zero entries in the precision matrix correspond to pairs of variables that are conditionally independent given other variables.Therefore, the graph model is closely related to the precision matrix.Estimating and testing of a precision matrix have been a rapidly growing research direction in the past few years.
Letting x 1 , ..., x n be a sequence of independent and identically distributed (i.i.d.) observations from the population x, X p×n := (x 1 , ..., x n ).A natural estimator of the precision matrix is the inverse of the sample covariance matrix Σ, where Σ = 1 n X T X.On one hand, in high-dimensional settings, Johnstone [10] proposed that the eigenvalues of the sample covariance matrix do not converge to the corresponding eigenvalue of the population covariance matrix for Σ = I.Consequently, this estimator becomes invalid when the dimension p is comparable to the sample size n.On the other hand, the sample covariance matrix is singular in a p > n − 1 setting.This will produce non-negligible errors in using Σ −1 n to estimate Θ 0 .In addition, a sparse (i.e., many entries are either zero or nearly so) assumption for a high-dimensional precision matrix is essential, since the zero entries imply the conditional independence structures, which are what we are most concerned with in the graphical model.In general, Σ −1 does not have a sparsity construction.How to estimate the sparse precision matrix in high-dimensional settings is an intractable problem.
In recent years, various proposals have been put forward for estimating a precision matrix in highdimensional situations, among which the graphical model with sparsity-promoting penalties is valid for obtaining a sparse estimator.By applying the l 1 (lasso penalty) to the entries of the concentration matrix, Yuan and Lin [16] proposed a max-det algorithm to obtain the estimator of Θ 0 .The convergence result of the estimator is derived under a p fixed assumption.Using a coordinate descent procedure, Friedman et al. [5] provided an algorithm for solving a graphical Lasso estimator that is remarkably fast, even if p > n.
Rothman et al. [13] investigated a sparse permutation invariant covariance estimator, and established a convergence rate of the estimator in the Frobenius norm as both data dimension p and sample size n are allowed to grow, and showed that the rate explicitly depends on how sparse the true concentration matrix is.For additional theoretical details on penalized likelihood methods for graphical models, see Fan et al. [4], Ravikumar et al. [11], Xue and Zou [14], and Yuan et al. [17].
The above-mentioned methods focus on estimating a single graphical model, but joint estimators better recover the truth graphs compared with separate estimations for data from multiple graphical models sharing the similarities structure with other, but not identical models.Guo et al. [6] studied joint estimation of precision matrices that have a hierarchical structure assumption.Liu and Lee proposed a joint estimator of multiple precision matrices under an assumption that precision matrices decompose into the sum of two components.A fused graphical lasso was proposed by Danaher et al. [3] with a penalty imposing a similar structure of a precision matrix across groups.Supposing that X [k]

matrices, and x
[k] i ∈ R p (i = 1, ..., n k ) are sampled i.i.d.from a distribution with mean µ [k] and covariance Σ [k] 0 , for k = 1, ..., K, we assume µ [k] = 0 without loss of generality.To simplify notation, we omit the subscript of p×n k , and denote the sample matrices as X [k] .The population precision matrix is defined as the inverse of the population covariance matrix, i.e., Θ 0 } are investigated by minimizing the negative penalized log likelihood where P({Θ [k] }) denotes the penalty function, the { Θ [k] } are the minimizers of (1), and we optimize over the symmetric positive-definite matrices set S ++ .The fused graphical lasso (FGL) is the solution to optimization problem (1) with the fused lasso penalty where λ and ρ are non-negative regularization parameters, (Θ [k] ) − represents the matrix obtained by setting the diagonal elements of (Θ [k] ) to zero, and || • || 1 denotes the l 1 norm of a vector or matrix.It is reasonable to restrict non-diagonal elements of Θ [k] , since we are most concerned with the conditional independence cross-different variables.Note that the first term in ( 2) is the classical lasso penalty, which shrinks the coefficients toward 0 as λ increases.It guarantees discovery of the sparse estimators { Θ [k] } of the model.
An approach for the estimation of the joint graphical models largely relies on penalized estimation.
The penalty biases the estimates toward the assumed structure, which makes hypothesis tests for precision matrices more challenging.Work on statistical inference for low-dimensional parameters in graphical models has recently been carried out (Janková and van de Geer [7]; Janková and van de Geer [8]; Ren et al. [12]; Yu et al. [15]) based on the l 1 -penalized estimator.Janková and van de Geer [7] provided a debiasing technique to obtain a new consistent estimator with known distribution.However, these approaches were developed only in the setting in which the parameters of one graph are inferred.In contrast, studies of inference techniques using estimators obtained from cross-group penalization are much fewer.The work on statistical inference for multiple graphical models is an interesting area open for future research.Inspired by Janková and van de Geer [7], we not only give FGL estimators of multiple precision matrices from co-movement data, but also test the linear combination of the entries of these precision matrices.The core of the proposed method is based on the de-biasing technique, and we implement statistical inference of the precision matrices under high-dimensional settings according to the proposed central limit theorem.
The rest of this paper is organized as follows.In Section 2, we give the oracle inequality for multiple estimators with a FGL penalty and its weighted version.Testing the hypothesis for the linear combination of corresponding entries of multiple precision matrices is also considered in this section.Based on debiasing technology, the CLT of the proposed statistics for multiple populations is also derived in Section 2.
In Section 3, we report the results of simulations.All technical details are relegated to the Appendix.

Main results
We assume following notation throughout the paper.For a matrix A = (a ij ) p i,j=1 , we denote (A) ij its (i, j)-entry, or denote its (i, j)-entry as A ij to simplify the notation.We write |A| for the determinant of A, and the trace of matrix A is denoted tr(A).Letting A + = diag(A) for a diagonal matrix with the same diagonal as A, A − = A − A + .||A|| 2 F = i,j a 2 ij denotes the Frobenius norm (also known as the matrix 2-norm).We use the notation ||A|| ∞ = max i,j |a ij | for the supremum norm of a matrix A, and In the common high-dimensional setting, the dimension p is allowed to grow to infinity.
The dimension is comparable, substantially larger or smaller than the sample size.We set sample sizes n 1 ... n K n throughout the paper, and n * = n 1 + ... + n K going to infinity.Furthermore, for notational simplicity, we assume that n 1 = ... = n K = n.

Oracle inequality
To obtain the oracle inequality of multiple estimators of FGL models, we introduce some notation related to the sparsity assumptions on the entries of the true precision matrix.Letting where Θ [k] 0ij is the (i, j)-entry of Θ [k] 0 and s k = |S k | is the cardinality of S k , we adopt the boundedness of the eigenvalues of the true precision matrix and certain tail conditions proposed by Janková and Van De Geer [7].

Condition 1 (Bounded eigenvalues).
There exist universal constants L for k such that where Λ min and Λ max denote the minimum and maximum eigenvalues of a matrix, respectively.
Remark 1. From the inequality, we must select λ so that λp → 0 as n → ∞ to ensure consistency, which is not satisfied by a sub-Gaussianity random vector.Thus, the condition λp → 0 excludes the p n situation.
The FGL does not take into account that the variables have, in general, different scaling.Thus, we consider the weighted FGL.The minimizer of the optimization problem (1) with weighted FGL penalty Further, the population correlation matrix is denoted 0 and the sample correlation matrix is denoted If we substitute R [k] for Σ [k] , the minimizer of with a FGL penalty (2) is denoted { Θ R }, which is a matter of estimating the parameter by the normalized data.Then, which means, essentially, that Theorem 2. Under the conditions of Theorem 1, on the set and It is natural to extend this conclusion to the K > 2 FGL model.For k = 1, ..., K and the K > 2 situation, we obtain the following theorem.
Theorem 3 (Multiple FGL model).Supposing that Conditions 1 and 2 hold, for K > 2, 2 K(K−1) and Theorem 4 (Multiple FGL model for weighted version).Under the conditions of Theorem 3, on the set and

Asymptotic property
We not only focus on the point estimation of multiple precision matrices, but also on hypothesis testing for the linear combination of the entries of the precision matrices over two groups.One may want to test whether the elements of the precision matrix over two groups are equal: To test Hypothesis (18), we aim to obtain confidence intervals for estimators based on the de-biasing technique, which is imposed for eliminating the bias associated with the penalty.The de-biasing estimator is defined as k] .The difference between the de-biasing estimator and the true value can be decomposed into two parts as follows: where Under the compatibility conditions, Janková and van de Geer [9] proposed that the (i, j)-entry of Θ has an asymptotic normality property, and √ n||Υ [k] || ∞ converges to zero in probability.Thus, for testing Hypothesis (18), we construct the testing statistic using de-biasing estimators.
For K = 2, we let where Next, we establish the central limit theorem for T ij .
Theorem 5. Assuming Conditions 1, 2, and λ ρ log p/n and (p + s) where and o p denotes the convergence in probability.Moreover, where To complete the testing procedure, we use the consistent estimator σ2 ij = ( Θ [1] ) ii ( Θ [1] ) jj + ( Θ [1] ) 2 ij + ( Θ [2] ) ii ( Θ [2] ) jj + ( Θ [2] ) 2 ij for Theorem 5. Theorem 5 provide a practical and efficient way of obtaining the p value and critical value for the test statistic.Under a null hypothesis, we observe that Θ For an α level of significance, we reject , where ξ α is the 1 − α upper quantile of the standard normal distribution.w , where In addition, where We do not need to impose the so-called irrepresentability condition on Σ to derive the theoretical properties of our estimators, in contrast to Brownlees et al. [2].
In addition, for the multi-sample precision matrix hypothesis problem, one may want to test a linear hypothesis testing problem: where a 1 , ..., a K are known constants.Similar to the two-sample case, we proposed the test statistic For the K > 2 multiple situation, we assume s = max{s 1 , ..., s K } and d = max{d 1 , ..., d K }.Consequently, we establish the asymptotic normality of the proposed statistic in the following corollary, i.e., Corollary 1.
Corollary 1.Under the assumptions of Theorem 5, it holds that where where The asymptotic variance σ ij in Corollary 1 is unknown, so to construct confidence intervals we use a consistent estimator [1] ) ii ( Θ [1] ) jj + ( Θ [1] ) 2 ij , ..., ( Θ where In addition, a weighted version is proposed as follows. Corollary 2. Under the assumptions of Theorem 6, the residual term in (33) converges in probability with rate 1/ √ n, and CLT in (34) holds by replacing w , which is obtained by solving the weighted FGL optimization problem.

Numerical study
Simulation experiments were carried out to evaluate the performance of the proposed de-biasing FGL test.We considered the sparse graphical model, and a random sample was generated from the multivariate normal distribution N (0 p , (Θ [k] 0 ) −1 ) with a population covariance matrix defined as the inverse of the population precision matrix.
To solve the graphical lasso problem with a certain penalty, we refer to the alternating direction method of multiplier (ADMM) algorithm, since it is guaranteed to converge to the global optimum.For more details, the reader is referred to Boyd et al. [1] and Danaher et al. [3].When an objective method for selecting tuning parameters λ and ρ is required, the approximations of the Akaike information criterion (AIC), Bayesian information criterion, or cross-validation method can be used to select tuning parameters.
The AIC method was chosen for the following simulation, and λ and ρ both range from 0.05 to 0.3 with a step of 0.0086, where the step is derived by (0.3 − 0.05)/(30 − 1).
In addition, all the reported simulation results are based on 500 simulations with a nominal significance level of 0.05, and we set the dimension to 100.

Fluctuations of test
We illustrated the theoretical asymptotic normality result on simulated data for testing the two-sample problem (18), and we set precision matrices equal under a null hypothesis, i.e., Θ Letting G be a p × p symmetric graph matrix with diagonal entries 0 and α percent of off-diagonal elements 1, and U be p × p matrix with elements i.i.d.generated from the uniformly distribution on the interval (0, 1), i.e., U (0, 1), we denote the elements of the symmetric matrix Θ as θij .For i > j, where g ij and u ij are the (i, j)-entry of G and U , respectively, and 1 {•} is the indicator function.For i < j, we set θij = θji .The diagonal entries of matrix Θ are zeros.Then, the precision matrix is generated as This shows that the matrix generated is symmetric and positive definite.To make the non-zero entries go away from 0 and to generate a sparse matrix, we subtract 1 from the non-zero elements.In addition, the precision matrix generation procedure shows that α is a parameter controlling the sparsity.When α = 1, a dense matrix is generated.As is well known, the sparsity of a matrix not only requires a small quantity of non-zero elements, but also a large absolute value of non-zero elements.The parameter α controls sparsity in terms of the number of sparse elements.
We examined the fluctuation of √ nT ij /σ ij under (p, n) = (100, 200) and (p, n) = (100, 400) settings for the extremely sparse and dense precision matrix cases, respectively.For the extremely sparse precision matrix case, we set the parameter α = 0.01, and for dense case we use α = 1.We simulated the fluctuation for the extremely sparse case as shown in Fig. 1 and the dense case in Fig. 2. The index (i, j) in the simulation was intermittently chosen.In fact, the CLT provides the method for testing any element of the linear combination of the precision matrix.Theoretically, we can test for any index (i, j)-entry of Θ 0 whether the true value is zero or not.

Average coverage probabilities
We demonstrate the performance of the test method for the K = 2 situation on testing the hypothesis as follows.
From the global perspective, we used the average coverage, which is also considered in Janková and van de Geer [7].Letting be the 95% asymptotic confidence interval for Θ 0ij , we substitute the estimator σij for σ ij to obtain the empirical version.The frequency of the true value being covered by the confidence interval (38) is defined as θij .Then, the average coverage over a set A is denoted S denotes the set of non-zero entries of Θ [1] 0ij .It is easy to check that S = S 1 = S 2 for the reason that Θ [1] 0ij and Θ [2] 0ij have same structure of sparsity for the Equal Null and Linear Null cases.Thus, for the different null hypotheses, we simulated the average coverage over S and its complementary set S c .The parameter of sparsity is α = 0.1, 0.5, and 0.9.
Partial results in Tab. 1 meet our expectation.However, we do not deny that the simulations are affected by randomness.In addition, the proposed method is based on the combination of estimation and hypothesis testing, which accumulates error.The simulation results provide guidance for practice.

Multiple FGL case
For the multiple FGL case, we examined the fluctuation of the statistic T ij for the K = 3 situation on testing the hypothesis as follows.
According to Cauchy-Schwarz inequality, on the sets Next, for L k ≥ 1 satisfying condition Based on the definitions of ∆ k and Θ k , we get on Lemma 7, we have In particular, we choose c = 1/(8L 2 ), and the inequality (49) still holds.Using bounds (46) and (49), the inequality (43) turns to be We move some terms of the inequality (50) and combine them to get the following inequality Next we need to prove three inequations: Because and hold.Thus, which proves inequality (52).By the triangle inequality, we naturally obtain Thus, the inequation (53) holds.For inequation (54), we have Thus, the inequality (51) yields By taking 2(ρ + λ 0 ) < λ, we conclude that By the definition of ∆ k , we have So we deduce holds.Since the inequality of arithmetic and geometric means, the inequality Using xy ≤ (x 2 + y 2 )/2, the inequality (64) infer that Because we obtain Thus, Based on the inequality Next, we prove that substituting Θ k for Θ k , the conclusion still holds.According to the condition, Thus, ||∆ k || F is bounded by M/2.In addition, According to inequality (69), we get and Thus, we conclude the upper bound of A.2. Proof of Theorem 2 Proof 2. The minimizer ( Θ R ) satisfying inequality (68), that is The diagonal elements of R [k] and R [k] 0 are all 1.Thus Moreover, for the conclusion of the l 1 -operator norm, we get For the minimizer ( Θ [1] w , Θ w ), following inequality holds To draw the conclusion, we have the following facts: • The Sub-Gaussian vector with covariance Σ [k] 0 implies that n/ log p||( w0 are bounded by a constant.

A.3. Proof of Theorem 3
Proof 3. Similarly, Θ k are the minimum value of the fused graphical Lasso for k we obtain Using the notations that ∆ we yield the following expression (82) For multiple case, we select a constant L satisfying 1/L ≤ 1/L k and L k ≤ L. By similar analysis, for M in (0, 1/2L], the inequality (48) and the inequality (49) still hold.
For K groups data, based on the inequalities (46) and (49).Then, the inequality (82) turns to be Thus, When k = 1, 2, • • • , K, the inequations (52) and (53) still hold.Similarly, we have the following inequality Thus, by the equations ( 52), ( 53) and (86) the inequality (85) yields Since K is a fixed constant, and 2( K(K−1) On the basis of the inequality (62), we deduce holds.In addition, one can get the inequality Based on xy ≤ (x 2 + y 2 )/2 and the inequality (66), the inequality (90) infer that Thus, Using the relation between the Frobenius norm and the supremum norm, we have According to the inequality (93), we get According to λ 0 ≤ λ/2 and the condition λ ≤ c/8L, we get That implies which completes the proof.
A.4. Proof of Theorem 4 Proof 4. We get from (92) and similarly derive we have At last, using the inequality (79), based on the analysis of the upper bound of ||W ||| Θ [k]  w − Θ A.5. Proof of Theorem 5 Proof 5. First of all, we prove that the remainder converge in probability with a 1/ √ n convergence rate.

Table 1 :
Estimated average coverage probabilities for K = 2 situation.
we have Thus, ||∆ k || F is bounded by M/2.Further, we can derive || Θ k − Θ 0k || F ≤ M which means that we can substitute Θ k for Θ k , and that leads to the inequality (93) holds for Θ k , i.e.