SILGGM: An extensive R package for efficient statistical inference in large-scale gene networks

Gene co-expression network analysis is extremely useful in interpreting a complex biological process. The recent droplet-based single-cell technology is able to generate much larger gene expression data routinely with thousands of samples and tens of thousands of genes. To analyze such a large-scale gene-gene network, remarkable progress has been made in rigorous statistical inference of high-dimensional Gaussian graphical model (GGM). These approaches provide a formal confidence interval or a p-value rather than only a single point estimator for conditional dependence of a gene pair and are more desirable for identifying reliable gene networks. To promote their widespread use, we herein introduce an extensive and efficient R package named SILGGM (Statistical Inference of Large-scale Gaussian Graphical Model) that includes four main approaches in statistical inference of high-dimensional GGM. Unlike the existing tools, SILGGM provides statistically efficient inference on both individual gene pair and whole-scale gene pairs. It has a novel and consistent false discovery rate (FDR) procedure in all four methodologies. Based on the user-friendly design, it provides outputs compatible with multiple platforms for interactive network visualization. Furthermore, comparisons in simulation illustrate that SILGGM can accelerate the existing MATLAB implementation to several orders of magnitudes and further improve the speed of the already very efficient R package FastGGM. Testing results from the simulated data confirm the validity of all the approaches in SILGGM even in a very large-scale setting with the number of variables or genes to a ten thousand level. We have also applied our package to a novel single-cell RNA-seq data set with pan T cells. The results show that the approaches in SILGGM significantly outperform the conventional ones in a biological sense. The package is freely available via CRAN at https://cran.r-project.org/package=SILGGM.

Here, each run of scaled Lasso regression is tuning-free and the tuning parameter is taken as = √2 log( /√ ) / . The estimated residual ̂= −̂ can be obtained once ̂ is estimated from (2). Then, Ω for variables and can be estimated: Under the minimal sparseness assumption = (√ /log ( )), each estimator ̂ has been shown asymptotically normal and efficient, According to (4), we can estimate the corresponding p-value and confidence interval of each .
The naive implementation of the procedure requires ( 2 ) runs of scaled Lasso regression, but the total number of runs of regression can be reduced to ( ) in terms of the comments in [1] and the implementation in [2].

The de-sparsified nodewise scaled Lasso (D-S_NW_SL)
D-S_NW_SL [3] is based on the runs of nodewise scaled Lasso regression for ℎ variable against all the other variables , Again, the tuning parameter for each run of regression is taken as = √2 log( /√ ) / . Unlike the previous B_NW_SL, the procedure deals with the coefficients rather than the regression noise. If However, it is well known that the initial estimators in (6) have bias, so the authors have proposed a bias correction procedure on ̂. Under the Karush-Kuhn-Tucker (KKT) conditions, the de- where ̂= / .
Under the minimal sparseness assumption = (√ /log ( )), each de-biased estimator ̌ has achieved the asymptotically efficient result with According to (8), the corresponding p-value and confidence interval can be estimated for each .

The de-sparsified graphical Lasso (D-S_GL)
D-S_GL [4] also depends on a bias correction procedure which is very similar to the one in D-S_NW_SL. However, the initial estimator Ω = (̂) × here is obtained by solving a graphical Lasso optimization problem for Ω: Even though (9) is not tuning-free, the tuning parameter can be taken as = √log( )/ according to the suggestion in [4]. Then, with the same idea of bias correction in D-S_NW_SL, the desparsified (or de-biased) estimator Ω = (̌) × is Under the minimal sparseness assumption = (√ /log ( )), each de-biased estimator ̌ achieves the same asymptotically efficient result as the one in (8).

The Gaussian graphical model (GGM) estimation with false discovery rate (FDR) control using scaled Lasso or Lasso (GFC_SL or GFC_L)
While the previous three methods are originally developed for individual inference of each , GFC_SL or GFC_L [5] is proposed particularly for global inference of all 's. The approach is based on a bias correction procedure on the sample covariance of residuals between each pair of variables and . In order to obtain the estimators of residuals, the first step of the method needs runs of nodewise scaled Lasso regression same as those in (5) or nodewise Lasso regression for ℎ variable against all the other variables as below, If the estimated coefficients ̂' s are obtained from (5) or (11), then we can obtain the estimated residual ̂= −̂ and the estimated sample covariance of residuals between ( , ) ℎ pair of variables ̂= 1 ∑̂= 1 . The second step is to make a bias correction on ̂ to obtain and construct a new test statistic
Under the null hypothesis in (14) and the same minimal sparseness assumption as before, (13) has an asymptotically normal result with ̂→ (0,1).
Since GFC_SL or GFC_L is developed for global inference, another main component of this method is to provide a novel framework for FDR control that has been theoretically proved valid in high-dimensional settings. It is well known that the false discovery proportion (FDP) with a threshold can be written as To control FDR needs to control (16) since we have E(FDP( )) = FDR( ). The numerator of (16) is generally unknown, but according to [5], the author has proved that where Φ(•) is a standard normal cumulative distribution function. Therefore, we can choose the threshold ̂= inf {0 ≤ ≤ 2√log( ) : We reject 0 in (14) if |̂| ≥.
As an alternative to the tuning-free scaled Lasso regression, each run of (11) for against Here, = 20 is set by default and a different value of can be set up in practice.
Since the previous three methods have asymptotically normal results in terms of (4) and (8), the FDR framework described in (17) and (18) can also be applied to them by replacing ̂ with a different test statistic based on ̂ or ̌. Therefore, the implementations of B_NW_SL, D-S_NW_SL and D-S_GL are allowed for global inference as well.