Empowering differential networks using Bayesian analysis

Differential networks (DN) are important tools for modeling the changes in conditional dependencies between multiple samples. A Bayesian approach for estimating DNs, from the classical viewpoint, is introduced with a computationally efficient threshold selection for graphical model determination. The algorithm separately estimates the precision matrices of the DN using the Bayesian adaptive graphical lasso procedure. Synthetic experiments illustrate that the Bayesian DN performs exceptionally well in numerical accuracy and graphical structure determination in comparison to state of the art methods. The proposed method is applied to South African COVID-19 data to investigate the change in DN structure between various phases of the pandemic.


Introduction
Probabilistic networks are becoming ever-present in a multitude of scientific disciplines.These networks aim to illustrate the relationships, if any, between the components of complex systems [1].If the data is assumed to be Gaussian distributed with mean µ and covariance matrix Σ; the precision matrix Θ directly determines the conditional dependence relations and structure of the undirected graphical model [2].
Numerous statistical methods exist for Gaussian covariance and precision matrix estimation, as well as graphical model determination.In particular, from a frequentist approach, [3] introduce a computationally efficient neighborhood selection procedure.The lasso is used for covariance estimation which enjoys consistency for sparse high-dimensional graphs.The approach is quite effective, in that the sparse precision matrix is estimated by fitting the lasso to each variable using the remaining as predictors.Finally, the estimated precision matrix entry (θ ij ) is non-zero if the estimated coefficient of i on j or vice versa is non-zero.Importantly, their algorithm can consistently estimate the set of non-zero entries in Θ, [4].For a penalised likelihood methodology for sparse precision matrix estimation see [5,6].More so, [7] estimate the undirected graphical model using both a block coordinate descent algorithm, as well as Nesterov's first order method [8].Additionally, [9] propose a 1 constraint estimation technique for both sparse and non-sparse high dimensional matrices with applicability on a wide range of sparsity patterns and class of matrices; precision estimation in Gaussian graphical models for example.For a joint graphical model estimation approach see [10,11].
Fully Bayesian treatments of Gaussian graphical models are, also, well rooted in literature.In particular, [12] introduce the Bayesian adaptive graphical lasso (BAGLASSO) which utilises a generalised Pareto distribution in the hierarchical formulation of the Bayesian graphical lasso.[13] provide a method for graphical model determination by invoking positive prior mass on the event that there is no conditional dependencies between variables.In terms of joint graphical model inference from a Bayesian perspective see [14].Lastly, [15] propose using Kullback-Leibler divergence and cross-validation for graphical model structure estimation.

Background
DN analysis is a statistical methodology that involves functions of at least two graphical models.Numerous measures exist for comparing and evaluating the differences between these graphical structures [1].Let G = (V, E) define a graphical model with nodes V = {1, 2, ..., p} and a set of edges E ⊆ V × V.The graph visually depicts the conditional dependence structure between the nodes of the system.For this work, the focus will be on the difference of two Gaussian graphical models, G 1 and G 2 that share the same set of nodes V.In particular, the edge sets given here are equivalent to the adjacency matrices obtained from the Gaussian graphical model estimation.More specifically, assume that the observations, x 1 , x 2 , ..., x n1 and y 1 , y 2 , ..., y n2 are generated from a p variate Gaussian distribution, N p (µ 1 , Σ 1 ) and N p (µ 2 , Σ 2 ), respectively.The interest here is estimating the DN , that is the difference between two precision matrices.DN analysis is becoming increasingly popular and important, for example in biological systems where protein interaction networks can be utilised as informative biosignatures for prevalent diseases [16,17].The fundamental idea here is that, if two molecules interact with one another then a statistical dependency between them should be observed.Additionally, another application of DNs is multivariate statistical quadratic discriminant analysis [18,19], under the Gaussian distribution assumption.
Recently, a plethora of statistical techniques have emerged for estimating DNs.These techniques can largely be classified into two main categories.The first estimating the individual precision matrices, Θ 1 and Θ 2 separately; where the estimated DN is the difference between the estimated precision matrices.For example, the methods and references for Gaussian graphical model estimation outlined in the introduction can be used to directly to estimate ∆.The second methodology estimates both the precision matrices simultaneously.The approach here, typically penalises a joint loss function for both precision matrices.[20] provide a methodology for inference and estimation of functions of Gaussian graphical models.In particular, the Intertwined Graphical LASSO (IGL) approach biases the estimation of the precision matrices towards a common value.More so, their Graphical Cooperative-LASSO (GCL) utilises a group-penalty for solutions that favour a common sparsity pattern.[10] and [21] estimate separate graphical models using a joint penalised loss function.[22] propose a method for estimating ∆ directly which relaxes the need for both individual precision matrices to be sparse nor be estimated directly.Similarly, [23] and [19] utilise an alternating direction method of multipliers (ADMM) algorithm for estimating ∆ from their joint 1 penalised convex loss function.More recently, [24] introduce a computationally efficient iterative shrinkage-thresholding algorithm for minimising the 1 loss function defined in [19], namely is convex and S 1 and S 2 are the sample covariance matrices.The DN estimate is obtained by minimising the penalised loss Eq (1).An analogous symmetric convex loss function and estimator is proposed by [23].
The shrinkage-thresholding algorithm proposed by [24], based on the fast-iterative shrinkage-thresholding algorithm in [25], aims to minimise Eq (1) without computing matrix inverses in the estimation process.The objective function is given by arg min The optimisation objective converges to the solution sequentially using a quadratic approximation and a gradient descent algorithm.The efficiency of the procedure is attested to this approach, resulting in superior computational complexity in contrast to the ADMM approaches by [23] and [19].

Contributions
In this paper, the objective is to develop a framework for Bayesian DN estimation, which remains unexplored.In particular, the BAGLASSO is adapted for the Bayesian DN estimation; noting that frequent references thereto are included throughout the development of the novel DN architecture.The block Gibbs sampler is used for estimating each component of the DN.An adjusted edge inclusion threshold, based on a conjugate Wishart prior, for graphical structure learning is also exhibited.Comparisons in synthetic data studies illustrate that the Bayesian DN is proficient in both graphical structure learning and matrix estimation, when compared to current state of the art methods.
Finally, it is worth noting the main contributions of this paper.First, the novel Bayesian approach to estimating DNs, using the Bayesian adaptive graphical lasso (BAGLASSO), is introduced.A threshold selection strategy for graphical structure determination, based on a conjugate Wishart prior is explored.An application to South African COVID-19 data is investigated to examine the change in DN structure of key daily metrics between various phases of the pandemic.Lastly, an R package has been developed for the BAGLASSO block Gibbs sampler.The Markov Chain Monte Carlo (MCMC) sampler simulates precision matrices from the posterior distribution of the BAGLASSO.The R package is available on The Comprehensive R Archive Network (CRAN) abglasso.

The Bayesian DN
A fully Bayesian treatment of DNs remains unexplored and the novel methodology here aims to develop a simple yet highly accurate Bayesian DN estimation procedure.The novel contribution utilises the BAGLASSO as a launching point to separately estimate the components of the DN.Moreover, the framework has been develop for low to moderate, p ∈ {10 − 100}, dimensions where n ≥ p.

The Bayesian graphical lasso prior
Recall that the graphical lasso objective is maximising the penalized log-likelihood arg max where M + is the space of positive definite matrices and S is the sample covariance matrix.More over, ρ ≥ 0 is the shrinkage parameter and Θ = (θ ij ) is the precision matrix.The Bayesian connection to the graphical lasso problem is the maximum a posteriori (MAP) estimate, assuming a random sample from N p (µ, Θ −1 ), of the following The prior distribution is given by the product of a double exponential (DE) with form p(y) = λ/2 exp(−λ|y|) for the off diagonal elements and an exponential (EXP) with form p(y) = λ exp(−λy)1 y>0 , otherwise.The value of Θ which maximizes the posterior density is the graphical lasso estimate when ρ = λ/n.
Hierarchical representation [12] propose a hierarchical representation of the graphical lasso prior Eq (3), using the Bayesian lasso formulation in [26].The Gibbs sampler in [26] utilises the structure of the double exponential distribution as a scale mixture of Gaussians (assuming independence of the conditional double exponential priors) ( [27]; [28]) to simulate from the desired posterior distribution.The positive definite constraint on the precision matrix in the graphical lasso Eq (3) implies that the Gaussian components for θ ij in the scale mixture formulation are no longer independent given the scale parameters.To address this issue, the hierarchical representation of the graphical lasso prior is given by where θ = {θ ij } i≤j is a vector of the upper triangular matrix entries of Θ and τ = {τ ij } i<j the scale parameters.The normalising constant has no closed-form solution.Obtaining the marginal distribution Eq (3), [12] propose an exponential mixing density with rate parameter λ 2 /2.Simple substitution yields that the mixing density circumvents the intractable normalising constant.Finally, the hierarchical representation in Eq (4) is used in the development of the data-augmented block Gibbs sampler, available in the supplementary material, with a target distribution given by

BAGLASSO
It is well known that the double exponential prior in Eq (3) may over-shrink (under-shrink) large (small) coefficients in Θ.The limitations within a regression context have been studied in ( [29]; [30]; [31]) with alternative proposals.The BAGLASSO, Bayesian analog to the adaptive graphical lasso, exploit the framework and flexibility of the hierarchical representation in Eq (4) to address the aforementioned limitation where ξ ij = 1/| θij | are the adaptive weights and the weight matrix ( θij ) is the sample precision matrix.
The Bayesian graphical lasso Eq (3) enables the selection of an appropriate hyperprior on the shrinkage parameter λ, recall that ρ = λ/n in the Bayesian formulation of Eq (2).Adhering to the positive definite constraint on Θ, the normalising constant in Eq (3) for a single prior λ for all elements in Θ can be obtained by applying the substitution Θ = λΘ.Thereafter, a diffuse gamma prior λ ∼ GA(r, s) and corresponding conditional posterior λ ∼ GA(r + p(p + 1), s + Θ 1 /2) can be obtained and sampled from.When allowing for individual λ ij 's for different θ ij 's, the normalising constant C will inevitably depend on λ ij and a hierarchical formulation can be used to construct a set of prior distributions for various λ ij .In particular, assuming a random sample from N p (µ, Θ −1 ), The normalising constant C {λij } i≤j is intractable and the set {λ ii } p i=1 are hyperparameters for the diagonal elements of Θ.As it happens, the computation of λ ij is simplified by circumventing the intractable normalising constant.
The BAGLASSO selects the amount of shrinkage λ ij proportionally to the current value of θ ij .To see this, [12] demonstrate that the conditional posterior, , has an expected value that is inversely related to magnitude of θ ij .The data augmented block Gibbs sampler for the hierarchical representation in Eq (7) is the fundamental building block upon which the novel Bayesian DN is devised.

Technicalities on conditional dependencies
A conjugate Wishart prior may be used to infer on events such as θ ij = 0 when a purely Bayesian posterior inference regarding network structure is desired.An alternative thresholding strategy is presented which is an adaption of the recommendation by [31].In particular the conjugate Wishart W(3, I p ) prior is used.The corresponding posterior is W(3 + n, (S + I p ) −1 ), where = 0.001 and I p a p dimensional identity matrix.The posterior samples are used to compute the posterior distribution of the p × p partial correlation matrix.The recommended strategy here suggests The Bayesian posterior thresholding recommendation by [12] claim that Noting that ρij is the posterior sample mean estimate of the partial correlation under graphical lasso priors Eq (3); g is the standard conjugate Wishart W(3, I p ) and h the standard conjugate Wishart W(3, I p ).Moreover, η ∈ {0 − 1} with the lower and upper bounds resulting in a completely dense or sparse estimate, respectively.The original recommendation for η in Eq (9) is 0.5.The forthcoming synthetic data analysis section describes the simulation procedure, as well as, illustrates the performance of the Bayesian DN with regards to different graph structures, namely an AR(1), AR(2), sparse random, scale-free, band, cluster, star and circle.The goal here, May 31, 2021 5/20 however, is to suggest a suitable sparsity threshold region under the varying graph structures.The Bayesian DN is applied across all graph structures with thresholds, η, in the range of 0.2 and 0.6 in increments of 0.02.The absolute sparsity error is computed for each graph structure and threshold candidate for both Bayesian sparsity criterion in Eq (9) and Eq (8).The results are based on the median of 40 replications and the Matthews Correlation Coefficient (MCC), see [32], is used to determine the best performing threshold.Figure 1 (1a) -(1i) display the optimal threshold, based on the top performing MCC, for each graph structure and Bayesian sparsity criterion for p = 10.The optimal threshold plots for p = 30 and p = 100 are available in the supplementary material.The optimal threshold based on Eq (8) , η * , for the Bayesian DN is, in most cases, in the neighborhood of the minimum absolute sparsity error and in the region of η * ∈ {0.2 − 0.4}.Both Bayesian sparsity criterion candidates perform comparably well noting, however, that Eq (8) requires less computation.

Synthetic data analysis
The synthetic experiment is designed to test the parameter estimation and graphical structure determination of the DN estimation for both the novel Bayesian approach (referred to as 'B-net') and the iterative shrinkage-thresholding estimator (referred to as 'D-net') from [24].The iterative shrinkage-thresholding estimator uses the lasso penalty and Bayesian Information Criterion (BIC) for model estimation and selection, respectively.For all simulations, the assumption is that the observations, x 1 , x 2 , ..., x n1 and y 1 , y 2 , ..., y n2 are generated from a Gaussian N p (0, Σ 1 ) and N p (0, Σ 2 ) respectively.The true DN is , where the true precision matrices are The Bayesian DN applies the BAGLASSO Eq (7) to each sample, i.e. separately estimates the precision matrices.Furthermore, for excellent performance set r = 10 −2 and s = 10 −6 , see supplementary material for more details, for the hyperparameters of the prior distributions of λ ij for i < j and λ ii = 1 for i = 1, . . ., p.The iterative shrinkage-thresholding approach jointly estimates the precision matrices for Eq (1).The following 9 graphical structure variations are considered -where the structure of each is applied to each component in the DN's composition to achieve the desired structure in the DN itself -in the simulation: • structure 1 : An AR(1) model.
• structure 3 : A sparse random model where both components have approximately up to 80% off-diagonal elements set to zero.
• structure 4 : A moderately sparse random model where both components have approximately up to 40% off-diagonal elements set to zero.
• structure 5 : A scale-free model where the second component is a scalar multiple of the first.
• structure 6 : A band or diagonal model.
• structure 7 : A cluster model containing two disjoint groups.
• structure 8 : A star model with every node connected to the first node.
The sample sizes and dimensions for each model are n 1 = n 2 ∈ {50, 100, 200} and p 1 = p 2 ∈ {10, 30, 100}, respectively.The Bayesian estimates are based on 10000 Monte Carlo iterations after 5000 burn-in iterations.To assess the performance of DN matrix estimation, six loss functions are considered and defined in Table 1, where p denotes the dimension and γ i the i th eigenvalue, respectively.Notice that some loss functions utilise the true DN matrix and its estimates, while others utilise the eigenvalues and their respective estimates.Table 2 reports the median of L1, L2, EL1, EL2, MAXEL1 and MINEL1 for p = 10, 30, 100 in structures 1 − 9 based on 40 replications.For each scenario, the best performing measure is boldfaced.
Table 1.Loss functions used in the synthetic data analysis to assess the numerical accuracy of the B-net and D-net estimates.

Measure Loss function Abbreviation
The eigenvalue based loss functions are designed to investigate the extremes of the eigenvalue spectrum.In particular, the MAXEL1 loss function highlights which estimator is favourable in a principal component setting, [33].A couple of observations are worth noting from Table 2 and Table 3. First, the D-net estimator performs better with the AR(1) structure.Second, the B-net estimator performs exceptionally well in remaining structures.Third, the standard errors for both DN estimation techniques remain relatively consistent throughout the dimension spectrum considered, noting that the D-net estimator yields, in general, better results.This may be due to the fact that the best performing tuning parameter in the solution path leads to highly sparse estimates.The B-net estimation procedure inherits the utilisation of multiple penalty parameters in the precision matrix estimation, leading to robust estimation of the precision matrices.
To assess the performance on graphical structure determination, the specificity, sensitivity, false negative rate, f1 score and the MCCs are computed and defined in Table 5.Noting that, TP, TN, FP and FN denote the number of true positives, true negatives, false positives and false negatives, respectively.Values of specificity, sensitivity, f1-score and MCC closer to one, imply better classification performance.The closer the values of false negative rate are to zero the better.The sparsity for the B-net estimator is determined by the thresholding rule in Eq (8) and Figures (1a) -(1i).Similarly, the best performing tuning parameter in the solution path of the D-net algorithm determines the sparsity of the estimator.The median performance scores, based on 40 repetitions, for each graphical structure is presented in Table 4.The main diagonals of the adjacency matrices were not included in the scoring.

MC
The B-net estimator generally outperforms the D-net estimator across all models and all dimensions according to the MCC, f1-score, sensitivity and false negative rate, with the exception of the star case for p = 100.Figure 3

Real data analysis
This section focuses on applying the novel Bayesian DN estimator, B-net, as well as the terative shrinkage-thresholding estimator, D-net, to the spambase dataset, available at https://archive.ics.uci.edu/ml/datasets/spambase to investigate changes in DN structure between spam and non-spam data.In addition, the B-net estimator is applied to South African COVID-19 data, obtained from https://www.nicd.ac.za/diseases-a-z-index/covid-19/surveillance-reports to investigate the change in DN structure between various phases of the pandemic.

Spam data
The objective here is to compare the B-net and D-net graphical model estimates of the spam and non-spam emails.The dataset consists of 1813 spam emails and 2788 non-spam emails.The attributes include, amongst others, the average length of uninterrupted sequences of capital letters; total number of capital letters in the e-mail; an indicator denoting whether the e-mail was considered spam or not, in this study.
Following the approach of [24], the data is standardised using a non-paranormal transformation in order to satisfy the Gaussian assumption.The B-net estimates are based on 10000 iterations of the Monte Carlo sampler after 5000 burn-in iterations.Figure 4 illustrates the difference between the B-net and D-net estimates.Both estimators indicate the presence of several common hub features namely, 'edu', 'original', 'direct', 'lab', 'telnet' and 'addresses'.It is clear from both panes that the state of the covariance matrix structure between the spam and non-spam emails may very well be different.Furthermore, given that Hewlett-Packard Labs donated the data, words such as 'telnet' and 'hp' appear more often in the non-spam emails and can be used to distinguish between spam and non-spam emails.South African COVID-19 data The 2019 novel coronavirus (COVID-19) has affected more than 180 countries around the world, including South Africa.Understanding the interaction of key metrics and attributes between various phases, cycles or waves of the pandemic may prove to be invaluable in strategic planning and prevention.The goal here, is to use the Bayesian DN, B-net, to illustrate that the interactivity of key daily metrics between suspected homogeneous and heterogeneous phases within the pandemic life cycle is ever changing.
In particular, the B-net is used to model the interactivity of daily metrics between the first two peaks or waves; the first wave and the following plateau and finally the difference between the first and second post wave plateaus.The data consists of 446 observations from the 7 th of February 2020 to the 27 th of April 2021.The daily metrics include, deaths; performed tests; positive test rate; active cases; tests per active case; recoveries; hospital admissions; hospital discharges; ICU admissions and the number of ventilated patients.Due to the irregularities in data capturing and publishing, a seven day moving average is applied across all daily metrics.The data is standardised using a non-paranormal transformation in order to satisfy the Gaussian assumption.The B-net is applied to the data using 10000 iterations of the Monte Carlo sampler after 5000 burn-in iterations.Figure 5 highlights the temporal nature of the pandemic between suspected homogeneous and heterogeneous phases.In other words, comparing the cyclical behaviour of individual daily metrics may seem clearly distinctive over time; a peak or wave is always followed by a plateau.Furthermore, extrapolation of the temporal behaviour of individual daily metrics may incorrectly allude to distinct multi modality of multiple daily metrics.Upon observing multiple metrics simultaneously, the crisp group-wise multi modality diminishes rather rapidly.The figures in Table 6 illustrate the higher proportions of hub features present in the DNs.Interestingly, the Bayesian DN provides insight to the change in interaction between daily metrics between perceived homogeneous pandemic phases, that is comparisons between the two peaks and two post-peak plateaus.This change in behaviour could be as a result of the change in population adherence to public sanitation awareness; weather conditions; virus mutations or complacency of over time.

Fig 1 .
Fig 1.The median of the absolute sparsity error and best performing MCC for various graph structures under varying thresholds for each Bayesian sparsity criterion in Eq (9) (dotted) and Eq (8) (dot-dash) for dimension p = 10.The best performing threshold is indicated by a vertical line with the accompanying MCC value displayed in the legend.
D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D-net B-net D (1a) -(1i) display the true and inferred undirected DN graphs for both the B-net and D-net estimators for p = 10; higher dimensions are available in the supplementary material.Lastly, Figure2(2a) -(2i) display the true and inferred adjacency matrices for p = 10.Both Figures3 and 2visually demonstrate the superiority of the B-net estimator.

Fig 2 .
Fig 2. Comparison of the true DN, B-net and D-net adjacency matrices for an AR(1), AR(2), sparse random, scale-free, band, cluster, star and circle graphical model and p = 10.

( a )
The Bayesian DN for the spam emails dataset.(b)The iterative-shrinkage DN for the spam emails dataset.

Fig 4 .
Fig 4. A comparison of the D-net and B-net DN estimates for the spam emails dataset.

Fig 5 . 7 -
Fig 5. 7-day moving average filled area line plots with standardised counts for daily new cases; deaths; tests; positive test rate; active cases; tests per active case; recoveries; hospital admissions; hospital discharges; ICU admissions and ventilated patients.

Fig 2 .
Fig 2. Comparison of the true DN, B-net and D-net adjacency matrix estimates for the scale-free, band and cluster structure for p = 30.

Fig 3 .
Fig 3. Comparison of the true DN, B-net and D-net graphical structure estimates for the scale-free, band and cluster structure for p = 100.

Fig 4 .
Fig 4. Comparison of the true DN, B-net and D-net adjacency matrix estimates for the scale-free, band and cluster structure for p = 100.

Fig 5 .
Fig 5.The median of the absolute sparsity error and best performing MCC for the scale-free, band and cluster structure under varying thresholds for each Bayesian sparsity criterion in Eq (9) (dotted) and Eq (8) (dot-dash) for dimension p = 30.The best performing threshold is indicated by a vertical line with the accompanying MCC value displayed in the legend.

Fig 6 .
Fig 6.The median of the absolute sparsity error and best performing MCC for the scale-free, band and cluster structure under varying thresholds for each Bayesian sparsity criterion in Eq (9) (dotted) and Eq (8) (dot-dash) for dimension p = 100.The best performing threshold is indicated by a vertical line with the accompanying MCC value displayed in the legend.

Table 2 .
Summary of L1, L2, EL1, EL2, MAXEL1 and MINEL1 for an AR(1), AR(2), sparse random, scale-free, band, cluster, star and circle graphical model.The median loss values reported here are based on 40 replications for both the B-net and D-net estimators.The best performing values are boldfaced.

Table 3 .
Summary of L1, L2, EL1, EL2, MAXEL1 and MINEL1 for an AR(1), AR(2), sparse random, scale-free, band, cluster, star and circle graphical model.The median standard errors reported here are based on 40 replications for both the B-net and D-net estimators.The best performing values are boldfaced.

Table 4 .
Summary of SE, SP, F1, MC and for an AR(1), AR(2), sparse random, scale-free, band, cluster, star and circle graphical model.The median performance scores reported here are based on 40 replications for both the B-net and D-net estimators.The best performing values are boldfaced and scores that were not attainable due to single class classification are encoded as NA.

Table 5 .
Performance measures used to assess classification accuracy of the B-net and D-net graphical models estimates.