Gene Regulatory Networks from Multifactorial Perturbations Using Graphical Lasso: Application to the DREAM4 Challenge

A major challenge in the field of systems biology consists of predicting gene regulatory networks based on different training data. Within the DREAM4 initiative, we took part in the multifactorial sub-challenge that aimed to predict gene regulatory networks of size 100 from training data consisting of steady-state levels obtained after applying multifactorial perturbations to the original in silico network. Due to the static character of the challenge data, we tackled the problem via a sparse Gaussian Markov Random Field, which relates network topology with the covariance inverse generated by the gene measurements. As for the computations, we used the Graphical Lasso algorithm which provided a large range of candidate network topologies. The main task was to select the optimal network topology and for that, different model selection criteria were explored. The selected networks were compared with the golden standards and the results ranked using the scoring metrics applied in the challenge, giving a better insight in our submission and the way to improve it. Our approach provides an easy statistical and computational framework to infer gene regulatory networks that is suitable for large networks, even if the number of the observations (perturbations) is greater than the number of variables (genes).


Introduction
Traditional methods where one gene or one chemical reaction was studied at a time, have taken step to more sophisticated ones, which try to elucidate the complex machinery connecting all the biochemical reactions happening in a cell. Advanced data collection techniques are able to produce a great variety of data that aim to be the vehicle to better understand the processes within a cell. Development of statistical and mathematical methodology to study such data plays a key role to elucidate and model the mechanisms behind the cell biochemical complex architecture. In particular, it is of great interest to represent the cell biochemistry into networks that mimic the chemical reactions taking place in the cell.
The DREAM project [1,2], acronym for Dialogue on Reverse Engineering Assessment and Methods, is an initiative that tries to motivate the systems biology community to investigate and develop methodologies that translate biochemical processes into gene regulatory networks, by challenging the participants to infer network structure from some given in silico gene expression data sets. This in silico data were generated by the GeneNetWeaver tool version 2.0 [3] based on the ideas in [4]. The multifactorial subchallenge, posted in the DREAM4 initiative web page [5] aimed to reverse engineer five gene regulatory networks of size 100 with an experimental scenario assuming that extensive knockout/ knockdown or time series experiments, could not be performed.
The data for this multifactorial sub-challenge consisted of measurements of steady-state levels of the network, which were obtained by applying 100 multifactorial perturbations to the original network. These steady-state level measurements intrinsically do not give information about the regulatory network dynamics, but about the system equilibrium once it has recovered after the intervention or perturbation.
Given the steady-state nature of the multifactorial sub-challenge data, we focused on Gaussian Markov Random Field theory [6] that leads to the estimation of undirected graphical models [7]. Understanding the topology of a gene regulatory network is equivalent to know which are the connections between the genes involved in the network summarized in the adjacency matrix, that represents the web of connections between the genes of the network.
Gaussian Markov Random Fields theory (GMRF) relates the inverse of the process covariance matrix, described by the elements of the network, in our case a set of genes, with the adjacency matrix that describes the topology of the network. If the (i,j) th element of the covariance inverse matrix is zero, then variables i and j are conditionally independent given the others and do not have an edge in the network. Due to the symmetric nature of inverse covariance matrix the estimated network topology is undirected.
This relation between the covariance inverse and the adjacency matrix links GMRF theory with graphical models, so extending the graphical models provided by relevance networks [8]. The graphical Lasso algorithm [9] is an appealing, new approach to estimate the process covariance inverse and thus appeared very suitable to provide the gene regulatory network under the GMRF umbrella. The graphical Lasso computes the covariance inverse matrix by applying an L 1 penalty to the GMRF loglikelihood [9,10], as in the regular lasso [11]. The L 1 penalty is the sum of the absolute values of the entries of the covariance inverse and due to the geometry of this penalty, the resulting covariance inverse contains entries being exactly zero. The corresponding network is thus sparse. This is an attractive feature of the graphical Lasso, as many of the cell metabolic or enzymatic process networks are known to be sparse [12]. Networks which are very densely connected are unlikely to represent the true biochemical processes within a cell.

Data sets
The data provided in the multifactorial sub-challenge of DREAM4, consisted of in silico networks of gene expression measurements of steady-state levels, obtained by applying 100 different multifactorial perturbations to the original network, containing in total 100 genes. The multifactorial perturbations were induced by slightly increasing or decreasing the basal activation of all the genes in the network simultaneously by different random amounts [5]. If we think of the data in a matrix format, the data set for each network (Fig. 1) consists of a matrix with 100 rows and 100 columns. Each row of this matrix contains the 100 genes expression measurements for the network for a given perturbation, and each column stores the expression levels for a given gene for all the perturbations. From the data matrix we can compute the sample covariance matrix of the gene expression measurements, but this will be a poor estimate of the true covariance matrix S because of the low number of perturbations.

Gaussian Markov Random Fields (GMRF)
With the aim of predicting the network structures in the multifactorial sub-challenge and considering that the only available data consisted of static records (i.e steady-state levels), it seemed reasonable to tackle the problem by a Gaussian Markov Random Field [6]. A Gaussian Markov Random Field (GMRF) consists of a finite set of random vectors x~(x 1 , . . . ,x p )[< p g that have a multivariate Gaussian probability density function with mean vector m~(m 1 , . . . ,m p ) and p|p covariance matrix S. The multivariate Gaussian character of the set x gives the Gaussian random field its name. In our case p is the number of genes and x i (i~1, . . . ,p) is a random variable representing the in silico gene expression measurements for node ( = gene) i of the network. We consider the rows of the data matrix as a random sample of size n~100 of x.
The Markov adjective is for the Markovian global, local and pairwise conditional independencies which describe the relationships between the elements of the GMRF network [13]. These three types of conditional independency are equivalent and govern the factorization of the joint probability distribution [14]. In particular, the independence of two random variables x i ,x j given the rest (i.e. the pairwise conditional independence) implies that the corresponding entry in the covariance inverse (S {1 ) ij is zero, which indicates the non-existence of an edge between variables x i and x j in the network. Consequently, pairwise conditional independence plays a fundamental role in network reconstruction since it provides information about the existence of edges between any pair of elements of the GMRF. Due to the symmetry of S {1 , the inferred networks will be undirected. The graphical representation of a GMRF consists of an undirected graph that is defined as tuple G~(n, E) of a set of nodes n or genes in our case, and a set of edges, e, that describe the connections between the nodes or genes of the network.
In summary, estimating an undirected gene regulatory network graph is analogous to estimating the pairwise conditional independencies between the genes and, in our GMRF approach, is analogous to finding the zero entries of the inverse covariance matrix of the genes in the network. The covariance inverse H~S {1 is also known as the precision matrix.
In a GMRF the conditional mean of x i given the rest (x {i ) is linear in the measurements at the other nodes: which has the same form as a multiple linear regression of x i on and depends only on the variables/nodes that are connected to x j .

Graphical Lasso Algorithm
The sample covariance matrix S of the gene expression measurements is a poor estimate of the true covariance matrix S because of the low number of perturbations; its inverse, when it exists, will be dense. Equation (2) suggests that it is possible to learn about the dependence of x i on x {i via (penalized) multiple linear regression [15,16], in particular via the lasso [11] to obtain sparsity. However, things are a bit more complicated as x i appears not only as response variable as in equation (2), but also as predictor in the equations for j=i.
A both rigorous and efficient solution is the graphical lasso [9]. This maximizes the L 1 penalized loglikelihood l of the GMRF [10], defined by with respect to the precision matrix H~Ŝ S {1 . Here, EHE 1 is the L 1 norm of H, that is the sum of absolute values of the elements of H, and r is a penalty that governs the sparsity of the network. In practice, this optimization problem is carried out for a series of r values, resulting in a series of networks that vary from very dense networks for low values for r to very sparse networks for high values ( Fig. 2A). We now present a derivation of the graphical lasso algorithm [9], ending with an intuitive view of it. The gradient equation of the graphical lasso problem is [9,10] where W is the estimate of S, i.e. W~H {1 . It is possible to solve this gradient equation (4), in an iterative block descendant fashion [10], by considering the partition of the GMRF x~(P 1 ,P 2 ) into the two groups P 1~f x 1 , . . . ,x p{1 g and P 2~xp . The corresponding partition of H and its inverse W is where the column vector w 12 contains the marginal covariances between x p and the other elements in the GMRF x. We partition S correspondingly. Friedman and coauthors [9] showed that for each given partition and W 11 , equation (4) can be solved for w 12 and w 22 by a fast, regular lasso algorithm. The loglikelihood (3) is then maximized by considering all the possible partitions P 1~f x {i g, P 2~xi of the GMRF x in turn and by iterating this process (Table 1). A key element is that, after w 12 and w 22 are calculated, they are inserted in the full W before a new partition is created. The matrix W 11 , for a given partition, thus changes across iterations, until convergence.
We now show how, for a given partition, w 12 , w 22 and the corresponding covariance inverse estimates can be obtained by a regular lasso algorithm. For a given partition, the partitioned version of equation (4) yields and also so that w 22~s22 zr. Equation (7) is less easy to solve as we do not know the sign of H 12 yet. But, as W~H {1 , the partitioned version of gives so that The sign of H 12 is thus opposite to the sign of b:W {1 11 w 12 since H 22 w0. Rewriting equation (7) in terms of b on using equation (11) gives Equation (12) can be recognized as the gradient equation of the lasso problem [11] y~Xbze, s:t EbE 1 ƒt ð13Þ with s 12 and W 11 replacing X T y and X T X, respectively. Each individual problem (12) is solved by coordinate descent [9,17].
The graphical lasso algorithm (Table 1) can thus intuitively be viewed as a set of coupled lasso regression problems that share the same W and H~W {1 . Table 1 summarizes the algorithm.

Network selection
The efficiency of the Graphical Lasso algorithm allows to compute a great variety of network topologies just by evaluating a grid of penalty values r. Since the parameter r is responsible for the network sparsity, it is of particular interest to find which is the optimal estimated network in terms of this parameter. This problem of optimal network selection is equivalent to that of  traditional model selection. We would want to have a model selection method that enables to select the adjacency matrix that best predicts the true topology. The main challenge here, is not only to discover the best network in terms of prediction accuracy, but to find a trade off between network sparsity and prediction accuracy in the hope to get closer to the true network. There are many model selection techniques. Cross validation [18], based on the performance of the estimated network into a test data set, is one of the most widely used. However, cross validation does not take into account the complexity of the selected network. With the goal in mind of finding sparse networks, we decided to minimize the Bayesian information criterion [19] and, just for the record, the Akaike criterion [20] BIC(m)~{2Lzln(n)ed(m) ð14Þ where L corresponds to the log-likelihood of model m (l(H) in equation (3) without the penalty term), ed(m) is the effective model dimension, here the number of non-zero edges in the network corresponding to model m [21][22][23], and n is the number of observations (perturbations). The BIC best enjoys the fame of being a trade off between model prediction and model complexity and thus to select sparser networks than those chosen by AIC or by cross validation. This is illustrated by the arrows in Fig. 2A showing the values of r that minimize the BIC and AIC for data set 3.
In the submission to the DREAM4 challenge, each network had to consist of a ranked list of regulatory links ordered according to their confidence value. In the original description, links had to be directed but, as directionality is difficult to detect without experimental interventions, we consider here only undirected links. In our submission, we determined the confidence value of a link (edge) in a rather ad-hoc fashion as follows. We first set a series of 100 equispaced values in terms of log(r) (Fig. 2). For each value of r, we then calculated the covariance inverse and associated BIC value. We then ordered the covariance inverses according to BIC, selected the 50 best ones and converted each to a network, i.e. adjacency matrix A~(abs(Hw0)), yielding 50 networks. The assigned confidence of an edge was the number of networks in which the edge was present divided by 50, the rationale being that we are more confident about an edge if it appears in more networks. As the resulting ranked network uses an ensemble of 50 networks, we term it the Ensemble network. In hindsight, we feel that the procedure is rather ad-hoc as it depends on the selected range of penalty values and the fraction of networks used in the ranking, which are both rather arbitrary. For that reason, we evaluated for this paper also the best BIC (and AIC) network per data set. For this evaluation we ordered the edges of the network according to the absolute value of their covariance inverse entry (abs(H ij )), the rationale being that the size of H ij is traded against the loglikelihood of the network in equation (3) and thus has at least some statistical meaning. The network ranked in this way is termed the BIC (AIC) network. Selecting a particular r matters, because the models obtained via the graphical lasso algorithm for a grid of different penalties are not necessarily nested and therefore an edge that appears in a network with high r value can disappear when a smaller r is considered. The corresponding entries of the covariance inverse may change non-monotonically with r. A referee suggested another ranking scheme, in which a given edge is ranked according to the maximal value of the penalty r for which the putative edge is present in the predicted network. The network ranked in this way is termed MAX r.

Post-hoc network validation
After submission, the true networks were released and it is thus possible to evaluate each submitted network according to the true one. Because of the confidence rating of edges, each submitted network is not just a single network but a ranked list of networks, containing from one to many edges, depending on the required confidence for an edge to exist and the total list size. For each given confidence threshold, the resulting network can be evaluated and compared with the golden standard, as follows.
Given two nodes in a network x i and x j , the edge prediction problem can have four possible outcomes when compared with the true network: (i) if the edge occurs in both the true and the predicted network, the prediction is called a true positive (TP), (ii) if the edge is predicted but does not occur in the true network, it is a false positive (FP), (iii) if the edge does neither occur in the true network nor in the predicted one, it is a true negative (TN), (iv) if the edge occurs in the true network, but is not predicted, it is a false negative (FN). Once the TP,TN,FN,FP events are counted, it is possible to calculate True Positive Rate (TPR) and False Positive Rate (FPR) and precision and recall [24][25][26] Precision~T P TPzFP , Precision is a measure of the exactness or fidelity of the network forecast, recall ( = TPR) is a measure of completeness, whereas FPR is the statistical Type I error (false alarm). In the words of [27], ''Precision may be defined as the probability that an object is relevant given that it is returned by the system, while the recall is the probability that a relevant object is returned''.
By sliding the confidence threshold, the pairs (TPR, FPR) and (precision, recall) give rise to the Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve, respectively (Fig. 3). Popular overall measures of performance are then the Area Under the ROC and PR curves (AUROC and AUPR, respectively). The challenge organizers provided three more performance measures based on the P-value. The P-value is ''the probability that a given or larger area under the curve value is obtained by random ordering of the T potential network links. Distributions for AUROC and AUPR were estimated from 100,000 instances of random network link permutations.'' The overall P-value is then the geometric mean of the P-values of the individual data sets and the associated score is {log10(P). This score is calculated for both AUROC and AUPR and the two values are averaged to obtain the overall score.
We also investigated how well the graphical lasso could have done once we know the true networks. For this we determined the r values maximizing AUROC and AUPR, yielding the AUROC and AUPR networks. The gap between the maximum AUROC and AUPR and those of the AIC, BIC and Ensemble networks indicates how much the results could have improved if we would have an ideal method of penalty selection.

Results
The goal of the multifactorial sub-challenge was to reverse engineer five gene regulatory networks from training data consisting of steady-states levels of variation of the original networks, obtained after applying multifactorial perturbations to the system. The type of training data (only steady state, neither time series nor knockout/knockdown nor any other intervention data) motivated our choice for the GMRF approach to solve the problem in question.
The network topology was estimated by setting the edges to correspond to the nonzero elements of the estimated covariance inverse matrix H. This covariance inverse was estimated from the training data by maximizing the penalized likelihood (3). The graphical lasso algorithm performed the computations in a very efficient and fast fashion, making it possible to compute the best covariance inverse for a series of 100 r values within 60 seconds per data set on a laptop. Fig. 2A shows for data set 3 how the number of nonzero elements in the covariance inverse decreases with increasing penalty r and also indicates the values of r that minimize (from left to right) AIC, MAX-AUROC, MAX-AUPR and BIC. It also shows the range of the 50 best r values used in the Ensemble network. As expected BIC yielded a much sparser network (ca. 500 edges) than AIC and Ensemble (both 5000 edges), whereas the true number of edges was 192 in network 3. The methods that select r knowing the truth (MAX-AUROC and MAX-AUPR) produce networks that have both more edges than the network selected by BIC ( Fig. 2A). Fig. 2B shows that r values minimizing BIC vary little across the five data sets. Fig. 3 shows the ROC and PR curves for the different r selection methods, averaged across the five data sets, while Table 2 shows the performance numerically. Fig. 3 and Table 2 show that AIC networks produced very poor results and that the Ensemble and MAX r networks performed remarkably similar. BIC performed better than these in terms of AUPR score, but worse in terms of AUROC score. The network selected on the basis of maximum AUPR was better in terms of AUPR score, and about equal to Ensemble in terms of AUROC score. The network selected on the basis of maximum AUROC was only slightly better in terms of AUROC, and about equal to BIC in terms of AUPR score. The ensemble network that we submitted ranked fifth on the overall score. The modifications we investigated afterwards gave only modest improvement, as the different r selection methods gave very similar results, except for the AIC method. Overall, BIC would have done slightly better, as it can be seen in Table 2. The winning team in the undirected 100 multifactorial sub-challenge had 37.3 and the team just above us had an overall score of 27.99. The worst score was close to 0.
Furthermore, we studied the performance of the presented methodology with only half of the 100 perturbations. The results show for all the methods a decrease in the overall scores of about 20 percent (Table 3).
We also compared our approach with simple correlation networks, both for the full data (n = 100) and half the data (n = 50). Correlation networks were obtained by connecting two genes with an edge if the absolute value of their correlation was higher than a predefined threshold. The ranking of the edges was done according to the absolute value of the correlations. Table 4 shows that the performance depends somewhat on the threshold with the highest scores for threshold 0, that is, the relevance network (REL. in Table 4). The overall scores for the relevance networks (31.64 for n = 100, and 26.30 for n = 50) are higher than those obtained with the graphical lasso ( Table 2 and Table 3). The overall score for n~100 is equal to that of the second team in the multifactorial sub-challenge. The only two places were the graphical lasso wins over the relevance network are the precision at 1% and 10% recall for the full data. This advantage is then lost again for half the data.

Discussion
We used a GMRF framework to tackle the problem of reverse engineering of regulatory networks based on data from random multifactorial perturbations, as posted in the DREAM4 challenge. The graphical lasso algorithm was used to compute the network topologies offering a very fast and easy computational set up, to provide a large range of candidate network topologies. This subchallenge consisted of inferring directed networks, however, with the static nature of the provided training data, we believe that it is very complex to infer directionality or similarly causal relationships, and therefore we focused on the estimation of undirected networks which motivated the selection of our approach to tackle the problem.
We submitted networks with edge ranking based on edge frequency in an ensemble of the 50 best (out of 100) BIC networks. This ranked network turned out to perform very similar to MAX r and similar to the single networks selected by BIC and ranked with confidence value proportional to the absolute value of the entry in the corresponding covariance inverse. The ensemble and MAX r networks contained a larger number of edges in than the networks selected by BIC. Other ways to construct ensembles might perhaps improve the quality of the predictions and are a topic of further research.
The similarity between the ensemble and MAX r networks goes beyond their performance in Tables 2 and 3. The Spearman correlation between the confidence values, averaged across the five networks, was 0.97 and their ranked networks are therefore very similar. An explanation is that an edge that comes in early (at a high penalty value) gets high confidence in the MAX r method, is likely to stay in the model for a large range of smaller r values and thus occurs in many of the networks we consider. Such an edge is likely to get a high rank in the ensemble method as well, because this method assigns the rank on the basis of the number of networks (among the 50 best BIC networks, see Fig. 2A) in which the edge occurs. Vice versa, a edge that comes in late (at a low penalty value, so that it gets low confidence in the MAX r method) cannot occur in many of the 50 best networks and will thus receive low confidence in the ensemble method.
In our approach we assumed multivariate normality, that is normality of all marginal and conditional distributions of the measurements and, related to this, linearity between the conditional mean expression of a gene and the expression levels of its neighboring genes (equation (2)). These are strong assumptions, which are unlikely to hold true exactly. With few observations, these assumptions are hard to check. Q-Q plots, made assuming a sparse covariance inverse, did not show gross deviations from normality. A log-transformation of the measurements did not improve performance. The small data set size requires a simple model to produce reasonable results. Simplicity and speed are the key features of our approach.
Our study contributes to a better understanding of the properties and performance of the graphical lasso algorithm to estimate undirected networks. We showed that the method also  works when the number of genes is larger than the number of perturbations. However, in this challenge relevance networks have shown a better performance, both for the full data and for half the data. For networks containing cliques that are locally dense, correlation networks might have an advantage compared to the sparsity imposed by the graphical Lasso algorithm with a single penalty term, as used in this study.