^{1}

^{2}

^{1}

^{3}

^{1}

^{*}

^{1}

^{2}

Conceived and designed the experiments: PM YK CJFtB FvE. Performed the experiments: PM YK CJFtB FvE. Analyzed the data: PM YK CJFtB FvE. Contributed reagents/materials/analysis tools: PM YK CJFtB FvE. Wrote the paper: PM YK CJFtB FvE.

The authors have declared that no competing interests exist.

A major challenge in the field of systems biology consists of predicting gene regulatory networks based on different training data. Within the DREAM4 initiative, we took part in the multifactorial sub-challenge that aimed to predict gene regulatory networks of size 100 from training data consisting of steady-state levels obtained after applying multifactorial perturbations to the original

Due to the static character of the challenge data, we tackled the problem

Our approach provides an easy statistical and computational framework to infer gene regulatory networks that is suitable for large networks, even if the number of the observations (perturbations) is greater than the number of variables (genes).

Traditional methods where one gene or one chemical reaction was studied at a time, have taken step to more sophisticated ones, which try to elucidate the complex machinery connecting all the biochemical reactions happening in a cell. Advanced data collection techniques are able to produce a great variety of data that aim to be the vehicle to better understand the processes within a cell. Development of statistical and mathematical methodology to study such data plays a key role to elucidate and model the mechanisms behind the cell biochemical complex architecture. In particular, it is of great interest to represent the cell biochemistry into networks that mimic the chemical reactions taking place in the cell.

The DREAM project

Given the steady-state nature of the multifactorial sub-challenge data, we focused on Gaussian Markov Random Field theory

Gaussian Markov Random Fields theory (GMRF) relates the inverse of the process covariance matrix, described by the elements of the network, in our case a set of genes, with the adjacency matrix that describes the topology of the network. If the

This relation between the covariance inverse and the adjacency matrix links GMRF theory with graphical models, so extending the graphical models provided by relevance networks

The data provided in the multifactorial sub-challenge of DREAM4, consisted of

Visualization of the gene levels for all the perturbations ordered according to the first principal component.

With the aim of predicting the network structures in the multifactorial sub-challenge and considering that the only available data consisted of static records (i.e steady-state levels), it seemed reasonable to tackle the problem by a Gaussian Markov Random Field

with mean vector

The

In summary, estimating an undirected gene regulatory network graph is analogous to estimating the pairwise conditional independencies between the genes and, in our GMRF approach, is analogous to finding the zero entries of the inverse covariance matrix of the genes in the network. The covariance inverse

In a GMRF the conditional mean of

which has the same form as a multiple linear regression of

The sample covariance matrix

A both rigorous and efficient solution is the graphical lasso

A: Number of edges versus penalty for data set 3 in the multifactorial challenge with down arrows indicating the chosen

We now present a derivation of the graphical lasso algorithm

Friedman and coauthors

Graphical Lasso algorithm |

We now show how, for a given partition,

and also

so that

gives

so that

The sign of

Equation (12) can be recognized as the gradient equation of the lasso problem

with

The graphical lasso algorithm (

The efficiency of the Graphical Lasso algorithm allows to compute a great variety of network topologies just by evaluating a grid of penalty values

where

In the submission to the DREAM4 challenge, each network had to consist of a ranked list of regulatory links ordered according to their confidence value. In the original description, links had to be directed but, as directionality is difficult to detect without experimental interventions, we consider here only undirected links. In our submission, we determined the confidence value of a link (edge) in a rather ad-hoc fashion as follows. We first set a series of 100 equispaced values in terms of

After submission, the true networks were released and it is thus possible to evaluate each submitted network according to the true one. Because of the confidence rating of edges, each submitted network is not just a single network but a ranked list of networks, containing from one to many edges, depending on the required confidence for an edge to exist and the total list size. For each given confidence threshold, the resulting network can be evaluated and compared with the golden standard, as follows.

Given two nodes in a network

and

Precision is a measure of the exactness or fidelity of the network forecast, recall ( = TPR) is a measure of completeness, whereas FPR is the statistical Type I error (false alarm). In the words of

By sliding the confidence threshold, the pairs (TPR, FPR) and (precision, recall) give rise to the Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve, respectively (

The ROC and PR curves (Ensemble, AIC, BIC, MAX_AUPR and MAX_AUROC) are vertical averages of the curves for the five data sets.

We also investigated how well the graphical lasso could have done once we know the true networks. For this we determined the

The goal of the multifactorial sub-challenge was to reverse engineer five gene regulatory networks from training data consisting of steady-states levels of variation of the original networks, obtained after applying multifactorial perturbations to the system. The type of training data (only steady state, neither time series nor knockout/knockdown nor any other intervention data) motivated our choice for the GMRF approach to solve the problem in question.

The network topology was estimated by setting the edges to correspond to the nonzero elements of the estimated covariance inverse matrix

Measures | Ensemble | AIC | BIC | MAX-AUPR | MAX-AUROC | MAX |

0.23(0.04) | 0.05(0.01) | 0.26(0.06) | 0.28(0.06) | 0.26(0.08) | 0.23(0.03) | |

0.67(0.05) | 0.58(0.02) | 0.65(0.04) | 0.68(0.04) | 0.69(0.04) | 0.68(0.04) | |

0.84(0.26) | 0.18(0.27) | 1(0.00) | 0.82(0.25) | 0.79(0.32) | 0.81(0.27) | |

0.66(0.13) | 0.06(0.02) | 0.83(0.14) | 0.83(0.13) | 0.79(0.32) | 0.65(0.13) | |

0.10(0.04) | 0.05(0.01) | 0.07(0.02) | 0.10(0.04) | 0.11(0.04) | 0.11(0.04) | |

0.05(0.00) | 0.04(0.01) | 0.05(0.01) | 0.05(0.01) | 0.05(0.01) | 0.05(0.01) | |

36.58 | 2.25 | 43.00 | 47.30 | 43.55 | 35.29 | |

11.19 | 3.49 | 8.52 | 11.26 | 12.80 | 11.79 | |

23.89 | 2.87 | 25.76 | 29.28 | 28.17 | 23.54 | |

− | −6(0.00) | −2.38(0.08) | −2.60(0.07) | −2.94(0.34) | − |

Furthermore, we studied the performance of the presented methodology with only half of the 100 perturbations. The results show for all the methods a decrease in the overall scores of about 20 percent (

Measures | Ensemble | AIC | BIC | MAX-AUPR | MAX-AUROC | MAX |

0.18(0.04) | 0.19(0.05) | 0.21(0.05) | 0.22(0.06) | 0.22(0.06) | 0.18(0.03) | |

0.64(0.05) | 0.63(0.04) | 0.64(0.04) | 0.63(0.04) | 0.64(0.05) | 0.64(0.05) | |

0.83(0.24) | 0.82(0.26) | 0.82(0.26) | 0.92(0.18) | 0.89(0.26) | 0.83(0.24) | |

0.53(0.17) | 0.64(0.23) | 0.69(0.16) | 0.73(0.14) | 0.70(0.16) | 0.53(0.16) | |

0.07(0.02) | 0.07(0.02) | 0.07(0.02) | 0.07(0.02) | 0.07(0.02) | 0.07(0.02) | |

0.05(0.01) | 0.04(0.01) | 0.04(0.01) | 0.04(0.01) | 0.05(0.01) | 0.05(0.01) | |

26.83 | 28.07 | 32.51 | 35.55 | 34.32 | 26.58 | |

7.99 | 7.29 | 7.69 | 7.60 | 8.39 | 8.02 | |

17.41 | 17.68 | 20.10 | 21.57 | 21.36 | 17.30 | |

− | −3.00(0.00) | −2.72(0.27) | −2.41(0.15) | −2.66(0.23) | − |

We also compared our approach with simple correlation networks, both for the full data (n = 100) and half the data (n = 50). Correlation networks were obtained by connecting two genes with an edge if the absolute value of their correlation was higher than a predefined threshold. The ranking of the edges was done according to the absolute value of the correlations.

Measures | ||||||

0.26(0.06) | 0.19(0.07) | 0.05(0.02) | 0.24(0.06) | 0.20(0.06) | 0.06(0.01) | |

0.74(0.02) | 0.61(0.04) | 0.51(0.01) | 0.70(0.04) | 0.61(0.03) | 0.51(0.00) | |

0.73(0.30) | 0.73(0.30) | 0.38(0.43) | 0.84(0.36) | 0.84(0.36) | 0.84(0.36) | |

0.67(0.18) | 0.67(0.17) | 0.05(0.01) | 0.73(0.15) | 0.73(0.15) | 0.05(0.01) | |

0.16(0.04) | 0.06(0.01) | 0.04(0.01) | 0.11(0.03) | 0.06(0.01) | 0.04(0.01) | |

0.06(0.00) | 0.04(0.01) | 0.04(0.01) | 0.05(0.01) | 0.04(0.01) | 0.04(0.01) | |

45.53 | 30.75 | 2.30 | 39.12 | 32.02 | 3.24 | |

17.75 | 5.63 | 0.45 | 13.48 | 5.84 | 0.52 | |

31.64 | 18.19 | 1.38 | 26.30 | 18.93 | 1.88 |

We used a GMRF framework to tackle the problem of reverse engineering of regulatory networks based on data from random multifactorial perturbations, as posted in the DREAM4 challenge. The graphical lasso algorithm was used to compute the network topologies offering a very fast and easy computational set up, to provide a large range of candidate network topologies. This sub-challenge consisted of inferring directed networks, however, with the static nature of the provided training data, we believe that it is very complex to infer directionality or similarly causal relationships, and therefore we focused on the estimation of undirected networks which motivated the selection of our approach to tackle the problem.

We submitted networks with edge ranking based on edge frequency in an ensemble of the 50 best (out of 100) BIC networks. This ranked network turned out to perform very similar to MAX

The similarity between the ensemble and MAX

In our approach we assumed multivariate normality, that is normality of all marginal and conditional distributions of the measurements and, related to this, linearity between the conditional mean expression of a gene and the expression levels of its neighboring genes (equation (2)). These are strong assumptions, which are unlikely to hold true exactly. With few observations, these assumptions are hard to check. Q-Q plots, made assuming a sparse covariance inverse, did not show gross deviations from normality. A log-transformation of the measurements did not improve performance. The small data set size requires a simple model to produce reasonable results. Simplicity and speed are the key features of our approach.

Our study contributes to a better understanding of the properties and performance of the graphical lasso algorithm to estimate undirected networks. We showed that the method also works when the number of genes is larger than the number of perturbations. However, in this challenge relevance networks have shown a better performance, both for the full data and for half the data. For networks containing cliques that are locally dense, correlation networks might have an advantage compared to the sparsity imposed by the graphical Lasso algorithm with a single penalty term, as used in this study.

The authors would like to thank Alex Lenkoski and an anonymous referee for their constructive comments and suggestions that have helped to improve the manuscript.