The authors have declared that no competing interests exist.
Conceived and designed the experiments: HC BB JP. Performed the experiments: HC. Analyzed the data: HC. Wrote the paper: HC BB JP.
Reverse-engineering of biological networks is a central problem in systems biology. The use of intervention data, such as gene knockouts or knockdowns, is typically used for teasing apart causal relationships among genes. Under time or resource constraints, one needs to carefully choose which intervention experiments to carry out. Previous approaches for selecting most informative interventions have largely been focused on discrete Bayesian networks. However, continuous Bayesian networks are of great practical interest, especially in the study of complex biological systems and their quantitative properties. In this work, we present an efficient, information-theoretic active learning algorithm for Gaussian Bayesian networks (GBNs), which serve as important models for gene regulatory networks. In addition to providing linear-algebraic insights unique to GBNs, leading to significant runtime improvements, we demonstrate the effectiveness of our method on data simulated with GBNs and the DREAM4 network inference challenge data sets. Our method generally leads to faster recovery of underlying network structure and faster convergence to final distribution of confidence scores over candidate graph structures using the full data, in comparison to random selection of intervention experiments.
Molecules in a living cell interact with each other in a coordinated fashion to carry out important biological functions. Building a rich network of these interactions can greatly facilitate our understanding of human diseases by providing useful mechanistic interpretations of various phenotypes. Recent advances in high-throughput technologies have given rise to numerous algorithms for reverse-engineering interaction networks from molecular observations, as they provide an efficient and systematic way of analyzing the molecular state of a large number of genes. One class of such interaction networks that has generated much interest in recent years is transcriptional gene regulatory networks, which specify the set of genes that influence a given gene’s expression level. This type of pattern can be naturally modeled in a causal graph or Bayesian network.
Bayesian networks provide a compact way of representing causal relationships among random variables [
A key insight behind active learning is that not every variable is equally informative when intervened. For instance, if
Several researchers have developed active learning frameworks for causal structure learning during the last decade. In the Bayesian setting, Tong and Koller [
In this paper, we derive an efficient active learning algorithm for biological networks based on the framework of Murphy [
Let
A standard approach to inferring Bayesian network structure from data involves defining a score that reflects how well a given graph explains the data and searching for high-scoring graphs in the space of DAGs or causal node orderings. Typically, a Markov chain Monte Carlo (MCMC) method based on random walks is used to explore the space of candidate graph structures and to select the highest-scoring graph structure. In this section, we describe a Bayesian scoring function, which evaluates the posterior probability of a structure given the data. This scoring function constitutes an important component of the active learning algorithm we will develop next.
Given an instance of
Under an intervention (e.g., gene knockout or RNAi), a subset of random variables in
Now, let
By arranging terms for each family (i.e., a node and its parents) across data instances, this can be rewritten as
The fact that the likelihood over intervention data still decomposes into family-specific terms (each over a mutually exclusive set of parameters) enables the use of a conjugate prior similar to the one introduced by Geiger and Heckerman [
Specifically, for each node
Moreover, the
Since
Most network inference methods, including the one presented in the previous section, assume that the data set is obtained and fixed prior to learning. However, in a real world setting, one can perform additional intervention experiments and combine them with existing data to improve the quality of learned networks. An active learning framework allows us to reason about how
Here, we present our active learning algorithm for inferring the structure of GBNs. We adopt the information-theoretic framework developed by Murphy [
Let
Computing expectations over
The overall active learning procedure, with the optimization technique discussed in the following section, is outlined in Algorithm 1 and
Sample
Sample
Using Eqs (
Compute
Using Eqs (
Using
Perform experiment under
Sample
We first estimate our belief over candidate graph structures based on the initial data set that contains observational and/or intervention samples. Then, we iteratively acquire new data instances by carrying out the optimal intervention experiment predicted to cause the largest change in our belief (in expectation) and updating the belief. The final belief is summarized into a predicted network via Bayesian model averaging.
The computational bottleneck of our algorithm is in the evaluation of
We assessed the performance of our learning algorithm in several different ways. To analyze how accurately we learned the underlying causal structure, we followed the evaluation scheme used in the DREAM4 challenge [
In addition to analyzing the trajectory of different accuracy measures over the course of the iterative learning procedure where one intervention experiment is added at a time, we also looked at a metric that is agnostic to whether we have access to the ground truth network. When we are given a data set with pre-generated interventions and their outcomes, we can retroactively evaluate, given any subset of the data set, how close we are to the final belief over candidate graph structures obtained using the whole data set. The final belief is expected to better reflect the ground truth, and thus faster convergence to the final belief is desirable in most cases. Intuitively, this evaluates how much information we lose if we only had enough resources to perform a small subset of the intervention experiments provided. We measure this by calculating the KL divergence of the final belief from the current belief over 5000 randomly chosen candidate graphs.
We first set out to test whether the model assumptions of GBNs (acyclicity and Gaussianity) are too restrictive to be effectively applied to real biological data. We ran our algorithm on gene expression data collected by Sachs et al. [
We applied our Bayesian structure learning algorithm based on GBNs to uncover the signaling pathway of 11 human proteins from expression data provided by Sachs et al. [
To demonstrate the effectiveness of our active learning algorithm, we randomly generated a GBN with 10 nodes (
We compared edge prediction performance between active and random learners, summarized over five trials. The dotted lines are drawn at one standard deviation from the mean in each direction. Active learner achieves higher accuracy and faster convergence than random learner.
The parameters of the ground truth GBN are generated as follows. Each edge weight
For the MH algorithm used for sampling graphs from the posterior distribution at each iteration, we used a proposal distribution that assigns uniform weight to each DAG in the neighborhood that is reachable by a single-edge insertion, deletion, or reversal, following the suggestions of Giudici et al. [
The results are summarized in
We next asked whether we can achieve a similar improvement on a data set that more closely resembles biological data. To this end, we tested our method on data from the DREAM4 10-node in-silico network reconstruction challenge [
The results from the DREAM4 analysis are summarized in
The results are summarized over five trials. The dotted lines are drawn at one standard deviation from the mean in each direction. Active learner achieves higher accuracy and faster convergence than random learner.
He and Geng [
We evaluated the final prediction accuracy of our active learning algorithm in identifying edges in the undirected skeleton of the ground truth network. The resulting precision-recall (PR) curves were compared to PC with different values of
Lastly, we tested the extent to which our optimization based on rank-one updates to the matrix inverse and determinant improves the runtime of our algorithm. The cumulative runtime of the iterative learning procedure on our simulated data is shown in
The results are summarized over three trials (error bands are not visible due to low variance). Our optimization technique specific to GBNs leads to significant improvement in runtime.
In this paper, we derived an efficient active learning algorithm for Gaussian Bayesian networks and demonstrated its effectiveness on several data sets. We showed that our algorithm achieves a clear improvement in uncovering the true network as long as the underlying causal structure does not significantly violate the acyclicity assumption inherent in the GBN models. Even under violation of model assumption, we were able to observe superior convergence rate of the active learner, which further supports the effectiveness of our method.
There are several important ways in which this work could be improved for better applicability in systems biology in the future. First, we could develop a systematic way of selecting a batch of intervention experiments to be performed simultaneously, which is a more suitable setup for high-throughput assays. Second, we could further adopt our method to support perturbation experiments in which we only observe the response of a single reporter gene, whose phenotype (e.g., luminescence) is easier to quantify than systematic expression profiling. Third, it would be interesting to look for better ways to find optimal intervention other than exhaustive enumeration followed by linear search for the optimal solution. This capability is especially of interest as we consider higher-order interventions of multiple variables, in order to counter the combinatorial explosion in the number of candidate interventions to consider.
Even when the ground truth network contains numerous short cycles, our method still achieves significantly faster convergence to the final belief. However, due to the violation of model assumption, our method achieves generally lower final accuracies than those of data sets 4 and 5 and does not clearly outperform random learner. The results are summarized over five trials. The dotted lines are drawn at one standard deviation from the mean in each direction.
(EPS)
(ZIP)