Expectation propagation for large scale Bayesian inference of non-linear molecular networks from perturbation data

Inferring the structure of molecular networks from time series protein or gene expression data provides valuable information about the complex biological processes of the cell. Causal network structure inference has been approached using different methods in the past. Most causal network inference techniques, such as Dynamic Bayesian Networks and ordinary differential equations, are limited by their computational complexity and thus make large scale inference infeasible. This is specifically true if a Bayesian framework is applied in order to deal with the unavoidable uncertainty about the correct model. We devise a novel Bayesian network reverse engineering approach using ordinary differential equations with the ability to include non-linearity. Besides modeling arbitrary, possibly combinatorial and time dependent perturbations with unknown targets, one of our main contributions is the use of Expectation Propagation, an algorithm for approximate Bayesian inference over large scale network structures in short computation time. We further explore the possibility of integrating prior knowledge into network inference. We evaluate the proposed model on DREAM4 and DREAM8 data and find it competitive against several state-of-the-art existing network inference methods.


Introduction
Cellular components function through their interaction in form of biological networks, such as regulatory and signaling pathways [1]. With the advances of experimental methods and the emergence of high-throughput techniques, such as DNA microarray and next generation sequencing, the measurement of expression values of genes on whole genome scale is now possible. These advances have motivated attempts to learn molecular networks from experimental data. However, network inference from experimental data is computationally nontrivial, because the number of variables (typically genes, or proteins) usually exceeds the number of samples. Moreover, the number of possible network structures increases super-exponentially with the number of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 network nodes. Therefore the search space to look for the true network is very large even for small graph instances and thus prevents the use of exact methods.
With the emergence of targeted perturbation techniques such as RNAi [2] and more recently CRISPR-Cas9 [3], it becomes possible to study the effect of specific gene silencing on a whole molecular network in a systematic manner, thus enabling identification of causal networks based on multiple intervention effects. Based on this fact, several groups of researchers have focused their work on network learning from perturbation data.
In this work, we follow a mechanistic ODE modelling framework such as one described in [15,[19][20][21]. Molinelli et al. described an efficient inference method based on Belief Propagation (BP) and showed that their method has similar performance as exact Bayesian inference [16]. That method relies on a steady state assumption and searches over a finite set of edge weights in order to ensure convergence. The required data discretization is a principal limitation of the method and potential source of error in a practical application. Furthermore, the steady state assumption is only valid in specific applications, excluding time series data.
In contrast, the proposed method called FBISC (Fast Bayesian Inference of Sparse Causal networks), neither requires any data discretization step nor does it rely on a steady state assumption. It can be applied to time series as well as steady state data. Akin to Molinelli et al. we allow for modeling non-linear regulatory relationships between molecules. To enable causal inference we allow for arbitrary, possibly combinatorial interventions. Our method is Bayesian and thanks to the employed Expectation Propagation (EP) scheme [22] computationally attractive.
Sometimes prior information is available about the structure of the network we wish to infer. The Bayesian framework used in our method allows us to incorporate prior knowledge into the network inference procedure in a natural and flexible way. This is achieved via a "spike and slab" prior [23], which at the same time enforces sparsity of the network. The spike and slab prior is known to yield less biased estimates than lasso-type L1 penalties, as e.g. employed in the "Inferelator" [24].

Modeling perturbations and dynamics
To ease the representation of our model, we first assume we have time-series data. The case of steady state data will be explained later. Let Y = {Y 1 ,Y 2 ,. . .,Y n } be the molecules (genes, proteins, etc.) for which we have available measurements at different time points (t). We would like to estimate the unknown network topology underlying their interaction. The time-series dataset can be described as D ¼ fy c ir ðtÞji ¼ 1; . . . ; n; r ¼ 1; . . . ; q i ; c ¼ 1; . . . ; mg, where q i is the number of replicate measurements for molecule i, and m is the number of perturbations. C = {1,2,. . .,m} represents the set of all perturbation experiments, in which each c 2 C can either directly influence one molecule or a subset of molecules Y c Y, i.e. perturbations can be targeted or combinatorial.
We have to take into account that perturbations may not exhibit the same quantitative effect on all directly influenced molecules. For example, we can think c to consist of a treatment with two ligands A and B. At given concentrations A might strongly inhibit protein P1, whereas B might exhibit a moderate effect on protein P2. In our model, we capture this behavior by including a set of extra nodes in our network, and a set of extra edges connecting perturbations with perturbed molecules. All the edges in our network are weighted, and with our model, we will infer the difference in the perturbation strength on different target molecules based on data. In conclusion, our network is a graph Γ = (Y[C,ε), with the set of nodes Y[C and set of edges . The set of edges for this graph consists of all possible connections between regular nodes (i.e. all nodes except the perturbation nodes) and also the connections from our perturbation nodes to our regular nodes. So, our adjacency matrix to be inferred is a (n+m)×(n+m) weight matrix W = (w ij ). We do not aim to infer any edge pointing to the perturbation nodes. By means of this representation we are able to model interactions between our network nodes (genes, proteins), and also between perturbation nodes and their targets as shown in Fig 1. In general c 2 C can be time dependent, represented as x c (t), which can be Boolean (1 indicating perturbation, 0 no perturbation) or fully quantitative. A special case is when x c (t) = 1 for all t = 1,. . .,T. Targets of perturbation nodes do not have to be fully known; they can be inferred from data with our method. The technique for that (Expectation Propagation) is described later.
We can in principle model perturbations either as perfect (ideal) interventions or as soft interventions: Perfect interventions remove the influence of any other on the perturbed node, whereas soft interventions just increase the probability of the target node to be perturbed [25]. By using explicit intervention nodes and correspondingly weighted edges we here rely on a soft intervention scheme, which we believe to be closer to biological reality.
We assume an ordinary differential equation system (ODE) for describing the dynamics of the molecular network relative to known interventions: Function f can be linear or non-linear [26]. Linear ODEs are not capable of capturing important biological phenomena such as coupled perturbation effects, nonlinear interactions, and switch-like behavior [15]. Inspired by [15], we thus propose the following approach for representing the time dependent behavior of system measurements, fy c ir ðtÞg, via a set of coupled non-linear differential equations: An example network of 3 genes (g 1 , g 2 , g 3 ) and 2 perturbation nodes (p 1 , p 2 ) and its related weight matrix. The goal is to infer network edges (w elements). The matrix is interpreted in the sense that elements in each row represent the weights of all incoming edges to a specific node (here: g1). The last two rows are zero, because we do not have edges directed towards perturbation nodes.
Here z c pr ðtÞ represents the time series of a measurement or perturbation node. The upper and lower bounds of d dt y c ir are controlled by positive parameters α i and β i . Eq (2) can be linearly approximated as yielding where Δ t is the length of the known time interval between subsequent measurements t and t +1. There is no constraint on equality of interval lengths. Δ t weights the influence of measurements at time t on those at time point t+1. Shorter time intervals increase the influence of y c ir ðtÞ and decrease the effect of R c ir ðtÞ. Under steady state conditions we have d dt y c ir ¼ 0 and hence we obtain: Notably, we have removed the dependency on time t in the formula here due to the steady state condition.

Bayesian model fitting
. .,n. Furthermore, let σ i denote the Gaussian measurement noise for molecule i. The likelihood of measured data D given weight matrix W and known measurement noise can then be written as: Typically the number of replicate measurements q i per molecule is small, and thus the empirical variance is an unreliable estimate of the true s 2 i In order to account for this fact we assume the true noise variance s 2 i to be drawn from an inverse gamma distribution: With this setting the marginal likelihood pðy c ir ðtÞjm i ðtÞÞ, integrating out the unknown s 2 i , can be computed in closed analytical form [27]: Accordingly, the marginal likelihood of the data given weight matrix W is given by Notably, using Eq (5) in the steady state situation Eq (6) changes into: Note that we dropped t here to make the independence of time explicit. In our method we use Eq (6) and Eq (7) to score weight matrices for time series and steady state data, respectively.

2.2.b Prior knowledge and network sparsity.
To enforce sparsity of W we use a spikeand-slab prior [23] on the edge weights: We introduce a binary latent variable, γ ij , for each w ij indicating the presence (γ ij = 1) or absence (γ ij = 0) of edge i!j. Given γ ij the spike-and-slab prior on is defined as: The variance σ 1 of the slab distribution can be set sufficiently large (here: 10) in order to achieve a low bias of weight estimates for present edges. On the other hand, σ 2 is set close to zero (σ 2 !0) to approximate a delta function centered at zero (the spike). The mixture coefficient γ ij is drawn from a Bernouli distribution: Parameter ρ ij reflects the prior probability for that. This allows us to incorporate prior knowledge in a similar way as e.g. described in [28].

2.2.c Bayesian model inference via expectation propagation.
Expectation propagation (EP) has been introduced by [22] as a computationally efficient approximation of full Bayesian inference. It extends the technique of moment matching [29].
Let Ө denote the set of all inferable parameters (W,α,β) of our model. Similar to Variational Bayesian methods, EP minimizes Kullback-Leibler (KL) divergence between the true joint distribution p(Ө,D) and some approximation, q(Ө, D). For that purpose it is essential to factorize the joint distribution p(Ө,D), for example as: . This is done using matching first moments, i.e. expectations. Notably, the EP algorithm always converges when the approximating factors are in the exponential family [22].

2.2.d Implementation.
For the implementation of the EP algorithm we use microsoft Infer.NET [30], a framework for Bayesian inference on graphical models. Our proposed model can be interpreted as a special type of Dynamic Bayesian Network (DBN), connecting each network node to its parents i.e. the node values in previous time-step (Fig 2). The code of our model-written in C#-is provided in the Supplemental material (S1 Code).
The same weight matrix (or the same set of weight parameters) is used for all layers of our DBN; for example if we assume w 12 as the weight of the edge from g 1 at time-point 0 to g 2 at time-point 1, the edge connecting the same two nodes from time-point 1 to time-point 2 would have same weight parameter. This implies that we assume the network structure not to change over time.

Dealing with non-linearity within the EP framework
The non-linear sigmoid function shown in Eq (4) yields severe convergence issues within the EP inference framework. We thus use a piece-wise approximation of the sigmoid function g Z ð Þ ¼ 1 1þexpðÀ ZÞ appearing on the right hand side of Eqs (4) and (5) where max(y) and min(y) denote the maximally and minimally measured concentrations for one particular molecule (per replicate) in a certain condition over a complete time series. Note Eq (9) provides an upper an lower bound for the concentration dependent change of each molecular in dependency of its regulators. The function F in the simplest case could be the identity function, as proposed in Bonneau, Reiss et al. (2006). In that case between the upper and lower bound the function g is fully linear, and deviates significantly from the original sigmoid curve if the argument Z is far away from 0.5. Furthermore, a linear concentration change is principally non-physiological. In order to account for these facts we thus suggest to define F in Eq (9) as a non-parametric B-spline basis expansion [31]. The B-spline expansion can be computed in a pre-processing step of our method, which maps the original time-series data into a cubic B-spline space. That means we replace each z c pr in Eqs (4) and (5) by FðZ c pr Þ ¼ and use Eq (9) as an approximation of the sigmoid function. In Eq (10) B cpr k denotes a (cubic) B-spline basis function and the sum runs over the different spline knots k. After choosing an appropriate number of spline knots, functions of the form of Eq (10) can be fitted to each measured time series (on log scale). These functions can in principle be evaluated up to every desired resolution. Correspondingly, interpolated input data can now be fed to our model rather than the original raw data.
In practice we tried the following different spline interpolation methods here: 1.
Smoothing B-splines, as implemented in the "smooth.spline" function in R [32] 2. Interpolated cubic B-splines, as implemented in the "spline" function in R [33]

Simulation studies
In order to better understand the principle behavior of our method named FBISC (Fast Bayesian Inference of Sparse Causal networks) under different conditions, we performed several simulation experiments. A rigorous comparison against competing methods is shown in later Sections. Network topologies with 10, 50 and 100 nodes and corresponding time series data with 5, 10, 15 and 20 measurement time points were simulated via the GeneNetWeaver tool [34]. Gen-eNetWeaver samples random sub-networks from known large-scale yeast and E. coli transcriptional networks. The network topologies used in this paper are shown in the (S1 Fig). Data simulation for each of these topologies was repeated 5 times, and no perturbations were applied at first place.
The area under ROC (AUROC) and area under precision-recall curve (AUPR) are used to evaluate prediction of method. Notably, these measures are independent of a specific confidence or p-value cutoff. Fig 3 depicts the influence of the number of time points and the number of network nodes on reconstruction performance when using the most basic version of our method without spline interpolation (i.e. a piecewise linear approximation of the sigmoid curve). As expected, AUPR and AUROC values increase with more time points. For networks with 10 nodes and 15 time points, AUROC and AUPR reach 60% and 85%, respectively, while for networks with 100 nodes AUPR drops to 15%, but the AUROC is still close to 60%.
Next we investigated the effect of using perturbation data with the same basic version of FBISC. For that purpose, we randomly picked 20% of the nodes of the network with 100 nodes and each of them affected by a different perturbation. Perturbations were assumed to represent a constant signal over time, and ten time points were simulated for time courses of each network node. We compared three situations 1) targets of perturbations are fully known; 2) targets of perturbations are unknown; 3) purely observational data. Fig 4 represents our results for four network structures for each of these situations.
As indicated by Fig 4 perturbation data generally increased AUPR and AUROC compared to purely observational data dramatically: AUPR was at least twice as high than with purely observational data, and the AUROC increased by more than 10%. If targets of individual perturbations were fully known to the algorithm (i.e. the prior probability for edges connecting perturbations to their targets was one) AUPR and AUROC values were on average~5% higher than in the case were targets of perturbations were unknown.
In the last experiment we compared the different spline methods discussed in Section 2. In conclusion, our simulations demonstrate that our method can successfully exploit perturbation information and profits from spline interpolated time series data. Furthermore, reconstruction performance is expected to be relatively robust, even if large networks are estimated.

Evaluation on DREAM challenge data and comparison against competing methods
We downloaded data from the DREAM4 [35] and DREAM8 [36,37] challenges for the further evaluation of our approach and comparison to competing methods (see http://dreamchallenges. org/project-list/closed/). The gold standard networks provided in these data are used for evaluation. DREAM4 provides simulated data for five networks of size 10 nodes and five networks of size 100 nodes. For each network perturbation time series and steady state data were retrieved. Time series data comprise 21 time points (t 0 = 0 to t 20 = 1000) reflecting measurements of each network node. Perturbations are always applied at time 0 and removed at time 500. Information regarding to the exact targets of perturbations are not available. Each time series is measured with 5 replicates for the 10-node network and 10 replicates for the 100-node network. Each replicate represents a perturbation experiment in which different nodes (about one-third of the network) are perturbed.
In addition to time series different kinds of steady state data are available from DREAM4. Here we employed knock-out, knock-down, and multifactorial perturbation data. Knock-out and knock-down data reflect steady state measurements of all network nodes after perturbation of exactly one known node. Multifactorial data corresponds to combinatorial perturbations of unknown nodes in each experiment. No replicate measurements are available for steady state data.
DREAM8 provides experimental data of a signaling network with 20 nodes at 11 time points. Here the perturbations correspond to compound treatments. Exact concentrations of perturbation sources are not given but specified with qualitative values (high/low). Following Young et al. [38] we normalized each measured time series by subtracting its mean.
We compared FBISC with ScanBMA [38], iBMA [39], LASSO [40], ebdnet (Empirical Bayes Estimation of Dynamic Bayesian Networks [41], ARACNE [42], and CLR [43]. As opposed to the first three methods LASSO, ARACNE and CLR are not per se designed for time series data. In order to adapt these methods to our situation we considered for each network gene of interest all other genes as potential regulators. For each potential regulator its measurements excluding the last time point were used. We then asked, which subset of regulators could predict the measured time series of the network gene of interest excluding the first time point. This estimation procedure was repeated for all network genes. Notably, information about perturbations was included into all methods competing with our FBISC approach. This was done by adding perturbations as additional potential "regulators" of each node (similar to the way that FBISC treats perturbations). In case that perturbation targets were known, perturbations were only considered as potential regulators of directly targeted nodes.
All tested network inference algorithms produce either a confidence measure (FBISC, ScanBMA, iBMA) or a p-value (ebdnet, CLR, ARACNE) for each possible edge. Correspondingly, AUROC and AUPR are used to evaluate prediction performances of methods. Notably, these measures are independent of a specific confidence or p-value cutoff.

Results for DREAM4 data
Using the above described time series perturbation data we compared results obtained by our method with the ones reported in Young et al. [38]. Results generated by our method are shown in the last two rows of Table 1, indicating a higher AUROC/AUPR for 10-gene and higher AUPR for 100-gene networks than competing methods, specifically when using the B-spline method. ROC and PR curves regarding DREAM4 networks of size 100 are provided in S2 Fig. Next we compared our method to the competing approaches on the basis of various kinds of steady state data (Table 2), showing a clear improvement compared to ARACNE, LASSO and CLR (which are the only competing methods using steady state data) in all cases. As expected, AUROC and AUPR values for steady state data are typically a bit below those observed for time series data. However, FBISC using knock-out data could still achieve an AUROC of 70% for the 100-gene network.

Results for DREAM8 data
Next, we focused on the DREAM8 challenge data. In contrast to before, results for this dataset were obtained by our own implementation of competing methods. More specifically, we used R package NetworkBMA [44] for ScanBMA, minet [45] for ARACNE and CLR, ebdbnet [41], and glmnet [46] for LASSO. Results are presented in Table 3, indicating a slightly better AUROC than the best competing approach (CLR).

Effect of incorporating prior knowledge
Next we tested, in how far the previously presented results would change in dependency of prior knowledge. Only time series data were used at this point. Following the approach used by Praveen et al. [47] we considered 0, 5, 10, 25 and 50 percent of true network edges to be known with probability of 90%, and the FBISC-B-spline method with the same setting reported in Sections 3.3 and 3.4 was used. The results of this experiment (Fig 6) show an increase of AUROC and AUPR as fractions of confidently known edges increase. Notably, with 50% known edges we could achieve an AUROC of close to 75% for the 100 gene network from the DREAM4 challenge and about the same performance for the DREAM8 challenge network.

Run time and parallelization
FBISC is a Bayesian approach. Frequentist methods like CLR and ARACNE are based on mutual information and conceptually far simpler. They are thus computationally comparably cheap. From the practical point of view, a question is thus, how the computing time of FBISC compares to competing methods. The run time of ScanBMA depends on the number of potential parents (nvar) per node. Here we tested ScanBMA with nvar = 10 and nvar = 20 (for larger values of nvar the run time Table 2. Average performance results based on DREAM4 10-gene and 100-gene networks with steady state knock-down, knock-out, and multifactorial data. Notably, FBISC is only applicable without spline interpolation in these situations. For the 10-gene networks LASSO failed due to low number of available samples and is thus not shown.

Method
Type of perturbation AUROC (10-gene network) AUPR ( increases dramatically). Results for our method compared to the competing approaches considered in this paper are reported in Table 4, indicating a good computational scalability of our FBISC approach. Notably, our algorithm can be parallelized, because the inference problem can be solved independently for each network node. This allows for an additional gain in computation time. Corresponding results shown in Table 4 (named FBISC parallel) refer to the use of 2 cores (Intel1 Core™ i5-5257U dual core processor with 4 parallel threads). More cores would reduce calculation time even more. For DREAM8 and DREAM4 networks of size 100 there is speed up by factor 2 due to parallelization (Table 4). At the same time, the memory use was rather modest and allowed to run all computations on a standard laptop computer. Overall, the Bayesian inference scheme used in FBISC thus seems to be applicable even for large networks.

Discussion
We proposed a Bayesian approach for computationally efficient inference of large scale molecular networks from complex perturbation data. Our FBISC method is highly flexible and applicable even in situations where the exact targets of perturbations are unknown (such as stress experiments), which is frequently the case in biology. A further strength is that we consider perturbations themselves as time dependent, as e.g. reflected in the DREAM4 data. FBISC uses a biochemically inspired model to describe the non-linear dynamical behavior of molecular networks and integrates this description into a graphical modeling framework. This allows for the application of efficient approximate inference schemes, such as expectation propagation.
Notably, the output of our method is a posterior distribution over edge weights, which accounts for the unavoidable uncertainty of any network inference. We enforced sparsity of inferred networks in form of a spike and slab prior. This type of prior forces many edge weights to exactly zero and naturally allows for the integration of prior background knowledge, which demonstrated useful in our results. Altogether we see the combination of a highly flexible modeling framework (reflected by non-linear dynamics, arbitrary complex perturbation schemes and probabilistic integration of prior knowledge), which is applicable to time series as well as steady state data and uses computationally scalable Bayesian inference as differentiation of FBISC to existing techniques. The advantage for the user lies in a unified method, which allows for automatically adapting literature derived network information to experimental data and produces confidence measures. Our results showed an attractive prediction performance of our method. We thus believe that our proposed FBISC method is an attractive alternative to existing methods to learn causal network structures from complex perturbation data. The C# code of our method is included in the supplements of this paper (S1 Code).