^{1}

^{2}

^{2}

^{3}

^{1}

^{2}

^{4}

^{*}

Conceived and designed the experiments: AM EVE RB. Performed the experiments: AM. Analyzed the data: AM. Contributed reagents/materials/analysis tools: AM AG. Wrote the paper: AM AG RB.

The authors have declared that no competing interests exist.

Many current works aiming to learn regulatory networks from systems biology data must balance model complexity with respect to data availability and quality. Methods that learn regulatory associations based on unit-less metrics, such as Mutual Information, are attractive in that they scale well and reduce the number of free parameters (model complexity) per interaction to a minimum. In contrast, methods for learning regulatory networks based on explicit dynamical models are more complex and scale less gracefully, but are attractive as they may allow direct prediction of transcriptional dynamics and resolve the directionality of many regulatory interactions.

We aim to investigate whether scalable information based methods (like the Context Likelihood of Relatedness method) and more explicit dynamical models (like Inferelator 1.0) prove synergistic when combined. We test a pipeline where a novel modification of the Context Likelihood of Relatedness (mixed-CLR, modified to use time series data) is first used to define likely regulatory interactions and then Inferelator 1.0 is used for final model selection and to build an explicit dynamical model.

Our method ranked 2nd out of 22 in the DREAM3 100-gene

For decades the biological community has had a keen interest in characterizing the genetic regulatory networks that are largely responsible for an organisms ability to adapt to its constantly changing environment. An ever increasing number of functional genomics projects continue to make this a key problem in modern biology. It remains, however, unclear what constitutes the most efficient paradigm for characterizing regulatory networks, i.e. what experiments to perform, data to collect, and methods to use for learning biological regulatory networks. Moreover, the number of proposed methods for learning regulatory networks from systems data is growing and it is difficult to compare the relative merit of these methods unless methods are evaluated on similar datasets using similar metrics. The DREAM (Dialogue for Reverse Engineering Assessments and Methods) project

There are several broad classes of regulatory network inference methods that aim to reconstruct and model the underlying regulatory networks at varying degrees of detail. It is beyond the scope of this introduction to review more than a small subset of these methods, as they represent a very large body of work, for a more thorough review of network reconstruction methods we refer the reader to

Mutual information based methods

Ordinary Differential Equation based methods

Here we employ a modified version of the MI based method Context Likelihood of Relatedness (CLR)

The Inferelator 1.0 is a scalable method that uses an additive ODE model to approximate regulatory dynamics. At the core of the method is an

Several network reconstruction methods, including the method described here, restrict the number of considered regulatory interactions using a correlation or MI based pre-processing step. For example, the Sparse Candidate Algorithm, a Bayesian network approach for learning biological regulatory networks, employed a mutual information pre-processing step, aimed at reducing complexity and improving the scaling of the algorithm to the genome scale

Here we describe the three step pipeline (

For each regulatory interaction,

The dynamical variables available from observations are the simulated mRNA levels of genes:

We are given data sets that contain observations taken from five different networks

The DREAM3

To determine rankings, we will define a confidence score,

Without loss of generality we can assume that time-series observations resulted from one perturbation experiment, i.e. we can write them in the form of a

There are two main sets of steady-state experiments: measurements of all genes when one gene (per experiment),

Denote by

Note that unlike typical genome-wide mRNA observations, the observations given in DREAM3 ranged from zero to one (e.g. microarray and RNA-seq can exhibit multiple

As the first step in our pipeline we apply our modified CLR algorithm (mixed-CLR) to reduce the number of likely regulators for each target (i.e. gene). This procedure has two parts: 1) computing static and dynamic Mutual Information (MI) between each potential regulator and target pair, followed by 2) a background correction step, for which we use the procedure originally described in

We use MI as a metric of statistical dependency between two genes. MI between two random variables

When computing MI from continuous data a binning approach is often used

Using both time-series and steady-state observations (the full set of provided experiments) we compute the static MI between the observed expression levels of every gene pair,

Computing MI between the expression levels of genes with the purpose of characterizing regulatory interactions has two major limitations: 1) a pair of genes can often have a high MI value due to many reasons other than a regulatory interaction, e.g. a pair of genes can share a regulator; and 2) MI between the expression levels of two genes is a symmetric quantity, and thus can not resolve causality, i.e. can not resolve the directionality of the regulatory interaction. To partially resolve these limitations we compute dynamic MI values, derived from a linear additive ODE model, motivated by our previous work on Inferelator 1.0

We assume that the time evolution in the

The next two steps aim to separate the terms in (8) that involve the putative regulators (i.e. the explanatory variables) from the terms in (8) that involve the target (i.e. the response), first for time-series experiments and then for steady-state experiments.

For time-series experiments we can write (8) using a finite difference approximation as

For steady state experiments we can write (8) by setting the derivative to zero as

Combining the time-series and steady-state response variables, we get the response vector:

Combining the corresponding time-series and steady-state explanatory variables together, we get the explanatory variables vector:

We compute the dynamic MI between every pair of response-vector and explanatory-variable vector,

Note that static- and dynamic-MI values are estimated using the same number of observations. Next we describe how to use dynamic MI values as part of a modified CLR background correction.

At the core of both the original CLR method and our modified CLR variant, mixed-CLR, is a background correction step that computes the significance of a given regulator-target MI value by comparing that value to all MI values for that regulator and all MI values for the given target. This background correction step can be briefly described as follows:

Let

We have computed the pseudo z-scores in three variations:

To apply this procedure we first, as was done for dynamic-CLR above, compute the Z-score of

Second, we compute the Z-scores of

Lastly, we combine the previous two Z-scores into a pseudo Z-score,

Note that

In order to decide which CLR variant to use for DREAM3 predictions, we evaluated the performance of all three CLR variants, on the two DREAM2

For our second step we perform crude filtration to remove the most unlikely regulatory interactions given our knowledge of gene knock-outs and knock-downs. This step is solely based on the genetic perturbations collected as steady-state observations.

For each interaction,

The regulatory interactions scored in

Here we use the results of the previous two steps, contained in

We use Inferelator 1.0 to learn a sparse ODE model for each

Least Angle Regression (LARS)

Ten fold cross validation is used to select the minimum value of

To produce the ranks required by the challenge we combine the Inferelator 1.0 model weights (

Regulatory interactions that were supported by mixed-CLR and not filtered out all have corresponding confidence scores

To ensure that Inferelator 1.0 confidence scores are on equal footing with the previous confidence scores, stored in

We store our final confidence scores for regulatory interactions that were supported by mixed-CLR and Inferelator 1.0, in

The regulatory interactions scored in

We have implemented all the steps in our pipeline using the R statistical language

Inferelator 1.0 as previously described in

After a network inference method suggests potential regulatory interactions, validation of these interactions typically requires significant effort (often requiring the coordination of multiple experiments). Hence, a regulatory network inference method should ideally produce a small number of false positives (FP) even at the expense of a higher false negative (FN) rate. When testing such a method, the performance metric should be sensitive to the method's ability to avoid FPs. Therefore, throughout this section we used area-under-curve of precision (

We used the DREAM2 50-gene data for testing our pipeline prior to the DREAM3 100-gene challenge. On both this pre-competition data and the actual DREAM3 data, Mixed-CLR with Inferelator 1.0 outperformed other potential pipelines we evaluated, and was thus the method we initially used for the DREAM3 competition. From

We evaluated the performance of Inferelator 1.0 and three different versions of CLR—namely: original-CLR (CLR), dynamic-CLR, and mixed-CLR—with or without Inferelator 1.0, at three levels of knock-out filtration,

The same trend, in which mixed-CLR coupled with Inferelator 1.0 outperforms the other evaluated method combinations for a large range of tested filtration cutoffs, holds for DREAM3 50-gene networks (for which our method ranked 4th out of 27) and 10-gene networks (for which our method ranked 5th out of 29) (data not shown). As for the DREAM3 50-gene networks, our pipeline did not outperform the DREAM2 50-gene challenge best performers

One important question that the DREAM initiative aims to answer is what data sets are most useful for characterizing regulatory network. We compared the performance of five methods (CLR, mixed-CLR, Inferelator 1.0, and mixed-CLR or CLR with Inferelator 1.0) over four partitions of the full range of provided experiments, namely: knock-down, knock-out, time-series, and the former three combined. From

We evaluated the contribution of each data set (namely: knock-down (‘

Determining causation (the directionality of regulatory interactions) is one of the tougher problems to solve when inferring regulatory networks. In practice,

We compared the relative merit of five methods (CLR, mixed-CLR, Inferelator 1.0, and mixed-CLR or CLR with Inferelator 1.0) with or without knock-out filtration to determine causation. From

We present the relative merit of five methods, with and without knock-out filtration, to resolve causation (i.e. directionality of regulatory interactions). For each method we computed the fraction of correctly resolved true regulatory interactions (true positives, TPs) out of the total number of TPs the method had identified. We define a TP interaction,

Biological regulatory networks are typically sparse, i.e. they have a relatively small number of regulatory edges when compared to the total number of possible edges. Network sparsity is commonly used to glean at what the dynamic complexity of that network would be if it could be simulated or observed (where the more sparse a network is, the simpler its dynamic behaviour becomes). Network sparsity in turn can be separated into two more detailed measures: network in-degree distribution, derived from the distribution of regulatory edges entering each target gene, and network out-degree distribution, derived from the distribution of regulatory edges leaving each regulator. Each distribution when summed equals to the number of regulatory edges in the network. We find that, as expected, our method's median error increases with genes median in-degree (see

Here we evaluate the performance of mixed-CLR, filtration cutoff of

One unexpected problem with mixed-CLR (that the DREAM3 challenge revealed) is that we have no guarantee that static and dynamic MI values will be in the same range for a given data set (which we assumed when constructing mixed-CLR). Indeed, from

We computed static and dynamic Mutual Information (MI) values for every possible regulatory interaction. Vertical lines represent distribution means. We present the combined probability densities for the five

Also, we can see from

We hypothesized that dynamic-MI will decrease false statistical dependencies between gene pairs (i.e. dependencies that are not due to direct regulatory interactions), assisting in the identification of true regulatory interactions. To test this hypothesis we computed MI between the expression levels of every gene pair (static MI), and between every pair of dynamic response and explanatory-variable (dynamic MI). For both static and dynamic MI values, we computed a z-scores for each true regulatory interaction (true positive, TP) and false regulatory interaction (true negative, TN) by assuming its MI value is taken from the distribution of MI values involving the target in that interaction, i.e. the first z-score from dynamic-CLR or mixed-CLR. Indeed, from

We computed static and dynamic Mutual Information (MI) values for every possible regulatory interaction for all five 100-gene networks. For both static and dynamic MI values, we computed z-scores for true regulatory interactions (true positives, TPs) and false regulatory interactions (true negatives, TNs). We present the static (

As mentioned previously, in biology it is desired that methods have high precision even in the expense of recall (completeness). Here we take a look at precision for several recall values ranging from low to high recall (

mixed-CLR+Inferelator 1.0 | |||||||

mixed-CLR | |||||||

dynamic-CLR+Inferelator 1.0 | |||||||

dynamic-CLR | |||||||

CLR+Inferelator 1.0 | |||||||

Inferelator 1.0 | |||||||

CLR |

mixed-CLR+Inferelator 1.0 | |||||||

mixed-CLR | |||||||

dynamic-CLR+Inferelator 1.0 | |||||||

dynamic-CLR | |||||||

CLR+Inferelator 1.0 | |||||||

Inferelator 1.0 | |||||||

CLR |

We have shown that explicitly modeling dynamics using a simple ODE model increases the ability of our pipeline to identify true regulatory interactions (when compared to a static model), and help resolve the directionality of these interactions. Specifically, analysis of our performance on the DREAM3 100-gene networks show that: 1) the full pipeline (mixed-CLR followed by, knock-out filtration and Inferelator 1.0) outperformed other tested combinations of dynamic and static methods (

We observed a drop in performance as the median in-degree of a network increases (

Interestingly, we have not observed a similar drop in performance as networks median out-degree increased (

We learned that the Inferelator 1.0

We were encouraged to see that even a very simple dynamical model was able to significantly increase performance (compared to static model) at identifying true regulatory interactions and resolving their causation. Moreover, the two dynamic methods mixed-CLR and Inferelator 1.0 proved complimentary.

Knock-out observations were instrumental for characterizing the DREAM3 100-gene regulatory networks (

To conclude, the pipeline we have described here was developed with the aim of producing a sorted, enriched subset of true direct regulatory interactions. We find that our full pipeline was able to find a significant fraction of the true positive regulatory interactions. We also find that our top ranked predictions have very low error rate, suggesting that our method is useful in the context of an active genomics consortia, where network models are improved in an iterative manner: highly ranked predictions of target-gene interactions are validated with new data collection, causing the generative model to be re-updated, allowing for new predictions and validation, etc.

We thank Dennis Shasha, Peter Waltman and Thadeous Kacmarczyk (NYU) for both creative and technical input into the work described. We wish to thank Boris Hayete and Bruce W. Church for helpful discussions and insights.