^{1}

^{*}

^{2}

^{3}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AFV JR FM JRB. Performed the experiments: AFV. Analyzed the data: AFV JRB. Contributed reagents/materials/analysis tools: JR FM. Wrote the paper: AFV JRB.

The prediction of links among variables from a given dataset is a task referred to as network inference or reverse engineering. It is an open problem in bioinformatics and systems biology, as well as in other areas of science. Information theory, which uses concepts such as mutual information, provides a rigorous framework for addressing it. While a number of information-theoretic methods are already available, most of them focus on a particular type of problem, introducing assumptions that limit their generality. Furthermore, many of these methods lack a publicly available implementation. Here we present MIDER, a method for inferring network structures with information theoretic concepts. It consists of two steps: first, it provides a representation of the network in which the distance among nodes indicates their statistical closeness. Second, it refines the prediction of the existing links to distinguish between direct and indirect interactions and to assign directionality. The method accepts as input time-series data related to some quantitative features of the network nodes (such as e.g. concentrations, if the nodes are chemical species). It takes into account time delays between variables, and allows choosing among several definitions and normalizations of mutual information. It is general purpose: it may be applied to any type of network, cellular or otherwise. A Matlab implementation including source code and data is freely available (

Reverse engineering a network consists of inferring the structure of interactions between its components from a set of data. This problem appears in many different contexts, such as chemistry (construction of chemical reaction mechanisms), biology (inferring gene regulatory networks), engineering (system identification), or social sciences

Reviews of network inference methods typically find large discrepancies among the predictions of different algorithms, and usually conclude that there is no single best method for all problems

The present work addresses the problem of recovering the structure of a network from the available data in its most general form. This entails that no assumptions about the underlying structure are made, and previous knowledge is not taken into account. Interactions should be deduced only from the statistical features of the data, without resorting to biological intuition. To reach this goal, many methods have exploited the analytical tools provided by information theory. The fundamental concept of information theory is entropy, which was introduced by Shannon

The relative entropy, which is also known as Kullback–Leibler divergence or information gain, is a measure of the distance between two distributions. It is defined as

The mutual information measures the amount of information that one random variable contains about another. In other words, it is the reduction in the uncertainty of one variable due to the knowledge of another. Since it does not assume any property of the dependence between variables–such as linearity or continuity–it is more general than linear measures such as the correlation coefficient, and is able to detect more interactions

In the next section (Methods) we present a methodology and software toolbox called MIDER (Mutual Information Distance and Entropy Reduction). MIDER seeks to achieve high precision on small and medium-scale networks of any kind, cellular or otherwise, although it can also be applied to large-scale problems. It is designed with the aim of accurately distinguishing between direct and indirect interactions, thus minimizing the number of false positives. In the

The MIDER workflow is shown in

A recent review on information-theoretic network inference methods can be found in

Thus it defines a matrix of distances between species, and by applying Multidimensional Scaling (MDS) it obtains a two-dimensional map that serves as an indication of species connectivity. EMC was designed as a generalization of a previous method called CMC, Correlation Metric Construction

A different way of distinguishing direct from indirect interactions is carried out by the ARACNE method

Context Likelihood of Relatedness, CLR

The Minimum Redundancy Maximum Relevance method (MRMR) introduced in

The rationale is to rank direct interactions better than indirect interactions. MRNET was implemented in the R package MINET

Another way of discriminating between direct and indirect interactions is given by MI3, three-way mutual information

Finally, some authors have proposed to redefine the concept of entropy in order to make it more suited for inferring networks where long-range interactions exist.

This subsection explains how MIDER (1) estimates mutual information from data using an adaptive partitioning algorithm, (2) provides several normalizations of the mutual information, and (3) plots three-dimensional landscapes of the mutual information pairs as a function of the time lag between variables.

Mutual information can be either analytically calculated or estimated from experimental data. For reverse engineering purposes, knowledge of the underlying system equations cannot be assumed; therefore it is necessary to estimate mutual information from the available datasets. This is far from trivial, and several algorithms have been proposed for this task. The simplest one is a naive estimation, where the data is binned into equally sized intervals and an indicator function

This simple approach gives good results if the number of data points is large; otherwise the finite-size effects lead to overestimation of the mutual information

These reasons support the choice of the aforementioned adaptive algorithm

A characteristic of mutual information is that its range of values is in principle unknown. A number of normalizations have been proposed in the literature. An early one was the definition by Linfoot

In

The distance measure is then defined as

Studholme et al

MIDER lets the user choose between any of these normalizations or the standard definition of mutual information. While normalization changes the numerical range of the distance matrix, in practice its effects on the reconstructed network are very small.

Furthermore, the user can choose between the classic definition of mutual information (

MIDER generates 3D plots of the mutual information between every variable and all the others, for every time lag considered. They are a graphical representation of the time-varying dependency between variables. To make visualization easier, the mutual information is normalized to the range [0,1] according to

One of the MIDER outputs, shown for Benchmark B2: a 3D plot of the mutual information between a variable (X3) and the rest, for different time lags between variables.

MIDER uses mutual information

If time series data is available, the mutual information between two variables

MIDER implements an entropy reduction procedure that is inspired by the one proposed in

The numerical values, such as the upper limit of 0.2, were empirically chosen; this tuning was carried out with the datasets used in the

Note that other measures of entropic reduction have been proposed elsewhere for similar tasks, and could also be used at this step. For example, in the area of machine learning the concept of variable relevance, defined as the relative reduction of uncertainty of one variable due to the knowledge of another, was formalized

MIDER implements the entropy reduction step according to

There are several ways of estimating the strength of an interaction between two variables

Links between variables are plotted as arrows, which represent directional (causal) relationships. Inferring causality is a subtle matter, with deep theoretical implications, and currently an open problem in biological applications

For every pair of variables (

The performance of MIDER has been evaluated with the seven benchmark problems listed in

Number | Description | Publication | Type | Data | Data points | Variables |

B1 | Glycolytic pathway | Metabolic | Real | 57 | 10 | |

B2 | 8 species mechanism | Metabolic | Simulated | 250 | 8 | |

B3 | 4 species mechanism | Metabolic | Simulated | 100 | 4 | |

B4 | IRMA benchmark | Genetic regulatory | Real | 125 | 5 | |

B5 | MAPK cascade | Protein signaling | Simulated | 210 | 12 | |

B6 | DREAM4 10 genes–1 | Genetic regulatory | Simulated | 105 | 10 | |

B7 | DREAM4 100 genes–1 | Genetic regulatory | Simulated | 210 | 100 |

To carry out objective comparisons between inference methods it is necessary to have quantitative measures of their performance. Two common measures are precision (

Other common measures are the true positive rate (

Most inference methods have some tunable parameter that represents the minimum strength that an interaction must have in order to be considered real, and not an artifact of the data. By changing this threshold and recording the different outcomes of a method, one can plot either Precision-Recall curves (PR), which show how

Precision-Recall curves provide quantitative measures of a method's performance for a variety of settings. However, they do not give information about which performance is to be expected with the method's

While MIDER and TD-ARACNE infer interaction direction, ARACNE, MRNET, and CLR do not. To enable direct comparison of these methods, we do not take direction into account when classifying a link as true or false.

Previous evaluations of network inference methods, such as the ones carried out in the DREAM initiative, have stressed the importance of the “wisdom of crowds”

PR curves (recall in horizontal axis, precision in vertical axis) of all the benchmarks (B1–B7) for five network inference methods (ARACNE, CLR, MRNET, TDARACNE, and MIDER) and for the community prediction. Solid lines and small dots correspond to the (P,R) values obtained by changing the threshold for detecting interactions. Large square points correspond to the (P,R) values obtained with the default (out of the box) settings of each method.

The color maps show precision (left panel) and recall (central panel) achieved by each method and for each benchmark with its default settings, as well as the area under precision-recall curve (AUPR, right panel). Numerical values are in the range [0–1], and are represented in colors according to the scale in the right (green = good, blue = bad).

It should be noted that the community prediction turned out to be comparable to the best result obtained by any method in six of the seven benchmarks. In other words, the community prediction is in the Pareto front of non-dominated solutions for those cases: no method has simultaneously better precision and better recall than the community. Thus, the community prediction is the optimal trade-off between precision and recall for a given weight of

These results show, on the one hand, that MIDER is a good option for network inference in a variety of settings, and on the other hand, that it is advantageous to take into account the outcomes of several algorithms. The following subsections describe the benchmark problems and analyze the results in more detail.

As a first example we considered the first steps of the glycolytic pathway, which are depicted in the upper left panel of

First reaction steps of glycolysis.

As a second example we chose a simulated metabolic pathway, the chemical reaction network represented by

Reaction chain with 8 species.

Next we considered the following small linear chain of reactions,

Reaction chain with 4 species.

IRMA (In vivo Reverse-engineering and Modeling Assessment)

IRMA.

It must be noted that the five methods predict a link between SWI5 and GAL4, which does not exist in reality (SWI5 is linked to GAL4 only indirectly, through GAL80). GAL4 and GAL80 form a complex, and it was already acknowledged in the original publication

The classic Mitogen-Activated Protein Kinase model presented by Huang & Ferrell

MAPK cascade.

Finally, we tested the methods using benchmark problems generated for the DREAM4 in silico network challenge (

The performance of all the five methods compared in this section was relatively modest for the network of size 10. Precision values ranged from

For the network of size 100 all methods obtained poor results. We found a clear distinction between methods that focused on precision (ARACNE, TD-ARACNE, MIDER) and methods that focused on recall (CLR,MRNET).

The present work has introduced a methodology for network inference called MIDER. It is based on information theoretic concepts, and combines the use of mutual information-based distances and entropy reduction. It outputs a visual representation of the inferred system, as well as estimates of the strength and directionality of the interactions, and time-lagged plots of the mutual information between variables. Among other options, it offers the possibility of choosing from different normalizations of the mutual information, and even a nonextensive version.

One of the strengths of MIDER is its generality: it makes no assumptions about the characteristics of the network, which makes it suitable for inferring connections in systems where little is known. Indeed, the only necessary input is the experimental data. Another advantage of the method is that, although it has some tunable parameters that can be modified if desired, it requires no expertise from the user. Due to the adaptive nature of its subroutines, its default settings provide good results for a variety of problems. It has been tested on seven different benchmarks including metabolic, gene regulatory, and protein signaling networks, and has performed well when compared to other state of the art techniques.

Regarding its theoretical foundations, a strength of MIDER is its ability to detect multiple interactions and avoiding false positives. It ranked first in precision among the tested methods in five of the seven benchmark problems considered, and achieved precision scores close to the best performer in the other two. Since in every reverse engineering method there is a trade-off between precision and recall, this emphasis in precision entails that MIDER can yield low recall for large-scale problems. However, for smaller-scale networks (up to ten nodes in our tests) it manages to obtain simultaneously high precision and high recall.

The main hurdle to surmount in order to accurately discard false positives is the need of large amounts of data, which are required if it is desired to carry out more than three entropy reduction rounds. This limitation is due to the difficulty in estimating reliably joint entropies of high dimensions (i.e., of four or more species), and is hence shared by all information-theoretic methods. For networks with a large number of components, performing more than three entropy reduction rounds may also involve high computational costs, particularly if many possible time lags are taken into account. To alleviate this burden, MIDER estimates the mutual information using an algorithm that is much faster than the one used by some of the precedent methods. Furthermore, since the related calculations are carried out in arrays and are amenable for parallelization, this limitation can be easily overcome. As a future development we plan to implement a parallel version of MIDER.

We hope that MIDER will be a valuable addition to the existing methodologies for network inference, either by itself or in combination with other algorithms to create a community prediction. To facilitate its use, we provide the source code along with the datasets required to reproduce the results reported in this paper. We envision that it will be particularly useful for the community of Matlab users; to the best of the authors' knowledge, this is the first time that a Matlab implementation of a comparable method is made available.

(RAR)

The authors thank Alfonso Albano for kindly providing an implementation of the algorithm in