RegnANN: Reverse Engineering Gene Networks Using Artificial Neural Networks

RegnANN is a novel method for reverse engineering gene networks based on an ensemble of multilayer perceptrons. The algorithm builds a regressor for each gene in the network, estimating its neighborhood independently. The overall network is obtained by joining all the neighborhoods. RegnANN makes no assumptions about the nature of the relationships between the variables, potentially capturing high-order and non linear dependencies between expression patterns. The evaluation focuses on synthetic data mimicking plausible submodules of larger networks and on biological data consisting of submodules of Escherichia coli. We consider Barabasi and Erdös-Rényi topologies together with two methods for data generation. We verify the effect of factors such as network size and amount of data to the accuracy of the inference algorithm. The accuracy scores obtained with RegnANN is methodically compared with the performance of three reference algorithms: ARACNE, CLR and KELLER. Our evaluation indicates that RegnANN compares favorably with the inference methods tested. The robustness of RegnANN, its ability to discover second order correlations and the agreement between results obtained with this new methods on both synthetic and biological data are promising and they stimulate its application to a wider range of problems.


INTRODUCTION
Since the first examples dating back to early seventies (Glass and Kauffman (1973)), the challenge of reconstructing the links among genes in a regulatory network starting from their expression signals has been tackled by several laboratories worldwide.These initial efforts have originated a number of related publications which has been exponentially growing in the last few years.
The inference methods generally employed are of very different nature, ranging from deterministic, e.g.: systems of differential equations (Bansal et al. (2007)) and Groebner bases (Dimitrova et al. (2007)), to stochastic approaches, e.g.: Boolean (Kauffman (1993)) or Bayesian (Friedman et al. (2000)) algorithms.Such approaches may also start from different types of gene expression data: time-course or steady states.Furthemore, also the detail and the complexity of the considered network can vary as the links may carry information about the direction of the relation (directed graph) and a weight may be associated to the strenght of each link (weighted directed graph) Markowetz and Spang (2007); Karlebach and Shamir (2008).Generally, the reconstruction accuracy is far from being optimal in many situations with the presence of several pitfalls, related to both the methods and the available data (He et al. (2009)).Citing Baralla et al. (2009), * to whom correspondence should be addressed.
"Inferring gene networks is a daunting task", not only in terms of devising an effective algorithm, but also in terms of quantitatively interpreting the obtained results.Only recently efforts have been carried out towards an objective comparison of network inference methods also highlighting occurring limitations (Krishnan et al. (2007); Altay and Emmert-Streib (2010); Marbach et al. (2010)).
This work compares four network reverse engineering methods, first settling in a controlled situation with synthetic data and then focusing on a biological setup by analysing transcriptional subnetworks of Escherichia coli.In order to simplify our comparative evaluation, we will only consider the underlying topology, thus neglecting both weight and direction of the links among the genes.In doing so, we confine the analysis of the reconstructed network in terms of the binary existence or notexistence of an edge.The general performance of the network inference task is evaluated in terms of Matthews Correlation Coefficient (MCC, Matthews (1975) -see Sup.Mat. for details).MCC is becoming the measure of choice in many application fields of machine learning and bioinformatics: it is one of the best methods for summarizing into a single value the confusion matrix of a binary classification task.Recently it has also been used for comparing network topologies (Stokic et al. (2009)).
In this paper we introduce a novel inference method called Reverse Engineering Gene Networks with Artificial Neural Networks (RegnANN).This approach is based on an ensemble of multilayer perceptrons trained using steady state data.Its perfomance is compared with those of top-scoring methods such as KELLER (Song et al. (2009)), ARACNE (Margolin et al. (2006)) and CLR (Faith et al. (2007)) while assessing possible sources of instability.To improve the general efficiency of RegnANN we implement the algorithm using GPGPU (Lahabar et al. (2008)) The extensive evaluation on both synthetic and biological data indicates that the algorithms tested suffer of instability and variability issues with regards to the reconstruction of the network topology.The instability makes objectively very hard the task of establishing which method performs best.Nevertheless, RegnANN shows MCC scores that compare very favorably with all the other inference methods tested.

RegnANN: network inference using ANN
To infer gene regulatory networks we adopt an ensemble of feed-forward multilayer perceptrons (Bishop (1995)) trained using the back-propagation algorithm.Each member of the ensemble is essentially a multi-variable regressor (one to many) trained using an input expression matrix to learn the relationships (correlations) among a target gene and all the other genes in the network.We proceed in determining the interactions among genes separately and then we join the information to form the overall network.From each row of the gene expression matrix 1 we build a set of input and output patterns used to train a selected multilayer perceptron.Each input pattern corresponds to the expression value for the selected gene of interest.The output pattern is the row-vector of expression values for all the other genes for the given row in the gene expression matrix (Figure 1).By cycling through all the rows in the matrix, each regressor in the ensemble is trained to learn the correlations among one gene and all the others.Repeating the same procedure for all the columns in the expression matrix, the ensemble of multi-variable regressors is trained to learn the correlations among all the genes.
The procedure of determining separately the interactions among genes is very similar to the one presented in Song et al. (2009), where the authors propose to estimate the neighborhood of each gene (the correlations among one gene and all the others) independently and then joining these neighborhoods to form the overall network, thus reducing the problem to a set of identical atomic optimizations (Section 2.2).
Here we build N -one for each of the N genes in the network -multilayer perceptrons with one input node, one layer of hidden nodes and one layer of N − 1 output nodes.The input node takes the expression value of the selected gene rescaled in [−1, 1].The number of hidden nodes is set empirically to the square root of the number of inputs by the number of outputs, resulting in √ N − 1.The activation function is the hyperbolic tangent, which provides output values in the range [−1, 1], thus making the output values interpretable in terms of positive correlation (+1), anticorrelation (−1) and not-correlated (0).The other parameters used to learn each multi-layer perceptron are as follows: learning rate equal to 0.01; momentum equal to 0.1, learning epochs equal to 10000; bias equal to 0 2 .
Finally, the topology of gene regulatory networks is obtained by applying a second procedure.The correlation of each gene with all the others is extracted by passing a purposely made test pattern to the regressor: considering separately each multilayer perceptron in the ensemble, a value of 1 is passed to its input neuron, consequently recording its output values.In this way, the correlation between the corresponding gene with all the others is obtained as a vector of values in [−1, 1].By cycling through all the members of the regression system, we obtain the adjacency matrix of the sought gene network.It is important to note that this procedure does not allow discovering of gene self correlation (regulation) patterns, but only correlation patterns among different genes.Moreover, the algorithm here proposed cannot estimate future values, because it is not a predictor, as in the case of GRNN (Specht (1993): instead it models static correlations between genes.As in Song et al. (2009), it is possible to extend the regression system to take into account dynamic rewiring of the topology, but this is beyond the scope of the present work.
To improve the general efficiency of the algorithm and thus allow a systematic comparison of its performance with the other gene network reverse engineering methods tested (Subsection 2.2), we implemented the ANN based regression system using the GPGPU programming paradigm (Lahabar et al. (2008); Scanzio et al. (2010)).

Alternative inference methods
As reference methods we select three alternative algorithms widely used in literature: ARACNE, CLR and KELLER.KELLER: it is a kernel-reweighted logistic regression method (Song et al. (2009)) introduced for reverse engineering the dynamic interactions between genes based on the time series of their expression values.It estimates the neighborhood of each gene separately and then joins the neighborhoods to form the overall network.The approach aims at reducing the network 1 In this work we consider gene expression matrices of dimension M × N : N genes whose expression levels are recorded M times. 2 These values are evaluated empirically during preliminary tests on synthetic data.The ad hoc procedure proposed to build the training input/output patterns starting from a gene expression matrix.Each input pattern corresponds to the expression value for the selected gene of interest.The corresponding output pattern is the vector of expression values for all the other genes for the given row in the gene expression matrix.inference problem to a set of identical atomic optimizations.KELLER makes use of the l 1 -regularized logistic regression algorithm and operates modeling the distribution of interactions between genes as a binary pair-wise Markov Random Field.The method has been applied to reverse engineer genome-wide interactions taking place during the life cycle of Drosophila melanogaster.Although KELLER has been developed to uncover dynamic rewiring of gene transcription networks (e.g.: dynamic changes in their topology), here we consider constant network topology for a given gene expression matrix.In this work we make use of the reference implementation of the algorithm provided in Song et al. (2009).
ARACNE: it is a general method able to address a wide range of network deconvolution problems -from transcriptional (Margolin et al. (2006)) to metabolic networks (Nemenman et al. (2007)) -that was originally designed to scale up to the complexity of regulatory networks in mammalian cells.The method makes use of an information theoretic approach to eliminate the majority of indirect interactions inferred by co-expression methods.ARACNE removes the vast majority of indirect candidate interactions using a well-known information theoretic property: the data processing inequality (Cover and Thomas (1991)).In this work we use the reference implementation of the algorithm provided in Meyer et al. (2008) with default value for the data processing inequality tolerance parameter.
CLR: it is an extension of the relevance networks class of algorithms (Faith et al. (2007)), which predicts regulations between transcription factors and genes making use of the mutual information score.CLR proposes an adaptive background correction step that is added to the estimation of mutual information.For each gene, the statistical likelihood of the mutual information score is computed within its network context.Then, for each transcription factor-target gene pair, the mutual information score is compared to the context likelihood of both the transcription factor and the target gene, and turned into a z-score.We adopt the reference implementation of the algorithm provided in Meyer et al. (2008).

Experimental protocol
We are interested in comparing the performance of the selected reverse engineering methods in inferring the underlying topology of regulatory networks.As proposed in Song et al. (2009), we focus on the estimation of the interaction structures between genes, rather than the strength of these interactions.The inferred adjacency matrix is symmetric and discretized with values in {0, 1} by thresholding.
The binarization of the inferred network obtained with RegnANN is achieved using by using a threshold value of 0.5.In the case of KELLER, the reference implementation (Song et al. (2009)) returns a symmetric and discrete (with values in {0, 1}) adjacency matrix -binarization is obtained by rounding values bigger than 10 −3 to 1. Results obtained with ARACNE are discretized as in the case of KELLER.Usually, the cutoff value for the mutual information is estimated for each data-set separately using a significance measure (e.g.: the F-score (Altay and Emmert-Streib (2010))) or building a Precision-Recall curve and selecting the desired threshold value (Margolin et al. (2006)).Here, the threshold value is kept constant to avoid the introduction of a selection bias in the outcome of the ARACNE algorithm.The same procedure is applied to CLR (threshold value of 10 −3 ).
The accuracy (in terms of MCC) of the inference methods is firstly evaluated on synthetic data (Section 3) by varying the topology of the network, its size, the amount of data available, the method adopted to synthesize the data and the method adopted to normalize the data prior to network inference -see Supplementary Material for details.Methodically, we vary one parameter at a time and then measure the performance of the systems as the mean of 10 randomly initialized runs.For each run, the network topology is randomly generated with the desired number of genes (N ), the expression profiles -the data -are (randomly) generated the required number of times (M ), the selected normalization method is applied and the MCC values for the applied reverse engineering method recorded.The error of the measurement is expressed as twice the standard deviation of the 10 independent runs.
Finally, the performance of the four network inference algorithms is tested on 7 selected gene network modules of Escherichia coli (Peregrin-Alvarez et al. (2009)).While ARACNE, CLR and KELLER are deterministic algorithms 3 , RegnANN may produce different results depending on the random initialization of the weights in the ensemble of multi-layer perceptrons.Thus, in order to smooth out possible local minima, we adopted a majority voting schema: for each network module, the RegnANN algorithm is applied 10 times and the inferred adjacency matrices accumulated.The final topology is obtained selecting those links that appeared with a frequency higher than 7 (out of 10).The entire procedure is repeated 10 times and the final prediction is estimated as the mean and the associated error as twice the standard deviation of the 10 independent runs.

DATA
Synthetic data: we benchmark the reverse engineering algorithms here considered using both synthetic and biological data.Synthetic data are obtained considering two different network topologies: Barabasi-Albert (Barabasi and Albert (1999)) and Erdös-Rényi (Erdös and Renyi (1959)).Furthermore, we apply two different gene expression synthesis methods: the first one considers only linear correlation among selected genes (SLC), the second one is based on a gene network/expression simulator recently proposed to assess reverse engineering algorithms (GES, Di Camillo et al. (2009)).See Supplementary Material for full details.
Escherichia coli data: the task for the biological experiments is the inference of a few transcriptional subnetworks of the model organism Escherichia coli starting from a set of steady state gene expression data.The data are obtained from different sources and they consist of three different elements, namely the whole Escherichia coli transcriptional network, the set of the transcriptional subnetworks and the gene expression profiles to infer the subnetworks from.The Escherichia coli transcriptional network is extracted from the RegulonDB4 database, version 6.4 (2010) and it consists of 3557 experimentally confirmed regulations between 1442 genes, amongst which 172 transcription factors.The 117 subnetworks are defined in Marr et al. (2010): in our experiments we use 7 of these subnetworks, including a number of genes ranging from 7 to 104.The expression data have been originally used in Faith et al. (2007) and consist of 445 Escherichia coli Affymetrix Antisense2 microarray expression profiles for 4345 genes, collected under different experimental conditions such as PH changes, growth phases, antibiotics, heat shock, varying oxygen concentrations and numerous genetic perturbations.MAS5 preprocessing is chosen among the available options (MAS5, RMA, gcRMA, DChip).

RESULTS
Due to space constraints, hereafter we present a selection of the outcomes of the experimental evaluation with emphasis on the reconstruction variability; for previous usage of MCC in network theory and applications see Stokic et al. (2009); Supper et al. (2007).
Synthetic data: Figure 2 illustrates the MCC scores obtained with ARACNE, CLR, KELLER and RegnANN for synthetic Barabasi networks (scale free, exponent P = 1), varying the number of nodes.In order to provide similar amount of information to the inference algorithms while varying the size of the network, we kept constant the data ratio: the number of expression profiles to number of nodes (80%) -e.g.: 50 nodes, 40 different expression profiles; 200 nodes 160 different expression profiles.Expression values are linearly rescaled in [−1, 1]. Figure 2 indicates that the MCC scores on Barabasi networks depend on both the inference algorithm and the data synthesis methods, while the size of the network (number of nodes considered) has a somewhat smaller impact on the performance.RegnANN-GES scores 0.5 ± 0.1 on a network of 200 nodes, while RegnANN-SLC scores 0.34 ± 0.08 on a similarly sized network.KELLER scores 0.4 ± 0.1 irrespective of the data synthesis method applied on the 200 nodes network.On the same sized network, ARACNE-GES scores 0.42 ± 0.04 while ARACNE-SLC scores 0.28 ± 0.06.Finally, CLR shows the worst performance of the four algorithms tested, irrespective of the network size and the data synthesis adopted, e.g.: 0.17±0.02(GES) for a network of 200 nodes -0.18 ± 0.01, in the case of SLC.
Figure 3 shows the MCC scores for the same network inference methods as above, varying the number of expression profiles considered while keeping constant the size of the Barabasi network (100 nodes).Expression values are statistically normalized.Figure 3 indicates that the MCC scores greatly vary when considering statistically normalized values while varying the amount of data generated (the number of expression profiles).The data synthesis method adopted can also greatly affect the performance score.MCC scores for RegnANN, ARACNE and CLR show to be positively affected when the number of generated expression profiles is increased from 10 to 40: RegnANN-GES scores 0.20 ± 0.02 considering only 10 profiles, while scoring 0.50 ± 0.08 with 40 different.Adopting SLC data synthesis, RegnANN scores 0.07 ± 0.04 and 0.24±0.06with 10 and 40 expression profiles respectively.Similarly, ARACNE-GES scores 0.28 ± 0.06 and 0.35 ± 0.04 with 10 and 40 expression profiles respectively.ARACNE-SLC scores 0.15 ± 0.08 and 0.31 ± 0.06 with 10 and 40 expression profiles respectively.On the other hand, as also shown in Figure  2, CLR shows performance curves that are not influenced by the data synthesis method adopted: it scores 0.13 ± 0.01 (GES) with 10 expression profiles; 0.10 ± 0.04 synthesizing data with SLC.With 40 expression profiles CLR-GES scores 0.22 ± 0.04;CLR-SLC scores 0.21 ± 0.04.On the contrary, Figure 3 shows that the performance of KELLER is greatly influenced by the data synthesis method, while the number of expression profiles has a somewhat limited impact: KELLER scores 0.44±0.06synthesizing expression profiles with GES (40 in total), it scores 0.18 ± 0.02 using SLC to generate 40 profiles.
Figure 4 shows the MCC scores obtained with ARACNE, CLR, KELLER and RegnANN by varying data normalization methods while keeping constant the network size (200 nodes) and the number of expression profiles generated (160).Only the SLC data synthesis is considered.Figure 4 indicates that ARACNE, CLR and RegnANN MCC scores are not significantly affected -considering the error of the measure -by the normalization method: RegnANN scores 0.42 ± 0.06, 0.4 ± 0.1 and 0.4 ± 0.1 applying respectively discretization, linear rescaling and statistical normalization to the data.Similarly, ARACNE scores 0.24 ± 0.04, 0.28 ± 0.03 and 0.28 ± 0.03 when the expression values are discretized, linearly rescaled and statistically normalized.Finally, CLR scores: 0.14 ± 0.04, 0.17 ± 0.01 and 0.17 ± 0.01 for the very same normalization methods above (discretization, linear rescaling, statistical normalization).On the other hand, KELLER MCC scores show to be highly influenced by the normalization method applied to the synthetic data.In the case of discretization and in the case of statistical normalization KELLER scores 0.10 ± 0.01 and 0.19 ± 0.01 respectively.In the case of linear rescaling it scores a higher value: 0.40 ± 0.07.
Figure 5 shows the MCC scores obtained with ARACNE, CLR, KELLER and RegnANN for synthetic Erdös-Rényi networks (random graph, mean degree D = 1), varying the number of nodes.In order to provide similar amount of information to the inference algorithms while varying the size of the network, we kept constant the data ratio (80%).Expression values are linearly rescaled in [−1, 1].In the case of Erdös-Rényi networks the MCC curves are greatly and unevenly affected by all the parameters explored: inference method, size of the network and data synthesis method.ARACNE and CLR show a decreasing MCC score -although not strictly statistically significant -when the number of nodes in the network is increased from 50 to 200: ARACNE-GES scores 0.29 ± 0.08 with network size 50, 0.25 ± 0.04 with network size 200.Similarly, CLR-GES scores 0.19 ± 0.06 with network size 50, 0.11 ± 0.02 with network size 200 -a similar negative trend is recorded in case of SLC data synthesis.On the other hand, KELLER and RegnANN have higher MCC when the number of nodes in the network is increased from 50 to 200: KELLER-SLC scores 0.39 ± 0.08 network size 50, 0.65 ± 0.08 when the network size is 200.Similarly, RegnANN-SLC scores 0.4 ± 0.1 and 0.64 ± 0.04 for network size 50 and 200 respectively.Considering GES for synthetic data generation, the MCC curves are significantly different for both KELLER and RegnANN: KELLER scores 0.37 ± 0.04 for network size 200 while RegnANN scores 0.20 ± 0.04 for similarly sized networks (200 nodes).
Figure 6 shows the MCC scores for the same network inference methods as above, varying the number of expression profiles considered while keeping constant the size of the Erdös-Rényi network (100 nodes).Expression values are statistically normalized.As indicated in Figure 6, KELLER and RegnANN show opposite MCC curves by increasing the amount of expression profiles generated.RegnANN-GES shows rapidly increasing scores varying the number of expression profiles from 10 to 80: 0.12 ± 0.02 and 0.6 ± 0.1 respectively.KELLER-GES scores 0.28 ± 0.06 and KELLER-SLC scores 0.13 ± 0.04 with 10 expression profiles.KELLER-GES scores 0.16±0.01 and KELLER-SLC scores 0.17± 0.04 with 80 expression profiles.On the other hand, MCC curves for CLR are limitedly affected by the number of expression profiles or by the data generation methodology: with 80 expression profiles it scores 0.15 ± 0.02 using GES and 0.16 ± 0.02 using SLC for data synthesis.
Figure 7 shows the MCC scores obtained with ARACNE, CLR, KELLER and RegnANN varying data normalization method while keeping constant the network size (200 nodes) and the number of expression profiles generated (160).Only the SLC data synthesis is considered.As in the case of Barabasi networks (Figure 4), Figure 7 shows that ARACNE, CLR and RegnANN MCC scores are not significantly affected by the normalization method.On the contrary, KELLER is significantly affected: it scores 0.11 ± 0.01 and 0.15 ± 0.02 when the expression values are discretized and statistically normalized respectively.A higher value of 0.65 ± 0.08 is recorded in the linearly rescaled case.
Selected Escherichia coli subnetworks: Table 1 summarizes  As for the case of synthetic data, Table 1 indicates great variability of the MCC scores across the different network modules for all the inference methods tested.ARACNE scores range from 0.78 (module 81) to 0.00 (module 88).CLR values range between 0.45 and 0.02 for module 81 and 96 respectively.KELLER scores range between 0.63 and −0.12 (module 12 and module 81 respectively).Finally RegnANN scores range between 0.32 ± 0.005 (module 12) and −0.05 ± 0.02 (module 88).It is interesting to note that the MCC score varies unevenly for the different inference algorithms

CONCLUSION
In this work we presented a novel method for network inference based on an ensemble of multi-layer perceptrons configured as multi-variable regressor (RegnANN).We compared its performance to the performance of three different network inference algorithms (ARACNE, CLR and KELLER) on the task of reverse engineering the gene network topology, in terms of the associated MCC score.
Our extensive evaluation indicates that all the algorithms suffer of instability in the reconstruction of the network topology due to the various sources of variability, possibliy not limited to the relative small set of parameters explored here.Because of such instability, it is objectively very difficult to establish which method performs best.Generally, the newly introduced RegnANN shows performance scores that compare very favorably with all the other inference methods tested.Nonetheless further efforts are required in order to effectively cope with the difficulty of the task and minimize the variability of the inference process.

Recall = T P T P + F N
(1) where TP indicates the fraction of true positives, while FN indicates the fraction of false negatives.
On the other hand, precision measures the fraction of true interactions among all inferred ones, and it is computed as: where FP indicates the ratio of false positives.
In this work we adopt instead the Matthews correlation coefficients -MCC (Baldi et al. (2000); Matthews (1975)): this is a measure that takes into account both true/false positives and true/false negatives and it is generally regarded to as a balanced measure, useful specially in the case of unbalanced classes (i.e.: not equal number of positive and negative examples).
The MCC is in essence a correlation coefficient between the observed and predicted binary classifications: it returns a value between −1 and +1.A coefficient value equal to +1 represents a perfect prediction, 0 indicates an average random prediction while −1 an inverse prediction (Baldi et al. (2000); Matthews (1975)).In the context of network topology inference the observed class is the true network adjacency matrix, while the predicted class is the inferred one.
The Matthews correlation coefficient has the following is obtained according to the following equation: Recently MCC has also been used for comparing network topologies (Supper et al. (2007); Stokic et al. (2009)).

SYNTHETIC DATA GENERATION
The synthetic data sets used in the main paper are obtained starting from an adjacency matrix describing the selected topology7 .In this work we consider undirected graphs: we are interested in estimating the structures of interaction between nodes/genes, rather than the detailed strength or the direction of these interactions.Thus, we consider only symmetric and discrete adjacency matrices, representing with a value of 1 the presence of a link between two nodes.A value equal to 0 in the adjacency matrix indicates no interaction.
Once the topology of the network is (randomly) generated, the output profiles of each node are generated according to the approaches in the following section.

Simple Linear Correlation (SLC):
similarly to the simulation of gene expression data presented in the supplementary material of Langfelder and Horvath (2007), we consider a set of seed expressions (a matrix M × N -N genes which expression profiles are recoded M times -with values uniformly distributed in [-1, 1]) and the desired topology expressed by the adjacency matrix adjM (N × N ).The gene expression profiles (gep, a matrix M × N ) are calculated as: where the symbol '+' indicates element-element summation and the symbol '⋆' indicates row-column matrix multiplication.With this method, the seed expression columns are linearly correlated (correlation equal to 1) with the columns of the same matrix as described by the discrete input adjacency matrix adjM.

Gene Expression Simulator (GES):
this second methodology is based on a gene network simulator recently proposed to assess reverse engineering algorithms (Di Camillo et al. (2009)).Given an input adjacency matrix, the network simulator uses fuzzy logic to represent interactions among the regulators of each gene and adopts differential equations to generate continuous data.As in Margolin et al. (2006), we obtain synthetic expression values of each gene n (n = 1, . . ., N ) by simulating its dynamics until the expression value reaches its steady state.We obtain M different values for each gene by repeating the process M times and recording the expression value at steady state.The synthesis of each gene profile is randomly initialized by the simulator.
Fig. 1.The ad hoc procedure proposed to build the training input/output patterns starting from a gene expression matrix.Each input pattern corresponds to the expression value for the selected gene of interest.The corresponding output pattern is the vector of expression values for all the other genes for the given row in the gene expression matrix.

Fig. 2 .
Fig. 2. MCC scores of the different network inference algorithms for synthetic Barabasi networks (scale free, exponent P = 1), varying the number of nodes and keeping constant the data ratio: the number of expression profiles to the number of nodes (80%).Both methods (GES and SLC) for data synthesis are considered.Expression values are linearly rescaled in [−1, 1].

Fig. 3 .
Fig. 3. MCC scores of the different network inference algorithms for synthetic Barabasi networks (scale free, exponent P = 1), varying number of expression profiles and constant number of nodes (100).Both methods (GES and SLC) for data synthesis are considered.Expression values are statistically normalized (zero mean and unit standard deviation).

Fig. 4 .
Fig. 4. MCC scores of the different network inference algorithms for synthetic Barabasi networks (scale free, exponent P = 1), varying data normalization method (Discretization, Linear Rescaling and Statistical Normalization) and constant network size (200 nodes) and number of expression profiles generated (160).Only SLC data synthesis is considered.

Fig. 5 .
Fig. 5. MCC scores of the different network inference algorithms for synthetic Erdös-Rényi networks (random graph, mean degree D = 1), varying the number of nodes and keeping constant the data ratio: the number of expression profiles to the number of nodes (80%).Both methods (GES and SLC) for data synthesis are considered.Expression values are linearly rescaled in [−1, 1].

Fig. 6 .
Fig. 6.MCC scores of the different network inference algorithms for synthetic Erdös-Rényi networks (random graph, mean degree D = 1), varying the number of expression profiles and keeping constant the number of nodes (100).Both methods (GES and SLC) for data synthesis are considered.Expression values are statistically normalized (zero mean and unit standard deviation).
the results obtained on a selection of Escherichia coli gene subnetworks (Peregrin-Alvarez et al. (2009)) for the four inference algorithms.Gene expression values are linearly rescaled in [−1, 1].

Fig. 7 .
Fig. 7. MCC scores of the different network inference algorithms for synthetic Erdös-Rényi networks (random graph, mean degree D = 1), varying data normalization method (Discretization, Linear Rescaling and Statistical Normalization) and constant network size (200 nodes) and number of expression profiles generated (160).Only SLC data synthesis is considered.

Table 1 .
MCC scores of the different network inference algorithms on selected Escherichia coli network modules

Table 2 .
Accuracy [MCC] scores in network topology inference for the different reverse engineering algorithms and their stability [MCC].Column Accuracy indicates the mean MCC score in reconstructing the target network topology, column A.Err the associated error.Column Stability indicates the mean distance [MCC] among all the inferred topologies, column S.Err the associated error.