A Parallel Implementation of the Network Identification by Multiple Regression (NIR) Algorithm to Reverse-Engineer Regulatory Gene Networks

The reverse engineering of gene regulatory networks using gene expression profile data has become crucial to gain novel biological knowledge. Large amounts of data that need to be analyzed are currently being produced due to advances in microarray technologies. Using current reverse engineering algorithms to analyze large data sets can be very computational-intensive. These emerging computational requirements can be met using parallel computing techniques. It has been shown that the Network Identification by multiple Regression (NIR) algorithm performs better than the other ready-to-use reverse engineering software. However it cannot be used with large networks with thousands of nodes - as is the case in biological networks - due to the high time and space complexity. In this work we overcome this limitation by designing and developing a parallel version of the NIR algorithm. The new implementation of the algorithm reaches a very good accuracy even for large gene networks, improving our understanding of the gene regulatory networks that is crucial for a wide range of biomedical applications.


Introduction
Microarray analysis methods produce large sets of gene expression data that can be exploited for novel insights into the fundamentals of molecular biology research. Inferring gene regulating networks from microarray gene expression data has become one of the major topics in system biology. Inferring or 'reverse-engineering' gene networks can be defined as the process of identifying regulatory gene interactions from experimental data through computational analysis.
Since the advent of microarray, various methods have been developed to infer the underlying gene regulatory network. The pioneer methods were simple clustering algorithms [1] where the similarity between genes were measured by a distance or ''pseudodistance'' metric such as the clustering coefficient.
Since the structure to infer is a gene network, graphical models have been proposed and developed. This is the case of BANJO [2], that assumes that the gene network can be modeled as a Bayesian network. Bayesian statistics can be applied under some constrain [3].
Information theoretic approaches have been first proposed in [4] but the very first application was ARACNe [5]. ARACNe computes a pairwise pseudo-distance between each pair of genes to check their dependence. Theoretically ARACNe can be run to infer networks of any dimension.
All the mentioned algorithms have a ready-to-use software, and their performances have been tested and compared on both insilico and in-vivo gene networks [6].
Here we will focus on an ODE-based algorithm, NIR [7], that relates the expression of each gene with the expression of other genes in the cell. It has been shown that NIR is able to correctly identify these relations called ''influence interactions''. The ensemble of these interactions is referred to as gene network. Each of the recovered interactions within the gene network implies a regulatory interaction between components (Proteins, mRNAs, metabolites, etc.) of the cell. Gene networks can be used to identify the functional modules i.e. the subset of genes that regulate each other with multiple indirect interactions; to predict the response of the network to external perturbations and identify the genes directly hit; to identify physical interactions when integrated with additional information coming from sequence data and other experimental data.
Gene network inference algorithms based on ODEs relate gene transcript concentration changes to each other and to an external perturbation. The external perturbation is generally an experimental treatment that can alter the transcription rate of the genes in the cell. An example of perturbation is the treatment with a drug or a genetic perturbation that results in the overexpression or the downregulation of particular gene. NIR algorithm achieves better results than the other network inference algorithms being able to reach peaks of 95% of correct predicted interactions. This has been shown in [8] where there have been selected and compared the algorithms capable of solving the network inference problem with an available ready-touse software. In that work it wasn't possible to run NIR on data set with more than 100 genes because it would have taken too much time. Microarray technology allows the measurements of many thousands of transcripts at once and the mammalian gene regulatory network is of the order of 10 4 . The complexity in handling networks of this size [6], motivated us to design, develop and run a parallel computing algorithm.
In section Methods we describe the NIR algorithm and its parallel implementation; in section Results we describe experimental results and in section Discussion we give our conclusions.

The NIR algorithm
It is possible to describe the gene network as a system of differential equations in which each equation describes the variation in time of the concentration of a particular transcript (gene), x i , as a non linear function, f i , of the concentrations of the other transcripts: where x(t)~½x 1 (t), . . . ,x n (t) is a vector whose components are the concentrations of the transcripts measured at time t, u i (t) is the external perturbation applied at gene i at time t, h i is a set of the rate of change in transcription of transcript x i and N is the number of genes.
To reverse-engineer a network using ODEs means to choose a functional form for f i and then to estimate the unknown parameters h i for each i by using the gene expression data. With the ODE-based approach the resulting gene network will be a directed graph (i.e., if a ij is the interaction between genes i and j, it specifies the direction of the interaction, that is, gene j regulates gene i and not vice versa, a ij =a ji ).
We expand f i in a Taylor series around x 0 , the point in which measurements were made. If we assume that perturbations around this point are sufficiently small, it is possible to truncate the Taylor series after the first order term, and obtain a linear expression: where a ij represents the influence of gene j on gene i, b i represents the effect of the external perturbation on x i (in subsequent analysis, we set b i equal to one for the sake of simplicity) and u(t) is the vector of the external perturbations at time t (a ij and b i are the h i in equation (1)). _ x x i (t) is the rate of change of concentration of gene i at time t, i.e., the first derivative of the mRNA concentration of gene i at time t. In proximity of the steadystate the concentrations of the N transcripts don't change in time (2) can be rewritten as: Let us suppose that we have conducted M experiments such that we know the genes directly perturbed (u i (k), k~1, . . . ,M) as well as the expression profiles following the experiments (transcript concentration levels from microarray data x j (k), k~1, . . . ,M). We can then solve the equation (3) for the unknown parameters a ij , and thus obtain the ingoing edges per gene. NIR applies the multiple linear regression method to estimate the unknown model parameters. It relies on the assumptions that the data x are realizations of a normally distributed random variable with known variances and the perturbations, u, are random variables, also normally distributed with known variances. Generally, the response y may be related to k regressors and the is termed a multiple linear regression with k regressors.
Having M experiments (response observation points) at our disposal, the model for each gene of the network becomes: The response (y) is given by the experimental perturbation values u i [ < 1|M , the regression variable values (X ) are given by the concentrations of the gene transcripts and the regression variable parameters are given by the components of the vector a i , so that, in matrix form, the model becomes: with X [ < N|M . The regression analysis aims to best-fit the data by estimating the parameters of the model. NIR estimates the parameters of the regression variables for each gene, using the least squares method. These are the values for which the first derivative of the residual sum square function is zero: under the assumption that the regressors are linearly independent. Biological networks are sparse [9], thus NIR adopts the sparsity assumption that imposes an upper bound on the number of ingoing edges per gene (i.e. maximum number of regulators per gene), restkvN, which can be chosen by the user.
For each gene the restk parameters that result in the smallest mean square deviation identify the restk ingoing edges for that gene. The weight of the identified edges is given by the value of the estimated parameters. The choice of restk affects either the sensitivity to measurement errors or the execution time. A low value of restk induces an increase in the solution sensitivity to measurement errors. A high value prohibitively increases the computational time needed to identify the regulatory network due to the high number of the regressor combinations to be included in the model. This number is equal to the number of combinations without repetitions of N objects taken restk at a time: This is polynomial of degree restk in the number of genes. The exhaustive approach which evaluates the regression for each combination is not feasible for gene networks larger than 100 genes (with 100 genes and restk~10 the number of combinations is of the order of 10 13 ), thus NIR uses the following heuristic approach.  N At step kz1 NIR computes (7) by considering the remaining N{k variables jointly with each of the topd sets of k variables selected at the previous step, that is topd(2N{2k{topdz3) 2 sets of k variables are considered; the topd sets of kz1 variables for which the sum of the squared deviations is minimized are selected as possible sets of kz1 ingoing edges for the gene.
N The process ends when the number of regression variables selected reaches restk; the set of restk variables for which the sum of the squared deviations is minimized identifies the set of parameters a i corresponding to the input regulations affecting expression profile of gene i.
The final output is an adjacency matrix, where each element is the edge a ij , that encodes the directed graph. The number of times (7) is calculated for each gene is O(restk : topd : N), so the overall number of times (7) is calculated is O(restk : topd : N 2 ). The computational complexity of (7) at step kz1, for the submatrix of X whose rows correspond to a set of k variables (0ƒkƒrestk{1), is O(k 2 N). The overall computational complexity is therefore O(topd(restk : N) 3 ).

The NIR parallel version and its implementation
The NIR algorithm can be easily parallelized to handle large problems in a computationally efficient manner by distributing the overall computational burden among different processors to reduce the total execution time. In order to address the high computational cost issue of the NIR algorithm we have applied some specific implementation optimizations along with parallel programming techniques.
The computational core of the NIR algorithm is the equation (7) where X is a submatrix of the gene expression matrix composed only of k rows (with k~1,2,:::,restk). From the matrix-matrix product definition applied to the submatrix X (v,v), with v~½i 1 , . . . ,i k vector of k indices, it follows that: Therefore for each step of the algorithm, we don't compute any matrix-matrix product operation X (v,1 : N)(X (v,1 : N)) T . On the contrary the product XX T is computed once and for all at the beginning of the program. At each step k, our implementation just selects the symmetric submatrix of XX T whose row and column indices correspond to the k possible ingoing edges for the gene. Let S be this submatrix of dimension k stored in packed format.
In each experiment only one gene is perturbed. This implies that for each gene i the perturbation vector u i is equal to (0, . . . ,0, 1 i ,0, . . . ,0) and then the product u i X T reduces to the i-th row of X T . Denote this row by r.
S is positive definite so we can apply the Cholesky factorization to the matrix S in order to computeâ a i as solution of the system of equations Sa i~{ r, thus avoiding the matrix inversion. We rely on the LAPACK [10] routine DPPSV to solve this system of equations with a computational complexity of O(k 3 ).
By avoiding the matrix product in (7), the parallel algorithm complexity is decreased by one order of magnitude: at the generic step the computational complexity is O(k 3 ), the overall computational complexity is therefore O(topd : restk 4: N 2 ) The parallelization is implemented by assigning different genes to different computing processes: each process takes care of N=p genes where p is the number of processes available. The computing steps described in the previous section can be performed independently for each gene so each process can compute the results for its genes independently without communication. The parallel algorithm has been implemented in C using the MPI standard.

Results
We carried out two kind of tests: to measure the result accuracy of the algorithm and to measure the efficiency in terms of speedup. In order to measure the result accuracy we ran the program, by using the 'in silico' data generated by [8], on 20 different networks counting 1000 genes, with 10 as average in degree per gene. 'In silico' data are gene expression data generated by a computer model of the gene regulation that enable one to check the performance of algorithms against a perfectly known truth. For each network we generated 1000 experiments perturbing a different single gene at time (local steady-state data). The program has been executed on 100 processors of an HP XC6000 Cluster with Itanium 2 biprocessors nodes and a Quadrics ELAN 4 network. On average it took 984 seconds to generate the results for each gene network. The program recovered most of the true interactions as shown in Table 1. We compared the results from the simulations obtained by our implementation with the ones obtained by the network inference software reviewed in [8], setting the parameters restk~11, topd~50. Some software infers the network just as an undirected graph (i.e. the direction of the interaction is not specified, a ij~aji ), while NIR generates directed and signed graphs. In order to be able to compare the different types of software, we computed PPV and Se by first transforming the real (signed directed graphs) and inferred networks yielded by our implementation into undirected graphs (labeled u in Table 1). As shown in Table 1, NIR performs better than the other ready-to-use software even for 1000 gene networks.
Moreover we ran the program on a 2500 gene network, even though we didn't have any other software predictions to compare the results with. We set the same values for the parameters restk and topd as before. It took around 12450 seconds to generates the results and we obtained the following values for PPV and Sensitivity: 0.26 d , 0.27 u and 0.10 d , 0.11 u respectively. In order to measure the parallel efficiency we run the program for a 1000 gene network on different number of processors. The execution times and speedups are shown in Table 2. To evaluate the parallel speed-up, we developed a serial version of the algorithm with the implementation optimizations discussed in the previous paragraph.

Discussion
We have designed, developed and tested a parallel version of the NIR algorithm whose purpose is to infer gene regulating networks from microarray gene expression data. Our parallel algorithm reduces the time complexity of the original NIR algorithm by one order of magnitude by avoiding the useless repetition of matrix multiplication. The algorithm uses data parallelism, distributing the gene expression data over the available processors. The tests were performed on large networks (N = 2500) that couldn't be efficiently analyzed by the original serial version. The results confirm the improvements in accuracy, in terms of positive predicted values, and sensitivity, in terms of false negatives, with the respect to the other inferring methods. In addition, the parallel algorithm scales well as the number of processors increases and has a linear speedup.