Gene Regulatory Network Inference from Multifactorial Perturbation Data Using both Regression and Correlation Analyses

An important problem in systems biology is to reconstruct gene regulatory networks (GRNs) from experimental data and other a priori information. The DREAM project offers some types of experimental data, such as knockout data, knockdown data, time series data, etc. Among them, multifactorial perturbation data are easier and less expensive to obtain than other types of experimental data and are thus more common in practice. In this article, a new algorithm is presented for the inference of GRNs using the DREAM4 multifactorial perturbation data. The GRN inference problem among genes is decomposed into different regression problems. In each of the regression problems, the expression level of a target gene is predicted solely from the expression level of a potential regulation gene. For different potential regulation genes, different weights for a specific target gene are constructed by using the sum of squared residuals and the Pearson correlation coefficient. Then these weights are normalized to reflect effort differences of regulating distinct genes. By appropriately choosing the parameters of the power law, we constructe a 0–1 integer programming problem. By solving this problem, direct regulation genes for an arbitrary gene can be estimated. And, the normalized weight of a gene is modified, on the basis of the estimation results about the existence of direct regulations to it. These normalized and modified weights are used in queuing the possibility of the existence of a corresponding direct regulation. Computation results with the DREAM4 In Silico Size 100 Multifactorial subchallenge show that estimation performances of the suggested algorithm can even outperform the best team. Using the real data provided by the DREAM5 Network Inference Challenge, estimation performances can be ranked third. Furthermore, the high precision of the obtained most reliable predictions shows the suggested algorithm may be helpful in guiding biological experiment designs.


Introduction
Reconstructing the structure of a gene regulatory network (GRN) from experimental data and other a priori information is very helpful in understanding the development, pathology and functioning of all biological organisms. Recently, with the development of high-throughput technologies, such as DNA microarrays, mass spectroscopy, etc., it is possible to reconstruct GRNs from some types of experimental data. In practice, the common data types contain knockout data, knockdown data, time series data, etc. Various models and methods have been suggested to attack this problem based on these types of experimental data, such as Boolean networks [1], Bayesian networks [2], information theory based algorithms [3], ordinary differential equation (ODE) based methods [4], etc.
Recently, the Dialogue for Reverse Engineering Assessments and Methods (DREAM) has been providing not only a set of benchmark networks extracted from actual biological networks of some most important and typical biological modules, such as Escherichia coli transcriptional regulatory network and Saccharomyces cerevisiae (yeast) transcriptional regulatory network [5], but also some types of In Silico gene expression data sets generated by the GeneNetWeaver tool version 2.0 [6], to motivate the systems biology community to investigate and develop structure identification methods for GRNs. In particular, the DREAM project offers an alternative type of steady-state data, i.e., multifactorial perturbation data, which are obtained by slightly perturbing all genes simultaneously so that the basal activation of all genes of the network is slightly increased or decreased simultaneously by different random amounts [5]. Multifactorial perturbation data might be regarded as expression profiles obtained from different patients [5]. Therefore, such data are easier and less expensive to be obtained than other types of experimental data and are thus more common in practice [7]. On the other hand, such data provide less information about GRNs with respect to other types of data which make the GRN identification problem more formidable [7].
Several methods have been shown to be effective in inferring the structure of a GRN through participating in the DREAM4 In Silico Size 100 Multifactorial subchallenge. For example, the best performer has developed GENIE3 algorithm for the inference of GRNs, which decomposes the prediction of a regulatory network among p genes into p different regression problems. In each of the regression problems, the expression pattern of a target gene is predicted from the expression patterns of all the other genes, using tree-based methods [7]. The second place team tackled the problem via a sparse Gaussian Markov Random Field, which relates network topology with the covariance inverse generated by the gene measurements. And, the Graphical Lasso algorithm is used to compute the covariance inverse. Then, the optimal network is selected by different model selection criteria [8]. On the other hand, a GRN can be modeled as a correlation network [9], which is obtained by computing the correlation coefficient between arbitrary two genes. Surprisingly but also interestingly, this simple method was proved to be placed at the second (tie) for the DREAM4 In Silico Size 100 Multifactorial subchallenge. However, due to the symmetry of the correlation coefficient, the estimated correlation network topology is undirected.
Motivated by the GENIE3 algorithm, an identification algorithm is developed in this paper for GRN topology inference, based on the regression analysis and the correlation analysis. Specifically, the GRN inference problem among p genes is decomposed into p|(p{1) different regression problems. And, in each of the regression problems, the expression level of a target gene is predicted solely from the expression level of a potential regulation gene. For different potential regulation genes, different weights for a specific target gene are constructed. The larger the sum of squared residuals is, the weaker the direct regulatory interaction will be. And, the higher the Pearson correlation coefficient is, the stronger the rationality is for the application of the linear regression. To take both into consideration, the weight corresponding to a possible direct regulation is selected as their product. Then these weights are normalized to reflect effort differences of regulating distinct genes.
It has been observed that most large scale gene regulatory networks are sparse. Mathematically, the sparsity of a GRN may be characterized by the power law [4]. And, the in-degree distribution of a GRN can be obtained by means of the power law. In this paper, the so-called in-degree distribution means the number of genes with in-degree equal to m,m~1,2, Á Á Á. By appropriately choosing the in-degree distribution of a GRN, this paper suggest a method to utilize the sparsity quantitatively. Through constructing loss functions and incorporating power law, and solving a 0-1 integer programming problem, the direct regulation genes for an arbitrary gene can be estimated. Then, the above normalized weights can be further adjusted based on these estimated direct regulatory relationships.
In general, these weights are used to queue the possibility of the direct causal regulation. The larger the adjusted weight is, the higher the confidence is for the existence of the direct causal regulation. When a threshold is provided, this queue can lead to an estimate about the structure of a GRN. Computation results with the DREAM4 In Silico Size 100 Multifactorial subchallenge show that estimation performances of the suggested algorithm can even outperform the best team. Using the real data provided by the DREAM5 Network Inference Challenge, estimation performances by the proposed method can be ranked third. Furthermore, the high precision of the obtained most reliable predictions implies that the suggested algorithm may be helpful in guiding biological validation experiment designs.
The outline of this paper is as follows. At first, the structure estimation algorithm is illustrated. Afterwards, the proposed estimation method is assessed using the data sets of the DREAM4 In Silico Size 100 Multifactorial subchallenge and the DREAM5 Network Inference Challenge. Variations of estimation performances with respect to parameters of the suggested method will also be reported. Finally, some concluding remarks are given about the characteristics of the suggested method, as well as some future works worthy of further efforts.

Problem Statement
Considering a GRN with p genes, it is assumed that the targeted network is a directed graph, in which each node represents a gene, and an edge directed from one gene i to another gene j indicates that gene i regulates the expression of gene j directly. The goal of gene regulatory network inference in this paper is to recover the network solely from multifactorial perturbation data. A set of multifactorial perturbation data can be obtained by first perturbing all genes simultaneously, and then measuring steady-state levels of all genes. Different data sets can be obtained by implementing different perturbations to the network [5]. At the same time, such data do not give information about the regulatory network dynamics, but about the system equilibrium once it has recovered after the perturbation [8].
Denote S by N sets of multifactorial perturbation data: where, x i k represents the steady-state levels of gene k in the i-th experiment. Specifically, the problem of recovering regulatory networks is addressed as follows: Utilizing data set S, design a GRN inference algorithm and assign weights w ik §0,i,k~1,2, Á Á Á ,p. The larger the weight w ik is, the higher the confidence is for the existence of the direct causal regulation from gene i to gene k.
For most of large scale networks, it has been observed that the distribution of the number of chemical elements that have direct regulatory effects on a randomly chosen chemical element, obeys approximately a power law [4,[10][11][12][13]. More specifically, let Pr m f g denote the probability that the number of direct regulations on a randomly chosen chemical element equals to m, then there exist a positive number c and a positive integer k min , such that [4] Pr m f g~c . This important prior structural information is also incorporated into our estimation procedures.

Regression Analysis
It is well known that the relevance between any two genes can be represented by the Pearson correlation coefficient [9]. But this method is non-causal. On the other hand, the GENIE3 algorithm decomposes the prediction of a regulatory network among p genes into p different regression problems. In each of the regression problems, the expression pattern of a target gene is predicted from the expression patterns of all the other genes, using tree-based methods [7]. Motivated by this idea, we decompose the GRN inference problem among p genes into p|(p{1) different regression problems. The novelty is as follows. In each of the regression problems, the expression level of a target gene is predicted solely from the expression level of a potential regulation gene. For different potential regulation genes, different weights for a specific target gene are constructed. That is, for a particular gene k and its potential regulation gene i, the aim of the regression analysis is to establish a function, i.e., y~f (x)ze xy . Obviously, this function reveals the causal relationship between them. Here, y and x represent the steady state expression concentrations of the genes k and i, respectively. In practice, y is not completely determined by x, because there are many factors which may affect y. Therefore, e xy is used to represent the unknown secondary factors or/and the random errors, all of which may affect y. An important parameter in the regression model is the variance of e xy , i.e., s 2 xy . In essence, s 2 xy is the mean squared error when y is approximated by an suitable function f (x) [14]. Generally, when x is reasonably selected as the most important factor, then the value of s 2 xy will be relatively smaller; otherwise, the value of s 2 xy will be relatively larger. In other words, for the particular gene k, the smaller the magnitude of s 2 ik is, the larger the probability is for the existence of the direct causal regulation from gene i to gene k.
In practice, although s 2 ik is unavailable, it can be estimated from the sum of squared residuals by using linear regression (leastsquares estimation). Therefore, we can construct the weight w ik based on the above discussion. A practical network prediction is obtained by setting a threshold on the ranking of weights from the most to the less significant. In this paper, we focus on the task of constructing weights, while the question of the choice of an optimal confidence threshold, although important, will be left open.

Weight Construction
Denote by x k the steady-state level of gene k. The steady-state level of gene k may be directly affected by all other genes expression levels. Therefore we have the following expression: The function f : ð Þ in Equation (1) not only contains lots of arguments, but also may be a non-linear function. Thus, it is hard to directly estimate the function f : ð Þ. On the other hand, from the definition of the weight w ik , we know that w ik represents the probability of the direct causal regulation only from gene i to gene k. That is, when the weight w ik is computed, the function f : ð Þ in Equation (1) can be approximated by the following expression: The form of the function h ik x i ð Þ, however, is not clear and might be non-linear. Hence, the linear regression technique is used to analyze the direct causal regulation from gene i to gene k. And, the function h ik x i ð Þ is approximated by its first order Taylor expansion, i.e., where, e ik represents the approximation error or/and the measurement error.
Consequently, from Equation (2) and Equation (3), we have The regression coefficients a ik and b ik can be estimated by the least squares estimation. Let X k~x Moreover, the sum of squared residuals SSE ik is also obtained in this process, i.e., The value of SSE ik might be regarded as the capability of the direct regulatory interaction from gene i to gene k. In other words, we assume that the larger the sum of squared residuals SSE ik is, the weaker the direct regulatory interaction from gene i to gene k will be. For this reason, the constructed weight should utilize this characteristic provided by SSE ik . On the other hand, for arbitrary two data sets X k and X i , not matter whether there exists the linear correlation between them, the sum of squared residuals SSE ik can always be obtained by Equation (5). However, if there does not exist the linear correlation between them, the application of the linear regression is meaningless. To test whether the data sets X k and X i are linear correlation, the correlation coefficient is the most frequently used test statistic. The expression for the correlation coefficient r ik is as follows: According to the discussion above, it is clear that the larger the sum of squared residuals SSE ik is, the weaker the direct regulatory interaction from gene i to gene k will be. And, the larger the Pearson correlation coefficient r ik is, the stronger the rationality is for the application of the linear regression on data sets X k and X i . To take both of them into consideration, a weight w ik corresponding to a possible direct regulation from gene i to gene k is constructed as follows: For the particular gene k, the larger the magnitude of w ik is, the larger the confidence is that gene k is directly regulated by gene i.

Weight Normalization
It is important to note that in GRN topology inferences the larger the value of w ik is, the larger the probability is for the existence of a direct regulation from gene i to gene k. Define a p|p dimensional matrix W with its i-th row k-th column element being the estimate of w ik when i=k and its diagonal element being zero, and denote its i-th column vector by W i . And then, it is clear that this matrix contains information about the probability of the existence of a direct regulation between any two different genes in a GRN. However, to infer the structure of a GRN from this matrix, an important fact must be taken into account. That is, in a GRN, some genes may be easily regulated by other genes, while regulations on some other genes may need more efforts [15][16][17]. This implies that direct regulations to different genes may lead to weights of different magnitude orders. Therefore, in order to obtain a good estimate from the matrix W about the topology of a GRN, an appropriate normalization is still required for the estimated w ik s among different genes.
In [17], it is suggested to use the q-norm of the vector W i and the geometric average of its non-zero elements to achieve the normalization. More specifically, when q is adopted as 3.5, the structure inference performance is improved the most. Therefore, in this paper, it is suggested to also use the q-norm of the vector W i to achieve this normalization, that is, w ik is replaced by It is worthwhile to note that this normalization does not change the diagonal elements. For presentation conciseness, the normalized matrix W using the vector q-norm is denoted by W ½q in the rest of this paper. The normalization is firstly proposed in [17], in which the weight is represented by the RELV (relative expression level variation). The goal of the normalization is to guarantee that the weights for different genes hold the same magnitude order. For a GRN, in the last ranking list of w ½q ik , if the magnitude is larger, the corresponding transcription regulation will be established in a larger probability.

In-degree Estimation and Weight Magnitude Modification
To compute the weight w ik , the multivariate function f : ð Þ is approximated by a univariate function h ik x i ð Þ, which implies that the in-degree for an arbitrary gene k is assumed as one. Thus, the constructed weights do not employ the information about the combinatorial regulation to a gene. In this subsection, we try to estimate the in-degrees of genes in a GNR to utilize the information about the combinatorial regulation.
It is clear that the value of SSE ik represents the capability of the direct regulatory interaction from gene i to gene k, that is, the smaller the sum of squared residuals SSE ik is, the stronger the direct regulatory interaction from gene i to gene k will be. Sort the sum of squared residuals of gene k in a non-decreasing order, and denote the sorted results as follows: In this ranking k 1~k , so it is assumed that the top m genes from gene k 2 to gene k mz1 have great chance to combinatorially regulate gene k. Therefore, the multivariate function f : ð Þ can be approximated by a m-variable function in such case, i.e.: The form of the function mk : ð Þ, however, is also not clear and might be non-linear. Hence, the linear regression technique is used again. Applying the first order multiple Taylor expansion to the function mk : ð Þ, we have where, z mk represents the approximation error or/and the measurement error.
Using the least squares again, not only the regression coefficients but also the sum of squared residuals E mk and the sum of deviation squares R mk can be estimated. Let , and, Define a loss function P mk as follows: Here, the value of the sum of squared residuals E mk represents the capability of a direct combinatorial regulation from genes numbered k 2 , Á Á Á ,k mz1 to gene k. Obviously, it can be thought that the smaller the sum of squared residuals E mk is, the stronger the direct combinatorial regulation interaction will be. And, is the test statistic. The larger the test statistic is, the stronger the rationality is for the application of the multiple linear regression analysis. Therefore, to take both into consideration, the loss function P mk is defined as Equation (11). And, it can be presumed that the smaller the value of P mk is, the higher the probability is for the establishment of a direct combinatorial regulation from genes numbered k 2 , Á Á Á ,k mz1 to gene k.
To estimate the in-degree for a specific gene k optimally, one can search m from 1 to p{1 to find the minimum of the loss function P mk at m~m Ã . In such case, the optimal in-degree for the specific gene k is m Ã and genes numbered k 2 , Á Á Á ,k m Ã z1 are most likely to have a direct regulation effect on gene k. However, to estimate the in-degree for every gene in a GRN optimally, the structural characteristics of GRNs should be taken into consideration, that is, the power low could be taken into consideration. let M(Mvp) and l m denote respectively the maximum in-degree of a GRN and the number of genes with its in-degree equalling to m. Then, from the power law, it is clear that l m~t p| Pr m f gs. Since each gene has a unique in-degree, we can utilize the following 0-1 integer optimization to estimate the in-degree for every gene optimally.

min
X M m~1 X p k~1 P mk f mk s:t: Problem (12) can be solved by using a linear programmingbased branch-and-bound algorithm [18,19], and its optimal estimates can be denoted byf f mk D k~1,2,ÁÁÁ,p m~1,2,ÁÁÁ,M . For gene k, iff f mk~1 , with 1ƒmƒM, then, from the above problem description, it is clear that the optimal estimate for the in-degree of this gene is m, and genes numbered k 2 , Á Á Á ,k mz1 are most likely to have a direct regulation effect on this gene. In GRN topology estimation, another important thing worthy of considering is that genes estimated to have a direct regulation should correspond to a weight with a magnitude greater than those estimated not to have a direct regulation [20,21]. To achieve this purpose, the following adjustment is suggested in this paper. Define d as  With this value, the normalized weights for an arbitrary gene k are modified as follows, w w Here, for each gene k, m k is determined by the solution of Problem (12). Denote by W W ½q the p|p dimensional matrix with its i-th row kth column element being w w q ½ ik . Elements of this matrix are directly used to infer the structure of a GRN. The bigger the i-th row k-th element is, the higher the probability is that gene k is directly regulated by gene i.
It should be stressed here that the effectiveness of the indegree estimation depends on the veracity of the prior structural information. In this paper, the sparsity of a GRN is characterized by the power law. Therefore, the number of genes, whose in-degree are equal to m, can be represented as l m~t p| Pr m f gs. Here, Pr m f g is the so-called power law. That is, the solution of Problem (12) depends on the parameters of the power law. If the in-degree distribution of a GRN is pertinent and appropriate, the effectiveness of this step may be positive. Otherwise the performance may deteriorate. The results from Table 1,2,3 in the following section may support the argument.

Estimation Algorithm
In summary, on the basis of the regression analysis and the correlation analysis, the algorithm suggested in this paper for identifying direct regulations of a GRN consists of the following steps.
1. Compute the weight matrix W according to Equations (5), Equations (6) and Equations (7). 2. Normalize the weight matrix W according to Equations (8). 3. Choose appropriate values for c, k min and M, and solve the Problem (12), and modify the matrices W ½q according to Equations (13) and (14). (This is an optional step, not necessary.) Using elements of these matrices W ½q (or W W ½q ), queue possibilities of the existence of a direct regulation from the gene with the same number of the row to the gene with the same number of the column. The bigger the element is, the higher the confidence is for the existence of the direct causal regulation.

Data Sets and Assessment Metrics
To illustrate the effectiveness of the developed inference algorithm, tests are firstly performed on the DREAM4 In Silico Size100 Multifactorial subchallenge, which are designed to assess performances of an identification method for the structure of a large scale GRN [22]. They respectively contain 5 different benchmark networks with 100 genes which are obtained through extracting some important and typical modules from actual biological networks of E. coli and S. cerevisiae. Auto-regulatory interactions are removed, that is, there are no self-interactions in the in silico networks. For each network, 100 sets of multifactorial perturbation data are supplied.
Predictions are compared with the actual structure of the networks by the DREAM project organizers using the following two different metrics in topology prediction accuracy evaluations. Similarly, we can define a specification for each network as score i~{ 0:5 log 10 p i i~1, Á Á Á ,5. Based on the above discussion, we know More detailed explanations can be found in [22] or on the web site of the DREAM project at http://wiki.c2b2.columbia.edu/ dream/. Moreover, to evaluate performance on real data, tests are also performed on the DREAM5 Network Inference Challenge. Finally, the computation time needed by the suggested method is discussed. To evaluate the prediction accuracy of W ½q , W i is normalized by using some typical vector norms, such as the 1-norm and the Euclidean norm. Moreover, it is reported that when q is adopted as 3.5, the structure inference performance is improved significantly [17]. Thus, each column of W is also normalized by using the 3.5-norm. The corresponding results are given in Table 1. Also, the Performance of W is include in Table 1.
To compare prediction performances with the best team, the corresponding specifications are also included in Table 1, obtained directly from the web site of the DREAM project. Their digit lengthes are different from the other results that are obtained through actual computations. In addition, the corresponding pvalue for each specification is given in parentheses. In the last column of Table 1, the obtained scores are also given for each method. From Table 1, it is clear that by the normalization step the structure inference performance is improved remarkably. Specifically, when q is chosen as 2 and 3.5, their final scores even outperform the best team's final score.
The final score is a pretty important specification in inferring the structure of GRNs, while the precision specification can not be revealed by the final score. In topology estimations, highly confident predictions can become a good guidance to biological experiment designs [22]. However, these predictions will be helpful only if their precisions are sufficiently high. This requires that a desirable estimation algorithm should have a PR (precisionrecall) curve starting from the left upper corner, and decreasing monotonically and slowly with the increment of the recall rate. The ROC curve and PR curve of each network according to W , W ½1 , W ½2 and W ½3:5 are represented in Figure 1.
From Figure 1, we can draw some conclusions as follows. The AUPR and AUROC measures of W ½q ,q~1,2,3:5, are improved much more by the normalization step compared with these measures of W . What's more, when the weight matrix is adopted as W ½q , most of the PR curves start from the left upper corner. Specifically, when q is chosen as 1 and 3.5, the precision specification is pretty well for all the five networks. And, when the weight matrix is adopted as W ½2 , except the network 4, the PR curves start from the left upper corner for all other networks. This high precision implies that the suggested algorithm may be helpful in guiding biological validation experiment designs.
To investigate how the AUPR and AUROC measures and the final score of W ½q are influenced by q, q is searched over the interval 1,10 ½ through an equally spaced sampling with 90 samples. The corresponding results are given in Figure 2.
The results in Figure 2(a) suggest that when q[(1,3), the AUROC measure for each network maintains growth along with the increase of q. And when q[ (3,10), the AUROC measure for each network nearly remains unchanged. On the other hand, for the networks 2,3,4, when q[(1,2), the AUPR measure maintains growth; and when q[(2,10), the AUPR measure slowly falls. The situation for the network 1 is similar, while the inflexion point is about q~3. For the network 5, when q[ (1,4), the AUPR measure maintains growth; then this measure nearly remains unchanged. As for the final score, when q[ (2,4), it is more than 38. And, the results in Figure 2(b) confirm again that when q is adopted as 3.5, the structure inference performance is improved the most.

Prediction Performances of W W ½q
In the previous subsection, it is clear that prediction performances are improved by the normalization step, especially when the weight matrix is adopted as W ½3:5 . In this subsection, the prediction performances of W W ½q is under investigation. For convenience, q is adopted as 3.5 in this subsection.
To investigate influences of power low parameters on the prediction accuracy of the estimation algorithm, optimal values are searched for both k min and c. Particularly, for every network, the optimal k min is searched over the set 1,2,3 f g, and the optimal c is over the interval 0:1,10 ½ through an equally spaced sampling with 100 samples. In this optimization, the desirable k min and c are selected to be the sample that maximizes the score i specification, i~1,2, Á Á Á ,5. The corresponding results are given in Table 2.
Taking the exponential decay of power law into account, M~25 is utilized in these estimations. To compare prediction performances with the best team, the corresponding specifications are also included in Table 2, obtained directly from the web site of the DREAM project. The best values of the AUROC and the AUPR specifications for each network are written in boldface. In addition, the corresponding p-value for each specification is given in parentheses. In the last column of Table 2, the obtained scores are also given for each method. Furthermore, the optimal k min and c for each network are given in the last two lines. From results of Table 2, it is clear that compared with the method adopted by the best team, although there are networks with which the AUROC specification of the suggested method is slightly worse, its AUPR specification is much better than the best team for every network. Therefore, the final score for the suggested method is greater than the best team.
It is worthwhile to note that in actual applications, the optimal k min and c are usually not available. On the other hand, it is currently well known that for most biology systems, the parameter c belongs to the interval 2,3 ½ [23]. To test practical effectiveness of the suggested method, its estimation performances with the power law parameters taking some typical values, i.e., k min [ 1,2,3 f gand c[ 2,3,4,5 f ghave been studied. The corresponding results are given in Table 3. For each case, the AUROC and the AUPR specifications with the corresponding p-value written in parentheses are presented. And, in the last column of Table 3, the obtained scores are given for each case. In addition, similarly to Table 2, the prediction specifications of the best team are also included in Table 3. It is obvious that the performance of this step is affected by the parameters of the power law. Although estimation performance deteriorates slightly when k min and c deviate from their optimal values, it is still better than the available methods.
The ROC curve and PR curve of each network with empirical and optimal power law parameters are presented in Figure 3. Here, the empirical power law parameter means that k min~3 and c~4 for every network. From Multifactorial Data to Networks Figure 3 show that the precision specification is also very well, when the weight matrix is adopted as W W ½3:5 . More importantly, the third step of the proposed method may guarantee that the PR curve starts from the left upper corner. This phenomenon is verified by Figure 4. Figure 4 contains two PR curves. The one is Net4 without the third step, while the other is also Net4 when its weight matrix is adopted as W W ½2 . It is clear that the PR curve of W W ½2 starts from the left upper corner. This feature is a good guidance to biological experiment designs.
Most large scale networks may have the sparse property, which may be approximated by the power law. The developed algorithm has quantitatively employed this property by constructing a 0-1 integer programming problem. Consequently, direct regulation genes for an arbitrary gene can be (sub)optimally estimated. Furthermore, this information is incorporated into the developed algorithm by the manipulation of Equations (13) and (14). That is the reason why the propose method has a property of high confident predictions. On the other hand, there are some potential risks when the third step is used. Specifically, when the distribution of in-degree is not accurate, the prediction accuracy of W W ½q may deteriorate with respect to W ½q . For example, when k min~2 and c~5, the final score of W W ½3:5 is less than W ½3:5 . Therefore, it is suggested that when the in-degree distribution is unreliable or unavailable in practice, the operations of the third step should be used with caution.

Performances on the DREAM5 Network Inference Challenge
To evaluate the performance on real data, tests are performed on the DREAM5 Network Inference Challenge. Here, all of gene expression data offered by the DREAM5 organizers are regarded as multifactorial perturbation data. To better reconstruct the real GRNs, some special improvements are taken into consideration. First, the networks in the DREAM5 Network Inference Challenge are much more complicated than those in the DREAM4. The function h ik x i ð Þ may not be properly approximated by its first order Taylor expansion. In general, if the order of the Taylor series is high enough, h ik x i ð Þ will be obtained precisely. However, this treatment may bring some adverse impacts. Especially, when h ik x i ð Þ is approximated by its fourth order Taylor expansion, the matrix inversion operation will be infeasible when the least squares estimation is used. Therefore, we use the third order Taylor expansion to approximate it, i.e., With the help of the Least Squares, the coefficients in above equation and the sum of squared residuals SSE ik can be obtained. Second, consider two genes i and j. Assume that gene k regulates gene i and gene l has no direct effect on gene j. And, suppose r ki &r lj and SSE ki is slightly smaller than SSE lj . In this case, w lj may be very close to w ki in the weight ranking list. To overcome this drawback, the factor exp ({SSE ik ) in Equation (7) 1=3 is 0.0190. In general, the value of m is larger than 1, but it can not be too large, to avoid Dr ik D m tending to 0. Based on our computational experience, when m is set as 4, the performance is improved significantly. Therefore, Equation (7) is replaced by the From Multifactorial Data to Networks following expression: Due to the reason that the in-degree distribution is unreliable (unavailable), the operations of the third step are canceled. The prediction performances of W ½2 are shown in Table 4.
The final score in Table 4 is better than the third team. Further more, the improved method is also tested on the DREAM4 In Silico Size100 Multifactorial subchallenge. The final performances are represented in Table 5, and the estimation performances of the improved algorithm significantly outperform the best team. These results shows that our improved method may be competent to infer gene regulatory networks.

Computation Time
In this section, the computational complexity of the proposed method is discussed. It is well known that integer programming is an NP-complete problem and there is no known polynomial-time algorithm to solve it [18,19]. Therefore, we only discuss the computational complexity of the first two steps. The main calculating module is the least squares estimator. More precisely, this estimator involves a large matrix multiplication operation, for instance K T K. Here,  where, N represents the experiment number, and r represents the order of the Taylor series. Therefore, for a particular network including p genes, in which the number of transcription factors is f , the computational complexity of the proposed method is O (rz1) 2 pfN À Á . In general, r%p,f ,N, that is, the computational complexity is O pfN ð Þ. Using the first order Taylor expansion, the computation time for each network in the DREAM4 In Silico Size100 Multifactorial subchallenge is respectively: 0.1047 s, 0.1054 s, 0.1042 s, 0.1052 s, and 0.1046 s. While, using the third order Taylor expansion the computation time is respectively: 0.7725 s, 0.7285 s, 0.7338 s, 0.7281 s, and 0.7272 s. For the DREAM5 Network Inference Challenge, the computation time is respectively: 65.5030 s, 378.9817 s, and 295.2905 s.
The computation is performed on a PC with Inter(R) Core (TM) i5-2400 CPU, 4 GB RAM, and Matlab 2008a.

Concluding Remarks
In this paper, an algorithm is developed for the GRN topology inference from steady state multifactorial perturbation data. The GRN inference problem among p genes is decomposed into p|(p{1) different regression problems. In each of the regression problems, the expression level of a target gene is predicted solely from the expression level of a potential regulation gene. For different potential regulation genes, different weights for a specific target gene are constructed by using the sum of squared residuals and the Pearson correlation coefficient. The larger the sum of squared residuals is, the weaker the direct regulatory interaction will be. And, the higher the Pearson correlation coefficient is, the stronger the rationality is for the application of the regression analysis. Then, the constructed weight of a gene is normalized. To employ the network sparse property quantitatively, a 0-1 integer programming problem is constructed. By solving this problem, direct regulation genes for an arbitrary gene can be estimated. Lastly, the normalized weight of a gene is modified, on the basis of the estimation results about the existence of direct regulations to it. These normalized and modified weights are used in queuing the possibility of the existence of a corresponding direct regulation.
Computational results with the DREAM4 In Silico Size100 Multifactorial subchallenge show that this method can outperform the available method, particularly in improving the AUPR specifications. Using the real data provided by the DREAM5 Network Inference Challenge, estimation performances can be ranked third. In addition, if the veracity of the prior structural information is certifiable, the third step of this method not only improve the final score but also could guarantee the PR curve starts from the left upper corner, which may be helpful in guiding designs of a biological validation experiment.
Although the computational results are promising, many important issues still need further efforts. Among them, how to utilize the experimental data to obtain the in-degree distribution of a GRN is currently under investigations.