Inference of Gene Regulatory Networks from Genetic Perturbations with Linear Regression Model

It is an effective strategy to use both genetic perturbation data and gene expression data to infer regulatory networks that aims to improve the detection accuracy of the regulatory relationships among genes. Based on both types of data, the genetic regulatory networks can be accurately modeled by Structural Equation Modeling (SEM). In this paper, a linear regression (LR) model is formulated based on the SEM, and a novel iterative scheme using Bayesian inference is proposed to estimate the parameters of the LR model (LRBI). Comparative evaluations of LRBI with other two algorithms, the Adaptive Lasso (AL-Based) and the Sparsity-aware Maximum Likelihood (SML), are also presented. Simulations show that LRBI has significantly better performance than AL-Based, and overperforms SML in terms of power of detection. Applying the LRBI algorithm to experimental data, we inferred the interactions in a network of 35 yeast genes. An open-source program of the LRBI algorithm is freely available upon request.


Introduction
Exploring the structure of Gene Regulatory Networks (GRN) is a key element in understanding gene functions, especially in some complex diseases [1][2][3]. Direct experimental methods to explore the relationships among genes are time-consuming and laborintensive. Statistical inference on GRN is a process of identifying gene interactions from limited experimental data using computational analysis, and is much more efficient.
Several models have been applied to describe the GRN. An intuitive and frequently applied method is to model the GRN as graphs [4][5][6], where the genes are considered as nodes and the interactions among them represented as edges. Several graphical methods, including directed acyclic graphs and directed cyclic graphs, have been proposed in [7][8][9]. GRN can also be modeled by the graphical Gaussian model [10], or the Bayesian network model [11]. Information theory, for instance, mutual information and synergy, can be also used to infer the GRN [12,13]. Due to high measurement cost of gene chip technology, only limited number of samples can be obtained. This limitation may result in low inference accuracy when applying synergy or mutual information to analyze the GRN.
In the last decade, Structural Equation Modeling (SEM) [14] has been used to infer GRN [9,15,16]. Exploiting genetic perturbation data and gene expression data, the work in [16] used SEM model via an adaptive Lasso based algorithm (AL-Based) to infer the networks. With simulations, the authors showed that the AL-based method had better performance than all other existing methods. With the two same types of data, Cai et. al.
introduced a sparse SEM model, and stated that their Sparsityaware Maximum Likelihood (SML) algorithm significantly outperformed all other algorithms, including the AL-based one [17,18].
In this paper, we also study the gene regulatory networks with SEM model using both genetic perturbation data and gene expression data, and transfer the SEM to a Linear Regression (LR) model through matrix transformation. In this transformation process, regulatory information in GRN will not be lost. Instead of ML approaches or classic Lasso methods, we propose an approach to infer the networks via the LR model by using a Bayesian method (LRBI). Simulations show that our LRBI algorithm is effective and reliable, and offers significantly better performance than the AL-based algorithm. Compared with SML, LRBI has significantly better performance in terms of power of detection, but has slightly worse performance in false discovery rate. LRBI also has the advantages that the estimation of the initial parameters and the consideration of the data sensitivity are not needed.

Model and Methods
The LR model for gene network inference We consider m genes, n individuals' measurement using microarray. Without loss of generality, we assume that there are m makers. As in [9,16,17], the GRN obeys the form of SEM, where genes are the nodes, and interactions among genes are the edges, i.e.
where P is an m|n matrix, p kj is the jth expression level of the kth gene; B is an m|m matrix, defining the structure of the gene regulatory networks, b kj is the regulatory effect of the jth gene on the kth gene; X is an m|n matrix, x kj is the genotype of the kth marker in the jth perturbation; A is an m|m matrix representing the effect of each eQTL; e is an m|n matrix, and e kj is the jth measurement noise of the kth gene. All elements in e are independent and identically distributed (i.i.d).
We assume that there is no self-loop of each gene, so that all diagonal entries of B are zeros. We also assume that each gene has its own corresponding QTL, and the loci of the m eQTLs have been determined by an existed method, but the effects of these eQTLs are unknown yet. Therefore A has m unknown entries, and all other entries are zeros. Without loss of generality, we assume that all the unknowns in A are the diagonal entries.
With the predetermined eQTLs matrix X and the gene expression data P, the inference for GRN is to determine the unknown entries of B and A with appropriate optimization methods.
Since all the unknown parameters are in (B,A), (1) can be written as follows

P~PVze ð2Þ
where P is still the m|n matrix defined above, P~BA ð Þ, V~P X . We further rewrite (2) to where Y k is the kthrow of P, L k is the kthrow of P, e k is the kthrow of e. By the definition and the structure of (3), we can infer the parameters row by row. Therefore, the problem can be decomposed into Y k~Lk Vze k ,k~1,2,:::,m ð4Þ In (4), the parameters that need to be inferred are L i,j ,i,j~1,2,:::,m,i=j, and L izm,i ,i~1,2,:::,m.

Bayesian inference for the LR models
In gene regulatory networks, most entries of L k are zeros, so L k is sparse. Therefore, we assume that all entries of L k follow Gaussian distribution with mean zeros. We also assume that entries of e k are i.i.d, and normally distributed with mean zeros and variance Y k~Q ek I, where I is an n|n identity matrix.
With known V, the parameters to be estimated in (4) are h k~Lk ,Y k ð Þ. The joint prior distribution can be factorized as: Rewriting (5) leads to We assume that L k ,Q {1 ek À Á has a joint prior distribution of Gaussian-Gamma [19], with where a 0ek ,b 0ek ,L 0k are hyper parameters, should be preset to fixed values. H 0yk is a symmetric positive definite matrix. We will set it to an identity matrix in the implementation of algorithm for simplification.
The likelihood is where V i is the ith column of V. The joint posterior distribution of L k ,Q ek ð Þ is proportional to the product of the prior and the likelihood According to the prior distribution in (7,8) and the likelihood in (9), the joint posterior distribution (10) can be written as More details of the derivation can refer to Text S1. A similar result was shown in [14,20], where (12,16) were parts of an iterative process to solve the Confirmatory Factor Analysis (CFA) model.
We can sample the posterior distributions in (12) and (13) to constitute an iterative process. Since the values in (14), (16) are all determined, the parameters of the Gamma distribution in (12), 2 {1 nza 0ek and b ek , are all fixed. As a result, the posterior distribution of Q {1 ek will not be affected by the samples of L k . It is noted that only Q {1 ek ,L k can be sampled, therefore, the iterative process may be not effective and accurate. Thus, we modify the calculations of a k ,b ek in (15) and (16), and substitute L 0k by L k .
The combination of (12,13) and (17,18) forms an iterative process. We execute this iterative process with a sufficient number of times, and until a steady state is reached. A sequence of sets of L (i) k are obtained by sampling from the posterior distribution in (13), which are then averaged to get the estimated parameters of L k . To get accurate results, we must guarantee that the iteration reaches its steady state. A simple stopping condition is to test the value of the square difference of the inferred parameters between two successive iterations, i.e.
If the difference is small enough (say tv0:001), the iteration has reached a stable state. The choice of t can influence the accuracy. The smaller t is, the higher the accuracy of the parameter approximation is, naturally at the cost of more iterations and increased computational time.
The sketch of an algorithm is as follows: Input the eQTLs matrix X, and the gene expression data P; Set the initial hyperparameters a 0ek~n =5, b 0ek~1 , H 0yk~I . L 0k is set to a 1 Ã 2mvector, where only the kth entry is 1, all other entries are zeros. k~1,2,:::,m; Assign a small value to t.
For k~1,2,:::,m Calculate A k ,a k ,b ek by (14,16); Repeat: 1. Get the sample of Q {1 ek from the Gamma distribution by (12); 2. Get the sample of L k from the Normal distribution by (13); 3. Calculate a k ,b ek by (17)(18) 5. If Svt,then end the iteration, else go to step 1; End for More details of the algorithm implementation can refer to the software package F1.
It should be noted that it would be better to choose the second half of the samples and average them to get accurate result. The reason is that at the beginning of the iteration, the gap between the estimated L (i) k and the true values is large.

Simulations
Let Logsdon and Mezey [16] had shown that the AL-based algorithm outperformed the PC-algorithm [21,22], the QTLnet algorithm [23], and had comparable performance with the QDG algorithm [24]. Cai et al. stated that their SML algorithm offered significantly better performance than the AL-based algorithm and the QTL algorithm in PD and FDR [17,18]. Therefore, we shall compare our LRBI algorithm with SML and AL-based.
Firstly, we carried out simulations following the setups in [16]. We simulated two types of directed acyclic gene networks: one with 10 genes and the other one with 30 genes. Averaged N e~3 edges were created per gene, which meant that there were on average 3 edges created between one gene and all other genes. If an edge existed from node j to node i, then b ij was sampled from a uniform distribution on the interval {1{0:5 ð Þ | 0:51 ð Þ; otherwise b ij was set to 0. Entries of X took values from the set 1,2,3 f gwith the corresponding probabilities 0.25, 0.5, and 0.25 respectively. Each gene has its own corresponding QTL, and A is assumed to be an identity matrix. Each entry of e in (1) was sampled from a Gaussian distribution N 0,0:01 ð Þ. P was calculated by (1).
We generated cyclic or acyclic networks for simulations, and used LRBI to infer the parameters of the simulated networks. For cyclic networks,LRBI can obtain the steady-state solutions naturally. By inference, the steady regulatory relations can be got, if some cyclic regulatory relations among genes existed.
Due to the inference characteristic of Bayesian methods, the estimated parameters are not regressed to zeros as in Lasso methods. Therefore, an edge from gene j to gene i is considered to be present if b 0 ij w0:05, otherwise, there is no edge from gene j to gene i.
Simulation results for the setups described above are shown in Figure 1, where (a) and (b) are for the gene network of m~10, (c) and (d) are for the gene network of m~30. LRBI has a better performance than SML in terms of PD, but SML outperforms LRBI algorithm in terms of FDR. Both LRBI and SML significantly outperform the AL-Based algorithm in terms of PD and FDR. The PD of LRBI reaches 1 when the number of samples is 20 or more for both the two scenarios m~10 and m~30.
Secondly, we simulated two types of directed cyclic gene networks: one with 10 genes and the other one with 30 genes. Averaged N e~3 edges were created per gene. We employed the same procedure used in the acyclic scenario to generate B,A,X,P,e. Simulation results are shown in Figure 2, where (a) and (b) are for the gene network of m~10, (c) and (d) are for the gene network of m~30. LRBI has significantly better performance than SML in terms of PD, and SML outperforms LRBI algorithm in terms of FDR. When the number of samples is large enough, the FDRs of LRBI and SML are all close to zeros. Both LRBI and SML significantly outperform the AL-Based algorithm in PD and FDR.
Thirdly, we simulated the impact of different decision thresholds on performances. We used a bigger network m~100. Averaged N e~3 edges were created per gen, and the variance of noise was 0.01. Three decision thresholds j~0:05,0:1,0:2 were simulated.
In simulations, if we found that b 0 ij ƒj, then we set b 0 ij~0 . A directed acyclic network and a directed cyclic network were simulated and the results were separately shown in Figure 3  We continued the simulations with a even bigger network with N e~3 , m~300 to study the impact of decision thresholds on performance. The variance of noise was 0.01. Two decision thresholds j~0:1 and j~0:2 were simulated respectively. Both directed acyclic network and directed cyclic network were simulated, and the results were separately shown in Figure 4 (a) (b) and Figure 4 (c) (d). As confirmed by Figure 3 and Figure 4, a large decision threshold can reduce the FDR, but it also lowers the PD. Therefore, the decision thresholds used in simulations or applications should be chosen carefully.
Finally, we evaluated the impact of noise levels on the performance of LRBI. Here, we used networks with m~100, N e~3 . Again, we applied LRBI to both directed acyclic and directed cyclic networks. The variance of noise was set to 0.01, 0.05, and 0.1 respectively. The simulation results are shown in Figure 5, where (a) and (b) are for directed acyclic network, (c) and (d) are for directed cyclic network. We find that the PD performance is always excellent, but the FDR of LRBI is worse when the noise level increases, even when the number of samples is relatively large.
We have stated that LRBI cannot infer parameters to zeros automatically, but can infer them with high precision. In most of the simulations we conducted, the decision thresholds are 0.05. That is to say, if the value inferred is lower than 0.05, the entry is considered to be zero. This implies that the numerical difference between the inferred value and the original value is less than 0.05 for most of the entries in regulatory networks. We define that numerical difference as INEr i,j ð Þ~b ij {b 0 ij . Through simulations, we found that INEr was also very small for the entry whose value is nonzero in the original network. This feature is very meaningful, because the inferred parameters can accurately indicate the regulatory relationships among genes. An acyclic network was simulated, with m~30,N e~3 , sigma2 = 0.01, decision threshold j~0:1. Some results are shown in Table 1.

Case study
Here, we applied LRBI to infer the gene regulatory networks using the gene expression data and the genetic makers, which were assayed in 112 segregants of a cross between the yeast strains BY4716 and RM11-1a [25]. The cross had 5727 genes with small number of samples, so a pretreatment process was needed to select strong cis-eQTLs and interactions among genes [16]. We dealt with the filtered dataset provided by Logsdon [16], in which only 35 genes were used . The 35 yeast genes are SEO1, NUP60,  RCY1, IRC18, TPK3, PHD1, JLP1, SNF7, PCD1, RPL19A,  SEN1, OST6, BUB2, BUL1, PHA2, ORC5, FYV6, SLM2,  HAL9, RDL1, POC4, ASA1, ECM13, TYR1,       With 112 samples for these 35 genes, and the eQTLs data, we inferred the regulatory network as shown in Figure 6. It is noted that our algorithm doesn't need to assume the network is cyclic or acyclic. There are 145 edges in the inferred network. A total of 31 genes are regulators of at least one target, and 32 genes have at least one regulator. A total of 28 genes occur both as regulators and targets.
There were only 4 instances of reciprocal regulation (two genes act on each other) presented: ORC5 / ?   Among the 145 regulator-target pairs, there are 78 positive regulations, and 67 negative regulations. To verify the inference result, we used the Generate Regulation Matrix tool in the website of YEASTRACT [26] to create the gene regulatory network with the 35 selected yeast genes described above. In the network generated by the tool, there are only three regulatory relationship, HIM1 regulated by HAL9 [27], YEL073C regulated by PHD1 [27], and FYV6 regulated by PHD1 [28]. From the experimental yeast data, we deduced two out of the three regulatory relationships, HIM1 regulated by HAL9, YEL073C regulated by PHD1.

Conclusion and Discussion
We modeled the gene regulatory networks by using a LR model, and proposed a Bayesian method to complete the inference. We conducted a series of simulations to evaluate the performance of the proposed algorithm LRBI, and compared LRBI with another two algorithms, the AL-Based and the SML algorithms. LRBI had a significantly better performance than AL-based regarding to both PD and FDR. Compared to SML, LRBI showed a better performance in PD and slightly worse in FDR. This feature of LRBI makes more sense. Considering two cases, one is that we can find less false edges but loss more true edges, the other one is that we can find more, or even all true edges among genes, but with slightly more false edges, the latter one is more meaningful.
The proposed algorithm was accurate, and the gap between the inferred and the original parameters was less than 5% (even 2%) in most case. The proposed algorithm was also very effective. We inferred the GRN of the 35 yeast genes in a short time (1.2 seconds in a laptop), while for the SML algorithm, a program error occurred after about 52 minutes' run with the same 35 yeast genes data set. LRBI also had the benefit that the dependency of the performance on the estimates of initial parameters is not strong. For simplicity, we just assign some constants to these parameters in simulations and case studies. Therefore, the LR model and the LRBI algorithm can provide an effective way of exploiting both gene expression and perturbation data to infer GRN.
The reason our method seemed to perform better was that, LRBI fully exploited the structure of the SEM, and transformed it into a linear regression model without information loss, while AL-Based only partly exploited the structure of the SEM and used the adaptive Lasso to infer the networks, so LRBI was more effective. LRBI used the Bayesian method, while SML essentially used the maximum likelihood method to infer the GRN, therefore SML was not efficient and sensitive to data. However, there are many other methods for linear regression problems, such as hierarchical Bayesian, variational approximation, and so on. These methods can potentially improve the inference accuracy of GRN with the linear regression model proposed by this paper.
However, the FDR of LRBI is considerably high when the noise level is large, and another issue is the ability of dealing with largescale gene networks. Thus, a future work is to decrease FDR in high-noise context, and apply new strategies to handle large-size gene networks.

Supporting Information
Text S1 Bayesian Inference for the Linear Regression Model.