A robust gene regulatory network inference method base on Kalman filter and linear regression

The reconstruction of the topology of gene regulatory networks (GRNs) using high throughput genomic data such as microarray gene expression data is an important problem in systems biology. The main challenge in gene expression data is the high number of genes and low number of samples; also the data are often impregnated with noise. In this paper, in dealing with the noisy data, Kalman filter based method that has the ability to use prior knowledge on learning the network was used. In the proposed method namely (KFLR), in the first phase by using mutual information, the noisy regulations with low correlations were removed. The proposed method utilized a new closed form solution to compute the posterior probabilities of the edges from regulators to the target gene within a hybrid framework of Bayesian model averaging and linear regression methods. In order to show the efficiency, the proposed method was compared with several well know methods. The results of the evaluation indicate that the inference accuracy was improved by the proposed method which also demonstrated better regulatory relations with the noisy data.


Introduction
The study of gene regulatory networks (GRNs) structure is important in understanding cellular function. GRNs are typically represented by graphs in which the nodes represent the genes and the edges show the regulatory or interaction between genes. There are many methods for inference of GRNS. One of these methods is computational methods. Many computational methods have been proposed in the literature to model GRNs. These methods can be classified into the co-expression based methods [1], supervised learning-based methods [2,3], modelbased methods [4,5] and information theory-based methods [6,7]. Co-expression based methods have low complexity but lack inference direction of interaction. The supervised learning methods such as GENIES [8] and SIRENE [9] require information about some interactions in order to learn the models.
Model-based methods can be categorized into ordinary differential equation [10], multiple linear regression [11], Boolean networks [12] and probabilistic graphical models including Bayesian Network (BN) and Dynamic Bayesian Network (DBN) [13]. They infer GRNs with a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 high accuracy and can identify direction of interaction. However, these methods are time consuming and require many parameters to be set up, and thus cannot be used for large-scale networks. There are two suggestions for addressing this problem: Searching the optimal graph from all possible graphs and using decomposition technique in regression based methods for network structure inference. The inference of regulatory interactions for N genes is decomposed into N independent sub-problems, with sub-problems inferring the regulators of a target gene. Narimani et all proposed a new Bayesian network reverse engineering method using ordinary differential equations with the ability to include non-linearity. In this method, Expectation Propagation is used for approximate Bayesian inference [14]. Due to Bayesian network (BN) methods cannot handle large-scale networks in [15] present a novel method, namely local Bayesian network (LBN), to infer GRNs from gene expression data by using the network decomposition strategy and false-positive edge elimination scheme.
There are many significant advantages to use Bayesian network model. Bayesian networks can be easily understood and allow researchers to use their domain expert knowledge for determine the Bayesian network structure. When sample size is small Bayesian networks are less influenced and they use the probability theory, which is suitable for dealing with noise in biological data. Furthermore, Bayesian networks where complete data are not available, can produce relatively accurate prediction. Although a few disadvantages exist, such as computational complexity and need to set many parameters, therefore they cannot be used for largescale networks.
To address this problem within this context, this paper presents a new method that uses Bayesian model averaging based on Kalman filter and linear regression to infer GRNs. In this method, a new solution is applied to calculate the posterior probabilities of the edges from possible regulators to the target gene, which leads to high prediction accuracy and high computational efficiency. This method is the best performer among well-known existing methods in the DREAM4 in silico challenge and IRMA Dataset [16][17].
Another important category of GRN inference methods is based on regression methods, which are used to predict one target gene based on one or more input genes such as artificial neural networks (ENFRN) [18], support vector machines (SIRENE)], rotation forest (GEN-IRF) [19], random forests (GENIE3) [20] and Bayesian Model Averaging for Linear Regression (BMALR) [21].
Furthermore, information theory-based methods are used for inferencing GNRs, such as conditional mutual information (CMI) [6] and mutual information (MI) [15]. This method can be used for large scale networks.
MI measures the dependency between two genes. A higher mutual information value for two genes shows that one gene is related with the other. However, MI cannot distinguish indirect regulators from direct ones. Consequently, this leads to possible false positives [22]. Although CMI-based methods are able to distinguish indirect regulators from direct ones, they cannot locate the directions of interactions in the network and also in some cases underestimate the interactions strength.  [27] assume that correlation between genes expression are indicative of a regulatory interaction. In [28] for capture coarse-grained dynamics propose a new mutual information based Boolean network inference (MIBNI) method. In This method, using mutual information first selected a set of initial regulatory genes, and then by iteratively swapping a pair of genes between the selected regulatory genes and the other genes, improves the dynamics prediction accuracy.
The rest of this paper is organized as follows. Details of the Kalman filter are given in section two. Conditional Mutual Information is given in section three. The proposed method is presented in section four. In section five, the results of the proposed method are shown on data collection DREAM4 and other datasets. Finally, conclusions are summarized in section six.

Kalman filter
To infer gene regulatory network, one way is to find the Bayesian network structure. This is normally achieved by maximizing the likelihood of the observed dataset (maximum likelihood) or the posterior probability of the structure given the observed data (maximum a posteriori). In this paper, because the data are time series and contain noise, the Kalman filter is used to find the Bayesian network structure [29].
Kalman filter is an algorithm that uses a series of measurements observed over time, containing statistical noise. Applying the Kalman filter for this purpose, assuming input and output, is given as: where F k is the state transition model which is applied to the previous state x k , G k is the control matrix which is applied to the w k . w k is the process noise which is assumed to be drawn from a zero mean multivariate normal distribution with covariance Q k . H k is the observation model which maps the true state space into the observed space, D k is the control matrix which is applied to the v k , and v k is the observation noise which is assumed to be zero mean Gaussian white noise with covariance R k [29]. The details of Kalman filter is shown in Fig 1. First, prior probability p(x k−1 |y k−1 ) has a random value and is related to prior knowledge as follows: pðx kÀ 1 jy kÀ 1 Þ $ Nðx kÀ 1jkÀ 1 ; p kÀ 1jkÀ 1 Þ ð2Þ Prediction and updating phases can alternatively be used to calculate the posterior probability. In the prediction phase, the value of p(x k |y k−1 ) is obtained as follows: Instead of predicting this probability, the mean and variance, x k|k−1 and p k|k−1 , are predicted respectively. In the updating phase, the value of p(x k |y k ) is obtained as follows. As a matter of, in this step the parameters of the mean and variance of the posterior probability are updated.
This probability is achieved over time and periodically until the posterior probability is calculated. By using the following equations, the mean and variance can be obtained as follows: x kjk ¼x kjkÀ 1 þ K k ðy k À H kx kjkÀ 1 Þ ð7Þ Where K k is Kalman rate and it is calculated as follows:

Conditional mutual information
Mutual Information (MI) and Conditional Mutual Information (CMI) have been used to construct GRNs [30] owing to their ability to detect nonlinear dependencies between genes with Gaussian noise. The mutual information (MI) is a measure of the mutual dependence between two genes X i and X j . Thus, its value can be used to evaluate the strengths between genes. For measuring the conditional dependency between two genes X i and X j given another gene X k , CMI can be used, which can quantify the undirected regulation. For discrete variables X and Y, MI is defined as [31]: where p(x) and p(y) are the marginal probability distributions of X and Y, respectively, p (x, y) is the joint probability distribution of X and Y, H(X,Y) is the joint entropy of X and Y, and H (X) and H(Y) are the entropies of X and Y, respectively. CMI between two variables X and Y  where H(X,Z), H(Y,Z) and H(X,Y,Z) are the joint entropies, and p(x,y|z), p(x|z) and p(y|z) are the conditional probability distributions, respectively.

Proposed method
KFLR constructs GRNs using Bayesian network and linear regression which lead to a directed graph of regulatory interactions between genes with high accuracy. This method mainly consists of three distinct phases. In the first phase prior knowledge is extracted from the data using MI, then in the next phase Bayesian network is constructed based on prior knowledge and Kalman filter, and in the last phase the network is modified using CMI. The proposed method is described in Fig 2. In the next subsection, detailed description of each of these phases will be presented.

Phase 1: Knowledge extraction with MI
In this step, the MI values between all genes are computed and the knowledge matrix is created. If MI (i, j) is smaller than a threshold, the cell (i,j) in the knowledge matrix will be zero, otherwise this cell is one. Phase 2: Building Bayesian network using Kalman filter. For inferring gene regulatory network, the proposed idea is based on prior knowledge from the knowledge matrix, and Kalman filter is used to construct Bayesian network. The proposed method integrates a Bayesian model averaging method with a linear regression approach. A new method is used to calculate the posterior probability edges based on Kalman filter. In the proposed method, linear regression is used on the target gene and all combinations of other genes. The final score of the edge between the parent and target genes, is the sum of the all posterior probability of the linear regression models containing this edge.

Bayesian model averaging
One methods for inference of gene regulatory network is finding the structure N of Bayesian network that better explains the data. There are many methods for finding Bayesian network structure such as maximizing the likelihood of the observed data (maximum likelihood, ML) or the posterior probability of the structure N given the observed data (maximum posteriori, MAP). This paper makes use of Kalman filter in achieving the posterior probability. There exist a lot of Bayesian network structures that best describe the data when the number of observations in gene data are limited. We can find best structure using heuristic search.
But the heuristic search methods have high computational complexity and do not guarantee global optimal. Thus, Bayesian model averaging can be used instead of searching for the best structure among the existing Bayesian structures. In other words, the probability of an edge (f) given the observed dataset (D) between node i and j in a structure (N) can be calculated with the posterior probability of f: This probability shows the posterior probability of f given the observed dataset (D). In this equation, if the Bayesian network N contains edge f, f (N) equals to 1, otherwise it is 0. Therefore, 100 Bayesian networks are built with different structures using Kalman filter and then the final score edge from node i to node j, can be obtained based on Bayesian structures posterior probability of this edge. In other words, in the construction of Bayesian networks using Kalman filter theory, each node X i , has a probability distribution P(X i |Parents(X i )), that shows the effect of parent nodes on this node to be numerical. In this step, the parent sets which are obtained with prior knowledge of the first phase, are checked and genes having smaller MI degree than a threshold are not selected with gene X i . In order to have an accurate estimate for posterior probability of the edge, with k-nearest neighbor (kNN) method, the network is decomposed into a set of smaller sub networks according to the relationship among genes in the network. In the graph structure, according to their shortest path distance the k nearest neighbors of each gene are selected. In this paper, the k-nearest neighbors with k = 2 containing the Markova blanket of the gene are applied for each gene in order to decompose a global network to a set of sub networks.
where X i is the expression level of gene i, w ji is a weight between gene i and j and showing the effect of gene j on gene i. If w ji is zero, then in the gene regulatory network there is no edge from j to i. If w ji is non-zero, j is one of the i's candidate regulators (parents) and ε i denotes the noise. The posterior probability for each edge calculate base on the sum of the posterior probabilities of all the sub structures containing the edge [19]. Using the following equation, the posterior probability of an edge feature f is calculated: where S i is the set of all possible parent sets of X i . X i is the target of the edge feature f. D Pa;x i denotes the data restricted to X i and the genes in Pa. N Pa is a sub structure that is composed of the edges from the genes in Pa, a parent set of gene X i . If the sub structure N Pa contains f, f (N Pa ) is equal to 1, otherwise it is 0.

Phase 3: Modifying the network
After gene regulatory network inference, the network is modified to achieve better results. MI method commonly cannot estimate the regulation degrees between genes. Because it does not consider the joint regulations into two or more genes, the rate of false positive edges is high. In this phase, by computing the first-order CMI (i, j|k) and second-order CMI (i, j|k, l), false positive edges are removed. By so doing, if CMI (i, j|k) (or CMI (i, j|k, l)) is smaller than a threshold α, the edge between genes i and j is removed from the network.

Data set
The Another dataset we have used is the IRM dataset. IRMA network is a subnetwork embedded in Saccharomyces cerevisiae which consist of 5 genes: CBF1, GAL4, SWI5, GAL80, and ASH1. Gene expression data are time-series and include switch-off data and switch-on data. The switch-off data is taken from 4 experiments and the switch-on data is taken from 5 experiments, with a total of 142 samples measured (see S1 Data) [33].

Performance metrics
The proposed method is evaluated using the area under the precision versus recall curve and receiver operating characteristic (ROC) curve for the whole set of link predictions for a network.
A precision-recall (PR) curve plots fraction of retrieved instances that are relevant (Precision) versus the fraction of relevant instances that are retrieved (Recall), whereas a ROC curve plots the true positive rate versus the false positive rate [34]. To summarize these curves, the DREAM organizers proposed different statistics. AUPR and AUROC are respectively the area under the PR and ROC curve. AUPR p-value and AUROC p-value are the probability that random ordering of the potential links is given or larger of AUPR and AUROC.
The overall p-values: p aupr and p auroc of the five networks constituting each DREAM4 sub challenge were defined as the geometric mean of the individual p-values, as shown in Eq 15 [35]: The overall score for each method was the log-transformed geometric mean of the overall AUROC p-value and the overall AUPR p-value, as shown in Eq 16 [35]:

Performance comparison on the DREAM4 dataset
In this section, evaluation of five inferred sub network using the proposed method before and after adding noise into data has been studied and to demonstrate the performance, the proposed method has been compared with thirteen common methods in the field of gene regulatory networks construction. Methods used for comparison are as follows: GENIE3, is an algorithm for inferring regulatory networks from expression data using treebased methods. The implementation of matlab codes by its authors and with default parameters and protocols are used [18]. BMALR is an algorithm for inferring cellular regulatory networks with Bayesian model averaging for linear regression algorithm. The author's system code is used [19]. CLR [21], ARACNE [23] and MRNET [25] algorithms: These three algorithms have been implemented by the minet package into R Language. BGRMI, Bayesian Gene Regulation Model Inference, a model-based method for inferring GRNs from timecourse gene expression data. BGRMI uses a Bayesian framework to calculate the probability of different models of GRNs and a heuristic search strategy to scan the model space efficiently [36]. G1DBN is a method based on dynamic Bayesian network [37]. NARROMI is a noise and Inference gene regulatory network using Kalman filter redundancy reduction technique improves accuracy of gene regulatory network inference [5]. TIGRESS, this method solves the network inference problem by using a feature selection technique (LARS) combined with stability selection. In the method Web-based platform is performed [38]. GENIRF, this method decomposes the prediction of a gene regulatory network between p genes into p different regression problems. Each regression problem is constructed with singular value decomposition and rotation forest [17]. MIBNI, in this method, first selected a set of initial regulatory genes using mutual information, and then, improves the dynamics prediction accuracy by iteratively swapping a pair of genes between the selected regulatory genes and the other genes [26]. The implementation of java codes by its authors and with default parameters and protocols are used. FBISC, in this method, expectation propagation is used for approximate Bayesian inference [14]. The implementation of C# codes by its authors and with default parameters and protocols are used. CMI2NI, CMI2 is used to quantify the mutual information between two genes given a third one through calculating the Kullback-Leibler divergence between the postulated distributions of including and excluding the edge between the two genes [6]. The implementation of matlab codes by its authors and with default parameters and protocols are used. In the following, the results in the form of AUPR and AUROC values, ROC and PR curves are examined and an overall score is calculated for each method. As earlier mentioned, 5 sub networks in DREAM4 dataset were used for evaluation. The goal of each 5 sub network is finding the rank for edges and directional regulatory relations. Table 1 shows the AUPR and AUROC values for different methods in 5 sub networks without noise. Comparing stability Inference gene regulatory network using Kalman filter against noise, Table 2 shows AUPR and AUROC values for different methods in noisy data. It should be noted that 10% Gaussian noise with mean = 0 and standard deviation = 1 was added to the data. From the results, the proposed method is robust against noise than the other methods. The results show that the proposed method has higher accuracy, because of the use of the knowledge extraction phase in network constructions and removal of many false positives edges. Also, the use of Kalman filter probability theory can thus deal with noise data in which the Kalman filter removes noisy regulations.
According to Tables 1 and 2, the rate of improvement of the KFLR in sub network 1 is also less, while the rate of improvement is higher in sub network 2,3, 4 and 5, because of extracting more false positive edges. Therefore, with the more accurate obtained knowledge in the first phase, KFLR results to better network. In fact, in the Bayesian network construction phase using the Kalman filter, each node X i have one conditional probability distribution P(X i | Parents(X i )) which shows the effect of parents on this node numerically. In this phase, parents are selected with obtained knowledge from first phase and not allowed to select genes which are very similar to each other. So this work changes the value of relationships between one gene and its parents in comparison with inferred network by other algorithm. When the number of extracted knowledge increases, more improvement is achieved compared with other algorithm. In KFLR, the refining network using CMI coefficient is done. This phase will Inference gene regulatory network using Kalman filter improve the amounts of regulatory relations between pairs of genes using biology significant relationships between them and this work improves the results of each sub network slightly. Table 3 shows p-values of AUROC and AUPR for different methods and each subnet separately. This shows that the predictions of this method is significantly better than a random guess compared to other methods. The overall scores of each method in the whole network are shown in Table 4. The results indicate that the proposed method performs better than the other methods.
Recall and precision are the ratios of the numbers of correctly inferred interactions vs all interactions in the gold standard networks and the reconstructed networks respectively The Area under the PR curve (AUPR) provides an unbiased scalar estimate of the accuracies of the reconstructed GRNs. ROC curves (AUROC) is a measure of the overall performance of a model. Therefore, for better compression, ROC curves in noise data are drawn for three subnets and some methods is presented in Fig 3-5. According to all the figures, the KFLR method generally has a better result. Also, the PR curves in noise data are shown in Fig 6-8 for some methods and three subnets individually. According to all the PR figures, the KFLR approach in general has better and more accurate results.

Performance comparison on the IRMA dataset
The different GRN inference methods were applied to reconstruct the IRMA (In vivo Reverseengineering and Modeling Assessment) network. Table 5 shows the AUPRs of the GRNs Inference gene regulatory network using Kalman filter inferenced in the noise data and in the main data. In the main data, KFLR is competitive with BGRMI method when inferring the network from the switch-on data. In the case of the switchoff data, KFLR had the highest accuracy. But in the noise data KFLR outperform other method. These results show that KFLR performs well on in-silico datasets and on in-vivo experimental data.

Conclusion
In this paper, a new method was proposed to improve the accuracy of reconstructed GRN from time series gene expression data by using two approachs, i.e, the false-positive interactions deletion and the inference using model averaging. In this paper, by using CMI and MI, false-positive interactions were deleted and in the model averaging approach, Kalman filter was proposed to compute the posterior probabilities of the edges from possible regulators to the target gene with the combination of Bayesian model averaging and linear regression methods. The Kalman filter is a linear state-space model that operates recursively on noisy and time series input gene expression data to produce a statistically optimal estimate of the gene regulatory network. The results on the benchmark gene regulatory networks from the DREAM4 challenge and in Vivo IRMA Network showed that the proposed method significantly outperforms other state-of-theart methods. Also, it was established that this method is more robust to the noisy data.