Using Consensus Bayesian Network to Model the Reactive Oxygen Species Regulatory Pathway

Bayesian network is one of the most successful graph models for representing the reactive oxygen species regulatory pathway. With the increasing number of microarray measurements, it is possible to construct the Bayesian network from microarray data directly. Although large numbers of Bayesian network learning algorithms have been developed, when applying them to learn Bayesian networks from microarray data, the accuracies are low due to that the databases they used to learn Bayesian networks contain too few microarray data. In this paper, we propose a consensus Bayesian network which is constructed by combining Bayesian networks from relevant literatures and Bayesian networks learned from microarray data. It would have a higher accuracy than the Bayesian networks learned from one database. In the experiment, we validated the Bayesian network combination algorithm on several classic machine learning databases and used the consensus Bayesian network to model the 's ROS pathway.


Introduction
Reactive Oxygen Species (ROS) are formed as by-products of normal metabolism of aerobic organisms, they can react with DNA and produce damage [1]. Cells protect themselves from ROS by detoxification mechanisms and repair mechanisms [2,3]. Microarray is a powerful tool for measuring a large number of genes' expressions. Given the microarray expressions, it is possible to construct the regulatory pathway that organisms response to the oxidative stress directly.
An outstanding idea is the use of Bayesian network for representing regulatory pathway [4][5][6][7]. Bayesian network is a Directed Acyclic Graph (DAG) used for representing probabilistic relationships between variables. It was first proposed by Pearl [8], and Jensen [9] gave an intuitive definition. A lot of work has been done in the automatic learning of Bayesian network from database. Consequently, large numbers of Bayesian network learning algorithms based on different methodologies have been developed [10][11][12][13] and they have high accuracies in learning Bayesian networks from classic machine learning databases. However, when applying these algorithms to learn Bayesian networks from microarray data, the accuracies are low. Careful studies show that this is because the databases they used to learn Bayesian networks contain too few microarray data. On the other hand, microarray chip is expensive, it is difficult to obtain a large number of microarray data from one laboratory or one database, and a few hundred expression data can not guarantee a high learning accuracy.
To overcome this problem, we propose a consensus Bayesian network which is constructed by combining several Bayesian networks. This consensus Bayesian network is approximately equal to the Bayesian network learned from the database obtained by merging all these combined Bayesian networks' corresponding databases, then its equivalent database may have enough data and the accuracy can be improved. The main procedure of construction of consensus Bayesian network can be described as follow: (1) Review all relevant literatures and derive the Bayesian networks.
(2) Search microarray expressions which are not used in relevant literatures and download them to learn Bayesian networks. (3) Combine all these Bayesian networks to construct the consensus Bayesian network.
Combination of Bayesian networks includes combination of graph models and aggregation of probability distributions [14][15][16][17]. Utz [18] proposed a method to combine many different Bayesian networks into an undirected graph, and each edge in the graph has a weight represents the frequency with which the edge occurs in the component networks. Zhang et al. [19] proposed a method for fusing Bayesian networks. They construct an initial network based on the union and intersection of the Bayesian networks, and then search for the structure which maximizes the scoring function(CH criterion). Our Bayesian network combination algorithm is based on the properties of probability. Due to probabilistic independence, Conditional Probability Tables (CPTs) can be extended, then corresponding nodes' CPTs can be changed into a same form and the aggregation function can be applied to these CPTs. After extending every corresponding CPTs, the combination of Bayesian networks changed into the aggregations of every corresponding nodes' CPTs if these Bayesian networks' variables' prior orders are consistent with each other. Some nodes' CPTs were extended previously, so they may have bogus parents after combination, then we should find them, delete the bogus edges and simplify the CPTs. The combination algorithm can also be applied to the combination of Bayesian networks defined over different variable sets by using the extension of Bayesian network.
Escherichia coli MG1655 was used in the experiment, a constructed ROS pathway was derived from the literature wrote by Hodges et al. [20] and 612 microarray expression data were downloaded from the Many Microbe Microarrays Database(M3D) [21]. 27 genes were identified from the EcoCyc [22] ROS detoxification pathway. A consensus Bayesian network using the 27 genes as variables was constructed by combining the Bayesian network from the literature and the Bayesian network learned from the 612 microarray expressions. For demonstrating the combination of Bayesian networks defined over different variable sets, we used a prediction program to find genes may be involved in the ROS pathway, learned a Bayesian network which using the 27 genes and the newly found genes as variables, and then combined this Bayesian network and the Bayesian network from the literature to construct a new consensus Bayesian network.

Validation on classic machine learning databases
In order to validate whether the consensus Bayesian network BN c constructed by combining Bayesian networks BN 1 and BN 2 is equivalent to the Bayesian network learned from the database obtained by merging the two Bayesian networks' corresponding databases DB 1 and DB 2 or not, 6 databases were downloaded from the UCI Machine Learning Repository (http://archive.ics. uci.edu/ml/), and the databases of ALARM net and Chest-clinic net were generated by the BN PowerConstructor. For each database DB, we chose n samples (about 1=3*1=2 of the samples in DB) randomly and used them as DB 1 , the rest samples in DB were used as DB 2 . Two Bayesian networks BN 1 and BN 2 were learned from DB 1 and DB 2 , respectively. Consensus Bayesian network BN c was constructed by combining BN 1 and BN 2 . After that, another Bayesian network BN used as a reference was learned from DB, BN c was compared with BN and the proportion of the number of identical edges between BN c and BN to the total number of edges in BN c and BN (similarity S) was computed. The program was run 100 times to compute the average similarity. All results of the experiments are shown in Table 1. Table 1 shows that all the average similarities are greater than 75%. So, consensus Bayesian network BN c is approximately equal to Bayesian network BN. Although the combination algorithm is validated with 8 different databases and the types of data in these databases are very different, it doesn't affect the results. The Bayesian network learning algorithm just compute the distributions by counting the number of samples, and determine the relationships between the variables by analyzing the distributions. The Bayesian network combination algorithm is used to combine Bayesian networks and it doesn't involve the data. So, the type of data doesn't affect the validation.
Consensus Bayesian network BN c is approximately equal to Bayesian network BN, then we can view BN's database DB as BN c 's equivalent database, and DB have more samples than DB 1 or DB 2 . So, the use of consensus Bayesian network helps to solve the problem of lack of data in partial databases and the accuracy can be improved. The true structures of ALARM net and Chestclinic net are known. Then we compared the learned networks with the known networks, the results are shown in Table 2. Table 2 shows that BN c has a higher accuracy than BN 1 or BN 2 .

Construction of consensus Bayesian network for modeling Escherichia coli's ROS pathway
The consensus Bayesian network is constructed by combining Bayesian networks derived from literatures and Bayesian networks learned from microarray data. First, relevant literatures were reviewed and a ROS pathway was derived from the literature wrote by Hodges et al. [20], denoted as BN 1 . In the literature, 27 genes identified from the EcoCyc ROS detoxification pathway were chosen as variables and 305 microarray expressions were used to learn the Bayesian network. Second, microarray data was searched and a microarray expression build with 612 microarray expressions was downloaded from the M3D database. Then Bayesian network BN 2 which also uses the 27 genes as variables was learned from these microarray expressions. Finally, consensus Bayesian network BN c was constructed by combining these two Bayesian networks, and the result is shown in Figure 1. In the combination program, we take weights W 1~3 05, W 2~6 12 and threshold e~0:026.
A novel prediction algorithm based on the computation of mutual information was developed to identify genes which are strongly associated with a particular gene in the regulatory pathway. If R is a gene in the regulatory pathway, gene O is strongly associated with R, then O may work together with R and also be involved in the pathway. The main procedure of this algorithm can be described as follow: assume set R includes all the known genes in the regulatory pathway, and set O includes the rest genes of the organism. Choose one gene The program is ended until every gene in O has been tested.
27 genes identified from the EcoCyc ROS detoxification pathway were used as set R, while the rest genes in Escherichia coli were used as set O. The program found 4 genes may be involved in the ROS pathway, and the results are shown in Table 3

Discussion
In the discussion, we address this question: does the Bayesian network learned from microarray expressions match with a known regulatory pathway?
Before answering this question, we carried out an experiment. The procedure of the experiment can be described as follow: assume that V includes all of the genes of Escherichia coli, and then we construct an undirected graph Let C be the largest connected subgraph of G V . Then 99:3% of the genes in Escherichia coli were included in C. Mutual information I(X i ,X j )we means genes X i and X j are interacted, so this phenomenon shows that almost all genes in Escherichia coli are related directly or indirectly. We can infer that some genes may be involved in different regulatory pathways, simultaneously. Otherwise, if there is no gene be involved in more than one regulatory pathway, that is, the regulatory pathways in Escherichia coli have no intersection, then we can't observe the phenomenon that thousands of genes related directly or indirectly. On the other hand, before microarray measurements, the Escherichia coli was alive, so almost all of the regulatory pathways of Escherichia coli were at work. Then although two genes must be interacted if there is a directed edge between them in the Bayesian network, it is hard to determine the directed edge belongs to which regulatory pathway. For example, there is a directed edge between marA and marR in BN c (Figure 1), then there must be an interaction between marA and marR. They are involved in the regulation of transcription (EcoCyc database) and this biological process was working when measuring the expressions of these genes using microarray, therefore, the existence of marR?marA maybe due to that they are regulating the transcription. However, the ROS detoxification pathway (EcoCyc database) also contains marA and marR, then the existence of marR?marA maybe due to that they are regulating the response to the oxidative stress. So, it is hard to determine the directed edge marR?marA belongs to which regulatory pathway. If there is no edge between two genes in the Bayesian network, then the two genes are not interacted directly in any regulatory pathway. So, if a known regulatory pathway contains n genes and we use these n genes as variables to learn a Bayesian network from microarray expressions. Then all of the interactions between the n genes are contained in the Bayesian network, however, some of these interactions may not contained in this regulatory pathway. This means the regulatory pathway is a subgraph of the Bayesian network. Although the Bayesian network is not equivalent to the regulatory pathway, it still has important significance. With its guidance, the number of biological experiments could be greatly reduced when modeling a regulatory pathway.

Data preprocessing
The algorithms can only process discrete data in this paper. However, the 612 microarray expression data of Escherichia coli MG1655 downloaded from the M3D database are continuous. Then expression data for each gene was discretized using a maximum entropy approach which uses three equally-sized bins (q3 quantization). And the genes' expressions were divided into three categories: underexpressed, normal, overexpressed.
Usually, Bayesian networks derived from literatures only have a structure, then we have three ways to obtain the parameters: (1) If the program of the learning algorithm is available on the internet, then both the structure and the parameters of the Bayesian network can be obtained by run the program directly. (2) If the microarray data used in the literatures were collected in a database available on the internet, then we can download these microarray data to learn the parameters. (3) Sometimes the corresponding database is unable to be found, or the Bayesian network is not learned form database, but constructed by biological experiments directly. Then distribution for each node can be estimated by analyzing the genes' special characteristics and the relationships between genes.

Bayesian network
A Bayesian network defined over a variable set V can be represented as a pair (G,P), where G is a DAG and each directed edge in the DAG represents a dependence, P is a group of CPTs and each node in the DAG has a CPT. Usually, G is called Bayesian network's structure and can be represented as a pair (V ,E), where E is the edge set; P is called Bayesian network's parameter. G is a directed acyclic graph, that is, the nodes in G have a topological order, and we call it prior order. Let BN 1 and BN 2 be two Bayesian networks and their DAGs are G 1~( V 1 ,E 1 )    Figure 3(c), it shows B is a parent of A and x 00 means P(A~a 0 jB~b 0 )~x 00 . Assume that CPT cpt 1 represents P 1 (AjB,C), cpt 2 represents P 2 (AjB,C) and cpt 3 represents P 3 (AjC). Then cpt 1 and cpt 2 are two tables with the same structure and the conditional probabilities in the corresponding positions of the two tables represents the same conditional probability, so we say they have a same form. While cpt 1 and cpt 3 do not have a same form.

Bayesian network learning algorithm
Usually, Bayesian network is learned from database, it represents the probabilistic relationships between the variables in the database. So, a Bayesian network matches with a database, and we call this database Bayesian network's corresponding database. Bayesian network learning includes structure learning and parameter learning. We use an information-theory based learning algorithm proposed by Cheng et al. [11] to learn Bayesian network's structure in this paper.
Dependence between two variables can be quantitatively computed by using mutual information. Mutual information I(X i ; X j ) between two variables X i and X j can be defined as: where x i , x j are the expression values of X i and X j , respectively. Mutual information is non-negative, it means I(X i ; X j ) §0. I(X i ; X j )~0 holds if and only if X i and X j are independent. Given a threshold e (ew0), X i and X j are related if I(X i ; X j )we. Similarly, conditional mutual information I(X i ,X j jX k ) can be defined as: Then the main procedure of Cheng's Bayesian network structure learning algorithm can be described as follow: Step 1. Create initial undirected graph. A Maximum Weight Span Tree (MWST) [23] is used as the initial graph. Let L~f(X i ,X j )jX i ,X j [V ,I(X i ; X j )weg be an undirected edge list, where V is the variable set. Sort L in descending order of mutual information. For each (X i ,X j )[L, add it into the undirected graph(and delete it from L) if it doesn't form a circle. End this loop until the graph contains n{1 edges.
Step 2. Add edges. Assume that set D Xi (X i ,X j ) contains all the nodes which are in the paths between X i and X j and in the neighborhood of X i , simultaneously. D X (X i ,X j ) represents one of sets D Xi (X i ,X j ) and D Xj (X i ,X j ) which contains less nodes. For each (X i ,X j )[L, add it into the undirected graph(and delete it from L) if I(X i ; X j jD X (X i ,X j ))we holds.
Step 3. Remove redundant edges. For each edge (X i ,X j ) in the undirected graph, delete it if I(X i ; X j jD X (X i ,X j ))ve holds.
Step 4. Determine edges' directions. For each holds, where threshold dw0. Some undirected edges' directions can be determined by using Bayesian network's acyclic property. For the rest undirected edges, use the local Minimal Description Length (MDL) score [24] to choose the direction which makes the MDL score more smaller.    In Bayesian network parameter learning, the following equation is used to compute the conditional probabilities in each node's CPT: where N(Conditions) is the number of samples satisfies Conditions in the database. Corollary 1. Given A, B and any other variable C, then P(AjB,C)~P(AjC) (P(BjC)=0) holds if A and B are independent given C.

Extension and simplification of CPT
Suppose we have the CPT of node A as shown in Figure 3(a), it can be extended into the form as shown in Figure 3(b) if A and B are independent of each other. Since A and B are independent, B can not affect the distribution of A, then for Vb j [B, P(A~a i jB~b j )~P(A~a i ) holds. According to that, two CPTs of a same node in different Bayesian networks can be extended into a same form, and then can be aggregated even if the node does not have a same parent set in these Bayesian networks. Specifically, for a node A, and its parent sets are Pa 1 (A) and Pa 2 (A) (Pa 2 (A)=Pa 1 (A)) in BN 1 and BN 2 , respectively. Then the two CPTs of A do not have a same form, and the aggregation function can't be applied (See the CPTs of A shown in Figure 3(b) and Figure 3(c), they have a same form, then the aggregation function can be applied to aggregate the conditional probabilities in the corresponding position of the two CPTs, and the aggregation function can't be applied to aggregate the CPTs shown in Figure 3(a) and Figure 3(c)). However, we can take Pa(A)~Pa 1 (A)|Pa 2 (A) as the parent set and extend both the CPTs of A in BN 1 and BN 2 into form P(AjPa(A)), then the two CPTs of A have a same form and the aggregation function can be applied. This means we also view the nodes in Pa 2 (A){Pa 1 (A) as the parents of A in BN 1 , although they are not real parents and do not affect A's conditional probability. We call these parents bogus parents and the directed edges between a node and its bogus parents bogus edges. As shown in Figure 4(c), C is a bogus parent of A, and C?A is a bogus edge. Theorem 2 and Corollary 2 can be used to determine whether two nodes are independent of each other or not. Suppose we have the CPT of node A as shown in Figure 3(c), if x 00~x10 , x 01~x11 or they are approximately equal, it deduces that A and B are independent, and B is not the parent node of A. Then the CPT of node A can be simplified into the form as shown in Figure 3

Aggregation function
Assume that the conditional probability of node A in Bayesian networks BN 1 and BN 2 are P 1 (A~a i jPa(A)~p k )~x 1 and P 2 (A~a i jPa(A)~p k )~x 2 , respectively. Then in the consensus Bayesian network, the conditional probability of A can be computed using the following equation: where W 1 is the weight of BN 1 and W 2 is the weight of BN 2 . Weight W is a positive integer representing a belief to the Bayesian network. W 1 wW 2 means BN 1 is more reliable than BN 2 ; W 1 ?? means BN 1 is absolutely reliable.
Next, we would like to discuss why we choose this aggregation function. The combination of Bayesian networks must satisfies this property: the consensus Bayesian network BN c constructed by combining Bayesian networks BN 1 and BN 2 is equivalent to the Bayesian network learned from the database obtained by merging the two Bayesian networks' corresponding databases DB 1 and DB 2 . Then the aggregation function should satisfy it too. Assume that node X is not the parent of A in any Bayesian network, then in the consensus Bayesian network, X can not be the parent of A. Then CPTs of A in different Bayesian networks after extension not only have a same form, but also contains all of A's possible parent nodes. So, we needn't consider the nodes which are not included in the parent set of A when aggregating the CPTs. When computing the conditional probability of one node in Bayesian network, Equation (4) is used. The conditional probability of A in Bayesian networks BN 1 and BN 2 are P 1 (A~a i jPa(A)~p k )~x 1 and P 2 (A~a i jPa(A)~p k )~x 2 , respectively. Assume that the number of samples satisfy Pa(A)~p k in DB 1 is n 1 , then the number of samples satisfy A~a i and Pa(A)~p k in DB 1 is n 1 Ã x 1 ; the number of samples satisfy Pa(A)~p k in DB 2 is n 2 , then the number of samples satisfy A~a i and Pa(A)~p k in DB 2 is n 2 Ã x 2 . So, the conditional probability of A in the Bayesian network learned from the database obtained by merging DB 1 and DB 2 is: On the other hand, samples satisfy Pa(A)~p k in DB 1 and in DB 2 obey the same distribution. So, we have: where N 1 and N 2 are the total numbers of samples in DB 1 and DB 2 , respectively. Then the conditional probability of A changed to be: Total numbers of samples in databases are unable to be known sometimes, so we use the weights of the Bayesian networks instead of them, then Equation (8) changed into Equation (5). In the experiment, we still use the total numbers of samples as they are already known.

Combination of Bayesian networks
If two Bayesian networks are defined over the same variable set and their variables' prior orders are consistent with each other, then they can be combined using the method described as follow: Step 1. Extend every corresponding CPTs in the two Bayesian networks into same form. Then the structures of the two Bayesian networks are completely the same(although some of their edges are bogus edges).
Step 2. Use the aggregation function to aggregate the conditional probabilities in the corresponding positions of each corresponding CPTs.
Step 3. In each CPT after aggregation, compute variance D X for each parent node X , determine whether D X ve holds or not to judge node X is a bogus parent or not, then simplify the CPT and delete the bogus edge if D X ve holds.
After simplifying the CPTs and deleting the bogus edges, the consensus Bayesian network is obtained. Figure 4 shows an example of combination of two Bayesian networks. However, Bayesian networks' variables' prior orders do not always consistent with each other, then it needs to reverse some directed edges sometimes. The principle of reversal is to ensure that the Bayesian network after reversal is equivalent to the original Bayesian network.

Extension of Bayesian network
Sometimes the Bayesian networks going to be combined may not defined over the same variable set, then they need to be extended. Specifically, given two Bayesian networks BN 1 and BN 2 with their variable sets satisfy V 1 =V 2 and V 1 \V 2 =1, if their variables' prior orders are consistent with each other, BN 1 can be extended into BN 1 +BN 2 using the method described as follow: Step 1. Extend BN 1 's DAG G 1 into G 1 +G 2 . Let G 1 +G 2~( V 1 |V 2 ,E 1 ), and then add all the directed edges satisfy into graph G 1 +G 2 . These added edges are not in BN 1 originally, so we call them extended edges.
Step 2. Compute each node's CPT. For a node A[G 1 +G 2 , if A[V 2 {V 1 , then its CPT is the same as the CPT of A in BN 2 ; if A[V 1 and there is no directed edge satisfies f(C,A)[E 2 jC[V 2 {V 1 g, then its CPT is the same as the CPT of A in BN 1 ; if A[V 1 and has directed edges satisfy f(C,A)[E 2 jC[V 2 {V 1 g, in this case, there are three possible situations may appeared in the extended Bayesian network as shown in Figure 5. Then the conditional probabilities of A in these three situations can be computed using the following equations, respectively: In Figure 5(a) P(A~a i jB~b j ,C~c k )P 1 (A~a i jB~b j ) Ã P 2 (C~c k jA~a i ,B~b j ) P 2 (C~c k jB~b j ) P m i~1 P 1 (A~a i jB~b j ) Ã P 2 (C~c k jA~a i ,B~b j ) where m is the number of expression values of A. P 2 (C~c k jA~a i ,B~b j ) and P 2 (C~c k jB~b j ) can be computed using the standard Bayesian network inference algorithm [25] in BN 2 , while P 1 (A~a i jB~b j ) is already known in BN 1 .
In Figure 5(b) P(A~a i jB~b j ,C~c k )P 1 (A~a i jB~b j ) Ã P 2 (C~c k jA~a i ) P 2 (C~c k ) P m i~1 P 1 (A~a i jB~b j ) Ã P 2 (C~c k jA~a i ) P 2 (C~c k ) In Figure 5(c) P(A~a i jC~c k )~P 1 (A~a i ) Ã P 2 (C~c k jA~a i ) P 2 (C~c k ) P m i~1 P 1 (A~a i ) Ã P 2 (C~c k jA~a i ) If B and A are disconnect in BN 2 , it can only deduce that B and A are independent given C, however, it doesn't affect the conditional probabilities P 2 (C~c k jB~b j ,A~a i ) and P 2 (C~c k jB~b j ), then the conditional probability of A can be computed using Equation (9). If both B and C, B and A are disconnect in BN 2 , it deduces B and C are independent, then the conditional probability of A can be computed using Equation (10). After obtaining every node's CPT in G 1 +G 2 , the extension of Bayesian network BN 1 is finished.