Constructing Biological Pathways by a Two-Step Counting Approach

Networks are widely used in biology to represent the relationships between genes and gene functions. In Boolean biological models, it is mainly assumed that there are two states to represent a gene: on-state and off-state. It is typically assumed that the relationship between two genes can be characterized by two kinds of pairwise relationships: similarity and prerequisite. Many approaches have been proposed in the literature to reconstruct biological relationships. In this article, we propose a two-step method to reconstruct the biological pathway when the binary array data have measurement error. For a pair of genes in a sample, the first step of this approach is to assign counting numbers for every relationship and select the relationship with counting number greater than a threshold. The second step is to calculate the asymptotic p-values for hypotheses of possible relationships and select relationships with a large p-value. This new method has the advantages of easy calculation for the counting numbers and simple closed forms for the p-value. The simulation study and real data example show that the two-step counting method can accurately reconstruct the biological pathway and outperform the existing methods. Compared with the other existing methods, this two-step method can provide a more accurate and efficient alternative approach for reconstructing the biological network.


Introduction
One great challenge of postgenomic research is to explore complex biological pathways from genomic data such as DNA sequences, protein sequences, and gene expression profiles. The network building method is widely used throughout biology to reconstruct complex biological pathways.
We take MAPK pathway as an example. The MAPK/ERK pathway is a signal transduction pathway that couples intracellular responses to the binding of growth factors to cell surface receptors. Robert et al. [1] and related studies [2][3][4][5] based on biology experiments provide the MAPK pathway ( Figure 1).
It would be interesting if Figure 1 can be reconstructed in terms of their expression profile of Wsc1/2/3, Mid2,…, etc. To reduce the cost of experiments, one possibility is to predict the activation status of these genes through their microarray expression data for inferring the pathway.
There have been methods proposed in literature for reconstructing genetic regulatory networks in terms of microarray data. For instance, the Bayesian network model is an important technique that has been studied in the last two decades [6][7][8]. In addition, Wei and Li [9] proposed a hidden spatial-temporal Markov random field model to identify genes that are related to biological pathway. Allocco et al. [10] provided a variety of methods to find the gene-pairs with similarity relationship. Moreover, other algorithms using linear models [11,12], differential equation [11,13], neural network [14] and structural equation modeling [15] have been proposed to explore gene regulatory networks based on genomewide data. However, most of these methods have limitations in dealing with large-scale gene regulatory network because of their complex model structures. Also, careful discretization can be used to denoise high-throughput data. One such example can be found in Xing and Karp [16].
To overcome the disadvantage of the mentioned methods, we consider a simple model based on the Boolean network to reconstruct a large scale gene network in this study. Boolean networks have been proposed and investigated for a long time in literature. Kauffman [17,18] considered a dynamic version of Boolean networks. Liang et al. [19] proposed the algorithm construct a DAB network from the noisy array data. Since it involves noisy data, the reconstruction of the pathways cannot employ a deterministic inference. Instead, we need to establish a statistical model to capture its random characteristics. A DAB network is characterized by two kinds of pairwise relationships: similarity and prerequisite. The former represents a pair of elements with coherent on-off states. The latter is a partial order relationship, namely, the on-status of one element is a prerequisite for the on-status of another element. More specifically, if one element is a prerequisite to another element, the off-status of one element will restrict another element's off-status. A DAB network is uniquely determined by its state space: all possible on-off states subjected to the pairwise relationships.
Recently, a Boolean implication network is proposed with similar aspect as the DAB network, which investigates all Boolean implications between pairs of genes for large-scale genome microarray datasets [27]. For any pair of elements, they use two statistics to test whether there is any specific relationship between the pair of elements. However, the methods are more applicable for dealing with mass information of datasets.
The approach of building a DAB network based on the expectation-maximization (EM) algorithm to derive the maximum likelihood estimator [28] for a statistical model is established in Li and Lu [29]. Their strategy is to build up a statistical model with measurement error and assign scores for the possible relationships between two genes, and then use the scores to select the true relationship. This method involves more computation and cannot provide a simple closed-form statistic to recover the true relationships between genes.
In this study, we propose a simple method to estimate pairwise relationships between elements from noisy array data. The approach is based on two steps: the first one is to count the numbers of different pairwise relationships in a sample, and the second one is to test the relationship hypotheses according to their asymptotic p-values. Compared with the Li and Lu [29] method, this new approach has a simple closed form and it is not timeconsuming. In addition, the proposed counting approach shows substantial improvement compared to the Sahoo et al. [27] method. We conduct a simulation study to an example used in Li and Lu [29]. It is shown that the proposed method can recover all of the true relationships. A simulation study for a larger scale network is given in the supplementary material. In addition, the proposed method is implemented on the MAPK pathway example. It can recover 6 true relationships among seven relationships, however, Li and Lu's method only recovers one true relationship in this example. In this real data example, the new method shows a significant improvement in adopting a DAB network for exploring the pathway.

Methods
To describe the model and notations, we adopted a simple example used in Li and Lu [29] to illustrate the model assumption. Figure 2 shows the relationships of the seven elements in this example derived from the 13 states of Table 1. In the diagram, the notation A?B denotes that A is a prerequisite of B and the notation E{B denotes that B and E are similar. Note that A,:::,G in Figure 2 are called elements. The definitions of prerequisite and similar relationships for any two elements A and B are defined as follows.
Assume that an element only has two levels, on or off. We use ''0'' and ''1'' to represent ''off'' and ''on'' states respectively. For two elements A and B, A is a prerequisite for B if the on-state of A is necessary for the on-state of B, and we denote it by A[B. When A and B are on and off simultaneously, the relationship between A and B is called similar and is denoted by A*B: We define A A to be the dual state of A. It means that A A~0 when A~1.  There are 4 possible situations for the prerequisite relationship, and 2 possible situations for the similar relationship, see Table 2. Totally, there are 6 possible relationships for any two genes. The prerequisite relationship is a partial order. It is transitive on the ground-set, namely, A[B and B[C implies A[C. The notations ''z'' and ''{'' in Table 2 denote the possible states of A and B and the impossible states of A and B under the relationship, respectively.
Because of the misclassification error, m 00 may be split up into four categories. We use the notations m 00,00 ,m 00,01 ,m 00,10 and m 00,11 to represent the counts of four cells split from m 00 . Analogous notations are defined for m 01 ,m 10 and m 11 . Consequently, their generating probabilities are calculated as follows: q ij,kl~p Di{kDzDj{lD (1{p) 2{Di{kD{Dj{lD q ij . Here, we adopt the notation q ij,kl analogous to m ij,kl . The splitting counts and probabilities implied by misclassification error are given in Tables 4 and 5. Now we go back to the example of Figure 2 which includes 7 elements. There are a total of 2 7~1 28 states for a seven-element network. Only thirteen of these states in Table 1 are compatible with the twelve pairwise relationships in the above example. From Figure 2, there are 12 true relationships between the elements, which are Under the measurement error model assumption, we do not directly observe the 13 states but observe states with measure error. We aim to reconstruct the true pathway. A proposed method is given in the following.
The two-step counting method  Table 1, there is a sample of size 13 for seven genes. We propose a two-step approach to recover their relationships.

The first step: counting
For a pair of genes, say A and B, we can count the numbers for 6 relationships in Table 2 for the n states. The relationships with a counting number greater than a given threshold are regarded as potential relationships.
If there are no measurement errors, it is reasonable to expect that the counting number of two elements, say A and B, satisfying the true relationship is exactly equal to n. However, since it involves measurement errors, the counting number with respect to the true relationship may not be exactly equal to n. For each pair   Table 2. Patterns for the six pairwise relationships assuming exhaustive sampling and no measurement error. Table 3. The six pairwise between the two elements A and B.

Relationship Hypothesis
of elements, we count the numbers satisfying the 6 relationships respectively, say c 1 ,:::,c 6 . Since we expect that the misclassification probability is low, the counting number(ƒn) corresponding to the true relationship should be close to n. Thus, we can select the relationships with a counting number greater than a threshold. The threshold selection is suggested as follows.
Threshold Selection. The suggested thresholds for the similar and prerequisite relationships are respectively, where w~pzz a=2 p 1{p ð Þ=n ð Þ 1=2 . Here, the misclassification error probability p can be assumed to be known from empirical experiences. If p is unknown, the maximum likelihood approach for estimating p is given in Appendix D in the materials section.
The argument for the threshold selection is given in Appendix A in the materials section. It is based on a confidence bound approach associated with the counting number formulas. The approach is to derive the formulas for the two kinds of relationships, and then uses a confidence interval approach to obtain a lower bound for the counting number formulas. The forms n((1{w) 2 zw 2 ) and n((1{w) 2 zw(1{w)) are inferred by the counting number formulas with misclassification probability p, where w value is derived by a confidence bound approach.

The second step: asymptotic p-value
Besides directly counting the relationships' numbers, the second step is to test the relationships in Table 3 using an asymptotic pvalue. Then we combine both steps to estimate the true relationship between two elements.
The following simulation study shows that the two steps are both essential for selecting the true relationship. If any one of the steps is used solely in selecting the true relationship, the simulation shows that it cannot select the true relationships very accurately.
The p-values derived for the 6 hypotheses with misclassification probability p corresponding to the 6 relationships are listed as follows. The derivations are given in Appendix B in the materials section.
The extremeness of the observed value for the test statistic under the null hypothesis leads to a small p-value, which would imply rejection of the null hypothesis. Thus, if the null hypothesis is the true relationship, we expect to obtain a higher p-value. In the second step, we also set a threshold for the asymptotic p-value such that the relationships with asymptotic p-value greater than or equal to the threshold are selected.
A large p-value indicates a larger possibility that the null hypothesis holds. Note that the p-value is less than or equal to 1. In this study, we use the threshold 1 for the p-value criterion in the examples because the largest p-value for each relationship is one. From the simulation study and the real data example discussed in this study, setting 1 to be a threshold for p-value criterion can lead to very accurate results. Note that for other examples, it is possible that the largest p-value is not 1. In this case, we need to observe the p-values to select a suitable threshold.
It is worth noting that the hypothesis testing procedure corresponds to a confidence interval approach [30]. From the confidence interval viewpoint, when the p-value is large enough (close to 1) or small enough (close to 0), we have confidence to accept or reject the null hypothesis. Therefore, in this study, when p-value is 1, we have confidence to accept the null hypothesis.
The two-step method is described as follows.
Procedure for selecting the true relationship of m elements Step 1. For a sample of size n for m elements, calculate the counting numbers for the 6 relationships of each pair of the elements. Set a threshold for the counting numbers. Select the relationships with a counting number greater than the threshold.
Step 2. For each pair of elements, derive the asymptotic pvalues for each relationship and set a threshold for the p-value. Select the relationships with a asymptotic p-value greater than or equal to the threshold. Table 5. Splitting probabilities caused by misclassification error.
Note that it is possible to have more than one relationship satisfying both criteria for two elements. But from a simulation result and a real data application, it shows that in most situation, there is only one relationship satisfying both criteria.
The asymptotic p-value has a closed form which can be easily calculated and the counting number can also be easily calculated. This shows that this method can provide a convenient way to recover the biological pathway.

An Example
We revisit the example of Figure 2 to illustrate the counting step. Assume that we only have a sample of the states for the 7 elements and we want to recover the 12 true relationships. Note that there are totally C 7 2~2 1 pairs of the 7 elements and there are only 12 pairs with relationships in this example. When considering the case without measurement error, we can reconstruct the pathway from a sample using the counting number method if the sample size is large enough. We can construct the Boolean network for the example by identifying prerequisite or similar relationships. From Table 1, we list the relationship corresponding to the highest counting number for each pair as follows: where c AB denotes the counting number corresponding to the indicating relationship (A[B).
If there is only one relationship corresponding to the highest counting number, we list that one, such as (A[B); if there are more than one relationship corresponding to the highest counting number, we list all of the relationships, such as . In the 21 pairs, the relationships corresponding to the highest counting number 13 is the 12 true relationships, and the relationships with the counting number less than 13 are not the true relationships.

Comparison
We consider two existing methods for detecting the pairwise relationships between any two elements. A simulation study is conducted to compare the proposed method with the existing methods for the measurement error case.

Existing methods
Li and Lu [29] proposed the directed acyclic Boolean network to recover the genetic network. For any pair of element, they use the EM algorithm to calculate the maximum likelihood estimator of misclassification rate p under the multinomial distribution model structure and adopt a criterion that requires a true relationship to correspond to a small estimator of p in order to select a relationship. Besides the disadvantage that the EM algorithm is time-consuming, this method is also shown to be less accurate than the counting method from a simulation study.
Another method for inferring the relationship of any two elements is proposed by Sahoo et al. [27]. For any two genes A and B, let n 00 , n 01 , n 10 and n 11 denote the numbers of the four states where ''expected'' and ''observed'' denote the values of (n 01 zn 00 ) |(n 01 zn 11 )=(n 00 zn 01 zn 10 zn 11 ) and n 01 , respectively.
The relationship A[B in Sahoo et al. method is regarded as true when the ''error rate'' value is less than 0.1 and ''statistics'' value is greater than 3 [27]. However, from our calculation, the method may lead to inaccurate results when the sample size is not large. For instance, suppose the number of experiments we observed is 91 and the numbers of states corresponding to (0,1), (0,0), (1,0) and (1,1) are 1, 30, 30 and 30 respectively, resulting in a small ''statistic'' value of 2.94. Note that the state (0,1) indicates that the relationship A[B does not hold. However, since the state (0,1) only occurs once, it may be due to a measurement error. In this case, the method does not select the relationship A[B. This shows that the criterion is too conservative to select a potential relationship when the sample size is not large enough.

Simulation
We conduct a simulation study using the example of Figure 2 Tables 6 and 7 denote the relationships in order in Table 3.
In this case, the maximum value for the counting number is 100 because the sample size is 100. As discussed in above, we can set a threshold (2) for the counting number. In this case, the thresholds for the similar and prerequisite relationships are 86 and 93. And we set the threshold for the p-value to be 1 because the highest pvalue for each pair is 1 in this case. The relationships corresponding to the hypotheses with a p-value 1 are the candidates for the true relationship.
For any pair of the 7 elements, there are 6 possible relationships of each pair. Since there are 21 pairs for the 7 elements, there are totally 126 possible relationships. Using the two thresholds set above, there are only 12 relationships satisfying the conditions, which are exactly the true relationships (1). There are many relationships among the 126 relationships satisfying only one condition, but not both. For example, the relationships H 00 in (A,C) and H 01 in (A,D) satisfy the counting number condition, but not the p-value condition; the relationships H 0110 and H 11 in (D,F ) satisfy the p-value condition, but not the counting number condition. It shows that any one of the two steps is an important condition for identifying the true relationships. In this case, we can recover all true relationships using the proposed method.
Next, we implement the algorithm of Li and Lu [29] in the simulated data. Since this algorithm does not provide a specific threshold selection method, we adopt different thresholds and find that the best situation is to recover 11 relationships . In this case, one relationship A[G is misjudged to be A*G.
In order to compare the counting method with Sahoo et al. [27], we implement their method in this simulated data. There are only two relationships B*E and D[F recovered from their method with ''statistic'' values 4.05 and 3.18, respectively. The ''statistic'' values for relationships of other pair elements are smaller than 3, resulting in inaccuracy of identifying the other true relationships. It shows that the method of [27] is less efficient and accurate in recovering the true relationships than the counting method from the simulation study for the case with measurement error.
Beside the example with 7 elements, a more comprehensive example with a larger network ( Figure S1) that shows the superiority of the proposed method is given in the supplementary material.

Yeast expression data
We revisit the MAPK pathway example from the Introduction. The datasets used in analyzing the MAPK pathway include 81 experimental data excluding two data with missing values, 57 from Spellman et al. [31] and 26 from Zhu et al. [32] . The datasets from Spellman et al. [31] include 18 data from the alpha factor experiments, 14 data set from the Elutrtation experiments and 24 data sets from cdc15 experiments. The datasets from Zhu et al. [32] include 25 data from Forkhead experiments. The raw data can be download from the Stanford Microarray Database [33]. We adopt values corresponding to the Log(base2) column in the raw dataset to reconstruct the MAPK pathway, which are log ratio values of red to green signal.
A gene state is regarded as on state or off state when the log ratio value of red to green signal is greater than or less than 0, respectively. The gene expression data for the 81 experimental data (Table S1) are given in the supplementary material.
In this study, we apply the two-step approach to explore the expression profiles, and show exploratory results on the pathway. The results are also compared with the Li and Lu's method [29] and Shaoo et al. method [27].
We implement the proposed method to the yeast cell cycle data [31,32]. In our analysis, we assume that the level a~0:1. According to the threshold selection formulas (2), the thresholds for the similar and prerequisite relationships are 61 and 69, respectively. And the threshold for the asymptotical p-value we selected is 1.
According to the network structure reconstructed using our proposed approach, we can see that Wsc2p and Mid2p activate Rho1p, Pkc1p and Bck1p which results in activation of the downstream of MAPK cascade, Mkk1p and Mlp1p. Activated Table 6. The counting numbers for the 21 pairs in the 100 states under each relationship.  Wsc2p also interacts with Mid2p. The functions of genes Swi4p, Swi6p and Rlm1p in the downstream of the network are not significant in our approach. The reconstruction results of the DAB network using the twostep approach and the method of Li and Lu [29] are illustrated in Figure 3(a) and Figure 3(b), respectively. In addition, we also implemented the method of Shaoo et al. [27] in this real yeast data. The results show that there are no pair relationships detected by the method of Shaoo et al. [27], because all ''statistic'' values are smaller than 3 for any two elements. Therefore, compared with the methods in Li and Lu [29] and Shaoo et al. [27], our proposed method is more useful for finding the cascade relationship.

Discussion
For the implementation of the network reconstruction algorithm, the greatest complexity lies in the computation of p-value for every two elements. The number of all pair is n(n{1)=2 where n is the number of elements. Therefore, the time complexity for the proposed approach is O(n 2 ) showing that the proposed method is capable of handling thousands of genes simultaneously.
This study mainly focuses on reconstructing pathway by gene expression. Although pathway reconstruction methods based on gene expression have been widely discussed in the literature, there is a limitation on the gene expression methods. A biological pathway comprises more than genetic interactions alone. Long chains of vents may happen on the protein level (e.g. (de)activation by phosphorylation) which does not necessarily have to be regulated via gene expression. Therefore, these gene expression methods can be expected to reconstruct pathways that are regulated via gene expression, but not other biological interactions.
In summary, we propose a two-step approach to test the biological pathways from noisy array data. This new method has the advantages of easy calculation for the counting numbers and simple closed forms of the p-value. From the simulation results, we can see that this method can precisely estimate the true relationships for most of the situations. Compared with the other existing methods, it can provide a more accurate and efficient alternative approach for reconstructing the biological network.

Materials and Methods
Appendix A: Threshold Selection (i) Suppose the misclassification probability is p. For a similar relationship such as the case A*B, in this case, we have m 01~m10~0 and m 00 zm 11~n : With misclassification error, the counting number corresponding to the relationship is ((1{p) 2 zp 2 )m 00 z2p(1{p)m 01 z2p(1{p)m 10 z((1{p) 2 zp 2 )m 11 : By (3), the last equation is equal to n(p 2 z(1{p) 2 ), which is the mean of the counting number if this similar relationship holds. Since we cannot expect that the counting number is always equal to the mean, we look for a lower bound of the counting number as a threshold. From the viewpoint of constructing confidence interval, if p is unknown, a 1{a upper bound of p iŝ p pzz a=2 (p p(1{p p)=n) 1=2 , wherep p is an estimator of p and z a=2 is the upper a=2 quantile of the standard normal distribution. The bound d~p pzz a=2 (p p(1{p p)=n) 1=2 is an upper bound of p. Then 1{d is a lower bound of 1{p. Here we replacep p by p in the upper bound and suggest n(w 2 z(1{w) 2 ), where w~pz z a=2 (p(1{p)=n) 1=2 as a threshold. We expect that the counting number is greater than the threshold if the similar relationship holds. Beside using the conventional confidence interval, we can also consider some improved intervals discussed in literature [34][35][36][37].
(ii) Assume for two elements A and B, a prerequisite relationship holds. In this case, we have m 01~0 and m 00 zm 11 zm 10~n : ð4Þ With misclassification errors, the counting number corresponding to the relationship is By a similar argument as in (i), we suggest ((1{w) 2 z w(1{w))n as a threshold for the prerequisite relationship.

Appendix B: Computational details
The methods for testing the 6 hypotheses in Table 3 are listed as following.  we can consider the following two different situations. Note that the condition q 01~q10~0 in hypothesis H 0 is equivalent to q 01 zq 10~0 because q 01 §0 and q 10 §0.