Exploring candidate biological functions by Boolean Function Networks for Saccharomyces cerevisiae

The great amount of gene expression data has brought a big challenge for the discovery of Gene Regulatory Network (GRN). For network reconstruction and the investigation of regulatory relations, it is desirable to ensure directness of links between genes on a map, infer their directionality and explore candidate biological functions from high-throughput transcriptomic data. To address these problems, we introduce a Boolean Function Network (BFN) model based on techniques of hidden Markov model (HMM), likelihood ratio test and Boolean logic functions. BFN consists of two consecutive tests to establish links between pairs of genes and check their directness. We evaluate the performance of BFN through the application to S. cerevisiae time course data. BFN produces regulatory relations which show consistency with succession of cell cycle phases. Furthermore, it also improves sensitivity and specificity when compared with alternative methods of genetic network reverse engineering. Moreover, we demonstrate that BFN can provide proper resolution for GO enrichment of gene sets. Finally, the Boolean functions discovered by BFN can provide useful insights for the identification of control mechanisms of regulatory processes, which is the special advantage of the proposed approach. In combination with low computational complexity, BFN can serve as an efficient screening tool to reconstruct genes relations on the whole genome level. In addition, the BFN approach is also feasible to a wide range of time course datasets.


Introduction
One of the challenging fields of computational biology is the study of gene regulatory networks (GRNs). The demanding task of recovering the hidden relations between genes at the wholegenome level can provide insights to the comprehensive understanding of biological pathways and their mechanisms. It can also enhance the developments for disease treatments and biological technology. Currently there are two major experimental approaches to identify regulatory relations between genes. The first type uses perturbation (knockout or overexpression) experiments to explore regulatory targets of specific gene. The second type detects targets for PLOS  of false positive links. The output of algorithm is controlled by two threshold parameters p 1 and p 2 which set the significance level of each test in the first and the second step. As comparison of performance, we evaluate the accuracy of BFN and Boolean Network [25], Dynamic Bayesian Network [26], information theory based and graph based GRNI methods [27]. The prior knowledge in form of list of regulators can be incorporated to BFN as well in order to enhance accuracy of prediction.
BFN can be considered as dynamic GRNI since it uses adaptive time delay between genes. Another important characteristic of BFN which is seldom addressed by the most existing approaches is the ability to identify the regulatory function describing the relation between gene pairs in GRNs. Among the methods reported in literature, the Boolean models (BN and PBN) are few methods that have such capacity. However, these methods use a limit number of Boolean functions, such as three Boolean gates (AND, OR, NOT) and their combinations. We will extend the number of Boolean functions in the proposed approach to describe the complex nature of regulatory relations.
Furthermore, BFN has low computational complexity and computation time. Therefore, it can be applied to high-throughput data, which is demonstrated for whole genome S. cerevisiae dataset in the section of "Results".

Method overview
The proposed BFN method is based on a hidden Markov model (HMM). We assume that the true status of a gene at discrete moment of time is a hidden state, while the measured expression level of gene transcript at the specific time is an observation. In a nutshell, the BFN method consists of two major steps: identification of pairwise dependencies between genes by Test 1 and the subsequent check whether those links are direct or indirect by Test 2.
Both Test1 and Test 2 are based on the comparison of likelihoods of two alternative models. Test 1 search for the best Boolean function and time delay between two genes, which would maximize the likelihood ratio of a model with a link over a model without a link. provides a graphical illustration of Test 1. Analogously, Test 2 is illustrated with Fig 2 that compares the likelihoods of the model with a direct link and that with an indirect link. The reason why Test 2 is needed is that it helps to avoid collisions in causation order upon structural network reconstruction. Suppose, we have detected the following links x 1 ! x 2 , x 1 ! x 3 , x 2 ! x 3 by Test 1. As it is shown at Fig 2 there are two ways to place those links at the map. In the first scenario, x 1 has indirect effect on x 3 through x 2 . In the second scenario, both x 1 and x 2 regulate x 3 directly. Therefore the question arises which one of the two maps is correct? With the increasing number of genes, the resulting network can differ dramatically if the directness of the links is not tested. Further method's details are explained in the section of Materials and Methods.
The proposed BFN method needs time course data to discover the regulatory relationships. The minimum number of time points required for proper inference is 10. To investigate the performance of BFN, we will use the widely used data of yeast expression in Spellman et al. [28]. Details about the dataset and its preprocessing are described in the section of Materials and Methods. Whole genome analysis and comparison to the other methods Ma et al. [27] provided the combined binding and knockout data in 3 golden-standard networks for various levels of significance from the most conservative gene relationship #1 to the most liberal gene relationship #3. They used those as reference to evaluate the performance of 18 statistical approaches over 4 data types based on the assessments of 3 combined statistical metrics. We evaluate the performance of the proposed BFN with those 18 methods by the same assessment metrics. However, those data sets tested in Ma et al. [27] do not contain sufficient time course data for the application of the proposed BFN method. Thus, we will use the dataset in Spellman et al. [28] that is of the same data type for observational data obtained across time and/or environmental conditions studied in Ma et al. [27].
For BFN, we ran Test 1 and Test 2 consecutively with the p-value threshold of 0.05 for both tests. The time delay range is set to be the range from 0 to 5 on the Spellman dataset. 185 genes are assigned as TFs in according to SGD (Saccharomyces Genome Database) [29], which are served as sources. The list of genes and TFs can be found in the columns 1 and 2 of S7 Table. We found the gene relationships for the entire network that are comprised of 335531 direct links in S1 Table. Table 1 summarizes the performance of the BFN method for 3 networks and the details are explained in S2 Table. The performance in Table 1 is assessed with seven metrics: sensitivity or true positive rate (TPR), specificity or true negative rate (TNR), precision or positive predictive value (PPV), negative predictive value (NPV) and 3 combined metrics which show Euclidean distance from the ideal performance point ). Since these three combined metrics describe the distance from the ideal performance point (e.g., sensitivity = 1, specificity = 1 in C1), the smaller score indicates the better performance. According to the scores reported in Table 1, the performance of the BFN method is similar for three gold-standards. The performance is slightly better for the gold-standard GRN #1 which contains only regulatory relations with highly significant links. Table 2 provides an overview of BFN performance compared to the performance of 18 statistical approaches reported in Ma et al. [27] by 3 combined metrics. Each sub table in Table 2 reports the performance comparison to every specific combined metric, C1, C2 or C3. The performance comparison incorporates the followings. 1) The average score over 18 approaches and 4 observational datasets obtained by changing time and/or environmental conditions (i.e., Gresham et al. [30], Gasch et al. [31], Smith et al. [32], Yeung et al. [33]) measured with one of three GRN gold-standard networks. 2) The corresponding BFN score. 3) The best values among 18 methods in each of 4 datasets.
From these comparisons, the performance of BFN is better than the average results. The performance in C1 and C3 metrics shows that the BFN can achieve improvements on the average. In particular, the improvement of BFN is appealing for C1 metric (sensitivity and specificity) across all GRNs gold-standards. When compared with the best individual scores, the BFN has better performance in C1 metric than any method applied to Smith and Yeung datasets, while no advantage can be seen for C2 and C3 metrics. It should be noted that performance ranks of those methods in Table 2 differ with distinct metrics. For example, the method producing the best score (0.69) for C1 metric (sensitivity and specificity) will generate the worst score (0.98) in C2 (PPV and NPV) in the Gasch dataset when GRN #1 is used as the gold-standard. Thus it is rather difficult to make comparison of performance among individual approaches and average scores seem to be more informative.
Performance comparison with Boolean network and Bayesian network models on simulated data  [25]) and the Dynamic Bayesian Network (DBN) inference (coded in the R "G1DBN" R package [26]). The REVEAL approach has failed to infer any network from given data, while the best results of Best-Fit, G1DBN and BFN are provided in Table 3. Table 4 contains the performance analysis for Boolean Network (Best-Fit), Dynamic Bayesian Network (G1DBN) and Boolean Function Network (BFN). To make a comprehensive comparison, we conduct the Boolean Network method of Best-Fit with three different discretization approaches (k-means, edgeDetector and scanStatistic). The result with edgeDetector is removed from Table 4 due to the poor performance. We evaluate the performance of k-means and scanStatistic by three options for maximum arity of Boolean functions (2, 3 or 4). The DBN was tested with different approaches to regression estimates (Huber, Tukey and Least Squares). However, only the Least Squares method managed to converge in given number iterations.
As it can be seen from Table 4, the BFN performs better in this example with respect to most performance measures (sensitivity, specificity, precision and NPV). Moreover, the BFN can assign both Boolean function and time delay for each relation.

Performance evaluation of BFN by Saccharomyces Genome Database (SGD) regulatory information
The performance improvement of BFN over the classical approach of Pearson correlation is evaluated by the Saccharomyces Genome Database (SGD) regulatory information. We focus on the 103 cell-cycle regulated S. cerevisiae genes annotated from previous studies [28]. Among them, 7 are TFs according to SGD and they are used as source genes. The relations discovered with each method were assessed with the SGD regulatory information. The detailed information about numbers of true positive, false positive, true negative and false negative links obtained with BFN at varying p-value thresholds of Test1 is described in S3 Table. Moreover, S3 Table contains Table 3. Inference results of Boolean network (Best-Fit with optimal discretization k-means and maxK = 2), Dynamic Bayesian Network (with least squares estimation, and default parameters alpha1 = 0.5 and alpha2 = 0.05 for first order dependencies and full dependencies correspondingly) and Boolean Function Network (maxTimeDelay = 10 and p1 = 0.005). Cell cycle genes and their succession along cell cycle phases

Best-Fit
Conventionally cell cycle is partitioned into four consecutive phases: mitosis (M), gap1 (G1), synthesis (S) and gap 2 (G2). In this study, we focus on the list of 103 cell-cycle genes [28] along with their corresponding cell phase labels assigned according to literature (Columns 5 and 6 in S7 Table). We allow all these 103 genes to be sources and targets. The cut-off values of p 1 = 0.005 and p 2 = 0.05 are used for Test1 and Test2 correspondingly. The time delay range varies from 1 to 5 sample time points. We consider links of positive Boolean functions only. As a result, we obtained 68 links in S4 Table. We check the consistency of 68 relations discovered via the BFN method with cell phase annotation of these 103 genes. If the label of one target  gene is the same as its source's label or is successive to the source's label, then this relation is consistent with cell cycle information. It turns out that all these 68 links listed in S4 Table are consistent with cell cycle information.

Functional enrichment analysis
The capacity of BFN to identify a Boolean function and a time delay for each regulatory relation can be utilized not only for the reconstruction of gene regulatory map but also the analysis of gene sets. For each transcription factor (TF), the pool of target genes can be naturally divided into groups according to their Boolean functions and time delay parameters detected by BFN. For each TF, the results of the whole genome S. cerevisiae analysis (S1 Table) can be divided up to 36 groups because there are 6 possible Boolean functions and 6 possible time delays. S5 Table contains the results of gene ontology (GO) and pathway enrichment for detected groups of target genes (with TFs excluded from target groups) which are positively regulated by each of 185 TFs with time delays of 0 and 1. Annotations were obtained with the YeastMine, which is the embedded tool in SGD [29] for data searching, retrieving and annotating. In total, there are 210 out of 370 possible gene groups received annotation of Molecular Function, Biological Process or Pathway.
The obtained information is useful both for elaboration of TF's function and for establishing candidate genes of regulatory targets. To illustrate the former case, we consider the YHL020C (OPI1) transcription factor. The annotation by SGD shows that the regulatory targets are related to the "carboxylic acid biosynthetic process" (GO: 0046394). When we consider the subgroup according to time delay 0 (No. 37 in S5 Table), the targets of YHL020C suggest the "methionine biosynthetic process" (GO:0009086) annotation which is one more specific GO category positioned two levels below in GO hierarchy. Therefore, the BFN suggests that YHL020C can be involved in the positive regulation of biosynthesis of methionine and this process is time-constrained within one time interval.
As an example for the second type of information that can be extracted from the functional annotation results by BFN, we consider the YPR008W (HAA1) transcription factor (group No. 208 in S5 Table). According to the results positively regulates genes, the BFN suggests that the target genes are YML100W (TSL1), YBR126C (TPS1), YDR074W (TPS2) and YMR261 (TPS3) with time delay 1. Protein subunits TPS1, TPS2, TSL1 and TPS3 constitute the alpha, alpha-trehalose-phosphate synthase complex which converts glucose 6-phosphate plus UDPglucose to trehalose in two-steps trehalose biosynthetic pathway. Currently, only TSL1 is a regulatory target of HAA1 registered in SGD. However, according to our functional annotation, the other three genes YBR126C(TPS1), YDR074W(TPS2), YMR261(TPS3) are also strong candidates to be the regulatory targets of YPR008W(HAA1) together with YML100W(TSL1) because they show highly significant (p-value = 0.0001) pathway enrichment. The detailed information about the list of genes referring to each GO category is not included in S5 Table  for table concreteness. But they can be easily obtained if the list of target genes (excluding TFs) for a specific TF in S1 Table is loaded into YeastMine.

Transcriptional regulation of metabolic pathway
Based on the references [39, 40], we schematically illustrate the conversion of D-glucose to Pyruvate in the process of glycolysis in Fig 7. Using the SGD regulatory information for genes encoding 9 essential glycolytic enzymes (provided in Fig 7) as the reference to eliminate false positive results, we are able to highlight the regulatory subnetwork for glycolytic genes from our whole-network results in S1 Table. The results can be found in S6 Table which contains 147 regulatory relations out of 398 registered with SGD for these 9 genes. We use Cytoscape

Discussion
The proposed BFN approach is a fast and efficient way to explore the regulatory relations between genes for further experimental analysis. By adjusting those two threshold parameters p 1 and p 2 which set significance level in two consecutive tests, we can trade off the numbers of false positive and false negative links. Thus, the smaller values of p 1 and p 2 are set, the more stringent restrictions we apply for the output. Consequently, the smaller size of output links will be generated with the smaller number of false positive links and the larger number of falsely rejected links. The user can utilize the prior knowledge about known regulatory relations to decide the level of significance. Moreover, the choice of p 1 and p 2 also depends on the number of time-points in dataset. The more time-points there are at disposal, the more links will be discovered to be significant, thus more stringent limits on p 1 and p 2 need to be applied. Even though the range of time delay is another parameter to be set, it can be easily decided based on the number of time points and biological knowledge such as periodicity of cell cycle or circadian rhythm as demonstrated in this study. The lower bound of time delay range should be chosen depending on whether we need to consider co-regulated sets of genes (time delay equals to 0) or we are only interested in pairwise relations when regulatory effect can be seen over time (time delay equals to 1 or higher). Based on our empirical experience, the maximum time delay should be no larger than one third of number of time-points in dataset and at least a half of the time interval between cell's steady states.
Similar to any other computational methods based solely on transcriptome data, the BFN is not sufficient to reconstruct GRN entirely because the posttranslational modification should be taken in to account. Moreover, the observational dataset can reflect the status of GRN only under specific experimental conditions. However, the proposed BFN method can become valuable tool for biologists to reduce the search space for relations between genes. And it will help to recover the overall picture of regulatory pathways when it is applied to several related time-series data under different experimental conditions.
When the prior knowledge is available, such as a list of TFs, it can be integrated to the proposed BFN to reduce computational complexity and improve prediction precision. The apparent advantage of the BFN method is that it not only determines direct relationships between genes but also provides direction and Boolean function with time delay. The follow-up division into groups based on the assignment of Boolean functions and time delays to each relation can be incorporated to clustering and for the analysis of functional enrichment.
The proposed BFN can identify directness of a link between a pair of genes. It could be expanded to discover structures of three and more genes with higher computational complexity in future studies.

S. cerevisiae dataset and its preprocessing
Spellman et al. [28] provide data from 3 microarray experiments with different synchronization techniques. For this study, we use data obtained with α-factor cells arrest since this experiment has the highest time resolution (the measurements of RNA were taken every 7 min). The total number of genes in data set is 6075 measured along 18 time-points. There are 59 genes excluded from analysis since there were 4 or more missing values for each gene. For the rest of genes, the missing values were replaced with spline extrapolation. Therefore the whole genome dataset analyzed in this study consists of 6016 genes.

Discretization
The expression profile is the measured abundance of mRNAs for each determined point in time. The source for this type of data can be microarray, next generation sequencing and other types of biochip experiments. Hereafter, we arrange the variables (genes) horizontally and n is the number of genes. The observations (time-points) are arranged vertically with the total number of columns is m, n)m. Naturally, the range of values varies greatly from one gene to another. In order to enable comparison of the expression profiles of different genes, the expression values have to be standardized to the same scale, i.e. converted to the standard range of [0, 1] for every gene. We will apply the approach of empirical cumulative distribution function (ECDF) transformation, which can be described as follows.
For each gene x i : • Sort observations x ij along m time-points in ascending order.
• Assign to corresponding observation a probability p iI j ¼ j m , j = 1. . .m, where I is array of sorting indices.
If there are ties (the same values of observations) for some genes, then the above standardization procedure can generate skewness. Thus, we use the following modified procedure for standardization in this study.
For each gene x i: • Sort observations x ij along m time-points in ascending order.
• Identify unique values u ik , k = 1. . .K, K < m • For each unique value u ik : • Count the number of ties c k for given unique value • Compute and assign to corresponding x iI j ,. . ., x iI jþc k À 1 probabilities fp iI j . . . p iI jþc k À 1 g ¼ Cþc k m , where I is array of sorting indices and C ¼ X k c kÀ 1 .
After applying the above standardization, we obtain the corresponding empirical cdf valuê F i ðtÞ for every gene i at time point t.

Boolean Network and Boolean Functions
We define the Boolean network as a set of vertices V = {x 1 . . .  Tables 5 and 6, we enumerate all possible non-trivial Boolean unary and binary functions correspondingly: Besides all possible non-trivial Boolean functions with unique definite output, we also consider functions with two possible outputs, {0, 1}, which means that either 0 or 1 may appear in the output for the same input assignment. In Table 5, function f1 (equivalence) represents upregulation of gene x2 by gene x1. Function f2 (negation) stands for downregulation of gene x2 by gene x1. Functions f3 and f4 reflect relation of necessity and its negation correspondingly. That is, function f3 explains condition "gene x2 cannot be turned on unless gene x1 is turned on" and function f4 states opposite "gene x2 cannot be turned off unless gene x1 is turned on". Functions f6 express sufficiency and f5 is its negation. If gene x1 is sufficient for gene x2 it means that knowing that gene x1 is on we can claim that gene x2 is on as well. However, it is not legit to assert that if gene x1 is off then gene x2 is off too. Whilst function f5 stands for statement "if gene x1 is on then gene x2 must be off".
In Table 6, functions F1-F42 with binary inputs may or may not have simple and intuitive forms. For instance, F1 realizes an AND gate of two inputs; yet F24 does not have a simple Boolean functional form. In this study, we concern primarily pairwise dependencies in the gene network. The Boolean functions with binary inputs are auxiliary in determining the directness of links.

Test 1
In order to identify the pairwise dependencies between genes, we examine two models for every possible pair of genes. One model represents the situation where genes are linked and the other model suggests there is no link between genes under consideration.
Assume that y 1 (t) and y 2 (t) are continuous observed values of Gene 1 and Gene 2 at time point t respectively. The notations of x 1 (t) and x 2 (t) are the corresponding discrete latent variables. The notation of τ represents the time delay between genes. For example, τ = 1 means the time delay of 7 minutes for the expression data of α -factor cells arrest in Spellman et al. [28]. Two competitive models are shown at Fig 1. In order to establish which one of the models Table 6. Truth table for binary functions F1-F42. x1 and x2 are input of the function and x3 is output.   x1  x2  x3   F1  F2  F3  F4  F5  F6  F7  F8  F9  F10  F11  F12  F13  F14   0  0  0  0  0  0  0  1  1  1  1  1  {0,1}  {0,1}  {0,1}  {0,1}   0  1  0  0  1  1  1  0  0  0  1  explains data better, we use the likelihood ratio to evaluate: The larger this ratio is, the more significant the link is. The likelihoods of models can be written respectively: where the product is taken over all time points; at each time point t the likelihood score is marginalized over all possible latent variable states of x 1 (t) and x 2 (t). According to Bayes' theorem, When P(y 1 (t)|x 1 (t)) and P(y 2 (t)|x 2 (t)) in formulas (1) and (2) are replaced with (3), we obtain the followings: The terms of P(y 1 (t)) and P(y 2 (t)) are taken outside of the sum in both formulas because P (y k (t)) is constant as y k (t) is the realization of random variable x k (t). They will be cancelled out in likelihood ratio and they can be omitted in formulas for L not linked and L linked . So, the formulas be rewritten as next: The estimate of conditional probability P(x k (t)|y k (t)) is the empirical cdfF k ðtÞ and Pðx k ðtÞ ¼ 1jy k ðtÞÞ ¼F k ðtÞ; Pðx k ðtÞ ¼ 0jy k ðtÞÞ ¼ 1 ÀF k ðtÞ: For simplicity, we will use p x1x2 t notation instead of product P(x 1 (t)|y 1 (t))‧P(x 2 (t + τ)|y 2 (t + τ)). For example, p 00 t = P(x 1 (t) = 0|y 1 (t))‧P(x 2 (t + τ) = 0|y 2 (t + τ)).
Note that P(x k (t)) is a marginal probability and it can be computed as follows: The conditional probability of Boolean state of variable x 2 (t+τ) given x 1 (t) in (5) becomes: Eq (6) specifies the pattern for each of six possible Boolean functions of one variable. If it is f1 (equivalence), then p 00 and p 11 are given weight 1, while p 01 and p 10 are set to zero for the accordance to the truth table of f1. For computation reason in practice, we will use ε and 1−ε instead of 0 and 1 to avoid computing log(0) in log likelihood. The parameter ε can be adjusted if needed. Based on our empiric experience, it does not notably affect output. The default value of ε in software implementation is set to 0.005 in this study. However, the decrease of ε can slightly increase number of regulatory relations in output. For the functions which have indefinite output for one of inputs (f3-f6), we use the marginal probability of the second gene to be 1 or 0 as weight function. With all notations explained above, the likelihoods corresponding to However, it is unnecessary to compute all parts since we are only interested in the difference, that is, the unary function of f(x 3 ) against the binary function of F(x 1 , x 3 ). Since the link x 1 !x 3 is present in both models and it does not contribute to models differentiation, we can remove it from computation. Thus the corresponding likelihoods for two models can be written as next:  Table. Source/target annotation for whole genome and 103 cell-cycled genes experiments; cycle phase annotation for 103 genes. (XLSX)