^{1}

^{2}

^{1}

^{3}

^{¤}

^{1}

^{2}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: RS AQF BA. Performed the experiments: RS AQF. Analyzed the data: RS AQF BA. Wrote the paper: RS AQF BA.

Current address: Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America.

Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that

Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression. Recent technological advances make it possible to map TF binding patterns across the whole genome. Multiple single-gene studies showed that combinatorial binding of multiple transcription factors determines the gene transcriptional output. A common naive assumption is that correlated binding profiles may indicate combinatorial binding. However, it has been found that many TFs bind to distinct hotspots whose role is currently unclear. It is thus of great interest to find transcription factor combinations whose correlated binding is causally most immediate to gene expression. Building upon theories of statistical dependence and causality, we develop novel graphical modelbased algorithms that handle highly correlated transcription factor binding profiles more efficiently and reliably than existing algorithms do. These algorithms can also be applied to other biological areas involving highly correlated variables, such as the analysis of high-throughput gene knock-down experiments.

A major area in genome research is understanding how the regulatory information is encoded. Work over the past few decades has resulted in the notion of a combinatorial regulatory code: the concerted binding of a context-specific set of transcription factors (TFs) to regulatory sequences, which is crucial for proper gene expression. Studies of a handful of single genes and their few well-characterised enhancers prevailed in the early days (see

Similar to the gene regulation problem described above, many other biological problems involve highly-correlated features and high correlation does not necessarily indicate functional relevance. Machine learning approaches, especially classification methods, have been developed to use the measurements of these features (or “explanatory variables”) to predict biological outcomes (or “target variables”), e.g. using core promoter DNA motifs to predict transcription start site locations

In contrast, graphical models (GM)

Two concepts are particularly important in the theory of Bayesian networks: the causal neighbourhood and the Markov blanket. Specifically, if there is a directed edge from variable A to the target variable T in the network, then variable A is defined as the causal parent of T. If the directed edge goes from T to A, then A is the causal child of T. The causal neighbourhood of the target variable consists of the causal parents and causal children of the target variable. It is thus the set of variables that are most “causally immediate” for the target variable. The Markov blanket of the target variable T contains its causal neighbourhood as well as other causal parents of T's causal children (these other causal parents are T's causal spouses). From the information-theoretical perspective, the Markov blanket contains all the information about the target variable

In terms of statistical inference, existing algorithms for inferring Bayesian networks can be broadly classified into constraint-based, score-based and hybrid algorithms

In this paper we develop a novel constraint-based graphical model method, the Neighbourhood Consistent PC (NCPC) algorithms, to infer the causal neighbourhood and the Markov blanket of a target variable. Through synthetic data, we demonstrate that our algorithm has superior performance to existing algorithms when the variables are highly correlated, the data of the target variable is sparse, and the coupling of the target variable and other variables is weak.

We also develop a novel graphical representation, the Direct Dependence Graph (DDGraph), which can represent the dependence patterns inferred from the NCPC algorithms. This representation is broader than the common representation in DAGs, and is useful for exploratory analyses of NCPC results. In particular, the DDGraph shows the conditional independencies in the data even if the underlying network is cyclic or non-faithful to a DAG. Both NCPC and DDGraph are implemented in the R package ddgraph, which is part of Bioconductor (

Applying our algorithm to genome-wide TF profiles and expression profiles of cis-regulatory modules (CRMs) published in

We illustrate the concepts of direct and indirect dependencies in terms of the combinatorial binding code of transcription factors. Our aim is to identify transcription factors that

Below we formally define the types of statistical dependencies our NCPC algorithm and its extension detect. We use _{i}_{i}_{i}

_{i} and T are _{i} and T are marginally dependent (i.e., X_{i}_{i}_{i}. That is, it holds that X_{i}_{i}

_{i} and T are _{i} and T are marginally independent (i.e., X_{i}_{i}_{i}_{i}

_{i} and T are _{i}_{i}_{i}_{i}

Note that, in the example above, A and T are directly dependent, whereas B and T are indirectly dependent. When many TFs are involved, often several TFs have similar types of dependence with T. Such collections of TFs are of interest in understanding the complex transcriptional regulatory network and are related to the causal neighbourhood and Markov blanket introduced in the previous section and formally defined below.

_{i} in

_{i} in

As mentioned in the

Here we present two versions of the Neighbourhood Consistent PC (NCPC) algorithm, which are based on the PC algorithm _{i}_{j}_{i}_{j}_{j}_{i}_{i}_{j}

_{i} and X_{j} have a _{i} and T are conditionally independent given X_{j} and_{j} and T are conditionally independent given X_{i} and_{i} and X_{j}. X_{i} and X_{j} in this pattern are candidates for having direct dependency with T.

_{i} and X_{j} have a _{i} and T are conditionally independent given X_{j} and_{j} and T are conditionally independent given X_{i} and_{i}, X_{j} are conditional on, and possibly other variables (excluding X_{i} and X_{j}). X_{i} and X_{j} in this pattern are candidates for having conditional dependency with T.

Although these candidate patterns are mathematically inconsistent (see proof in Supplementary Text), we show in the subsequent section on synthetic data that these patterns can arise in applications with highly correlated variables, and thus should not be discarded.

Between the two versions, the basic NCPC algorithm, shown in

Input:

Matrix _{1}, _{2}, …_{m}

Column vector

Conditional independence test appropriate for the dataset

Algorithm:

Initialise a set of direct dependence candidates _{i}

Let

Repeat:

Enumerate all subsets

For every _{i}_{i}

Set

Break out of the loop if

Label candidates

Systematically check for joint pattern of dependence in tests performed in Step 3

If _{i}

Label all variables removed in Step 3 not having joint dependence as having

Label all remaining variables as having

Return calls for each of the variables in

The NCPC* algorithm differs from the NCPC algorithm in two main ways. Firstly, during the initialisation step, in addition to the candidate set

The NCPC and NCPC* algorithms have similar computational complexity to the PC algorithm. That is, in the worst case, the number of required tests increases exponentially with the size of the causal neighbourhood (NCPC) or that of the Markov blanket (NCPC*), although in real life applications, the size of the causal neighbourhood and that of the Markov blanket of

As local network reconstruction algorithms our NCPC algorithms assume that there are no hidden variables or directed cycles (i.e., feedback loops) in the Markov blanket of

Assuming an infinite sample size, a perfect statistical test (“conditional independence oracle”) and a dependence structure faithful to a DAG without hidden (i.e. unmeasured) variables, the NCPC* algorithm can correctly label all the variables in the network; that is, this algorithm is asymptotically correct for a distribution faithful to a DAG (see

NCPC and NCPC* output labels for the explanatory variables

DDGraphs use both directed edges (ending in dots) and undirected edges to capture a multitude of dependency patterns with respect to the target variable _{i}_{j}_{j}_{i}_{i}_{j}_{i}_{j}_{i}_{j}_{i}_{j}

The vocabulary consists of five types of nodes and two types of edges. For the edges, directed edges ending with dots indicate conditional independences between _{k}_{i}_{i}_{j}

A DDGraph and a DAG with the same dependence patterns around the target variable

(

A DDGraph also represents joint and conditional joint dependency patterns, which are mathematically inconsistent and thus impossible to represent with DAGs (

We generated synthetic data based on the 15 correlated TF binding profiles in _{1}, _{2}) for

We further introduced a third variable (_{3}) as the confounding variable in the network and generated correlated data for two realistic scenarios:

Time - The two causal neighbours (_{1}, _{2}) and the third variable (_{3}) represent the binding profiles of the same TF at three times, such that _{1}→_{2}→_{3}, in which the correlation between _{1} and _{3} is smaller than that between _{1} and _{2} and between _{2} and _{3} (

Hidden - The three variables are correlated with a common unobserved cause, e.g., the chromatin and/or cell population structure (represented by

While the synthetic data were generated for a network of 15 explanatory variables, only variables X1 and X2 have direct dependence with the target variable T, and therefore constitute the causal neighborhood of

With these synthetic data, we focus on the performance of separating direct from indirect dependence and detecting the causal neighbourhood. We applied our NCPC and NCPC* algorithms, at an

We measured the proportion of correct predictions from these algorithms over 1000 data sets generated for each combination of the sample size and correlation in either of the two scenarios. A prediction is correct when only the two causal neighbors and no other variables are identified. These prediction rates for the “Time” scenario are summarized in

Each cell shows the mean proportion of correct predictions (with 95% confidence intervals) averaged over 1000 data sets generated in each case. Highest prediction proportions accounting for variation in the data (pairwise T-tests with a cut-off of 0.001 for the P values) are shown in bold. See

Identifying variables in direct dependence and in joint dependence, the NCPC algorithm (“NCPC dir+jnt”), has the highest (accounting for variation in simulated data) rate of correct predictions amongst all the algorithms in all the cases examined here, except in the biggest dataset with 0 correlation. This superior performance is particularly notable when the correlation between the variables is high and the dataset is small. By including the variable pairs in joint dependence, “NCPC dir+jnt” achieves better performance “NCPC dir” because this inclusion drastically improves recall (corresponding to low false negative rates), especially when the sample size is not large, although the inclusion lowers precision (corresponding to high false positive rates) slightly (see rates of precision and recall defined in

Increasing the sample size improves the prediction for most algorithms, as we expected. However, when the correlation in the data is 0.75, the NCPC and NCPC* algorithms have lower rates of correct predictions for data with a sample size of 500 than for data with a sample size of 300. This may be due to the

Zinzen et al.

Note that the cluster that consists of Mef2 8–12 h and Bin 6–12 h (lower left corner of the matrix) is anti-correlated with early Twi 2–4 h binding.

Here we applied the NCPC and NCPC* algorithms to the same 310 CRMs with the 15 TF binding profiles. The advantage of this dataset is that any computational predictions can be benchmarked against a wealth of previously established biological results. At an

Variables in green circles are target variables. Variables in ovals are inferred causal neighbours. Variables in rectangles are inferred to have indirect dependence with the target. Values on the edges are (unadjusted) P-values from conditional independence tests. The same NCPC algorithm with no multiple testing correction was used as in the synthetic data benchmark. See

We also applied other algorithms benchmarked in the previous section to this data set. Hill-climbing with BIC identified a smaller but overlapping set of variables (Supplementary Figure S8 in

We applied our method on the dataset of early mesoderm development in the Drosophila embryo

After identifying the causal neighbourhood, we further examined which specific TF combinations are enriched or depleted in each of the five expression classes, compared with the rest of the 310 CRMs analysed here (

For each combinatorial pattern we show the number of CRMs with this pattern in the CRM class and that in the rest of CRMs (percentages are given in parenthesis). The difference in the two frequencies (CRM class vs rest) and the corresponding P-value are given in the last two columns. P-values were computed from Fisher's exact test for each combination and adjusted for multiple testing using the Benjamini-Hochberg method. See

In addition to previously established regulatory principles outlined above, the genome-wide statistics also suggest a thus far uncharacterized mechanism of prevention of early Twi binding at 2–4 h of embryogenesis for the class of CRMs active in visceral and somatic muscle (VM&SM) at 8–12 h of development. This suggests that these CRMs are selectively shut off during early embryogensis, but are bound later on by tissue-specific transcription factors:

Twi 2–4 h is identified to also have direct dependence with this VM&SM CRM class (

In this paper we present a novel graphical model-based method that distinguishes direct from indirect dependencies between explanatory variables (or features) and the target variable. Our NCPC and NCPC* algorithms work particularly well in cases of highly correlated features and of sparse or weak signals, as seen in comparison with other algorithms on synthetic data.

We applied our algorithms to data published in

Our NCPC algorithms assume no hidden variables in the Markov blanket of the target variable. This assumption is frequently not met in reality; for example, in the case of the transcriptional regulation, a number of relevant TFs might not have been measured. In that case, a seemingly irrelevant TF might be inferred as a causal neighbour if it is correlated with the unmeasured relevant TF (e.g. due to open chromatin structure). Such a TF would be a “proxy” for the binding of the relevant TF.

Our NCPC algorithms also assume no feedback loops in the Markov blanket of the target variable. This may not be the case in a real biological system. However, if time course data are available and informative enough such that the underlying Markov blanket is acyclic at each time point, then our NCPC algorithms can still be applied (similar to the way we re-analysed the fly mesoderm development data) to identify causal neighbours. Transcriptional responses are typically slow (on the order of minutes

The statistical tests our algorithms perform for the variables in these systems tend to be highly dependent. It is still a challenge to control the false discovery rate for highly dependent tests. We implemented the multiple testing procedure of

The NCPC algorithms infer the causal neighbourhood and do not optimise the prediction accuracy of the target variable. Hence, we do not expect these algorithms to be an optimal feature selection procedure for classification. Nonetheless, the NCPC algorithms may in principle be used for feature selection to improve prediction accuracy, for example, by using cross-validation to choose a P-value threshold that minimises the cross-validation error. Directly incorporating the dependence structure in a classifier is still challenging, since it is difficult to robustly estimate higher-order conditional probabilities from small datasets (a Naive Bayesian Classifier has been used in practice; see

A wealth of genome-wide data have been and are currently produced, featuring binding sites of transcription factors, chromatin marks and RNA levels

Although we have focused on TF binding and CRM activity in this paper, our NCPC algorithms are applicable to other biological problems involving possible highly correlated features. For instance, high-throughput imaging of knock-down strains can produce large sets of highly correlated visual features describing cell shape

A unified interface to all causal neighbourhood/Markov blanket methods benchmarked in this paper, including the NCPC/NCPC* algorithms and the DDGraph representation, is available as the R package ddgraph, which is part of Bioconductor (

We used the data from Supplementary Figure 8 of

To construct the synthetic dataset we used Hill-climbing with BIC to infer a Bayesian network from the real biological dataset (

To generate the CRM class target variables we considered a causal neighbourhood of size 2 and used a noisy AND function, representing the simplest combinatorial code of 2 TFs. The noise in the AND function is incorporated into both the inputs and the output of the function. The noise in the inputs models the activity of other TFs, which might, for example, inhibit the CRM activity in the presence of the TF, or activate the CRM in the absence of the TF. The noise in the output models the noise in the reporter assay used to find the activity of a CRM. Let _{A}_{B}_{A}_{B}_{A}_{B}_{A}_{B}_{A}_{B}_{A}_{B}_{A}_{A}

To incorporate the two scenarios “Time” and “Hidden” described in the main text, we randomly chose three variables in each simulated network. We then rewired these three variables to match each scenario. For the “Time” scenario we allowed for the first variable to have causal parents as in the unmodified network, while variables two and three have causal parents only from the scenario. However, they retained the original causal children of the unmodified network. This ensured that we can fully control the correlation between these three variables, but also leave it as much as possible in the context of rest of the network. In the “Hidden” scenario, we generated an additional hidden variable and made it a causal parent for the three variables in the scenario. Now the three variables only retained their original causal children, but not their causal parents. To generate the binary profile of the target variable, we applied the noisy AND function as before.

The hill-climbing and IAMB algorithms were applied using the bnlearn R package, and PC algorithm was applied using the pcalg R package. Both can be accessed using a unified interface in our R package ddgraph.

For NCPC and NCPC* we used the Monte-Carlo chi-square test, while for the IAMB algorithms we used the Mutual Information test recommended by the authors

To assess the performance of the algorithms, we defined a prediction as correct if there are no false positive and no false negatives. The accuracy was measured by the prediction rate, which was the proportion of correct predictions over all the synthetic networks. We also defined precision as TP/(TP+FP), where TP is the number of true positives, and FP is the number of false positives. Additionally, we defined recall as TP/(TP+FN) where FN is the number of false negatives. Rates of precision and recall were also averaged over all the synthetic networks.

As the size of the conditioning set increases, the power of the test decreases. To increase power, we limited the total count

Alternatively, one may constrain the size of the conditioning set. Since our data are binary, we set the maximal size of the conditioning set _{min}

For

(PDF)

The authors wish to thank David Molnar and Rob Foy for valuable discussion.