A Graphical Modelling Approach to the Dissection of Highly Correlated Transcription Factor Binding Site Profiles

doi:10.1371/journal.pcbi.1002725

Figure 1.

The graphical vocabulary of the DDGraph.

The vocabulary consists of five types of nodes and two types of edges. For the edges, directed edges ending with dots indicate conditional independences between X_k and the target variable T given X_i. Undirected edges indicate dependencies, which involve T in different ways, and for conditional independencies between X_i and X_j given T. Consider a case of non-faithful distribution where T is an XOR function of X1 and X2 with carefully set parameters so that from data it looks like X1 and X2 are marginally independent of T. In this case, X1 and X2 would be conditionally dependent when conditioning on each other. This distribution would be represented as two dotted nodes with a dotted line between them, but disconnected from T. This kind of graph signals a non-faithful distribution where the neighbourhood and Markov blanket are not defined by transversing undirected edges from T.

More »

Expand

Figure 2.

Comparison of DDGraphs and DAGs.

(A) The causal neighbourhood of the target variable T consists of variables X1 and X2, while T's Markov blanket consists of X1, X2, X4 (in ovals). The remaining variables X3 and X5 have indirect dependence (in rectangles). The DDGraph (left) and the DAG (right) represent the same conditional dependencies. The causal neighbourhood/the Markov blanket and the variable in indirect dependence are distinguishable by the variable shapes in the DDGraph, but have to be inferred in the DAG by following the edges. (B) joint dependency patterns representable in the DDGraph (left) cannot be represented by DAGs (right). The DAG shown here represents the conditional independencies between X1 (or X2) and T given X2 (or X1), but it does not represent the marginal dependency between X1 (or X2) and T. Neither this DAG or any other DAG can represent the entire joint dependency pattern.

More »

Expand

Figure 3.

Two scenarios for generating the synthetic data with correlated variables.

While the synthetic data were generated for a network of 15 explanatory variables, only variables X1 and X2 have direct dependence with the target variable T, and therefore constitute the causal neighborhood of T. Variable X3 is included as the confounding variable. (A) The “Time” scenario in which X1, X2 and X3 correspond to three time points with stronger correlation between X1 and X2 and between X2 and X3 than between X1 and X3. (B) The “Hidden” scenario in which X1, X2 and X3 are correlated due to a common cause H in the network. This common cause is used in data generation, but is not available to algorithms.

More »

Expand

Figure 4.

Proportion of correct predictions for the “Time” scenario.

Each cell shows the mean proportion of correct predictions (with 95% confidence intervals) averaged over 1000 data sets generated in each case. Highest prediction proportions accounting for variation in the data (pairwise T-tests with a cut-off of 0.001 for the P values) are shown in bold. See Materials and Methods for the generation of the synthetic data and for the calculation of the correct prediction proportion.

More »

Expand

Figure 5.

Clustered pairwise correlation matrix of the 15 transcription factor binding profiles over all 310 CRMs.

Note that the cluster that consists of Mef2 8–12 h and Bin 6–12 h (lower left corner of the matrix) is anti-correlated with early Twi 2–4 h binding.

More »

Expand

Figure 6.

DDGraphs for the 5 CRM classes inferred by the NCPC algorithm at α = 0.05.

Variables in green circles are target variables. Variables in ovals are inferred causal neighbours. Variables in rectangles are inferred to have indirect dependence with the target. Values on the edges are (unadjusted) P-values from conditional independence tests. The same NCPC algorithm with no multiple testing correction was used as in the synthetic data benchmark. See Figure 1 for the graphical vocabulary.

More »

Expand

Figure 7.

Combinatorial patterns of TFs in inferred causal neighbourhoods.

For each combinatorial pattern we show the number of CRMs with this pattern in the CRM class and that in the rest of CRMs (percentages are given in parenthesis). The difference in the two frequencies (CRM class vs rest) and the corresponding P-value are given in the last two columns. P-values were computed from Fisher's exact test for each combination and adjusted for multiple testing using the Benjamini-Hochberg method. See Materials and Methods for details. Frequency differences are colour-coded: blue for decrease in the CRM class, and orange for increase in the CRM class.

More »

Expand