Supplementary Information Supplementary Text S1 Joint and Conditional Joint Dependence Patterns of Random Variables

In this section we prove that no discrete random variables can have the joint and conditional joint dependence patterns of conditional independence (Candidate Patterns 1 and 2). Theorem 1. There are no such discrete random variables A, B, T for which the following set of dependencies are all simultaneously true: A ⊥ ⊥ T (1) B ⊥ ⊥ T (2)(B) have non-zero probabilities. Proof. We will prove this theorem by contradiction. We start by assuming all four Equations 1-4 are true. In that case, the joint probability P(A, B, T) can be written as: have non-zero probabilities. When these are zero, the system reduces to trivial cases where either A and B are identical, or one of them is a constant. We divide both sides of the equation by P(A, B) to obtain: P(T |B) = P(T |A) (7) We can repeat this derivation for various values of random variables A, B, T , and for any of these values we would arrive to the same equation. Without loss of generality assume that A, B, T are binary. In that case for This implies that P(T = 1|A = 0) = P(T = 1|A = 1) which in turns implies that A and T are marginally independent when T = 1. Same reasoning can be applied to the case when T = 0, thus showing that A and T are marginally independent which contradicts the initial assumption that they are dependent (Equation 1). Similar argument can be made for discrete variables with more than two values as well. The same theorem can be proven for conditional joint pattern of dependence. Both the formulation and the proof are equivalent, but conditional on a set of variables. We conjecture that the same theorem and reasoning might apply for continuous random variables as well.

In this section we prove that no discrete random variables can have the joint and conditional joint dependence patterns of conditional independence (Candidate Patterns 1 and 2).
Theorem 1. There are no such discrete random variables A, B, T for which the following set of dependencies are all simultaneously true: When all of P(A|B), P(B|A), P(A), P(B) have non-zero probabilities.
Proof. We will prove this theorem by contradiction. We start by assuming all four Equations 1 -4 are true. In that case, the joint probability P(A, B, T ) can be written as: Now we use the property that P(A|B), P(B|A), P(A), P(B), and thus P(A, B) have non-zero probabilities. When these are zero, the system reduces to trivial cases where either A and B are identical, or one of them is a constant.
We divide both sides of the equation by P(A, B) to obtain: We can repeat this derivation for various values of random variables A, B, T , and for any of these values we would arrive to the same equation. Without loss of generality assume that A, B, T are binary. In that case for A = 1, B = 1, T = 1 and A = 0, B = 1, T = 1 we get: This implies that P(T = 1|A = 0) = P(T = 1|A = 1) which in turns implies that A and T are marginally independent when T = 1. Same reasoning can be applied to the case when T = 0, thus showing that A and T are marginally independent which contradicts the initial assumption that they are dependent (Equation 1). Similar argument can be made for discrete variables with more than two values as well.
The same theorem can be proven for conditional joint pattern of dependence. Both the formulation and the proof are equivalent, but conditional on a set of variables.
We conjecture that the same theorem and reasoning might apply for continuous random variables as well.
Conditional independence can be established at least in two ways: (i) by performing conditional independence tests for subsets of variables, the approach taken by constraint-based network inference algorithms; and (ii) by building a Bayesian Network for all the variables, the approach taken by score-based algorithms.
The PC algorithm is one of the first algorithms proposed for inference of Bayesian networks [15]. It is constraint-based and builds a Bayesian network in two steps. In the first step (PC-skeleton), the algorithm starts with a fully connected graph and then performs conditional independence tests of increasing order to delete edges. This recovers the structure of direct and indirect dependencies, that is, the undirected network skeleton. In the second step, edges are oriented using a set of rules.
Our algorithm uses a procedure similar to the PC-skeleton algorithm to distinguish direct from indirect dependencies. However, it does not perform the second step as we are not interested in inferring causation. For completeness, below we introduce the PC skeleton algorithm. With Ad j(X) we denote the nodes in a graph that have an edge to X, i.e. adjacent nodes in a graph.

PC-skeleton algorithm
2. Let G be a complete undirected graph with nodes representing variable from D.

Repeat:
(a) For all pairs of variables (X, Y) connected with an edge in D test if they are conditionally independent given a set of variables S of size n, where S ⊂ Ad j(X) or S ⊂ Ad j(Y). If they are conditionally independent, remove the edge X − Y from the graph.
(b) Set n = n + 1 until no variable has greater than n adjacencies,or a stopping condition is satisfied.

Return G
The algorithm assumes the conditional independence test giving reliable results. In practice, a conditional independence test, like a chi-square test, is used with a fixed P-value cut-off α (typically set to α = 0.05). This cut-off specifies the Type I error (false positive) rate under the null hypothesis. The Type II error (false negative) rate of conditional independence tests is harder to control for. This rate will depend on sample size, size of the conditioning test, size of effects we wish to detect and the α value. Either the power can be estimated taking into account these four factors, or the maximal size of the conditioning set (or data point per conditioning set) can be capped [14]. The Type II error rate increases as the size of the conditioning set S increases, that is, the test looses power as we condition on more variables.
More recently, a framework for general local learning (GLL) was proposed [13]. Instead of systematically doing tests of increasing order, the Markov blanket for each of the nodes is found separately, and then a network constructed by violating the smallest number of inferred Markov blankets. These algorithms have a forward and backward phase, in the forward variables are added to the tentative Markov blankets of a node, and in the backward conditional independence tests are used to remove false positives from the tentative Markov blankets. The advantage of such an approach is that it separately finds the best local Markov blanket for each of the nodes, and is thus less sensitive to error propagation through the network inference steps, like it is the case in the PC algorithm.

Detailed description of NCPC and NCPC* algorithms
Below we give a more detailed pseudo-code for the NCPC and NCPC* algorithms. A conditional independence test I α return either true if the P-value of the test is greater than α or false otherwise.
• Column vector T of target variable values, with observations corresponding to those of X.
• Conditional independence test I α (with a given α value threshold and optionally: a minimal number of observations per set of values of in a conditioning set l).
• (Optionally) Maximal size of conditioning set k. If a test pair is of form: I α (X i , T |{X j , S }) = true and I α (X j , T |{X i , S }) = true then mark this test pair as inconsistent.
7. If a variable X i is removed only using inconsistent tests, then mark it as having joint dependence 8. Mark all remaining variables without any calls as having no dependence 9. Return calls for each of the variables in X NCPC* algorithm Input: (same as for NCPC algorithm) Algorithm: 1. Initialise a list of candidates of variables for direct dependence with ordered pairs: Extend a list of candidate variables with ordered pairs (conditional dependent variable, conditioning set): (a) Enumerate all subsets S n : {S : S ⊂ C ∧ ||S|| = n}. Since S consists of ordered pairs, ||S|| measure the number of non-empty values in all pairs. E.g. ||{(X i , ∅), (X j , X k )|| = 3.
(b) For each ordered pair (X i , X j ) ∈ C (X j can be ∅) and S ∈ S n (X i S) test I α (X i , T |{X j , S}) and save the result (c) For each (X i , X j ) ∈ C check if there is at least one S for which I α (X i , T |{X j , S}) = true. If such S exists, remove (X i , X j ) from list of candidates C, together with all those ordered pairs where X i is in the second place in the pair.
(e) Break out of the loop if n > k or if n ≥ ||C|| 5. Mark all remaining pairs (X i , X j ) ∈ C as having direct dependence if X j = ∅ and conditional dependence otherwise 6. Mark all pairs (X i , ∅) removed in Step 3. as having indirect dependence 7. For each pair of ordered pairs variables (X i , X j ), (X p , X r ) removed in Step 3, examine all tests performed in Step 3. If a test pair is of form: I α (X i , T |{X p , X j , S }) = true and I α (X p , T |{X i , X r , S }) = true then mark this test pair as inconsistent.
8. If an ordered pair (X i , X j ) is removed only using inconsistent tests, then mark it as having joint dependence if X j = ∅ and conditional joint dependence otherwise 9. Mark all remaining variables without any calls as having no dependence 10. Return calls for each of the variables in X We also implemented NCPC and NCPC* with multiple testing correction as suggested by [21] for the PC algorithm. Briefly, at each iteration of step 3 in NCPC (step 4 in NCPC*), and for each candidate retained, a list is kept of P-values from conditional independence tests given other candidates C. The maximum of these P-values is used as the candidate P-value and multiple testing correction applied to these maximal P-values over all candidates. At a predefined false discovery rate (FDR), candidates with adjusted P-values above the FDR threshold are removed from the candidate set. This procedure controls the false discovery rate of edges with direct dependence. Either the Benjamini-Hochberg [19] for independent tests or Benjamini-Yekutieli correction [20] for dependent tests can be used.
Supplementary Figure S2. Precision and recall for the "Time" scenario. Precision and recall for the "Time" scenario. Precision is measured as TP/(TP+FP) and recall as TP/(TP+FN) where TP is the number of true positives, FP number of false positives and FN number of false negatives. The best precision and recall rates are shown in bold together with other values that are not significantly different (where the difference is smaller than 0.04 which is roughly two times the 95% confidence interval for these values) Supplementary Figure S4. Results in the "Time" scenario with α = 0.01. Correct predictions, precision and recall in the "Time" scenario but by using α = 0.01 P-value cut-off. When we use this cut-off we recover the expected high performance of the NCPC algorithm in the most correlated case. The best values and those not significantly different are shown in bold.
Supplementary Figure S5. Results in the "Hidden" scenario with α = 0.01. Correct predictions, precision and recall in the "Hidden" scenario but by using α = 0.01 P-value cut-off. When we use this cut-off we recover the expected high performance of the NCPC algorithm in the most correlated case. The best values and those not significantly different are shown in bold. Figure S6. Causal neighbourhood marked with rectangles and marginal dependence colour-coded for the mesodermal dataset. The layout of these plots is same to that of Figure 4 of the original paper where the data has been first published [18]. The colour-coding of marginal dependence is same as in Figure 6. The numbers represent fold difference for binding of TFs in each of the classes. For instance, in the "Meso" CRMs, Twi 2-4h is bound 1.83 times more frequently than in the rest of the CRMs. With solid rectangles we mark those combinations of TFs and time points that are in the causal neighbourhood of a CRM class (direct or joint dependence). The last figure summarizes the causal neighbourhoods for the 5 CRM classes. With "x" we mark those TF/time interval combinations which are not in causal neighbourhood of any of the CRM classes.

Supplementary
Supplementary Figure S7. DDGraph for results of NCPC* on 5 CRM classes of mesoderm at α = 0.05. The result is the same as Figure 6 with the exception of Meso and VM&SM classes. For the Meso class we find two additional variables with conditional joint dependence. Following our results on synthetic data we discard them as false positives. For the VM&SM class we find that Bin 10-12h has not only joint dependence, but also conditional dependence (when conditioning on Twi 2-4h). This further confirms that Bin 10-12h is a member of the Markov blanket of VM&SM CRM class.
Supplementary Figure S10. IAMB applied to the 5 CRM classes at α = 0.05. IAMB finds similar, but sometimes seemingly wrong causal neighbours. For instance, for Meso it finds that Bin 10-12h has direct dependence (being causal child to Meso CRMs), while Twi 2-4h is a causal spouse, and thus independent of Meso CRMs. However, the situation is exactly reverse, Bin 10-12h being independent of Meso CRMs and Twi 2-4h having strong dependence on Meso CRMs. For VM and SM IAMB doesn't find any variables that directly influence these CRMs.
Supplementary Figure S12. InterIAMB applied to the 5 CRM classes at α = 0.05. InterIAMB finds variables similar to other IAMB class of algorithms. In case of Meso it also doesn't find Twi 2-4h as having direct dependence.