Voting-based integration algorithm improves causal network learning from interventional and observational data: an application to cell signaling network inference

In order to increase statistical power for learning a causal network, data are often pooled from multiple observational and interventional experiments. However, if the direct effects of interventions are uncertain, multi-experiment data pooling can result in false causal discoveries. We present a new method, “Learn and Vote,” for inferring causal interactions from multi-experiment datasets. In our method, experiment-specific networks are learned from the data and then combined by weighted averaging to construct a consensus network. Through empirical studies on synthetic and real-world datasets, we found that for most of the larger-sized network datasets that we analyzed, our method is more accurate than state-of-the-art network inference approaches.

Causal modeling is an important analytical paradigm in action planning, predictive 2 applications, research, and medical diagnosis [1,2]. A primary goal of causal modeling is 3 to discover causal interactions of the form V i → V j , where V i and V j are observable 4 entities and the arrow indicates that the state of V i influences the state of V j . Causal 5 models can be fit to passive observational measurements ("seeing") as well as 6 measurements that are made after performing external interventions ("doing"). 7 In many settings, observational measurements [3] are more straightforward to obtain 8 than interventional measurements, and thus observational datasets are frequently used 9 for causal inference. However, given only observational data, it is difficult to distinguish 10 between compatible Markov equivalent models [4,5]. For example, the three causal batch effects, data collected from two different experiments might not be identically 23 distributed and thus the two experiments may be incoherent from the standpoint of 24 causal network model. As a result, directly combining data from different experiments 25 can lead to errors in network learning. Interventions, too pose a challenge due to the 26 fact that in real-world settings many interventions are (i) imperfect, meaning 27 interventions are unreliable and have soft-targets (A "soft" target intervention, or 28 "mechanism change," is an intervention that changes a target node's distribution's 29 parameters, but does not render that it's independent of its parent nodes [7]), and 30 (ii) uncertain, meaning that the "off-target" nodes are unknown. Classical causal 31 learning algorithms are based on the assumption that interventions are perfect [1]; 32 applying such algorithms to a dataset derived from imperfect interventions would likely 33 yield spurious interactions. Eberhardt [8] classifies such errors into two types: Consensus has yet to emerge on the question of 41 how-given two or more datasets generated from different interventions-the datasets 42 should be combined to minimize such errors in the learned network model. 43 In this paper, we have demonstrated in details the performance of our proposed 44 method, "Learn and Vote" [9], for inferring causal networks from multi-experiment 45 datasets. "Learn and Vote" can be used to analyze datasets from mixed observational 46 and interventional studies and it is compatible with uncertain interventions. As it is 47 fundamentally a data integration method, "Learn and Vote" is compatible with a 48 variety of underlying network inference algorithms; our reference implementation 49 combines "Learn and Vote" data integration with the Tabu search algorithm [10] and 50 the Bayesian Dirichlet uniform (BDeu) [6,11,12] network score, as described below. To 51 characterize the performance of "Learn and Vote", we empirically analyzed the network 52 learning accuracies of "Learn and Vote" and six previously published causal network 53 learning methods (including methods that are designed for learning from heterogeneous 54 datasets) applied to six different network datasets. Of the six network datasets, the 55 largest real-world dataset is a cell biology-based, mixed dataset (the Sachs et al. 56 dataset [13]) with a known ground-truth network structure. On larger networks, we 57 report superior (or in worst case, comparable) performance of "Learn and Vote" to the 58 six previously published network inference methods. In this section, we introduce notation and describe how perturbations affecting two or 62 more variables in a causal model can lead to spurious dependencies or independencies.

63
Mathematically, a causal model M c is described by a directed acyclic graph (DAG) 64 containing a pair (V, E), where V is a set observable nodes (corresponding to random  Illustration of a simple hypothetical causal model M c with three observable entities (V i , V j , and V k . Two different interventional experiments are depicted: experiment M e 1 involves intervention I 1 , and experiment M e 2 involves intervention I 2 . Pooling measurements from the two experiments can cause two types of network inference errors: false positive edge (shown in (a) as a red arrow between V i and V j ), and false negative edges (shown in (b) as blue arrows between V i and V k and between V j and V k ).
(dashed arrow in Fig. 1). Applying classical network inference algorithms to 71 measurements pooled from multiple interventional experiments can lead to two different 72 types of learning errors, as we explain below.  intervention I 2 on V k removes all the incident arrows for V k and cuts off the 81 causal influences of V i and V j on V k , causing V i ⊥ ⊥ Pa(V i ). Pooling data from 82 such models can cause the causal dependencies V i → V k and V j → V k in M c to be 83 missed (i.e., a "false negative" in the inferred network).

84
Review of prior literature 85 Classical causal learning methods fall into two classes: constraint-based methods (e.g.,

86
PC [2], FCI [14]), in which the entire dataset is analyzed using conditional independence 87 tests; and score based methods (e.g., GES, GIES [15]), in which a score is computed 88 from the dataset for each candidate network model. Both classes of methods were 89 designed to analyze a single observational dataset, with the attendant limitations (in 90 the context of multi-experiment datasets) that we described above. Several 91 multi-dataset network inference approaches have been proposed that circumvent the 92 above-described problems associated with cross-experiment measurement pooling. 93 Cooper and Yoo [6] proposed a score-based algorithm that combines data from multiple 94 experiments, each having perfect interventions with known targets. The approach was 95 later refined by Eaton and Murphy [7] for uncertain and soft interventions [16]. The 96 method of Claassen and Heskes [17] is based on imposing the causal invariance property 97 across environment changes. Sachs  exploits the 'local' aspect of causal V-structures [23]. However, none of these methods 115 produce experiment-specific weighted graphs, instead enumerating experiment-specific 116 partial ancestral graphs that are consistent with the data. In real-world datasets, due to 117 a variety of factors (finite sampling, experiment-specific biases and confounding effects, 118 measurement error, missing data, and uncertain/imperfect interventions), the 119 confidence with which a given causal interaction V i → V j can be predicted within a 120 given experiment will in many cases vary significantly from experiment to experiment 121 (and in the case of incomplete measurements, may not be quantifiable at all in a given 122 experiment). Thus, a network combination method compatible with experiment-specific 123 edge weights would seem to offer a distinct advantage in the context of multi-experiment 124 network inference. Furthermore, all of these methods assume that a single underlying   2. Although pooling data adds more confidence into learning the true causal arcs, it 161 can also introduce spurious arcs with incorrect direction (see Fig. 4). 162 3. Each intervention might alter a mechanism or influence the local distribution in 163 an unknown way [24].

165
To avoid the problems arising from pooling data from different experiments in causal 166 network learning, we propose the "Learn and Vote" method (shown in Fig. 3 and   167 Algorithm 1). The method's key idea is to (1) learn a separate weighted causal network 168 from the data generated in each experiment (which may be interventional or avgArcs = avgNetwork(arcProb) 10: Scoring Function 175 We incorporate the effect of intervention in the score component associated with each 176 node by modifying the standard Bayesian Dirichlet equivalent uniform score 177 (BDeu) [6,11,12]. assigned to the choice of set U as parents of V i , and the right part is the probability of 188 the data integrated over every possible parameterizations (θ) of the distribution. Step 2 Step 3 Step 4 Step 5 Step 2 -Creating 100 random DAGs using the observed nodes, as a starting point.
Step 3 -Optimizing each of the 100 DAGs with data using Tabu search.
Step 4 -Calculating probability (in terms of strength and direction) of occurrence for every possible arc from the 100 optimized DAGs and storing them in tables.
Step 5 -Combining votes from all the tables by weighted averaging and constructing the final causal network, with arc strengths above a threshold (in this case 50%) .
February 26, 2020 7/17 networks in Net. We average the arc strengths for every directed arc over the networks 202 in which corresponding target node was not intervened and store them as a list arcProb. 203 Combining results from experiments 204 Given arc information (in arcProb, see Algorithm 1) from each experiment, we average 205 their strengths and directions over the number of experiments where the given arc is 206 valid (using procedure avgNetwork). Finally, we compute the averaged arc strengths as 207 avgArcs and threshold on arc strength (using a predefined Threshold) in order to 208 produce the final DAG (using procedure learnDAG). We found that our method 209 performs best at a 50% threshold. We implemented "Learn and Vote" in the R 210 programming language, making use of the bnLearn package [25].

211
Datasets that we used for empirical performance analysis 212 From six published networks, we obtained nine datasets (with associated ground-truth 213 networks) that we analyzed in this work. To avoid bias, from each network we used 214 both observational and interventional datasets. For synthetic networks, as observations, 215 we drew random samples. As interventions, we set some target nodes to fixed values.

216
Next, in order to model uncertainty, we also set one or more of the target's children to 217 different values (like "fat-hands" [7]) which are assumed to be unknown. Finally, we 218 sampled data from each of the mutilated networks [26] :

219
• Lizards: a real-world dataset having three variables representing the perching 220 behaviour of two species of lizards in the South Bimini island [27]. We generated 221 one observational dataset and two interventional datasets from the lizards 222 network.

223
• Asia: a synthetic network of eight variables [28] about occurrence of lung diseases 224 and their relation with visits to Asia. For our empirical study, we created two 225 mutilated networks: Asia mut1 has one observation and one interventional 226 dataset, and Asia mut2 has one observational and two interventional datasets.

227
• Alarm: a synthetic network of thirty seven variables representing an alarm 228 messaging system for patient monitoring [29]. For our study, we created two 229 mutilated networks: Alarm mut1 has three observational and six interventional 230 studies, and Alarm mut2 has five observational and ten interventional datasets.

231
• Insurance: a synthetic network of twenty seven variables for evaluating car 232 insurance risks [30]. We created two mutilated networks: Insurance mut1, from  Causal network learning methods that we compared to "Learn and Vote"

243
Using the aforementioned networks and datasets, we compared the accuracy of "Learn 244 and Vote" for network inference to the following six algorithms (implemented in R): For each of these methods except PC, the method implementations that we used were 268 adapted for heterogeneous datasets (see citations above).  Table 1.

Effect of interventions on network inference 278
Based on prior studies suggesting that incorporating data from interventional studies 279 improves network inference (see Introduction), we re-analyzed the Sachs et al. [13] 280 biological cell signaling dataset (for which a ground truth network was published [13]) 281 using their published inference approach twice, first using observational samples only 282 ( Figure 4a) and then using an equal number of samples comprising 50% observational 283 and 50% interventional data (Figure 4b). We found that sensitivity for detecting cell 284 signaling interactions increases when data from observational and interventional 285 experiments are co-analyzed (Fig. 4b), versus when only data from observational 286 experiments are used (Fig. 4a). These results illustrate the benefit of using data from 287 interventional experiments for causal network reconstruction.  (Fig. 4b), use of "Learn and Vote" significantly reduced false positives, while increasing 294 the overall robustness of the network learning (Fig. 4c).

295
Systematic comparative studies 296 To study the performance characteristics of "Learn and Vote" for a broader class of 297 network inference applications, we carried out a systematic, empirical comparison our 298 method's performance with six previously published causal network learning methods 299 using nine datasets (from six underlying networks of small to medium size, as described 300 above in Methods and Datasets), spanning a variety of application domains.  Moreover, our method had the lowest number of false positives among all seven methods 317 and was tied for second-highest in terms of the number of true positives (Fig. 5h).   Each row corresponds to a specific dataset derived from a specific underlying ground-truth network (as described in detail in Methods and Datasets). Each row is split into three performance measures (precision, recall, and the "F1" harmonic mean of precision and recall). For each sub-row, the highest performance measurement is boldfaced. Each column corresponds to a specific method for causal network inference (as described in detail in Methods and Datasets), with the performance measures of our method ("Learn and Vote") in the rightmost column. The symbol "n/a" denotes that no performance results were available for that method on that dataset. Here, the method "simy" is only feasible for networks containing up to 20 nodes, so it failed to produce results on the larger networks. The network size denotes the number of nodes in the indicated network. The network type is as follows: RW, real-world; S, synthetic. terms of precision, with the ICP method having second best performance. The positive 327 predictive rate of our approach is higher for small or medium sized networks (i.e., fewer 328 than 20 nodes) but decreases as the size of the network increases. In contrast, the 329 greedy algorithms (GDS, GIES) perform well for smaller networks but suffer from lower 330 precision on larger networks. In terms of F1, our approach outperformed the others in 331 five out of nine studies and is more stable even when the network size increases. For 332 very small networks (i.e., fewer than 10 nodes), the PC-based approach has good 333 sensitivity, however, it leaves many of the arc directions ambiguous (Fig. 5a). samples), the performance of "Learn and Vote" is no better than the pooling-based 356 Sachs et al. method (Fig. 7b-c). Hence, when only a small amount of data are available 357 it is a good idea to combine them irrespective of experimental conditions. However, for 358 large enough sample size, we see that pooling degrades accuracy of network

361
Taken together, our results ( Fig. 5 and Table 1) suggest that for analyzing datasets 362 from studies that have imperfect interventions, greedy analysis methods (e.g., GDS, 363 GIES) are not as accurate as "Learn and Vote". On the other hand, ICP is conservative 364 due to its strict invariance property and helps reduce false causal arcs to a great extent, 365 but at the cost of sensitivity (Fig. 5d). The relatively poor performance of the PC structure. Another improvement of our approach is to see how choosing the number of 371 random DAGs (we have taken 100) scales with network size. For example, in case of 372 larger graphs, 100 might not be sufficient while in smaller graphs it could be overkill.

373
One possible improvement to "Learn and Vote" would be an adaptive method for 374 selecting the number of random initial DAGs; this is an area of planned future work.

376
We report a new approach, "Learn and Vote," for learning a causal network structure 377 from multiple datasets generated from different experiments, including the case of 378 hybrid observational-interventional datasets. Our approach assumes that each dataset is 379 generated by an unknown causal network altered under different experimental 380 conditions (and thus, that the datasets have different distributions). Manipulated 381 distributions imply manipulated graphs over the variables, and therefore, combining 382 them to learn a network might increase statistical power but only if it assumes a single 383 network that is true for every dataset. Unfortunately, this is not always the case under 384 uncertain interventions. Our results are consistent with the theory that simply pooling 385 measurements from multiple experiments with uncertain interventions leads to spurious 386 changes in correlations among variables and increases the rate of false positive arcs in 387 the consensus network. In contrast, our "Learn and Vote" method avoids the problems 388 of pooling by combining experiment-specific weighted graphs. We compared "Learn and 389 Vote" with six other causal learning methods on observational and interventional 390 datasets having uncertain interventions. We found that for most of the larger-network 391 datasets that we analyzed, "Learn and Vote" significantly reduces the number of false 392 positive arcs and achieves superior F1 scores. However, for cases where sample size per 393 experiment is very small, we found that pooling works better. Our findings (i) motivate 394 the need to focus on the uncertain and unknown effects of interventions in order 395 improve causal network learning precision, and (ii) suggest caution in using causal 396 learning algorithms that assume perfect interventions, in the context of real world 397 domains that have uncertain intervention effects.