Inferring Tree Causal Models of Cancer Progression with Probability Raising

Existing techniques to reconstruct tree models of progression for accumulative processes, such as cancer, seek to estimate causation by combining correlation and a frequentist notion of temporal priority. In this paper, we define a novel theoretical framework called CAPRESE (CAncer PRogression Extraction with Single Edges) to reconstruct such models based on the notion of probabilistic causation defined by Suppes. We consider a general reconstruction setting complicated by the presence of noise in the data due to biological variation, as well as experimental or measurement errors. To improve tolerance to noise we define and use a shrinkage-like estimator. We prove the correctness of our algorithm by showing asymptotic convergence to the correct tree under mild constraints on the level of noise. Moreover, on synthetic data, we show that our approach outperforms the state-of-the-art, that it is efficient even with a relatively small number of samples and that its performance quickly converges to its asymptote as the number of samples increases. For real cancer datasets obtained with different technologies, we highlight biologically significant differences in the progressions inferred with respect to other competing techniques and we also show how to validate conjectured biological relations with progression models.

We proceed by rewriting P(b | a)/P(b | a) > P(a | b)/P(a | b) as P(a, b)P(a) P(a, b)P(a) > P(a, b)P(b) P(a, b)P(b) which means that P(b | a) P(b | a) > P(a | b) P(a | b) () P(a) P(a, b)P(a) > P(b) P(a, b)P(b) We can rewrite the right-hand side of (14) by using x, y, z where P(a) = P(a, b) + P(a, b) = y + z and P(b) = P(a, b) + P(a, b) = x + y, and then do suitable algebraic manipulations. We have 1 y z x(y + z) > 1 x y z(x + y) () yz y 2 z xz 2 yz 2 > xy x 2 y x 2 z xy 2 (15) when x(y + z) 6 = 0 and z(x + y) 6 = 0. To check that the right side of (15) holds we show that (xy x 2 y x 2 z xy 2 ) (yz y 2 z xz 2 yz 2 ) < 0 . First, we rearrange it as (x z)[y y 2 xz y(x + z)] < 0 to show that is always negative. By observing that, by assumption 1 we have z > x and thus (x z) < 0, and, by equation (13) we have y(1 y x z) zx > 0, we derive which concludes the ) direction. The other direction ( follows immediately by contraposition: assume that P(a | b) > P(a | b), P(b | a)/P(b | a) > P(a | b)/P(a | b) and P(b)  P(a). We distinguish two cases: 1. P(b) = P(a), then P(b | a)/P(b | a) = P(a | b)/P(a | b).
2. P(b) < P(a), then by symmetry P(b | a) > P(b | a), and by the ) direction of the proposition it follows that P(b | a)/P(b | a) < P(a | b)/P(a | b).
In both cases we have a contradiction. This completes the proof.

Proof of Proposition 4 (Monotonic normalization).
Proof. We prove the forward direction ), the converse follows by a similar argument. Let us assume then P(b | a)P(a | b) > P(a | b)P(b | a). Now, to show the righthand side of the implication, we will show that h By (17), two equivalent inequalities hold and hence the implication holds.
Proof of Proposition 5 (Coherence in dependency and temporal priority).
Proof. We make two assumptions: The proof for dependency follows by Proposition 1 and its implication: Moreover, being symmetric by definition, the proof for temporal priority follows directly by Proposition 2 The properties outlined so far are sketched informally in the following diagram.

Temporal Priority
Temporal Priority Proof of Theorem 1 (Independent progressions).
Proof. For any a i 2 G ⇤ it holds that P(a i ) > P(b) m ai!b > 0 being prima facie, also a ⇤ ! b is the edge selected by CAPRESE being the max{·} over G ⇤ . Thus, CAPRESE selects ⇧ ! b instead of a ⇤ ! b if, for any a i , it holds .
With some algebraic manipulations we rewrite this as , which gives the inequality in the theorem statement.

Proof of Theorem 2 (Algorithm correctness).
Proof. It is clear that CAPRESE does not create disconnected components since, to each node in G, a unique parent is attached (either from G or ⇧). For the same reason, no transitive connections can appear.
The absence of cycles results from Properties 3, 4 and 5. Indeed, suppose for contradiction that there is a cycle (a 1 , a 2 ), (a 2 , a 3 ), . . . , (a n , a 1 ) in E, then by the three propositions we have P(a 1 ) > P(a 2 ) > . . . > P(a n ) > P(a 1 ) which is a contradiction.

Proof of Theorem 3 (Asymptotic convergence).
Proof. For any u 2 G, when s ! 1 the observed probability P(u) (evaluated from D) is equivalent to the product of the probabilities (in T ) obtained by traversing the forest from the root ⇧ to u (Definition 1). Thus, P(u) 2 (0, 1) since the traversal probabilities are in (0, 1) too, hence all events are distinguishable and Algorithm 1 reconstructs a tree with the same events set G of T .
We now observe that the distribution induced by T (Definition 1) respects a single-cause prima facie topology where to each event is assigned at most a single cause. In other words, Definition 2 holds for any edge (u, v) 2 E: • by the event-persistence property usually assumed in cancer (fixating mutations are present in the progeny of a clone) the occurring times satisfy t u < t v which, in a frequentist sense, implies P(u) > P(v); • it holds by construction (Definition 1) that P(v, u) = P(v), thus P(v | u) = P(v)/P(u) which is strictly positive since P(v) and P(u) are, and that P(v, u) = 0, thus P(v | u) = 0.
To correctly reconstruct T we rely on the fact that our score m u!v is consistent with the prima facie probabilistic causation because of: • Proposition 3, which states that pr (embedded as ↵ u!v in m) subsumes a good temporal priority model of occurring times, as stated above; • Proposition 4 and 5 which ensure the monotonicity and sign coherency among ↵ u!v and u!v in m. Thus, m is consistent with a single-cause prima facie topology. We now show that Algorithm 1 reconstructs correctly a generic edge in E, and hence also T .
Consider an event v 2 G and edge (u, v) 2 E. The set of its "candidate" parent events is G \ {v}, we partition it in three disjoint sets G, S and N : • G, genuine: all the backward-reachable events, in G \ {v}, from v; • S, spurious (or ambiguous): all the events (but v ) in the sub-forest which includes the path from ⇧ to v, which are not in G; • N , non prima facie: all other events, i.e., This way of partitioning events according to the structure of T subsumes a equivalent partitioning based on the score ↵ 2 [0, 1], which we use to prove correctness of our algorithm: for any We now show that CAPRESE correctly selects u 2 G: • a non prima facie event x 2 N either satisfies (Proposition 1) and thus m x!v < 0, or it is a descendant of v, which means that P(x) < P(v). By construction, CAPRESE considers as candidate parents of v only not descendant events with positive score (see step 3); • a spurious event x 2 S is prima facie to v but ↵ x!v < 1 since: -P(x)P(v) < P(v, x) < P(v), otherwise x would be backward reachable from x and thus in G; CAPRESE will thus not select any of these events as cause of v if there exist an event with m x!v = 1, which is actually the case with genuine causes; • genuine causes are the real cause of v, u, plus all the transitive backward-reachable events. Any x of these has maximum score ↵ x!v = 1 since: -P(x)P(v) < P(v, x) = P(v) and 0 = P(v, x); -by the above P(v | x) = 0 which implies ↵ x!v = 1.
Thus, CAPRESE will pick an event from G, and not from S. We need to show that u is the event with maximum score.
Enumerate the events in G as g 1 (which is u), . . ., g k in a way that P(g 1 ) < . . . < P(g k ) and recall that this is a total ordering induced by the temporal priority, and that this is consistent with coefficient , which means that is the event closer in time to v, with respect to . This event, namely u, is chosen by the algorithm as the real cause of v.
Finally, we show that the last step of the algorithm (the independent progression filter, step 4), does not invalidate the edge (u, v). In fact, the algorithm would replace such an edge with (⇧, v) if, for all nodes x backward-reachable from v (i.e., those in G [ S) it was .
It suffices thus to show that the above inequality is violated just by one of the backward-reachable nodes. We pick just u 2 G and note that .
Also, we have that P(u) < 1, P(v) < 1 and, by construction, P(u | v) = 1 because all the instances of v are co-occurring with those of u (but not the converse). Thus, inequality is always true and ensures that edge (u, v) is maintained, which concludes the proof.
Proof. As shown in [22], the uniform rates ✏ + and ✏ affect the observed probabilities as follows P(i) ⇤ = P(i)(1 ✏ ) + (1 P(i))✏ + (18) where ⇡ ij = P(i) + P(j) P(i, j). It is important to note (Lemma 1, [22]) that P(i) > P(j) =) P(i) ⇤ > P(j) ⇤ , namely uniform noise is still implying temporal priority. Because of this, and since the raw estimate ↵ is monotonic relative to temporal priority, all the derivations for Theorem 3 are still valid in this context, and the algorithm selects the correct genuine cause for each effect.
To guarantee that no valid connection is broken by the independent progressions filter, we again rely on Szabo's result (Reconstruction Theorem 1, [22]). In particular, for any correctly selected edge (u, v) in our algorithm, since we implement Desper's filter (or, analogously, Szabo's) for independent progressions we do not mistake by deleting (u, v) unless also their algorithms do. Since this is not the case when ✏ + < p p min (1 ✏ + ✏ ) the proof is concluded. Figure S1: Reconstruction with noisy synthetic data and ! 0. The settings of the experiments are the same as those used in Figure 6, but in this case the estimator is shrank by ! 0, i.e., = 0.01. In the magnified image one can sees that the performance of CAPRESE converges to Desper's one already for ⌫ ⇡ 0.01, hence largely faster than in the case of ⇡ 1/2 (Fig. 6).

S2 Synthetic data generation
A set of random trees is generated to prepare synthetic tests. Let n be the number of considered events and let p min = 0.05 = 1 p max , a single tree with maximum depth log(n) is generated as follows: 1: pick an event r 2 G as the tree root; 2: assign to each event but r an integer value in [2, log(n)] representing its depth in the tree, ensure that for each level there is at least one event (0 is reserved for ⇧, 1 for r); 3: for all events e 6 = r do 4: let l be the level assigned to e;

5:
assign a father to e selecting an event among those at which level l 1 was assigned;

6:
add the selected pair to the set of edges E; 7: end for 8: for all edges (i, j) 2 E do 9: assign ↵((i, j)) a random value in [p min , p max ]; 10: end for 11: return the generated tree; When a forest is to be generated, we repeat the above algorithm to create its constituent trees. These trees (or forests), in turn, are used to sample the input matrix for the reconstruction algorithms, with the parameters described in the main text.

S3 Further results
We show here the results of the experiments discussed but not presented in the main text.
Reconstruction of noisy synthetic data with ! 0. Although we know that ! 0 is not the optimal value of the shrinkage-like coefficient for noisy data, we show in Figure S1 the analogue of Figure 6 when the estimator is shrank by ! 0, i.e., = 0.01. When compared to Figure 6 it is clear that a best performance of CAPRESE is obtained with ⇡ 1/2, as suggested by Figure 2. Figure S2: Reconstruction with CAPRESE compared to oncotrees and h-CBN with noisy synthetic data. Performance of CAPRESE compared to oncotrees and h-CBNs as a function of the number of samples and noise ⌫. The parameter used for CAPRESE is 1/2, and the reconstructed topologies contain 10 nodes each.
Comparison with hidden Conjunctive Bayesian Networks, h-CBNs. We here compare the performance of CAPRESE to hidden Conjunctive Bayesian Networks (h-CBN) [15], as well as to oncotrees. The settings of the experiment are slightly different from those of the previous analyses: we used 100 distinct random trees of 10 events each. We ranged the number of samples available for reconstruction from 50 to 200, with a step size of 50. The settings used for running h-CBNs are relatively standard settings: we allowed for 50 annealing iterations with initial temperature equal to 1. Since h-CBNs reconstruct DAGs, it is not possible to quantify its performance using Tree Edit Distance, as we did in the comparison with oncotrees. Instead, we here adopt Hamming Distance (computed on the connection adjacency matrix), as a closely related and computationally feasible alternative for measuring performance [33].
The results of the experiment can be found in Figure S2, and show that CAPRESE clearly outperforms h-CBNs. In particular, it is possible to notice that, for all the analyzed values of noise and sample sizes, both CAPRESE and oncotrees display a (average) Hamming Distance between the reconstructed model and the original tree topology that is significantly lower than h-CBNs, with the largest differences observed in the noise-free case. This result would point at a much faster convergence of CAPRESE with respect to the number of samples, also in presence of moderate levels of noise.
A few remarks are warranted about this experiment. First, in contrast to the comparison with oncotrees, we ran each experiment exactly once rather than averaging the results over 10 repetitions, and on relatively smaller trees. These limitations are a consequence of the extremely high time complexity of the simulated annealing step of h-CBNs. However, the comparison between CAPRESE and h-CBNs shows a so large difference in the performance that we do not expect this to be have significant impact. Second, the results obtained by h-CBNs are perhaps worse than expected based on results in the absence of noise presented in [24], which were however based on a unique tree topology. Yet, this outcome may have been potentially influenced by either the estimation procedure of the noise parameter in h-CBN, the adopted annealing procedure or by the used number of iterations. In future work we plan to extend our algorithm to extract more general topologies and to compare both methods in a greater detail.
Inference of models with multiple conjunctive parents. CAPRESE is specifically tailored to reconstruct models with independent progressions and a unique cause for each event (i.e., trees or forests), while other approaches such as CBNs can reconstruct models where multiple conjunctive parents co-occur to cause Figure S3: Performance of CAPRESE to reconstruct models with conjunctive parents and noisy data. Performance of CAPRESE measured in terms of the number of false positives/negatives in the reconstructed model, when data are generated from directed acyclic graphs with 10 nodes and where each event is caused by at most 3 conjunctive events (randomly assigned). The parameter is set to 1/2. an effect (i.e., a^b cause c). It is thus reasonable to use such conjunctive approaches to infer more complex model, in spite of CAPRESE.
However, it is interesting to asses CAPRESE's performance when (synthetic) data are sampled from a model with multiple parents and noise. By sampling input data from random directed acyclic graphs with 10 nodes and where each event is caused by at most 3 conjunctive events (randomly assigned), we assess the number of false positives and false negatives retrieved in the model reconstructed with CAPRESE. We show the results in Figure S3. Our results indicate that for increasing sample size, the number of false positives approaches 0. Thus, for sufficiently large number of samples, all the causal claims returned by CAPRESE are true. In addition, the number of false negatives is always higher and proportional to the connectivity of the target model. This is to be expected since CAPRESE assigns at most one parent (the cause) to every node.