phyC: Clustering cancer evolutionary trees

Multi-regional sequencing provides new opportunities to investigate genetic heterogeneity within or between common tumors from an evolutionary perspective. Several state-of-the-art methods have been proposed for reconstructing cancer evolutionary trees based on multi-regional sequencing data to develop models of cancer evolution. However, there have been few studies on comparisons of a set of cancer evolutionary trees. We propose a clustering method (phyC) for cancer evolutionary trees, in which sub-groups of the trees are identified based on topology and edge length attributes. For interpretation, we also propose a method for evaluating the sub-clonal diversity of trees in the clusters, which provides insight into the acceleration of sub-clonal expansion. Simulation showed that the proposed method can detect true clusters with sufficient accuracy. Application of the method to actual multi-regional sequencing data of clear cell renal carcinoma and non-small cell lung cancer allowed for the detection of clusters related to cancer type or phenotype. phyC is implemented with R(≥3.2.2) and is available from https://github.com/ymatts/phyC.

This two parametric distribution can alternatively parametrised in terms of mean µ and size s, called dispersion parameter, via where the variance is σ 2 = µ + µ 2 s . We denote K ∼ N B(µ, s) if the random variable K follows negative binomial distribution.

S-1-2 Details of simulation I
We generated random trees via generating random tree topologies as follows. Let E = {e i ; i = 1, 2, . . . , m} be edges of a tree defined as any of the five types defined in Figure  2. We generated random trees via editing edges e. The editing operation, denoted by a function OP (E, N ) where e is edge set and N is the number of the operation per tree , consists of "add" and "delete" the edge, and "none" (does nothing) that are randomly chosen. We controlled topological variance by the number of the operations per tree denoted as N with larger one resulting in larger topological variance. We generated 10 trees for each classes of tree topology and examined various topological variances N = 2, 3, 4, 5. In Figure 3, the variance index corresponds to 1 N .

S-1-3 Details of simulation II & III
We generated random trees via generating random edge length as follows. Let l = (l 1 , l 2 , . . . , l m ) be vector of edge lengths of a tree defined as any of the three types of simulation II and the nine types of simulation III in Figure 2. We generate random trees via generating random edge length independently using negative binomial distribution. We set size parameter s and the mean parameter is set for each edge length l i as µ i = l i (i = 1, 2, . . . , m). We generate a random tree with edge length (Z 1 , Z 2 , . . . , Z m ) via Z i ∼ N B(µ i , s). In this simulation, we generated 10 trees for each class and examined various size parameters s = 1, 2, . . . , 10.

S-1-4 External clustering validation indices
Purity (PR) measures accuracy of cluster assignments to the correct classes that are defined by the majority cluster, where {ω k ; k = 1, 2, . . . , n} is set of clusters and {c j ; j = 1, 2, . . . , m} is set of classes. PR lies in [0, 1] and higher PR is considered as good. However, since PR doesn't consider the number of clusters, high PR is easy to achieve when the number of clusters is large, e.g., PR is 1 if each object gets its own cluster. PR also doesn't consider the false positive rate. The other two indices below incorporate false positive rate or the number of cluster into their evaluation index. Rand index (RI) measures the accuracy of the clustering assignments. We consider true positive (TP), true negative (TN), false positive (FP), and false negative (FN). If a pair of similar clusters are assigned to the same cluster, the pair is counted as TP and conversely, if a pair of dissimilar clusters are assigned to the different cluster, the pair is counted as TN. The FP and FN counts the pairs that are similar and dissimilar clusters to be assigned to the different and the same clusters, respectively. RI is defined as RI lies in [0, 1] and high RI means high true positive and true negative rate. Normalized mutual information (NMI) measures cluster assignment accuracy thorough mutual information entropy, where NMI lies in [0, 1] and high NMI means that subgroups of objects forms meaningful clusters, e.g., non random cluster assignments.

S-2 Sampling effects on the clustering results
We examined how the clustering results of phyC depend on the sampling effects as follows. We downsampled the number of samples in all patients to be the same (minimum number of samples among patients) and the cancer evolutionary trees based on the downsampled VAF profiles were clustered by phyC. In the down sampling, we randomly chose the subsamples from a patient. To consider the randomness, we took median over the distances of the trees that are obtained from 100 sets of downsampled VAF profiles and the evolutionary trees based on each set of VAF profile.
We made the contingency table where rows and columns represent cluster assignments without downsampling and the assignments with downsampling (Table S4,S5,and S6).
The cluster assignments were the same between those without and with downsampling in ccRCC and NSCLC datasets, respectively (Table S4 amd S5). In case of the dataset of ccRCC&NSCLC, only ccRCC-EV005 was classified in the different clusters compared to the cluster assignments without downsampling (Table S6). There is a long edge in the branch of ccRCC-EV005, and it is considered to be attribute to the cluster assignments without downsampling, compared to those with downsampling.
Our method captures both tree topologies and shapes by edge length vector as shown in equation (2) in the manuscript and this indicates that phyC tends to emphasize the differences of edge length rather than tree topologies, which is implied in the simulation II and III in the manuscript. The number of samples may lead to more complex tree topologies rather than edge length in actual case and it is considered that there was no big difference in clustering results.       Step 1. Mapping starts with the subtrees having maximum depth. i.e., a subtree of node A including the node B (d=3) (① -③).

S-3 Supplementary tables and figures
Step 2. Within the subtrees of step 1, mapping is performed from a edge with maximum total edge length. i.e., the edge including the node C (l=5+4+3=12) (①).
Step 4. Repeat step 1 -step 3 for subtrees with the next maximum depth. i.e., the next subtree including the node E (d=2) (④).