## Figures

## Abstract

We present methods to construct phylogenetic models of tumor progression at the cellular level that include copy number changes at the scale of single genes, entire chromosomes, and the whole genome. The methods are designed for data collected by fluorescence *in situ* hybridization (FISH), an experimental technique especially well suited to characterizing intratumor heterogeneity using counts of probes to genetic regions frequently gained or lost in tumor development. Here, we develop new provably optimal methods for computing an edit distance between the copy number states of two cells given evolution by copy number changes of single probes, all probes on a chromosome, or all probes in the genome. We then apply this theory to develop a practical heuristic algorithm, implemented in publicly available software, for inferring tumor phylogenies on data from potentially hundreds of single cells by this evolutionary model. We demonstrate and validate the methods on simulated data and published FISH data from cervical cancers and breast cancers. Our computational experiments show that the new model and algorithm lead to more parsimonious trees than prior methods for single-tumor phylogenetics and to improved performance on various classification tasks, such as distinguishing primary tumors from metastases obtained from the same patient population.

## Author Summary

Cancer is an evolutionary system whose growth and development is attributed to aberrations in well-known genes and to cancer-type specific genomic imbalances. Here, we present methods for reconstructing the evolution of individual tumors based on cell-to-cell variations between copy numbers of targeted regions of the genome. The methods are designed to work with fluorescence *in situ* hybridization (FISH), a technique that allows one to profile copy number changes in potentially thousands of single cells per study. Our work advances the prior art by developing theory and practical algorithms for building evolutionary trees of single tumors that can model gain or loss of genetic regions at the scale of single genes, whole chromosomes, or the entire genome, all common events in tumor evolution. We apply these methods on simulated and real tumor data to demonstrate substantial improvements in tree-building accuracy and in our ability to accurately classify tumors from their inferred evolutionary models. The newly developed algorithms have been released through our publicly available software, FISHtrees.

**Citation: **Chowdhury SA, Shackney SE, Heselmeyer-Haddad K, Ried T, Schäffer AA, Schwartz R (2014) Algorithms to Model Single Gene, Single Chromosome, and Whole Genome Copy Number Changes Jointly in Tumor Phylogenetics. PLoS Comput Biol 10(7):
e1003740.
doi:10.1371/journal.pcbi.1003740

**Editor: **Sergei L. Kosakovsky. Pond, University of California San Diego, United States of America

**Received: **February 2, 2014; **Accepted: **June 4, 2014; **Published: ** July 31, 2014

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

**Funding: **This research was supported in part by the Intramural Research Program of the U.S. National Institutes of Health, National Cancer Institute, and National Library of Medicine, and by U.S. National Institutes of Health grants 1R01CA140214 (RS and SAC) and 1R01AI076318 (RS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** Dr Shackney is an employee of Intelligent Oncotherapeutics. All other authors declare that no competing interests exist.

## Introduction

In this paper, we develop new methods to advance the theory of phylogenetic inference for reconstructing evolutionary histories of cell populations in solid tumors. The work is specifically designed for use in tracking tumor evolution by gain and loss of genomic regions as assessed by multicolor fluorescence *in situ* hybridization (FISH), which measures the copy numbers of targeted genes and chromosomes in potentially hundreds of individual cells of a tumor. This technology was the basis of the earliest methods for phylogenetic reconstruction of single tumors [1], [2]. FISH remains uniquely valuable for such studies because the large number of cells that FISH can profile makes it possible to collect data on enough tumors in enough detail to build cell-by-cell phylogenies for populations of tumors and begin to study the common features of these phylogenies. In the present work, we specifically extend our previously developed inference algorithms to encompass a more complicated but realistic model of evolution of FISH probe counts, accounting for gain and loss of genetic material at the level of single gene probes, multiple probes on a single chromosome, or a probe set distributed across the whole genome. We demonstrate the value of these algorithmic improvements to more accurate phylogenetic inference and improved effectiveness of the resulting phylogenies in downstream prediction tasks.

The present work adds to the growing list of phylogenetic methods in cancer modeling, which were reviewed through 2008 in [3]. These include methods for analyzing comparative genomic hybridization (CGH) or other genetic gain/loss data in a single tumor type [4]–[11], for defining the cell type lineage of single tumors [1], [2], [12], , for organizing a taxonomy of tumor types [14], for reconstructing a partial order of genetic changes in multiple samples from one patient [15], and for reconstructing progression from cell types inferred from bulk genomic assays [16]. Recent high-throughput sequencing studies have also used ad hoc phylogenetic methods to infer putative tumor progression scenarios, e.g., [17]–[20]. Like many of these methods, the present work is aimed at building tree models that provide a proposed partial order on the observed cell states, a strategy motivated originally by the work of Fearon and Vogelstein, proposing a linear order for four types of events in colorectal cancer and associating each event with a tumor stage [21]. Other ordering methods have been proposed, mostly for CGH or breakpoint data [15], [22]–[28] and, more recently, sequencing data [29], [30].

The present work specifically advances the reconstruction of phylogenetic histories of single tumors from intratumor cellular heterogeneity data. The use of phylogenetic methods to reconstruct histories of single tumors was first developed in our prior work [1], [2] by taking advantage of the ability of FISH to profile genetic changes in large numbers of single cells, allowing one to survey hundreds of cells per tumor in populations of tens of tumors [31]. This early work showed that even small numbers of markers could reveal numerous genetically distinct cell populations in single tumors, which could be resolved by phylogenetic inference to reveal multiple distinct pathways of progression between tumors and even within single tumors. Numerous studies since then, using multicolor FISH [2], [31]–[36] and, more recently, single-cell sequencing [19], [37]–[39] have greatly increased our ability to identify distinct cell populations and, in the process, revealed far more extensive intratumor heterogeneity than had been suspected prior to 2010 (reviewed in [40]). The repeated observation of intratumor heterogeneity has necessitated a reconsideration of Nowell's [41] theory that tumors evolve clonally, showing that a tumor may contain many subpopulations relevant to the clinical prognosis of the patient [42] and that rare subpopulations may be more relevant to prognosis than the most common ones [43]. Furthermore, a simulation study has suggested that methods based on average copy number data perform poorly when there is substantial intratumor heterogeneity [44]. Such findings suggest a need for improved methods for organizing the dozens or hundreds of observed cell states in single tumors to infer the evolutionary processes that produced them.

Despite extensive work on tumor phylogenetics, however, the study of algorithms for reconstructing tumor evolution from large numbers of single cells has lagged far behind advances in data generation. The standard in practice for single-cell tumor phylogenetics remains the use of simple generic phylogeny algorithms (e.g., neighbor-joining [45]) that are not designed to model the patterns of copy number changes one would expect from evolution by chromosome abnormalities that largely drive tumor evolution. Until recently, algorithms designed specifically for inferring phylogenies of single tumors from FISH data have been limited to just a few probes per cell and lacked robust, publicly available software implementations [1], [2], [34]. In prior work [46], we developed algorithms to find copy-number phylogenies for in principle arbitrary numbers of probes and cells. That work, however, was itself limited to a simple model in which tumor cells evolve by events of gain or loss of a single copy number of a single probe at each mutation step. In real tumors, gene copy numbers can change due to a variety of mechanisms, including:

- Single gene duplication/loss events (SD), in which one copy of a genetic region covered by a single probe is gained or lost.
- Chromosome duplication/loss events (CD), in which entire chromosomes are unequally distributed among daughter cells during mitosis along with potentially several probes.
- Whole genome duplication events (GD), in which a cell fails to divide during mitosis leading to doubling of all genetic material and all probe counts.

These events are illustrated schematically in Figure 1. While more complex probabilistic models of tumor evolution have been developed for inference of small phylogenies, with approximately ten taxa per tumor corresponding to distinct biopsies (e.g., [47]), the class of inference algorithms such models require would not be expected to scale to phylogenies of hundreds of single cells per tumor such as those examined in the present work.

A copy number profile of four genes is shown as an ordered set for homologous chromosome pairs and respectively, where the gene located on the top position in the chromosome precedes the gene located on the bottom position in the ordering. After the (A) Single gene duplication event, the copy number of a gene located on gets increased by 1. After the (B) Single chromosome duplication event, the chromosome gets duplicated and the cell has one extra copy of that chromosome as chromosome . After the (C) Whole genome duplication event, all the chromosomes are duplicated and the total number of chromosomes in the daughter cell is twice the number of chromosomes in the mother cell.

The work presented here seeks to fill this need for scalable phylogenetic algorithms capable of fitting more realistic models of tumor-like evolution to data sets of hundreds of single cells per tumor. We improve on our prior work for inferring tumor evolutionary models considering only SD events [46] to now include CD and GD events, which are also frequently observed in tumor progression. We specifically focus on the problem of accurately inferring evolutionary distances between distinct cells in terms of maximum parsimony combinations of SD, CD, and GD events. The major contributions of the work are:

- algorithms to compute minimum evolutionary distances between pairs of cell states in terms of SD and CD events and in terms of SD, CD, and GD events;
- a heuristic Steiner tree method based on the median-joining method [48] and our prior work on SD-only inference [46];
- software implementation of the new methods to compute and use of those methods to construct tumor progression trees;
- evaluation of the new methods on simulated data, which shows that they do better than the SD-only approach at recovering simulated tree topologies;
- application of the methods to published data on cervical cancer (CC, [49]) and breast cancer (BC, [36]);
- demonstration of improved ability to classify tumor types from phylogenetic features using a strategy in the spirit of the genomic progression scores (GPS) of Rahnenführer et al. [50].

The new methods are implemented in version 2 of our software FISHtrees (ftp://ftp.ncbi.nlm.nih.gov/pub/FISHtrees). The work addresses a critical need in modern cancer research for algorithms capable of inferring evolutionary trajectories of hundreds of single cells per tumor under plausible models of evolution including both gene-specific and chromosome abnormalities that are central drivers of true tumor evolution.

## Results

We used data collected from cervical cancer (CC) [49] and breast cancer (BC) [36] patients to evaluate our methods. Figure 2(A) shows a tumor progression tree inferred from one of the cervical cancer samples. For comparison, Figure 2(B) shows a progression tree inferred on the same sample using our prior SD model [46]. Visual inspection shows that large regions of the two trees are identical but that allowing CD and GD events leads to some rearrangement and a reduction in tree depth and overall size. Next we evaluate the changes induced by adding SD, CD and GD events, using simulated data to show effectiveness of the methods in finding more parsimonious solutions to the broader model and using the real CC and BC data to show the biological relevance of the improvements. We further show that our algorithms infer trees with higher accuracy than the prevailing alternative algorithms for single-tumor phylogenetic inference. Finally, we perform statistical experiments to evaluate the effects of tumor sample size on the performance of our tree building algorithm.

Trees are built considering (A) all of SD, CD and GD and (B) only SD model of tumor evolution. Each node represents a configuration of the four gene probes *LAMP3*, *PROX1*, *PRKAA1* and *CCND1*. Nodes with solid and dotted borders represent cells present in the collected sample and inferred Steiner nodes respectively. Green and red edges model gene gain and gene loss, respectively. The weight value on each edge connecting two nodes and is the distance between the states of and , computed using the particular model of tumor progression under consideration. The weight on each node describes the fraction of cells in the sample with the particular copy number profile modeled by that node; Steiner nodes are assigned weight 0.

### Simulation experiments

To measure accuracy of the methods for FISH datasets with a known ground truth, we generated a dataset of trees with six probes, two of which were treated as being on the same chromosome. Each tree was generated by starting from a diploid root node and executing a branching process in which each node was recursively assigned a number of children drawn from a geometrically distributed random variable with mean . Each child was distinguished from its parent by selecting an SD, CD, or GD event with probability for each of the six possible SD events, of a CD event, and of a GD event. This process terminated when all leaf nodes had been assigned zero children by the sampling. We then generated simulated FISH data for each tree by uniformly sampling cells from the nodes in this topology. The simulated data corresponds to counts of probes for each sampled cell in the tree. We applied Algorithm 3 (see Methods) to find a minimum-cost tree for each of four event models: (i) SD only, (ii) SD and CD, (iii) SD and GD, and (iv) SD, CD and GD.

We quantified the accuracy of tree inference by comparing each simulated true tree to its corresponding inferred tree derived from the sampled cells. This assessment was performed at the level of accuracy of tree edges by the following procedure:

- We pruned the real tree so as to remove any subtree for which no cell in the tree was sampled. This step was intended to avoid penalizing for “impossible” inferences of subtrees unsupported by any data.
- We computed a maximum matching of edges between the real subtree and the inferred tree, with each pair of edges weighted by the maximum number of nodes in agreement between the corresponding parts of the bipartitions that the two edges define [46], [51]. We used the Hungarian algorithm [52] for computing the maximum matching (applying the function“Hungarian” by Alexander Melin from the Matlab Central File Exchange).
- We calculated a reconstruction error of the inferred tree using the following formula:

where is the weight of the maximum matching, is set of taxa in common between the real and inferred trees, and and represent the sets of nontrivial bipartitions in the real and inferred trees, respectively.

Intuitively, this formula measures the fractional agreement between bipartitions of the trees relative to the total number of bipartitions. We use a matching-based formula, rather than the more familiar Robinson-Foulds metric [53], both because of its greater sensitivity to small changes in trees and because the Robinson-Foulds measure is not defined for trees with different node sets. We also note that we use a different normalization factor than in our prior work [46], normalizing essentially by the total number of edges between the two trees, to control properly for the fact that different inference methods may infer different numbers of tree edges. The reconstruction error ranges in value from , if the real and inferred trees are isomorphic, to an upper bound of in the limit of complete disagreement.

To illustrate the meanings of the terms of the equation for , we present a simple example using a hypothetical ground truth and an inferred tree presented in Figure 3(A) and Figure 3(B), respectively. The set of nontrivial bipartitions in the ground truth are

(A) A hypothetical simulated ground truth tree on the set of taxa . (B) Example inferred tree built on the sampled set of taxa on the dataset resulting from the ground truth tree.

and the nontrivial bipartitions in the inferred tree are

If we apply the matching algorithm on these two sets of bipartitions, the first and second bipartitions in the ground truth tree are matched with the first and second bipartitions in the inferred tree, respectively. The weight of the matching is . The number of common taxa between these two datasets is . The total number of nontrivial bipartitions in the real and inferred trees are and . Plugging these values into the equation for , we calculate .

A comparison of the four models is presented in Figure 4. The SD model showed reconstruction error with standard deviation (s.d.) of across the trees. The SD+CD model yielded error with s.d. . SD+GD yielded error with s.d. . The full SD+CD+GD model yielded error with s.d. . Collectively, the results suggest that one can reconstruct reasonably accurate trees even from the SD-only model, despite the fact that the trees were generated from a model of all three event types, although accuracy improves with each event type added. Accounting for GD events made a larger difference in accuracy than accounting for CD events, presumably because a missed GD event might require many SD or CD events to explain it, while a missed CD event could be explained with just two SD events. The reconstruction error for the full model is reduced by more than 1.7-fold relative to the SD-only model considered in our prior work.

Variants of our phylogenetic algorithms and two competing methods from the literature were applied to simulated FISH datasets describing evolution by combinations of single-gene (SD), chromosome (CD), and whole-genome (GD) duplication and loss events. Results are reported for inference by our methods from simulated trees, allowing for SD events alone, SD+CD events, SD+GD events, and SD+CD+GD events. We compared these results to inference by neighbor-joining (NJ) and pure maximum parsimony (MP) as implemented in MEGA, version 6. Accuracy is assessed by mean reconstruction error of bipartitions between true and inferred trees. Error bars show plus or minus one standard deviation across the samples for each method.

We further compared these results to those derived using generic phylogenetic methods that have been used in much of the single tumor phylogenetics work to date [16], [54]. We tested the accuracy of reconstruction of the simulated trees described above using generic neighbor joining (NJ) with Euclidean distance and pure maximum parsimony (MP) treating copy numbers as arbitrary characters, approaches chosen because they have been the primary alternatives to our specialized algorithms in the single-tumor phylogeny literature. We omit here comparison to more complicated Bayesian phylogenetic models (e.g., [47]) because such approaches are not scalable to the numbers of cells we examine. We then used the weighted matching based similarity method, described above, to calculate the mean percentage reconstruction error between the inferred and the ground truth trees. The mean reconstruction errors for NJ and MP were (s.d. ) and (s.d. ), respectively, in contrast to the error of (s.d. ) for the SD+CD+GD algorithm proposed here. The test thus demonstrates that when the underlying evolutionary process includes cancer-like chromosome abnormalities, errors are substantially reduced by using an algorithm designed for that model relative to standard off-the-shelf algorithms still widely used for single-tumor phylogenetics work.

We performed additional experiments to evaluate the effects of different evolutionary parameters on the accuracy of inference of tumor progression trees by FISHtrees. For this experiment, we selected five different combinations of probabilities of SD, CD and GD events for generating the ground truth trees and then used SD, SD+CD, SD+GD and SD+CD+GD models to infer the tumor phylogenies. These data sets again each used six probes with two of the six on a common chromosome. The selected five combinations of (SD,CD,GD) event probabilities are: , , , and . These combinations of event probabilities were chosen to yield trees of comparable complexity to the real data while producing test sets enriched in distinct combinations of the three event types. They thus allow us to consider how robust our algorithms are to contributions from each of the three event types, singly or in combination. We report the reconstruction error for trees for each of these combinations of event probabilities in Table 1. These results again show that accuracy improves with each event type added. When the probability of SD events is high (as in combination 3), the SD model results in highly accurate trees (mean reconstruction error of with s.d. ). Accounting for GD events in combination with SD events always result in larger improvement in the reconstruction error in comparison to the SD+CD models, even when the CD events are very frequent (as in combinations 2 and 4). Finally, accounting for GD events in combination with SD and CD events results in the largest improvements when the probability ratio of GD events to SD+CD events is highest, as can be seen from comparison of parameter sets 1 and 2.

Next, we performed simulation tests to evaluate the effects of non-uniform distributions of cells across different levels of the trees on the performance of our tree inference method. In our initial simulation experiments described above, we assumed that observed cells were sampled uniformly across clones. In real tumors, the distribution of cells would not typically be uniform due to differences in age and fitness of clones. In order to test robustness of our method to non-uniformity of clone frequencies, we sampled the cells following a non-uniform model in which the sampling frequency of a clone varies geometrically with its depth in the tree with a parameter . We used values of and for in our experiments. When , of the total cells are located in the first three levels of the trees, while for , this fraction is . We generated trees in each case with probabilities of SD, CD and GD events fixed at and . We again used SD, SD+CD, SD+GD and SD+CD+GD models to infer the tumor progression trees. We present the results from this experiment in Table 2, where we also show the results from the uniform sampling of the cells. Additionally, we report the results on the trees inferred using NJ and MP for these three different cell distributions. From the table, we can see that the reconstruction error increases with increasing for all methods. The SD+CD+GD model, however, shows the best performance among all the models for all three values of and the least loss of performance with increasing .

Finally, we performed simulation experiments to understand the effects of varying the numbers of chromosomes with multiple probes. We created a simulated dataset of trees with eight probes where two pairs of probes each reside on two different chromosomes and the remaining four probes reside on four separate chromosomes. The probabilities of each of the SD, CD and GD events were fixed at , and , respectively. We report the results from this experiment in Table 3, which compares the results from this experiment with our earlier result using only a single chromosome with two probes and four other probes located on separate chromosomes. The table shows that inclusion of the extra possible CD event results in higher accuracy for all the models except for the SD only model. The performance drop in the SD model is expected, as it would require more SD events to explain a greater number of missed CD events. The highest gain in performance is observed for SD+CD+GD model. These results show that our algorithm will tend to yield comparatively more advantage over the earlier work with more complicated scenarios of sharing probes across chromosomes, suggesting its utility will increase as improvements in technology allow for larger probe sets.

### Application to real cervical and breast cancer data

We applied the algorithm to two sets of real data:

- A set of CC [49] FISH data consisting of samples organized into primary samples of metastatic patients, paired metastasis samples from the same patients, and primary samples from patients who did not progress to metastasis. Each sample consisted of cells profiled on four FISH probes:
*LAMP3*(Entrez Gene Id 27074) [55],*PROX1*(5629) [56],*PRKAA1*(5562) [57] and*CCND1*(595) [58]. All of these four genes are oncogenes, which typically show copy number gains in tumor cells. Each of the genes belongs to a distinct chromosome. - A set of BC [36] FISH data consisting of paired (from the same patient) ductal carcinoma in situ (DCIS) and invasive ductal breast carcinoma (IDC) samples with cells per sample profiled on eight FISH probes:
*COX-2*(5743) [59],*MYC*(4609) [60],*CCND1*[58],*HER-2*(2064) [61],*ZNF217*(7764) [62],*DBC2*(23221) [63],*CDH1*(999) [64] and*TP53*(7157) [65]. The first five genes in this list are oncogenes and the last three genes are tumor suppressors. In tumor cells, tumor suppressors are typically associated with loss in copy numbers.

Among the eight genes in the BC dataset, *DBC2* and *MYC* reside on chromosome and *HER-2* and *TP53* reside on chromosome . The other four genes belong to distinct chromosomes. The oncogene Cyclin D1 (*CCND1*), which plays a role in many solid tumor types, is in both the BC and CC datasets. However, in some other tumor types, such as oral cancer, *CCND1* is part of a larger region with recurrent copy number gains on chromosome and other nearby genes have also been suggested to play a role in oncogenesis [66].

We evaluated the SD+CD+GD method by its effectiveness in reducing the parsimony score (total number of mutation events) of the resulting trees relative to the prior SD-only model. With the primary CC samples, the SD+CD+GD method found a lower-cost tree in of cases, a tree of equal weight in cases, and a higher-cost tree in cases. In each case of increased weight, the increase was by and appears to result from the subtree regrafting heuristic used in handling GD events (see Methods). These results suggest that the heuristic tree search may more often yield a suboptimal result for the SD+CD+GD model than it does for the SD-only model. The benefit of the more realistic model, however, outweighs the cost of this suboptimality in a large majority of instances. For trees derived from metastatic samples, of trees had lower weight for the full SD+CD+GD model and the remainder all had equal weight for the two models. Metastatic data sets tend to have fewer distinct cell types than do primary trees and thus may represent an easier optimization challenge. For the BC samples, of DCIS (samples 1–13) and of IDC (samples 14–26) had lower weight for the full model, with the remaining one sample having equal weight. Parsimony scores by tree are provided in Figures 5 and 6.

Comparison of (A) Primary and (B) Metastatic CC tumor progression tree weights built considering only SD and combined SD, CD and GD models. “Total Cell Type” refers to the total number of unique probe copy number configurations in the dataset, providing a lower bound on the minimum possible parsimony score for a given data set.

Comparison of DCIS (id 1–13) and IDC (id 14–26) BC tumor progression tree weights built considering only SD and combined SD, CD and GD models. “Cell Types” refers to the total number of unique probe copy number configurations in the dataset, providing a lower bound on the minimum possible parsimony score for a given data set.

We next evaluated effects of the improved model on overall tree topology, based on results of our prior work [46] that tree topology can significantly distinguish trees drawn from distinct progression stages of a given tumor type, with possible implications for the varying balance of diversification and selection acting on different stages of tumor progression. Figure 7 quantifies the topology for each sample set based on fractions of cells inferred at each tree depth from to . The figure shows similar qualitative trends for both SD and SD+CD+GD methods, although with small quantitative differences. For example, both SD and SD+CD+GD trees recapitulate a tendency for CC primary trees to show relatively broad topology (Figure 7(A)) while CC metastatic trees prune rapidly beyond the first few tree levels (Figure 7(B)). There is, however, an overall shift to lower depth in the SD+CD+GD trees. For CC primary trees, of cells are located in the first tree levels for SD versus for SD+CD+GD. For CC metastatic, of cells are located in the first tree levels for SD versus for SD+CD+GD. For BC, the comparable numbers of cells in depths are for SD versus for SD+CD+GD in DCIS and for SD versus for SD+CD+GD. These results suggest that the overall tree topology is not greatly sensitive to the combination of event types, although there is a noticeable shift towards lower depth in the full model.

Distribution of cells across different levels are shown for (A) Primary and (B) Metastatic CC, and (C) DCIS and (D) IDC BC tumor progression trees.

An additional evaluation was possible for the BC trees, because for the BC data, a probabilistic model and expert annotation based on two additional centromere probes made it possible to estimate the cell ploidy [36], which we define as the mode among the number of copies of the twenty-two autosomal chromosomes in a cell. Each cell in that dataset is thus annotated with an expert-curated overall ploidy estimate. We used these ploidy estimates to validate our inference of GD events based on whether edges assigned to GD events in our trees correspond to doubling of annotated ploidy. The percentage agreement by edge between GD events and annotated doubling in ploidy is across DCIS trees and across IDC trees. In of all inferred GD events, at least one endpoint of the corresponding edge is a Steiner node, and the uncertainty among whether a GD event occurred prior to or after the emergence of the Steiner node may explain why the per-edge agreement is not higher. Nonetheless, the data support the conclusion that inferred GD events are correct in a majority of cases.

As a final step, we repeated an approach developed in our prior work [46] to both validate the biological relevance of the trees and develop a practical application of them by treating the trees as sources of features for classification tasks applied to the CC data. For this purpose, we developed several sets of quantitative features based on inferred trees as well as comparative features derived from raw FISH probe counts. We used the following set of tree-based features:

- Edge count: features corresponding to fraction of progression tree edges showing gains and losses of each gene.
- Tree level cell percentage: features corresponding to the fraction of cells at each of the first levels for the progression trees.

We omitted a third feature set, bin count, used in our prior work because it is not easily comparable between SD and SD+CD+GD trees. We compared these features to four features derived directly from FISH probe counts without reference to the trees:

- Mean gain and loss of individual genes.
- Maximum copy number of individual genes.
- An information theoretic measure, Shannon index [67]. For each gene, each combination of gene copy number and cellular ploidy represents a species. If we denote the frequency of species among all tumors by , then Shannon index is given by the formula .
- Simpson's index [67], which is defined as .

We used each feature set as input to the Matlab support vector machine (SVM) classifier with a quadratic kernel using rounds of bootstrap replicates per test with leave-one-out cross-validation to compute mean and standard deviation of accuracy. We used Matlab functions “svmtrain” and “svmclassify” for training and testing of the SVM classifier.

We then applied these methods for three classification tasks: (i) distinguishing primary samples that progressed to metastasis from their paired metastatic samples, (ii) distinguishing all primary samples from all metastatic samples, and (iii) distinguishing primary samples that metastasized from primary samples that did not metastasize. The first two tasks are relevant to identifying features that help us understand the differences in evolutionary mechanisms of primary and metastatic samples. The third is intended to model an important practical problem in cancer treatment: determining whether a given primary tumor will metastasize.

Figure 8 shows results on each task. For task (i), allowing SD+CD+GD events increased accuracy relative to SD trees from to for edge counts and from to for tree level cell count. The SD+CD+GD tree level cell count was the most effective of all features, tree-based or not. For task (ii), we similarly saw a substantial improvement in prediction accuracy for SD+CD+GD trees relative to SD trees. Classification accuracy improved from to for edge count features and from to for tree level features. In this case, both SD+CD+GD tree feature sets outperformed all other features sets, tree-based or otherwise. These results provide an indirect validation that using a more general tree model gets closer to the biological ground truth. For task (iii), we saw no improvement, with identical results for SD and SD+CD+GD trees for either feature set. All tree-based feature sets significantly outperformed all non-tree-based feature sets for this task. We conclude that the more realistic evolutionary models appear not to reveal any more information to the classifiers for predicting which primary samples will go on to metastasize than the SD trees, which were already quite effective for that task.

Prediction accuracy on three different classification tasks of CC samples of an SVM classifier using tree-based and cell-based features. Each of the two tree-based features, edge count and tree level cell percentage, is derived from phylogenetic trees built using two different models of tumor progression, namely SD and combination of SD, CD and GD. Two cell-based features, average gain/loss and maximum copy number of each gene, and two information theoretic measures of cell heterogeneity, Shannon entropy and Simpson's index, are used.

### Dependence on data size

A key advantage of FISH for profiling tumor heterogeneity is that it makes it cost-effective to profile much larger numbers of cells than alternatives such as single-cell sequencing. To assess the practical importance of this advantage, we asked two related questions: (1) how many cells do we need per tumor to accurately reconstruct single-cell phylogenies and (2) how many tumors do we need to examine to identify reproducible, statistically significant features across trees.

We first assessed the number of cells needed per tumor by using our first simulated dataset of trees described above with subsamples of varying numbers of cells per tumor, measuring reconstruction error of our SD+CD+GD algorithm with the weighted matching algorithm. The mean reconstruction errors calculated across cases for subsamples of , , , and cells were (s.d. ), (), (), (), and () respectively. We can thus conclude that accuracy improves noticeably with increasing numbers of cells to at least cells per tumor before plateauing at approximately error.

We next assessed numbers of tumors needed to identify meaningful statistically significant properties of tumor classes by analysis of the CC paired and primary samples. We randomly subsampled from among the pairs and, for each subsample, calculated the following three tree statistics on progression trees inferred from our SD+CD+GD algorithm:

- Shannon index based on distribution of cells across different tree levels.
- Weighted mean depth of the trees.
- Sum of differences of fractional gain and loss of each gene across the tree edges.

We then compared distributions of each statistic on primary vs. metastatic trees by a Wilcoxon signed rank test. As the samples were selected randomly, no ordering among the samples was considered. Figure 9 shows the 1-sided p-values of the three statistical tests when the number of randomly selected samples are increased from to . The figure shows that ability to distinguish the two tumor subsets improves with increasing number of tumors. While the threshold for significance varies by statistic, each reaches weak significance (p0.05) between and tumors. We can thus conclude that finding reproducible features distinguishing the tree types requires on the order of tens of tumors, at least for the candidate probe sets examined here.

Wilcoxon signed rank test 1-sided p-values for separating the primary CC samples from the metastases across subsets of increasing numbers of randomly selected tumor samples. For each set of tumors, samples were randomly selected from paired CC primary and metastatic tumors with atleast one of each type and then Wilcoxon signed rank test was used to calculate the p-values for separating the primary from metastases based on three different statistics: (A) Shannon index calculated using the distribution of cells across different tree levels, (B) weighted mean depth of the trees and (C) sum of differences of fractional gain and loss of each gene across the tree edges.

Taken together, these two results demonstrate that building accurate trees on a large enough scale to distinguish meaningfully primary from metastatic trees requires data sets with roughly the order of thousands of single cells (hundreds of cells per tumor for tens of tumors), a scale of data that has so far been achieved only by FISH studies of tumor heterogeneity. We note, however, that one would expect these numbers to vary depending on the degree of tumor heterogeneity, the classes of trees one wishes to distinguish, and the specific markers examined.

## Discussion

This paper has presented novel theory and algorithms for reconstructing evolutionary trajectories of gene copy numbers in solid tumors in terms of a model of tumor evolution incorporating changes at the scale of single gene probes, full chromosomes, or all probes in the genome. We have derived algorithms to reconstruct maximum parsimony sequences of events, and thus estimates of evolutionary distance, between pairs of cells assayed by FISH probes. We have further incorporated these inferences into a method for building phylogenies of hundreds of cells in single tumors. These methods have been added to FISHtrees [46], our software for inferring tumor phylogenies from single-cell copy number data. Experimental results on simulated data confirm the ability of the new methods to improve phylogenetic inference accuracy relative to simpler models by adding CD and GD events that model chromosome-scale and whole-genome copy number changes that are frequently observed in tumor evolution. Application to observed human tumor data shows that these extended evolutionary models are able to yield more parsimonious tree reconstructions and that the resulting trees lead to improved accuracy in prediction tasks related to diagnosis and prognosis.

In future work, we hope to extend the theory developed here to handle even more realistic models and more challenging data types. One important direction will be advancing the theory developed here to improve upon the heuristic approximations used in the Steiner tree inference to better approach the goal of finding globally optimal trees for the most computationally challenging FISH data sets. The evolutionary models, likewise, might be further extended to go beyond the three mutational event types considered here to better approximate the numerous distinct mutational mechanisms by which copy number profiles of tumor cells might evolve. The data sets studied here do not include geographical information about locations of individual cells in the tumor, but other data sets for analyzing tumor heterogeneity do include such geographical information [38], [68]. We expect it would be interesting to construct phylogenies with distance functions that combine spatial distance in three dimensions with combinatorial distance measures between the cell count patterns, as we have studied here. Further, while FISH for the moment retains a unique advantage in the large number of cells it can profile, one can reasonably anticipate that single-cell sequencing will eventually become practical for comparable cross-tumor studies. There would thus be value in extending the theory developed here to single-cell sequencing data, a goal that would pose substantial algorithmic challenges due to the much larger number and variety of markers it can reveal as well as the more complicated error models it would entail. Finally, we hope to make more use of these single-tumor phylogenetic models in clinically relevant prediction tasks and further explore the biological insights one can gain from more accurate tumor phylogenies.

## Methods

Our main theoretical result is a method for inferring minimum distances between two states within a copy number phylogeny when duplication/loss of single genes (SD), duplication/loss of all genes on a common chromosome (CD), and duplication of all genes in the full genome (GD) events are possible. We first establish some mathematical results and then develop an algorithm for accurate distance computation. This algorithm then becomes a subroutine in a heuristic Steiner tree algorithm for inferring copy number phylogenies in the presence of SD, CD, and GD events. We introduce some notation required for specifying and proving the theoretical results:

- : A set of copy numbers of one or more genes , which we call a “configuration”. When are clear from the context, we use as shorthand.
- : or rectilinear distance between two configurations and .
- , , : Distance between two configurations and when considering SD+CD (s,ch), SD+GD (s,g), or SD+CD+GD (s,ch,g) events, respectively.
- , , : Operations corresponding to single chromosome (CD) events corresponding to either gain (g), loss (l), or either (no subscript) of all genes belonging to the same chromosome from starting configuration , while keeping the copy numbers of genes on other chromosomes unchanged.
- , : Operations corresponding to doubling () or halving () counts of all genes in configuration . In the case of halving, it is assumed that all genes in have even counts.
*even*,*odd*configuration: A configuration (copy number profile) is denoted an*even*configuration if . Otherwise, it is denoted an*odd*configuration.- : The set of “nearest even” values for each in , i.e., if then For example, .
- An operation is
*valid*on a configuration if satisfies for all given predefined lower-bound LB and upper-bound UB. Otherwise, is*invalid*on . LB = 0 and UB = 9 is used in the sofware, but the theory only requires that UB LB. - A sequence of operations is
*boundary-sensitive*on configuration if satisfies for all and . We use*boundary-insensitive*to refer to a sequence on which this condition has not been checked.

### Progression model considering SD and CD events

We develop the theory for inference of the Steiner (unsampled or extinct cell configurations) nodes in the paths formed by the sequence of gene copy number gains and losses from an initial configuration to a final configuration . We first extend the prior theory to account for SD and CD events. Our model assumes that on division of a tumor cell, the configuration can change either by gain or loss of one copy of a single gene (SD event) or by gain or loss of one copy of each gene on a single chromosome (CD event). For example, a configuration of four genes with the first two genes on the same chromosome might evolve in a single mutational event to by an SD event or to by a CD event. We propose Algorithm 1, provided in Figure 10, to calculate the minimum number of steps required to transform into considering SD and CD events, where, without loss of generality, we assume that the genes on a common chromosome have consecutive indices in . Algorithm 1 also identifies a minimum-length sequence of events, although this sequence is not necessarily unique. For example, if there are four genes on one chromosome and we want to get from configuration to configuration , then a shortest sequence of SD and CD events would be CD to , SD to , SD to , and SD to . Other orders of the same four events are also possible.

Algorithm 1 converts a set of boundary-insensitive events to boundary-sensitive events; lines 3–17 are used for chromosomes on which all CD events are losses and lines 18–32 are used for chromosomes on which all CD events are gains.

The above example focuses on a single chromosome because as explained below, the problem of finding the shortest SD+CD path can be solved one chromosome at a time. We begin by establishing the following lemmas:

#### Lemma 1.

*A minimum-length boundary-insensitive sequence of CD and SD events cannot have both a gain of chromosome ** and a loss of the same chromosome **.*

*Proof.* By contradiction. Suppose is a sequence of events that has both a gain and a loss of the same chromosome. Then removing one gain and one loss produces a new sequence that is shorter and has the same final state.

#### Lemma 2.

*For any gene **, a minimum-length boundary-insensitive sequence of events cannot have both a gain of ** and a loss of **.*

*Proof.* By contradiction. Suppose is a sequence of events that has both a gain of and a loss of . Then removing one gain and one loss produces a new sequence that is shorter and has the same final state.

#### Lemma 3.

*The following sequence of events describes a minimum-length boundary-insensitive sequence of SD and CD events for transforming ** into **:*

*Perform CD events in arbitrary order starting from**so that each successive event decreases the**distance between the intermediate configurations**and**until any further CD event will increase the**distance. We define the final configuration reached after this step to be**.**Perform SD events in abitrary order starting at**so that the**distance between**and**decreases on each step until the distance becomes zero. The total number of events required will be**.*

*Proof.* Since the sequence of events is boundary-insensitive and addition is commutative, we can change the order of events without changing the endpoints or the cost. Therefore, we assume that all CD events precede all SD events. The construction of the above sequence of the events ensures that it uses a maximum number of possible CD events. If we denote the number of genes on the common chromosome by and the number of CD events by , then the total number of events required is . If there exists a shorter sequences of events to transform to , then that sequence must have a larger number of CD events, which is contradicted by the construction. Thus, the number of events is minimized.

The above lemmas show how to construct a minimum-length boundary-insensitive sequence of events. We now establish that this sequence can be used to derive a minimum-length boundary-sensitive sequence of events:

#### Lemma 4.

*For any boundary-insensitive minimum-length sequence of SD and CD events ** transforming ** to **, there exists a boundary-sensitive sequence of SD and CD events ** such that ** and ** have equal length.*

*Proof.* We analyze one chromosome at a time because in this section the events on different chromosomes are independent. By Lemma 1, on any specific chromosome all the CD events are gains or all the CD events are losses. We analyze in detail the case in which all CD events are losses; the case of all gains is symmetric.

The proof is constructive. Specifically, we will show that the upper part of Algorithm 1 will transform a boundary-insensitive to a boundary-sensitive of equal cost solely by reordering events. Without loss of generality, suppose the only CD events in are chromosome losses. There is a symmetric algorithm, shown as the lower part of Algorithm 1, for the case where all the chromosome events are gains. We add the following definition:

A gene is defined as unidirectional with respect to if there are no gains of in . A gene is defined as bidirectional with respect to if includes gains of . For unidirectional genes, the order of chromosome losses and gene losses can never cause a boundary to be crossed because the copy numbers are monotonically decreasing. The situations we need to avoid are:

- A bidirectional gene has copy number UB and the next operation affecting is a gain of .
- A bidirectional gene has copy number LB and the next operation affecting is a chromosome loss.

Chromosome gains are excluded by Lemma 1 and our assumption without loss of generality that all CD events are losses. Gene losses for bidirectional genes are exluded by Lemma 2.

To prove correctness of the algorithm, we note that can never cross LB for the unidirectional genes because their net loss equals their total loss. can never cross LB for the bidirectional genes, because when their copy number is at LB, a gene gain must still be pending and the gene gains alternate in the first while loop until no chromosome losses or gene gains are remaining. can never cross UB for the unidirectional genes because they have only losses. can never cross UB for the bidirectional genes because of the test (line 8) before any gene gain is done. Further, all the chromosome losses will be used because one chromosome loss happens on each pass through the first while loop, if any chromosome losses remain. All gene gains in will be used in the first while loop because the net change for any gene must keep its copy number below UB. All the gene losses for the unidirectional genes are used in the second while loop. The unordered set of events and total change in each gene is thus preserved between and , while guarantees that the sequence is boundary-sensitive.

We use the preceding result to derive the main theorem of this section, which estabishes a method to find a minimum-length sequence of SD and CD events transforming to . As in the proof of Lemma 4, we can consider each chromosome separately since each SD and CD event affects only one chromosome.

#### Theorem 5.

*Assume we partition the gene list by chromosomes such that each chromosome ** corresponds to a consecutive subset of genes **. Further define ** and **. Then we can construct a minimum-length boundary-sensitive sequence of events transforming ** to ** by constructing a minimum-length boundary-sensitive sequence of events ** transforming ** to ** for each chromosome ** and interleaving each ** in arbitrary order.*

*Proof.* The distance function can be decomposed into individual parts for genes belonging to distinct chromosomes as follows:

Because the distance cost can be decomposed in this way and each CD or SD event contributes to only a single term of the outer sum, we can minimize the cost of events for each chromosome independently and combine the events from distinct chromosomes in arbitrary order without changing the value of the objective function. Likewise, since these each chromosome affects a disjoint subset of genes, boundary-sensitive sequences for each chromosome will yield a boundary-sensitive sequence across all genes.

### Progression model combining SD, CD and GD events

We now extend the theory from the prior section to include SD, CD, and GD events. We assume in the proofs and discussion below that , where denotes lexicographical ordering. This assumption reduces the number of cases in several proofs. If instead, , the proofs are identical or symmetric except that GD events may be used in the wrong direction (halving instead of doubling). The use of halving events is corrected heuristically by a procedure of subtree pruning and regrafting at line 24 of the pseudocode of Algorithm 3, described below, and in FISHtrees. We will produce the complete proof by deriving a series of lemmas for three cases that together will cover all possible and :

#### Lemma 6.

*For an an even configuration **, if there exists an optimal sequence of copy number change events from ** to ** composed of one or more SD and CD events and a single GD event, then the following sequence of events is of minimum length:*

*SD and CD events to transform**into**, constructed as described in the first named subsection of Methods**A single GD event to transform**into**.*

*Proof.* We prove the statement by considering the three different ways that can be used to transform to using single GD and multiple SD and CD events. The statement of the lemma presents one case and the remaining two possibilities are as follows:

- A single GD event to transform into and then multiple SD and CD events to transform into .
- Multiple SD and CD events to transform to an intermediate configuration , a single GD event to transform into , and multiple SD and CD events to transform into .

We show that for either of these alternative cases, we can produce a sequence satisfying the conditions of the lemma with equal or smaller length. For the first case, we have to show that

If all genes are located on distinct chromosomes, then, and the claim follows directly.

Now, assume the genes are partitioned into sets of chromosomes such that each chromosome corresponds to a consecutive subset of genes . We focus on a specific chromosome and consider the problem of updating just genes of that chromosome from their values in to their values in . Either zero or a positive even number of CD events must be performed to convert these genes from to and along with zero or a positive even number of SD operations on each gene. If an odd number of CD operations are performed on , then we get an odd configuration and at least one or an odd number of SD operations must be performed on each gene of this odd configuration to convert it to the even configuration . But a combination of single SD operations acting on each of the individual genes in has the same effect as a single CD operation on chromosome and this combination therefore cannot be minimal. Therefore, the number of CD operations is even. If a total of CD operations and SD operations are needed to convert to , then a total of CD operations and SD operations are needed to convert to . So,

For alternative 2, we can write the distance function as:

The distance function for our proposed optimal sequence can be written as:

As shown for alternative 1, we can write: which implies .

#### Lemma 7.

*For an odd configuration **, if the optimal sequence of copy number change events from ** to ** is composed of one or more SD and CD events, followed by a single GD event, followed by one or more SD and CD events, then the configuration from which the final set of SD and CD events take place is a member of **.*

*Proof.* We denote the intermediate configuration following the GD event to be . We will show by contradiction that if there exists any optimal sequence of events for which then there must exist an alternative, shorter sequence of events. Define the full sequence of events from to to be , subdivided into the subsequences . First, we note that if there is any duplicated event in then we can construct a more parsimonious solution by replacing the duplicate in with a single copy of the event in . Therefore, no event appears more than once in . There are exactly two SD and CD events that can increase the count of any given probe (SD of that probe or CD of its chromosome) and similarly exactly two events that can decrease the count of any probe. Thus, no probe's value changes by more than in the transition from to in . Finally, we note that since immediately follows a GD event, it must be an even configuration. Together, these assertions establish that for any optimal path .

#### Lemma 8.

*For an odd configuration **, if the optimal sequence of copy number change events from ** to ** is composed of one or more SD and CD events and a single GD event, then the optimum sequence of events follows the following path:*

*Generate**.**SD and CD events to transform**into**.**A single GD event to transform**into**.**SD and CD events to transform**into**.*

*The optimal sequence is an element of the set of sequences generated using this procedure.*

*Proof.* The proof follows from application of Lemma 6 and Lemma 7. As is an odd configuration, the final step cannot be a GD event. So, the last steps have to be a combination of SD and/or CD events; in that case, Lemma 7 shows that the configuration reached as a result of GD must be a member of , which we denote by . Lemma 6 shows that to reach any member of , which are even configurations, the optimal sequence of events is to generate SD and CD events to transform into first and then to perform a GD event to transform into . This sequence of events matches the sequence proposed in the lemma.

The above lemmas allow us to derive Algorithm 2 to transform to using a minimum-length combination of SD, CD and GD events. The pseudocode of Algorithm 2 is presented in Figure 11. To illustrate the algorithm, suppose and , where we will assume we have two probes on a single chromosome. Since is an odd configuration, we first generate its nearest even neighbors and calculate . The algorithm tests for two stopping conditions by which a solution can be constructed (lines and in Algorithm 2), neither of which applies to any of the solutions at this point. are therefore considered for the next iteration. , , and are odd configurations, so we generate their neighbor sets , , and . One stopping condition is satisfied for each of the elements of these neighbor sets, so ,, and are each considered in turn as the next candidate neighbor. is an even configuration, so we only need to consider one possible stopping condition (line 11), which it satisfies, so it is also considered as a possible next candidate neighbor. Among the four possibilities, we will conclude that using as the immediate neighbor will lead to the smallest possible number of steps when accumulating SD+CD events from to the candidate, a single GD event from the candidate to its double, and SD+CD events from that double to . Following some postprocessing updates (procedure CheckSrcNeighbor), the algorithm computes a minimum-length solution of and returns the corresponding length .

Algorithm 2 finds the shortest directed distance between two configurations using SD, CD, and GD events.

Algorithm 2 satisfies the following theorem, which constitutes the major result of this section:

#### Theorem 9.

*Algorithm 2 returns the minimum distance between two configurations ** and **, where **.*

*Proof.* We use induction on the minimum number of steps to get from to , which we denote by .

**Base case.** For the base case, we have . We must consider two sub-cases: (i) and (ii) . For case (i), is an even configuration. The condition at line in Algorithm 2 fails and is considered for the next iteration. In the next iteration, if is an even configuration then the condition at line is now satisfied and is assigned the value in CheckSrcNeighbor procedure called at line in the main procedure. If is an odd configuration, then the condition at line is satisfied for each of the even neighbors of and is assigned the value in the CheckSrcNeighbor procedure called at line . For case (ii), one of the conditions at line or line is satisified in the first iteration of the algorithm depending on whether is an even or odd configuration and is assigned the value at line or .

**Induction step.** For the induction hypothesis, we assume that the the algorithm uses the minimum number of steps for all cases where . Then, suppose that an adversary selects an example that has complexity . Let us assume that the penultimate configuration in the optimal solution is . If is an even configuration, then it can be reached from by using (i) a GD event, (ii) an SD event, or (iii) a CD event. According to the induction hypothesis, for each of these cases, Algorithm 2 uses the minimum number of steps to generate from . If there is at least one GD event in the optimal solution, then Algorithm 2 first calculates . The induction hypothesis ensures that and thus, Algorithm 2 returns a solution with a maximum length of . If there is no GD event in the optimal solution from to , then Algorithm 2 uses the procedure described in the first named subsection of Methods to calculate the optimal path from to and combining it with the optimal solution from to , it returns the optimal path between and . Now, if is an odd configuration, then going from the penultimate configuration to can only be achieved using either an SD or a CD event. For odd , Algorithm 2 first generates its even neighbors which are steps from . If , the proof follows directly from the inductive hypothesis. If , then there is a such that is located on the optimal path between and formed using SD and CD events only. If is the total number of genes with odd copy number values in , then and . Using the induction hypothesis, we can write,

As Algorithm 2 uses the procedure described in the first named subsection of Methods to construct the optimal path between and , we can see that it returns a path with .

### Runtime analysis of Algorithm 2

We provide an upper bound on the runtime of Algorithm 2 as a function of the number of genes and their copy numbers. Considering all three events, where , the maximum number of doublings required is , where denotes the copy number of the first gene where and . At each stage of the algorithm, the maximum number of nodes generated as a result of a operation is . SD and CD events are used to create each of those nodes in the case of an odd configuration. So, the maximum number of required operations is . Therefore, the number of operations performed during the execution of Algorithm 2 is .

### Generating tumor phylogenies

We implemented Algorithm 2 and integrated it with our approximate median-joining-based algorithm from our prior SD-only FISHtrees [46] code. The key steps of this algorithm are summarized in Algorithm 3 (Figure 12), which we describe at a high level here. The phylogeny algorithm first relies on Algorithm 2 to derive a matrix of pairwise distances between observed cell configurations, which are treated as states on a truncated integer lattice of dimension with a maximum value (UB) set to 9 in the current code. It then repeatedly samples triplets of nodes, identifying as potential Steiner nodes those that agree in each dimension with at least one of the triplet. Those Steiner nodes that lead to reduced minimum spanning tree cost are added to the node set, with the process is repeated until there is no further improvement. Finally a series of post-processing steps are performed to prune Steiner nodes that are not needed for the final tree and to apply subtree regrafting to correct for a potential source of suboptimality arising from the fact that the core phylogeny algorithm assumes symmetric distances but GD operations are asymmetric.

This figure provided the main steps in the algorithm to generate tumor progression trees; generate_distance_matrix uses Algorithm 2 on each distinct pair of nodes in the set of nodes it is passed. To compute Minimum Spanning Tree (function *mst* called at lines 4 and 16), we implemented Prim's algorithm.

### Inferring tumor phylogenies using Neighbor Joining (NJ) and Maximum Parsimony (MP) methods

Neighbor Joining (NJ) and Maximum Parsimony (MP) methods have been commonly used for building single-tumor phylogenies [16], [54] and we therefore compared their accuracy to that of our own methods in inferring copy number phylogenies. We applied these two traditional phylogenetic tree building methods to build tumor progression trees using the individual copy number profiles as taxa and compared them with the trees built using our algorithms. We used implementations of both approaches in MEGA version 6 [69]. For NJ, we used Euclidean distances between cell copy number profiles to build the pairwise distance matrix. For MP, we treated copy number profiles of the genes in individual cells as sequences of arbitrary phylogenetic characters. We used the “Close-Neighbor-Interchange on Random Trees” search method. For the parameters “Number of Initial Trees” and “MP search level”, we used values of and respectively.

## Acknowledgments

We thank Darawalee Wangsa for collecting the CC data and we thank Lissa Berroa Garcia, Amanda Bradley, and Clarymar Ortiz-Melendez for help in collecting the BC data.

## Author Contributions

Conceived and designed the experiments: SAC SES KHH TR AAS RS. Performed the experiments: SAC AAS RS. Analyzed the data: SAC SES KHH TR AAS RS. Wrote the paper: SAC SES KHH TR AAS RS. Designed the software used in the analysis: SAC AAS RS.

## References

- 1.
Pennington G, Smith CA, Shackney S, Schwartz R (2006) Cancer phylogenetics from single-cell assays. Technical report, Carnegie Mellon University.
- 2. Pennington G, Smith CA, Shackney S, Schwartz R (2007) Reconstructing tumor phylogenies from heterogeneous single-cell data. J Bioinform Comput Biol 5: 407–427. doi: 10.1142/s021972000700259x
- 3. Attolini CSO, Michor F (2009) Evolutionary theory of cancer. Ann NY Acad Sci 1168: 23–51. doi: 10.1111/j.1749-6632.2009.04880.x
- 4. Desper R, Jiang F, Kallioniemi OP, Moch H, Papadimitriou CH, et al. (1999) Inferring tree models of oncogenesis from comparative genomic hybridization data. J Comput Biol 6: 37–51. doi: 10.1089/cmb.1999.6.37
- 5. Desper R, Jiang F, Kallioniemi OP, Moch H, Papadimitriou CH, et al. (2000) Distance-based reconstruction of tree models for oncogenesis. J Comput Biol 7: 789–803. doi: 10.1089/10665270050514936
- 6. Szabo A, Boucher K (2002) Estimating an oncogenetic tree when false negatives and positives are present. Math Biosci 176: 219–236. doi: 10.1016/s0025-5564(02)00086-x
- 7. McGlynn KA, Edmonson MN, Michielli RA, London WT, Lin WY, et al. (2002) A phylogenetic analysis identifies heterogeneity among hepatocellular carcinomas. Hepatology 36: 1341–1348. doi: 10.1002/hep.1840360608
- 8. Beerenwinkel N, Rahnenführer J, Däumer M, Hoffmann D, Kaiser R, et al. (2005) Learning multiple evolutionary pathways from cross-sectional data. J Comput Biol 12: 584–598. doi: 10.1089/cmb.2005.12.584
- 9. Beerenwinkel N, Rahnenführer J, Kaiser R, Hoffmann D, Selbig J, et al. (2005) Mtreemix: a software package for learning and using mixture models of mutagenetic trees. Bioinformatics 21: 2106–2107. doi: 10.1093/bioinformatics/bti274
- 10. Bogojeska J, Alexa A, Altmann A, Lengauer T, Rahnenführer J (2008) Rtreemix: an R package for estimating evolutionary pathways and genetic progression scores. Bioinformatics 24: 2391–2392. doi: 10.1093/bioinformatics/btn410
- 11. Bogojeska J, Lengauer T, Rahnenführer J (2008) Stability analysis of mixtures of mutagenetic trees. BMC Bioinformatics 9: 165. doi: 10.1186/1471-2105-9-165
- 12. Frumkin D, Wasserstrom A, Itzkovitz S, Stern T, Harmelin A, et al. (2008) Cell lineage analysis of a mouse tumor. Cancer Res 68: 5924–5931. doi: 10.1158/0008-5472.can-07-6216
- 13. Shlush LI, Chapal-Ilani N, Adar R, Pery N, Maruvka Y, et al. (2012) Cell lineage analysis of acute leukemia relapse uncovers the role of replication-rate heterogeneity and microsatellite instability. Blood 120: 603–612. doi: 10.1182/blood-2011-10-388629
- 14. Liu J, Bandyopadhyay N, Ranka S, Baudis M, Kahveci T (2009) Inferring progression models for CGH data. Bioinformatics 25: 2208–2215. doi: 10.1093/bioinformatics/btp365
- 15. Letouzé E, Allory Y, Bollet MA, Radvanyi F, Guyon F (2010) Analysis of the copy number profiles of several tumor samples from the same patient reveals the successive steps in tumorigenesis. Genome Biol 11: R76. doi: 10.1186/gb-2010-11-7-r76
- 16.
Subramanian A, Shackney S, Schwartz R (2012) Inference of tumor phylogenies from genomic assays on heterogeneous samples. J Biomed Biotechnol: 797812.
- 17. Campbell PJ, Pleasance ED, Stephens PJ, Dicks E, Rance R, et al. (2008) Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc Natl Acad Sci USA 105: 13081–13086. doi: 10.1073/pnas.0801523105
- 18. Notta F, Mullighan CG, Wang JC, Poeppl A, Doulatov S, et al. (2010) Evolution of human BCR-ABL1 lymphoblastic leukaemia-initiating cells. Nature 469: 362–367. doi: 10.1038/nature09733
- 19. Tao Y, Ruan J, Yeh SH, Lu X, Wang Y, et al. (2011) Rapid growth of a hepatocellular carcinoma and the driving mutations revealed by cell-population genetic analysis of whole-genome data. Proc Natl Acad Sci USA 108: 12042–12047. doi: 10.1073/pnas.1108715108
- 20. Hou Y, Song L, Zhu P, Zhang B, Tao Y, et al. (2012) Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell 148: 873–885. doi: 10.1016/j.cell.2012.02.028
- 21. Fearon E, Vogelstein B (1990) A genetic model for colorectal tumorigenesis. Cell 61: 759–767. doi: 10.1016/0092-8674(90)90186-i
- 22. Höglund M, Gisselsson D, Mandahl N, Johansson B, Mertens F, et al. (2001) Multivariate analyses of genomic imbalances in solid tumors reveal distinct and converging pathways of karyotypic evolution. Genes Chromosomes Cancer 31: 156–171. doi: 10.1002/gcc.1129.abs
- 23. Newton MA (2002) Discovering combinations of genomic aberrations associated with cancer. J Am Stat Assoc 97: 931–942. doi: 10.1198/016214502388618726
- 24. Bilke S, Chen QR, Westerman F, Schwab M, Catchpoole D, et al. (2005) Inferring a tumor progression model for neuroblastoma from genomic data. J Clin Oncol 23: 7322–7331. doi: 10.1200/jco.2005.03.2821
- 25. Hjelm M, Höglund M, Lagergren J (2006) New probabilistic network models and algorithms for oncogenesis. J Comput Biol 13: 853–865. doi: 10.1089/cmb.2006.13.853
- 26. Gerstung M, Baudis M, Moch H, Beerenwinkel N (2009) Quantifying cancer progression with conjunctive bayesian networks. Bioinformatics 25: 2809–2815. doi: 10.1093/bioinformatics/btp505
- 27. Oesper L, Mahmoody A, Raphael BJ (2013) Theta: Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol 14: R80. doi: 10.1186/gb-2013-14-7-r80
- 28. Shahrabi Farahani H, Lagergren J (2013) Learning oncogenetic networks by reducing to mixed integer linear programming. PLoS ONE 8: e65773. doi: 10.1371/journal.pone.0065773
- 29. Greenman CD, Pleasance ED, Newman S, Yang F, Fu B, et al. (2010) Estimation of rearrangement phylogeny for cancer genomes. Genome Res 22: 346–361. doi: 10.1101/gr.118414.110
- 30. Purdom E, Ho C, Grasso CS, Quist MJ, Cho RJ, et al. (2013) Methods and challenges in timing chromosomal abnormalities within cancer samples. Bioinformatics 29: 3113–3120. doi: 10.1093/bioinformatics/btt546
- 31. Janocko LE, Brown KA, Smith CA, Gu LP, Pollice AA (2001) Distinctive patterns of Her-2/neu, c-myc, and cyclin D1 gene amplification by fluorescence in situ hybridization in primary breast cancers. Cytometry 46: 136–149. doi: 10.1002/cyto.1098
- 32. Heselmeyer-Haddad K, Chaudhri N, Stoltzfus P, Cheng JC, Wilber K, et al. (2002) Detection of chromosomal aneuploidies and gene copy number changes in fine needle aspirates is a specific, sensitive, and objective genetic test for the diagnosis of breast cancer. Cancer Res 62: 2365–2369.
- 33. Snuderl M, Fazlollahi L, Le LP, Nitta M, Zhelyazkova BH, et al. (2011) Mosaic amplification of multiple receptor tyrosine kinase genes in glioblastoma. Cancer Cell 20: 810–817. doi: 10.1016/j.ccr.2011.11.005
- 34. Martins FC, De S, Almendro V, Gönen M, Park SY, et al. (2012) Evolutionary pathways in BRCA1-associated breast tumors. Cancer Discov 2: 503–511. doi: 10.1158/2159-8290.cd-11-0325
- 35. Szerlip NJ, Pedraza A, Chakravarty D, Azim M, McGuire J, et al. (2012) Intratumoral heterogeneity of receptor tyrosine kinases EGFR and PDGFRA amplification in glioblastoma defines subpopulations with distinct growth factor response. Proc Natl Acad Sci USA 109: 3041–3046. doi: 10.1073/pnas.1114033109
- 36.
Heselmeyer-Haddad K, Berroa Garcia LY, Bradley A, Ortiz-Melendez C, Lee WJ, et al. (2012) Single-cell genetic analysis of ductal carcinoma in situ and invasive breast cancer reveals enormous tumor heterogeneity, yet conserved genomic imbalances and gain of
*MYC*during progression. Am J Pathol 181: 1807–1822. doi: 10.1016/j.ajpath.2012.07.012 - 37. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, et al. (2011) Tumour evolution inferred by single-cell sequencing. Nature 472: 90–94. doi: 10.1038/nature09807
- 38. Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, et al. (2012) Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 366: 883–892. doi: 10.1056/nejmoa1113205
- 39. Xu X, Hou Y, Yin X, Bao L, Tang A, et al. (2012) Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell 148: 886–895. doi: 10.1016/j.cell.2012.02.025
- 40. Marusyk A, Polyak K (2010) Tumor heterogeneity: causes and consequences. Biochim Biophys Acta (BBA)-Reviews on Cancer 1805: 105–117. doi: 10.1016/j.bbcan.2009.11.002
- 41. Nowell PC (1976) The clonal evolution of tumor cell populations. Science 194: 23–28. doi: 10.1126/science.959840
- 42. Ding L, Raphael BJ, Chen F, Wendl MC (2013) Advances for studying clonal evolution in cancer. Cancer Lett 340: 212–219. doi: 10.1016/j.canlet.2012.12.028
- 43. Urbschat S, Rahnenführer J, Henn W, Feiden W, Wemmert S, et al. (2011) Clonal cytogenetic progression within intratumorally heterogeneous meningiomas predicts tumor recurrence. Int J Oncol 39: 1601–1608. doi: 10.3892/ijo.2011.1199
- 44. Sprouffske K, Pepper JW, Maley CC (2011) Accurate reconstruction of the temporal order of mutations in neoplastic progression. Cancer Prev Res 4: 1135–1144. doi: 10.1158/1940-6207.capr-10-0374
- 45. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4: 406–425.
- 46. Chowdhury SA, Shackney SE, Heselmeyer-Haddad K, Ried T, Schäffer AA, et al. (2013) Phylogenetic analysis of multiprobe fluorescence in situ hybridization data from tumor cell populations. Bioinformatics 29: i189–i198. doi: 10.1093/bioinformatics/btt205
- 47. Sottoriva A, Spiteri I, Shibata D, Curtis C, Tavaré S (2013) Single-molecule genomic data delineate patient-specific tumor profiles and cancer stem cell organization. Cancer Res 73: 41–49. doi: 10.1158/0008-5472.can-12-2273
- 48. Bandelt H, Forster P, Röhl A (1999) Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol 16: 37–48. doi: 10.1093/oxfordjournals.molbev.a026036
- 49. Wangsa D, Heselmeyer-Haddad K, Ried P, Eriksson E, Schäffer AA, et al. (2009) Fluorescence in situ hybridization markers for prediction of cervical lymph node metastases. Am J Pathol 175: 2637–2645. doi: 10.2353/ajpath.2009.090289
- 50. Rahnenführer J, Beerenwinkel N, Schulz WA, Hartmann C, Deimling AV, et al. (2005) Estimating cancer survival and clinical outcome based on genetic tumor progression scores. Bioinformatics 21: 2438–2446. doi: 10.1093/bioinformatics/bti312
- 51. Lin Y, Rajan V, Moret BME (2012) A metric for phylogenetic trees based on matching. IEEE/ACM Trans Comput Biol Bioinform 9: 1014–1022. doi: 10.1109/tcbb.2011.157
- 52. Kuhn HW (1955) The Hungarian method for the assignment problem. Nav Res Logist Q 2: 83–97. doi: 10.1002/nav.3800020109
- 53. Robinson D, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53: 131–147. doi: 10.1016/0025-5564(81)90043-2
- 54. Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, et al. (2010) Inferring tumor progression from genomic heterogeneity. Genome Res 20: 68–80. doi: 10.1101/gr.099622.109
- 55.
Kanao H, Enomoto T, Kimura T, Fujita M, Nakashima R, et al. (2005) Overexpression of
*LAMP3/TSC403/DC-LAMP*promotes metastasis in uterine cervical cancer. Cancer Res 65: 8640–8645. doi: 10.1158/0008-5472.can-04-4112 - 56.
Wigle JT, Oliver G (1999)
*PROX1*function is required for the development of the murine lymphatic system. Cell 98: 769–778. doi: 10.1016/s0092-8674(00)81511-1 - 57.
Huang FY, Chiu PM, Tam KF, Kwok YKY, Lau ET, et al. (2006) Semi-quantitative fluorescent PCR analysis identifies
*PRKAA1*on chromosome 5 as a potential candidate cancer gene of cervical cancer. Gynecol Oncol 103: 219–225. doi: 10.1016/j.ygyno.2006.02.028 - 58. Fu M, Wang C, Li Z, Sakamaki T, Pestell R (2004) Minireview: Cyclin D1: normal and abnormal functions. Endocrinology 145: 5439–5447. doi: 10.1210/en.2004-0959
- 59. Howe L, Subbaramaiah K, Brown A, Dannenberg A (2001) Cyclooxygenase-2: a target for the prevention and treatment of breast cancer. Endocr Relat Cancer 8: 97–114. doi: 10.1677/erc.0.0080097
- 60.
Wolfer A, Ramaswamy S (2011)
*MYC*and metastasis. Cancer Res 71: 2034–2037. doi: 10.1158/0008-5472.can-10-3776 - 61.
Tan M, Yu D (2007) Molecular mechanisms of erbB2-mediated breast cancer chemoresistance. In: Breast Cancer Chemosensitivity, Springer. pp. 119–129.
- 62.
Nonet GH, Stampfer MR, Chin K, Gray JW, Collins CC, et al. (2001) The
*ZNF217*gene amplified in breast cancers promotes immortalization of human mammary epithelial cells. Cancer Res 61: 1250–1254. - 63.
Hamaguchi M, Meth JL, von Klitzing C, Wei W, Esposito D, et al. (2002)
*DBC2*, a candidate for a tumor suppressor gene involved in breast cancer. Proc Natl Acad Sci USA 99: 13647–13652. doi: 10.1073/pnas.212516099 - 64. Birchmeier W, Behrens J (1994) Cadherin expression in carcinomas: role in the formation of cell junctions and the prevention of invasiveness. Biochim Biophys Acta (BBA)-Reviews on Cancer 1198: 11–26. doi: 10.1016/0304-419x(94)90003-5
- 65.
Vousden KH, Lane DP (2007)
*P53*in health and disease. Nature Rev Cell Biol 8: 275–283. doi: 10.1038/nrm2147 - 66.
Huang X, Gollin S, Raja S, Godfrey T (2002) High-resolution mapping of the 11q13 amplicon and identification of a gene,
*TAOS1*, that is amplified and overexpressed in oral cancer cells. Proc Natl Acad Sci USA 99: 11369–11374. doi: 10.1073/pnas.172285799 - 67. Park SY, Gönen M, Kim HJ, Michor F, Polyak K (2010) Cellular and genetic diversity in the progression of in situ human breast carcinomas to an invasive phenotype. J Clin Invest 120: 636–644. doi: 10.1172/jci40724
- 68. Almendro V, Cheng Y, Randles A, Itzkovitz S, Marusyk A, et al. (2014) Inference of tumor evolution during chemotherapy by computational modeling and in situ analysis of genetic and phenotypic cellular diversity. Cell Rep 6: 514–527. doi: 10.1016/j.celrep.2013.12.041
- 69. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: molecular evolutionary genetics analysis version 6.0. Molecular Biol Evol 30: 2725–2729. doi: 10.1093/molbev/mst197