SubClonal Hierarchy Inference from Somatic Mutations: Automatic Reconstruction of Cancer Evolutionary Trees from Multi-region Next Generation Sequencing

Recent improvements in next-generation sequencing of tumor samples and the ability to identify somatic mutations at low allelic fractions have opened the way for new approaches to model the evolution of individual cancers. The power and utility of these models is increased when tumor samples from multiple sites are sequenced. Temporal ordering of the samples may provide insight into the etiology of both primary and metastatic lesions and rationalizations for tumor recurrence and therapeutic failures. Additional insights may be provided by temporal ordering of evolving subclones—cellular subpopulations with unique mutational profiles. Current methods for subclone hierarchy inference tightly couple the problem of temporal ordering with that of estimating the fraction of cancer cells harboring each mutation. We present a new framework that includes a rigorous statistical hypothesis test and a collection of tools that make it possible to decouple these problems, which we believe will enable substantial progress in the field of subclone hierarchy inference. The methods presented here can be flexibly combined with methods developed by others addressing either of these problems. We provide tools to interpret hypothesis test results, which inform phylogenetic tree construction, and we introduce the first genetic algorithm designed for this purpose. The utility of our framework is systematically demonstrated in simulations. For most tested combinations of tumor purity, sequencing coverage, and tree complexity, good power (≥ 0.8) can be achieved and Type 1 error is well controlled when at least three tumor samples are available from a patient. Using data from three published multi-region tumor sequencing studies of (murine) small cell lung cancer, acute myeloid leukemia, and chronic lymphocytic leukemia, in which the authors reconstructed subclonal phylogenetic trees by manual expert curation, we show how different configurations of our tools can identify either a single tree in agreement with the authors, or a small set of trees, which include the authors’ preferred tree. Our results have implications for improved modeling of tumor evolution and the importance of multi-region tumor sequencing.


Introduction
The clonal evolution hypothesis in cancer states that cancer genomes are shaped by numerous rounds of cellular diversification, selection and clonal expansion [1,2]. Recent methods to characterize tumor clonal evolution can be divided into two broad classes-sample tree reconstruction and subclone tree reconstruction. The first class of methods models the history of clonal evolution in an individual as a phylogenetic tree with leaves being the individual's tumor samples, yielding a relative temporal ordering and estimate of divergence between the samples [3][4][5]. The second class aims at reconstructing the history of clonal evolution as a tree, which summarizes lineage relationships between cellular subpopulations [6][7][8][9].
Until single-cell sequencing data is widely available, accurate high resolution modeling of tumor evolution [10] will likely remain exceedingly difficult, if not impossible. On the other hand, the coverage depth of current next generation sequencing experiments limits the number of cellular subpopulations or subclones detectable in tumor samples to a few (approximately 5-10) that have undergone signficant clonal expansions [4,[11][12][13][14]. Each of these subclones emerge from a parental population of cells by acquiring additional somatic mutations, and cells within each subclone can be assumed to be homogeneous. Modeling of subclone evolution often involves estimating the fraction of cancer cells harboring each somatic mutation i.e., somatic mutation cellularity, which can be inferred from next generation sequencing read count data. For example, PyClone [15] employs a Markov Chain Monte Carlo method to identify groups of mutations with similar cellularities, and SciClone [16] uses variational Bayes mixture models to cluster somatic mutations by their read count frequencies, which can be a proxy for cellularities.
Most recently, methods that couple the problems of somatic mutation clustering and phylogenetic reconstruction have emerged. PhyloSub applies a tree-structured stick breaking process that introduces tree-compatible cellularity values for mutation clusters [6]. A combinatoric approach based on an approximation algorithm for binary tree paritions [7] and a mixture deconvolution algorithm [8] have also been developed. However to our knowledge, most recently published studies of multi-region tumor sequencing continue to employ manual curation to construct a subclone phylogeny, after mutation cellularity has been estimated computationally [12][13][14].
We propose that progress in methods to reconstruct subclonal phylogenies will be substantially enabled by decoupling the problems of temporal ordering of subclones from that of mutation cellularity estimation. The SubClonal Hierarchy Inference from Somatic Mutations (SCHISM) framework described here can incorporate a variety of methods to estimate the cellularity of individual mutations, the cellularity of mutation clusters, and to build phylogenetic trees. First, we derive a novel mathematical formulation of assumptions about lineage precedence and lineage divergence in tumor evolution that have been fundamental to other subclone tree reconstruction methods [6][7][8][9]. Lineage precedence is modeled in terms of a statistical hypothesis test, based on a generalized likelihood ratio. Hypothesis test results are combined with lineage divergence assumptions and formulated as a fitness function that can be used to rank tree topologies, generated by a phylogenetic algorithm. In this work, we designed an implementation of genetic algorithms to build phylogenetic trees. However, the fitness function can also be combined with other approaches to phylogenetic tree reconstruction. The hypothesis test can be combined with any method to estimate mutation or cluster cellularities to infer their temporal orderings.
We use simulations to evaluate the power of the hypothesis test and show that for many combinations of tumor purity, sequencing coverage, and phylogenetic tree complexity, the hypothesis test has good power (! 0.8) and Type 1 error is well controlled, when at least three samples from a patient are available. The simulations also confirm that the problem of subclonal phylogenetic tree reconstruction is underdetermined in many settings when the tumor sample count per individual is smaller than the number of subclones i.e, nodes in the phylogenetic tree. In these cases, we may see that the genetic algorithm identifies multiple equally plausible phylogenetic trees. However, when the problem is sufficiently determined, in general when the number of samples equals or exceeds the tumor sample count, the genetic algorithm reliably reconstructs the true tree, in most combinations listed above.
Using data from three published multi-region tumor sequencing studies of murine small cell lung cancer [13], acute myeloid leukemia [12] and chronic lymophocytic leukemia [14], we show how SCHISM can be configured with a variety of inputs. For all samples in these three studies, SCHISM identified either a single tree in agreement with the tree reconstructed manually by the authors, or a small set of trees, which include the authors' published tree.

Simulations
Generalized likelihood ratio hypothesis test. The hypothesis test yielded good power on average and Type 1 error was well controlled (Fig 1). Power improved as the number of samples per individual increased. As the number of nodes in the subclone tree increased, yielding a more complex tree, more samples were required to achieve the same level of power. Even at the lowest purity level (50%) included in our experiments, good power (! 0.8) was achieved with 1000X coverage and three or more samples.
Automated subclone tree reconstruction. The performance of the genetic algorithm (GA) used for tree reconstruction varied substantially, depending on the simulation inputs: number of tumor samples, node count and topology of the true tree, mutation cluster cellularity values, tumor purity, and sequencing coverage (Fig 2). Given sample count exceeding or equal to node count, the GA most frequently (with probability ! 0.5) identified the true tree or a pair of maximum fitness trees that included the true tree. This probability increased to ! 0.75 at high purity (0.9) and coverage (1000X). As expected, simpler trees, e.g., 3-or 4-node trees, were frequently identified even when the sample count was small. As trees grew more complex, a larger sample count was required, and even the most complex trees in the simulation, which had 8 nodes, were identified frequently when 10 samples were available. However, we also identified combinations of inputs for which the GA had limited success in finding the true tree. We decomposed the performance of the GA into two stages. In Stage 1, we assessed whether the tree reconstruction problem was sufficiently determined by our inputs, meaning that a single maximum fitness tree or a pair of two maximum fitness trees was identified. The GA was more likely to fail in Stage 1 when sample count was smaller than node count (Fig 2A1 and 2B1). Furthermore, the settings of purity and sequencing coverage used in our simulations had less of an effect on Stage 1 success than sample count and tree node count. In Stage 2, we assessed whether the single maximum fitness or pair of maximum fitness trees included the true tree. Samples counts ! 5 had the most stable Stage 2 success rates, and the correct tree was recovered with increasing frequency, given higher sample counts, coverage, and purity. As expected, probability of Stage 2 success was higher for trees with smaller node counts. Our estimates of Stage 2 success were noisy when sample count was small and node count high. This behavior was a result of higher failure rates at Stage 1 under these conditions (Fig 2A2 and 2B2). Performance of the genetic algorithm evaluated in two stages. Stage 1: Fraction of simulation runs where the genetic algorithm's fitness function identified either a single maximum fitness tree (A1) or two maximum fitness trees (B1). Stage 2: Given success in stage 1, fraction of simulation runs where the correct tree was either the single maximum fitness tree (A2) or one of the top two maximum fitness trees (B2). For each combination of coverage and purity, results are shown for trees with node counts from three to eight. Simulations where (sample count) ! (node count) are marked by a double circle.

Multi-sample sequencing studies
Recent studies of small cell lung cancer (SCLC) in mice [13], acute myeloid leukemia (AML) [12], and chronic lymphocytic leukemia (CLL) [14] attempted to infer the subclonal phylogeny underlying tumor progression, based on sequencing of multiple tumor samples. All of the studies applied computational methods to cluster somatic mutations. In some cases mutation cluster cellularities were provided, while in others read counts or cluster mean variant allele fraction was provided. The authors did not use computational methods to reconstruct the subclonal phylogenies.
We applied SCHISM to these datasets, using a variety of configurations. In cases where mutation cluster cellularities were available [13], we used the hypothesis test on pairs of mutation clusters. If mean variant allele fraction for clusters was available [12], we inferred cellularity as described in (S1 Text:Eq.S4) and used the hypothesis test on pairs of mutation clusters. When only read counts and mutation cluster assignments were available [12,14], we used our own naive estimator to derive cellularity values, and applied the hypothesis test to pairs of mutations. For all configurations, we constructed a precedence order violation matrix for all pairs of mutations (POV matrix) or all pairs of mutation clusters (CPOV matrix) and ran the genetic algorithm. This approach consistently identified phylogenies identical to those manually constructed by the authors as either the single maximum fitness tree or among a small set trees tied for the maximum fitness, in underdetermined cases. In the AML study, the authors did not construct phylogenies for three out of eight patients, but they predicted which of two general clonal evolution models best explained relapse in these three patients. For two of these patients, our phylogenies were in agreement with the authors' clonal evolution model. For the third patient, our phylogenies suggested that either of the two clonal evolution models might explain relapse. Methods details are all described in Methods (sections on Hypothesis test, Precedence Order Violation Matrix, Application to mutation clusters, Vote Aggregation and Subclonse size estimation) and S1 Text (sections on Naive mutation cellularity estimate and Cluster cellularity estimation).
Murine small cell lung carcinoma. This study sequenced small cell lung cancer (SCLC) tumor samples from a cohort of transgenic mice with lung-specific Trp53 and Rb1 compound deletion. The full study included whole exome sequencing (WES) (150X coverage) of 27 primary tumors and metastases from six individual animals [13]. The authors applied ABSO-LUTE to call copy number alterations, combine them with mutation read counts, and estimate cellularity values of each mutation. Mutation clusters were defined by clustering mutations with similar cellularity values. For three of these animals (with identifiers 3588, 3151, 984), the authors manually built subclonal phylogenetic trees.
Using the hypothesis test on the ABSOLUTE cluster mean cellularities ( Fig. 5 in [13]), we generated a cluster-level precedence order violation matrix (CPOV) and a cluster cellularity by sample matrix for each mouse.
Animals 3588 and 3151 had data available for one primary and two metatastic tumors. For animal 3588, SCHISM identified an 8-node single maximum fitness tree and that tree (Fig 3A) was identical to the authors' manually built tree. For animal 3151, SCHISM identified six 9-node maximum fitness trees, and one of these trees was identical to the authors' tree. While the problem was insufficiently determined, interestingly the trees shared a significant number of lineage relationships, and the main discrepancies among trees were the parental lineage for Clone3 and Clone2b (using notation from [13]) ( Fig 3B). For animal 984, data was available for one primary one metastatic tumor. SCHISM identified six 7-node maximum fitness trees (i.e., underdetermined problem), and one of these trees was identical to the authors' tree ( Fig 3C).
Chronic lymphocytic leukemia. This longitudinal study of subclonal evolution in B-cell CLL tracked three patients (CLL 003, CLL006, CLL077) over a period of up to seven years [14]. Reconstruction of subclonal phylogenies in murine models of SCLC. A. Animal 3588. SCHISM identified a single maximum fitness 8-node tree using one primary and two metastatic tumors. B. Animal 3151. Six maximum fitness 9-node trees were identified using one primary and two metastatic tumors. C. Animal 984. Six maximum fitness 7-node trees were identified using one primary and one metastatic tumor. Solid arrows represent lineage relationships shared by all six trees and dashed arrows represent lineage relationships shared by only a subset of the trees. Each arrow is labeled with the fraction of maximum fitness trees that include the lineage relationship. Highlighted arrows indicate the tree manually constructed by the study authors.
For each patient, five longitudinal peripheral blood samples were collected, and each sample was whole-genome sequenced. For selected somatically mutated sites, they further applied targeted deep sequencing (at reported 100,000X coverage). K-means clustering and expert curation were used to infer mutation clusters and subclonal phylogenetic trees were manually built.
We used mutation cluster assignments and read counts from targeted deep sequencing, for each mutation in each sample (Tables S6, S7, or S8 in [14]). Purity was estimated by identifying the mutation cluster with the maximum mean variant allele frequency in each sample. Next, based on purity and read count, we calculated cellularity (and standard error) for each mutation in diploid or copy number = 1 regions with the naive estimator. The hypothesis test was performed for each pair of mutations and a cluster precedence order violation (CPOV) matrix was constructed, using a vote aggregation scheme (Methods).
For patients CLL003 and CLL077, SCHISM identified a single 4-node maximum fitness tree that was identical to the authors' manually built tree (Fig 4A and 4C). For patient CLL006, two 5-node maximum fitness trees were identified, and one was identical to the authors' tree ( Fig 4B).
Acute myeloid leukemia. This study of relapse in acute myeloid leukemia (AML) consisted of whole genome sequencing for primary and relapse samples from eight patients (AML1, AML15, AML27, AML28, AML31, AML35, AML40, AML43) [12]. For AML1, the authors identified mutation clusters with MClust [17] and manually constructed a subclonal phylogeny. For the other patients, mutation cluster means were inferred using kernel density estimation. Phylogenetic trees were constructed as in AML1 for four of the patients (AML40, AML27, AML35, AML43).
The authors described two distinct models of clonal evolution to explain relapse. In the first model, the dominant subclone present in the primary leukemia is not eliminated by therapy, but it acquires new mutations and thrives in the relapse. The patient may not have received a sufficiently aggressive treatment or may have harbored resistance mutations. In the second model, the dominant subclone is eliminated by therapy and a minor subclone in the primary acquires new mutations and thrives in the relapse, while some mutations in the primary are absent in the relapse. The mutations that allow the minor subclone to survive may have been present early on or have been acquired during or after chemotherapy, or both [12].
For patients where the authors had constructed a tree, we compared it to the best tree(s) identified by SCHISM. For the other patients, we considered whether the SCHISM trees were consistent with their suggested clonal evolution models.
For AML1, we used the published variant allele fractions and mutation cluster assignments ( Table S5a in [12]). In each sample, the naive estimator was used to derive cellularity values (S1 Text) for the subset of mutations which were located in diploid or hemizygous region and had coverage of ! 50. Hypothesis tests were performed for pairs of mutations and the CPOV matrix was constructed by voting (Methods). SCHISM identified a single maximum fitness tree, which was identical to the authors' manually generated tree ( Fig 5A). For the remaining seven patients, we used the published cluster mean variant allele frequencies (Table S10 in [12]) and combined them with the authors' purity estimates to infer cluster mean cellularities (S1 Text:Eq. S4).
Patients AML27, AML35, and AML40 were reported to harbor only two mutation clusters each, and SCHISM identified a single maximum fitness tree for each, which was identical to the authors' tree ( Fig 5B). Patient AML15 was reported to harbor three mutation clusters, and GL = germline state. Cluster precedence order violation (CPOV) matrices are shown to the left of each tree. Columns and rows represent subclones (or Clones in the terminology of [13]). Each red square represents a pair of subclones (I,J) for which the null hypothesis that I could be the parent of J was rejected. Each blue square represents a pair for which the null hypothesis could not be rejected.  CLL006. Two maximum fitness 5-node trees were identified using 5 samples. Solid arrows represent lineage relationships shared by both trees and dashed arrows represent lineage relationships specific to one of the trees. C. CLL077. A single maximum fitness 4-node tree was identified using 5 samples. Each arrow is labeled with the fraction of maximum fitness trees that include the lineage relationship. Highlighted arrows indicate the tree manually constructed by the study authors. GL = germline state. Cluster precedence order violation (CPOV) matrices are shown to the left of each tree. Columns and rows represent SCHISM identified two maximum fitness trees. Each tree supported one of the authors' two alternative models of AML relapse ( Fig 5C). In one tree, the relapse-specific mutation cluster 3 descended from the dominant subclone in the primary. This subclone consisted of the 92% of cells in the primary that carried both cluster 1 and cluster 2 mutations. In the other tree, it descended from the minor subclone in the primary (the 8% of cells in the primary carrying only cluster 1 mutations). Patient AML28 had five mutation clusters, and SCHISM identified a single maximum fitness tree. While no subclone tree for AML28 was constructed by the authors, they proposed that this patient fit the second clonal evolution model of relapse driven by a minor subclone in the primary. The SCHISM tree was consistent with this model, as the relapse-specific mutation cluster 5 descended from a minor subclone (1% of cells), which harbored only mutation cluster 1, and mutation clusters 3 and 4, which were present in the primary were absent in the relapse cluster ( Fig 5D). Patient AML31 had four mutation clusters, and SCHISM identified a single maximum fitness tree ( Fig 5E). Although no tree was provided by the authors, they proposed that this patient fit the second clonal evolution model, which was consistent with this tree. In the tree, the relapse-specific mutation cluster 4 descended from a minor subclone (21% of cells) in the primary. Cells in this subclone carried mutations in cluster 1, but not clusters 2 and 3. AML43 was reported to have four mutation clusters and SCHISM identified a single maximum fitness tree that was identical to the authors' tree ( Fig 5F).
SCHISM runtime. On a MacBook Pro labptop with Intel Core i7, 4 GB of memory and 2.7 GHz CPU, a GA run with 20 generations modeling three samples and nine mutation clusters completed in 2 minutes.

Discussion
Representing tumor evolution as a phylogenetic tree of cell subpopulations can inform critical questions regarding the temporal order of mutations driving tumor progression and the mechanisms of recurrence and metastasis. As the cost of next generation sequencing with high coverage depth decreases, many labs are employing multi-region tumor sequencing strategies to study tumor evolution. However, going from multi-region sequencing data to a subclonal phylogeny is a computationally challenging task and methods are still in their early days. Here we derived a novel framework to approach the problem. We described a statistical hypothesis test and mathematical representation of constraints on subclone phylogenies, based on rules of lineage precedence and divergence that have informed previous works in the field. We designed a new fitness function that can be used to constrain the process of subclone tree reconstruction. These tools comprise a flexible framework called SCHISM, which can be integrated with many existing methods for mutation cellularity estimation and phylogenetic reconstruction. Combined with a new implementation of genetic algorithms, we demonstrated the utility of SCHISM with simulations and by application to published multi-region sequencing studies. We were able to reconstruct the subclonal phylogenies derived by manual curation in these studies with high fidelity.
Today's multi-region sequencing studies may often have a limited number of tumor samples, due to restrictions on the number of biopsies likely to be performed for living patients. Our results suggest that even when only a few samples are available, more accurate estimates of mutation cellularity at higher purity and coverage increase the power of the SCHISM hypothesis test. A more subtle result is that the power and Type 1 error of the test also depend on the accuracy of the standard error estimates for cellularity values. The dependency can be seen mutation clusters. Each square represents a pair of mutation clusters (I,J) and the numeric value in the square shows the fraction of mutation pairs (i,j) for which the null hypothesis was rejected (Section Vote Aggregation). The mutated genes assigned to each cluster in [6,14]   For each patient, two samples were available from primary and relapse cancers. Purple nodes represent mutation clusters present in both primary and relapse samples. Gray nodes are present in primary but not relapse and green nodes are present in the relapse but not primary. A. AML1. SCHISM identified a single maximum fitness 5-node tree. CPOV matrix columns and rows represent mutation clusters. Each square represents a pair of mutation clusters (I,J) and the numeric value in the square shows the fraction of mutation pairs (i,j) for which the null hypothesis was rejected (Section Vote Aggregation). B. AML27,35,40. For each patient, SCHISM identified a single maximum fitness 2-node tree. C. directly in the derivation of the test statistic itself (Methods:Eq 13) and indirectly in the ability of SCHISM to reconstruct complex subclone phylogenies in murine models of SCLC [13]. Although only two or three samples were sequenced from each mouse, the authors provided robust statistical estimates of mutation cluster cellularity and standard deviations.
To our knowledge, our study is the first to apply genetic algorithms to the problem of subclone tree reconstruction. Current sequencing technologies limit discovery to approximately 5-10 major subclones in patient tumor samples, equivalent to phylogenetic trees with 5-10 nodes. It is likely that in the near future, improved technology will enable discovery of a larger number of subclones. The number of topologies for a tree with n nodes is equal to n n−2 [18]. Therefore it becomes increasingly difficult to use exhaustive enumeration over all topologies when n = 9 (approximately 4.8 million topologies), n = 10 (approximately 100 million topologies), and certainly for n > 10. The genetic algorithm presented here enables heuristic searches over very large numbers of topologies and consequent evaluation of candidate phylogenetic trees, according to the extent to which they violated the rules of lineage precedence and divergence. However, the genetic algorithm will not always succeed when n is very large, and its success depends on the topology of the true tree and the distribution of mutation cluster cellularities (S1 Text). Alternative heuristic approaches might also prove useful in this setting such as tabu search [19,20], simulating annealing [21,22], or iterated local search [23].
Currently, the topology cost component of the fitness function is informed by a statistical hypothesis test that addresses a dichotomous question about the ancestral relationship of two nodes. The mass cost is a numeric measure that quantifies the extent to which the lineage divergence rule is violated, and tree fitness depends on this numeric value, rather than on acceptance or rejection of a null hypothesis. We took into consideration the power loss that would result from a statistical test based on mass cost. Such a test would depend on the estimated cellularities of a parental cluster and the sum of cellularities of its child clusters (Methods:Eq 31). A possible null hypothesis could be that the difference between parental and sum of child cluster cellularities is non-negative (no violation of lineage divergence rule), and where the absolute value of the difference measures the magnitude of the violation. This test would be underpowered compared to the topology cost test, because the expected confidence interval for the sum of mutation cluster cellularity values (Eq 2) will be larger than that for a single mutation cluster. Thus, we chose to focus our hypothesis testing framework on the topology cost. The value of our current strategy to include a numeric measure of mass cost in the fitness function is explored in (S1 Text and S1 Fig).
The fitness function used in this work could be further improved by incorporating measures of mutation or mutation cluster importance, using knowledge about ordering of specific driver mutations based on tumor biology, synthetic lethality, or results from single-cell sequencing. The genetic algorithm used in this work could itself be improved by the addition of online termination criteria and adaptive modulation of its key parameters, such as crossover and mutation probabilities.
A number of excellent methods to reconstruct subclonal phylogenies have been recently published [6][7][8][9], and we believe all of them are likely to be useful to the cancer research AML15. Two maximum fitness 3-node trees were identified. Solid arrows represent lineage relationships shared by both trees and dashed arrows represent lineage relationships specific to one of the trees. D. AML28. A single maximum fitness 5-node tree. E. AML31. A single maximum fitness 4-node tree was identified. F. AML43. A single maximum fitness 4-node tree. Each arrow is labeled with the fraction of maximum fitness trees that include the lineage relationship. Highlighted arrows indicate the tree manually constructed by the study authors, if it was available. GL = germline state. CL = cluster. Cluster precedence order violation (CPOV) matrices are shown to the left of each tree. Columns and rows represent mutation clusters. Each red square represents a pair of mutation clusters (I,J) for which the null hypothesis that I could be the parent of J was rejected. Each blue square represents a pair for which the null hypothesis could not be rejected.
doi:10.1371/journal.pcbi.1004416.g005 community. A feature matrix comparing attributes of SCHISM to four other published methods is in S1 Table. Key contributions of SCHISM are that its genetic algorithm can handle more complex tree topologies than are tractable by brute force while retaining lack of in-built bias towards linear or branched tree topologies, its modularity, and its capability to integrate information from multiple samples (biopsies from a patient) in a new statistical framework.
Finally, it is clear that under many circumstances, particularly when sample count is low and tree complexity is high, the problem of subclone tree reconstruction is underdetermined. It is likely that for at least some tumor types, the true subclone trees may be very complex. In the future, sequencing studies with a large number of samples per patient will be essential to accurately characterize these trees.

Framework overview
The hypothesis test and genetic algorithm used in this work are components in a general framework that decomposes the problems of mutation cellularity estimation, mutation clustering and subclone tree reconstruction (Fig 6). Given aligned reads from whole-genome, wholeexome or targeted deep next generation sequencing, any method for mutation cellularity estimation and/or clustering can be combined with the hypothesis test described in this section. If cellularities are estimated for a cluster of mutations, the test can be applied directly to temporal ordering of clusters. If cellularities are estimated for specific mutations, the test can be applied to infer temporal ordering of mutation pairs. Given assignments of mutations to clusters, a voting aggregation scheme can be used to order the clusters themselves. The precedence order violation matrix and cluster precedence order violation matrix summarize the output of the hypothesis tests. They can be used to visualize the statistical support for potential temporal orderings (as in Figs 3, 4 and 5). Finally, a fitness function that depends on constraints for possible values of cluster cellularities (mass cost) and the results of the hypothesis test (topology cost) can be used to rank possible topologies of subclone phylogenetic trees. The fitness function is independent of the genetic algorithm search strategy proposed in this work.

Modeling assumptions
According to the infinite sites assumption [6][7][8], each somatic mutation in a tumor arises only once throughout the history of the disease, and once a mutation occurs in a cell it is inherited by all descendants of the cell. It follows that given multiple tumor samples from an individual, a mutation may be present in varying proportions of the tumor cells in each sample, referred to as varying cellularity of the mutation across samples.
SCHISM constructs a rooted phylogenetic tree to represent the history of tumor clonal evolution in an individual. Each tree node represents cells harboring a unique compartment of mutations, defining a subclone. Each edge represents a set of mutations, acquired by the cells in the child node and differentiating them from the cells in its parental node. The somatic mutations of each tumor cell then uniquely map it to one of the nodes in the tree.
From the infinite sites assumption, we can infer that each mutation is uniquely assigned to an edge and the cells represented by a node harbor all mutations present in their parental node. Furthermore, a mutation present at a node cannot have cellularity greater than the mutations at its parental node, defining a lineage precedence rule. Also, the sum of mutation cellularities occuring in child nodes cannot exceed the mutation cellularity of their parent, because these mutations occur in mutually exclusive cellular populations, defining a lineage divergence rule.
Many methods have been proposed to estimate mutation cellularities, and our framework can be used with any of these methods. In our reconstruction of subclone phylogenies from Overview of SCHISM framework. The framework decouples estimation of somatic mutation cellularities and reconstruction of subclone phylogenies. Given somatic mutation read counts from next generation sequencing data and somatic copy number calls if available, any tools for mutation cellularity estimation and mutation clustering can be applied. Their output is used to estimate the statistical support for temporal ordering of mutation or mutation cluster pairs, using a generalized likelihood ratio test (GLRT). Other approaches to tree reconstruction can be applied, by using the fitness function as the objective for optimization. GA = genetic algorithm, WGS = whole genome sequencing, WES = whole exome sequencing, DS = (targeted) deep sequencing. KDE = kernel density estimation. POV = precedence order violation. three multi-sample sequencing studies (Results), cellularities were derived using ABSOLUTE and our own naive estimator (S1 Text). Other methods such as PyClone or SciClone could also be applied.

Hypothesis test
Generalized likelihood ratio test. The lineage precedence rule for a pair of mutations i and j, where i precedes j in the same lineage implies that for s in {1, . . ., S}, where C s i and C s j represent the cellularity of mutations i and j in sample s and S denotes the total number of samples from an individual. Then, for each ordered pair of mutations i and j, the null hypothesis (H i!j 0 ) to be tested is whether mutation i can be an ancestor of mutation j, and the alternative hypothesis (H i!j A ) is that it is not possible for i to be ancestral to j. Let the estimated cellularity for mutation i in sample s be (Ĉ s i ). It can be represented as a draw from a normal distribution centered at the true cellularity of mutation i in sample s (C s i ) and with standard deviation Assuming independence between the cellularity estimates for mutations i and j where d s ij represents the true difference in cellularity of mutations i, and j in sample s, and s s ij is the standard deviation ofd s ij , the observed difference in cellularity of mutations i and j in sample s. Under H i!j 0 , the cellularity of mutation i should exceed or be equal to that of mutation j in all tumor samples, based on the lineage precedence rule. Thus, The numerator represents the maximum likelihood for observationsd s ij when d s ij in ω 0 , the parameter space corresponding to H i!j 0 (Eq 6).
which simplifies to Therefore, we reject the null hypothesis that mutation i could be an ancestor of mutation j (H i!j 0 ) if the test statistic T is significantly large.
where I is a binary indicator variable. Eq 13 explicitly shows how T depends on the accuracy of cellularity values and their standard deviation. When standard deviation is overestimated, T is underestimated, yielding power loss. When standard deviation is underestimated, T is overestimated, yielding Type 1 error inflation.
Since the standard deviation is unknown, standard error was used to calculate the test statistic.
Significance evaluation. To assess the significance of an observed value of the GLRT test statistic (T) (Eq 13), we consider the distribution of T under H i!j 0 and derive the Type 1 error probability of the test. Similar results have previously been derived for a more general class of GLRTs [24].
Next, let δ V be a binary random variable representing the event when, for a particular subset V of {1, . . ., S}, we have z s ij < 0 for s 2 V and z s ij ! 0 for s = 2 V. By the law of total probability, or equivalently, But the reduced null hypothesis (Eq 15) states that all d s ij in Eq 4 are set to zero, sô or equivalently, Also, for a set of independent identically distributed random draws from N(0, 1), the value of the summation in Eq 18 is independent of the signs of fz 1 ij ; :::; z S ij g, and is a χ 2 random variable with jVj degrees of freedom.
Eq 20 implies that each random variable z s ij assumes positive and negative signs with equal probability. Thus, the probability of observing a particular sequence of signs for random variables z 1 ij ; :::; z S ij , i.e. the probability of each δ V being true, is 1 2 À Á S . X V &f1;:::;Sg Finally, summarizing the sum above over all possible values jVj can take, X V &f1;:::;Sg Thus it can be concluded that the distribution of GLRT test statistic T under the reduced null hypothesis is that of a random variable drawn from a mixture of χ 2 distributions, with degrees of freedom varying in k 2 {0, . . ., S}, and the weight of each mixture component equal to S k ð Þ 2 S . Here, a w 2 0 random variable is defined as one that is fixed at zero. We use this derivation to assign a conservative estimate of significance level to an observed value of the test statistic T under (H i!j 0 ). Precedence order violation matrix. For each possible ordered pair of mutations (i, j) characterized in a set of tumor samples from the same individual, we test the hypothesis that mutation i is a potential ancestor of mutation j. A fixed common significance level of α = 0.05 is assigned to decide the outcome of each pairwise test. These results can then be organized as a binary Precedence Order Violation (POV) matrix, where non-zero entries mark mutation pairs (i, j) for which the null hypothesis H i!j 0 was rejected. Application to mutation clusters. Given a mapping of individual mutations onto clusters, the hypothesis test can be applied to pairs of clusters rather than to pairs of mutations, and in this case is used to generate a straightforward extension of the POV matrix Cluster Precedence Order Violation (CPOV matrix) where non-zero entries mark mutation cluster pairs (I, J) for which the null hypothesis H I!J 0 was rejected. An alternate approach for generating a CPOV matrix by vote aggregation is described next.

Vote aggregation
The Cluster Precedence Order Violation (CPOV) matrix can be generated by the following vote aggregation approach. Let the set of mutations assigned to cluster I be M(I). Rows and columns of the POV matrix can be reordered so that mutations belonging to the same cluster are adjacent. Then the ordered interaction of any pair of clusters (I, J) is represented by a block of matrix entries with addresses The support for potential lineage precedence of cluster I to cluster J can be summarized by a vote of the matrix elements within the block, represented as an element (I, J) in a cluster-level POV matrix CPOV where jM(X)j denotes the number of mutations in cluster X.

Genetic algorithm
A genetic algorithm (GA) is a heuristic search inspired by the process of natural selection. In an initial generation, a set of random objects is created and their fitness with respect to a fitness criteria is evaluated. Next, objects from the initial generation are selected according to their fitness to be parents of the following generation, with a preference for the fittest parents. The parental objects reproduce themselves, and their progeny may harbor new variation. The process is repeated for either a fixed number of generations or until a pre-defined convergence criteria is reached. In our implementation, the GA searches through a space of phylogenetic tree topologies, ranking them with a fitness function that we derived based on our model assumptions. In the initial generation, we generate G 0 = 1000 random tree topologies and evaluate the fitness of each tree. A sample of size 0.8 Ã G 0 trees are selected for reproduction by a fitness proportional selection method [25] and their progeny are generated, using crossover and mutation operations (Figs 7 and 8). To increase diversity and avoid too fast convergence to a local optimum, 0.2 Ã G 0 random tree topologies are also generated. The following generation then consists of a mixture of the progeny of the previous generation(s) and new random trees, and the total number of trees is the same as in the previous generation, so that G 1 = G 0 . The process is repeated for a fixed number γ = 20 generations. For each generation, the trees selected to be parents are not limited to the previous generation only, but can be selected from any preceding generations. To avoid getting trapped in local optima, four independent runs of the GA are performed, each with 20 generations (1000 trees per generation), and the entire ensemble of trees sampled in the four runs is ranked by tree fitness.
In this work, the number of mutation clusters and the cellularity of each mutation cluster in each sample is assumed to be known, and the GA is applied to explore the space of tree topologies with a given node count, including both linear and branched topologies.
Random topology generation. To generate a random tree topology, a mutation cluster is randomly selected and assigned to the incident edge downstream of the root node. An incident node downstream of the edge is appended. Next, one of the remaining mutation clusters and a non-root node are randomly selected and the cluster is assigned to the incident edge downstream of this node, again appending a new incident node downstream of the edge. The process continues until all mutation clusters in the data have been assigned.
Mutation and crossover operations. The tree topologies selected for sexual reproduction are randomly paired, and each pair undergoes a crossover operation with probability P c to yield two progeny intermediate trees (Fig 7). Next, a mutation operation is applied to each progeny intermediate tree with probability P m . There are two possible mutation operations (Fig 8), and for each tree one is selected with probability P s . The result is a collection of new progeny trees that will appear in the next generation. (Default P c = 0.25 and P m = 0.9 and P s = 0.6.
Fitness function. Tree fitness is evaluated as where TC(T) is a topology cost that summarizes violations of the lineage precedence rule, and MC(T) is a mass cost that summarizes violations of the lineage divergence rule. The two components of the fitness function are useful because they represent biological properties of tumor evolution and practically, their combination can identify the true tree in cases where the topology cost or mass cost alone is insufficient (S1 and S2 Figs).
The topology cost (TC) of tree T is where C s pðnÞ is the cellularity of the mutation cluster associated with the upstream edge incident to node n, i.e., p(n) in sample s, and P q2DðnÞ C s q is the sum of cellularities of mutation clusters associated with its set of immediate descendant edges D(n). The fitness F of the tree T is then a monotonically decreasing function of the tree cost Z(T).
where f c is a positive-valued scaling coefficient (default f c = 5), yielding fitness reduction by a factor of * 150X for each unit increase in total cost.

Simulations
The simulations were designed to generate data compatible with a set of likely tree topologies and assess how well SCHISM could recover these topologies from the data. Given a tree topology, a simulation produces a set of tumor samples consistent with lineage relationships summarized in the tree. We assume that while the samples share these lineage relationships, each represents an independent instantiation of cellularity distributions over the edges of the tree. The variability among these simulated samples captures the stochastic process of preferential sampling of tumor cells in an individual's multiple tumor samples. In each simulated sample, we model variant and reference read counts for mutations belonging to each edge in the tree, taking into account sequencing coverage depth, sample purity level and mutation cluster cellularity. Generating subclonal phylogenies. Simulated trees range in size from three to eight nodes, with no restrictions on the number of child nodes. For trees with three to five nodes, an exhaustive set of topologies is generated. Otherwise, ten topologies are selected (S3 Fig). Each unique topology at a given node count is considered an instance. A Poisson process with rate parameter λ = 10 is used to simulate the number of mutations that occurred along each each edge in the tree. Node count does not include the root node, which represents the germline state prior to any somatic mutations.
A detailed description of how mutation cellularities and mutation variant allele fractions are generated in the simulations is presented in S1 Text.

Subclone size estimation
Based on our modeling assumptions, it is straightforward to conclude that in each sample s, the fraction of tumor cells belonging to the subclone described by node n in a tree can be calculated as where C s pðnÞ is the cellularity of the mutation cluster associated with the upstream edge incident to node n, i.e., p(n) in sample s, and P q2DðnÞ C s q is the sum of cellularities of mutation clusters associated with its set of immediate descendant edges D(n).

Assessment of hypothesis test
Each element of the precedence order violation matrix POV[I, J] is a binary indicator of whether the null hypothesis that mutation i can be ancestral to mutation j is rejected. Each element in the POV matrix is compared with its true value, given the correct tree topology. Performance is summarized by power and Type 1 error.
Number of maximum fitness trees identified by the genetic algorithm. In underdetermined cases, ranking of trees generated by the GA for a replicate may result in ties. We conservatively define a Stage 1 success for a replicate as an outcome where only one or two maximum fitness trees have been identified. To estimate the probability of Stage 1 success, the frequency of success of all tree instances and their replicates for each of 240 (2x2x6x10) unique settings of the key variables is computed. In practice, multiple maximum fitness trees can be a useful result, but limiting the number of ties in this way makes the assessment of the simulations more tractable. Note that Stage 1 success is only an indicator that the phylogeny reconstruction data is sufficiently determined to identify a good solution.
Agreement of maximum fitness tree(s) with the true tree. Even in the absence of ties, the maximum fitness tree discovered by the GA may not be the true tree that was used as the basis for the simulation. For this assessment, Stage 2 success for a replicate is an outcome where either a single maximum fitness or top two maximum fitness trees are the true tree. To assess Stage 2 success, we eliminate replicates where Stage 1 failed and of those remaining, calculate the fraction in which the true tree is either the maximum fitness or one of two maximally fit trees.
Software availability. SCHISM software is available for download at http://karchinlab. org/apps/appSchism.html Supporting Information S1 Text. Supplementary Methods and Results. (PDF) S1 Fig. Limitation of the topology cost in identifying the true tree. Our fitness function considers both a topology cost and a mass cost. If only topology cost is used, in the case of a completely branched tree (A) and a CPOV matrix with power of 1.0 and Type 1 error of 0, the mass cost is not necessary to narrow down candidate tree topologies. However, if the tree contains linear topologies (B), the topology cost is not sufficient to identify the true tree. A and B.  14) in seven simulated biopsies S1-7 C. CPOV matrix depicting the hypothesis test outcomes. Each red square represents a pair of mutation clusters (I,J) for which the null hypothesis that I could be the parent of J was rejected. Each blue square represents a pair for which the null hypothesis could not be rejected. D. The Jaccard index (S1 Text, Topology Similarity Measure) of the true tree (A) and the maximum fitness tree(s) in population at the end of each GA generation. In cases where more than a single highest maximum fitness tree was present, the maximum of the Jaccard indices is plotted. Each color trace represent one of ten independent GA runs performed on inputs in (B, C). E. The Jaccard index of the true tree and the maximum fitness trees at the end of each generation for a sample GA run. Each transparent circle represents a single maximum fitness tree. This GA run identified 9 maximum fitness tree topologies, one of which was the true tree (A).  14) in seven simulated biopsies S1-7 C. CPOV matrix depicting the hypothesis test outcomes. Each red square represents a pair of mutation clusters (I,J) for which the null hypothesis that I could be the parent of J was rejected. Each blue square represents a pair for which the null hypothesis could not be rejected. D. The Jaccard index (S1 Text, Topology Similarity Measure) of the true tree (A) and the maximum fitness tree(s) in population at the end of each GA generation. In cases where more than a single maximum fitness tree was present, the maximum of the Jaccard indices is plotted. Each color trace represent one of ten independent GA runs performed on inputs in (B, C). Only one of the ten GA runs (dark purple trace) identified the true tree (indicated by reaching Jaccard Index = 1). E. The Jaccard index of the true tree and the maximum fitness trees at the end of each generation for the successful run. Each transparent circle represents a single maximum fitness tree. This GA run also found 16 other phylogenetic trees sharing the maximum fitness with the true tree. . Subclone Seeker and TrAp have similar inputs to SCHISM but the modeling task and outputs are different. A SCHISM models a single, unified tree across multiple samples. B Subclone Seeker and C TrAp model trees for each individual sample. The AML1 patient has one primary and one relapse sample [12]. SubcloneSeeker reports six trees for the primary sample and one tree for the relapse (secondary) sample. Primary Tree 11 is reported as compatible with Secondary Tree 1 (marked with asterisks). TrAp reports a top-scoring pair of trees-one tree (left) for the primary and one tree (right) for the relapse sample. (TIF) S1 Table. Feature matrix comparing subclonal phylogeny reconstruction methods. (PDF) S2 Table. Detailed performance of cost function for multi-sample sequencing studies. (PDF)