Skip to main content
Advertisement
  • Loading metrics

Orchard: Building large cancer phylogenies using stochastic combinatorial search

  • Ethan Kulman,

    Roles Conceptualization, Data curation, Formal analysis, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, United States of America, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America

  • Rui Kuang,

    Roles Formal analysis, Investigation, Supervision, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America

  • Quaid Morris

    Roles Conceptualization, Investigation, Resources, Supervision, Writing – original draft, Writing – review & editing

    MorrisQ@mskcc.org

    Affiliation Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, United States of America

Abstract

Phylogenies depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer, i.e., cancer phylogenies, offer valuable insights about cancer development and guide treatment strategies. Many methods exist that reconstruct cancer phylogenies using point mutations detected with bulk DNA sequencing. However, these methods become inaccurate when reconstructing phylogenies with more than 30 mutations, or, in some cases, fail to recover a phylogeny altogether. Here, we introduce Orchard, a cancer phylogeny reconstruction algorithm that is fast and accurate using up to 1000 mutations. Orchard samples without replacement from a factorized approximation of the posterior distribution over phylogenies, a novel result derived in this paper. Each factor in this approximate posterior corresponds to a conditional distribution for adding a new mutation to a partially built phylogeny. Orchard optimizes each factor sequentially, generating a sequence of incrementally larger phylogenies that ultimately culminate in a complete tree containing all mutations. Our evaluations demonstrate that Orchard outperforms state-of-the-art cancer phylogeny reconstruction methods in reconstructing more plausible phylogenies across 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias (B-ALLs). Remarkably, Orchard accurately reconstructs cancer phylogenies using up to 1,000 mutations. Additionally, we demonstrate that the large and accurate phylogenies reconstructed by Orchard are useful for identifying patterns of somatic mutations and genetic variations among distinct cancer cell subpopulations.

Author summary

Somatic mutations accumulate during cancer progression, leading to genetically distinct subpopulations of cancer cells. Understanding the evolutionary relationships among these subpopulations reveals critical events and offers clinical insights for treatment. We developed a novel machine learning algorithm, Orchard, which uses point mutations detected via bulk DNA sequencing to reconstruct phylogenies depicting the genetic evolution of an individual cancer. We evaluated Orchard on 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias. Our results demonstrate that Orchard is fast and accurate at reconstructing phylogenies with up to 1,000 mutations detected in 100 cancer samples. Orchard’s ability to reconstruct large phylogenies motivated a novel approach: applying agglomerative clustering to these trees to identify clusters of mutations that co-occur in cells. Our results show that this new clustering approach accurately identifies mutation clusters in real data, often outperforming state-of-the-art methods.

1 Introduction

Cancerous cells contain somatic mutations that override normal cellular controls [1, 2]. Evolution progresses as cancer cells acquire additional mutations that provide selection advantages [3], leading to genetically distinct cancerous cell subpopulations (i.e., subclones) characterized by sets of shared mutations [4]. Phylogenies describing the ancestral relationships of subclones in an individual cancer can reveal important evolutionary events, help to characterize the cancer’s heterogeneity [5] and progression [6, 7], support the discovery of new driver mutations [8], and provide clinical insight for treatment regimens [9]. Many cancer evolution studies use bulk DNA from multiple cancer samples to reconstruct phylogenies depicting the evolution of subclones in an individual cancer [1016]. These phylogenies are also prerequisites for inferring migration histories of metastases [17, 18].

A mutation tree is an arborescence that can represent portions of a cancer phylogeny that satisfy the infinite sites assumption (ISA). The ISA posits that each mutation is acquired only once and is never reverted. A phylogeny that adheres to the ISA is a perfect phylogeny; cancer phylogenies are often perfect because of low mutation rates (∼1–10 mutations / megabase) [19]. By definition, a mutation tree is a perfect phylogeny, and each non-root node in a mutation tree represents a somatic mutation present in one or more cancerous cell subpopulations. Bulk DNA data obtained by sequencing multiple tissue samples from the same cancer can be used to reconstruct a mutation tree by solving the mixed sample perfect phylogeny (MSPP) problem.

In the MSPP problem, each cancer sample is an unknown mixture of clonal genotypes, i.e, clones, that are related to one another via a perfect phylogeny. The input to the MSPP problem is a matrix of the frequencies of each mutation in each sample; these frequencies can be estimated from the variant allele frequencies (VAFs) measured by short read sequencing of the bulk samples. Although perfect phylogeny reconstruction has a linear time algorithm [20], MSPP reconstruction is NP-complete even if the mutation frequency estimates are noise-free [21]. There are some special cases where the noise-free MSPP problem can be solved in quadratic time [2224].

Until recently, no MSPP reconstruction algorithm could reliably build mutation trees with more than 10 mutations [25]. State-of-the-art MSPP reconstruction algorithms such as Pairtree [25] and CALDER [26] are capable of accurately reconstructing 30-node mutation trees, and in some rare cases, 100-node mutation trees. However, most MSPP problems have more than 30 mutations, and this necessitates that mutations are pre-clustered based on their VAFs prior to MSPP reconstruction [2729]. When mutations are clustered together, the reconstructed tree is known as a clone tree, and many MSPP algorithms treat each pre-defined cluster of mutations as a single mutation during reconstruction. Clustering mutations is often referred to as subclonal reconstruction, because these mutation sets uniquely define subclones which are subpopulations of cancer cells that correspond to clades in the phylogeny. Unfortunately, using mutation VAFs to define subclones is an error-prone process that can lead to incorrectly estimating the number of subclones [30, 31], and incorrectly grouping mutations together that are in unrelated clones [32]. Reconstructing a clone tree using poorly defined subclones can yield inaccurate trees and lead to incorrect conclusions about the evolution of a cancer [3133]. Pre-defining subclones before tree reconstruction can also result in the loss of critical information such as the partial order of mutation acquisition, which is crucial for analyzing somatic mutation patterns and determining prognosis and treatment strategies for certain hematological cancers [6, 7, 3436].

Here, we introduce a new MSPP reconstruction algorithm, Orchard, which reliably reconstructs extremely large mutation trees, alleviating the need to pre-cluster mutations. Orchard accurately solves MSPP problems with as many as 1000 mutations and 100 samples. We benchmark Orchard’s reconstructions against Pairtree and CALDER on 14 B-progenitor acute lymphoblastic leukemias (B-ALLs) and 90 simulated cancers. These results demonstrate that: (1) Orchard matches or outperforms Pairtree and CALDER on reconstruction problems with 30 or fewer mutations while being 5–10x faster, and (2) Orchard greatly outperforms both competing methods on reconstruction problems with more than 30 mutations. Orchard achieves greater accuracy and scalability for large reconstruction problems by sampling from a factorized approximation of the mutation tree posterior, a novel result derived in this paper. Each factor in this approximate posterior represents a conditional distribution for adding a new mutation to a partially built tree. Consequently, Orchard samples from this approximate posterior by generating sequences of progressively larger trees, culminating in a complete tree containing all mutations. Orchard uses Gumbel Soft-Max Tricks [3739] to sample trees for each factor, enabling it to sample complete mutation trees without replacement. This approach differs from existing methods, such as Pairtree [25], which uses Markov Chain Monte Carlo (MCMC) to explore the space of complete mutation trees, or CALDER [26], which formulates the MSPP problem as a mixed integer linear program (MILP). These existing algorithmic approaches have critical pitfalls: MCMC-based methods often converge slowly and are prone to becoming stuck in local maxima because samples are highly auto-correlated, while MILP-based methods can be extremely slow, may fail to recover a solution, or may only be able to produce a single solution in a reasonable amount of time. Orchard avoids these pitfalls because of two crucial properties of the algorithm: it samples without replacement, which mitigates the risk of becoming stuck in local maxima, and its search strategy retains and grows partial tree structures that adhere closely to the ISA, effectively pruning large portions of the search space.

Orchard’s ability to accurately reconstruct large mutation trees enables a novel strategy to flexibly define subclones. To illustrate this new strategy, we introduce a simple phylogeny-aware clustering algorithm that infers subclones from a reconstructed mutation tree. On real data, this simple algorithm recovers expert-defined subclones as well as, and often better, than state-of-the-art VAF based clustering algorithms [2729]. Orchard, along with the phylogeny-aware clustering algorithm, is accessible for free under the MIT License at http://www.github.com/morrislab/orchard.

2 Materials and methods

Orchard reconstructs cancer phylogenies using point mutations detected via bulk DNA sequencing. Below, we first describe the mixed sample perfect phylogeny (MSPP) problem and its definition using variant allele frequency data. Next, we derive a novel factorized approximation to the mutation tree posterior and demonstrate how to use Gumbel Soft-Max Tricks [37, 38] to sample without replacement from it. We then present the Orchard algorithm’s pseudo-code and discuss implementation-specific details. Lastly, we describe our phylogeny-aware clustering algorithm that infers clones and clone trees using a mutation tree.

2.1 The cancer-specific mixed sample perfect phylogeny problem

The input to a mixed sample perfect phylogeny problem is a matrix of mutation frequencies, , and the goal is to infer a perfect-phylogeny-compatible binary genotype matrix B ∈ {0, 1}n×n. Each column B:v in B represents the binary genotype for clone v. The rows of B represent mutations, where if Bjv = 1, then clone v’s genome contains mutation j. The rows of F represent mutations and the columns represent samples. Each entry Fjs is the frequency of mutation j in sample s, where Fjs ∈ [0, 1]. Since multiple clones may harbor mutation j, Fjs equals the sum of the frequency of those clones in sample s, i.e., Fjs = ∑v BjvUvs, where Uvs ≥ 0 and ∑vUvs ≤ 1. The mixing coefficients, Uvs, represent the proportion of cells in samples s with the clonal genotype v, i.e., the clonal proportions. The general relationship between F, B, and U, is: (1) The basic MSPP problem assumes that F is noise free, and solving it requires finding a perfect-phylogeny-compatible binary genotype matrix B and a clonal proportion matrix that satisfy Eq 1.

In the cancer-specific, noisy MSPP problem, we assume that F cannot be observed directly, and instead, we are provided variant allele frequency data D, from which a noisy estimate of F can be derived. The goal is to find B and U that admit an F with the highest possible data fit to D. The next section describes how we define a noisy MSPP problem, one step of which is estimating using the observed bulk DNA data. After that, we formally define a mutation tree.

2.2 The input to Orchard

The inputs to Orchard include VAFs for n point mutations in m samples from the same cancer. For each point mutation j, Orchard requires a set of tuples , where s indexes the bulk cancer sample. The entire data set for all n mutations is represented by a set .

Each tuple (ajs, bjs, ωjs) ∈ Xj represents the bulk DNA data from sample s for the genomic locus containing mutation j: bjs is the count of reads containing the variant allele j; ajs is the count of reads containing the reference allele; and ωjs is the variant read probability, which is the estimated proportion of alleles per cancer cell that contain the variant allele j. The latter value, 0 ≤ ωjs ≤ 1, is required to convert from VAFs to mutation frequencies; and the dependence on s permits modelling of sample-specific copy number aberrations (CNAs). Under the ISA, a mutation j initially only affects one allele, so in diploid and haploid regions free of CNAs, and ωjs = 1, respectively. In some cases, ωjs can be estimated by performing an allele-specific copy number analysis [40]. If mutations are heavily impacted by CNAs, Orchard requires reliable sample specific copy number estimates to produce accurate reconstructions; if these are unavailable, mutations affected by CNAs should be omitted from the inputs provided to Orchard.

The observed VAF for mutation j in sample s, denoted by , is given by . This VAF can be converted to an observed mutation frequency, , using the following formula: (2) The value of is an estimate of the percentage of cells in sample s that have mutation j.

2.3 Mutation tree representations of perfect phylogenies

Orchard represents perfect phylogenies of cancer clones using a mutation tree. A mutation tree is a perfect phylogeny represented as a rooted, directed tree, i.e., an arborescence. We denote it with t = {V, E, M}, where V is a set of nodes, indexed by v ∈ {r} ∪ {1, 2, …, n}, with r indicating the root node representing the germline; E is a set of directed edges defining the child/parent relationships among the nodes; and M is a vector of mutations, one for each non-root node. Each node vV \ {r} defines a genetically distinct cell population or clone whose clone-defining mutation is Mv. Each directed edge (v, u) ∈ E indicates that v is a parent of u, and that u inherits the mutation Mv, along with all mutations that v inherited from its ancestors. One implication of this inheritance is that the set of clones in the subtree rooted at v are uniquely defined by having the mutation Mv, thus Mv also defines a subclone rooted at v. For convenience, we will assign the same index to a clone, its clone-defining mutation, and the node associated with the clone, so clone v is represented by mutation v and Mv = v.

Orchard also considers partial mutation trees, denoted as t() = {V(), E(), M()}, where the superscript denotes the number of mutations the tree contains. A tree containing only the root node and a single mutation is denoted as t(1), and a complete mutation tree containing all n mutations and the root node is denoted as t(n). If 0 < < n, then V()V and only the entries Mv for vV() \ {r} are defined. To complete a partial tree t(), the remaining mutations V \ V() need to be added to t(). Adding vV \ V() to t() results in a new tree t(+1) = {V() ∪ {v}, E(+1), M(+1)}. The mutation v can be incorporated into t() either as an internal node, where it is ancestral to one or more mutations already in t(), or as a leaf node.

Each unique mutation tree is consistent with at most one binary genotype matrix B that can be resolved in linear time [20], and vice versa. Consequently, we establish equivalent representations for binary genotype matrices as t(1)B(1), t()B(), and t(n)B(n). Adding vV \ V() to t() uniquely extends its binary genotype matrix B(). If t(+1) is defined as above, then , where a(+1) is a length binary column vector indicating v’s ancestors in t(+1), d(+1) is a length binary row vector indicating v’s descendants, and B() is unchanged. Without loss of generality, we can re-order the rows and columns of B, and its submatrices B(), so that if the v-th row of B() represents the genotype of clone v, then the v-th column of B() corresponds to the clone defining mutation Mv for clone v. As such for any v. This row and column ordering guarantees that if , then either v = w or clone v is an ancestor of clone w in t(). Fig 1 depicts the relationship between t() and B() as mutations are added, and demonstrates how the rows and columns of the genotype matrix can be reordered without affecting the underlying tree structure.

thumbnail
Fig 1. Overview of how adding a new mutation changes a tree and its binary genotype matrix.

a. Introducing a new mutation to the tree t() corresponds to adding a new row a(+1) and column d(+1) to B(). These additions must be filled in such a way that B(+1) remains a perfect-phylogeny-compatible binary genotype matrix. b. The rows and columns of a binary genotype matrix can be reordered without affecting the underlying tree structure. If we reorder the rows and columns such that ancestors come before their descendants, then the matrix becomes upper triangular as shown here.

https://doi.org/10.1371/journal.pcbi.1012653.g001

2.4 Motivating the approximate posterior

Orchard’s goal is to sample without replacement from the mutation tree posterior defined by: (3) where B is a genotype matrix and D is a data set for n point mutations across m bulk samples. Note that computing P(B, D) = ∫UP(B, U, D)dU requires marginalizing over the clonal proportion matrix U. According to Cayley’s formula [41], (n + 1)n−1 unique mutation trees can be constructed with n mutations and a root node r. Consequently, if D contains data for a large number of mutations, then it is intractable to compute the normalizing constant, P(D) = ∑BU P(B, U, D)dU, since this requires instantiating (n + 1)n−1 perfect-phylogeny-compatible B-matrices [41], and integrating over the possible values of U for each B. Fortunately, there are a variety of Markov Chain Monte Carlo (MCMC) techniques which do not require computing P(D). These techniques rely on the following: (4) Eq 4 can be used to design an MCMC sampling algorithm, where B-matrices are sampled and subsequently evaluated using the likelihood P(D|B, U) by computing (or estimating) the integral over U, see, e.g., [25]. MCMC methods can be slow to converge and mix, and if P(B|D) has low entropy, then many samples will need to be drawn before k distinct ones are obtained. Here, we propose an approximation to the posterior, Qπ(B|D), from which one can sample trees containing all n mutations by progressively adding one mutation at a time to a partial tree structure. We will then show how Gumbel-Max tricks [3739] can be used to efficiently sample without replacement from Qπ(B|D).

We define Qπ(B|D) by choosing a fixed order in which mutations are added to the tree; the accuracy of its approximation of P(B|D) depends on this ordering. We denote a mutation order with the vector π representing a permutation, whose -th element, π, is the index of the -th mutation added. Each mutation appears in π exactly once. We use the notation D(+1) to represent the data associated with the first + 1 mutations in π. Given π, we write: (5) (6) Here, the genotype matrix B(+1) is obtained from B() by appending a(+1), representing the ancestors of the new mutation in t(+1), and d(+1), representing its descendants (see Section 2.3). Eq 6 follows directly from factoring P(B|D) into sets of binary random variables, i.e., {(a(2), d(2)), (a(3), d(3)), …, (a(n), d(n))} associated with the elements in the newly added row and column in B(+1) versus B() and noting that B(1) is a 1 × 1 binary matrix containing a 1. See Section A3.2 in S1 Appendix for more details. Sampling from Qπ(B|D) in place of P(B|D) is dramatically more efficient. Specifically, computing each factor P(B(+1)|D, B()) requires evaluating all mutation trees of size n that include B(+1) as a submatrix. In contrast, computing Qπ(B(+1)|D(+1), B()) requires evaluating only mutation trees of size + 1 that contain B() as a submatrix. The difference in computational complexity is super-exponential.

In general, Eq 5 is equal to Eq 6 if and only if: (7) i.e., that, given B() and the data for the first + 1 mutations, D(+1), B(+1) is conditionally independent from the data for the last n − ( + 1) mutations in π, D \ D(+1). This conditional independence assumption states that data for mutations not yet in the tree can be ignored when placing new mutations. Ordering the mutations so that each mutation is placed in the tree before its descendants will often ensure that this approximation is accurate. However, there are counterexamples where this is not true. We discuss this issue in more detail in Section A2.1 in S1 Appendix.

Evaluating each placement of π+1 into t() requires computing the conditional likelihood P(B(+1)|D(+1), B()). This term can be rewritten as follows: (8)

We discuss computing the data likelihood P(D(+1)|B(+1)) in Section 2.5.5. Eq 8 introduces a new term, P(B(+1)|B()), which we define as: (9) where τ is the number of possible settings of a(+1) and d(+1) such that B(+1) is a perfect phylogeny genotype matrix.

2.5 Orchard algorithm

Orchard performs beam search in a search tree to generate k samples without replacement from Qπ(B|D). In this search tree, each node represents a unique mutation tree, leaf nodes are complete mutation trees, and internal nodes are partial mutation trees (see Fig 2). Each node in the search tree is labeled with the perturbed (unnormalized) log probability of its corresponding mutation tree. The search tree has a total of (n + 1)n−1 leaves, representing each unique mutation tree that can be constructed from n mutations. Orchard’s search routine finds the top-k leaves of the search tree with the largest perturbed log probabilities without exhaustive exploration [38]. To further reduce run times, Orchard employs heuristic optimizations that, empirically, do not compromise its accuracy.

thumbnail
Fig 2. Example of Orchard’s mutation tree search with k = 2, f = ∞.

Mutation trees are depicted using genotype matrices. Search begins with a genotype matrix B(1) containing the first mutation in π. During each iteration, the best tree t() is popped from the queue and extended. The extensions are scored and reintroduced into the queue. Only the k trees with the highest scores in the queue are kept, while others are discarded. The bars next to each genotype matrix indicate its perturbed log probability, Gϕ. Bars with grey fill correspond to the top-k trees that are retained and extended. Genotype matrices within dashed boxes denote parts of the search space that are not explored further. Orchard’s best reconstructed tree can then be input into the phylogeny-aware clustering algorithm. This algorithm conducts agglomerative clustering on the mutation trees to produce a set of clone trees. Each clone tree’s set of clones is scored, and the algorithm yields the clone tree that minimizes the Generalized Information Criterion (GIC). See Section 2.6 and Section A4 in S1 Appendix for complete details.

https://doi.org/10.1371/journal.pcbi.1012653.g002

2.5.1 Background on Ancestral Gumbel-Top-k trick.

Sampling by finding the largest perturbed log probabilities is an example of a Gumbel-Max trick. These tricks derive from the motivating observation that if ϕi is the unnormalized log probability of the i-th item in a set, {ϕ1, …, ϕf}, and G(i) ∼ Gumbel(0, 1) is an i.i.d. sample from the standard Gumbel distribution—where Gumbel(μ, β) has a location parameter μ and a scale parameter β—then the following relationship holds: (10) Eq 10 is recognized as the original Gumbel-Max trick [42], and we refer to the quantities as perturbed log probabilities, noting that . It shows that finding the index of the largest perturbed log probability in a set is equivalent to sampling from a categorical distribution, defined by unnormalized log probabilities associated with each element. This powerful observation can be used to convert various combinatorial optimization algorithms into sampling algorithms. In a simple example, one can draw k samples without replacement from {ϕ1, …, ϕf}, using an extension of the Gumbel-Max trick called the Gumbel-Top-k trick [39], by finding the k largest perturbed log probabilities in the set by, e.g., sorting the set [39].

Applying Gumbel-Top-k to sample from either P(B|D) or Qπ(B|D) would require instantiating all (n + 1)n−1 unique trees over n mutations. Fortunately, the conditional independence structure of Qπ(B|D) supports a tractable extension called the Ancestral Gumbel-Top-k trick. The Ancestral Gumbel-Top-k trick can be used alongside a branch-and-bound search to find the highest-scoring trees, leveraging a property of Gumbel variables known as max-stability. Specifically, the maximum of a set of independent Gumbel random variables follows a Gumbel distribution with a location equal to the log-sum-exp of their log probabilities {ϕ1, ϕ2, …, ϕf} [38, 42, 43], i.e., (11)

Max-stability can be used to determine the maximum perturbed log probability in a subtree of the search tree. This property allows a branch-and-bound beam search to efficiently find the top k largest perturbed log probabilities in the search tree, avoiding exhaustive exploration. To illustrate this, first, let B() be a genotype matrix with log probability ϕ = log Qπ(B()|D()), and let B(+1,i) be its i-th extension (out of f possible ones) with log probability ϕi = log Qπ(B(+1,i)|D(+1)). Note that ϕ is the log-sum-exp of the set : (12) (13) (14) where Eqs 13 and 14 result from marginalizing over all f extensions of B(). Thus we can write (15) So, if computing ϕ is tractable, then one can determine the maximum perturbed log probability of the set without computing any of the ϕis. Additionally, because Eq 15 can be applied recursively to each extension of B(), and so on, down to the leaves of the search tree rooted at B(), the perturbed log probability of B() is also the maximum of any complete mutation trees that have B() as a submatrix.

Orchard leverages this property by using Gϕ as an estimate of the best complete tree that can be found by sampling the leaves in the subtree below B() and employs a beam search to explore only those parts of the search tree containing the top-k leaves.

2.5.2 Computing the “shifted” perturbed log probabilities of extensions.

When Orchard extends t(), it has already sampled Gϕ. For each extensions t(+1,i) that’s considered, is sampled. However, each must be sampled conditionally such that (Eq 15). Orchard achieves this by initially sampling the set , then applying a transformation, as described in [38], to ensure this condition is satisfied, resulting in a new set . We briefly describe this transformation here, but encourage the reader to refer to [38] for a complete proof of correctness. Let and . We assume is also readily available, as it can be initialized to 0 at the start of the algorithm. The transformation sets Z = T, and shifts each to be less than or equal to Z using the truncated Gumbel CDF: (16) where is the “shifted” perturbed log probability of the i-th extension, Fϕ,T(g) is the CDF of the Gumbel distribution truncated at T, and is the inverse CDF of the Gumbel distribution truncated at T. Please see [38] for a numerically stable version of Eq 16.

2.5.3 Stochastic beam search.

Algorithm 1 contains the pseudocode for the Orchard algorithm. Orchard’s search routine can be considered a stochastic beam search, and is adapted from the algorithms described in [37, 38]. Orchard keeps track of partial trees using a queue with a fixed size k, a user-defined parameter called the beam width. At each iteration, the partial tree t() with the largest is popped from the queue and extended. For each extension t(+1,i) considered, we compute a perturbed log probability, but to do so, we need to compute, or estimate, P(D(+1)|B(+1,i)), which is a computationally intensive operation. As such, we use a heuristic, denoted by the function H(t(), f), which uses an analytical approximation to score all extensions based on their adherence to the ISA, and selects the top-f extensions based on these scores. The parameter f is user-defined, and is called the branching factor. In practice, we find this heuristic decreases Orchard’s run time without compromising its accuracy. Complete details for this heuristic are provided in Section A3.1 in S1 Appendix.

Algorithm 1: Orchard

input: A dataset D containing allele frequency data and copy number states for

    n mutations in m tissue samples

parameters: k (beam width), f (branching factor), π (mutation order)

output: , the top-k trees

 // Initialize first tree assuming that

1

2 // initialize scores

3

4 while Q is not empty do

5  

   // H(t(), f) returns a set of f indices corresponding to extensions of t() with the best adherence to the ISA—this helps speed up lines 14, 16

6  

7  if + 1 == n then

8   k = k − 1

9   yield t* = arg maxi S // yield best tree in S

10   Return to line 5

11  else

   // “Shift” perturbed log probabilities using Eq 16

12   Z = maxi S

13   T = Gϕ

14   

15   

16   Sort Q in descending order according to , discard all but top-k

2.5.4 Approximating the MLE of the clonal proportion matrix.

Ideally, if we seek to score a mutation tree represented by its binary genotype matrix B, we would integrate out the clonal proportions U. In this case, the data likelihood for B, P(D|B), assuming a uniform prior on U, is given by (17) where Ds are the data associated with sample s and the integration is over non-negative values of Uvs such that ∑vUvs ≤ 1. We know of no analytical solution to this integral, instead we use a point estimate of U, U* and use the approximation

A good point estimate of U would be its maximum likelihood estimate (MLE) given D and B. Instead, to speed up Orchard, we use the projection algorithm [44, 45] which finds a point estimate U* by optimizing the following quadratic approximation to the binomial log likelihood of U given B and D: (18) where ‖⋅‖ is the Frobenius norm, 1 is a vector of “1”s, are the observed mutation frequencies, W is an n × m matrix of inverse-variances for each mutation in each sample derived from D, and ⊙ is the Hadamard, i.e., element-wise product. Further details on computing the elements of W and U* can be found in Section A2.3 in S1 Appendix.

2.5.5 Approximating the complete data likelihood.

Orchard evaluates a genotype matrix B and its corresponding clonal proportion matrix U using an approximation of the complete data likelihood of the input data D. This approximation assumes that the observed bulk read count data for each locus is generated according to a binomial sampling model, i.e., (19) where bjs is the count of reads containing the variant allele j in sample s, Njs = ajs + bjs is the total number of reads mapped to the locus containing the variant allele j in sample s, and λjs is the variant allele frequency for the variant allele j in sample s. We do not observe λjs, but instead infer it from the mutation frequency matrix F. Given a genotoype matrix B and a clonal proportion matrix U, we calculate the mutation frequency matrix with F = BU (Eq 1). We can use the following formula to estimate the variant allele frequency that generated the observed bulk read count data for mutation j in sample s: (20) where λjs is the inferred variant allele frequency, Fjs is the frequency of mutation j in sample s, and ωjs is the variant read probability provided as an input along with the read count data. We then compute the approximate complete data likelihood as follows: (21) (22)

We assume that the samples are exchangeable in Eq 21. To complete our definition for Eq 22, we define as the binomial probability of observing b variant reads out of a total of N = a + b reads mapping to its locus when the variant allele frequency is λ.

2.6 Phylogeny-aware clustering

Orchard contains a “phylogeny-aware” clustering algorithm that uses a reconstructed mutation tree and its corresponding mutation frequency matrix F to cluster mutations into subclones and resolve clone trees. This algorithm performs agglomerative clustering and joins adjacent nodes u and v in a tree using the following formula: where nu is the number of mutations associated with node u, is the average mutation frequency of the mutations contained in node u, and ‖⋅‖ is the Euclidean norm. This definition for d(u, v) is known as Ward’s method. The agglomerative clustering is repeated, successively joining adjacent nodes, until it reaches a clone tree consisting of only the root node and a single node containing all mutations. Note that only parent-child pairs, where the child is the parent’s only child, are considered for joining. Once no such pairs remain, any adjacent nodes can then be joined. The algorithm outputs n different clusterings containing 1, …, n clones. Each clone tree containing c clones has a corresponding n × c binary matrix Z, where Zji = 1 if and only if mutation j originated in clone i. Associated with each clone i in each sample s is its variant allele frequency, λis. If mutation j is assigned to clone i, then it is assumed that the observed variant allele frequency for mutation j in sample s, , is a noisy observation of λis, such that , where ϵ is a noise term. So, given F, Z, and D, the likelihood of a clustering can be expressed as: (23) where is defined as the binomial probability of observing b variant reads out of a total of N = a + b reads mapping to its locus when the variant allele frequency is λ, and is the indicator function which evaluates to 1 if mutation j is assigned to clone i, and 0 otherwise. Maximizing Eq 23 may result in overfitting, where each mutation is assigned to its own clone. To mitigate this issue, the clone tree with the clustering that minimizes the Generalized Information Criterion (GIC) [46] is yielded. The implementation details of this algorithm are provided in Section A4 in S1 Appendix.

3 Results

3.1 Evaluation overview

We compare Orchard with two state-of-the-art MSPP reconstruction algorithms: CALDER [26], and Pairtree [25]. For CALDER we used its non-longitudinal model and v8.0.1 of the Gurobi optimizer. CALDER only outputs a single tree which is optimal under its mixed integer linear programming formulation, or otherwise fails to output a valid tree. Pairtree uses Markov Chain Monte Carlo to sample trees from a data-implied posterior over trees. We set Pairtree’s trees-per-chain parameter to 5000, and otherwise used the default parameters. We ran the Pairtree algorithm in parallel on 32 CPU cores, generating 160000 trees (32 × 5000). All Orchard experiments used the same branching factor f = 20, but we evaluated two beam widths: k = 1, and k = 10. We also ran the Orchard algorithm in parallel on 32 CPU cores; generating 32 trees (32 × 1) and 320 trees (32 × 10), respectively. Parallel instances of Orchard sample trees independently, so we are only guaranteed a minimum of 1 or 10 unique trees, respectively. Running parallel instances of Orchard and Pairtree means that each algorithm was effectively executed 32 times on each dataset, with each parallel instance having a unique random seed. CALDER, being deterministic when run on identical computing hardware with the same inputs, requires only a single run to obtain its best reconstruction.

We assess the mutation frequency matrix F and genotype matrix B for the tree(s) output by a MSSP reconstruction method by comparing them with a baseline mutation frequency matrix F(baseline) and, when available, a set of ground-truth genotype matrices. For simulated bulk data, F(baseline) corresponds to the ground-truth mutation frequency matrix F(true) used to generate the simulated VAF data. For real bulk data where an expert-defined clone tree is available, F(baseline) corresponds to a maximum a posteriori (MAP) mutation frequency matrix F(MAP) that we fit to the expert-defined tree (see Section A5.2 in S1 Appendix). Each mutation in F(MAP) is assigned a frequency according to the MAP mutation frequency of the subclone the mutation belongs to in the expert-defined clone tree. We compare the reconstructed mutation frequencies in F to F(baseline) using a metric called the log perplexity ratio [25]. The log perplexity ratio is the log of the ratio between the perplexity of F and the perplexity of F(baseline). A negative log perplexity ratio indicates that the mutation frequency matrix F for the tree(s) reconstructed by a method agree with the VAF data better than F(baseline). Perplexity is evaluated under a binomial probability model. We compare the reconstructed genotype matrix B to the set of ground-truth genotype matrices using a metric called the relationship reconstruction loss [25]. The relationship reconstruction loss compares the evolutionary relationships in B to those in the set of ground-truth genotype matrices using the Jensen-Shannon divergence. The log perplexity ratio and relationship reconstruction loss are computed as likelihood-weighted averages over all distinct trees returned by a method. This approach improves the robustness and reliability of our evaluations by incorporating the uncertainty in the reconstructions. We also evaluate MSSP reconstruction methods based on their wall clock runtime, representing the time (in seconds) each method requires to complete on a data set. For more details on these metrics see Section A5 in S1 Appendix.

3.2 Orchard produces better reconstructions than competing methods on 90 simulated cancers

We used the Pearsim software (https://github.com/morrislab/pearsim) from [25] to generate 90 simulated mutation trees and corresponding VAF data which adheres to the ISA. The data sets vary in the number of mutations (10, 30, 50, 100, 150, 200), cancer samples (10, 20, 30, 50, 100), and sequencing depths (50x, 200x, 1000x). These simulations all adhere to the ISA.

We ran Orchard, CALDER, and Pairtree on each simulated data set a single time, and the box plots in Fig 3b, 3c and 3d show the distribution of results for each evaluation metric. Table 1 contains counts of data sets, separated by problem size, where a method had the best log perplexity ratio (P) or relationship reconstruction loss (R). CALDER was the only method to fail on a reconstruction problem, its failure rate increased as the problem size increased, and it failed to produce a valid tree on any reconstruction problem with more than 100 mutations.

thumbnail
Fig 3. Evaluation of reconstructions for 90 simulated mutation trees.

Results are grouped by the size of the simulated mutation trees (rows), i.e., the problem size. a. Bar plots show the percentage of data sets where a method produces at least one valid tree. A red x means the method did not succeed on any of the data sets for that problem size. A red arrow means the results for the method on a problem size occur beyond the x-axis limit. The distributions, represented by box plots, in (b,c,d) only include data sets where the method was successful. b. The distribution of log perplexity ratios, a measure of VAF data fit. Ratios are relative to the perplexity of the ground truth mutation frequency matrix F(true), and can be negative. Lower log perplexity ratios indicate better reconstructions. c. The distributions of relationship reconstruction loss for each method on a problem size. This loss can range between zero bits (complete match of pairwise relationships) and one bit (complete mismatch of pairwise relationships). d. The distributions of wall clock run time.

https://doi.org/10.1371/journal.pcbi.1012653.g003

thumbnail
Table 1. Count of simulated mutation tree data sets, out of 15 per column, where a model had the best log perplexity ratio (P) or relationship reconstruction loss (R).

Bold indicates column max.

https://doi.org/10.1371/journal.pcbi.1012653.t001

Fig 3 and Table 1 illustrate a problem-size-dependency in relative performance. Orchard generally outperformed Pairtree and CALDER on the small reconstruction problems (10–30 mutations) on all metrics. Although Orchard’s reconstructions are only slightly better than Pairtree for these problems, Orchard is 5–10x faster. In contrast, Orchard consistently outperformed Pairtree and CALDER on all problems with 50 or more mutations. For these problems, in either setting of k, Orchard finds trees that are significantly better than those found by Pairtree or CALDER. Unfortunately, the run time required by Orchard increases significantly on larger problem sizes. Detailed run time breakdowns for each method across problem sizes can be found in Table D in S1 Appendix. Orchard (k = 1) generally outperforms Orchard (k = 10) on smaller problems (10–30 mutations), but this switches on larger problems. To understand this change, we inspected the trees recovered by Orchard (k = 1 and k = 10). In nearly all 10 or 30-mutation tree reconstruction problems, both Orchard configurations found the same best tree. However, Orchard (k = 10) yields nine other sub-optimal trees which are included in the weighted averages used to compute the metrics resulting in slightly worse average scores. On larger problems, the trees reconstructed by Orchard (k = 10) are generally much better than those reconstructed by Orchard (k = 1).

3.3 Orchard reconstructs more plausible trees on 14 B-progenitor acute lymphoblastic leukemias

We applied Orchard, CALDER and Pairtree on real bulk data from 14 B-progenitor acute lymphoblastic leukemias (B-ALLs) originally studied in [47]. The B-ALL data sets vary in the number of subclones (between 5 and 17), represented by expert-defined subclones, tissue samples (between 13 and 90), and mutations (between 16 and 292). All of the B-ALL data sets had an approximate sequencing depth of 212x [47]. Each B-ALL patient had samples taken at diagnosis and relapse, and then each of these samples was transplanted into immunodeficient mice resulting in multiple patient derived xenografts (PDXs). The VAF data for each B-ALL was used by experts to manually cluster mutations into subclones and construct clone trees, both of which were reviewed for biological plausibility [47].

Table A in S1 Appendix presents a comprehensive comparison of each method’s log perplexity ratios on the 14 B-ALL clone tree reconstruction problems. Orchard generally outperformed Pairtree, achieving a slightly lower log perplexity ratio on 10/14 data sets, with an average reduction of -2.17e-5 bits. CALDER, on the other hand, failed on 2 of the data sets and performed the worst on those it did succeed on. It is unsurprising that both Orchard and Pairtree perform well on the B-ALL data, given that each data set contains at least 13 samples, and at most 17 clones. Typically, as the number of samples increases, the number of plausible trees typically decreases, making accurate reconstruction more feasible [25].

In most practical scenarios, expert-defined subclones are unavailable. To emulate these conditions using the B-ALL data, we evaluate two approaches:

  1. The traditional approach of clustering mutations into subclones using VAF-based clustering algorithms and then constructing a clone tree from these subclones.
  2. Our new approach of reconstructing a mutation tree using the VAF data and subsequently utilizing this mutation tree to guide the inference of subclones.

We provide a detailed assessment of both approaches in Section A6.8 in S1 Appendix. To assess the second approach, we first reconstructed mutation trees for each of the 14 B-ALL data sets using both Orchard (k = 10) and Pairtree. We provide a comprehensive analysis of the mutation trees reconstructed by each method in Section A6.5 in S1 Appendix. Orchard and Pairtree reconstruct similarly accurate mutation trees on B-ALL data sets containing fewer than 50 mutations, but Orchard reconstructs significantly more plausible trees than Pairtree for B-ALLs with more than 50 mutations.

The accuracy of Orchard’s mutation tree reconstructions, compared to Pairtree, is highlighted by further analyzing the B-ALL data set SJBALL022611. This data set has 84 mutations identified across 29 different samples. First, Orchard (k = 10) and Pairtree were used to reconstruct trees for this data using a varied number of samples (2,3,5,10,15,20,25,29). For each number of samples, we selected those with the largest number of variant reads. Fig 4a shows the log perplexity ratio for the trees reconstructed by Pairtree and Orchard when varying the number of samples used. The baseline mutation frequencies were chosen to be the MAP mutation frequencies, F(MAP), fit to the expert-defined clone tree for SJBALL022611. Counter intuitively, Fig 4a demonstrates that as more samples are used during reconstruction, the mutation frequencies reconstructed by Pairtree exhibit poorer fit to the VAF data. This phenomenon was observed across all B-ALL data sets with more than 60 mutations, indicating that Pairtree’s reconstructions become increasingly less reliable with more data. In contrast, Orchard’s reconstructions exhibit excellent data fit irrespective of the number of mutations or samples. In Section A6.6 in S1 Appendix, we conducted a similar experiment using random subsets of samples for reconstruction. The results of this experiment strongly align with those shown in Fig 4a.

thumbnail
Fig 4. Evaluation of reconstructions by Orchard and Pairtree for SJBALL022611.

a. Log perplexity ratio for the trees reconstructed by Orchard and Pairtree as a function of the number of samples. Orchard’s reconstructions are accurate regardless of the number of samples provided, while Pairtree’s reconstructions worsen with more samples. b,c. Absolute difference between the VAFs inferred by Orchard and Pairtree and the VAFs implied by the bulk data. Large values indicate divergence between VAFs inferred by a method and the VAFs implied by the data. VAFs inferred by Orchard adhere very closely to the data, while also adhering to the ISA. Pairtree’s poorly reconstructed tree results in innaccurate VAF estimates for many mutations. The same row in each heatmap corresponds to same unique mutation, and each column corresponds to the same unique sample.

https://doi.org/10.1371/journal.pcbi.1012653.g004

The poor data fit for Pairtree’s best reconstructed tree for SJBALL022611 using all 29 samples is exemplified in Fig 4c. This figure presents a heatmap of the absolute differences between the VAFs inferred by Pairtree and those implied by the bulk data for each mutation, and it shows that Pairtree’s inferred VAFs significantly diverge from those implied by the bulk data, indicating a poor reconstruction. Again, in contrast, Fig 4b demonstrates that Orchard’s reconstructions fit the VAF data very well, as evidenced by the minimal difference between the data-implied VAFs and those inferred by Orchard. Comparing these two heatmaps reveals that Pairtree infers significantly larger differences than Orchard (confirmed by one-tailed t test P < 0.001).

Next, we use the phylogeny-aware clustering algorithm to infer clones from each method’s best tree that was reconstructed using all of the samples for SJBALL022611. We can compare the inferred clones to the expert-defined clones using the Adjusted Rand Index (ARI). The subclones inferred from Orchard’s tree exhibited an ARI of 0.96, indicating a near-perfect match with the expert-defined subclones. In contrast, the subclones inferred from Pairtree’s tree were more crude, obtaining a notably lower ARI of 0.82. With the exception of two B-ALL data sets (SJBALL022609, SJBALL022610), the subclones identified by the phylogeny-aware clustering algorithm using Orchard’s tree matched the expert-defined subclones as well or better than those identified using Pairtree’s tree. These results are consistent with our expectations, as we believe better tree reconstructions should be more informative for guiding clone inference. We provide a more detailed discussion of the phylogeny-aware clustering results for the B-ALL data in Section A6.7 in S1 Appendix.

4 Discussion

Modern cancer evolution studies generate large amounts of bulk sequencing data often containing hundreds of mutations and multiple cancer samples. This vast amount of data presents a promising opportunity to improve our understanding of the evolutionary events that drive cancer progression. Until now, existing mixed sample perfect phylogeny (MSPP) reconstruction algorithms were unable to reliably reconstruct trees with more than 30 mutations. These limitations necessitated clustering mutations into subclones prior to reconstructing trees, which is an error prone process [3032]. We introduced Orchard, a novel MSPP reconstruction algorithm that uses variant allele frequency data to build large and accurate mutation trees. We derived an approximation to the mutation tree posterior, then showed how Orchard adapts Ancestral Gumbel-Top-k sampling to efficiently sample without replacements from it. Our results demonstrate that Orchard can reliably match or beat ground-truth (or expert-derived) baselines on tree reconstruction problems with up to 300 mutations and up to 100 samples from the same cancer. Orchard can recover larger trees on simulated data, and in Section A6.4 in S1 Appendix we use Orchard to successfully reconstruct 1000-node simulated mutation trees in ≈26 hours of wall clock time. Orchard’s ability to reconstruct extremely large trees motivated a new strategy where the mutation tree structure is used to guide the inference of clones and clone trees. To this end, we introduced a “phylogeny-aware” clustering algorithm that uses agglomerative clustering to infer clones and clone trees from mutation trees.

There are a few potential areas of improvement for Orchard. First, the majority of Orchard’s run time is used by the projection algorithm [44] to estimate the clonal proportion matrix U each time it adds a new mutation to a tree. Although the computational complexity for estimating U is, at worst, quadratic in the number of mutations (i.e., rows) in U, the wall clock times grows quickly with increasing problem size, beam width (k), or branching factor (f). One potential improvement, because the frequency of most mutations should not be changed by introducing a new mutation into the tree, is developing a new version of the projection algorithm that only performs local updates to the U matrix. Also, Orchard marginalizes over all possible ways to place a new mutation into a partially built tree, which requires an exponential number of evaluations when the tree is a star tree, i.e., a tree where all mutations are children of the root. Unless specified otherwise, Orchard will try all 2 + possible ways to place a new mutation into a star tree with mutations. Fortunately, under the ISA, such trees are uncommon and linear branching becomes more likely as the number of children for a node increases. So, in practice, it is rare for nodes to have large numbers of children.

The current implementation of the “phylogeny-aware” clustering algorithm uses a non-probabilistic merging strategy to identify clones in a mutation tree. Despite this, the algorithm already matches, or in some cases, outperforms state-of-the-art probabilistic VAF-based clustering algorithms on real data (see Section A6.7 in S1 Appendix). These results suggest that the mutation tree structure is useful for guiding the inference of clones and clones trees. We anticipate that improvements to the current algorithm to make it probabilistic, so that its merging choices are made based on an appropriate likelihood function, will result in even greater performance.

We further anticipate the development of new algorithms that utilize large mutations trees reconstructed by Orchard. For example, these trees could be used to study patterns of somatic mutations, recover lineage-specific mutation signature activities, or identify patterns of metastatic spread [17]. The Orchard algorithm could also be adapted to reconstruct mutation trees from single-cell data, though this requires updating Orchard’s noise model and sampling routine to accommodate single cell data.

Supporting information

References

  1. 1. Hanahan D, Weinberg RA. The Hallmarks of Cancer. Cell. 2000;100(1):57–70. pmid:10647931
  2. 2. Nowell PC. The clonal evolution of tumor cell populations. Science (New York, NY). 1976;194(4260):23–28. pmid:959840
  3. 3. Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481(7381):306–313. pmid:22258609
  4. 4. Gerstung M, Jolly C, Leshchiner I, Dentro SC, Gonzalez S, Rosebrock D, et al. The evolutionary history of 2,658 cancers. Nature. 2020;578(7793):122–128. pmid:32025013
  5. 5. Dentro SC, Leshchiner I, Haase K, Tarabichi M, Wintersinger J, Deshwar AG, et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell. 2021;184(8):2239–2254.e39. pmid:33831375
  6. 6. Hirsch P, Zhang Y, Tang R, Joulin V, Boutroux H, Pronier E, et al. Genetic hierarchy and temporal variegation in the clonal history of acute myeloid leukaemia. Nature Communications. 2016;7(1):12475. pmid:27534895
  7. 7. Kent DG, Green AR. Order Matters: The Order of Somatic Mutations Influences Cancer Evolution. Cold Spring Harbor Perspectives in Medicine. 2017;7(4):a027060. pmid:28096247
  8. 8. Koch L. Using tumour phylogenies to identify drivers. Nature Reviews Genetics. 2022;23(4):196–196. pmid:35140352
  9. 9. Pogrebniak KL, Curtis C. Harnessing Tumor Evolution to Circumvent Resistance. Trends in genetics: TIG. 2018;34(8):639–651. pmid:29903534
  10. 10. Adams EN. Consensus Techniques and the Comparison of Taxonomic Trees. Systematic Zoology. 1972;21(4):390–397.
  11. 11. Nik-Zainal S, Van Loo P, Wedge DC, Alexandrov LB, Greenman CD, Lau KW, et al. The life history of 21 breast cancers. Cell. 2012;149(5):994–1007. pmid:22608083
  12. 12. Jamal-Hanjani M, Wilson GA, McGranahan N, Birkbak NJ, Watkins TBK, Veeriah S, et al. Tracking the Evolution of Non–Small-Cell Lung Cancer. New England Journal of Medicine. 2017;376(22):2109–2121. pmid:28445112
  13. 13. Light N, Layeghifard M, Attery A, Subasri V, Zatzman M, Anderson ND, et al. Germline TP53 mutations undergo copy number gain years prior to tumor diagnosis. Nature Communications. 2023;14(1):77. pmid:36604421
  14. 14. Sakamoto H, Attiyeh MA, Gerold JM, Makohon-Moore AP, Hayashi A, Hong J, et al. The Evolutionary Origins of Recurrent Pancreatic Cancer. Cancer Discovery. 2020;10(6):792–805. pmid:32193223
  15. 15. Schuh A, Becq J, Humphray S, Alexa A, Burns A, Clifford R, et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood. 2012;120(20):4191–4196. pmid:22915640
  16. 16. Villani A, Davidson S, Kanwar N, Lo WW, Li Y, Cohen-Gogo S, et al. The clinical utility of integrative genomics in childhood cancer extends beyond targetable mutations. Nature Cancer. 2022; p. 1–19. pmid:36585449
  17. 17. El-Kebir M, Satas G, Raphael BJ. Inferring parsimonious migration histories for metastatic cancers. Nature Genetics. 2018;50(5):718–726. pmid:29700472
  18. 18. Turajlic S, Xu H, Litchfield K, Rowan A, Horswell S, Chambers T, et al. Deterministic Evolutionary Trajectories Influence Primary Tumor Growth: TRACERx Renal. Cell. 2018;173(3):595–610.e11. pmid:29656894
  19. 19. Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Medicine. 2017;9(1):34. pmid:28420421
  20. 20. Gusfield D. Efficient algorithms for inferring evolutionary trees. Networks. 1991;21(1):19–28.
  21. 21. El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics. 2015;31(12):i62–i70. pmid:26072510
  22. 22. Sundermann LK, Wintersinger J, Rätsch G, Stoye J, Morris Q. Reconstructing tumor evolutionary histories and clone trees in polynomial-time with SubMARine. PLOS Computational Biology. 2021;17(1):e1008400. pmid:33465079
  23. 23. Eskin E, Halperin E, Karp RM. Efficient reconstruction of haplotype structure via perfect phylogeny. Journal of Bioinformatics and Computational Biology. 2003;1(1):1–20. pmid:15290779
  24. 24. Gusfield D. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In: Proceedings of the sixth annual international conference on Computational biology. RECOMB’02. New York, NY, USA: Association for Computing Machinery; 2002. p. 166–175. Available from: https://doi.org/10.1145/565196.565218.
  25. 25. Wintersinger JA, Dobson SM, Kulman E, Stein LD, Dick JE, Morris Q. Reconstructing Complex Cancer Evolutionary Histories from Multiple Bulk DNA Samples Using Pairtree. Blood Cancer Discovery. 2022;3(3):208–219. pmid:35247876
  26. 26. Myers MA, Satas G, Raphael BJ. CALDER: Inferring Phylogenetic Trees from Longitudinal Tumor Samples. Cell Systems. 2019;8(6):514–522.e5. pmid:31229560
  27. 27. Caravagna G, Heide T, Williams MJ, Zapata L, Nichol D, Chkhaidze K, et al. Subclonal reconstruction of tumors by using machine learning and population genetics. Nature Genetics. 2020;52(9):898–907. pmid:32879509
  28. 28. Gillis S, Roth A. PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics. 2020;21(1):571. pmid:33302872
  29. 29. Miller CA, White BS, Dees ND, Griffith M, Welch JS, Griffith OL, et al. SciClone: Inferring Clonal Architecture and Tracking the Spatial and Temporal Patterns of Tumor Evolution. PLOS Computational Biology. 2014;10(8):e1003665. pmid:25102416
  30. 30. Salcedo A, Tarabichi M, Espiritu SMG, Deshwar AG, David M, Wilson NM, et al. A community effort to create standards for evaluating tumor subclonal reconstruction. Nature Biotechnology. 2020;38(1):97–107. pmid:31919445
  31. 31. Sun R, Hu Z, Sottoriva A, Graham TA, Harpak A, Ma Z, et al. Between-region genetic divergence reflects the mode and tempo of tumor evolution. Nature Genetics. 2017;49(7):1015–1024. pmid:28581503
  32. 32. Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Identification of neutral tumor evolution across cancer types. Nature Genetics. 2016;48(3):238–244. pmid:26780609
  33. 33. Satas G, Raphael BJ. Tumor phylogeny inference using tree-constrained importance sampling. Bioinformatics. 2017;33(14):i152–i160. pmid:28882002
  34. 34. Benard BA, Leak LB, Azizi A, Thomas D, Gentles AJ, Majeti R. Clonal architecture predicts clinical outcomes and drug sensitivity in acute myeloid leukemia. Nature Communications. 2021;12(1):7244. pmid:34903734
  35. 35. Ortmann CA, Kent DG, Nangalia J, Silber Y, Wedge DC, Grinfeld J, et al. Effect of Mutation Order on Myeloproliferative Neoplasms. New England Journal of Medicine. 2015;372(7):601–612. pmid:25671252
  36. 36. Schwede M, Jahn K, Kuipers J, Miles LA, Bowman RL, Robinson T, et al. Mutation order in acute myeloid leukemia identifies uncommon patterns of evolution and illuminates phenotypic heterogeneity. Leukemia. 2024; p. 1–10. pmid:38467769
  37. 37. Kool W, van Hoof H, Welling M. Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement; 2019. Available from: http://arxiv.org/abs/1903.06059.
  38. 38. Kool W, Hoof Hv, Welling M. Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement. Journal of Machine Learning Research. 2020;21(47):1–36.
  39. 39. Paulus MB, Choi D, Tarlow D, Krause A, Maddison CJ. Gradient Estimation with Stochastic Softmax Tricks; 2021. Available from: http://arxiv.org/abs/2006.08063.
  40. 40. Tarabichi M, Salcedo A, Deshwar AG, Ni Leathlobhair M, Wintersinger J, Wedge DC, et al. A practical guide to cancer subclonal reconstruction from DNA sequencing. Nature Methods. 2021;18(2):144–155. pmid:33398189
  41. 41. Cayley A. A theorem on trees. Quarterly Journal of Pure and Applied Mathematics. 1889;XXIII:376–378.
  42. 42. Maddison CJ, Tarlow D, Minka T. A\ast Sampling. In: Advances in Neural Information Processing Systems. vol. 27. Curran Associates, Inc.; 2014. Available from: https://papers.nips.cc/paper_files/paper/2014/hash/309fee4e541e51de2e41f21bebb342aa-Abstract.html.
  43. 43. Gumbel EJ. Statistical Theory of Extreme Values and Some Practical Applications. A Series of Lectures. National Bureau of Standards, Washington, D. C. Applied Mathematics Div.; 1954. PB175818. Available from: https://ntrl.ntis.gov/NTRL/dashboard/searchResults/titleDetail/PB175818.xhtml.
  44. 44. Jia B, Ray S, Safavi S, Bento J. Efficient Projection onto the Perfect Phylogeny Model. In: Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc.; 2018. Available from: https://proceedings.neurips.cc/paper/2018/hash/d198bd736a97e7cecfdf8f4f2027ef80-Abstract.html.
  45. 45. Ray S, Jia B, Safavi S, van Opijnen T, Isberg R, Rosch J, et al. Exact inference under the perfect phylogeny model; 2019. Available from: http://arxiv.org/abs/1908.08623.
  46. 46. Kim Y, Kwon S, Choi H. Consistent Model Selection Criteria on High Dimensions;.
  47. 47. Dobson SM, García-Prat L, Vanner RJ, Wintersinger J, Waanders E, Gu Z, et al. Relapse-Fated Latent Diagnosis Subclones in Acute B Lineage Leukemia Are Drug Tolerant and Possess Distinct Metabolic Programs. Cancer Discovery. 2020;10(4):568–587. pmid:32086311