Figures
Abstract
Motivation
DNA sequencing of multiple bulk samples from a tumor provides the opportunity to investigate tumor heterogeneity and reconstruct a phylogeny of a patient’s cancer. However, since bulk DNA sequencing of tumor tissue measures thousands of cells from a heterogeneous mixture of distinct sub-populations, accurate reconstruction of the tumor phylogeny requires simultaneous deconvolution of cancer clones and inference of ancestral relationships, leading to a challenging computational problem. Many existing methods for phylogenetic reconstruction from bulk sequencing data do not scale to large datasets, such as recent datasets containing upwards of ninety samples with dozens of distinct sub-populations.
Results
We develop an approach to reconstruct phylogenetic trees from multi-sample bulk DNA sequencing data by separating the reconstruction problem into two parts: a structured regression problem for a fixed tree , and an optimization over tree space. We derive an algorithm for the regression sub-problem by exploiting the unique, combinatorial structure of the matrices appearing within the problem. This algorithm has both asymptotic and empirical improvements over linear programming (LP) approaches to the problem. Using our algorithm for this regression sub-problem, we develop fastBE, a simple method for phylogenetic inference from multi-sample bulk DNA sequencing data. We demonstrate on simulated data with hundreds of samples and upwards of a thousand distinct sub-populations that fastBE outperforms existing approaches in terms of reconstruction accuracy, sample efficiency, and runtime. Owing to its scalability, fastBE enables both phylogenetic reconstruction directly from indvidual mutations without requiring the clustering of mutations into clones, as well as a new phylogeny constrained mutation clustering algorithm. On real data from fourteen B-progenitor acute lymphoblastic leukemia patients, fastBE infers mutation phylogenies with fewer violations of a widely used evolutionary constraint and better agreement to the observed mutational frequencies. Using our phylogeny constrained mutation clustering algorithm, we also find mutation clusters with lower distortion compared to state-of-the-art approaches. Finally, we show that on two patient-derived colorectal cancer models, fastBE infers mutation phylogenies with less violation of a widely used evolutionary constraint compared to existing methods.
Author summary
DNA sequencing of a bulk tumor sample measures the genomes of the heterogeneous mixture of cells that comprise a tumor. Reconstructing the evolutionary history of a cancer from such admixed measurements is challenging, as standard phylogenetic techniques assume that genomes of individual cells are measured. Multiple specialized techniques aim to simultaneously infer the unmeasured genomes and construct the evolutionary history of these genomes, but many of these methods do not scale to large numbers of genomes in the mixture. We introduce a new tool, fast Bulk Evolution (fastBE), which accurately reconstructs the evolutionary history of tumors containing hundreds-thousands of genomes from bulk DNA sequencing data. Key to the success of fastBE are new algorithmic insights which make this task tractable. fastBE is a useful tool to analyze large multi-region tumor sequencing datasets.
Citation: Schmidt H, Raphael BJ (2024) A regression based approach to phylogenetic reconstruction from multi-sample bulk DNA sequencing of tumors. PLoS Comput Biol 20(12): e1012631. https://doi.org/10.1371/journal.pcbi.1012631
Editor: Teresa M. Przytycka, National Library of Medicine, UNITED STATES OF AMERICA
Received: July 9, 2024; Accepted: November 12, 2024; Published: December 4, 2024
Copyright: © 2024 Schmidt, Raphael. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript, its Supporting information files, and the GitHub repository at https://github.com/raphael-group/fastBE.
Funding: This research is supported by Ludwig Cancer Research and by National Cancer Institute (NCI) grants U24CA248453 and U24CA264027 to B.J.R. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Tumor evolution is characterized by the accumulation of somatic genomic alterations that alter the fitness of sub-populations of cells, leading to unregulated growth. Over the past ten years, high-coverage DNA sequencing of bulk tumor samples has proven tremendously successful in deciphering this complex evolutionary process [1–3]. There are now dozens of computational techniques [4–12] to accurately and efficiently identify distinct sub-populations of cells in a tumor sample and reconstruct the evolutionary history, or a phylogeny, of these populations. Application of these techniques can help identify the genomic alterations that drive tumor growth [13, 14]. Recent studies have demonstrated that intra-tumor heterogeneity is more prevalent than previously reported (reviewed in [15]). For example, the TracerX Consortium [2] found up to fifteen distinct sub-populations of cells, or subclones, from multi-region sequencing of a patient biopsy of non-small-cell lung cancer. Further, they noticed that without this multi-region sequencing, up to 65% of subclones and 76% of subclonal mutations would have been missed, suggesting that perhaps further heterogeneity could be uncovered by increasing the number of sequenced samples. In tandem with decreasing sequencing costs, high levels of intra-tumor heterogeneity have led to bulk-DNA sequencing datasets with an increasingly large number of samples and subclones—containing up to 90 samples and 26 subclones for a single cancer patient [16].
While numerous methods have been developed to build phylogenies from bulk DNA sequencing of tumors, few of these methods scale to datasets with dozens of samples and subclones from the same patient. Recently [11] demonstrated that existing methods fail to scale beyond even ten subclones, making their application to datasets with high amounts of intra-tumor heterogeneity challenging. As another consequence of poor scalability, all existing methods—except the newly introduced method Orchard [12]—infer phylogenies using summary statistics computed from clusters of mutations rather than the individual mutation read counts, potentially missing valuable phylogenetic signal.
Contribution. In this work, we describe a structured regression formulation of the successful matrix factorization model [5–7, 9, 10] for phylogenetic inference from somatic single nucleotide variants (SNVs) measured via DNA sequencing of one or more bulk tumor samples from the same patient. In particular, we identify a tractable, ℓ1-regression problem hidden within the NP-complete, variant allele frequency (VAF) factorization problem [5, 17]—analogous to the method of minimum evolution in species phylogenetics [18, 19]—where a tractable regression sub-problem [20] is solved within an NP-complete [21, 22] optimization problem. By studying the unique, combinatorial structure of the clonal matrices [5] appearing within this ℓ1-regression sub-problem, we derive an algorithm which obtains both asymptotic and empirical improvements over a naïve, linear programming based approach. Further, our regression algorithm efficiently recomputes the solution to the ℓ1-regression sub-problem upon slight modifications to the tree topology, such as the addition of vertices and subtree prune-and-regraft (SPR) operations [23].
Utilizing our fast ℓ1-regression algorithm and incorporating recently introduced combinatorial search techniques [12], we develop a simple method, fastBE (fast Bulk Evolution), for phylogenetic inference from multi-sample bulk DNA sequencing data. We show that on simulated data, fastBE outperforms existing methods for inferring the ground truth phylogeny across a variety of metrics—including sample efficiency—while running orders of magnitude faster. For example, fastBE solves simulated instances with up to 1000 clones and 100 samples in under half an hour. Building on the scalability of fastBE, we develop a phylogeny constrained mutation clustering algorithm which clusters mutations using a mutation phylogeny inferred by fastBE and outperforms state-of-the-art mutation clustering methods. Applying fastBE to a multi-sample dataset from fourteen patients with acute lymphoblastic leukemias [16], we show that our inferred mutation phylogenies are similar to those inferred by Pairtree [11] and Orchard [12], but better recapitulate observed mutational frequencies and possess fewer violations of the sum condition [5–7, 24]. Finally, we demonstrate that on two patient-derived colorectal cancer models fastBE finds mutation phylogenies with fewer violations of the sum condition compared to existing methods.
2 Materials and methods
2.1 Background and related work
Following previous work, we restrict attention to somatic, single-nucleotide mutations in copy neutral regions as phylogenetic characters of cancer evolution. We assume that each genomic locus is mutated exactly once during this evolutionary process, known as the infinite sites assumption [5–11]. The problem of inferring a phylogeny from DNA sequencing data from multiple bulk samples then corresponds to a matrix factorization problem—called the variant allele frequency factorization problem (VAFFP) [5]—which we describe below. For extending this model under violations of the infinite sites assumption, see Sections B.5, B.6 in S1 Text.
Under these assumptions, the evolutionary history of a tumor is described as a rooted, phylogenetic tree where the vertices V correspond to sub-populations of cells containing identical sets of mutations, or clones, and the edges E define ancestral relationships between the clones. Mathematically, a mutation is represented by its position j in the genome and a clone bi is a length n binary vector where bi,j = 1 (resp. bi,j = 0) denotes the presence (resp. absence) of mutation j in the ith clone. As each mutation occurs exactly once during the evolutionary process, there are n distinct clones bi, and they form an important subclass of perfect phylogenies [25, 26] where the internal vertices, in addition to the leaves, are labeled [5, 27]. Then, the evolutionary history of a tumor is given by a vertex labeled, perfect phylogeny with n vertices, which we call an n-clonal tree to emphasize that each of the n vertices correspond to a distinct tumor clone.
In practice, rather than modeling the evolution of individual mutations, mutations are typically clustered into mutation clusters using tools such as PyClone [28, 29] or SciClone [30] to make inference computationally tractable. These clusters of mutations are then assumed to both evolve together and satisfy the infinite sites assumption.
Definition 1. A rooted tree with vertices V = [n] is an n-clonal tree if each edge (i, j) is labeled by mutation j. We simply write clonal tree when n is clear by context.
The root of an n-clonal tree
is assigned the unique mutation that does not appear as a label on any edge and is denoted r when the tree is clear by context. The vertices and edges of
are denoted as
and
. The parent of a non-root, vertex i in
is written as δ(i) and the set of children of a vertex i is written as C(i). The depth d(i) of a vertex i in
is the number of edges on the path from
to i. The depth of the tree
is the maximum depth of any vertex i in
. The clone bi corresponds to vertex i of the n-clonal tree and contains all mutations occurring on the unique path from r to i in
. We summarize all clones in
with an n-clonal matrix where each clone is a row of the matrix.
Definition 2. The n-clonal matrix is the n-by-n binary matrix such that bi,j = 1 if and only if either i is a descendant of j in
or i = j. We drop the subscript
when it is clear by context.
The model of bulk DNA sequencing is then as follows: each of m samples consists of a mixture of distinct clones and the sequencing experiment measures the frequency of all mutations in this mixture. More formally, it is assumed that each of the m measurements are convex combinations [31] of the n clones in
. That is,
(1)
where clone bi is the ith row of B. This model is often summarized compactly in matrix notation as F = UB, where F = [fi] and U = [ui] are m-by-n matrices, fi,j is the frequency of mutation j in sample i, and ui,j is the fraction of clone j in sample i. As such, we call F a frequency matrix and any right stochastic matrix U a usage matrix.
Under this model, the problem of reconstructing the evolutionary history of the tumor becomes equivalent to factoring the observed frequency matrix F into its constituents U and B. While it would be desirable to solve this factorization problem exactly, imperfect measurement makes this challenging in practice and instead, most methods attempt to infer U and B such that F ≈ UB. For example, CITUP [7] attempts to find U and B such that is minimized. In the general setting, we have a loss function L(F, U, B) that provides a measure of error between the observed frequency matrix F and the inferred matrices U and B.
Problem 1 (The Variant Allele Frequency L-Factorization Problem (L-VAFFP)). Given a frequency matrix F, find a clonal matrix B and a usage matrix U such that the loss L(F, U, B) is minimized.
Multiple variations of the L-VAFFP have been studied in the literature, with different choices of loss function L(F, U, B) and additional constraints, as summarized in Table 1. In particular, CITUP [7] jointly clusters mutations and infers a phylogeny, using a regularized L2 loss to avoid overfitting on the number of mutation clusters. CITUP is an exact algorithm to solve this problem based off exhaustive enumeration of clonal trees and quadratic integer programming. LICHeE [6] also clusters mutations, and minimizes the total squared violation of the sum condition [5–7, 24] in their inferred tree using quadratic programming. AncesTree [5] on the other hand, does not cluster mutations, and studies two different loss functions. The first loss they study is the 0–1 loss
, which is zero if and only if F = UB. Interestingly, they show that under this loss function, the variant allele frequency factorization problem is NP-complete, which implies that it is also NP-complete under any Lp loss. As real data is quite noisy, they also study a variant of the L1 loss obtained from adding the hard constraint that |Fij − (UB)ij| ≤ ϵi,j for some ϵ > 0. For both loss functions, they use an integer linear programming formulation to solve the problem exactly. CALDER [10] builds off of the approach of AncesTree, but adds an additional hard constraint that the inferred matrices U and B are longitudinally consistent, leveraging the temporal information present in certain experimental settings. Further, rather than minimizing the L1 loss, they minimize the L0 loss on the usage matrix U. Again, they utilize integer linear programming to solve this optimization problem exactly. A complete description of all methods is summarized compactly in Table 1.
The loss functions L(F, U, B) and additional constraints beyond are noted for each method. Due to space constraints, we do not describe the regularization term(s) appearing in the loss function for several methods [6–9, 11], which penalize the total number of mutation clusters. *CALDER enforces an additional hard constraint that the inferred matrices are longitudinally consistent, when provided with additional longitudinal information.
The main focus of this work is the structured regression problem that appears when the clonal matrix B is fixed. Specifically, when B is fixed, the problem reduces to finding a usage matrix U such that the loss L(F, U, B) is minimized. We call this a structured regression problem as the aim is to regress the frequency matrix F against the clonal matrix B where B has a unique, combinatorial structure which we describe in Section 2.2.1.
Problem 2 (The Variant Allele Frequency L-Regression Problem (L-VAFRP)). Given a frequency matrix F and a clonal matrix B, find a usage matrix U such that the loss L(F, U, B) is minimized. We call this minimum loss L*(F, B).
In the special case where the loss for p ∈ {1, 2}, this regression problem is solveable in polynomial time. More formally, since ui,j ≥ 0 and
are linear constraints (1) on the matrix U, the L1 and L2 regression problems can be formulated as linear and convex quadratic programs respectively, which are both solveable in polynomial time. Throughout the remainder of this work, we will focus on the L1 regression and factorization problems, which we call the variant allele frequency ℓ1-regression problem (ℓ1-VAFRP) and variant allele frequency ℓ1-factorization problem (ℓ1-VAFFP), respectively. Further, though a slight abuse of notation, we will write
instead of Lp(F, U, B).
In contrast to the aforementioned approaches, which preprocess the read count data to obtain the observed frequency matrices F, probabilistic approaches such as PhyloWGS [8], PASTRI [9], Pairtree [11], and Orchard [12] explicitly model the read count data. Further, rather than minimize a loss, these methods attempt to sample from the posterior distribution of phylogenetic trees using a variety of sampling techniques. Since these probabilistic approaches are challenging to succinctly describe, we refer to the original publications [8, 9, 11, 12] for a complete description and denote their loss generically as in Table 1.
General purpose linear programming software solves the ℓ1-VAFRP in polynomial time. However, linear programming solvers do not exploit the special structure of B and have worse asymptotic complexity as compared to the algorithm we derive in this work. Numerical methods from convex optimization such as the Alternating Direction Method of Multipliers and Projected Gradient Descent Method [31] can be used to solve the L2 variant allele frequency regression problem to a guaranteed optimality threshold. However, these blackbox methods are again quite slow and do not yield an exact solution. As such, [32] designed an improved algorithm that solves the L2 regression problem exactly in time.
2.2 A structured regression model for the ℓ1-VAFFP
Structured regression and local search have played a pivotal role in the success of methods for performing distance based reconstruction of phylogenetic trees. In particular, state-of-the-art methods for distance based phylogenetics such as FastME [33] and FastTree [34, 35] work by locally exploring phylogenetic tree space (i.e. a tree space polytope [36, 37]) and regressing the observed distances against the tree to infer the unknown branch lengths, picking the tree which best explains the observed distances (Fig 1). Pivotal to the success of these methods are efficient algorithms [20, 38, 39] for computing and recomputing the solution to the structured regression problem. Importantly, these algorithms leverage the structured relationship between the branch lengths, the tree, and the induced distances.
The usage matrix U describes the fraction of each clone across all samples, and the clonal matrix BT describes the genotype of each clone in T. The matrix AT describes the system of linear equations relating the branch lengths x to the distance vector d.
Despite the success of structured regression as a tool for distance based phylogenetics, such an approach has not yet been studied in the context of tumor evolution. Here, we derive an efficient algorithm for the ℓ1-VAFRP by exploiting the combinatorial structure of clonal matrices. In particular, we start by studying the structure of clonal matrices, both summarizing and extending existing results. Then, narrowing our focus, we derive an equivalent characterization of the ℓ1-VAFRP, which emphasizes the tree structure inherent in the problem. Finally, we use our characterization to design an efficient algorithm for the ℓ1-VAFRP which enables fast recomputation upon subtree prune-and-regraft (SPR) operations [23], providing the main theoretical result of our work, as stated below.
Theorem 1. Given a clonal tree with n vertices and an m-by-n frequency matrix F, the minimum
(2) can be found in
time, where d is the depth of
.
As mentioned above, our algorithm for solving the ℓ1-VAFRP is also able to efficiently recompute the minimum upon slight modifications to the tree topology, in the sense of the following corollary.
Corollary 1. Given a clonal tree with n vertices and an m-by-n frequency matrix F, the following queries can be efficiently answered after
pre-processing time using
space.
- (i) For a subtree prune-and-regraft (SPR) operation on vertices i and j which results in a tree
, the minimum
can be queried in
time.
- (ii) For the operation of attaching a new vertex j as a child of a vertex i to obtain a tree
and appending a corresponding column to the frequency matrix F to obtain F′, the minimum
can be queried in
time.
Importantly, the depth d of a tree on n vertices is at most n − 1, implying our algorithm runs in quadratic (in n) time in the worst case and improves upon linear programming based approaches. However, for reasonable classes of trees, the complexity is much better. For example, the expected depth of a rooted spanning tree drawn uniformly at random from the complete graph Kn is of order —similar bounds can also be derived for the random spanning trees of an arbitrary graph G [40, 41]. All proofs can be found in Supplementary Results A in S1 Text.
2.2.1 Clonal trees and matrices.
The most salient feature of clonal matrices is that they are (two-state) perfect phylogeny matrices [5], and this allows us to tap into the theory which studies such matrices. However, the class of clonal matrices is much more restrictive: there are a handful of useful results concerning clonal matrices which are not applicable to perfect phylogeny matrices. For example, a perfect phylogeny matrix need not be square and thus it is not necessarily invertible, while clonal matrices are always invertible [5].
We continue the study of clonal matrices by drawing an analogy to perfect phylogeny. Specifically, we introduce a new, recursive definition of clonal matrices that is inspired by the recursive definition of perfect phylogeny matrices described by Gusfield [25, 26]. Formally, we show that a matrix B is an n-clonal matrix if and only if the rows and columns of B can be reordered such that B is clonally canonical, a term we define below.
Definition 3. A matrix B is clonally canonical if i) Bi,1 = 1 for all i ∈ [n], ii) B1,j = 0 for all j ∈ {2, …, n}, and iii) there exists clonally canonical matrices B1, B2, …, Bk such that B2:n,2:n is block diagonal with blocks B1, B2, …, Bk. That is, B is clonally canonical if it has the form:
Conveniently, for a given clonal tree , it is straightforward to construct a clonally canonical matrix associated with
by relabeling the clones by their preorder traversal index during a depth first search starting at r. With this definition, we are now ready to state several equivalent characterizations of clonal matrices.
Proposition 1. For any binary n-by-n matrix B, the following conditions are equivalent:
- (i) B is an n-clonal matrix.
- (ii) The rows and columns of B can be reordered to make B clonally canonical.
- (iii) B satisfies the following conditions [5]:
- (a) There exists exactly one index r such that
.
- (b) For all i ≠ r, there exists exactly one j such that Bj,k = 1 implies Bi,k = 1 and
.
- (c) Bi,i = 1 for all indices i ∈ [n].
- (a) There exists exactly one index r such that
The clonally canonical form of the clonal matrix associated with a clonal tree is especially useful since it enables efficient multiplication and inversion of B.
Proposition 2. Let B be an n-by-n clonally canonical matrix associated with the clonal tree . Then, the four matrix vector products
can be computed in
time for any vector
when given the clonal tree
.
We conclude this section with an algebraic description of the inverse of B, which was previously described in [32], and will serve as a key technical ingredient for deriving an equivalent formulation of the ℓ1-VAFRP. Define the adjacency matrix A of as the n-by-n matrix such that Ai,j = 1 if i is a parent of j in
and 0 otherwise. Then, the inverse of B is (I − A), where I is the identity matrix.
Lemma 1. For any clonal matrix B associated with a clonal tree having adjacency matrix A, B = (I − A)−1. Consequently, [B−1v]i = vi − vδ(i) if
, and [B−1v]i = vi when
.
2.2.2 An equivalent formulation of the VAF ℓ1-regression problem.
In this section, we show that the ℓ1-VAFRP is equivalent to a constrained vertex labeling problem on the clonal tree associated with B. Specifically, we prove that the ℓ1-VAFRP is equivalent to a special case of the dot product tree labeling problem, defined as follows, for the case when m = 1 and F = fT. The general case of m > 1 follows from the separability of the objective
.
Problem 3 (Dot Product Tree Labeling Problem (DPTLP)). Given a rooted tree with vertices [n] and a vector
, find a non-negative vector
such that xTw is maximized and |xi − xδ(i)| ≤ 1 for all
, where δ(i) is the parent of vertex i in
.
To derive the equivalence between the ℓ1-VAFRP and the DPTLP, we start by writing the ℓ1-VAFRP as a linear program (LP) in standard form [31]. Using the usual trick [31] for converting the ℓ1 norm to a linear objective with linear constraints, we write the ℓ1-VAFRP as a LP (Fig 2). Then, we write out the dual problem by associating a dual variable αi with the constraint in (3), a dual variable βi with the constraint in (4), and a dual variable γ with the constraint in (5). Thus obtaining our dual LP (Fig 2).
To simplify the dual form of the LP, we perform a change of variables by setting λi = βi − αi. Since αi and βi are non-negative and their sum is bounded by 1, λi ∈ [−1, 1]. Then, writing the constraints in matrix form and using a slack variable ψ to remove the inequality constraint, we have the following equivalent, dual LP.
(8)
(9)
Applying Lemma 1 to the matrix B in (8) and invoking LP duality then proves the following theorem.
Theorem 2. Given a length n frequency vector f and an n-by-n clonal matrix B corresponding to a clonal tree with root vertex 1, the minimum of ‖fT − uTB‖1 over all usage vectors is equal to the maximum of
(10) over all non-negative vectors
such that |x1 − xn+1| ≤ 1 and |xi − xδ(i)| ≤ 1 for all i ∈ [n].
In other words, the above theorem states that the ℓ1-VAFRP is equivalent to a special case of the DPTLP where we append a parent labeled n + 1 to the root vertex and appropriately set the vector . Interestingly, the sum condition [5–7, 24], that is, the requirement that
(11)
appears almost unexpectedly in (10). Using the appearance of the sum condition in (10), we extend the theory of El-Kebir et al. [5] which states that we can find a usage vector uT such that fT = uTB if and only if the sum condition (11) is satisfied. In particular, we show that the total violation of the sum condition also provides a lower bound on the ℓ1 error.
Corollary 2. Let f be a frequency vector of length n, let B be an n-by-n clonal matrix, and let . Then
or the total violation of the sum condition (11). Thus,
if and only if the sum condition is satisfied.
Finally, we observe that as a consequence of the above corollary, the sum condition is somewhat redundant when minimizing the ℓ1 error ‖F − UB‖1. This is because by the above corollary, the ℓ1 error is an upper bound on the total violation on the sum condition, implying that minimizing the ℓ1 error forces the total violation of the sum condition to zero.
2.2.3 An algorithm for the DPTLP.
In this section, we develop an efficient algorithm for the DPTLP. There are two key ideas underlying our algorithm. The first idea is that exploiting the tree structure enables us to express the solution for the subtree rooted at a vertex i in terms of the solution for the subtrees rooted at the children of i. This expression is derived using standard techniques for dynamic programming on trees [42]. The second idea is more technical, and is based on the observation that the solution of this recurrence is a concave piecewise linear function, which we represent compactly as a list of size , where d is the depth of
. Combining these two ideas yield an
algorithm for the DPTLP, as stated below.
Theorem 3. Given a rooted tree of depth d with vertices [n] and a vector
, the DPTLP can be solved in
time with the following recurrence,
(12) where gi(ψ) is the optimal solution to the DPTLP for the subtree rooted at vertex i, when vertex i is assigned the label ψ.
We will prove this theorem by first deriving a recurrence relation for the solution of the DPTLP. Assume gi(ψ) is the optimal solution to the DPTLP for the subtree rooted at i when the root vertex i is assigned the label ψ. Then, gi(ψ) satisfies the following recurrence relation:
(13)
Importantly, to solve the DPTLP, it is necessary and sufficient to compute maxψ≥0gr(ψ) for the root vertex
. Unfortunately, however, the straightforward technique [42] of storing a dynamic programming table for this recurrence will not work because the number of possible values of ψ ≥ 0 is infinite. Thus, we need to find an alternative way to describe and represent the functions gj and hj appearing in this recurrence.
The key mathematical idea underpinning our algorithm is that the functions gj and hj appearing in the recurrence are not arbitrary. Rather, they form a special class of functions which admit a compact representation and are convenient to work with. In particular, the functions gj and hj are concave (and thus continuous) piecewise linear functions with a finite number of breakpoints at coordinates {1, 2, …, k}, formally defined below.
Definition 4. Let be the set of concave piecewise linear functions with breakpoints at integers 1, …, k; i.e., a continuous function
if f is linear with slopes s1 ≥ s2 ≥ … ≥ sk+1 on the intervals I1 = [0, 1), I2 = [1, 2), …, Ik+1 = [k, ∞) respectively.
Note that form a strictly increasing nested sequence
of sets of piecewise linear functions. Further, every function f in the class
is represented by a tuple (y, s1, …, sk+1) of size k + 2 giving the intercept y and slopes si of each piece of f.
Now, we will prove that gj and hj are in the class , where d is the depth of
. To start, notice that if i is a leaf vertex, then gi(ψ) = ψ ⋅ wi is a linear function and is thus in
. This provides us with the necessary base case required to prove that gj is in
. Next, we prove the inductive step of our claim: if
, then
. The results then follow by induction on the depth of
. We start with the following description of hj in terms of gj.
Lemma 2. Suppose f ∈ Lk and is represented by the tuple (y, s1, s2, …, sk+1). Let i* be the largest index i such that si ≥ 0. If si < 0 for all i, we set i* = −∞ and if si ≥ 0 for all i, we set i* = ∞. Then,
is in
and satisfies
(14) Proof. If i* = ∞, all slopes si are non-negative and the function g(ψ′) is non-decreasing. Thus, the maximum of g(ψ′) over any interval [ψ − 1, ψ + 1], is achieved at the interval’s right most value ψ + 1.
If i* = −∞, all slopes si are negative and the function g(ψ′) is strictly decreasing. Thus, the maximum of g(ψ′) over any interval [ψ − 1, ψ + 1], is achieved at the interval’s left most value ψ − 1. However, since ψ′ is constrained to be non-negative, if ψ < 1 the maximum is achieved at ψ′ = 0.
If i* ≠ ∞, −∞, then g(ψ′) is non-decreasing on the interval [0, i*] and non-increasing on the interval [i*, ∞). Further, the maximum of g(ψ′) over all non-negative ψ′ is bounded and equal to g(i*). The result then follows by a case analysis on the value of ψ. If ψ is in [i* − 1, i* + 1], then we can take ψ′ = i* and achieve the maximum. If ψ < i* − 1, then the function g(ψ′) is non-decreasing on the interval [ψ − 1, ψ + 1] and the maximum is achieved at the interval’s right most value ψ + 1. Symetrically, if ψ > i* + 1, then the function g(ψ′) is non-increasing on the interval [ψ − 1, ψ + 1] and the maximum is achieved at the interval’s left most value ψ − 1, which is always non-negative since ψ > 1. As this covers all possible cases, this proves that h has the form (14).
To see that h(ψ) is continuous, observe that
at the only candidates for discontinuity, i* − 1 and i* + 1. To see that it is concave, note that
As g(ψ) is concave, its second derivative g″(ψ) is non-positive, which implies that h(ψ) is also concave. Since h is trivially piecewise linear by (14), the proof is complete.
Stated in terms of the tuple representations of hj and gj as tuples, we have the following equivalent result.
Proposition 3. Suppose f ∈ Lk and is represented by the tuple (y, s1, s2, …, sk+1). Let i* be the largest index i such that si ≥ 0. If si < 0 for all i, we set i* = −∞ and if si ≥ 0 for all i, we set i* = ∞. Then,
is in
and is represented by the tuple
In summary, we have shown that i) if i is a leaf, gi is in , and ii) if gj is in
, then hj is in
. The final step is to show that if the function hj is in
for all children j of i, then gi is also in
. This follows from the observation that the class
is closed under addition. The result is summarized below.
Proposition 4. Let gi(ψ) be the optimal solution of the DPTLP for the subtree rooted at vertex i such that i is assigned the label ψ. Then, gi is in
.
We are now ready to prove the main result of this section.
Proof of Theorem 3. The result follows by induction on n, the number of vertices in . In particular, assume that the representation of the function gi for the root vertex i is computable in
time for all trees with fewer than n vertices and depth at most d. Clearly, this holds for n = 1.
Then, let r be the root of and let C(r) be the set of children of r. Let
denote the subtree rooted at j ∈ C(r) and nj denote the size of
. By the inductive hypothesis, we can compute the representation of gj for j ∈ C(r) in
time, since the depth of the subtree rooted at j ∈ C(r) is at most d + 2. Using Proposition 3, we can compute the representation of hj for all j ∈ C(r) in
time as these functions are represented by tuples of length at most d + 3.
Finally, we compute the representation of gr by observing that is closed under addition and that the representation of gr is easily computed from the representations of hj. In particular, summing the tuples representing hj coordinate-wise, we obtain the representation of gr.
Then, by reducing the ℓ1-VAFRP to the DPTLP using Theorem 2 and applying Theorem 3, we complete the proof of the first part of Theorem 1 for the special case where m = 1. The general case of m > 1 follows by noting that the objective is separable and that each term
can be minimized independently.
To see Corollary 1, notice that after pruning and regrafting any subtree as a child of a vertex j, gk only changes if the vertex k is on the path from root
to the vertex δ(i) or the vertex j—this follows from Eq (13). Since there are at most 2 ⋅ max{d(j), d(i)} vertices on these paths, this observation yields the statement (i) in Corollary 1, as long as the representations of gk are stored. The second statement (ii) in Corollary 1 follows by a similar argument.
2.3 A deterministic search algorithm for the ℓ1-VAFFP
Here we describe a deterministic search algorithm, fastBE, for the ℓ1-VAFFP which builds upon the fast regression algorithm described in Section 2.2 and is inspired by the beam search techniques used by Orchard [12]. Our algorithm iteratively constructs the inferred tree one mutation at a time, choosing the best vertex placement for each mutation in the current tree, while allowing added vertices to “adopt” children from their parent. A vertex u is said to adopt the child w of a vertex v in upon removing edge (v, w) and adding the edge (u, w) to
. This procedure is described formally as follows:
- Fix an order O = {o1, …, on} in which to append the
mutations O and initialize the starting tree as
with root
.
- For i = 2, …, n:
- (a) Let the current tree
and the current frequency matrix F′ be the submatrix of F spanned by the columns o1, …, oi.
- (b) Find the parent
and the subset of children
such that the tree
obtained from attaching oi as a child of p in
and adopting the children S as children of oi minimizes the loss
.
- (c) Set the next tree
.
- (a) Let the current tree
- Output the final tree
.
Since at every iteration i of the algorithm there are a total of where
placements of mutation oi to consider, the running time of this algorithm is dominated by the time to compute the objective function
over all such placements. Using the efficient recomputation procedure outlined in the latter part of Theorem 1, we can avoid the naïve approach which takes
time and instead perform all such computations in
time, where
is the average depth of the tree
. Further details on these steps are supplied in Section B.4 in S1 Text.
2.4 Inference of mutation clusters with phylogeny constrained clustering
The scalability of fastBE enables the inference of clone trees directly from mutation-level read count data, foregoing the need to cluster mutations prior to running fastBE. Still, it is often desirable to cluster similar mutations together, for example, to improve interpretability or to estimate tumor heterogeneity. Typically, clustering of mutations is done with specialized tools such as PyClone [28, 29], SciClone [30], EXPANDS [43], or QuantumClone [44]. However, these tools do not exploit the phylogenetic relationships between the mutations, as thus far, no method was able to infer clone trees at mutation granularity.
To incorporate phylogenetic information, we formalize the mutation clustering problem in a k-means fashion, with the additional constraint that the selected mutation clusters form connected components on the inferred clone tree.
Problem 4 (p-Phylogeny Constrained Mutation Clustering (p-PCMC)). Given an n-clonal tree , an m-by-n frequency matrix F, and a number of clusters
, find a clustering C = {C1, …, Ck} of the vertices
and a set of cluster centers
such that the loss
(15) is minimized, the clustering C partitions
, and each cluster Ci forms a connected component in
.
Due to the constraint that the clusters Ci form connected components in , we identify each clustering with a set of k − 1 edges describing the cut of the partition C. Consequently, there are
possible clusterings to the p-PCMC problem, corresponding to each selection of k − 1 edges. Further, for a fixed clustering C, the optimal centers ci are obtained by taking the median vector if p = 1 and the mean vector if p = 2. Thus, the p-PCMC is solveable in polynomial time for fixed k when p ∈ {1, 2} by checking all
possible clusterings.
While the p-PCMC is solveable in polynomial time for fixed k where p ∈ {1, 2}, this naiv̈e solution is computationally prohibitive for clone trees containing upwards of n = 1000 mutations and k = 20 clones. Consequently, we use a heuristic approach to solve the p-PCMC problem, which exploits the fact that the p-PCMC problem is solvable in time for k = 2. In particular, we use a divisive clustering algorithm [45], which builds a clustering by recursively splitting the clustering until k clusters are formed. Divisive clustering runs in the opposite direction to the more frequently used hierarchical, or agglomerative, clustering and is typically not used because finding the optimal split of an existing cluster is a hard problem. However, due to the phylogenetic constraints imposed by the p-PCMC, we are able to optimally solve the splitting step in
time, leading to an extremely effective,
time algorithm for the p-PCMC problem which is optimal when k = 2.
3 Results
3.1 Runtime comparison to linear programming solvers
We compared an implementation of our structured regression algorithm for the ℓ1-VAFRP to a linear programming (LP) approach on simulated data. In particular, we implemented the natural primal LP formulation solving the ℓ1-VAFRP (Section 2.2.2) using two commercial LP solvers: Gurobi v9.0.3 [46] and CPLEX v22.1.0 [47]. We generated 264 pairs of frequency matrices F and clonal matrices B as described in Section B.1 in S1 Text, and measured the wall-clock runtime of our algorithm and the LP solvers on these simulated instances. Excluding the time required to construct the LP—which would unfairly penalize the LP solvers—we found that our algorithm was a mean of 95.6 times faster than Gurobi and 105.1 times faster than CPLEX (S1 and S2 Figs).
Next, we tested the warm start capability of our structured regression algorithm for the ℓ1-VAFRP upon perturbations to the topology of the input tree. For each of the 264 instances constructed above, we measured the time to solve the ℓ1-VAFRP for 25,000 trees obtained by applying a single random SPR operation to the input clonal tree. We performed this measurement both in the setting where we employ our regression algorithm as a black-box (the cold start setting) and the setting where we used the recomputation procedure outlined in Corollary 1 (the warm start setting). We found that our algorithm was a mean of 6.2 times faster in the warm start as opposed to cold start settings (S3 and S4 Figs). This implies that our regression algorithm possesses another advantage over naïve LP approaches, which do not provide any warm startingcapabilities.
3.2 Evaluation of fastBE on simulated data
We evaluated our factorization algorithm, fastBE, on simulated data, and compared it to four other state-of-the-art factorization algorithms: Pairtree [11], Orchard [12], CALDER [10], and CITUP [7]. To perform our evaluation, we simulated ground truth clone trees and usage matrices, and measured the ability of each algorithm to reconstruct this ground truth. To construct each simulated instance, we generated a clone tree and usage matrix U, computed the frequency matrix F = UB, and sampled both variant and non-variant reads from F at 40× coverage. Complete details describing the simulations, parameters, and evaluation metrics are provided in Sections B.1, B.2, B.3 in S1 Text.
On simulated instances with few clones (n = 3, 5, 10) and samples (m = 5, 10, 25), all algorithms terminated in under 24 hours on the majority of the 108 simulated instances (fastBE: 108/108, Pairtree: 108/108, Orchard: 108/108, CALDER: 106/108, CITUP: 107/108). fastBE, Pairtree, and Orchard accurately recovered pairwise relationships in this setting (Fig 3 and S5 Fig), with fastBE, Pairtree, and Orchard performing nearly identically for the n = 10 clone setting in terms of mean F1-score (fastBE: 0.965, Pairtree: 0.972, Orchard: 0.965).
(Left) The F1-score versus the number of clones. (Right) The wall-clock runtime on instances with ≥100 clones versus the number of samples. Methods that did not scale to instances with many clones are excluded from the plot.
In contrast, CITUP and CALDER struggled to accurately reconstruct pairwise relationships (Fig 3 and S5 Fig) when there were 5 or more clones. In terms of recovering the ground truth usage matrix U and frequency matrix F, fastBE significantly outperformed CITUP and CALDER, while performing similarly to Pairtree and Orchard (S6 Fig).
On simulated instances with a modest number of clones (n = 20, 30, 50) and samples (m = 25, 50), CITUP and CALDER were unable to terminate on the majority of the simulated instances within 24 hours—consistent with the findings of [11, 12]—and were excluded from our evaluation. On instances with n = 20, 30 clones, fastBE, Pairtree, and Orchard performed similarly in terms of reconstructing pairwise relationships (Fig 3 and S7 Fig). However, on instances with n = 50 clones, fastBE and Orchard outperformed Pairtree (mean F1 fastBE: 0.825, Pairtree: 0.749, Orchard: 0.805) in reconstructing pairwise relationships (S13 Fig). In terms of recovering the ground truth usage matrix U and frequency matrix F, all methods performed quite well in recovering F, but Pairtree was less accurate in recovering U and F when the number of clones was large (S8 Fig). Finally, fastBE was an order of magnitude faster than Pairtree and Orchard, running for an average of 1.21 seconds on instances with n = 50 clones (S13 Fig).
In the regime with a large number of clones (n = 100, 250, 500, 1000) and samples (m = 50, 100), only fastBE and Orchard were able to terminate within a 24 hour time limit. On these instances, fastBE and Orchard perform nearly identically in terms of recovering ground truth pairwise relationships (mean F1 fastBE: 0.773, Orchard: 0.783). The methods also had similar performance in recovering the ground truth usage matrix U and frequency matrix F (S8 Fig). In terms of runtime, however, fastBE was several orders of magnitude faster than Orchard (Fig 3 and S13 Fig). For example, fastBE took a mean of 1229.8 seconds to run on instances with n = 1000 clones and terminated on all such instances, whereas Orchard took a mean of 71749.1 seconds and terminated on 19/24 such instances when allotted 48 hours and a dedicated 32-core processor.
We also found that fastBE was more sample efficient than other methods, requiring fewer samples to recover the ground truth clonal relationships. In particular, the pairwise reconstruction accuracy strictly improved for fastBE, Pairtree, and Orchard as the number of samples increased (Fig 3 and S12 Fig), while this was not necessarily the case for CALDER and CITUP (S19 and S10 Figs). However, the reconstruction accuracy improved for fastBE more quickly than Pairtree and Orchard (Fig 3 and S12 Fig) as the number of samples increased, and fastBE obtained near perfect recovery (median F1: 0.987) with the number of samples m ≥ 50 and the number of clones n < 100. This observation led us to investigate the reconstruction accuracy as the ratio of samples to clones increased. Interestingly, we observed a sharp transition in pairwise reconstruction accuracy for fastBE as the ratio of samples to clones approached one (S11 Fig).
3.3 Evaluation of phylogeny constrained clustering with fastBE on simulated data
We compared our phylogeny constrained mutation clustering algorithm (Section 2.4) to both the mutation clustering algorithm in Orchard [12] and the method PyClone-VI [29] on a low-coverage simulated dataset. To perform our evaluation, we created ground truth mutation clusters by simulating a clone tree and a usage matrix U, assigning each mutation to one of the clones in
. To construct each simulated instance, we generated a clone tree
and usage matrix U, computed the frequency matrix F = UB, and sampled both variant and non-variant reads for each mutation from F at 20× coverage. By construction, the variant read sampling frequency was identical for mutations belonging to the same clone and sample. Complete details describing the simulations are provided in Sections B.1 in S1 Text.
To infer mutation clusters, we first built phylogenetic trees from the mutation-level variant and total read count matrices for fastBE and Orchard. Using these inferred phylogenies, we then applied our mutation clustering algorithm (Section 2.4) and Orchard’s mutation clustering algorithm, respectively, passing in the ground truth number of clusters. For PyClone-VI, we simply passed in the mutation-level variant and total read count matrices to obtain mutation clusters. To evaluate the quality of the inferred clusterings, we applied two metrics, the adjusted rand index (ARI) and the normalized mutual information score (NMI), to the ground truth and inferred mutation clusters. Across all simulated settings, fastBE had both the highest mean ARI and NMI, with Orchard falling in second place (S14 Fig). Interestingly, on several instances Orchard exactly inferred the true mutation clusters, however, Orchard’s performance was quite variable, sometimes attaining an ARI of approximately 0. PyClone-VI, though state-of-the-art for mutation clustering, performed worst on all simulated settings, illustrating the utility of using a phylogeny-aware approach.
3.4 Analysis of B progenitor acute lymphoblastic leukemia patient samples
We applied fastBE to infer phylogenetic trees from multi-sample bulk DNA sequencing data of fourteen patients with B progenitor acute lymphoblastic leukemia (B-ALL) [16]. This dataset was generated by whole exome sequencing (≈ 200× coverage) of tissue samples from fourteen B-ALL patients at both diagnosis and relapse time points. Both diagnosis and relapse samples were subsequently grafted onto immunodeficient mice, generating additional patient-derived xenografts further sequenced using targeted sequencing. Using orthogonal copy number information, mutations were excluded from downstream analysis if they did not lie in copy number neutral regions [16]. Otherwise, all mutations were included for downstream analysis, regardless of their driver or passenger status. Taking both patient and derived xenograft samples together, this process resulted in fourteen patient samples containing a median of 42 (min: 13, max 90) samples and a median of 42 mutations (min: 17, max 293) per patient.
We inferred phylogenies using fastBE, Pairtree, and Orchard directly from the mutation-level variant and total read counts provided by Pairtree [11]. While fastBE and Orchard were successfully able to infer phylogenies on all fourteen patient samples, Pairtree failed to terminate on the two largest samples, SJETV010 and SJBALL022610, which contained 130 and 293 mutations respectively. On the remaining samples, fastBE was substantially faster than both Orchard and Pairtree (S15 Fig), terminating in less than 80 seconds on all samples. On all 14 of the patient samples, each of fastBE, Pairtree, and Orchard inferred distinct trees. However, directly quantifying the similarity of the inferred trees was challenging due to long chains of mutations, many of which can be arbitrarily reordered without affecting the reconstruction accuracy.
We quantified the differences between the phylogenetic trees inferred by fastBE, Pairtree, and Orchard by examining two metrics of concordance with the observed data: the frequency matrix estimation error and violations of the sum condition (Eq (11)) [5–7, 24]. For all methods, the normalized frequency matrix estimation error (B.3 in Supplementary Text A) was less than 0.1 in all cases (S16 Fig). However, the trees inferred by fastBE had substantially lower frequency matrix estimation error (mean error fastBE: 1.2 × 10−3, Pairtree: 2.1 × 10−3, Orchard: 1.7 × 10−3), suggesting a better fit to the observed mutational frequencies. The sum condition requires that the frequency Fi,j of a mutation j gained at a clone in sample i is greater than or equal to the sum ∑k∈C(j) Fi,k of the frequencies of the mutations gained at the clone’s children. The sum condition follows from the perfect phylogeny assumption, which states that each mutation is gained at most once and never lost. Consequently if a mutation is present in a clone, then the mutation is present in all the clone’s children. Importantly, all of the methods benchmarked [7, 10–12] make the perfect phylogeny assumption, and thus if the frequency matrix is correctly measured, the inferred trees should satisfy the sum condition. For a sample i and mutation j, we define the violation Vi,j = max{∑k∈C(j) Fi,k − Fij, 0} of the sum condition. The total violation V of the sum condition is the sum V = ∑i,j Vi,j of the violations over all samples and mutations. We found that the phylogenies inferred by fastBE had a lower total violation V (mean V of 23.9 over 14 patients) compared to both Pairtree (mean V of 30.1 over 12 patients) and Orchard (mean V of 29.4 across all 14 patients). The reduced violation of the sum condition demonstrated by fastBE also held across individual experiments and mutations (S17 and S18 Figs).
Next, we inferred mutation clusters on the fastBE and Orchard phylogenies using the phylogeny-aware clustering algorithms described in (Section 2.4) and [12] respectively. Since both of these methods require the number of clusters k as input to the method, we performed a manual elbow analysis to select the number of clusters (S19 Fig), providing the selected number of clusters k to both methods. Interestingly, both methods inferred relatively similar mutation clusters, obtaining a mean ARI of 0.53 (S20 Fig), though this varied between samples. For example, while for the SJBALL022612 patient sample, fastBE and Orchard inferred nearly identical mutation clusters, for SJETV010 the clusters inferred by both methods varied drastically. To evaluate the mutation clusters output by both methods, we computed the mutation cluster distortion for the clusters output by each method. While quite similar overall, fastBE had a slightly lower distortion on average, with an average distortion 77.2 as opposed to 80.1 (S21 Fig).
Qualitatively, we observed differences between the trees inferred by fastBE as compared to those inferred by Pairtree and Orchard. For example, Pairtree and Orchard (median average depth Pairtree: 15.3, Orchard: 17.04) tended to infer deeper trees as compared to fastBE (median average depth: 10.2), and also tended to place mutations on long chains of degree two vertices. As a concrete example, we took a closer look at the phylogenetic trees inferred for patient sample SJBALL022613. For this patient sample, which contained 20 samples and 72 mutations, both the Orchard and Pairtree inferred a linear phylogeny, suggesting linear evolution [48]. In contrast, the fastBE phylogeny branched into two distinct lineages, suggesting a branched evolution (Fig 4A–4C). The phylogeny inferred by fastBE had lower total violation of the sum condition compared to the phylogenies inferred by Pairtree and Orchard, though Orchard had subsantially less total violation than Pairtree. The mutation clusters inferred by fastBE and Orchard were similar, with an ARI of 0.57, but contained notable differences (Fig 4D and 4E). For example, the mutation clusters inferred by fastBE were more uniform in size, and placed all mutations on the X-chromosome into a single mutation cluster off the root of the phylogeny. While Orchard also inferred that the X-chromosome mutations occurred early in the tumor’s evolution, it spread these mutations across multiple clusters. Finally, the clusters inferred by fastBE had lower mutation cluster distortion than those inferred by Orchard (Fig 4D and 4E).
The mutation phylogenies inferred by a) fastBE, b) Pairtree [11], and c) Orchard [12] for the patient sample SJBALL022613 from multi-sample bulk DNA sequencing data of fourteen patients with B progenitor acute lmyphoblastic leukemia [16]. For ease of visualization, we collapsed vertices with out-degree 1 into a single node, preserving the order of mutations as they appeared in the original tree. The mutation clusters inferred using the phylogeny-aware clustering algorithms of fastBE d) and Orchard e) for the patient sample SJBALL022613.
3.5 Analysis of patient-derived colorectal cancer models
We compared fastBE to Pairtree and Orchard on two patient-derived xenograft models of colorectal cancer, POP66 and CSC28 [49], from which multiple bulk samples underwent whole-exome sequencing. The POP66 model contained eight samples collected in the parent tumor (P0), first generation xenograft (G0), and regrowth xenografts, and 25 mutation clusters were inferred across these samples in [49]. The CSC28 model consisted of four samples collected in the first generation (G0) and regrowth xenografts, and 11 mutation clusters were inferred across these samples in [49]. Due to the high read depth of whole-exome sequencing (≈50× sequencing), we inferred phylogenies directly from the mutation-level variant and total read counts rather than the previously reported mutation clusters. Following the original publication [49], we excluded mutations that were contained in copy number aberrations in any of the samples, and used the copy number corrected mutation-level variant and total read counts provided by [49].
We found that the phylogenetic trees inferred by fastBE were quite different from those inferred by Pairtree and Orchard in terms of both their implied pairwise relationships and overall structure (S22 and S23 Figs). For example, while both the fastBE and Orchard CSC28 phylogenies had the mutation ORG13G occurring off the root, only the fastBE CSC28 phylogeny implied a polyclonal tumor origin. Further, the CSC28 phylogeny inferred by fastBE was substantially less deep than those inferred by Pairtree and Orchard. A similar story appears for the POP66 phylogenies, where the differences are even further exaggerated due to the large number of mutations.
Finally, we quantified the frequency matrix estimation error and total violation of the sum condition in the phylogenetic trees inferred by fastBE, Pairtree, and Orchard. We found that fastBE had both the lowest frequency matrix estimation error and total violation of the sum condition, though Orchard outperformed Pairtree by a large margin (S24 Fig).
4 Discussion
We defined a linear optimization problem, the ℓ1-VAFRP, a subproblem of the NP-complete ℓ1-VAFFP. By exploiting the special structure of the matrices which appear in this regression problem, we derived an algorithm which runs in time where m is the number of samples, n is the number of clones, and d is the depth of the input tree
, obtaining asymptotic and empirical speedups over state-of-the-art linear programming solvers. Using our regression algorithm, we developed a method fastBE for the ℓ1-VAFFP which scales to large, multi-sample bulk DNA sequencing datasets. While fastBE serves as a practically useful tool for phylogenetic inference, we also believe our ℓ1-regression algorithm and structured regression model is of independent interest, and will serve as a useful tool for the development of other algorithms for phylogenetic inference from multi-sample bulk DNA sequencing data.
There are several limitations of the present approach, which are directions for future work. On the theoretical side, it is an open question whether the time complexity of our regression algorithm can be improved from the current time to the optimal
time, which is the size of the input. On the practical side, extending our model and regression algorithm to additional classes of evolutionary models is desirable. Here, we analyzed the simplest case of single nucleotide variants in copy neutral regions. Accounting for copy number heterogeneity by replacing the VAF with either the cancer cell fraction [50–52] or the descendant cell fraction [53], could improve the performance of our method on real datasets. Furthermore, using evolutionary models that allow for mutation loss—e.g., the Dollo model [5, 54], or generalizations [55, 56]—is a challenging future direction. Finally, extending or applying fastBE to infer repeated evolutionary trajectories [57–60] across patients may extend the utility of fastBE beyond the single-patient setting.
Supporting information
S1 Text. Supplementary text file (PDF) containing supplementary methods, results, and proofs.
https://doi.org/10.1371/journal.pcbi.1012631.s001
(PDF)
S1 Fig. Relative runtime analysis of our ℓ1 regression algorithm and LP solvers.
https://doi.org/10.1371/journal.pcbi.1012631.s002
(TIFF)
S2 Fig. Absolute runtime analysis of our ℓ1 regression algorithm and LP solvers.
https://doi.org/10.1371/journal.pcbi.1012631.s003
(TIFF)
S3 Fig. Relative runtime analysis of warm versus cold starting our ℓ1 regression algorithm.
https://doi.org/10.1371/journal.pcbi.1012631.s004
(TIFF)
S4 Fig. Absolute runtime analysis of warm versus cold starting our ℓ1 regression algorithm.
https://doi.org/10.1371/journal.pcbi.1012631.s005
(TIFF)
S5 Fig. FPR and FNR of inferring pairwise relations for ≤ 10 clone and ≤ 25 sample simulated instances.
https://doi.org/10.1371/journal.pcbi.1012631.s006
(TIFF)
S6 Fig. Inferred matrix error for ≤ 10 clone and ≤ 25 sample simulated instances.
https://doi.org/10.1371/journal.pcbi.1012631.s007
(TIFF)
S7 Fig. FPR and FNR of inferring pairwise relations for ≥ 20 clone simulated instances.
https://doi.org/10.1371/journal.pcbi.1012631.s008
(TIFF)
S8 Fig. Inferred matrix error for ≥ 20 clone simulated instances.
https://doi.org/10.1371/journal.pcbi.1012631.s009
(TIFF)
S9 Fig. FPR and FNR of inferring pairwise relations versus ratio of samples to clones with ≤ 10 clones and ≤ 25 samples.
https://doi.org/10.1371/journal.pcbi.1012631.s010
(TIFF)
S10 Fig. FPR and FNR of inferring pairwise relations versus number of samples with ≤ 10 clones.
https://doi.org/10.1371/journal.pcbi.1012631.s011
(TIFF)
S11 Fig. FPR and FNR of inferring pairwise relations versus ratio of samples to clones with between 20 and 100 clones.
https://doi.org/10.1371/journal.pcbi.1012631.s012
(TIFF)
S12 Fig. FPR and FNR of inferring pairwise relations versus the number of samples with between 20 and 100 clones.
https://doi.org/10.1371/journal.pcbi.1012631.s013
(TIFF)
S13 Fig. Runtime analysis of fastBE, Pairtree, and Orchard.
https://doi.org/10.1371/journal.pcbi.1012631.s014
(TIFF)
S14 Fig. ARI and NMI of inferring mutation clusters.
https://doi.org/10.1371/journal.pcbi.1012631.s015
(TIFF)
S15 Fig. Runtime analysis for B-ALL patient phylogenies.
https://doi.org/10.1371/journal.pcbi.1012631.s016
(TIFF)
S16 Fig. ℓ1 matrix error for B-ALL patient phylogenies.
https://doi.org/10.1371/journal.pcbi.1012631.s017
(TIFF)
S17 Fig. Total violation of the sum condition for B-ALL patient phylogenies.
https://doi.org/10.1371/journal.pcbi.1012631.s018
(TIFF)
S18 Fig. Per mutation total violation of the sum condition for B-ALL patient phylogenies.
https://doi.org/10.1371/journal.pcbi.1012631.s019
(TIFF)
S19 Fig. B-ALL mutation clustering distortion versus the number of clusters.
https://doi.org/10.1371/journal.pcbi.1012631.s020
(TIFF)
S20 Fig. B-ALL mutation clustering ARI between fastBE and Orchard clusterings.
https://doi.org/10.1371/journal.pcbi.1012631.s021
(TIFF)
S21 Fig. B-ALL mutation clustering distortion difference for fastBE and Orchard clusterings.
https://doi.org/10.1371/journal.pcbi.1012631.s022
(TIFF)
S24 Fig. Sum condition violation and frequency matrix estimation error on POP66 and CSC28.
https://doi.org/10.1371/journal.pcbi.1012631.s025
(TIFF)
S25 Fig. F1 score and runtime analysis of fastBE on imperfect phylogenies.
https://doi.org/10.1371/journal.pcbi.1012631.s026
(TIFF)
References
- 1. Gundem G, Van Loo P, Kremeyer B, Alexandrov LB, Tubio JMC, Papaemmanuil E, et al. The evolutionary history of lethal metastatic prostate cancer. Nature. 2015;520(7547):353–357. pmid:25830880
- 2. Jamal-Hanjani M, Wilson GA, McGranahan N, Birkbak NJ, Watkins TBK, Veeriah S, et al. Tracking the Evolution of Non–Small-Cell Lung Cancer. New England Journal of Medicine;376(22):2109–2121. pmid:28445112
- 3. Pan-cancer analysis of whole genomes. Nature. 2020;578(7793):82–93. pmid:32025007
- 4. Strino F, Parisi F, Micsinai M, Kluger Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Research;41(17):e165–e165. pmid:23892400
- 5. El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics;31(12):i62–i70. pmid:26072510
- 6. Popic V, Salari R, Hajirasouliha I, Kashef-Haghighi D, West RB, Batzoglou S. Fast and scalable inference of multi-sample cancer lineages. Genome Biology;16(1):91. pmid:25944252
- 7. Malikic S, McPherson AW, Donmez N, Sahinalp CS. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics;31(9):1349–1356. pmid:25568283
- 8. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biology;16(1):35. pmid:25786235
- 9. Satas G, Raphael BJ. Tumor phylogeny inference using tree-constrained importance sampling. Bioinformatics;33(14):i152–i160. pmid:28882002
- 10. Myers MA, Satas G, Raphael BJ. CALDER: Inferring Phylogenetic Trees from Longitudinal Tumor Samples. Cell Systems;8(6):514–522.e5. pmid:31229560
- 11. Wintersinger JA, Dobson SM, Kulman E, Stein LD, Dick JE, Morris Q. Reconstructing Complex Cancer Evolutionary Histories from Multiple Bulk DNA Samples Using Pairtree. Blood Cancer Discovery;3(3):208–219. pmid:35247876
- 12.
Kulman E, Kuang R, Morris Q. Orchard: building large cancer phylogenies using stochastic combinatorial search. arXiv preprint arXiv:231112917. 2023.
- 13. Tarabichi M, Salcedo A, Deshwar AG, Ni Leathlobhair M, Wintersinger J, Wedge DC, et al. A practical guide to cancer subclonal reconstruction from DNA sequencing. Nature methods. 2021;18(2):144–155. pmid:33398189
- 14. Cortés-Ciriano I, Gulhan DC, Lee JJK, Melloni GE, Park PJ. Computational analysis of cancer genome sequencing data. Nature Reviews Genetics. 2022;23(5):298–314. pmid:34880424
- 15. Marusyk A, Janiszewska M, Polyak K. Intratumor heterogeneity: the rosetta stone of therapy resistance. Cancer cell. 2020;37(4):471–484. pmid:32289271
- 16. Dobson SM, García-Prat L, Vanner RJ, Wintersinger J, Waanders E, Gu Z, et al. Relapse-Fated Latent Diagnosis Subclones in Acute B Lineage Leukemia Are Drug Tolerant and Possess Distinct Metabolic Programs. Cancer Discovery;10(4):568–587. pmid:32086311
- 17. El-Kebir M, Satas G, Oesper L, Raphael BJ. Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell systems. 2016;3(1):43–53. pmid:27467246
- 18. Rzhetsky A, Nei M. A simple method for estimating and testing minimum-evolution trees. Mol Biol Evol. 1992;9(5):945–967.
- 19. Rzhetsky A, Nei M. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Molecular biology and evolution. 1993;10(5):1073–1095. pmid:8412650
- 20. Bryant DJ, Waddell PJ. Rapid evaluation of least squares and minimum evolution criteria on phylogenetic trees. Molecular Biology and Evolution. 1997;.
- 21. Day WH. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of mathematical biology. 1987;49(4):461–467. pmid:3664032
- 22. Bastkowski S, Moulton V, Spillner A, Wu T. The minimum evolution problem is hard: a link between tree inference and graph clustering problems. Bioinformatics. 2016;32(4):518–522. pmid:26500153
- 23. Allen BL, Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Annals of combinatorics. 2001;5:1–15.
- 24. Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC bioinformatics. 2014;15:1–16. pmid:24484323
- 25. Gusfield D. Efficient algorithms for inferring evolutionary trees. Networks;21(1):19–28.
- 26.
Pe’er I, Shamir R, Sharan R. Incomplete Directed Perfect Phylogeny. In: Giancarlo R, Sankoff D, editors. Combinatorial Pattern Matching. Lecture Notes in Computer Science. Springer;. p. 143–153.
- 27. Qi Y, Pradhan D, El-Kebir M. Implications of non-uniqueness in phylogenetic deconvolution of bulk DNA samples of tumors. Algorithms for Molecular Biology;14(1):19. pmid:31497065
- 28. Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, et al. PyClone: statistical inference of clonal population structure in cancer. Nature methods. 2014;11(4):396–398. pmid:24633410
- 29. Gillis S, Roth A. PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC bioinformatics. 2020;21(1):1–16. pmid:33302872
- 30. Miller CA, White BS, Dees ND, Griffith M, Welch JS, Griffith OL, et al. SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS computational biology. 2014;10(8):e1003665. pmid:25102416
- 31.
Boyd SP, Vandenberghe L. Convex optimization. Cambridge university press; 2004.
- 32.
Jia B, Ray S, Safavi S, Bento J. Efficient Projection onto the Perfect Phylogeny Model. In: Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc.;.
- 33. Lefort V, Desper R, Gascuel O. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular biology and evolution. 2015;32(10):2798–2800. pmid:26130081
- 34. Price MN, Dehal PS, Arkin AP. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Molecular biology and evolution. 2009;26(7):1641–1650. pmid:19377059
- 35. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one. 2010;5(3):e9490. pmid:20224823
- 36. Haws DC, Hodge TL, Yoshida R. Optimality of the neighbor joining algorithm and faces of the balanced minimum evolution polytope. Bulletin of mathematical biology. 2011;73:2627–2648. pmid:21373975
- 37. Forcey S, Keefe L, Sands W. Facets of the balanced minimal evolution polytope. Journal of mathematical biology. 2016;73:447–468. pmid:26714816
- 38.
Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. In: Algorithms in Bioinformatics: Second International Workshop, WABI 2002 Rome, Italy, September 17–21, 2002 Proceedings 2. Springer; 2002. p. 357–374.
- 39. Mihaescu R, Pachter L. Combinatorics of least-squares trees. Proceedings of the National Academy of Sciences. 2008;105(36):13206–13211. pmid:18779558
- 40. Rényi A, Szekeres G. On the height of trees. Journal of the Australian Mathematical Society;7(4):497–507.
- 41. Chung F, Horn P, Lu L. Diameter of random spanning trees in a given graph: DIAMETER OF RANDOM SPANNING TREES IN A GIVEN GRAPH. Journal of Graph Theory;69(3):223–240.
- 42. Sankoff D, Rousseau P. Locating the vertices of a Steiner tree in an arbitrary metric space. Mathematical Programming. 1975;9:240–246.
- 43. Andor N, Harness JV, Mueller S, Mewes HW, Petritsch C. EXPANDS: expanding ploidy and allele frequency on nested subpopulations. Bioinformatics. 2014;30(1):50–60. pmid:24177718
- 44. Deveau P, Colmet Daage L, Oldridge D, Bernard V, Bellini A, Chicard M, et al. QuantumClone: clonal assessment of functional mutations in cancer based on a genotype-aware method for clonal reconstruction. Bioinformatics. 2018;34(11):1808–1816. pmid:29342233
- 45. Roux M. A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification. 2018;35:345–366.
- 46.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual; 2023. Available from: https://www.gurobi.com.
- 47.
International Business Machines Corporation, LLC. IBM ILOG CPLEX Optimization Studio Reference Manual; 2022. Available from: https://www.ibm.com/docs/en/icos/22.1.0.
- 48. Davis A, Gao R, Navin N. Tumor evolution: Linear, branching, neutral or punctuated? Biochimica et Biophysica Acta (BBA)-Reviews on Cancer. 2017;1867(2):151–161. pmid:28110020
- 49. Rehman SK, Haynes J, Collignon E, Brown KR, Wang Y, Nixon AM, et al. Colorectal cancer cells enter a diapause-like DTP state to survive chemotherapy. Cell. 2021;184(1):226–242. pmid:33417860
- 50. Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, et al. Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences. 2010;107(39):16910–16915. pmid:20837533
- 51. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nature biotechnology. 2012;30(5):413–421. pmid:22544022
- 52. Grigoriadis K, Huebner A, Bunkum A, Colliver E, Frankell AM, Hill MS, et al. CONIPHER: a computational framework for scalable phylogenetic reconstruction with error correction. Nature Protocols. 2023; p. 1–25. pmid:38017136
- 53. Satas G, Zaccaria S, El-Kebir M, Raphael BJ. DeCiFering the elusive cancer cell fraction in tumor heterogeneity and evolution. Cell Systems. 2021;12(10):1004–1018. pmid:34416171
- 54.
Bonizzoni P, Ciccolella S, Della Vedova G, Soto M. Beyond perfect phylogeny: Multisample phylogeny reconstruction via ilp. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics; 2017. p. 1–10.
- 55. Satas G, Zaccaria S, Mon G, Raphael BJ. SCARLET: single-cell tumor phylogeny inference with copy-number constrained mutation losses. Cell systems. 2020;10(4):323–332. pmid:32864481
- 56. Sashittal P, Zhang H, Iacobuzio-Donahue CA, Raphael BJ. ConDoR: Tumor phylogeny inference with a copy-number constrained mutation loss model. Genome biology. 2023;24(1):272. pmid:38037115
- 57. Caravagna G, Giarratano Y, Ramazzotti D, Tomlinson I, Graham TA, Sanguinetti G, et al. Detecting repeated cancer evolution from multi-region tumor sequencing data. Nature methods. 2018;15(9):707–714. pmid:30171232
- 58. Khakabimamaghani S, Malikic S, Tang J, Ding D, Morin R, Chindelevitch L, et al. Collaborative intra-tumor heterogeneity detection. Bioinformatics. 2019;35(14):i379–i388. pmid:31510674
- 59. Luo XG, Kuipers J, Beerenwinkel N. Joint inference of exclusivity patterns and recurrent trajectories from tumor mutation trees. Nature communications. 2023;14(1):3676. pmid:37344522
- 60. Pellegrina L, Vandin F. Discovering significant evolutionary trajectories in cancer phylogenies. Bioinformatics. 2022;38(Supplement_2):ii49–ii55. pmid:36124798