A zero-agnostic model for copy number evolution in cancer

Motivation New low-coverage single-cell DNA sequencing technologies enable the measurement of copy number profiles from thousands of individual cells within tumors. From this data, one can infer the evolutionary history of the tumor by modeling transformations of the genome via copy number aberrations. Copy number aberrations alter multiple adjacent genomic loci, violating the standard phylogenetic assumption that loci evolve independently. Thus, specialized models to infer copy number phylogenies have been introduced. A widely used model is the copy number transformation (CNT) model in which a genome is represented by an integer vector and a copy number aberration is an event that either increases or decreases the number of copies of a contiguous segment of the genome. The CNT distance between a pair of copy number profiles is the minimum number of events required to transform one profile to another. While this distance can be computed efficiently, no efficient algorithm has been developed to find the most parsimonious phylogeny under the CNT model. Results We introduce the zero-agnostic copy number transformation (ZCNT) model, a simplification of the CNT model that allows the amplification or deletion of regions with zero copies. We derive a closed form expression for the ZCNT distance between two copy number profiles and show that, unlike the CNT distance, the ZCNT distance forms a metric. We leverage the closed-form expression for the ZCNT distance and an alternative characterization of copy number profiles to derive polynomial time algorithms for two natural relaxations of the small parsimony problem on copy number profiles. While the alteration of zero copy number regions allowed under the ZCNT model is not biologically realistic, we show on both simulated and real datasets that the ZCNT distance is a close approximation to the CNT distance. Extending our polynomial time algorithm for the ZCNT small parsimony problem, we develop an algorithm, Lazac, for solving the large parsimony problem on copy number profiles. We demonstrate that Lazac outperforms existing methods for inferring copy number phylogenies on both simulated and real data.


Introduction
Tumor evolution is characterized by both small and large genomic alterations that alter the fitness of cancer cells [1].Copy number aberrations, i.e. modifications to the number of copies of a genomic segment, are an important and frequent sub-class of such alterations that drive prognostic and metastatic outcomes [2].Deriving the evolutionary history of copy number aberrations, herein referred to as copy number phylogenies, is thus important for understanding the emergence of primary tumors and the development of subpopulations of cells that evade treatment and/or metastasize to other anatomical sites.
Recent technological and computational improvements in single-cell sequencing have enabled the mapping of high resolution copy number profiles in single cells.For example, the high-throughput 10x Genomics Single-cell Copy Number Variation solution [3,4] produces ultra-low coverage (< 0.05×) whole genome sequencing data from � 2000 individual cells.Other recent technologies, including DLP/DLP+ [5][6][7] and ACT [8], produce similar data.Multiple computational methods [4,[9][10][11][12][13] have been introduced to infer high resolution copy number profiles, integer vectors that contain the number of copies of each genomic segment, from this type of data.Other recent methods can infer copy number profiles from thousands of cells or spatial locations from single-cell RNA sequencing (scRNA-seq) [14], scATAC-seq [15], or spatial transcriptomics data [16].
The increasing availability of technologies to measure genomic copy number in thousands of cell motivates development of methods to infer the cellular phylogenies from copy number profiles.However, there are multiple challenges in inferring phylogenies from copy number profiles.First, copy number aberrations are diverse, ranging from small duplications and deletions [17] to whole chromosome shattering and reconstruction events [18].Second, a single copy number aberration can alter the number of copies of a large section of the genome simultaneously.This means that loci on the genome cannot be treated as independent phylogenetic characters, a widely-used assumption in phylogenetics [19][20][21][22].Finally, the increasing size (> 10, 000 cells) and resolution (< 5Kb bins) of copy number profiles require increasingly scalable algorithms.
One widely used model of copy number evolution is the copy number transformation (CNT) model [23].In the CNT model, a genome is represented as a vector of non-negative integers and copy number aberrations correspond to the increase or decrease of the entries in a contiguous interval of coordinates in the vector, explicitly modeling the non-independence of copy number amplifications and deletions.The CNT distance is the minimum number of copy number events needed to transformation one profile to another.The CNT distance is computable in linear time [24] and has been used to define an evolutionary distance between profiles.Since the CNT distance is not symmetric, a variety of symmetrized CNT distances have also been used to construct copy number phylogenies using distance-based phylogenetic methods [23,[25][26][27].Further, owing to its effectiveness, the CNT model has become the basis of a variety of distinct models [26][27][28] for copy number evolution.
While the CNT model is described by specific events-or mutations-there has been little work on constructing phylogenetic trees under the CNT model using the method of maximum parsimony.Even the small parsimony problem-where the topology of the tree is given and one aims to infer the ancestral profiles that minimizes the total number of copy number events on the tree-has no known efficient solution.For example, for the special case of a two leaf tree, the best algorithm for the CNT small parsimony problem [24] runs in OðnB 7 Þ time where B is the largest allowed copy number and n is the number of loci [25].Without an efficient algorithm for the small parsimony problem under the CNT model, one cannot hope to solve the large parsimony problem, where the topology of the tree is unknown.
We introduce a relaxation of the CNT model, called the zero-agnostic copy number transformation (ZCNT) model, that approximates the CNT model and has a number of desirable properties.Unlike the CNT model, the ZCNT model allows for the amplification of zero copy number regions.While such an operation is not biologically realistic, we show that this relaxation makes the ZCNT distance a metric, in contrast to the CNT distance.Moreover, we derive a closed form expression for the ZCNT distance between two profiles.We use this closed form expression as well as an alternative characterization of copy number profiles to solve two relaxations of the small parsimony problem in polynomial time and to derive a linear time, 2-approximation algorithm for the small parsimony problem.To our knowledge, this is the first attempt to solve the small parsimony problem for a segment-based (i.e.non-independent) model of copy number evolution.We then use our efficient algorithm for the (relaxed) small parsimony problem to design an algorithm, Lazac (Large-scale Analysis of Zero Agnostic Copy number), for inferring copy number phylogenies by solving the large parsimony problem.We show on simulated data that Lazac is > 100× faster than other phylogenetic methods and also more accurate in recovering the ground truth phylogeny.On single-cell wholegenome sequencing data from human breast and ovarian tumors, Lazac finds phylogenies that are more consistent with both copy number clones and single-nucleotide variants (SNVs).

Copy number transformations
A copy number profile p = [p 1 , . .., p m ] is a vector of non-negative integers where p j 2 {0, . .., B} is the the number of copies of locus j.Suppose we measure the copy number profile of n cells of a tumor across m loci in a single-cell DNA sequencing experiment.We encode the copy number profiles in a n × m copy number matrix M = [M i,j ] where M i,j 2 {0, . .., B} is the copy number of cell i at locus j.The copy number profile p of cell i is then the i th row of this matrix, and is denoted M i .
One of the most basic phylogenetic principles is that nearly perfect measurement of evolutionary distances enables exact recovery of the evolutionary history [29].It is thus not surprising that many of the successful attempts at inferring copy number phylogenies focus on finding good methods to compute an evolutionary distance between a pair of copy number profiles.Early methods for computing evolutionary distances on copy number data [17,30] employed simple measures of distance such as the Hamming, weighted Hamming, and ℓ 1 distance between copy number profiles.However, these distances do not account for dependencies between loci caused by long CNAs spanning contiguous segments of the genome, leading to inaccurate phylogenetic reconstruction [23,27].
In this section, we describe and investigate the copy number transformation (CNT) model, one of the most well-known and successful evolutionary models for copy number evolution in cancer.The CNT model was originally introduced in MEDICC [23] and extended in subsequent studies [24][25][26][27][28]. Since the CNT model only allows intrachromosomal copy number events, it is sufficient to consider the case of a single chromosome, and thus for ease of exposition we will describe the model using a single chromosome.
The fundamental operation in the CNT model is a copy number event which increases or decreases (by one) the entries in a contiguous interval of a copy number profile, defined formally as follows.
Definition 1 (Copy number event) A copy number event c s;t;b : where s � t and b 2 {+ 1, −1}.We denote such a function as c when clear by context.That is, an amplification (resp.deletion) increases (resp.decreases) the copy number of all non-zero entries in the interval between positions s and t, or alternatively a copy number event skips the zero entries (Fig 1).Thus, once a locus is lost (i.e.p i = 0), the locus cannot be regained or deleted further.A copy number transformation (CNT) C is the composition of multiple copy number events and we denote this function as Several copy number problems have been previously studied to compute evolutionary distances under the CNT model.The first, and simplest, is the copy number transformation problem, originally introduced in [23], which defines a distance, σ(u, v), between two copy number profiles.Put simply, the distance between two profiles is the length of the shortest copy number transformation needed to transform one profile to another.
Definition 2 (Copy number transformation distance).Given two copy number profiles u and v, the copy number transformation distance is sðu; vÞ :¼ min where C = (c 1 , . .., c n ) is a CNT.Alternatively, σ(u, v) = 1 if no such transformation exists.[24,31] show there is a (non-trivial) strongly linear time algorithm (i.e.time complexity Oðjuj þ jvjÞ) for computing the CNT distance σ(u, v).Unfortunately, the CNT distance σ(u, v) is not symmetric (i.e.σ(u, v) 6 ¼ σ(v, u)), which makes it difficult to use in distance based phylogenetic methods such as neighbor joining [32].
In order to apply distance-based phylogenetic methods, multiple approaches to symmetrize the distance σ(u, v) have been introduced.[24] use a mean correction replacing the asymmetric σ(u, v) with a symmetric distance σ 0 (u, v) defined as s 0 ðu; vÞ ¼ sðu; vÞ þ sðv; uÞ 2 : Alternatively, several authors [23,25,27] define the distance between two profiles in terms of a closely related, median profile w.Specifically, the median distance between two profiles u and v is defined to be the smallest value of sðw; uÞ þ sðw; vÞ over all profiles w.Computing this median distance is called the copy number triplet problem in [25].Unfortunately, no efficient algorithm is known for the copy number triplet problem.The fastest algorithm uses OðmB 7 Þ time and OðmB 4 Þ space where B is the maximum allowed copy number [25].

Small and large copy number parsimony
The small parsimony problem for copy number profiles is the following: given a tree T whose leaves are labeled by copy number profiles, infer ancestral copy number profiles that minimize the total dissimilarity between profiles across all edges (Fig 1).For evolutionary models in which each character evolves independently and has finitely many states (e.g.single nucleotide substitution models), the small parsimony problem is solved in polynomial time via Sankoff's algorithm, a dynamic programming algorithm [33].Unfortunately, the CNT model presents two major challenges in solving the small parsimony problem.First, since copy number events affect multiple loci simultaneously, the loci cannot be analyzed independently, in contrast to most phylogenetic characters.Second, the space of possible copy number profiles is a priori unbounded, since the maximum copy number of a segment in a genome is unknown.Thus, it is not surprising that there is no published solution to the small parsimony problem for CNT dissimilarity, with the exception of the special case of two-leaf trees [25].Here, we formalize both the CNT small parsimony problem and the corresponding large parsimony problem, the latter of which was previously described in [25].
A copy number phylogeny ðT ; 'Þ is a rooted tree T and leaf labeling ℓ.Let VðT Þ, EðT Þ, and LðT Þ denote the vertices, edges, and leaves of T , respectively.In our applications below, each leaf of T represents one of the n cells (or bulk samples) from a tumor.An ancestral labeling ' of a copy number phylogeny is a vertex labeling of T that agrees with ℓ on the leaves of T , i.e.
'ðvÞ ¼ 'ðvÞ when v 2 LðT Þ.We say that ðT ; 'Þ is a copy number phylogeny for copy number matrix M if T has n leaves such that ℓ labels each leaf by a row of M. Formally, if ðT ; 'Þ is a copy number phylogeny for a copy number matrix M, then there exists a cell assignment p : We define the cost JðT ; 'Þ of a vertex labeled, copy number phylogeny as the total number of copy number events required to explain the phylogeny: sð 'ðuÞ; 'ðvÞÞ: We now introduce the small parsimony problem [34] under the copy number transformation model.
Problem 1 (CNT small parsimony).Given a copy number phylogeny ðT ; 'Þ find an ancestral The parsimony score is defined as the cost JðT ; 'Þ of the solution ' to the CNT small parsimony problem.To the best of our knowledge the CNT small parsimony problem (Problem 1) has not been analyzed in the literature.We believe this is due to the difficulty of solving the CNT small parsimony problem.That is, for even a special case of two-leaf trees, referred to as the copy number triplet problem [25], no strongly polynomial time algorithm is known (Section 2.1).
The CNT large parsimony problem defined in [25] aims to find a vertex labeled, copy number phylogeny ð T ; 'Þ for a matrix M with minimum cost such that the root of T is labeled by the normal, diploid copy number profile.Problem 2 (CNT large parsimony).Given copy number matrix M, find a copy number phylogeny ðT ; 'Þ for M and ancestral labeling ' such that JðT ; 'Þ is minimized and ℓ(r) = (2, . .., 2) for the root r of T .Unsurprisingly, [25] showed that the above large parsimony problem (Problem 2) is NPhard.They also formulated an integer linear program (ILP) to solve the problem exactly.However, this ILP consists of O(n 2 m + nm log B) variables and does not scale to the size of current real data sets with thousands of cells.

The zero-agnostic CNT model
The copy number transformation (CNT) model imposes the constraint that once a locus is lost (has zero copy number), the locus remains with zero copies for all time.While this constraint is biologically realistic, the constraint also makes the inference problems-including the CNT small (and large) parsimony problems-computationally hard to solve.Here, we show that relaxing the constraint that copy number events do not alter zero entries leads to a simpler model with favourable mathematical properties.We call this the zero-agnostic copy number model (Fig 1 ) to indicate that the model allows the amplification and deletion of loci with zero copies.Formally, we define a zero-agnostic copy number event as follows.
Definition 3 (Zero-agnostic copy number event).A zero-agnostic copy number event c s;t;b : Z n !Z n is a function that maps a profile p 2 Z n to a profile written c s,t,b (p) described by its entries as where s � t and b 2 {+ 1, −1}.We denote such a function as c when clear by context.
Thus, a zero-agnostic copy number event either increases or decreases the number of copies of all loci in the interval (s, t) regardless of whether the loci have zero copies.Analogous to our definition of a copy number transformation, we define a zero-agnostic copy number transformation C as the composition of multiple zero-agnostic copy number events and denote this function as C = (c 1 , . .., c n ) where C(p) = c n (� � �(c 2 (c 1 (p)))).Note that our formulation of a zeroagnostic copy number transformation (ZCNT) allows for the number of copies of a locus to decrease below zero, one can show that given two profiles with non-negative entries, it is always possible to find a minimum length zero-agnostic copy number transformation such that no intermediate profile has negative entries.See Section 2.3.2 below.
Due to space constraints, we do not include all proofs in the main text.Any proof not present in the main text can be found in Sections B and C in S1 Text.
2.3.1 Delta profiles.We simplify our analysis of zero-agnostic copy number events by examining their effect on the differences between the copy number of adjacent loci.In particular, while a zero-agnostic copy number event c s,t,b increments (or decrements) all entries p i where i 2 {s, . .., t}, c s,t,b only alters two differences between adjacent loci, namely the difference p s − p s−1 and the difference p t+1 − p t .To formalize this idea, we first define the delta profile, a vector obtained by taking the differences in copy number between adjacent loci.Definition 4. A delta profile is any vector q 2 Z n that satisfies the balancing condition: Or equivalently, P q i >0 jq i j ¼ P q i <0 jq i j.We denote the set of delta profiles in Z n as D n .The above definition provides us with a convenient (and useful) description of the image of the following difference transformation, which we call the delta map.
Definition 5.The delta map D : Z n !D nþ1 maps a copy number profile p to a delta profile Δ(p) by taking the differences in the copy number of adjacent loci after appending normal, diploid copy number loci to both ends of p. Specifically, where the constant 2 represents a normal, diploid copy number.
A basic property of the delta map Δ is that it is invertible.Proposition 1.The delta map D : Since Δ is one-to-one and onto with respect to D nþ1 , each delta profile q then corresponds to a unique copy number profile p = Δ −1 (q).Interestingly, a copy number event c s,t,b applied to a copy number profile p only affects two entries of the delta profile Δ(p), meaning that loci of the corresponding delta profile are (nearly) independent.We formalize this in the following definition of a delta event.
Definition 6 (Delta event).A delta event d s;t;b : D n !D n is a function that maps a delta profile q 2 D n to a delta profile δ s,t,b (q) described by its entries as where s � t and b 2 {+ 1, −1}.We denote such a function as δ when clear by context.
A delta transformation D = (δ 1 , . .., δ n ) is the composition of multiple delta events, where D (q) = δ n (� � �(δ 2 (δ 1 (q)))).We now state the connection between delta events and zero-agnostic copy number (ZCNT) events in the following theorem and corollary.) be the minimum number of zeroagnostic copy number events needed to transform the copy number profile p to p 0 .In this section we derive a closed form expression for d(p, p 0 ).
We begin by noting that d(p, p 0 ) is equal to the minimum number d 0 (Δ(p), Δ(p 0 )) of delta events needed to transform delta profile Δ(p) to Δ(p 0 ).This follows from the equivalence between the copy number transformations and the corresponding delta transformation (Corollary 1).Thus, it suffices to only consider delta profiles and delta events; for the rest of the section all profiles q and q 0 are delta profiles unless otherwise specified.
We start by observing two basic facts: delta transformations are commutative and d 0 (q, q 0 ) forms a metric.
Proposition 2. A delta transformation D = (δ 1 , . .., δ n ) is commutative.That is, the application of D to a profile is identical to the application of D s ¼ ðd s 1 ; . . .; d s n Þ where σ is any permutation of {1, . .., n}.
Proposition 3. d 0 (q, q 0 ) is a distance metric.Further, d 0 ðq; q 0 Þ ¼ d 0 ðq À q 0 ; 0Þ ¼ d 0 ðq 0 À q; 0Þ: Note that this also implies that zero-agnostic copy number transformations are commutative and that d(�, �) is a distance metric.To see this, let C ¼ ðc s 1 ;t 1 ;b 1 ; . . .; c s n ;t n ;b n Þ be a zeroagnostic copy number transformation and D ¼ ðd s 1 ;t 1 ;b 1 ; . . .; d s n ;t n ;b n Þ be the corresponding delta transformation, then for any pair of copy number profiles p; p 0 2 Z n and permutation σ, where the first equivalence follows from Corollary 1, the second from Proposition 2, and the third from Corollary 1.This implies that C(p) = C σ (p), which proves that a zero-agnostic copy number transformation is commutative.To see that d(�, �) is a distance metric, it suffices to observe that d(p, p 0 ) = d 0 (Δ(p), Δ(p 0 )) implies symmetry and reflexivity.The triangle inequality is satisfied since the composition of a zero-agnostic copy number transformation from p to r and r to p 0 yields a copy number transformation from p to p 0 .
As a corollary to the commutativity of zero-agnostic copy number transformations, given any zero-agnostic copy number transformation C, one can re-order the events such that amplifications always occur first.After this reordering of events, no intermediate profile has negative entries as long as both the initial and resultant profile have no negative entries.Thus, it is always possible to find a minimum length zero-agnostic copy number transformation such that no intermediate profile has negative entries.
From our characterization of delta profiles, we derive our expression for the distance between delta profiles.
Proof.Since each delta event decreases the total magnitude of kqk 1 by at most two, to transform q to the 0 profile requires at least 1  2 kqk 1 events.We prove the other direction by induction on P q i >0 jq i j.Clearly, if the sum is zero, the claim holds.Otherwise, by the balancing condition (1), we can choose δ to be any delta event that decrements i 2 {i : q i > 0} and increments j 2 {j : q i < 0}.Applying δ to q results in a delta profile δ(q) such that kδ(q)k 1 = kqk 1 − 2. Invoking the induction hypothesis then yields a sequence of 1  2 kqk delta events to transform q to the 0 profile.The second statement follows from Proposition 3. As a corollary to the above theorem and the equivalence between zero-agnostic copy number transformations and delta transformations (Corollary 1), we have our closed form expression for the ZCNT distance between copy number profiles.
Corollary 2. For all copy number profiles p and p 0 in Z n , Further, as a corollary to the fact that d(p, p 0 ) is a distance metric, the following median distance is trivially computed in linear time: Corollary 3. Given two copy number profiles p and p 0 in Z n , both p and p 0 minimize the median distance d(r, p) + d(r, p 0 ) over all choices of copy number profiles r.Thus, min r2Z n fdðr; pÞ þ dðr; p 0 Þg ¼ dðp; p 0 Þ:

ZCNT small parsimony
We show below that the special form of the ZCNT model enables us to solve three variants and special cases of the small parsimony problem polynomial time.First, using the equivalence between copy number profiles and delta profiles described above, we formulate the small parsimony problem (Problem 1) using the ZCNT model as follows, where we drop the constant factor of 1/2 for ease of exposition.
Problem 3 (ZCNT Small Parsimony).Given a copy number matrix M, a tree T and an assignment π of cells to leaves, find a vertex labeling ' : VðT Þ !Z m minimizing the cost JðT ; 'Þ, or equivalently the sum such that the following two conditions are satisfied: ii. l(u) satisfies the balancing condition (1) for all vertices u 2 VðT Þ.
To solve the above problem, we recall the general form of the Sankoff-Rousseau recurrence [33,35] for solving the small parsimony problem.Let cðT ; xÞ be the cost of the optimal labeling ' of VðT Þ that satisfies condition (i) of Problem 3 and has label x for the root.Let T w denote the sub-tree rooted at w and suppose that w has children u and v.Then, by condition (ii), and the requirement that the ancestral labeling lies in Z m , we have the following recurrence relation [35]: This recurrence has several difficulties.First, y and z are unbounded and can take on any value in Z m .Thus, it is impossible to store a dynamic programming table for c(T w ; x) without imposing bounds on the maximum copy number.Further, even when the entries are constrained to a bounded interval {0, . .., B} m , the dynamic programming table has size (B + 1) m , exponentially large.Second, because of the balancing conditions (1), P n i¼1 y i ¼ P n i¼1 z i ¼ 0, one cannot analyze the loci independently.
2.4.1 Two polynomial time relaxations.Despite these challenges, the recurrence ( 2) is a substantial improvement over the analogous recurrence under the CNT model.In fact, if we remove either the balancing (1) or the integrality condition, we can solve this recurrence in (resp.strong or weak) polynomial time.These relaxations not only provide us with lower bounds for the value of the true solution that can be computed efficiently, but these lower bounds are empirically quite tight (Section 3.4).
Theorem 3 If the balancing condition (1) is dropped, the ZCNT small parsimony problem can be solved in OðmnÞ time.If the integrality condition is dropped, the ZCNT small parsimony problem can be solved in (weakly) polynomial time using a linear program with OðmnÞ variables and constraints.
Both of these facts derive from our closed form expression for the ZCNT distance between two copy number profiles in terms of the ℓ 1 norm.We sketch the ideas here, and refer to Sections B.1 and B.2 in S1 Text for proofs of these claims.
For the first case when we drop the balancing condition, we can analyze the loci independently as there is no constraint on the entries of the ancestral profiles.Then, it suffices to observe that since the distance corresponds to the absolute difference, the function cðT w ; xÞ has a nice structure and we do not have to store an infinitely large dynamic programming table.When the integrality is removed, then, since both (i) and (ii) are linear constraints on the profiles and because ℓ 1 norm minimization can be written as a linear program (LP), there is an LP formulation of the ZCNT small parsimony problem.As it is well known we can solve LPs in (weakly) polynomial time, this concludes the second case.

A relabeling strategy and approximation algorithm.
Since solutions to the relaxed ZCNT small parsimony problem do not necessarily yield solutions to the original problem, e.g. by not satisfying the balancing condition, it is natural to ask if it is possible to "fix" such solutions.Here, we show that given any ancestral labeling of a tree T , we can perform local fixes that do not substantially increase the small parsimony objective and ensure that the resultant labeling satisfies the balancing condition.Then, we will show that these local fixes lead to a linear time, 2-approximation algorithm for the ZCNT small parsimony problem.While the results here are stated (and proven) only for binary trees, these results are without loss of generality as polytomies can be (arbitrarily) resolved to create binary trees without increasing the ZCNT small parsimony score.We sketch the ideas here and refer to Section B.3 in S1 Text for proofs of these claims.
Given a labeling ' : VðT Þ !Z m of T , the discrepancy of a vector l(u) is defined as and is also called the discrepancy of the vertex u.Interestingly, the vertex discrepancy provides us with a bound on the cost to relabel ℓ to ensure it satisfies the balancing condition (1), as stated in the following corollary.Corollary 4. Let ' : VðT Þ !Z m be any labeling of a binary tree T such that ℓ(π(i)) = Δ(M i ) for all cells i 2 [n].Then, we can construct a new labeling ℓ 0 satisfying the balancing condition (1) such that: Further, we can compute the labeling ℓ 0 in OðmnÞ time.Said another way, the above corollary provides us with an approximation algorithm for the ZCNT small parsimony problem where the approximation error is bounded by the total discrepancy.In fact, this general strategy gives us a linear time, 2-approximation algorithm.To see this, let ℓ be a solution to the ZCNT small parsimony problem when the balancing condition ( 1) is dropped.Then, the cost JðT ; 'Þ is at least as large as the total discrepancy, as stated in the following lemma.
Lemma 1.Let ' : VðT Þ !Z m be any labeling of a binary tree T such that ℓ(π Then, relabel ℓ using the above strategy to obtain ℓ 0 .Since JðT ; 'Þ is a lower bound of the true cost and since JðT 0 ; 'Þ is at most twice JðT ; 'Þ, the cost of ℓ 0 is at most twice the true cost.Further, since we can compute both the ℓ and ℓ 0 in linear time, this leads to the following theorem.
Theorem 4. For binary trees, the ZCNT small parsimony problem can be solved to within a constant factor of 2 in OðmnÞ time.

A special case:
The multiple median problem.We conclude this section with an investigation into a special case where the input tree has a single internal node, also known as the (multiple) median problem [25].More formally, let p 1 , . .., p n be a set of copy number profiles, the ZCNT median problem is to compute the the profile r minimizing the total distance:

1). It then suffices to show that we can efficiently compute A[j, k].
To compute A[j, k], first notice there is always an optimal solution such that r j 2 {−B, . .., B} for all j 2 [m] where B is the maximum copy number in the input.Then, the vertex discrepancy k is in {−Bm, . .., Bm}, and we can represent A[j, k] as an m × Bm matrix.Matrix A then satisfies the recurrence For the boundary conditions, we set A[i, k] = 0.As it takes 2Bn computations to fill in each A [j, k], it takes Oðnm 2 B 2 Þ total time to fill in A. Reading off a solution takes the same time.Thus, we have solved the ZCNT median problem in Oðnm 2 B 2 Þ time.

Lazac algorithm for ZCNT large parsimony
We develop a tree-search algorithm, Lazac, to find approximate solutions to the ZCNT large parsimony problem (Problem 2).Our procedure searches the space of copy number trees for a given copy number matrix C using sub-tree interchange operations [36] and relies heavily on the efficient algorithm we developed for the small parsimony problem (Problem 3) when the balancing condition (1) is dropped.The procedure is similar to the tree search procedure we developed for lineage tracing data [37] based off IQ-TREE [36].Complete details on our tree search procedure are in Section A in S1 Text Lazac is implemented in C++17 and is freely available at: github.com/raphael-group/lazac-copy-number.

Comparison of copy number distances and phylogenies on prostate cancer data
We first investigated the differences between the CNT and ZCNT distances on copy number profiles inferred from bulk whole-genome sequencing data from ten metastatic prostate cancer patients [38].We analyzed the copy number profiles for these patients published in [27].For each pair of copy number profiles from distinct samples (e.g.anatomical sites) from the same patient, we computed the CNT distance d CNT and ZNCT distance d ZCNT .We found that for all ten patients, the median relative difference |d CNT /d ZCNT − 1| over all pairs of samples was less than 5%-and for many patients the relative difference was even smaller (S5 and S6 Figs).
Lazac inferred the most accurate phylogenetic trees across varying number of cells (n = 100, 150, 200, 250, 600) and loci (l = 1000, 2000, 3000, 4000).In particular, we found that on all but one parameter setting, Lazac had the lowest median RF (Fig 2 ) and Quartet distance (Section A.5 in S1 Text) on simulated instances (S1 Fig) .Further, on large phylogenies containing n = 600 cells, Lazac showed an even larger improvement in median RF (Fig 2 ) and Quartet distance (Section A.5 in S1 Text) over other methods, owing to its scalability.In terms of speed, Lazac was the fastest method on every instance, taking less than *250 seconds to run on the largest simulated dataset containing 600 cells.Further, it was *100 times faster than the other top performing methods Sitka and MEDICC2 (Fig 2B).
As a further evaluation of the differences between the ZCNT and CNT distances, we compared the trees obtained using distance-based phylogenetic methods with the ZCNT and CNT distances.Specifically, we compared the performance of applying neighbor joining on the ZCNT distances, referred to as Lazac-NJ, to three distance-based methods for reconstructing copy number phylognies: MEDICC2 [27], WCND [26], and MEDALT [31] on simulated data.MEDICC2 and WCND compute distances based on extensions of the CNT model and then apply neighbor joining to infer phylogenies.As such, they allow for a natural benchmark with which to compare our simpler, ZCNT distance.Lazac-NJ had nearly identical (within 1%) median RF and Quartet distance compared to other distance based methods (S2 and S3 Figs).This provides evidence that even by itself, the ZCNT distance is useful for phylogenetic reconstruction.However, WCND performed at least as well or strictly better than Lazac-NJ in terms of median RF and Quartet distance for all but two instances (S3 Fig) , emphasizing the importance of accurate estimation of the true evolutionary distance.

Single-cell DNA sequencing data
We used Lazac to analyze single-cell whole genome sequencing (WGS) data from 28 human breast and ovarian tumor samples [7].This dataset was generated using the DLP+ [6], singlecell whole-genome sequencing technology which produces � 0.04× coverage from a median (resp.mean) of 636 (resp.1457) cells per sample.The original study used Sitka [39], a method that uses the breakpoints between copy number segments as phylogenetic markers, to construct copy number phylogenies using this data.
We found that the phylogenies inferred by Lazac are substantially different than the phylogenies constructed by Sitka.Specifically, the normalized RF distance between pairs of phylogenies was greater than 0.5 in most cases, which means that at least half the edges in the phylogeny separate distinct sets of leaves (Fig 3A).The normalized distances were also large across other metrics of tree dissimilarity, including the Quartet and Matching Split metrics.Other summary statistics calculated from the structure of the inferred phylogenetic trees also highlight these differences; for example, while the Sitka phylogenies were highly unresolved, that is, containing many nodes with more than two children, the Lazac phylogenies were typically much more resolved.
To further investigate these differences, we first examined whether the more resolved phylogenies inferred by Lazac had a meaningful difference in discerning neighboring cells.Specifically, we define the sibling dissimilarity metric to be the normalized Hamming distance between the copy number profiles of a pair of siblings in the phylogeny.We found that in 27/ 28 of the samples, the Lazac phylogenies had lower mean sibling dissimilarity scores compared to the Sitka phylogenies.In some samples, such as SA1055, the sibling dissimilarity score differs by several orders of magnitude (Figs 3B and 4C).This indicates that the more resolved phylogenies inferred by Lazac helps separate cells with distinctive copy number profiles.Or equivalently, the highly unresolved groups of cells in the phylogenies inferred by Sitka contain substantial variation in their copy number profiles.
Second, we analyzed the concordance between the phylogenies and the assignments of cells to copy number clones that was reported in the original publication [7].The clone labels are obtained from the assignment of each cell to one of k clones, defined by the clustering of cells according to their copy number profiles performed in [7].For a dataset with k clone labels, the minimum possible clonal discordance score for a tree is k − 1, corresponding to the case where each clone label forms a clade in the tree.We find that on 20/28 of the samples, the Lazac phylogenies had substantially lower clonal discordance scores than the Sitka phylogenies (S4 and S12 Figs) showing that the Lazac phylogenies were more concordant with the copy number clones compared to the published phylogenies.
Third, we verified that Lazac better optimizes ZCNT small parsimony score as compared to Sitka.Specifically, we computed the exact ZCNT small parsimony score on the phylogenies inferred by both algorithms using integer linear programming.Unsurprisingly, we found that the Lazac inferred phylogenies had lower ZCNT small parsimony scores than Sitka in every case (S9 Fig) .Perhaps more surprising, we noticed that on several samples, such as SA1096 and SA1182, the ZCNT small parsimony scores for the Sitka phylogenies were quite close (within 5%) to the scores for the Lazac phylogenies.These samples were also the ones where the Sitka phylogenies had lower clonal discordance scores than Lazac, supporting the hypothesis that ZCNT parsimony score concords well with the clonal discordance score.
Fourth, we verified that the ancestral labeling inferred by our ZCNT model was biologically meaningful by checking whether it assigned negative copy numbers to ancestors or included amplifications of regions with zero copy number.Negative copy numbers and amplification of zero copy numbers are allowed by the ZCNT model but are violations of the CNT model.In 10/28 patient samples, none of the ancestors were labeled with negative copy number regions and in all 28 samples, none of the ancestors were labeled with a copy number below -1.Copy number of -1 was extremely infrequent: across all samples, the total number of negative stretches normalized by the number of edges in the tree was at most 0.1 (median 0.002) (S10  Amplifications of zero (or negative) copy number loci were also quite rare: across all samples, the total number of such amplifications normalized by the number of edges in the tree was less than 0.33 (median 0.042) (S10 Fig).These CNT violations had negligible effect on the parsimony score: for all samples, CNT violations contributed less than 2% of the ZCNT small parsimony score (S11 Fig).Thus, while our ZCNT model allows for both amplifications of zero copy number loci and negative copy number loci, these events were rare on the patient samples we analyzed and contributed little to the ZCNT small parsimony score.
As a final quantitative evaluation of the Lazac and Sitka phylogenies, we examined whether somatic single-nucleotide variants (SNVs) supported the splits in each phylogeny, following the approach of [4].Note that these SNVs were not used in the inference of either phylogeny, and thus they provide independent validation of the phylogeny.Given the extremely low sequence coverage (0.04× per cell), it is not possible to reliably measure SNVs of individual cells.Thus, we performed this analysis on the three samples (SA039, SA604, SA1035) with the largest number of cells.We identified subtrees in the phylogeny with at least 5% and at most 15% of the cells and identified SNVs present in the subpopulation of cells in these subtrees.Following the approach in [4], we perform a permutation test to determine whether the subtree is supported by more SNVs than expected (Section A.4 in S1 Text).For all three samples, we found that the Lazac phylogenies had a greater fraction of supported subtrees (P < 0.05) than the Sitka phylogenies (S13 Fig) .On the largest sample, SA1035, we identified five out of six supported subtrees (supported by 3175, 3334, 3799, 3435, and 3402 SNVs) for the Lazac phylogeny compared to only three of eight statistically significant subtrees (supported by 3426, 3129, and 3362 SNVs) for the Sitka phylogeny.
Finally, we performed a qualitative analysis of one of the smaller patient samples, SA1292, to demonstrate the differences between the Sitka and Lazac phylogenies.To perform this investigation, we first drew the Sitka and Lazac phylogenies for sample SA1292 and colored the leaves by the clone labels (Fig 4A).From this drawing, we noticed that the structure of the Sitka trees was much less resolved (mean degree 6.84) as compared to the Lazac trees (mean degree 3.04).We randomly sampled one of these high degree nodes in the Sitka phylogeny, which was the parent of 40 cells.While all 40 cells were siblings in the Sitka phylogeny, in the Lazac phylogeny the 40 cells were part of distinct clades.We split the 40 cells into two groups by picking the edge furthest from the root which partitions the cells into two groups, each containing at least 10 cells.We name these groups Group 1 and Group 2, and color the cells from these groups blue and pink, respectively (Fig 4A).We then looked at the copy number profiles of the cells (Fig 4B ) and noticed distinct chromosomal aberrations between the two groups.Specifically, 15 of the 26 cells in Group 1 had amplifications in both the p arm of chromosome 17 and the q arm of chromosome 10.In contrast, only 2 of the 14 cells in Group 2 had both amplifications, a significant difference in proportion (p < 0.01; two-sided binomial test of proportion; Fig 4D).

Approximation error of ZCNT small parsimony relaxations
We investigated the approximation error produced by the relaxations (Section 2.4) used in our two polynomial time algorithms for the ZCNT small parsimony problem.To perform this investigation, we first generated a set of 200 copy number phylogenies by perturbing the published phylogeny [39] using from 1 to 200 random nearest neighbor interchange (NNI) operations (Section 3.3).Then, for each phylogeny, we computed the optimal solution to the ZCNT small parsimony problem and its two relaxations using (integer) linear programming.
Importantly, we found that the exact solution to the ZCNT small parsimony problem and the solution obtained by relaxing the integrality condition were nearly identical in every case.Specifically, they had the exact same solution on 180/200 instances and the solution differed by at most 0.7 on the remaining 20 instances.This leads us to believe that the relaxed linear program often has a special structure, which makes it a good approximation to the ZCNT small parsimony problem.
Removing the balancing condition resulted in solutions with a lower score, implying that the balancing condition does meaningfully constrain the solution space.However, the rankings of phylogenies with and without the balancing condition were highly concordant (Spearman correlation R 2 = 0.9644, p < 10 −100 ) across the 200 copy number phylogenies (S7 and S8 Figs).Thus, when ranking phylogenies by ZCNT parsimony score-as is done when solving the ZCNT large parsimony problem-removing the balancing condition does not change the result substantially.

Discussion
We introduced the zero-agnostic copy number transformation (ZCNT) model, a relaxation of the CNT model that allows for modification of zero copy number regions.We derived a closed-form expression for the ZCNT distance and presented polynomial time algorithms to solve two natural approximations of the small parsimony problem for copy number profiles.We used our efficient algorithm for the small parsimony problem to derive a method Lazac, to solve the large parsimony problem for copy number profiles.We demonstrated that on both real and simulated data, Lazac found better copy number phylogenies than existing methods.
There are multiple directions for future work.First, the complexity of the small parsimony problems for both the CNT and ZCNT models remains open, though we conjecture, and provide empirical evidence, that the latter is polynomial.Second, the algorithm we developed for the ZCNT large parsimony problem relies on a simple, hill climbing search using nearestneighbor interchange operations.We expect that a more advanced approach that uses state-ofthe-art techniques from phylogenetics [36,40] could substantially improve both inference speed and accuracy.Finally, is to apply Lazac to other large single-cell WGS datasets [4,8].We anticipate that the scalability and accuracy of Lazac will be useful in analyzing the increasing amount of single-cell WGS cancer sequencing data.

Fig 1 .
Fig 1.(a) The results of applying a copy number transformation c 2,6,+1 under both the CNT model and ZCNT model.Under the CNT model a zero cannot be increased via the amplification, but the zero-agnostic CNT model allows the zero to increase to one copy.(b) An instance of the ZCNT small parsimony problem: given a tree with copy number profiles labeling the leaves, the goal is to infer the ancestral copy number profiles that minimize the total ZCNT distance across all edges.https://doi.org/10.1371/journal.pcbi.1011590.g001 d 0 ðDðrÞ; Dðp n ÞÞ: We can solve this with an elegant dynamic program.Define A[j, k] as the minimum of d 0 ðr½1 . . .j�; Dðp 1 Þ½1 . . .j�Þ þ � � � þ dðr½1 . . .j�; Dðp n Þ½1 . . .j�Þ over all r where r[1. ..j] is the j th prefix of r and r[1. ..j] has discrepancy k.Then, clearly A[m, 0] is the score of the optimal solution to the ZCNT median problem, since if r has discrepancy zero it satisfies the balancing condition ( ends in b and has discrepancy k, the prefix r [1. ..j − 1] has discrepancy k − b.

Fig 4 .
Fig 4. (a) The copy number phylogenies inferred by Lazac and Sitka on sample SA1292 with the leaves colored by the corresponding clone labels and a subset of cells colored by their assignment to one of two groups (blue and pink).(b) The copy number profiles of the cells in (left) Group 1 and (right) Group 2. (c) The distribution of sibling dissimilarity scores for sample SA1292.(d) The results of a two-sample binomial test of proportion for the observed frequencies of chr10 and chr17 amplifications present in Group 1 and Group 2. https://doi.org/10.1371/journal.pcbi.1011590.g004

Theorem 1 .
Let c s,t,b be a zero-agnostic copy number event and δ s,t,b be a delta event.Then, p 0 ¼ c s;t;b ðpÞ if and only if Dðp 0 Þ ¼ d s;t;b ðDðpÞÞ: Corollary 1.Let C ¼ ðc s 1 ;t 1 ;b 1 ; . . .; c s n ;t n ;b n Þ be a zero-agnostic copy number transformation and D ¼ ðd s 1 ;t 1 ;b 1 ; . . .; d s n ;t n ;b n Þ be the corresponding delta transformation.Then, p 0 ¼ CðpÞ if and only if Dðp 0 Þ ¼ DðDðpÞÞ Proof.The corollary follows by induction on |C| and repeated application of (Theorem 1).