Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data

Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.

have been developed for detecting CNAs from NGS data. SeqCNA [12] filters out potentially false-positive CNAs and corrects GC content for a more accurate CNA detection. CNAseg [14] analyzes flowcell-to-flowcell variability to avoid false-positive CNAs. CNAnorm [15] addresses the normal contamination and aneuploidy of the tumor sample to infer CNAs accurately. Paired-end NGS data provides another modality in addition to the read depth to infer CNA, and a few bioinformatics tools use this, including, for example, CNVer [17], CNVnator [18], ReadDepth [19], and Mseq-CNV [20].
Although both aCGH and NGS data can be used to detect CNAs, they do not provide high-throughput data at the single-cell resolution that is ideal for understanding tumor growth. In particular, intratumor heterogeneity [22] is best understood by using data obtained from individual cells within the tumor tissue. Indeed, in the last ten years, the field has made great strides towards developing technologies for single-cell DNA sequencing. Data generated by these technologies can be analyzed to detect CNAs and other types of mutations in individual cells and individual clones within the heterogeneous tumor [23]. For example, DOP-PCR is a PCR amplification method that generates low-coverage data suitable for CNA detection in single-cell data [24][25][26][27]. However, it also suffers from uneven coverage and allelic dropout [23] that could lead to false-positive calls of CNAs. Beyond this method, three tools have been extensively applied to single-cell sequencing data for CNA detection: AneuFinder [28,29], CopyNumber [30], and Ginkgo [31]. Like HMMcopy, AneuFinder uses a Hidden Markov Model (HMM) to determine the segmentation of the genome, i.e., to computationally segment the genome into non-overlapping regions so that each region has a homogeneous copy number, and the absolute copy number of each segment. CopyNumber [30] pools all the cells together for joint segmentation to improve boundary detection accuracy. Since cancer cells in the same subclone mostly share the same CNA boundaries, such a strategy can improve the nucleotide resolution of the boundary by implicitly amplifying the signal in the data. Ginkgo [31] uses Circular Binary Segmentation (CBS) [7] to segment the genome, followed by inferring the integer value of the absolute copy number. It is worth noting that some methods designed for aCGH and NGS data have also been extensively used on single-cell data, and this is especially true for HMMcopy [32][33][34][35][36]. As both AneuFinder and HMMcopy are HMM-based methods, we focus on HMMcopy as a representative of the HMM-based approach due to its wide application to single-cell sequencing data in multiple studies [32][33][34][35][36]. However, considering AneuFinder appears later than HMMcopy, we also benchmark its performance particularly on different ploidies (average copy number across the genome), as from our observation of HMMcopy, correctly inferring the ploidy level is a challenge to such HMM-based approaches. A more recent method is SCNV [37], which uses a bin-free segmentation method to do segmentation and copy number profiling. However, the method has not been widely applied to single-cell DNA studies. Most recently, Chisel [38] was introduced for detecting CNAs in single-cell sequencing data. With the help of a matched normal sample or identified normal single cells, Chisel phases CNAs by germline single nucleotide polymorphisms at a low coverage, and further infers allele-and haplotypespecific CNAs. Such allele-and haplotype-specific CNAs can subsequently help improve single-cell clustering and phylogenetic tree inference. As our study here does not deal with phasing, we did not include Chisel in our study. Among CopyNumber, Ginkgo, and HMMcopy, only CopyNumber utilized the pooled information from single-cell data. The other two methods can be equally well applied to bulk samples. Moreover, HMMcopy was designed for aCGH data originally, and thus does not take into account the specific error profiles that characterize single-cell sequencing data, such as low and uneven coverage, or the computational challenges that arise due to biological processes such as aneuploidy in a tumor single cell.
In this paper, we compared and benchmarked three methods that have been widely applied to CNA detection on single-cell DNA cancer data: Ginkgo, HMMcopy, and CopyNumber. We developed a simulator of cancer genome evolution in the presence of CNAs and used it to investigate the accuracy, running time and memory consumption of the three methods. We also investigated their performance on a real dataset and assessed their consistency. An important contribution of this paper is the use of phylogeny-based analysis of CNA calls to identify potentially erroneous ones. In particular, the use of a phylogeny helps identify regions that have witnessed a large number of mutations which could be further explored for potential error or some interesting biological explanation.
We found that in terms of the accuracy of the detected breakpoints (positions on the genome where segmentation occurs) and memory consumption, HMMcopy is the best of the three methods, and in terms of running time, it is slower than CopyNumber but faster than Ginkgo. However, when evaluated the methods in terms of the actual copy number profiles they detect, we found that Ginkgo is more accurate than HMMcopy; in fact, we found that HMMcopy is not stable at this task (paradoxically, CopyNumber does not detect actual copy numbers). In terms of accuracy, CopyNumber has a very poor performance. While Knouse et al. [34] assessed the performance of CBS and HMM-based methods on single-cell DNA sequencing data, their evaluation is limited to CNVs in brain and skin cells. Moreover, they did not investigate the effect of the ploidy on the accuracy of the methods. Our results highlight the need for developing new accurate and efficient methods for CNA detection from singlecell DNA data.

Results
To better understand the strengths and limitations of current approaches for CNA detection from single-cell DNA sequencing data, we selected HMMcopy, Ginkgo, and CopyNumber using both simulation and realistic data.
For simulation, we designed three experiments to evaluate the performance of the three methods under different conditions. The first experiment was designed to evaluate the recall and precision of the CNA detection methods. We designed the simulation study in this experiment to produce single cells that have ideal read count variability and normal ploidy levels ranging between 2 and 3, so that we can learn how the methods perform when the data is relatively not challenging. The second experiment was designed to evaluate the performance of each method under a variety of ploidy levels. Specifically, we simulated single cells whose ploidies range from 1.5 to 5.26. We then compared the recall and precision of the three methods on the simulated data at different ploidies. The third experiment was designed to assess the performance of each method under different coverage variabilities. In particular, we simulated single cells whose coverage variabilities mimic those produced by three single-cell sequencing technologies (MALBAC, DOP-PCR and TnBC) [39] that have been used for CNA detection.

Performance on simulated datasets
In the first experiment, we simulated the evolution of 10,000 cells from which we generated, through sampling without replacement, three 1000-cell datasets. For each cell, we simulated read data using a simulator that we developed (Methods). We then aligned the reads back to the reference genome using BWA with default parameters [40,41]. Finally, we ran each of three methods on the resulting BAM files, and computed the recall and precision of each method based on the ground truth generated by the simulator.
We assessed the methods' performances in coarse-and fine-grained analyses. For the coarse-grained evaluation, we inspected the breakpoint positions as well as whether they were consistent with the ground truth in terms of the estimated gain/loss state (rather than the actual value) in the copy number. The predicted breakpoint is counted as consistent with the ground truth whenever it has the same status (i.e., the copy number increases or decreases) and its genomic location is within a certain distance of its counterpart in the ground truth. We varied the value of this distance to study the methods' accuracies. Each ground truth breakpoint was matched by at most one predicted breakpoint to avoid double counting of the truepositive calls. For each method, we varied a parameter to obtain the receiver operator characteristic (ROC) curve, the details of which are described in the caption of Fig 1. A preliminary analysis of CopyNumber on the data revealed that the method achieves extremely low recall and precision. Since CopyNumber pools all the cells together for breakpoint detection, we suspected that this poor performance owes mainly to the method's lack of sensitivity in detecting breakpoints shared by only a small number of cells. Therefore, to allow for more meaningful comparison of CopyNumber to the other two methods, we eliminated breakpoint pairs shared by fewer than five cells in the ground truth and used the resulting new ground truth to evaluate CopyNumber's recall and precision (but we did not filter the breakpoints for the other two methods). As Fig 1a shows, CopyNumber still has, by far, the poorest performance. We hypothesize that for a breakpoint to be detectable by CopyNumber, it needs to be shared by a large number of cells. We further checked this by calculating the number of cells sharing a breakpoint that is called or missed by CopyNumber, and found that there is a significant difference between the two sets (p-value < 9.019e-09 for Student's t-test with mean 9.27 versus 5.35). We also observe that as the tolerance threshold for the detected breakpoint position increases, improvement in CopyNumber's performance is much larger than the improvement in the performance of the other two methods. However, even with the most forgiving threshold (two breakpoints are considered to be the same if their positions are within 1 million basepairs of each other), CopyNumber still has poorer performance than the other two methods even under the most stringent threshold. Overall, the results in Fig 1a show that (1) HMMcopy generally outperforms the other two methods, with Ginkgo being a close second, and (2) that even HMMcopy's best recall and precision are around 0.8 and 0.7, respectively.
In the fine-grained analysis, we focused on the agreement of the absolute copy numbers on both 5' and 3' of an inferred breakpoint with those of the ground truth, in addition to the requirements on gain/loss consistency and distance. Since CopyNumber does not predict the absolute copy numbers for an individual cell, it is not considered in this analysis. Surprisingly, HMMcopy's prediction of the absolute copy number is not stable, leading to a bimodal distribution of both recall and precision (S1 Fig). We found that cells with low recall and precision correspond mainly to cases where HMMcopy made inaccurate estimates of the cells' ploidies (S2 Fig). We then selected only those cells for which the ploidy was correctly predicted (i.e., 2 or 3), and plotted the ROC curve of HMMcopy on them. We found that HMMcopy performed generally better than Ginkgo (Fig 1b), which is in agreement with the coarse-grained analysis. The recall and precision for the two methods dropped, which is expected since the true positives and negatives are now measured most stringently. However, we observed that the difference in results between the coarse-and fine-grained analyses is not large, suggesting that once the breakpoint is found by these methods, predicting the absolute copy number can be attained quite accurately. This is especially true for Ginkgo whose ploidy prediction is stable.
Similar results were observed on the other two datasets (S3 and S4 Figs). The results in Fig 1 were obtained under default parameters except for the parameters that were tuned to generate the ROC curves (alpha, gamma, and nu). However, we found that the value of parameter strength in HMMcopy has to be much larger than the default value in order to make the results more expected, i.e., increasing recall is accompanied with decreasing precision, and vice versa. We therefore set strength to be 10 million. According to HMMcopy's users' guide, For each method, the results based on three thresholds of correctness are plotted. For HMMcopy, nu, which controls the suggested degree of freedom between states, was tuned to take on the values 0.01 (rightmost), 0.1, 2.1 (the tool's default), 4, 10, and 20 (leftmost). For Ginkgo, alpha, which controls the significance level to accept a change point, was tuned to take on the values 1e-1000 (rightmost), 1e-100, 1e-10, 1e-5, 1e-4, 1e-3, 1e-2 (the tool's default), 0.02 and 0.05 (leftmost). The dots corresponding to values 1e-5 and 1e-10 in coarse-grained analysis overlap. For CopyNumber, gamma, which is the weight of the penalty on changing a state, was tuned to take on the values 40 (rightmost, and the tool's default), 10, 5, 4, 3, 2, and 1 (leftmost). https://doi.org/10.1371/journal.pcbi.1008012.g001

PLOS COMPUTATIONAL BIOLOGY
Benchmark methods on copy number aberration detection from single-cell DNA sequencing data strength is the parameter that controls how much the initial values of e, which controls the probability of extending a segment, remains the same throughout the iterations. We found that setting strength would help to have a good quality control of the result by making the initial setting of e last throughout all the iterations. Apart from parameters nu and strength, we found that e is also an important parameter in HMMcopy. The larger the value of e, the smaller the chance that the breakpoint is detected. To explore which combination can yield the best performance for HMMcopy, we varied both e and nu and calculated the F1 score. The performance of HMMcopy is the best when nu is 4 and e � 0.999999 (S5 Fig).
We also analyzed the computational requirements in terms of running time and memory consumption for the three methods on a 1000-cell dataset (Fig 2). Ginkgo is the slowest among the three and CopyNumber is the fastest. HMMcopy is in between Ginkgo and CopyNumber in terms of running time. For memory consumption, Ginkgo is the most demanding of the three, followed by CopyNumber. HMMcopy is the lightest in terms of memory consumption. Note that in running Ginkgo, we eliminated the steps of generating figures such as heatmaps and copy number profile, so that these do not affect the running time and memory in comparison with the other two methods. For CopyNumber, an extra step is required to generate its input file. We used the intermediate result of HMMcopy, i.e., the GC corrected read count on each bin, as the input to CopyNumber. We take the time for calculating this intermediate file into account for CopyNumber for a fair comparison. Since CopyNumber processes all the cells together, we suspect that more cells will require more memory, whereas Ginkgo and HMMcopy's memory requirements are not affected by the number of cells involved.
In summary, we found that HMMcopy is the most accurate in predicting breakpoints among the three. When HMMcopy's prediction of ploidy is accurate, its recall and precision of predicting the absolute copy number are the best among the three methods. However, it is not as stable as Ginkgo in terms of the absolute copy number detection since its prediction of ploidy is wrong for many cells (49.4% for default values of nu and e). CopyNumber's recall and precision are the worst of the three methods. Moreover, it cannot predict the absolute copy number for each individual cell, and thus is not as applicable as the other two methods.

The effect of ploidy on performance
To test the robustness of the methods to different ploidies, we varied the ploidy by tuning the parameters that control whole chromosomal amplifications and the rate of deletion (see the "Methods" section). We varied the ploidy from 1.5 to 5. Specifically, we used 1.5, 2.1, 3, 3.8, and 5.26 (we refer to them as 1, 2, 3, 4, and 5, respectively, hereafter), and generated three datasets for each ploidy. We tuned the coverage parameter for each ploidy so that the total number of reads for different ploidies are approximately the same to avoid adding reads for larger genomes resulting from higher ploidies.
We ran each method using their default parameters (except the strength parameter in HMMcopy). Finding CopyNumber's recall to be zero using the default gamma, we tuned gamma using the optimal value, i.e., 5, shown in Fig 1. We then found the recall greatly increased with this setting. Similar to the previous experiment, we again removed those breakpoint pairs shared by � 5 cells from the ground truth for evaluating CopyNumber's performance.
We used different combinations of the parameters to simulate high-ploidy cells (details are in the "Methods" section), i.e., 4 and 5, and found that in the absence of odd and intermediate copy numbers, HMMcopy's inference of the ploidy and absolute copy numbers were inaccurate (S6 Fig). This is also the case for Ginkgo in the absence of the odd copy numbers. However, despite the lack of intermediate copy numbers, Ginkgo correctly predicted absolute copy numbers for the case of ploidy = 5, showing that Ginkgo is more robust to changes in ploidy than HMMcopy in terms of predicting absolute copy numbers. In summary, the lack of odd or intermediate copy numbers in the data led to wrong predictions of absolute copy numbers. We then tuned the simulator's parameters so that in higher ploidies there are odd and intermediate copy numbers to avoid the extremely hard cases for CNA detection (details are in the "Methods" section). Fig 3 shows the precision and recall for the three methods for coarse-and fine-grained analyses, respectively.
In the coarse-grained analysis, Ginkgo's recall is > 0.8 in general, but its precision is relatively low (i.e., < 0.4) for ploidy 2 and 3. This is probably because it was affected by the variability of the read counts and over-segmented the genome. With similar recall, HMMcopy has higher precision at all ploidies. CopyNumber's recall and precision are low (< 0.4) for all ploidies, with low recall and precision for ploidy 5 and low precision for ploidy 2.
In the fine-grained analysis, Ginkgo's recall and precision dropped by about 10% as compared with the coarse-grained results. Its recall dropped the most for ploidy 5, indicating the challenge in accurately predicting the absolute copy number for high-ploidy cells. Although the odd and intermediate copy number states are present in this simulated data, HMMcopy's precision and recall are still bimodally distributed for all ploidies.
As we observed cells whose incorrect ploidy prediction led to wrong prediction of absolute copy numbers in the previous experiment, these bimodal distributions further showed that such wrong prediction can widely occur to cells with different ploidies. Similar results were observed on the other two datasets (S7, S8, S9 and S10 Figs). We then plotted the predicted versus actual ploidies for all cells for five simulated ploidies on one dataset, and found that HMMcopy's predicted ploidies deviate from the true ploidy for all simulated ploidies, whereas Ginkgo's prediction of ploidy is relatively accurate (S11 and S12 Figs). Specifically, we noticed that when the ploidy is between 1.5 and 3, most of the incorrectly predicted ploidies accumulated at the values that are multiples of the actual ploidy. For example, when the actual ploidy is 2, the incorrectly predicted ploidies were at 4 and 6, as seen in the histogram in S12(c) Fig. When the actual ploidy is greater than 4, the incorrectly predicted ploidies centered around the actual ploidy but with a large deviation. While both phenomena are probably due to the wrong selection of the best-fitting ploidy, which can be further traced to a suboptimal scoring algorithm for selecting the best ploidies in HMMcopy, the first phenomenon of having multiples of the actual ploidy is possibly also due to that the algorithm mistook the noises as the intermediate copy numbers, leading to a much larger predicted ploidy. In addition, while both Ginkgo and HMMcopy adopt a post-segmentation step for inferring the ploidy, the fact that Ginkgo allows non-integer ploidy but HMMcopy does not may lead to their performance difference in inferring the ploidy. Since AneuFinder and HMMcopy are both HMM-based methods but use different emission probability distributions (HMMcopy uses Gaussian whereas AneuFinder uses negative binomial), we also tested AneuFinder on datasets of varying ploidies (1.14.0, method="edivisive", binsizes = 2e5). We applied AneuFinder to the same dataset we described above and found that AneuFinder's recall and precision are generally lower than that of HMMcopy's, and the difference is extremely large when the ploidy is above 3. In particular, AneuFinder's precision drops below 0.2 for ploidy at 4 and 5 with recall between 0.4 and 0.6 (S13 Fig) for coarsegrained analysis. In other words, at a lower recall, the precision of AneuFinder for ploidy above 3 is at least 50% smaller than those of HMMcopy's. When the ploidy is between 2 and 3, while the recall of AneuFinder is close to that of HMMcopy, its precision is still much poorer than that of HMMcopy. For fine-grained analysis, AneuFinder's recall further drops near 0.2 when the ploidy is 4, and 0.05 when the ploidy is 5. Its precision also drops near 0.05 when the ploidy is 5 (S14 Fig). In all, we conclude that although AneuFinder does not show bimodality on fine-grained analysis, its recall and precision are both much lower than those of HMMcopy especially at high-ploidy datasets.

The effect of coverage on performance
To evaluate the performance of each method under different single-cell sequencing technologies, we sampled 20 cells from the simulated dataset and simulated their sequencing at four levels of coverage variabilities, corresponding to MALBAC, DOP-PCR, TnBC and Bulk sequencing (see details in the "Methods" section) and ran the three methods on each of them. We generated three datasets for each level of variability. Fig 4 shows the performance on one of the datasets. With decreasing variability, all three methods' recall increased under both the coarse-and fine-grained analyses. Ginkgo's and HMMcopy's precisions increased with decreasing variability. CopyNumber's precision, on the other hand, stays the same regardless of the coverage variability level, whereas its recall generally increases. In summary, better sequencing technology leads to better performance. The best that can be ever obtained (Bulk sequencing) is about 15% higher than the worst (MALBAC) for recall, and slightly higher for precision.
We looked into the copy number profiles in cases where HMMcopy's precision and recall were effectively 0 (one such case is illustrated in S15 Fig). We found that choosing a wrong ploidy from the set of candidate ploidies by HMMcopy may result in a copy number profile in which all the segments are predicted to have the same absolute copy number, whereas the closest profile to the ground truth is among the reported non-optimal results. We observed that in such cases, the wrong choice of ploidy may affect both the segmentation and inference of the absolute copy number of those segments.
Similar results were observed on the other two repetitions (S16, S17, S18 and S19 Figs). Performance on a real dataset. In real data analysis, due to the lack of ground truth, we evaluated the performance of the three methods in two ways. First, we evaluated the consistency among the three methods. The more overlap among the methods, the more consistent they are. Second, we inferred a maximum parsimony tree using PAUP [42] from the inferred copy number profiles and calculated the smallest number of copy number changes for each bin across all the branches of the tree, where the genome at the root of the tree is assumed to be diploid. The rationale for the latter way of assessing performance is that if the CNAs called by a method result (under a parsimony analysis) in a very large number of changes of the copy number at any bin, then one plausible explanation is error in the calls (another plausible prediction is that, for some reason, the locus corresponding to that bin has repeatedly gained and lost copies during the evolution of the cells which could be indicative of some interesting biological processes at play).
We downloaded single-cell DNA sequencing data of seven samples (the median number of cells in the seven samples is 47) from [43] (NCBI Sequence Read Archive under accession SRP114962) and selected those pre-treatment samples whose CNA profiles had not changed due to treatment compared with mid-treatment and post-treatment ones. We then ran the three methods with default parameters (except for the strength parameter in HMMcopy, as discussed above) on the single cells in each sample.
For assessing accuracy, we generated for each sample a Venn diagram of the predictions based on all three methods, where predictions by two methods were deemed consistent according to the same rule we used in the simulation study (in assessing consistency between predictions and the ground truth). Fig 5a shows the results for Sample 102 (S20 Fig  shows results for the other six samples). It can be seen that 47% of Ginkgo's calls overlapped with the other two methods, leaving a large portion as unique calls. HMMcopy overlapped well with the other two methods, with 22% of unique calls. In particular, HMMcopy overlapped well with Ginkgo: 76% of HMMcopy's calls overlapped with Ginkgo. CopyNumber's overlap with Ginkgo was larger than that of HMMcopy (65% versus 43%). The overlap among the three methods is a very small portion of the union of all calls (8%), indicating a very large inconsistency among the three methods. From these results, we observe that HMMcopy performed the best among the three in breakpoint calling, if we consider consistency with other methods as a metric of quality, which is consistent with what we observed on the simulated data. We then investigated the smallest number of changes required to explain the copy numbers detected by Ginkgo and HMMcopy (again, CopyNumber does not detect absolute copy numbers, which is why it is excluded in this analysis). To accomplish this, we inferred a phylogenetic tree from the inferred copy number profiles of the individual cells under the maximum parsimony criterion. We then inferred ancestral copy number profiles, also under the maximum parsimony criterion, and tabulated, for each bin, the number of changes that occurred along the branches of the tree. A bin that has a very large number of changes in the copy number could be pursued for further analysis. Two possible explanations of large number of changes are erroneous inferred copy numbers in this bin, or some real biological phenomenon that is worth pursuing further. Note that in this analysis we make the independent-bin assumption (copy number aberrations on each bin are independent of each other) and assume that bins undergo changes once at a time. More complex evolutionary models such as considering a group of neighboring bins together for a copy number aberration are beginning to emerge, but computing the parsimony score under those models is intractable [44]. Fig 5b and 5c show the distributions of copy number changes based on HMMcopy and Ginkgo's results, respectively (S21 and S22 Figs show results for the other six samples). Interestingly, for Ginkgo, four out of seven samples (samples 102, 132, 152, and 302) showed a higher number of bins that have one copy number change than ones with no copy number changes. The other three samples (samples 126, 129, 615) have the most bins that had no copy number changes at all. Generally the number of bins that had copy number changes decreased with the increasing number of changes. On the other hand, based on the HMMcopy results, all samples showed much higher percentage of no copy number change than those with some copy number change.

Discussion
Single-cell DNA sequencing data holds great promise for elucidating the complex biological processes that underly human diseases, including cancer. Copy number aberrations have been implicated in cancer, and their accurate detection from single-cell DNA sequencing data is of great potential to diagnosis and treatment of cancer. In this paper, we investigated the performance of three representative methods, Ginkgo, HMMcopy, and CopyNumber, that have been widely applied to single-cell sequencing cancer data for CNA detection.
We compared the three methods on simulated data generated under different settings that reflect varying degrees of complexity in the data. To accomplish this task, we developed a simulator that is flexible to simulate different scenarios and also mimic realistic data. We found that HMMcopy performs the best for breakpoint detection. However, HMMcopy is not stable in inferring the absolute copy number. Ginkgo performs well for both breakpoint detection and inference of the absolute copy number. CopyNumber is not as sensitive as the other two methods. We also looked into the performance of the three methods when ploidies were varied. We found that data with higher ploidies presented challenges for Ginkgo. HMMcopy is the most robust in terms of breakpoint detection among the three methods regardless of the ploidy, but its inference of the absolute copy number is not accurate for all ploidies. However, since the program yields copy number profiles for all possible integer ploidies, selecting the correct ploidy and obtaining the right copy number profiles will be trivial if given the pre-specified ploidy information, for example from DAPI staining. Both recall and precision of Copy-Number are the worst among the three methods. To explore the effect of technology artifacts on the accuracy of the methods, we simulated data that mimics the variability in coverage corresponding to MALBAC, DOP-PCR, TnBC, and Bulk. We found that all three methods' recall generally increases with the improvement in the technology, with smaller observed change in their precision. We then applied the three methods to real data and evaluated their performance by analyzing the shared and unique detections they made as well as counting the total number of copy number changes that must be invoked based on their detections. We found a good amount of overlap in detections between Ginkgo and HMMcopy. We also found that HMMcopy's detections result in fewer copy number changes than Ginkgo's.
Our benchmarking study highlights several points. First, the ploidy and coverage of the genome under analysis affect the ability of a computational method to detect CNAs. Second, there is much more power in analyzing all cells in a data set simultaneously, as the fact that they all evolved from a common ancestral cell provides not only signal for the inference, but also accounts naturally for model complexity and regularizes the number of changes in copy numbers. Third, using algorithms and tools from the field of phylogenetics can help significantly in this area. As we demonstrated above on the biological data set, parsimony analysis of phylogenetic trees can be used to identify regions with large numbers of changes in the copy number, especially convergent changes, and those can be further inspected for determining whether the high rate of change reflects a biological process that is worth pursuing or it is due to sequencing or computational inference error artifacts. Finally, our results show that while existing methods are a good step in the right direction, there is need for developing more accurate methods for CNA detection, especially ones that are designed specifically to model the specifics of single-cell DNA sequencing.

Simulation
Two steps are involved in simulating reads for single-cell sequencing. First, the cell tree is generated, where the nodes are the cells, and the edges represent the parent-daughter cell relation. The leaf nodes represent the single cells that are sampled from the patient; the internal nodes represent the cells that existed in the past and were not sampled. We set the root node as a human normal genome without any CNA, assuming that all CNAs are somatic. The tree is simulated by the Beta-Splitting model (see below), which allows producing imbalanced trees, consistent with what was observed in the real data [45].
On each edge (except for those attached to the root of the tree; see below), we simulate a number of CNAs, the number of which corresponds to a Poisson distribution (by default, λ = 2). λ relates with the mutation rate which has been studied for two decades [46,47]. There has not yet been a comprehensive knowledge of the mutation rate of CNAs, but according to the data from [43], we found there are about several dozens of CNAs in this dataset. The same can be found in a pan-cancer study [48]. Setting default λ to be 2 will lead to the similar number of CNAs at the leaves for a tree. The daughter cell of the edge inherits all CNAs in the parent node, in addition to its unique CNAs. To simulate a CNA, we randomly choose the allele, and the chromosome and position on the allele that CNA is going to occur. First, we sample the allele on which the CNA is going to occur from the paternal and maternal alleles according to a binomial distribution (default p = 0.5). We designed the simulator in a framework which keeps track of the allele at which the CNA occurs so that in our future work of simulating single nucleotide variations (SNV) simultaneously, the allele that is dropped due to the high allelic dropout rate can be traced. For the CNA size, we sample from an exponential distribution (default mean = 5Mbp), plus a minimum CNA size (default 2Mbp). We set the minimum CNA size by default to be 2Mbp because these CNAs are rare and commonly associated with disease [49], and also because of the limited resolution of single-cell data. The exponential distribution with mean 5Mbp is to render a wide range of CNA size. According to [34,50], the larger the CNA size, the smaller the CNAs possibility. We choose copy number gain versus loss by a binomial distribution (default p = 0.5). We set the default parameter to be 0.5 so that copy number gain and loss are equally distributed. If a copy number gain is sampled, we sample from a geometric distribution (default p = 0.5) to determine the number of copies to be gained (mean = 1/p). This choice of a distribution is motivated by the observation that extremely high copy number gains are very rare and are often observed by double minutes amplification [51], which we do not take into account currently. Once a whole-genome DNA sequence is simulated with the CNAs, the gained copies are placed in tandem with the original copy. If a copy number loss is sampled, the whole sequence on that region of the allele is deleted.
The CNAs on the edges attached to the root node are simulated differently. In particular, clonal whole chromosomal amplifications can occur on these edge, as indicated in the punctuated evolution model observed in [45]. We simulate the chromosomal amplifications in addition to the focal CNAs. We set the probability of a chromosome to be amplified to be according to a binomial distribution (default p = 0.2). This default value is used so that while the whole chromosome amplification is introduced, 20% chromosomes in the genome will be changed. The number of the amplified copy is sampled from a geometric distribution (default mean is p = 1) multiplied by a value (default is 1) to amplify the copy numbers simulated without changing the distribution. The distribution of the whole chromosomal amplification can be turned off as an option.
At the edge to the root, we also add an option to allow more CNAs than the other edges. This is again to mimic a scenario of punctuated evolution [45]. To do that, we sample a value from a Poisson distribution (by default, λ = 4) which is the multiplier of the average number of the CNAs that occur to the edges other than the root. Thus the edge to the root has on average 4 times (default parameter) more focal CNAs than those of other edges. The higher this number, the more focal CNAs the edge to the root carries. This parameter is introduced to allow the user to simulate data that mimics the punctuated evolution model. However, due to the diversity of models that have been summarized for cancer evolution [52], users can turn off this option or tune the parameter λ so that the simulated data corresponds to their observation and experience. In our study, the value of λ was chosen according to the length of the trunk observed in [45].
Once we have the tree and the DNA sequences for all leaf nodes, we simulate the generation of read data from the genomes. Given the coverage of the genome (by default 0.04X), the simulator divides the genome into non-overlapping bins each of which has a default size of 200,000bp. To simulate the coverage variability observed in single-cell data, we use a Markov Chain Monte Carlo (MCMC) Metropolis-Hastings algorithm to determine a sequence of numbers of read pairs to be sampled for each bin.
An input of the variability information is a point on the Lorenz curve, whose X axis represents the percentage of the reads, and Y axis represents the percentage of the coverage. We transform it to a Beta distribution by Equations (1) and (2) in [53]. Through this transformation, we can sample read counts from a Beta distribution that corresponds to the given Lorenz curve. The followings are mathematical equations in [53] that are used to calculate the parameters (α and β) for the Beta distribution. In more detail, suppose X is a random variable whose cumulative distribution function F corresponds to a Beta distribution with parameters α and β. A point x sampled from this distribution has its corresponding X and Y positions on the Lorenz curve as F(x) and ϕ(x), where and Given a point (F(x), ϕ(x)) on the Lorenz curve, we can calculate α and β for the Beta distribution.
Given the Beta distribution's parameters, we can then sample read count for each bin by MCMC Metropolis-Hastings algorithm. Starting from the first bin whose read count is assigned as the expected coverage x 0 , we sample the next bin's proposed read count x 0 by a Gaussian distribution, and accept it if compared with the previous bin's read count x 0 , where Gaus(x 0 |x 0 ) is the proposal probability of proposing x 0 given x 0 , and u is the acceptance ratio. We set u to be 0.5 by default. We set the same standard deviations for Gaus(x 0 |x 0 ) and Gaus(x 0 |x 0 ), centered at x 0 and x 0 , respectively. Thus the two Gaussian distributions canceled out. The rest term, I x 0 =I x 0 � u, controls how much the next bin's read count differs from the current one. The read counts drawn are thus corresponding to a Beta distribution, and are simultaneously constrained by the acceptance ratio of the Metropolis-Hastings algorithm. This is to mimic the realistic data whose read coverage fluctuates, but the read count changes smoothly without sharp changes between neighboring bins.

Running the programs
In all experiments, we eliminated reads that have mapping quality score < 40. We eliminated the cells that HMMcopy predicted as normal cells (predicted to be diploid and found no copy number aberration) in all experiments, the percentage of which was very small (< 0.2%).

Parameters of simulator
The simulator is designed to be flexible, with user-specified parameters, as now describe. Parameters for varying the ploidy level. To generate data with different ploidies, parameters associated with whole chromosomal amplification can be set for that purpose.
• -W (-whole-amp) Controls whether there are whole chromosomal amplifications or not.
• -d (-del-rate) The rate of copy number loss versus copy number gain.
• -C (-whole-amp-rate) The possibility that a chromosome is selected to have whole chromosomal amplification.
• -E (-whole-amp-num) For those chromosomes that are selected to be amplified, multiplying this number with the sampled value from a geometric distribution, whose p is "-J" described below, renders the final number of copies to be amplified.
• -J (-amp-num-geo-par) The parameter p in the geometric distribution from which the number of copy of the chromosome to be amplified is sampled. Combination of -J and -E can make a variety of copy number distributions and make it convenient to attain higher copy number gains when necessary.
In our experiment above where we varied the ploidy level, we use a combination of these five parameters to generate data whose ploidies range from 1.5 to 5 as follows.
• Ploidy = 1.5: -W 0 -d 1 No amplifications are allowed, and all copy number aberrations come from deletion.
• Ploidy = 3: -W 1 -d 0.5 -C 0.5 -E 1 -J 1 Amplification is allowed, and the average number of amplification for the whole genome is 0.5 for one allele. The final ploidy is 3.
• Ploidy = 4, the case that lacks odd copy numbers: -W 1 -d 0.5 -C 0.5 -E 2 -J 1 Amplification is allowed, and the average number of amplification for the whole genome is 1 for one allele. The final ploidy is 4. Note that since the parameter p in the geometric distribution (-J) is set to be one, the copy number is amplified by two for the allele that is selected for amplification. This causes the lack of intermediate copy numbers such as three, five, etc.
• Ploidy = 3.8, the case that has odd copy numbers: -W 1 -d 0.5 -C 0.9 -E 1 -J 1 Amplification is allowed, and the average number of amplification for the whole genome is 0.9 for one allele. The final ploidy is 3.8. Compared with the previous case which lacks odd copy numbers, we increase the copy number by doubling 90% of the chromosomes. The following local copy number aberrations that are performed based on the amplified genome will then generate regions that have different copies, including the odd copies. In the absence of odd copy numbers, copy numbers 2, 4 and 6 will be considered as 1, 2, and 3 by any method. Thus, without copy numbers 1, 3 and 5, there is no way for a method to tell the correct absolute copy number.

-E 3 -J 1
Amplification is allowed, and the average number of amplifications for the whole genome is 1 for one allele. The final ploidy is 5. Note that since the parameter p in the geometric distribution (-J) is set to be one, the copy number is amplified by three for the allele that is selected for amplification. This causes a scenario where most of the copy numbers are two, five and eight.
• Ploidy = 5.26, the case that has intermediate copy numbers: -W 1 -d 0.5 -C 0.9 -E 1 -J 0.55 Amplification is allowed, and the average number of amplification for the whole genome is 0.9 for one allele. Setting parameter J to be 0.55, the total amplified copy number for each allele is 1.63 (from 1/p × 0.9). The final ploidy is 5.26.
Parameters for varying the read count distribution. Since Lorenz curves have been used to evaluate the variability of read counts [32,39,54], we used the Lorenz curves reported in [39] for simulating variabilities at different levels. We sampled the read counts for each bin by the distribution (Beta distribution) corresponding to their Lorenz curves using a Markov

The Beta-splitting model
For generating the underlying evolutionary trees, we followed a generalization of the Blum-François Beta-splitting model [55] which is inspired by Aldous' Beta-splitting model [56]. The construction of a tree based on this model [57] consists of two major steps: First, we generate two sequences of random values: B = (b 1 , b 2 , � � �) and U = (u 1 , u 2 , � � �), B is a sequence of independent and identically distributed (i.i.d.) random variables sampled from the Bða þ 1; b þ 1Þ distribution, and, U is a sequence of i.i.d. random variables with the uniform distribution on [0, 1]. We call fg i ¼ ðu i ; b i Þ i2N g the generating sequence which is the basis of incremental construction of a tree. At the second step, we run the following algorithm on the random values generated at the first step. The process of constructing an evolutionary tree for n cells/leaves based on the Beta-splitting model with the parameters (α, β) combining these two steps is described in the following pseudocode. Algorithm 1 Algorithm for constructing a tree T with n leaves with the Beta parameters α and β. 1: function BUILDBETASPLITTINGTREE (n, α, β) 2: Create the root of T 3: T :root:label ð0; 1Þ 4: for i = 1� � �n do 5: Sample u i � U(0, 1) 7: for each leaf 2 T :leaves do 8: (x, y) leaf.label 9: if u i 2 [x, y] then 10: l Create the left child of leaf 11: r Create the right child of leaf 12: r.label leaf

Software availability
The simulator has been implemented in Python and is freely available at https://github.com/ compbiofan/SingleCellCNABenchmark.git, which also includes the scripts for regenerating the comparison results for both simulated and real datasets. The new version of HMMcopy was downloaded from https://github.com/shahcompbio/single_cell_pipeline/tree/master/ single_cell/workflows/hmmcopy. The scripts to preprocess files for HMMcopy were downloaded from https://shahlab.ca/projects/hmmcopy_utils/. We use hg19 for all experiments in this manuscript and the mappability file used by HMMcopy was downloaded from http:// genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability. CopyNumber was downloaded from https://bioconductor.org/packages/release/bioc/html/copynumber.html. Ginkgo's command line version which was used in this manuscript was downloaded from https://github.com/robertaboukhalil/ginkgo. For HMMcopy, nu, which controls the suggested degree of freedom between states, was tuned to take on the values 0.01 (rightmost), 0.1, 2.1 (the tool's default), 4, 10, and 20 (leftmost). For Ginkgo, alpha, which controls the significance level to accept a change point, was tuned to take on the values 1e-1000 (rightmost), 1e-100, 1e-10, 1e-5, 1e-4, 1e-3, 1e-2 (the tool's default), 0.02 and 0.05 (leftmost). The dots corresponding to values 1e-5 and 1e-10 in coarse-grained analysis overlap. For CopyNumber, gamma, which is the weight of the penalty on changing a state, was tuned to take on the values 40 (rightmost, and the tool's default), 10, 5, 4, 3, 2, and 1 (leftmost). (PNG) ploidy in HMMcopy leads to wrong estimation of breakpoints. X and primary Y axes represent the whole genome segmented into bins and the raw read count for each bin, respectively. The secondary Y axis represents the absolute copy number. Black dots represent the raw read count for each bin. Red and blue horizontal lines represent the copy number profiles of HMMcopy on the condition when its ploidy estimation is 2 and 1, respectively, the latter of which was selected as the optimal ploidy by HMMcopy. (PDF) S16 Fig. Recall and precision  Nakhleh.