Comparative Analysis of Methods for Identifying Recurrent Copy Number Alterations in Cancer

Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.


Introduction
Identifying recurrent copy number alterations (CNAs) in cancer genomes is an important step in locating cancer driver genes and understanding the mechanisms of tumor initiation. Many human cancers including ovarian serous carcinoma [1], lung adenocarcinoma [2], glioblastoma multiforme [3], and other types of cancers [4,5], have been largely explored by analyzing CNAs. However, the identified CNAs with high frequency of occurrence across multiple samples only account for a small fraction of clinically or biologically relevant aberrations for many cancers. The most common reason for missing some well-known driver mutations is that almost all cancers are heterogeneous [6], indicating that many recurrent CNAs only appear in a subset of samples (i.e., samples within subtypes) and accordingly their frequencies are less-extreme across the whole samples. For this challenge, a number of statistical and computational methods with promising results have been reported. They are divided into onestage [7,8,9,10] and two-stage approaches [3,4,11,12,13]. Many of them were reviewed and discussed by Rueda and Diaz-Uriarte in their latest paper [14].
One outstanding phenomenon of copy number profiles is that a part of markers are changed in identical regions in multiple genomes and the remainder markers are changed in random places of the genomes. Thus, the frequency of CNA occurrence across samples is usually used to help distinguish recurrent events from random markers. However, due to the complicated structures of copy number data, the identification of less-extreme recurrent CNAs is an extremely difficult task. Below we profile a real copy number dataset to show the complexity of CNAs, and further use it as an example to illustrate why the less-extreme CNAs are difficult to detect. Figure 1a and Figure 1b depict the rate of CNA occurrence across the entire genome and its frequency across the samples in a set of lung cancers, which contains 371 samples and 216,327 markers [3,5]. It can be noted from the figures that most of the markers are changed (amplified or deleted) in at least one sample and many of them are overlapped by a part of samples. Additionally, the sizes of CNA regions vary from chromosome to chromosome. For a given set of N cancer samples, assuming all the observed CNAs are randomly distributed across the genome in each sample, the expected probability (E(P)) of one CNA marker shared by at least n samples (corresponding to a percentage f of the whole samples) can be estimated using Equation (1), and consequently the expected number (E(l)) of such shared markers in the genome can be expressed by Equation (2).

E(P(f ))~X
N n~Nf X N n À Á k~1 ( P n i~1 r ki P N{n j~1 (1{r kj )) ð1Þ where L is the length of the genome being analyzed; r ki and r kj are the CNA rates of the i-th and j-th samples in the k-th subset, which refers to the k-th combination of n samples chosen from the whole N samples. Here, the total number of combinations of choosing n from N is represented by N n À Á . Let us consider a set of 100 samples with each having 1000 markers, and in each sample the rates of CNA are 0.035 for amplification and 0.040 for deletion (these frequencies are relatively less than the means of the above lung cancer dataset). If we assume the CNAs are randomly placed in the genome, the probability of one marker shared by at least 100f (0, f #1) samples can be regarded as a cumulative probability, termed P c (f ) (shown in Equation (3)). For instance, P c (0.1) equals to 0.0027 in the case of amplification, indicating that the probability of one marker amplified in at least 10 (0.1 multiplies 100) samples is 0.0027. Figure 2 shows such cumulative probability versus the frequency of one CNA marker across the 100 samples. Consequently, the number of such markers in the whole genome can be estimated as 1000 P c (f ).
If the frequency is used as a statistic to test the significance of CNAs individually, the estimated p-value for the marker with frequency f can be calculated using Equation (4), which is under the max-T procedure to control the family-wise error rate (FWER) [15]. For clearly understanding the relationship between the CNA frequency and its p-value, we demonstrate the p-value as a function of the frequency ranging from 0.01 to 1 for amplification and deletion separately in Figure 3. It can be noted that the p-value decreases with the increased frequency of the CNA, and particularly, p-value is 0.05 when f = 0.13 in the case of amplification and p-value equals to 0.05 when f = 0.14 in the case of deletion. These suggest that if a p-value cutoff 0.05 is employed, the CNA markers with frequency less than 0.13 for amplification (or less than 0.14 for deletion) could not be detected, while in real data such frequency may be of significant biological relevance since many CNAs may affect only a minority of cancer samples [3,7].
Generally, the frequency-statistic and random permutation of markers in the above example is just a basic strategy to test significance. To complement this strategy, many methods design various statistics and null distributions for this challenge. For example, STAC (Significance Testing for Aberrant Copy number) [4] proposes a new statistic ''footprint'' to score each marker and establishes the distribution under the null hypothesis that the observed CNA regions are equally placed anywhere across the genome; GISTIC (Genomic Identification of Significant Targets In Cancer) [3] scores each marker by combing frequency and amplitude, and constructs a semi-exactly approximated null distribution, and its extension GISTIC2.0 [11] considers the distinction of the background frequency between focal CNAs and broad CNAs and scores each marker proportional to its amplitude; CMDS (Correlation Matrix Diagonal Segmentation) [9] scores each marker based on its correlations with its surrounding sites and constructs a student's t distribution; and DiNAMIC (Discovering Copy Number Aberrations Manifested In Cancer) [13] employs a summary statistic and a cyclic permutation scheme to generate the null distribution. In addition, to adjust statistic values and improve null distributions, many methods employ a peel-off algorithm to iteratively test CNAs [3,13,16,17]. This will help much in identifying low-to-moderate-frequency (or/and low-tomoderate-amplitude) markers.
Along with recent advance of genomic technologies and rapid production of huge datasets, new methods with more sophisticated capabilities and features for detecting recurrent CNAs continue to emerge. However, the relative strengths and weaknesses of the existing methods are difficult to discern, due to the lack of comprehensive performance comparisons. This is a true problem especially from the perspective of biological researchers who need to choose a method for a dataset of interest. In this paper, we compare six classic and publicly available methods based on criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, so that users can quickly get an overview of them and their performance. Various simulation datasets and two real datasets obtained for lung adenocarcinoma and glioblastoma samples are used to evaluate the methods.

Methods for Identifying Recurrent CNAs
A variety of statistical and computational methods have been proposed recently for identifying recurrent CNAs. These methods can be categorized in different ways, such as frameworks, strategies for establishing null distributions, source codes, and so on. Generally, different cancer datasets have distinct profiles and patterns of copy number alterations, and they may require different computational methods for analysis, as there is no single method that could be suitable for all datasets. It is necessary to explore those methods that possess distinct features and different advantages. To mirror this, we carefully select six representative methods for assessment and comparison, based on their reported effectiveness in real applications. We list the six methods in Table 1 as well as their properties for an overview. These methods have been developed under different rationales in the latest six years and some of them have been widely used in cancer data analysis [2,18,19]. For a general understanding of them, we give a brief summary of their principles as follows.
(1) STAC [4]. The input of STAC is a binary matrix X, in which each element x ij represents the status of j-th marker at sample i. Specifically, x ij = 1 stands for amplification (or deletion), x ij = 0 means normal. It analyzes amplification and deletion matrices separately, and tests significance of them in the same way. The null hypothesis behind STAC is that the observed CNA segments are randomly placed anywhere in the chromosome being considered [4,17], hence permuted samples can preserve the original structures of the copy number data. STAC adopts two statistics, frequency of aberration and ''footprint'', to assess pvalues for each marker, and controls the family-wise error rate (FWER) based on the extreme right-hand tail probability [4,13,20].
The ''frequency'' for marker x is calculated as the proportion of samples sharing the aberration, while the ''footprint'' for marker x is calculated as a number of locations contained in a stack, which is a set of intervals containing x across samples [4]. The principle behind the ''footprint'' is that the tighter alignments of aberrations are less likely to be expected by chance and thus are more likely to suggest biologically relevant events, while the more relaxed alignments of aberrations might suggest passenger mutations with higher probability.
(2) GISTIC [3]. This method requires segmented input data with continuous log 2 -values resulted from single sample analysis methods such as CBS [21] and GLAD [22]. It permutes individual markers on the whole genome by assuming that the markers are independent [3,17], and derives a semi-exact estimated null distribution based on the convolution function [3] of where h amp i is the distribution (histogram) of amplification in the ith sample. Based on the null distribution, GISTIC uses a G-score combining both frequency and amplitude (Equation 6) to assess the significance for each marker and corrects for multiple hypothesis testing through the Benjamini-Hochberg FDR proce-dure [23]. The same procedure is applied to the analysis of deletion and LOH (loss of heterozygosity). The intuition behind the G-score is that an aberration with higher amplitude and frequency is more likely to be a driver event. In order to relieve the side effect of peak regions with the highest amplitude and frequency, GISTIC adopts a ''peel-off'' algorithm to iteratively test the CNAs within the significant regions.
(3) KC-SMART [8]. Different from the above two methods, one-stage framework is embraced for this method without requiring a prior step of segmenting (smoothing) copy number profiles. The principle behind KC-SMART is that it imposes a kernel function at each location m to construct a statistic, kernel smoothed estimate (KSE) [8]: where a i is a summed positive or negative log 2 -ratios across all samples for each location, g i (m) is a kernel function (e.g. flat-top Gaussian kernel function), and M i is a set of markers around location m and it is usually determined based on the width of the kernel function. Theoretically, this statistic considers the correlations among copy number data and incorporates information gained from neighboring markers. To identify peak locations (i.e., recurrent CNAs), the method compares the observed KSE of each location against a null distribution which is established through permutations of individ-  (4) CMDS [9]. The input data to CMDS is largely similar to KC-SMART. This method doesn't directly utilize the frequency and amplitude of copy number aberrations to construct test statistic. It assigns a RCNA score to each marker. The RCNA score is an averaged correlation value over the surrounding sites of the marker. The null hypothesis of CMDS is that there is no correlation between markers within chromosomes, thus it can be created by randomly permuting individual markers in the stretch of the chromosome being considered. To save computational time, CMDS uses the information from the observed correlation values in the copy number genome to establish a standard normal distribution, as a closely approximated t distribution. The multipletesting effect is also corrected using Bonferroni strategy, exactly like the KC-SMART method.
The intuitive notion behind CMDS is that the copy number noise is not correlated while the recurrent CNAs are in high correlation. Another outstanding feature of CMDS is that it doesn't analyze amplification and deletion separately, but uses the average copy number value over the predefined window across all samples and its significance level [9] to determine whether the corresponding marker is amplification or deletion. This is different from most other existing methods.
(5) DiNAMIC [13]. This method accepts both continuous raw signal and discrete segmented data. It adopts a global summary statistic that incorporates both frequency and amplitude of each marker for analyzing either amplification or deletion. Two novel features underlying DiNAMIC are concluded as follows. First, it employs a cyclic permutation strategy to generate the null distribution [13,17], which preserves the structures of the original copy number data at a higher degree than most other methods such as STAC [4] and GISTIC2.0 [11]. Second, to increase the power for detecting less-extreme CNA markers, the method utilizes a ''peel-off'' algorithm different from that used by GISTIC [3], which assesses the significances of new regions by removing all aberrations overlapped by the previously detected recurrent regions, while DiNAMIC re-tests markers by generating a new null distribution on a new data matrix in which the previously detected markers K are null and the markers contribute to the significance of K are scaled using a factor. This method is supposed to test one marker during each ''peeloff'' iteration procedure, thus computational cost will be a significant issue, especially when a large number of iterations are required. For this, DiNAMIC provides Quick Look and Detailed Look platforms for the user's options. In the first one, the original null distribution is re-used to test the significance of the most extreme markers, and thus accordingly saves a piece of computational time. In addition, the significance for multiple testing is corrected using max-T procedure exactly like STAC [4].
(6) GAIA [16]. In contrast to other existing methods [3,13,24], GAIA (Genomic Analysis of Important Alterations) incorporates within-sample homogeneity into the ''peel-off'' procedure under its statistical hypothesis framework: first, individual markers are randomly permuted to generate a null distribution, based on which the observed count (the number of aberrations across samples, this is equivalent to the effect of frequency of aberrations) of each marker is assessed and assigned with a significance level; second, GAIA defines a homogeneity value for each paired adjacent markers in every sample and produces a new data matrix called H (N6M-1), in which each element H ij M{0, 0.5, 1}, represents maximum, medium or minimum homogeneity; finally, a homogeneous peel-off is performed on the matrix H to expand the boundaries of the significant regions detected previously. This ''peel-off'' scheme was expected to identify more recurrent CNA peaks and omit spurious peaks.

Evaluation of the Methods
Fairly evaluating the relative merits of these methods is necessary, but this is complicated due to several realistic issues. First of all, the input data formats (segmented or raw) to different algorithms are not always the same, and those requiring segmented inputs usually adopt different segmentation algorithms. For example, the default segmentation algorithms used by STAC, GISTIC, DiNAMIC, and GAIA are GenePix Pro 4.0 [25], GLAD [22], CBS [26], and VEGA [27] respectively. Considering that different segmentation algorithms might have different abilities in processing individual CNA profiles, and thus will pose great impact on downstream analysis, we choose to use the CBS segmentation algorithm [26] for all the two-stage methods in this comparison study, since CBS is a very popular algorithm and it performs consistently well in detecting copy number changes [28]. Secondly, the significance outputs of the six methods include two types: p-values (STAC, KC-SMAR, CMDS, and DiNAMIC) and q-values (GISTIC and GAIA), and the thresholds for declaring significant in these methods are different. For a fair comparison, we choose the commonly used thresholds 0.05 for p-value and 0.25 for q-value here. Thirdly, the parameters in different methods differ greatly. For example, DiNAMIC requires an input of number of iterations, where the default setting is 10. However, such a setting is usually not large enough in real applications, since there might be a great number of aberrant markers which should be assessed. Thus, we change this default setting into a larger number in the implementation of the algorithm. For most of the algorithm parameters, we use the default settings as much as possible or the values suggested in the papers or program documents. Finally, different algorithms were written in various languages and implemented in different platforms, as shown in Table 1. This will increase the difficulties to compare the computational time of the methods in practice.
To quantitatively evaluate the performance of the methods, we test four commonly used criteria [13,28,29,30] based on a large number of simulation datasets. The criteria are described in detail below.
1. Type I error rate. The purpose of assessing type I error rate is to investigate the meaning of the significance levels resulted from the statistical methods for detecting recurrent CNAs [13,30]. If type I error rate is too conservative or too aggressive, the intended meaning of the p-values (or q-values) would be reduced or lost, and it doesn't agree with the real false positive rate in results. Thus the accuracy of the type I error rate is a critical index for evaluating methods. To this aim, we simulate a large number (S) of replicated datasets with null ground truth CNAs, and calculate the type I error rate using Equation (8): where a is the threshold for calling significant (e.g. a~0:05), and X i (a) is an indicator function, i.e., if any CNAs in dataset X i are declared significant, then X i (a)~1; otherwise, X i (a)~0. Thus, Equation (8) is actually a calculation of family-wise type I error rate [17].
2. Detection power. Since CNA is a structural unit and it usually includes a number of markers, the detection power can be calculated through two ways: unit-based and marker-based calculations.
CNA unit-based detection power: for a ground truth (recurrent) CNA unit, it is necessary to observe how likely it can be successfully declared significant by a method. We define this detection power as the sensitivity to detect the recurrent CNA unit. Generally, exactly detecting the boundaries of (or all the markers within) the recurrent CNA unit is difficult to achieve, and this is not always necessary for locating the genes covered by CNA. For example, the genes can be mapped if a part of markers within them are overlapped by the detected CNA units. For a convenient assessment, we use the middle marker of the recurrent CNA unit to determine whether the unit is declared, i.e., if the middle marker is detected, then we suppose that the unit is successfully detected, otherwise, it is not. Accordingly, the CNA unit-based detection power of a method can be calculated by [30] Power~1 S : R where R is the total number of ground truth CNA units in each simulated dataset, and X i (R,a) indicates the number of ground truth CNA units that are declared significant in the i-th dataset. CNA marker-based detection power: in addition to the location of cancer driver genes, recurrent CNAs can also be used to analyze chromosomal instability index and other biological meanings [1]. So it is necessary to see how many ground truth markers are detected. Accordingly, we define this power as Equation (10) [30], in which R 0 is the total number of ground truth CNA markers and X i (R 0 ,a) indicates the number of ground truth markers that are successfully detected in the i-th dataset.
3. Receiver operating characteristics (ROC) curve and AUC measure. We further assess the overall performance of the six methods, measured by both sensitivity and specificity through ROC curves, which shows how much percentage of ground truth markers are selected conditioned on a given false positive rate. Additionally, we measure the area under curve (AUC) for these methods with the purpose of evaluating their average performance especially when some ROC curves have crossed. 4. Computational complexity. We evaluate the computational complexity based on execution time and memory usage. Since different methods are usually implemented in different platforms such as C++, R language, and JAVA, the comparison of the computational time might be influenced. To overcome this issue and provide a general comparison of the efficiency of the six methods, we give big-O complexities for them, in addition to the actual running times.

Simulation Datasets
Real datasets rarely have absolutely confirmed ground truth CNAs, and thus can't be used to evaluate the performance of the methods. However, simulation technologies provide a reasonable way to solve this problem [31]. Since the four evaluation criteria illustrated above are utilized to quantify the methods from different perspectives, it is necessary to employ different simulation schemes to generate a variety of datasets.
For the first criterion of testing type I error rate, we adopt the simulation algorithm introduced by Hsu et al [32] and Walter et al [13] to create null datasets. The algorithm is based on an instability-selection model [33], which has been originally used by many researchers to model LOH (loss of heterozygosity). The principle of simulating copy number aberrations under the instability-selection model can be simply summarized as follows [13]. The marker status is firstly denoted either by 0 as no aberration or by 1 as aberration. To generate contiguous markers which are inherent correlated along one chromosome with length of M, an initial marker location x k (kM{1, 2, …, M}) is prespecified and the status of its neighboring marker x k+1 is then modeled based on the transition probability [13], p a, b (d) = p(T(x k+1 ) = a | T(x k ) = b), where a, b = 0, 1, and d is the distance between adjacent markers x k and x k+1 . Specifically, the transition probabilities have been defined as [13,33]: where m is the background or sporadic probability of aberration at a marker, and l is the transition rate between regions of aberration and normality (i.e., no aberration). The other transition probabilities are p 0, 0 (d) = 1-p 1, 0 (d) and p 1, 1 (d) = 1-p 0, 1 (d). According to such probabilities, the status of the markers x k+1 , …, x M is determined based on a Binomial distribution. For the starting marker x k , the status is assigned using a binomial random variable with probability m [13]. The left part of the chromosome can be also determined similarly.
To get an idealized copy number data, the above simulation process is conducted two times, and the two simulated profiles are then combined to generate an individual sample [13]. To make the simulation data more realistic, a normal cell contamination with a random proportion , Uniform (0.7, 0.9) will be added to each sample, as well as a Gaussian noise with mean 0 and standard deviation 0.25. For a more detailed description of this simulation algorithm, interested readers can refer to [33], [13], and [32].
For the second criterion of testing statistical power of the methods, we combine the features of the simulation strategies introduced by Willenbrock et al [34] and Zhang et al [9], to generate multiple ratio profiles with ground truth CNA regions, and we further consider the signal scenarios summarized by Rueda and Diaz-Uriarte such as scenarios I-III, and scenario V [14]. We create an initial data matrix in which each element is assigned with a normal copy number level. Based on this matrix, we insert the ground truth CNA regions by considering the following factors that are generally regarded to affect the statistical power of detecting recurrent CNAs: length (L) and amplitude (CN) of recurrent CNA, frequency (F) of recurrent CNA across samples [9], signal noise level (s) of the ratio profiles, normal cell contamination (d) in tumor samples [35]. To make the simulated data more realistic, we add a number of randomly placed background CNA regions to each sample. The lengths of these regions are generally similar to that of the recurrent CNAs. For the third and last evaluation criteria, we still adopt this simulation scheme but use different factor settings. Particularly for the last criterion, we focus on simulating the scale of datasets, i.e., the size of samples and the length of genome, since these are generally thought to be the main factors influencing computational complexity.
To fully investigate the performance of the six methods under different criteria, in each simulation scheme, we set the related parameters to different values to configure various data replications. The details are demonstrated in each scenario in the next.

Performance Evaluation on Simulations
According to the four evaluation criteria and their corresponding simulation schemes, we present and analyze the comparative results of the methods and also explore and discuss the principles and features of the underlying methods.
Scenario 1 (Evaluation of type I error rate). Based on the default parameter settings of the simulation algorithm introduced by Walter et al [13], we create four types of null datasets by altering the distance (d) between adjacent markers. The distance is used to emulate the density of CNA markers distributed on the genome. For example, the equally spaced copy number data means that the markers are uniformly distributed in the stretch of the genome, while 50% clumped copy number data means that half of the markers are contained in the clumps accounting for 5% of the genome size, and the other half of the markers are equally spaced in the remainder intervals accounting for 95% of the genome size. 25% and 75% clumped copy number datasets have the similar interpretations. To accurately evaluate the type I error rate for a method, under each setting of the distance d, we generate 10000 data replications, each of which consists of 50 samples and 2000 markers, and we calculate the type I error rates for amplification and deletion separately. The experimental results are shown in Figure 4, which suggests that the value of the type I error observed in the DiNAMIC method is closer to the significance threshold than other five methods, and the STAC method appears relatively more liberal while the GAIA method shows relatively more conservative. It should be noted that here we use the same threshold a~0:05 for all the methods (including p-value and qvalue output methods) for a convenient comparison.
The potential explanations about the deviations from type I error rate of 0.05 are concluded based on a careful inspection of the principles behind the methods. For the STAC method, two reasons may result in a liberal type I error rate. The first reason is the partial contribution of the non-changed locations to the null distribution, since STAC creates null distribution by permuting segments obtained from individual sample analysis methods [22,26] but not removing the non-changed segments from the genome. The second reason is that STAC tests the significance of markers on chromosome-wide or chromosome arm-wide and controls FWER only within the chromosome. From the perspective of genome-wide, the p-value threshold of 0.05 might be too liberal since there are usually 23 multiple tests (i.e., 23 chromosomes) in the analysis of either amplification or deletion. For the GAIA method, the primary reason for conservativeness is that it does not take into account the correlations among copy number data in both the statistic design and null distribution generation, and also it does not incorporate signal amplitude into the statistic. Theoretically, the dependency among test statistics may also lead to conservativeness [36]. In addition, extreme loci usually present higher amplitude than that of sporadic loci [1] so that the methods ignoring amplitude may decrease the detection power of extreme loci, and thus generate conservative results.
For the CMDS method, we observe that it produces a distinct difference of the type I error rates between amplification and deletion, i.e., it behaves much liberal on amplification while conservative on deletion in all the cases of Figure 4. The most likely reason is that CMDS does not deal with the overlaps of amplification and deletion in the same regions (or windows). For example, within one specific window, if a part of individuals are deleted at an average copy number CN d and another part of individuals are amplified at an average copy number CN a , the copy number mean of the window across all samples might be larger than 2, showing an amplification state, since the value of CN a (CN a M (2, +')) can be generally much larger than that of CN d (CN d M [0, 2)). Therefore, the deletion events tend to be masked by the amplification events from the perspective of copy number mean that has been used by CMDS to define amplification and deletion states. Accordingly, in the datasets with evenly distributed copy number gains and losses, CMDS may result in more amplification markers with low significant levels while less deletion markers.
Scenario 2 (Evaluation of power based on CNA units and markers). Since amplifications and deletions of copy number are regarded to play different biological roles, and thus are usually analyzed separately for understanding cancer pathogenesis [1,11,18], we set out to calculate the power of the methods for detecting amplification and deletion separately. To mirror this, we first simulate three amplification events (i.e., ground truth recurrent amplifications) according to the following basic settings: from which the normal cell proportion d is randomly drawn for each sample. Moreover, 2 to 5 background CNA regions with lengths varying from 10 to 50 and copy numbers varying from 3 to 5 are inserted in each sample. It is worth noting that these settings have a clear interpretation: the CNA frequency across samples and its amplitude are inversely proportional to the CNA length [11].
To generate various datasets for understanding how the factors (L, s, d, et al.) influence the power of the methods, we use three parameters b L , b s , and b d to modify the lengths of the recurrent CNAs, the noise level, and the normal cell contamination of the copy number profiles. Under each configuration, we simulate 50 replications with log 2 -ratio values, each of which contains 50 samples and 2000 markers. The unit-based and marker-based detection powers are calculated using Equations (9) and (10), respectively, and the corresponding results are shown in Figure 5. Similar to the analysis of amplification, we also evaluate the detection power for the methods by simulating deletion events (i.e., ground truth recurrent deletions). The results are presented in Figure S1.
As shown in these figures, most methods achieve slightly higher power in the case of lower levels of noise and lower fractions of normal cell contamination, and the unit-based power is usually higher than the marker-based power, implying the generally recognized fact that the boundaries of recurrent CNA regions are more difficult to detect than the middle loci of the regions [16,28,37]. Moreover, it can be noted in the figures that the STAC method is more sensitive to recurrent CNAs in most cases than other methods. Examination of the principles underlying STAC reveals two reasons for this advantage. One contributor is the ''footprint'' statistic, which accounts for the lengths of aberrant intervals covering the markers being tested and directly measures tight alignment of the intervals (the intervals are contained in different samples and are defined as a stack) [4]. The tight alignment is a reasonable reflection of the concordance of CNAs. The second and more important reason is that STAC generates the null distribution corresponding to the number of intervals in a stack covering the tested marker. For example, if one stack contains 10 intervals (i.e., there are 10 samples aberrant in the same region), in order to assess the significance of the anchor markers contained in the stack, STAC calculates the footprints of all stacks that consists of exactly 10 intervals on permuted data, and then compares the observed statistic values to these footprints to get p-values. Theoretically, such null distributions account for subspace of the whole samples and thus facilitate finding consistent CNAs within subsets of samples. Therefore, the combination of the ''footprint'' statistic and the null distribution reduces the influence of highly frequent copy number aberrations across whole samples on detecting less-extreme markers.
Scenario 3 (ROC curve and AUC comparison of the six methods). Since the identification of recurrent CNAs usually incorporates more or less false positives in the results, it is necessary to see how many ground truth CNAs can be identified given a false positive rate. For this purpose, we employ the similar   Figure 6, from which we clearly see that the STAC and GAIA methods perform consistently well. It is important to note that in Figure 6 we focus on the top-left part of the ROC curves. Specifically, we start TPR from 0.5 in Y-axis and end FPR at 0.1 in X-axis, since such curves would provide a clearer observation of the difference between the methods than those ranging from 0 to 1 in both X   and Y axis. Moreover, a TPR level higher than 0.5 and a FPR level lower than 0.1 are usually required in real problems especially in the analysis of high-resolution datasets, due to the need of controlling true and false discovery rates [9]. Under the same simulation configurations, we measure the area under curve (AUC) for the six methods. The results are shown in Table 2, from which one can find that STAC and GAIA have the highest AUC values in most cases. The observation from this comparison is roughly consistent with that from the ROC curve comparison.
Additionally, to test the methods in analyzing deletions of copy number, we make a comparison of ROC curves in Figure S2, where we can also see the advantages of STAC and GAIA.

Scenario
4 (Assessment of computational complexity). Computational complexity of a method is a key factor for its application to real data due to the huge size of the genome being analyzed. Generally, three factors are regarded to determine the computational complexity, sample size, genome length, and algorithm for analyzing the data. To achieve a reasonable comparison of computational complexities for the six methods, we perform experiments using modest computational resources, available to all researchers. Specifically, our computer platform is: Linux OS (Ubuntu 9.04) and Windows XP OS, 5.99 GB RAM, and 2.00 GHz CPU. Various simulation datasets are generated by changing genome length or sample size based on the simulation scheme illustrated in previous text. Specifically, we set the sample size to 50 and change the genome length from 1000 to 10000, and then set the genome length to 2000 and change the sample size from 50 to 500 in the data simulations, with the purpose of observing the changes of running time with respect to the genome length and sample size. In each setting of parameters (i.e., sample size and genome length), we simulate 50 replicated datasets, and measure the running times and memory usages of the six methods by averaging over the replications. In addition, for the two-stage methods, including STAC, GISTIC, DiNAMIC, and GAIA, we add the costs of the preprocessing of datasets (i.e., single-sample analysis) to the running times for making a fair comparison. The comparative results of the running times are depicted in Figure 7, which suggests that CMDS and GAIA are  Table 3. Comparison of big-O computation complexity for the six methods. the fastest, while KC-SMART and DiNAMIC are the slowest. It is worthy noting that in Figure 7 the efficiency of KC-SMART is seriously affected with the increased genome length while it is slightly affected with the increased sample size. Similarly, we present the comparative results of the memory usages in Figure 8, which suggests that CMDS and GAIA require the lowest memory size while KC-SMART requires the highest memory size. Here, it is important to note that the results shown in Figures 7 and 8 should not be strongly platform-dependent since the machine memory is sufficient for the six algorithms to analyze the simulation datasets that we have considered. Different algorithms were written with different languages such as R, Matlab, and Java, thus it is insufficient to assess the computational complexities of the method only using the practical running times and memory usages. To provide a general comparison of the methods, we examine the principles of the algorithms carefully and give big-O time costs and memory requirements for them in Table 3, where N denotes the sample size, L denotes the genome length (number of markers), T represents the number of permutations, and l is the averaged length of the segments obtained from the single-sample analysis of the whole genome. Comparing Figure 7, Figure 8, and Table 3, we find that the big-O computational complexities are roughly consistent with the running time and memory usage results.

Real Applications
To investigate the behaviors of the methods in real applications, we first test them on a cancer dataset with 371 lung adenocarcinoma samples [2,9,17]. Raw CNA profiles are segmented using the CBS algorithm [26] and are transformed into the input formats to the two-stage methods. Since this dataset has been already analyzed by the GISTIC and CMDS methods in their papers [3,9], we do not run them again but only use their presented results for comparing to other four methods (STAC, KC-SMART, DiNAMIC, and GAIA). Recurrent copy number alterations usually include both arm-level and focal events, which are regarded to have different lengths and distinct biological consequences [3,11]. Specifically, focal events are generally defined as the peak regions with the most significance levels among the recurrent regions. Here, we focus on the comparison of focal events being identified by the six methods, since these events encompass limited number of genes and usually contain cancer driver genes [2,11]. To mirror this, we employ the log 2 -ratio thresholds of 0.848 and 20.737 for defining amplifications and deletions to fit those methods (STAC, GISTIC, and GAIA) that require copy number calls. Actually, these thresholds are the default settings of GISTIC in its applications [2,3].
We produce the result of each method using its default significance thresholds as aforementioned. For example, in the STAC method, we use a p-value of 0.05 as a cutoff for declaring significant, and in the GAIA method, we use a q-value of 0.25 as the cutoff. The results derived by the methods are presented in Table S1. Since there are no definite ground-truth driver genes in real datasets, it is extremely difficult to quantify the performance of the methods using the previously described criteria (e.g., power and true positive rate vs false positive rate). Instead, we first compare the total number of identified recurrent regions of the six methods, and give the numbers of overlapped regions between any two methods (shown in Figure 9). This will help observe which method is the highest concordant one, i.e., which method overlaps other methods in the largest area. From Figure 9, we observe that GAIA overlaps the largest area with other methods. Specifically,   [2,9]. Moreover, based on a collection of previous studies on lung cancers and their subtypes [2,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56], we compare the methods using the numbers of lung cancer related regions that are identified by them. The comparative result is presented in Table 4. Here it should be noted that we have only counted the regions reported to be associated with lung cancer in previous studies, the remainder regions in the result of each method are generally regarded as candidates and need further investigation in terms of their biological relevance.
Similar to the analysis of lung adenocarcinoma dataset, we further compare the methods based on a set of glioblastoma samples [3,57]. We still use the log 2 -ratio thresholds of 0.848 and 20.737 for defining amplification and deletion markers, and also use the result already reported by the GISTIC method [3] in this comparison. The numbers of recurrent regions identified by the six methods and the numbers of overlapped events among them are presented in Figure 10, from which we see that GAIA detects the largest number of regions including 42 amplifications and 91 deletions, and also overlaps the largest area with other five methods. For example, GAIA overlaps STAC with 94 (81.7% of 115 regions identified by STAC) events and overlaps DiNAMIC with 24 (47% of 51 regions identified by DiNAMIC) events. The detailed information of the identified regions by the six methods is given in Table S2. Based on a large collection of previous studies on glioblastoma patients [3,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75], we summarize the numbers of glioblastoma related regions resulted by the methods for a further comparison. This is shown in Table 5.
It should be noted that in this comparison we have used the default segmented profiles [3] of the glioblasotma samples for both the one-stage and two-stage methods. Specifically, we test the KC-SMART and CMDS methods on the segmented data rather than raw data, since these two methods also allow for segmented input signals [8,9]. This will help to investigate the performance of the methods on different data platforms. Comparing to the result of the lung adenocarcinoma dataset (Figure 9), most of the methods identify less regions in the glioblastoma dataset ( Figure 10). However, the number (92) of regions detected by CMDS in Figure 10 is significantly larger than that (46) in the lung adenocarcinoma dataset. The most likely reason is that the noise level of the segmented profiles is significantly lower than that of raw profiles, and the underlying principle of CMDS is to distinguish recurrent copy number alterations from noise copy number. Therefore, the input signals to CMDS with lower noise level would result in more regions.

Brief Summary
Six representative methods for identifying recurrent copy number alterations are examined and compared in this paper. It  is not straightforward to state which method is the best one. However, according to experimental results on a larger number of simulation datasets, we find that the STAC and GAIA methods perform consistently well except for their liberal and conservative significance levels. By observing the results under the three performance assessment criteria (i.e., type I error, power, and ROC), we further find that the behavior of type I error does not has a clear relationship with that of power and ROC, while the behavior of power displays a linear relationship with that of ROC curves. In terms of computational complexity, KC-SMART is heavily affected by the increased genome size while slightly by the sample size. This is mainly because KC-SMART assigns KSE score to each marker by incorporating information of a large number of neighboring markers, which needs much computational time. DiNAMIC is also the slowest method, which provides two options (Quick Look and Detailed Look) for statistical significance assessment. It should be noted that in our experimental comparison, we always choose the Quick Look scheme in DiNAMIC, which needs relatively less computational time when compared with Detailed Look scheme with time complexity of O(T9TNL), where T9 denotes the number of iterations. If we test all the markers in the genome, T9 equals to L. The methods requiring implementation of permutations are generally much slower than those without permutations.
In the applications on the lung adenocarcinoma and glioblastoma datasets, the six methods identify different numbers of recurrent CNA regions and display various pair-wise overlaps (Figure 9 and Figure 10). Among these methods, GAIA identifies the largest number of recurrent regions and shows the highest concordance with other methods, and it further provides the largest number of lung cancer and glioblastoma related regions (Table 4 and Table 5). Nevertheless, the proportion of the cancer related regions to the identified regions in GAIA is not as high as that in GISTIC (e.g., 70.8% of 31 regions resulted from lung cancer dataset). In addition, the STAC method also shows a high concordance performance either in method comparison ( Figure 9 and Figure 10) or cancer related gene comparison (Table 4 and  Table 5). Inevitably, the values in Table 4 and Table 5 might contain some bias, since a large number of new studies on lung cancer and glioblastoma are emerging, making it difficult to track every novel discovery in this area especially for those being published or unpublished. In these real applications, another observation is that most methods identify more deletion regions than amplification regions, but reveal relatively less deletion regions associated with lung cancer or glioblastoma. The most common explanation is that the homogeneous deletions are very frequent in the lung cancer profiles, and the threshold (20.737) for defining deletions is critical so that most homogeneous deletions are preserved but some heterogeneous deletion regions (low-to moderate-amplitude) with biological relevance might be filtered out [11].

Challenging Issues
Although a number of methods have been developed for the identification of recurrent copy number alterations in multiple samples and new methods with more sophisticated capabilities continue to emerge, several important and challenging issues in this area have not yet been well solved. In the first place, the relationship between significant, recurrent, and cancer driver CNAs is less modeled in a decent way. Many existing algorithms stop after obtaining significant CNA regions. Practically, the significant regions may contain a huge number of genes, and not all of them are the cancer driver genes. Accordingly, only identifying significant regions is not enough to define biological events, since the number of cancer driver genes is generally limited. Thus, distinguishing driver mutations from significant or recurrent CNAs is a critical and challenging issue. Interestingly, GISTIC performs a further step to define ''peak regions'' with the most significant levels from each significant region and employs a relatively high threshold (e.g. over 3.6 copies of amplifications and less 1.2 copies for deletions) to detect focal regions, which contain limited numbers of genes [3]. Considering that the high threshold may filter out low-to moderate-amplitude focal events, a new version called GISTIC2.0 was proposed to separate the whole genome into arm-level and focal regions, which were then analyzed respectively by estimating the corresponding background rates [11]. The focal events will be paid much attention to locate biologically relevant genes.
The second challenge is to detect driver aberrations for cancer subtypes, since most cancers are heterogeneous [76,77] and each subtype displays distinct copy number patterns [1,4]. The answer to this question will help much in providing important information for diagnosis and treatment of subtypes. Many existing methods have already tried to investigate the CNA patterns in cancer subtypes. For example, STAC utilizes unsupervised two-way hierarchical scheme to cluster samples based on its detected regions [4], and pREC-S identifies subsets of samples that share regions of CNA based on Hidden Markov Model (HMM) [14,78]. Some other methods use clinical information (e.g., primary vs recurrent, high grade vs low grade, and survival time) or phenotype to divide cancer samples into subsets and then analyze each subset separately [1,79]. However, many cancers may be involved with highly complicated heterogeneous structures. For example, in the high grade ovarian cancer subtype, different samples have different survival times and distinct copy number patterns. This greatly increases the difficulties of defining small cancer subtypes and identifying characteristic CNA regions associated with specific subtypes.

Potential Extensions
Based on previous analysis and discussions, we conclude several potential directions that can be pursued to improve the identification of driver copy number alterations. First, the advantages of the existing methods can be combined to refine the methodology. For instance, STAC can be refined by incorporating the peel-off scheme of the GISTIC method, so as to detect a set of ''peak regions'' from the significant STAC regions. This would be beneficial for reducing the scale of the identified regions and improving the purity of cancer driver alterations in the result. Second, one issue that has not been taken into consideration in the current methods is the contribution of the null markers (i.e., not changed markers) to null distributions. Specifically, when the threshold for defining aberrations is very high, a great number of markers are filtered out as null markers, which are not removed from the genome when performing permutations and generating the null distribution. Consequently, the mean of the null distribution is theoretically a little biased to the left, from the viewpoint of distinguishing recurrent CNAs from random background aberrations. Although it is not clear how much benefit could be obtained by removing the contribution of the null markers, this can be worthy of trying. Third, since the samples within a cancer subtype may themselves be heterogeneous (e.g., high grade ovarian cancer), only performing one-dimensional (i.e., across-genome) permutations in the whole samples may achieve a limited number of subtype-specific CNA events. The suggestion here is to adopt two-dimensional (i.e., across genome and sample spaces) permutations to establish the null distribution. One possible way to realize this is to incorporate the principle of pREC-S [78] and bootstrap sampling scheme into one of the six methods. Finally, analyzing absolute copy numbers in somatic DNA alterations of cancers would be much helpful [17,80,81], in either identifying cancer driver genes or exploring intra-tumor heterogeneity. Currently, most existing methods have only handled the relative copy number data that are generated directly from array or sequencing experiments. Figure S1 Power comparison of the six methods by testing CNA deletions. The unit-based and marker-based powers for each parameter are calculated based on 50 simulated replications, which are depicted with pink and dark red bars respectively. (DOC)