SAAS-CNV: A Joint Segmentation Approach on Aggregated and Allele Specific Signals for the Identification of Somatic Copy Number Alterations with Next-Generation Sequencing Data

Cancer genomes exhibit profound somatic copy number alterations (SCNAs). Studying tumor SCNAs using massively parallel sequencing provides unprecedented resolution and meanwhile gives rise to new challenges in data analysis, complicated by tumor aneuploidy and heterogeneity as well as normal cell contamination. While the majority of read depth based methods utilize total sequencing depth alone for SCNA inference, the allele specific signals are undervalued. We proposed a joint segmentation and inference approach using both signals to meet some of the challenges. Our method consists of four major steps: 1) extracting read depth supporting reference and alternative alleles at each SNP/Indel locus and comparing the total read depth and alternative allele proportion between tumor and matched normal sample; 2) performing joint segmentation on the two signal dimensions; 3) correcting the copy number baseline from which the SCNA state is determined; 4) calling SCNA state for each segment based on both signal dimensions. The method is applicable to whole exome/genome sequencing (WES/WGS) as well as SNP array data in a tumor-control study. We applied the method to a dataset containing no SCNAs to test the specificity, created by pairing sequencing replicates of a single HapMap sample as normal/tumor pairs, as well as a large-scale WGS dataset consisting of 88 liver tumors along with adjacent normal tissues. Compared with representative methods, our method demonstrated improved accuracy, scalability to large cancer studies, capability in handling both sequencing and SNP array data, and the potential to improve the estimation of tumor ploidy and purity.

merging segments step and SCNA calling step, we set the significance p-value threshold as 0.05. For the analysis of HKU HCC WES data, we skipped the merging step to allow for better sensitivity.
For each analysis, the pipeline output a list of segments, each of which was called as gain, loss, CN-LOH, normal or undecided. For NA18507 WES and WGS data, the called gain, loss, and CN-LOH segments were counted towards false positives. We only considered results for autosome for easy comparison between platforms and analysis methods.

(C) Some remarks on segment merging step
In the segment-merging step, we need to ensure the size of the "null" data is large enough to produce a certain number of random samples (n=1000 in this study). In a majority of cases, the number of loci in the null data is much larger than 1000 plus the number of loci within the two neighboring segments of the longest total size, so that we can draw enough distinct random samples. If this is not the case in less common scenarios, we simply capped the number of loci within resampled segments at certain upper limit. For example, if the number of loci in the "null" data is 10000, to make sure enough random samples can be drawn, we can set the upper limit to, say, 3000, such that the total size of resampled neighboring segments is 6000. This upper limit is automatically adjusted in the implementation of the saasCNV package. This procedure is still valid even though the initially identified cluster of segments is one corresponding to SCNA rather than "normal", because our purpose is to estimate the variability in medians of neighboring segments with the same SCNA status and such segments are expected to clump together within the same cluster (see Fig. 2(B) and (D) in the main text).

(D) Some remarks on SCNA calling step
Balanced gains and losses are much less common than imbalanced gains and losses. We used GAP [8] on HKU HCC SNP array data (which is used as benchmark in method comparison; see results section in the main text) to investigate the prevalence of balanced gains and losses. Please note that double deletion (i.e., 0 copy in diploidy genome) is not considered as balanced losses, because this alteration presents signals in log2mBAF due to larger variance of BAF in 0-copy regions than normal regions. The case of two copies with heterozygous genotype in tetraploidy samples is regarded as balanced losses. As a result, a total of 16 out of 84 samples (excluding 4 samples which have larger noise and on which GAP did not perform satisfactorily) presented balanced gains and on average, 10.7% of the autosomes were affected by balanced gains among the 16 samples. A total of 2 out of 16 tetraploidy samples (as inferred and visually checked with GAP) presented balanced losses and on average, 4.5% of the autosomes were affected by balanced losses.

(E) ExomeCNV analysis of NA18507 WES data
For each synthesized pair of replicated WES data of the HapMap sample NA18507, with one regarded as tumor and the other as normal, we performed CNV and CN-LOH analysis using R package ExomeCNV (version 1.4) [9] and following the guidance at: https://secure.genome.ucla.edu/index.php/ExomeCNV_User_Guide. We used the recommended GATK command to extract read depth information at each exon, which was taken as input in the workflow.
The CNV analysis consists of three major steps: 1) calculating log read depth ratio of tumor to normal (calculate.logR); 2) calling CNV for each individual exon (classify.eCNV); 3) combining exonic CNV into segments using Circular Binary Segmentation (CBS) (multi.CNV.analyze). The first step was performed as recommended. In the second step, we used the recommended parameters, except that we set read length (read.len) to be 100bp and the normal cell admixture (admix) to be 1, reflecting the fact that the "tumor" data is actually from normal cells and making the inference more conservative against false positives. In the third step, we still used the recommended parameters, except that we set read length (read.len) to be 100bp, the normal cell admixture (admix) to be 0.9 (for it would cause program running failure when setting it to be 1), and the significance level for CBS test to accept change-points (alpha) to be 1e-5, more conservative against false positive.
The CN-LOH analysis initiates with BAF information for all heterozygous sites in the exome, which is extracted from VCF files, and consists of two major steps: 1) calling CN-LOH at each heterozygous site (LOH.analyze); 2) combining CN-LOH at multiple sites into segments (multi.LOH.analyze). All parameters were set as recommended, except that, in the first step, we set type I error rate in the heterozygoussite-wise LOH test (alpha) to be 1e-4, and, in the second step, we set Type I error rate for the statistical test (test.alpha) to be 1e-4, for the inference to be more conservative against false positive.
We summarized results for autosome. The segments with copy number 1 or 3 (normal being 2) resulting from the CNV analysis and the segments called as CN-LOH resulting from the CN-LOH analysis were regarded as false positives.

(F) PatternCNV analysis of NA18507 WES data
To run PatternCNV [10] on WES data of the HapMap sample NA18507, one replicate was taken as tumor in turn and the rest five replicates as normal at a time. We followed the manual (http://bioinformaticstools.mayo.edu/research/patterncnv/) and used the default parameters.
Since PatternCNV calls CNVs at exon level but does not provide segment-level CNV calls -it uses circular binary segmentation (CBS) [11] (implemented by the R package "DNAcopy") merely for visualization purpose, we were only allowed to count false positives at exon level. However, it is still reasonable to compare exon-wise false CNV calls from PatternCNV with segment-wise false CNV calls from SAAS-CNV. Take the case shown in Fig. S5 for example: A total of 8949 exons out of 368146 (2.4%) was falsely called as CNV by PatternCNV (Fig. S5(B)), whereas one out of 111 (0.9%) segment was falsely called as CNV by SAAS-CNV ( Fig. S5(C)). The proportions of false calls to the number of exons/segments are not strikingly different, but the difficulty in interpreting thousands of CNV calls from PatternCNV is evident. Fig. S5(B) shows that the gains and losses (red and blue dots) called by PatternCNV are distributed all over the genome. Even though being visualized via CBS (black segments), a large number of small segments can be visually identified as false CNV calls. In practice, the number of CNV calls really matters for the interpretability and applicability of the results in downstream analysis. Therefore, such comparison in our study is meaningful.
Since it is difficult to interpret the exon-wise results from PatternCNV in real data analysis and the comparison of all other methods based on Dataset II is at segment level, we did not apply it to Dataset II.

(G) CNAnorm analysis of NA18507 WGS data
For each synthesized pair of replicated WGS data of the HapMap sample NA18507, with one regarded as tumor and the other as normal, we performed CNV analysis using R package CNAnorm (version 1.4.0) [12] and following the manual (version November 1,

2013).
We used the PERL script bam2windows.pl (http://bam2windows.googlecode.com/svn/trunk/bam2windows.pl) to produce the text file used as input for CNAnorm, which computed read depth for each 1kb nonoverlapping window for tumor and normal, respectively, as well as the average GCcontent for each window.
The workflow consists of 6 primary steps: 1) calculating the ratio of tumor to normal read depth for each window (dataFrame2object); 2) correcting for GC-content segmentation of the ratio signal using CBS (addDNAcopy); 6) normalizing ratio signal (discreteNorm). We carried out steps 1-3 and 6 with default parameters. In step 4, we set maximum ploidy (ploidyToTest) to 7 (which is the minimum possible value to avoid program running failure), in order to suppress over-fitting of mixture model for the "null" data, which is supposed to exhibit only one mode. The model fitting method (method) was set as "mixture", unless the program reported failure, in case of which we changed the method to "density". In step 5, we modified the original function (addDNAcopy) to allow more stringent criteria for CBS (alpha=1e-5, min.width=5, undo.splits="sdundo", undo.SD=1) to prevent DNAnorm from generating excessive false positives.
CNAnorm is able to estimate the copy number for each segment after normalization and correction for normal cell contamination. We defined segments with 0.5 copies below identified baseline as losses and 0.5 copies above baseline as gains, which were counted towards false positives. We summarized results for autosome.

(H) Control-FREEC analysis of NA18507 WGS data
For each synthesized pair of replicated WGS data of the HapMap sample NA18507, with one regarded as tumor and the other as normal, we applied Control-FREEC [13] according to the tutorial (http://bioinfo-out.curie.fr/projects/freec/tutorial.html).
We used "samtools mpileup" as instructed to produce the text file in pileup format [6] as input to Control-FREEC. We set the size of sliding window as 1kb while using the default settings as demonstrated in the example configuration file distributed along with the package. We counted those somatic CNVs and CN-LOHs called by Control-FREEC towards false positives. Results were summarized for autosome.

(I) Some details about the analysis of Dataset II using different methods
ExomeCNV. We applied the same workflow as described in Section (E) to analyze the synthesized WES data in Dataset II. We used the default parameters as demonstrated in https://secure.genome.ucla.edu/index.php/ExomeCNV_User_Guide, except that we set read length (read.len) to be 100bp and the significance level for CBS test to accept change-points (alpha) to be 1e-5, more stringent than the default one to produce a reasonable number of segments. CN-LOH detection was not performed separately because in accuracy comparison, CN-LOH and normal status are collapsed down to non-CNV status (see methods section in the main text).
CNAnorm. We applied the same workflow as described in Section (G) to analyze Dataset II, except that in step 4, we set ploidyToTest=12, the default value of the package, to allow more flexibility for mixture model fitting.

Control-FREEC.
We applied the same workflow as described in Section (H) to analyze Dataset II.

(J) Some comments on correlation calculation
We computed Pearson correlation coefficient of read depths in each synthesized pair of NA18507 WES/WGS data, across heterozygous sites for SAAS-CNV, across exons for ExomeCNV and across 1kb non-overlapping windows for CNAnorm and Control-FREEC. In our data analysis, the implemented GATK pipeline set a down-sampling cap of 250 on the total read depth of reference allele and alternative allele at heterozygous loci for computationally efficient variant calling, while the read depth derived from averaging over exons or 1kb windows did not undergo down-sampling with an upper bound. We observed that the correlation could be boosted by a few exons with large read depth in WES data ( Fig. S14 and Table S3), and, more sharply, by a few windows with excessive read depth in WGS data ( Fig. S15 and Table S4). Therefore, we excluded the sites with read depth greater than 250 from the computation of correlation, and named it as truncated correlation (see Tables S3-4). The correlation for ExomeCNV, CNAnorm and Control-FREEC refers to the truncated correlation in the main text.

(K) Calculation of theoretical mBAF
Assume that the tumor genotype involves n A A alleles and n B B alleles for a segment and the tumor purity is ρ . With the definition of mBAF, we only need to calculate its value for n A ≤ n B .
In case of allelic imbalance (i.e., n A < n B ), .
In case of allelic balance (i.e., n A = n B ), where σ is a robust estimate of noise level of BAF signal (see Section (A)) and Φ −1 (⋅) is the inverse normal cumulative distribution function (CDF). The formula is derived from half-normal distribution.

(L) Some remarks on GC content adjustment in data normalization
We  Table S5 below, the proportion of the variance of tumor read depth explained by GC content was negligible as compared to matched normal read depth, CNV associated read depth and even random noise (residual) in all datasetsr.
Therefore, in paired tumor-normal study with the matched two samples properly processed by the same procedure in parallel, the matched normal is expected to play a dominant role in normalization of tumor data as compared to GC content. Just in case, we added an option for GC content adjustment on log2ratio in an updated version of the saasCNV package.   Table S3).  Table S4).