A Streamlined Method for Detecting Structural Variants in Cancer Genomes by Short Read Paired-End Sequencing

Defining the architecture of a specific cancer genome, including its structural variants, is essential for understanding tumor biology, mechanisms of oncogenesis, and for designing effective personalized therapies. Short read paired-end sequencing is currently the most sensitive method for detecting somatic mutations that arise during tumor development. However, mapping structural variants using this method leads to a large number of false positive calls, mostly due to the repetitive nature of the genome and the difficulty of assigning correct mapping positions to short reads. This study describes a method to efficiently identify large tumor-specific deletions, inversions, duplications and translocations from low coverage data using SVDetect or BreakDancer software and a set of novel filtering procedures designed to reduce false positive calls. Applying our method to a spontaneous T cell lymphoma arising in a core RAG2/p53-deficient mouse, we identified 40 validated tumor-specific structural rearrangements supported by as few as 2 independent read pairs.


Introduction
Somatic structural variants (SVs), including large deletions, insertions, inversions, duplications and translocations are important hallmarks of cancer genomes, responsible for the creation of fusion genes, copy number and regulatory changes leading to activation or overexpression of oncogenes and inactivation of tumor suppressor genes [1,2,3,4,5,6]. Defining the architecture of a specific cancer genome is therefore essential not only as a first step towards understanding the biology of the tumor and mechanisms of oncogenesis, but also clinically towards designing effective personalized therapies [7,8].
Recent advances in high throughput sequencing technology [9,10] have made it possible to study whole genomes at unprecedented high resolution and relatively low cost. However, the current short read paired-end sequencing technologies carry many challenges, especially apparent when attempting to study SVs in cancer. First, the inherent complexity of tumor tissue [11,12,13] is a challenge in itself, since tumors are rarely monoclonal and are often mixed with normal tissue, so the sequencing coverage must be deeper than for SV detection in the germline. Second, short reads generated by paired-end sequencing (typically, 50-100 bp from each end of the 300-400 bp DNA fragment) prove to be difficult to map correctly back onto the reference genome due to the high percentage of repetitive genomic sequences [14,15,16,17]. All this leads to a large number of false positive calls, generating unacceptable levels of noise. Retrotransposon activity, common in human and mouse genomes [18,19], additionally complicates the data analysis leading to certain types of false positive calls. Finally, DNA library preparation artefacts arising from PCR amplification combined with sequencing errors add another level of complexity.
This work describes a whole genome sequencing based approach to identify 4 types of SVs: large deletions, inversions, duplications and translocations. We used SVDetect [20] and BreakDancer [21] to call SVs in a mouse lymphoma genome from a set of paired-end reads obtained on the Illumina's HiSeq platform. In order to reduce the high number of false positive calls, we developed a filtering procedure that allows detection of tumorspecific events at relatively low coverage (17x). First, we found it essential to compare the tumor dataset to a germline sample obtained from the same animal, to remove a large number of germline SVs (mostly arising from retrotransposon activity) detected in the experimental animal when compared to the reference genome. Second, we developed methods to remove read pairs marked as discordant due to alignment errors, as well as imperfect PCR duplicates arising from DNA library preparation and sequencing errors. Third, we applied several filters on the results produced by SV calling programs, such as overlaps with annotated simple repeats and low mappability regions, in order to identify high confidence SV candidates. We show PCR and Sanger sequencing validation of 40 tumor-specific SVs in a single tumor genome supported by as few as 2 independent read pairs.
In summary, the method presented here simplifies the analysis, increasing sample throughput. It also provides high sensitivity, allowing detection of rare variant clones in complex mixtures that may have important prognostic or therapeutic consequences.

Establishing Initial Analysis Parameters
We used paired-end (PE) sequencing simulations as a tool to establish the initial analysis parameters, to quantify the effect of sequencing depth on detection of known SVs, and to study alignment related false positives. We simulated a rearranged genome based on C57BL/6J mouse reference (mm9), introducing 10 interchromosomal translocations and 10 large deletions into areas of varying mappability (Table 1). Read length, mean insert size and standard deviation of the insert size were chosen to be representative of our experimental data (50, 315, 44, respectively). Using three independent simulated datasets with 10, 20, 40, 80 and 160 million read pairs, we assessed the number of detected real and false positives, as well as the detection probability as a function of local mappability.
PE sequencing proved to be an efficient method for SV detection at coverage levels corresponding to 80 or more million read pairs. 90% of events in our simulated rearranged genome were detected with 160 million read pairs, about the minimum currently obtainable from a single lane using the Illumina HiSeq platform (Fig. 1A). As expected, detectability of a certain rearrangement strongly depended on the breakpoint microenvironment, with more coverage needed to detect events in regions of lower mappability (Fig. 1B). When assessing false positives, we found that 97% of total SV calls were attributed to reads with more than one equally valid mapping position. These reads originate from various repetitive genomic regions (such as centromeric satellite sequences, retroelements, RNA genes, etc.) and had to be removed from the analysis. After examining BWA mapping quality scores of reads contributing to real and false positives, we chose a cutoff of 23 for our analysis (for further discussion, see ''False positives arising from BWA alignment errors''). It should be noted that cutoff is chosen based on the desired ratio of real and false positives, with lower cutoff increasing sensitivity at the expense of specificity. After applying the BWA mapping quality cutoff to our simulated datasets, we observed no more false positives related to read mapping errors. However, we noticed size-related false positives that appeared with the increasing coverage. These false positives were small deletions originating from higher end and duplications originating from the lower end of the normal DNA library fragment size distribution. To correct for insert size related false positives, we used a size cutoff of 8 standard deviations and applied it to our analysis. This parameter should be determined for each library individually, depending on the desired sensitivity: increasing the standard deviation cutoff will lead to increasing the minimal detectable deletion and duplication size. Depending on the analysis needs, it may be beneficial using lower standard deviation cutoffs together with an assessment of the number of supporting read pairs, as SVs with a higher number of supporting read pairs can indicate a real event. However, this approach should be used with caution when analyzing tumor samples where loss or gain of copy number can lead to false conclusions.
Simulations of PE sequencing proved to be a useful tool in developing the data filtering strategy. After optimizing the initial parameters described above and removing all false positive calls from simulated datasets, SV calls in the experimental dataset could be attributed to the sample and the experimental procedure itself, rather than analysis artefacts. Simulations were also useful as a means to predict necessary coverage for detecting certain types of events. Importantly, when relating simulations to the experimental data analysis, it has to be taken into account that expected frequency of rearrangements, and hence the needed coverage, will normally be 50% due to the diploid nature of the genome. In case of heteroclonal or impure samples (the usual case when dealing with tumor samples), this frequency is expected to be even lower.

Data Filtering
As our experimental dataset, we chose an uncharacterized thymic lymphoma obtained from a Rag2 c/c p53 2/2 mouse. Thymic lymphomas arising spontaneously in this mouse model harbor a large number of structural rearrangements such as translocations, large deletions and amplifications [22]. Illumina's paired-end sequencing was chosen over the mate pair strategy, which we abandoned in the early course of this work due to difficulties in DNA library preparation. We sequenced two genomic libraries, one obtained from the solid tumor tissue and the other from the liver of the same animal (germline control). We found the control library to be essential due to a large number of germline SVs originating from remains of a 129 strain background (the mouse was initially created as a 129SvEv/C57BL6 hybrid). The tumor and control library were sequenced to 17x and 9x physical coverage, respectively (Table 2, Fig. 2). We used SVDetect (Fig. 3A) and BreakDancer (Fig. 3B) to call initial SVs, as these are the two most widely used large structural variant detection programs applicable to 50 bp read PE data. Generally, the analysis using the BreakDancer initially produced more intrachromosomal and less interchromosomal SV calls compared to SVDetect, perhaps due to differences in the clustering strategy. The same analysis parameters and filtering procedure was applied to both programs, yielding similar results at the end.
In contrast to simulations, analysis of experimental data led to a large number of false positive calls after applying initially established analysis parameters described above. We define these false positives as events supported by reads mapping to repetitive genomic regions, as well as those that span regions with retroelement activity. The number of false positives was especially large among interchromosomal SVs, explained by the higher likelihood of a repetitive read being misaligned to a chromosome different from its mate. In order to find and validate real tumorspecific variants, it was necessary to analyze the source of these calls and reduce them to a manageable number. We identified 3 main types of false positive calls, depending on their source: 1) false positives related to variation between mouse strains, 2) false positives arising from alignment errors, and 3) false positives related to PCR duplicates originating from sample preparation combined with sequencing errors. We developed different pre-and post-detection filtering procedures in order to work around these challenges.

False Positives Related to Structural Variation between Laboratory Mouse Strains
Structural variation among commonly used laboratory mouse strains, similar to structural variation between individual humans, has already been documented in great detail [23,24,25]. Most knock-in mice, including the one used in this study, can be classified as hybrid strains, even if the animals were backcrossed a number of times to the reference genome strain (C57BL/6J). Observed SVs can mostly be attributed to germline retroelement activity, and are manifested as insertions of SINE, LINE and LTR elements as well as reverse-transcribed intronless genes (retrogenes). When an experimental dataset is compared to the C57BL/ 6J reference genome, several types of structural variants are called. Most commonly, retroelement insertions present in the reference, but missing in the sample strain, will be called as deletions, while those present in the sample strain, but missing in the reference, will be called as balanced translocations. Insertions of retrogenes can be recognized as a number of deletions encompassing introns, accompanied by a translocation call from the chromosome of origin to the recipient chromosome (Fig. 4).
In order to filter out germline SVs described above, we found it necessary to obtain a control dataset by sequencing normal tissue originating from the same animal. In this study, a control dataset was prepared using liver tissue and compared to the tumor dataset. Using this strategy, we were able to remove most germline SVs. However, certain SVs failed to be detected as germline, due to lack of overlap between supporting read pairs. Therefore, we found it necessary to examine each SV manually for potentially missed overlap with the control. Even after applying the comparison procedure, a number of events we identified as high quality candidates were validated as germline (30% of intrachromosomal and 50% of interchromosomal SVs). This result can be attributed to lower coverage in our control dataset, leading to lower sensitivity of germline SV detection. Aneuploidy of tumor tissue (additional copies of some chromosomes or loss of others) creates local differences in coverage between the tumor and control dataset, which adds to the complexity of the analysis (Fig. 2).

False Positives Arising from BWA Alignment Errors
To remove false positives related to alignment errors, we tested the effect of BWA mapping quality score-based filtering on the number of resulting SV calls. Although BWA authors designate reads with 0-10 mapping quality as ''unreliably mapped'' [26], we found the best cutoff range for mapping quality score in our experiment to be 0-22 (Fig. 5). To partially correct for undesired removal of real SV candidates in less unique genomic regions, calls with large numbers of supporting read pairs were examined manually. However, none of the examined removed SVs could be designated as high quality candidates, since they all involved genomic regions of low mappability. After applying this read mapping quality filter before any other filtering is applied, the number of called SVs was reduced to 85% for intrachromosomal and 36-39% for interchromosomal events (Fig. 3).
To further reduce the number of SV calls resulting from misalignment of reads originating from repetitive regions, we tested the strategy of removing SVs with overlap with the RepeatMasker [27] and the simple repeats track of the UCSC Genome Browser. We found that RepeatMasker strategy reduces the number of false positive calls significantly, but filters out 12% of previously validated rearrangements, including some with potential biological importance (eg. Pten deletion). Importantly, reads coming from RepeatMasker annotated regions are not necessarily difficult to map uniquely, since this track contains many ancient repeated elements that have significantly diverged through evolution. RepeatMasker filtering strategy was finally used only to identify high confidence candidates among interchromosomal events with low numbers of supporting read pairs. In contrast to the RepeatMasker, overlap with simple repeats track was found to be successful in filtering out alignment error related false positives only.
As another strategy of dealing with repetitive element-related false positives, we tested the efficiency of filtering SVs against the low mappability regions, calculated based on the mappability data of the UCSC Genome Browser (see Materials and Methods). This strategy proved to very successful, removing significant numbers of false positive calls, especially efficient in the case of interchromosomal SVs (Fig. 3).

False Positives Related to Errors in Duplicate Calling
In the course of our analysis, we observed false positives called from small clusters of 2 or 3 read pairs, with both reads mapping at positions 0-2 bp away from one another (Fig. 6). As already discussed by others in the field [28], most of these ''imperfect duplicates'' probably originated from one DNA fragment and diverged either during PCR amplification, perhaps due to template strand slipping, or sequencing errors at the beginning or the end of the read during the sequencing procedure. These bona fide duplicates cannot be removed using existing tools such as Picard's MarkDuplicates since they do not have identical mapping positions. Percentage of imperfect duplicates appears to be correlated with the percentage of perfect PCR duplicates: specific datasets with high perfect duplicate percentage will show higher percentage of imperfect duplicates (M. Mijušković, results not part of this study).
We defined imperfect duplicates as pairs with the same mapping position of both reads with the possible offset up to 2 bp. Detection of these duplicates was done during clustering of discordant read pairs by SVDetect or BreakDancer, using different strategies (see Materials and Methods). After applying this filter, the number of intrachromosomal and interchromosomal SVs was reduced by 0.3-1.7% and 3.9-19.5%, respectively ( Figure 3). Importantly, these numbers might underestimate the total imperfect duplicate percentage since in this case they were detected after removing low mapping quality reads.

Validating Structural Variants
We created the final list of 61 high confidence SVs (see Materials and Methods) after manual examination of 381 intrachromosomal and 130 interchromosomal SVs detected by SVDetect and 328 intrachromosomal and 64 interchromosomal SVs detected by BreakDancer obtained after applying our filtering procedure. The majority of these calls, called by both programs, were found to either be a result of alignment errors related to repeats (59%), or previously unidentified germline SVs such as retroelement or retrogene insertions (23%). BreakDancer detected only a subset of high confidence SVs found by SVDetect (47 out of 61), even before any filtering was applied, perhaps due to differences in the clustering algorithm.
We used PCR to test 57 intrachromosomal and 4 interchromosomal high confidence SVs found by the BreakDancer and/or SVDetect (Table S1). From this set, we validated 23 large (1-539 kb) deletions, 10 inversions, 5 duplications and 2 translocations as tumor-specific, and the specificity of the PCR products was confirmed by Sanger sequencing (Table 3). Thus, 40 of the 61 high confidence SVs identified by our method were validated as tumor specific SVs. The other 19 intrachromosomal and 2 interchromosomal events were PCR validated as germline SVs. 16 out of 21 of these SVs had at least one supporting read pair in the original control dataset and failed to be detected due to our 2 supporting read cutoff. These false positives can be avoided either by sequencing the control dataset to higher coverage, when possible, or examining the control dataset using the 1 read pair cutoff.   Among validated tumor-specific SVs, we found several tumorsuppressor gene deletions, as well as some expected canonical antigen receptor gene rearrangements (Table 3). Notably, two tumor-specific translocations, two inversions and one validated tumor-specific duplication show signs of a complex rearrangement [29].

Conclusions
First, our work shows that simulating paired-end sequencing can be an effective way to develop the analysis strategy, predict coverage necessary to detect DNA breakpoints in different genomic environments and to separate sources of false positive calls into sample related and those that arise due to analysis artefacts.
Second, we have found that a control dataset obtained from the same animal is essential to reduce a large number of germline SVs that exist between commonly used laboratory mouse strains, even in cases when the animals are backcrossed a number of times to the reference genome strain.
Third, we have defined two types of duplicated reads leading to false SV prediction, both arising from PCR over-amplification during sample preparation: perfect duplicates, with matching genomic coordinates, and those with 1-2 bp coordinate offset that are not detected using existing tools. We present a method to remove SVs resulting from those reads using either SVDetect or BreakDancer.
Fourth, we find that removing reads with low BWA mapping quality, as well as SV calls that overlap with genomic regions of low mappability, is a very efficient way to filter our large numbers of false positives that arise due to alignment errors.
Finally, using this method, we validated a fairly large number of true tumor-specific SVs from a rather small dataset. Starting with a large number of candidate events, we were able to rapidly discard majority of false positives and focus on a tractable number of candidates for manual analysis (,5% of the initial number of calls from this dataset). We validated our filtering method with two widely used SV detection programs, SVDetect and BreakDancer, showing that it is universally applicable, rather than being restricted to a single program and its possible shortcomings. The final number of candidate events, as well as the number of false negatives, is a function of coverage and the stringency of filtering parameters. Depending on the needs of the experiment, these parameters can be set to a desired level in order to achieve an acceptable number of false positives vs. false negatives.
Our method should be applicable for future work in model organisms as well as in human tumors. In the clinical context, higher coverage would be needed to reduce the number of undetected germline SVs, as well as to improve the detection of low frequency somatic SVs.

Simulating PE Sequencing Data
Simulated PE sequencing datasets were created based on a mutated mouse reference genome (mm9) containing 10 translocations and 10 large deletions introduced using the EMBOSS tools (http://emboss.sourceforge.net). Illumina format fastq files were written using our PE.pl program (http://sourceforge.net/projects/ svdetection) that selects random positions in the user-provided genome, normalized for different chromosome lengths. Userdefined parameters include the number of read pairs, read length, mean insert size and standard deviation.

PE Read Alignment and Quality Filtering
Fastq files were generated using Casava 1.8 (Illumina) and reads were aligned using BWA [26]. Output files were manipulated by Samtools as needed [30]. Perfect PCR duplicates were removed using Picard's MarkDuplicates tool (http://sourceforge.net/apps/ mediawiki/picard). BWA-designated concordant read pairs and read pairs with low BWA mapping quality scores were removed using our own software (http://sourceforge.net/projects/ svdetection), as needed.

Calling Structural Variants and Removing Imperfect Duplicates
SVDetect [20] or BreakDancer [21] were used to call intrachromosomal and interchromosomal rearrangements from discordant, quality pre-filtered read pairs. Mean insert size and standard deviation used in this analysis were obtained by Picard's InsertSizeMetrics tool (http://sourceforge.net/apps/mediawiki/ picard). SVDetect and BreakDancer were configured to detect rearrangements with 2 or more supporting read pairs using 8 times standard deviation as threshold for both deletions and duplications. SVDetect built-in ''compare'' function was used for comparison of the tumor and control datasets. When comparing the calls, the option for comparing only the same SV type was turned off. For SV detection with BreakDancer, tumor to normal comparison was done using BEDTools [31].
To remove PCR duplicates with 1-2 bp offset in coordinates (''imperfect duplicates''), we manipulated the output file created by the SVDetect "linking" function using our own software (http:// sourceforge.net/projects/svdetection). This file lists clusters of read pairs supporting the same rearrangement and contains coordinates of individual supporting reads. Pairs where both reads are positioned 0, 1 or 2 base pairs away from each other, in the same orientation, were removed as imperfect duplicates. In BreakDancer-based SV analysis, we changed the minimum SV anchoring region setting to 3, in order to avoid SVs being called from clusters of imperfect PCR duplicates. We also examined reads supporting SV calls in BreakDancer-produced bed files and used our own software to remove any SVs resulting from imperfect duplicates (http://sourceforge.net/projects/svdetection).

Defining High Confidence SV Candidates
Structural variants called by SVDetect were additionally filtered based on the overlap with low mappability regions, simple repeats and RepeatMasker data extracted from the UCSC Table Browser [32]. Overlap between these regions and SVDetect links was assessed using Galaxy tools [33,34,35]. Low mappability regions were assembled as adjacent intervals of 50 bp with Duke ENCODE uniqueness scores less than 0.5 (the 50 bp sequence occurs more than 2 times in the genome). SVs with links overlapping these regions were removed, with the cutoff at 85% and 50% overlap for intrachromosomal and interchromosomal events, respectively. For overlap with simple repeat regions, the cutoff was 50% or greater. RepeatMasker overlap was used as a filter only for interchromosomal events supported by 2 or 3 read pairs, with the cutoff set to 80%. For intrachromosomal events, the additional custom filtering was applied to remove SVs called from read pairs arising from DNA fragments deviating from the expected library insert size range that were not removed by our standard deviation cutoff. To account for this, deletion size cutoff was set to 600 bp and duplication to 300 bp.
Tumor-specific SVs called by SVDetect and BreakDancer were finally examined manually to generate the list of high confidence candidates. SVs originating from alignment errors (related to repetitive genomic regions), failed tumor-control comparison filtering, as well as germline SVs (retroelement and retrogene insertions) were removed from the list or designated as low confidence candidates.

Validation of SV Calls
High confidence SV candidates were validated by PCR using custom designed primers mapping to SV ''linking'' regions, in the appropriate orientation. SVs validated as tumor-specific were cloned using the TOPOH TA Cloning Kit (Invitrogen, K4500-01). For each SV, two independent clones were sequenced by the Sanger method. Resulting sequences were mapped using BLAT [36].

Ethics Statement
This study was carried out in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health. The protocol was approved by the Institutional Animal Care and Use Committee of the University of Pennsylvania (Permit Number: 803893).