Conceived and designed the experiments: CAM AM. Performed the experiments: CAM. Analyzed the data: CAM OH CC. Contributed reagents/materials/analysis tools: CAM OAH CC. Wrote the paper: CAM AM. Algorithmic design and implementation: CAM. Project leadership and supervision: AM.
The authors have declared that no competing interests exist.
Copy number alterations are important contributors to many genetic diseases, including cancer. We present the readDepth package for R, which can detect these aberrations by measuring the depth of coverage obtained by massively parallel sequencing of the genome. In addition to achieving higher accuracy than existing packages, our tool runs much faster by utilizing multi-core architectures to parallelize the processing of these large data sets. In contrast to other published methods, readDepth does not require the sequencing of a reference sample, and uses a robust statistical model that accounts for overdispersed data. It includes a method for effectively increasing the resolution obtained from low-coverage experiments by utilizing breakpoint information from paired end sequencing to do positional refinement. We also demonstrate a method for inferring copy number using reads generated by whole-genome bisulfite sequencing, thus enabling integrative study of epigenomic and copy number alterations. Finally, we apply this tool to two genomes, showing that it performs well on genomes sequenced to both low and high coverage. The readDepth package runs on Linux and MacOSX, is released under the Apache 2.0 license, and is available at
Copy number alterations (CNAs) that arise due to genomic duplications or deletions have gradually been recognized as a major contributor to genetic variation and disease
Specifically, it has been shown that when whole-genome shotgun sequencing is performed on massively parallel instruments, the number of sequence reads that align to a position in the genome is proportional to the copy number at that position
Another challenge in extracting copy number information from sequence data is that the genome contains many repetitive elements, and aligning reads to these positions is impossible using current short-read technologies. Several groups sidestep this problem by sequencing a reference genome alongside their target genome
We present readDepth, a new R package for CNA detection that does not require the sequencing of matched normal sample. Using a binning procedure, readDepth calls copy number variants based on sequence depth, and then invokes a circular binary segmentation algorithm to call segment boundaries. If the reads are obtained from paired ends, breakpoint information can be used to refine segment boundaries. The algorithm accepts both regular and bisulfite-treated DNA reads, thus enabling integrative study of structural and epigenomic alterations. A key feature of the algorithm is an improved statistical model that is applied to adjust for a number of types of bias, including GC-content, mapability, and other sources of distortion introduced by the preparation and sequencing processes. The algorithm also allows for explicit control of the false discovery rate (FDR), which minimizes the number of false positive aberrations detected. Furthermore, by leveraging parallel processing, readDepth produces results many times faster than comparable algorithms.
We validate readDepth on simulated data and show that it has high sensitivity and specificity and outperforms a comparable R package. We then apply it to several genomes. We show this algorithm's ability to resolve complex and highly amplified regions from low-coverage data on the MCF-7 breast cancer cell line. We also demonstrate a method for integrating information from paired-end sequencing information to create more accurate boundary calls. Finally, we showcase a methylation-sensitive method of correcting for GC-content bias by applying it to high-coverage bisulfite-treated data from the H1 embryonic stem cell line.
We first examined several genomes and assessed the distribution of reads uniquely mapping to the genome in order to create a statistical model. We started with the assumption that all reads were chosen randomly from the genome, which meant that the number of reads in any given region would follow a Poisson distribution with mean proportional to the copy number of the region. After examining several complete genomes (Yoruban
Thus, we developed a model that uses a negative-binomial distribution to approximate an overdispersed Poisson distribution. The negative binomial distribution can be seen as a mixture of Poisson distributions where the median values (λ) are drawn from a gamma distribution
a) Illumina reads from the Yoruban genome are not fit well by a Poisson model. b) Modeling the reads using a negative binomial distribution with a variance/mean ratio of 3 results in a much better fit, with a root mean square error three times smaller. c) The use of bin sizes that are too small results in an inability to cleanly separate peaks with copy number of one and two, resulting in a large number of false-positive calls in the overlapping region. d) Increasing the bin size allows us to trade resolution for better separation.
We next developed readDepth, a tool that uses our model to identify sets of optimal parameters that correspond to specific false-discovery rates. To allow for improvements in the sequencing process or introduction of new platforms that may result in different distributions, the VMR parameter (set to 3 in the following experiments) is adjustable.
Once a bin size is established and the number of reads that fall into each bin is counted, readDepth corrects for bias introduced by the inability to map reads into repetitive regions of the genome. To do this, we created mapability tracks via self-alignment of the reference genome. ReadDepth takes these tracks as input and uses them to proportionally scale the number of reads in bins that have less than 100% mapability. Previous studies have described significant bias in the number of reads generated from regions with differing GC-content
Once read counts are adjusted for each bin in the genome, readDepth applies circular binary segmentation, as implemented in the DNAcopy R package
We first validated readDepth's ability to detect regions of copy number gain and loss using simulated data. We randomly generated reads from chromosome one, drawing from a distribution with copy number of two that conforms to our negative-binomial model with a VMR of three. We then insert a copy number gain by replacing a region of the specified size with reads drawn from a distribution with a copy number of three. One thousand simulations were run, and sensitivity was measured by determining if both edges of the CNA call matched the seeded CNA with a tolerance of one bin size. A false positive was defined as a putative alteration call where both ends did not match the seeded CNA. We measured specificity by taking one minus the percentage of trials which had one or more false positive calls.
We show that readDepth detects CNAs with high sensitivity and specificity, even at low levels of genomic coverage. With a single lane of 76 bp Illumina paired-end sequencing that provides 0.5x coverage, readDepth reliably detects alterations that exceed 200 kbp in size. (
a) readDepth sensitively detects small copy number alterations even at very low levels of sequence coverage. As additional reads increase the coverage, the algorithm is able to detect smaller alterations. b) Controlling the false discovery rate keeps the number of false positives very low. c/d) the readDepth algorithm is applied to simulated data with 1x coverage. When the read distribution is overdispersed, readDepth uses larger bins, effectively trading accuracy for resolution. e/f) When the cnv-seq algorithm is applied to the same data set, it tends to call many false breakpoints, resulting in lowered sensitivity and specificity. This problem is exacerbated by overdispersed data.
We then compared the performance of readDepth to cnv-seq, which is another R package for detecting CNAs from sequencing data
To further test readDepth, we analyzed over 47 million uniquely mapping reads from the MCF-7 breast cancer cell line generated as 55 bp mate-pairs with the Illumina GAII sequencer. This represents 0.85-fold coverage of the reference genome. ReadDepth was applied to this data using an FDR rate of 0.01, which resulted in bins with a size of 25.3 kbp. As shown in
a) a log2 plot of copy number alterations found in the MCF-7 breast cancer cell line. Sequence based copy number calls made with the readDepth package (bottom) reveal the same gross morphology seen by an assay done with a 100 k SNP array (top) b) an absolute copy number plot of MCF-7 chromosome 20, showing high level amplifications and fine-scale copy number changes not detectible with array-based methods.
We then integrated breakpoint information derived from the mate-pair reads with information about read depth in order to more accurately delineate the boundaries of copy number alterations. Genomic breakpoints were detected from mate-pair data using a custom pipeline and the locations of these breakpoints were used as input to readDepth. After identifying segments of gain and loss, the algorithm adjusted segment boundaries if an unambiguous single breakpoint was located less than half of the bin size away from the segment boundary. This step resulted in the refinement of 99 breakpoints, effectively increasing the mean resolution of those copy number boundaries to 3.396 kbp, which is far lower than the 25.3 kbp resolution possible through the use of bins.
The use of bisulfite reads presents some unique challenges, as prior to sequencing, they are treated such that all non-methylated cytosines are converted to uracils, which are then read out of the sequencer as thymine bases
With reads representing approximately 37-fold coverage of the genome, the algorithm is able to use a window size of 500 bp, resulting in a very high resolution picture of the CNAs in this cell line. We detect 6,722 copy number variants, mostly small, with 39% smaller than 10 kbp, and 80% smaller than 100 kbp. The number of deletions is over ten-fold higher, with 91% losses and 9% gains, but they tend to be smaller, with a median size of 17.5 kbp, compared to 45 kbp for amplifications. Chromosome 12 is especially aberrant, with most of the chromosome amplified (
Through the use of the multicore and foreach R packages, readDepth splits the most CPU-intensive parts of the calculations among multiple processors
Given
We generate the distributions using the rnbinom function in R, setting µ = λ, and the size parameter = λ/(d-1), where d is the input VMR. We then generate distributions using the expected number of reads with copy number of one, two, and three, and choose a threshold value for gains and losses that minimizes the number of bins that are misclassified. The FDR rate can then be calculated as the number of misclassified bins divided by the total number of bins. We progressively test lower bin sizes until we reach the smallest size possible under the constraints of the given false-discovery rate.
We developed mapability tracks by aligning all genomic sequences of a given length back to the human reference genome (hg18) with BWA
For bisulfite-mapability tracks, the same process was followed for read generation, then all cytosines were replaced with thymines to mimic the effects of bisulfite treatment. These were aligned to the reference genome using Pash 3.0, and read depth correction was carried out as above.
To correct for sequencing biases that arise due to preferential sequencing of certain levels of GC content, we normalize the read depth based on the GC content of each bin. These values incorporate mapability information, such that we only consider the GC content of mappable bases. To correct the data, we first calculate the average read depth for bins with GC content in intervals of 0.1%. We then use the LOESS method to fit a regression line to this data. The correction value for each bin is equal to the difference between the median read depth and the average read depth of that bin. We then scale the values such that the correction is neutral with respect to the total number of reads.
For bisulfite reads, we generate a methylation report for each cytosine in the genome using a pipeline built around Pash 3.0
Segmentation is performed using circular binary segmentation, as implemented in the R package DNAcopy
MCF-7 copy number data assayed with the Affymetrix 100 k chip (50k_Xba240 and 50k_Hind240) was obtained from
MCF-7 sequence reads are available at
The readDepth package is under active development and we anticipate adding a number of additional features in the future. Functions that will allow better visualization of results are currently being developed, as well as methods for detecting overdispersion automatically, rather than requiring it as an input parameter. We are also exploring methods for increasing the resolution of breakpoint calls when deep-sequencing data is available.
We expect that readDepth will be useful in a variety of different scenarios. Projects that produce low-coverage paired end sequencing can benefit from its ability to integrate breakpoint information for accurate boundary calls. Deep-sequencing projects will be able to leverage the multiple cores present in modern computers to detect CNAs quickly, even when analyzing billions of reads. Additionally, readDepth's first-of-its-kind ability to correct for GC-biases specific to bisulfite sequencing mean that it is uniquely well-suited to dealing with data coming from the burgeoning field of epigenomics.
The readDepth package is under active development and we anticipate adding a number of additional features in the future. Functions that will allow better visualization of results are currently being developed, as well as methods for detecting overdispersion automatically, rather than requiring it as an input parameter. We are also exploring methods for increasing the resolution of breakpoint calls when deep-sequencing data is available.
We expect that readDepth will be useful in a variety of different scenarios. Projects that produce low-coverage paired end sequencing can benefit from its ability to integrate breakpoint information for accurate boundary calls. Deep-sequencing projects will be able to leverage the multiple cores present in modern computers to detect CNAs quickly, even when analyzing billions of reads. Additionally, readDepth's first-of-its-kind ability to correct for GC-biases specific to bisulfite sequencing mean that it is uniquely well-suited to dealing with data coming from the burgeoning field of epigenomics.
The readDepth package runs on Linux and MacOSX, is released under the Apache 2.0 license, and is available at
(TIF)
(TIF)
(DAT)
(DAT)
We thank the communities at stats.stackexchange.com and stackoverflow.com for useful advice concerning statistics and R.