^{1}

^{*}

^{2}

^{2}

^{3}

^{4}

^{*}

Conceived and designed the experiments: AB VB BR. Performed the experiments: AB. Analyzed the data: AB VB BR. Contributed reagents/materials/analysis tools: SV CC. Wrote the paper: AB VB BR.

The authors have declared that no competing interests exist.

Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structural variation on a genome-wide scale. This technique is particularly useful for detecting copy-neutral rearrangements, such as inversions and translocations, which are common in cancer and can produce novel fusion genes. We address the question of how much sequencing is required to detect rearrangement breakpoints and to localize them precisely using both theoretical models and simulation. We derive a formula for the probability that a fusion gene exists in a cancer genome given a collection of paired-end sequences from this genome. We use this formula to compute fusion gene probabilities in several breast cancer samples, and we find that we are able to accurately predict fusion genes in these samples with a relatively small number of fragments of large size. We further demonstrate how the ability to detect fusion genes depends on the distribution of gene lengths, and we evaluate how different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, and fusion gene detection, even in the presence of errors that suggest false rearrangements. These results will be useful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomes that are enabled by next-generation sequencing technologies.

Cancer is driven by genomic mutations that can range from single nucleotide changes to chromosomal aberrations that rearrange large pieces of DNA. Often, these chromosomal aberrations disrupt a gene sequence, and even fuse the sequences of two genes, producing a “fusion gene.” Fusion genes have been identified as key participants in the development of several types of cancer. Using genome-sequencing technology it is now possible to identify chromosomal aberrations genome-wide and at high resolution. In this paper, we address the question of how much sequencing is required to detect a chromosomal aberration and to determine the location of the aberration precisely enough to identify if a fusion gene is created by this aberration. We derive a mathematical formula that accurately predicts a number of fusion genes in a breast cancer sequencing study. We also demonstrate how the ability to detect chromosomal aberrations and fusion genes depends on both the size of the fusion gene and the parameters of the genome sequencing strategy that is used. These results will be useful in calibrating future cancer sequencing efforts, especially those using next-generation sequencing technologies.

Cancer is a disease driven by selection for somatic mutations. These mutations range from single nucleotide changes to large-scale chromosomal aberrations such as deletion, duplications, inversions and translocations. While many such mutations have been cataloged in cancer cells via cytogenetics, gene resequencing, and array-based techniques (i.e. comparative genomic hybridization) there is now great interest in using genome sequencing to provide a comprehensive understanding of mutations in cancer genomes. The Cancer Genome Atlas (

Until recently, it is was generally believed that recurrent translocations and their resulting fusion genes occurred only in hematological disorders and sarcomas, with few suggesting that such recurrent events were prevalent across all tumor types including solid tumors

These studies raise the question of what other recurrent rearrangements remain to be discovered. One strategy for genome-wide high-resolution identification of fusion genes and other large scale rearrangements is paired-end sequencing of clones, or other fragments of genomic DNA, from tumor samples. The resulting end-sequence pairs, or

(A) The endpoints of a clone _{C}_{C}_{C}_{C}_{C}

Whole genome paired-end sequencing approaches allow for a genome-wide survey of all potential fusion genes and other rearrangements in a tumor. This approach holds several advantages over transcript or protein profiling in cancer studies. First, discovery of fusion genes using mRNA expression

In this paper, we address a number of theoretical and practical considerations for assessing cancer genome organization using paired-end sequencing approaches. We are largely concerned with detecting a rearrangement breakpoint, where a pair of non-adjacent coordinates in the reference genome is adjacent (i.e. fused) in the cancer genome. In particular, we extend this idea of a breakpoint to examine the ability to detect fusion genes. Specifically, if a clone with end sequences mapping to distant locations identifies a rearrangement in the cancer genome, does this rearrangement lead to formation of a fusion gene? Obviously, sequencing the clone will answer this question, but this requires additional effort/cost and may be problematic; e.g. most next-generation sequencing technologies do not “archive” the genome in a clone library for later analysis (for the sake of simplicity we will use the term “clone”to refer to any contiguous fragment that is sequenced from both ends). We derive a formula for the probability of fusion between a pair of genomic regions (e.g. genes) given the set of all mapped clones and the empirical distribution of clone lengths. These probabilities are useful for prioritizing follow-up experiments to validate fusion genes. In a test experiment on the MCF7 breast cancer cell-line, 3,201 pairs of genes were found near clones with aberrantly mapping end-sequences. However, our analysis revealed only 18 pairs of genes with a high probability (>0.5) of fusion, of which six were tested and five experimentally confirmed (

Start Gene | End Gene | Fusion Probability | Cluster Size | Sequencing Supporting Fusion | Cell Line/Primary Tumor |

ASTN2 | PTPRG | 1 | 2 | Yes | MCF7 |

BCAS4 | BCAS3 | 1 | 20 | Yes | MCF7 |

KCND3 | PPM1E | 0.99 | 12 | Yes | MCF7 |

NTNG1 | BCAS1 | 0.99 | 6 | Yes | MCF7 |

BCAS3 | ATXN7 | 0.83 | 8 | Yes | MCF7 |

ZFP64 | PHACTR3 | 0.6322 | 2 | No | BT474 |

CT012_HUMAN | UBE2G2 | 0.0880 | 1 | No | Breast |

VAPB | ZNFN1A3 | 0.0842* | 3 | Yes | BT474 |

BMP7 | EYA2 | 0.0324 | 4 | No | MCF7 |

KCNH7 | TDGF1 | 0.0215 | 1 | No | Breast |

SULF2 | TBX4 | 0.00656 | 2 | No | MCF7 |

NACAL | NCOA3 | 0.0057 | 2 | No | MCF7 |

MRPL45 | TBC1D3C | 0.0005 | 1 | No | BT474 |

U1 | NP_060028.2 | 0.0005 | 1 | No | Breast |

RBBP9 | ITGB2 | 0.0005 | 1 | No | Breast |

Y | SYNPR | <0.0001 | 4 | No | MCF7 |

PRR11 | TMEM49 | <0.0001 | 9 | No | MCF7 |

BMP7 | Q96TB | <0.0001 | 3 | No | MCF7 |

The gene order shown indicates “start” and “end” positions with respect to the direction of transcription. Note that

A single clone contained more than two chromosomal segments, i.e. the clone is not a simple fusion of two genomic loci.

The advent of high throughput sequencing strategies raises important experimental design questions in using these technologies to understand cancer genome organization. Obviously, sequencing more clones improves the probability of detecting fusion genes and breakpoints. However, even with the latest sequencing technologies, it would be neither practical nor cost effective to shotgun sequence and assemble the genomes of thousands of tumor samples. Thus, it is important to maximize the probability of detecting fusion genes with the least amount of sequencing. This probability depends on multiple factors including the number and length of end-sequenced clones, the length of genes that are fused, and possible errors in breakpoint localization. Here, we derive (theoretically and empirically) several formulae that elucidate the trade-offs in experimental design of both current and next-generation sequencing technologies. Our probability calculations and simulations demonstrate that even with current paired-end technology we can obtain an extremely high probability of breakpoint detection with a very low number of reads. For example, more than 90% of all breakpoints can be detected with paired-end sequencing of less than 100,000 clones (

Clone Length(L) | Paired Reads (N) | Clone Coverage (c) | _{ζ}|) | _{ζ} | _{ζ*}|) | _{ζ*} |

1 kb | 40×10^{6} | 13.3× | 295 | >.99 | 289 | .99 |

1 kb | 1×10^{6} | .33× | 972 | .15 | 658 | .012 |

2 kb | 20×10^{6} | 13.3× | 593 | >.99 | 581 | .99 |

2 kb | 1×10^{6} | .66× | 1889 | .28 | 1296 | .044 |

10 kb | 5×10^{6} | 16.7× | 2393 | >.99 | 2378 | >.99 |

10 kb | 1×10^{6} | 3.3× | 7342 | .81 | 5657 | .50 |

40 kb | 2×10^{6} | 26.7× | 5998 | >.99 | 5997 | >.99 |

40 kb | .1×10^{6} | 1.33× | 35587 | .49 | 25124 | .14 |

150 kb | .5×10^{6} | 25× | 23997 | >.99 | 76807 | .71 |

150 kb | .1×10^{6} | 5× | 93169 | .92 | 72022 | .80 |

150 kb | .012×10^{6} | .6× | 142510 | .26 | 97457 | 0.037 |

The probability _{ζ} of detecting a fusion point and the expected length _{ζ}|) of a breakpoint region under various clone lengths (_{ζ*} and _{ζ*}|) correspond to the probability for, and expected size of, a breakpoint region in the case when _{ζ}|) and _{ζ} = .99 over a continuous range of clone lengths, see

Given a set of clones from a cancer genome, we want to compute the probability that these clones identify a fusion gene in the cancer genome, i.e. a fusion of two different genes from the reference genome. We consider the cancer genome as a rearranged version of the reference human genome and assume that there exists a mapping between coordinates of the two genomes. The reference genome is described by a single interval of length _{C}_{C}_{C}_{C}_{C}_{C}_{min},_{max}], and we assume that: (i) only a _{C}_{C}_{C}_{C}_{C}_{C}_{min} = 0) (

If multiple clones contain the same fusion point ζ, then the corresponding breakpoint (

The rectangle indicates the possible locations of a breakpoint on chromosomes 1 and 20 that would result in a fusion between NTNG1 and BCAS1. Each trapezoid indicates possible locations for a breakpoint consistent with an invalid pair. Assuming that all clones contain the same breakpoint, this breakpoint must lie in the intersection of the trapezoids (shaded region). Approximately 69% of this shaded region intersects (darkly shaded region) the fusion gene rectangle, giving a probability of fusion of approximately 0.69. The empirical distribution of clone lengths reveals that not all clone lengths are equally likely (e.g. extremely long or short clones are rare). Using this additional information, our improved estimate for the probability of fusion is >0.99.

Now, each gene in the reference genome defines an interval

We made predictions of fusion genes for the MCF7, BT474, and SKBR3 breast cancer cell lines as well as two primary tumors using data from end sequence profiling of these samples

We applied our method of computing fusion gene probability to each of these samples, using the distribution of clone lengths in each library for these calculations.

(A) Probability of fusion vs. the product of gene lengths involved in the fusion indicates higher fusion probabilities for pairs of larger genes. Larger circles indicate gene pairs experimentally validated by further sequencing. A “Positive Result” indicates a predicted fusion for which sequencing results supported a fusion gene. A “Negative Result” indicates a predicted fusion for which sequencing results did not support a fusion gene. (B) The number of fusion genes in chimerDB

We now consider the problem of how much sequencing is required to detect a genome rearrangement and to localize the breakpoint of a rearrangement. Consider an idealized model in which ^{6} bp) and end-sequenced. These end sequences are mapped to the reference genome and the fraction

A fusion point, ζ, on the cancer genome is detected if a uniquely mapped clone contains it (_{ζ} of detection of ζ is given by_{ζ}, as the interval determined by the intersection of all clones that contain ζ. Thus, |Θ_{ζ}| defines the localization of ζ, or the uncertainty in mapping ζ. Since localizing a fusion point to within

A fusion point ζ on the cancer genome contained in multiple clones. The leftmost and rightmost clones determine the breakpoint region Θ_{ζ} in which the fusion point can occur.

These equations allow us to estimate the expected length of Θ_{ζ}, _{ζ} is not defined) as_{ζ}|. The relative error between the average observed length of the breakpoint region and Equation 5 was 0.02.

We also assessed the effect of different clone lengths, _{ζ}|), around a specific fusion point, ζ. _{ζ}|) decreases. Interestingly, note that the 40 kb clones are most advantageous when localization |Θ_{ζ}| = 40 kb is desired. A similar effect was observed for the 150 kb and 2 kb clones. Thus, there is a direct correlation between the clone length and the ability to

A fusion point ζ is localized to length _{ζ} has length

Formulas 2 and 5 provide a framework for examining a variety of choices of sequencing parameters _{ζ}) and extremely high resolution of fusion points (small |Θ_{ζ}|).

Since our simulations revealed that the choice of sequencing parameters affects the ability to localize breakpoint regions to intervals of different lengths (

All genes: The “known genes” track in the UCSC Genome Browser

The variation in gene sizes for different classes of genes (

(A) The number of paired reads necessary to detect fusion genes with fusion probability greater than 0.5 as a function of gene size for different clone lengths. The vertical lines indicate median (20 kb) and mean (40 kb) sizes for all known genes as well as the median (40 kb) and mean (90 kb) sizes for chimerDB genes. (B) The number of paired reads necessary to detect fusion genes with fusion probability greater than 0.5 as a function of clone length for different fusion genes sizes (log scale in both axes). Each point in these plots is the average over 100 different fusion genes and and 100 different simulations of clone sets from the genome. Thus, each data point represents the average value of 10^{4} simulations. In each simulation, a pair of genes was chosen such that area of the resulting gene rectangle (

There is also a relationship between the size of a fusion gene and the probability of detecting the fusion (

There are numerous sources of error in paired-end sequencing strategies for rearrangement identification including experimental artifacts, genome assembly errors or mis-mapping of end sequences. These errors can lead to incorrect predictions of fusion genes, or false positives. A major source of experimental artifacts in current sequencing approaches is chimeric clones that are produced when two non-contiguous regions of DNA are joined together during the cloning procedure. Approximately 1–2% of clones in modern BAC libraries are chimeric

In order to assess the rate of false positive predictions of fusion genes in the presence of errors, we simulated 100 random genome rearrangements with 1% of the paired-end sequences arising from chimeric clones. For several clone lengths, we recorded the number of fusion genes correctly identified (true positives) and the number of incorrect fusion gene predictions (false positives) as the minimum fusion gene probability required for identification was increased (

(A) Number of false positive (FP) and true positive (TP) fusion gene predictions for a simulated genome with 100 translocations and 10,000 paired reads. Each curve represents the average of 50 simulations with clones of a fixed length (2 kb, 40 kb, 150 kb clones). The minimum fusion probability threshold for indicating that a fusion gene was predicted was decreased from >.95 (leftmost point) to >0 (rightmost point) in increments 0.05 and the number of true and false predictions was determined. For all figures 19 true fusion genes were present in the rearranged genome. These 19 events were not selected for but rather they resulted from random rearrangement of the genome. (B) 100,000 paired reads. (C) 1,000,000 paired reads. (D) 10,000,000 paired reads.

Finally, we examined the effect of chimeric clones on our ability to identify breakpoints from invalid

These probabilities were computed using Equation 27, with clone length

We provided a computational framework to evaluate paired-end sequencing strategies for detection of genome rearrangements in cancer. Our probability calculations and simulations show that current paired-end technology can obtain an extremely high probability of breakpoint detection with a very low number of reads. For example, more than 90% of all breakpoints can be detected with paired-end sequencing of less than 100,000 clones (

We derived formulae that provide estimates of the probability of detecting rearrangement breakpoints and localizing them precisely. For a genome of length

The natural question for the practitioner is: what sequencing strategy maximizes information about rearrangements in the cancer genome for minimum cost? Three considerations preclude a definitive answer to the question. First, the goal of “maximizing information about rearrangements” in cancer genomes requires further specification. Second, the parameters of a sequencing strategy cannot be set arbitrarily, but are restricted by the chosen technology. Third, the complexity of cancer genomes at the sequence level – including the number and type of rearrangements and the sequence characteristics of rearrangement breakpoints – is currently unknown We discuss each of these issues below and then conclude by describing further extensions of our methodology.

When studying genome rearrangements by paired-end sequencing approaches, there are two interrelated goals that affect the choice of sequencing strategy. First, one might be interested in detecting as many rearrangement breakpoints as possible with the minimum amount of sequencing. In this case, the goal is to maximize the clonal coverage

Better localization of breakpoints is desirable if one wants to determine with certainty that a gene is fused or disrupted by a genome rearrangement. Our results showed the correlation between clone length and the probability of localizing breakpoints to an interval of a specific length.

Better localization is also desirable when one wants to validate a breakpoint via PCR, perhaps to determine if the breakpoint is recurrent across multiple samples. In this case, the breakpoint must be localized to an interval length that can be amplified via PCR, typically less than a few kilobases, and thus smaller clones are appropriate. On the other hand, in many cases rearrangement breakpoints are known to vary across kilobases in different patients

There are several next-generation sequencing technologies now on the market, and others that soon will be commercially available. Information about the capabilities of many of these machines, particularly in regards to paired-end sequencing, is presently limited. In addition, the field is developing rapidly and any claims stated about read lengths, sequencing error rates, etc. are undergoing continual revision. While our analysis focused on several key parameters including number of paired reads, clone length, and percent of chimeric clones, in reality only some of these parameters are adjustable while others (e.g. error rate) are fixed by the chosen sequencing technology.

One issue not considered in our model that is closely tied to the sequencing technology is the mapping of reads to the reference genome. Different sequencing technologies produce reads of varying length and quality that can have a dramatic effect on the ability to map paired reads. On one extreme, conventional paired-end sequencing of cloned genomic fragments employed by current ESP studies

Our simulations made certain simplifying assumptions about the character of cancer genomes. Most notably, we assumed that the size of the cancer genome (equal to the parameter _{ζ} increases, approximately following 1−^{−ca}, assuming that the genome size is constant under the amplification. Since highly amplified regions can have complex organization due to duplication mechanisms

An additional consideration is whether cancer rearrangement breakpoints are biased to certain regions of the genome. For example, if rearrangement breakpoints are in highly repetitive regions, it might be difficult to map sequences that are too close to the breakpoints, and thus larger clones are appropriate. On the other hand, if there are multiple rearrangements clustered in a small genomic interval as observed in the multiple breakpoints found in some sequenced BACs and also in other recent sequencing studies

Our formula for the probability of a fusion gene is readily extended to fusions of other genomic features. For example, we can compute the probability of regulatory fusions that result from the fusion of the promoter of one gene to the coding region of another gene. Other genomic assays such as array comparative genomic hybridization (CGH) can be used in combination with paired-end sequencing. Array CGH identifies breakpoints involved in deletions and amplifications at average resolutions of less than 10 kb

We assume that each clone _{C}_{C}_{C}_{min}_{max}_{C}

If a pair (_{C}_{C}_{max}_{min}_{C}_{C}_{C}_{C}_{C}_{C}_{C}_{C}

Clones containing predicted fusion genes were draft sequenced (1× coverage) by subcloning into 3 kb plasmids as described in

Define _{(a,b)} as the event that a clone _{C}_{C}_{C}_{C}_{C}_{C}_{C}_{(a,b)} implies the event _{C}_{C}_{C}_{C}_{C}

Now consider a pair of genes spanning genomic intervals _{min}_{max}

In this case, Equation 11 gives the fraction of the trapezoid (Equation 1) that intersects _{C}_{C}

Next, we extend the equations to include the case when a set {^{(1)},^{(2)},…} of multiple clones overlap the breakpoint (^{(j)} overlap the breakpoint (_{(a,b)} implies

The naive approach for computing Pr(∪_{(a,b)∈U×V}_{(a,b)}|^{(j)}, which is time consuming. We exploit several features of this equation to make the computation more efficient. First, it is not necessary to compute _{c}_{C}_{C}_{C}

For an integer _{s}_{s}D_{s}_{s}_{s}_{s}_{s}_{s}

We now compute the probability of detecting a fusion point and the expected number of fusion points that are detected as a function of the number and length of clones that are end-sequenced. Recall that a _{ζ} of detection is given by _{1},…,_{M}_{i}_{i}^{−c})

If one or more clones contain a fusion point ζ, the _{ζ} as the intersection of all clones that cover ζ (_{ζ} as follows. Following Lander-Waterman _{ζ} is determined by the left endpoint of the right-most clone that contains ζ and the right endpoint of the left-most clone that contains ζ. Define for 0≤_{j}_{j}_{j}_{ζ} as_{s}_{−j} and _{j}_{ζ}|, we have two cases. For _{ζ}| = _{ζ} conditioned on ζ being covered by a clone; otherwise Θ_{ζ} is undefined. Since the event |Θ_{ζ}|≤

Because of the presence of chimeric clones, it is be useful to consider a fusion point ζ to be detected if is it covered by a

It is also useful to compute the probability that two or more chimeric clones form a cluster. Let _{min},_{max}], then

Supporting Methods.

(0.05 MB PDF)

Distribution of MCF7 clone lengths. The mean for this distribution is 122 kb, and the standard deviation is 24 kb. Fusion Probabilities in

(0.03 MB PDF)

Length of a breakpoint region (BPR) for varying amounts of clonal coverage. The blue curve shows the expected length (Equation 5), while the red curve is the average observed length over 50 simulations.

(0.03 MB PDF)

Clone length vs. _{ζ} vs. |Θ_{ζ}| for varying _{ζ} (detection probability), compared to smaller clone lengths, which have the advantage of better localization (smaller |Θ_{ζ}|). Different lines originating from 0 refer to different number of reads. As the number of reads grows, the trade-off converges to high detection, and better localization. (A) shows values in a mesh graph, while (B) shows raw values.

(0.45 MB PDF)

The effect of clone length and number of paired reads on _{ζ} and |Θ_{ζ}|. (A) _{ζ} increases as the number of paired reads _{ζ}| decreases as the number of paired reads increases or the clones length decreases. Note that all axes are log values (with the exception of _{ζ} in [A]).

(0.42 MB PDF)

_{ζ} and |Θ_{ζ}| for different _{ζ}, for different clone lengths and varying number of mapped paired reads. (B) The expected length of a breakpoint region, |Θ_{ζ}|, around a fusion point (assuming that the fusion point is contained in a clone).

(0.18 MB PDF)

The number of paired-reads (and resulting _{ζ}|)) needed to obtain a _{ζ} of 0.99 for clone lengths varying from 1 to 150 kb. The x-axis indicates clone length, _{ζ}|. The vertical line indicates the intersection point between the two lines at ∼16,000 bp.

(0.33 MB PDF)

Average fusion probability vs. number of mapped reads. The average fusion probability with mean and standard deviations as a function of

(0.06 MB PDF)

Effect of chimeric clones. The probability of observing at least one chimeric cluster for a fixed number of paired reads as a function of the percent of chimeric clones indicates that the observed rate of chimerism is lower for smaller clones. (A) 1 kb clones, (B) 10 kb clones, (C) 40 kb clones, and (D) 150 kb clones.

(0.05 MB PDF)

We would like to thank the members of the Bafna and Pevzer labs at UCSD for helpful suggestions and discussions.