deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data

Gene fusions created by somatic genomic rearrangements are known to play an important role in the onset and development of some cancers, such as lymphomas and sarcomas. RNA-Seq (whole transcriptome shotgun sequencing) is proving to be a useful tool for the discovery of novel gene fusions in cancer transcriptomes. However, algorithmic methods for the discovery of gene fusions using RNA-Seq data remain underdeveloped. We have developed deFuse, a novel computational method for fusion discovery in tumor RNA-Seq data. Unlike existing methods that use only unique best-hit alignments and consider only fusion boundaries at the ends of known exons, deFuse considers all alignments and all possible locations for fusion boundaries. As a result, deFuse is able to identify fusion sequences with demonstrably better sensitivity than previous approaches. To increase the specificity of our approach, we curated a list of 60 true positive and 61 true negative fusion sequences (as confirmed by RT-PCR), and have trained an adaboost classifier on 11 novel features of the sequence data. The resulting classifier has an estimated value of 0.91 for the area under the ROC curve. We have used deFuse to discover gene fusions in 40 ovarian tumor samples, one ovarian cancer cell line, and three sarcoma samples. We report herein the first gene fusions discovered in ovarian cancer. We conclude that gene fusions are not infrequent events in ovarian cancer and that these events have the potential to substantially alter the expression patterns of the genes involved; gene fusions should therefore be considered in efforts to comprehensively characterize the mutational profiles of ovarian cancer transcriptomes.

High Grade Serous (HGS) Highly proliferative ovarian carcinoma subtype characterized by genomic instability due to TP53 loss and in some cases BRCA1/2 mutations. This cancer may originate in the fallopian tube.
Clear cell carcinoma (CCC) Ovarian carcinoma subtype characterized by large epithelial cells with abundant clear cytoplasm.
Endometrioid tumor (EMD) Ovarian carcinoma subtype composed of tubular glands bearing a close resemblance to benign or malignant endometrium.

Mucinous tumor (MUC)
Ovarian carcinoma with similarities to mucinous colonic carcinomas.
Yolk sac tumor (YKS) Ovarian germ cell tumor that represents a proliferation of both yolk sac endoderm and extraembryonic mesenchyme.
Granulosa cell tumor (GRC) Ovarian tumors that arise from granulosa cells characterized by a single nucleotide variation in FOXL2.
Small cell hypercalemic (SCH) Ovarian cancer subtype characterized by diffuse sheets of cells punctured by variable numbers of follicle-like spaces. Often presents with hypercalcemia.
2 Summary of Supporting Information   Text S1 This document 2. 1 Table S1 -Table of all gene fusion predictions   The table of 2 Table S2 -Table of predicted interrupted genes   The table of

Table S3 -Table of predicted CNVs
The table of predicted copy number variations (CNVs) is provided in tab delimited format with the following unnamed columns.

Table S8 -Table of positive and negative controls
The table of all positive and negative control deFuse predictions is provided in tab delimited format with named columns as for Table S1, with the following exceptions. The probability and classification columns are not included. A column named leave one out probability is included corresponding to the probability estimate calculated for the given fusion when the adaboost model is trained on all but the given fusion and then used to classify that fusion.

Table S9 -UMOD aligned read counts
A table of read counts for each transcript for each library is provided in tab delimited format.
The first column is the library name, the following 4 columns are the counts of number of reads aligning in paired end mode to the 4 transcript variants of UMOD. Table S10 -Gene names and their ensembl ids

Dataset S1 -RT-PCR sequence traces
The sequence traces for all fusions successfully validated by RT-PCR are provided in the form of ab1 files.

Dataset S2 -FISH images
FISH images are provided for all attempted FISH experiments.

Dataset S3 -MapSplice Output
Two files are provided for each of the 6 libraries for which MapSplice was used to predict fusions. The fusion.junction file corresponds to the similarly named file produced by Map-Splice. The junctions.txt file contains a list of all predicted splice sites calculated by parsing the CIGAR strings of each alignment in the alignments.sam file produced by MapSplice.

Dataset S4 -FusionSeq Output
The confidence.gfr file is provided for each of the 3 libraries for which FusionSeq was used to predict fusions.

Towards a classifier for gene fusions predictions
We sought to develop a classifier for gene fusion predictions so that we would not have to rely on arbitrary thresholds. We selected the following 11 features, described in detail in section 4.7. We chose to not select features that could be related to expression, such as the number of split or spanning reads, since we did not wish to bias the classifier towards highly expressed fusions. Let r be the read length, f ivep(a X ) be the aligned position in transcript X of the 5' end of the read and let strand(a X ) be the strand of that alignment a X . Then the fusion boundary region is given by equation 1.
Let a X , a Y , b X and b Y be the alignments to transcript X and Y of paired end reads a and b. We define the overlapping boundary region condition as the condition that the fusion boundary regions in each transcript must overlap in order to consider paired end reads a and b to have originated from the same fusion transcript. The overlapping boundary region condition ensures that there exists a valid location for the fusion boundary in transcript X and transcript Y that would simultaneously explain both paired end alignments. Included in the overlapping boundary region condition is the condition that strand(a X ) = strand(a Y ) and strand(b X ) = strand(b Y ). The overlapping boundary region condition is defined specifically as given in equation 2.
Suppose now that transcripts X and Y are concatenated together as fusion transcript XY with a +− alignment configuration (alignments are to the + strand of X and the − strand of Y ). The location of the fusion boundary in each transcript is unknown, as is the variable d that corresponds to the distance between the two fusion boundaries in the concatenated sequence. The fragment lengths l a and l b of fragments a and b are unknown also. However, it is possible to calculate the difference between the fragment lengths as |l a − l b | = |z a − z b | as shown in figure 2. We define the similar fragment length condition as the constraint that |l a − l b | must be no more than l max − l min for us to consider paired end reads a and b to have originated from the same fusion transcript.
Unknown Fusion Boundaries Fragment b Trivially, if XY produces a −+ alignment configuration then Y X will produce a +− configuration and should be considered instead. However, it may also be interesting to consider the situation in which XY results in a −− or ++ configuration because although the prediction may not represent a chimeric transcript with preserved open reading frame, it may represent an expressed structural variation or gene interruption. For this situation, a +− configuration can be obtained by considering the reverse complement of either X or Y and recalculating the alignment positions to that reverse complemented sequence.

|za − z b| = | l a + d − (l b + d)| = | l a − l b| ≤ l max − l min
In practise, however, it is not necessary to remap the position of each alignment to the concatenated sequence described above, since any offset added to the positions of alignments to X or Y will be incorporated into the value d and will cancel out when calculating |z a − z b |.
For the same reason, if it is necessary to reverse complement either X or Y , all that is required is to consider the negation of the positions of alignments to whichever of X or Y it was necessary to reverse complement, since any additional offset will be incorporated into the value d, and will cancel out. The value z a (and z b ) can be calculated, with consideration for the strand of the alignments, using equation 3. Note that this formulation of the similar fragment length condition is equivalent to that given in the main text, and allows for easier calculation of maximal valid clusters using the method in 4.2.

Generating Maximal Valid Clusters
We provide a polynomial time algorithm for calculating a set of clusters of paired end alignments, such that any two paired end alignments satisfy the overlapping boundary region and similar fragment length conditions, and such that those clusters are maximal.
Let G be the set of transcripts under consideration. Let S = {+, −} be the set of strands.
Let A X,Y,S,T be the set of alignments such that one end finds at least one alignment to strand S of transcript X and the other end finds at least one alignment to strand T of transcript Y . Consider all distinct sets A X,Y,S,T = ∅. Let A X be the alignments to transcript X and A Y be the alignments to transcript Y . Maximal paired end alignment clusters P X,Y,S,T satisfying both conditions can be computed in polynomial time as follows.
1. Create the fusion boundary region clusters C X for transcript X. The fusion boundary region clusters can be created using a polynomial time algorithm as described in [2], reiterated here. Fusion boundary regions br(A X ) are sorted by their start coordinate. Clustering proceeds by adding regions in left to right order to cluster C k X until a region is encountered that does not overlap with all other regions in C k X . Cluster C k X is kept unless it is a proper subset of C k−1 X . Cluster C k+1 X is initialized to C k X \ a where a is the region in C k X with the leftmost end coordinate and the process repeats. Repeat for transcript Y creating C Y .

Create clusters of paired end alignments
For any D C X ,C Y it should be true that any two paired end alignments in D C X ,C Y satisfy the overlapping boundary region condition.
3. Refine clusters of paired end alignments D C X ,C Y into clusters of paired end alignments {D i C X ,C Y } that also satisfy the similar fragment length condition. For each paired end alignment a in D C X ,C Y calculate the value z a . Sort the alignments by z and use a sliding window of size l max − l min to calculate clusters {D i C X ,C Y }. Specifically, proceed by adding alignments to cluster D k C X ,C Y in order of increasing z while maintaining the property that the difference between the lowest and highest z values in D k where a is the paired end alignment with the lowest z value. 4. Remove any cluster that is the subset of another cluster. Let P X,Y,S,T = {D i C X ,C Y } be the resulting set of clusters. It can be easily verified that P X,Y,S,T is the set of maximal paired end alignment clusters satisfying both conditions.

Split read boundary sequence prediction
Let C X,Y,S,T be a paired end alignment cluster that is evidence between strand S of transcript X and strand T of transcript Y . Let A X and A Y be the end alignments of each paired end to transcripts X and Y respectively. Let br(A X ) = ∩ a X ∈A X br(a X ) and br( For each alignment with one end aligning to transcript X we calculate the mate alignment region denoted mate(a X ) as in equation 4.
For each alignment with one end aligning to transcript X, if br(A X )∩mate(a X ) = ∅ then add the sequence of the end that does not align to transcript X to M X . Repeat the process for transcript Y to create M Y . Create the sequence S X by extracting the sequence of transcript X in the range br(A X ) expanded by r on each side.. Repeat for transcript Y to create S Y . Reverse complement S Y if S = T . Reverse complement the sequences in M X . Reverse complement the sequences in M Y if S = T . For each candidate split read r ∈ M X ∪ M Y = M align r to S X using dynamic programming based local alignment and penalizing initial gaps in the end sequence. Repeat with the reverse of sequence r and the reverse of sequence S Y (see supplementary section 4.4). Proceed as described in the main text of the paper.

Dynamic programming matrix definition
We use dynamic programming based local alignment penalizing initial gaps in the read sequence as part of the method for finding read sequences split by the fusion boundary. Let δ(p, q) = m if p = q otherwise δ(p, q) = u, thus m is the match score. Let g be the score given for a gap in either the read sequence of the transcript sequence. Let r be the read sequence and S the reference sequence on one side of the fusion boundary. The dynamic programming matrix may be defined as follows [8].

Covariance between the lengths of fragments spanning a fusion boundary
We do not assume that the set of fragment lengths {l i } of paired end reads spanning the same fusion boundary are drawn independently from the fragmet length distribution P (L). Thus the variance ofl includes a covariance term Cov(L 1 , L 2 ) as given by equation 6. The covariance Cov(L 1 , L 2 ) represents the degree to which two fragments overlapping the same position are likely to have the same length.
V ar(L) = nV ar(L) We estimate the covariance between the lengths of two fragments originating from the same location in the transcriptome using concordant alignments to cDNA. Concordant alignments to cDNA often contain paired end alignments that are consistently aligned to the wrong splice variant causing some alignments to imply the wrong fragment length. In an attempt to mitigate this affect we only consider paired end alignments for which the implied fragment length is in the range [µ − 3σ µ + 3σ] where µ and σ are the mean and standard deviation of inferred fragment length distribution. We begin by selecting n positions in the transcriptome at random. For each position we select at random, if they exist, two paired end alignments with one end aligning entirely to the left and one end aligning entirely to the right of that position. Let the fragment lengths implied by the two paired end alignments selected for position i be given by l i1 and l i2 . Equation 7 is used to estimate the covariance between the two random variables L 1 and L 2 representing the fragment lengths of two reads spanning the same fusion boundary.Ĉ

Covariance between split read statistics for reads split by a fusion boundary
We do not assume that the values p i calculated for reads split by a fusion boundary are drawn independently from a uniform distribution. To model dependency we estimate the covariance Cov(p i , p j ). We begin by selecting n positions in the transcriptome at random. For each position we select at random, if they exist, two paired end alignments with one end overlapping that position by at least n anchor nucleotides. We calculate p 1 and p 2 for both of these split alignments as given by equation 10. Equation 8 is then used to estimate the covariance between two random variables P 1 and P 2 representing p i values of two reads split by the same fusion boundary. An equivilent analysis is used to estimateĈov(Q 1 , Q 2 ) for q i values as calculated by equation 10.

Features
For each fusion prediction we calculate a number of features to assist in the discrimination between real fusions and false positives.
Spanning read count Number of reads spanning the fusion boundary.
Spanning read coverage Normalized spanning read coverage (section 4.7.1).

Split read count Number of reads split by the fusion boundary.
Split position p-value P-Value for the hypothesis that the split position statistic was calculated from split reads that are evenly distributed across the fusion boundary (section 4.7.2).
Minimum split anchor p-value P-Value for the hypothesis that the minimum split anchor statistic was calculated from split reads that are evenly distributed across the fusion boundary (section 4.7.2).
Corroboration p-value P-Value for the hypothesis that the lengths of reads spanning the fusion boundary were drawn from the fragment length distribution (section Corroborating spanning and split read evidence in the main text).
Concordant ratio Proportion of spanning reads supporting a fusion that have a concordant alignment using blat with default parameters. cDNA adjusted percent identity Maximum adjusted percent identity (section 4.7.5) for the alignments of the predicted sequence to any cDNA.
Genome adjusted percent identity Maximum adjusted percent identity (section 4.7.5) for the alignments of the predicted sequence to the genome.
EST adjusted percent identity Maximum adjusted percent identity (section 4.7.5) for the alignments of the predicted sequence to any EST.
EST island adjusted percent identity Maximum adjusted percent identity (section 4.7.5) for the alignments of the predicted sequence to any EST island (section 4.7.6).

Normalized spanning read coverage
For each fusion partner gene X we calculate c X , the number of nucleotides matched in X by at least one of the prediction's spanning reads alignments. We then normalize c X by the expected coverage l avg − r min where l avg is the mean fragment length and r min is the minimum read length. The normalized spanning read coverage for a prediction is the minimum of the normalized coverage calculated for each gene predicted as fused (equation 9). PCR duplicates of poor quality reads, or systematic alignment errors for small homologous regions are expected to result in smaller values for the normalized spanning read coverage than predictions representing real fusions.
Normalized spanning read coverage = min(c X , c Y ) l avg − r min (9)

Split position p-value and minimum split anchor p-value
Split read alignments are prone to systematic alignment errors that produce false positive fusion boundary predictions. We expect a true positive to produce a certain number of reads split approximately in half by the fusion boundary, whereas many false positives are identified by the lack of any reads that are split approximately in half. We calculate two statistics in order to identify false positive split alignments.
A dependence between p i values for reads split by the same fusion boundary means that the sample variance of a set of n p i values includes a covariance term. The covariance term and sample variance of n p i values are calculated as described in 4.6. A dependence between q i is resolved similarly. The samples means of the n p i and n q i values are assumed normally distributed.
A two sided z-test with alternative hypothesis E[p] = 0.5 is used to calculate the split position p-value. A one sided z-test with alternative hypothesis that E[q] < 0.5 is used to calculate the minimum split anchor p-value. Significant values for these p-values represents evidence to reject the null hypothesis that the split reads are uniformly distributed across the fusion boundary.

Fusion boundary di-nucleotide entropy
A common source of false positive fusion boundary predictions using split alignments results from the alignment of low complexity reads such as poly-A reads to low complexity regions in genes. In order to identify spurious fusion boundary predictions caused by low complexity reads, we calculate the di-nucldeotide entropy of the predicted fusion boundary sequence. Let D = {n i n j : n i , n j ∈ {A, C, T, G}} be the set of all possible di-nucldeotides. Let S be a sequence of length m and let count(d, S) be the number of occurrences of di-nucleotide d in sequence S. The di-nucleotide entropy of the sequence S can be calculated as given by equation 12.
Let S u be the 40 nucleotides of the predicted sequence upstream of the fusion boundary, and let S d be the 40 nucleotides of the predicted sequence downstream of the fusion boundary.
For the purposes of this study we use m = 40. We calculate the fusion boundary di-nucleotide entropy as min(H(S u ), H(S d )). The fusion boundary di-nucleotide entropy is expected to be lower for fusion boundary predictions involving low complexity sequence on either side of the fusion boundary

Fusion boundary homology
Reverse transcriptase (RT) during cDNA preparation has been identified previously as a mechanism for producing chimeric cDNA fragments [3]. An identifying feature of chimeric cDNA produced by template switching is the existence of short homologous sequence at the 'splice site' implied by the cDNA sequence [3]. Thus, to identify predictions resulting from chimeric reads produced by template switching during RT, we calculate the length of homologous sequence at the fusion boundary.
Let S be the predicted sequence for a fusion prediction between gene X and gene Y , and let l be length of S. Let m X and m Y be the number of matches minus mismatches for the best alignments of S to all splice variants of X and Y respectively. We calculate an estimate of the fusion boundary homology as given by equation 13.

Fusion boundary homology
Note that if a prediction is caused by misalignments of non-chimeric reads from a single gene, the predicted sequence may align with high sequence similarity to only that gene. This situation will also produce a higher than normal value for the fusion boundary homology, also indicating a likely false positive. All alignments of S to splice variants of X and Y were obtained using blat [4].

Adjusted percent identity
We sought to identify concordant alignments of the predicted fusion sequence to cDNA, EST and chromosome sequences. However, some predicted fusion sequences are asymmetrical: they involve only a small amount of sequence from one of the genes predicted as fused. As a result, reporting a simple percent identify for the alignment of the predicted sequence to a cDNA, EST, or chromosome would be biased against asymmetrical fusion prediction sequences. We use the adjusted percent identity, described below, as an alternative to the percent identity that does not suffer from a bias against asymmetrical fusion prediction sequences.
Let S be the predicted sequence for a fusion prediction between gene X and gene Y , let ζ be fusion boundary in S, and let l be the length of S. Also let S X and S Y be the sequences on the X and Y sides of ζ respectively, with lengths l X and l Y respectively. Given an alignment of S to a cDNA, EST or chomosome sequence, let m be the matches minus mismatches for the alignment. We first assume that the longer of S X and S Y is matched exactly in the alignment, and any remaining matches exist in the shorter of S X and S Y . We then calculate the adjusted percent identity as the percent identity of the alignment within the shorter of S X and S Y under these assumptions (equation 14). All alignments of S to cDNA, EST and chromosome sequences were obtained using blat [4].

EST islands
We sought to identify predictions that could be explained by alternative splicing as opposed to underlying genomic structural variation. We use UCSC's spliced EST alignments [6] as evidence of co-transcription of genomic regions. An EST island is then defined as the set of minimal genomic regions such that any splice EST alignment that overlaps with an EST island is contained within that EST island. EST islands represent islands of co-transcription in the genome as evident by EST alignments. The EST island adjusted percent identity for a fusion prediction is the adjusted percent identity of a spliced alignments of the predicted sequence that falls entirely within an EST island.

Filtering
A principled machine learning approach to discriminating between true and false positives is difficult without a significant number of positive and negative controls. Thus in order to roughly discriminate between real fusions and false positives, we initially used a set of thresholds on a subset of the features calculated in section 4.7. These thresholds are given below.

Probabilistic motivation for clustering conditions
The two conditions for clustering paired end alignments can be motivated probabilistically by considering the likelihood of two paired end alignments given that those paired end reads represent the same fusion transcript. Consider the alignments of two discordant paired end reads, a and b. Suppose a has an alignment of end a X to transcript X and end a Y to transcript Y. Similarly, suppose b has an alignment of end b X to transcript X and an alignment of end b Y to transcript Y. Figure 3 shows a possible configuration of the alignments.
The distances d X and d Y are the differences between the positions of alignments on transcript X and transcript Y respectively. Also, v is the latent variable representing the unknown Figure 3: Paired end configuration length of the unsequenced region of paired end a. Given v, we can calculate the fragment lengths x a and x b of paired end reads a and b as, where r is the read length.
Thus given v and supposing that paired end reads a and b result from the same fusion isoform F , we can calculate the probability P (d X , d Y |v, F ) as where µ and σ are the inferred fragment length mean and standard deviation. We can now use the fact that Z is a normalization constant calculated as follows. Figure 4 shows the probability distribution P (d X , d Y |F ) for r = 50, µ = 200 and σ = 30.
The overlapping boundary region condition and the similar fragment length condition have equivalent formulations as constraints on d X and d Y . The overlapping boundary region condition is equivalent to the constraints given by equations 15 and 16. Any values for d X or d Y outside these constraints will result in non overlapping fusion boundary regions for transcript X or transcript Y . Values for d X and d Y that satisfy constraints given by both equations 15 and 16 will have overlapping boundary regions and will satisfy the overlapping boundary region condition. The similar fragment length condition is equivalent to the constraint −(l max − l min ) ≤ d X + d Y ≤ l max − l min , which is simply a reformulation of the equation in figure 2c.
We compared the region of the d x × d y configuration space that satisfies the overlapping boundary region condition and similar fragment length condition for α = 0.05 with a region contained within an equivalent contour of P (d X , d Y |F ). We used r = 50, µ = 200 and σ = 30 as was used for figure 4. We calculated l max − l min for α = 0.05 and then calculated q = |l i −l j |<lmax−l min P (l i )P (l j ) = 0.99422. The value q represents the combined probablity of seeing two fragments that satisfy constraints given by equations 15, 16, and 17 for α = 0.05. We then calculated the contour of P (d X , d Y |F ) that contains probability mass equal to q. Figure 5 shows the region of the configuration space satisfying the two conditions together with the equivalent contour of P (d X , d Y |F ).

FusionSeq predictions
FusionSeq version 0.6.1 was used to predict gene fusions in CCC15, CCC16 and EMD6. These 3 cases were chosen because they contained the greatest number of validated predictions, with 6, 3 and 4 validations respectively. We followed the instructions provided on the FusionSeq and RSeqTools websites, reiterated here. We first downloaded the hg18 bundled dataset. We then created a junction library from the ucsc provided 2bit genome and the gene models provided in the bundled dataset using the following command: createSpliceJunctionLibrary hg18.2bit knownGeneAnnotationTranscriptCompositeModel.txt 45 Next we used bowtie-build to generate bowtie indices for the human genome and junction library combined. Bowtie was used with default parameters to independently generate alignments for each end of the paired end reads. The two bowtie outputs were converted into MRF format using the bowtiePairedFix executable provided by the author and bowtie2mrf from RSeqTools. To compare the overlap between FusionSeq predictions and deFuse predictions, we aligned the FusionSeq read evidence to fusion sequences predicted by deFuse using bowtie. Comparing the results in this way avoided problems that would result from trying to compare gene identifiers from different sets of gene annotations.
We also sought to validate fusions predicted by FusionSeq that were not predicted by deFuse. In order to maximize our chances of successful validation, we applied a set of filters to the FusionSeq output before selecting fusions to validate. We first sought to classify as concordant reads that were evidence for the FusionSeq predictions. We aligned the read evidence to the genome and ESTs from UCSC, and searched for alignments of each within 1000nt of each other on the same chromosome/EST. We removed any FusionSeq prediction for which at least one read could be classified as concordant using this method. We also removed FusionSeq predictions for which at least one end of one read aligned to a ribosomal RNA (ensembl 54 gene models). Since we were were not interested in differences between the results that arose because of the use of different sets of gene annotations, we removed Fu-sionSeq predictions for which none of the reads aligned using blat to gene regions considered by deFuse. Several of the predictions were removed because they involved reads that did not align to a contiguous region of the genome or to contiguous exons, making it difficult to pinpoint a breakpoint and design primers. Finally, we removed fusions also predicted by deFuse, and selected the 3 predictions from each library with the highest RESPER score. This produced 3 candidates for CCC16 and EMD6, and 2 candidates for CCC16 which only contained 2 fusions after filtering.

MapSplice predictions
MapSplice version 1.14.1 was first used predict fusions in CCC15, CCC16 and EMD6. To reiterate, these 3 cases were chosen because they contained the greatest number of validated predictions, with 6, 3 and 4 validations respectively. We followed the set of instructions on the MapSplice website, downloading the ucsc genome and building a bowtie index. The default paired end configuration file was used, with the following differences. read length = 50 segment length = 16 junction type = non-canonical run MapPER = yes full running = no do fusion = yes We then searched the MapSplice results for the validated deFuse predictions. We selected all of the sequences in the synthetic sequence column of fusion.junction file and used blat with default parameters to find an alignment of those junction sequences to the sequences predicted by deFuse. We also extracted all splice junction predictions from the CIGAR string of each alignment in the alignments.sam file, and compared those splice junction predictions with the validated deFuse predictions.
We suspected MapSplice might perform better on the 75mer libraries. Thus, we ran Map-Splice on the 75mer reads from SCH1, EMD5 and GRC5. The default paired end configuration file was used, with the following differences. We sought to validate fusions predicted by MapSplice that were not predicted by deFuse. In order to maximize our chances of successful validation, we applied a set of conservative filters to the MapSplice output before selecting fusions to validate. From the fusion.junction file we selected fusions with at least 2 supporting reads that were predicted to occur within the boundaries fo the ensembl genes we were considering in this study. We then removed predictions for which the synthetic sequence aligned with greater than 90% identity by blat to the genome, or greater than 50% identity to ribosomal RNA. After applying these filters we were left with 14 predictions from CCC15, CCC16, EMD6, SCH1, EMD5 and GRC5.

Running deFuse on melanoma RNA-Seq datasets
RNA-Seq datasets for 13 melanoma samples and cell lines were downloaded from the short read archive. These datasets are half the size of our average sarcoma or ovarian cancer datasets, and 4 of the fusions represented in these datasets have 5 or less supporting reads. Thus we adjusted the following parameters of deFuse so that deFuse would be able to predict fusions in these datasets. The clustering precision parameter is equal to 1 − α.

Calculating expression from RNA-Seq alignments
Reads were aligned to the genome (NCBI36/hg 18) using MAQ (0.7.1) [5] and allowing for up to 5 mismatches. Raw expression values (read counts) were obtained by summing the number of reads that mapped to human genes based on the Ensembl database (Release 51). Gene expression values were normalized using a quantile normalization procedure using aroma.light (1.16.0.) package in R (2.11.1).
4.14 Inferring copy number from Affymetrix SNP 6.0 The Affymetrix SNP6.0 arrays were normalized using CRMAv2 [1] using the default settings for performing allelic-crosstalk calibration, probe sequence effects normalization, probe-level summarization, and PCR fragment length normalization. Log ratios are then computed by normalizing against a pooled reference generated using a normal dataset of 270 HapMap samples obtained from Affymetrix. Segmentation is performed using a modified version of a hidden Markov model for detecting aCGH copy number, CNA-HMM [7]. The model has been extended to analyze high-density genotyping array platforms (available for download at http://compbio.bccrc.ca/) . The HMM model performs segmentation of the log ratio intensity data and predicts discrete copy number status for each resulting segment from the set of 6 possible states (homozygous deletion, hemizygous deletion, neutral, gain, amplification, and high-level amplifcation). The boundaries of the segments provide candidate breakpoints in the genome as a result of copy number alteration events. CNV predictions are provided in supplementary table S3. The given data is formated to be visualized using IGV (http://www.broadinstitute.org/igv).

Paired-End RNA Sequencing
For each patient sample polyadenylated RNA was purified from 10ug of DNAse1 (Invitrogen, Carlsbad, CA) treated total RNA using the MACSTM mRNA Isolation Kit (Miltenyi Biotec, Germany). Double-stranded cDNA was synthesized from the purified polyA+RNA using Superscript TM . Double-Stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA) and random hexamer primers (Invitrogen) at a concentration of 5M. The resulting cDNA was sheared using a Sonic Dismembrator 550 (Fisher Scientific, Canada) and size separated by PAGE (8%). The 190-210bp DNA fraction was excised, eluted overnight at 4C in 300 uL of elution buffer (5:1, LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)-7.5 M ammonium acetate) and purified using a QIAquick purification kit (Qiagen, Mississauga, ON). The sequencing library was prepared following the Illumina Genome Analyzer paired end library protocol (Illumina Inc., Hayward, CA) with 10 cycles of PCR amplification. PCR products were purified on Qiaquick MinElute columns (Qiagen, Mississauga, ON) and assessed and quantified using an Agilent DNA 1000 series II assay and Qubit fluorometer (Invitrogen, Carlsbad, CA) respectively. The resulting libraries were sequenced on an Illumina Genome Analyzer II following the manufacturer's instructions. Sequencing read lengths varied between 36 and 75 nucleotides. Image analysis and basecalling was performed by the GA pipeline v1.0 (Illumina Inc., Hayward, CA) using phasing and matrix values calculated from a control phiX174 library run on each flowcell. Raw Quality scores were calibrated by alignment to the reference human genome (NCBI build 36.1, hg18) using ELAND (Illumina Inc., Hayward, CA).

Direct sequencing
RNA was extracted from frozen tumors using Qiazol (Qiagen, Valencia, CA) and reverse transcribed using SuperScriptIII (Invitrogen, Carlsbad, CA). The cDNA was amplified using the primers as given in supplementary table S5 using PCR SuperMix High Fidelity (Invitrogen, Carlsbad, CA). The cycling parameters were: an initial denaturation of 94C for 1 min. followed by 35 cycles of 94C 30sec denaturation, 58C 30sec annealing and 72C 30sec extension, followed by a final extension of 72C for 5min. PCR was performed on a MJ Research Tetrad (Ramsey, MN). PCR products were purified using a MinElute PCR purification kit (Qiagen, Valencia, CA) and bi-directionally sequenced using an ABI BigDye terminator v3.1 cycle sequencing kit (Applied Biosystems, Foster City, CA) and an ABI Prism 3130xl Genetic Analyzer (Applied Biosystems, Foster City, CA). Sequence traces in ab1 format are provided in supplementary data S9.

Fluorescent in situ hybridization
Metaphases and metaphase slides were produced by using standard methods. Locus-specific FISH analysis was performed by using BACs from Human BAC library RPC1-11 (BACPAC Resources Centre, Childrens Hospital, Oakland Research Institute). Supplementary table S4 shows the locations of the BAC probes used for gene fusion validation. BACs were directly labeled with either Spectrum green or Spectrum orange (Vysis, Downer's Grove, IL). The chromosomal locations of all BACs were validated by using normal metaphases (results not shown). Probe labeling and FISH was performed by using Vysis reagents according to the manufacturer's protocols. Slides were counterstained with DAPI for microscopy. For all slides, FISH signals and patterns were identified on a Zeiss Axioplan epifluorescent microscope. Signals were interpreted manually, and images were captured by using the ISIS FISH imaging software (MetaSystems Group, Belmont, MA). A cutoff of 2 breaks per 100 nuclei was selected for a positive score based on examining 230 other soft-tissue tumors. Efficiency of the break-apart FISH probes on TMAs was demonstrated with the t(X;18) in synovial sarcomas [9]. FISH images are provided in supplementary data S10.