RNA-Seq Mapping and Detection of Gene Fusions with a Suffix Array Algorithm

doi:10.1371/journal.pcbi.1002464

Figure 1.

RNA-Seq mapping and splice junction detection methodology.

A. Four reads that span (spliced single reads), and three reads that bridge (paired-end reads) the junction are shown. The top chart shows a bird's eye view of the genomic alignments detected for seven pairs of reads between the two exons. Areas of the read highlighted in red correspond to colors that do not align to a genomic reference, and dots in the reference are unknown colors/bases. B. Mapping pipeline is reviewed in the Methods sections. Candidate junctions correspond to a sparse graph of junction evidences. After the candidates are found, splice junction and fusion predictions are made with optional quality thresholds. C. As a first step in SASR, 10 to 35 bp ends from each end of the exon are stored in two lexicographical dictionaries. Stored suffix starts are shown as a vertical stop and end with empty triangles. D. 10 base pairs from the left and right ends of the read (decamers) are searched in the 3′ and 5′ end dictionaries, respectively, with a binary string search. Decamers are matched without mismatches. Matching decamers are extended as possible (with up to two mismatches) to determine whether they cover the entire suffix. Mismatches are illustrated as vertical lines. Up to ten bases are clipped from the ends of the reads until a matching read is found. E. Decamer block size frequency in the hg18 RefSeq database.

More »

Expand

Table 1.

Mapping and splicing statistics for paired-end runs.

More »

Expand

Figure 2.

Combined evidence improves specificity of splice and fusion detection.

Scatterplots show the increasing mapped coverage (x-axis) versus Left: Known RefSeq junctions; Middle: Putative junctions; Right: Fusion junctions. Top track shows results for UHR and bottom track for HBR. Three different evidence thresholds were compared: 1) red line: one SPAN (SR) evidence required for junction call, 2) magenta line: two SPAN (2-SR) evidences required for junction call, and 3) blue line: one SPAN and one BRIDGE evidence (1-SR-1-PE) required for junction call.

More »

Expand

Figure 3.

Improvements by junction confidence value and comparison to TopHat.

A. Logarithms of number of known and putative junctions are shown with yellow and blue bars respectively. The ratio of known over putative is shown with dashed line. Dataset consisted of 64,000 sample UHR junctions called with default thresholds. B. TopHat and Lifescope candidate calls were compared to each other and also to RefSeq database. TopHat junctions were filtered with score>5, and Lifescope junctions were filtered with 1-SR-1-PE threshold (requiring one span and one bridge evidence).

More »

Expand

Table 2.

Validated MCF-7 gene fusions and TaqMan expression ratios.

More »

Expand

Figure 4.

Localization of gene fusions on specific chromosomal regions.

A. Whole genome and B. Chr 1, 17 and 20 gene fusions circular graph. Red lines represent inter-chromosomal gene fusions, blue lines represent inverted intra-chromosomal and black lines represent same-strand intra-chromosomal fusion events. Graphs were drawn with Circos software [61].

More »

Expand

Figure 5.

Fusion breakpoints are biased to 5′ end of the genes.

Histogram of order of 5′ (yellow) and 3′ (green) intron breakpoints for A. MCF-7, B. UHR and HBR combined gene fusions. Breakpoint is inferred to happen at the intron (X axis) following the exon that is fused. Y axis shows the count of breakpoints that are inferred to happen at numbered intron. C. Boxplot of the distribution of simulated gene fusion locations for each of the 23 genes in which a fusion was observed. Magenta star marks the location of the observed fusion, relative to the 5′ exon. 23 fusions correspond to the gene fusions from Table 2 (except for ESR1- C6orf97, and ADAMTS19- SLC27A6 alternatively spliced fusions merged into single data points).

More »

Expand

Figure 6.

Screening of fusion assays in cancer cell lines reveal recurring fusions.

Heat map of the expression of selected gene fusions (rows) in 20 samples including 18 cancer cell lines (columns). Lower cycle threshold (CT) indicates a higher level of expression and is highlighted in blue. High CT (max 40, yellow) indicates no expression. PPIA is used as positive control and non template control sample (NTC) as negative control.

More »

Expand