Transcriptome Profiling of the Cancer, Adjacent Non-Tumor and Distant Normal Tissues from a Colorectal Cancer Patient by Deep Sequencing

Colorectal cancer (CRC) is one of the most commonly diagnosed cancers in the world. A genome-wide screening of transcriptome dysregulation between cancer and normal tissue would provide insight into the molecular basis of CRC initiation and progression. Compared with microarray technology, which is commonly used to identify transcriptional changes, the recently developed RNA-seq technique has the ability to detect other abnormal regulations in the cancer transcriptome, such as alternative splicing, novel transcripts or gene fusion. In this study, we performed high-throughput transcriptome sequencing at ∼50× coverage on CRC, adjacent non-tumor and distant normal tissue. The results revealed cancer-specific, differentially expressed genes and differential alternative splicing, suggesting that the extracellular matrix and metabolic pathways are activated and the genes related to cell homeostasis are suppressed in CRC. In addition, one tumor-restricted gene fusion, PRTEN-NOTCH2, was also detected and experimentally confirmed. This study reveals some common features in tumor invasion and provides a comprehensive survey of the CRC transcriptome, which provides better insight into the complexity of regulatory changes during tumorigenesis.


Introduction
Colorectal cancer (CRC) is one of the most commonly diagnosed cancers with over one million new cases worldwide in every year [1]. Metastatic CRC is usually incurable; as a result, CRC is the leading cause of cancer-related deaths [1,2]. CRC arises from adenomatous polyps and develops into locally invasive and subsequently metastatic cancer. The progression of CRC is a multistep process and can be categorized into four stages (Dukes staging system) based on the degree of tumor invasion [3,4]. In previous studies, several molecular mechanisms, such as genomic instability [5,6,7], loss of DNA repair genes [8,9] and aberrant epigenetic modifications [10,11] (see review in [12]), were shown to contribute to the development of CRC. Additionally, an unbiased approach of high-throughput screening of the expression changes between CRC and normal tissue revealed multiple diagnostic and prognostic biomarkers [13,14,15]. However, the comprehensive understanding of the progression of CRC and the proper prognosis are still challenging task due to the genetic heterogeneity of CRC and complex genomic alterations found with this type of cancer [12,16].
Prior studies of genomic alterations have revealed that somatic changes, including point mutations, DNA rearrangements and copy number variations (reviewed in [12]), can result in mutations that drive the development of CRC. As a consequence of changes in the cancer genome, the reprogramming of the transcriptome leads to abnormal cellular behavior and thus directly contributes to cancer progression [17,18]. Studying the cancer transcriptome not only enables us to fill in the gap between driver mutations and cancer cell behavior, but also allows us to identify additional candidate cancer-related mutations and the molecular basis of gene regulation [17]. The recent development of massively parallel sequencing (RNA-seq) provides a powerful approach to profile the transcriptome with greater efficiency and higher resolution [19]. The advantage of RNA-seq is that this technique makes feasible the study of the cancer transcriptome complexity, including alternative splicing, isoform usage, gene fusions and novel transcripts (reviewed in [20,21]). Despite the prevalence of using RNA-seq to study various cancer transcriptomes [22,23,24,25], the deep annotation of CRC gene expression profiling has not been performed.
In this study, we aimed to thoroughly annotate the transcriptomes of CRC tissue, adjacent non-tumor tissue and distant normal tissue from a single patient by RNA-seq. First, we found several cancer-specific dysregulated genes and alternative splicing. Second, following Gene Ontology (GO) and pathway analysis of the dysregulated genes and isoforms, we identified a potential candidate pathway and a functional class of genes that are relevant to CRC progression, which has not been reported previously. Third, we detected a novel gene fusion event specifically in CRC tissue and experimentally confirmed the fusion product. Finally, to validate our sequencing results, quantitative real-time PCR (qPCR) was used to confirm the gene expression difference between CRC and normal tissue.

Characterization of sequencing and mapping
Three samples -CRC tissue (stage III), adjacent non-tumor tissue and distant normal tissue -were collected from a 57-yearold female patient. The clinicopathological information of the patient is shown in Fig. S1. All three samples were subjected to massively parallel paired-end cDNA sequencing. In total, we obtained 36.5 million, 33.1 million and 29.9 million read pairs from the CRC, adjacent non-tumor and distant normal tissue, respectively. We used TopHat to align the reads to the UCSC (the University of California Santa) reference human genome Hg19. The uniquely aligned reads for the three samples ranged from 20.8 million to 25.9 million pairs. The proportion of reads that mapped to the Ensembl reference genes ranged from 75% to 86% for the three samples. The average coverage of our sequencing depth was approximately 50 times of human transcriptome (approximately 113 millon bp, based on the total length of the uniquely annotated exon region in the Ensembl database). In addition, only ,1% reads were mapped to rRNA, indicating that our libraries are properly constructed and faithfully represent the expression of RNA with ployA tails. The details of the mapping results are listed in Table 1.

Analysis of differentially expressed genes
To measure the gene expression and to identify the differentially expressed genes (DEGs) among the samples, we used the method of Cuffdiff [26] to estimate the gene expression and to identify significantly dysregulated genes. The normalized expression level of each gene was measured by Fragments Per Kilobase of exon per Million fragments mapped (FPKM). By requiring that the FPKM was greater than one, we detected 14854-15168 expressed genes in each sample, which included the majority of the annotated human reference genes (See Table S1 for details). We further analyzed the correlation of the gene expression among the samples. The global profiles of gene expression were generally highly correlated with the Pearson correlation coefficient, ranging from 0.90 to 0.94 (Fig. 1A). In addition, the clustering analysis indicates that the CRC transcriptome is distinguished from those of the adjacent non-tumor tissue and distant normal tissue (Fig. 1B).
We detected 1660, 1528 and 941 significant DEGs between the CRC and adjacent tissue, the CRC and normal tissue and the adjacent and normal tissue, respectively (the complete lists of the DEGs are summarized in Table S1). The overlapping of the DEGs among the three samples is shown as a Venn diagram in Fig. 1C. It is noteworthy that CRC yields more dysregulated genes (1660 genes in CRC vs. adjacent, 1528 genes in CRC vs. normal) than other two tissues, which are 1.5-fold more abundant than those found in other tissues (941 genes in normal vs. adjacent), indicating the cancer-specific reprogramming of the CRC transcriptome, as shown in the ''volcano plot'' of the gene expression profiles (Fig. 1D). When comparing the direction of the DEGs, the number of up-and down-regulated genes identified between CRC and the other two samples was nearly equal. In contrast, a slight increase in the down-regulated genes was observed in adjacent non-tumor tissue when compared to the adjacent cancer and normal tissue. The MA-plot of the gene expression profiles (Fig. S2) shows that the significant number of dysregulated genes is not biased toward highly expressed genes.
In previous studies, several key genes relevant to CRC have been identified. To determine whether our findings were in agreement with reported results, we systematically compared the changes in the expression of specific CRC-related genes with those identified in other studies. We found that 15-prostaglandin dehydrogenase (15-PGDH), a rate-limiting enzyme that catalyzes the degradation of prostaglandin [27], is significantly downregulated in CRC in cancer tissues compared to normal tissues. The activation of COX-2 and the loss of 15-PGDH are common oncogenic events that are observed in ,80% of CRC cases [28]. In addition, we found another tumor suppressor, TGFBR2 [29], which was down-regulated in both CRC and cancer-adjacent tissues. Because the inactivation of TGFBR2 is coordinated with the transition from adenoma to carcinoma, the progressive inactivation of TGFBR2 in cancer-adjacent and tumor tissues is expected. We also detected other genes that were dysregulated in CRC, including APC [30], MYH [31], CD133, IDH1 and MINT2 [10]. In contrast, several known driver factors that are frequently mutated in CRC, including MINT3 [32,33], MSH2 [34] and MSH6 [9], showed no change in expression in this study, suggesting that the genetic heterogeneity of CRC or the mutated products might be deleterious even if the expression level is unaffected.

Functional enrichment analysis of differentially expressed genes
To better understand the function of DEGs, we conducted an enrichment analysis of Gene Ontology for the dysregulated genes. To identify the cancer-specific functional categories, we first performed parallel enrichment tests for significantly up-and down-regulated genes that were detected by pair-wised compar- isons in the CRC, adjacent non-tumor and normal tissue using online tools from DAVID [35]. The GO categories that were significantly enriched in the dysregulated genes from the comparison of CRC vs. cancer adjacent non-tumor tissue and CRC vs. normal tissue, but not adjacent non-tumor vs. normal tissue, were selected. In total, the up-and down-regulated genes in CRC were categorized into 47 cancer-specific functional categories (Fig. 2). Interestingly, although we identified equal numbers of up-and down-regulated genes in CRC, we observed an excess of significant GO categories for CRC up-regulated genes, suggesting that the up-regulation of cancer-specific genes is functionally more important for cancer progression. For example, the significant GO terms for up-regulated genes, which include ''cell migration'', ''cell motility'' and ''extracellular matrix binding'', are relevant to cancer invasion [15,36,37]. In addition, the genes related to metabolic changes, including ''collagen metabolic process'', ''multicellular organismal macromolecule metabolic process'' and ''multicellular organismal catabolic process'', reflect the alteration of tumor metabolism [38,39] and are also overrepresented in CRC. On the contrary, genes that are downregulated in CRC are enriched in several functional processes related to homeostasis.
A more informative analysis of functional annotation can be achieved by studying the enrichment of differentially expressed genes in a particular pathway. We used DAVID [35] to analyze which KEGG pathway was enriched with CRC-specific dysregulated genes. The pathways enriched with DEGs are listed in Table 2. The extracellular matrix (ECM) receptor interaction pathway was commonly affected in all the pair-wise comparisons, and such gene regulation alterations in the ECM pathway were much more severe in the CRC tissue. In addition, the focal adhesion pathway was enriched in the DEGs identified from the CRC tissue.
To experimentally confirm the differentially expressed genes identified by RNA-seq, the expression levels of selected genes were validated in each sample by quantitative real-time PCR (qRT-PCR). We chose five candidate genes (COL1A1, COL3A1, FN1, SPP1, and ITGB5) from the ECM pathway (according to the gene expression level and fold change between the CRC and normal tissue) that were differentially expressed by Cuffdiff (Table S2). We used GAPDH as an endogenous control in these reactions. The qRT-PCR results confirmed that all of these candidate genes showed nearly identical changes in gene expression to those detected via the RNA-seq technique, as shown in Fig. 3.  . The functional enrichment of cancer-specific dysregulated genes identified in the Gene Ontology analysis. We only chose GO categories that enriched cancer-related dysregulated genes but did not enrich dysregulated genes that were identified when comparing normal tissue with adjacent non-tumor tissue. The cancer-specific dysregulated genes were categorized as significantly up-or down-regulated genes in cancer tissue. The level of significance is indicated by different colors. The ''s1'', ''s2'' and ''s3'' indicators denote ''cancer'', ''adjacent non-tumor'' and ''distant normal'' tissues, respectively. doi:10.1371/journal.pone.0041001.g002 To examine whether these genes were always up-regulated in the colorectal cancer, we performed the qRT-PCR to test the expression changes for the five genes between the paired cancer and normal tissue in ten additional patients. The result of one paired samples from one patient was excluded due to large variation within technical replicates. The result of the remained patients showed that, except for ITGB5, the other three genes COL1A1, FN1and SPP1were up-regulated in six cancer samples, and the COL3A1 was up-regulated in four cancer samples (Table S3), suggesting the ECM-pathway genes are usually upregulated in colorectal cancer. In addition, the cancer samples of five patients can be clustered together according to the expression levels of these five genes (Fig. S3).

Analysis of alternative splicing and differential usage of isoforms
One gene locus can express multiple isoforms by alternative splicing (AS). The transcript diversity leads to plastic transcriptional networks in cancer, which are important to generate the unusual properties of cancer cells [40,41]. Of the numerous molecular mechanisms that can generate AS isoforms, exon skipping to truncate the functional domain is the most common way to generate protein products with alternative functions in mammals [41]. We thus performed genome-wide screening to identify the cancer-restricted exon skipping events using software MISO (the Mixture of Isoforms) [42]. In total, we detected 14072, 14537 and 13865 exon skipping events in the CRC, adjacent nontumor tissue and normal tissue, respectively. We next compared the differential exon skipping (DES) events (Table S4) among samples, as shown in Fig. 4A. We found that: i) only a small proportion of DES events was shared in three way comparison, suggesting that a considerable proportion of genes are under cancer-specific regulation by alternative splicing; and ii) the number of DESs between normal and adjacent tissue is less than the number of DES events between the CRC and adjacent tissue or the CRC and normal tissue, indicating the enhanced use of differential isoforms in cancer. Because ES events are likely to change protein function by affecting the functional domain, the number of DES events in CRC tissue is roughly twice that of noncancer tissue, indicating that the splicing pathway can be significantly activated in CRC to generate diverse functional products. We checked the expression of the slicing factors derived from NCBI and SpliceAid 2 (http://www.introni.it/spliceaid. html), and found that four splicing factors, including RBFOX1, SPRK1, MBNL1 and SRRM2, were significantly dysregulated between the cancer tissue and non-cancer tissue (Table S1), suggesting that the anomalous splicing activity in cancer tissue might be related with the dysregulation of the splicing factors.
While the DEGs have already been shown to be significant in CRC, it would be interesting to determine the degree of overlap between the DEGs and genes with DES events. As shown in Fig. 4B, only three genes (3/752, ,0.4%) were simultaneously affected by changes in transcriptional regulation and posttranscriptional regulation (i.e., alternative splicing).
To identify highly reliable cancer-associated genes with DES events, we filtered the DES events by a series steps (Material and Methods) from all DES events (Table S4) and obtained 20 reliable DES events from 14 cancer-associated genes (Table 3). Six genes, including ADD3, CTNND1, EPB41L3, F3, MUC4 and PDGFA, showed cancer-tissue-specific DES events. As shown in Fig. 5. The ratio of junction-reads number for the exon inclusion versus the exon exclusion was obviously lower in the cancer tissue than that in the other two tissues. The reads mapping of other five genes were shown in Fig. S4-S8 respectively.
The differential splicing of ADD3 has been found in the nonsmall cell lung cancer [43] and the murine breast tumor [44]. Interestingly, ADD3 showed a cassette exon inclusion in these two studies, but it showed a cassette exon exclusion in the same location in our study. In another study of human embryonic stem cells (hESCs), the cassette exon exclusion has also been found in hESCs relative to the derived cardiac progenitors [45]. Even in the previous lung cancer study, there was also heterogeneous evidence for alternative splicing patterns of ADD3 (four of 18 lung-cancer patients showed a cassette exon exclusion for ADD3) [43]. Therefore, further studies on ADD3 should be done to understand the relationship of its alternative splicing with cancer.

Bioinformatics prediction of gene fusion events
We used two algorithms, deFuse and TopHat-Fusion, to detect gene fusion based on the pair-ends reads in different samples. Although various results were generated by deFuse and TopHat-Fusion (Table S5), a fusion event between PTGFRN and NOTCH2 was the only cancer-specific fusion event identified by both algorithms. As shown in Fig. 6A, PTGFRN and NOTCH2 are separated on chromosome 1 by 3 million bps, and wild type forms are transcribed from opposite directions. In our cancer sample, we detected that the first intron of PTGFRN is fused with the 39 junction of the 17th exon of NOTCH2 to generate a chimeric PTGFRN-NOTCH2 transcript. It is worth noting that a partial intronic region of PTGFRN is present in mature mRNA due to this fusion event (Fig. 6B). We thus separately designed a pair of primers that coordinate with the first intron of PTGFRN and the exon region of NOTCH2 to confirm this fusion in normal, adjacent non-tumor and cancer tissue by RT-PCR. The results indicated that this fusion event is cancer-restricted (Fig. 6C), which is consistent with the conclusions from our RNA-seq analysis. In addition, we examined the PTGFRN-NOTCH2 gene fusion in additional ten samples by RT-PCR, but none of them showed the gene fusion, suggesting that the PTGFRN-NOTCH2 might be a rare gene fusion in colorectal cancer.
Using strands-specific reverse-transcription and PCR, we found that the fusion gene transcribed with the promoter of NOTCH2. We predicted the ORF of fusion gene by using the start codon of NOTCH2 (Fig. 6D). This predicted protein corresponds to a 934aa peptide sequence with the first 917aa from NOTCH2 (these sequences were listed in Supplementary S1). We annotated the protein sequence using CD-Search tools in NCBI (http://www. ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, database: CDD), and found some EGF domains in the peptide regions from NOTCH2. However, some essential domains in NOTCH2, such as NOTCH domain and Ankyrin repeats, were lost in the fusion protein (Fig. 6D). Therefore, we inferred that the gene fusion PTGFRN-NOTCH2 in this study appeared to be more like a loss-of-function mutations, consistent with those recently described for myeloid leukemia [46], head and neck squamous cell carcinoma [47,48]. Nevertheless, the overall expression of wild-type PTGFRN, NOTCH2 and its targets (PTCRA, HES1, HES5) are less affected in cancer (Table S6), indicating that 1) the PTGFRN-NOTCH2

Discussion
Using RNA-seq technology, we profiled the whole transcriptomes from CRC, adjacent non-tumor and normal tissue with extremely thoroughly. In total, approximately 50-70 million reads were generated per sample, which enabled us to quantify the gene expression abundance at a wide range [49]. The number of expressed genes (FPKM .0) detected in our study is approximately 67% of the total UCSC reference genes per sample, representing the majority of the transcriptome.
Alternative regulation of gene expression can be achieved by transcriptional and post-transcriptional regulation. The first class of dysregulation of CRC at the transcriptional level has been well studied using microarray technology [14,15,50]. Quantifying the second class of regulatory change remains challenging despite the invention of the exon array [51]. RNA-seq technology enables the simultaneous study of these two different mechanisms [19,26,52,53]. In our study, we investigated transcriptional dysregulation by analyzing the DEGs. Then, we used pair-end cDNA sequencing to more efficiently identify the alternative : s1, s2 and s3 represented the normal, adjacent non-tumor and cancer tissues, respectively; # : Y, percentage spliced in, denotes the fraction of mRNAs that represent the inclusion isoform [67]; *: The ''diff'' is provided by the MISO, and indicated the degree of splicing difference between samples. It was in [21,1]. The positive ''diff'' value means that the exon was skipped less in the sample1 than that in the sample2, and the negative values means the exon skipped less in the sample2; $ : The ''bayes factor'' is provided by MISO, indicating the significance of the splicing difference. It was in [0, +'), and it was greater, then the difference was more significant. doi:10.1371/journal.pone.0041001.t003 splicing. Moreover, by employing the MISO algorithm, we were able to measure the relative expression level of different isoforms produced by the exon-skipping events, which are quantitative measurements of alternative splicing. Interestingly, the genes affected by these two different regulatory mechanisms are largely independent (Fig. 4B), suggesting versatile ways to reprogram the cancer transcriptome.
The local invasion and distant metastasis of cancer has been considered a multistep processes composed of the regulatory changing of intracellular circuitry and the complex interaction between cancer cells and their microenvironment [36,54,55]. During invasion and metastasis, frequent remodeling of the extracellular matrix enables cancer cells to disseminate from primary tumors and invade normal tissue. In our study, we found that many genes related to extracellular matrix (ECM) receptor interactions are highly dysregulated in a cancer-restricted manner. The ECM is composed of several types of macromolecules, including collagen-type proteins, laminins, tenascin and other adhesion molecules [55]. All of the collagen-type genes, including type I-IX collagen, are up-regulated 10-to 1000-fold in cancer tissue (Table S7). Although there is some concordance between our observations and previous studies on the up-regulation of collagen mRNA in colorectal cancer tissue [56], the pervasive induction of collagen mRNAs is unique to our study. These findings suggest that the reprogramming of the collagen protein family network during colon cancer development can be much more complex than previously thought. In addition, we also noted that members of the matrix metalloproteinase (MMP) family, which degrade ECM structures [55,57], are also significantly induced in cancer tissues, consistent with a previous report [58]. The fold change in the expression of the MMPs ranged from 10fold (MMP1, MMP3 and MMP14) to 554-fold (MMP7). Meanwhile, other cell-cell adhesion-related molecules, such as laminins (LAMA4, LAMA5, LAMB1, LAMB2 and LAMC2) and integrins (ITGA5, ITGA5, ITGB5, ITGA11 and ITGBL1), are elevated in cancer tissues. We also detected the up-regulation of vascular endothelial growth factor (VEGF), suggesting that the ''angiogenesis switch'' is activated in cancer tissue. Taken together, the global up-regulation of the ECM pathway and the angiogenic growth factor indicates that CRC progression leads to massive ECM remodeling and the expansion of new vessel networks. Moreover, previous studies have shown that genes in the ECM pathway are under intensive epigenetic modification [59] and thus may be novel prognostic biomarkers; thus, our study provides greater insight into using expression changes in ECM pathway members as candidate biomarkers.
Gene fusion, which often results from a genomic aberration, has been shown to be the key mechanism for generating chimeric ''oncogenes'' that initiate tumorigenesis or contribute to tumor progression (reviewed in [60]). Using the RNA-seq technique, the expressed gene fusion transcript that is more likely to produce a functional product can be detected [23,61]. Given that common gene fusion is rare in CRC [5], identifying case-specific gene fusion can help to understand the complexity of the molecular basis of CRC development. In this study, we detected a cancer-restricted gene fusion between PTGFRN and NOTCH2 in CRC. In addition, the gene fusions between the immunoglobulin lambda variables and IGLL5 were detected in the filtering result of TopHat-Fusion (Table S5), which might represent immune rearrangements in tumor-associated B cells. Previous studies suggested that the consequence of gene fusion can be i) an alteration of gene expression [62]; or ii) the generation of a truncated or chimeric protein with a different function [63]. Because the PTGFRN-NOTCH2 transcript only includes a small portion of PTGFRN and the expression of PTGFRN and NOTCH2 are not down-regulated in CRC, we reason that the original functions of these two genes are not affected by this fusion event, and therefore, the gain of function of this fusion construct will be particularly interesting for future study. Given that the majority of the fusion gene is composed of NOTCH2, the function of this fusion product could be more related to that for NOTCH2. NOTCH2 is a homolog of NOTCH1 and plays a role in a variety of developmental processes by controlling cell fate decisions. NOTCH2 expression has been shown to be a prognostic predictor and is related to the tumor differentiation status in CRC [64,65]. In addition, the gain of function of truncated NOTCH2 with nonsense mutations causes an autosomal dominant skeletal disorder [66]. Therefore, NOTCH2 may play an important role in CRC development, and the PTGFRN-NOTCH2 gene fusion could introduce dominant negative effects on the normal development program.

Sample information
Written informed consent from the patients was obtained, and this series of studies was reviewed and approved by Institutional Ethics Committees of Fujian Provincial Hospital (Fuzhou, China). Three samples used in RNA-Seq, including distant normal colonic mucosa, adjacent colonic mucosa and cancer, were collected from one Chinese patient who was diagnosed with stage III colon adenocarcinoma. The distance between adjacent non-tumor and cancer tissue boundary is about 1cm, while that of distant normal tissue and cancer tissue is about 10 cm. Fig. S1 provides the micrograph of the cancer sample used in our study. Ten paired normal and cancer samples used in additional validation were obtained from ten patients with stage III colon adenocarcinoma.

Library preparation
Total RNA was extracted from normal, adjacent non-tumor and cancerous colon tissues with TRIzol according to the manufacturer's protocol (Invitrogen). For mRNA-seq sample preparation, the Illumina standard kit was used according to the TruSeq RNA SamplePrep Guide (Illumina). Briefly, 10 mg of total RNA from each sample was used for polyA mRNA selection using poly T oligo-conjugated magnetic beads by two rounds of purification, followed by thermal mRNA fragmentation. The fragmented mRNA was subjected to cDNA synthesis using reverse transcriptase (SuperScript II) and random primers. The cDNA was further converted into double-stranded cDNA, and after end repair (Klenow fragment, T4 polynucleotide kinase, T4 polymerase and 3-'A' add process [Klenow exo-fragment]), the product was ligated to Illumina Truseq adaptors. Size selection was performed using a 2% agarose gel, generating 380-bp cDNA libraries. Finally, the libraries were enriched using 15 cycles of PCR and purified with the QIAquick PCR purification kit (Qiagen). The enriched libraries were diluted with elution buffer to a final concentration of 10 nM.

Sequencing and primary analysis
Libraries from normal tissue, adjacent non-tumor tissue and cancerous colon tissue were analyzed at a concentration of 11 pM on a single Genome Analyzer IIx (GAIIx) lane using 115-bp sequencing. Raw RNA-seq data were filtered by Fastx-tools (http://hannonlab.cshl.edu/fastx_toolkit/) according to the following criteria: 1) reads containing sequencing adaptors were removed; 2) nucleotides with a quality score lower than 20 were trimmed from the end of the sequence; 3) reads shorter than 50 were discarded; and 4) artificial reads were removed. After the filtering pipeline, a total of 21.5G bp of cleaned, paired-end reads were produced. The raw sequence data have been submitted to the NCBI Short Read Archive with accession number SRP009386.

RNA-seq reads mapping
The clean reads were then aligned with the UCSC H. sapiens reference genome (build hg19) using TopHat v1.3.1 [53], which initially removes a portion of the reads based on quality information accompanying each read and then maps the reads to the reference genome. The pre-built H. sapiens UCSC hg19 index was downloaded from the TopHat homepage and used as the reference genome. TopHat allows multiple alignments per read (up to 20 by default) and a maximum of two mismatches when mapping the reads to the reference. TopHat builds a database of potential splice junctions and confirms these by comparing the previously unmapped reads against the database of putative junctions. The default parameters for the TopHat method were used.

Transcript abundance estimation
The aligned read files were processed by Cufflinks v1.0.3 [26], which uses the normalized RNA-seq fragment counts to measure the relative abundances of the transcripts. The unit of measurement is Fragments Per Kilobase of exon per Million fragments mapped (FPKM). Confidence intervals for FPKM estimates were calculated using a Bayesian inference method [67]. The reference GTF annotation file used in Cufflinks was downloaded from the Ensembl database (Homo_sapiens.GRCh37.63.gtf [68]). The transcript abundance data has been submitted to the GEO database with accession ID GSE33782.

Differentially expressed gene testing
The downloaded Ensembl GTF file was passed to Cuffdiff along with the original alignment (SAM) files produced by TopHat. Cuffdiff re-estimates the abundance of the transcripts listed in the GTF file using alignments from the SAM file and concurrently tests for differential expression. Only the comparisons with ''q_value'' less than 0.01 and test status marked as ''OK'' in the Cuffidff output were regarded as showing differential expression.

Detection of differential exon skipping events using MISO
The Mixture of Isoforms (MISO) analysis [42] was used to detect differentially regulated exons across samples. The MISO analysis was performed according to the tool's given workflow using paired-end reads (http://genes.mit.edu/burgelab/miso/ docs/). The reads alignment files (SAM) produced by TopHat and the pre-build human genome (Hg19) alternative events downloaded from the MISO reference manual page (http:// genes.mit.edu/burgelab/miso/docs/#gff-event-annotation) were used as the input.
Filtering for highly reliable cancer-associated differential exon skipping (DES) events To identify highly reliable cancer-associated DES events, we filtered the DES events by the flowing steps: 1) use the stringent cuff-offs to filter the result of MISO (the absolute value of diff .0.3 and bayes factor .1000, the default cut-off of MISO were 0.2 and 10); 2) remove the DES events occurred only in adjacent nontumor tissue vs. normal tissue but not in CRC tissue vs. adjacent non-tumor and CRC tissue vs. normal tissue to ensure the events associated with the CRC tissue; 3) keep the genes that are overlapped with the cancer-associated gene set, which were collected from the NCBI gene database (searched by ''oncogene'' and ''tumor suppressor gene'') and the Bushman Lab web (http:// microb230.med.upenn.edu/protocols/cancergenes.html).

Mapped reads visualization
The mapping results were visualized using the Integrative Genomics Viewer (IGV) available at http://www.broadinstitute. org/igv/. Views of other individual genes were generated by uploading coverage.wig files to the UCSC Genome browser.

Functional enrichment analysis of differentially expressed genes
The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 is a set of web-based functional annotation tools [35]. The unique lists of differentially expressed genes and all the expressed genes (FPKM .0 in any sample) were submitted to the web interface as the gene list and background, respectively. The cut-off of the False Discovery Rate (FDR) was set at 5%, and only the results from the GO FAT and KEGG pathways were selected as functional annotation categories for this analysis.

Candidate gene fusion identification
All the filtered RNA-seq reads were mapped to the reference transcript sequences that were downloaded from the Ensembl database (Homo_sapiens.GRCh37.63.cdna.all.fa) using TopHat. The read pairs mapping to the same transcripts were removed, and the ends of remaining reads were truncated to maintain the 75-bp length using in-house Perl scripts. These fixed-length reads were passed to two software packages, deFuse (deFuse-0.4.2) [61] and TopHat-Fusion (TopHatFusion-0.1.0) [69], to find the candidate gene fusions. The bowtie-index used in the TopHat-Fusion was downloaded from the TopHat homepage (H. sapiens UCSC hg19). The parameters of the TopHat-Fusion used were obtained from the ''Getting Started'' (http://tophat-fusion. sourceforge.net/tutorial.html) tutorial. The deFuse parameters were the default settings, as described in the deFuse manual.

Candidate gene fusion filtering
The deFuse results were filtered according to McPherson et al. [61] and Steidl et al. [70]. There were 82 candidate gene fusions remaining after the filtering pipeline. The TopHat-Fusion results were parsed by in-house Perl scripts and filtered according to the following pipeline: 1) the span reads were greater than eight reads; 2) the ratio of the against reads vs. the span reads was less than 0.5; and 3) gene fusions involving ribosomal proteins or small nuclear ribosomal proteins were excluded. There were 11 filtered candidate fusions remaining after parsing. The filtering candidates simultaneously detected by both deFuse and TopHat-Fusion were regarded as reliable candidate gene fusions. After this filtering pipeline, one reliable candidate, PTGFRN-NOTCH2, was obtained and validated.

Differentially expressed gene validation
The differentially expressed genes were validated by Real-Time Quantitative Polymerase Chain Reaction (RT-qPCR) using a LightCyclerH 480 Instrument II (Roche). The PCR volume included 10 ml sample, 5 ml 26 SYBR Green Master Mix (TOYOBO), 1 ml cDNA template and 1 pmol/ml of each oligonucleotide. The RT-qPCR thermal profile was obtained using the following procedure: 95uC for 1 min, 40 cycles of 95uC for 10 sec, 60uC for 30 sec and 72uC for 10 sec, followed by 72uC for 5 min. The program was set to reveal the melting curve of each amplicon from 60uC to 95uC and obtain a read every 0.5uC. The primer sequences are listed in Table S2. All the RT-qPCR reactions were performed in triplicate to capture intra-assay variability.
The expression levels of each target gene in the tested experimental conditions (adjacent non-tumor and cancerous colon tissue) were compared to the control condition (normal colon tissue) according to Cook et al. [71]. The data were normalized using GAPDH, which had previously been identified as the best reference gene under different experimental conditions [72]. In the present analysis, GAPDH was confirmed to be stable and always showed variability less than 61 cycle.

Gene fusion validation
To detect fusion transcripts, we design the forward primer targeting the 59 partner gene and reverse primer targeting the 39 partner. Primer pairs (Table S2, NOTCH2 and PTGFRN) for the coding exons of the fusion genes were generated using Primer 5 software (PREMIER Biosoft International, Palo Alto, Calif.), and the PCR volume used comprised 10 ml sample, 1 ml 106 PCR buffer, 1 ml cDNA template, 0.2 ml dNTP, 0.2 ml Taq Enzyme (Genscript), and 0.2 pmol/ml each oligonucleotide. PCR was performed using the following procedure: 95uC for 1 min, 40 cycles of 95uC for 15 sec, 55uC for 30 sec and 72uC for 15 sec, followed by 72uC for 5 min. We confirmed the presence of the fusion gene in cancerous colon tissue. GAPDH was used as the loading control. The PCR products of the fusion gene were cloned in the pGEMH-T Easy Vector (Promega) and then sequenced with the T7 primer using a 3730 DNA Analyzer (ABI).

Detecting the transcription direction of gene fusion
Total RNA (2 mg) was reverse transcribed into single-stranded cDNAs using SuperScript III reverse transcriptase (Invitrogen) and gene-special primer (Table S2, NOTCH2 and PTGFRN) in 20 ml reaction at 42uC for 60 min, 70uC for 15 min, 4uC for 5 min, respectively. 2 ml of cDNA was used for a subsequent 20 ml PCR amplification. To detect fusion transcripts, we design another primer pairs (Table S2, NOTCH2-nest and PTGFRN-nest) for the coding exons of the fusion genes were generated using Primer5 software. PCR was performed using the following procedure: 95uC for 1 min, 40 cycles of 95uC for 15 sec, 50uC for 20 sec and 72uC for 15 sec, followed by 72uC for 5 min. We confirmed the presence of the fusion gene in single-stranded cDNAs which were reverse-transcribed by PTGFRN-nest primer.

Author Contributions
Conceived and designed the experiments: YW. Performed the experiments: XW FW RH. Analyzed the data: FX GL. Contributed reagents/ materials/analysis tools: YH. Wrote the paper: YW MT PC.