A Global View of Cancer-Specific Transcript Variants by Subtractive Transcriptome-Wide Analysis

Background Alternative pre-mRNA splicing (AS) plays a central role in generating complex proteomes and influences development and disease. However, the regulation and etiology of AS in human tumorigenesis is not well understood. Methodology/Principal Findings A Basic Local Alignment Search Tool database was constructed for the expressed sequence tags (ESTs) from all available databases of human cancer and normal tissues. An insertion or deletion in the alignment of EST/EST was used to identify alternatively spliced transcripts. Alignment of the ESTs with the genomic sequence was further used to confirm AS. Alternatively spliced transcripts in each tissue were then subtractively cross-screened to obtain tissue-specific variants. We systematically identified and characterized cancer/tissue-specific and alternatively spliced variants in the human genome based on a global view. We identified 15,093 cancer-specific variants of 9,989 genes from 27 types of human cancers and 14,376 normal tissue-specific variants of 7,240 genes from 35 normal tissues, which cover the main types of human tumors and normal tissues. Approximately 70% of these transcripts are novel. These data were integrated into a database HCSAS (http://202.114.72.39/database/human.html, pass:68756253). Moreover, we observed that the cancer-specific AS of both oncogenes and tumor suppressor genes are associated with specific cancer types. Cancer shows a preference in the selection of alternative splice-sites and utilization of alternative splicing types. Conclusions/Significance These features of human cancer, together with the discovery of huge numbers of novel splice forms for cancer-associated genes, suggest an important and global role of cancer-specific AS during human tumorigenesis. We advise the use of cancer-specific alternative splicing as a potential source of new diagnostic, prognostic, predictive, and therapeutic tools for human cancer. The global view of cancer-specific AS is not only useful for exploring the complexity of the cancer transcriptome but also widens the eyeshot of clinical research.


Introduction
It remains unknown how both intron removal and exon rearrangement are precisely regulated to produce correct proteomes in a cell type-or developmental stage-specific manner. Alternative splicing, the process by which the exons of primary transcripts can be spliced into different arrangements to produce structurally and functionally distinct mRNA and protein variants, is the most widely used mechanism to enhance the protein diversity of higher eukaryotic organisms. It has been estimated that 35%-94% of all human genes appear to undergo alternative splicing [1][2][3][4][5][6][7], suggesting that this mechanism has a major role in generating protein diversity. As sequence data continue to be generated from projects at an ever-increasing rate, the need for mining the data and constructing a repository for transcriptome information continues to grow as well.
In many pathological conditions, aberrantly spliced pre-mRNAs are generated because they escape the quality control mechanisms within cells (e.g. the nonsense mediated mRNA decay pathway) and are, therefore, translated into aberrant proteins involved in human diseases, including cancer [8][9][10][11]. It is estimated that approximately 60% of disease mutations in the human genome are splicing mutations [12,13]. Currently, the analysis of cancerspecific alternative splicing is a promising step forward and potential source of new clinical diagnostic, prognostic, and therapeutic strategies. Evidence is accumulating that supports a connection between tumorigenesis and alternative splicing [14][15][16][17][18]. Using bioinformatic approaches, Xu and Lee discovered cancer-specific splice variants in 316 genes [19]. We previously identified testis-/testis cancer-specific splice variants using bioinformatic and experimental approaches [20].
Despite the growing interest in the impact of alternative splicing in various aspects of the biological processes, our understanding of alternative splicing is still scattered, and its general regulatory mechanisms, especially in tumorigenesis, are not well known [21,22]. However, it is believed that cancer-specific splice variants could be involved in the etiopathogeny of many diseases and some might serve as diagnostic or prognostic markers. Moreover, the direct targeting of protein is probably an advantageous way of correcting cancer-associated splicing alterations. For example, the cancer-restricted splice variant protein could be used as the target for specific antibodies conjugated to tumor cell toxins for cancer treatments. The etiopathogeny concerning the cancer-specific AS and all related applications need to be explored further.
In order to advance our understanding of the biological significance of alternative splicing in human cancers, it is essential to systematically identify cancer-specific splicing events at the transcriptome level. In the present study, we performed a genomewide analysis of alternative splicing in human cancer and normal tissues using an intersection/subtractive model consisting of the following steps: 1) identifying insertions or deletions in the alignments of expressed sequence tags (ESTs) to identify alternative splicing transcripts based on a previously described method [2], 2) the alignment of EST/genome to confirm the transcripts, and 3) obtaining the tissue-specific and alternatively spliced variants by subtractively cross-screening the alternatively spliced transcripts in each tissue. Our results distinguish distinctive patterns of cancerspecific alternative splicing and identify a large number of cancerand tissue-specific splicing isoforms, which provides a global view of human cancer-specific alternative splicing in a large-scale approach and a potential source of new clinical diagnostic, prognostic, and therapeutic strategies for human cancer.

Data sources and filtration
Human EST data for both cancerous and normal tissues were drawn from the Cancer Genome Anatomy Project (CGAP) (http://cgap.nci.nih.gov/Tissues/LibraryFinder). The CGAP collects EST libraries from all over the world and provides good tissue information. All available EST libraries for both human cancer and normal tissues were downloaded from the CGAP libraries, Mammalian Gene Collection libraries, and Open Reading Frame EST Sequencing libraries. We sought to avoid mixing multiple tissues. Among these libraries, those signed 'pooled' were excluded because these procedures affect tissue classification. For normal tissue, ESTs were classified in accordance with the developmental stage information, and libraries without this information were not used. All EST and library data on different tissues that were used are listed in Tables 1 and 2. All collection data were then dealt with in three procedures: repeat sequence masking to remove simple repeats in the dataset (program, repeatmasker; repeat database, repbase; girnst server:www.girinst.org), vector and contamination masking to clean the vector sequences (program, crossmatch; vector database, UniVec_Core; National Center for Biotechnology Information ftp server: ftp://ftp.ncbi.nih.gov/), and a final cleaning of short and rubbish sequences (program, seqclean from egassembler server: http://egassembler.hgc.jp). Any Alu repeats were included in, and the filtered ESTs were available for the following analysis.
Computational procedures to identify cancer/tissuespecific alternative splicing A basic local alignment search tool (BLAST) database was constructed for the ESTs of each tissue. Alternative splicing was analyzed based on a previous method [2]. Transcripts specific to tissue T were identified based on an intersection/subtractive model:

TS~T{T\O
Where TS is the alternatively spliced transcripts specific to tissue T, T is all transcripts in tissue T, and O is all transcripts in the other tissues (>, intersection).
Briefly, the three steps were as follows: (1) Tissue T's EST dataset was BLASTed against itself. The evalue was set to 1e-30. Gaps (insertion or deletion) in the ESTs were identified after alignment. Parameters to identify EST/genomic sequence alignments, chromosome mapping, and splice site analysis To decrease errors in EST alignments and determine the chromosomal loci of each gene, we localized ESTs to genomic sequences using BLAST-like alignment tools (http://genome.ucsc. edu). We used the default parameters and selected the best score results. The exon position on the chromosome was recorded for each transcript and used to determine splice sites and gene structure. Splice sites for both 59 and 39 exon/intron boundaries were aligned online via http://weblogo.berkeley.edu/logo.cgi. We allowed an error of 10 bp in the exon/intron boundary. Based on comparisons of EST/genomic alignments, two possible errors can be checked: (i) if the candidate EST in the same gene was not on the same chromosome and (ii) is the candidate EST in the same gene was not in the same locus on the chromosome. The reasons for these errors mainly included EST sequencing errors, pseudogenes, and multiple copy genes. The two cases were excluded as false positives in the final database.

Function classification of alternative splicing
Each alternatively spliced EST was BLASTed to the RefSeq mRNA database (expectations 1e-30) to identify the corresponding genes. Using PANTHER (http://www.pantherdb.org/tools/ genexAnalysis.jsp), these genes were clustered by the gene ontology (GO) process. We also searched the Entrez Gene Database to correct our results.

Alternative splicing database construction
We input all prediction results into the local alternative splicing database. This database was constructed with MySql and programmed by Perl and CGI. All information such as gene ID, gene structure, EST accession, mRNA accession, gene information, and exon location on the chromosome were collected in the database.   (Tables 1 and 2). Through computationally subtractive analysis, we detected 15,093 cancer-specific transcripts in 9,989 genes from the 27 types of cancer, and 14,376 normal tissue-specific transcripts in 7,240 genes from the 35 tissues (Tables 3 and 4), which cover the main types of human tumors and tissues. Cancer-specific transcript numbers per gene detected were 1 to 1.69 with an average of 1.51, whereas there were 1 to 6 normal tissue-specific transcripts with an average of 1.99 (Tables 3  and 4), indicating fewer alternative splicing events (cancer-specific) in cancer compared to normal tissues.
To facilitate future studies and referencing of alternatively spliced genes, for both human cancer and normal tissues, we constructed a human cancer-and normal tissue-specific alternative splicing database (HCSAS) based on our analysis, which was divided into two parts: cancer-specific (15,093 transcripts) and normal tissue-specific (14,376) alternative splicing. Of these cancer-or tissue-specific AS, approximately 70% are novel isoforms. For example, in brain cancer, because of the alternative splicing and deletion of domain of the peptidase m20 family member, the aminoacylase-1 gene (ACY1) was spliced to produce a brain cancer-specific transcript (Figure 1a), and alternative splicing occurs in the SRP19 gene to produce a breast cancerspecific transcript by an alternative deletion of exon 3 (Figure 1b). Similarly, in liver cancer, lung cancer, and prostate cancer, cancer-specific isoforms were detected in our subtractive screening (Figure 1c-e).
Furthermore, we systematically identified cancer-specific transcripts in both oncogenes and tumor suppressors. Thirty-nine oncogene isoforms and 38 tumor suppressor gene isoforms with cancer-specific AS events were detected (Table 5). For example, we identified a lung cancer-specific transcript in the oncogene RAF1 with a deletion of the Raf-like Ras-binding domain, an uterus cancer-specific transcript in oncogene FOS (Figure 2a), and a retinoblastoma-specific transcript in the tumor suppressor GLTSCR2, and a skin-cancer-specific transcript in the tumor suppressor EMP3 (Figure 2b).
The HCSAS database presents a global overview of cancerspecific alternative splicing in humans and is essential for understanding tumorigenesis at a systematic level. The main information in this database includes the specific alternative splicing in both cancer and normal tissues, gene ID, gene structure, splicing sites, chromosome localization, DNA and protein sequences linked with the NCBI website, and GO process, function, and subcellular localization. An example page set shows the details of an adrenal cancer gene, FDPS ( Figure 3). The HCSAS database can be accessed at http://202.114.72.39/ database/human.html.

Biased utilization of alternative splicing types in cancer
An examination of cancer-specific alternative splicing revealed a biased distribution of alternative splicing types in cancer. Both the alternative 39 splice site and 59 splice site were used more often in cancer; however, a lower proportion of intron retention and cassette alternative exon occurred in cancer tissues compared to normal tissues (Figure 4b). Moreover, alternative splicing types differ between different kinds of cancer (Figure 4a). For example, in liver cancer, breast cancer, and prostate cancer, intron retention decreased and cassette alternative exons increased significantly, whereas in uterus cancer and skin cancer, cassette alternative exons markedly decreased.

Preference in the selection of alternative splice sites in cancer
To explore the preference/diversification of alternative splice sites in cancer, we analyzed all splice sites in the 27 types of cancer and 35 normal tissues by comparing each EST with its genomic sequence and mapping it onto the chromosome. We detected five basic donor-acceptor splice sites: GT-AG, CT-AC, GC-AG, GG-AG, and GT-GG, of which GT-AG are the most dominant sites. The others were classified into rare splice sites. We found that cancer uses rare splice sites and GT-AG more frequently, but less CT-AC compared to normal tissues (Figure 5a, b). Moreover, the selection of splice sites differs between different kinds of cancer (Figure 5c). For example, CT-AC sites are seldom used in breast cancer, liver cancer, lung cancer, and prostate cancer; in liver cancer, 59 sites of rare splicing are almost AA.
Association of cancer-specific alternative splicing of both oncogenes and tumor suppressor genes with cancer Although both oncogenes and tumor suppressors are thought to be vital factors in tumorigenesis, we sought to identify cancer-specific variants and their possible involvement in cancer. We observed that oncogenes with cancer-specific AS are more often present in ovary cancer (6 oncogenes) and muscle cancer (5 oncogenes), whereas tumor suppressor genes with cancer-specific AS are more frequent in germ cell cancer (6), skin cancer (5), and primitive neuroectodermal cancer (5) (Figure 6). Some oncogenes and tumor suppressors with cancer-specific alternative splicing, such as EWSR1, CDKN1A, and GLTSCR2, are present in more types of cancer. Moreover, neither oncogenes nor tumor suppressors with cancer-specific AS were detected in brain cancer, prostate cancer, adrenal cancer, or lymphoma. This distribution bias for cancer-specific AS implies that the cancer-specific alternative splicing of both oncogenes and tumor suppressor genes is associated with specific cancer types.

Biological relevance of the cancer-specific transcripts in the diversification of protein functions
The cancer-specific transcripts were classified based on gene function by searching the RefSeq database and GO. We classified 15,093 cancer-specific transcripts from 9,989 genes into 15 function groups. Protein metabolism and modification, and nucleic acid metabolism are the most prevalent functional processes in cancer. However, the function groups of these cancer-specific transcripts differ in different cancers. For example, the least common process in breast cancer is pre-mRNA processing, whereas the function groups of cell communication and lipid, fatty acid, and steroid metabolism are seldom found in prostate cancer (Figure 7).

Discussion
The complexity of the transcriptome has been underestimated. In this paper, we described the transcriptome-wide identification and characterization of cancer-specific and alternatively spliced variants in human cancer based on a global view of cancer-specific alternative splicing developed by subtractive transcriptome-wide analysis. Based on an intersection/subtractive model, we have developed an analysis method for precisely screening cancerspecific alternative splicing. The EST sequences were aligned first, compared with their genomic sequences, and then mapped onto chromosomes. These procedures eliminated many EST errors, pseudogene, and multiple-copy/repeat gene problems when data were from diverse EST databases. Finally, the alternatively spliced transcripts were subject to the subtractive screening of a tissue versus all other tissues, and these analyses finally yielded cancerspecific transcripts. We identified a large number of cancer-/ normal tissue-specific transcripts. Beyond all doubt, this is an abundant resource for research and the development of new diagnostic, prognostic, predictive, and therapeutic tools against human cancer. Furthermore, these resources are integrated into an available database. The HCSAS database presents a global overview of cancer-specific alternative splicing in humans and is essential for understanding tumorigenesis at a systematic level.
There are two main approaches for the global analysis of alternative splicing. First, based on the availability of sequenced genomes and large databases of sequenced transcripts (ESTs and cDNAs), alternative splicing events may be searched through reciprocal transcript alignments and alignments to genomic sequences. Several analyses in this manner have been reported [6,[23][24][25][26][27][28][29]. Because of its major limitation of EST coverage bias, a microarray-based technology has been developed to search for the alternative splicing events [3,[30][31][32][33][34][35][36]. Large sets of oligonucleotide probes may be designed specifically for individual exons and/or splice junction sequences, which allow the identification of new AS events. Here we have further developed a systematic method to search for cancer-or tissue-specific AS events in transcriptomes based on the intersection/subtractive screening analyses of transcriptomes, which is especially useful for identifying cancer/ tissue-specific variants. Using this method, large numbers of cancer-specific isoforms were identified for the main human cancers. Nevertheless, these transcripts need to be further confirmed for their cancer/tissue specialization. RT-PCR technology and/or microarrays may be useful screening tools for this analysis.
Based on the transcriptome-wide analysis, we did observe special patterns of cancer-specific alternative splicing. 1) Less cancer-specific AS events occur in cancer compared to normal tissues. 2) Cancer possesses distribution bias for alternative splicing types. 3) Cancer uses rare splice sites and GT-AG more frequently, but less CT-AC compared to normal tissues. 4) The selection of splice sites differs between different kinds of cancer. 5) The cancerspecific alternative splicing of both oncogenes and tumor suppressor genes is associated with the specific cancer type. And finally, the functional groups of these cancer-specific transcripts differ in different cancers, indicating that individual cancers prefer combination controls of pathways in preference of using AS in tumorigenesis. These special features of human cancers indicate that 1) the cellular splicing machinery is changed during the transformation from normal to cancerous, 2) alternative splicing plays an important role during tumorigenesis, and 3) individual cancers have unique regulatory combinations at the alternative splicing level, which further support the prediction that approximately 60% of disease mutations in the human genome are splicing mutations [12,13]. Our data includes the discovery of many novel splice forms of cancer-associated genes and alternative-splicing patterns in cancer, and it suggests a significant new direction for human cancer research. We strongly advise the use of cancer-specific alternative splicing as a potential source of new diagnostic, prognostic, predictive, and therapeutic tools against human cancer. The global view of cancer-specific AS is not only useful for exploring the complexity of the cancer transcriptome, but it also widens the eyeshot of clinical research. Figure 7. Biological processes of alternatively spliced transcripts specific to cancer. The five cancer types are brain, breast, liver, lung, and prostate cancer. The numbers indicate the percentages for each process in the cancer. The GO process classification is based on the PANTHER (http:// www.pantherdb.org/tools/genexAnalysis.jsp). doi:10.1371/journal.pone.0004732.g007