Colorectal cancer stages transcriptome analysis

Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths in the United States. The purpose of this study was to evaluate the gene expression differences in different stages of CRC. Gene expression data on 433 CRC patient samples were obtained from The Cancer Genome Atlas (TCGA). Gene expression differences were evaluated across CRC stages using linear regression. Genes with p≤0.001 in expression differences were evaluated further in principal component analysis and genes with p≤0.0001 were evaluated further in gene set enrichment analysis. A total of 377 patients with gene expression data in 20,532 genes were included in the final analysis. The numbers of patients in stage I through IV were 59, 147, 116 and 55, respectively. NEK4 gene, which encodes for NIMA related kinase 4, was differentially expressed across the four stages of CRC. The stage I patients had the highest expression of NEK4 genes, while the stage IV patients had the lowest expressions (p = 9*10−6). Ten other genes (RNF34, HIST3H2BB, NUDT6, LRCh4, GLB1L, HIST2H4A, TMEM79, AMIGO2, C20orf135 and SPSB3) had p value of 0.0001 in the differential expression analysis. Principal component analysis indicated that the patients from the 4 clinical stages do not appear to have distinct gene expression pattern. Network-based and pathway-based gene set enrichment analyses showed that these 11 genes map to multiple pathways such as meiotic synapsis and packaging of telomere ends, etc. Ten of these 11 genes were linked to Gene Ontology terms such as nucleosome, DNA packaging complex and protein-DNA interactions. The protein complex-based gene set analysis showed that four genes were involved in H2AX complex II. This study identified a small number of genes that might be associated with clinical stages of CRC. Our analysis was not able to find a molecular basis for the current clinical staging for CRC based on the gene expression patterns.


Introduction
Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths in the United States [1]. Among the five subtypes of CRC (adenocarcinomas, carcinoid tumors, gastrointestinal stromal tumors, lymphomas and sarcomas), adenocarcinomas are the most common (95% of all CRCs). Currently the staging of CRC, referred to as clinical staging, is based on results of physical exams, biopsies, and imaging tests (CT or MRI scan, X-rays, PET scan, etc.). The criteria of staging are based on: 1) how far the cancer has grown into the wall of the intestine; 2) whether it has reached nearby structures; and 3) whether it has spread to the nearby lymph nodes or to distant organs. The results of surgery can be combined with clinical staging to determine the pathologic stages. The most often used CRC staging system is the AJCC cancer staging manual developed by American Joint Committee on Cancer (AJCC), based on conditions of primary tumor (T), regional lymph nodes (N) and distant metastasis (M) [2]. The earliest stage cancers are called stage 0, then range from stage I through IV, with additional sub-stages identified with the letters A, B and C [3].
Several genes, such as WNT, WAPK/PI3K, TGF-β, TP, have been associated with CRC. For instance, mutations in adenomatous polyposis col (APC) gene, a tumor suppressor gene, were found to be responsible for familial adenomatous polyposis and then further developed to CRC [4]. MisMatch Repair system genes such as MLH1 and MSH2 gene were found to be associated with Lynch syndrome, the most frequent form of hereditary CRC [5,6]. Further, a 12-gene recurrence score assay has been developed as a prognostic factor in stage II-III colon or rectal carcinoma [7][8][9]. Even though many genes have been associated with an increased risk of CRC, the genetic differences across different stages of CRC have not been clearly identified. So far, only one study had assessed the gene expression levels of three candidate genes (MMP9, MMP28 and TIMP1) across CRC stages and found no statistically significant differences based on the stage of CRC [10]. There have been no studies in the literature comparing the gene expression levels in the entire transcriptome across CRC stages. The purpose of this study is to explore transcriptome-wide gene expression differences across different stages of CRC followed by gene ontology, gene set network analysis approaches based on the publicly available RNAseq dataset in The Cancer Genome Atlas (TCGA) [11].

Data acquisition
The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/) is a joint effort between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) to facilitate the sharing of data and speed up cancer research [11,12]. The Eli and Edythe L. Broad (Broad) Institute of MIT and Harvard is a joint venture between both institutions and several area hospitals (https://www.broadinstitute.org/about-us). Their "FireHose" project ingests, aggregates, standardizes, and processes TCGA data via automated pipelines in an attempt to accelerate analysis and discoveries (https://confluence.broadinstitute.org/ display/GDAC/Rationale).
The Broad Institute has established pipelines for processing each TCGA dataset and the outputs from each stage of the pipeline are made available as a versioned set. Illumina HiSeq expression data was processed by Broad Institute to output both reads per kilobase per million mapped reads (RPKM) expression values [13] and RNA-seq by Expectation-Maximization (RSEM) values [14] normalized to "upper quartile count at 1000". TCGA clinical data and expression data were manually downloaded from the Broad Institute (TCGA data version 2016_01_28) via the firebrowse.org website.

Data merging
Using Python 2.7.10 and version 0.19.0 of the Pandas module, the expression data from the Broad Institute was read into a Pandas dataframe, transposed, and re-saved. The clinical data were also transposed in the same manner. Additionally, in order to cut down on the size of the data and number of components of interest, only a subset of the columns from the clinical data were kept for the analysis. These included common demographic data such as patient gender, race, ethnicity, and age; clinical data such as cancer stage, associated International Classification of Diseases (ICD) 10 codes, presence of polyps, whether analysis had been done for common mutations such as KRAS and BRAF; and finally, approximately 85 different aliquot identifiers from the TCGA dataset itself.
Matching of clinical data with expression data was performed using TCGA's "hybridization REF" identifier from the expression data and searching against the aliquot identifiers present in the clinical data. Eventually, 377 patients with gene expression data from 20,532 genes were included in the final analysis.

Differential expression analysis
Gene expression differences were evaluated across the disease stages using linear regression. The standard deviation of the gene expression level for each gene was computed. The genes with standard deviation of zero, which indicates no change in the gene expression, were removed from further analysis. To select top genes that are differentially expressed across cancer stages, a linear regression model was performed for each gene to test the trend in gene expression with increasing cancer stages. The analyses adjusted for age, gender and race/ethnicity of the patients. Genes with p 0.0001 were considered suggestive and the expression level by cancer stages were presented for these genes. Analyses were performed using R version 3.3.1 and SAS 9.4 (Cary, NC).

Principal component analysis
In order to identify gene expression pattern of the selected CRC samples across different stages, all the genes with p 0.001 in the linear model analysis were included in the principal component analysis using SAS. Ten principal components (PCs) were identified and the first two PCs were plotted according to the staging status of the CRC patients.

Gene annotation and gene set enrichment analysis
Genes with expression difference of p 0.0001 were evaluated further in gene annotation using DAVID [15]. Then the gene IDs and official gene names were used for further analysis. ConsensusPathDB tool [16,17] was then used to perform network-based and pathway-based analyses on these top genes. ConsensusPathDB consists of a comprehensive collection of human, mouse and yeast molecular interaction data integrated from 32 different public repositories and a web interface with a set of computational methods and visualization tools to explore these data (http://consensuspathdb.org). This tool applies computational methods for statistical over-representation and enrichment analysis and reports network modules, pathways and functional information that are significantly enriched by any given gene list. Consen-susPathDB provides 4 types of predefined annotation gene sets: neighborhood-based entity sets (NESTs) which includes protein-protein interactions, biochemical interactions, gene regulatory and genetic interactions, protein complexes, pathways (including metabolic, signaling and gene regulatory pathways) and GO terms [16]. For computing the significance of the enrichment of the annotation sets with respect to user-input gene list, this tool applies Wilcoxon's matched-pairs signed-rank test.

Demographics
The TCGA database contains clinical information for 629 patients but only 396 unique patients have both gene expression data and clinical data. The numbers of patients with CRC in stage I through IV were 59, 147, 116 and 55 respectively and 19 patients did not have stage information and there were no patients in the stage 0. The mean age of these patients was 64 ± 12 years. Further, 46.4% were women, 69.2% were white, 16.2% were Black/African American, 14.6% were Asian, American Indian/Alaska Native and of unspecified race, and 1.1% were Hispanics. From a clinical standpoint, 76.7% had colon cancer and 23.3% had rectal cancer. The demographic and relevant clinical information of these patients stratified by CRC stage are summarized in Table 1. The final analysis included 377 patients with clinical data including staging information and gene expression in 20,532 genes.

Linear model for gene expression
Eleven genes had p 0.0001 in the differential gene expression analysis according to the clinical staging. NEK4 gene, which encodes for NIMA related kinase 4, was differentially expressed across the four stages of CRC. The samples from the stage I patients had the highest expression of NEK4 genes, while the stage IV had the lowest expressions (p = 4.50 Ã 10 −6 ) ( Table 2, Fig 1).
Ten other genes had p value of 0.0001 in the unadjusted differential expression analysis including two with decreasing gene expression levels in more advanced CRC stages (RNF34 and NUDT6) and eight with increasing gene expression levels in more advanced CRC stages (LRCH4, HIST3H2BB, SPSB3, HIST2H4A, TMEM79, AMIGO2, GLB1L and C20orf135) (Table 2, Fig 1).

Gene annotation and network-based analysis
Network analysis showed that the top eleven genes map to multiple pathways such as meiotic synapsis and packaging of telomere ends, etc. (S1 Table). Ten of these 11 genes were linked to Gene Ontology (GO) terms such as nucleosome, DNA packaging complex and protein-DNA interactions (S2 Table). The protein complex-based gene set analysis showed that four genes were involved in H2AX complex II with q value of 5.72 Ã 10-5 (S3 Table). The enriched neighborhood based sets analysis of these 11 genes (S1 Fig) identified CDC like kinase 2 be connected with most genes (386 genes) in the neighborhood. RNF4 and RNF8 genes, in the same family as one of the top genes (RNF34), were also well-connected with multiple genes in pathways. Finally, the induced network module analysis identified several genes with gene protein interaction: HIST2H4A, HIST3H2BB, LRCH4 and NUDT6 (S2 Fig).

Discussion
Using publically available data from TCGA, this study explored the gene expression differences across four stages of CRC. We found that eleven genes showed suggestive level of evidence for differential expression in a linear fashion. These genes map to multiple pathways and were linked to GO terms. Further, several few genes were enriched in protein complexes. However, a principal component analysis was not able to identify a molecular basis for the current CRC staging process. This might be due to the following: 1) due to the limitation of publically available data, our study was not able to compare the gene expression data from different CRC stages with a normal control; 2) the CRC staging system currently uses the size of lesion for staging, not molecular basis; and 3) the principal component analysis was able to cover only~30% of the variance in the gene expression data. Such analysis has not been done previously in the literature. Among the genes with suggestive level of significance, only a few had possible link with cancer in the literature. The gene with the strongest p value for differential expression by stage is NEK4 gene, which encodes NIMA related kinase 4, a serine/threonine protein kinase required for normal entry info replicative senescence. In cell culture, suppression of NEK4 doubled the number of replications needed to reach senescence, reduced cellular reactions to doublestranded DNA damage in both recruitment of repair proteins and arresting of further cell divisions, and also reduced activity of the p53 tumor suppressor protein [18]. Our study suggested that the CRC patients in the higher stages have lower NEK4 gene expression compared to lower stages, this is consistent with the direction shown in tissue culture [18] that lower expression was associated with worse diagnosis.
RNF34 gene, which encodes ring finger protein 34, was first known and characterized as hRFI (human ring finger homologous to inhibitor of apoptosis protein type) in 2005, was shown to have anti-apoptotic properties [19], and later was shown to also play a role in regulation of p53 via ubiquitination and subsequent proteasomal degradation [20]. Overexpression of this gene was shown to confer the resistance to 5-fluorouracil-induced apoptosis in colorectal cancer cells via activation of NF-kappaB and upregulation of BCL-2 and BCL-XL [21]. In our study, RNF34 had lower expression in those in the more advanced clinical stages of CRC patients. This seems indicate that more advanced CRC patients may be more sensitive to 5-fluorouacil treatment compared to patients in earlier stages, but this is outside the scope of our study. However, it is worth noting that 5-fluorouacil is currently recommended as one of the adjuvant chemotherapy agents for stage III and high-risk stage II colon cancer patients [22].
HIST3H2BB and HIST2H4A, both encoding histone proteins, were also among the top differentially expressed genes, increasing in expression with increasing cancer stages. Eukaryotic DNA that is not currently being replicated is stored in a wrapped and coiled form around four pairs of histone proteins that provide support for the coiled DNA. Histones are also sensitive to post-translational modification, such as acetylation and deacetylation, which the cells use to help regulate transcription [23]. A direct link to the role of increased histone protein expression isn't clear, perhaps further examination of co-expression levels of histone acetyltransferases and deacetylases would suggest a link.
Members of the NUDT6 gene family exhibit behaviors that include controlling the level of cellular metabolites and signaling compounds as well as degrading "potentially mutagenic" oxidized nucleotides" [24]. The trend of downregulation of this gene across cancers stages would indeed contribute to the ability of cancer cells to continue to grow, divide, and evade normal cellular precautions.
LRCH4 gene encodes leucine rich repeats and calponin homology domain containing 4, which is a protein that contains leucine-rich repeats at its amino terminus and that is known to be involved in ligand binding. AMIGO2, which encodes adhesion molecule with Ig like domain 1, is a leucine-rich repeat family member. AMIGO2 mRNA was found to be differentially expressed in near half of cancer vs. normal tissue from gastric adenocarcinoma patients [25]. In an antisense study, it was found that the inhibition of AMIGO2 expression negatively impact tumor growth and altered chromosomal stability [25].
Our study has some limitations: 1). TCGA CRC data only included data on samples from cancer patients, therefore the only analysis we could perform was within cancer samples and using controls from a different source would bring too much confounding. 2) The data from TCGA had many field with missing information, such as medication information, which may be altering gene expression in some of the genes or loci of interest. Therefore, no meaningful analysis can be performed with the medication data.
In conclusion, our study identified several genes that might be associated with clinical stages of CRC. Our analysis also suggests that the current clinical staging might not have molecular basis according to the gene expression patterns.