Large-Scale Analysis of Gene Expression Data Reveals a Novel Gene Expression Signature Associated with Colorectal Cancer Distant Recurrence

Colorectal cancer (CRC) is the fourth-ranked cause of cancer-related deaths worldwide. Despite recent advances in CRC management, distant recurrence (DR) remains the major cause of mortality in patients with preoperative chemotherapy and radiotherapy, underscoring a need to precisely identify novel gene signatures for predicting the risk of systemic relapse. Herein, we integrated two independent CRC gene expression datasets: the GSE71222 dataset, including 26 patients who developed DR and 126 patients who did not develop DR, and the GSE21510 dataset, including 23 patients who developed DR and 76 patients who did not develop DR. Our data revealed 37 common upregulated genes (fold change (FC) ≥ 1.5, P < 0.05) and three common downregulated genes (FC ≤ 1.5, P < 0.05) between DR and non-recurrent patients from the two datasets. We subsequently validated the upregulated gene panel in the Cancer Genome Atlas CRC datasets (379 patients), which identified a five-gene signature (S100A2, VIP, HOXC6, DACT1, KIF26B) associated with poor overall survival (OS, log-rank test P-value: 1.19 × 10−4) and poor disease-free survival (DFS, log-rank test P-value: 0.002). In a Cox proportional hazards multiple regression model, the five-gene signature and tumor stage retained their significance as independent prognostic factors for CRC DFS and OS. Therefore, our data identified a novel DR gene expression signature associated with worse prognosis in CRC.


Introduction
Colorectal cancer (CRC) is one of the most prevalent types of cancers and is currently ranked as the fourth leading cause of cancer-related deaths globally, and the third leading cause of death in the United States in both men and women [1,2]. The 5-year survival rate for CRC patients with a localized tumor is approximately 90%, which declines to 70% for patients with regional disease, and to 12% for patients with metastatic disease [2]. Multiple molecular alterations occur during CRC development and progression. Therefore, the identification of clinical and pathological parameters that can accurately predict the prognosis of patients with CRC has been a daunting task. Some of the factors to consider for predicting the risk of systemic relapse include the differentiation status of the tumor, depth of tumor invasion, and vascular and perineural invasion [3,4]. Over the past several years, numerous molecular signatures have been identified for CRC prognosis [5][6][7]. However, one major problem with many of the established molecular signatures for CRC relapse is the lack of validation across different groups and platforms. Therefore, large-scale analysis of multiple gene expression datasets might lead to the identification of more representative gene expression signatures associated with CRC relapse. Herein, we integrated three independent CRC gene expression datasets retrospectively, which led to the identification of a novel five-gene signature associated with CRC systemic relapse.

Patient information and data analysis
The current study was conducted on three different CRC cohorts: (1) the National Center for Biotechnology Information Gene Expression Omnibus (GEO) GSE71222 dataset, which included 26 patients who developed distant recurrence (DR) and 126 patients who did not develop DR; (2) the GSE21510 dataset, which included 23 patients who developed DR and 76 patients who did not develop DR; and (3) The Cancer Genome Atlas (TCGA) CRC dataset, which included a total of 379 CRC patients. Interrogation of the TCGA dataset was conducted as previously described [8][9][10]. The relationship of gene expression patterns with patient survival in the TCGA database was queried using the cBioportal database with the formula GENE: EXP > 0, where GENE represents a query gene. The clinical characteristics for the TCGA dataset are shown in Table 1. The clinical characteristics for the GSE71222 and GSE21510 datasets have been described previously [11,12].

Microarray data analysis
The GSE71222 and GSE21510 raw gene expression datasets were retrieved from the GEO and were imported into GeneSpring 13.0 software (Agilent Technologies, Palo Alto, CA, USA). Raw data were subsequently normalized using the percentile shift, and a 1.5 fold-change (FC) cutoff and P < 0.05 were used to determine significantly changed transcripts between groups [13].

Statistical analysis
Kaplan-Meier survival curve comparison was conducted using the log-rank test, and a P-value of 0.05 was considered statistically significant. The Cox proportional hazards multiple regression model was used to identify the independent prognostic factors and to correct the effect of potential confounding variables, such as gender (male vs female), age (> 65y vs < 65y), tumor stage (stage 3/4 vs stage 1/2), and of cancer type (colon adenocarcinoma vs rectal adenocarcinoma vs mucinous adenocarcinoma of the colon and rectum) on OS and DFS using MedCalc 16.8.4 (MedCalc, Mariakerke, Belgium). Pathway analyses were conducted using DAVID functional annotation and clustering bioinformatics tool, as described in our previous reports [14,15]. Statistical analyses and graphing were performed using Graphpad Prism 6.0 software (Graphpad Software, San Diego, CA, USA).

Generation of a gene expression panel associated with risk of DR
To devise a gene expression panel associated with CRC DR with high confidence, we analyzed two independent CRC gene expression datasets (GSE71222 and GSE21510) and identified the genes associated with patient recurrence. Analysis of the GSE71222 and GSE21510 datasets revealed 180 (1.5 FC, P < 0.05) and 317 (1.5 FC, P < 0.05) differentially expressed transcripts between DR and non-metastatic tumors, respectively (Fig 1a and 1b). To identify DR-related genes with high confidence, we crossed the differentially expressed genes from the two datasets that revealed 44 common upregulated transcripts, comprising 37 genes (Fig 1c, Table 2), and three common downregulated genes ( Table 2). Pathway analysis performed on the common upregulated genes revealed enrichment in several cellular pathways, including cell motion and regulation of cell differentiation (Fig 1d).

Validation of the DR-associated gene panel in the TCGA CRC dataset
We subsequently focused on the potential role of the upregulated genes in CRC recurrence. Therefore, each of the 37 upregulated genes was further validated using the TCGA CRC dataset to determine their relationship to overall survival (OS) and disease-free survival (DFS). S100A2, VIP, HOXC6, DACT1, and KIF26B were significantly associated with OS (P 0.01) and DFS (P 0.05), while LAMC2, NOV, and AMIGO2 were only associated with DFS (P 0.05). We subsequently focused on the five-gene panel that was associated with OS and DFS. The OncoPrint for this gene panel in the TCGA CRC dataset with the proportion of patients overexpressing each gene is presented in Fig 2a. Interestingly, the combination of this five-gene panel revealed a higher prognostic value, in which patients overexpressing at least one of the five genes showed a worse OS (log-rank test P-value: Fig 2b) and worse DFS (log-rank test P-value: 0.002, Fig 2c) than those with lower expression of these genes. Data from the univariate analysis were subsequently put into the Cox proportional hazards multiple regression model to identify the independent factors for prognosis. The results showed that expression of the five-gene panel and tumor   stage retained their significance as independent prognostic factors for CRC DFS and OS (p = 0.0023 and 0.0001 for DFS and p = 0.0086 and <0.0001 for OS, respectively), while age at diagnosis only correlated with OS, p = 0.0004 (Table 3). Network analysis of this fivegene signature revealed multiple network interactions in CRC, such as between VIP and GNG11, GNB3, GNG12, GNB2, GNG5, GNAS, GNG2, GNB4, GNG4, GNG10, and GNB1; between DACT1 and ARRB1, DVL1, CSNK2B, CSNK2A1, and CSNK2A2; and between S100A2 and TP53 (Fig 2d).

Discussion
In the current study, we retrospectively derived and validated a gene expression signature associated with the risk of systemic relapse in patients with CRC. Analysis of the GSE71222 and GSE21510 datasets identified 37 upregulated and three downregulated genes associated with DR in CRC. Interestingly, several of the identified genes (LAMC2, LPL, SERPINB5, TCN1, VIP, MSX2, PRUNE2, KRT6B, TESC, EPHA4, GPR155, KIF26B, C3ORF70, and PID1) were also found to be differentially expressed in our previous global mRNA expression profiling of CRC compared to adjacent normal mucosa, suggesting a plausible role of these genes in driving CRC in addition to DR [16]. Concordant with our data, Takahashi and colleagues [11] reported a worse prognosis in CRC patients overexpressing Traf2-and Nckinteracting kinase (TNIK). Higher expression of MSX2 was found to be associated with metastasis in different types of human cancers [17]. PROM1, also known as CD133, was among the 37 upregulated genes in both datasets. Interestingly, PROM1 has previously been reported as a cancer stem cell marker in CRC [18,19]. Similarly, two of the identified genes in the current study (SLC14A1 and KIF26B) were identified in an intestinal stem cell signature previously reported to be associated with poor clinical outcome in CRC [20]. Therefore, it is possible that patients with an enriched CSC phenotype are more likely to develop DR. We subsequently validated this gene signature in the TCGA CRC dataset, which includes 379 patients. Our analysis narrowed down the CRC recurrence signature to five genes (S100A2, VIP, HOXC6, DACT1, and KIF26B) whose expression was associated with poor OS (log-rank test P-value: 1.19 × 10 −4 ) and DFS (log-rank test P-value: 0.002), which was further confirmed in a multivariate analysis. Therefore, we here present a novel gene expression signature for predicting the risk of systemic relapse in CRC. Concordant with our data, overexpression of S100A2 has been associated with poor clinical outcome in colorectal [21] and oral [22] cancers. The HOXC6 gene is frequently upregulated in prostate cancer, although no association with patient relapse was observed [23]. DACT1 was recently shown to promote CRC tumorigenicity and invasion via stabilization of β-catenin [24]. Concordantly, overexpression of DACT1 was observed during the transition of ductal carcinoma in situ to invasive ductal carcinoma in breast cancer [25].

Conclusion
Herein, we integrated multiple gene expression datasets and devised a novel five-gene signature as an independent predictor of CRC DR. This signature adds to the current prognostic value of tumor staging. Before this five-gene-signature can be utilized in the clinic; however, additional validations are required