Development and Validation of a Prognostic Gene-Expression Signature for Lung Adenocarcinoma

Although several prognostic signatures have been developed in lung cancer, their application in clinical practice has been limited because they have not been validated in multiple independent data sets. Moreover, the lack of common genes between the signatures makes it difficult to know what biological process may be reflected or measured by the signature. By using classical data exploration approach with gene expression data from patients with lung adenocarcinoma (n = 186), we uncovered two distinct subgroups of lung adenocarcinoma and identified prognostic 193-gene gene expression signature associated with two subgroups. The signature was validated in 4 independent lung adenocarcinoma cohorts, including 556 patients. In multivariate analysis, the signature was an independent predictor of overall survival (hazard ratio, 2.4; 95% confidence interval, 1.2 to 4.8; p = 0.01). An integrated analysis of the signature revealed that E2F1 plays key roles in regulating genes in the signature. Subset analysis demonstrated that the gene signature could identify high-risk patients in early stage (stage I disease), and patients who would have benefit of adjuvant chemotherapy. Thus, our study provided evidence for molecular basis of clinically relevant two distinct two subtypes of lung adenocarcinoma.


Introduction
Lung cancer is one of the most common cancers worldwide, accounting for an estimated 226,160 new cases and 160,340 deaths in 2012 in the United States alone [1]. The vast majority of lung cancers are non-small cell lung cancers (NSCLCs), of which adenocarcinoma is the most common histology (approximately 50% of all NSCLCs) [2].
The American Joint Committee on Cancer (AJCC) staging system is currently used to guide treatment decisions and is the best predictor of prognosis for patients with NSCLC. Although surgical resection is potentially curative and the most effective treatment for patients with early-stage NSCLC, 35% to 50% of patients with AJCC-defined stage I disease will experience a recurrence within 5 years [3][4][5]. This indicates that NSCLC is a very heterogeneous cancer even in the earliest stage, and this underlying heterogeneity is not well-reflected in the current staging system. Small fraction of NSCLC patients have an underlying EGFR mutations or EML4-ALK fusion which are associated with relatively high response rates to targeted molecular therapies [6][7][8]. However, for the majority of adenocarcinoma patients, we do not yet have any validated biomarkers to predict overall outcome or to guide treatment selection. Thus, to improve patient care and management, it is important to further characterize molecular subgroups significantly associated with this differential response to standard treatment and to develop models to predict those who would receive greatest benefit from these treatments.
Recent advances in technology allow unbiased genome-wide screening of potential markers or gene-expression signatures that might reflect prognosis. This approach has shown potential success in identifying prognostic and predictive markers in breast cancer [9]. Similar approaches have been applied to NSCLC and prognostic or predictive molecular signatures that may be clinically useful have been found [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29]. However, the majority of these studies are limited by a lack of validation with large and multiple independent cohorts, or lack of a statistical test for the robustness of the predictive models and their contribution as new markers in prediction improvement [30]. In the current study, we applied a genome-wide survey of gene-expression data to distinguish subgroups of lung adenocarcinoma with distinct biological characteristics associated with prognosis and then identify a gene-expression signature that best reflects the biological and clinical characteristics of each subgroup. We further tested the robustness of our new prognostic gene-expression signature using several statistical approaches and multiple independent cohorts. Finally, we performed pathway analysis to study the biological differences that characterize each group.

Patients and Gene Expression Data
All clinical and gene expression data were collected previously and are available from public databases. Gene expression and clinical data from the National Cancer Institute (NCI) Director's Challenge Consortium were obtained from the caArray database at the NCI (https://caarraydb.nci.nih.gov/caarray; experiment ID, jacob-00182). This data set consisted of 4 different patient cohorts, including Toronto/Canada (TC, n = 82), Memorial Sloan-Kettering Cancer Center (MSKCC, n = 104), H. Lee Moffit Cancer Center (HLM, n = 79), and University of Michigan Cancer Center (UM, n = 177) [18]. For exploration and the discovery of a potential prognostic gene-expression signature and validation of the signature, patients were divided into 2 groups. Patients from the TC and MSKCC cohorts were combined for discovery of the signature (TM cohort, n = 186). Patients from the HLM and UM cohorts were used as the first validation set (HM cohort, n = 256). Gene-expression and clinical data from Massachusetts General Hospital (MGH cohort, n = 125) were obtained from the public website of the Broad Institute (http://www. broadinstitute.org/mpr/lung) [11] and used as a second validation set. The data from the Duke Institute for Genome Sciences and Policy (Duke cohort, n = 58) were obtained from the public website of Duke University (http://data.cgt.duke.edu/oncogene.php) [22] and used as a third validation set. The data from Aichi Cancer Center (ACC cohort, n = 117) were obtained from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo, accession number GSE13213) [21] and used as the fourth validation set.
Although overall survival (OS) and recurrence free survival (RFS) were available for the NCI Director's Challenge cohorts (TM and HM), only OS data were available for remaining cohorts (MGH, Duke, and ACC). Adjuvant chemotherapy data were available only for the TM, HM, and ACC cohorts. Of the 442 patients in TM and HM cohorts, 89 (39 in AJCC stage I, 27 in stage II, 22 in stage III, and 1 with unknown stages) received standard adjuvant chemotherapy. The remaining patients did not receive chemotherapy (n = 233) or treatment data were not available (n = 120). No patient in the ACC cohort received adjuvant chemotherapy. RFS was defined in a previous study as the time from surgery to the first confirmed relapse and was censored when a patient died or was alive without recurrence at last contact. Table 1 shows the pathological and clinical characteristics of the patients in all 5 cohorts. All patients had undergone surgical resection as their primary treatment.

Statistical Analysis of Microarray Data
Biometric Research Branch (BRB)-ArrayTools were used for statistical analysis of the gene-expression data [31], and all other statistical analyses were performed in the R language environment (http://www.r-project.org). Except for data from the ACC cohort, all gene-expression data were generated using the Affymetrix (Santa Clara, CA) platform (U95A for the MGH cohort, U133A for the TM and HM cohorts, and U133 plus 2.0 for the Duke cohorts). Raw data from the Affymetrix platform were downloaded from public databases and normalized using a robust multiarray averaging method [32]. Data from the ACC cohort were generated using the Agilent whole-genome microarray platform, and pre-normalized data were downloaded and used for analysis.
We identified genes that were differentially expressed between the 2 classes using a random-variance t-test. Differences in gene expression between the 2 classes were considered statistically significant if their p value was less than 0.001. Cluster analysis was performed with Cluster and Treeview [33]. To predict the class of the independent patient cohort, we adopted a previously developed model [34][35][36]. Briefly, gene-expression data in the training set (the TM cohort) were combined to form a series of classifiers according to the compound covariate predictor (CCP) algorithm as described in previous publications [37] and the robustness of the classifier was estimated by the misclassification rate determined during leave-one-out cross-validation (LOOCV) of the training set. When applied to the independent validation sets, prognostic significance was estimated by evaluating the differences between Kaplan-Meier plots and log-rank tests between the 2 predicted subgroups of patients. After LOOCV, the sensitivity and specificity of the prediction models were estimated by the fraction of samples correctly predicted.
Multivariate Cox proportional hazard regression analysis was used to evaluate independent prognostic factors associated with survival, and we used gene signature, tumor stage, and pathologic characteristics as covariates. For each clinical variable, Harrell's concordance index (c-index) was calculated as a measure of predictive accuracy [38]. Interpretation of the c-index is similar to that of the area under a receiver operating characteristic curve. The higher the c-index, the more informative the variable is about a patient's outcome. The c-index analysis was carried out using the Harrell Miscellaneous (HMISC) package in the R language environment. The confidence interval (CI) of the c-index was estimated using 1000 bootstrap resamplings. A p value of less than 0.05 was considered statistically significant, and all tests were 2tailed.

Gene Network Analysis
IngenuityTM Pathways Analysis (IPA, Ingenuity SystemsH) was used for gene network analysis. Gene network analysis was carried out by using a global molecular network developed from information contained in the Ingenuity knowledge Base. Out of 470 gene features, 468 were mapped to the Ingenuity Knowledge Base. Identified gene networks were ranked according to scores provided by IPA. The score is the likelihood of a set of genes being found in the networks due to random chance. For example, a score of 3 indicates that there is a 1/1000 chance that the focus genes are in a network due to random chance.

Discovery, Development, and Validation of a Prognostic Gene Expression Signature
To find potential prognostic subgroups of lung adenocarcinoma with distinct biological characteristics, we collected gene expression data from previous studies and divide them into 5 independent cohorts (one exploration cohort and 4 validation cohorts) ( Table 1). Hierarchical clustering analysis of the gene expression data from the exploration data set (TM cohort, n = 186) revealed 2 distinct subgroups (clusters) of lung adenocarcinoma (Fig. 1A). Subsequent analysis of the clinical data showed a significant difference in clinical outcomes between the 2 subgroups. The OS rates of patients in cluster C1 were significantly lower than those of patients in cluster C2 (3- We next sought to identify a limited number of genes whose expression was tightly associated with the 2 subgroups. By applying a stringent threshold cutoff (p,0.001 and at least a 2fold difference between subgroups), we identified 193 gene features differentially expressed between 2 subgroups ( Fig. S1 and Table  S1). Of note, the expression of many genes involved in cell proliferation and cell cycle regulation, such as CCNB1, TOP2A, AURKA, CDC2, and FOXM1, was significantly higher (p,0.001, by t-test) in patients in the poor-prognosis subgroup (C1), indicating that tumors in the C1 subgroup had higher cell proliferation rates. Thus, we renamed the 2 clusters C1 and C2 as cluster F (for ''fast-growing tumors'') and cluster S (for ''slowgrowing tumors''), respectively.

Independent Validation of the Identified Expression Signature
With a gene expression signature (193 genes) that accurately reflected prognosis in TM cohort, we next sought to validate the association of the gene signature with prognosis in 4 independent patient cohorts (HM, MGM, Duke, and ACC cohort). For this validation, previously established data training and prediction methods [34][35][36] were applied to gene expression data from the HM cohort (n = 256; Fig. 2A). When lung adenocarcinoma patients in the HM cohort were stratified according to the prognostic gene expression signature, Kaplan-Meier plots showed significant differences in OS (p = 9.4610 24 by log-rank test) between the 2 subgroups of patients that were predicted by the CCP (Fig. 2B). The specificity and sensitivity for correctly predicting subgroup F during LOOCV were 0.881 and 0.975, respectively.
To assess the robustness of our gene-expression signature, we applied our prediction method to 2 additional independent validation cohorts (MGH cohort, n = 125; Duke cohort, n = 58). Consistent with the results from the HM cohort, the expression signature successfully discriminated patients with poor prognosis (subgroup F) from those with a better prognosis (subgroup S; Fig. 2C and 2D). In addition, we further tested the robustness of the signature using another independent cohort with a different ethnic background, that is, the 117 Japanese patients with lung adenocarcinoma from the ACC cohort [21]. When patients in the ACC cohort were stratified according to their gene expression signatures, Kaplan-Meier plots showed significant differences in OS (p = 8.1610 24 by log-rank test) between the 2 predicted subgroups (Fig. 2E). Taken together, these results demonstrated the robustness of the gene signature for identifying patients at high risk for disease recurrence and poorer survival.

Significant Association of the Gene Signature with Clinical Variables
To evaluate the prognostic value of the gene expression signature in combination with other clinical variables, including patient age at diagnosis, disease stage by AJCC criteria, smoking status, sex, and mutation status of certain oncogenes and tumor suppressor genes (i.e., KRAS, EGFR, and TP53), univariate and multivariate Cox proportional hazards regression analyses were performed in the ACC cohort. All patients in this cohort received uniform treatment (curative resection without adjuvant chemotherapy) thus minimizing confounding factors associated with different treatments. In the univariate analysis, both disease stage and the gene-expression signature were significantly associated with OS (p = 2.17610 24 and p = 0.001, respectively). In the multivariate analysis, disease stage and gene expression signature maintained their significance (p = 0.002 and p = 0.01, respectively; Table 2).
In addition to performing multivariate analysis, we assessed our new prognostic signature's potential using the ''drop in concordance index'' approach [30,39]. Briefly, we generated prediction models using all clinical variables used in the multivariate analysis. While the best model was constructed using all of the variables, test models each lacking 1 variable were generated and compared with the best model. In each comparison, the predictive value of each variable was weighted by measuring the decreased value of the cindex in each test model. Omission of the gene signature in the prediction model caused the largest decrease in the c-index value ( Table S2), suggesting that the signature not only retains its prognostic relevance over the classical pathological prognostic features but also significantly improves the prediction accuracy.
The independence of the new prognostic gene expression signature over the current staging system was further supported by analysis of pooled data from all 4 validation cohorts (n = 556). As expected, the OS of subgroup F was significantly worse than that of subgroup S (p = 3.0610 28 by log-rank test) when all patients were included in the analysis (Fig. S2B). In subset analysis, the gene-expression signature successfully identified poorer survival for both stage I (p = 0.006 by log-rank test) and stage II patients (p = 0.03 by log-rank test; Fig. S2C and S2D). Taken together, these findings strongly demonstrate that our new prognostic geneexpression signature is independent from the current staging system.

Association of the Gene Signature with Potential Benefit from Adjuvant Chemotherapy
Of the 442 patients from TM and HM cohorts, adjuvant chemotherapy data were available for 322 patients. Thus, we next sought to determine whether the new gene expression signature could predict a potential benefit from adjuvant chemotherapy. To examine the association of the gene signature with response to adjuvant chemotherapy, we performed subset analysis with patients in AJCC stage III, a stage for which the benefit of adjuvant chemotherapy has been previously demonstrated [40][41][42]. Patients with stage III disease (n = 49) were subdivided into 2 subgroups (F or S), and the difference in OS was independently assessed. Adjuvant chemotherapy significantly affected OS in patients in subgroup F (3- Fig. 3C). When a Cox regression model was applied, the interaction of subgroups with adjuvant chemotherapy reached a significance level of 0.03. Consistent with the Kaplan-Meier plot and log-rank test, the estimated HR for death for adjuvant chemotherapy in subgroup F was 0.44 (95% CI, 0.2 to 0.95; p = 0.036), while the HR for death for adjuvant chemotherapy in subgroup S was 1.96 (95% CI, 0.56 to 6.88; p = 0.29). This suggests a benefit of adjuvant therapy only in the F subgroup and potential harm associated with adjuvant treatment in the S subgroup. A similar trend was observed in the Stage II patients, although it did not reach statistical significance (p = 0.22) (Fig. S3). In the Stage I patients, there was an overall trend towards worse outcome with adjuvant chemotherapy (Fig.  S3).

Biological Insights from the Conserved Prognostic Gene-Expression Signature
To elucidate the biological characteristics of the subgroup with poor prognosis (subgroup F), we attempted to identify genes whose expression differed between the ''F'' and ''S'' subgroups across all data sets. We excluded gene-expression data from the MGH cohort in this analysis to maximize the compatibility of the data sets, since the MGH data were generated using an old microarray platform (U95A) with a limited number of gene probes. We applied a stringent cut-off (p,0.001) to avoid inclusion of potential false-positive genes. When they were all compared together, 470 genes were shared by all 4 cohorts (Fig. 4A).
We next performed pathway analysis on the 470 genes using the Ingenuity Pathway Analysis tool that is a controlled vocabularybased pathway tool. This analysis revealed a series of putative networks. Functional connectivity of the top network revealed a strong over-representation of the E2F1 pathway in patients in the F subgroup (Fig. S4), suggesting that its activation may be  a key genetic determinant associated with the poorer survival of lung adenocarcinoma patients in this subgroup. Expression of EZH2, which is frequently overexpressed in many cancers [43], was also significantly higher in subgroup F, indicating the importance of the E2F1-EZH2 network in the progression of lung adenocarcinoma. TP53 was overrepresented in another network (Fig. S5). Interestingly, many genes negatively regulated by TP53 were overexpressed in the TP53 networks. For example, previous studies have demonstrated that expression of PRC1 and BUB1 are directly suppressed by TP53 [44,45], but their expression is significantly upregulated in subgroup F, suggesting that the biological activity of TP53 may be substantially lost in this subgroup.

Discussion
By analyzing gene-expression data from lung adenocarcinoma tissues, we identified a limited number of genes (193 genes) whose expression is significantly associated with prognosis. The robustness of this gene-expression signature was validated in 4 independent cohorts with a total of 556 patients. Since current staging systems and biomarkers are limited in their ability to assess risk of recurrence and benefit from adjuvant chemotherapy in lung adenocarcinoma, our new gene-expression signature may represent a tool that could help further refine treatment decisions based on the tumors' molecular profiles.
For development and validation of a robust, prognostic gene expression signature, we applied 2 independent but complementary methods. Unsupervised hierarchical clustering was first applied to identify subgroups with significant differences in biological characteristics as well as prognosis. In the second approach, supervised prediction models were applied to validate the association of the signature with clinical outcomes in 4 independent patient cohorts. The robustness of the 193-gene signature was supported by the high sensitivity (.0.9) and specificity (.0.8) values seen during training of the prediction models within the discovery cohort and a significant association between the predicted outcome and patient prognosis in 4 test cohorts. In addition to its robustness, the prognostic gene signature's independence as a prognostic marker was supported by the results of vigorous tests using various approaches. First, the signature could identify high-risk patients among those with early stage adenocarcinoma (stage I and II). Second, in multivariate analysis, the signature was one of the most significant predictive factors for OS. Third, the signature was the most significant contributor to the predicted OS in models using the drop-in cindex approach. Taken together, these results strongly support that the 2 subgroups of lung adenocarcinoma predicted here are novel prognostic clinical subgroups that are not recognized by the current staging system. Subset analysis of patients with available chemotherapy data strongly suggested that the 193-gene signature can predict which patients will benefit from adjuvant chemotherapy. In patients with stage III disease, adjuvant chemotherapy was significantly associated with improved outcome for patients in subgroup F (HR, 0.44; 95% CI, 0.2 to 0.95; p = 0.036), whereas its benefit was not statistically significant for patients in subgroup S (HR, 1.96; 95% CI, 0.56 to 6.88; p = 0.29). Thus, our newly identified gene signature showed both a prognostic and predictive association.
Interestingly, our prognostic gene expression signature lacks overlapped genes with previously identified prognostic gene expression signatures. For example, of 193 genes, only one gene is common with the prognostic signature discovered in Japanese patients [21]. Likewise, no or only few genes were shared with other signatures such as EGFR-mutation signature [29], stage I specific prognostic signature [27], and ALK-associated gene expression signature [28]. Moreover, when different signatures were compared all together in multiple-comparison manner, only few genes were shared among the signatures. Our finding is consistent with previous study in breast cancer showing absence of gene overlap although concordance of predicted outcome is very high [46]. Overexpression of EZH2, a methyltransferase that catalyzes H3 trimethylation on lysine 27 and is essential for stem cell selfrenewal [47], in subgroup F is in good agreement with previous studies. Its altered expression has been linked to the aggressive progression of many cancers through its activation of angiogenesis and maintenance of the tumor-initiating cell (or cancer stem cell) population [48]. EZH2 is a newly identified downstream target of E2F1 [49], which is a major downstream effector of the RB tumor suppressor and has a pivotal role in controlling cell cycle progression [50]. Expression of E2F1's well-known downstream genes whose expression is significantly different between subgroups F and S. a univariate test (2-sample t-test) with multivariate permutation test (10,000 random permutations) was applied. In each comparison, we applied a cut-off P-value of less than 0.001 to retain genes whose expression was significantly different between the 2 groups of tissues examined. target genes was significantly upregulated in subgroup F (Fig. S4), indicating that E2F1 was highly activated in subgroup F and that E2F1-mediated regulation of EZH2 may be a key genetic event associated with poor prognosis in lung adenocarcinoma.
Expression of TYMS (thymidylate synthase) was also higher in subgroup F, which is in good agreement with previous studies showing that higher expression of TYMS is significantly associated with poorer prognosis in lung adenocarcinoma [51,52]. Pemetrexed, a potent inhibitor of TYMS [53], has emerged as one of the most active agents for the treatment of patients with advanced NSCLC. Previous studies have demonstrated that higher TYMS expression is associated with a lower chemotherapeutic effect of pemetrexed in patients with a variety of solid tumors [54][55][56] and forced overexpression of TYMS in NSCLC cells reduced sensitivity to pemetrexed [57]. Since expression of TYMS is significantly higher in subgroup F, our data suggest that pemetrexed may show limited antitumor activity for patients in this subgroup. By contrast, patients in subgroup S may benefit from pemetrexed because they have lower expression of TYMS. Thus, the 2 newly identified subgroups of lung adenocarcinoma not only well reflect previously recognized clinical characteristics of lung adenocarcinoma but may also provide guidance for treatment regimens.
In a recent evaluation of all prognostic gene expression signatures for lung cancer [39,58], 2 important criteria were suggested for a new prognostic signature to be accepted by the medical community. First, the new signature should be rigorously tested for statistical validation and reproducibility in large multiple-patient cohorts. Second, the new signature should show good predictive power over and above current risk factors. Our prognostic signature fulfills these 2 suggested criteria, as evidenced by validation of the signature in 4 independent cohorts (a total of 556 patients), independence from the current staging system, improvement of predictive power when included in the prediction model, and identification of high risk-patients with very early-stage disease. Although interesting, our analysis has some limitations because we only used mRNA expression level of genes that is not always correlated with their biological activity. Thus, other approaches better reflecting biological activity like proteomics should be used for finding better functional markers in future study.
In conclusion, using gene-expression data from multiple cohorts, we identified 2 new prognostic subgroups of lung adenocarcinoma that show significant differences in patient survival. The 193-gene signature can identify patients with a high risk of recurrence, as well as patients who would have benefited from adjuvant chemotherapy. This study clearly demonstrated that our gene-expression signature reflects the molecular characteristics of different subgroups of lung adenocarcinoma and provides an opportunity to rationally design future clinical trials so that patients who might benefit from adjuvant chemotherapy can be identified. Our results, if confirmed in prospective studies, may improve patient care by providing more practical guidance for treatment. Figure S1 Genes differentially expressed between cluster C1 (F) and cluster C2 (S) in TM cohort (n = 186). Genes were selected by univariate test (2-sample t-test) with multivariate permutation test and stringent cut-off (P,0.001 and .2-fold difference) was applied to retain genes whose expression is significantly different between the 2 groups of tissues examined (193 genes). The data are presented in matrix format, where rows represent individual gene and columns represent each tissue. Each cell in the matrix represents the expression level of a gene feature in an individual tissue. The red and green color in cells reflect relative high and low expression levels respectively as indicated in the scale bar (log2 transformed scale).