Hybrid Models Identified a 12-Gene Signature for Lung Cancer Prognosis and Chemoresponse Prediction

Background Lung cancer remains the leading cause of cancer-related deaths worldwide. The recurrence rate ranges from 35–50% among early stage non-small cell lung cancer patients. To date, there is no fully-validated and clinically applied prognostic gene signature for personalized treatment. Methodology/Principal Findings From genome-wide mRNA expression profiles generated on 256 lung adenocarcinoma patients, a 12-gene signature was identified using combinatorial gene selection methods, and a risk score algorithm was developed with Naïve Bayes. The 12-gene model generates significant patient stratification in the training cohort HLM & UM (n = 256; log-rank P = 6.96e-7) and two independent validation sets, MSK (n = 104; log-rank P = 9.88e-4) and DFCI (n = 82; log-rank P = 2.57e-4), using Kaplan-Meier analyses. This gene signature also stratifies stage I and IB lung adenocarcinoma patients into two distinct survival groups (log-rank P<0.04). The 12-gene risk score is more significant (hazard ratio = 4.19, 95% CI: [2.08, 8.46]) than other commonly used clinical factors except tumor stage (III vs. I) in multivariate Cox analyses. The 12-gene model is more accurate than previously published lung cancer gene signatures on the same datasets. Furthermore, this signature accurately predicts chemoresistance/chemosensitivity to Cisplatin, Carboplatin, Paclitaxel, Etoposide, Erlotinib, and Gefitinib in NCI-60 cancer cell lines (P<0.017). The identified 12 genes exhibit curated interactions with major lung cancer signaling hallmarks in functional pathway analysis. The expression patterns of the signature genes have been confirmed in RT-PCR analyses of independent tumor samples. Conclusions/Significance The results demonstrate the clinical utility of the identified gene signature in prognostic categorization. With this 12-gene risk score algorithm, early stage patients at high risk for tumor recurrence could be identified for adjuvant chemotherapy; whereas stage I and II patients at low risk could be spared the toxic side effects of chemotherapeutic drugs.


Introduction
Lung cancer is the leading cause of cancer-related deaths in industrialized countries [1]. Local and distant recurrence is the major cause of treatment failure (i.e. deaths) in lung cancer. Currently, surgery is the foremost treatment option for patients with stage I non-small cell lung cancer (NSCLC). However, 35-50% of stage I NSCLC patients will relapse within 5 years [2,3]. It remains a critical challenge to determine the risk for recurrence in early-stage cancer patients. Patients at high risk for recurrence might benefit from adjuvant chemotherapy, whereas those with a low risk for tumor recurrence might be spared of the side effects of chemotherapy. Following this, another critical issue in clinics is to determine an individual patient's predisposition to a specific anticancer drug. The emerging use of biomarkers may enable physicians to make treatment decisions based on the specific characteristics of individual patients and their tumor, instead merely of on population statistics [4].
The advances in microarray technologies yield promise in the molecular prediction of individual clinical outcome. Such success is manifested by the commercial gene tests for breast cancer, Oncotype DX [5] and MammaPrint [6,7]. Nevertheless, the high dimensionality of the data has complicated major diagnostic and prognostic breakthroughs [8] and puts a premium on innovative data mining methods. In current biomarker identification studies, genes are ranked according to their association with the clinical outcome, and the top ranked genes are included in the classifier. It has been noted that individual biomarkers showing strong association with the outcome are not necessarily good classifiers [9][10][11]. Furthermore, each individual gene selection algorithm has different strengths and limitations. A hybrid model combining multiple gene selection methods could better identify novel biomarkers from high-throughput data for clinical utility.
There have been a few studies on lung cancer prognosis by transcriptional profiling [12][13][14][15][16][17][18][19]. To date, there is no fully-validated and clinically applied model for predicting lung cancer recurrence [20]. This study presents a combinatorial gene selection system for the identification of a 12-gene lung cancer prognostic signature. This 12-gene signature is more accurate compared with previously published signatures in a multiinstitutional study of lung adenocarcinoma (n = 442) [19]. This 12-gene signature could identify stage I and stage II patients who might benefit from adjuvant chemotherapy and who could be spared of it. Quantitative RT-PCR analyses of independent NSCLC tissue samples confirmed the gene expression patterns of the identified biomarkers in terms of tumor characteristics. A functional pathway analysis then revealed that the signature genes had interactions with well established cancer hallmarks, indicating the important roles of the signature genes in tumor initiation and progression. Furthermore, the 12-gene signature accurately predicted chemoresistance and chemosensitivity to Cisplatin, Carboplatin, Paclitaxel (Taxol), Etoposide, Gefitinib and Erlotinib in a panel of 60 cancer cell lines (NCI-60).

Prognostic model system
In the post-genomic era, hybrid models that take advantage of different algorithms in different stages of gene selection are needed for biomarker discovery and disease classification. In this study, we combined statistical methods and machine learning algorithms to identify prognostic biomarkers of lung adenocarcinoma. The surgical resections collected from the University of Michigan Cancer Center (UM) and Moffitt Cancer Center (HLM) form the training set (n = 256), whereas the samples obtained from Memorial Sloan-Kettering Cancer Center (MSK, n = 104) and the Dana-Farber Cancer Institute (DFCI, n = 82) constitute two independent validation sets. The clinical characteristics of the patient cohorts were described in the previous publication [19].
The prognostic study includes three phases ( Fig. 1): 1) identification of a small set of signature genes by combining Significance Analysis of Microarrays (SAM) [21], differentvariance t-tests, and Relief algorithm from genome-scale transcriptional profiles of the training cohort (UM & HLM), 2) construction of a classifier using Naive Bayes algorithm to predict overall survival in lung cancer patients, and 3) validation of the gene expressionbased prognostic model in two independent patient cohorts (MSK and DFCI). Independent test sets were used in the model validation and evaluation of the identified gene signature over previously published lung cancer prognostic signatures.

Identification of a 12-gene prognostic signature
A combinatorial scheme with multiple gene selection methods was adopted in the process of identifying a lung cancer prognostic gene signature. The first step selected candidate genes from 22,283 probes quantified on the training cohort (n = 256). A combination of t-tests and SAM was then used to select genes with expression levels significantly different between low-risk (patient who survived longer than 5 years) and high-risk (those who died within 5 years following surgery) groups with a predefined false discovery rate. Twenty-seven censored cases with follow-up time less than 5 years were removed from this analysis due to the uncertainty of patient post-operative status. Specifically, a different-variance t-test selected 718 genes with significant differential expression (P,0.01) between the two prognosis groups. To control false discovery rate (FDR), SAM was used to select 1,431 genes that significantly differentiated the two prognostic groups at a FDR of 25% (delta = 0.46). There were 583 genes selected by both t-tests and SAM, and these were considered for the next stage of the analysis.
In order to refine the gene set into a more feasible size for clinical application, Relief algorithm implemented in WEKA 3.4 was used to rank each of these 583 genes in terms of the power to separate low-risk and high-risk groups. This ranked list was used in a step-wise forward selection to identify a gene subset with the highest prognostication accuracy. Specifically, starting from the top ranked gene, one gene was added at each step to the gene set, until the classification accuracy could not be improved by adding one more gene. This gene set was used to classify good-prognosis and poor-prognosis groups with Naïve Bayes algorithm. This process stopped when the addition of a new gene did not increase the classification accuracy as evaluated in a 10-fold cross validation. As a result, a 12-gene signature ( Table 1) was identified which could provide the best prediction for overall survival.
Survival prediction using 12-gene prognostic model Using mRNA expression profiles of the identified 12 genes as predictors, a prognostic classifier was constructed to stratify patients into low-risk (5-year survival) and high-risk (non-5-year survival) groups. The Naïve Bayes classifier implemented in WEKA 3.4 was used in the classification on UM & HLM training samples (low-risk n = 104; high-risk n = 125). Twenty-seven censored cases without sufficient follow-up information were removed in the model construction. Priors estimated by the model are 0.45 for low-risk class and 0.55 for high-risk class. Other parameters of the trained Naïve Bayes model, including the mean and standard deviation for each of the 12 genes in both low-and high-risk groups, are listed in Table 2.
The Naïve Bayes classifier computes the posterior probability of death within 5 years after surgery in each patient. This posterior probability represents the risk for tumor recurrence in patients, since recurrence is the major cause of treatment failure (i.e. death) in lung cancer. Based on the posterior probability, a patient is classified into the high-risk group if the value is greater than 0.5; or into the low-risk group otherwise. The training model was evaluated in a 10-fold cross validation. Without parameter reestimation, this model was then used to predict posterior probability representing the risk for tumor recurrence in each patient in two test sets (MSK and DFCI), as well as the censored cases left out of the model construction. The distribution of the posterior probability of 442 patients in this study was illustrated in Fig. 2A. After obtaining the predicted outcomes, Kaplan-Meier (KM) analysis was carried out to estimate the average survival probability at the 5-year mark following surgery. Results show that high-risk posteriors from the prognostic model are strongly associated with the 5-year survival probabilities (Fig. 2B). Patients with a high probability of tumor recurrence tend to be more likely to have treatment failure after surgery. This indicates that the high-risk posterior probability computed by the model is a good prognostic factor of lung cancer survival. The wide 95% confidence interval at posteriors ranging from 0.35 to 0.6 ( Fig. 2B) might be due to the small sample size in this distribution ( Fig. 2A). Furthermore, a posterior of 0.5 means that the chance of tumor recurrence is random, which also leads to a looser confidence interval.
Using the prognostic categorization scheme described above, the 12-gene signature separated patients into high-and low-risk groups with significantly distinct (log-rank P = 6.96e-7) postoperative survival on the training cohort in Kaplan-Meier analysis (Fig. 3A). This scheme generated significant patient stratification on independent validation sets MSK (log-rank P = 9.88e-4; Fig. 3B) and DFCI (log-rank P = 2.57e-4; Fig. 3C).
The 3-year post-operative survival rate for low-risk groups is 79-94% in the studied cohorts, representing a significantly better prognosis compared with the corresponding high-risk groups for which the 3-year survival ranges from 48% to 63%. When 3-year survival was used to define high-and low-risk groups (high-risk: death within 3-y; low-risk: alive after 3-y), the 12-gene risk algorithm achieved a sensitivity (correctly predicted high-risk patients) of 73.65% in the training set, 86.96% in MSK, and 68.18% in DFCI, and a specificity (correctly predicted low-risk patients) of 59.21% in the training set, 57.75% in MSK, and 76.36% in DFCI (Table S9). The sensitivity and specificity of the 12-gene signature in predicting 5-year survival is also provided in Table S9.
In current practice, treatment for patients diagnosed with NSCLC is based on AJCC tumor stage. Surgical resection is the major treatment option for stage I NSCLC patients. However, about 35-50% of stage I NSCLC patients will develop and die from tumor recurrence within the five years following surgery [2,3]. On the other hand, stage IB patients who received surgical resection followed by adjuvant chemotherapy showed improved survival rate [22]. Thus, we sought to explore whether the 12-gene signature could identify specific high-risk patients with stage I tumors. Results show that the 12-gene prognostic signature could reliably identify high-risk patients with stage I tumors on both the training cohort (results not shown) and two validation cohorts (logrank P = 0.04; Fig. 3D and 3E). The prognostic model also separated high-and low-risk groups (log-rank P = 4.73e-3) within stage IB patients in the combined test sets (Fig. 3F). The 12-gene signature did not generate significant prognostic stratification on the stage IA patients (results not shown). These results demonstrate that the identified 12-gene signature is independent of and provides more refined prognosis than the current AJCC staging system. Using this model, stage I NSCLC patients could be advised to receive adjuvant chemotherapy according to the expression profiles of the 12 gene markers.
Treatment selection for stage I and II NSCLC patients based on the 12-gene signature In order to assess whether the 12-gene signature could be used for treatment selection for stage I and II non-small cell lung cancers, patients who did not receive chemotherapy were selected for further analysis. The prognostic model separated high-and low-risk stage I patients without chemotherapy in the training (UM & HLM; log-rank P = 0.04; Fig. 4A) and test cohorts (MSK & DFCI; log-rank P = 0.02; Fig. 4B). Similarly, the model differentiated high-and low-risk stage II patients without chemotherapy in the training (log-rank P = 0.06; Fig. 4C) and test cohorts (log-rank P = 0.03; Fig. 4D) in Kaplan-Meier analyses. The results indicate that this gene expression-defined prognostic model could reliably select patients with early stage NSCLC for adjuvant chemotherapy. Meanwhile, it could also spare some stage I and II NSCLC patients from chemotherapy based on the expression patterns of the identified gene markers in the tumors.

Prognosis evaluation of the 12-gene signature with clinical covariates
To confirm the prognostic power of the identified 12-gene signature, the expression-defined prognostic model was evaluated with commonly used prognostic factors of lung cancer, including gender, age, and tumor stage on the combined testing cohorts (DFCI and MSK). The posterior probability of high-risk, termed as 12-gene risk score, was used as a covariate in the multivariate Cox analysis (Table 3). Without the 12-gene risk score, tumor stage was the only factor significantly (P,0.00006) associated with elevated risk of lung cancer death. When the 12-gene risk score was added to the multivariate Cox model, the 12-gene risk score  (Table 3). Similarly, a comprehensive evaluation was carried out with all available clinical covariates and demographic data in the dataset, including smoking history, race, and tumor differentiation (Table 4). In this comprehensive evaluation, the 12-gene risk score remained a highly significant prognostic factor with a hazard ratio of 4.19 (95% CI: [2.08, 8.46]). In both multivariate analyses, the hazard ratios of the 12gene risk score algorithm were higher than other clinical covariates except tumor stage (III vs. I), while there is no significant difference between the hazard ratio of the 12-gene signature and tumor stage. These results demonstrate that the 12-gene signature is a more accurate prognostic factor than some commonly used clinical parameters.

Comparison with other lung cancer gene signatures
In the study by Shedden et al. [19], prognostic classifiers were constructed with gene expression signatures alone or gene expression signatures combined with clinical covariates. Among twelve gene signatures identified in their study (Table S7), the best signature was reported as ''method A'' (referred to as ''Shedden A'' in this study), which contains about 9,591 genes/probes. In order to compare the predictive performance of our prognostic model with their best model, the estimated hazard ratio and the concordance probability estimate (CPE) of the models were evaluated. Hazard ratios greater than 1 indicate that patients with high predicted risk scores have poor clinical outcome. The model has strong predictive power if the CPE value is close to 1; CPE value close to 0.5 indicates that the model has poor predictive power (comparable to random prediction). Results show that the proposed 12-gene signature has the highest hazard ratio and CPE in both test sets when compared to the gene signatures from Shedden et al. [19] (Fig. 5).
To evaluate the 12-gene signature with previously published 14 lung cancer signatures [12,13,[16][17][18][19][22][23][24][25] (Table S6), Gene Set Enrichment Analysis (GSEA) was used to assess the enrichment of these signatures on 5-year survival. The normalized enrichment score (NES) and its corresponding false discovery rate (FDR) associated with each gene signature were evaluated on all 442 samples used in this study. In general, a gene set with high NES and low FDR is desired, as it indicates that the gene set expresses diversely with respect to the clinical outcome and the finding is unlikely to be by chance. In comparison to 14 other published gene signatures, the 12-gene signature exhibits high enrichment in patient group whose survival is longer than 5-year with significantly low FDR (absolute NES = 1.5; FDR,0.10) (Fig. S1). In this analysis, the most enriched signature with the lowest FDR was SHEDDEN_MH of 244 genes (absolute NES = 2.00; FDR,0.002). Overall, among the 15 gene sets studied, the 12gene signature is one of the best lung cancer signatures evaluated with GSEA.

RT-PCR Validation of gene expression patterns
In order to further confirm the expression patterns of the identified 12 genes, RT-PCR microfluidic low density arrays were used to analyze independent NSCLC tumor samples. First, the 12-gene expression patterns obtained from both microarray and RT-PCR were compared in terms of lymph node metastasis (Fig. 6A). On the RT-PCR data normalized with POLR2A, gene expression fold changes of the 12 genes in lymph node positive (LN+) versus lymph node negative (LN2) samples were compared with those in microarray data from Shedden et al [19]. The results show that the mRNA expression patterns of the 12-gene signature measured in both platforms are concordant in terms of lymph node metastasis. Then, to confirm the gene expression patterns in terms of overall survival, fold changes between low-risk (alive after 3 years) and high-risk (death within 3 years) groups were also compared (Fig. 6C). Three-year survival was used to keep the balance of the high-risk (n = 11) and low-risk (n = 12) groups in the RT-PCR cohort. All gene markers exhibited consistent expression patterns in over survival except XPO1 in both platforms.

Prediction of chemoresponse in NCI-60 cell lines
After substantiating the clinical relevance of the 12-gene signature in predicting lung adenocarcinoma overall survival, we sought to explore whether the signature can predict chemoresponse to anti-lung cancer agents, including Cisplatin, Carboplatin, Paclitaxel, Etoposide, Erlotinib, and Gefitinib. Here, the NCI-60 cell lines, regardless of tissue origin, were used in the study. For each drug, cancer cell lines that are either sensitive or resistant to the drug were included to build a chemoresponse classifier based on the 12-gene expression profiles in the cell lines. The performance of the classifiers was evaluated with leave-oneout cross validation (Table 5). Statistical significance of the classification was evaluated by comparing the overall accuracy of the 12-gene signature with that of 1000 random signatures of the same size using the same algorithm. The overall prediction accuracy of chemoresponse was 81% (P,0.004) for Paclitaxel (Taxol), 78% (P,0.001) for Carboplatin, 80% (P,0.005) for Cisplatin, 73% (P,0.017) for Etoposide, 79% (P,0.001) for Erlotinib, and 94% (P,0.001) for Gefitinib. These results demonstrate that the 12-gene signature accurately predicted sensitivity and resistance to common lung cancer chemotherapeutic agents in cancer cell lines.
The differential expression in sensitive and resistant lung cancer cell lines was analyzed for each signature gene. The drug responses of the lung cancer cell lines in the NCI-60 panel were provided in Table S8. Among the signature genes, the over-expression of STK24 was linked to chemoresistance to all the studied drugs except Gefitinib in the lung cancer cell lines; whereas the overexpression of FAM14A was associated with chemosensitivity to all the studied drugs except Gefitinib in lung cancer cell lines. The under-expression of STK24 was associated with resistance to Gefitinib (P,0.05). The under-expression of CCDC99 was observed in resistance to Paclitaxel (P,0.05). The over-expression of DLC1 was associated with chemoresistance to Erlotinib (P,0.05), Paclitaxel, and Cisplatin; whereas its under-expression was associated with chemoresistance to Etoposide and Carboplatin (not statistically significant) (Fig. 7).
EGFR mutation is a well known factor in drug response to Gefitinib and Erlotinib. In the NCI-60 cell lines, EGFR mutation was detected only in the SK-MEL-28 melanoma and RPMI-8226 myeloma lines, but not in any lung cancer cell lines [26]. We analyzed the raw expression levels of EGFR probe sets.
Specifically, a fold change of 1.76 over-expression of EGFR (210984_x_at) was observed in Erlotinib resistant vs. sensitive lung cancer cell lines (P,0.05), whereas no significant differential expression of EGFR was observed in other studies drugs (results not shown). In the overall patient cohorts (n = 442) from Shedden et al [19], EGFR expression was not significantly associated with lung cancer overall survival in univariate Cox modeling.

Functional pathway analysis
Having established the clinical relevance of the 12-gene prognostic signature, we sought to explore the functional involvement of this gene set in lung tumorigenesis and tumor progression. Two functional pathway analysis tools, Ingenuity Pathway Analysis (IPA) and Pathway Studio 7.0, were used to obtain curated molecular interactions related to the 12 genes. Results from IPA show that the signature genes interact with major cancer signaling pathways, such as TNF and AKT (Fig. 8A). Pathway Studio 7.0 was used to find interactions among the 12 genes and 13 major lung cancer hallmarks (EGF, EGFR, KRAS, MET, RB1, TP53, E2F1, E2F2, E2F3, E2F4, E2F5, AKT1, and TNF) reported in the literature. Pathway Studio revealed various types of interactions ranging from regulation to protein modification among the 12 genes and eight out of 13 cancer hallmarks (Fig. 8B). The functional pathway analysis suggests that the 12 signature genes are involved in lung cancer oncogenesis and tumor progression.

Discussion
Lung cancer remains the leading cause of death worldwide. It is important to identify clinically relevant prognostic biomarkers to develop personalized treatment. More importantly, the discovered biomarkers may reveal fundamental molecular mechanisms of this deadly disease, and enhance our understanding of why patients with certain tumor molecular characteristics have a poor clinical outcome and how their outcome could be improved.
This study presents a hybrid model system for the identification of a 12-gene signature for lung cancer prognosis and chemoresponse prediction. The 12-gene signature accurately quantifies survival in patients with resectable lung adenocarcinoma, and provides significant prognostic categorization within stage I and IB patients, respectively. This signature reliably identified high-risk patients within stage I and II who did not receive chemotherapy.
The gene expression-defined risk score is a more accurate prognostic factor than commonly used clinical parameters. This prognostic signature also predicts chemoresistance and chemosensitivity to several major anti-lung cancer drugs in NCI-60 cancer cell lines. Together, the results indicate that the 12-gene signature could be used to select early stage lung adenocarcinoma patients at high risk for tumor recurrence for adjuvant chemotherapy. Meanwhile, it may spare stage I and II low-risk patients from unnecessary chemotherapy. Furthermore, the 12-gene signature has the potential to be used to inform physicians which anticancer  drugs should be used in treating a particular patient. The expression patterns of the 12-gene signature were confirmed in RT-PCR. Curated interactions between the signature genes and major cancer signaling hallmarks revealed in the functional pathway analysis provides further evidence that the 12-gene signature might be involved in lung cancer oncogenesis and tumor progression.
In the post-genomic era, innovative computational models are needed to identify clinically important disease markers. Given the current scale of high throughput data, a combinatorial gene selection scheme is needed at different stages of gene filtering. The choice to use a different feature selection technique depends on an evaluation with an independent classifier. If the classification performance cannot be further improved with the current algorithm, a different algorithm will be used to reduce the feature space. In this study, SAM and t-tests was used to identify candidate genes showing differential expression between two prognostic groups in the training set. SAM method is very similar to t-test. We used t-tests (P,0.01) to select genes with certain level of differential expression between two prognostic groups, and used SAM to control for false discovery rate (FDR,25%). The results from SAM and t-test are not exactly the same, because the SAM method adds a constant (s) in the denominator to ensure that genes with a very small variance in the samples and a small differential expression are not selected as significant markers. When s = 0, SAM is exactly the same as t-test [21]. In our study, genes that met both criteria (P,0.01 in t-tests and FDR,25% in SAM) were included for further analysis. From this candidate gene pool, Relief was then used to rank the importance of these genes in terms of prognostic classification for the selection of the final gene signature. This hybrid system was able to identify a small set of genes that are more accurate than previously published lung cancer gene signatures on the same datasets. We have experimented to change the threshold in SAM statistics. As a result, there were 87 genes with a FDR,10% and no genes were selected with a FDR,1% from the training set. The 87 genes were not able to generate significant stratification in all three patient cohorts. These results indicate that using SAM method alone is not sufficient to identify the most accurate prognostic gene signature. The hybrid models combining pooled variance t-tests and Relief algorithm also identified a 15-gene signature (Table S1). Among the two gene signatures, 16 genes (Table S2) share common biological functions (Table S5). The performance of these gene signatures (fitted as covariates in Cox model) is comparable to that of the 12-gene signature (Tables S3, S4 and S9; Fig. S2). The 15-gene signature was also validated with RT-PCR analysis of independent NSCLC tumor samples (Fig. S3). Overall, the hybrid models presented in this study are  . Gene expression fold change in high-risk (death within 3-y) vs. low-risk (alive after 3-y) groups was also compared on patients with follow-up information (C). The RT-PCR data was normalized with POLR2A in a sample-wise manner. DNA microarray data were obtained from Shedden et al [19]. An asterisk (*) was put above a bar if that gene showed significant differential expression in t-tests (P,0.05). doi:10.1371/journal.pone.0012222.g006 Table 5. Prediction accuracy of chemoresponse in NCI-60 cell lines using the 12-gene signature. efficient and robust, and could be used in biomarker discovery in general.

Microarray profiles and patient samples
Gene expressions profiles analyzed in this study include 22,283 probes quantified with Affymetrix HG-U133A on 442 lung adenocarcinoma samples from Shedden et al [19]. This study cohort contains four data sets (University of Michigan, H. Lee Moffitt Cancer Center, Memorial Sloan-Kettering Cancer Center, and Dana-Farber Cancer Institute) contributed by six institutions. Tumors were collected by surgical resection from patients who have provided consent and protocols were approved by the Institutional Review Boards (IRB-Med) of the respective institutions. None of the patients received preoperative chemotherapy or radiation and least two years of follow-up information was available. Regions containing a minimum of 60% tumor cellularity were required for macrodissection, and in most instances tumor cellularity of at least 70-90% was identified for inclusion in the sample for RNA isolation. The raw microarray data are available from caArray website (https://array.nci.nih.gov/caarray/project/ details.action?project.id = 182).
A total of 91 NSCLC specimens to be used in RT-PCR analysis were obtained from West Virginia University Tissue Bank and the Cooperative Human Tissue Network (CHTN) (Ohio State University Tissue Bank, Columbus, OH). Tumor tissues were collected in surgical resections and were snap-frozen and stored at 280uC until used for RNA extraction. This study was approved with an IRB exemption from West Virginia University.

RNA extraction, and quality and concentration assessments
Total RNA was extracted from snap-frozen tumor tissues using a RNeasy Fibrous Tissue Mini Kit according the manufacturer's protocol (Qiagen, USA). Total RNA was eluted in 30 ml of RNasefree water and stored at 280uC. The quality and integrity of the RNA was evaluated by visualizing the gel image and electropherogram for each sample using the 2100 Bioanalyzer (Agilent Technologies, CA). RNA assessed as having good quality from 74 tumor samples was included for further analysis. The RNA concentration of each sample was determined using a Nanodrop-1000 Spectrophotometer (NanoDrop Tech, Germany). Total RNA samples were analyzed on an Agilent 2100 Bioanalyzer RNA 6000 Nano LabChip.

Generation of complementary DNA (cDNA)
The reverse transcriptase polymerase chain reaction was used to convert the high-quality single-stranded RNA samples to doublestranded cDNA, using an Applied Biosystems GeneAmpH PCR 9600 machine (Foster City, CA). For standardization across all samples, one microgram of RNA was used to generate cDNA.

Real-time RT-PCR low-density arrays
Real-time RT-PCR assays of an independent patient cohort of NSCLC tumor samples were used to confirm the expression levels of the identified signature genes in microarray platform. The identified signature genes and three housekeeping genes were included in the experiment. The three housekeeping genes, 18S, UBC, and POLR2A, were selected due to their confirmed constant mRNA expressions across samples [27].
We analyzed 74 tumor samples with high quality RNA using TaqMan microfluidic low-density array (LDA) plates on an ABI 7900HT Fast RT-PCR instrument (Applied Biosystems). The report was generated by the SDS2.3 software (Applied Biosystems). In the report, the number of cycles required to reach threshold fluorescence (Ct) and DC T for each sample relative to the control gene defines the expression pattern for a gene. The gene expression data were further analyzed using the 2 {DDCT method [28].

SAM statistics
Significance analysis of microarray (SAM) implemented in MultiExperiment Viewer (MeV; downloaded from http://www. tm4.org/mev.html) was used to identify statistically significant genes. The relative difference of the expression levels of each gene in two prognostic groups is computed and ranked (gene specific ttests). Null distribution of the relative difference of each gene is generated by random permutations of patients' prognostic group labels. The expected relative difference of each gene is calculated as the mean of the relative differences of this gene in the null In the graph, differential expression with statistical significance (P,0.05, t-tests) is marked by a red asterisk. doi:10.1371/journal.pone.0012222.g007 distribution. Genes with observed relative difference that varies from the expected relative difference by a certain threshold (delta) were identified as significant genes. For a specific delta, SAM provides a false discovery rate (FDR) based on the permutation analysis of expression data [21].

Relief algorithm
Based on the genes selected with different variance t-tests and SAM statistics, Relief was used to rank these genes with WEKA 3.4. Relief evaluates the importance of a variable by repeatedly sampling an instance and checking the value of the given variable for the nearest instance from the same and different classes. The values of the attributes of the nearest neighbors are compared to the sampled instance and used to update the relevance scores for each attribute. As approximated in following equation, Relief computes the weight of attribute A as: Relief assigns more weight to those attributes that have the same value for instances from the same class and differentiate between instances from different classes [29,30].

Naïve Bayes classifier
Naïve Bayes algorithm is based on Bayes theorem with the assumption that attributes are conditionally independent given the target class. A new sample with attribute values ,a 1 , a 2 ,…, a i . would be classified into the most probable class based on posterior probability and computed according to the Bayes theorem [31]. Specifically, the new sample would be classified into the class with the highest posterior probability, based on the following equation: C predicted~a rg max cj[C P(a 1 ,a 2 ,:::,a i Dc j )P(c j ) where C is the set containing all the classes for the problem and c j is a specific class.
Based on the conditional independence assumption, it holds true for the situation that given a class of instances, the probability of observing the conjunction of attributes a 1 , a 2 ,…, a i would be the product of the probability of the individual attributes: P(a 1 ,a 2 ,:::,a i Dc j )~P i P(a i Dc j ) [31]. Therefore, a simpler form of the above equation used in Naïve Bayes classifier is as follows: In this study, the Naïve Bayes algorithm was used to classify 5year survival status in the training data with WEKA 3.4 [32]. Since gene expression is an attribute of continuous values, the probability density function for normal distribution was estimated to compute the probability of the gene in each class. Thus, the posterior probability for a new sample could be computed by: where P(c j ) is the prior for class j, m ij and s ij are estimated mean and standard deviation for the i th gene in class j.

Functional Pathway Analysis
Ingenuity Pathway Analysis (Ingenuity Systems, Redwood City, CA), and Pathway Studio 7.0 (Ariadne Genomics, Rockville, MD) were used to derive curated molecular interactions, including both physical and functional interactions, and pathway relevance. The databases and software toolsets weigh and integrate information from numerous sources, including experimental repositories and text collections from published literature.
Gene Set Enrichment Analysis (GSEA.) GSEA (http://broad.harvard.edu/gsea/) was used to evaluate the enrichment of gene sets in prognostic groups defined by 5-year survival. All measured genes are ranked according to the differential expression between the two prognostic groups. An Enrichment Score (ES) for each gene set is computed by going through the ranked list from the top. ES is increased if the encountered gene is a member of the gene set or decreased if otherwise.
In this study, multiple gene sets were evaluated with the GSEA. In Multiple Hypothesis Testing, prognostic group labels are randomly permuted. A normalized enrichment score (NES) for each gene set is generated by averaging enrichment scores from all permutations. Statistical significance of the corresponding NES is indicated by false discovery rate (FDR) in the permutation analysis [33].

Transcriptional Profiles in NCI-60 Cell Panel
Genome-wide mRNA expression profiles in NCI-60 cell lines [34] were retrieved with CellMiner (http://discover.nci.nih.gov/ cellminer). The data were generated on Affymetrix U133A and normalized using the GCRMA method [35]. The signature genes were identified from the data file with gene symbols or UniGene Cluster IDs (for unknown genes).

Drug activity profiles in NCI-60
The drug activity data in NCI-60 were retrieved from Developmental Therapeutic Program at NCI/NIH through DTP Data Search (http://dtp.nci.nih.gov/dtpstandard/dwindex/ index.jsp). The latest screening results for each studied drug were used in the analysis. Growth inhibition was assessed from the changes in total cellular protein after 48 hours of drug treatment using a sulphorhodamine B assay. Drug activities (log 10 GI 50 ) were recorded across the 60 human cancer cell lines. GI 50 is the concentration required to inhibit cell growth by 50% compared with untreated controls. The activity profile of an agent consists of 60 such activity values, one for each cell line.

Defining Drug Sensitivity and Resistance
Drug activity data for Cisplatin, Carboplatin, Paclitaxel, and Etoposide was processed to define drug resistance and sensitivity in the NCI-60 lines as described before [36,37]. Specifically, for each drug, log 10 (GI 50 ) values were normalized across the 60 cell lines. Cell lines with log 10 (GI 50 ) at least 0.5 SDs above the mean were defined as resistant to the drug. Those with log 10 (GI 50 ) at least 0.5 SDs below the mean were defined as sensitive to the drug. The remaining cell lines with log 10 (GI 50 ) within 0.5 SDs were defined as intermediate in the range of drug responses.

Classification of chemosensitivity/resistance
The mRNA expression profiles of the 12-gene lung cancer signature were used to predict chemosensitivity/resistance in the cancer cell lines. For each drug, only sensitive and resistant cell lines were included in the analysis, while those with intermediate response were excluded from classification. A k-nearest neighbor method was used to classify chemoresponse to Paclitaxel. Neural network was used to classify drug response to Carboplatin, Gefitinib, and Erlotinib. Boosting trees (AdaBoost) was used to classify response to Etoposide. An ensemble learning method (Decorate) was used in classifying chemoresponse to Cisplatin. The classification results were evaluated with a leave-one-out cross validation. These algorithms were implemented in WEKA 3.4 [32].
Differential expression analysis in resistant and sensitive lung cancer cell lines Where Resistant_Mean is the mean expression of the group of resistant cell lines and Sensitive_Mean is the mean expression of the group of sensitive cell lines. In this study, a value of 1.5 (1.5 for over-expressed and 0.67 for under-expressed) is the threshold used in deciding if a gene is expressing differently. Statistical significance of the fold change is computed using two-tail, unequal variance two-sample t-tests. It's considered statistically significant if a p-value is #0.05.

Statistical Analysis
In Kaplan-Meier analysis, log-rank tests were used to assess the difference in probability of survival of different prognostic groups. Hazard ratio and concordance probability estimate (CPE) were used in the evaluation of different molecular prognostic signatures. If the model gives hazard ratio greater than 1, it means that patient samples predicted as high risk are more likely to have poor outcome. CPE values close to 1 represents high concurrence and good predictive power; CPE values close to 0.5 represents low concurrence and poor predictive power. All the analyses were performed with packages in R unless otherwise specified.

Supporting Information
Table S1 A 15-gene lung cancer prognostic signature. This gene signature was identified using pooled-variance t-tests and RELIEF algorithm. The expression of the 15 genes were used as covariates in Cox model and median risk score from training set was used as the cutoff point.