Gene Expression Signature of Cigarette Smoking and Its Role in Lung Adenocarcinoma Development and Survival

Background Tobacco smoking is responsible for over 90% of lung cancer cases, and yet the precise molecular alterations induced by smoking in lung that develop into cancer and impact survival have remained obscure. Methodology/Principal Findings We performed gene expression analysis using HG-U133A Affymetrix chips on 135 fresh frozen tissue samples of adenocarcinoma and paired noninvolved lung tissue from current, former and never smokers, with biochemically validated smoking information. ANOVA analysis adjusted for potential confounders, multiple testing procedure, Gene Set Enrichment Analysis, and GO-functional classification were conducted for gene selection. Results were confirmed in independent adenocarcinoma and non-tumor tissues from two studies. We identified a gene expression signature characteristic of smoking that includes cell cycle genes, particularly those involved in the mitotic spindle formation (e.g., NEK2, TTK, PRC1). Expression of these genes strongly differentiated both smokers from non-smokers in lung tumors and early stage tumor tissue from non-tumor tissue (p<0.001 and fold-change >1.5, for each comparison), consistent with an important role for this pathway in lung carcinogenesis induced by smoking. These changes persisted many years after smoking cessation. NEK2 (p<0.001) and TTK (p = 0.002) expression in the noninvolved lung tissue was also associated with a 3-fold increased risk of mortality from lung adenocarcinoma in smokers. Conclusions/Significance Our work provides insight into the smoking-related mechanisms of lung neoplasia, and shows that the very mitotic genes known to be involved in cancer development are induced by smoking and affect survival. These genes are candidate targets for chemoprevention and treatment of lung cancer in smokers.


Introduction
Lung cancer is the leading cause of cancer death worldwide. Cigarette smoking is responsible for about 90% of lung cancers and decreases survival, [1] and yet the precise molecular alterations induced by smoking in lung that develop into cancer and impact survival have remained obscure. Using Affymetrix HG-U133A microarrays on 135 fresh frozen adenocarcinoma and paired non-tumor tissue samples from current, former and never smokers from the Environment And Genetics in Lung cancer Etiology (EAGLE) study (http://dceg.cancer.gov/eagle), we sought to identify the genes that are altered by smoking in lung, and those, within the smoking signature, that have a role in lung carcinogenesis and outcome from lung cancer. We chose adenocarcinoma, the predominant histological subtype of lung cancer, because it occurs in subjects with no history of smoking as well as in smokers, providing a range of exposures ideal for the study of smoking-induced carcinogenesis. Specifically, in early stage adenocarcinoma tissue we compared gene expression from current (C) and never (N) smokers and identified the major genes using stringent criteria for gene selection (p,0.001 and fold change .1.5), the Benjamini-Hochberg procedure [2] to calculate the False Discovery Rate (FDR), and Gene Ontology (GO) [3] to classify the gene functional categories. We then verified whether the comparison between former (F) and never (N) smokers identified similar genes. We performed Gene Set Enrichment Analysis (GSEA) [4] to identify common gene patterns where the single-gene analysis revealed only few overlapping genes. We further explored whether the genes that differentiated lung tumors of smokers from never smokers (C/N and F/N) also differentiated early stage tumor tissue (T) from paired non-tumor (NT) tissue to confirm the role of these genes in smoking-related lung carcinogenesis. We finally explored the impact of the smoking signature on survival from lung cancer in smokers. We validated C/N genes by Real Time-PCR in 68 samples used for the present microarray analysis, and confirmed them in 40 independent samples from EAGLE and a Mayo Clinic study of lung cancer. Ospedale San Giuseppe, Milano. The healthy controls in EAGLE were randomly selected from the same residential area of the lung cancer cases. After description of the EAGLE study by the study personnel, and discussion with potential participants, written informed consent was obtained under a protocol approved by the Institutional Review Board of each participating hospital and by the National Cancer Institute (Bethesda, MD). Subjects in this gene expression study, 44-79 years old, had histologically confirmed primary adenocarcinoma of the lung, stages I-IV, and provided detailed smoking and medical history information.

Study population and sample collection
Overall, 180 adenocarcinoma and non-tumor tissue samples were selected for the analyses, including duplicate or triplicate samples from 14 subjects for quality control. Samples had been snap-frozen in liquid nitrogen within 20 minutes of surgical resection. A single pathologist confirmed the hospital-based diagnosis of adenocarcinoma, estimated the presence of malignant cells in each sample based on H&E-stained fresh frozen sections, and classified the samples as Tumor (T) and Non-Tumor (NT). From the original 180 samples, 148 provided sufficient quantity of high-quality RNA for microarray analyses; 13 additional samples were excluded because of technical problems. Normalization was conducted on the remaining 135 microarrays; corresponding CEL files and information conform to the MIAME guidelines are publicly available on the GEO database (accession number = GSE10072). After normalization, 13 samples were excluded because of low percentage of tumor cells in the tumor tissues. This report is based on 122 samples, of which 15 duplicates/triplicates were averaged, resulting in 107 final expression values from 58 tumor and 49 non-tumor tissues from 20 never smokers, 26 former smokers, and 28 current smokers. Quality assurance and distribution of cell types across smoking groups are described in Appendix S1A, S1B, and S1C.

Statistical analysis
All statistical analyses were accomplished using R program language. Gene expression data were processed and normalized using Bioconductor Affy package, based on the Robust Multichip Average (RMA) method [5] for single-channel Affymetrix chips. All 22,283 probe sets based on RMA summary measure were used in class comparison analyses.
Average linkage hierarchical clustering of samples was based on one minus Pearson correlation as the dissimilarity metric.
An ANOVA analysis adjusting for sex was used to test whether genes were differentially expressed between smoking groups (C/N and F/N), between tumor tissue and non-tumor tissue (T/NT), or by pack years of cigarette smoking. Further analyses adjusted by tumor grade or excluding 6 subjects with emphysema or chronic bronchitis or 3 subjects who received chemotherapy prior to the study were conducted, with essentially unaltered results. For analyses including paired tissues (T/NT tissue samples from the same subjects), a linear mixed effects model was used to account for intra-person correlation.
To limit false positive findings, genes were considered statistically significant if their p-values were less than the stringent threshold of 0.001. Under the null hypothesis of no difference in expression profiles, and considering the analysis of 22,283 probes, we expect that by chance the average number of false positive findings will be #23. We used the Benjamini-Hochberg [2] procedure to calculate the False Discovery Rate (FDR). We further restricted significant genes to those which showed at least 1.5 fold ratio of geometric means of expression between two groups. Gene selection based on p,0.001 (two-sided) and foldchange .1.5 are referred to as ''stringent criteria''.
The Cox Proportional Hazards model [6] was used to estimate the effect of gene expression changes in C/N on survival from lung cancer in smokers. Of the 74 subjects included in this study (all stages), 34 (22 smokers) were alive, and 40 (32 smokers) were deceased as of May 2007. Among the deceased subjects, 36 died of lung cancer. The remaining 4 (2 smokers) died of other cancers and were censored at time of death in the analysis. The time from lung cancer to death or date of last follow-up was between 28 days and 5.0 years for the deceased subjects, and 3.7 and 5.7 years for the subjects alive in May 2007. The relative risk of gene expression was defined as the hazard ratio associated with one standard deviation change of the expression. Analyses were adjusted for stage, sex, and smoking. Age was similarly distributed across the groups and was not adjusted for in the analysis.

Analysis of total plasma cotinine concentration by gas chromatography/mass spectrometry
We verified the self-reported current smoking status by measuring plasma cotinine levels. The total cotinine (free plus cotinine Nglucuronide) concentration in plasma was quantified by GC/MS analysis using a method similar to that used for urinary cotinine, [7] with the addition of a solid phase extraction step carried out on an MCX column (Waters Corporation, Milford, MA).
One individual who reported to have quit smoking 2.6 years before the study had high cotinine levels (135 ng/ml) and was reclassified as a current smoker.

Gene Set Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) [4] was used to compare expression in groups of genes (gene-sets), between different tissues or between different comparison groups within the same tissue. GSEA analysis reveals a pattern of common gene-sets even when single-gene analysis reveals very few overlapping genes between groups. We modified the standard GSEA method by substituting an ANOVA test for the standard two-sample t-test to adjust for sex. Furthermore, we changed the permutation test for calculating the p-values by permuting residuals and using as weights the observed ANOVA coefficients divided by the standard error values. Up-and downregulated genes were included in different gene-sets for the analyses.

Molecular function classification of smoking-altered genes
Gene Ontology was used to assign the genes to functional categories. [3] GoMiner [8] was utilized to rank-order the GO categories for the genes identified in the smoking comparisons.

Quantitative PCR validation and confirmation in independent samples
We used quantitative real-time PCR (QRTPCR) to confirm the differential expression of 19 C/N selected genes (20 probes), including 14 genes from T and 5 from NT analyses. Primer and probe sets for the selected genes as well as control probes for GUSB and S18 (ABI) were run on 7500 Taqman under the manufacturer's standard protocol. Ct values were normalized based on GUSB expression.
Validation assays were performed in 68 samples used in the original microarray analyses, including 43 T (27 C and 16 N smokers), and 25 NT (18 C and 7 N smokers).
Confirmation assays were performed in 40 independent samples, including 19 T (12 C and 7 N smokers) and 21 NT samples (12 C and 9 N smokers). These samples were collected in EAGLE (10 T samples from 7 C and 3 N smokers, and 12 NT samples from 7 C and 5 N smokers-these samples were not used for the microarray analyses), and from the Mayo Clinic, Rochester, MN (9 T and 9 NT paired samples from 5 C and 4 N smokers).

The molecular signature of cigarette smoking in lung adenocarcinoma
To investigate the molecular changes associated with smoking in the tumor tissue, we compared gene expression changes between current and never (C/N) smokers (Table 1). To avoid potential alteration of gene expression due to advanced tumor status, we limited smoking comparisons in tumor tissue to the early stages (stages I and II). Unless specified differently, ''T'' samples represent early stage adenocarcinomas. Results from the advanced tumor stage tissues are reported for completeness in Appendix S2C.
The GoMiner results (Appendix S2D) confirmed that the mitosis genes (12 altered genes among the 127 mitotic genes on the HG-U133A chip, p,0.001), and more generally those involved in cell cycle were the most commonly altered in the tumor tissue ( Table 2).

Lung cancer gene expression is similar in current and former smokers
To verify whether the C/N smoking signature in the tumor was present also in former smokers, we compared the C/N and F/N signatures in T and found 26 probes (22 down-and 4 upregulated, representing 21 genes) that differentiated both C/N and F/N using stringent selection criteria (Appendix S2E). Some of these genes, e.g., STOM, SSX2IP, TRPC6, APLP2 (2 probes), and DHRS7, exhibited a persistent alteration even in subjects (n = 6) who quit smoking more than 20 years before the study. The GSEA analysis showed that among the 64 up-and 98 downregulated probes found in the C/N comparison in T, 58 and 90 probes, representing 50 up-and 73 down-regulated genes, were also up-and down-regulated, respectively in the F/N smoking comparison (p,0.001, Fig. 1, and Appendix S2F, S2G). All cell cycle genes that differentiated C/N were also altered in F/N, although less prominently ( Table 2), indicating that alterations of these genes persist following smoking cessation. Importantly, the mitosis/cell cycle genes identified in C/N and F/N also differentiated the early stage tumor from the non-tumor tissue samples (T/NT, paired analysis) ( Table 2), while pack years of cigarette smoking, a composite index of intensity and duration that does not consider the time when smoking occurred, were not associated with gene expression in either T or NT.

Smoking signature in the noninvolved lung tissue
The C/N comparison in NT revealed 28 up-and 75 downregulated probes, representing 25 up-and 73 down-regulated genes with the stringent selection criteria (Table 1, and Appendix S3A, S3B). As expected, the CYP1B1 gene, known to be induced by smoking [9,10] was strongly up-regulated. The GoMiner results showed that the most smoking-altered genes were involved in cellular defense response (5 of 90 cellular defense genes on the chip, p,0.001), and more generally in immune response (Appendix S3C). MACF1, UBE21, and CBX7 (p,0.001), and C16orf30 (p = 0.001) were shared between T and NT C/N comparisons. C16orf30 and UBE21, both on chromosome 16p13.3, are located within 246kb, but they do not appear to share specific transcriptional regulation mechanisms (Appendix S4A). The GSEA analysis revealed some similarities between T and NT in the overall pattern of smoking-induced alteration (p = 0.08 and 0.04, for up-and down-regulated genes, respectively, Appendix S4B, S4C, and S4D). Notably, NEK2 and TTK were among those similarly altered in both T and NT in the GSEA analysis. In contrast, the F/N comparison in NT showed no statistically significant genes (Table 1), and was not further explored.

Smoking-associated gene expression signature and survival from lung cancer
We studied the overall gene expression signature of smoking in T and NT (98+64 C/N in T, 75+28 C/N in NT, minus 3 overlapping probes between T and NT, for a total of 262 probesets representing 230 genes) in relation to survival from adenocarcinoma in smokers (n = 54, Appendix S5A). Since only 262 probe-sets were included in this analysis, we used a less stringent criterion of p,0.01 for gene selection (Table 3). Altered expression in NT of genes involved in the mitotic spindle formation, e.g., NEK2 (p,0.001) and TTK (p = 0.001) were associated with a 3-fold increased mortality risk (Table 3, analysis adjusted for stage, sex, and smoking).

Validation and confirmation of gene expression smoking signature
We selected 19 genes (20 probes) for validation by QRTPCR, including 14 genes for T and 5 for NT tissue, based on fold change (.2) and cancer relevance. Table 2. Cell cycle genes differentiating current from never smokers (C/N) in the early stage tumor (T) tissue samples, and corresponding values in the former/never smoker (F/N) and in the smokers' paired tumor/non-tumor tissue (T/NT) comparisons. Validation was based on 68 samples, including 43 T and 25 NT, also used for the microarray analysis. All 19 genes were upregulated in the C/N comparison in these samples (Table 4).
Confirmation was based on 40 independent samples (19 T and 21 NT) from EAGLE (samples not used for microarray analysis) and the Mayo Clinic, Rochester, MN. All the 14 genes in T and 4 of 5 genes in NT were up-regulated by smoking also in the independent samples (Table 4).

Discussion
In a population-based study with fresh frozen tissue samples of adenocarcinoma and noninvolved lung tissue (mostly paired samples), we identified a smoking signature that persists years after smoking cessation and is related to lung cancer development and survival.
Aneuploidy and chromosome instability are two of the most common abnormalities in cancer cells that arise through unequal segregation of chromosomes between daughter cells during mitosis. Thus, mitotic alterations are highly relevant for carcinogenesis. We found that smoking induces deregulation of this very mitotic process proceeding from lung tissue changes through cancer development to cancer death or survival. In fact, the smoking signature we identified comprises genes that regulate the mitotic spindle formation. These genes, such as NEK2 [11,12] and CENPF [11] (both on 1q32-q41), TPX2 [13,14] and STK6 (or AURKA) [15] (related to the Aurora-A activation pathway important in tumor progression [16]), TTK (linked to cell mitosis through EGFR, [17] a critical drug target for lung adenocarcinoma [18]), and BIRC5 (Survivin), [19] have all been found overexpressed in smoking-related tumors. While previous studies have proposed these genes as targets for therapeutic interventions, [16,[18][19][20][21] our work suggests that they may be targets for chemoprevention in smokers as well. In fact, they were strongly induced by smoking in the early stage tumor tissue and some, e.g., NEK2 and TTK, were also associated with increased mortality risk. The latter finding was most evident in non-tumor tissue, likely reflecting the widely recognized field-cancerization effect by smoking, [22] while in the tumor tissue, smoking-related genes' effects on survival may be masked by extensive molecular alterations occurring during tumorigenesis.
In the non-tumor tissue, current smoking strongly altered immune response genes, consistent with the defense mechanisms of the lung tissue against the acute toxic effects of smoking. Among the gene most strongly down-regulated in NT was CX3CR1, located on chromosome 3p21.3, an area known to be often deleted in lung cancer, [23] particularly in smokers. [24] Current knowledge of gene expression altered by cigarette smoking is based on bronchoscopy-obtained airway epithelial cells or macrophages [9,[25][26][27] or peripheral leukocytes [10] from healthy smokers rather than directly on lung tissue. The few studies with lung tissue samples are very small [28] or used RNA amplification [29] or RNA pooling [30] methods. Our results are consistent with some previous findings, such as smoking-related alteration of CYP1B1 [9,10] or of the mitotic pathway in cancer survival. [29] However, earlier studies were often limited by the small sample size, or lacked information on potential confounders, or availability of paired tumor and non-tumor lung tissue samples for the distinction of gene changes involved in lung carcinogenesis from those representing a transient smoking effect. We overcame these pitfalls with a relatively large sample size of fresh tumor and non-tumor lung tissues, detailed covariate information (e.g., sex, age, stage, previous lung diseases or chemotherapy), biochemical validation of the smoking status, and confirmation of the main findings in independent tissue samples.
In conclusion, our study provides clues on how cigarette smoking affects lung cancer development and survival. Functional assays to confirm these findings are warranted. If confirmed, these genes could become important targets for chemoprevention and treatment for lung cancer in smokers.

Supporting Information
Appendix S1 Quality Assurance. 1A Description of analysis of sample quality assurance 1B Samples' description 1C Surfactant genes in Tumor (T) and Non-Tumor (NT) lung tissues by smoking Appendix S4 Comparison between Tumor (T) and Non-Tumor (NT) lung tissue for the genes whose expression significantly