Phase I Metabolic Genes and Risk of Lung Cancer: Multiple Polymorphisms and mRNA Expression

Polymorphisms in genes coding for enzymes that activate tobacco lung carcinogens may generate inter-individual differences in lung cancer risk. Previous studies had limited sample sizes, poor exposure characterization, and a few single nucleotide polymorphisms (SNPs) tested in candidate genes. We analyzed 25 SNPs (some previously untested) in 2101 primary lung cancer cases and 2120 population controls from the Environment And Genetics in Lung cancer Etiology (EAGLE) study from six phase I metabolic genes, including cytochrome P450s, microsomal epoxide hydrolase, and myeloperoxidase. We evaluated the main genotype effects and genotype-smoking interactions in lung cancer risk overall and in the major histology subtypes. We tested the combined effect of multiple SNPs on lung cancer risk and on gene expression. Findings were prioritized based on significance thresholds and consistency across different analyses, and accounted for multiple testing and prior knowledge. Two haplotypes in EPHX1 were significantly associated with lung cancer risk in the overall population. In addition, CYP1B1 and CYP2A6 polymorphisms were inversely associated with adenocarcinoma and squamous cell carcinoma risk, respectively. Moreover, the association between CYP1A1 rs2606345 genotype and lung cancer was significantly modified by intensity of cigarette smoking, suggesting an underling dose-response mechanism. Finally, increasing number of variants at CYP1A1/A2 genes revealed significant protection in never smokers and risk in ever smokers. Results were supported by differential gene expression in non-tumor lung tissue samples with down-regulation of CYP1A1 in never smokers and up-regulation in smokers from CYP1A1/A2 SNPs. The significant haplotype associations emphasize that the effect of multiple SNPs may be important despite null single SNP-associations, and warrants consideration in genome-wide association studies (GWAS). Our findings emphasize the necessity of post-GWAS fine mapping and SNP functional assessment to further elucidate cancer risk associations.


Introduction
Lung cancer is the second most common malignancy and has the highest cancer mortality rate worldwide, with an estimated 161,840 individuals expected to succumb to the disease in 2008 in the US [1]. Tobacco smoking is the dominant causal factor for lung cancer; however, fewer than 20% of cigarette smokers develop the disease [2], suggesting that inherited genetic factors may also be important risk determinants. Genetic variation at tobacco carcinogen metabolizing enzymes may lead to interindividual differences in the level of internal carcinogenic dose and to differential risk for individuals with similar exposures [3]. For this reason, genes that encode enzymes activating harmful chemicals are suitable candidates for lung cancer susceptibility studies and have been intensively studied [4]. Nevertheless, the available published data generally offer inconsistent results [5], due to population heterogeneity, low sample size, poor characterization of the exposure, and a few polymorphisms tested with low power to address the presence of their joint effects.
Here we addressed these issues in the analysis of candidate genes in phase I metabolism and lung cancer susceptibility, taking advantage of a large sample size and detailed epidemiological and clinical information of the Environment And Genetics in Lung cancer Etiology (EAGLE) study [6]. Furthermore, we integrated results on polymorphisms with data on expression from the same genes and the same subjects, for the first time in the context of a population study of phase I metabolic genes and lung cancer.
We explored the role of 25 single nucleotide polymorphisms (SNPs) covering important genes involved in the activation of carcinogens from cigarette smoking: cytochrome P450s (CYP1B1, CYP1A1, CYP1A2, and CYP2A6), microsomal epoxide hydrolase (EPXH1), and myeloperoxidase (MPO). We included also SNPs not previously analyzed, thus providing wide loci coverage in areas previously understudied.

Candidate genes
Many of the chemical carcinogens in tobacco smoke are members of the polycyclic aromatic hydrocarbon (PAH) family [7]. Cytochrome P450 enzymes activate PAHs [8] to epoxide intermediates, which are converted by epoxide hydrolase to the carcinogens diol-epoxides that interact with DNA or proteins to form adducts. In human lung for example, Benzo[a]pyrene (B[a]P) -a major carcinogenic constituent in tobacco smoke -is first metabolically activated by cytochrome P450 1A1 (CYP1A1) and cytochrome P450 1B1 (CYP1B1) to form B[a]P-7,8-dihydroepoxide, which is further hydrolyzed by microsomal epoxide hydrolase (EPHX1) to (F)-benzo[a] pyrene-trans-7,8-dihydrodiol. This compound is further metabolized by CYP1B1 to form benzo[a]pyrene-7,8-dihydrodiol-9,10-epoxide [9], the most mutagenic and carcinogenic metabolite. CYP1A1 and CYP1B1 are over expressed in a wide range of human cancers, including breast, colon, lung, brain and testicular cancer [10,11]. Tobacco smoking can induce CYP1A1 and CYP1B1 proteins up to 10-fold higher levels, particularly in subjects (about 10% of the general population) that are more sensitive to enzyme induction [12]. Polymorphisms in CYP1A1 (chr15q24.1) are the most frequently studied in relation to lung cancer [13][14][15][16][17], but results are limited to only a few SNPs (rs4646903, rs1048943, and rs1799814) that are more frequent in Asian than in Caucasian populations. Functional studies for these SNPs have predicted an increased catalytic activity and higher levels of hydrophobic DNA adducts [18]. In close proximity and strong linkage disequilibrium with CYP1A1 is the cytochrome P450 1A2 (CYP1A2) gene, characterized by a similar activity [19]. Our study included 8 SNPs from the CYP1A1/A2 region not previously studied in case-control studies of lung cancer, and some of these SNPs were not included in the platforms used for recent genome-wide association studies (GWAS) [20][21][22][23][24]. The CYP1B1 gene is located on chr2p22.2 and characterized by at least 178 SNPs (ncbi.nlm.nih.gov/dbSNP), including 4 common SNPs that encode amino acid substitutions at codons 48, 119, 432, and 435. These four common amino acid variants alter catalytic activity depending on the substrate, e.g., increase for estradiol hydroxylation [25] and decrease for B[a]P epoxidation and phenylimidazopyridine metabolism [26]. Relatively few studies have reported on CYP1B1 polymorphisms and lung cancer susceptibility with inconsistent results [27][28][29][30]. We selected 7 SNPs in CYP1B1 gene, 6 of which not previously studied in association with lung cancer. Microsomal epoxide hydrolase (EPHX1 gene, chr1q42.12) plays a dual role in the metabolism of PAHs and other environmental pollutants, detoxification and bioactivation depending on the substrate. It hydrolyzes reactive compounds such as arene, alkene, and aliphatic epoxides, which are generated by cytochrome P450 and other phase I enzymes to the corresponding dihydrodiols through the trans addition of water [31]. On the other hand, less reactive dihydrodiols from PAHs can be substrates for further transformation into dihydrodiol-epoxides such as the carcinogen benzo[a]pyrene-7,8-diol-9,10 epoxide [32,33]. EPHX1 appears to be expressed in all tissues but the highest concentrations have been found in the liver, gonads, kidneys, lungs, and bronchial epithelial cells [34]. According to the NCBI's dbSNP database, 119 SNPs have been identified in the EPHX1 gene region, 20 of which are part of the HapMap database. Functional expression studies are available on a limited number of these polymorphisms and showed effects on hydrolase activity in both directions [35][36][37]. Few studies have investigated the association between coding EPHX1 polymorphisms and lung cancer susceptibility, with disparate findings mainly limited to the two non-synonymous SNPs rs1051740 and rs2234922, as reported by Kiyohara et al. in their review [38] and in more recent studies [39,40]. We included 8 SNPs from EPHX1 gene, 7 of which not previously studied in association with lung cancer. The human cytochrome P450 2A6 (CYP2A6) is responsible for the metabolism of different exogenous compounds including nitrosamines, aflatoxin B1, and other xenobiotic substrates [41]. In addition, CYP2A6 catalyzes nicotine C-oxidation to cotinine, and the subsequent hydroxylation of cotinine to 3-OHcotinine [42]. Several genetic polymorphisms including point mutations and deletions have been reported and studied in association with lung cancer with conflicting results in populations from different ethnicities [43][44][45]. In particular the polymorphism CYP2A6 rs1801272 selected for this study, which causes an amino acid change from Leu to His, has been object of dispute: studies found a protective association with lung cancer and amount of cigarette smoke [46] which has not been consistently replicated. Myeloperoxidase (MPO gene, chr17q22) is a lysosomal enzyme present in high concentrations in human lung due to recruitment of neutrophils [47], and activates B[a]P [48] as well as aromatic amines [49] in tobacco smoke and generates carcinogen-free radicals [50]. A single base substitution, 2463G.A, in the promoter region of MPO reduces transcription activity and DNA adduct levels in bronchoalveolar lavages of smokers [51]. These mechanisms have supported protective effects of the MPO 2463A allele against lung cancer [52]. However, this possible inverse association with lung cancer risk has remained controversial [53]. Therefore, further study of the effects of this MPO polymorphism on lung cancer is warranted, and we included this SNP in our selection.
A precise characterization of the smoking exposure is essential to successfully identify molecular mechanisms involved in tobaccorelated lung carcinogenesis. The EAGLE study provides detailed characterization of tobacco smoking including quantitative information on total exposure and daily intake of cigarette smoking. Using this information, we evaluated genotype-smoking interactions by likelihood ratio test, and compared the contributions of total exposure (pack-years) and intensity (cigarettes per day) of smoking using the linear-exponential model for smoking excess odds ratio (EOR) [54]. This model takes into account the correlation between the two smoking variables by describing the EOR per pack-year in terms of delivery rate of exposure. Our analyses also included stratified groups based on major lung cancer histology subtypes. Furthermore, we tested whether the overall lung cancer risk was determined by the combined action of multiple SNPs within the same gene, despite possible null effects in single SNP associations. We analyzed multiple SNPs jointly and performed gene haplotype analysis. The information on gene expression was limited to a subgroup of 44 subjects with adenocarcinoma, but can help clarify biological mechanisms behind the measured associations of lung cancer with polymorphisms in phase I metabolic genes. We prioritized our findings based on a low p-value threshold (p-value#0.01) and consistency across different analyses. In order to address concerns related to multiple testing and a priori knowledge considerations, we computed the False Positive Report Probability (FPRP) [55].

Gene polymorphism and population characteristics
The 25 SNPs selected from phase I metabolic genes are presented in Table 1. The gene coverage is described in Supplemental Figure S1. All analyses were restricted to subjects with at least a 90% genotype call rate (i.e. 34 subjects were excluded). All 25 SNPs passed the test for Hardy-Weinberg equilibrium genotype proportions among the 2041 controls, with a p-value of 0.05 as the threshold. Table 2 shows the frequency distributions and lung cancer association estimates for the main covariate, among the 4016 subjects included in the study. Age, sex and residential area were unrelated to case status, since frequency matching on these factors was in the design. As expected, all smoking related variables were associated with lung cancer, with increasing risks by increasing smoking exposures. Recent former smokers (up to 5 years) showed a higher risk for lung cancer compared to the current smokers. This is likely an artifact due to the fact that people typically quit smoking because of pre-clinical symptoms of lung cancer rather than a reflection of increasing risks in those who quit smoking [56]. In the analyses of genetic association we added the covariate ''years since quit smoking'' to the model, to adjust both for this reverse causation and for the attenuation of the risk over time. Table 3 reports results with p trend #0.05 for the main effect associations between each SNP and lung cancer risk overall and by histology. The complete list of results is reported in Supplemental Table S1.

SNP and lung cancer risk overall and by histology
In adenocarcinoma cases only (test for heterogeneity by histology: p heterog = 0.066), the minor allele of CYP1B1 rs10175368 was significantly protective for lung cancer (OR = 0.8, 95%CI = 0.69-0.93, p trend = 0.003) and a similar protective effect was nominally significant (i.e. p-value#0.05) for the CYP1B1 rs9341266 polymorphism. The cumulative number of variants in CYP1B1 rs9341266 and CYP1B1 rs10175368 also conferred a significant protection for lung cancer in adenocarcinoma cases only (OR = 0.83, 95%CI = 0.74-0.94, p trend = 0.002; test for heterogeneity by histology: p heterog = 0.058), in concordance with the two results from the single SNP analyses.

Genotype-smoking interaction
We repeated the analyses within subgroups defined by smoking status (never and ever smokers) in all cases and controls and, separately, in adenocarcinoma cases only and all controls (see Table 4 for the single SNP analysis, and Supplemental Table S2 for the joint SNP analysis). The other histology groups included too few never smokers to perform a meaningful analysis.
Three SNPs in the chr15q24.1 region (CYP1A1/A2) showed a protective effect for lung cancer among never smokers but a tendency towards increased risk of lung cancer in ever smokers, with a significant genotype-smoking interaction for CYP1A1 rs2606345 (p interact = 0.005) and a nominally significant genotype-smoking interaction for the two SNPs in CYP1A2.
We further explored the significant genotype-smoking interaction in CYP1A1 rs2606345 by means of the linear-exponential model for smoking excess odds ratio [54], and evaluated whether the variation in smoking risk by genotype resulted from the interaction with smoking intensity or with total pack-years and whether this interaction was present among other categories of smokers such as current or former smokers. Results are shown in Figure 1. The EOR per pack-years in current smokers compared to never smokers ( Figure 1A and Figure 1B) increased for increasing number of cigarettes per day, reaching a plateau for subjects carrying the CYP1A1 rs2606345 homozygote major allele ( Figure 1A), and in contrast, increasing exponentially for subjects carrying the CYP1A1 rs2606345 heterozygote or homozygote minor allele ( Figure 1B). The same analysis of EOR/pack-years in former smokers versus never smokers ( Figure 1C and Figure 1D) similarly showed that the EOR increase for cigarettes per day was lower in homozygote major allele carriers ( Figure 1C) than for heterozygote or homozygote minor alleles carriers ( Figure 1D), but here EOR/pack-years reached a plateau among both groups of subjects. Panel E in Figure 1 reports the estimated deviances and p-values for the genotype-smoking interaction among current and former smokers for the model including both interaction terms between the genotype and pack-years and between the genotype and cigarettes per day, and for intermediate models including either the interaction term between genotype and pack-years, or the interaction term between genotype and cigarettes per day. The overall genotype-smoking interaction was stronger among current smokers (p interact = 0.009) than among former smokers (p interact = 0.124). Among current smokers, the removal of pack-years from the model did not degrade fit relative to the full model (p = 0.209), whereas the removal of cigarettes per day did degrade fit (p = 0.022), suggesting that the genotype interaction effects resulted from cigarettes per day and not pack-years.
In the joint analysis of multiple SNPs stratified by smoking (Supplemental Table S2), the cumulative number of variants of all 8 SNPs from CYP1A1 and CYP1A2 in the chr15q24.1 region conferred a significant overall risk for lung cancer in ever smokers (OR = 1.03, 95%CI = 1.00-1.07, p trend = 0.040) and a borderline protective effect in never smokers (OR = 0.91, 95%CI = 0.84-0.99, p trend = 0.055). The smoking-genotype interaction was highly significant (p interact = 0.006).
In addition, the minor allele of CYP2A6 rs1801272 showed a significant protective effect in ever smokers, increased risk in never smokers, and a nominally significant genotype-smoking interaction.

Linkage disequilibrium and haplotype analysis
For genes represented by two or more SNPs, we computed linkage disequilibrium (LD) among controls and haplotype association with lung cancer. The complete results are reported in the Supplemental Text S1 and Figure S2.
Interestingly, the haplotype analysis for the 8 SNPs in EPHX1 (which were in low LD: r 2 #0.1 for most SNPs pairs, r 2 = 0.43 for EPHX1 rs2234922 and EPHX1 rs1051741) revealed two haplotypes significantly associated with lung cancer in the overall population: carriers of TGGCACTC haplotype had higher risk than non-carriers (freq = 0.01, p-value = 0.010) and carriers of CGGC-GCCT haplotype had a lower risk than non-carriers (freq = 0.01, pvalue = 0.015). In addition, we found similar results in the analysis restricted to adenocarcinoma cases only: TGGCACTC (p-value = 0.008) and CGGCGCCT (p-value = 0.023). Since the 8 SNPs were  in low LD, we also performed a three marker moving window haplotype analysis and found no significant associations between lung cancer and haplotype combinations of three SNPs (see Supplemental Table S3). However, we identified a borderline significant protective association (freq = 0.03, p-value = 0.059) with a three-locus haplotype with a C, G, and T in locus 1, 2 and 8, respectively, which was also contained in the 8 SNP haplotype.
For the 8 SNPs in the chr15q24.1 region, we found two regions of LD, one of modest strength surrounding CYP1A1, and a second region 39 of CYP1A2 (see Supplemental Figure S2), concordant with the results from HapMap. Haplotype analyses were computed separately for these two LD regions; the GTAAA haplotype (freq = 0.07) and the CGGGG haplotype (freq = 0.03) were nominally significantly associated with lung cancer risk in never and ever smokers respectively.

Association between genotype and gene expression
The complete results for the correlation between genotype and gene expression data are reported in Supplemental Table S4. We found that the 8 polymorphisms in the 15q24 chromosomal region had a significant down-regulating effect on mRNA expression for

Discussion
In this large population-based case-control study of lung cancer we have observed that EPHX1, CYP1A1, CYP1B1 and CYP2A6 genes may play a role in lung cancer susceptibility. Figure 1. Estimates of the smoking excess odds ratio by CYP1A1/rs2606345 status. Estimates of the linear slope parameter (EOR per packyear) and its 95 percent confidence interval within categories of smoking intensity (square symbol) and fitted linear-exponential odds ratio for continuous pack-years and cigarettes per day (solid line) for CYP1A1 rs2606345. The Figure shows results for T/T genotype in panels A and C, and for T/G+G/G genotypes in panels B and D, among current smokers (700 T/T+997 T/G+G/G) (panels A and B) and former smokers (640 T/T+855 T/G+G/G) (panels C and D). The table in panel E reports the estimated deviances and p-values for the genotype-smoking interaction among current and former smokers for the model including both interaction terms between the genotype and pack-years and between the genotype and cigarette per day, and for intermediate models including either the interaction term between genotype and pack-years, or the interaction term between genotype and cigarette per day. The significant increase in deviance in current smokers is mainly due to the interaction term of the genotype with cigarettes per day and not with pack-years; the removal of pack-years from the model did not degrade fit relative to the full model (p = 0.209), whereas the removal of cigarettes per day did degrade fit (p = 0.022). doi:10.1371/journal.pone.0005652.g001 Two haplotypes in EPHX1 compared to all other haplotypes were significantly associated with lung cancer in the overall population and in adenocarcinoma cases only: TGGCACTC as a risk factor and CGGCGCCT as a protective factor. In addition, we identified a borderline significant protective association with a three-locus haplotype which was also contained in the 8SNP haplotype and was present in approximately 3% of the population. These findings suggest that more than a hundred people in our study carried a three-variant haplotype resulting in a decreased lung cancer risk. The protective effect was even stronger for the smaller number of subjects (1%) who carried a combination of these three SNPs and the remaining 5 SNPs in the 8-locus haplotype. Since the significant associations with lung cancer were based on relatively rare haplotypes, replication will be needed in order to validate this finding. None of the 8 SNPs was significantly associated with lung cancer in the overall population when analyzed separately. This result, if confirmed, demonstrates that the effect of multiple SNPs on lung cancer may be important even if most individual SNPs do not show significant association. This may explain why previously published results, which are based on a limited number of EPHX1 polymorphisms, were inconsistent. In particular, EPHX1 rs2234922 has been previously associated both with risk [39] and with protection [57] for lung cancer. This SNP was not associated with lung cancer in our data. Nevertheless it was one of the three SNPs that differentiate the two significant haplotypes reported here. The other two SNPs were EPHX1 rs1051741, in medium LD with EPHX1 rs2234922, and EPHX1 rs2292568, nominally significantly associated with risk of lung adenocarcinoma in our data (see Table 3). We did not find a significant association between EPHX1 polymorphisms and gene expression. Measurements of epoxide hydrolase activity in lung cancer patients carrying these haplotypes will be needed in order to understand the biological mechanism that underlies this finding.
A group of SNPs from two LD regions in the chr15q24.1 region (CYP1A1 and CYP1A2) showed a protective effect on lung cancer risk among never smokers and a suggestive risk of lung cancer in ever smokers with a significant genotype-smoking interaction for CYP1A1 rs2606345 and a nominally significant interaction for the two SNPs in CYP1A2. This result was confirmed by the multiple SNP analysis stratified by smoking. The cumulative number of variants from CYP1A1 and CYP1A2 was in fact associated with a significant risk for lung cancer in ever smokers and a protective effect in never smokers, with a highly significant smoking-genotype interaction. Interestingly, Wang et al. [58] recently reported an analogous inverse association between CYP1A1 rs2606345 and levels of DNA adducts: the variant allele was associated with high level of DNA adducts among women with high PAH exposure and with low level of DNA adducts among women with low PAH exposure. Further, using the linear-exponential model for smoking EOR we found that the difference in smoking effects between the wild type and the variant resulted from the effects of cigarettes per day and not pack-years. This finding suggests that a dose-response mechanism and a saturation effect might underlie the smokingmediated association between CYP1A1 and lung cancer risk. The gene expression analysis supported this finding. In fact, the lower expression of CYP1A1 among never smokers and higher expression among current smokers in association with the SNPs at chr15q24.1 was consistent with the observed protective effect for lung cancer among never smokers and risk among smokers in association with variants in CYP1A1/A2.
Our data also showed that the minor allele of CYP1B1 rs10175368 was significantly protective for adenocarcinoma of the lung (OR = 0.80, 95%CI = 0.69-0.93) and a similar protective effect was observed for the minor allele of CYP1B1 rs9341266 (r 2 = 0.30), as well as for the cumulative sum of the two minor alleles. In addition, according to the HapMap database, CYP1B1 rs10175368 is in LD with 4 other SNPs in the same chromosomal region (rs2551188, rs4646430, rs4646429, and rs10175338, see Supplemental Figure S1B). These 4 SNPs are likely to be characterized by the same protective association. Previous results on CYP1B1 polymorphisms and lung cancer have been limited to the four non-synonymous SNPs rs10012, rs1056827, rs1056836 and rs1800440 [27][28][29][30]59,60]. None of the reported positive findings have been consistently replicated, except for rs10012, associated with lung cancer risk in two independent studies [28,30]. Our data on rs1800440 did not show any significant association with lung cancer. The three other non-synonymous SNPs were not evaluated in the current study. However, our SNPs were selected with an attempt to cover other regions of the gene. According to our data, variants other than those in the coding region could alter lung cancer risk. Polymorphisms in CYP1B1 have been associated with decreased PAH metabolism [26]. The significant protective effect of the CYP1B1 rs10175368 variant allele could be due to a lower level of smoking carcinogens in subjects carrying the variant allele. We did not find a significant effect on CYP1B1 gene expression for the two SNPs in CYP1B1 associated with a protection for adenocarcinoma. However, when we considered all seven polymorphisms in CYP1B1 together and studied their effect on gene expression, we found a significant increase in CYP1B1 gene expression among current smokers. The CYP1B1 gene is known to be highly expressed in lung tissues of lung cancer patients. Our result supports previous findings of CYP1B1 gene over-expression among current smokers [61] and suggests a possible involvement of CYP1B1 polymorphisms as a mechanism for differential expression.
The CYP2A6 rs1801272 polymorphism, which results in an amino acid change from Leu to His, was significantly associated with a decreased risk for squamous cell carcinoma, a strictly smoking-related malignancy. Interestingly, the same SNP was associated with a decrease in cigarettes per day in controls, confirming a previously hypothesized role of this gene in tobacco smoking addiction [46]. Our report provides the first confirmation of this finding in a population-based sample. In addition, the A allele of CYP2A6 rs1801272 showed a significant protective effect in ever smokers but no effect in never smokers, with a nominally significant genotype-smoking interaction due to the effect of cigarettes per day and not pack-years. The CYP2A6 gene is characterized by multiple polymorphisms and genomic repetitive elements in the regulatory regions, which make a complete coverage of the gene extremely challenging. Moreover, most variants are very rare in the general population and would not be identifiable even in a large sample size as ours. We genotyped CYP2A6 rs1801272 (also known as CYP2A6*2) because this SNP is relatively common (4% in our population), has been well characterized in previous functional studies [46], and showed controversial associations with cancer and smoking dependence [43][44][45][46]. Our findings of an association with both lung cancer risk and tobacco addiction warrant further investigation based on a more complete coverage of this gene.
The size of our population provides unusual power for confirming previously reported associations. Our data do not support proposed associations between lung cancer and EPHX1 rs2234922, CYP1B1 rs1800440, and MPO rs2333227. The confidence in our significant results was supported by the low FPRP values (see Table 5) observed for prior probabilities of 0.10 or more given the strong prior probabilities of the selected phase I genes being involved in lung cancer risk.
At the time that this project was initiated, there was less genotype data available with which to select SNPs to cover haplotype blocks. Nevertheless, based on the existing SNP500Can-cer and comparative assessment of HapMap data, we selected SNPs that represented tagSNPs in the Caucasian population. Although the coverage is inevitably incomplete, we substantially improved the coverage of the selected genes in comparison with previous studies (see Supplemental Figure S1).
Strengths of our study include a population-based design, large sample size with adequate power to detect main gene effect and gene-smoking interaction effect, integrative analysis with gene expression data, and a systematic approach in evaluating the joint effects of multiple SNPs.
Our results are particularly timely in relation to recent GWAS. For example, the significant association between haplotypes in EPHX1 and lung cancer risk emphasizes that the effect of multiple SNPs may be important despite null associations in single SNP analyses, and should be taken into consideration in GWAS. Similarly, although further study is necessary to confirm the qualitative interaction between smoking and genotype in relation to lung cancer susceptibility for the CYP1A1 rs2606345, this finding is particularly interesting, considering that this SNP is not included in the HapMap database or in the common platforms used for GWAS, although it is in relatively strong LD with other SNPs in these platforms. This highlights the necessity of fine mapping after GWAS to further elucidate associations with lung cancer risk and tobacco smoking addiction. In conclusion, this study emphasizes the importance of ample coverage of genes in the analysis of genetic susceptibility of cancer, integration with corresponding gene function in the target tissue, and rigorous study design and analytical approach.

Study population and data collection
A detailed description of the EAGLE study has been recently published [6]. Briefly, the study includes 2101 incident lung cancer cases and 2120 population controls enrolled in the period April 2002-June 2005 in 216 municipalities from the Lombardy region (Italy). Cases were subjects with primary cancer of trachea, bronchus and lung, first diagnosed between April 22, 2002 andFebruary 28, 2005, and admitted to 13 hospitals of the study area. Controls were randomly sampled from population databases, frequency matched to cases by area of residence (5 classes), gender, and age (5-year categories), and contacted through the family physician. All enrolled subjects were Caucasian. Subjects were 35-79 years of age at diagnosis (cases) or at sampling/enrollment for interview (controls). The study participation rates were 86.6% among cases and 72.4% among controls. After signing an Institutional Review Board-approved informed consent form, subjects underwent a computer-assisted personal interview (CAPI) and filled-in a self-administered questionnaire. Biospecimens (blood or buccal rinse from all participants and pathological samples from cases) were collected. Epidemiological information on the 4016 EAGLE subjects with available genotype data and analyzed in this study is described in Table 2.

SNP selection and genotyping
At the start of the study, SNP assays were selected from those available at the Core Genotyping Facility (CGF) of the Division of Cancer Epidemiology and Genetics (National Cancer Institute), using our own assessment of linkage disequilibrium between the SNPs from HapMap and previous evidence from the literature. The 25 SNPs selected from phase I metabolism genes are presented in Table 1. The gene coverage for EPHX1, CYP1B1, and CYP1A1/A2 based on the present version of the HapMap database is described in Supplemental Figure S1. For CYP2A6 and MPO genes, we selected only two SNPs whose association with lung cancer has been debated in previous studies [43][44][45]52,53]. Genotyping of the 25 SNPs was done at the CGF with the TaqManH assay, described at the National Cancer Institute SNP500Cancer website (http://snp500cancer.nci.nih.gov). Genotyping was performed on 4050 EAGLE subjects (those with sufficient DNA samples). Duplicate quality-control samples (2% of the total) showed 100% agreement for all 25 assays.

Gene expression data
In addition to genotype information, we analyzed mRNA gene expression data from an Affymetrix HG-U133A microarray using fresh tissue samples from a subgroup of adenocarcinoma cases. The original microarray study has been described elsewhere [61]. Here, we analyzed the gene expression data from non-tumor samples of 44 subjects in relation to genotype data from the same subjects, as described in the Statistical analysis section.
A. Main effect of genotype. The main effect of the variant genotypes on the risk of lung cancer was estimated by odds ratios and their 95% confidence intervals using unconditional logistic regression analysis. Homozygosity for the more frequent allele among controls was defined as the reference group. We tested for significance using two-sided Wald tests. The trend test for the effect of SNP was conducted by including the SNP variable as continuous in logit scale in the model, and the categorical analysis was performed by treating the SNP variable as three levels categorical variable. Age, sex, geographical location, cumulative smoking dose (pack-years), smoking intensity (cigarettes per day), and quitting smoking (years since quit) were selected as covariates. We performed stratified analyses by smoking status (never/ever) of cases and controls and polytomous logistic regression by the major histology types (adenocarcinoma, squamous cell carcinoma, and small cell carcinoma) of cases. In the analysis by histology, we defined the standard Wald chi-square test statistic using the coefficient estimates derived from a polytomous logistic regression (where the response variable was coded on four levels: controls, adenocarcinoma cases, squamous cells carcinoma cases, and small cells carcinoma cases) and the covariance matrix of the coefficients.
For polymorphisms showing the presence of a genotypesmoking interaction in the association with lung cancer, we fitted a model for the excess odds ratio of smoking (EOR) [54] in order to separate the contribution of total exposure and intensity in the interaction with the polymorphism. Specifically, we fitted the following 3-parameter linear-exponential model which described the OR in terms of continuous pack-years (d) and continuous cigarettes per day (n): OR(d,n) = 1+b6d6g(n), where b is the EOR at g(n) = 1, i.e., the EOR/pack-year, and g(.) is a function that describes the influence of changing cigarettes per day on the strength of the lung cancer and pack-years association. Based on an empirical evaluation, we used a two parameter form for g(.), where g(n) = exp{Q 1 6ln(n)+Q 2 6ln(n) 2 }. The component, b6g(n), describes the EOR per pack-year and its variation with cigarettes per day and thus the influence of the delivery rate, i.e., increasing cigarettes per day and decreasing duration of exposure. We expanded this model to incorporate genotype (s, where s = 1 and 0 denote the variant and wild type forms, respectively), using: OR(s,d,n) = exp(as)6[1+b s 6d6g s (n)], where the subscripts denote separate parameters for each genotype. We fitted the model to data on never and current smokers (including subjects who quit smoking less than two years before the study) and on never and former smokers (subjects who quit smoking more than two years before the study), and used likelihood ratio tests to compare homogeneity of the effects of pack-years, i.e., b 1 = b 0 , and/or smoking intensity, i.e., Q s = 1,1 = Q s = 0,1 and Q s = 1,2 = Q s = 0,2 .
C. Joint SNPs. We analyzed multiple SNPs jointly to test whether the overall lung cancer risk was determined by the combined action of multiple SNPs within the same gene and/or of multiple genes within the same pathway, even if each SNP may have had only a modest effect size individually.
c1. Under the assumption that the effect on lung cancer of each SNP was cumulative, we implemented the following model: where k = 1, …, n represents a collection of SNPs belonging to the same gene or a collection of SNPs belonging to genes in the same pathway (e.g. phase I, n = 25 i.e. all SNPs were grouped together). SNP k = 0 for the homozygote most common allele, SNP k = 1 for the heterozygote allele, and SNP k = 2 for the homozygote minor allele. b is the regression coefficient for the cumulative number of variants S k n (SNP k ). We estimated the overall risk of lung cancer associated with each selected group of n SNPs by computing OR = exp(b) in the overall population, in never smokers, and in ever smokers separately. We estimated smoking-genotype interaction using the likelihood ratio test. Note that in this model we do not assume nor infer a risk direction for each minor allele. This approach will be powerful if minor alleles for all SNPs have effects in the same direction, but there may be loss of power if minor alleles for some SNPs affect lung cancer risk in opposite directions and their contribution to the overall risk cancels with each other.
c2. For all genes represented in our data by two or more SNPs, we computed paired linkage disequilibrium (LD) using the Haploview software and carried out haplotype analysis using the haplo.stats R-package.
D. Gene Expression. We evaluated to the extent possible, the effect of polymorphisms SNP k G from a given gene G on the gene expression of the same gene G, and specifically the effect related to lung cancer. We first estimated the overall effect of each group of SNPs (SNP k G ) on lung cancer according to the additive model where b k are the n regression coefficients for the n SNPs in G. Second, we used the b k estimated from equation (2) to compute the overall effect of each group of polymorphisms SNP k G on the change of gene expression of G (Exp G ) by solving the following logistic regression: Phase I Metabolic Genes PLoS ONE | www.plosone.org According to equation (3), d.0 indicates an increase and d,0 a decrease in the gene expression of the gene G, due to the overall effect of the polymorphisms SNP k G on lung cancer. Basically, we used the SNPs regression score for lung cancer and verified whether it was positively or negatively associated with gene expression in non tumor tissue samples from a subgroup of cases. Note that since we lack gene expression data from healthy controls because no fresh frozen lung tissue samples can be collected from healthy people, we cannot measure directly the association between gene expression and lung cancer risk. Combining equation (2) and (3) instead, we are able to obtain such information. The described gene expression analysis was performed overall and, separately, among never smokers, former smokers, and current smokers.
E. Multiple testing and a priori knowledge considerations. We considered significant those results with a p-value less than (or equal to) 0.01. This choice was a compromise between a more stringent Bonferroni-corrected p-value and the loss in power from getting the threshold for significance too low. In addition, we referred to results with p-value between 0.01 and 0.05 as nominally significant, and considered them as notable when consistent across different analyses. Given the number of tested hypotheses in the single SNP analyses (25 tests corresponding to the 25 SNPs for the single SNP analysis and 5 tests when SNPs were grouped by genes) we took multiple testing into account. Our approach to multiple testing was informed by the selection strategy for the Phase I genes selected. Of note, each of the genes included has substantial mechanistic and at least some population data which support an association with lung cancer, as we have described in the introduction. We recognize that quantifying this a priori knowledge for each SNP is challenging, because of the heterogeneity of results in the literature and because most results actually refer to genes and not to our specific SNPs. In order to incorporate the effect of both multiple testing and a priori knowledge considerations, we computed the False Positive Report Probability (FPRP) [55] to characterize the noteworthiness for all the significant and nominally significant results from single SNP analyses for a range of prior probabilities. The statistical power to detect the measured OR given a type I error rate of 0.05 was computed by means of the QUANTO software (http://hydra.usc.edu/gxe).