Genome-Wide Interaction Analyses between Genetic Variants and Alcohol Consumption and Smoking for Risk of Colorectal Cancer

Genome-wide association studies (GWAS) have identified many genetic susceptibility loci for colorectal cancer (CRC). However, variants in these loci explain only a small proportion of familial aggregation, and there are likely additional variants that are associated with CRC susceptibility. Genome-wide studies of gene-environment interactions may identify variants that are not detected in GWAS of marginal gene effects. To study this, we conducted a genome-wide analysis for interaction between genetic variants and alcohol consumption and cigarette smoking using data from the Colon Cancer Family Registry (CCFR) and the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO). Interactions were tested using logistic regression. We identified interaction between CRC risk and alcohol consumption and variants in the 9q22.32/HIATL1 (Pinteraction = 1.76×10−8; permuted p-value 3.51x10-8) region. Compared to non-/occasional drinking light to moderate alcohol consumption was associated with a lower risk of colorectal cancer among individuals with rs9409565 CT genotype (OR, 0.82 [95% CI, 0.74–0.91]; P = 2.1×10−4) and TT genotypes (OR,0.62 [95% CI, 0.51–0.75]; P = 1.3×10−6) but not associated among those with the CC genotype (p = 0.059). No genome-wide statistically significant interactions were observed for smoking. If replicated our suggestive finding of a genome-wide significant interaction between genetic variants and alcohol consumption might contribute to understanding colorectal cancer etiology and identifying subpopulations with differential susceptibility to the effect of alcohol on CRC risk.


Introduction
Colorectal cancer (CRC) is the third-most common cancer in men and the second most common cancer in women worldwide [1]. Both environmental and genetic factors are involved in the development of CRC [2][3][4][5][6][7]. Since 2007, genome-wide association studies (GWAS) have identified about 50 loci associated with CRC risk [8][9][10][11]. However, only a small portion of the familial aggregation of CRC is explained by these identified genetic loci, and additional variants associated with CRC susceptibility are more likely to be identified through analyses of interactions between genes and environmental risk factors [12,13]. Single nucleotide polymorphisms (SNP) that impact only a subgroup of the population or have opposite effects in different subgroups are likely to produce weak main effects that cannot be easily detected by marginal association testing of the SNPs. However, these variants may be identified by testing for interactions between SNP and environmental risk factors (genome-wide interaction analysis) [14,15]. These findings may provide etiologic insight into CRC and identify potentially susceptible subpopulations [14,15]. There is compelling evidence from epidemiologic studies that alcohol consumption and cigarette smoking are associated with risk of CRC [16][17][18][19][20][21][22][23][24][25]. Both alcohol consumption and cigarette smoking influence disease risk through pathways involving multiple gene products and regulatory elements, providing potential for biological interactions [26][27][28]. Accordingly, alcohol consumption and smoking are important lifestyle factors to study interactions with genetic variants. In this study, we performed a genome-wide interaction analysis using the large datasets from the Colon Cancer Family Registry (CCFR) and the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) [3] to identify SNPs that modify the effects of alcohol and smoking on CRC risk.

Results
In this study, we included 14 studies from the Colon Cancer Family Registry (CCFR) and the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) as described previously [3,29,30] and in the S1 Text and S1 and S2 Tables. Basic characteristics of the participants, stratified by study center, are described in S1 and S2 Tables, respectively. We were able to harmonize measures of alcohol consumption across 8,058 cases and 8,765 controls and measures of smoking across up to 11,219 cases and 11,382 controls. As seen for other common diseases, such as cardiovascular diseases, alcohol consumption shows a different effect with CRC risk depending on the level of alcohol consumed. Heavy alcohol intake (>2 standard drinks per day) has been shown to be associated with increased risk of CRC [16,17,31] while light-tomoderate drinking (<2 standard drinks per day) may have little effect [18,19] or reduce risk of CRC [16,[20][21][22] compared to non-drinkers. Consistent with these previous publications [16][17][18][19][20][21][22]31] we observed an inverse association with CRC risk for light-to-moderate drinkers (OR = 0.91, P = 0.006, Fig 1A) but a positive association for heavy drinkers (OR = 1.22, P = 0.0004, Fig 1B) compared with non-/occasional drinkers. Modeling alcohol using this categorical approach fitted the association between alcohol intake and CRC risk better than the continuous variable based on the Akaike Information Criterion (AIC) which was 12.42 smaller for the model including the two categorical variables compared with the model including the continuous variable (AIC = 23123.72 for continuous alcohol and AIC = 23111.3 for categorical alcohol) [32]. Given the opposite effect of light/moderate alcohol drinking vs. heavy drinking, it is critical that analyses further investigating the impact of alcohol on CRC, such as interaction analysis do this separately for light/moderate and heavy drinking. Ever-smokers and packyears of cigarette smoking were positively associated with CRC risk (OR = 1.18 for ever vs. never smokers, P = 8.9×10 −9 ; OR = 1.11 per 10 pack-years increase, P = 7.1×10 −13 , Fig 2A and  2B). None of the smoking and alcohol variables showed evidence of heterogeneous associations across studies (P heterogeneity >0. 16).
Using conventional logistic regression including multiplicative interaction terms, we identified genome-wide significant interactions (at P<5×10 −8 ) between 11 SNPs at the 9q22.32/ HIATL1 (Hippocampus Abundant Transcript-Like 1) locus and light-to-moderate drinking with no evidence of heterogeneity across studies (P heterogeneity >0.5 for any of the 11 SNPs) (S3 Table, Fig 3). All 11 SNPs were common variants with minor allele frequency (MAF) between 0.31-0.34 and genotyped or imputed with high accuracy (imputation r 2 >0.98, S3 Table). The most significant SNP was rs9409565 with P interaction = 1.76×10 −8 ; permuted p-value 3.51x10 -8 ( Table 1, Fig 4C). The genetic variant was located in an intergenic region (28kb downstream of HIATL1 and 70kb downstream of FBP2, Fig 3). All the other 10 genome-wide significant SNPs were in strong linkage disequilibrium (LD) with rs9409565 (LD r 2 >0.8, S3 Table, Fig 3) and some of them were located within the gene HIATL1. The observed interaction for rs9409565 was similar in men and women and by cancer site (colon vs rectum) (Fig 4A and 4B, S4 Table). We did not observe any genome-wide significant interaction between any SNP and heavy drinking. No inflation was observed in the genome-wide SNP × alcohol interaction analysis (the inflation factor λ = 0.99 and 1.00 for light-to-moderate drinkers and heavy drinkers, respectively). To evaluate the potential confounding [33] by other lifestyle and environmental risk factors of the interactions between rs9409565 and light-to-moderate alcohol consumption in relation to CRC risk, we adjusted for smoking status (ever vs. never) and BMI (two variables have the highest correlation r = 0.15 and 0.13 with alcohol consumption in our data), as well as exercise, fruit and vegetable consumption in the conventional case-control logistic regression model. Our results did not change (multivariate adjusted interaction p-value = 4.34x10 -8 Table). The association between alcohol intake and CRC was also not heterogeneous within each genotype strata (p-heterogeneity > 0.73; S1 Fig).
We also estimated absolute risks of CRC based on Surveillance, Epidemiology, and End Results (SEER) age-adjusted incidence rates ( drinking, light-to-moderate drinking was associated with 14.0 fewer CRC cases per 100,000 individuals carrying the rs9409565-CT genotype per year; 35.5 fewer CRC cases per 100,000 individuals carrying the rs9409565-TT genotype per year. Using the Cocktail method as a two-step method that may improve power we did not observe any genome-wide significant SNP×alcohol interactions. Further, we did not observe any genome-wide significant interactions for SNP×smoking (smoking history and pack-years of smoking) using logistic regression or the Cocktail method.

Gene expression analyses
The SNP rs9409565 showing a significant interaction with alcohol is located in an intergenic region between HIATL1 and FBP2. As there is a recombination hotspot lying between rs9409565 and FPB2 (Fig 3), we focused the gene expression analysis on HIATL1, which is expressed in normal colon and rectal tissue. [34,35] Furthermore, based on our gene expression data for 35 colorectal cancer cases (S2 Text), the expression levels of the HIATL1 gene was significantly higher in tumor tissues compared with adjacent normal tissues (paired student t test, P<7.2×10 −5 , S2 Fig). This finding is consistent with a previous study [36] which is included in the UCSC Cancer Genomics Browser [37][38][39] and show that human colon tumors (n = 100) significantly over-expressed HIATL1 compared to normal colon tissues (n = 5) [36] (Fisher exact test: P = 0.03). Similarly, we were able to reproduce this observation in 50 independent paired colorectal adenocarcinoma and adjacent normal samples from The Cancer Genome Atlas (TCGA) (paired student t test, P = 0.02, S2 Fig). Furthermore, we observed that     sun-exposed skin) with p values ranging from 7x10 -138 to 4x10 -6 (S8 Table). In contrast, evaluation of eQTL in both normal (GTEx) and cancer colorectal tissue from TCGA for the rs9409565 locus (r 2 > = 0.2 in Phase 3 1000 genomes EUR data) did not show any significant eQTL. The inability to detect an eQTL is likely because the enhancer tagged by the locus is active in some but not all cancer cell lines and the current reference cancer transcriptome data may not be large enough or molecularly representative of our study population S5 Fig). Furthermore, we investigated whether any of the tagging SNPs are located in variant enhancer loci (VEL)reported by Akhtar-Zaidi et al. [43] using ChIP-seq (H3k27ac) enhancer signals. We observed that four of the variants (rs28406858, rs7042481, rs7858082, and rs9409510) in LD with rs9409565 (LD r 2 !0.6) were positioned within three gained cancer-specific VEL (S6 Fig).

Discussion
We identified a suggestive interaction between variants at 9q22.32/HIATL1 and light-to-moderate alcohol consumption in relation to CRC risk. This is the first genome-wide significant GxE interaction reported for alcohol intake and risk of CRC and warrants replication in independent studies. Evidence for overlap between the discovered 9q22.32/HIATL1 region with VEL as well as gene expression results support the relevance of the 9q22.32/HIATL1 region for CRC risk. Gene expression analyses indicated that a) SNPs identified in our study impact HIATL1 expression, b) HIATL1 is involved in signaling pathways related to CRC and expression differs between normal and tumor CR tissue, and c) HIATL1 expression in colon tissue differs by alcohol consumption. The most significant variant rs9409565 is correlated with 142 variants (LD r 2 !0.5 in Phase 3 1000 Genomes European populations), which spanned across intronic regions and approximately 50kb downstream and 75kb upstream of HIATL1. Nine of these variants (including rs9409550, rs4744345, rs9409546, rs9409778, and rs639276, all with interaction P<5×10 −8 ) fall within a transcriptionally active region in normal colon, rectal and duodenal mucosa [44] as defined by epigenetic signals. [45] Furthermore, these variants fall in a region of enriched enhancer signal; although we note that currently available ChIP-seq data are not able to identify a putative transcription factor binding site at any of the tagged SNPs (S6 Fig). In support of our findings that HIATL1 expression is higher in tumor than adjacent normal colorectal tissue, ChIP-seq (H3k27ac) enhancer signals suggest that this locus implicates a gained enhancer present in CR tumors that is absent in normal crypt cells (S6 Fig). In summary, multiple data points suggest that the genetic variants we identified to interact with alcohol on CRC risk are located in regulatory regions impacting the expression of HIATL1 and that HIATL1 expression varies by alcohol consumption.
HIATL1 is a member of the solute carrier (SLC) group of membrane transport, which enables the directed movement of substances (such as peptides, amino acids, proteins, metals, and neurotransmitters) into or out of cells and plays an important role in a variety of cellular functions [46,47]. Although the detailed function of HIATL1 remains elusive, this gene was found to be expressed in a large range of animal species and it is highly evolutionarily conserved [48], suggesting an potentially important functional role. Transporter proteins are commonly upregulated in many cancers [49,50] and take part in nutrient signaling to the mTOR pathway [51] which is an important signaling pathway in apoptosis and cancer [52][53][54]. Alcohol may modify the effects of HIATL1 on CRC risk through its influence on the gene expression of HIATL1. Nonetheless, the precise mechanism(s) of the interaction between alcohol and HIATL1 on CRC risk remains unclear and further studies are needed. Our Cocktail method for detecting G×E interactions did not identify the statistical interaction detected by the conventional logistic regression analysis because rs9409565 did not show strong statistical evidence for association with CRC risk in the marginal association analyses (P = 0.54, OR = 1.014) or with alcohol consumption (P = 0.22). Accordingly, this SNP was ranked low in step 1 of the Cocktail method, resulting in very stringent alpha-threshold for the interaction term in step 2. Although the conventional logistic regression analysis tends to be less powerful overall for genome-wide interaction analysis compared with the Cocktail method [14,55], it has greater power to detect an association if the marginal association of the SNP on disease or the correlation of the SNP with environmental factor are weak as it was the case for the observed interaction. In addition, no association between rs9409565 and alcohol consumption excluded the possibility that the observed interaction was due to the dependence between them [56]. We also explored the effect of rs9409565 and alcohol using other potentially more powerful single step approaches and observed a similar interaction effect in the Empirical Bayesian analysis [57] and a weaker interaction effect in the case-only analysis [58], which may be explained by the non-significant differential effect of alcohol on CRC in individual carrying the CC genotype (S6 Table).
To investigate if genome-wide interaction may help identifying variants that would be missed we looked up the marginal association of rs9409565 in the largest GWAS [59] which is about twice as large as our study and showed an OR for rs9409565 of 0.975 (95%CI 0.946-1.007, pvalue 0.127). Accordingly, the variant by itself showed only weak evidence for association with CRC. This may not be surprising given that it is estimated that the sample sizes required to identify GxE interaction vs. main effects is at least 4x larger [60]. Our study has several strengths, including the large sample size, environmental exposure assessment in well-characterized populations, and standardized harmonization of environmental data across studies. Further, there is no evidence of heterogeneity across studies for our findings, indicating our results are not dominated by one or a few studies and, indeed, represent evidence across all studies. There are also some limitations. Because amassing sufficient study power for genome-wide interaction analysis is a challenge, we combined all studies in the analysis to gain the greatest power [61] instead of dividing studies into discovery and replication sets. Although we do not have a replication set, the consistency of our findings across all studies and the independent evidence from different types of gene expression data and bioinformatics analyses support a novel interaction for CRC risk between alcohol intake and variants in the 9q22.32/HIATL1 region. Our analyses focused on current alcohol consumption, rather than lifetime alcohol use, which may cause misclassification of a certain portion of alcohol users. Both differential and non-differential misclassifications of alcohol consumption levels tend to lead to underestimation of interaction parameters (e.g. leading to non-significant interaction term between SNP and alcohol intake) [62], accordingly, we may have missed some true interactions. However, it is unlikely that this led to false positives for the interactions observed. Because, there is no strong evidence that the type of alcohol (usually defined as wine, beer and hard liquor) has a differential impact on CRC [63] we have not investigated interaction between genetic variants and type of alcohol. As we preformed genome-wide interaction testing for two environmental risk factors (smoking and alcohol consumption), additional adjustment for multiple comparisons may be needed. However, we note that the observed interaction at 9q22.32/HIATL1 would remain borderline significant (alpha threshold = 5×10 −8 /2 = 2.5×10 −8 ). The small numbers of heavy drinkers, particular in women, impeded the reliable estimation of interaction parameters and limited our power to identify significant interaction between SNP and heavy drinking. We focused gene expression analysis on HIATL1 because rs9409565 is located in an intergenic region between HIATL1 and FBP2 and further there is a recombination hotspot lying between rs9409565 and FPB2. If we expand gene expression analyses for all genes 500kb upstream or downstream 500kb of rs9409565 in the 35 pairs of colorectal tumor-normal tissue samples (S2 Text) we observed no significant result after false discovery rate (FDR) correction. The most significant results were for MIRLET7F which has a p value of 0.001 for testing differential gene expression across various levels of lifetime alcohol consumption in normal tissues and PTPDC1 which has a p value of 0.002 for testing differential gene expression across various levels of alcohol consumption at reference time. Further studies are needed to confirm our findings.
Alcohol has a particularly detrimental effect on several cancers, possibly including CRC, in Asian subpopulations with genetic determined alcohol sensitivity [64][65][66]. However, as we have focused our analysis on European descent populations and did not observe significant differences of the alcohol-CRC association between studies (phet = 0.16-0.76) we do not expect major underlying differences of the effect of alcohol in our study populations.
We did not perform stratification analyses by anatomical sites for our genome-wide GxE interaction analysis because the association of CRC with alcohol consumption (S7 Table) and smoking [23] did not vary according to anatomical site within the large bowel. Although we did observe potential interactions for alcohol consumption, we did not observe statistical evidence for genome-wide SNP x smoking interactions. This may be because smoking has a weaker association with CRC compared with alcohol intake [24,26,67], so we may have been underpowered even with more than 10,000 cases and 10,000 controls. We also may not have properly captured the most relevant smoking variables, such as duration of smoking or time since quitting smoking. The association between smoking and CRC risk are strongest for tumors that display certain molecular features such as microsatellite instability (MSI)-high and CpG island methylator phenotype (CIMP)-positive [68,69]. Because of the lack of MSI or CIMP data in several studies, we cannot perform stratification analysis by tumor characteristics for smoking-related analyses.
We note that it would be too early to make any recommendation on alcohol intake from our findings even after independent replication given that such recommendation need to be considered in context of the effect of alcohol on all diseases. Furthermore, it will be important to investigate the interactions between alcohol and genetic variants in larger studies to comprehensively evaluate the full impact of genetic variation on the effect of alcohol on colorectal cancer risk.
In summary, we identified a tentative novel interaction for CRC risk between alcohol intake and variants at 9q22.32/HIATL1. Further replication and functional studies are required to confirm our findings and understand the biologic implications of the interaction. This, in turn, could provide further insight into CRC etiology and may identify potentially susceptible subpopulations.

Ethics statement
The overall project was reviewed and approved by the

Study population
We included 14 study centers from the CCFR and GECCO as described in the S1 Text and S1 and S2 Tables. All colorectal cancer cases were defined as colorectal adenocarcinoma and confirmed by medical records, pathologic reports, or death certificates. We included advanced colorectal adenoma, a well-defined colorectal cancer precursor [70,71], from two studies (S1 Text). Advanced adenoma was defined as an adenoma 1 cm or larger in diameter and/or with tubulovillous, villous, or high-grade dysplasia/carcinoma-in-situ histology. Colorectal adenoma cases were confirmed by medical records, histopathology, or pathologic reports. Controls for adenoma cases had a clean sigmoidoscopic or colonoscopic examination. All participants provided informed consent and studies were approved by their respective Institutional Review Boards.

Genotyping, quality assurance/quality control and imputation
Average sample and SNP call rates, and concordance rates for blinded duplicates have been previously published [3]. In brief, genotyped SNPs were excluded based on call rate (< 98%), lack of Hardy-Weinberg Equilibrium in controls (HWE, p < 1 x 10 −4 ), and low minor allele frequency (MAF<0.05). We imputed the autosomal SNPs of all studies to the Northern Europeans from Utah (CEU population) in HapMap II. SNPs were restricted based on per-study minor allele count > 5 and imputation accuracy (R 2 > 0.3). After imputation and quality-control (QC) exclusion, approximately 2.7M SNPs were used in analysis.
All analyses were restricted to individuals of European ancestry, defined as samples clustering with the Utah residents with Northern and Western European ancestry from the CEPH collection population in principal component analysis [72], including the HapMap II populations as reference.
Alcohol consumption and smoking information. All information on basic demographics and environmental risk factors were collected through interviews or through self-administered questionnaires. Data for all studies were centrally harmonized at the data coordinating center. We used the risk-factor information at the reference time, which varied across studies (S1 Text). A multi-step data-harmonization procedure which is described in detail in Hutter et al. [29] was applied to reconcile differences in individual study questionnaires. We converted consumption of alcoholic beverages into grams of alcohol per day (g/day) by summing the alcohol content of each beverage consumed per day. To test if the categorical or continuous variable fitted the association between alcohol intake and CRC risk better we used Akaike Information Criterion (AIC) to compare both models. With our sample size a model with an AIC that is 6 points smaller than the other model is considered a better fitting model [32]. According to this analysis and consistent with previously described risk profiles [16,17,[19][20][21][22]73], we grouped study participants as non-/occasional drinkers (drinking < 1 g/day); light-to-moderate drinkers (drinking 1-28 g/day); and heavy drinkers (drinking >28 g/day, one standard drinking is approximately equal to 14 grams of alcohol). We coded these categories using indicator variables for the genome-wide interaction analysis. Smoking history was defined as never-and ever-smoking; pack-years of smoking was calculated by multiplying the average number of packs of cigarettes smoked per day by smoking duration (years). Smoking history (ever vs. never smoking) and pack-years (treated as a continuous variable) of smoking were used in genome-wide interaction analysis, separately.

Statistical analysis
Statistical analyses of all data were conducted centrally at the GECCO coordinating center on individual-level data to ensure a consistent analytical approach. Unless otherwise indicated, we adjusted for age at the reference time, sex (when appropriate), center (when appropriate), and the first three principal components from EIGENSTRAT to account for potential population substructure. The alcohol and smoking variables were coded as described above. Each directly genotyped SNP was coded as 0, 1, or 2 copies of the variant allele. For imputed SNPs, we used the expected number of copies of the variant allele (the "dosage"), which has been shown to give unbiased test statistics [74]. Genotypes were treated as continuous variables (i.e. log-additive effects). Each study was analyzed separately using logistic regression models and studyspecific results were combined using fixed-effects meta-analysis methods to obtain summary odds ratios (ORs) and 95% confidence intervals (CIs) across studies. We calculated the heterogeneity p-values using Woolf 's test [75]. Quantile-quantile (Q-Q) plots were assessed to determine whether the distribution of the p-values was consistent with the null distribution (except for the extreme tail). Subjects with missing data for SNPs or environmental factors were excluded from the relevant analyses. Considering the potential male-female difference in alcohol metabolism [76,77] and the different levels of alcohol consumption between sexes, we conducted the genome-wide interaction analysis for alcohol separately for men and women and used fixed effects meta-analysis to combine their results. All analyses were conducted using the R software (Version 3.0.1).
Two statistical methods that leverage SNPs and environmental factors interaction (G×E interaction) were used to detect potential disease associated loci. First, we used conventional case-control logistic regression analysis including G×E interaction term(s). As the alcohol consumption variable has three categories there are two interaction terms in the statistical models. Based on an increasing number of publications [78][79][80][81][82][83] providing a detailed discussion on the appropriate genome-wide significance threshold, which all arrive at similar values in the range of 5 x 10 -7 to 5 x 10 −8 for European populations, we decided to use an alpha level of 5 x 10 −8 as the genome-wide significance threshold, assuming about 1 million independent tests across the genome (0.05/1,000,000 = 5 x 10 −8 ). For significant results we used permutation approach to determine the empirical p-value. We defined the number of permutation needed as 1/p-value (i.e., for a p-value of 5 x 10 −8 1/5E-08 = 20,000,000). We permutated the case-control status 1/ p-value times and calculated the p values for the interaction from each meta-analyses to calculate the permuted p-value.
Second, we used our recently developed Cocktail method. [55] In brief, this method consists of two-steps: a screening step to prioritize SNPs and a testing step for GxE interaction. For the screening step, we ranked and prioritized variants through a genome-wide screen of each of the 2.7M SNPs (referred to as "G") by the maximum of the two test statistics from marginal association testing of Gs on disease risk [84], and correlation testing between G and exposure (E) in cases and controls combined. [85] Based on the ranks of these SNPs from screening, we used a weighted hypothesis framework to partition SNPs into ordered groups and assigned each group an alpha-level cut-off, with higher ranked groups from the screening stage having less stringent alpha-level cut-offs for interaction [86,87]. The second step of the Cocktail method is the testing step. We used either case-control (CC) or case-only (CO) logistic regression to calculate a p-value for the interaction. If the G was assigned based on its low marginal association P value in the screening tests, we used CO test; if it was ranked because of a low correlation screening p-value, we used CC tests. We compared the test step p-value to the alphalevel cutoff for each SNP in a given group.
We calculated absolute risks for each genotype of the SNP showing significant G×E interaction. Briefly, based upon the Surveillance, Epidemiology, and End Results (SEER) age-adjusted colorectal cancer incidence rate (denoted by "I") between 1982-2011 among the White population of 42.9 per 100,000 men and women per year, we estimated the reference incidence rate of colorectal cancer (denoted by "I_{reference}") using the following formula: I_{reference} = I/(P (AA, non-E) + OR{Aa, non-E}×P(Aa, non-E) + OR{aa, non-E}×P(aa, non-E) + OR{AA, E}×P (AA, E) + OR{Aa, E}×P(Aa, E)) + OR{aa, E}×P(aa, E)), where P(genotype, E (or non-E)) is the prevalence of light-to-moderate drinking (or non/occasional drinking) in each corresponding genotype category among controls (non-cases). Based on this reference incidence rate of colorectal cancer (i.e., I_{reference}), we further calculated absolute colorectal cancer incidence rates within each subgroup defined by genotype of the SNP according to a light-to-moderate drinking or non/occasional drinking by multiplying the I_{reference} with each corresponding OR. Bootstrap methods were used to calculate the 95% CI of absolute risk estimates [88].

Expression analyses
We used different types of gene expression data to examine putative expression of genes identified in our genome-wide interaction analysis, and to determine biological plausibility that the variants identified might impact CRC risk. First, we searched the Genotype-Tissue Expression project (GTEx) portal (http://www.broadinstitute.org/gtex/searchGenes) [34] and the Human Protein Atlas (http://www.proteinatlas.org) [35] to establish whether the implicated genes and corresponding proteins are expressed in human colon/rectal tissues. Second, we used several eQTL databases including the Browser at University of Chicago (http://eqtl.uchicago.edu/ Home.html),the Genevar (GENe Expression VARiation) at the Wellcome Trust Sanger Institute (http://www.sanger.ac.uk/resources/software/genevar) [42], HaploReg (http://www. broadinstitute.org/mammals/haploreg/haploreg.php) (PMID:22064851), and the GTEx Portal Version 4(http://gtexportal.org/home/) (PMID: 26484569) to investigate whether any of the implicated SNPs may impact the expression of the nearby genes. A cis-eQTL analysis was also performed in TCGA COAD data in 356 Caucasian samples that have demographic and clinical data for 15,008 genes (S1 Text). Third, we analyzed expression data for the implicated genes from 35 pairs of colorectal tumor-normal tissue samples included in the ColoCare Cohort (S2 Text) as well as expression data from the Cancer Genome Atlas (TCGA; http://cancergenome. nih.gov) in 50 pairs of colorectal adenocarcinoma-normal tissue samples. We searched the UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu) [37][38][39] to examine whether the implicated genes showed evidence of differential expression in colorectal tumor tissue and normal tissue. Last, we used the publically available data in the Gene Expression Omnibus site (http://www.ncbi.nlm.nih.gov/geo/) [89,90] and the gene expression data from normal colon (n = 33) and tumor (n = 28) tissue in the ColoCare Cohort (S2 Text) to investigate whether the expression of implicated genes are correlated with alcohol/smoking history.

Bioinformatics analysis
We explored potential functional annotations for the SNPs that showed evidence for interactions with either smoking or alcohol in our genome-wide interaction analyses. As detailed in S1 Text, we queried multiple bioinformatics databases using the UCSC genome browser (http://genome.ucsc.edu), HaploReg (http://www.broadinstitute.org/mammals/haploreg/ haploreg.php), and literature review of published enhancer signatures of colon cancer.   Table. Interaction between rs9409565 and alcohol consumption for CRC risk based on one reference group and stratified by genotype (last two rows) and by alcohol consumption (last column). (DOCX) S6 Table. Interactions between rs9409565 and alcohol consumption for CRC risk using Empirical Bayesian (EB) interaction analysis, case-control (CC) logistic regression and case-only (CO) interaction analysis.  TCGA (c,d). The 4 probes for HIATL1 all showed that HIATL1 expression was significantly higher in tumor tissue than in normal tissue (Paired t test, P = 4.4×10 −9 to 7.2×10 −5 ); the results from two probes that uniquely match HIATL1 transcript were shown in a (P = 7.

Acknowledgments
We thank Dr. Wei Sun for providing eQTL results on TCGA data. We thank Dr. Peter Scacheri for his guidance on the variant enhancer loci analysis. ASTERISK: We are very grateful to Dr. Bruno Buecher without whom this project would not have existed. We also thank all those who agreed to participate in this study, including the patients and the healthy control persons, as well as all the physicians, technicians and students.
GECCO: The authors would like to thank all those at the GECCO Coordinating Center for helping bring together the data and people that made this project possible. The authors also acknowledge COMPASS ( Kopp, Wen Shao, and staff, SAIC-Frederick. Most importantly, we acknowledge the study participants for their contributions to making this study possible. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI.
PMH: The authors would like to thank the study participants and staff of the Hormones and Colon Cancer study.
WHI: The authors thank the WHI investigators and staff for their dedication, and the study participants for making the program possible. A full listing of WHI investigators can be found at: http://www.whi.org/researchers/Documents%20%20Write%20a%20Paper/WHI% 20Investigator%20Short%20List.pdf

Author Contributions
Conceptualization: JG CMH SAB LH UP.