Multiple Analytical Approaches Reveal Distinct Gene-Environment Interactions in Smokers and Non Smokers in Lung Cancer

Complex disease such as cancer results from interactions of multiple genetic and environmental factors. Studying these factors singularly cannot explain the underlying pathogenetic mechanism of the disease. Multi-analytical approach, including logistic regression (LR), classification and regression tree (CART) and multifactor dimensionality reduction (MDR), was applied in 188 lung cancer cases and 290 controls to explore high order interactions among xenobiotic metabolizing genes and environmental risk factors. Smoking was identified as the predominant risk factor by all three analytical approaches. Individually, CYP1A1*2A polymorphism was significantly associated with increased lung cancer risk (OR = 1.69;95%CI = 1.11–2.59,p = 0.01), whereas EPHX1 Tyr113His and SULT1A1 Arg213His conferred reduced risk (OR = 0.40;95%CI = 0.25–0.65,p<0.001 and OR = 0.51;95%CI = 0.33–0.78,p = 0.002 respectively). In smokers, EPHX1 Tyr113His and SULT1A1 Arg213His polymorphisms reduced the risk of lung cancer, whereas CYP1A1*2A, CYP1A1*2C and GSTP1 Ile105Val imparted increased risk in non-smokers only. While exploring non-linear interactions through CART analysis, smokers carrying the combination of EPHX1 113TC (Tyr/His), SULT1A1 213GG (Arg/Arg) or AA (His/His) and GSTM1 null genotypes showed the highest risk for lung cancer (OR = 3.73;95%CI = 1.33–10.55,p = 0.006), whereas combined effect of CYP1A1*2A 6235CC or TC, SULT1A1 213GG (Arg/Arg) and betel quid chewing showed maximum risk in non-smokers (OR = 2.93;95%CI = 1.15–7.51,p = 0.01). MDR analysis identified two distinct predictor models for the risk of lung cancer in smokers (tobacco chewing, EPHX1 Tyr113His, and SULT1A1 Arg213His) and non-smokers (CYP1A1*2A, GSTP1 Ile105Val and SULT1A1 Arg213His) with testing balance accuracy (TBA) of 0.6436 and 0.6677 respectively. Interaction entropy interpretations of MDR results showed non-additive interactions of tobacco chewing with SULT1A1 Arg213His and EPHX1 Tyr113His in smokers and SULT1A1 Arg213His with GSTP1 Ile105Val and CYP1A1*2C in nonsmokers. These results identified distinct gene-gene and gene environment interactions in smokers and non-smokers, which confirms the importance of multifactorial interaction in risk assessment of lung cancer.


Introduction
Lung cancer is the most commonly diagnosed cancer and the leading cause of cancer death globally [1]. In India it constitutes 6.2% of all cancers with approximately 58,000 incident cases reported in 2008 and is the most frequent cancer in males [2]. North eastern (NE) part of India is showing a steady rise in cancer incidences and lung cancer is among the ten leading sites, with the highest age-adjusted incidence rate (AAR) in Mizoram state (24.5 in males and 26.3 in females). Aizwal district alone shows an AAR of 36.0 in males and 38.7 in females which is almost three to ten times higher than Delhi [3]. Incidence of lung cancer is also high among males in Silchar and Imphal districts. High incidence rates suggest role of both genetic as well as environmental factors such as smoking, tobacco use and dietary carcinogen consumption.
Individuals possessing modified ability to metabolize carcinogens such as polycyclic aromatic hydrocarbons (PAH), which are ubiquitous environmental, dietary, and tobacco carcinogens are at increased risk of developing cancer. Thus genetic variants in xenobiotic metabolizing genes can influence their clearance from circulation and determine response to such carcinogens. The phase I xenobiotic metabolizing enzymes like cytochrome P-450s (CYPs), alcohol dehydrogenase (ALDH) and epoxide hydroxylase (EPHX) usually activate the procarcinogens through oxidation and dehydrogenation thereby converting them into reactive metabolites. Phase II metabolic enzymes such as glutathione Stransferases (GST), sulfotransferase (SULT) and N-acetyltransferase (NAT) generally result in inactivation or detoxification of these reactive metabolites. Equilibrium between expression and activity levels of these xenobiotic-metabolizing enzymes of both phase I and II determine the relative level of detoxification of carcinogens. However, these pathways are also known to activate toxic and carcinogenic chemicals to electrophilic forms that react irreversibly with macromolecules such as proteins and nucleic acids leading to carcinogenesis.
Single nucleotide polymorphisms (SNPs) in xenobiotic metabolizing genes have been studied extensively with risk of lung cancer. A majority of these molecular epidemiological studies consider only the main effects of these SNPs and their observed strength of associations could be challenged by penetrance of the genetic variant. Furthermore, a single locus cannot account for genetic susceptibility in a complex disease such as cancer which involves multiple genetic variations and gene-environment interactions. Current evidences suggest that high order interactions in multigenic approach allow more precise delineation of risk groups [4,5].
In the present study, two data mining approaches, CART and MDR were applied along with LR to detect high order gene-gene and gene environment interactions. Both CART and MDR assume model free and non-parametric methods of estimating non-linear interactions with low false-positives even on relatively small sample sizes. Model validation through permutation testing and false positive report probabilities were also done to overcome inaccurate estimation. Interaction entropy graphs were constructed to interpret combination effects identified by MDR. To further analyze possible effects of the EPHX1 and CYP1A1 SNPs, we estimated their haplotype frequencies and risk imparted towards lung cancer.

Study subjects
This study consisted of 188 histopathologically diagnosed lung cancer cases registered at Dr. Bhubaneswar Borooah Cancer Institute, Guwahati, Civil Hospital, Aizawl, and Sir Thutob Namgyal Memorial Hospital, Gangtok, the collaborating centers in north east India. Incident cases during the period of December 2006 to 2009 and willing to participate in the study were included. 290 voluntary, age (65 years) and sex matched individuals were selected from the unrelated attendants who accompanied cancer patients. This provided a readily available and cooperative source of controls from the same socio-economic background as the cases reducing confounding biases. As our collaborating centers were public hospitals a large majority of subjects belonged to lower to middle socio-economic background. Demographic data and characteristics such as age, sex, smoking habit, usage of tobacco, betel quid and alcohol, were obtained from subjects in a standard questionnaire used for all the centers, in an in-person interview by a trained data collector. A majority of cases and controls were literate with full primary schooling and some upto the college level. The occupational history of the study participants revealed that most of them were farm laborers or engaged in petty jobs and the nature of such jobs did not exposed them to any occupational hazards. Any history of past or present illness was enquired or if undergoing any medication at the time of enrolment. Patients with only lung as their primary site of cancer were included. Any subject with history of familial malignancy or pulmonary infectious disease was excluded both from case and control. Final selected controls were included on the basis of no history of any obvious disease and those not taking any medication at the time of recruitment. All subjects provided written informed consent for participation in this research which was done under a protocol approved by the institutional ethics committee of Regional Medical Research Centre, North East Region (Indian Council of Medical Research). Smokers, chewers and drinkers were classified into two categories ever and never. For smoking, an individual who had never smoked or smoked less than 100 cigarettes in their lifetime and were not smoking at the time of reporting was considered never smoker or non-smokers. Ever smokers or smokers category included current smokers, and those who had quit within ,1 year of reporting [6]. 5 ml of blood was collected in EDTA vials and stored under -70uC until processed.

Genotyping
Genomic DNA was isolated using Qiagen Blood DNA Isolation kit (Qiagen GmbH, Germany) and stored at 230uC till further analysis. Details for SNPs selected for the study are summarized in Table S1. The deletion variants in GSTM1 and GSTT1 were determined by multiplex polymerase chain reaction protocol and SNPs in CYP1A1, EPHX1, GSTP1, SULT1A1 were determined by polymerase chain reaction-restriction fragment length polymorphism assays as previously described [7][8][9][10][11][12]. 10% of the randomly selected cases and controls were genotyped twice for each SNP, however no discrepancies were observed.

Statistical Analysis
Cases were individually matched with controls on the basis of age (65 years), sex and ethnicity, in a ratio of approximately 1:1.5. Difference in the distribution of demographic characteristics and genotype frequencies between cases and controls were evaluated using the Chi Square (x 2 ) and Fisher's Exact test wherever appropriate. Hardy-Weinberg equilibrium (HWE) was assessed by using the x 2 -test. Estimates of risk to cancer, imparted by genotypes and other covariates as tobacco smoking, tobacco chewing, betel quid chewing and alcohol consumption were determined by deriving the odds ratio (OR) and its corresponding 95% confidence interval (95% CIs) using multivariable conditional logistic regression. For all the tests a two sided p,0.05 was considered statistically significant. The data analysis was performed on the Intercooled Stata 8.0 statistical software package (Stata Co., College Station, TX).

Haplotype Analysis
Haplotypes were constructed from the unphased diploid genotype data using the Expectation Maximization-based algorithm. Individual haplotypes and their estimated population frequencies were inferred and estimates of linkage disequilibrium (D') between SNPs were calculated using Haploview software ver.4.1.

Identification of High Order Interactions
High order interactions were determined using CART, MDR and interaction entropy graphs.
CART. A binary recursive partitioning method was used to produce a decision tree that identified specific combinations of contributing factors associated with lung cancer risk using the commercially available CART software (version 6.6, Salford Systems) [13]. Tree splitting was done till terminal nodes reached a pre-specified minimum size of 10 subjects. Optimal tree was selected using one standard error (1-SE) rule and 10 fold cross validation. Subgroups of individuals with differential risk patterns were identified in the different order of nodes, indicating the presence of gene-gene and gene-environment interactions. Fischer's Exact test was used to calculate relative risk in each terminal node of the tree.
MDR. The MDR software was developed by Ritchie et. al. in 2001 [4] and reviewed by Moore et al [14]. Genotype and environmental factors were pooled into high and low risk group, effectively reducing the multifactor prediction from n dimension to one dimension using MDR software (version 2.0 beta) (http://www.epistasis.org). We applied Tuned ReliefF (TuRF) filter algorithm to remove noisy SNPs and avoid overfitting of data. Best models for each locus were selected by repeating the analysis for up to 10 seeds and applying 10 folds cross validation each time. Statistical significance of the best models selected for each locus was determined using 1000 fold permutation testing. p-values hence obtained for TBA and cross validation consistency (CVC), were considered statistically significant at 0.05 levels.

False Positive Report Probability (FPRP)
Reports of gene-environment interaction studies are often challenged by false positive discoveries especially when results are generated by multiple comparisons. To estimate the FPRP and to evaluate robustness of the findings from MDR analysis we used the Bayesian approach described by Wacholder et. al. [15]. The method requires prior probabilities that the genetic variant and disease association is real. As prior probability can be a subjective measure and can be influenced by several factors, usually a wide range is reported by studies. Considering poor epidemiological data from the study population and inconsistent association of the SNPs with lung cancer risk we set a fairly wider range of prior probabilities (10 26 to 10 21 ) with an estimated statistical power to detect an OR of 1.5 and 2.0 and a level equal to the observed p-value. The FPRP cutoff point was stringently kept to 0.2.

Interaction entropy graphs
Interaction graphs were built to visualize and interpret the results obtained from MDR using Orange machine learning software package [16]. Interaction graphs use entropy estimates as described by Jakulin et al. [17] for determining the gain in information about a class variable (e.g. case-control status) from merging two variables together over that provided by the variables independently. This measure of entropy is useful for building interaction graphs that facilitate the interpretation of the relationship between variables. Interaction graphs are comprised of a node for each variable with pairwise connections between them. The percentage of entropy removed (i.e. information gain) by each variable is visualized for each node. The percentage of entropy removed for each pairwise Cartesian product of variables was visualized for each connection. Thus, the independent main effects of each SNP can be compared to the interaction effect. Positive entropy (plotted in green) indicates non-linear interaction while negative entropy (plotted in red) indicates redundancy. Entropy value equal to zero indicates independence or a mixture of synergy and redundancy.

Characteristics of study subjects
The distribution of gender and ethnicity was similar for cases and controls. The frequency distribution of males and females were 77.1% and 22.9% in cases and 76.2% and 23.85 in controls respectively. Mean age of cases and controls was 60.41610.58 (range 30-82 yrs) and 57.19610.75 (range 32-85 yrs) respectively. The distribution of all SNPs in control was in agreement with HWE (p.0.05), however alleles of EPHX1 Tyr113His and SULT1A1 Arg213His polymorphisms in cases did not follow HWE (p,0.001 and p = 0.004 respectively).

Association of genetic and environmental factors with lung cancer risk by LR analysis
The distribution and main effects of genetic and environmental factors is summarized in Table 1. Risk habits such as smoking, tobacco chewing and betel quid chewing were predominant among cases. However only smoking and betel quid chewing were significantly associated with increased risk for lung cancer (OR = 3.06;95%CI = 1.9424.83;p,0.001 and OR = 1.86; 95%CI = 1.2122.84;p = 0.004 respectively). Genotype distribution of CYP1A1*2A, EPHX1 Tyr113His, SULT1A1 Arg213His and GSTT1 null polymorphism were significantly different in cases from controls (p = 0.014, p,0.001, p = 0.01 and p = 0.04 respectively). Main effects of genotypes in lung cancer susceptibility were evaluated using conditional multivariable logistic regression. Heterozygous genotype in CYP1A1*2A was associated with increased risk (OR = 1.69,95% CI = 1.1122.59; p = 0.01) whereas heterozygous genotypes in EPHX1 Tyr113His and SULT1A1 Arg213His imparted reduced risk towards lung cancer (OR = 0.40;95%C.I = 0.2520.65,p,0.001 and OR = 0.51;p = 0.33x2 0.78,p = 0.002 respectively). CYP1A1*2A and EPHX1 His139Arg polymorphisms were associated with increased risk to lung cancer in dominant genetic model, whereas EPHX1 Tyr113His and SULT1A1 Arg213His imparted reduced risk in recessive genetic model (Table  S2). Table 2 summarizes the associations between the frequency distributions of the haplotypes in CYP1A1 and EPHX1 genes and the risk of lung cancer. The odds ratios were calculated using the most common haplotype as the reference group. In CYP1A1, ''TA'' haplotype was the most frequent among both cases and controls and showed significant association. Only CYP1A1-CG haplotype imparted increased risk to lung cancer (OR = 1.49;95%-CI = 1.0022.21,p = 0.04). In EPHX1, the ''TA'' haplotype was the most common with frequencies of 44.79% and 45.04% in cases and controls respectively. No haplotype was found to be significantly associated with lung cancer risk.

Risk associated with SNPs stratified by smoking
Since smoking is a well established risk factor to lung cancer and was the strongest independent risk factor in LR, we further stratified the data by smoking status. Distribution and risk associated with genetic factors after stratification is shown in  Figure 1 shows the selected CART model constructed on all investigated genetic variants and environmental risk factors. The final tree contained eight terminal nodes. The first split of the root node was on smoking habit, indicating that smoking is the strongest risk factor for lung cancer. Among smokers, the subsequent splits showed interactions between EPHX1 Tyr113His, SULT1A1 Arg213His and GSTM1. In non-smokers first split was on CYP1A1*2A status, which was in concordance with the LR analysis where CYP1A1*2A showed strong association to risk only in nonsmokers. Further interactions were predicted by SULT1A1 Arg213His polymorphism and betel quid status. Terminal node 7, which comprised of least percentage of cases in non-smokers, was taken as reference to calculate OR for other terminal nodes. Among smokers maximum risk was observed for terminal node1 consisting of EPHX1 113TT (Tyr/Tyr) or -113CC (His/His) genotypes (OR = 4.38;95%CI = 2.1229.15) and for terminal node 2 with combination of EPHX1 113TC (Tyr/His), SULT1A1 213GG (Arg/Arg) or AA (His/His) and GSTM1 null genotypes (OR = 3.73;95%CI = 1.33210.55, p = 0.006). In non-smokers high risk was seen for terminal node 5 comprising of CYP1A1*2A 6235CC or TC, SULT1A1 213GG (Arg/Arg) and betel quid chewing (OR = 2.93;95%CI = 1.1527.51, p = 0.01). Parallel to the above, CART analysis on separate data sets of smokers and non-smokers was also performed. However, we did not detect any high-order interaction in these analyses (data not shown).

MDR Analysis
MDR analysis was applied to further explore gene-gene and gene-environment interactions. Best predictive models up to 4 orders of interaction, along with their CVC and TBA are summarized in Table 4. The analysis was run separately for total data set and data sets stratified on smoking status. For total data    False positive report probability (FPRP) Table 5 shows the FPRPs for the 3 best models obtained from MDR analysis. The 4-loci predictor model on total data set and 3-loci model in smokers showed excellent reliability even when assuming very low prior probabilities (from 10 23 to 10 26 ) for detecting ORs of 1.5 and 2.0. However the best model selected in non smoker category showed true association only at high probability of 10 21 for detecting OR = 1.5 and till 10 22 for detecting OR = 2.0.

Interaction entropy graphs
After identifying the high-risk combinations using MDR approach, interaction entropy algorithm was applied to interpret relationship between the variables. Graphs were constructed on MDR results obtained from analysis on total data set ( Figure S1) and on data set stratified by smoking (Figure 2). In smokers, EPHX1 Tyr113His had a large independent effect (4.64%) and a non-additive interaction with tobacco chewing (entropy 1.79%). Considerable entropy was associated with SULT1A1 Arg213His (1.88%) and its interaction with tobacco chewing further removed 1.49% of entropy from case-control group. However we did not detect any non-linear interaction between the two SNPs in the model. We found small percentages of the entropy in case-control status explained by alcohol consumption (0.56%) and tobacco chewing (0.70%) independently, but a large percentage of entropy explained by the interaction between these two environmental factors (2.47%). In non-smokers, CYP1A1*2A showed strongest main effect with entropy removal of 4.7%. GSTP1 Ile105Val too had a strong independent effect (entropy removal = 3.28%) and its interaction with SULT1A1 Arg213His further removed 3.02% of entropy. A strong synergistic interaction was observed between

Discussion
The present study used multiple analytical methods to first assess associations and then explore possible interactions of xenobiotic metabolizing genes with environmental factors in risk to lung cancer. The applied data mining approaches have the ability to search and identify interactions regardless of the significance of the main effects. The most significant finding of this study is the consistently identified gene-gene and gene environment interactions by all the three statistical approaches.
Smoking is the primary etiological factor in lung cancer. The same was reflected in the present study as smoking showed strong association in LR, best one factor model in MDR and formed first split in CART. Interaction of EPHX1 Tyr113His and SULT1A1 Arg213His was consistently identified in smokers. Both EPHX1 Tyr113His and SULT1A1 Arg213His conferred reduced risk in smoker subset in LR. The two polymorphisms along with EPHX1 His139Arg formed the best predictor model in MDR analysis in smokers and also formed subsequent splits within smokers in CART. EPHX1 enzyme catabolizes epoxides from PAH into dihydrodiols, which involves generation of more reactive carcinogenic metabolites. Substitution of a variant His allele at codon 113 (EPHX1 Tyr113His) decreases the activity of this enzyme [18] thereby reduces the risk of cancer. Studies on lung cancer suggest protective effect for His113 (slow type) as compared to Tyr113 (fast type) which imparts increased lung caner risk [19][20][21]. The variant allele has also been suggested to decrease the risk of ovarian cancer [22]. We have earlier reported similar results from the same population in esophageal cancer showing His113 allele to be associated with a significantly reduced risk in smokers [23]. Reflecting the same, in CART analysis Terminal node 1 of imparts over 4 fold high risk to smokers possibly due high proportion of the wild Tyr113 homozygous genotype. Sulphona-tion reaction of SULT1A1 is a detoxification reaction, however it also involves bioactivation of certain procarcinogens, including heterocyclic amines and PAHs to form carcinogen-DNA adduct [24,25]. In vitro model studies suggest that substitution of histidine at position 213 in the amino acid sequence is associated with decreased substrate affinity and a lower level of protein [26] which might protect against chemical carcinogenesis of PAHs in lung cancer [27]. Results on association of SULT1A1 Arg213His and risk of cancer are inconsistent, from null association with risk of colorectal cancer [28] and prostate cancer [29] to increase in risk of breast cancer associated with His213 allele [30]. Another study on colorectal cancer showed a significantly reduced risk for individuals carrying His213 allele [31]. A Meta-analysis by Kotnis et al [32] showed a significant protective effect of the polymorphism in seven studies of genitourinary cancers.
Among non-smokers CYP1A1*2A and GSTP1 Ile105Val were the most important polymorphisms identified for lung cancer development. The variant allele of both the polymorphisms conferred significant risk in the non smoking subgroup in LR analysis. Similarly, MDR 3 loci model of CYP1A1*2A, GSTP1 Ile105Val and SULT1A1Arg213His polymorphisms was the best predictor of risk in non-smokers. The CYP1A1 6235T.C MspI (CYP1A1*2A) polymorphism, is associated with higher enzymatic activity towards benzopyrene [33,34]. Investigations on association between CYP1A1 polymorphisms and lung cancer have yielded equivocal results [35,36]. Similar to our findings, a study by Taioli et. al. [37] reported association of CYP1A1*2A variant allele with lung cancer, however after stratification by smoking the association remained confined to non-smokers only. Further, in a pooled analysis of 11 studies on CYP1A1*2C polymorphism in lung cancer, Le Marchand et al [38] found it to be associated with risk in non-smokers, a finding which corroborates our results. Another study by Jose et al [39] on lung cancer found no association of any CYP1A1 polymorphism with smokers. Similar results were reported in colorectal cancer where heterozygous and variant genotypes of both CYP1A1*2A and CYP1A1*2C conferred risk in combinations with NAT2 only among non-smokers [40]. In vitro cDNA expression study suggests that GSTP1 with 105Val variant results in a protein with reduced enzyme activity [41], however it is reported to play an unlikely role for smoking-related cancers [42]. Similar observation has been reported from breast cancer [10].
Probably the precise role of GSTP1 in carcinogenesis can be determined by the kind of xenobiotic involved owing to its substrate specificity and affinity [43]. Confirming to its exploratory nature, CART analysis identified two more risk factors, GSTM1 null genotype in smokers and betel quid chewing in non-smokers. The results are quite plausible because both hold functional and biological significance. High risk for smoking related lung cancer has been reported in individuals deficient in GSTM1 [44][45][46]. Smokers with the GSTM1 enzyme have approximately one-third of the risk for lung carcinoma than smokers without the enzyme [47]. There are numerous reports of association between GSTM1 null genotype and smoking in various cancers including esophageal [48], bladder [49] colorectal [50] and oral [51]. A recent study by Wen et. al. [52] showed betel quid chewing increases lung cancer risk in non-smokers, with smoking habit further enhancing the risk. Betel quid chewing is a unique and widespread habit in the north-eastern (NE) region of India. Betel quid is a chewing mixture of whole betel/areca nut wrapped with betel leaves spread with white lime with frequent addition of tobacco. It is known to contain phenolic compounds and alkaloids, besides nitrosamines are formed from an in vivo reaction of betel arecoline, nitrite and thiocynate, all of which act as carcinogens [53]. Studies have reported association between betel quid chewing and cancer risk. Significant association of betel quid chewing with risk of oral, stomach [54] esophageal [23] and breast cancer [55] has been reported from the study population. It would Table 5. False positive report probability and odds ratio for best model of MDR analysis.  be reasonable to assume that the association of betel quid chewing with lung cancer is a result of a complex combination of direct and indirect action of tobacco carcinogens contained in it.
A post-hoc analysis through entropy graph was done to visualize and interpret interaction models identified by MDR. The previously documented main effects of EPHX1 Tyr113His and SULT1A1 Arg213His in smokers and CYP1A1*2A and GSTP1 Ile105Val in non-smokers were evident. Further, synergistic interactions of SULT1A1 Arg213His with GSTP1 Ile105Val and with CYP1A1*2C were observed in non-smokers.
As haplotype are more efficient and informative than separate markers, haplotype association analysis was carried out in CYP1A1 and EPHX1 genes. CG haplotype in CYP1A1 was significantly associated with risk of lung cancer. Noteworthy were results in EPHX1, where frequency of haplotypes among cases was strikingly similar to report published in esophageal cancer from north India [56].
Although both MDR and CART validated LR results, yet they differed in identifying some unique interactions, reflecting different methods followed by each program. Both approaches provide a clear advantage over the traditional LR by identifying non-linear interactions among discrete genetic and environmental attributes. Significant findings of the study are summarized in Figure S2. It would be safe to assume a definite association of the commonly recognized factors to lung cancer that might have implications on future studies. Role of CYP1A1*2A polymorphism is evident only among non smokers in all the three methods. LR and CART analyses even showed a gene-dosage effect for the increased lung cancer risk with the increasing number of variant allele in the CYP1A1*2A polymorphism. As aforesaid, this finding provides support to previously published reports [37][38][39][40]. MDR and CART analysis show epitasis between EPHX1 Tyr113His and SULT1A1 Arg213His polymorphisms exclusively among smokers. Their combined models confer risk to lung cancer however individually both act as protective factors in smokers only. These factors hold their importance as the SNPs are functionally and biologically relevant and have been implicated in the carcinogenesis process in previous studies on various cancers Major challenge for the identification of true genetic and interactive effects in a multi-factorial study is simultaneous testing of several hypotheses. The three methods of analysis used in this study address the same research hypotheses but differ in terms of their statistical methodologies and analytical approaches. P-value adjustment for multiple testing was performed through SIDAK correction in LR model with the equality as (1-(1-a) 1/n ) where n = 4 both in total and stratified analyses. Multiple testing in data mining approaches such as CART and MDR sometimes compromises upon the comparative power. When numerous null hypotheses are being tested yielding higher order interacting combinations the inference drawn from a single erroneous rejection is not an appropriate strategy, rather the proportion of erroneous rejection needs to be controlled. This is achieved by estimation of FPRP. These approaches utilize internal crossvalidations and permutation testing of p-value reducing the chances of making type I errors. Both MDR and CART apply cross validation of data before selecting the best model however MDR also uses 1000 fold-permutation testing, to validate its results for minimizing the proportion of false-positives due to multiple testing. The cross validation (5-10 fold) dividing the whole data set into different sets of training and testing set prevents over-fitting and artificial accuracy improvement. Permutation test is considered the gold standard for accurate multiple testing correction. Controlling for false discovery rate (FDR) is a more realistic approach than as compared to concerns raised by the multiple hypothesis testing. This is because FDR is the proportion of incorrect rejection among all such rejections. Likewise, the best models derived from MDR on total data set and smokers set in this study showed good reliability as associations remained robust even at low prior probabilities for FPRP testing. CART analysis was able to define genetic associations with fairly good measures. Correct classification of cases and controls in test data set was approximately 63% for both.
There might be some limitations to this study. The sample size of our study was relatively small, however based on the evidences (OR) provided by our research group on association between GSTs with lung cancer [57], the minimum sample size determined was 176 at 5% level of significance and 90% power. Polymorphisms of EPHX1 Tyr113His and SULT1A1 Arg213His in cases showed deviation from HWE. After ruling out false positive associations and genotyping errors perhaps population stratification, could be a reason for this deviation. However, the cases were incident, and thus, the data do not show report or recall bias. Also case-control matching was done in reference to age, gender, and ethnicity, thereby controlling for any confounding effect accounted by these variables.
In conclusion this study highlights that better predictors for lung cancer risk can be obtained through polygenic approaches and exploring gene-environment interactions. The study identified distinct patterns of interaction in smoking and non smoking sub groups. However, the results presented should be treated with caution since this is the first epidemiological evidence identifying the complex relationship between genetic polymorphisms and cancer susceptibility in the studied population. Further studies with large samples in independent populations are warranted to validate the findings of this study. Figure S1 Interaction entropy graphs (for total data set). The interaction model describes the percentage of the entropy (information gain) removed by each variable (main effect: represented by nodes) and by each pairwise combination of attributes (interaction effect: represented by connections). Attributes are selected on the basis of MDR results obtained in case of total data set. Labels: Smk: smoking, SULT: SULT1A1 Arg213His, Ex3: EPHX1 Tyr113His (EH3), Ex4: EPHX1 His139Arg, SULT. (TIFF)