Integrative Genomic Analysis Reveals Extended Germline Homozygosity with Lung Cancer Risk in the PLCO Cohort

Susceptibility to common cancers is multigenic resulting from low-to-high penetrance predisposition-factors and environmental exposure. Genomic studies suggest germline homozygosity as a novel low-penetrance factor contributing to common cancers. We hypothesized that long homozygous regions (tracts-of-homozygosity [TOH]) harbor tobacco-dependent and independent lung-cancer predisposition (or protection) genes. We performed in silico genome-wide SNP-array-based analysis of lung-cancer patients of European-ancestry from the PLCO screening-trial cohort to identify TOH regions amongst 788 cancer-cases and 830 ancestry-matched controls. Association analyses was then performed between presence of lung cancer and common(c)TOHs (operationally defined as 10 or more subjects sharing ≥100 identical homozygous calls), aTOHs (allelically-matched groups within a cTOH), demographics and tobacco-exposure. Finally, integration of significant c/aTOH with transcriptome was performed to functionally-map lung-cancer risk-genes. After controlling for demographics and smoking, we identified 7 cTOHs and 5 aTOHs associated with lung cancer (adjusted p<0.01). Three cTOHs were over-represented in cases over controls (OR = 1.75–2.06, p = 0.007–0.001), whereas 4 were under-represented (OR = 0.28–0.69, p = 0.006–0.001). Interaction between smoking status and cTOH3/aTOH2 (2p16.3–2p16.1) was observed (adjusted p<0.03). The remaining significant aTOHs have ORs 0.23–0.50 (p = 0.004–0.006) and 2.95–3.97 (p = 0.008–0.001). After integrating significant cTOH/aTOHs with publicly-available lung-cancer transcriptome datasets followed by filtering based on lung cancer and its relevant pathways revealed 9 putative predisposing genes (p<0.0001). In conclusion, differentially-distributed cTOH/aTOH genomic variants between cases and controls harbor sets of plausible differentially-expressed genes accounting for the complexity of lung-cancer predisposition.


Introduction
There are two main histologic groupings in lung cancer, small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). The latter includes adenocarcinoma (AC) and squamous cell carcinoma (SCC), along with less common subtypes. It has been widely accepted that an average of 5-10% of all malignancies are caused by high penetrance predisposition genes [1][2][3]. For example, there are 10 high penetrance genes, including BRCA1/ 2 and PTEN, accounting for ,10% of all breast cancers [3]. While aerodigestive tract cancers are believed to be a rare part of the neoplastic spectrum of BRCA2, no other high penetrance lungcancer-predisposition gene has been identified, and until recently, lung cancer has been attributed almost entirely to environmental exposure, chiefly tobacco. In the last few years, however, it has become obvious that a greater, but variable, proportion of all malignancies have a genomic component, conferring weaker predisposition (low penetrance). Eg, a genome-wide association study (GWAS) demonstrated specific single nucleotide polymorphims (SNPs) associated with risk of AC in smokers and never-smokers [4]. To date, NSCLC, especially AC-associated genomicloci have been identified in 15q25, 5p15, and 6p21 [5][6][7][8][9][10]. Analysis of the effect of smoking on lung-cancer risk showed that smoking does not entirely explain the risk of developing lung cancers and that residual genomic-factors interacting with smoking are likely [4]. Genomic variants, such as the associated SNPs, cannot fully explain the heterogeneity associated with the histologic subtypes either [11,12]. The evidence to date suggests the need to find other types of genomic variation that can explain the relatively large remaining risk associated with lung carcinomas.
In animal husbandry and animal-model experimentation, inbreeding which results in increasing homozygous loci is well recognized to result in increased incidence of various disorders, including increasing tumor incidence [13]. In humans, germline homozygosity as a genomic factor associated with disease-risk is a relatively recent concept. Eg, germline homozygosity, a type of genomic variation, has been shown to be associated with an increased risk of human cervical cancer. Identification of homozygous loci as risk factors may help target heightened cervical screening for high-risk women [14][15][16][17][18]. Relatedly, a relatively recent study uncovered a significantly higher frequency of germline homozygosity in a series of unrelated white individuals with invasive breast carcinomas, prostate carcinomas and head neck squamous cell caricinomas by genome-wide microsatellite genotyping [19]. This association was validated in a study of AC cases and matched-controls that were genotyped with denser SNPbased arrays (Illumina HumanHap550v3_B array), thus supporting the high likelihood of identifying homozygous genotypes that are associated with a broad variety of common solid tumors [19]. This study observed that homozygosity from both microsatalliteand SNP-based analyses showed specific, shared loci of homozygosity for all the three tumor types studied. In addition, there were also highly homozygous loci that are specific to each of the tumor types. Independently, Bacolod and colleagues [20] found that long tracts of homozygosity (TOH), operationally defined as spanning at least 4 Mb, were over-represented in colorectal cancer patients over controls.
Here, we hypothesized that germline regional-homozygosity involving specific chromosomal loci is a novel genomic factor contributing to low-to-moderate penetrance predisposition to (or protection from) lung cancer. Instead of identifying single genes, our hypothesis takes into account subsets of genes within these regions, which are differentially expressed to lend complex predisposition to lung cancer. We sought to address this hypothesis by systematically integrating data from differentially represented TOH regions with genome-wide expression data to localize regional lung-cancer predisposition loci.

Acquisition of Genotype Data from dbGAP
Genotypes were obtained from the Prostate, Lung, Colorectal and Ovarian cancer screening trial (PLCO) where the lung cohort was prospectively screened with chest X-rays [21]. Subjects were all self-identified as white, and comprise ancestry-matched cases and controls [21] based on principle-component analyses using both SNPs unlinked to lung cancer and their ancestry-informed SNP's, as described by Patterson et al [22]. Consistently, the CEPH (Centre d'Etude du Polymorphisme Humain) from Utah (CEU) HapMap controls cluster with this population, reconfirming northern and western European origin [23].
We followed the standard quality control (QC) procedure used in the original study [4]. Samples were screened and selected only if they had a minimum 95% successful genotype call rate. SNPs with minor allele frequencies (MAF) ,5%, departures from Hardy-Weinberg equilibrium (at p,0.01) and $5% missingness per SNP, were excluded from further analyses. After QC filtering, we had 1618 subjects (788 cases and 830 ancestry matched controls) with mean age categories of 1.63 (5 categories defined in Table S1), comprising 967 males and 651 females, including 156 nonsmokers, 703 previous smokers and 759 current smokers (Table S1); and an average 526,826 (514,355 autosomal) SNPs (93.8%)/subject. Table S1 shows the association analysis based on a logistic model with age, gender and smoking status (never smoked, previous smoking and current smoking) as covariates after excluding the potential genetic effects. It is important to note that the proportion of current smokers was about half the rate of active smokers in the US general population. It was noted that compliance was the lowest in the current smokers whereas the previous smokers were the most compliant.

Quantifying Tracts of Homozygosity and Comparing Frequencies in Cancer Cases and Controls
Identifying tracts of homozygosity (TOH) and common TOH (cTOH) region. We extended the module of Runs of Homozygosity in the GoldenHelix software [24] to identify TOHs [an in-house software (Zhang et al, unpublished)]. Next, data from all subjects were examined to determine whether a minimum number of individuals share a TOH call at a given position. To identify statistical differences between TOHs within a case-control design, we only retained those TOHs in which 10 or more subjects share 100 identical homozygous calls, which we operationally define as a common TOH (cTOH). There are 333,861 SNPs with 10 or more TOH calls across the entire series, representing 65% of the original pool of SNPs.
Detection of cTOHs associated with lung cancer. We then pursued testing for association between cTOH and lungcancer cases. By considering each cTOH as a genomic variant, a genome-wide case-control analysis was conducted for each cTOH, where a cTOH was viewed as a binary variable based on the presence or absence of a cTOH. Using each TOH (containing multiple SNPs that are in linkage disequilibrium) as a variable will considerably reduce the number of tests to be performed and boost the power of the association analysis. The traditional single SNP-association studies require at least 610 000 (up to 3 million if more SNPs are used) tests if a traditional GWAS was done. A logistic model was fitted for each cTOH by considering disease status as the outcome and the cTOH as the predictor. Other covariates included in the model were age, sex and smoking status. P-values were obtained by Wald tests and OR (95% CI) were calculated through coefficient estimates of the fitted logistic model. To detect interactions between cTOH and smoking status, and cTOH and age, a logistic model with two extra interaction terms was fitted for each cTOH. The P-value of interaction was obtained by F-test. To minimize chances of false positive findings, cTOHs are considered statistically significant if their p,0.01 [24]. Furthermore, the q-value approach [25], that is based on the concept of the false discovery rate, was used as an exploratory guide for which the variants called can be investigated further.
Investigating allelically-matched groupings within a cTOH (aTOH). As noted above, a cTOH is operationally defined by a minimum number of loci that are homozygous and minimum number of subjects sharing the cTOH, but not qualitative matching of nucleotides. Within the cTOH, TOH segments were then compared pair-wise and an allelic match is declared if at least 0.95 of jointly non-missing, jointly homozygous sites are identical. These allelic matching groups of TOHs within a cTOH are termed 'allelic'TOH (aTOH). The characterization and scanning of these aTOHs was performed using our customized software cag-TOH (unpublished software), similar to the allelic-matching procedure in PLINK [26].
Detection of aTOHs associated with lung cancer cases. The aTOH as genomic variant was then used for association analysis within a case-control framework. To retain the power of the statistical analysis, we only focused on the aTOHs which are present in at least 5 cases and 5 controls. For each aTOH, we applied a logistic model with disease status as the outcome and aTOH as a predictor with age, sex and smoking status as covariates. Similar to cTOH above, the aTOHs with p,0.01 by Wald-test are declared significantly associated with lung cancer. We also applied the q-value approach [25].

Integrating Genetic Information from Significant c/aTOH Regions with Publicly Available Expression Array Dataset
Data were obtained from a publicly available [27] geneexpression dataset of 107 fresh frozen tissue samples of AC (58 tumor and 49 non-tumor tissues from 20 never smokers, 26 former smokers, and 28 current smokers) downloaded from the Gene Expression Omnibus (GSE10072), from the Environment And Genetics in Lung cancer Etiology (EAGLE) study (http:// dceg.cancer.gov/eagle). The criteria used to select this particular array dataset provide not only minimal bias, but physiologically relevant data. We followed the universal standard that specific selection criteria and QC's are in place before using publicly available datasets (e.g. expression array) for cross-platform integration purposes. Therefore, we ensured that the lung cancers in the expression array datasets belong to patients who are similar to those patients who were genotyped and subjected to TOH analysis. For example, patients utilized in both expression array and TOH analysis represent two different subsets of one much larger study cohort. This by itself is a major strength of this cross platform integration process because patients in the two datasets were subjected to the same inclusion/selection criteria; these individuals have been exposed to similar environmental or treatment conditions; most importantly, ancestral background of the ''expression array dataset'' patients were similar to those who were genotyped for TOH analysis; and the patients are of the same age ranges, i.e. 55-60 yrs. After QC, we normalized the expression profiles of the samples using the Robust Multichip Average (RMA) method, similar to how the same expression array data were originally processed [28]. The raw probes are mapped to their corresponding genes, and multiple probes corresponding to the same gene were averaged. The significant cTOH regions were first extended 250 kb in each direction, and genes within these regions were identified (259 genes). The number of genes included in the region increases linearly as the flanking regions are extended, but is also dependent on the region being interrogated (i.e., if a gene rich or gene poor region). If it returned .1000 genes (which we did not observe in our analyses here), we would have simply used LD to capture the block of cTOH or aTOH. The microarray expression profiles of 153 of the 259 cTOH-genes were found on the expression array. Subsequently, we evaluated on an a priori basis differences in expression profiles of these 153 genes using individual univariate logistic regression with Bonferroni correction applied for statistical significance calculations (data not shown). Expression profiles of the significant genes from univariate analysis (p,0.01) and within the +/2250 kb region of c/aTOH region were subjected to unsupervised hierarchical clustering [29] using MatlabH.

Prioritization of Candidate Genes
After integrating significant c/aTOH regions with the expression array dataset, we determined the risk associated with differential expression of genes with c/aTOHs stratified by smoking status. Genes that showed differential expression profiles significant at p,0.0001 in the ever-and never-smoking strata were then subjected to a text mining approach to help filter from relevant information generated from genomic, transcriptomic, and proteomic investigations available in the PubMed literature database. Consequently, this information was used to identify relationship networks between the genes, their transcripts, their proteins and other lung cancer-relevant biological processes or pathways [30][31][32].

Identification of Specific Common Tracts of Homozygosity (cTOH) in Individuals with Lung Cancer in the PLCO Cohort
To address our central hypothesis that specific germline TOH is either over-or under-represented in lung-cancer cases over ancestry-matched controls, we initially screened for TOH regions in the PLCO-dataset (schema in Figure 1). We found a total of 91,460 TOHs across all samples with 44,725 TOHs in cases and 46,735 TOHs in controls. Average length of TOHs was 886 kb (median = 677.4 kb, 1 st quartile = 484.8 kb, 3 rd quartile = 956.3 kb) and average number of SNPs within each TOH 141.4 (median 121, 1 st quartile108, 3 rd quartile = 145). A total of 890 such cTOHs were identified across the genome, ranging in size 141.6-3421 kb (mean = 2144 kb, SD = 3115.6 kb, median = 1064 kb, 1 st quartile 623.9 kb, 3 rd quartile 2144 kb) and SNP-count of 100-413 (mean = 375, SD = 418, median = 215).
By considering each cTOH as a genomic variant, we performed a case-control analysis adjusting for the effects of age, sex and smoking status. Seven cTOH regions were found to be significantly differentially represented between LC cases and controls based on p,0.01 (  Figure 2 A), cTOH3 is 2-fold (OR = 1.8) over-represented in non-smoking cases over nonsmoking controls, whereas cTOH3 is significantly under-represented in ever-smoking cases over ever-smoking controls [OR 0.78 (previous smokers) and 0.34 (current smokers), respectively, p = 0.009-0.026] (Table S3 B).

Identification of Allelically-Matching Groups (aTOH) within cTOHs in Lung-Cancer Cases and Controls
The aTOHs may provide genetic background or ancestryrelated information, hence a biological meaningful association with the lung-cancer phenotype. The number of aTOHs in each cTOH ranges from 1 to 111. We conducted an independent (of cTOHs identified) case-control analysis followed by adjusting for the effects of age, sex and smoking status on the lung-cancer phenotype. In this manner, we identified 5 aTOHs (within 2p16.3-2p16.1, 3p25.3, 5q11.2-12.1, 7q21.11 and 13q31.1-31.3) that are significantly differentially represented between cases and controls (based on p,0.01; Table 2). Notably, only aTOH1 with OR of 0.5 (Table 2) Table 2).

Functional Genomic Validation by Integration of Significant cTOH and aTOH Regions with Global Transcriptome Datasets
We next turned our attention to look for biologically plausible genes, i.e., one or a subset of all genes, located within and in proximity (+/2250 kb) to significant c/aTOH's and that may be germane to lung cancer risk. To fine map the TOHs containing lung-cancer-related genes and to functionally validate our genomic data, we integrated our significant TOH regions with gene expression data derived from lung cancer patients in the EAGLE study [27] (Figure 1). This dataset was derived from a population of European ancestry (selection criteria described in the Methods section) and also served as our functional validation series. We were able to filter out genes within the significant c/aTOH regions to 46 genes based on differential expression in univariate analysis alone (Figures 1 and 4). With further risk analyses and integration with known organ-specific function and signaling pathway roles, we ended up with a final shortlist of 9 most-plausible lung cancer- Figure 1. Study schema for the identification and functional genomic validation of significant cTOH and aTOH regions that are over-or under-represented in lung cancer cases. The schema represents the framework used to identify and subsequently integrate significant cTOHs and aTOHs (from the PLCO lung cancer screening trial) with global transcriptome datasets comparing lung cancers to normal lungs (from the EAGLE lung cancer screening trial). Multiple differentially expressed genes within the cTOHs and aTOHs had their candidacy prioritized initially based on statistical significance followed by biological plausibility (eg, relevant mouse models, reported to be somatically altered in sporadic lung cancers, relevant signaling pathways, etc) to finally obtain 9 ''most plausible'' candidate genes and one candidate genomic region. The latter is so designated because it was independently derived (by this current study) and subsequently found to overlap with the region previously identified in 3 previous studies as associated with lung cancer risk. doi:10.1371/journal.pone.0031975.g001 risk genes and one candidate genomic region (p,0.0001; Table 3 and Figure 1; see Discussion). We particularly examined association of the TOHs harboring these 9 genes and smoking status. Relatedly, the 9 differentially expressed genes within the 6 cTOH/aTOH are germane in eversmokers compared to 3 that are germane in both ever-and neversmokers [(p,0.0001), Table 3]. One important exception is SBTBN1 and RTN4 within cTOH3/aTOH1 (2p16.3-16.1), where over-expression occurs almost exclusively in controls relative to lung-cancer cases, irrespective of smoking status (OR = 0.000 and 0.08, p,0.0001; Table 3, Figures 2A and 4). ACYP2 (OR = 0.08, p,0.0001), also within this TOH, is under-expressed in eversmokers associated with decreased lung-cancer-risk, but its differential expression is not germane in never-smokers (Table 3, Figures 2A and 4). Overall, unique differential expression signatures were observed for gene groups within a/cTOHs as shown in Table 3 and Figure 4. Analysis of expression profiles of genes in other aTOHs, eg, CD36 in aTOH4 (7q21.11), showed under-expression in cases amongst ever-smokers (p,0.0001; Table 3 and Figure 2B).
Expression profiles of the genes located in other significant cTOHs, cTOH1, 2, 5 and 7 (1 p13.2, 1p12, 5p15.31 and 9p22.3, respectively; Table 1) were analyzed. OLFML3 (1p12; Figure 3A), was under-expressed in ever-smoking cases compared to neversmoking cases consistent with a reduced risk as portrayed by OR's,1 (Table 3 and Figure 4). In contrast WDR3 (on 1p12; Figure 3B) showed significant relative over-expression irrespective of the smoking status, consistent with the TOH-relevant OR.1 (Table 3 and Figure 4). FASTKD3 (on 5p15.31; Figure 3C) showed significant relative over-expression in ever-smoking lung-cancer cases compared to never-smoking cases, consistent with the TOHrelevant OR.1 (Table 3 and Figure 4). PSIP1 (on 9p22.3; Figure 3D) was significantly under-expressed in both ever-and never-smoking cases, OR,1 (Table 3 and Figure 4). In general, we observed unique and similar expressional signatures for specific gene-sets (Table 3, Figure 4). For example, we observed net underexpression of a gene set within the cTOH3 in lung cancer cases who are smokers (OR,1) [ Table 3, Figure 4].

Discussion
Identifying risk factors, whether genetic or environmental, for malignancies, including lung cancer, is a start for early diagnosis, and tailoring heightened surveillance and prevention. The common variant-common cancer hypothesis prevalent in the last decade led to GWAS yielding common SNPs within 15q25, 5p15, and 6p21 associated with lung cancer [5][6][7][8][9][10], accounting for ,3% of all lung cancers. Based on the working hypothesis that other genomic factors predisposing to or lowering lung cancer risk must exist, we performed a genome-wide case-control analysis for long TOHs, each of which harbors one to several lung-cancerpredisposing or protective loci (most likely of low to moderate penetrance). We identified 7 cTOHs and 5 aTOHs that are significantly over-or under-represented in lung cancer cases versus controls, after adjusting for age, gender and smoking status. Interestingly, we found specific cTOH/aTOHs associated with cases over controls independent of these covariates, with others dependent on smoking status. Importantly, our identified significant cTOH and aTOH regions have been functionally validated by integrating differential expression of specific genes residing in these critical intervals, previously shown to play at least a somatic role in sporadic human lung carcinomas, in murine models and/or participate in neoplasia-associated signaling pathways (Table S4). We believe that agnostically searching for cTOH and aTOH and then integrating with expression data are powerful methods for finding, and at the same time functionally genomically validating, new lung cancer-risk regions and genes. Three previous lung cancer GWAS studies have identified the 5p15 region to be associated with lung cancer cases [4][5][6][7][8][9][10]. cTOH5 lies within 5p15.31 (our ''candidate genomic region'' after integrative analysis) and is 11-fold overrepresented in ever-smoking lung cancer cases and 3-fold in neversmoking lung cancer cases. This serves as a strong positive control. We have also identified a new candidate gene FASTKD3, beyond those previously postulated, by integration of expression with significant TOH in this region (Table S4).
We found only one TOH region where a significant aTOH lies within its parent cTOH: aTOH 1 (2p16.3-16.1) and its parent cTOH3, whose presence appears to confer a protective effect against lung cancer in ever-smoking cases (OR,0.7, ie, overrepresented in controls versus cases; Tables 1 and 2). Differential expression of a group of genes in this region seems to be equally protective against lung cancer irrespective of smoking status or history (Table 3, Figure 4). Eg, SPTBN1 codes for a beta-spectrim which plays a role in decreasing cell surface recruitment of CD45 and CD3, and abrogating T-cell function [33]. Accordingly, increased SPTBN1 expression (and over-representation of aTOH1/cTOH3 in controls over cases) could plausibly protect against lung cancer by increasing immune surveillance, given that we know that smoking suppresses the CD4/CD8 T cell ratio [34]. While there is plausible existing evidence that under-expression of genes within cTOH3/aTOH1 (Table 3, Table S4, Figure 4) would be protective through various mechanisms [35], we do not know what undiscovered mechanisms result in further mitigation of smoking-associated lung-cancer risk. Unlike the other genes in aTOH1/cTOH3, MTIF2 over-expression is associated with its TOH differentially associated with cases and controls. MTIF2 is a mitochondrial translation-initiation factor that partners with RNaseL. In vitro over-expression of MTIF2 stabilizes mitochondrial RNA, inhibits apoptosis induced by interferon-alpha and partially reverses alpha-interferon-cell growth inhibition [36]. Thus, this mechanism lends plausibility to the high OR in lung cancer cases over controls and in ever-smoking cases over neversmoking cases (OR = 18.75 vs 8.25; Table 3, Table S4, Figure 4). Amongst the other ''most plausible'' risk genes worthy of mention are CD36 (aTOH4) and PSIP1 (cTOH7), both of which have Figure 3. Lung cancer-associated cTOH1, cTOH2, cTOH5 and cTOH7 regions Single SNP association analysis was performed (independently of TOH analysis), after which the SNP association was compared to significant TOHs obtained with TOH analysis. The significant lung cancer-associated single SNPs, and TOH's namely cTOH1, cTOH2, cTOH5, and cTOH7, and their respective 95% CI are shown. The significant lung cancer association of aTOHs and SNPs in the region (top panel) and corresponding risk as odds ratios (lower panel) are shown in panels A-D. Below the lower panels are candidate genes which were prioritized after testing for association between lung cancer and differential expression of each of the genes within each significant TOH +/2250 kb TOH, stratified by smoking status (at p,0.0001; see Methods section). doi:10.1371/journal.pone.0031975.g003 plausible roles in lung function development, and genetic alterations (especially in the oncogene PSIP1) may well lead to NSCLC development (detailed in Table S4). Major vault protein (MVP), implicated in the regulation of cellular signaling cascades and multidrug resistance, has been shown to interact with IFNgamma-regulated gene (CD36) in the H65 lung cancer cell model [37]. Hence CD36, may be important in the development of lung cancer. OLFM (cTOH1) and WDR3 (cTOH2) have been implicated in apoptosis and cell cycle regulation in cancer cells (Table S4). The two genes may be important in the regulation of lung carcinoma cell proliferation.
In this study, differences in expression profiles displayed by genes within significant TOH regions can be interpreted as due to intra-c/aTOH or inter-c/aTOH composition and cross-talk, together with environmental influences (Table 3, Figures 2 and  3). These genes likely work in conjunction with each other to either  magnify or suppress the risk of lung cancer, more so in the presence or absence of tobacco (Tables 1, 2 and 3). A gene set effect (intra-gene-gene signature) within a TOH is most likely controlled by the direction of the OR and whether the genes are over-or under-expressed. Eg, cTOH3 with OR,0.7 will most likely have a gene-set that in combination relays a net protective effect in lung cancer cases who are exposed to tobacco (Tables 1, 2 and 3). Systematically integrating differential expression of specific genes residing in critical intervals such as tracts of homozygosity, have revealed new candidate lung cancer-risk-genes, as well as genes previously shown to play a somatic role in sporadic human lung carcinomas, in murine models and/or participate in neoplasia-associated signaling pathways. This regional approach of systems integration with identification of regional subsets of genes can complement classical analyses which only consider single genes represented by GWAS-associated SNP-risk.

Supporting Information
Table S1 Effects of age, gender and smoking status on lung cancer risk. The table shows the effects of age, gender and smoking status on lung cancer in a PLCO cohort. A logistic regression model was used to obtain an adjusted odds ratio with a 95% confidence interval. (DOC) Table S2 Covariate-adjusted significant (p,0.01) association of specific common tracts of homozygosity (cTOHs) with lung cancer. This table shows the effect of a cTOH region on the risk of lung cancer after adjusting for age, sex and smoking status in a logistic model. The adjusted odds ratio (OR), with its 95% confidence interval, of the cTOH region associated with lung-cancer are also shown. (DOC)

Table S3
A. Interaction coefficient estimates of cTOH3 and age, and of cTOH3 and smoking status on lung cancer. The result in this table is from an interaction analysis between smoking status and cTOH3. The estimated standard error and coefficient estimate of the logistic model the p-value (significant at p,0.05) are shown. B. Risk associated with interaction between cTOH3 and smoking status on lung cancer. This table shows the risk associated with interaction between smoking status and cTOH3. The odds ratio was calculated using the reference as male, age , = 59, nonsmoking and absence of cTOH 3. (DOC) Table S4 Summary of candidate genes putatively involved in lung cancer predisposition. This a summary of the final candidate gene list within specific significant TOH (i.e. cTOH or aTOH) regions. The two columns on the right of the candidate genes show previous experiments and animal models. This validates the putative roles of the selected candidate genes in lung cancer. (DOC) Table 3. Risk associated with differential expression of genes within aTOHs and cTOHs stratified by smoking status.