Pathway Analysis for Genome-Wide Association Study of Lung Cancer in Han Chinese Population

Genome-wide association studies (GWAS) have identified a number of genetic variants associated with lung cancer risk. However, these loci explain only a small fraction of lung cancer hereditability and other variants with weak effect may be lost in the GWAS approach due to the stringent significance level after multiple comparison correction. In this study, in order to identify important pathways involving the lung carcinogenesis, we performed a two-stage pathway analysis in GWAS of lung cancer in Han Chinese using gene set enrichment analysis (GSEA) method. Predefined pathways by BioCarta and KEGG databases were systematically evaluated on Nanjing study (Discovery stage: 1,473 cases and 1,962 controls) and the suggestive pathways were further to be validated in Beijing study (Replication stage: 858 cases and 1,115 controls). We found that four pathways (achPathway, metPathway, At1rPathway and rac1Pathway) were consistently significant in both studies and the P values for combined dataset were 0.012, 0.010, 0.022 and 0.005 respectively. These results were stable after sensitivity analysis based on gene definition and gene overlaps between pathways. These findings may provide new insights into the etiology of lung cancer.


Introduction
Lung cancer is one of the most frequently diagnosed cancers and the leading causes of cancer death globally [1]. In China, the incidence and mortality rates of lung cancer have been increasing rapidly in the last three decades, primarily because of tobacco consumption [2]. However, genetic factors also play an important role in lung carcinogenesis. Over the past several years, genomewide association studies (GWAS) have identified more than 10 loci associated with lung cancer risk with a modest effect for each single nucleotide polymorphism (SNP) [3,4,5,6,7,8]. However, these variants accounted for only a small fraction of hereditability of lung cancer [9,10,11].
Given that gene-gene interactions may contribute to complex diseases, it has been suggested that combining the multiple variants with small effect together based on biological pathways using the GWAS data may tend to detect the joint effects of multiple genes and to highlight the specific pathway aggregated in a certain disease [12]. A large proportion of disease susceptibility genes may be functionally related and/or interact with each other in biological pathways and only a small number of biological pathways may mainly contribute to the etiology of complex disease [13]. Thus, pathway-based approaches have been applied to the GWAS of several complex diseases, and some novel diseasesusceptibility pathways have been revealed [14,15,16,17,18,19,20,21,22,23]. Recently, Chung et al. (2012) [24] evaluated pathways associated with lung cancer risk in subjects collected by American Cancer Society across all U.S. states using a two-stage random forest-based pathway analysis method based on KEGG database (URL: http://www.genome. jp/kegg/pathway.html/), and identified 4 pathways associated with lung cancer including p53 signaling pathway. Meanwhile, Fehringer et al. (2012) [25] performed pathway analysis on lung cancer risk in subjects collected from Central Europe, Toronto, Germany and Texas using four different methods based on Gene Ontology (GO) database (URL: http://www.geneontology.org/), and found that the acetylcholine receptor activity pathway was significantly associated with lung cancer risk using two different approaches. However, none of pathway analyses of lung cancer GWAS are reported in populations of non-European ancestry to date.
Several methods have been proposed for pathway analysis [26], and one of the commonly used method is gene set enrichment analysis (GSEA) [16]. Briefly, three steps are used for pathway analysis in GSEA. First, individual-SNP association analysis is conducted to determine the effect for each SNP. Second, the representative SNP with the lowest P value is mapped to each gene, and all genes are assigned to predefined biological pathways. Finally, all genes are ranked by their significance, and then are to be evaluated whether a particular group of genes is enriched at the top of the ranked list by chance. As a result, a cluster of biological related SNPs which appeared in the top list may be potentially associated with disease as integration.
In a large-scale GWAS of lung cancer in Han Chinese population, we have already validated suggestive SNPs with a P value #1.0610 24 in independent populations and found five new lung cancer risk-related loci with effect size (odds ratio) ranging from 0.84 to 1.35 at a genome-wide significance level [3,4]. To further deeply understand the genetics mechanism of lung cancer and identify the crucial pathway in lung carcinogens, we currently performed a two-stage pathway analysis using GSEA method based on our existing GWAS data in Han Chinese population. In stage 1, we screened all available pathways in Nanjing study using 1,473 cases and 1,962 controls. In stage 2, the pathways with P values #0.05 and FDR #0.50 were validated in Beijing study using 858 cases and 1,115 controls.

Ethics Statement
This collaborative study was approved by the institutional review boards of China Medical University, Tongji Medical College, Fudan University, Nanjing Medical University and Guangzhou Medical College with written informed consent from all participants [27,28,29,30].

Study Participants
The study subjects were from an ongoing two-center GWAS of lung cancer in China, including Nanjing study and Beijing study. The details of population have been described elsewhere [3]. Briefly, there were 1,473 cases and 1,962 controls in Nanjing study, 858 cases and 1,115 controls in Beijing study after quality control. All lung cancer cases were histopathologically or cytologically confirmed by at least two local pathologists. All controls were selected from those receiving routine physical examinations in local hospitals or those participating in the community screening of noncommunicable diseases and frequency-matched for age, gender and geographic regions to each set of the lung cancer cases.

Genotyping and Quality Control
Genotyping was performed using an Affymetrix Genome-Wide Human SNP Array 6.0. A systematic quality control procedure was applied for both SNPs and individuals before pathway analysis [3]. Briefly, SNPs were excluded if they did not map on autosomal chromosomes, had a call rate ,95%, minor allele frequency (MAF) ,0.05, P,1610 25 for Hardy-Weinberg equilibrium (HWE) in combined samples of two studies or P,1610 24 for HWE in either the Nanjing or Beijing study samples. We removed samples with call rate ,95%, ambiguous gender, familial relationships, extreme heterozygosity rate and outliers. Finally, a total of 2,331 cases and 3,077 controls (Nanjing study: 1,473 cases and 1,962 controls; Beijing study: 858 cases and 1,115 controls) with 570,373 SNPs were remained in subsequent pathway analysis.

Pathway Data Construction
We collected pathways from two public resources: KEGG and BioCarta database (URL: http://www.biocarta.com/). Pathways containing genes from 10 to 200 were included in this study. This gene number range was considered appropriate to reduce the multiple-comparison issue and to avoid testing overly narrow or broad functional gene categories [22]. Pathway overlap was defined as the percentage of shared genes to total ones of two pathways [14].

Statistical Analysis
Logistic regression model with adjustment for age, gender, packyear of smoking and the first four principal components derived from EIGENSTRAT 3.0 [31] was used to evaluate the association significance of each SNP using GLM package executed in R software (version 2.14.0; The R Foundation for Statistical Computing). SNPs were assigned to a gene if they located within 50 kb downstream or upstream of the gene. The significance of each gene was derived from the representative SNP. All genes were assigned to pathways. Then the association between lung cancer risk and each pathway was evaluated by GenGen software [16] using the weighted Kolmogorov-Smirnov-like running sum statistic (denoted by enrichment score, ES), which reflected the over-representation of a cluster of genes within this pathway at the top of the entire ranked list of genes in the genome. We randomly shuffled the case-control status for 1,000 times, and repeated these above steps to get the permuted pathway association results. Thus, the normalized ES after adjusted for different sizes of genes, could be acquired via the permutation procedure. Meanwhile, P value of each pathway and the false discovery rate (FDR) to keep the proportion of expected false positive findings were derived. The significance level of pathways analysis was set to be P#0.05 and FDR #0.5.

Genetic Information and Prior Biological Information used in Pathway Analysis
Of the total 570,373 genotyped SNPs, 340,060 SNPs were mapped into 17,225 genes within 50 kb upstream or downstream. Among them, 135,160 SNPs were ultimately assigned to 3,514 genes within the pre-defined pathways (41,560 SNPs to 1,003 genes in BioCarta pathways and 120,864 SNPs to 3,134 genes in KEGG pathways). All genes were assigned to 368 pathways that contained 10 to 200 genes (191 pathways used in BioCarta database, 177 pathways used in KEGG database) in following pathway analysis.
For the achPathway, 9 of 16 ones had SNPs with P values ,0.05; similarly, 15, 14 and 18 significant genes were observed among 28 genes in the At1rPathway, 32 genes in the metPathway and 23 genes in the rac1Pathway, respectively. The representative SNPs with lung cancer risk at P values ,0.01 for each gene in these 4 pathways shown in Table 2.

Sensitivity Analysis of Pathway-based Association
In order to evaluate the influence of SNP-to-gene mapping strategy on pathway analysis, we further mapped SNPs to genes within 20 kb upstream or downstream. Twenty-three pathways were significant in Nanjing study, and 7 of them were replicated in Beijing study. After combining two studies together, three pathways (achPathway, metPathway and rac1Pathway) were still significant (0.019, 0.005 and 0.004 respectively) ( Table S2 in File S1). These three pathways were also identified in the initial analysis approach. In addition, At1rPathway that was identified in the initial analysis was also significant for combined dataset (P = 0.015) though the P value was 0.086 in Nanjing study. For the significant pathways observed in Nanjing study, 19 pathways were detected in both approaches using '50 kb' rule and '20 kb' rule and the concordance rate was 73.08% (19/26) between two approaches.
Considering the overlap of genes between pathways, we calculated the pair-wise overlaps between 4 identified pathways and found the overlap rates ranged from 6.25% to 25.00% ( Table  S3 in File S1). Since only significant genes in pathways contributed to the final results, we re-calculated the overlaps between these pathways using their significant genes, and the overlap rates were from 4.17% to 18.75% ( Table S3 in File S1). Furthermore, four genes (PAK1, PIK3R1, PTK2, and PTK2B) that contributed to more than 2 pathways ( Table S4 in File S1), were further removed from pathways for sensitivity analysis. As a result, the four identified pathways (achPathway, metPathway, At1rPathway and rac1Pathway) were still significant ( Table S5 in File S1) in combined dataset (P = 0.033, 0.040, 0.039 and 0.012, respectively).

Discussion
GWAS have successfully identified a number of loci associated with diseases/traits, which have greatly improved our understanding on genetic mechanism of these phenotypes. However, as the stringent quality control procedure and strict correction on multiple comparisons are used in GWAS, individual SNPs that are truly associated with phenotypes with modest effect may have been lost. Therefore, a pathway-based approach that evaluates the cumulative contribution of the function-related genes may provide new insights into the biology of a certain disease utilizing GWAS data. GSEA has two major advantages compared with other methods [16,26]. First, it performs two-step permutation-based correction procedure which effectively adjusts for different sizes of genes and preserves correlations of SNPs in the same gene. Second, covariates such as age, gender or population stratification in GWAS can be adjusted in GSEA. Thus, in the current study, we used GSEA and identified four pathways (achPathway, metPathway, At1rPathway and rac1Pathway) that may play an important role in the development of lung cancer in Han Chinese population. These findings were stable after sensitivity analysis when considering the SNP-to-gene mapping approach and gene overlapping between pathways. The achPathway (Role of nicotinic acetylcholine receptors in the regulation of apoptosis) was identified to be the top pathways associated with lung cancer risk in this study. Nicotinic acetylcholine receptors (nAchRs) are essential for neuromuscular signaling and have also been found on non-neuronal cells, such as bronchial epithelial cells and lung cancer cell lines [32,33,34]. Nicotine and its derived carcinogenic nitrosamines may play an important role in the pathogenesis of lung cancer through the binding to nAChRs expressed in lung epithelial cells, which mainly result from the resistance of cancer cells to apoptosis [35]. Maneckjee et al. (1994) showed that low concentrations of nicotine could block the induction of apoptosis in lung cancer cells [33]. In addition to conferring resistance against apoptosis, several studies have shown that nAChRs can induce cell proliferation as well as angiogenesis [36,37], both of which are involved in the genesis of cancer.
Importantly, several GWAS based on Caucasian populations have consistently identified 15q25 as lung cancer susceptibility region [5,6,7], which contains the nicotinic acetylcholine receptor subunit gene cluster, harboring CHRNA5, CHRNA3 and CHRNA4 genes. TERT is included in the achPathway and its representative SNP rs2736100 has been identified as a lung cancer susceptibility locus in different ethnic populations [3,8], especially in Asian populations [38,39,40].
In the At1rPathway (Angiotensin II mediated activation of JNK Pathway via Pyk2 dependent signaling), Clereet. al (2010) proposed that angiotensin II Type 2 receptor (AT2R) would promote tumor development, including both malignant cell proliferation and tumor angiogenesis [41]. Over-expression of angiotensin II type 2 receptor gene induces cell death in lung adenocarcinoma cells [42,43]. The aberrant activated JNK pathway can cause pathological cell death and different diseases including cancer [44], while mutations in the JNK pathway can also be involved in cancer development [45].
For the metPathway (Signaling of Hepatocyte Growth Factor Receptor), the hepatocyte growth factor receptor, also called c-Met, is activated by hepatocyte growth factor (HGF). Aberrant c-Met signaling plays significant roles in the pathogenesis and biology of human cancers [46]. Meanwhile, Mutated and overexpressed forms of c-Met are associated with oncogenesis and metastasis, making c-Met a potential therapeutic target for cancer drugs [47,48]. Interestingly, c-Met and epidermal growth factor receptor (EGFR)inhibitors can synergistically inhibit cell proliferation and promote apoptosis in lung cancer [49]. Yano et al. (2008) proposed that HGF-mediated MET activation can induce gefitinib resistance in lung adenocarcinoma with EGFR-activating mutations [50].
For the rac1Pathway (Rac 1 cell motility signaling pathway), Rac-1 is a small GTP-binding protein in the Rho family that regulates cell motility and proliferation in response to extracellular signals [51]. Meanwhile, Rac-1 can function as oncogenes in fibroblasts when over expressed [52]. Rab5 can induce the activation of Rac through several mechanisms, Rab5-regulated trafficking of Rac is involved in cell motility, which may also influence cell migration during morphogenesis and cancer metastasis [53,54]. Meanwhile, Rac activation by the IRSp53/ Eps8 complex plays an important role in the metastatic behavior of the malignant tumor cell [55].
In addition, we also checked the reported pathways by Chung et al. (2012) and Fehringer et al. (2012) in our study. Interestingly, the strongest association reported by Fehringer et al. (2012) was the acetylcholine receptor activity pathway while the achPathway was identified in the current study. Some studies have demon-strated that the activation of nicotinic acetylcholine receptors can alter apoptotic signaling as well as stimulate proliferation, both of which play an important role in lung carcinogenesis [56,57,58]. The results consistently support the importance of the 2 pathways in the development of lung cancer, which is biologically plausible. However, we did not found significant signals for other reported pathways. This may be explained by ethnic heterogeneity, different pathway analysis methods or definition of pathways from different databases.
This study has several strengths. First, we performed a two-stage pathway analysis in two independent populations, which may reduce the false-positive findings and improve the credibility of the results. Second, the four identified pathways were still stable after sensitivity analysis when considering SNP-to-gene mapping approaches and gene overlapping between pathways. This point as well as the important biology of the identified pathways in lung carcinogenesis has increased our confidence that our findings may be true other than just by chance. Nevertheless, several limitations are also need to be addressed in this study. First, incomplete annotation of human genome may reduce the study power for the pathway analysis, since many genes of unknown function cannot be assigned to known pathways and intergenic SNPs physically faraway from genes were not included yet. Thus, further studies based on improved genome-annotation database may provide additional understandings in genetics of lung cancer. Second, different pathway databases have different guidelines for pathway construction. Thus, the gene content of pathways representing the same biological process may vary substantially between different databases [14]. We only focused on KEGG and BioCarta databases that may have restricted our analysis due to inherent definition of the pathway, though these two databases have been commonly used in pathway analysis [19,22,24]. Finally, it's better to perform gene-smoking interaction analysis or gene-gene interaction analysis to discuss further on the involvement of these four pathways in smoking-related carcinogenesis. We will investigate them in our future studies.
In summary, this study conducted a two-stage pathway analysis in GWAS of lung cancer in Han Chinese using GSEA method, and identified four pathways (achPathway, metPathway, At1r-Pathway and rac1Pathway) associated with lung cancer risk. These findings may be an important supplement for GWAS and provide new insights into biology of lung cancer.

Supporting Information
File S1 Table S1. The rank of pathways based on combined dataset of Nanjing and Beijing studies. Table S2. Sensitivity analysis of pathway analysis for genes defined by SNPs within 20 kb upstream or downstream. Table S3. Gene overlaps between 4 indentified pathways for all genes defined by BioCarta database or genes with significant representative SNPs (P,0.05) a . (a) The bottom-left of the symmetric matrix is the number of overlap genes between pair-wise pathways and their total gene number. The top-right part is the overlap rate between pair-wise pathways (%). Table S4. Genes with significant representative SNPs (P#0.01) contributed to multiple pathways. (a) Derived from logistic regression model with adjustment for age, gender, packyear of smoking and principal components in combined dataset of Nanjing and Beijing studies. Table S5. The results of sensitivity analysis for 4 identified pathway after removing significant overlapping genes (PAK1, PIK3R1, PTK2 and PTK2B). (DOCX)