Insights into Pancreatic Cancer Etiology from Pathway Analysis of Genome-Wide Association Study Data

Background Pancreatic cancer is the fourth leading cause of cancer death in the U.S. and the etiology of this highly lethal disease has not been well defined. To identify genetic susceptibility factors for pancreatic cancer, we conducted pathway analysis of genome-wide association study (GWAS) data in 3,141 pancreatic cancer patients and 3,367 controls with European ancestry. Methods Using the gene set ridge regression in association studies (GRASS) method, we analyzed 197 pathways identified from the Kyoto Encyclopedia of Genes and Genomes database. We used the logistic kernel machine (LKM) test to identify major contributing genes to each pathway. We conducted functional enrichment analysis of the most significant genes (P<0.01) using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). Results Two pathways were significantly associated with risk of pancreatic cancer after adjusting for multiple comparisons (P<0.00025) and in replication testing: neuroactive ligand-receptor interaction, (Ps<0.00002), and the olfactory transduction pathway (P = 0.0001). LKM test identified four genes that were significantly associated with risk of pancreatic cancer after Bonferroni correction (P<1×10−5): ABO, HNF1A, OR13C4, and SHH. Functional enrichment analysis using DAVID consistently found the G protein-coupled receptor signaling pathway (including both neuroactive ligand-receptor interaction and olfactory transduction pathways) to be the most significant pathway for pancreatic cancer risk in this study population. Conclusion These novel findings provide new perspectives on genetic susceptibility to and molecular mechanisms of pancreatic cancer.


Introduction
Pancreatic cancer is the fourth leading cause of cancer-related death in the United States, accounting for more than 37,660 deaths per year [1]. Because no effective screening test exists for pancreatic cancer, it is important to identify genetic factors that contribute to the development of this cancer. Recent genomewide association studies (GWAS) and post-GWAS analyses have identified chromosome regions containing the ABO, NR5A2, and CLPTM1L-TERT genes [2,3], as well as the HNF1A gene [4], as susceptibility loci for pancreatic cancer. However, single-marker association tests have limited power to identify genes that are genuinely associated with disease status but may not reach a stringent genome-wide significance threshold in GWAS. Thus, many important disease genes may still remain unidentified with this approach. Furthermore, cancer development typically involves dysfunction of multiple functionally related genes acting concordantly in a network or pathways [5]. Thus, pathway analysis of GWAS data, which jointly considers multiple variants in interacting genes and multiple genes in a biological pathway, as a complementary approach to single-marker association tests [6], may have the potential to reveal the polygenic basis of disease susceptibility. Pathway-based GWAS analyses have provided novel insights into the etiology of cancers, such as colon cancer [7], lung cancer [8], and melanoma [9], and other complex diseases, including schizophrenia [10], bipolar disorder [11], and rheumatoid arthritis [12]. A recent study analyzed the GWAS data focusing on 23 selected pathways or groups of genes and identified the pancreas development pathway genes as susceptibility factors for pancreatic cancer [13]. While this data supports the candidate pathway analysis as a useful approach in genetic association study, it is limited by the number of pathways/genes examined, suggesting that a more comprehensive agnostic analysis of all known pathways may have the potential to uncover novel genes that were previously not considered in pancreatic cancer.
Gene set ridge regression in association studies (GRASS) is one of the newly developed pathway-based approaches [7]. In GRASS, principal components analysis (PCA) is used to capture the genetic variation within a gene to reduce the dimensionality of the single-nucleotide polymorphism (SNP) data and regularized logistic regression is performed to assess the association of pathways with disease. In this study, we first used GRASS on GWAS data to assess the association of pathways with pancreatic cancer. Then, we applied the logistic kernel machine (LKM) method to screen the major contributing genes to each pathway [14]. Finally, we conducted functional enrichment analysis of the most significant genes using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) method [15,16]. In this study, the first comprehensive analysis of GWAS data in pancreatic cancer using an agnostic approach, we have identified novel pathways and genes that are significantly associated with the disease risk. These findings may open new avenues of research on the molecular mechanism and etiology of pancreatic cancer.

Study Population and Data Source
The study population included a total of 7,019 individuals: 1,871 cases and 2,026 from PanScan1 including 12 nested casecontrol studies and one hospital-based case control study and 1,528 cases and 1,594 controls from PanScan2 including 6 case-control studies on pancreatic cancer [2,3]. Cases were defined as primary adenocarcinoma of the exocrine pancreas. Controls, which were free of pancreatic cancer at the time of recruitment, were matched to cases according to birth year, sex, and self-reported race/ethnicity. GWAS had been performed at the National Cancer Institute's Core Genotyping Facility using the HumanHap550, HumanHap550-Duo, and Human 610-Quad arrays (all from Illumina, San Diego, CA) [2,3]. The original GWAS data were downloaded from the Database of Genotypes and Phenotypes (dbGaP) [17]. On the basis of International HapMap Project genotype data (phase 3 release #3, NCBI build 36, dbSNP b126, 2010-05-28) for three populations (CEU, JPT/CHB, and YRI) [18] and minor allele frequency (MAF) .5%, we selected 10,155 SNPs with r 2 ,0.004 to use in population structure analysis [19]. A total of 6,508 individuals (3,141 cases and 3,367 controls) with European ancestry (i.e., 0.75-1 similarity to CEU) were selected from the starting study population of 7,019 individuals in the current pathway analysis.

Quality Control
The original GWAS data passed quality control procedures before posted on dbGaP. We pruned the genotype data by further excluding 13,822 SNPs with call rate ,98%, 45,653 SNPs with MAF ,5%, and 38,857 SNPs deviating from Hardy-Weinberg equilibrium (P,0.001), as well as SNPs in the gene desert regions, resulting in 82,881 SNPs in the final analysis from a starting number of 468,111 SNPs.
In order to evaluate the impact of population structure, we made the quantile-quantile (Q-Q) plot and calculated the inflation factor (l) in individuals with European ancestry only. The inflation factor was calculated according to method by de Bakker et al. [20], adjusted for a sample size of 1,000 cases and 1,000 controls using the formula: Where, n case and n control are actual number used to calculate l observed ; 1,000 is the sample size to be corrected. Q-Q plot shows little inflation of test statistics compared with expected distribution (l = 1.03), excluding the possibility of potential population structure between cases and controls.

Pathways and Genes
A total of 214 human biological pathways are listed in Kyoto Encyclopedia of Genes and Genomes (KEGG) [21]. After excluding pathways with ,10 or .500 genes, we analyzed 197 pathways using the GRASS approach [22]. We identified 19,058 Reference Sequence (RefSeq) genes in the GWAS data from the human genome 18 (hg18) database using the University of California Santa Cruz (UCSC) Table Browser data retrieval tool [23]. We tested 5,127 genes in the 197 pathways for association with pancreatic cancer. For each gene region, we included SNPs within 20 kb upstream or downstream of the gene in this study.

Statistical Methods
We used GRASS to test the association of each pathway with pancreatic cancer. Genotype data were coded in an additive model using PLINK version 1.07 [24] with 0 for homozygote common allele, 1 for heterozygote, and 2 for homozygote mutant allele. The GRASS tests the null hypothesis that none of the SNPs in a given pathway was associated with the disease [6]. To avoid undue influence of varying gene and pathway sizes, the GRASS uses normalized gene-level statistics and sample (subject) permu-tations. The details of GRASS have been previously described [7]. Briefly, the method consists of three steps. First, PCA is used to summarize SNPs in each gene as uncorrelated (orthogonal) linear combinations of the original SNPs, called eigenSNPs, accounting for $95% of the genetic variation. The number of resulting eigenSNPs is usually much smaller than that of the original genotyped SNPs and serves as predictors in the regularized logistic regression model. A penalized likelihood function is used to estimate the regression coefficients of the eigenSNPs. Second, a standardized gene-level statistic is calculated according to the regression coefficients of the eigenSNPs. The statistic, analogous to z-statistic, is defined as is the square root of summation of squared regression coefficient for each eigenSNP estimated under the optimal tuning parameter l;m m g andŝ s g , estimated from permutations, are mean and standard deviation ofb b lg under the null hypothesis that gene g is not associated with the disease. Thus, each gene, regardless of its size, contributes equally to the gene set association statistic, as described below. The third step involves calculating gene set (pathway) association statistic (T l ) and p value. T l is the square root of summation of squared standardized gene-level statistics; P value is estimated using is computed from the permutated data and B is the number of permutations. Because of the large number of genes and pathways analyzed, we applied the Bonferroni correction to adjust for multiple comparisons. The significance threshold was P,0.00025 (0.05/197). Due to the intensive computation entailed by GRASS, we adopted a two-stage permutation test procedure, similar to that implemented in PLINK [24]: we first conducted 5,000 permutations to each gene set in this study, and for those gene sets with pvalue less than 0.00025, we increased the number of permutations to 50,000.
We applied the LKM test to assess the association of each gene with pancreatic cancer as previously described [14]. Briefly, this method comprises two steps: forming SNP sets for each gene and testing the association of SNP sets with disease status. The gene database, gene region definition, and genotype coding used here were the same as those in GRASS. The LKM model integrates a regular logistic model with a semi-definite kernel function (a linear kernel was used here) that is specifically designed for genetic data. The variance-components score test of Zhang and Lin [25] was used to test gene-disease association. In this analysis, we tested the associations of 5,127 genes (in the 197 pathways) with pancreatic cancer after adjusting for age (in 10-year groups), sex, study (categorical), and five principal components (quantitative) capturing population structure obtained from a PCA analysis using EIGENSTRAT [26]. P values from the KLM analysis were adjusted for multiple comparisons using the Bonferroni correction. The significance threshold was P,9.75610 26 (0.05/5,127).
Finally, as a complementary approach to the GRASS pathway analysis, we investigated the functional enrichment of the most significant genes in gene-based association tests (P#0.01 in LKM) using the web-accessible bioinformatics tool DAVID [15,16]. The DAVID consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning and over-represented biological functions from large gene or protein lists based on the hypergeometric (Fisher's exact)  test. We used the KEGG, GO and InterPro [27] databases to define the gene sets. In addition, DAVID groups functionally similar gene sets into clusters to reduce the redundant nature of gene functional annotation systems, e.g., the hierarchically organized GO. As a replication effort, we analyzed the data from PanScan1 (1,796 cases and 1,880 controls) and PanScan2 (1,345 cases and 1,487 controls) separately. We also randomly split the entire dataset into two groups and conducted separate analysis in each group. We performed meta-analysis of the P values from individual cohort/group using the Stouffer's z-score method, which has been demonstrated to be efficient in meta-analysis of GWAS [28]. The test statistic for combining p-values from two individual cohorts for a given pathway is computed as z~½W {1 (1{p 1 )zW {1 (1{p 2 )= ffiffi ffi 2 p , where W {1 is inverse of the standard normal cumulative function. The overall meta-analysis P value is calculated as 1{W(z). Finally, we applied the GRASS method to test the two largest significant pathways (as detailed in Results) using the Wellcome Trust Case Control Consortium (WTCCC) GWAS data [29] to empirically evaluate the impact of pathway size and demonstrate the specificity of our results.

Pathways Associated with Pancreatic Cancer
We analyzed 197 pathways with 5,127 genes using the GRASS approach and found that six pathways were significantly associated with pancreatic cancer after the Bonferroni correction (P,2.5610 24 ) ( Table 1). Three pathways were significant at P values ,0.0001: neuroactive ligand-receptor interaction, longterm depression, and maturity onset diabetes of the young (MODY) pathways. Three pathways had a less significant P value of $0.0001 but ,0.00025: the olfactory transduction, Fc epsilon RI signaling, and the vascular smooth muscle contraction pathways. In addition to the above six pathways, the glycerophospholipid metabolism, pancreatic secretion and vascular endothe- lial growth factor (VEGF) signaling pathways were associated with pancreatic cancer at P = 0.0004 (Table S1, available online).

Pathway Replication Results
Two of the six significant pathways, i.e. the olfactory transduction pathway and the neuroactive ligand-receptor interaction pathway showed consistent small P values across the PanScan1 and PanScan2 cohorts, though not both P values were significant after multiple testing corrections likely due to much smaller sample size in each individual cohort and the resulting lower statistical power ( Table 2). The meta-analysis P values for these two pathways (1610 25 and ,1610 25 , respectively) were significant after the Bonferroni correction. The MODY pathway remained significant in PanScan1 (P = 0.00006) but not in PanScan2 (P = 0.91), and the meta-analysis P value was 0.0589. All three remaining significant pathways in the combined GRASS analysis had P values .0.1 in the PanScan1 and PanScan2 cohort ( Table 2). When we randomly split the dataset into two groups, all six pathways had a P value ,0.05 in one of the two groups but not in both; and none of the meta-analysis P values was significant after adjusting for multiple comparisons (Table S2). To investigate if the significance of the olfactory transduction pathway (353 genes and 1,122 eigenSNPs) and the neuroactive ligand receptor interaction pathway (263 genes and 1,374 eigenSNPs) was simply due to their large size, we tested these pathways by applying the GRASS to the WTCCC GWAS data and obtained a P value of 0.5652 and 0.2332 for bipolar disorder and 0.246 and 0.0062 for Crohn's disease, respectively, (each disease had 2,000 cases and 3,000 controls). These results, along with the consistent small P values across PanScan1 and PanScan2, indicate that the significant P value of these two pathways in the GRASS analysis is unlikely due to the pathway size. In addition, to investigate if the pathway results were mainly driven by GWAS top hits reported in PanScan1 and PanScan 2, we removed the gene NR5A2 from the MODY pathway and re-performed GRASS analysis with 50,000 permutations. The P value for the combined dataset, PanScan1 and PanScan2 subset was 0.4610 24 , 0.6610 24 , and 0.88, respectively. The respective P values were 0.6610 24 , 0.6610 24 and 0.91 from the analysis including the NR5A2 gene, suggesting that our pathway analysis unraveled signals independent from Activation of phospholipase C activity by G-protein coupled receptor protein signaling pathway coupled to IP3 second messenger GO:0007200 4 9.77610 23 a DAVID analysis was based on GO, InterPro, and KEGG database. b The group enrichment score is the geometric mean (in -log scale) of member gene sets' P values in a corresponding annotation cluster. c The P value is based on a modified Fisher's exact test in the DAVID system, referring to one-tail Fisher's exact probability value used for gene-enrichment analysis. doi:10.1371/journal.pone.0046887.t004 those by single-SNP analysis. Note that other GWAS top hits such as ABO and TERT1 were not included in any of the 197 pathways.

Major Contributing Genes to Pathways
Applying the LKM method, we identified 365 genes with nominal significance (P,0.05) and 118 genes with P,0.01 for the 197 pathways (Table S3, available online). The major contributing genes to each of the six significant pathways identified by GRASS are listed in Table 1. The major genes contributing to the 197 pathways and to the six significant pathways are presented in Figure 1 and 2, respectively. After we adjusted for multiple comparisons, four genes remained significant (P,9.75610 26 ): the ABO, HNF1A, OR13C4, and SHH genes (Table 3). In addition to these four genes, ABL1, MYC, HNF4G, NR5A2 (a GWAS top hit) and ADPGK had P values ,0.0001.

Functional Enrichment Analysis of Significant Genes
Finally, we conducted functional enrichment analysis, using DAVID, on the set of 118 genes with P,0.01 from LKM analysis. Forty-four clusters were identified on the basis of the KEGG, GO and InterPro categories. The clusters of genes with P,0.01 from DAVID are listed in Table 4 (See Table S4 for detailed list of genes in each cluster). The superfamily of rhodopsin-like G protein-coupled receptors (GPCRs), perception of smell and olfactory transduction was the most significant group of genes on the basis of the InterPro (P = 1.61610 -13 ), GO (P = 1.30610 -7 ) and KEGG (2.38610 23 ) databases, respectively, echoing our findings from GRASS. Genes maintaining the homeostasis process were also over-represented in pancreatic cancer ( Table 4). The biologic relationship map for the top 81 genes (P,0.05 in LKM) of the six significant pathways is shown in Figure 3, which was created with Ingenuity Pathway and Analysis software [30].

Discussion
In this GWAS pathway analysis, we identified two novel pathways, i.e. the neuroactive ligand receptor interaction and olfactory transduction pathways that are significantly associated with pancreatic cancer risk after adjusting for multiple comparisons and in replication testing. These findings were also supported by functional enrichment analysis. We also identified four genes that are significantly associated with pancreatic cancer risk, including three previously implicated genes ABO, HNF1A, and SHH [2][3][4] as well as a novel gene OR13C4. These findings provide new provocative insights into the polygenic basis of pancreatic cancer susceptibility and etiology.
The GPCR protein superfamily of transmembrane receptors accounts for ,4% of the whole human genome and .50% of modern therapeutic targets [31]. Genes of the neuroactive ligandreceptor interaction and olfactory transduction pathways are major components of the GPCRs (Table S4). The neuroactive ligand-receptor interaction pathway remained significant after adjusting for multiple testing in PanScan1 (P = 0.0006) but not in PanScan2 (P = 0.002). However, the P value from meta-analysis was highly significant (P,1610 25 ). The contributing genes to this pathway, e.g. CCKBR, CHRM5, EDNRA, LPAR1, SSTR2/3, and SCTR, have diverse functions in regulating the endocrine and exocrine functions of the pancreas, which are highly relevant to pancreatic cancer [32,33,34,35].
Humans have .700 olfactory receptor (OR) genes (of which $50% are functional) [36]. Genetic variants of the OR genes and dysfunctions of OR signaling have previously been associated with schizophrenia [37], fetal hemoglobin in sickle-cell anemia [38], and proliferation of prostate cancer cells [39]. Although the links between olfactory transduction and pancreatic cancer remain to be elucidated, a previous sequencing analysis of human pancreatic tumors did find many somatic mutations of the OR genes, including seven genes identified in the current analysis: OR13C3, OR13C5, OR10P1, OR1J2, OR4A16, OR51F2, and OR5D13 [40]. Expression of at least two OR genes has been reported in human pancreas tissues [41]. The top two contributing genes to the olfactory transduction pathway, OR13C4 and OR13C3, ranked as the third and fifth most significant genes among the 5,127 genes analyzed in this study. In the replication study, the olfactory transduction pathway remained as one of the top pathways with consistent small P values in PanScan1 and PanScan2 cohort with a significant meta-analysis P value after adjusting for multiple testing. On the other hand, we did not find any association of this pathway with bipolar or Crohn's disease in the WTCCC GWAS data analysis. All these data suggest that the association of olfactory transduction pathway and pancreatic cancer are unlikely to be due to chance. Further replication of this association in other dataset and functional studies of the biological and molecular links between olfactory transduction signaling and pancreatic cancer are warranted. GPCRs are the first gate through which outside signals are transmitted into the cell. High activity of GPCRs may contribute to transduction of outside detrimental signals, such as insulin, glucose, or carcinogens, into a cell and induce a cascade of responses related to carcinogenesis.
In addition to these two pathways, four additional pathways also passed the Bonferroni correction for multiple comparisons, i.e. the MODY, Fc epsilon RI, long-term depression and vascular smooth muscle contraction pathways. However, the MODY pathway was highly significant in PanScan1 but was not significant in PanScan2. The diminished or weaker gene-risk association in PanScan2 has previously been observed for other genes [3,4]. This difference may be related to the fact that PanScan1 was pooled from 12 cohort studies and one case control study while PanScan2 was drawn from eight case control studies. Because of the rapid fatality of pancreatic cancer, case control study may subject to a survival bias if the testing genes are associated with survival. Although meta-analysis did not show a significant P value, this pathway has been identified as the most significant pathway in association with pancreatic cancer in a separate analysis of the PanScan data using two different statistical methods [13]. The MODY genes are an important part of the transcriptional network that regulates pancreas development and differentiation in early life and maintains pancreatic homeostasis in adulthood [42,43,44]. Three MODY genes (HNF1A, HNF4G, and NR5A2) were among the top 10 genes with P values ,0.0001 in LKM analysis. Notably, another two of the top 10 genes, SHH and MYC, are also known to play an essential role in pancreas development [45]. Genes involved in organ development and differentiation may contribute to the ability of tumor cells to proliferate and survive, as well as alter cell plasticity, thus reprogramming cells to a state that can give rise to a tumor. MODY genes may also contribute to pancreatic cancer by modifying the risk of diabetes [46] and obesity [47,48], or by regulating epithelial cell growth and differentiation, lipid metabolism [49], protein fucosylation [50], and inflammation [51].
Fc epsilon RI is a high affinity receptor for IgE, and mast cell activation mediated by Fc epsilon RI is a key event in the allergic inflammatory response. Increasing evidence indicates that inflammation around tumor, including infiltration by mast cells, facilities tumor growth and angiogenesis in pancreatic cancer [52,53]. However, this pathway along with the long-term depression and vascular smooth muscle contraction pathways did not have consistent results in the replication studies. Thus, these data need to be treated with caution.
Compared to findings from the previously reported candidate pathway/gene analysis [13], our findings on pathways that were included in both studies were quite consistent, i.e. a positive finding on the pancreas development (aka MODY) pathway/ genes and the negative findings on the DNA repair, apoptosis, insulin signaling, wnt, notch and hedgehog pathways/genes. This is by far the largest study in pancreatic cancer with the most comprehensive analysis of all biological pathways identified from KEGG using an agnostic approach. Using PCA in GRASS greatly reduced the dimensionality of the GWAS data and increased the probability of singling out useful information. Using the LKM method overcame the influences of positive and negative effects of SNPs and enabled us to identify new genes in addition to replicating the gene regions discovered by previous marginalassociation studies [54]. By performing the GRASS analysis in two independent cohorts, we have shown consistent findings on some of the significant pathways. Further replication of these findings in future additional pancreatic cancer GWAS data is warranted. Overall, the pathway analysis approach with intensive control for false positive findings has a great potential to uncover gene traits that are associated with disease without a priori. Correct use of this tool may open up new avenues of research on the molecular mechanisms of pancreatic cancer and potential targets for the prevention and treatment of this disease.