Gene-set meta-analysis of lung cancer identifies pathway related to systemic lupus erythematosus

Introduction Gene-set analysis (GSA) is an approach using the results of single-marker genome-wide association studies when investigating pathways as a whole with respect to the genetic basis of a disease. Methods We performed a meta-analysis of seven GSAs for lung cancer, applying the method META-GSA. Overall, the information taken from 11,365 cases and 22,505 controls from within the TRICL/ILCCO consortia was used to investigate a total of 234 pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Results META-GSA reveals the systemic lupus erythematosus KEGG pathway hsa05322, driven by the gene region 6p21-22, as also implicated in lung cancer (p = 0.0306). This gene region is known to be associated with squamous cell lung carcinoma. The most important genes driving the significance of this pathway belong to the genomic areas HIST1-H4L, -1BN, -2BN, -H2AK, -H4K and C2/C4A/C4B. Within these areas, the markers most significantly associated with LC are rs13194781 (located within HIST12BN) and rs1270942 (located between C2 and C4A). Conclusions We have discovered a pathway currently marked as specific to systemic lupus erythematosus as being significantly implicated in lung cancer. The gene region 6p21-22 in this pathway appears to be more extensively associated with lung cancer than previously assumed. Given wide-stretched linkage disequilibrium to the area APOM/BAG6/MSH5, there is currently simply not enough information or evidence to conclude whether the potential pleiotropy of lung cancer and systemic lupus erythematosus is spurious, biological, or mediated. Further research into this pathway and gene region will be necessary.


Introduction
Since the beginning of the 20 th century, lung cancer (LC) occurrence has been increasing rapidly and has become the most common cancer in males. It is the main cause of cancer-related death worldwide [1] and tobacco smoke is its major risk factor. The risk of developing LC in current smokers is 7.6 to 9.3 times higher compared to that of never smokers [2]. However, around every fourth LC case is not attributable to smoking [3]. A five-fold increased risk of developing early-onset LC in the presence of a family history of early-onset LC in any firstdegree relatives has also been observed [4,5]. This and other evidence has led to the general acceptance that a genetic component in early-onset LC development exists. However, an increased risk of developing LC has also been observed in patients with other disease, such as COPD, pneumonia, tuberculosis, or the autoimmune disorder systemic lupus erythematosus (SLE) [6,7]. In the case of patients with SLE, an increased relative risk (RR) of developing LC was observed as being 1.68 (95%-CI: 1-33-2.13) [6]. In spite of multiform clinical manifestations and outcomes, it is generally accepted that genetics plays a role in SLE [8]. In light of the results of this investigation, we will discuss a shared genetic susceptibility as a possible connection between SLE and LC.
The usual approach to identify such molecular mechanisms with GWAS is primarily to investigate single-marker-association and then allocate these markers to genes and finally the genes to pathways. Doing so, either the marginal effect of a single marker and/or the sample size needs to be large, because a low genome-wide level of significance of 1 x 10 −7 or smaller is needed owing to multiple testing. Gene-set analysis (GSA) strategies were proposed as complementary approaches in the investigation of the genetic basis of a disease using GWAS results [24][25][26], by seeking to identify sets of genes (GS) with sufficient enrichment of marker-specific significance for an association with a phenotype.
GSA approaches provide no effect estimates of the association, but only p-values (p GS ). To pool the p GS -values of several GSAs, it is important to take into account the concordance across studies of all single-marker-association point estimates related to every gene in a considered gene set [27]. However, one only needs to correct for multiple testing using the lower number of GSs being investigated instead of the larger number of genotyped markers. Once a GS has been found to be significantly associated, a search may be conducted for the genes that drive its significance and for the hosted markers which are concordant across studies based on their observed associations.
Here we aimed to identify pathways taken from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [28] as being associated with LC. KEGG provides a collection of manually drawn pathway maps representing an up-to-date knowledge on the molecular interaction and reaction networks. This includes pathways for metabolisms (e.g. nicotinate and nicotinamide metabolism), for genetic information processing (e.g. DNA repair), for environmental information processing (e.g. Wnt signaling), for cellular processes (e.g. cell cycle), for organismal systems (e.g. circadian rhythm) and last but not least for human diseases (e.g. LC or SLE) [29]. We refrained from restricting the KEEG collection, because pathways that are potentially involved in the etiology of LC (examples are given above in brackets) are contained in every upper mentioned category.
Our subsequent goal was to determine the driving genes in the pathways identified in the first step. To this end, we combined the results of seven LC GWASs from the Transdisciplinary Research in Cancer of the Lung / International Lung Cancer Consortium (TRICL / ILCCO) in a meta-analysis.

Description of studies
The meta-analysis was based on summary data from seven previously reported LC GWASs form TRICL / ILCCO (Fig 1). We included 11,365 LC cases and 22,505 controls of European descent in the analysis. An overview as well as study name abbreviations are given in Table 1. Details and references are provided Supplement S1 File.

Strategy and methods
In the original GWASs, a log-additive mode of inheritance was fitted for each marker, adjusting for age, sex, smoking status, study center (if applicable), and the first three principal components to account for hidden genomic structure. The results of marker-by-marker association testing were used as input information for the GSAs.
For this meta-analysis, we set up a two-phase seamless design consisting of a screening phase and a replication phase. In the screening phase, the results of MDACC, TORONTO, GLC, and CE were combined, because GSA of these studies was performed for 234 KEGG pathways previously [30,31]. In the replication phase, the results of the remaining studies NCI, deCODE, and HARVARD were combined to investigate only those pathways whose findings in the screening phase proved promising. If necessary, GSA was performed using the program ALIGATOR [32]. The method META-GSA [27] was performed to pool GSA results (p-values p GS,s ) at each stage. The aim of META-GSA is to increase statistical evidence by pooling the p-values p GS,s of GSAs, taking also into account the concordance of the signs of singlemarker-association point estimates and related p-values of all markers (p m,s ) assigned to genes contained in the GS [27]. The core element of this approach is a directed p-value (PDR), combining significance and direction of single markers and LD to other markers. Necessary estimates of LD were based on the genotype data of GLC, with imputation of missing markers based on the 1000-Genome Project [33], the 1000-GenomePilot 1-Panel or the HapMap3-Panel as available using the SNAP online tool [34].
The SNP-to-gene annotation (StG) for humans of the ENSEMBL database [35] was used. Markers with LD of at least r 2 !0.8 to any marker inside a gene were additionally assigned to that gene [36]. All genes were then annotated to 234 gene sets from the KEGG database (geneto-pathway annotation (GtP)).
Both phases can be considered as the first and the second stage of a seamless, adaptive study with interim selection of gene sets ("drop-loser design" [37]). The investigation of every KEGG pathway with a pooled p scr. < β 1 = 1/234 in the screening phase was stopped early for futility. The significance, combining screening and replication phase, was assessed according  to the "method based on the sum of p-values" (MSP) [37,38]. The p-value was then calculated by the equation p GS ¼ b 1 ðp scr: þ p rep: Þ À 0:5b 2 1 . This p GS needs to be corrected for multiple testing by taking into account the total number of 234 pathways. Due to pathway overlap we estimated the number of independent tests t eff according to the lowest slope method (LSM) [39] considering all p scr. -values of the screening phase. Applying a Bonferroni-like correction then yields the final p-value p GS,corr. = min(1,t eff Á p GS ). Furthermore, META-GSA was also applied to all seven studies and all pathways surviving the screening phase to take into account the concordance of single-marker-association point estimates across all considered studies at the same time.
The next step was to identify the main genes driving the significance of gene sets (denoted as p GS -driving genes). Thus we contrasted the mean of PDRs across studies for each gene (PDR g as a measure of concordance) with pooled p-values regarding the gene-level statistics (p gene as measure of significance, calculated according to Fisher's χ 2 -method). To judge these findings adequately, we also calculated PDR g for the known LC-related genes CLTM1L, TERT, CHRNB4, CHRNA3, CHRNA5, MSH5, BAG6, RAD52 and CDKN2B. Within these genes we looked markers with a large mean of PDRs across studies (PDR m ).
Finally, we performed a sub-group meta-analysis for the one identified KEGG pathway according to histological subtype (AdenoLC, SqCLC, SCLC and LCLC), sex, age (older or younger than 50 years), and smoking behavior (current, former, ever and never smokers). During this investigation the region 6p21-22 became of interest. Respective correlation of marker genotypes and gene expression (eQTL) was previously measured in non-neoplastic pulmonary parenchymal samples taken some distance from the primary tumor in LC patients [40]. We used the estimated correlation between every SNP located between 31.6MB and 32.2 MB (all within 6p21-22) and the expression of the genes APOM, BAG6, MSH5 (reported as relevant in LC), C2, C4B, SKIV2L, STK19 (closely located to genes driving the significance in this META-GSA application) and TNXB (reported as relevant for SLE), in total 5,572 estimated correlations. Estimating t eff = 5309 independent tests (by LSM) yields a global threshold for significance of 1x10 -7 .

Association of pathways: Screening and replication phase
Only three of the 234 pathways investigated revealed a p-value lower than the futility threshold and were selected for the replication phase: hsa05322: systemic lupus erythematosus (SLE), hsa00790: folate biosynthesis and hsa04940: type I diabetes mellitus (Table 2). Only for the SLE pathway we were able to achieve a low p-value when combining screening and replication phase and correcting for multiple testing (p GS,corr = 0.0615). Combining all seven studies in a single META-GSA, in order to take the concordance of single-marker-association point estimates of all studies into account adequately, yielded a p GS -value of 0.0306 for this SLE pathway. This indicates sufficient enrichment and satisfactory concordance of marker-specific significance for an association with LC.

Genes driving significance
Four genes of the SLE pathway (HIST1-H4L,-1BN, -H2AK, -H4K) and their close neighbor HIST1H2BN strike out by concordance of marker-specific association (absðPDR g Þ $ 0:8) across studies and a gene-level p gene -value lower than 0.01 (Table 3). All five genes belong to the histone cluster 1 and are closely located within 41 kb of each other on 6p22.1. Weaker concordance was observed for further two less significant genes (p gene -value < 0.05): C4A (PDR g = -0.41) and C2 (PDR g = 0.33).

Markers driving significance
The markers rs13194781, rs1270942 and rs389884 are those with the largest PDR m -values (all >0.7) and the strongest associations with LC (in terms of OR). For rs13194781, which is located within HIST1H2BN (ENSEMBL definition), an OR of 1.23 (p = 0.0032) was estimated. The markers rs1270942 and rs389884 are perfect proxies for each other according to the 1000-Genome Pilot 1-panel [33]. They are closely located upstream of C2 and downstream of C4A, respectively. There is no LD with the first marker rs13194781 (Table 4).

Subgroup meta-analysis
We revealed more evidence for an association of the SLE pathway with AdenoLC (p GS = 0.0030) than for any other histotype. We also found the association to be significant in women (p GS = 0.0112) but not in men (p GS = 0.1453) and in older cases (p GS = 0.0002) but not in younger (p GS = 0.0588). No significant association was observed when stratifying according to smoking behavior (Table 5). Significance within the considered subgroups is driven by same p GS -driving genes of the region 6p22.1-22.2 as in the total sample (C2 and the genes of the histone 1 cluster). Also, most of the more moderate concordant genes that drive significance of hsa05322 in at least one of the considered subgroups are histone-coding genes.

SNP ⨯ eQTL correlation
Both aforementioned SNPs belonging to C2/C4A, rs1270942 and rs389884, are significant correlated with the expression of the gene APOM (p<10 −13 ), which is located about 500 kb away (Fig 2). However, the expression pattern is this region is puzzling, since other markers within C2 (rs537160, rs622871, rs630379) are also correlated with the gene expression in non-neoplastic samples of LC patients of the neighboring gene C4B (not part of the investigated KEGG pathway, although related to SLE). It is also remarkably that the correlation of SNPs belonging to C2/C4A with the expression of C2 is less significant (p~10 −3 ) than with the expression of SKIV2L (p~10 −5 ), which is not related to SLE.

Discussion
We could demonstrate an accumulation of genomic association with LC in the KEGG pathway hsa05322, which comprises genes related to SLE. This suggests some cross-phenotype (CP) association with LC and SLE. The significance was higher in the subgroup of AdenoLC patients than within other histological subtypes and in women compared to men. This fits our expectations in view of women, who predominantly develop AdenoLC, are more often affected with SLE than men [41], who predominantly develop smoking-related SqCLC [1,42]. All p GS -driving genes identified in this meta-analysis are located within or next to the major histocompatibility complex (MHC) on chromosome 6p21-22 (Fig 2), albeit in two separate areas, about 3000 kb apart. The first area comprises the genes of histone cluster I :  HIST1-H4L, -1BN, -2BN, -H2AK, -H4K (the strongest associated marker is rs13194781; OR = 1.23, p = 0.0032). It is well known that a variety of histone related modifications are either related to cancer or to SLE, or to both [8,43]. They play a role e.g. in DNA repair, cell cycle or gene expression [8,44], which by themselves are associated to LC or SLE, respectively [23,45]. Interestingly enough, we detected associations to LC of the DNA signature of histone coding genes, rather than with respect to some kind of epigenetic outcome. The second area comprises the genes C2, C4A, and C4B (the strongest associated markers are rs1270942 and rs389884; OR = 1.27, p = 0.009). It is well established, that reduced gene expression of C2 and C4A can predispose to SLE [46]. This two genes, and perhaps also C4B, are involved in the clearance of apoptotic bodies [8]. This is in turn crucially important for controlling inflammation, which plays a role in the development of LC [3].
However, the identification of disease-relevant genes in the MHC region (6p21-6p22) and far beyond is complicated owing to the strong and extensive LD across both common and rare haplotypes [47]. Hence any observed CP association will probably tag plenty of genes. An association of the gene area APOM/BAG6/MSH5 in the MHC region with LC has previously been reported, which is strongest for SqCLC and AdenoLC [9,13]. The strongest associations with SqCLC in this area was previously reported for the markers rs3117582 (located within BAG6 and APOM; OR = 1.3, p = 4.5×10 −10 ), which was found associated also with SLE (OR = 2.2, p = 4.2×10 -21 ) [48]. This marker is about 220 kB apart but in strong LD with the newly identified markers rs1270942 and rs389884 (located close to C2; Table 4 and Fig 2). More important, a highly significant correlation between markers of the area C2/C4A/C4B with the expression of the gene APOM in non-neoplastic samples taken from LC patients was also recently reported [40] (Fig 2). APOM is involved in lipid transport and is linked with high-density lipoprotein cholesterol in the pathogenesis of emphysema, which is on the other hand considered as associated with LC [49,50]. But other explanations of the observed associations have been given, too; for instant a connection to embryonic lethality with defects in the development of the lung (related to the function of BAG6) or deficits in mismatch excision repair (related to the function of MSH5) [13]. Moreover, the association of MSH5 with SLE was reported as not shared with other autoimmune/inflammatory diseases [51]. Apart from all this, some remarks about the applied method need to be made. The whole approach is an intensive investigation of p-values, which-in the context of this project-are indicators of evidence for or against the rejection of a null-hypothesis of no genetic association. We used the program ALIGATOR to perform GSA, which circumvents bias due to uneven counts of markers per gene as well as genes per gene set [32]. Choosing another algorithm would probably lead to different results [31]. In addition, a p-value can be used to justify the existences of an association; however it is not solely determined by the strength of the observed effect, but also by factors like sample size, the used statistical model and the applied test procedure. Hence we can present significance of our findings but are unable to estimate the part of LC risk that can be attributed to the identified genes or gene sets.

Conclusion
We were able to identify CP risk factors by first pooling results of gene set analyses and looking afterwards for those genes driving the significance of discovered gene sets. In doing so, we have discovered a pathway that is currently marked as specific to SLE as being significantly implicated in LC. The gene region 6p21-22 in this pathway appears to be more extensively associated with lung cancer than previously assumed. Given wide-stretched linkage Lung cancer and systemic lupus erythematosus disequilibrium to the area APOM/BAG6/MSH5, there is currently simply not enough information or evidence to conclude whether the potential pleiotropy of LC and SLE is spurious, biological, or mediated. Further research into this pathway and gene region will be necessary.