Genetics of Sputum Gene Expression in Chronic Obstructive Pulmonary Disease

Previous expression quantitative trait loci (eQTL) studies have performed genetic association studies for gene expression, but most of these studies examined lymphoblastoid cell lines from non-diseased individuals. We examined the genetics of gene expression in a relevant disease tissue from chronic obstructive pulmonary disease (COPD) patients to identify functional effects of known susceptibility genes and to find novel disease genes. By combining gene expression profiling on induced sputum samples from 131 COPD cases from the ECLIPSE Study with genomewide single nucleotide polymorphism (SNP) data, we found 4315 significant cis-eQTL SNP-probe set associations (3309 unique SNPs). The 3309 SNPs were tested for association with COPD in a genomewide association study (GWAS) dataset, which included 2940 COPD cases and 1380 controls. Adjusting for 3309 tests (p<1.5e-5), the two SNPs which were significantly associated with COPD were located in two separate genes in a known COPD locus on chromosome 15: CHRNA5 and IREB2. Detailed analysis of chromosome 15 demonstrated additional eQTLs for IREB2 mapping to that gene. eQTL SNPs for CHRNA5 mapped to multiple linkage disequilibrium (LD) bins. The eQTLs for IREB2 and CHRNA5 were not in LD. Seventy-four additional eQTL SNPs were associated with COPD at p<0.01. These were genotyped in two COPD populations, finding replicated associations with a SNP in PSORS1C1, in the HLA-C region on chromosome 6. Integrative analysis of GWAS and gene expression data from relevant tissue from diseased subjects has located potential functional variants in two known COPD genes and has identified a novel COPD susceptibility locus.


Introduction
Gene expression levels in humans are highly heritable [1,2]. Multiple published studies have examined the associations between single nucleotide polymorphism (SNP) variation and microarray gene expression measurements to identify expression Quantitative Trait Loci (eQTLs ), single nucleotide polymorphisms (SNPs) that influence gene expression [3][4][5]. However, most of the published studies have examined gene expression in lymphoblastoid cell lines (LCL) from unphenotyped individuals [3,4], though a recent paper has described eQTLs in peripheral blood CD4+ lymphocytes of patients with asthma [6]. Integrative genomic analyses can provide functional information regarding significant SNPs found through genomewide association studies (GWAS) or identify the key genes within a locus identified through GWAS. For example, genome-wide expression profiling in LCL from children with asthma [5] was used to localize ORMDL3 (ORM1-like 3 (S. cerevisiae) [MIM 610075]) as the likely gene for childhood asthma in the multi-gene chromosome 17q21 locus found through GWAS [7]. However, this study did not determine whether the eQTLs identified were relevant in primary human tissues in asthma. Integrative genomics studies can also be used to implicate novel genes for complex traits, such as the association between MMP20 (matrix metallopeptidase 20 [MIM 604629]) and age related decline in kidney function [8].
Chronic obstructive pulmonary disease (COPD [MIM 606963]), which includes emphysema and chronic bronchitis, is a complex disease with genetic and environmental influences [9]. COPD is a major source of morbidity and mortality in the U.S. and worldwide [10]. Previous GWAS have identified three susceptibility loci for COPD, including HHIP (hedgehog interacting protein [ [11][12][13]. Cough and phlegm production is common among COPD patients, and sputum samples may provide a non-invasive window into pathobiologic processes in the lungs of COPD patients. Therefore, we integrated GWAS data with microarray gene expression profiles from induced sputum samples from well-characterized COPD subjects participating in the Evaluation of COPD Longitudinally to Identify Predictive Surrogate End-points (ECLIPSE) Study [14]. We addressed two hypotheses: (1) eQTL analysis will improve understanding of previously known COPD susceptibility loci, such as chromosome 15q25; and (2) eQTL SNPs can be used to identify novel COPD susceptibility genes. Limiting the search to functional eQTL SNPs can reduce the multiple testing burden found in traditional GWAS. Although eQTL studies have now been performed in several human tissues besides blood, our study represents one of the first integrative genomics analyses performed in affected patients in order to gain insights into a common disease.

Ethics Statement
Study subjects provided written informed consent, and all studies were approved by the Institutional Review Boards at Partners Healthcare and all participating centers.

ECLIPSE Study
ECLIPSE was a three year observational study conducted at 46 centers in 12 countries [14]. ECLIPSE recruited 2083 COPD subjects ages 40-75 with a smoking history of at least 10 packyears (cigarettes smoked per day multiplied by years smoked, divided by 20 to convert to packs), 332 control smokers with at least 10 pack-years smoking history and normal lung function, and 237 non-smoking controls [15]. COPD was defined by GOLD stage 2 or greater (FEV 1 /FVC,0.7 with FEV 1 ,80% predicted) [10]. Genome-wide SNP genotyping was performed on all ECLIPSE subjects using the Illumina HumanHap550 BeadChip. GWAS analysis included 1736 cases COPD cases and 175 controls [11]. Sputum induction was performed on a subset of COPD cases at 14 sites, using a standard protocol [16]. RNA was extracted from sputum cell pellets using TRIzol and amplified with the Nugen Ovation RNA Amplification kit. Gene expression profiling was performed on RNA extracted from sputum samples of 145 COPD cases (all ex-smokers) using the Affymetrix Human U133 Plus2 array [17]. MIAME-compliant array data are available in the Gene Expression Omnibus database (http://www.ncbi.nlm. nih.gov/geo), accession GSE22148. Only Caucasian subjects were included in this analysis.

Other GWAS Populations
Subjects from two additional COPD case-control studies were merged with the ECLIPSE subjects in the combined GWAS analysis and the GWAS meta-analysis [11]. COPD cases and control smokers were Caucasians recruited in Bergen, Norway [18,19]. Cases were defined by GOLD stage 2 or greater COPD; smoking controls had normal lung function. Both cases and controls had smoking history of at least 2.5 pack-years. GWAS included 838 cases and 791 controls, genotyped using the Illumina HumanHap550 BeadChip [12].
The National Emphysema Treatment Trial (NETT) cases have FEV 1 #45% predicted and emphysema on chest CT scan [20,21]. Thus, NETT cases have COPD severity of GOLD Stage 3 or greater. All NETT Genetics Ancillary Study subjects are former smokers; only white subjects are included in this analysis. The Normative Aging Study (NAS) is a cohort study of initially healthy men followed by the Boston VA [22]. To define a control group for comparison to NETT cases, we selected Caucasian subjects meeting the following criteria: FEV 1 .80% predicted, FEV 1 / FVC.90% predicted, at least 10 pack-years of smoking, and an adequate DNA sample [23]. Genomewide SNP genotyping has been performed in the NETT-NAS study (366 cases, 414 controls) using the Illumina 610-Quad BeadChip [11].

Replication Populations
The International COPD Genetics Network (ICGN) was a family-based study of COPD at ten centers in North America and Europe [18,24]. Probands were ages 45-65 with post-bronchodilator FEV 1 ,60% predicted, FEV 1 /VC,90% predicted, a smoking history of at least 5 pack-years, and at least one sibling with $5 pack-year smoking history. Genotyping was performed on Caucasian subjects only ( Table 1).
The Genetic Epidemiology of COPD Study (COPDGene) enrolled COPD cases and control smokers at 21 clinical centers throughout the United States [25]. Subjects are 45-80 years old and have a smoking history of at least 10 pack-years. This analysis included the first 994 non-Hispanic white case and control subjects enrolled in COPDGene (Table 1). In these samples, a set of 75 ancestry informative markers has been previously genotyped and did not show evidence of population stratification [11].
SNP genotyping in ICGN and COPDGene SNPs was done using the iPLEX Gold assay on the Sequenom (San Diego, CA) MassARRAY system [26] or the TaqMan 59 exonuclease assay (Applied Biosystems, Foster City, CA) [27].

Statistical Analysis
A total of 145 COPD subjects had sputum samples with gene expression data available; two arrays failed quality control. Of the remaining 143 subjects, 131 had corresponding genomewide SNP data and phenotype data. The Affymetrix HG-U133 Plus 2 array contains 54,675 probe sets. After filtering out 17,420 probe sets which were not annotated with a specific gene symbol in the hgu133plus2.db R/Bioconductor database or which mapped to the X or Y chromosomes, 37,255 probe sets remained. Microarray preprocessing used the robust multiarray average method and quantile normalization [28], implemented in Bioconductor. QC of microarrays was performed using the Bioconductor package affyQCReport; QC results are available in the Data S1 and in Figure S1. QC of genomewide SNP data in ECLIPSE has been reported [11]. SNPs with minor allele frequency ,0.05 in the 131 ECLIPSE cases were additionally excluded.
In the integrative analysis, each expression probe set was mapped to its corresponding gene and all genotyped SNPs were identified within 50 kb of the transcription start site (TSS). General linear models were used to detect cis-acting associations between probe set expression levels and SNP genotypes, adjusted for age, gender, and the first six genetic ancestry principal components derived from the genotype data on all ECLIPSE COPD cases [29]. False discovery rate adjusted p-value,0.05 defined statistical significance. eQTL analysis utilized the GGTools Bioconductor package [30].
Each significant cis-eQTL SNP was then tested for association with COPD in the combined GWAS dataset from ECLIPSE, Norway, and NETT-NAS [11]. The published combined GWAS analysis was a mega-analysis of individual-level genotype data, using logistic regression, adjusted for age, pack-years of smoking and principal components for genetic ancestry. In the published meta-analysis, stratified logistic regression was performed within each case-control study and results were combined using Z-scores for weighting by the inverse variance. SNPs associated with COPD at p,0.01 in either the combined GWAS analysis (megaanalysis) or the GWAS meta-analysis were genotyped for replication in ICGN and COPDGene. In the COPDGene study, case-control data were analyzed with linear regression models, adjusted for age, sex, and pack-years of smoking, using PLINK version 1.0.7 [31]. Family-based ICGN data were analyzed in PBAT version 3.6.1, adjusted for age, sex, and pack-years of smoking [32].
We also tested for eQTL SNPs influencing the expression of genes in previously identified COPD loci. On chromosome 15q25, we defined a region starting 50 kb centromeric from IREB2 extending 50 kb telomeric from CHRNB4 (approx. 300 kb total) and tested all genotyped SNPs within this region for association with expression levels of probe sets for six genes: IREB2, AGPHD1, PSMA4, CHRNA5, CHRNA3, and CHRNB4. For the other two COPD loci, we expanded the cis-eQTL analysis to all SNPs with 200 kb of the TSS of the genes HHIP and FAM13A.

Sputum eQTL Analysis
Characteristics of the 131 ECLIPSE COPD subjects in the eQTL analysis are shown in Table 1. On average, COPD subjects had a heavy smoking history and severely impaired lung function, similar to the full set of ECLIPSE GWAS cases [11]. The data analysis is outlined in Figure 1. Combining the gene expression data with genomewide SNP data and limiting analysis to potential cis-acting SNPs (within 50 kb of TSS) yielded 562,787 SNP-probe set association tests. Of these, 4315 SNP-probe set associations were significant at FDR-adjusted p,0.05 (corresponding to unadjusted p = 3.8e-4), representing 3309 unique SNPs and 1399 unique probe sets, covering 1086 genes (Table S1).

Sputum eQTLs associated with COPD
We queried the 3309 significant cis-eQTL SNPs in the combined GWAS dataset including ECLIPSE, Norway, and NETT-NAS subjects [11]. Using a strict Bonferroni correction, there were two cis-eQTL SNPs significantly associated with COPD at p,0.05/ 3309 = 1.5e-5 (Table 2). These two SNPs on chromosome 15q25 are located in CHRNA5 and IREB2, genes with known COPD associations. [12,36] At a nominal threshold of p,0.01, there were 64 cis-eQTL SNPs associated with COPD (Table S2). There were 56 eQTL SNPs associated with COPD at p,0.01 in the metaanalysis of the ECLIPSE, Norway, NETT-NAS GWAS studies (Table S3), as opposed to the combined analysis of individual-level genotype data. Merging the 64 SNPs from the combined GWAS analysis and the 56 SNPs from the GWAS meta-analysis left 76 unique SNPs, which were brought to replication analysis.

Replication Studies
Characteristics of the ICGN and COPDGene subjects in the replication analysis are reported in Table 1. The two SNPs in Table 2 were analyzed in previous reports [12,36] and were not retested. Of the remaining 74 SNPs, 69 were successfully genotyped in ICGN and COPDGene. Screening in the larger ICGN study found 8 SNPs with p,0.1 (Table 3). Of these, only one had p,0.1 in COPDGene. SNP rs1265098 was significantly associated with COPD in ICGN and had a trend for significance in COPDGene. The effect direction for rs1265098 was consistent in ICGN, COPDGene, and the combined GWAS; the minor allele was associated with increased COPD risk in all three studies. SNP rs1265098 maps to the gene PSORS1C1 (psoriasis susceptibility 1 candidate 1 [MIM 613525]) on chromosome 6, yet is associated with transcript levels of the neighboring gene PSORS1C3 (p = 8.2e-05, FDR-adjusted p = 0.016) (Figure 2).

Sputum eQTLs in COPD Candidate Loci
Previous GWAS have identified three loci associated with COPD susceptibility: HHIP on chromosome 4q31 [12,13], FAM13A on chromosome 4q22 [11], and a region on chromosome 15q25 encompassing candidate genes CHRNA5, CHRNA3 and IREB2, among others [12,36]. On chromosome 15q25, cis-eQTL associations for IREB2 mapped to that gene (Figure 3a). Genetic regulation of CHRNA5 was more complex. Previous studies have demonstrated cis-acting effects of multiple SNPs on CHRNA5 expression. Saccone et al. defined 4 LD bins surrounding CHRNA5 with varying associations with cigarette smoking, lung cancer, and COPD [37]. Bins 1-3 were represented in our dataset, tagged by SNPs rs1051730, rs938682, and rs6495306, respectively (Table 4). SNPs in bins 1 and 3 were associated with CHRNA5 expression in sputum (Figure 2), as has been demonstrated in brain [38] and lung tissue [39]. SNPs in bin 2 were not eQTLs for CHRNA5. We added additional SNPs to these bins, based on strong LD with tag SNPs in the larger ECLIPSE GWAS dataset. We also identified 3 sets of SNPs (3a, b, c in Table 4) with cis-eQTL associations for CHRNA5 and moderate LD with SNPs in bin 3 (r 2 0.57-0.76). SNPs in bins 1 and 2, but not bin 3, showed evidence of association with COPD in the combined GWAS dataset, though they were not genomewide significant (bin 1: rs1051730, p = 2.8e-6; bin 2: rs938682, p = 5.6e-5; bin 3: rs6495306, p = 0.2). These results suggest that the COPD-associated SNP rs1051730 (bin 1) may influence phenotype by its effect on gene expression, while COPD-associated SNPs in bin 2 (tagged by rs938682) may exert their effect through other mechanisms. SNPs in bin 3, although eQTLs, were not associated with COPD risk.
SNPs in IREB2 were both cis-eQTLs for that gene (Table 4, Figure 2) and were associated with COPD in the combined GWAS (rs13180, p = 5.0e-7). Even though some of the significant eQTL SNPs for CHRNA5 mapped to IREB2 (bin 3a), SNPs in all 3 bins were not in LD with the IREB2 eQTL SNPs (Figure 3b). No SNPs were significantly associated with AGPHD1, PSMA4, or CHRNB4 gene expression. For the other two COPD GWAS loci, HHIP and FAM13A, we found no significant cis-eQTL SNPs

Discussion
In a cohort of well-characterized COPD subjects, we integrated genomewide SNP and gene expression data derived from induced sputum, a biologically-relevant tissue in COPD, to identify a set of eQTL SNPs affecting gene expression levels. The SNPs were then tested for association with the clinical phenotype of COPD; gene expression was not tested for association with disease status in this set of COPD cases only. Using the eQTL results, we implicated two distinct COPD susceptibility genes in a previously identified region of chromosome 15q25. Additionally, we provide evidence for a potential novel COPD susceptibility locus in the HLA region on chromosome 6.
The initial GWAS in COPD found significant associations on chromosome 15q25, with SNPs in the genes CHRNA3 and CHRNA5, encoding two subunits of the nicotinic acetylcholine receptor [12]. This region has also been associated with lung cancer, peripheral arterial disease, and smoking behavior [37,[40][41][42][43], so it is not clear whether these genes have a direct effect on COPD susceptibility, or their effects are at least partially influenced through cigarette smoking, the major environmental risk factor for COPD [44,45]. In terms of genetic regulation of expression of the chromosome 15q25 genes, we found similar eQTL associations with CHRNA5 expression in induced sputum as has been found in brain [38] and lung tissue [39]. We found additional sputum eQTL SNPs for CHRNA5 in moderate LD with previously defined eQTLs. The previous papers on brain and lung tissue gene expression did not report testing IREB2, a gene previously associated with COPD [11,36]. The specific IREB2 SNPs associated in GWAS (rs13180) [11] and in a candidate gene analysis of differentially expressed genes (rs2656069) [36] were in only moderate LD (r 2 = 0.44) with each other, implying independent effects on IREB2 expression. The IREB2 and CHRNA5 eQTL SNPs were not in LD with each other, suggesting the presence of at least two COPD susceptibility genes on chromosome 15q25. Previous studies have similarly used eQTL analyses to add functional information about genes identified through GWAS, including studies of asthma [7], celiac disease [46], and Crohn's disease [47]. However, these prior studies have examined gene expression in blood cells, and not primary disease tissues.
However, we did not finding significant cis-eQTL SNPs for two other known COPD loci, HHIP and FAM13A. The associated SNPs found through GWAS may exert their effects on phenotype via other mechanisms besides influencing gene expression. Alternatively, the GWAS SNPs may actually be eQTLs acting in other tissues besides sputum, such as alveolar or bronchial epithelial cells, which were not assessed in our study.
Besides improving understanding of the COPD susceptibility locus on chromosome 15q25, we identified a potential novel COPD locus on chromosome 6. The SNP maps to gene PSORS1C1, but it is associated with expression levels of the neighboring gene PSORS1C3. Variants in PSORS1C3 have been reported to be associated with psoriasis [48], an immune-mediated skin disease. PSORS1C3 is located in the major histocompatability (MHC) region, and subsequent papers have found that the associations with psoriasis may be due to variants in HLA-C (MIM 142840) [49,50]. Interestingly, one study has reported an epidemiologic association between psoriasis and COPD [51], and cigarette smoking is a risk factor for psoriasis as well [52]. Although there are no reports of HLA-C associations with COPD, alleles of other MHC class I genes, HLA-A and HLA-B, have been associated with COPD [53,54]. The locus encompassing PSORS1C1/3 and HLA-C will require additional replication studies and functional validation to confirm its role in COPD susceptibility.
Prior studies have also used eQTL analyses to identify novel genes for complex traits, including age related decline in kidney function [8] and body mass index [55]. In contrast to our study, these papers first found gene transcripts correlated with the phenotype, then tested SNPs in/near these genes for association with expression levels. We performed the cis-eQTL analysis as the initial step, then tested the eQTL SNPs for phenotype association. This limits multiple testing compared to a GWAS, enriching for eQTL SNPs which may be more likely to be associated with disease [56].
This study has several limitations. The sample size of 131 subjects, though adequate for gene expression analyses, may be underpowered to detect all potential eQTL associations. Therefore, we limited the cis-acting analysis to SNPs within 50 kb from the gene, to limit the multiple testing burden. Based Table 3. Genetic association analysis of sputum expression quantitative trait locus (eQTL) single nucleotide polymorphisms (SNPs) with COPD susceptibility.   Table 4). d) rs6495306 -CHRNA5 (206533_at), p = 9.9e-6; LD bin 3 (see Table 4 [57]. Previous papers have used a 50 kb limit to define cis-acting eQTLs [6]. Using this method, we were able to replicate published eQTL associations from other tissues and were able to identify a set of significant eQTL SNPs to carry forward for COPD association studies. However, our method would be unable to detect cis-eQTLs located .50 kb from the TSS, such as a SNP in an upstream enhancer or in the 39 UTR of a large gene. Due to the sample size, we limited our investigation to cis-acting eQTL SNPs, as a full search for trans-acting regulatory SNPs greatly increases the number of tests performed. The literature suggests that sample sizes under 200 subjects may be inadequate to find true trans-eQTLs [58].  Table 4. b) Linkage disequilibrium r 2 values between SNPs in the chromosome 15q25 COPD locus (listed in  Several groups have compared eQTLs in different tissues from the same individual, finding both overlapping and tissue-specific eQTLs [59][60][61]. Multiple tissues are known to be important in COPD biology, including large and small airways, lung parenchyma and immune cells. By only surveying sputum, we may have missed significant eQTLs for COPD genes that are expressed in other tissues. Multiple cell types may be present in sputum, yet neutrophils have been shown to be the predominant cell type in the sputum samples from COPD subjects in ECLIPSE [16]. Despite these limitations, sputum is a clinically important tissue in COPD and is more accessible for genomic and biomarkers studies than lung tissue. Studying diseased individuals may be advantageous to identify eQTL SNPs for potential disease genes, which may only be expressed, or may be expressed at higher levels, in patients compared to healthy controls.
In conclusion, we combined genomewide SNP genotyping with genomewide expression profiling from a relevant tissue in wellcharacterized subjects with a common chronic disease. Using this strategy, we were able to gain insights into the functional role of SNPs previously associated through GWAS, as well as identify a potential novel disease susceptibility gene which would have been missed using standard GWAS analysis. Previous eQTL studies have provided important information about genetic control of human gene expression. Integrative genomics studies in relevant tissue from well-phenotyped individuals, as we have performed, will be required to apply this knowledge to human disease.