Parkinson's disease-associated genetic variation is linked to quantitative expression of inflammatory genes

Genome-wide association studies (GWAS) have linked dozens of single nucleotide polymorphisms (SNPs) with Parkinson’s disease (PD) risk. Ascertaining the functional and eventual causal mechanisms underlying these relationships has proven difficult. The majority of risk SNPs, and nearby SNPs in linkage disequilibrium (LD), are found in intergenic or intronic regions and confer risk through allele-dependent expression of multiple unknown target genes. Combining GWAS results with publicly available GTEx data, generated through eQTL (expression quantitative trait loci) identification studies, enables a direct association of SNPs to gene expression levels and aids in narrowing the large population of potential genetic targets for hypothesis-driven experimental cell biology. Separately, overlapping of SNPs with putative enhancer segmentations can strengthen target filtering. We report here the results of analyzing 7,607 PD risk SNPs along with an additional 23,759 high linkage disequilibrium-associated variants paired with eQTL gene expression. We found that enrichment analysis on the set of genes following target filtering pointed to a single large LD block at 6p21 that contained multiple HLA-MHC-II genes. These MHC-II genes remain associated with PD when the genes were filtered for correlation between GWAS significance and eQTL levels, strongly indicating a direct effect on PD etiology.


Introduction
Parkinson's disease [MIM 168600] is an age-associated, incurable neurodegenerative disorder with a cumulative lifetime risk of close to 10% [1]. Most cases are genetically complex and sporadic, though twin studies indicate heritability is around 34-60% [2]. In addition to 7 autosomal and recessive monogenic causal genes, at least 24 loci have been associated by GWAS with an increased risk of developing Parkinson's [3,4]. Depending on the method of association, these loci link Parkinson's disease etiology with a set of between 24 and 800 genes having specific activity in a multitude of tissues.
Recently, some progress has been made in experimentally linking specific risk polymorphisms with allele-specific regulatory activity and corresponding neighboring gene activity [5]. However, even in this best-studied case, the exact impact of altered expression to the neighboring gene, a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 SNCA [MIM 163890], on PD remains unclear. For the majority of risk loci, the specific disrupted-regulatory element and related gene expression changes are unknown, preventing experimental manipulation of the sort seen in [5].
Previously, we examined the nonrandom distribution of PD risk SNPs overlapping tissuespecific putative regulatory elements (REs) [6]. That analysis indicated the tissue or tissues in which an allele-specific effect was most likely to be causal. Surprisingly, most RE enrichment was not seen in brain tissue, indicating that these loci may confer a predisposition to PD through biological effects that are remote from the eventual symptomatic tissues. One drawback of this method, however, is that it did not link risk to specific gene sets but only to tissuespecific active regions generally. Furthermore, the specifics of the enrichment excluded several major loci from analysis. In the present study, we examine different sets of genes potentially associated with PD risk loci in order to identify common functional pathways. In order to find experimental genetic targets, we linked genes to nearby super-enhancers by SNPs in a comprehensive genome-wide screen.

PD risk variants
Genetic polymorphisms, which have been associated directly or by imputation with PD, were obtained from pdgene.org (p. value <0.0001), PheGenI, and GWAS catalog on 7/2016 providing 7,607 significant risk index sites. Finding further proxy variants in linkage disequilibrium (LD), r 2 > 0.8 in using rAggr added an additional 23,759 variants (the majority of which were single nucleotide polymorphisms (SNPs). For convenience, we referred to all as risk SNPs. Offspring LD-SNPs were risk-annotated according to the most significant parent GWAS p-value when multiple linked SNPs were present.

Analysis
Significant eQTLs were downloaded from GTEx July 2016, version 6. Intersection with risk SNPs was determined using custom Perl scripts to generate matched gene/SNP eQTLs according to rsID. Tissue source for eQTL values were annotated. Cleaned txt files were then analyzed using the statistical package, R. Correlation between GWAS significance and eQTL expression was calculated and subset accordingly. Custom Perl scripts were used to find overlap between dbSUPER defined enhancer regions and risk SNPs. Gene set enrichment analysis on egenes associated with risk SNPs was performed using a variety of GSEA software, including DAVID, Metacore from Thomson Reuters, Gorilla, PANTHER, which all gave broadly similar results. Query sets were compared against a background of all significant egenes in GTEx when possible during GSEA. TF binding motif-disruption was found via the R package MotifBreakR, for Factorbook motifs, searched against the dbSNP build 142 and dbSNP build 144. Directional associations for pdgene.org risk SNPs and eQTLs at the HLA locus were oriented according to the GTEx measured minor allele frequency, correlated according to gene by GWAS OR vs. eQTL beta value, and averaged across tissues.

Results
We obtained 7,607 PD risk SNPs along with an additional 23,759 surrogate polymorphisms at r 2 > 0.8 occupying at least 26 loci (Fig 1A-1C, Methods). It is obvious that the resulting 31,366 SNPs needed to be reduced to a more manageable number for hypotheses generation, as well as to associate risk with specific genes and regulatory elements.
We first reduced the set of potential PD risk SNPs by removing those which were not associated with known gene expression changes. The most direct way to link SNPs with gene expression is through variant-RNA expression association known as expression quantitative trait loci (eQTL) screens [7]. This is achieved by associating significant risk SNPs plus surrogates to genes by searching across GTEx-significant eQTLs and combining the results from each of 53 tissues; this generated a total of 795 genes ( Fig 1D). Some risk loci queries (chromosomes 8,18,19) resulted in no eQTLs in the GTEx database, while other loci show association with a large number of genes. Large signals were seen at several regions, including chromosome 6 and 17. The largest number of eQTL genes (egenes) as well as the strongest change in expression were seen at chromosome 17q21. This locus, as well as that at chromosome 6p21, is located in a region with a great many polymorphisms in high LD.
We compared the set of 795 SNP/gene associations based on GTEx eQTLs with other associations based on the nearest transcription start site (TSS); on all TSS located within 1Mb at the gene locus given by the database (typically but not always the closest gene body). We used gene set enrichment to compare the sets produced by the different methods. In order to examine whether any functional pathways were over-represented in these gene sets, we used   pathway enrichment and found that nearly all gene sets showed a similar functional enrichment of antigen-related processes (data not shown). Upon further examination, it became obvious that the 6p21.32 locus which contain a large linkage disequilibrium (LD) block that includes many of the HLA-MHC class I and class II protein coding genes-overwhelmed other enrichment signals. Although some non-HLA-related processes were enriched, in nearly every PD risk-SNP/gene set we examined, antigen presentation pathways predominated. Although, neuorinflammation generally, and the HLA MHC genes specifically, have been related to Parkinson's disease [8][9][10][11], it is still possible that the functional enrichment is spurious. The HLA locus is a highly polymorphic site having a large number of variants in high LD and containing genes with related function, any PD risk variants within will be indirectly associated with HLA genes due to linkage with surrogate polymorphisms. In order to test whether the measured eQTL associations, including those at 6p21, directly impact PD risk we measured and found strong correlation between associated eQTL expression changes and PD GWAS significance of variants for individual genes (Fig 2). Filtering by these correlations reduced the PD risk associated genes to 189. Again, GSEA implicated MHC II genes, and to a lesser degree, MHC I genes.
Because most risk variants lie in noncoding DNA regions, we have reasonable confidence that variants affect phenotype by altering gene transcription, most likely by disrupting regulatory elements (REs). A particularly active subset of REs is situated in super-enhancers (SEs). SEs are large clusters of enhancer elements which show high association for Mediator occupancy and histone H3K27Ac modifications, and they show more tissue specific activity than enhancers in general [12][13][14]. As we and others have reported, we found that candidate risk SNPs often overlap SEs, and may especially affect disease predisposition. By further sub-classification of the risk SNPs according to overlap with dbSUPER defined superenhancers, we reduced the number of risk variants to 3685 and of risk genes to 100, with chromosome 6 and chromosome 17 having 1063 and 1322 variants, respectively. Finally, we have previously shown that locating putative transcription factor (TF) binding motif disruption is an effective way to identify possible functional risk SNPs [15]. We filtered the SE SNP set for TF motif disruption and found that the associated gene sets changed little, falling to 95 genes (S3 File), and indicating that the previous subsetting by super-enhancer overlap had enriched for TF binding motifs. Correspondingly the significant GSEA signals related to the locus at chromosome 6p21.32. The most significant gene ontology categories were by Panther, "MHC class II receptor activity" (p = 5.8e-4), MetaCore from Thomson Reuters: "peptide antigen binding" (p = 4.906e-112), and David, "interferon-gamma-mediated signaling pathway" (p = 1.97E-7) [16,17].

Discussion
Pdgene currently curates the most comprehensive GWAS of PD risk, having imputed association scores for 7.9 million variants [18]. To maximize the possible network of associated risk genes we used a low (0.001) p-value to obtain 7267 SNPs from pdgene.org and added to that risk SNPs from the NIH GWAS Catalog and NCBI PheGenI databases. Associating risk SNPs and LD SNPs to genes from the GTEx eQTL data gave a total of 795 genes (527 protein coding genes) across all 53 tissues. The median number of significant egenes per variant was 4 genes per tissue (max 18). Brain specific eQTLs consisted of 175 genes. This much smaller gene set may better predict influential genes, but non-brain tissues may have also a role in PD development and progression [6]. The smaller subset of brain-tissue egenes included 20 genes at the HLA locus (chromsome 6) and 28 genes at the MAPT locus (chromosome 17). This agrees with results from a similar study that examined 67 PD SNPs in a small number of PD and control brain cortical samples and found cis eQTLs at the same two regions [19].
GTEx gene expression data can be oriented by allele frequency and it is worth noting that some signals are directional in this presentation, indicating that the variants which show a positive expression changes are primarily the major or minor population alleles, and that LD variants all affect the same gene or genes. For instance, the signal on chromosome 12 for LRRK2 [MIM 609007] is primarily in the positive direction, indicating all significant nearby alleles are the low-frequency, minor alleles (in this case the minor alleles are the risk alleles) and these correlate with higher LRRK2 expression relative to the major allele. This is made more obvious by multiplying the GWAS-log(p. value) by eQTL expression (S1 Fig). The minor risk alleles are associated with increased LRRK2, which may indicate that a slightly elevated LRRK2 level leads to a small but cumulative relative PD risk. This is consistent with findings that overexpression of wild-type or mutant LRRK2 induces neural toxicity and elevated protein levels or activity that possibly contribute to Parkinson's disease [20][21][22]. In other regions, the signal is more complex, and in these cases, a single allele may be associated with an increase in one gene transcript but a decrease in another.
Because eQTL studies produce a quantitative association between gene expression and multiple polymorphisms it may be possible to prioritize SNPs by comparing the expression level changes. A large change in gene expression may be more tractable experimentally and so can be used to prioritize targets. A corollary and proof of principle is that SNPs which alter the same genes by different amounts should show a corresponding difference in GWAS significance. Indeed, we see more positive correlation and less negative correlation than expected by chance in pairwise SNP comparisons of the difference in eQTL effect sizes and GWAS measured significance (S2 Fig). This principle is noisier (due to fewer values) but more direct by searching for genes that show a positive correlation between GWAS significance and eQTL expression. GWAS significance has been previously positively correlated with eQTL effect size [23]. We used this method to reduce the PD-associated genes from 795 to 189. However, for many genes the number of SNPs with relevant measurement is low and so correlation is inaccurate. Furthermore, we only consider significant p-values and significant eQTLs further increasing the accuracy of correlation. For these reasons, we chose a correlation cutoff of 0.4 to increase the sensitivity. We believe this set of genes includes those for which there is the greatest evidence for involvement with PD etiology, although some important genes, including SNCA, were removed.
We hypothesize that some GWAS signals are due to complex haplotypes containing multiple linked SNPs, which confer risk through synergistic disruption of multiple TF binding sites simultaneously. To maximize the identification of these cases and to identify specific regulatory elements we sub-classified the set of risk SNPs by overlap with super-enhancers, the large enhancer regions showing unusually high active histone signals. In this way, we reduced the number of risk gene candidates to 100. After further filtering SNPs for those that disrupt TFbinding protein motifs, the median number of egenes per SNP per tissue was 2 (max 14). This set of 95 genes is functionally enriched for antigen presentation processes.
In general, gene set enrichment for functionally related genes following our filtering of PD risk associations, pointed to fewer underlying processes than expected. For instance, the set of genes curated by the Parkinson's Disease Gene Ontology Annotation Initiative displays broader biological functions [24]. This disparity may reflect inadequate power or unsuccessful subgrouping in the underlying GWAS studies. The functional enrichment seen here is due to a single locus containing multiple HLA genes. However, one must be careful over-interpreting GSEA for locations where closely spaced genes share similar functions. GSEA typically assumes multiple independent random variables. However, this assumption is not valid when a single variant can affect multiple related genes such as at the HLA region, which has previously shown strong correlation in MHC gene expression [25]. Furthermore, the density of variants near the HLA genes is high, making it more likely that noise alone will provide signals in these regions. Never-the-less, the strong GSEA for MHC-II genes (and significant enrichment for MHC-I genes) does indicate that variants in these regions can strongly affect multiple genes in a single pathway. Furthermore, in a previous study that linked genes to risk SNPs by proximity rather than by eQTLs, an enrichment for immunological gene categories remained even when the HLA locus was treated as a single signal [26].
The 6p21.32 risk locus contains approximately 3300 risk SNPs in LD (r 2 >0.8) and spans close to 2Mb and 121 genes. Cis eQTL analyses indicate that allele specific expression may be altered in up to 60 genes in this region, providing associated allele-specific differences in many traits and disorders. Indeed, the GWAS catalog links this locus with some 175 traits, approximately 10% of studied parameters [4]. Removing associations with genes that show lower PD correlation coefficients than overlapping variants associated with an eQTL in any tissue with dbSUPER super-enhancers, reduced the number of variants to 1063 with associated expression changes in 15 genes. We analyzed these with MotifbreakR [15] to find variants which alter putative TF-binding motifs; 595 risk polymorphisms strongly disrupted 79 Factorbook binding motifs (chromosome 6 summarized as a tract in Fig 3 and disrupted TFs listed in S1 File). We realize that this shorter list of variant/gene associations is by no means comprehensive, but it does represent the strongest indication of PD etiology.
A high GWAS significance may indicate one of three things. Firstly, the expression of a very central protein can be affected, perhaps only slightly. SNCA is a likely candidate for this type of risk. Secondly, a peripheral process may be affected, but to a very high degree. The large eQTL signals at the HLA and MAPT loci point to such relationships. Lastly, multiple peripheral processes may be affected simultaneously. This is particularly likely in large LD blocks where multiple correlated variants can have distinct effects. Again, both the MAPT and HLA may fall in this category.
We expect that multiple variants near the MHC-II genes and which overlap one of several super-enhancer regions (Fig 3), independently affect tissue-specific MHC-II gene expression levels and likely act to synergistically alter adaptive immunity-related processes. By comparing  [3]. A total of 595 PD risk SNPs that overlap dbSUPER defined super-enhancers, show eQTLs, and disrupt a TF binding motif (labeled disrupt eQTL SNPs, S1 File) are shown. the associated eQTL signals across tissues we found that PD risk alleles at this locus are associated with increased expression of 7 HLA-genes (HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, HLA-DQB1-AS1, HLA-DRB1, HLA-DRB5) and decreased expression of 4 genes (HLA-DOB, HLA-DQA2, HLA-DQB2, HLA-DRB6). MHC-II genes are expressed by antigen presenting cells, including microglia. The number of MHC-II-expressing cells increases with neuroinflammation and PD patients show more activated microglia then controls [27][28][29]. Activated microglia help clear debris, both foreign and native (including dying neurons), and produce pro-inflammatory factors. In addition, it was shown that microglial MHC-II plays a central role in the activation of both the innate and adaptive immune responses to alpha-synuclein expression, a hallmark in PD progression [30]. Altered antigen presentation pathways associated with risk alleles may contribute to prolonged neuro-inflammation or otherwise increase the loss of dopaminergic neurons in PD and as such be related to the microbiome in the gut [31]. The large number of variants in this region may relate to a highly complex regulation coupled with selective pressure on these genes and imply an importance of the MHC-II processes for many disorders including Parkinson's disease. However, we cannot rule out a significant role for MHC-I pathways in PD. Interestingly the density of overall GWAS catalog risk SNPs aligns well with super-enhancer-located PD SNPs (Fig 3) suggesting that a relatively small number of SNPs may have diverse and complex effects on a variety of disorders and that examining a single index SNP is not a good model for causality.