Connecting the Dots: Potential of Data Integration to Identify Regulatory SNPs in Late-Onset Alzheimer's Disease GWAS Findings

Late-onset Alzheimer's disease (LOAD) is a multifactorial disorder with over twenty loci associated with disease risk. Given the number of genome-wide significant variants that fall outside of coding regions, it is possible that some of these variants alter some function of gene expression rather than tagging coding variants that alter protein structure and/or function. RegulomeDB is a database that annotates regulatory functions of genetic variants. In this study, we utilized RegulomeDB to investigate potential regulatory functions of lead single nucleotide polymorphisms (SNPs) identified in five genome-wide association studies (GWAS) of risk and age-at onset (AAO) of LOAD, as well as SNPs in LD (r2≥0.80) with the lead GWAS SNPs. Of a total 614 SNPs examined, 394 returned RegulomeDB scores of 1–6. Of those 394 variants, 34 showed strong evidence of regulatory function (RegulomeDB score <3), and only 3 of them were genome-wide significant SNPs (ZCWPW1/rs1476679, CLU/rs1532278 and ABCA7/rs3764650). This study further supports the assumption that some of the non-coding GWAS SNPs are true associations rather than tagged associations and demonstrates the application of RegulomeDB to GWAS data.


Introduction
Over 1200 genome-wide association studies (GWAS) have been published since 2005 [1].While some of these studies have been crucial for determining genes responsible for disease phenotypes, including determination of genes involved in inflammatory bowel disease and age-related macular degeneration, the majority of variants identified show modest effect size at best.Furthermore, 88% of significant variants are located in either intronic or intergenic regions that do not encode proteins, suggesting their association with disease may occur for reasons other than changes in protein structure and/or function [2].
Given these findings, researchers recently have begun to deliberate implications of these non-coding variants.One such consideration is the possibility that, splice site variants and promoters aside, introns and intergenic regions are not ''junk DNA'' as previously believed, but possess regulatory properties which modify gene expression.Indeed, only 2% of the human genome encodes proteins, the remaining 98% is not ''functional'' in the sense that it does not encode proteins.Rather, the bulk of the genome is comprised of repeat regions, introns, and transposons [3].Multiple molecular techniques have been employed to determine chromatin structure, methylation, and protein motifs and binding to assess the effect of non-coding variants on transcription [4].RegulomeDB is a database developed to capture these data, and subsequently, assess the likelihood that a particular variant affects transcription factor binding.The advent of such databases is advantageous for studying gene associations of complex diseases [2].
Late-onset Alzheimer's disease (LOAD) is one such disease that may be better understood by examining the regulatory function of associated SNPs.Thus far, genome-wide association studies (GWAS) of LOAD have identified over 20 significantly associated risk loci [5][6][7].In addition, several suggestive loci for risk and ageat-onset (AAO) of AD have also been implicated [8], [9].Of these loci, only one, APOE, shows a strong effect size, which substantially increases risk for individuals homozygous for the APOE*4 allele especially after age 75 [10], [11].The remaining loci have only weak to modest effect sizes.In this study, we have demonstrated the utility of two publicly available bioinformatics tools, Broad Institute's SNP Annotation and Proxy search (SNAP) tool (http:// www.broadinstitute.org/mpg/snap/)[12] and RegulomeDB (http://regulomedb.org)[2], to investigate potential regulatory functions of recently identified, non-APOE variants (index and proxy SNPs) for known and suggestive loci associated with risk and AAO of LOAD.

Linkage Disequilibrium
Following SNP selection, we utilized the SNAP web portal [accessed 4 September 2013] [12] to identify SNPs in linkage disequilibrium (LD) (r 2 $0.80) with our SNPs of interest.SNAP allows users to find proxy SNPs based upon LD determined using the CEU populations from the International HapMap (v3) or 1000 Genomes Pilot 1 projects.SNAP searches were not limited by array and the identified SNPs could include the queried SNPs as proxies for themselves.At r 2 $0.80, the SNAP portal found 570 SNPs in LD with the 44 GWAS SNPs.SNAP proxy searches were repeated with r 2 thresholds of 0.90 and 1.0 to better assess associations among related SNPs.These higher thresholds yielded a total of 472 and 191 identified SNPs, respectively.As expected, the number of identified SNPs in LD with the 44 published SNPs decreased as the r 2 threshold increased.Table 1 summarizes the total number of SNPs in LD at all three thresholds for both HapMap3 and 1000 Genomes searches.All published SNPs and their respective proxy SNPs for each r 2 threshold are listed in Table S1.

RegulomeDB
RegulomeDB is a database providing functional annotation of SNPs as determined by data from the ENCODE Project Consortium (2012), NCBI Sequence Read Archive, and other sources totaling 962 data sets.It is free and publicly accessible (http://www.regulomedb.org)and has a straight-forward interface.With almost 60 million annotations, this tool will be invaluable for future examination of gene expression and disease traits.Variants can be classified into one of four RegulomeDB categories with scores ranging from 1 to 6 indicating putative functions.Scores and corresponding functional evidence are listed in Table 2.All reported SNPs and SNPs in LD (using the r 2 $0.80 list) were examined for potential regulatory functions using RegulomeDB (http://regulomedb.org,accessed [4 September 2013]) [2].
Of the 30 SNPs that were in LD with reported genome-wide significant variants and had high evidence of regulatory function, 10 were located in the MS4A region, including the SNP with the most evidence for regulatory function, rs667897 (RegulomeDB score = 1b).RegulomeDB cites rs667897 affects binding of 21 different proteins including BRCA1, SMARCC2, FOXA1, JUN, and POLR2A and falls within both TCF11:MafG and NFE2L2 binding motifs.Six other SNPs in the MS4A region, including 5 intergenic (rs1303615, rs617135, rs11230180, rs2123314, and rs655231) and 1 intronic (MS4A4E/rs2081547) SNPs, had RegulomeDB scores of 1f, and similar to the top hit rs667897, all are eQTLs for MS4A4A as evidenced by work in monocytes.Some of the protein binding affected by these SNPs include CEBPB, JUN, JUND, POLR2A, and SMARCC2.These are the same proteins that are also affected by top MS4A region hit, rs667897, however, motifs containing these variants have yet to be determined.Three more SNPs in the MS4A region, rs636317, rs636341, and rs7933202 were likely to affect binding according to RegulomeDB (score = 2a, 2a, and 2b, respectively).

Discussion
As the list of associated LOAD risk loci continues to grow, it becomes increasingly important to decipher the biological underpinnings of these associations.If we accept that these associations are real, we must endeavor to explain them.Since many of these risk variants are non-coding, one logical explanation for their association is an effect on gene expression.The ENCODE project has provided invaluable contributions to this area of research with a wealth of data that is publicly available for interpretation and expansion.These data are ideal for generating hypotheses and furthering our understanding of gene expression and epistasis.Here we have used two publicly available bioinformatics tools, SNAP tool and RegulomeDB, to investigate potential regulatory functions of non-APOE SNPs implicated with risk and AAO of LOAD.
None of the ten MS4A region SNPs with a score of ,3 are reported GWAS SNPs, indicating the difficulty of differentiating between a true signal and a tag signal in association studies, as well as highlighting the complexity of interactions between genetic variants and disease risk.Of the remaining 24 putative regulatory variants representing 12 loci other than the MS4A region, we observe some thought-provoking outcomes.For example, synonymous variant SLC39A13/rs2293576 (in LD with CELF1/ rs10838725, r 2 $0.8) is unique in this dataset because it is the only SNP with regulatory evidence that resides in an exon, reminding us that regulatory elements can be found within coding sequences as well as in intergenic regions and introns.Furthermore, eight putative regulatory variants located in six different genes (including the synonymous SLC39A13/rs2293576 variant) are in LD with the reported CELF1/rs10838725 SNP, and all are  eQTLs for C1QTNF4.These results suggest future work should examine C1QTNF4 (aka CTRP4) as a potential player in LOAD risk in addition to currently implicated CELF1 gene.C1QTNF4 is an inflammatory cytokine capable of activating both Stat3/IL6 and NF-kB pathways, as shown in cancer cells [13].The implication of the inflammatory pathway in AD pathogenesis and the inverse association between AD and cancer may explain in part the observed relationship between these SNPs and their effect on C1QTNF4 expression [14], [15].
According to RegulomeDB, the binding of the IKAROS family zinc finger 1 (Ikaros) transcription factor, IKZF1, is affected by ABCA7/rs4147911 (score = 2b).It is worth noting that the expression of another LOAD risk gene, INPP5D, is regulated by the Ikaros transcription factor family in B cells [16], suggesting a potential functional link between ABCA7 and INPP5D.Similarly, RegulomeDB findings suggest other proteins whose binding seems to be affected by variants at different LOAD loci (Table 3).Another position of interest is intron 5 of PTK2B.Two variants in this intron had RegulomeDB scores less than 3 (rs17057043, score = 1f and rs73223431, score = 2b), suggesting that intron 5 of PTK2B may play an important role in affecting the binding of regulatory proteins and consequently the risk of LOAD.
Variants in reported suggestive novel loci for AAO of AD, ZNF592/ALPK3/SLC28A1, HRK/RNFT2, and ADAMTS9, are also of functional importance as reflected by RegulomeDB scores of 1f, 2b, and 2b, respectively.Both rs12917429 and rs12909280 in the SLC28A1 region are eQTLs for neuromedin B (NMB), with the latter SNP suggested to affect binding of RFX3.According to GeneCards [17] NMB is a ligand that binds to bombesin receptors to instigate smooth muscle contractions.The bombesin peptides and receptors have been implicated in a variety of cellular processes and are frequently overexpressed in cancer cells [18], [19].RFX3 has been shown to be responsible for proper Corpus Callosum development in mice [20].RFX3 also affects expression of glucokinase and subsequently affects differentiation and function of beta cells [21].Two other SNPs with RegulomeDB scores of 1f, ZCWPW1/rs1476679 and rs655231 (MS4A region), show indications for affecting binding of RFX3 in K562 (chronic myelogenous leukemia, CML) cells.Given the proposed link between insulin resistance and AD as a result of insulin degrading enzyme (IDE), RFX3 may be an interesting transcription factor to examine in the context of LOAD pathogenesis [22].
Although RegulomeDB is an extensive database for the annotation of variants' effects on gene expression, it provides information for only selected DNA binding elements in certain cell types.A total of 220 variants of the 614 we examined returned scores of ''No data,'' meaning we cannot argue against their involvement in gene expression as related to LOAD pathogenesis.Along the same lines, some loci have a markedly higher number of SNPs that have been tested for expression effects than others.Thus, we make no assumptions that the mere number of putative regulatory variants for a given locus is indicative of the magnitude of that locus' role in risk and disease process.Moreover, the primary focus of our study was RegulomeDB and prediction of regulatory effects on gene expression based on the data included in that database.Therefore some other regulatory mechanisms, such as regulation of RNA splicing, or prediction of changes in protein structure and/or function were not covered as part of this study.
In conclusion, these results highlight a number of important considerations for the interpretation of future GWAS data including the necessity of carefully examining LD structure for SNPs showing association with disease risk.Additionally, careful attention should be paid to the regulatory function of associated SNPs and those in LD with such SNPs to clarify correlation between disease and associated variants and better understand complex disease mechanisms.These factors will be critical for elucidating genetic mechanisms that are truly causal for LOAD.Although the cellular pathology of the disease appears to be more widely agreed upon, the molecular basis is still elusive, requiring resolution before pathogenesis is completely comprehended.Identification of potential therapeutic targets can be expedited with a more extensive molecular understanding of the disease process.Given the replication of association of loci with unclear or unknown functions with LOAD risk, future studies should aim to determine their functions both in normal and disease states to identify their roles in disease pathogenesis.

Table 1 .
Number of SNPs in linkage disequilibrium for all published GWAS SNPs for HapMap3 and 1000 Genomes populations at tested r 2 thresholds.

Table 4 .
Linkage disequilibrium for published GWAS SNPs with functional proxies (RegulomeDB score ,3) according to SNAP search.