3 Jul 2014: The PLOS Genetics Staff (2014) Correction: Integrative Modeling of eQTLs and Cis-Regulatory Elements Suggests Mechanisms Underlying Cell Type Specificity of eQTLs. PLoS Genet 10(7): e1004546. doi: 10.1371/journal.pgen.1004546 View correction
Genetic variants in cis-regulatory elements or trans-acting regulators frequently influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To fully exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and causal mechanisms of cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in three parts: first we identified eQTLs from eleven studies on seven cell types; then we integrated eQTL data with cis-regulatory element (CRE) data from the ENCODE project; finally we built a set of classifiers to predict the cell type specificity of eQTLs. The cell type specificity of eQTLs is associated with eQTL SNP overlap with hundreds of cell type specific CRE classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. These associations provide insight into the molecular mechanisms generating the cell type specificity of eQTLs and the mode of regulation of corresponding eQTLs. Using a random forest classifier with cell specific CRE-SNP overlap as features, we demonstrate the feasibility of predicting the cell type specificity of eQTLs. We then demonstrate that CREs from a trait-associated cell type can be used to annotate GWAS associations in the absence of eQTL data for that cell type. We anticipate that such integrative, predictive modeling of cell specificity will improve our ability to understand the mechanistic basis of human complex phenotypic variation.
When interpreting genome-wide association studies showing that specific genetic variants are associated with disease risk, scientists look for a link between the genetic variant and a biological mechanism behind that disease. One functional mechanism is that the genetic variant may influence gene transcription via a co-localized genomic regulatory element, such as a transcription factor binding site within an open chromatin region. Often this type of regulation occurs in some cell types but not others. In this study, we look across eleven gene expression studies with seven cell types and consider how genetic transcription regulators, or eQTLs, replicate within and between cell types. We identify pervasive allelic heterogeneity, or transcriptional control of a single gene by multiple, independent eQTLs. We integrate extensive data on cell type specific regulatory elements from ENCODE to identify general methods of transcription regulation through enrichment of eQTLs within regulatory elements. We also build a classifier to predict eQTL replication across cell types. The results in this paper present a path to an integrative, predictive approach to improve our ability to understand the mechanistic basis of human phenotypic variation.
Citation:Brown CD, Mangravite LM, Engelhardt BE (2013) Integrative Modeling of eQTLs and Cis-Regulatory Elements Suggests Mechanisms Underlying Cell Type Specificity of eQTLs. PLoS Genet 9(8): e1003649. doi:10.1371/journal.pgen.1003649
Editor: Greg Gibson, Georgia Institute of Technology, United States of America
Received: October 4, 2012; Accepted: June 4, 2013; Published: August 1, 2013
Copyright: © 2013 Brown et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding:BEE was funded through NIH grant HG006265. Generation of the Cholesterol and Pharmacogenetics (CAP) data and LMM were funded by a grant from the National Institutes of Health, U01 HL69757. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The precise spatial and temporal control of gene transcription is critical for biological processes, as evidenced by the causal role of gene expression perturbation in many human diseases –. Gene expression is controlled by regulatory proteins, RNAs, and the cell type specific cis-regulatory elements with which they interact. Genetic variation within cis-regulatory elements (CREs, e.g., transcription promoters, enhancers, or insulators) can affect gene expression in a cell type specific manner. An extensive body of work, performed in a variety of cell types in both humans and model organisms, has demonstrated that genetic variants that impact gene expression, or expression quantitative trait loci (eQTLs), are common and exist in both cis (local) and trans (over long genetic distances) –. Over of genotype-phenotype associations found in genome-wide association studies (GWAS) are with non-coding single nucleotide polymorphisms (SNPs), making their mechanistic interpretation more challenging. It is possible that these associated SNPs tag causal coding SNPs; however, numerous compelling lines of evidence , – demonstrate that regulatory SNPs have causal roles in many complex human phenotypes. This is further supported by the finding that GWAS associations are enriched within DNase I hypersensitive (DHS) sites  and eQTL SNPs , , and by several elegant GWAS follow up studies that have mechanistically tied disease associations with SNPs that cause the misregulation of gene expression , .
Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited number of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity.
While several studies have quantified the reproducibility of eQTLs within or between cell types derived from the same or different individuals – the determinants of eQTL cell specificity are still largely unknown. We address this need in this study by analyzing cell specific eQTLs collected from eleven studies performed in seven different cell types and by integrating these data with cell specific CRE data to mechanistically interpret cell specific eQTLs. We used Bayesian regression models to identify all cis-linked SNPs that are independently associated with each gene expression trait in each study. In an effort to identify the functional determinants of eQTL cell specificity, we quantified the associations between eQTL SNPs and CRE data sets, many of which were derived from the cell types used in eQTL discovery and are known to function in a cell type specific manner (e.g., transcription factor binding sites (TFBSs), DHS sites). We further considered the relationship between eQTL SNP-CRE overlap and the cell type specificity of eQTL replication. Lastly, we built a series of classifiers to predict the cell type specificity of eQTLs in the absence of additional gene expression data and to predict the function of GWAS SNPs with phenotypes relevant to cell types lacking eQTL data. We believe these predictive models will facilitate more substantial mechanistic analyses of individual SNPs by enabling the integration of disease genetics and regulatory SNPs with the thousands of genomic data sets that have been produced by projects like ENCODE , .
A uniform analysis of cis-eQTLs across seven cell types
In an effort to comprehensively characterize eQTL reproducibility within and between different cell types, we gathered publicly available data sets that included both gene expression and genotype data. This collection included eleven studies from seven unique cell types (Table 1) , , –. To ensure the data from each eQTL study were comparable, we uniformly processed all raw data by developing a standardized analysis pipeline that was designed to marginalize the effect of study design differences on the identified eQTLs (see Methods). Genotype data, regardless of array type, were subjected to standard quality control filters. Missing and unobserved genotypes were subsequently imputed to the SNPs in the HapMap phase 2 CEPH panel ( SNPs) using BIMBAM , . Each gene expression array was uniformly re-annotated; probe sequences were aligned to the human reference genome (hg18) and to RefSeq gene models. Within each array platform, multiple probes mapping to a single gene were clustered as in previous work . Only uniquely aligned probes that did not overlap known, common polymorphisms were included in our analysis.
We chose to control for the confounding effects of both known covariates and unknown factors by removing the effects of principal components (PCs; Figure S1, Table S1) , . Given that a diverse set of demographic (e.g., age, sex), environmental (e.g., BMI, drug use), and technical (e.g., post-mortem interval, array batch, ozone levels, identity of the technician who handled the arrays) variables are known to be associated with gene expression measurements and to confound eQTL ascertainment , ,  we felt it was critical to control for these effects in the most consistent way possible prior to eQTL mapping. Across the diverse set of studies examined here, the covariate annotation ranges from non-existent to detailed. To address this non-uniformity, we analyzed each data set with the same approach, irrespective of covariate annotation. Multiple independent studies demonstrate the effectiveness of controlling for latent variables with respect to eQTL ascertainment; indeed, controlling for PCs substantially increases power to detect cis-eQTLs within these studies . Importantly, it has also been demonstrated that each of these eQTL discoveries is also more likely to replicate across studies .
We projected residual expression variation to the quantiles of a standard normal distribution to control for outliers, and we used these projected values as the quantitative traits for association mapping, which was performed in each study set using the same HapMap phase 2 CEPH SNP panel. We evaluated evidence for gene expression-genotype associations in terms of Bayes factors (BFs) using BIMBAM , , as BFs have been shown to be more robust to SNPs with small minor allele frequencies (MAF) than p-values , .
We identified, for each gene expression trait, the most highly associated SNP within each local linkage disequilibrium (LD) block. We tested the independence of each SNP by multivariate regression (Figure S2) and took the union of the independently associated SNPs for each gene. We refer to, for example, the first and second most significant, independently associated SNPs as primary and secondary SNPs, respectively, and we refer to the set of primary SNPs as first tier, or tier 1, extending in the straightforward way through tier 4. We do not consider tiers beyond the fourth tier because of lack of statistical power. For each study, and within each tier, we independently estimated false discovery rates (FDRs) by permutation. Although we computed a BF for every SNP-gene pair, we limit our subsequent analysis to cis-linked SNPs, or SNPs within 1 Mb of the transcription start site (TSS) or transcription end site (TES) of a gene. While we have standardized analysis and reporting across studies, we have not considered the scope of differences in eQTL discovery based on alternative data analysis pipelines.
Across these studies, we observe between 585 and 5433 genes with eQTLs (), corresponding to thresholds between and (Figures 1A–C, Table 2, Table S2). As expected, studies with larger sample sizes and replicate gene expression measurements identified more eQTLs at a given FDR threshold (Figure 1D; and , respectively, by multivariate ANOVA). Indeed, across the eleven studies analyzed, of the variance in the proportion of genes with eQTLs can be explained by sample size and expression replication. The per study effect size distribution is also consistent with the expectation that larger studies identify eQTLs with smaller effect sizes (Figures S3, S4). We expect that future eQTL studies with larger sample sizes (even from previously examined cell types) will identify additional eQTLs with smaller effects. We find that, despite study heterogeneity, the relationship between BF and FDR is quite uniform across studies (Figure 1A). As demonstrated in previous studies , , eQTL SNPs are highly enriched at the transcription start site (TSS) of the associated gene (Figure 1F).
Studies are labeled by their acronym from Table 1. (A) Plot of (y-axis) as a function of (x-axis), for each study as a separate line of a diferent color, as indicated in panels B, D, and E. Dashed line represents . (B) Plot of eQTL counts as function of , for all studies. (C) eQTL count (x-axis) by tier, for tiers 1–4 (light blue, dark blue, light green, and dark green, respectively), with separate bars for each study (y-axis). (D) Fraction of genes with a significant eQTL SNP (y-axis; thresholded at ), as function of sample size (x-axis). Each study is plotted in a distinct color, as indicated with labels. Studies with replicate expression measurements are depicted as triangles, studies without as circles. (E) Fraction of genes with a significant eQTL that have more than one independently associated SNP (y-axis; thresholded at ), as a function of sample size (x-axis). Each study is plotted in a distinct color. Studies with replicate expression measurements are depicted as triangles, studies without as circles. (F) Histogram of eQTL counts by tier (y-axis; colors as in panel C), summed across studies, as a function of their distance to the gene transcription start and end sites (x-axis; gene split into 10 bins). P (grey) line depicts the counts of first tier eQTL SNPs from a permutation, to illustrate the background distribution of tested SNPs.
Across all eleven studies, ( of ) of eQTL associated genes are independently associated with at least two SNPs in at least one study (; Figures 1C and 1E, Figures S3, S4). Within each study, the fraction of eQTL-associated genes with two or more independently associated SNPs ranges from (). Our search for allelic heterogeneity, or multiple SNPs in the same locus that are independently associated with a single trait, appears to be power-limited and our estimates of its frequency should be taken as a lower bound; larger sample sizes will identify additional heterogeneity (Figure 1E), as the relationship appears almost identical to the linear relationship between genes with eQTLs and sample size (Figure 1D). As with tier 1 eQTL discovery, sample size and replicate expression measurements are significantly associated with the fraction of genes with an associated eQTL SNP exhibiting allelic heterogeneity (AH; and , respectively, by multivariate ANOVA). Tier 2 eQTL SNPs reside significantly further from the associated gene TSS than tier 1 eQTL SNPs (Figure 1D). For example, in the CAP_LCL eQTL data set, the median absolute distances between the TSS and tier 1 and tier 2 eQTL SNPs are and kb, respectively (Wilcoxon signed rank test ).
Cis-eQTL replication within and between cell types
We next investigated the cell type specificity of eQTLs, comparing eQTLs both within and between cell types. Cell type specific eQTLs are defined here as eQTL SNPs that replicate across studies of the same cell type but fail to replicate across studies of different cell types. Given the broad array of technical and biological factors associated with the reproducibility of eQTLs , , , , our analysis of eQTL replication focused on three specific comparison trios:
- CAP_LCL versus Stranger_LCL and Merck_liver
- UChicago_liver versus Merck_liver and Stranger_LCL
- Harvard_cerebellum versus Myers_brain and Stranger_LCL.
Each trio of comparisons enabled the simultaneous quantification of within and between cell type eQTL reproducibility. Each of the six studies above used different expression platforms and were composed entirely of independent samples. Although post hoc comparisons between heterogeneous studies will have limitations, we found there to be substantial scientific merit to using the full breadth of data available while being completely forthcoming about both our comparison methods and those limitations. We note that, despite this heterogeneity, the conclusions highlighted below are largely independent of the particular discovery cohort, replication cohort, or cell type (Figure 2, Figures S5, S6). These specific trios were chosen for comparative analysis based on the following criteria: (i) two or more studies in our analysis included only these three cell types; (ii) of the studies that included these three cell types, we chose those with the largest sample size, and (iii) LCLs and liver are valuable in this comparative context because of the substantial amount of ENCODE data available for GM12878 and HepG2 cells. We note that the Myers_brain study includes samples from several different brain cell types, a minority of which were cerebellum, implying that the cell type matching in comparison 3 above is inexact.
(A, B, C) eQTL replication frequency (y-axis) as a function of discovery significance (x-axis: ). SNPs were grouped into 30 equally spaced bins by BF. (D, E, F) eQTL replication frequency (y-axis; thresholded at ) as a function of SNP position () (x-axis). Cis-eQTL SNPs within 250 kb of the TSS were grouped into equally spaced bins. (A, D) Replication frequencies for CAP_LCL eQTLs in Stranger_LCLs (blue) and Merck_liver (red). (B, E) Replication frequencies for UChicago_liver eQTLs in Merck_liver (blue) and Stranger_LCL (red). (C, F) Replication frequencies for Myers_brain eQTLs in Harvard_cerebellum (blue) and Stranger_LCL (red). In all panels, bold lines depict percentage of SNP-gene pairs with per bin, and ribbons depict confidence interval.
Consistent with previous observations, , , cis-eQTLs are more likely to replicate across studies within the same cell type than they are to replicate between different cell types (e.g., in CAP_LCL: McNemar's test ). Beyond the trios listed above (Figures S5, S6), replication frequencies vary broadly. Two variables have large effects on replication: sample size for the replication cohort (which is well correlated with statistical power), and genetic independence of the samples (i.e., whether the two cell types were derived from the same or different individuals).
Within a given comparison, eQTL replication frequency is associated with a number of factors. For example, within and between cell type replication of CAP_LCL eQTLs is positively associated with discovery significance (within: , between: , quantified by multivariate logistic regression, Equation (3)) and negatively associated with absolute distance to the TSS (Figure S7; within: , between: ) and with eQTL tier (within: , between: ), while differences in allele frequency across studies does not have a major effect (Figure S8). We found that as the level of discovery significance increases, the likelihood that the eQTL replicates in both matched and unmatched cell types also increases, implying that cell type specific eQTLs tend to have smaller effects (Figure S9). After controlling for discovery significance, effect size is not significantly associated with replication frequency. Similar to previous reports (see Figure S6 from ), alternative post hoc replication metrics (e.g., correlation of effect sizes) produce qualitatively similar results. To assess the effects of model parameters and post hoc comparison thresholds, we applied a bivariate Bayesian regression model to a subset of our studies (Figure S10; see Methods). The results of these more formal bivariate analyses are qualitatively similar to those obtained from post hoc comparisons: the fraction of cell type specific cis-eQTLs decreases with increasing discovery significance and cell specific eQTL SNPs reside further from the TSS.
eQTL SNP tier is significantly associated with eQTL replication frequencies; tier 1 eQTL SNPs are more reproducible than additional independently associated SNPs (Figure S11; e.g., CAP_LCLs: Fisher's exact test ). Additionally, first tier eQTL SNPs are significantly less likely to be cell type specific, relative to additional independently associated SNPs (e.g., CAP_LCLs: Fisher's exact test ). Therefore, for any given gene, the first tier eQTL SNP is more likely to be TSS-proximal, of large effect, and observed in additional cell types, as compared to additional independent eQTL SNPs, which are more likely to be specific to the discovery cell type, have smaller effect sizes, and reside further from the TSS.
eQTL SNPs are associated with many classes of cis-regulatory elements
We next sought to investigate the biological characteristics associated with the reproducibility and cell specificity of eQTLs. To do this, we quantified the overlap between cis-eQTL SNPs and genomic features associated with functional cis-regulatory elements (CREs), including DHS sites, chromatin marks, and binding sites for transcription factors and other DNA associated regulatory proteins (see Table S3 for full list of data sets). We categorized regions of open or activating chromatin, and regions of transcription factor or DNA protein binding as activating CREs, and regions of repetitive, repressive, or heterochromatic chromatin domains as repressive CREs, to draw a contrast between genomic regions where transcription factor binding is frequent and regions where it is discouraged or unlikely. We focused analyses of LCL eQTL SNPs on CRE data sets produced in LCLs (primarily GM12878) and analyses of liver eQTLs on CRE data sets produced in HepG2 cells, a well-characterized, if imperfect, proxy for hepatocyte biology. We note that the quantification of eQTL SNP-CRE overlap enrichments is inherently conservative, given that the boundaries of most genomically defined CRE types are imprecise and that eQTL SNPs are typically tag SNPs, rather than the exact causal variants.
We quantified the enrichment or depletion of eQTL SNPs, relative to the full set of CEU HapMap phase 2 SNPs tested for eQTL associations, within each class of CRE by multivariate logistic regression, controlling for the SNP to TSS distance and the expression level of the associated gene (see Methods and Equation (2)). Consistent with the hypothesis that many eQTL SNPs exert their effect by modifying the biochemical function of CREs, cis-eQTLs are known to be enriched for overlaps with several classes of CREs, including DHS sites (Figure 3A and , ). Moreover, cis-eQTLs have been shown to be depleted within regions in which a CTCF binding site lies between the eQTL SNP and the target gene TSS (Figure 4G and , ).
(A–I) CAP_LCL eQTL SNP () overlap with predicted cis-regulatory elements. Each row of panels depicts overlap with distinct CRE data sets: (A–C) DNAse hypersensitive sites, (D–F) p300 binding sites, (G–I) chromHMM predicted active promoters. In each panel, SNPs are grouped into 25 equally spaced bins within the 50 kb upstream and downstream of the TSS and TES, and 10 bins between the TSS and TES. Each bin is plotted along the x-axis. Bold lines depict the percentage, per bin, of SNPs overlapping the CRE class, ribbons depict confidence interval. Each column of panels depicts a distinct SNP set contrast. (A,D,G) Observed eQTL SNPs (blue) and randomly drawn cis-linked SNPs at expressed genes (red). (B,E,H) eQTL SNPs that replicate in Stranger_LCL () (green) and SNPs that fail to replicate (purple). (C,F,I) CAP_LCL eQTL SNP overlap with CREs derived from the LCL line GM12878 (orange) and HepG2 cells (brown).
(A–I) CAP_LCL eQTL SNP () overlap with predicted cis-regulatory elements. (A–C) eQTL SNP overlap with chromHMM predicted heterochromatin, (D–F) eQTL SNP overlap with chromHMM predicted repressive chromatin, (G–I) eQTL SNP-TSS pairs with an intervening CTCF binding site. In each panel, SNPs are grouped into 25 equally spaced bins within the 50 kb upstream and downstream of the TSS and TES, and 10 bins between the TSS and TES. Each bin is plotted along the x-axis. Bold lines depict bin percentage, ribbons depict confidence interval. Each column of panels depicts a distinct SNP set contrast. (A,D,G) Observed eQTL SNPs (blue) and randomly drawn cis-linked SNPs at expressed genes(red). (B,E,H) eQTL SNPs that replicate in Stranger_LCL () (green) and SNPs that fail to replicate (purple). (C,F,I) CAP_LCL eQTL SNP overlap with CREs derived from the LCL line GM12878 (orange) and HepG2 cells (brown).
We further extend these observations by demonstrating that LCL eQTL SNPs are associated (, quantified by Equation (2)) with LCL derived CRE data sets, liver eQTL SNPs are associated with HepG2 derived CRE data sets, and cerebellum eQTL SNPs are associated with cerebellum derived CRE data set (Figures S12, S13). Almost universally, eQTL SNPs are enriched within regions of activating CREs (Figure 3; Tables S4, S5, S6) and depleted within repressive CREs (Figure 4, Figures S12, S13, S14, Tables S4, S5, S6). LCL eQTL SNPs are enriched within activating CREs while being depleted within repressive CREs (Fisher's exact test ). Liver eQTL SNPs display a similar enrichment within activating CREs and depletion within repressive CREs (Fisher's exact test ).
The pattern of eQTL SNP-CRE enrichment displays significant spatial structure and is typically consistent with the known biology of the CRE (Figures 3–4, Figure S14). For example, eQTL-CRE enrichment peaks immediately adjacent to the TSS for several classes of activating CREs, including H3K4me3 and H2A.Z. Alternatively, eQTL enrichment increases throughout the gene body within H3K36me3 domains, and is more uniformly distributed within H3K4me1 domains. In contrast, we find that eQTL SNP overlap with heterochromatin, repressive chromatin, or repetitive regions is typically most highly depleted through the gene body (e.g., Figures 4A and D). Similarly, we find that intervening CTCF sites are most depleted immediately upstream of the gene TSS, but the decay of this depletion is intriguingly asymmetrical about the TSS (Figure 4G).
Tiers 2–4 eQTL SNPs are themselves also associated with numerous CRE classes. For example, primary and secondary CAP_LCL eQTL SNPs are associated with and LCL CRE classes, respectively (Table S7; multivariate logistic regression (Equation (4)), controlling for distance to TSS and gene expression levels, ). Interestingly, CTCF binding sites are significantly enriched between primary and secondary eQTL SNPs (Figure S15; multivariate logistic regression (Equation (5)), controlling for inter-SNP distance, SNP-TSS distance, and the presence of intervening TSSs and recombination hotspots, ). Independently associated primary and secondary eQTL SNPs separated by less than kb are more than twice as likely to have a CTCF binding site as similarly spaced background cis-SNPs ( versus ). CTCF binding sites are enriched between alternative promoters in human  and Drosophila melanogaster , supporting the hypothesis that CTCF binding sites frequently demarcate independent CREs for the same gene. These observations, combined with the replication rates of eQTL SNPs in tiers 2–4 (Figure S11), suggest that eQTL SNPs independently associated with the same gene frequently tag SNPs affecting the biochemical function of distinct CREs that, in turn, independently regulate transcription, rather than SNPs tagging the same causal regulatory variant.
eQTL-regulatory element overlap is frequently cell type specific
Previous investigations have suggested several plausible mechanisms underlying the cell type specificity of eQTLs , , . For example, given the known cell type specificity of regulatory protein binding sites and local chromatin environment , , , it is plausible that a SNP that disrupts a TFBS would have different downstream effects if it were found within a region of open, activating chromatin as opposed to a region of repressive chromatin. Although the current resolution of CRE and tag eQTL SNP data sets make this hypothesis difficult to test directly for individual SNPs, we sought to quantify the frequency, in aggregate, with which eQTL SNPs overlap CREs that differ between cell types using our identified eQTLs and cell specific CRE data from the ENCODE project.
We assessed the cell type specificity of eQTL-CRE overlap by quantifying the fraction of eQTL SNPs overlapping a CRE derived from the same cell type, relative to the fraction overlapping a CRE derived from a second cell type. When the frequency of overlap between an eQTL SNP set and a matched and unmatched cell type CRE is significantly different (McNemar's test ), we consider the overlap to be cell specific, and the eQTL SNPs to be differentially represented in that CRE type. LCL and liver eQTL SNPs were tested for overlap with CRE data sets available from both LCLs and HepG2 cells, while cerebellum eQTLs were tested for overlap with a single pair of cerebellum and LCL CRE data sets. We find that eQTL SNP-CRE overlap is frequently cell specific (see Figures 3–5 for examples, Tables S4, S5, S6 for full results). The observed specificity of eQTL-CRE overlap is not dependent on analysis thresholds and is recapitulated in bivariate analyses of eQTL data sets (Figure S10).
Percentage (dots) and confidence interval (lines) of (A) CAP_LCL, (B) UChicago_liver, and (C) Harvard_cerebellum eQTL SNPs overlapping DHS sites (y-axis) derived from the LCL cell line GM12878 (red), the HepG2 cell line (blue), and the cerebellum (green).
LCL and liver eQTL SNPs are differentially represented (McNemar's test ) in and CRE data types, respectively; moreover, cerebellum eQTLs are over-represented in cerebellum derived DHS sites relative to LCL DHS sites (Table S6). For example, and of CAP_LCL eQTL SNPs overlap LCL and HepG2 derived chromHMM strong enhancers, respectively (McNemar's test ). Notably, the canonical biochemical function of the CRE class is predictive of the pattern of cell type specific eQTL-CRE overlap. eQTL SNPs are more likely to overlap activating CREs and less likely to overlap repressive CREs derived from the same cell type as the eQTL discovery data (repeated measures logistic regression ; Tables S4, S5, S6). Consistent with previous observations , , , the proportion of eQTL SNP - TSS pairs with intervening insulators is remarkably consistent across cell types, suggesting that CTCF binding sites do not substantially affect cell-specific eQTL function (Figure 4I, Figure S16, Tables S4, S5, S6).
Genetic architecture of cell type specific eQTLs
We next examined the hypothesis that the CRE landscape is a major determinant of eQTL specificity, and found that eQTL SNPs that overlap cell type specific CREs are more likely to be cell type specific than are eQTL SNPs that overlap shared CREs (Fisher's exact test ) for , , and CRE data sets in LCLs, liver, and cerebellum, respectively (Tables S4, S5, S6). For example, LCL eQTLs are significantly more likely to be cell type specific (i.e., replicate in an independent cohort of LCLs but fail to replicate in the liver) when they overlap an LCL-derived p300 binding site, but do not overlap a HepG2-derived p300 binding site (Fisher's exact test ). To illustrate a specific example, we examined the pattern of cell specific eQTL SNP-CRE overlap at the SORT1 locus, a well characterized myocardial infarction risk locus (Figure 6). Consistent with previous observations , we find a liver specific eQTL association approximately kb downstream of the SORT1 gene, which overlaps a cluster of predicted enhancers that are present in HepG2 cells but not LCLs.
Associations between (A) UChicago_liver and (B) CAP_LCL SORT1 expression and cis-linked SNPs (left y-axis; ), plotted as points by SNP genomic coordinates (x-axis). Blue line overlaying the manhattan plot is the estimate of the local recombination rate (right y-axis; cM/Mb). Points are colored by level of LD (see legend below) with the reference SNP (purple diamond). Below each manhattan plot are boxes depicting the location of chromHMM predicted promoters (red), enhancers (orange), and insulators (blue). Below CRE predictions are RefSeq gene models.
Prediction of eQTL replication across cell types
Given the association between cell type specific eQTLs and cell type specific cis-regulatory elements, we sought to test our ability to use CRE data in conjunction with genomic location information to predict the cell type specificity of eQTLs. We trained a random forest classifier on a large set of SNP features, including SNP position, effect size, cell type specific CRE overlap, and non-cell type specific genomic elements (CRE features listed in Table S3) to predict whether each eQTL SNP association would replicate in a second study or not (i.e., binary class). The classifier accurately predicts within cell type eQTL replication, between cell type eQTL replication, and cell type specific eQTL replication. We validated the classifier with -fold cross validation and demonstrated that its accuracy is dependent upon the inclusion of cell specific CRE data (Figure 7, Figure S17, Table 3). For example, the area under the ROC curves (AUC) for within LCL replication, between LCL and liver replication, and LCL specific replication were , , and , respectively. We believe that the predictive performance of the random forest on this problem reflects the fact that random forests are capable of capturing interactions among the features. Given a broad collection of chromatin state and regulatory factor binding site data sets, such as is available for a large number of cell types in the ENCODE project database, it is now possible to predict whether a given eQTL association exists in a different, specific cell type, in the absence of eQTL data from the second cell type.
ROC curves depicting the performance of a random forest classifier to predict within cell type reproducibility (red), between cell type reproducibility (blue), and within cell type specific reproducibility (green). Predictions plotted separately for (A) LCL/LCL/Liver, (B)Liver/Liver/LCL, and (C) Brain/Brain/LCL. The classifier was trained on a diverse collection of CREs (see Methods and Supplement for complete data set description). True positive rates (y-axis) and false positive rates (x-axis) were quantified by ten fold cross validation.
We further quantified the contribution of each feature to prediction accuracy (see Methods). Across all training sets, eQTL discovery significance, SNP to TSS distance, and discovery cell type gene expression level contribute substantially to prediction accuracy (Table S8). Consistent with the relative cell type specificity of CTCF binding sites and chromatin marks discussed above, CREs vary considerably in the degree to which they are useful in predicting within or between cell type replication. Cell type specific differences in activating and repressive chromatin states overlapping eQTLs contribute substantially to the accuracy of predictions of between cell type eQTL replication and cell type specific replication. In contrast, CTCF binding site data are less informative for predicting cell specific eQTL replication, as might be expected considering its function appears less cell type specific.
Integration of CREs and eQTLs for GWAS interpretation
While the use of eQTL data to aid in the interpretation of GWAS results has proven extremely useful , , , , these data are most useful when they are derived from the cell type that is most relevant to the disease of interest. Because eQTL data exist from a limited number of cell types, they often have limited utility for annotating the cell type of interest. We investigated whether cell specific CRE overlap would be useful to predict whether disease associated SNPs function as eQTLs in cell types lacking eQTL data. We extracted from the NHGRI GWAS catalog  SNPs associated with diseases arising from breast, kidney, or lung tissue. We trained a random forest classifier to differentiate eQTL SNPs from non-eQTL SNPs in liver and LCLs using as features HepG2 and LCL CRE-SNP overlap (see Methods). We then applied this classifier to each GWAS SNP set, using CRE data from HMEC, NHEK, and NHLF cells to model breast, kidney, and lung function, respectively, to see if there was an enrichment for predicted eQTL SNPs when the CRE data represented the cell type of interest. These predictions were compared with predictions using LCL and HepG2 CREs as features (Figure S18). In each case, CREs from the disease-relevant cell type are more likely to predict that GWAS SNPs function as eQTLs (Wilcoxon signed rank tests: breast , kidney , lung ). Given the hypothesis that a substantial number of GWAS SNPs function by modifying gene expression, these results support the hypothesis that integrated eQTL-CRE modeling can aid in the annotation of GWAS results arising from a wide variety of cell types.
The integrative analyses presented here provide new insights into the patterns of cis-eQTL replication within and between cell types, while controlling for biological and technical variation. Several notable results emerge from our analyses. We demonstrate that eQTLs are more likely to overlap activating CREs and less likely to overlap repressive CREs when they are ascertained from the same cell type versus different cell types. Cis-eQTL SNPs overlapping most classes of activating cis-regulatory elements are significantly more likely to replicate in independent studies. Conversely, eQTL SNPs that overlap repetitive or repressive chromatin states and eQTL SNP-gene pairs that are intersected by insulators are significantly less reproducible. Cis-eQTL SNP-CRE overlap is also significantly more predictive of eQTL reproducibility when the CRE data are derived from the same cell type as the gene expression data. Furthermore, eQTL SNPs that overlap cell type specific CREs are significantly enriched for cell type specific eQTLs, suggesting specific regulatory mechanisms for those cell type specific eQTL associations. The observed relationship between eQTL SNP reproducibility and CRE overlap led us to test the hypothesis that SNP-CRE overlap could be used to predict the cell type specificity of eQTLs in the absence of additional gene expression or genotype data. While we see room for substantial improvement, we believe that the successful validation of this hypothesis with a random forest classifier will enable improved interpretation of genome-wide association study results.
After a GWAS is performed, it is now common practice to search eQTL databases to determine whether SNPs of interest are eQTLs for the cell type relevant to the study phenotype. When the SNPs are not known eQTLs in the cell type of interest, typically the line of reasoning is dropped; however, it is possible that the specific cell type was not tested, that the relevant SNP-gene pair was not interrogated, or that the sample size was too small for a cell type specific eQTL study to substantiate the SNP as an eQTL. Instead, if the researcher finds that the SNP is an eQTL in an alternative cell type, our classifier can be applied to determine the likelihood of the SNP being an eQTL in the cell type of interest when there are CRE data available for the relevant (or related) cell type. Furthermore, the genomic location can be scrutinized relative to known CREs to identify specific CRE types that may explain the mechanism by which gene transcription and downstream phenotype are regulated. Conversely, given those same GWAS hits, using these predictions one might be able to identify the physiologically relevant cell type based on overlap with (predicted or known) cell specific eQTLs.
Genotype data were downloaded from public databases or individual investigators as summarized in Table 1. Genotype and quality control filtering was performed with plink . Individuals with a call rate of less than were removed. SNPs with a call rate of less than were classified as missing and later imputed. SNPs deviating from HWE were removed ().
Merck_liver study genotypes were previously imputed by MACH ; we extracted from the full set of SNPs only those that were unimputed (and we removed all of the imputed genotypes from individuals in the non-imputed SNPs) to represent the original genotyping data. This set we filtered using identical criteria as above.
The HapMap phase 2 individuals fully sequenced genotypes were downloaded from the Impute2 website ; we matched the genotypes to individuals by comparing individual SNPs to genotypes from the indexed individuals. We filtered these genotypes as above. We removed ungenotyped individuals in the Harvard study, leaving us with individuals with cerebellum tissue data, individuals with prefrontal cortex tissue data, and individuals with visual cortex tissue data.
Genotypes were imputed using BIMBAM . We imputed genotypes up to the HapMap phase 2 CEPH SNP set. BIMBAM removes SNPs with a minor allele frequency (MAF) less than or missing SNPs by default. For the studies with Caucasian only participants, we used only the unrelated CEPH individuals for imputation. For the UChicago_liver study, which has African American individuals (of subjects total), we used both the CEPH and the unrelated YRI individuals as a reference set. For the Stranger_LCL study on a subset of HapMap phase 2 individuals, we used the CEPH, YRI, and the unrelated JPT and CHB individuals' genotypes and did not impute. Mean values for the imputed genotypes were used for association and other downstream analyses .
Gene expression preparation, normalization, and processing
For each expression array platform, probe sequences were aligned to the human reference genome (hg18) and the RefSeq transcript set. Probes with only one genomic alignment with identity to the reference genome, over the full length of the probe, were considered to be uniquely aligned. Probes that failed to align to the genome but did have at least one alignment with at least identity to a RefSeq transcript were further considered to be adequately aligned. All other probes were removed from further analyses. Based on genomic alignment coordinates and RefSeq gene annotations, each aligned probe was assigned to a RefSeq gene. We further searched the genomic locations of each probe alignment for the presence of common polymorphisms, as defined by dbSNP131 and the One Thousand Genomes Project (8/4/2010 release ).
Where appropriate, we defined a lower expression level boundary, above which we considered a gene to be expressed. Genes falling below the expression threshold were removed from further analyses. Low expression thresholds were defined either on the basis of negative control probes, exogenous RNA spike in probes, or the observed relationship between probe mean expression and variance.
Gene expression data from each study were prepared independently as follows. Poorly extracted, non-uniform, outlier, or other flagged features were treated as missing data. Where appropriate, background signal intensities were subtracted. Negative adjusted intensities were set to one half the minimum positive value on the array. Background corrected intensities were transformed. Missing data were imputed using the -nearest neighbors algorithm (), as implemented in the R package impute . Each array was quantile transformed to the overall average empirical distribution across all arrays. Across all arrays within a study, each probe's expression values were transformed to the quantiles of the standard normal distribution (or quantile normalized). Transformation to standard normal avoids potential problems due to outliers or other deviations from normality in later association tests . We controlled for known and unknown sources of non-genetic variation by correcting these data using their principle components (PCs), identified using the R function pca from the R package pcaMethods , . For each matrix of gene expression data, we computed the percentage of variance explained (PVE) for each PC, which is a monotone decreasing function, and controlled for PCs until the difference in the PVE by the subsequent PC was (Table S1). We jointly controlled for the PCs in the gene expression data by taking the residuals from a linear model with these PCs as covariates. For one data set with high quality covariate annotation (UChicago_liver), we have have quantified the correlation between each PC and available covariates (Figure S1), demonstrating that the PCs extracted are capable of capturing heterogeneity arising from such confounding variables. Finally, we quantile normalized these residuals within each probe and used these normalized data in our subsequent analyses.
For genes with multiple probes on a single array, we used the R package mclust  to cluster the probes samples matrix of expression levels. We allowed up to clusters per gene. Within each probe cluster, we used the per individual, PC corrected mean of the different probes as a proxy for the gene expression level for that collection of probes. Each probe cluster was modeled downstream independently, under the assumption that uncorrelated probe sets represent either independent transcript isoforms or poorly performing probes.
Two studies included non-European individuals: Stranger_LCL included individuals with European, Yoruban, and East Asian ancestry, and UChicago_liver included individuals with European ancestry and African Americans. To minimize the effect that differences in allele frequencies between populations may have on false positive discoveries and false negative replication, we analyzed these data with a modified protocol. Within a multi-population study, the distribution of gene expression measurements from each array was normalized to the average distribution of expression measurements across all arrays. For each population separately, each probe was then quantile normalized and measurements across populations were pooled . When assessing replication across studies, we require that the SNP has a MAF in both studies. Therefore alleles that are polymorphic in one population but fixed in another will not result in false negative replication. We note that replication frequencies are similar between mixed population and single population studies (Figures S5, S6). Furthermore, cross study replication is not substantially affected by differences in allele frequency across studies (Figure S8).
We used Bayesian regression, as implemented in BIMBAM ,  to quantify the association between each SNP and residual gene expression data for each gene across each sample from each study. We used default parameters, which average over different plausible effect sizes for additive and dominant models. We used the mean imputed genotype for all studies except Stranger_LCLs, in which we used assayed genotypes for each individual because the individuals did not require imputation.
Multiple cell type eQTL mapping
We used multi-trait Bayesian linear regression models to jointly test for eQTLs with specific models of differential association in paired samples , . We computed the Bayes factors for five models as compared with the null model (no eQTL in either cell type):
- stable model, where the eQTL is associated with gene expression in both cell types,
- A not B model, where the eQTL is associated with gene expression in one of the two cell types,
- B not A model, where the eQTL is associated with gene expression in one of the two cell types,
- A and B model, where the eQTL is associated with gene expression differently in the two cell types, and
- A opposite B model, where the eQTL is associated with gene expression in both cell types, but the direction of the effect is opposite.
To combine these individual Bayes factors, we calculated an integrated Bayes factor (iBF) summarizing evidence for a differential eQTL as:(1)eQTLs with (i.e., there was more evidence supporting a differential eQTL than a stable eQTL) were further categorized by the specific model with the largest BF, although we note that the power to identify each of the models is different in this construction. Further analysis was conducted only on eQTLs with .
We identified the SNP with the largest for each gene, and also identified the cis-SNP with the largest in each LD block around a gene. We defined LD blocks using the HapMap recombination rates  in which each SNP interval with cM/Mb defines the boundaries of an LD block. For each of these associations, we note the chromosome and location of the gene and SNP, major and minor allele of the SNP, the LD block index, the MAF of the SNP in the sample population, the value of the fit of the linear model between the imputed values for the SNP and the rounded values of the SNP, the magnitude and direction of the association () and the value of the fit of the linear model, the number of exons and the average length of the gene exons, number of probes corresponding to the probe cluster, mean gene expression value for those probes, and the for this association. We determined the maximum MAF for all SNPs overlapping expression array probe alignment coordinates using One Thousand Genomes Project data and dbSNP131. We note that non-replication within cell types is not driven by differences of MAF between the studies (Figure S8 is a cross-study eQTL MAF comparison). For all downstream comparisons between studies, we considered only expressed gene-SNP pairs in common between the two studies.
Evaluating FDR by permutation
To evaluate the FDR for each study, we permuted the sample indices on the gene expression data identically across genes within each study, and ran association mapping on these permuted data. Then, for each cutoff , we conservatively computed FDR by the number of associations identified in the original data at that cutoff divided by the number of associations identified in the permuted data at that cutoff (Table S9). We performed a single permutation for each study because of the resources required to run a single complete permuted association test. Unless otherwise noted, eQTL results in the text refer to associations significant at .
Multivariate analysis for allelic heterogeneity
Given the computational requirements of performing conditional QTL scans in eleven studies, we implemented a two step approach to identify independently associated SNPs at each gene. For each gene probe cluster, we identified the most highly associated SNP in each LD block within Mb of the gene's transcription start site (TSS) or transcription end site (TES). We subsequently recomputed the BF of each SNP association by Bayesian multivariate regression to quantify the conditional independence of the effects of the associated SNPs . Finally we took the union of the identified SNPs from all probe clusters for a single gene; when this set had more than one SNP below the appropriate cutoff, the gene is said to exhibit allelic heterogeneity.
We compared this approach to forward stepwise regression for a subset of CAP_LCL genes with significant allelic heterogeneity ( genes; ) , . We implemented, in R, a custom forward stepwise regression with all SNPs within 1 Mb of a gene's TSS and TES to identify the set of independently associated SNPs. In particular, starting from the model with a gene probe cluster as the response variable and no covariates, we included a SNP in the model when it most improved the Bayesian Information Criterion (BIC) of the fitted model. The BIC is a criterion for model selection that takes into account both the likelihood of the model and the number of parameters (here, eQTL SNPs), so as to avoid excessive parameters that may result in overfitting. We continued to identify a single SNP that most improves the BIC and included that SNP in the model until a SNP resulted in a non-improvement of the BIC, when we stopped. As in the LD-based method, we took the union of the identified SNPs from all probe clusters for a single gene. Although forward selection is not exhaustive, it is tractable, and these results represent a more complete (but still conservative) estimate of the allelic heterogeneity for a particular gene relative to our method of identification (for full results comparison, see Table S10). In particular, we note that this method is an approximation in the presence of interactions , but we consider this case to be beyond the scope of this analysis. Our LD-based allelic heterogeneity ascertainment generally appears to underestimate the number of independently associated SNPs per gene, however this may be a result of the thresholds for SNP inclusion between the two methods being different (Figure S19, Table S10).
Analysis of Gene Ontology annotation enrichment
We performed Gene Ontology (GO)  enrichment analyses using DAVID software . We considered GO Biological Process ontology, Molecular Function ontology, and KEGG pathway  enrichment terms with (Table S11). We considered CAP_LCL, UChicago_liver, and Harvard_cerebellum as eQTL discovery data sets, and looked for enrichment among genes with multiple independent eQTLs in these data, and also looked for enrichment among genes that did and did not replicate within cell type and between cell type studies as above.
A SNP-gene association was considered ‘replicated’ if the in the target study. Only SNP-gene pairs that were tested in a given replication cohort were considered when calculating the replication frequency in that study. In other words, if a SNP failed quality control or if a gene was not represented on a particular gene expression microarray platform, the SNP-gene pair was not considered when calculating replication. Similar to previous observations (see Figure S10H in ) we do not observe a substantial effect of cross-study allele frequency differences on eQTL replication (Figure S8).
Comparison between eQTLs and functional genomic data sets
When available, CRE data from the CEU HapMap LCL line GM12878, the hepatocellular carcinoma HepG2 cell line, and primary cerebellum tissue were used to represent LCLs, liver, and cerebellum respectively. These data sets (fully listed in Table S3) were all downloaded directly from the ENCODE data coordination center at UCSC. Cell type independent data sets were also used, including CpG islands , GERP evolutionary constrained elements , clustered DNAse hypersensitive sites, and clustered transcription factor binding sites. A second chromatin structure segmentation (Segway ) data set, derived from the integration of K562 data sets, was also included in the regulatory element feature set. When available, precomputed ‘peak calls’ were used. If more than one replicate (i.e., peak calls generated in the same lab using the same antibody), was available, peak calls were merged by taking the union of all elements.
Given that a putative eQTL SNP does not necessarily represent the causal genetic variant, but rather is a ‘tag’ for the causal variant in substantial linkage disequilibrium, we classified a SNP as overlapping a given genomic element if it was either contained within the element or found within 500 bp of the element boundary. CTCF sites were classified as eQTL interrupting if the midpoint of the CTCF binding site was between the eQTL SNP and the TSS of the associated gene.
To test if eQTL SNPs are enriched within each class of putative cis-regulatory element, we first modeled the background distribution of cis-linked SNPs as follows. The background set of SNPs was selected to match the set of SNPs that were tested for eQTL association. All CEU HapMap phase 2 SNPs that lie within 1 Mb of a RefSeq gene model TSS or TES, or lie within a RefSeq gene model were included. Each such SNP that was cis-linked to more than one RefSeq gene model was randomly assigned to one such gene. As with the eQTL SNP set, the background SNP set was tested for overlap with any element in each genomic feature class.
For each eQTL or background SNP (), we modeled the probability of cis-regulatory element overlap () by logistic regression:(2)In this equation, we controlled for the effects of SNP position () and the expression level of the associated gene (), in order to quantify the difference in overlap frequency between observed eQTL SNPs and those drawn from the background SNP distribution (denoted by the indicator variable ). This and other logistic regression models were run using the R glm function. To assess the significance of the effect of each covariate, we computed a Z-score and, from that, a two-tailed p-value by comparison to the normal distribution.
Similarly, for each eQTL SNP (), we model the effect of cis-regulatory element overlap () on the probability of within cell type replication () by logistic regression:(3)In this equation, we controlled for the effects of SNP position and the significance of the eQTL association (, as assessed by Bayesian multivariate regression) and the tier of the associated SNP (), in order to quantify the enrichment of within cell type replicating eQTLs in cis-regulatory elements.
We tested for an enrichment of second tier SNPs within CRE data sets, relative to the expectation from the background distribution of cis-linked SNPs, by modeling, for each SNP (), the probability of cis-regulatory element overlap () by logistic regression:(4)where we quantify the difference in overlap frequency between background SNPs, first tier SNPs, and second tier SNPs with the indicator variable . Enrichment of second tier SNPs, relative to background, was quantified by testing for the significance of the difference between the estimates.
To test for an enrichment of CTCF sites between SNPs independently associated with the same gene expression trait (i.e., SNPs tagging the allelic heterogeneity at a given locus), we used the background distribution of cis-linked SNPs as defined above; however, for each RefSeq gene, two cis-linked SNPs were selected at random. This collection of background SNP pairs was contrasted with the first and second tier eQTLs at all genes for which the secondary eQTL was significant at a threshold. For each SNP pair (), we modeled the probability of the presence of an intervening insulator () by logistic regression:(5)In this equation, we controlled for the inter-SNP distance (), SNP pair position relative to the gene TSS (, in which indexes each SNP in the pair), the presence of an intervening recombination hotspot (), and the presence of an intervening TSS (), in order to quantify the difference in insulator frequency between observed eQTL SNP pairs and those drawn from the background SNP distribution (denoted by the indicator variable ). We note that because the majority of SNP pairs that are independently associated with a gene expression trait span a recombination hotspot, we have attempted to control for any bias this may introduce by including the presence of such a hotspot in the model.
Predicting eQTLs with random forests
We built a classifier to predict within cell type, between cell type, and cell type specific replication using random forests . A random forest is an ensemble classifier that uses the mode of predictions from a large number of decision trees built by bootstrapping the training set. For each pair of comparisons, we can train a random forest classifier to predict the binary replication outcome given a set of features about the genomic location of the eQTL and the CREs for the corresponding cell types (see Table S3 for complete set). We used the random forest classifier and computed variable importance using the R randomForest package . We performed -fold cross validation to compute generalization error. We used the R package ROCR  to build the ROC curves and compute the AUC. We also computed accuracy as .
To predict the function of GWAS SNPs from cell types lacking eQTL data, we built a random forest classifier trained on cell specific CRE data. We first trained the classifier to predict whether a SNP is an eQTL or not an eQTL in LCLs and liver by selecting the most highly associated SNP for each gene () in these two cell types and selecting a background set of SNPs that were not associated with gene expression in any study (). We selected a reduced set of CRE data (chromHMM segments, H2A.Z, EZH2, DHS, CTCF) that were available for LCLs, HepG2, NHEK, NHLF, and HMEC cells. We initially trained the classifier using as features the CRE data from LCLs and HepG2 to predict whether a SNP is an eQTL in LCLs and liver. After training, the model was applied to sets of GWAS SNPs that were associated with phenotypes with relevance to breast, kidney, and lung (hand annotated from the NHGRI GWAS catalog  downloaded on January 10, 2013). In particular, the classifier was used to predict the probability that each GWAS SNP was an eQTL or not, using as features CRE data from the matched cell type (HMEC for breast, NHEK for kidney, NHLF for lung) or from an unmatched cell type (LCL or HepG2). Distributions of eQTL probabilities between matched and unmatched cell types were compared by Wilcoxon's signed rank test.
Additional statistical analyses
tests of unpaired categorical data were quantified by Fisher's exact test. tests of paired categorical data were quantified by McNemar's test. Tests of paired interval data were quantified by Wilcoxon's signed rank test.
Sample principal components (PCs) capture many known study covariates in UChicago_liver study. Heat map indicating the absolute Pearson's correlation between sample covariates (x-axis) and the first twenty PCs (y-axis). Age, for example, is captured well by the PC and smoking status is captured best by the PC. Sample ID appears correlated with the PC because the IDs were ordered by when the sample was processed, which is well correlated with batch and other known gene expression confounders.
Identification of allelic heterogeneity. Univariate (x-axis) versus multivariate (y-axis) as a function of linkage disequilibrium (; color as indicated in scale bar) between the primary and secondary SNP. All tier two SNPs are plotted for each study independently, in each panel, as labeled at top. Note SNPs in higher LD with the primary SNP (‘bluer’ points) tend to produce greater drops from the univariate .
Distribution of eQTL effect sizes by study and SNP tier. Violin plots of the distribution of effect sizes for significant () eQTLs. Data are plotted separately for each study, as indicated by plot color and sample labels at left. SNPs from each tier are plotted separately in each facet, as labeled at top. Note studies are ordered on the y-axis by sample size.
eQTL effect size and significance by tier. eQTL significance (; y-axis) as a function of SNP effect size (x-axis) and SNP tier. Data from each study and tier are plotted in independent facets, as labeled at top and right, respectively. Each eQTL SNP set was independently thresholded at by permutation.
eQTL replication frequencies, by observation significance, across all cohorts. eQTL replication (y-axis), as a function of discovery significance (x-axis: ). Each column of facets depicts the set of eQTL SNPs discovered in each of 11 studies, as labeled at top. Within each column, replication frequencies are plotted separately for each replication study set, in rows and in a different color, as labeled at the right. SNPs are binned along the x-axis into 30 equally spaced intervals. Per bin replication frequencies are plotted as bold lines, 95% confidence intervals are plotted as ribbons.
eQTL replication frequencies, by SNP position, across all cohorts. eQTL replication (y-axis), as a function of SNP position (x-axis; ). Each column of facets depicts the set of eQTL SNPs discovered in each of 11 studies, as labeled at top. Within each column, replication frequencies are plotted separately for each replication study set, in rows and in a different color, as labeled at right. SNPs are binned along the x-axis into 30 equally spaced intervals. Per bin replication frequencies are plotted as bold lines, confidence intervals are plotted as ribbons.
eQTL replication by SNP to TSS distance, conditioned on significance. Within cell type eQTL replication (y-axis), as a function of absolute SNP to TSS distance (x-axis). Replication frequencies are displayed separately for eQTLs with (red), (blue), and (green). SNPs have been binned along the x-axis into 15 equally spaced intervals. eQTL SNP-CRE overlaps per bin are plotted as bold lines, confidence intervals are plotted as ribbons. Each facet depicts a different study set comparison (labeled at top).
Minor allele frequency differences and replication. Minor allele frequencies in the two different populations of replicating (left column) and non-replicating (right column) eQTLs. The size of the point corresponds to the of the eQTL in CAP_LCL; the color represents the in the replication study, where blue is higher and grey (in the right-hand column) is . Top row: MAFs compared between eQTLs discovered in CAP_LCL (x-axis) that replicate (left) or fail to replicate (right) in HM2_LCL (y-axis). Bottom row: MAFs compared between eQTLs discovered in CAP_LCL (x-axis) that replicate in HM2_LCL and that replicate (left) or fail to replicate (right) in Merck_liver.
Extended eQTL replication plots. eQTL replication (y-axis), as a function of discovery significance (x-axis; left facet set) or absolute SNP to TSS distance (x-axis; right facet set; thresholded at ), as indicated at the bottom. Replication frequencies are displayed separately for within cell type (red), between cell type (blue), within but not between cell type (i.e., cell specific; green), and within and between replication (purple). SNPs have been binned along the x-axis in 30 equally spaced intervals. eQTL SNP-CRE overlaps per bin are plotted as bold lines, confidence intervals are plotted as ribbons. Each column of facets depicts a different study set comparison, as labeled at the top, for LCLs (left column), liver (middle column), and brain (right column).
Bivariate Bayesian regression recapitulates results of post hoc comparisons. eQTL SNPs were classified by their best fitting model (denoted at left and by color). Results from bivariate cis-eQTL mapping are on top, results from post hoc models are on bottom. (A) Overlap of SNPs in each class with cerebellum DHS sites. Points denote overlap percentage, lines denote CI. (B) Distribution of (x-axis) for eQTL SNPs in each model class. Box plots denote median, inter-quartile range, and CI. (C) Distribution of absolute distances (x-axis; scale) between each SNP and its associated gene's TSS.
eQTL replication conditional on SNP tier. eQTL replication (y-axis), as a function of discovery significance (x-axis; ; left set of panels) or absolute SNP to TSS distance (x-axis, right set of panels). Replication frequencies are displayed separately for tier 1 (red) and tiers 2–4 (blue). SNPs have been binned along the x-axis into 30 equally spaced intervals. eQTL SNP-CRE overlaps per bin are plotted as bold lines, confidence intervals are plotted as ribbons. Each panel depicts a different study set comparison, as labeled at the top of each panel. Within cell type and between cell type replication frequencies are plotted along the left and right columns, respectively, for LCLs (top row), liver (middle row), and brain (right row).
Overlap of full data set of ENCODE LCL CREs with CAP_LCL eQTLs. CAP_LCL eQTL SNP-CRE overlap (y-axis; tier means as points, confidence interval as lines), with SNPs from each tier plotted in separate panels, as indicated at right. Each CRE class is plotted in a separate panel, as labeled at top. As applicable, within each panel, overlaps are plotted separately for LCL CREs (red circles) and HepG2 CREs (blue triangles). In cases where multiple CRE data sets exist for the same CRE class (e.g., H3K27me3 marks), overlaps were calculated from each data set independently and over plotted, with a jitter.
Overlap of full data set of ENCODE HepG2 CREs with UChicago_liver eQTLs. UChicago_liver eQTL SNP-CRE overlap (y-axis; tier means as points, confidence interval as lines), with SNPs from each tier plotted in separate panels, as indicated at right. Each CRE class is plotted in a separate panel, as labeled at top. As applicable, within each panel, overlaps are plotted separately for HepGe CREs (red circles) and LCL CREs (blue triangles). In cases where multiple CRE data sets exist for the same CRE class (e.g., H3K27me3 marks), overlaps were calculated from each data set independently and over plotted, with a jitter.
CAP_LCL eQTL associations with 142 LCL derived CRE data sets. CAP_LCL eQTL SNP () overlap with predicted cis-regulatory elements. Each panel depicts overlap with distinct CRE data sets, as labeled at top. In each panel, SNPs are grouped into 25 equally spaced bins within the 50 kb upstream and downstream of the TSS and TES, and 10 bins between the TSS and TES. Each bin is plotted along the x-axis. Bold lines depict the percentage, per bin, of SNPs overlapping the CRE class, ribbons depict confidence interval. Observed eQTL SNPs are plotted in blue and randomly drawn cis-linked SNPs at expressed genes in red.
CTCF binding sites are enriched between SNPs independently associated with the same gene expression trait. Percentage of primary and secondary CAP_LCL (red) and Stranger_LCL (blue) LCL eQTL SNP pairs that have an intervening CRE (y-axis; SNP pairs were binned by the distance between them, bold lines depict bin frequency, ribbons depict confidence interval) as a function of the absolute distance between the SNPs (x-axis). Randomly drawn cis-linked SNPs are displayed in green. Each panel depicts the analysis of a different CRE data set, as labeled at top, including CTCF, SMC3, and Rad21, which have each been shown to mark enhancer blocking insulators , DHS sites, which promiscuously mark insulators and other CRE classes, chromHMM defined heterochromatin regions, and p300 sites.
Overlap of ENCODE data sets across cell types. Overlap between CRE data sets (y-axis), as a function of absolute CRE to TSS distance (x-axis). Data are plotted separately for comparisons between two different LCL lines (red; e.g., between GM12878 and GM060990) and between LCLs (GM12878) and HepG2 cells (blue). CREs are binned into 30 equally spaced intervals along the x-axis. Per bin CRE overlaps are plotted as bold lines, confidence intervals are plotted as ribbons. Data are plotted separately for each of six different CRE types, as labeled at the top of each panel. Note the striking difference between the cell specificity of CTCF binding sites and each other CRE class.
CRE data improve accuracy of cell specific eQTL prediction. ROC curves depicting the performance of random forest classifiers to predict within cell type specific reproducibility, either trained with CRE data (red) or without CRE data (blue). Facets depict predictions for LCL, liver, and brain eQTL SNPs (labeled at top). True positive rates (y-axis) and false positive rates (x-axis) were quantified by tenfold cross validation. AUCs from models with and without CRE training, respectively, were 0.67 and 0.61 (LCL), 0.71 and 0.57 (liver), and 0.68 and 0.57 (brain).
Regulatory element overlap predicts GWAS SNP function. Boxplot of the distribution of random forest classifier predictions. A classifier was trained to discriminate LCL and liver eQTL SNPs from non-eQTL SNPs on the basis of SNP overlap with CREs from LCLs and HepG2 cells. The trained classifier was then applied to SNPs associated with phenotypes of relevance to breast, kidney, and lung function (facets labeled at top). The probability that each SNP is an eQTL (y-axis, as box plot) was calculated using CREs from matched and unmatched cell types (listed on the x-axis and color coded). Asterisks denote significant differences in probability distributions (Wilcoxon signed ranks test, ).
Quantification of allelic heterogeneity by forward stepwise regression and an LD-based method. Forward stepwise regression was applied to genes with allelic heterogeneity as identified by the LD-based method described in the text. For each gene, the resulting pair of models are contrasted as follows. At left, multivariate of the FSR model (y-axis) is plotted as function of the multivariate of the LD based model (x-axis). At right, the Bayesian information criterion (BIC) of the FSR model (y-axis) is plotted as function of the BIC of the LD based model (x-axis). Circle size and color depicts the number of SNPs identified by the FSR and LD-based models, respectively (blue = 2 SNPs, green = 3 SNPs, red = 4 SNPs). We note that, in one gene, NINJ1, FSR found fewer independent eQTLs than our LD-based method; interestingly, the tertiary eQTL SNP for this gene had a lower univariate than the conditional ( versus ), implying a possible SNP-SNP interaction. For genes () the two methods found the same numbers of independent eQTL SNPs, and for genes () FSR identified additional eQTL SNPs (Table S9).
Numbers of PCs removed from each study gene expression data. Columns include the study name, the cell type, the number of PCs removed (PCs), and the percentage variance, which refers to the percentage variance explained by the last PC removed.
eQTL SNP replication table. Summary of eQTL findings and replication across all included studies. eQTL SNPs were included in the table if, from a given study, they were the most highly associated cis-linked SNP within an LD block and if their . eQTL SNPs are listed in separate rows and associated data are in columns, as labeled in the first row and denote the following:
- REF STUDY. eQTL discovery study.
- GENE. RefSeq gene identifier.
- CHR. Chromosome of the eQTL SNP and associated gene expression trait.
- STRAND. Annotated strand of the associated transcript.
- TSS. Genomic coordinate of the transcription start site.
- TES. Genomic coordinate of the transcription end site.
- RSID. Identity of the eQTL SNP.
- HS INDEX. Identity of the LD block containing the SNP.
- RSCOORD. Genomic coordinate of the SNP.
- TIER. Tier of the SNP with respect the associated gene.
- UBF. Univariate of the eQTL SNP-gene expression trait association.
- MBF. Multivariate of the eQTL SNP-gene expression trait association, controlling for all SNPs in lower tiers.
- BETA. Estimate of the effect size per minor allele of the SNP.
- GEX. Mean gene expression level across all samples in the discovery study.
- PROBE MAF. Maximum minor allele frequency of SNPs overlapping the coordinates of the gene expression probe.
- NALN. Number of high quality alignments between the gene expression probe and hg18.
- STL, MBR, CLI, CPL, HVC, GCT, GCF, HCE, HPC, MLI, GCL. Each subsequent column denotes the measure of univariate replication between the SNP and gene expression trait in the study set indicated by the TLA, where:
CRE data set details. File name, element summary type, cell type, experiment type, experiment class, experiment name, control type, file format, and two indicators denoting whether the data set was tested for overlap with eQTL SNPs or for intersection between the eQTL SNP and gene TSS (i.e., is an insulator), or both.
CAP_LCL eQTL SNP enrichment/depletion within CREs: (a) relative to background, (b) replicating versus non-replicating, (c) same versus different cell types. For Tables S4, S5, and S6, eQTL SNPs () were tested for overlap with CRE data sets, each listed in a separate row. Columns are labeled in the first row and denote the following:
- CRE type. Name of the tested CRE, including the cell type of origin, as applicable.
- Cell A Overlap. Fraction of eQTL SNPs that overlap a CRE of this class, where the CRE was derived from cell type A (the cell type indicated in column 1).
- Cell B Overlap. Fraction of eQTL SNPs that overlap a CRE of this class, where the CRE was derived from cell type B (LCLs for liver and cerebellum eQTLs, HepG2 cells for LCL eQTLs).
- AB Overlap Test McNemar PV. McNemar's test p-value run on the table where denotes the number of SNPs overlapping a CRE from cell type A, denotes the number of SNPs not overlapping a CRE from cell type A, denotes the number of SNPs overlapping a CRE from cell type B, denotes the number of SNPs not overlapping a CRE from cell type B.
- A Overlap Test Beta. from Eqn. (2), for CREs derived from cell type A.
- A Overlap Test SE. Standard error of the estimate of from Eqn. (2).
- A Overlap Test Z. Z-score of the significance of from Eqn. (2).
- A Overlap Test PV. P-value of the significance of from Eqn. (2).
- A Overlap Test AIC. Akaike information criterion of Eqn. (2).
- B Overlap Test Beta. from Eqn. (2), for CREs derived from cell type B.
- B Overlap Test SE. Standard error of the estimate of from Eqn. (2).
- B Overlap Test Z. Z-score of the significance of from Eqn. (2).
- B Overlap Test PV. P-value of the significance of from Eqn. (2).
- B Overlap Test AIC. Akaike information criterion of Eqn. (2).
- A Overlap Rep OR. Odds ratio derived from a Fisher's exact test performed on the table where denotes the number of within cell type replicating SNPs overlapping a CRE from cell type A, denotes the number of within cell type replicating SNPs Eqn. not overlapping a CRE from cell type A, denotes the number of non-replicating SNPs overlapping a CRE from cell type A, denotes the number of non-replicating SNPs not overlapping a CRE from cell type A.
- A Overlap Rep PV. P-value of the Fisher's exact test above.
- A Overlap Rep Beta. from Eqn. (3), for CREs derived from cell type A.
- A Overlap Rep SE. Standard error of the estimate of from Eqn. (3).
- A Overlap Rep Z. Z-score of the significance of from Eqn. (3).
- A Overlap Rep PV. P-value of the significance of from Eqn. (3).
- A Overlap Rep AIC. Akaike information criterion of Eqn.(3).
- B Overlap Rep OR. Odds ratio derived from a Fisher's exact test performed on the table where denotes the number of within cell type replicating SNPs overlapping a CRE from cell type B, denotes the number of within cell type replicating SNPs not overlapping a CRE from cell type B, denotes the number of non-replicating SNPs overlapping a CRE from cell type B, denotes the number of non-replicating SNPs not overlapping a CRE from cell type B.
- B Overlap Rep PV. P-value of the Fisher's exact test above.
- B Overlap Rep Beta. from Eqn. (3) for CREs derived from cell type B.
- B Overlap Rep SE. Standard error of the estimate of from Eqn. (3).
- B Overlap Rep Z. Z-score of the significance of from Eqn. (3).
- B Overlap Rep PV. P-value of the significance of from Eqn. (3).
- B Overlap Rep AIC. Akaike information criterion of Eqn. (3).
- AB Overlap Rep pos OR. Odds ratio derived from a Fisher's exact test performed on the table where denotes the number of SNPs that replicate in cell type A but not B that overlap a CRE found in cell type A but not B, denotes the number of SNPs that replicate in cell type A but not B that overlap a CRE found in cell type A and B, denotes the number of SNPs that replicate in cell type A and B that overlap a CRE found in cell type A but not B, denotes the number of SNPs that replicate in cell type A and B that overlap a CRE found in cell type A and B.
- AB Overlap Rep pos PV. P-value of the Fisher's exact test above.
- AB Overlap Rep neg OR. Odds ratio derived from a Fisher's exact test performed on the table where denotes the number of SNPs that replicate in cell type A but not B that overlap a CRE found in cell type B but not A, denotes the number of SNPs that replicate in cell type A but not B that overlap a CRE found in cell type A and B, denotes the number of SNPs that replicate in cell type A and B that overlap a CRE found in cell type B but not A, denotes the number of SNPs that replicate in cell type A and B that overlap a CRE found in cell type A and B.
- AB Overlap Rep neg PV. P-value of the Fisher's exact test above
UChicago_liver eQTL SNP enrichment/depletion within CREs: (a) relative to background, (b) replicating versus non-replicating, (c) same versus different cell types. Details are the same as Table S4.
Harvard_cerebellum eQTL SNP enrichment/depletion within CREs: (a) relative to background, (b) replicating versus non-replicating, (c) same versus different cell types. Details are the same as Table S4.
AH model comparison table. Details of tests for independent enrichment of primary and secondary tier eQTL SNPs within CREs. CRE data sets are listed in separate rows, columns are labeled with a header in the first row as:
- 1 Overlap Test Beta. Primary tier , relative to background SNPs, from Eqn. (4).
- 1 Overlap Test SE. Standard error of the estimate of the primary tier from Eqn. (4).
- 1 Overlap Test Z. Z-score of the significance of the primary tier from Eqn. (4).
- 1 Overlap Test PV. P-value of the significance of the primary tier from Eqn. (4).
- 2 Overlap Test Beta. Second tier , relative to background SNPs, from Eqn. (4).
- 2 Overlap Test SE. Standard error of the estimate of the second tier from Eqn. (4).
- 2 Overlap Test Z. Z-score of the significance of the second tier from Eqn. (4).
- 2 Overlap Test PV. P-value of the significance of the second tier from Eqn. (4)
Random forest classifier feature importance. Measures of feature importance for each variable for each classification task. Columns are:
- classification task,
- variable name,
- variable importance. For each tree, the prediction accuracy on the out- of-bag portion of the data is recorded. Then the same is done after permuting each predictor variable. The difference between the two accuracies are then averaged over all trees, and normalized by the standard error.
The cutoffs and total numbers of eQTLs at for each tier of AH in each study. FDR cutoff was assessed by permutation.
Comparison of LD-based method and forward stepwise regression on subset of CAP_LCL genes with AH. Columns are: Gene names (note that these represent only single gene probe clusters, and not combined results across probes); the number of eQTLs for the LD-based method, the BIC score for the LD-based method, the multivariate for the LD-based method, the set of RSIDs for the LD-based method, the number of eQTLs for FSR, the BIC score for FSR, the multivariate for FSR, the set of RSIDs for FSR.
Gene Ontology enrichment results. First two columns denote the eQTL discovery study and cell type; second two columns the eQTL comparison study and cell type. The Genes column denotes the type of enrichment analysis performed, where multiple eQTLs indicates the enrichment of terms in genes with allelic heterogeneity with a background of genes with one or more eQTLs, eQTLs indicates the enrichment of terms in genes with eQTLs, with a background of all of the tested genes in the study, NR indicates the enrichment of terms in genes with eQTLs that do not reproduce across cell types, with the background of all eQTLs identified in the discovery data. The Type column indicates what term is enriched, where KEGG indicates a KEGG pathway, MF denotes a molecular function in GO, and BP indicates biological process. Any of the comparisons not mentioned (for the three types or the different study comparisons discussed) did not come up with any statistically significant enrichment ().
The authors would like to thank Manolis Dermitzakis for access to the GenCode study data, and Greg Crawford for pre-release access to DNAse I hypersensitive site data from cerebellum. The authors would also like to thank Greg Cooper, Greg Crawford, Ron Hause, and Matthew Stephens for helpful discussions, the ENCODE project, and all of the studies included here who made their gene expression, genotype, and CRE data public. The gene expression and genotype data designated by Harvard* were generated by Merck Research Laboratories in collaboration with the Harvard Brain Tissue Resource Center and was obtained through the Synapse data repository (data set id: syn4505 at https://synapse.sagebase.org/#Synapse:syn4505).
Conceived and designed the experiments: CDB BEE. Performed the experiments: CDB BEE. Analyzed the data: CDB BEE. Contributed reagents/materials/analysis tools: LMM. Wrote the paper: CDB BEE LMM.
- 1. Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M (2009) Mapping complex disease traits with global gene expression. Nature Reviews Genetics 10: 184–94. doi: 10.1038/nrg2537
- 2. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, et al. (2008) Genetics of gene expression and its effect on disease. Nature 452: 423–8. doi: 10.1038/nature06758
- 3. Gilad Y, Rifkin SA, Pritchard JK (2008) Revealing the architecture of gene regulation: the promise of eQTL studies. Trends in genetics 24: 408–415. doi: 10.1016/j.tig.2008.06.001
- 4. Brem RB, Yvert G, Clinton R, Kruglyak L (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296: 752–5. doi: 10.1126/science.1069516
- 5. Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422: 297–302. doi: 10.1038/nature01434
- 6. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, et al. (2004) Genetic analysis of genomewide variation in human gene expression. Nature 430: 743–7. doi: 10.1038/nature02797
- 7. De Gobbi M, Viprakasit V, Hughes JR, Fisher C, Buckle VJ, et al. (2006) A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312: 1215–7. doi: 10.1126/science.1126431
- 8. Small KS, Hedman AK, Grundberg E, Nica AC, Thorleifsson G, et al. (2011) Identification of an imprinted master trans regulator at the KLF14 locus related to multiple metabolic phenotypes. Nature Genetics 43: 561–4. doi: 10.1038/ng.833
- 9. Göring HHH, Curran JE, Johnson MP, Dyer TD, Charlesworth J, et al. (2007) Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nature Genetics 39: 1208–16. doi: 10.1038/ng2119
- 10. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, et al. (2007) Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448: 470–3. doi: 10.1038/nature06014
- 11. Emison ES, Garcia-Barcelo M, Grice EA, Lantieri F, Amiel J, et al. (2010) Differential contributions of rare and common, coding and noncoding Ret mutations to multifactorial Hirschsprung disease liability. American journal of human genetics 87: 60–74. doi: 10.1016/j.ajhg.2010.06.007
- 12. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, et al. (2012) Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science 337: 1190–1195. doi: 10.1126/science.1222794
- 13. Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, et al. (2010) Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genetics 6: e1000888. doi: 10.1371/journal.pgen.1000888
- 14. Fraser HB, Xie X (2009) Common polymorphic transcript variation in human disease. Genome Research 19: 567–75. doi: 10.1101/gr.083477.108
- 15. Musunuru K, Strong A, Frank-Kamenetsky M, Lee NE, Ahfeldt T, et al. (2010) From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466: 714–9. doi: 10.1038/nature09266
- 16. Harismendy O, Notani D, Song X, Rahim NG, Tanasa B, et al. (2011) 9p21 DNA variants associated with coronary artery disease impair interferon-γ signalling response. Nature 470: 264–8. doi: 10.1038/nature09753
- 17. Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, et al. (2009) Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325: 1246–50. doi: 10.1126/science.1174148
- 18. Fairfax BP, Makino S, Radhakrishnan J, Plant K, Leslie S, et al. (2012) Genetics of gene expression in primary immune cells identifies cell type-specific master regulators and roles of HLA alleles. Nature Genetics 44: 502–10. doi: 10.1038/ng.2205
- 19. Powell JE, Henders AK, McRae AF, Wright MJ, Martin NG, et al. (2012) Genetic control of gene expression in whole blood and lymphoblastoid cell lines is largely independent. 22: 456–66. doi: 10.1101/gr.126540.111
- 20. Huang DW, Sherman BT, Zheng X, Yang J, Imamichi T, et al. (2009) Extracting biological meaning from large gene lists with DAVID. Current protocols in bioinformatics Chapter 13: Unit 13.11. doi: 10.1002/0471250953.bi1311s27
- 21. van Nas A, Ingram-Drake L, Sinsheimer JS, Wang SS, Schadt EE, et al. (2010) Expression quantitative trait loci: replication, tissue- and sex-specificity in mice. Genetics 185: 1059–68. doi: 10.1534/genetics.110.116087
- 22. Nica AC, Parts L, Glass D, Nisbet J, Barrett A, et al. (2011) The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genetics 7: e1002003. doi: 10.1371/journal.pgen.1002003
- 23. Ding J, Gudjonsson JE, Liang L, Stuart PE, Li Y, et al. (2010) Gene expression in skin and lymphoblastoid cells: Refined statistical method reveals extensive overlap in cis-eQTL signals. American journal of human genetics 87: 779–89. doi: 10.1016/j.ajhg.2010.10.024
- 24. Fu J, Wolfs MGM, Deelen P, Westra HJ, Fehrmann RSN, et al. (2012) Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genetics 8: e1002431. doi: 10.1371/journal.pgen.1002431
- 25. Gerrits A, Li Y, Tesson BM, Bystrykh LV, Weersing E, et al. (2009) Expression quantitative trait loci are highly sensitive to cellular differentiation state. PLoS Genetics 5: e1000692. doi: 10.1371/journal.pgen.1000692
- 26. Innocenti F, Cooper GM, Stanaway IB, Gamazon ER, Smith JD, et al. (2011) Identification, replication, and functional fine-mapping of expression quantitative trait Loci in primary human liver tissue. PLoS Genetics 7: e1002078. doi: 10.1371/journal.pgen.1002078
- 27. Heap GA, Trynka G, Jansen RC, Bruinenberg M, Swertz MA, et al. (2009) Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC medical genomics 2: 1. doi: 10.1186/1755-8794-2-1
- 28. Flutre T, Wen X, Pritchard J, Stephens M (2013) A Statistical Framework for Joint eQTL Analysis in Multiple Tissues. PLoS Genetics 9: e1003486. doi: 10.1371/journal.pgen.1003486
- 29. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, et al. (2012) The accessible chromatin landscape of the human genome. Nature 489: 75–82. doi: 10.1038/nature11232
- 30. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74.
- 31. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, et al. (2007) Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315: 848–853. doi: 10.1126/science.1136678
- 32. Myers AJ, Gibbs JR, Webster Ja, Rohrer K, Zhao A, et al. (2007) A survey of genetic human cortical gene expression. Nature Genetics 39: 1494–9. doi: 10.1038/ng.2007.16
- 33. Schadt EE, Molony C, Chudin E, Hao K, Yang X, et al. (2008) Mapping the genetic architecture of gene expression in human liver. PLoS Biology 6: e107. doi: 10.1371/journal.pbio.0060107
- 34. Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics 3: e114. doi: 10.1371/journal.pgen.0030114.eor
- 35. Scheet P, Stephens M (2006) A fast and exible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American journal of human genetics 78: 629–44. doi: 10.1086/502802
- 36. Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3: 1724–35. doi: 10.1371/journal.pgen.0030161.eor
- 37. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768–772. doi: 10.1038/nature08872
- 38. Guan Y, Stephens M (2008) Practical issues in imputation-based association mapping. PLoS Genetics 4: e1000279. doi: 10.1371/journal.pgen.1000279
- 39. Stephens M, Balding DJ (2009) Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 10: 681–90. doi: 10.1038/nrg2615
- 40. Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, et al. (2008) High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genetics 4: e1000214. doi: 10.1371/journal.pgen.1000214
- 41. Veyrieras JB, Gaffney DJ, Pickrell JK, Gilad Y, Stephens M, et al. (2012) Exon-specific QTLs skew the inferred distribution of expression QTLs detected using gene expression array data. PloS One 7: e30629. doi: 10.1371/journal.pone.0030629
- 42. Degner JF, Pai AA, Pique-Regi R, Veyrieras JB, Gaffney DJ, et al. (2012) DNaseI sensitivity QTLs are a major determinant of human expression variation. Nature 482: 390–4. doi: 10.1038/nature10808
- 43. Gaffney DJ, Veyrieras JB, Degner JF, Pique-Regi R, Pai AA, et al. (2012) Dissecting the regulatory architecture of gene expression QTLs. Genome biology 13: R7. doi: 10.1186/gb-2012-13-1-r7
- 44. Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, et al. (2007) Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128: 1231–45. doi: 10.1016/j.cell.2006.12.048
- 45. Nègre N, Brown CD, Shah PK, Kheradpour P, Morrison CA, et al. (2010) A comprehensive map of insulator elements for the Drosophila genome. PLoS Genetics 6: e1000814. doi: 10.1371/journal.pgen.1000814
- 46. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, et al. (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473: 43–9. doi: 10.1038/nature09906
- 47. Cooper SJ, Trinklein ND, Nguyen L, Myers RM (2007) Serum response factor binding sites differ in three human cell types. Genome Research 17: 136–44. doi: 10.1101/gr.5875007
- 48. Chen H, Tian Y, Shu W, Bo X, Wang S (2012) Comprehensive identification and annotation of cell type-specific and ubiquitous CTCF-binding sites in the human genome. PloS One 7: e41374. doi: 10.1371/journal.pone.0041374
- 49. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America 106: 9362–7. doi: 10.1073/pnas.0903103106
- 50. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81: 559–75. doi: 10.1086/519795
- 51. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic epidemiology 34: 816–34. doi: 10.1002/gepi.20533
- 52. Howie B, Marchini J, Stephens M (2011) Genotype Imputation with Thousands of Genomes. G3: Genes, Genomes, Genetics 1: 457–470. doi: 10.1534/g3.111.001198
- 53. 1 kg (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–73. doi: 10.1038/nature09991
- 54. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17: 520–525. doi: 10.1093/bioinformatics/17.6.520
- 55. Barber MJ, Mangravite LM, Hyde CL, Chasman DI, Smith JD, et al. (2010) Genome-wide association of lipid-lowering response to statins in combined study populations. PloS One 5: e9763. doi: 10.1371/journal.pone.0009763
- 56. Fraley C, Raftery A, Wehrens R (2005) Incremental Model-Based Clustering for Large Datasets With Small Clusters. Journal of Computational and Graphical Statistics 14: 529–546. doi: 10.1198/106186005x59603
- 57. Kudaravalli S, Veyrieras JB, Stranger BE, Dermitzakis ET, Pritchard JK (2009) Gene expression levels are a target of recent natural selection in the human genome. Molecular biology and evolution 26: 649–58. doi: 10.1093/molbev/msn289
- 58. Maranville JC, Luca F, Richards AL, Wen X, Witonsky DB, et al. (2011) Interactions between glucocorticoid treatment and cis-regulatory polymorphisms contribute to cellular response phenotypes. PLoS Genetics 7: e1002162. doi: 10.1371/journal.pgen.1002162
- 59. Petretto E, Bottolo L, Langley SR, Heinig M, McDermott-Roe C, et al. (2010) New insights into the genetic control of gene expression using a Bayesian multi-tissue approach. PLoS computational biology 6: e1000737. doi: 10.1371/journal.pcbi.1000737
- 60. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–61.
- 61. Wen X, Stephens M (2011) Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions. arXiv:1111.1210v2.
- 62. Wang S, Yehya N, Schadt EE, Wang H, Drake Ta, et al. (2006) Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genetics 2: e15. doi: 10.1371/journal.pgen.0020015
- 63. Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, et al. (2012) Patterns of Cis Regulatory Variation in Diverse Human Populations. PLoS Genetics 8: e1002639. doi: 10.1371/journal.pgen.1002639
- 64. Storey JD, Akey JM, Kruglyak L (2005) Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biology 3: e267. doi: 10.1371/journal.pbio.0030267
- 65. Ashburner M, Ball C, Blake J (2000) Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25–29.
- 66. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic acids research 40: D109–14. doi: 10.1093/nar/gkr988
- 67. Gardiner-Garden M, Frommer M (1987) CpG Islands in vertebrate genomes. Journal of Molecular Biology 196: 261–282. doi: 10.1016/0022-2836(87)90689-9
- 68. Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, et al. (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Research 15: 901–13. doi: 10.1101/gr.3577405
- 69. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, et al. (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods 9: 473–476. doi: 10.1038/nmeth.1937
- 70. Breiman L (2001) Random Forests. Machine Learning 45: 5–32. doi: 10.1023/a:1010933404324
- 71. Liaw A, Wiener M (2002) Classification and Regression by randomForest. R news 2: 18–22.
- 72. Sing T, Sander O, Beerenwinkel N, Lengauer T (2005) ROCR: visualizing classifier performance in R. Bioinformatics 21: 3940–1. doi: 10.1093/bioinformatics/bti623
- 73. Wendt KS, Yoshida K, Itoh T, Bando M, Koch B, et al. (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor. Nature 451: 796–801. doi: 10.1038/nature06634
- 74. Simon JA, Lin F, Hulley SB, Blanche PJ, Waters D, et al. (2006) Phenotypic predictors of response to simvastatin therapy among African-Americans and Caucasians: the Cholesterol and Pharmacogenetics (CAP) Study. The American journal of cardiology 97: 843–50. doi: 10.1016/j.amjcard.2005.09.134