Conceived and designed the experiments: ETD. Analyzed the data: JBV SK SYK YG MS JKP. Contributed reagents/materials/analysis tools: JBV YG MS JKP. Wrote the paper: JBV YG MS JKP.
The authors have declared that no competing interests exist.
Recent studies of the HapMap lymphoblastoid cell lines have identified large numbers of quantitative trait loci for gene expression (eQTLs). Reanalyzing these data using a novel Bayesian hierarchical model, we were able to create a surprisingly high-resolution map of the typical locations of sites that affect mRNA levels in
Individual phenotypes within natural populations generally exhibit a large diversity resulting from a complex interplay of genes and environmental factors. Since the advent of molecular markers in the 1980s, quantitative genetics has made a significant step toward unraveling the genetic bases of such complex traits, in particular by developing sophisticated tools to map the genomic locations of genes that affect complex traits. These regions are known as quantitative trait loci (QTLs). More recently, these tools have been extended to the study of gene expression phenotypes on a massive scale. In this paper, we used a previously published dataset consisting of expression measurements of 11,446 genes in human cell lines derived from 210 unrelated human individuals that have been genetically characterized by the International HapMap Project. Our article develops and applies a framework for determining the genetic factors that impact gene regulation. We show that these factors cluster strongly near to the gene start and gene end and are enriched within the transcribed region. Our approach suggests a general framework for studying the genetic factors that affect variation in gene expression.
Genetic variation that affects gene regulation plays an important role in the genetics of disease and adaptive evolution
To address this gap, recent experimental and computational approaches have made progress on identifying elements that may be functional, for example through experimental methods that identify transcription factor binding sites
As a complementary approach, genome-wide studies of gene expression are now starting to provide information on genetic variation that impacts gene expression levels
In this study, we applied a new Bayesian framework to identify and fine map human lymphoblast eQTLs on a genome-wide scale. In effect, we treat the SNP data as a tool for assaying the functional impact of individual nucleotide changes on gene regulation. Our analysis focuses on the impact of common SNPs on gene expression levels. By using naturally occurring variation, we test the effects of several million variable sites in a single data set. Our results provide a detailed characterization of the types of SNPs that affect gene expression in lymphoblast cell lines.
We analyzed gene expression measurements from lymphoblastoid cell lines representing 210 unrelated individuals studied by the International HapMap Project
After remapping the Illumina probes onto human mRNA sequences from RefSeq, we created a cleaned set of expression data for 12,227 distinct autosomal genes that had a unique RNA sequence in RefSeq (see
We then set out to identify SNPs that affect measured mRNA levels in
Although the HapMap samples represent four different populations, originating from Africa, Europe and east Asia, our main analyses pooled the data into a single sample. To avoid false positives due to population-level expression differences
For each of the 11,446 genes, we tested for putative
We also observed that, in many cases, the SNPs most strongly associated with mRNA levels for a particular gene lie in a restricted region, allowing relatively precise localisation of eQTLs.
The plots show examples of eQTLs for three genes:
Encouraged by the potential for these data to localise eQTLs, we next examined the distribution of the physical location of putative eQTLs within the
Each plot shows, for genes with an eQTL, the distribution of locations of the most significant SNP. The x-axis of each plot divides a typical
Finally, for all three gene sizes, the highest density of eQTLs is around the TSS and immediately upstream of the TES, as reported previously in yeast
While
Consequently, we next developed a Bayesian hierarchical modeling approach that solves many of these problems (see the
To implement our hierarchical approach, we switched to using Bayesian regression to test for association between SNPs and gene expression
The hierarchical model shares information across all genes about the distribution of signals and this in turn allows better weighting of which SNPs in individual genes are most likely to be eQTNs. For example, consider a hypothetical gene in which two SNPs that are associated with expression are in perfect LD (
Of course, some degree of complication is added by the fact that current HapMap data do not yet contain all SNPs. Therefore, the sites that we infer to be “eQTNs” in this study surely include many SNPs that are tags of nearby functional SNPs that are not in HapMap. This effect will systematically reduce our estimates of the importance of any particular factor in predicting eQTNs. In the case of factors relating to physical location (such as distance from the TSS) simulations show that this has a modest impact on spreading out the signal peaks that we observe, and that the overall distribution of signals is still estimated very well (see
We first set out to get a more refined view of the distribution of eQTNs across the
We also considered models with pairs of anchor points (e.g., the TSS and the TES). In those models, each SNP belonged to two bins, each corresponding to the distance from one anchor point. This model treats the probability that a SNP is an eQTN as the sum of an effect due to the first anchor plus an effect due to the second anchor. Recall that our gene set includes only genes with a single annotated transcript, so that this analysis does not incorporate alternative transcription start or end sites.
Model | Log Likelihood Diff. | AIC Difference |
TSS+CDSE | −11.9 | −11.9 |
TSS+Probe | −14.8 | −14.8 |
TSS+TXMID | −58.5 | −58.5 |
TSS+CDSMID | −63.5 | −63.5 |
TSS+CDSS | −94.9 | −94.9 |
TES | −330.7 | −229.7 |
The table compares the performance of seven different hierarchical models of eQTN locations. In each model we used either a single “anchor” point to predict the locations of eQTNs (e.g., the location of the TSS) or two anchor points (e.g., the TSS and TES locations). The “TSS+CDSE” model uses the TSS and the coding sequence end locations as anchors; similarly “probe” refers to the location of the probe and “TXMID” is the midpoint of the transcript. The second and third columns compare the model listed on that line against the best model (TSS+TES), in terms of the difference in log likelihood (column 2) and the difference in Akaike Information Criterion (AIC, column 3). AIC penalizes the two-anchor models for 51 additional parameters compared to the one-anchor models.
In summary, the results provide compelling support for a model including both the TSS and TES over all other models (
We next replotted the locations of eQTNs, using the posterior probabilities estimated by the hierarchical model (
The three left-hand panels plot the estimated fractions of SNPs in each bin that are eQTNs, using the posterior expected numbers of eQTNs in each bin from the hierarchical model. The right-hand panels plot the corresponding cumulative distributions of detected eQTNs, as a function of position across the
Another view of the hierarchical model results is shown in the cumulative plots in
We next investigated the peaks of signal near the TSS and TES in more detail, using a finer bin partition close to the TSS and TES (see
The left- and right-hand columns show data for 5 kb on either side of the TSS and TES, respectively (averaging across all gene sizes). Locations inside genes are colored green and outside genes are black. A. Posterior expected fractions of SNPs in each bin that are eQTNs, as estimated by the hierarchical model (see
We also observed that the TSS and TES peaks both correspond with two parts of the typical gene structure that, averaging across all 11,446 genes, tend to be highly conserved across the mammalian phylogeny (
Similarly, the TSS peak also matches up closely with the peak binding densities of a collection of transcription factors that are involved in transcription initiation (reported previously by the ENCODE group, based on ChIP-chip data collected for a set of regions spanning ∼1% of the genome
We next used our hierarchical model to examine the impact of various types of functional annotation on the probability that a SNP is an eQTN. We first classified SNPs that lie inside genes into categories based on the exon/intron structure (e.g, first, coding and last exons; first, internal, and last introns;
The main result of this first analysis is that internal introns have a deficit of eQTNs, compared to coding exons, as well as first and last exons and introns (
The plot shows the odds ratios for the probability that a SNP in a particular part of the gene (e.g., coding exon) is inferred to be an eQTN, relative to that probability for a SNP in an “internal” intron (i.e., an intron within the coding sequence). The odds ratios are estimated using the hierarchical model with internal introns fixed at a value of 1, and control for SNP position using the TSS+TES model. The vertical bars show 95% confidence intervals.
We then considered the impact of a variety of other types of SNP annotation (see
Finally, based on ENCODE results showing that the promoter regions of genes with CpG islands tend to have more accessible chromatin and greater occupancy by transcription factors
Cells use a variety of mechanisms at the transcriptional and translational levels to regulate gene expression. Transcription initiation is controlled by the interaction between transcription factors and cofactors with a set of
Consistent with the importance of transcription initiation, we found a strong peak of eQTNs near the TSS, with 33% of eQTNs lying within 10 kb of the TSS. Many of these eQTNs are likely to be polymorphisms that change the binding strength of transcription factor binding sites, thereby affecting the rate of transcription
In addition to the peak of eQTN signals near the TSS, we were intrigued to find a second, similarly strong peak near the TES, as seen previously in yeast
An alternative explanation for the overrepresentation of eQTNs in exons is that in some cases these may cause alternative splicing of the exon containing the expression probe, thereby changing measured expression levels. In particular, SNPs in the last exon might sometimes affect the location of the TES
Our results also imply that surprisingly few eQTNs with large effects lie far upstream of the TSS (or downstream of the TES): for example, just 5% of the eQTNs that we detected were more than 20 kb upstream of the TSS. These results are consistent with data showing that most transcription factor binding sites are near the TSS
In summary, our results show that eQTL studies provide a remarkably high-resolution tool for identifying variants that affect gene expression. A major strength of the eQTL approach is that, unlike other experimental techniques that are more targeted, the eQTL approach is agnostic about the mechanism of action of the functional variants, provided only that they are encoded in the DNA sequence (as opposed to epigenetic factors, for example). Hence, eQTL studies can provide a relatively unbiased view of the importance of different types of regulatory mechanisms. Moreover, as the cost of genome sequencing drops, it will soon be possible to conduct these analyses with nearly complete ascertainment of variation, potentially providing this approach with the resolution to study the sequence level determinants of gene expression. We anticipate that eQTL mapping will make an essential contribution to our understanding of human gene regulation.
We analyzed genotype and expression data from 210 unrelated individuals studied by the International HapMap project
We used gene expression levels that were measured previously in lymphoblastoid cell lines from all 210 unrelated individuals, using Illumina's human whole-genome expression array (WG-6 version 1)
Since mean expression levels at many loci differ between the HapMap populations
We used BLAT
Of these 12,277 genes, 85% contained exactly one probe. For the genes with multiple probes, we analyzed only a single probe, selecting the probe nearest to the 5′ end of the gene. We selected this probe because overall the probes are strongly biased towards the 3′ end of the gene, and we wanted to reduce this bias as far as possible. Then, we removed 634 genes for which there was at least one HapMap SNP inside the probe since it is known that such SNPs can impact the measured expression level
Gene structure annotation was obtained from the RefSeq gene table
First (non-coding) exon. If the gene has at least 2 exons, this is the part of the first exon that is not located inside the CDS. If the gene has only one exon, we do not consider it to have a
First intron. If the gene has at least 2 exons, this the intron following the first exon, provided that it is not located inside the CDS. Otherwise there is no first intron.
Noncoding exon. This is any part of an exon located outside the CDS region and excluding the first and last exons.
External intron. This is an intron located outside the CDS region and excluding the first and the last introns.
Coding exon. This is any part of an exon located inside the CDS region. Note that exons containing the translation start or stop generally contain both coding exon and noncoding (or first/last) exon. Coding SNPs were further subdivided into synonymous and nonsynonymous, according to their annotation in dbSNP.
Internal intron. This is an intron located inside the CDS region.
Last intron. If the gene has at least 2 exons, this is the intron preceding the last exon, provided that it is not located inside the CDS. Otherwise there is no last intron.
Last (noncoding) exon. If the gene has at least 2 exons, this is the part of the last exon that is not located inside the CDS. Otherwise there is no last exon.
We also included annotations that indicate whether a SNP is in the following special categories: SNP is in a (1) CpG island; (2) conserved noncoding region; (3) predicted
Finally, note that in our analysis design, each SNP is tested for association with every gene that is within 500 kb. This means that typical SNPs contribute data to multiple genes. Our analysis treats these multiple tests as independent, which is likely a good approximation since we identified only five SNPs that are eQTLs for > one gene in
The data consist of SNP genotypes and gene expression measurements for
Next, let
In the first part of the paper we used standard linear regression to test the gene expression data at each gene for association with SNPs in the
We used the following procedure to generate the results plotted in
Prior to reporting the data, we also applied a correction for the possibility of spurious signals due to ungenotyped SNPs in the expression array probe. We used the 634 genes with a known HapMap SNP inside the probe to create a profile of the abundance of spurious signals as a function of distance from the probe. This profile was used to adjust the observed number of signals,
To display the distribution of signals in
We present here an overview of the hierarchical model. Complete details on the models are provided in the Supplementary Methods section (
The hierarchical model applies the Bayesian regression framework of Servin and Stephens
Let
We describe first the basic version of our hierarchical model. All the results presented in this paper additionally include a correction for the possibility that genes might show signals due to undetected SNPs in the probe. We describe that extension later in the
Our basic model assumes that there are two mutually exclusive categories of genes. With probability Π0 there is no eQTN in the
Given that there is a single eQTN in gene
A key feature of the hierarchical model is that the probability that SNP
As detailed in the Supplementary Methods (
Substituting the above expressions for
Since undetected SNPs in the probe sequence sometimes generate eQTLs, the results that we report include a modification to account for this effect. We used the 634 genes that have a known SNP in the probe region as training data to help parameterize the model. We assume that these represent ∼1/3 of all probes with common SNPs
Suppose that with probability
To maximize
Once the likelihood has been maximized, we can compute the posterior probability of a given SNP
To compute the average sequence conservation as a function of position for
We obtained results on transcription factor binding density using ChIP-chip data collected by the ENCODE project (4). We used data for eight transcription factors that showed large numbers of binding fragments at a 1% false discovery rate in the ENCODE study. The left-hand panel of
The methods reported here are implemented in the package
About 60% of the eQTNs are shared between at least two populations. Venn diagram of the set of eQTNs detected separately in each population. To generate the diagram, we admitted a SNP to the analysis (as an eQTL) if either the p-value in the combined sample (pooling the 3 populations) is lower than 7×10−6 or the p-value in a single population is lower than the p-value cutoff corresponding to a gene FDR of 5% within each population. We then considered two populations to share an eQTL if any single population has a p-value <1×10−2. Finally, for each gene having at least one such eQTL, we defined the eQTN as the SNP with the largest number of shared populations (sharing weight between multiple SNPs if there is a tie).
(0.12 MB PNG)
Expression QTNs in the combined Japanese plus Chinese analysis panel (ASN) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as
(0.43 MB PNG)
Expression QTNs in the European-derived sample (CEU) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as
(0.46 MB PNG)
Expression QTNs in the Nigerian sample (YRI) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as
(0.43 MB PNG)
Illustration of the ability of the HM to accurately estimate the distribution of eQTNs when all the actual eQTNs are genotyped. This figure is based on a simulated dataset assuming that for all genes the actual eQTN is genotyped (see
(0.15 MB PNG)
50% of the most significant SNPs lie within 7.5 kb of the actual eQTNs. Both panels are based on the results from the p-value method applied to a simulated dataset (see
(0.05 MB PNG)
No obvious impact of the eQTN location on the mapping precision. Cumulative plot of the distance between the most significant SNPs and the actual eQTNs according to the eQTN location (upstream of the TSS, downstream of the TSS, within an exon, and within an intron). This plot was generated by averaging results from the p-value method applied to 10 simulated dataset (see
(0.08 MB PNG)
Impact of the local recombination rate on the eQTN mapping precision. Boxplot of the physical distance between the tag SNP and the actual eQTN as a function of the average recombination rate (cM/Mb) around the actual eQTN in a simulated dataset assuming that all eQTNs are not genotyped (see
(0.05 MB PNG)
There is a deficit of most-significant SNPs in internal introns, and an enrichment of such SNPs in last exons (p-value method). This figure is based on the subset of 295 genes for which there is a unique most significant SNP (and for which the smallest p-value is <7×10−6) that fall into the gene transcript region. For the five panels, the blue arrows represent the observed number of most significant SNPs in the five gene functional elements for which at least 5 most significant SNPs have been found. Here these counts have been corrected for putative spurious signal due to an unobserved SNP inside the probe (leading to the removal of {similar, tilde operator } 46 genes). Under the null hypothesis that these most significant SNPs are randomly distributed into the eight possible gene functional elements, we carried out a simple Monte-Carlo procedure where for each of the 295 genes we picked at random a SNP inside the gene transcript region to be the most significant SNP (and weight it by the probability that the gene has genuine signal according to the location of the observed most significant SNP with respect to the probe (see
(0.12 MB PNG)
When distance is measured from the TSS (or TES) only, the TES (or TSS) peak is hidden due to the great variability in gene lengths. The plots show the fraction of SNPs with eQTN signals as a function of position in the
(0.07 MB PNG)
Illustration of the ability of the HM to accurately estimate the distribution of eQTNs even when only 30% of the actual eQTNs are genotyped. These plots are based on a simulated dataset assuming that across all genes only 30% of the true eQTNs are genotyped (see
(0.20 MB PNG)
Simulated dataset with eQTNs symmetrically distributed around the TSS. The three left panels plot the true (simulated) probability to be the actual eQTN according to the gene size category. The three right panels plot the probability to be the most significant SNP (i.e the SNP with the smallest p-value inside the
(0.43 MB PNG)
Numbers of SNPs inside each of the 9 mutually exclusive gene-related annotations as a function of position within the gene. SNPs inside coding exon are classified into synonymous and non-synonymous SNPs. Notice that ∼84% of genic SNPs occur inside internal introns.
(0.12 MB PNG)
Fine-scale structure of eQTN peaks near the TSS and TES, and comparison to four types of functional annotation. The left- and right-hand columns show data for 5 kb on either side of the TSS and TES, respectively (averaging across all gene sizes). Locations inside genes are colored green and outside genes are black. A. Posterior expected fractions of SNPs in each bin that are eQTNs, as estimated by the hierarchical model (see
(0.27 MB PNG)
Genes with CpG islands spanning the TSS are expressed at higher average levels and are more likely to contain eQTLs than genes without a CpG island at the TSS. Results for genes with a CpG island ON the TSS are displayed in red while results for genes without a CpG island spanning the TSS (OFF) are displayed in black. These results are based by computing seperately for the two gene categories the posterior probabilities from the hierarchical model. A. Estimated probability for each gene category to have an eQTN anywhere in the
(0.27 MB PNG)
Schematic explanations of our gene structure annotation. The plot shows three pairs of hypothetical genes consisting of, respectively, 1, 2 and 6 exons. In each pair, the upper version of the gene shows the exon/intron structure (from RefSeq) and the translation start and stop sites (vertical red lines). The lower version of the gene shows how we annotate the gene structure (see color code at right of figure). A verbal explanation is also provided in the main text.
(0.17 MB PNG)
Locations of the most significant eQTL SNPs for small, medium, and large genes using a p-value cutoff of A) 1×10−2 and B) 1×10−4. For A and B, the three panels was prepared in the same way as
(0.44 MB PNG)
Locations of the most significant eQTL SNPs for small, medium, and large genes using a p-value cutoff of A) 1×10−6 and B) 1×10−8. For A and B, the three panels was prepared in the same way as
(0.41 MB PNG)
Distribution of most significant eQTL SNPs around probes. The black bars indicate the numbers of spurious eQTL signals as a function of distance from the probes, among the 634 genes with a known SNP in the probe. The sum of the red+green bars gives the numbers of most significant eQTL SNPs among the remaining 11,446 genes; the red component is our estimate of the fraction that is spurious. (See section ‘Spurious Signal’ in
(0.18 MB PNG)
Table of descriptive statistics for each of the 9 mutually exclusive gene structure annotations for the 11,446 genes of our data set. The “Exp nber” and “Fraction” columns of the table are based on the posterior probabilities to be a genuine eQTN from the hierarchical model: left side for TSS-only+annotation model and right side for TSS+TES+annotation model.
(0.03 MB PDF)
Table of descriptive statistics for each of the 8 mutually exclusive gene structure annotations for the 11,446 genes of our data set.
(0.03 MB PDF)
Table of descriptive statistics for each of the 5 functional annotations for the 11,446 genes of our data set.
(0.04 MB PDF)
Supplementary methods.
(0.15 MB PDF)
We thank Abraham Palmer, Marcelo Nobrega, and Kevin Bullaughey, Graham Coop and other members of the Pritchard, Przeworski and Stephens groups for discussions and comments, and the anonymous reviewers for extensive comments on the manuscript.