Conceived and designed the experiments: JBV JKP MS YG. Analyzed the data: JBV JKP JP DJG. Wrote the paper: JBV JKP MS.
JBV worked as a consultant on this project via his self-employed company Biominglabs and has nothing to declare regarding any patents, products in development or marketed products etc in connection with the study. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.
Mapping of expression quantitative trait loci (eQTLs) is an important technique for studying how genetic variation affects gene regulation in natural populations. In a previous study using Illumina expression data from human lymphoblastoid cell lines, we reported that cis-eQTLs are especially enriched around transcription start sites (TSSs) and immediately upstream of transcription end sites (TESs). In this paper, we revisit the distribution of eQTLs using additional data from Affymetrix exon arrays and from RNA sequencing. We confirm that most eQTLs lie close to the target genes; that transcribed regions are generally enriched for eQTLs; that eQTLs are more abundant in exons than introns; and that the peak density of eQTLs occurs at the TSS. However, we find that the intriguing TES peak is greatly reduced or absent in the Affymetrix and RNA-seq data. Instead our data suggest that the TES peak observed in the Illumina data is mainly due to exon-specific QTLs that affect 3′ untranslated regions, where most of the Illumina probes are positioned. Nonetheless, we do observe an overall enrichment of eQTLs in exons versus introns in all three data sets, consistent with an important role for exonic sequences in gene regulation.
Polymorphisms that impact gene regulation play an important role in disease genetics and adaptive evolution
In previous work, we developed a Bayesian hierarchical method for studying the distribution of eQTLs with respect to their target genes, and for identifying biological anotations that can predict the locations of causal sites
Here we revisit the TES peak to understand better the mechanism that generates this intriguing signal. Using expression data for the same samples from independent experiments and different technologies, our new analysis suggests that in fact exon-specific effects are responsible for most, if not all of the 3′ UTR peak that we saw previously. However, we find that our previous result showing an eQTL enrichment in exons overall, compared to introns is supported by all three data sets.
For this analysis we used data from the 210 unrelated HapMap samples in the original HapMap Phase I/II cell lines
We analyzed expression measurements obtained using three distinct technologies:
Illumina gene array data from a total of 210 CEU, CHB, JPT and YRI samples
Affymetrix exon array data from 117 CEU and YRI samples
RNA sequence data from 102 CEU and YRI samples
Note that the 117 individuals in the Affymetrix data set, and 102 individuals in the RNA-seq data set are both subsets of the 210 individuals in the Illumina data set (and in both cases include the majority of the CEU and YRI samples). The original RNA-seq data sets included a few individuals that were not in either the Illumina data or Phase I/II HapMap; in order to simplify the genotype imputation pipeline these individuals were excluded from the analysis.
To avoid the impact of spurious associations caused by SNPs falling within the probes of the two array data sets (Illumina and Affymetrix), we systematically removed all probes containing at least one SNP. We also removed all probes impacted by short insertions/deletions or copy number variations (CNV) based on the genomic coordinates of these structural variants as provided by
For eQTL mapping, we used standard linear regression to test every SNP within the transcript or 100 kb from either end of the transcribed region for association with gene expression.
The left panels of
The left-hand column plots the distribution of locations of most significant SNPs for each technology; the red arrows indicate the location of the TES peak observed in the Illumina data. SNPs outside genes are assigned to bins based on their physical distance from the TSS (for upstream SNPs), or TES (downstream SNPs). SNPs inside genes are assigned to bins based on their fractional location within the gene. The plotted gene size is the average gene length in the data. To provide a formal comparison among different models, the right-hand column displays the difference in Akaike Information Criterion (AIC) values between different parameterizations of our Bayesian hierarchical model (see
Overall, the Affymetrix probes are spread roughly evenly across exons while the Illumina probes are 3′ biased. By analyzing only those Affymetrix probes that are in the same exons as Illumina probes, we create an apparent 3′ signal peak. For the sake of comparison, the grey line represents the original distribution as plotted in
To assess the evidence for a TES peak more quantitatively, we computed the AIC (Akaike Information Criterion) for a model with, and without a special TES effect (
One major difference between the Illumina data and the other two data sets is that a large fraction of the Illumina probes are positioned in the last exon (85% for Illumina compared to 21% for Affymetrix). To assess whether the Illumina probe placement might have helped to create the peak of signals at the TES, we filtered the Affymetrix data to include only those Affymetrix probes that are in the same exon as an Illumina probe (and hence the filtered data set includes mainly probes in the last exons of genes). When we did this, we observed that indeed the filtered Affymetrix data set showed a much stronger peak of eQTLs in the last exon (
In principle, one plausible explanation might be that ungenotyped SNPs in Illumina probes could generate spurious eQTL signals, and that these would often be detected by nearby SNPs; such an effect might in principle generate a spurious 3′ peak of eQTLs. However, this does not appear to be the case. The original analysis of Veyrieras
Recent work has shown that there are many SNPs that impact the expression levels of individual exons, while not necessarily affecting the overall expression levels of genes
To evaluate this hypothesis, we computed exon-specific expression levels in each individual using both the Affymetrix and the RNA-seq data sets, controlling for the overall expression level of the gene in that individual to remove the impact of gene-level eQTLs (see Materials and
We determined the most significant SNP for each Illumina eQTL, and then tested every such SNP for association at the gene- and exon-levels using the Affymetrix and RNA-seq data. Here we show QQ-plots for these Illumina eQTNs in the exon-level analysis (left) and the gene-level analysis (right), using the Affymetrix exon array data (top) and RNA-seq data (bottom). The color codes correspond to 5 exclusive categories of the Illumina eQTNs with respect to the target gene: intragenic, exonic “(first, internal and last) or intronic (intron). Note that last-exon Illumina eQTNs tend to replicate well at the exon level, but poorly at the gene level, suggesting that these are frequently exon-QTLs but infrequently gene-QTLs.
As an illustration,
For each panel, we display quantile-normalized expression levels. Data for each genotype at SNP rs8984 are repre- sented with the same color code (orange, grey and green) for all the panels. The top panel plots the mean exon expression levels along the gene as measured by the Affymetrix probes and provides on top of each exon the p-value for the association between the exon expression levels and the SNP genotypes. The blue vertical bar indicates the position of the single Illumina probe. The middle panel is a schematic representation of the gene: exons are plotted as black/green rectangles where the green color indicates coding regions. The position of SNP rs8984 is indicated by a red arrow. The bottom panel provides the box plots corresponding to each analysis: from left to right, specific Affymetrix last exon expression levels (p-value = 3×10−11), Affymetrix gene expression levels (p-value = 0.04) and Illumina gene expression levels (p-value = 3×10−27).
In our previous work, we also reported a 2-fold enrichment of eQTNs within internal exons compared to introns
TSS: the model accounts only for distance from the TSS;
TSS, intragenic: same as previously plus an annotation for SNP being within the transcribed region of the target gene;
TSS, intron, exon: as previously but the intragenic annotation is split into exclusive intron and exon categories;
TSS, intron, exon, last exon: as previously but the exon category is split into exclusive exon (except the last) and last exon categories.
For each of the three data sets, we ran each model separately and then selected the best model based on the Akaike Information Criterion (AIC).
Model | Annotation | Odds Ratio Estimates [95%CI] | ||
Illumina | Affymetrix | RNA-seq | ||
1 | intragenic/intergenic | 7.51 [6.70, 8.43] | 4.25 [3.29, 5.53] | 9.12 [6.36, 13.41] |
2 | exon/intron | 12.13 [10.78, 13.60] | 11.13 [8.54, 14.33] | 6.68 [4.37, 9.89] |
3 | exon (except last)/intron | 5.95 [5.05, 6.96] | 8.67 [6.42, 11.49] | 7.29 [4.69, 10.94] |
3 | last exon/intron | 28.66 [24.71, 33.13] | 21.46 [13.69, 31.94] | 4.03 [0.86, 10.50] |
The table displays the odds ratio estimates together with their corresponding 95% confidence intervals, as estimated by the empirical Bayesian model (see
In this analysis we have shown that the sharp peak of eQTNs previously observed at the TES in Illumina eQTL data
Unlike the TES signal, however, we find that the previously reported enrichment of eQTNs within exons compared to introns
For all three data sets, we found that there is an enrichment of eQTNs within transcribed regions, controlling for distance from the TSS. That is, a SNP at a distance
In summary, we have shown that the TES peak of eQTLs that was observed previously was most likely driven by QTLs affecting the last exon only. In addition, we have confirmed the enrichment of eQTNs within exons compared to introns; and in transcribed regions compared to downstream intergenic regions, controlling for distance.
For this project we used data from 210 unrelated individuals studied in Phases I and II of the HapMap Project (i.e., all the Chinese and Japanese individuals plus the parents from the Yoruba and CEU trios). The genotype estimates were based on a combination of the 1000 Genomes and HapMap data. These genotypes should include most common variants in the non-repetitive fraction of the genome. For all HapMap SNPs we used the HapMap genotype calls from release 24 of HapMap Phase II
All the expression datasets were preprocessed using the same gene models, based on the hg18 Ensembl gene annotation track downloaded from the UCSC web site on 12/31/2009.
We used data from the 210 unrelated individuals in Stranger
Then we removed “non-expressed” probes by visual inspection of a median versus median absolute deviation (MAD) scatter plot augmented with the fraction of expressed genes at a given MAD value as derived from the RNA-seq datasets. From this visualization it is clear that there are two populations of probes, one “expressed” population with moderate to high MAD values and a “non-expressed” population of probes with low MAD values. This analysis indicated that probes with low MAD are generally non-expressed. We thus removed 8,214 “non-expressed” probes from the original set which yields a core set of 10,264 probes.
Since expression measurements are susceptible to large technical variability and since we are looking only at cis-eQTLs, we performed a principal component-based adjustment of the expression dataset similar to what it has been previously described by our group
We downloaded the raw Affymetrix Human Exon 1.0 ST array CEL files published by Huang
We performed GC-bin background correction of each array followed by a quantile normalization on natural scale within each population (CEU and YRI) using an in-house implementation. We then removed “non-expressed” probes defined as probes with a median normalized intensity level below
To compute entire-gene expression levels we used a simple summarization approach based on the median polish procedure
We obtained published RNA-seq data from 60 CEU individuals and 75 YRI individuals
From the original 241,639 exons we removed 118,548 exons for which the median counts within both populations were 0. Using this core set of 123,091 exon level expression measurements we performed a PCA-based adjustment but this time separately within each population (since sequencing was performed in two distinct environments each one may have its own hidden factor structure). Thus, including sex as a covariate, we regressed out 12 specific PCs for the YRI dataset and 16 other specific PCs for the CEU dataset.
For subsequent analyses, we used only the 43 CEU and 59 YRI samples that were included within our genotype dataset. As for the expression arrays, gene expression levels were computed by applying a median polish procedure onto the sample x exon expression level matrix, thus removing exon specific effects. Similarly specific exon expression levels have thus been derived from the residuals of the gene level median polish procedure
For the technical details of the statistical analyses we invite the reader to refer to our previous article
for eQTN mapping we restricted the cis-candidate region to 100 kb around both gene ends (instead of 500 kb),
for sQTN mapping we used a 10 kb window around both exon ends, since it has been previously show that sQTNs are mainly concentrated nearby the spliced-exon
The exon-level FDR for the sQTN analysis has been obtained in the same way as the gene-level FDR described in
(PDF)
We thank members of the Gilad, Pritchard, Przeworski and Stephens labs for helpful discussions and the two anonymous reviewers for their helpful comments.