^{1}

^{2}

^{*}

^{3}

^{3}

^{4}

^{*}

Conceived and designed the experiments: OS LP RD JW. Performed the experiments: OS LP. Analyzed the data: OS LP. Wrote the paper: OS LP RD JW. Developed methodology: OS LP RD JW.

The authors have declared that no competing interests exist.

Gene expression measurements are influenced by a wide range of factors, such as the state of the cell, experimental conditions and variants in the sequence of regulatory regions. To understand the effect of a variable of interest, such as the genotype of a locus, it is important to account for variation that is due to confounding causes. Here, we present VBQTL, a probabilistic approach for mapping expression quantitative trait loci (eQTLs) that jointly models contributions from genotype as well as known and hidden confounding factors. VBQTL is implemented within an efficient and flexible inference framework, making it fast and tractable on large-scale problems. We compare the performance of VBQTL with alternative methods for dealing with confounding variability on eQTL mapping datasets from simulations, yeast, mouse, and human. Employing Bayesian complexity control and joint modelling is shown to result in more precise estimates of the contribution of different confounding factors resulting in additional associations to measured transcript levels compared to alternative approaches. We present a threefold larger collection of

Gene expression is a complex phenotype. The measured expression level in an experiment can be affected by a wide range of factors—state of the cell, experimental conditions, variants in the sequence of regulatory regions, and others. To understand genotype-to-phenotype relationships, we need to be able to distinguish the variation that is due to the genetic state from all the confounding causes. We present VBQTL, a probabilistic method for dissecting gene expression variation by jointly modelling the underlying global causes of variability and the genetic effect. Our method is implemented in a flexible framework that allows for quick model adaptation and comparison with alternative models. The probabilistic approach yields more accurate estimates of the contributions from different sources of variation. Applying VBQTL, we find that common genetic variation controlling gene expression levels in human is more abundant than previously shown, which has implications for a wide range of studies relating genotype to phenotype.

DNA microarray technologies allow for quantification of expression levels of thousands of loci in the genome. These measurements enable exploring how a variable, such as clinical phenotype, tissue type, or genetic background, affects the transcriptional state of the sample. Recently, gene expression levels have been studied as quantitative genetic traits, investigating the effect of genotype as the primary variable. Studies have found and characterised large numbers of expression quantitative trait loci (eQTLs)

An important issue in such studies is additional variation in expression data that is not due to the genetic state, as illustrated in

The

In practise it is not possible to measure or even be aware of all potential sources of variation, but nevertheless it is important to account for them. Unobserved,

The challenge in modelling several confounding sources of variation (

The key for correctly attributing expression variability is controlling the complexity of the statistical models for each source of variation. For example, the number of genotypes considered in an association scan can be enormous, and not all of them affect the expression level of every probe. Threshold values, obtained from likelihood ratio statistics or empirical p-value distributions, can be used to determine the significance of individual associations, thereby avoiding overfitting by controlling the model complexity

In this work we present VBQTL (Variational Bayesian QTL mapper), a joint Bayesian framework for gene expression variability that accounts for the signal from genotype, known factors, and hidden factors. VBQTL is implemented within a general framework that provides commonly used models for sources of phenotypic variation, which can be combined as needed. While previous attempts have been specific to a narrow set of underlying sources, our approach is flexible and can be adapted to a particular study design. The probabilistic treatment allows uncertainty to be propagated between models, and yields a posterior distribution over model parameters. Complexity control is tackled at the level of individual models, where parameters are regularised in a Bayesian manner.

We compare the performance of VBQTL with existing approaches for detecting expression QTLs. A simulation experiment contrasts VBQTL with common approaches that use non-Bayesian techniques for distinguishing global hidden factor effects from genetic effects. This study highlights differences in the methodology to control model complexity with implications to eQTL detection power. The necessity and difficulty to account for variability that confounds the genetic signal is demonstrated. Results on datasets from a human outbred population and crosses of inbred yeast and mouse strains show that VBQTL identifies more significant associations than alternative methods. Finally, we apply VBQTL to perform a whole-genome eQTL scan on the HapMap phase 2 expression and genotype data, demonstrating the scalability of our framework to large numbers of samples and probes. We find three times more

Here, we present VBQTL, a configuration of a general framework for modelling diverse sources of gene expression variability. The model underlying this framework assumes that gene expression levels are influenced by additive effects from independent sources, e.g. in the case of VBQTL these are contributions from genotype, known factors, and hidden factors (

(

We perform Bayesian inference in the joint model, which is appealing for several reasons. First, it allows possible dependencies between the different sources of variation to be captured. The effects of the genotype, known and hidden factors are learned jointly, taking other parts of the model into account. Propagation of uncertainty leads to more accurate parameter estimates

In the following, we present the mathematical model of VBQTL, and an outline of the inference procedure. We then describe alternative non-Bayesian models for expression QTL studies used in the experiments. An in-depth treatment of the framework including full details about the parameter estimation is provided in

The observed gene expression matrix

To reduce the computational cost, inference in the association model is approximated, only considering a single most relevant SNP-regulator per gene, with the other

Parameter inference in VBQTL is implemented using variational Bayesian learning

The current belief of the residual dataset for a particular active model is calculated, taking the prediction form all other models and the estimated noise precision into account (

The parameters of the active

The distribution of the model contribution

The same procedure is in turn applied to the remaining models in the schedule (

This iterative procedure, performing updates of local parameter distributions in turn, can be interpreted as a message passing algorithm, where sufficient statistics of parameter and data distributions are propagated across the graphical model

The initial values of parameters are determined from maximum likelihood solutions. A random initialisation via sampling from the prior is possible as well; we have not explored the implications of this alternative here. Details on inference and the individual parameter update equations are given in

In experiments, we compare two alternative inference schedules of VBQTL. In iterative VBQTL (iVBQTL), the model parameters are learned using several iterations through all model components, first updating the genetic model, then known and hidden factors (

In cases where neither known nor hidden factors are correlated with the genetic state, their effect can be learned independently without running the risk of explaining away meaningful association signal. This motivates fast VBQTL (fVBQTL), which performs a single update iteration of the full model, first inferring the contribution from the known and hidden factors, and then from the genetic state. This simpler schedule can save significant computation time, since the factor effects can be precalculated, and only a single iteration of the computationally more expensive genetic association model is needed. In cases where the genetic state is approximately orthogonal to the known and hidden factors, this cheaper approximation performs equally with iVBQTL for finding genetic associations (

We compared VBQTL with previous methods that account for confounding variance in the context of expression QTL mapping. Similarly to VBQTL, they model known and hidden factors in the expression levels. The differences between the alternative methods are in the hidden factor model used, which in turn vary in the complexity control approach employed as highlighted below. Thus these alternative models are named after the hidden factor estimation method.

For a quantitative evaluation of the performance of each method, we considered the resulting residuals of the estimated effects from known and hidden factors. To detect eQTLs we applied standard statistical tests employing a linear model on the SNP genotype on these residual datasets (

While VBQTL shares basic assumptions with these alternatives, there are a number of differences. First, it is a probabilistic model that operates with uncertainties in the parameter estimates as explained above. Second, the hidden factor model allows for non-orthogonal components, and provides probabilistic complexity control based on ARD. Third, the iVBQTL schedule takes the genetic signal into account when estimating the hidden factor effect. Finally, the VBQTL model estimates a global gene-specific noise level, while the non-Bayesian models either estimate noise levels implicitly (SVA) or assume noise-free observations (PCA, PCAsig).

We employed a simulated dataset to highlight the differences between alternative approaches to account for global factors in eQTL finding. Our synthetic expression data combines linear effects from genetic associations (eQTLs), known, hidden, and genetic global factors, and gene-specific noise (

We assessed the ability of the considered methods to recover the simulated confounding variability. For those approaches that do infer hidden factor effects, we varied the corresponding complexity control parameters to investigate the influence on performance. For methods that take the number of components in the hidden factor model as a parameter (PCA, VBQTL), performance for one to 50 hidden factors was compared. For significance-testing based methods, we considered different significance cutoffs

iVBQTL correctly captured the non-genetic global factor effects (

(

Complexity control settings determined the performance of capturing the simulated global effects on expression levels. PCA was most accurate when the number of hidden factors was set to 10, since seven hidden factors and three transcription factors were simulated. For larger number of components PCA overfitted, and started explaining away genetic signal, resulting in the increase in error. For a small number of components, transcription factor effects were explained away first, which increased the error in estimating the hidden factors alone. However, the estimates of the total global effects improved. PCAsig and SVA found 6 and 7 significant hidden factors for the wide range of significance cutoffs,

We determined the sensitivity and specificity of the considered methods for detecting the immediate and downstream simulated genetic associations. The significance of an eQTL was tested using a two-sided t test on the correlation coefficient with a

The accuracy of the hidden factor effect estimation mirrored the immediate eQTL finding sensitivity (

All methods except iVBQTL and standard method explained away simulated

Taken together these results suggest that it is important to account for the confounding sources of variation in expression levels, while keeping the signal of the genetic state. Correct complexity control is required to avoid over- and underfitting in order to achieve optimal sensitivity for detecting true genetic associations.

Next, we compared the same methods for expression QTL finding on yeast

We applied the considered methods on the genotype and expression data from 90 individuals of the CEU (CEPH from Utah) HapMap phase 2 samples

The standard method found the least gene probes with a

Significance-testing based methods (PCAsig, SVA) identified the same number of factors for a wide range of cutoff values (

PCA, the simplest method for accounting for hidden factors, found additional associations when up to 30 principal components were used, but substantially fewer for 60 components. This is expected, since there are no more than 90 degrees of freedom in this dataset, and 60 principal components accounted for over

The significance-testing based methods, SVA and PCAsig both found additional associations compared to the standard method. It is remarkable that both found a constant number of significant hidden factors for the wide range

fVBQTL and iVBQTL found more probes with an association (55 and 54) than all other methods, representing an almost threefold increase in the number of genes with a

It is important to note that the model performance depends on two aspects. First, the model complexity control, regulating the amount of variance explained, is important to ensure that genetic signal is not attributed to hidden factors. Overfitting in case of PCA for a large number of components is an example of such an effect. Second, while alternative hidden factor models explained similar amounts of variance, their performance differed due to the underlying model. For example, PCA and fVBQTL both explained about 70% of variance in the observed expression levels (

Next, we applied the methods to two datasets of inbred strain crosses. The yeast expression dataset _{2} mouse lines, and genotypes at

The relative performance of different methods was similar to their ability to detect

All methods found additional

The effect of pivotal loci has been observed before, and interpreted in different ways

Previous methods do not provide consistent ways of dealing with this issue. The SVA authors also suggest to remove the effect of the primary variable first. However, the authors do not consider accounting for the genetic effect in their application to the same yeast dataset

Motivated by the results of the initial study of a single human chromosome, we applied fVBQTL, learning 30 hidden factors, to the 10,000 most variable expression probes of the HapMap 2 dataset. We searched for

On the CEU population, we found 1051 genes with a VBeQTL at false discovery rate (FDR) of

We repeated this genome-wide experiment on pooled populations. Due to the increased sample size, it was possible to detect additional associations. We found 2696 genes with a VBeQTL compared to 1045 genes with a standard eQTL at the 0.1% FPR (

(

Exploratory results indicate additional power to find

It is important to demonstrate that the additional associations found after removing the learned non-genetic factors are biologically meaningful. We provide evidence that the additional associations found in HapMap phase 2 data are real in three ways.

First, we investigated how many of the genes with a VBeQTL in each of the three populations individually were replicated using the standard method on a pooled data set containing all populations. Note that this will only validate weak associations that occur in multiple populations – we would not expect weak population-specific associations to be replicated in the pooled data set. However, we expect many of the associations to be replicated in multiple populations

Second, we evaluated to what extent the additional genes with a VBeQTL in a single population were replicated in other populations. For instance,

Finally, we validated that the locations of the novel associations are distributed similarly to the original ones. We analysed the distribution of the position of additional

The hidden factor models hypothesise a set of unobserved non-genetic factors that influence the measured gene expression levels. To gain insights into their interpretation we considered correlations to known effects such as gender, population or environment, and the sets of genes most influenced.

We applied fVBQTL to expression data from individuals of all three HapMap populations, and tested for correlation between the inferred hidden factors and the population and gender indicator variables. The resulting correlation coefficients (

A recent study in yeast looked for changes in eQTLs when segregating strains were grown in different media

The global factors identified can be further analysed for biological signals, looking for GO term over-representation in the genes that they affect. We used the ordered GO profiling method

We have presented VBQTL, a probabilistic model to dissect gene expression variation in the context of genetic association studies. The model is implemented in a Bayesian inference framework that allows uncertainty to be propagated between different parts of the model, and yields posterior distributions over parameter estimates for more sensitive analysis. In comparative eQTL mapping experiments, VBQTL outperformed alternative methods for eQTL finding on simulated and real data. In the most striking example, VBQTL found up to three times more eQTLs than a standard method, and 45% more compared to the best alternative in the HapMap 2 expression dataset.

Our approach advances the methodology for understanding phenotypic variation. The implementation of a flexible framework allows models for explaining the observed variability to be straightforwardly combined. Notably, non-Bayesian models can also be included, as we demonstrated with PCA, SVA, and linear regression models. VBQTL controls the model complexity at the level of all individual components of expression variability, thereby preventing from over- and underfitting. Our experimental results on simulation and real data showed how explaining away too much variability removes some signal of interest from the data, and failing to account for all sources of confounding variation decreases power to detect the relevant signal. When the variable of interest is correlated with many gene expression levels, its effect can be falsely explained away by the hidden factor model. We showed that in such settings the choice of an iterative schedule helps to ensure that variability is explained by the appropriate part of the model. There can be no silver bullet solution that provides perfect results in any scenario with no supervision. Instead, modelling assumptions must be made explicit, and incorporated in the analysis, as is elegantly done in the Bayesian setting.

VBQTL and other methods that account for hidden factors all found additional expression QTLs in the datasets studied compared to the standard method. It is remarkable that, with only 270 samples, and looking in one tissue type, we can find significant genetic associations to

In conclusion, we believe that VBQTL provides a principled and accurate way to study gene expression and other high-dimensional data. Increasingly complex models combining genetic and other effects can explain significantly more of the variance in observed phenotypes, as suggested by this study and others. Our general framework provides the flexibility to facilitate these richer models, for example, we have already started exploring interaction effects as an additional model of the framework. It will be interesting to see how these approaches can contribute to our understanding of human disease genetics, potentially involving intermediate phenotypes such as gene expression and other factors.

The software used in this study is freely available online at

Supplementary methods.

(0.23 MB PDF)

Supplementary results.

(0.86 MB PDF)

Sensitivity of recovering simulated eQTLs for alternative eQTL models. (a–b) Using a standard model for expression values, performing 2-tailed t tests on the statistic based on correlation coefficient between expression level and genotype. (c–d) Similar test for ranks of expression values. (e–f) Permutation test with 1000 permutations and 0.1% FPR. Bonferroni correction to 0.1% false positive rate was used for (a–d) to correct for multiple testing as detailed in

(0.30 MB PDF)

Sensitivity of recovering human eQTLs for alternative eQTL models. (a–b) Using a standard nested model for expression values, performing chi-squared tests with one degree of freedom on the log likelihood ratio for adding the genetic association term to the model. (c–d) Using a standard nested model for ranks of expression values, performing t tests with N-2 degrees of freedom as described in Supplementary Methods. Bonferroni correction to 1% false positive rate was used for both methods to correct for multiple testing as detailed in

(0.23 MB PDF)

Sensitivity of recovering yeast eQTLs for alternative eQTL models. (a–b) Using a standard model for expression values, performing 2-tailed t tests on the statistic based on correlation coefficient between expression level and genotype. (c–d) Similar test for ranks of expression values. Bonferroni correction to 0.1% false positive rate was used for both methods to correct for multiple testing as detailed in

(0.26 MB PDF)

Sensitivity of recovering mouse eQTLs for alternative eQTL models. (a–b) Using a standard model for expression values, performing 2-tailed t tests on the statistic based on correlation coefficient between expression level and genotype. (c–d) Similar test for ranks of expression values. Bonferroni correction to 0.1% false positive rate was used for both methods to correct for multiple testing as detailed in

(0.25 MB PDF)

Number of probes with a

(0.02 MB PDF)

Magnitude and fraction of overlap between probes with a Standard of fVBQTL

(0.02 MB PDF)

Overlap of VBQTLs in one population (2.) with standard eQTLs found when pooling the other two populations (3.). Overlaps are given both for all QTLs (2. & 3.) and only for additional ones (2. - 1. & 3. - 1.) compared to standard eQTLs in the population. Per-probe eQTL FPR = 0.1%, Bonferroni corrected for testing multiple SNPs per probe, 2-tailed t test.

(0.01 MB PDF)

Pearson correlation coefficient between top 6 factors learned on the pooled HapMap data, and 4 indicator variables relating to the background of the individual. Correlations with absolute value above 0.6 are highlighted.

(0.01 MB PDF)

Summary statistics for method performances on the human chromosome 19 dataset presented in the main text. The parameters for different methods are varied by the number of allowed factors K (PCA, VBQTL) or by the significance cutoff α (PCAsig, SVA). Hidden factor summary is given by the number of factors found and the variance explained by the hidden factor effects. The number of probes with a

(0.02 MB PDF)

Summary statistics for method performances on the yeast dataset presented in the main text. The parameters for different methods are varied by the number of allowed factors K (PCA, VBQTL) or by the significance cutoff α (PCAsig, SVA). Hidden factor summary is given by the number of factors found and the variance explained by the hidden factor effects. The number of probes with a

(0.02 MB PDF)

Summary statistics for method performances on the mouse dataset presented in the main text. The parameters for different methods are varied by the number of allowed factors K (PCA, VBQTL) or by the significance cutoff α (PCAsig, SVA). Hidden factor summary is given by the number of factors found and the variance explained by the hidden factor effects. The number of probes with a

(0.02 MB PDF)

The authors would like to thank Manolis Dermitzakis for helpful discussions and feedback, Rachel Brem, Eric Schadt and Barbara Stranger for access to gene expression and genotype data, Claude Beasley for assistance, Richard Bourgon for advice, and Alexandra Nica and members of the Durbin group for comments on the manuscript.