A Statistical Framework for Joint eQTL Analysis in Multiple Tissues

Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues, this framework increases power to detect eQTLs that are present in more than one tissue compared with “tissue-by-tissue” analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells, and fibroblasts, we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR = 0.05). Further, the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.

1 Computational algorithm for fitting hierarchical model For the hierarchical model described in the main text, our primary interest is making inference on the parameter set Θ = (π 0 , η, λ). Here, we give details of an algorithm for inferring Θ, via maximum likelihood estimation based on the EM algorithm.

Notations
Throughout this section, we adopt the following additional notations. For gene k, we use a latent binary indicator z k to denote if there is any eQTL in its cis-region for any tissue type, in particular, We use a latent random indicator m k -vector s k to denote the true eQTL SNP conditional on z k = 1 and let s kp denote the p-th entry of s k . The "one cis eQTL per gene" assumption restricts s k can have at most one entry equaling 1 (with the remaining entries being 0). By this definition, and we also make the simplifying assumption that Furthermore, for gene k and SNP p, we index all configurations and use a (2 S − 1)-dimension latent indicator vector c kp to denote the actual configuration for the gene-SNP pair. In case the SNP is not an eQTL, Pr(c kp = 0|s kp = 0) = 1.
Otherwise, we assume the jth configuration is active with prior probability Joining the column vectors c kp for all m k SNPs, we obtain a latent (2 S − 1) × m k random matrix C k . Finally, we use the latent L-vector w kp indicate the actual prior effect size for active tissue types for the pair of gene k and SNP p. The m-th entry of the indicator is denoted by w kpm , for which we assume prior probability Pr(w kp = 0|s kp = 0) = 1, and Pr(w kpm = 1|s kp = 1) = λ m .
Joining the column vectors w kp for all m k SNPs, we obtain a latent L × m k random matrix W k

Maximum Likelihood Inference based on EM algorithm
In the maximum likelihood framework, we treat latent variables z k , s k , c k and w k , k = 1, . . . , G as missing data and apply the EM algorithm. For a total number of G genes, let z = (z 1 , . . . , z G ), S = (s 1 , . . . , s G ), C = (C 1 , . . . , C G ) and W = (W 1 , . . . W G ) denote the complete collection of latent variables. Let Y = (Y 1 , . . . Y G ) and G = (G 1 , . . . , G G ) denote the complete set of observed data. Based on the hierarchical model described in previous section, we can write out the complete data log-likelihood as follows, In (8), p 0 k denotes the likelihood of the null model for gene k, i.e., and is the Bayes Factor (pre-)computed for a fully specified alternative model. The EM algorithm searches for maximum likelihood estimate of Θ, by iteratively performing an expectation (E) step and a maximization (M) step.
In the E-step, for the t-th iteration, we evaluate the expectation of complete data log-likelihood (8) conditional on current estimate of parameter Θ (t) , G and Y . The computation is straightforward, for example, similarly, where and In the M-step, we find a new set of estimates, Θ (n+1) , by maximizing the conditional expectation E log p(Y , z, S, C, W |G, Θ)|Y , G, Θ (t) . In this case, the simultaneous maximization can be performed analytically. In particular, and Typically, we initiate the EM algorithm by setting Θ (0) to some random values and running iterations until some pre-defined convergence threshold is met (In practice, we monitor the increase of the the log-likelihood function between successive iterations, and stop the iterations as the increment becomes sufficiently small.).
We construct profile likelihood confidence intervals for estimated parameters. For example, a (1−α)% profile likelihood confidence set for π 0 is built as whereπ 0 ,η,λ are MLEs obtained from the EM algorithm.
2 Supplements for the simulations 2.1 Simulate eQTL data via the proportion of variance explained For a given gene-SNP pair at a time, we simulate data in S tissues according to a particular configuration. In a given tissue s ∈ {1, . . . , S} for which the SNP is an eQTL (β s = 0), let's define the proportion of variance in phenotype explained by the genotype: When working with standardized effect sizes b s = β s /σ s : As stated elsewhere ( [1]), we approximate the expectation of the PVE via a ratio of expectations, noted h: We assume that the genotypes are drawn from a Binomial distribution with parameters 2 and f , the minor allele frequency, so that E[V (X)] = 2f (1 − f ). Moreover, as we assume b s |b ∼ N (b, φ 2 ) and b ∼ N (0, ω 2 ), the marginal effect size is b s ∼ N (0, φ 2 + ω 2 ). We can hence approximate b 2 s by its variance. Therefore: By fixing h (e.g. 20%) as well as the minor allele frequency (e.g. 30%), we obtain: Now if we fix the heterogeneity in effect sizes (e.g. φ 2 /(φ 2 + ω 2 ) = 20%), we can deduce φ 2 and then ω 2 . We can hence drawb and then each b s |b.
Once we have them, it is straightforward to simulate the phenotype of the i th individual in the s th tissue: y is = b s σ s g i + N (0, σ 2 s ) with σ s being fixed at 1 for instance.

Implement the ANOVA/LR model in R
For each gene-SNP pair, the expression levels from all N individuals in all S tissues are recorded into a vector y of length N × S. The genotypes are appropriately repeated S times into a vector xg, and the tissue indicators are appropriately recorded into a vector xs. We can then use the ANOVA/LR model to test if there is an effect of the genotype with the following commands:

Calculate the empirical FDR from simulated eQTL data
For a simulated data set of G gene-SNP pairs, let z g be the test statistic of a given pair with g = 1, . . . , G. For the tissue-by-tissue method, we take as test statistic the minimum p-value across tissues. For the Bayesian method, the test statistic is the Bayes Factor. For the ANOVA/LR, the test statistic is the pvalue from the F test comparing the null model (tissue indicator only) with the unconstrained alternative (tissue indicator, genotype and their interaction).
All gene-SNP pairs can be classified as in the following table ( [2]): Called eQTL Not called Total True null As we simulate data, we know which pairs are true eQTLs. By fixing the empirical false discovery rate (F DR e ) at 5%, we can find the corresponding cutoff c on the test statistics, and from there calculate the true positive rate (TPR) at this cutoff: T foreach gene-SNP pair g ← 1 to G do c ← z (g) s ←number of called eQTLs at this cutoff c f ←number of false positives among them f dr ← f /s if f dr ≥ 5% then t ←number of true positives among the called eQTLs tpr ← t/s exit Between different methods, the empirical FDRs will always be 5% (or slightly higher) but the TPRs and FPRs will be different, which allows us to compare the performance of the methods. The grid noted A in the Methods section of the main text corresponds to an expected (average) effect size, but even one grid point would allow for a range of actual effect sizes (normally distributed, with the specified variance); thus even large values of A allow for some effects that are very close to 0. Multiple grid points are helpful to allow for a longer-tailed distribution of effects than a single normal. The most important thing is that the grid of A values is broad enough to covers the range of effect sizes detectable in the data -results are relatively insensitive to the inclusion of a few grid points that are bigger or smaller than can be detected, particularly in the hierarchical model where grid weights are estimated from the data -so maybe best to err on the side of a grid spanning too large a range. The grid we used reflects this, and was chosen to span (more than) the full range of effect sizes we think are plausibly detectable; as a result it would probably be unnecessary to add more smaller values even in larger studies.
To illustrate, we simulated PVE values using our grid (with equal weight on each grid point), for a SNP with frequency 20% (see R code below). The median simulated PVE was 0.017. The 5 th and 95 th percentiles were [8.7 × 10 −5 , 0.54], which we view as spanning a range of impossible-to-detect to unrealistically large effects (see also supplementary figure S2). This range means that, for a SNP at frequency 0.2, equal weight on this grid implies that 90% of eQTLS will explain between 8.7 × 10 −5 and 0.54 of the total variance in expression.

Hierarchical model fed with Bayes Factors from residuals
First we computed the Bayes Factors for each combination of grid values and configurations, one gene-SNP pair at a time. Second, for each gene, we regressed out the effect of its best SNP, and we recomputed the Bayes Factors for the remaining SNPs using the residuals as phenotypes. Third, we launched the hierarchical model with only the Bayes Factors obtained from the residuals.
If the "at most one eQTL per gene" assumption is reasonable for this data set, we would expect the estimated π 0 to be very high (meaning that the vast majority of genes have no eQTL), the lowest grid value to have the highest probability (meaning that the effect sizes are very small), and the credible intervals for the configurations to be very large (corresponding to high uncertainty).
This is indeed what we observe:

Running times
After our preprocessing, the data set from Dimas et al. totalizes 5012 genes and 418,477 SNPs. Focusing on the cis region ± 1 Mb around the transcription start site of each gene yields 1,436,869 gene-SNP pairs to test for association.
The first step of our approach is to compute the Bayes Factors for all configurations and grid values. As each gene-SNP pair can be analyzed in parallel, we splitted the genes in 101 batches (i.e. ≈ 50 genes per batch). Our first program, eqtlbma, spent a total of 36 minutes to calculate all Bayes Factors. The second step then uses all Bayes Factors when fitting the hierarchical model via an EM algorithm. On this data set, our second program, hm, ran in 2 hours.
The eqtlbma program can also perform permutations per gene. Each job ran in less than 8 hours when BF BMA is the test statistic, and in less than 6 hours when BF BMAlite is the test statistic. In this case (3 tissues), BF BMA is averaged over 7 configurations whereas BF BMAlite is averaged over 4 configurations. This small difference between the number of configurations over which each test statistic is averaged over explains why the running time is only 25% less with BF BMAlite compare to BF BMA , but the gain will increase with the number of tissues.
Both programs are written in C++. The eqtlbma measures running times using the CLOCKS_PER_SEC macro from the C library. For the hm program, we measured running time as the duration between the time at which the execution started and at which it ended.