MRLocus: Identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity

Expression quantitative trait loci (eQTL) studies are used to understand the regulatory function of non-coding genome-wide association study (GWAS) risk loci, but colocalization alone does not demonstrate a causal relationship of gene expression affecting a trait. Evidence for mediation, that perturbation of gene expression in a given tissue or developmental context will induce a change in the downstream GWAS trait, can be provided by two-sample Mendelian Randomization (MR). Here, we introduce a new statistical method, MRLocus, for Bayesian estimation of the gene-to-trait effect from eQTL and GWAS summary data for loci with evidence of allelic heterogeneity, that is, containing multiple causal variants. MRLocus makes use of a colocalization step applied to each nearly-LD-independent eQTL, followed by an MR analysis step across eQTLs. Additionally, our method involves estimation of the extent of allelic heterogeneity through a dispersion parameter, indicating variable mediation effects from each individual eQTL on the downstream trait. Our method is evaluated against other state-of-the-art methods for estimation of the gene-to-trait mediation effect, using an existing simulation framework. In simulation, MRLocus often has the highest accuracy among competing methods, and in each case provides more accurate estimation of uncertainty as assessed through interval coverage. MRLocus is then applied to five candidate causal genes for mediation of particular GWAS traits, where gene-to-trait effects are concordant with those previously reported. We find that MRLocus’s estimation of the causal effect across eQTLs within a locus provides useful information for determining how perturbation of gene expression or individual regulatory elements will affect downstream traits. The MRLocus method is implemented as an R package available at https://mikelove.github.io/mrlocus.


MRLocus statistical model
MRLocus proceeds in two separate hierarchical models, which are encoded in the Stan programming language and with posterior inference performed using the Stan and RStan software packages [Carpenter et al., 2017, Stan Development Team, 2020. In section 1.1 we define the model for the colocalization step, and in section 1.2 we define the model for the slope fitting step.

Input data
The first step performs colocalization of eQTL (A) and GWAS (B) signals across a number of nearly-LD-independent signal clusters j ∈ 1, . . . , J, using summary statistics from both studies: β A i,j and se( β A i,j ), β B i,j and se( β B i,j ) for SNP i ∈ 1, . . . , n j in cluster j for study A and B, and the respective LD matrices for each cluster j: Σ A j and Σ B j . Near LD independence is obtained by clumping with PLINK [Purcell et al., 2007] on the putative mediator A with r 2 threshold of 0.1 and then trimming "signal clusters" (the term we use for PLINK clumps after they are created), in which the pairwise r 2 of the index eSNP is greater than 0.05. Trimming is performed with the trimClusters function in MRLocus, which removes signal clusters with r 2 > 0.05 to the first cluster, then the second cluster, and so on until no pairs remain with r 2 > 0.05. The clusters and the individual SNPs per cluster must be matched across study. The estimated coefficients β X i,j for X ∈ {A, B} refer to either the estimated coefficients from a linear model of a continuous trait y on genotype dosages {0, 1, 2}, or the estimated log odds from a logistic regression of a binary trait y modeled on genotype dosages.
While it would be preferable to use allelic fold change (aFC) [Mohammadi et al., 2017] or ACME effect sizes [Palowitch et al., 2018] for the eQTL (A) study in MRLocus modeling, in practice we typically are provided with publicly available estimated coefficients representing inverse normal transformed (INT) or log 2 transformed expression values regressed on genotype dosages. For eQTL coefficients derived from INT expression data, the mediation effect estimated by MRLocus represents the effect on the trait from modifying gene expression by 1 SD, while for eQTL coefficients derived from log 2 transformed expression data, the mediation effect represents the effect on the trait from doubling gene expression.
The eQTL and GWAS studies are referred to as "A" and "B" in the formula and code below for generalization, for example, the eQTL study could be replaced with a pQTL (protein quantitative trait loci) study. "A" therefore refers to a study of a trait (or "exposure") that is believed to be causally upstream of the trait examined in study "B" (or "outcome").

Collapsing and allele flipping
MRLocus contains two convenience functions, collapseHighCorSNPs and flipAllelesAndGather, which are described briefly. The first function uses hierarchical clustering based on the LD matrix (the user must pick which to use if two are available), in order to collapse SNPs into groups using complete linkage, and thresholding the resulting dendrogram at 0.95 correlation. The SNP with the highest absolute z-score within a collapsed set is chosen as the representative SNP.
flipAllelesAndGather performs a number of allele flipping steps for assisting statistical modeling and visualization. The alleles are flipped such that the index SNP (as defined by its absolute value of z-score) for study A has a positive estimated coefficient. This is to simplify the interpretations of the plots -such that we are always describing the effects on downstream traits for expression increasing alleles. Additionally, we flip alleles such that SNPs with positive correlation of genotypes in either Σ A j or Σ B j (the user must pick which to use) are kept the same, while SNPs with negative correlation of genotypes have their alleles flipped. Allele flipping involves both keeping track of the reference and effect allele, as well as changing the sign of the estimated coefficient. This function also performs checks such that the A and B study agree in terms of the effect and reference allele. If two LD matrices are provided, one is prioritized for generating positive correlations of genotypes, while the other has its alleles flipped for consistency.

Scaling
In practice, before supplying β A i,j and β B i,j and the associated standard errors to the colocalization hierarchical model, the values are scaled such that the index SNP (as defined by its absolute value of z-score) for study A has estimated coefficient of ±1 for both study A and B. If two or more SNPs have the same z-score, the first is chosen. This simplifies the Stan code and improves model fit, as the two studies are then at comparable scale. The scaling is reversed after the Stan model is fit. For the user-input estimated coefficients and standard errors for study X ∈ {A, B} and cluster j, β X,input i,j and se( β X,input i,j ), in the first step we create scaled estimated coefficients: Note that equations (1-2) refer specifically to study A, while equations (3-5) refer to steps that are repeated for X = A and X = B. Again, in words, i * j is the index of the first occurrence of the maximal value of absolute value of z-score in study A and cluster j.

Colocalization
Colocalization refers to the task of determining if the same signal in eQTL and GWAS summary statistics arise from the same causal variant(s), considering the correlation of genotypes in a locus (the LD matrix). Here we perform colocalization using a generative model for the estimated coefficients, where the true coefficients will be modeled and their posterior distribution used for inference. In the following equations, β X i,j and se( β X i,j ) refer to these scaled estimates as described in the previous section, but in the following section 1.2 on slope fitting, the values are transformed back to the original scale by multiplying by 1/s X j for X ∈ {A, B} respectively. We use a statistical model for the summary statistics motivated by the eCAVIAR model [Hormozdiari et al., 2016]. In eCAVIAR, the summary statistic z-scores in a vector S are modeled as a multivariate normal distribution with mean vector ΣΛ and covariance matrix Σ (the LD matrix), where Λ is a vector giving the true standardized effect sizes. In a locus with a single causal SNP producing the observed enrichment of signal, and if the causal SNP is in the set modeled by eCAVIAR, Λ would consist of a vector with all 0 values except for one SNP with non-zero value. eCAVIAR then models Λ with a multivariate normal prior distribution centered on 0 with covariance matrix based on a vector of 0's and 1's giving the true causal status of the SNPs in the locus and a preset scale parameter determined from previous studies.
Here MRLocus diverges from eCAVIAR in two ways. First, we will use a generative model for the elements in the vector β X ,j conditional on the true effect sizes β X ,j within nearly-LD-independent signal cluster j (that is, MRLocus models effect sizes rather than z-scores). Second, we will use a univariate distribution for each β X i,j instead of a multivariate distribution for the entire vector. This second difference is motivated by practical concerns of model fitting and specification; the univariate modeling provided more efficient model fitting with Stan (higher effective sample sizes and R-hat values near 1), and allowed for more flexible choice of priors, as described below.
Instead of a multivariate normal prior on the estimated coefficients, MRLocus uses a horseshoe prior [Carvalho et al., 2009[Carvalho et al., , 2010, a type of hierarchical shrinkage prior that provides sparsity in the posterior estimates [Piironen and Vehtari, 2017]. In the following, we use the notation of Carvalho et al. [2009], where λ i provides the local shrinkage parameters and τ provides the global shrinkage parameter.
The following hierarchical model is fit separately across nearly-LD-independent signal clusters j ∈ 1, . . . , J, and so in the following equations, the subscript for j is omitted for clarity. In all equations below but the last, i ∈ 1, . . . , n j . Here, for consistency with Stan code, the normal distribution is written as N (µ, σ) where the second element σ provides the standard deviation instead of the variance.
τ ∼ Cauchy(0, 1) Of note, β A i and β B i share a prior involving λ i , such that evidence from study A and B supporting a SNP i † that is causal both for eQTL and GWAS signal (β A i † = 0, β B i † = 0), will be reflected in larger posterior draws for λ i † compared to other λ i for i = i † .
The posterior mean for β A i,j and β B i,j are the parameters of interest from this step of model fitting, and passed along to the next step after scaling back to the original scale by 1/s A j and 1/s B j , respectively, as described earlier. We also considered making use of the posterior standard deviation for these two parameters, but found MRLocus gave better performance in terms of accuracy and stability in the Stan fitting procedure if only the posterior mean was kept from the colocalization step. The use of the horseshoe prior in this colocalization step is distinct from other uses of the horseshoe prior in Bayesian Mendelian Randomization methods, Berzuini et al. [2018] and Uche-Ikonne et al. [2019], where it is used as a prior for the pleiotropic effects (effects not mediated by the exposure) or on the mediation slope, respectively.

Colocalization with eCAVIAR (optional alternative)
Alternatively, MRLocus can accept colocalization results from eCAVIAR as input to the slope fitting step, as explored in the main text. eCAVIAR is run with default options on each nearly-LDindependent signal cluster separately, supplying the LD matrix, and z-scores for study A and B, and -c 1, i.e. setting the maximum number of causal SNPs to 1. The following section describes how the output from eCAVIAR colocalization can be used alternatively to the MRLocus colocalization output.

Slope fitting step
The second step of MRLocus is to estimate the gene-to-trait effect α (the slope in a Mendelian Randomization analysis) using the posterior mean values from the colocalization step. For each nearly-LD-independent signal cluster j, MRLocus extracts one SNP, based on the largest posterior mean value for β A i,j (prioritizing the putative mediator A for SNP selection). MRLocus also has the non-default option to perform model-based clustering using an EM algorithm to determine how many SNPs per nearly-LD-independent signal cluster to pass to the slope fitting step [Scrucca et al., 2016], but this option was not extensively evaluated here. For eCAVIAR colocalization, the SNP with the largest colocalization posterior probability (CLPP) is selected from each signal cluster, among those SNPs which are valid instruments, i.e. with an absolute value of z-score as large as the value used for PLINK clumping [Purcell et al., 2007]. eCAVIAR alleles are flipped such that the alternate allele is the one corresponding to an increase in the measured phenotype in A (e.g. gene expression). Another round of r 2 trimming is performed, as in section 1.1.1, so that all chosen SNPs for slope fitting have pairwise r 2 < 0.05.
By default, J SNPs are passed from the colocalization step to the slope fitting step, which should represent the candidate causal SNPs from each nearly-LD-independent signal cluster. The posterior mean for β A i,j and β B i,j for these SNPs is then provided to the slope fitting step as variables β A j and β B j . These two variables are modeled as normally distributed variables centered on true values β A j and β B j with standard deviation according to the original standard errors se( β A i,j ) and se( β B i,j ). We found this incorporation of the original standard errors into the slope fitting procedure helped to accurately estimate the uncertainty on the slope (the gene-to-trait effect).
The primary two parameters of interest are the slope α, i.e. the predominant gene-to-trait effect demonstrated by the nearly-LD-independent signal clusters in the locus, and σ, the dispersion of gene-to-trait effects from individual signal clusters around the predominant slope. The hierarchical model for the slope fitting step is given by: where the first four equations are defined for j ∈ 1, . . . , J, and where HN specifies the Half-Normal distribution. In words, β B j is assumed to follow a normal distribution centered on the predicted value from a line with slope α and no intercept, and with dispersion σ around the fitted line. We are both interested in the posterior mean of α and σ as well as our uncertainty regarding the point estimates. In particular we focus on a quantile-based credible interval for α, which is used for determining our confidence in a gene being a causal mediator for a trait. Large values of σ relative to α times typical mediator perturbation sizes (β A j ) reflect significant heterogeneity of the gene-to-trait effect.

Choice of hyperparameters
The hyperparameter for the prior for β A j is SD β , which is set to 2 times the largest absolute value of β A j . The hyperparameters for the prior for α are µ α and SD α , and are set based on a simple un-weighted linear model of β B j on β A j without an intercept. µ α is set to the estimated slope coefficient, and SD α is set to 2 times the absolute value of this estimated slope coefficient.
The hyperparameter for the prior for σ is SD σ , which is set to max(2 sd( β B j ), max j (| β B j |)), where sd(.) denotes the sample standard deviation. That is, the larger value between (i) 2 times the standard deviation of the coefficients for B and (ii) the largest absolute value of coefficient for B. When the coefficients for B are widely dispersed, then taking two times the SD results in a wide prior for σ, and when the coefficients are not widely dispersed, then taking the absolute value of the largest coefficient ensures the prior width for σ does not become too small.
Prior predictive checks [Gabry et al., 2019], in which data is generated from the estimated prior using the data generating mechanism, are provided in the MRLocus package and demonstrated in the vignette.

Loci with only one signal cluster (J = 1)
It is not recommended to use MRLocus with only one signal cluster.

Variation in α in simulations
We note that the value for true slope parameter α (x-axis in the simulation assessment plots) varies across iterations in the simulation for fixed h2med and h2g. The reason for this variation is that the α provided by the twas sim simulation framework corresponds to the gene-to-trait mediated effect, when eQTL and GWAS effect sizes are standardized using the SD of gene expression and trait in the eQTL and GWAS samples (as would occur in real data analysis). The h2g and h2med in the twas sim simulator describe the population-level expression heritability and mediated trait heritability. As the eQTL and GWAS sample sizes grow (e.g. already seen in the eQTL N = 1, 000 simulation), the variation is reduced and we have α values converging to h2med h2g .