Integration of expression QTLs with fine mapping via SuSiE

Genome-wide association studies (GWASs) have achieved remarkable success in associating thousands of genetic variants with complex traits. However, the presence of linkage disequilibrium (LD) makes it challenging to identify the causal variants. To address this critical gap from association to causation, many fine-mapping methods have been proposed to assign well-calibrated probabilities of causality to candidate variants, taking into account the underlying LD pattern. In this manuscript, we introduce a statistical framework that incorporates expression quantitative trait locus (eQTL) information to fine-mapping, built on the sum of single-effects (SuSiE) regression model. Our new method, SuSiE2, connects two SuSiE models, one for eQTL analysis and one for genetic fine-mapping. This is achieved by first computing the posterior inclusion probabilities (PIPs) from an eQTL-based SuSiE model with the expression level of the candidate gene as the phenotype. These calculated PIPs are then utilized as prior inclusion probabilities for risk variants in another SuSiE model for the trait of interest. By prioritizing functional variants within the candidate region using eQTL information, SuSiE2 improves SuSiE by increasing the detection rate of causal SNPs and reducing the average size of credible sets. We compared the performance of SuSiE2 with other multi-trait fine-mapping methods with respect to power, coverage, and precision through simulations and applications to the GWAS results of Alzheimer’s disease (AD) and body mass index (BMI). Our results demonstrate the better performance of SuSiE2, both when the in-sample linkage disequilibrium (LD) matrix and an external reference panel is used in inference.

We first state two properties given by the SuSiE manuscript [1]: Property 1: Bayes Factor (BF) for simple linear regression (SuSiE Appendix A. 1.).Suppose a Bayesian simple linear regression model: where y is an n-vector of response data, x is an n-vector of an explanatory variable, both x and y are centered to have zero mean.Define b := (x T x) −1 x T y is the least-squared estimate of b, s 2 := σ 2 /(x T x) is the variance of b, and z := b/s is the corresponding z score.Then the Bayes Factor (BF) for comparing this model with the null model (b = 0) is: Property 2: The single effect regression (SER) model (SuSiE Appendix A.2.).The SER model is described as: Here, y is the n-vector of the centered response variable, X = (x 1 , ..., x p ) is a scaled matrix containing n observations of p explanatory variables, b is the p-vector of regression coefficients which can be decomposed as the product of a scalar λ and indicator variable c = (c 1 , ..., c p ) T ∈ {0, 1} p , π = (π 1 , ..., π p ) T gives the prior probability that each variable is the effect variable, σ 2 and σ 2 0 are the hyperparameters for the residual variance and prior variance of the non-zero effect.
Under the SER model (S.2), there exists only one non-zero element in the coefficient vector b, determined by the indicator variable c.With fixed hyperparameters σ 2 and σ 2 0 , the posterior distribution of b = λc can be computed as: where α = (α 1 , ..., α p ) T is the vector of PIPs, which can be computed with Bayes factors: .

(S.5)
Suppose µ 1 = (µ 11 , ..., µ 1p ) T , σ 2 1 = (σ 2 11 , ..., σ 2 1p ) T , we can write the SER model as: Posterior inclusion probabilities for the single eQTL regression model: In this section, we demonstrate that under the assumption that the distribution of phenotype given the gene expression level and genotype data is equivalent to the distribution of phenotype only given genotype data, the PIPs estimated by SuSiE 2 are exactly the posterior probabilities for variants to be causal given phenotype and gene expression level data.We first consider the following single eQTL regression (SEQR) model: where y is the n-vector of the centered phenotype, g is the n-vector of the centered gene expression level, X is the scaled genotype matrix of p SNPs, β is a fixed coefficient, b is defined in the same way as in the SER model.The SEQR model can be considered as a two-layer model, where the second layer (S.8) can be considered as a SER model with the gene expression level g as the response variable.Assume a non-informative prior for each SNP to be causal in model (S.8), i.e., π = (1/p, ..., 1/p) T , then based on (S.5) in Property 2, the SuSiE 2 prior for SNP j to be causal is: .
(S.9) Suppose phenotype y is based on cohort one with n 1 samples, and gene expression level g is based on cohort two with n 2 samples.The genotype matrices of these two cohorts for p SNPs are X 1 = (x 11 , ..., x 1p ) and X 2 = (x 21 , ..., x 2p ), respectively.Denote the parameters as ϕ = (σ 2 1 , σ 2 2 , σ 2 20 , β), we can formulate the required assumption introduced in the first paragraph of this subsection as: that is, the distribution of phenotype given the gene expression level and genotype data is equivalent to the distribution of phenotype given genotype data of cohort one.
Under this assumption, we can derive that: It is worth mentioning that when the two cohorts are independent, which is usually the case in real analyses, this assumption is definitely valid.However, even when y and g are from the same cohort, we can still derive (S.11) based on this assumption.
In the SEQR model, the phenotype is effected by genotype only through the gene expression level, thus the causal SNP of gene expression level is also the only SNP that affects the phenotype.We can rewrite the SEQR model as the following two SER models between y, g and genotype matrices X 1 and X 2 , respectively: With the decomposition (S.11), under a non-informative prior P r(c j = 1) = 1 p , the posterior probability for SNP j to be causal for phenotype y is: .
(S.14) Suppose c j = 1, then SNP j is the only variant with non-zero coefficient for g and y.
(S.17) Therefore, with (S.16) and (S.17), we have P r(c j = 1|y, g, X 1 , X 2 , ϕ) where P IP e j is the SuSiE 2 prior for SNP j to be causal, as defined in (S.9).
but fixed the minimum absolute correlation for SNPs from the same credible set at 0.5.
flashfm: We ran flashfm using the FLASHFMwithFINEMAP function from R package flashfm (version 0.0.0.9000) [6].This function internally applied FINEMAP (we used FINEMAP version v1.4.2) [7] to perform multi-trait fine-mapping.The inputs to FLASHFMwithFINEMAP are a list of summary statistics, a SNP correlation matrix, a vector of reference allele frequency, a vector of trait means, and a vector of sample sizes for traits.All other settings were kept at their defaults.Since both the trait of interests and the genotype matrix were standardized, we set the trait means to zero, and replaced the MAFs with a constant value as these MAFs were used to obtain the covariance matrix from the LD matrix.Since at most 5 traits can be used with the FLASHFMwithFINEMAP function, we can perform flashfm on simulation scenario (a) and (b).