Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization

We propose a novel statistical framework for integrating the result from molecular quantitative trait loci (QTL) mapping into genome-wide genetic association analysis of complex traits, with the primary objectives of quantitatively assessing the enrichment of the molecular QTLs in complex trait-associated genetic variants and the colocalizations of the two types of association signals. We introduce a natural Bayesian hierarchical model that treats the latent association status of molecular QTLs as SNP-level annotations for candidate SNPs of complex traits. We detail a computational procedure to seamlessly perform enrichment, fine-mapping and colocalization analyses, which is a distinct feature compared to the existing colocalization analysis procedures in the literature. The proposed approach is computationally efficient and requires only summary-level statistics. We evaluate and demonstrate the proposed computational approach through extensive simulation studies and analyses of blood lipid data and the whole blood eQTL data from the GTEx project. In addition, a useful utility from our proposed method enables the computation of expected colocalization signals using simple characteristics of the association data. Using this utility, we further illustrate the importance of enrichment analysis on the ability to discover colocalized signals and the potential limitations of currently available molecular QTL data. The software pipeline that implements the proposed computation procedures, enloc, is freely available at https://github.com/xqwen/integrative.

Supplementary Text S1 for "Integrating Molecular QTL Data into Genome-wide Genetic Association Analysis: Probabilistic Assessment of Enrichment and Colocalization" S. 1 Outline of Multiple Imputation Procedure In this section, we outline the multiple imputation procedure used in the estimation of α 1 .
We first analyze the GTEx whole blood cis-eQTL using the adaptive DAP algorithm, and obtain the joint distribution of Pr(d | Y qtl , G qtl ) for each gene. Subsequently, we create m = 25 imputed annotations by independent sampling from the corresponding posterior distribution of each gene. We then perform the enrichment estimate for each imputed data set using the EM-DAP1 algorithm implemented in the software package TORUS. In the end, we obtain a point estimate of α 1 , namelyα (i) 1 , and its standard error, σ (i) for each imputed data set i.
The multiple imputation procedure combines the estimates from individual imputed data sets in the following way. First, the overall point estimateα 1 is simply given bŷ The variance of the point estimate is then computed by where and

S.2 Adaptive Shrinkage in Enrichment Estimation
As discussed in the main text, the lack of strongly colocalized signals leads to unstable enrichment estimate. Here we propose a data-driven empirical approach to remedy this issue. The general statistical idea is to trade off the variance of the enrichment estimate against bias to achieve an overall accurate prior estimate for the downstream fine-mapping and colocalization analyses.
In the EM algorithm that we previously detailed [2,3], we showed that the M-step is equivalent to fitting a logistic regression model, which, in our context, regress the expected association status estimated in the E-step on the imputed eQTL annotation for each SNP. To apply the shrinkage on the estimate, we apply an l 2 penalty with a constant shrinkage parameter λ in fitting the logistic regression for each M-step, which is equivalent to assuming a normal prior N(0, 1/λ) on α 1 .
We determine the shrinkage parameter in a data-adaptive way. Intuitively, a larger shrinkage is desired if the data indicates more severe sparsity of overlapping association signals. If both γ and d are observed, we can construct a 2 × 2 contingency table and α 1 and its variance can be estimated in a standard way.
In particular, the variance of the log-odds ratio estimate can be used as a measure for our purpose: e.g., the variance is small if all 4 cells have sufficient counts, whereas if the count of one cell is small, the overall variance of the log-odds ratio estimate is large. For each imputed data set, d is indeed observed but γ remains latent. We therefore estimate the cell count of the hypothetical 2 × 2 contingency table by computing the PIP of each SNP in GWAS using the DAP-1 algorithm and assuming α 1 = 0. More specifically, let η i denote the indicator if the SNP i is a molecular QTN in the imputed data set and let p i denote the resulting PIP from the DAP-1 algorithm. Assuming there are total p candidate SNPs, we estimate the 4 cell counts by Finally, we set the shrinkage parameter λ by which is the variance of the log-odds ratio estimate if true cell counts are indeed observed (instead of estimated). This shrinkage parameter has a natural interpretation and quantitatively reflects the imbalance of the unobserved 2 × 2 contingency table.

S.3 Derivation of SNP-level Colocalization Probability
The SNP-level colocalization probability can be computed by noting the relationship between the posterior and their relationship to the marginal PIP in the GWAS data, i.e., and Solving this linear system yields the following expression for SNP-level colocalization probability,

S.4 Connections to coloc Model
Here we compare the statistical approach coloc to our proposed method. Specifically, we show that coloc can be viewed as a rough approximation and a special case to our general integrative analysis approach.
The method coloc requires pre-defining three SNP-level prior probabilities p 1 , p 2 and p 12 . Using the notations of this paper, these three quantities can be formulated as Additionally, the prior probability Pr(γ i = 0, d i = 0) = 1 − p 1 − p 2 − p 12 can be trivially computed and represented by p 0 . In comparison, we explicitly estimate α 0 and α 1 from the GWAS data. Although we do not directly utilize the prior probability of a QTN, Pr(d i = 1) (rather, the inferred posterior probability Pr(d i = 1 | Y qtl , G qtl ) is applied throughout our inference procedure), this very quantity can be straightforwardly estimated from the eQTL data, (Y qtl , G qtl ), using the EM-DAP1 algorithm [4].
Similar to our RCP quantification, the coloc method considers the existence of a colocalized GWAS and molecular QTL signal within a LD block. Importantly the coloc model makes an explicit assumption that there is at most a single GWAS hit and/or a single QTN within the locus of interest, which enables highly efficient approximate computation for the RCP. Here we show that, given the simplifying assumption and the pre-specified priors for p 1 , p 2 and p 12 , coloc yields identical result of RCP as the proposed method given p 1 , p 2 and p 12 .
Suppose that there are m SNPs in the LD block of interest and let the binary m-vectors γ l and d l denote their association status with respect to the complex trait and the molecular phenotype, respectively. Because GWAS and Molecular QTL data are obtained from non-overlapping samples, it follows that Assuming that SNP l i is the colocalized association signal, it follows from our proposed model that where the marginal likelihood ratios are approximated by the Bayes factors of single-SNP association models for the complex trait and molecular phenotype, respectively.
Under the constraint imposed by the simplifying assumption, all possible configurations of (γ l , d l ) can be enumerated, of which the corresponding posterior probability can be similarly computed as (10).
For example, and, In the end, coloc computes the RCP for the locus of interest by where C denotes the normalizing constant and is computed by , and the set Ω denotes the eligible (γ l , d l ) configurations under the constraint. Additionally, coloc approximates p 0 ≈ 1, and expression (14) becomes identical to what is used in [5].
In summary, we have shown that the coloc approach can be derived as a special case from our proposed general framework with the following added assumptions, the coloc annotates a much smaller set of eQTLs in our simulated data. We find that this results in an inferior performance in ranking potential colocalized signals as indicated by the ROC curve (Fig. 1). In particular, at a given false positive rate threshold, we find much fewer colocalized signals are identified comparing to the proposed approach. In the application of hypothesis testing, we find that the type I errors are inflated, in some cases severely, indicating that the posterior colocalization probabilities are poorly calibrated. For example, at 5%, 10% FDR levels, the realized FDRs are 14% and 20%, respectively.

S.5
Multi-SNP Analysis with Adaptive DAP Algorithm using Summary-level Statistics For practical and/or privacy considerations, many GWAS data are made available with only summarylevel statistics, typically in the forms of single-SNP testing p-values or z-scores. In this section, we discuss the analytic strategy to perform proposed analysis using only summary-level statistics from GWAS. Many authors have demonstrated that the SNP-level PIP in GWAS, Pr(γ i = 1 | y, G, Y qtl , G qtl ,α), can be approximated from the summary-level statistics obtained from single-SNP association testing results [6,7].
Here we show their results extend to the application of the adaptive DAP algorithm. In the case where individual-level genotype data are unavailable, the main computational difficulty in applying adaptive DAP algorithm lies in calculating the marginal likelihood, i.e., Bayes factor, for any given value of γ. There are primarily two types of approaches to circumvent this difficulty. This first type attempts to derive the likelihood by treating summary-level statistics as the sole observed data: the results by Chen et al [8] (Equation (3)) and Zhu and Stephens [7] (Equation (2.10)) provide suitable approximations that are accurate and easy to compute. The second type of approach is based on the trivial fact that given the summary-level statistics, namely z-scores from the single-SNP analysis, and the exact matrix G G, the exact likelihood function can be computed for any given γ as the individual level data, (y, G), are provided. The p × p matrix, G G, characterizes the correlations between genotypes among candidate SNPs. When the full matrix from the samples is not available, we can estimate it based on the LD patterns observed in an appropriate population genotype panel.
Our current implementation in the adaptive DAP algorithm takes the second approach and requires single-SNP testing z-values and either an estimate or exact matrix of G G . (We also plan to implement the approximation by [8].)

S.6 Caveat of Mean Imputation for Enrichment Analysis
An often-applied imputation strategy is to impute the missing values by their corresponding expectations, which is commonly named as "mean imputation". Within our proposed modeling framework, a mean imputation strategy would simply treat the PIPs from the molecular QTL analysis as the observed annotations. The general drawbacks of the mean imputation have been thoroughly discussed in the statistical literature. Here we focus on a particular observation from the simulation studies: the mean imputation consistently overestimates the α 1 values and sometimes the degree of the overestimation is high. Among many contributing factors to this phenomenon, we illustrate one factor, for which we name as the "scaling factor", to explain the observed overestimation.
leads to an unbiased estimator of α 1 . The mean imputation replaces each d i value by its expectation δ i from the QTL analysis. Because δ i 's are valid probabilities, they are always shrunk towards 0.5, regardless of the true value of d i . This consequently narrows the scale of the covariate in the logistic model, which requires the regression coefficient α 1 to be "scaled-up", i.e., overestimated, accordingly.
Consider an extreme example where all 0's and 1's are replaced by 0.1's and 0.9's, respectively, and α 1 is positively estimated when d is observed. With the same input γ values, the ratio of the α 1 estimates in the re-scaled vs. original regression model is exactly 1.25, and the difference between the two becomes more noticeable if the true α 1 is relatively large.
The scaling factor, at least in part, explains the inflation of the α 1 estimate. It also should be noted that the degree of the overestimation is related the overall accuracy of the mean imputation: if the imputation is extremely accurate, i.e., δ i 's become very close to their true binary values, the effect of the scaling factor can be negligible.