Figures
Abstract
Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal genes. However, consensus in the field dictates that the genetic instruments in MVMR must be independent (not in linkage disequilibrium), which is usually not possible when considering a group of candidate genes from the same locus. Here we used causal inference theory to show that MVMR with correlated instruments satisfies the instrumental set condition. This is a classical result by Brito and Pearl (2002) for structural equation models that guarantees the identifiability of individual causal effects in situations where multiple exposures collectively, but not individually, separate a set of instrumental variables from an outcome variable. Extensive simulations confirmed the validity and usefulness of these theoretical results. Importantly, the causal effect estimates remained unbiased and their variance small even when instruments are highly correlated, while bias introduced by horizontal pleiotropy or LD matrix sampling error was comparable to standard MR. We applied MVMR with correlated instrumental variable sets at genome-wide significant loci for coronary artery disease (CAD) risk using expression Quantitative Trait Loci (eQTL) data from seven vascular and metabolic tissues in the STARNET study. Our method predicts causal genes at twelve loci, each associated with multiple colocated genes in multiple tissues. We confirm causal roles for PHACTR1 and ADAMTS7 in arterial tissues, among others. However, the extensive degree of regulatory pleiotropy across tissues and the limited number of causal variants in each locus still require that MVMR is run on a tissue-by-tissue basis, and testing all gene-tissue pairs with cis-eQTL associations at a given locus in a single model to predict causal gene-tissue combinations remains infeasible. Our results show that within tissues, MVMR with dependent, as opposed to independent, sets of instrumental variables significantly expands the scope for predicting causal genes in disease risk loci with pleiotropic regulatory effects. However, considering risk loci with regulatory pleiotropy that also spans across tissues remains an unsolved problem.
Author summary
Although genome-wide association studies have mapped thousands of genetic variants that explain the heritable nature of many complex traits and diseases, the causal genes and mechanisms underlying these associations are often unclear. This is partly due to the widespread presence of “regulatory pleiotropy”, a phenomenon where the same genetic variants affect gene expression of multiple genes in the same genomic locus across multiple tissues. Mendelian randomization is a statistical method that uses genetic variants as instrumental variables to estimate causal effects of exposures on outcomes. Here we have extended this technique to the situation where multiple exposures can have a simultaneous effect on an outcome, and no independent instrumental variables are available for each exposure. When applied to a dataset of genetic and gene expression variation in seven vascular and metabolic tissues of 600 individuals undergoing heart surgery, our method identified candidate causal genes and tissues for coronary artery disease risk at genomic positions where regulatory pleiotropy and the extensive correlations between genetic variants made the application of existing Mendelian randomization methods infeasible. Further support for the validity of our method to identify causal genes using sets of correlated instrumental variables was provided by extensive simulations and theoretical results.
Citation: Khan M, Ludl A-A, Bankier S, Björkegren JLM, Michoel T (2024) Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables. PLoS Genet 20(11): e1011473. https://doi.org/10.1371/journal.pgen.1011473
Editor: Xiang Zhou, University of Michigan, UNITED STATES OF AMERICA
Received: January 12, 2024; Accepted: October 28, 2024; Published: November 11, 2024
Copyright: © 2024 Khan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All source code related to the paper is available at https://github.com/mariyam-khan/Causal_genes_GWAS_loci_CAD. Results reported in this paper used version 1.0.0 of the code, which has been archived at \url{https://doi.org/10.5281/zenodo.10091331}. Supporting data is available at https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/VM0WKQ.
Funding: T.M. acknowledges support from the Research Council of Norway (project number 312045). J.L.M.B. acknowledges support from the Swedish Research Council (2018-02529 and 2022-00734), the Swedish Heart Lung Foundation (2017-0265 and 2020-0207), the Leducq Foundation AteroGen (22CVD04) and PlaqOmics (18CVD02) consortia; the National Institute of Health-National Heart Lung Blood Institute (NIH/NHLBI, R01HL164577; R01HL148167; R01HL148239, R01HL166428, and R01HL168174), American Heart Association Transformational Project Award 19TPA34910021, and from the CMD AMP fNIH program. T.M. and J.L.M.B. acknowledge the European Union’s Horizon Europe (European Innovation Council) programme (grant agreement number 101115381). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Mendelian randomization (MR) is a statistical technique that uses genetic instruments to estimate causal effects between complex traits and diseases [1–3]. It is based on the fact that genotypes are independently assorted and randomly distributed in a population by Mendel’s laws, and not affected by environmental or genetic confounders that affect both traits (commonly called the “exposure” and the “outcome” trait for the putative causal and affected trait, respectively). If it can be assumed that a genetic locus affects the outcome only through the exposure, then the causal effect of the exposure on the outcome can be derived from their relative associations to the genetic locus, which acts as an instrumental variable [4].
Traditionally MR is used to study phenotypic traits where genetic effect sizes are small and horizontal pleiotropy can never be fully excluded, that is, alternative causal paths may exist between the genetic instruments and the outcome trait independent of the exposure trait. Hence much of the methodological development in MR has sought to address these limitations [5, 6].
More recently, MR has been applied to estimate causal effects of molecular traits such as mRNA or protein abundances on phenotypic traits [7–10]. Using molecular exposure traits in MR offers potential advantages due to genetic effects on molecular abundances being much larger than those for phenotypic traits, with greater than two-fold allelic effect sizes [11] on gene expression not being uncommon [12]. Moreover, fundamental molecular biological principles reduce the risk of horizontal pleiotropy. Indeed, most genetic variants associated to mRNA and protein abundances (and phenotypic traits) are located in non-coding regions of the genome and there are no known mechanisms by which such variants could affect outcome traits other than through affecting transcription and translation of nearby genes, typically considered to be at most a distance of 1 Mbp away [13]. Hence, by limiting the selection of genetic instruments to those in the genomic vicinity of the gene whose mRNA or protein product is used as the exposure in an MR analysis, the risk of these variants affecting the outcome through alternative pathways not including the exposure gene or protein is reduced.
However, analyses of large transcriptomic datasets have shown that genetic variants with “regulatory pleiotropy”, defined as variants associated with more than one gene in the same locus [12], are common. For instance, it has been found that around 10% of non-redundant cis-expression quantitative trait loci (cis-eQTLs; single nucleotide polymorphisms (SNPs) exhibit regulatory pleiotropy [14]. This covered cases where a true cis-regulatory DNA region was shared between genes as well as cases where genes had distinct regulatory elements that were in high linkage disequilibrium (LD) with one another. More recently, analyses by the GTEx Consortium of RNA abundances across 49 human tissues showed that a median of 57% of variants per tissue are associated with more than one gene in the same locus, typically co-occurring across tissues [12].
When genes or proteins are tested as exposures one-by-one using MR, as in most studies to date, regulatory pleiotropy could clearly lead to horizontal pleiotropy as one cannot know a priori to which of the associated genes a genetic instrument exerts its influence on the outcome. Hence, to account for regulatory pleiotropy, all MR analyses using molecular abundances must employ multivariable MR (MVMR) methods, where all gene products in a genomic locus of interest are treated simultaneously as a set of potential causes for an outcome of interest.
As with single-exposure MR, MVMR was first developed for phenotypic exposure traits [15–18]. MVMR requires a set of instrumental variable SNPs, at least as many as the number of exposures that are associated with the exposure and outcome variables, and not affecting the outcome other than through the set of exposure variables. MVMR can then estimate the direct causal effect of each exposure on the outcome using a two-stage least squares method in the single-sample, individual-level data setting, or a “regression of regression coefficients” method in the two-sample summary data setting [15–17]. Recently, MVMR was applied to transcriptomic and genome-wide association study (GWAS) data in the two-sample setting, in a first attempt to overcome the challenge of regulatory pleiotropy when analyzing mRNA abundances to identify causal disease genes, in a procedure called transcriptome-wide MR (TWMR) [19].
However, the precise conditions to ensure validity of MVMR remain somewhat ambiguous, in particular with regards to the presence of linkage disequilibrium (correlations) between the SNPs in the set of instrumental variables. For instance, Burgess et al. [16] state that estimates from the two-stage least squares method in the individual-level data setting are valid even if the genetic variants are in LD [15], but require uncorrelated instruments in their proof of equivalence with the regression method in the summary data setting. McDaid et al. [20] show that the two-stage least squares estimates can be obtained from univariate summary data even if the genetic variants are in LD, allowing for two-sample MVMR analysis, albeit at the cost of the estimates no longer being expressed as a “regression of regression coefficients”. Nevertheless, despite their formula for the causal effect estimates being valid for correlated instruments, McDaid et al. [20] still follow common practice in MVMR to select only instruments spread far apart on the genome to eliminate any LD between them. Similarly, Porcu et al. [19], whose TWMR method is based on the analytic results of McDaid et al. [20], only use variants with low mutual LD (r2 ≤ 0.1) in their analysis of loci with regulatory pleiotropy. Moreover, all simulations reported in the literature to test MVMR methods in a controlled situation use independently sampled instruments [15–17, 19].
Here we clarify and strengthen the theoretical underpinnings of MVMR with correlated instrumental variables, supported by realistic simulations, to identify causal genes at genomic loci with pleiotropic gene regulatory effects. If we restrict to genetic instruments in the same locus to exclude horizontal pleiotropy, then inclusion of correlated instruments becomes a necessity.
The basis of our approach is Wright’s method of path coefficients [21, 22] and its generalization into the graph structural causal inference theory by Pearl and colleagues [23]. Accordingly, we distinguish between identification and estimation of causal effects. Causal identification refers to the question whether causal effects can be expressed in terms of the true (but unknown) joint distribution of the variables in a causal graph. Causal estimation refers to the problem of estimating the identified causal effects from a finite number of observational samples from the joint distribution.
For the identification problem, we show that the MVMR causal graph with correlated instruments belongs to a class of causal graphs previously studied in the AI literature [24]. Specifically, we show that any set of non-redundant causal variants of the same size as the set of exposure traits, regardless of their mutual LD levels and regardless of any interactions between the exposures, satisfies Brito and Pearl’s [24] “instrumental set” condition, allowing identification of the causal effects of each exposure on the outcome. In particular, the causal effects can be expressed as the coefficients of the linear regression of the instrument-outcome covariances on the instrument-exposure covariances, thereby showing correctness of the “regression of regression coefficients” method in all settings.
For the estimation problem, we expand the instrument set to the overcomplete set of all genetic variants in the locus of interest. We assume that this set contains at least as many causal variants as the number of exposures, and analyze a class of methods called the Generalized Method of Moments (GMM) [25]. As shown by Hansen [25], every previously suggested instrumental variables estimator, in linear or nonlinear models, with cross-section, time series, or panel data, as well as system estimation of linear equations can be cast as a GMM estimator. GMM is therefore viewed as a unifying framework for estimation in causal inference problems.
We show that for the exactly determined MVMR problem with correlated instruments (equal number of instruments and exposures), replacing the true covariances by their finite-sample estimates is the optimal GMM estimator. For the overdetermined system, a weighted version that corresponds exactly to the two-stage least squares solution is optimal. In particular, the solution of McDaid et al. [20] of the two-stage least squares problem in terms of summary statistics is shown to be the finite-sample GMM estimator of the “regression of regression coefficients” solution to the causal identification problem.
To support our theoretical results, we conducted extensive simulations with binomially distributed, correlated instruments mimicking real LD from the human genome. We further applied our MVMR method with correlated instrumental variable sets to transcriptomics data from seven vascular and metabolic tissues in 600 coronary artery disease (CAD) cases undergoing surgery from the STARNET study [26] to predict causal genes and tissues at 19 genome-wide significant CAD risk loci with regulatory pleiotropy.
Materials and methods
The method of path coefficients
Causal effect identification consists of expressing the effect of an intervention of one variable on the marginal distribution of one or more other variables using knowledge of the underlying causal diagram and joint distribution of all variables under consideration [23]. A causal diagram is a directed acyclic graph (DAG) of directed edges between variables where a causal relationship is known or hypothesized, augmented with bidirected edges between variables known or hypothesized to be affected by unmodelled confounding factors (Fig 1); bidirected edges can be represented equivalently by two directed edges originating from a common latent factor.
(A) The standard instrumental variable graph is only suitable for modelling the causal effect of gene X on outcome trait Y if X is the only cis-eGene in the locus of the genome-wide significant variant E for Y. (B) A causal graph for MVMR with regulatory pleiotropy, where two or more cis-eGenes Xi are found in a genome-wide significant locus for Y; the causal effects of each Xi on Y can be identified if there exist at least the same number of causal variants as the number of cis-eGenes in the GWAS locus, even when they are in LD with each other, because the variants form an instrumental set. (C) If the number of causal variants is smaller than the number of cis-eGenes and other variants in the locus are merely associated by LD, the causal effects of Xi on Y are non-identifiable. (D,E) When estimating causal effects from finite samples, we use an overdetermined set of genetic variants as instruments, without knowing the number or identity of the causal variants. (F) If the (unknown) number of causal variants in a locus is smaller than the number of cis-eGenes, causal effects can not be estimated, a condition that can be tested by analyzing the rank or determinant of the covariance matrix between the variants and cis-eGenes. Solid arrows represent direct causal effects and dashed (bi)directed arrows represent non-zero covariances due to unobserved (U) confounders.
A special class of causal models are those where the causal diagram is assumed to define a linear structural equation model (SEM) [27], that is, a system of linear equations among a set of variables {X1, …, Xp} of the form
The Ui are error or disturbance terms representing unobserved background factors which are assumed to have mean zero and a variance/covariance matrix [Ψij] = Cov(Ui, Uj). It is often assumed that the error terms are normally distributed, but, as already pointed out by Wright [22], this is not necessary—the results below hold for any error distribution. Every non-zero coefficient cij corresponds to a directed edge Xj → Xi in the causal diagram, that is, cij is the direct causal effect of Xj on Xi, and every non-zero element Ψij corresponds to a bidirected edge Xi ↔ Xj. The SEM structure implies that the variables {X1, …, Xp} have mean zero and covariance matrix
where C is the matrix of coefficients cij, CT is its transpose, and I is the identity matrix. Causal identification in linear SEMs consists of solving for the non-zero cij (and Ψij) in terms of the covariance matrix Σ.
The method of path coefficients is a result derived by Sewall Wright [21, 22, 27], which expresses the covariance between any pair of variables in a linear SEM as a polynomial in the parameters (causal coefficients and error covariances) of the model:
(1)
where the summation ranges over unblocked paths p connecting X and Y, and T(p) is the product of the parameters (cij or Ψij) of the edges along p. A path in a graph is a sequence of edges such that each pair of consecutive edges share a common node and each node appears only once along the path. A path is unblocked if it does not contain a collider, that is, a pair of consecutive edges pointing at the common node. Two nodes are called d-separated if there are no unblocked paths between them. Eq (1) is valid for standardized variables that are normalized to have zero mean and unit standard deviation; if this is not the case, a modified rule can be applied where each T(p) is multiplied by the variance of the variable that acts as the “root” for path p [27].
Instrumental sets
Fix a variable Y in a linear SEM, and assume we have the equation
in our SEM. Brito and Pearl [24] derived a sufficient graphical condition for identifying the parameters c1, …, cK by application of Wright’s rule (1), of which we present a simplified version. The set of variables {E1, …, EK} is said to be an “instrumental set” relative to {X1, …, XK} if there exist pairs (E1, p1), …, (EK, pK) such that for each i the following three conditions hold.
- Ei is a non-descendant of Y and pi is an unblocked path between Ei and Y including edge Xi → Y.
- Ei is d-separated from Y in
, the graph obtained after removing the edges X1 → Y, …, Xk → Y from the original graph
.
- For j > i, the variable Ej does not appear in path pi, and if paths pi and pj have a common variable V, then both pi[V ∼ Y] and pj[Ej ∼ V] point to V, where pi[V ∼ Y] and pj[Ej ∼ V] denote truncations of pi and pj from V and until V, respectively.
If {E1, …, EK} is an instrumental set relative to {X1, …, XK}, then the parameters of the edges X1 → Y, …, XK → Y are identifiable and can be computed by a set of linear equations [24]. For concrete examples verifying the instrumental set conditions, see S1 File.
Causal effect identification in MVMR with correlated instruments
We assume genes X1, …, XK are colocated in the same genomic locus and have a potential causal effect on a phenotypic trait Y associated to the locus. We assume there exist K causal variants in the locus such that each gene has at least one causal variant (causal variants may be shared between genes). The variants are correlated by LD, the exposures Xi can be connected by causal relations or unmodelled latent factors, and likewise there can be latent factors affecting the exposures Xi and outcome Y. Randomization of genotypes ensures that there are no unmodelled correlations between the Ej and Xi. An example causal diagram for K = 2 is shown in Fig 1B.
We now show that {E1, …, EK} is an instrumental set relative to {X1, …, XK}. By relabelling, we may assume that there exists an edge Ei → Xi in the causal diagram. Define the paths pi = {Ei → Xi, Xi → Y}. Since Ei is d-separated from Y in the diagram where all edges Xj → Y are removed, it follows that the set {(E1, p1), …, (EK, pK)} satisfies the three instrumental set conditions (see above) and the causal parameters ci are identifiable. We apply Wright’s rule (Eq (1)), following the general proof of Brito and Pearl [24]. By Eq (1),
(2)
where the sum is over unblocked paths p. It is clear that any path that starts in Ei, ends in Y, and includes a bidirected edge Xj ↔ Y must be blocked by a collider V → Xj ↔ Y, where V can be either a variant or another gene. Hence no such paths enter the sum in Eq (2). In other words, all unblocked paths must end with an edge Xj → Y contributing a factor cj to T(p). Collecting all paths by their final edge, the sum can be decomposed as
where the inner sum now extends over the subset of paths from Ei to Xj obtained by truncating the unblocked paths from Ei to Y which have Xj as their penultimate node. Since the truncation of an unblocked path is also necessarily unblocked, each p′ is an unblocked path from Ei to Xj. Vice versa, every unblocked path from Ei to Xj can be extended to an unblocked paths from Ei to Y by adding the edge Xj → Y. Hence, the sum over p′ is the sum over all unblocked paths from Ei to Xj, By another application of Wright’s rule (1),
and hence
In matrix-vector notation, we obtain:
where c = (c1, …, cK)T is the vector of causal effects, and ΣEX and ΣEY are the matrix and vector of covariances
and
, respectively.
Since ΣEX is not a symmetric matrix, we cannot assume that an inverse exists, but we can always write the solution to the linear set of equations using a generalized left inverse:
(3)
that is, find the least squares solution. These equations are valid for standardized variables.
The generalized method of moments for causal effect estimation in MVMR with correlated instruments
To estimate the causal effects from observational data, we consider multiple finite-sample approximations to the above equations. For finite-sample estimation, the number of instruments can be greater than the number of exposures, that is, we assume we have N observations of the random variables {E1, …, EL}, {X1, …, XK}, and Y, with L ≥ K. We use lower-case letters to denote sample values, e.g. eil denotes the value of El in observation i. We use the notation ei⋅, e.l, and e⋅⋅ to represent the vector of sample values for all L instruments in observation i, the vector of all N observations of instrument l, and the N × L matrix of all observations for all instruments, respectively, and similarly for xi⋅, x⋅k, and x⋅⋅. We assume the observations for all variables are standardized to mean zero and unit standard deviation.
The generalized method of moments (GMM) [25] is based on moment functions that depend on observable random variables and unknown parameters, and that have zero expectation in the population when evaluated at the true parameters:
(4)
where g denotes the moment function, wi is a vector of samples of random variables in observation i, and c is a vector of unknown parameters. The moment function g can be linear or nonlinear.
Assuming a linear MVMR SEM as before, the natural moment functions are of the form
(5)
that is, we consider one moment equation per instrument. From the structural equation for Y in the assumed SEM (see above and Fig 1), it follows that Y − XTC = UY, the residual error on Y, and since all instruments El are assumed independent of UY (no bidirected arrows between El and Y), that is,
, it follows immediately that the moment conditions are satisfied:
Replacing the expectation with the empirical mean over the observations results in L equations in K unknowns (c1, …, cK) which can be written in matrix-vector notation as
(6)
If the number of instruments equals the number of unknowns, L = K, this system can be solved exactly, and by the standardization of all variables, the solution reduces to the least-squares estimator ,
(7)
which is also obtained by replacing the true covariances in the instrumental variable formula (3) by their empirical estimates.
If the model is overdetermined, L > K, then in general there is no unique solution to (6), because not all L sample moments will hold exactly. While the least-squares estimator provides one approximate solution to the overdetermined system, Hansen [25] proposed an alternative solution, based on bringing the sample moments as close to zero as possible by minimizing the quadratic form
(8)
with respect to the parameters c, where Δ is a positive definite L × L weighting matrix, and the quantity between the large square brackets is understood to be the L-vector of moment functions. It can be shown that this yields a consistent estimator
of c, under certain regularity conditions. Again using the standardization of all variables, the cost function can be written as
with solution
(9)
When Δ = I, the identity matrix, we again obtain the least-squares estimator, .
While all choices of Δ lead to a consistent estimator, in finite samples different Δ-matrices will lead to different point estimates. Hansen [28] showed that the GMM estimator with the smallest asymptotic variance is obtained by taking Δ equal to the inverse of the variance-covariance matrix of the moment functions (5). If the errors are homoskedastic, , we have
and obtain the estimator
(10)
We will refer to this estimator as the GMM estimator. Note that this estimator is equivalent to the two-stage least squares estimator.
Simulations with randomly generated instrument correlations
We simulated data from idealized models corresponding to the causal diagrams in Fig 1 with binomially distributed instruments and normally distributed exposure and outcome variables.
For the diagrams with a single exposure (Fig 1A and 1D), we set the true causal effect cX = 0.3. In the case of one instrument (Fig 1A) we varied the instrument strength a between 0.01 (weak instrument), 0.3 (mediocre instrument), and 0.8 (strong instrument). In the case of multiple instruments (Fig 1D), we simulated ten instruments with randomly sampled instrument strengths, either all strong (effect sizes between 0.1 − 0.3) or all weak (effect sizes between 0.001 − 0.03).
For the diagrams with multiple exposures and correlated instruments (Fig 1B, 1C, 1E and 1F), we set the true causal effects and
in all simulations.
To test the effect of instrument correlation we simulated data with two instruments (Fig 1B) where the pairwise correlation of the instruments varied between 0.01, 0.3, 0.7, 0.9 and 0.95. The matrix A = [aij] of instrument effect sizes, where aij is the direct causal effect of Ei on Xj, was generated randomly such that its determinant was greater than 0.05 and there were no weak instruments (aij between 0.1 − 0.3).
To test the effect of weak instruments, we simulated data with two instruments (Fig 1B) with randomly generated non-zero pairwise correlation of the instruments and randomly generated matrix of instrument effect sizes A with determinant either less than 0.001 and effect size of each instrument less than 0.001 or determinant greater than 0.05 and effect size of each instrument between 0.1 − 0.3.
To test robustness of the causal effect estimates, we simulated data with two instruments and two exposures (Fig 1B), with randomly generated non-zero pairwise correlation between the instruments, randomly generated matrix of instrument effect sizes A with determinant greater than 0.05, and true causal effects of the exposures on the outcome equal to and
, but estimated causal effects for each exposure separately under the (false) hypothesis that the data was generated from Fig 1A.
Simulations with instrument correlations derived from real LD matrices
For more realistic simulations, we simulated data with binomially distributed genotypes and normally distributed exposure and outcome variables in such a way that the summary statistics from the simulated data match summary statistics at real GWAS loci for coronary artery disease (CAD). We used eQTL summary statistics from GTEx [12] and CAD GWAS summary summary statistics from the CARDIoGRAMplusC4D meta-analysis [29].
For a given locus of interest with cis-eQTLs E1, …, EL for genes X1, …, XK (L ≥ K), with LD matrix ΣE (size L × L), matrix of eQTL summary statistics A (size L × K), and vector of CAD GWAS summary statistics ΣEY (size L × 1), we simulated data as follows.
To sample binomially distributed genetic instruments according to a predefined LD matrix, we made the simplifying assumption that the correlation structure between the simulated instruments can be described by a Markov chain. We sampled the first SNP’s genotype from a binomial distribution with probability of success equal to its real minor allele frequency (MAF). For every other SNP, we sampled a genotype from a binomial distribution with probability of success conditional on the genotype of the previous SNP in the sequence, such that both the MAF and pairwise LD between successive SNPs match the real data from the simulated locus. Further details are given in S1 File.
We assumed that a randomly selected subset of at least K eQTLs were causal and all others were associated with the exposures only through LD. For the causal eQTLs, we sampled the vector a of instrument effect sizes randomly between 0.1 and 0.3,
with σ2 = 1.
Finally, we set the causal effect c of X on the outcome Y equal to the value obtained from eq. (3) with the real ΣEY, and sampled outcome data from
with σ2 = 1.
We performed simulations to analyze the influence of LD matrix accuracy, weak instrument bias, standard error calculation method, and horizontal pleiotropy.
For the simulations on the accuracy of the LD matrix, we generated datasets for the SLC22A3-LPA-PLG locus (centered on chr 6: 161089307) in liver as exposures. Causal effects of the genes on the outcome variable Y were calculated to be cSLC22A3 = 0.15, cLPA = −0.05 and cPLG = −0.27 when seven instruments were used, and the same values were also used in the other simulations. We compared the situations where L = 3 (LD threshold of 0.25), L = 4 (LD threshold of 0.3), L = 5 (LD threshold of 0.4), L = 6 (LD threshold of 0.5), and L = 7 instruments (LD threshold of 0.8) were used. We performed one-sample simulations where the sample size varied between 500 − 30000. For the instrument effect sizes, we assumed that three eQTLs were causal for all genes (their effect sizes sampled from a uniform distribution within the range 0.1 − 0.3), while additional eQTLs were assumed to be associated with the exposures only through LD.
For the simulations on weak instrument bias, we generated datasets within the same locus and setup as the LD accuracy simulations except here we kept L = 8 eQTLs (LD threshold of 0.96) and varied the sample size between 500 − 10000. For the instrument effect sizes, we again assumed that only three eQTLs were causal for all genes, their effect sizes sampled from a uniform distribution within the range 0.1 − 0.3 for strong instruments and within the range 0.001 − 0.01 for weak instruments. We confirmed that these instruments were indeed weak/strong by employing the conditional F-statistics computed using the Mendelian Randomization package [30] as well as the MVMR package [31] (S3 Fig).
For the simulations on comparing the standard errors computed approximately using summary-level data and exactly using individual-level data, we generated datasets within the same locus and setup as the previous simulations. We kept L = 7 eQTLs (LD threshold of 0.95) and varied the sample size between 500 − 10000. For the instrument effect sizes, we again assumed that only three eQTLs were causal for all genes. Details about the standard error calculation using summary and individual-level data are given in detail in S1 File.
To simulate the effect of horizontal pleiotropy, we generated datasets for the SLC22A3-LPA-PLG locus with causal effects of the genes on the outcome variable Y set to cLPA = −0.05 and cPLG = −0.27 and cSLC22A3 ∈ [0.0, 0.4]. We used eight instruments where only three eQTLs were causal for all genes, their effect sizes sampled from a uniform distribution within the range 0.1 − 0.3. We estimated the effects in a misspecified estimation model where we performed MVMR only using LPA-PLG genes, assuming horizontal pleiotropy through one unknown/unmeasured gene SLC22A3 in the same locus.
To simulate the effect of potential bias introduced by two-sample MR, we generated independently sampled eQTL and GWAS datasets for the MRAS-ESYT3 locus in the Aorta tissue of STARNET, which share L = 5 eQTLs in their locus (centred on chr 3:138121920), with causal effects of the genes on the outcome variable Y set to cMRAS = 0.208 and cESYT3 = −0.294. We simulated eQTL sample sizes between 300 − 4000 and GWAS sample sizes between 10000 − 140000.
Comparison to other MVMR methods
We compared the least squares (LS) (Eq (7)) and GMM (Eq (10)) estimators to seven other multi-variable MR methods: Transcriptome-Wide MR (TWMR) [19], the mv_multiple estimator from TwoSampleMR package [32], and the mv_mvivw estimator from the Mendelian Randomization package [30], estimators MVMR-Robust, MVMR-Median, and MVMR-Lasso [33] and estimator MVMR-cML [34].
The least squares and GMM estimators can also be applied in the univariate MR setting. In this setting we performed a comparison to all available methods in the MR-Base package, namely, Inverse Variance weighted (IVW), Weigted Median (WM), Maximum Likelihood (MaxLik), MR-Egger (Egger). We also included TWMR in the comparison.
In the course of our analysis we identified the presence of a “regularization” term in the TWMR source code that was hard-coded to a specific value, presumably related to the sample size of the use-case from their paper [10]. More precisely, for the weight matrix used in the GMM estimator, H = ((XTE) ⋅ (ETE)−1 ⋅ (ETX))−1, they used shrinkage of the form
Here α is
where N = 3781 and I is the identity matrix. Results reported here used the published version of TWMR including this hard-coded term.
Prediction of causal genes and tissues at GWAS loci for coronary artery disease
We used eQTL summary statistics from the Stockholm-Tartu Atherosclerosis Reverse Network Engineering Task (STARNET) study, a genetics-of-gene expression study of tissue samples from blood (n = 481), atherosclerotic-lesion-free internal mammary artery (MAM, n = 552), atherosclerotic aortic root (AOR, n = 539), subcutaneous fat (SF, n = 573), visceral abdominal fat (VAF, n = 531), skeletal muscle (SKLM, n = 534), and liver (LIV, n = 576) obtained during open thorax surgery of 600 coronary artery disease (CAD) patients [26]. eQTL summary statistics from matching tissues in GTEx [12] were used for replication analyses. Significant cis-eQTL associations were defined as in the original STARNET study (FDR <5% across all tested SNP-gene combinations where the SNP is within 1Mb up or downstream of the center of the gene).
We used coronary artery disease (CAD) as the outcome trait and used summary statistics from the CARDIoGRAMplusC4D genome-wide association meta-analysis (GWAMA) study [29]. Summary statistics specific to the European population were extracted using the TwoSampleMR package [32] (study IDs ebi-a-GCST003116, n = 141, 217).
To define locus boundaries and candidate genes of interest for MVMR, we considered genome-wide significant SNPs, all their linked eQTLs within a 0.5Mb radius, and all the genes they are associated with in one or more tissues (“eGenes”). Specifically, we first created tissue-specific outcome data with GWAS summary statistics for all SNPs with non-zero (FDR <5%) eQTL effect in the STARNET summary statistics for the tissue of interest. We filtered these eQTLs by their GWAS p-values, retaining only those with p-values below 5 × 10−8 in the GWAS study. We then iterated through these eQTLs, sorted by GWAS p-value, such that for the first iteration, the first eQTL would be the first lead SNP. We collected all eQTLs within a 0.5Mb radius and with non-zero LD with the lead SNP, and then removed the lead SNP as well as its collection of linked eQTLs from the list, repeating the procedure with the next lead SNP. For each lead SNP and its collection of linked eQTLs, we first removed SNPs in perfect LD with the lead SNP and then treated the remaining eQTLs as instrumental variables in one causal diagram containing all genes associated with these eQTLs either in a specific tissue, or across all tissues, depending on the application, to complete the diagram. This procedure ensures that for each locus of interest, we obtain a a closed causal diagram, where “closed” means that no more genes (exposures) can be added to the diagram that have shared cis-eQTL associations with any eQTLs (instruments) either already in the diagram or linked to eQTLs in the diagram.
We computed conditional F-statistics using the Mendelian Randomization [30] and MVMR [31] packages to verify instrument strength.
In replication analyses using GTEx, the same procedure was implemented independently for the same GWAS loci in matching tissues, using all cis-eQTL associations in GTEx (which may differ from the cis-eQTL associations in STARNET in the same locus). Hence, replications in GTEx represent the best possible causal effect estimation in GTEx and are not biased by the exposure gene sets used in STARNET.
Since in the GWAS study, the log-odds ratio was used as the unit, that is, logistic regression was performed of the dichotomous outcome trait on SNP genotype, the β-value for a SNP is the increase in log-odds of the outcome. Hence in our models, the predicted causal effect of a gene X on the outcome Y (CAD) is to be interpreted as the effect of X on the log-odds ratio of Y.
We computed summary-based standard errors using the sample size from the GWAS study (see S1 File for details) and obtained p-values from its asymptotic normal distribution. We confirmed using simulations that summary-based and individual-level based standard errors were comparable (S6 Fig). We further confirmed standard errors using the “mr_mvivw” method of the “MendelianRandomization” package (S6 Fig). We note that to the best of our knowledge, the only standard error estimator that takes into account variability due to the different sample sizes in the two-sample setting [35] requires individual-level data and can thus not be computed in the two-sample summary-data setting.
Results
Causal effect identification and estimation in MVMR with correlated instruments
To analyze the problem of MVMR with correlated instruments, we consider causal diagrams where one or more genes Xi are colocated in the same genomic locus and have a potential causal effect on a phenotypic trait Y associated to the same locus. We assume that there exist at least as many causal variants in the locus as there are potential causal genes, such that each gene has at least one causal variant (causal variants may be shared between genes).
If there is only one gene in the locus, we have the classical instrumental variable graph that underpins standard (univariate) MR (Fig 1A). Application of the method of path coefficients (see Methods) shows immediately that the causal effect c of the exposure gene X on the outcome Y is given by the usual instrumental variable formula c = σEY/σEX, where E is a causal variant (or a variant in LD with a causal one) for X and σEY and σEX are the covariances between E and X, and E and Y, respectively.
In the case of regulatory pleiotropy, there will be multiple putative causal genes in the same locus, and we obtain an MVMR causal graph with correlated instruments (Fig 1B). Using analytical results derived by Brito and Pearl [24], it can be shown that the variants {E1, …, EK} in this graph form an instrumental set relative to the exposures (genes) {X1, …, XK} (see Methods), which permits identification of the direct causal effects c = (c1, …, cK)T associated to the edges X1 → Y, …, XK → Y in the underlying linear structural equation model (SEM). Application of the method of path coefficients shows that the vector c of causal effects satisfies the system of equations (see Methods for details)
(11)
with solution
(12)
where ΣEX and ΣEY are the matrix and vector of covariances
and
, respectively. We note the similarity between Eqs (11) and (12) and the standard instrumental variable solution for univariate MR.
Eqs (11) and (12) describe the causal effects as the coefficients in a linear relation between univariate covariances, and this relation is exact in the infinite sample limit, regardless of LD levels between the instruments. Furthermore, Eqs (11) and (12) remain valid if the number of genetic variants is greater than the number of candidate genes, regardless whether the additional variants are causal or not (Fig 1D and 1E)—the linear system is then merely overdetermined. However, if the number of true causal variants is less than the number of exposures (Fig 1C and 1F), adding genetic instruments that are only in LD with the causal variants does not help. In this case, the variants violate the third instrumental set condition (see Methods), and Eq (11) is easily seen to be underdetermined.
The covariances in Eqs (11) and (12) are the true covariances of the random variables Ei, Xj, and Y. To estimate the causal effects from observational data, we need finite-sample approximations to these equations. The most straightforward estimates are obtained by replacing the true covariances by their empirical estimates. We refer to this solution as
, the least squares (LS) estimator, that is,
Since the empirical covariances can be obtained from univariate regressions, the least-squares estimator is also called the “regression of regression coefficients” method.
The least-squares estimator belongs to a more general family of estimators obtained by the generalized method of moments (GMM; see Methods for details), which can be written as
where Δ is any square invertible matrix with dimensions equal to the number of instrumental variables. If Δ is the identity matrix (or if
is invertible), we recover the least-squares estimator. The estimator
is consistent for all choices of Δ. With homoskedastic errors, the estimator with theoretically minimal variance is obtained by taking
, the inverse of the empirical covariance matrix of the instruments, that is, the inverse of the LD matrix of the variants (see Methods for details). We will refer to this estimator as the GMM estimator
:
We observe that the GMM estimator is identical to the two-stage least squares estimator of McDaid et al. [20] and Porcu et al. [10].
Therefore, both the least squares, regression of regression coefficients estimator and the two-stage least squares estimator are consistent estimators for the MVMR instrumental variable formula (12), and both belong to a more general family of GMM estimators.
Causal effect estimation from simulated data
To test the accuracy of the least-squares (LS) and GMM estimators we performed simulations under a wide range of scenarios. In a first set of experiments, we considered exactly determined systems with normally distributed variables, where the LS and GMM estimators are identical (see Methods). As predicted by theory, the LS/GMM estimator is consistent with variance decreasing with increasing sample size in all simulations (Fig 2). In the literature, no simulations of MVMR with correlated instruments have been reported. We observed, perhaps surprisingly, very limited effect of increased instrument correlation on the variance of the causal effect estimates for pairwise correlations between 0.1 and 0.7, and even at correlation 0.9, the variance is only twice as large as at correlation 0.1 (Fig 2B and 2C).
(A) Causal diagram for the simulation of two exposures X1 and X2 for an outcome Y, with two shared instruments E1 and E2 with pairwise covariance α and matrix of instrument effect sizes A = [aij]. (B) Difference () between estimated and true causal effect of X1 (true effect size c = 0.2) on Y across 1,000 independently simulated datasets for a range of sample sizes. (C,D) Distribution of the difference between estimated and true causal effects for X1 (left; true effect size c = 0.2) and X2 (right; true effect size c = 0.6) across 1,000 independently simulated datasets and a range of sample sizes comparing strong (0.1 − 0.3) vs weak (0.001 − 0.01) instruments. (E) False causal diagram used for inference where one causal exposure is missing. (F-G) Difference between estimated and true causal effect of LPA (left, true effect size cLPA = −0.05) and PLG (right, true effect size cPLG = −0.27) on Y across 1,000 independently simulated datasets with sample size 2000 across a range of effect sizes of the hidden gene SLC22A3 mediating horizontal pleiotropic effects of the instruments on Y, showing both the estimates with the correct model (green) and the estimates with the hidden exposure model (orange). See Methods for simulation details.
Principal component analysis (PCA) has been suggested as a method to prune correlated SNPs in cis-MR studies by using a reduced set of uncorrelated linear combinations of SNPs as instruments instead [36]. We used the multivariable principal component generalized method of moments (MVMR PCA, function mr_mvpcgmm from the MendelianRandomization package) [37] in our simple simulation and found that at an instrument of correlation value of 0.5 or more, the MVMR PCA method reported too few (that is, only one) effective degrees of freedom for causal effect estimation, despite the causal effects still being estimable.
We further tested the effect of weak instruments, that is, instruments with weak effects on the exposures (see Methods for details). As expected, weak instruments increase the variance in the causal effect estimates, more strongly at small sample sizes (n = 500 − 1,000) (Fig 2D and 2E). In general, the increase in variance due to weak instruments is not greater in the MVMR setting than in standard, univariate MR (S1B and S1D Fig).
In our next set of experiments, we used more realistic model parameters taken from real LD correlation matrices and eQTL and GWAS summary statistics from real loci in the human genome (see Methods for details). First we tested the effect of horizontal pleiotropy. Using real summary statistics from the LPA-PLG-SLC22A3 locus, we simulated data from a model where all three genes (exposures) share causal instruments and are causal for the outcome Y, but we assumed only data for LPA and PLG was available (Fig 2F, see Methods for details). In other words, we assumed that the instruments have horizontal pleiotropic effects through the unmeasured gene SLC22A3. As expected, horizontal pleiotropy introduces bias in the estimated causal effects, although the bias remains modest (within the standard deviation of the estimates obtained from the complete model) upto a simulated effect size of around 0.15 of SLC22A3 on the outcome (Fig 2G and 2H).
Next, we tested the effect of LD matrix bias, where the LD matrices in the eQTL and GWAS population differ from the reference LD matrix (e.g. from 1000 Genomes Project). For this purpose, we compared GMM estimates from data generated from the reference LD matrix and data generated from a biased LD matrix sampled from a Wishart distribution centred around the reference LD matrix (see Methods for details), using the reference LD matrix ΣEE in both GMM estimates (cf. Eq (10)). With a modest degree of LD matrix bias (Wishart distribution with 50 degrees of freedom), the bias in the causal estimates remained small (Fig 3B and 3C).
The first and second row show a causal diagram for the simulation of overdetermined systems where the number of causal variants was equal to the number of exposures (first row, three exposures; second row, two exposures) and other variants were only correlated with the exposures and outcome by LD. (A, B, C) Comparison between the GMM estimator using an unbiased (B) and biased (C) LD matrix for the causal effect of simulated SLC22A3, for different LD pruning thresholds (corresponding to models with 3, 4, 5, 6 and 7 instruments). (D, E, F) Comparison between the least-squares, GMM, TWMR, MV multiple (TwoSampleMR), and MVIVW (Mendelian Randomization package) estimators using independently sampled datasets of different sizes for exposures and outcomes to emulate a two-sample MVMR setting. (G, H, I) Difference between estimated and true causal effect of SLC22A3 (left, true effect size cSLC22A3 = 0.15), LPA (center, true effect size cLPA = −0.05) and PLG (right, true effect size cPLG = −0.27) on Y showing comparison between the least-squares, GMM, MVMR Robust, MVMR Median, MVMR Lasso and MVMR cML estimators using simulated datasets from causal diagram (A). All distributions plots show the difference between estimated and true causal effects across 1,000 independently sampled datasets. See Methods for simulation details.
Next, we compared the LS and GMM causal effect estimators in over-determined models (more instruments than exposures). While both estimators are guaranteed to be consistent, the GMM estimator is the estimator with theoretically lowest variance (see Methods). However, across thousands of simulations with multiple sample sizes and for models of varying degree of over-determination, we did not observe any differences in variance between the LS and GMM estimators (Fig 3B and 3C).
We also simulated a two-sample setting where eQTL and GWAS LD matrices were sampled independently from two different Wishart distributions centred around the reference LD matrix, with different variance parameters to reflect the difference in sample size between eQTL and GWAS studies, and generated eQTL and GWAS summary statistics independently, with sample sizes representative of real eQTL and GWAS studies (see Methods for details). In the two-sample setting, even the effects estimated using the true LD matrix from the GWAS population are biased due to the LD matrix sampling variation between the eQTL and GWAS studies (Fig 3E). Unsurprisingly, the bias increased further when the reference LD matrix was used instead of the GWAS study LD matrix in the estimation (Fig 3F). In this setting, all estimators were biased until an eQTL and GWAS sample size combination of 4,000/140,000 was reached, which corresponds to the upper end of currently available sample sizes (Fig 3F); TWMR remained biased at all sample sizes, due to a hard-coded regularization factor (see Methods for details). We note that the causal effect bias in two-sample settings is not specific to MVMR or the use of correlated instruments, but can already be observed in univariate MR with a single instrument (S5 Fig).
Prediction of causal genes at GWAS loci for coronary artery disease
To test our method on real-world data, we predicted causal genes at 36 genome-wide significant loci for coronary artery disease (CAD) from the CARDIOGRAMC4D meta-analysis study [38] using eQTL data from seven vascular and metabolic tissues from the STARNET study [26] (see Methods for details). Results are presented here for the least squares estimator; the GMM estimates are predominantly consistent and are provided in the Supporting Data.
First, to get an understanding for the range of causal effect sizes one can expect for true causal genes, we identified 17 genome-wide significant CAD loci which had an effect on expression of a single candidate gene in a single tissue, that is, where no cis-regulatory pleiotropy is present and we can plausibly hypothesize that the true causal gene is known (Fig 4A). These genes included well characterized genes such as PCSK9, a risk gene for low-density lipoprotein cholesterol levels and CAD [39], whose visceral adipose (VAF)-specific association to CAD risk SNPs in STARNET was previously confirmed in independent data [26], and GUCY1A3, whose expression correlates with risk of atherosclerosis [40]. The absolute predicted causal effects on the standardized scale for these 17 genes ranged from 0.099 (PCSK9 in VAF) to 0.34 (GSTM3 in mammary artery (MAM)). We therefore considered an absolute causal effect size greater than or equal to 0.1 to be an appropriate threshold to define causal genes.
(A) Causal effects from univariate MR of genes at 17 CAD GWAS loci with effects on a single gene in a single tissue. (B) Causal effects from tissue-specific, univariate MR of genes at seven CAD GWAS loci with cis-regulatory effects on a single gene in multiple tissues. (C) Tissue color legend for all panels. (D) Causal effects from tissue-specific MVMR with correlated instrumental variable sets of genes at 12 CAD GWAS loci with cis-regulatory effects on multiple genes in multiple tissues. (E) Scatter plot of tissue-specific causal effect estimates from MVMR vs. univariate MR. (F) Scatter plot of tissue-specific causal effect estimates from MVMR vs. CAD regulary trait concordance (RTC) scores. (G) Scatter plot of tissue-specific causal effect estimates from MVMR in STARNET vs. GTEx.
We confirmed the validity of this threshold using seven genome-wide significant CAD loci which had an effect on expression of a single candidate gene in at least two tissues in the STARNET data, that is, loci where we can plausibly hypothesize that the true causal gene, but not the tissue, is known. We used standard, univariate MR on a tissue-by-tissue basis and found that in all seven cases, the candidate gene had a predicted absolute causal effect size greater than 0.1 in at least one tissue (Fig 4B). These genes included CDKN2B, located in one of the most robust genetic markers for type 2 diabetes, CAD, and myocardial infarction [41]. GWAS SNPs in the CDKN2B locus were associated with expression of CDKN2B in aorta (AOR), subcutaneous fat (SF), and blood, with predicted causal effects c = −0.75, −0.55, and 0.5, respectively.
We further computed p-values based on the standard errors of the estimated causal effects. As these p-values are based on one-sample estimates for the standard error and may not reflect variability in two-sample summary-data settings (see Methods for details), we conservatively set a Bonferroni threshold of p < 0.05/150 = 3 ⋅ 10−4, which exclude only five gene-tissue combinations with estimated causal effect .
Having thus found a reasonable threshold to define causal genes for CAD, we analyzed 12 genome-wide significant CAD GWAS loci which had an effect on expression of multiple nearby genes in one or more tissues. Similar to Porcu et al. [10], we applied MVMR on a tissue-by-tissue basis using all genes with cis-eQTL associations in a locus in a given tissue as potential exposures in a causal diagram (cf. Fig 1B and 1E). We note that at only two of the 12 loci, it was possible to reduce the SNPs to a set of independent instruments using the LD threshold (r2 ≤ 0.1) used by Porcu et al. [10], while still retaining at least as many instruments as exposures. Hence an approach using correlated instrument sets is essential.
In total, 35 candidate causal genes with cis-eQTL associations were located in the 12 CAD GWAS loci with pleiotropic gene regulatory effects, of which 24 had a predicted absolute causal effect size greater than 0.1 in at least one tissue (Fig 4D). To analyze these causal genes more systematically, we compared the predicted causal effects from MVMR against the predicted causal effects from standard MR (Fig 4E) and regulatory trait concordance (RTC) scores [42] computed by Franzén et al. [26] (Fig 4F), two univariate methods that ignore the pleiotropic gene regulatory effects. We also conducted a replication analysis for loci with cis-associations in matching tissues in both STARNET and GTEx (Fig 4G, see Methods for details).
We observed a general trend where predicted causal genes using MVMR (predicted causal effect |c| ≥ 0.1) were also predicted to be causal (using the same threshold) using univariate MR, had high RTC values, and had concordant effect sizes between STARNET and GTEx (Fig 4E–4H). However several genes with high univariate MR effects and high RTC values were not causal according to MVMR, suggesting that MVMR can indeed correct for likely false predictions from MR or colocalization analyses due to regulatory pleiotropy.
A clear example of a CAD GWAS locus where MVMR and MR give contrasting results is a locus centred around chr 6:12,901,441. In STARNET, the locus has cis-associations with the expression of PHACTR1, GFOD1, TBC1D7, RP1–257A7.4, and RP1–257A7.5 in MAM (Fig 5A and 5B). In univariate MR analyses these genes have predicted causal effects ranging from 0.15 for PHACTR1 to 0.31 for GFOD1, and a previous integrative genomics analysis lists all of them as candidate causal genes for this locus [43]. After pruning nearly identical SNPs (LD r2 ≥ 0.95), 12 instruments (with mutual LD ranging from r2 = 0.17 to r2 = 0.86) were available for conducting MVMR, resulting in a predicted causal effect of 0.19 for PHACTR1, with all other effects below the threshold of 0.1 in absolute value, suggesting that PHACTR1 is the only causal gene in arterial tissue at this locus.
Each row shows the predicted causal effects from tissue-specific MVMR of genes with cis-eQTL associations in a CAD GWAS locus of interest across seven vascular and metabolic tissues from the STARNET study (left) and the genetic architecture and cis-regulatory pleiotropy of the locus. The heatmaps show the causal effects on the liability on a standardized scale. Only tissues with non-zero cis-eQTL effects in the STARNET study are included. Values identically zero indicate gene-tissue combinations not included as exposures in the model (no cis-eQTL association in that tissue). The LocusZoom plots show for each SNP in the locus the GWAS −log10(p − value) (y-axis) and associated cis-eQTL genes (symbol, see legends) in a selected tissue. (A, B) Locus centred on chr 6:12,901,44 with LocusZoom plot for MAM. (C, D) Locus centred on chr 15:79,141,784 with LocusZoom plot for AOR. (E, F) Locus centred on chr 2:85,809,989 with LocusZoom plot for AOR.
Another example is the locus centred around chr 15:79,141,784, where functional studies support a causal and proatherogenic role for ADAMTS7 [44]. In STARNET, the locus has cis-associations with the expression of ADAMTS7 in AOR, MAM, and VAF(Fig 5C and 5D). In all three tissues, the locus also has cis-associations with the expression of CTSH. After pruning nearly identical SNPs (LD r2 ≥ 0.86), 9, 25, and 10 instruments (with mutual LD ranging from r2 = 0.03 to r2 = 0.94) were available for conducting MVMR in AOR, MAM, and VAF respectively. In the arterial tissues, MVMR confirms that ADAMTS7 is the most likely causal gene (predicted causal effects 0.18 and 0.14 vs. 0.025 and −0.058 for CTSH in AOR and MAM, respectively; Fig 5C). Replication in GTEx AOR tissue confirmed these results (predicted causal effects 0.18 and 0.009 for ADAMTS7 and CTSH, respectively). By contrast, in VAF, both genes are predicted causal with effects in the opposite direction (predicted causal effects −0.18 and 0.25 for ADAMTS7 and CTSH, respectively), the latter due to opposite eQTL effects of the same variants in VAF vs. AOR and MAM. No eQTL associations with either gene were available in GTEx in adipose tissue to confirm these results. The overall picture is further complicated by the fact that the locus also has cis-associations with the expression of CTSH in blood, SKLM, and SF, and with PSM4 in SKLM. The absence of associations with ADAMTS7 in these tissues implies that the association between the locus and CAD is automatically attributed to the other genes (Fig 5C). Given the overwhelming functional evidence for ADAMTS7 [44], these results probably reaffirm the importance of having all true exposures included in the model (cf. Fig 2F–2H).
As a final example we consider the locus centred around chr 2:85,809,989. Previous analyses found a candidate causal SNP in this locus located in the promoter region of MAT2A, but since the SNP was associated with expression of multiple genes in this locus, a causal gene could not be predicted [45]. In STARNET, the locus has cis-associations with the expression of MAT2A in AOR and MAM. In both tissues, the locus also has cis-associations with the expression of GGCX, and both genes have previously been listed as candidate causal genes in arterial tissues [43]. After pruning nearly identical SNPs (LD r2 ≥ 0.95), three instruments (with mutual LD ranging from r2 = 0.69 to r2 = 0.89) were available for conducting MVMR. In both AOR and MAM, MVMR suggests MAT2A, and not GGCX is the causal gene (predicted causal effects −0.23 and −0.33 vs. 0.02 and 0.16 for GGCX in AOR and MAM, respectively; Fig 5E and 5F). The overall picture is again less clear, since the locus also has cis-associations with the expression of GGCX and VAMP8 in liver, and with GGCX alone in SF and VAF. Interestingly, in liver, neither gene is predicted causal (predicted causal effects −0.08 and −0.06; Fig 5E), while in the adipose tissues, the entire association between the locus and CAD is automatically attributed to GGCX (predicted causal effects −0.14 in both tissues; Fig 5E).
The preceding examples suggest that in fact two distinct forms of regulatory pleiotropy exist. A GWAS locus can be associated (in cis) to expression levels of multiple genes in the same locus in a tissue of interest, but can also be associated to one or more genes in multiple tissues. We attempted to conduct MVMR using all potential exposures (across genes and tissues) in a single model, in order to simultaneously predict the causal gene(s) and tissue(s) at loci of interest. However, conclusive results were often elusive, as both the determinant of the LD-matrix and the instrument strength matrix were very small, indicating that the number of causal variants in most loci is less than the often large number of exposures that must be accounted for in such a multi-tissue analysis.
An example of how multi-tissue MVMR can potentially differ from tissue-specific MVMR is given by the locus centred on chr 19:11,061,315. In STARNET, the locus has cis-associations with the expression of CARMI in SF and SKLM, and with RGL3 and SMARCA4 in liver. In tissue-specific analyses, CARMI was causal in SF (c = 0.19) and SKLM (c = 0.18) (univariate MR), while RGL3 (c = −0.19) and SMARCA4 (c = −0.2) were both predicted to be causal in liver (MVMR). Notably, the lead SNP rs35140030 was shared between CARMI in SF and SKLM, and another SNP in the locus, rs113718993 was shared between CARMI in SKLM and SMARCA4 in liver. Since these SNPs were shared between tissues, and two additional instruments were available after pruning nearly identical SNPs (LD r2 ≥ 0.95), we conducted an additional MVMR analysis where exposures were gene-tissue pairs. This analysis predicted that (CARM1, SKLM) (c = 0.18) and (RGL3, liver) (c = −0.19) were the causal gene-tissue pairs, while (CARM1, SF) (c = −0.02) and (SMARCA4, liver) (c = −0.01) were not causal.
Discussion
In this paper we studied if multivariate Mendelizan randomization (MVMR) can be used to identify causal genes at GWAS loci with pleiotropic gene regulatory effects, that is, GWAS loci associated with gene expression of more than one candidate gene in the genomic locus of interest. MVMR requires at least as many genetic instruments as the number of candidate genes included in the model, and the consensus in the field has been that these instruments must be independent. However, due to high levels of linkage disequilibrium between genetic variants located in the same genomic locus, it is usually impossible to identify such independent instrument sets.
Using the method of path coefficients we showed that the MVMR causal diagram with correlated instruments satisfies the instrumental set condition, a classical result by Brito and Pearl [24] that guarantees the identifiability of the direct causal effects of all exposures in the model on the outcome variable of interest. Moreover, the effects solve a regression of (univariate) regression coefficients, and the only requirement is that the set of instruments consists of (or tags) at least as many causal variants as the number of candidate genes included in the model. We further showed that the standard two-stage least squares estimator of the causal effects in MVMR is part of a generalized method of moments (GMM) family of finite-sample estimators for the theoretical identification equation, as is a least squares estimator that simply replaces the theoretical univariate regression coefficients by their finite sample estimates, results which again do not depend on the presence or absence of instrument correlations.
Extensive simulations confirmed the validity and usefulness of these theoretical results. Most surprisingly perhaps was the finding that the variance in causal effect estimates remained small, even at high correlation values between the instruments. Moreover we saw no benefit in using the computationally more expensive two-stage least squares estimator instead of the simpler least squares estimator—both were unbiased and had identical variance in all our simulations. As expected, horizontal pleiotropy or misspecified LD matrices in the two-sample setting introduce bias in the estimated causal effects, as in all applications of MR, but using conservative simulation parameters, we observed convergence to the correct estimates at sample sizes at the upper end of what is currently available. Of note, in the one-sample setting, that is, when exposures and outcomes are simulated from the same LD structure, misspecified LD matrices (randomly sampled from a distribution centred around the true LD matrix) did not lead to noticeable bias or increased variance.
We applied our method to predict causal genes for CAD using eQTL data from seven vascular and metabolic tissues obtained from 600 coronary artery bypass grafting surgery patients in the STARNET study. While predicting CAD risk using expression data from CAD cases can be criticized, we have shown previously that more gene expression traits are associated with CAD risk loci in STARNET cases than in comparable disease-unspecific samples (e.g., GTEx) [26], that eQTLs inferred from CAD cases explain a large proportion of heritability of CAD risk [46, 47], and that variation in eQTL-associated genes and gene networks correlates with the extent of atherosclerosis in CAD cases [47, 48], suggesting that variation in gene expression and disease severity in CAD cases can indeed reflect underlying causal effects on CAD risk. This was further supported by the current finding that causal effects estimated from STARNET and GTEx samples for the same genes and tissues correlated well.
Our analysis of the STARNET data illustrated the importance of being able to apply MVMR with correlated instrument sets. Out of 36 genome-wide significant loci with cis-eQTL associations in STARNET, only 17 were associated with a single gene in a single tissue, 7 with a single gene in multiple tissues, and 12 with multiple genes in multiple tissues. A deeper look into the predictions at some of these loci with widespread regulatory effects showed both the strengths and limitations of applying MVMR in this context.
A relatively clear case was provided by the PHACTR1 locus, where PHACTR1 was predicted as the only causal gene in arterial tissue, even though GWAS variants in this locus had cis-eQTL associations to an additional four genes in the same tissue. The lead GWAS SNP in this locus is located in an intron of PHACTR1, and PHACTR1 was one of two top candidate causal genes (together with CDKN2B) for CAD in an integrative genomics analysis that included STARNET data as well as orthogonal evidence [43]. Functional evidence also supports a causal role for PHACTR1 [49], although recent results suggest effects of this locus on a distal gene EDN1 may also play a role, and controversy about the true causal gene remains [50].
The ADAMTS7 locus provided another important test case. Functional evidence strongly supports a causal role for ADAMTS7 in CAD [44], and in the arterial tissues where the GWAS SNPs in this locus were associated with ADAMTS7 expression, our method correctly predicted ADAMTS7 as the causal gene, and not CTSH, another gene in this locus with cis-eQTL associations in the same tissues. However, in VAF, both genes were predicted causal, and when the method is applied to tissues where there are no cis-eQTL associations with ADAMTS7, other genes are predicted as causal genes. This latter result is consistent with what is seen in simulations: if true exposures are missing from the model, the effect of the missing exposure is distributed over the available exposures in a manner that ensures consistency of the univariate instrument-exposure and instrument outcome covariances.
Throughout this paper, we have defined regulatory pleiotropy as the situation where the same genetic variants are associated to multiple genes in the same locus resulting in multiple paths by which the variants can affect downstream complex traits. Thus, accounting for (measured) regulatory pleiotropy using MVMR is different from methods that account for unmeasured horizontal pleiotropy. However, the fact that regulatory pleiotropy can extend over multiple cell types and tissues and that data from some potentially relevant cell types or tissues will almost always be missing, implies that the two concepts are interrelated: unmeasured regulatory pleiotropy likely is an important source of horizontal pleiotropy.
If data from multiple tissues are available, our results imply that MVMR must be run strictly speaking by including all gene-tissue pairs with cis-eQTL associations in a given GWAS locus as exposures in a single, large causal model in order to limit the risk of unmeasured pleiotropy. However, even though instrument sets can be highly correlated and the STARNET data contains only seven tissues, it was almost never possible to run such large models, as the requirement on the number of causal variants included in or tagged by the instrument set still stands. To overcome this limitation, we see two ways forward.
First, we should not consider MVMR to be a hypothesis-free method that can be run across all available tissue eQTL data to discover causal genes de novo. Instead prior knowledge of the relevant tissue and plausible candidate genes must be used to limit both the size of the model and the risk that true causal genes and gene-tissue combinations are missing from the model. For instance, integrative genomics analysis pipelines such as the ones developed by Brænne et al. [45] and Hao et al. [43] combine multiple eQTL and GWAS datasets with functional genome annotation and literature search to compile ranked lists of the most likely causal genes and tissues for an outcome of interest such as CAD. Using MVMR as an additional step in such pipelines to resolve causal effects at loci with eQTL effect on prioritized genes in prioritized tissues can provide additional causal evidence, as illustrated by the PHACTR1 and ADAMTS7 examples.
Another and more challenging solution would be to expand instrumental variable sets to the required size for models with a large number of exposures by including variants from outside the locus of interest (trans-acting variants) in the instrumental variable set. However such variants typically have small effects and are by definition indirect, such that care must be taken to exclude horizontal pleiotropic effects from upstream regulatory factors. Most likely, progress in this direction will require whole-network causal modelling in accordance with the omnigenic model [51].
We focused our analysis on genome-wide significant GWAS loci because the use of eQTL information to identify candidate genes is an important part of modern GWAS protocols [52] and because candidate genes at many GWAS loci have been analyzed using orthogonal data, providing some form of validation for our approach. Needless to say, our approach also could be applied to loci passing less stringent thresholds. More generally, the method could be integrated in existing methods to account for linkage disequilibrium and potential pleiotropic effects in transcriptome-wide association studies using statistical fine-mapping approaches [53–55]. These methods typically use sparsity-inducing constraints to predict outcome traits from a minimal set of most informative genes. Using results from a causal model such as considered here, either directly in an outcome prediction model, or indirectly as part of the sparsity constraints, could increase the probability that the minimal predictive gene set consists of truly causal genes.
In summary, through theory, simulation, and application on real-world data, we have shown that MVMR with correlated instrumental variable sets significantly expands the scope for predicting causal genes at GWAS loci with pleiotropic regulatory effects, but important challenges remain to account completely for the extensive degree of regulatory pleiotropy across multiple tissues.
Supporting information
S1 File. Supplementary Methods.
Contains 15 pages of supplementary methods.
https://doi.org/10.1371/journal.pgen.1011473.s001
(PDF)
S1 Fig. Causal effect estimation on simulated data of one dimensional systems.
(A, C) Causal diagrams for the simulation of one exposure X for an outcome Y, influenced by one instrument E with variable instrument strengths a (A), or influenced by n ≥ 2 instruments E1, …, En with instrument strengths ai (C). (B, D) Distribution of estimated causal effects for X (true effect size 0.3), showing distributions across 1,000 independently simulated datasets across a range of sample sizes under different simulation scenarios with varying instrument strengths. (G) Distribution of estimated causal effects for X (true effect size 0.3) assuming the false diagram in (F) for inference when the true diagram that generated the data is in (E).
https://doi.org/10.1371/journal.pgen.1011473.s002
(PNG)
S2 Fig. Causal effect estimation on simulated data for underdetermined and overdetermined systems.
(A) Causal diagram for the simulation of two exposures X1 and X2 for an outcome Y, influenced by three instruments E1, E2, E3 where the number of causal variants is smaller than the number of cis-eGenes and other variants in the locus are merely associated by LD. (B, C) Distribution of estimated causal effects for X1 (B, true effect size 0.2) and X2 (C, true effect size 0.6) in the graph from (A). (C) Causal diagram for the simulation of one exposure X for an outcome Y, with n ≥ 2 shared instruments E1, …, En where the number of causal variants is greater than the number of cis-eGenes. (E) Distribution of estimated causal effects for X (true effect size 0.3) from the graph in (D). Panels B, C, and E show distributions across 1,000 independently simulated datasets across a range of sample sizes.
https://doi.org/10.1371/journal.pgen.1011473.s003
(PNG)
S3 Fig. Conditional F-statistic and causal effect estimation on simulated data.
The distribution of the conditional F-statistic and causal effect estimates are shown using two effect size ranges for weak and strong instruments: 0.001 − 0.01 (weak) and 0.1 − 0.3 (strong) in subplots (A, B, C, D), and 0.001 − 0.03 (weak) and 0.8 − 1.5 (strong) in subplots (E, F, G, H). The results are based on 2,000 simulations of over-determined systems for ADAMTS7 (true effect size 0.15) where subplots (A, E) show the distribution of the conditional F-statistic using the Mendelian Randomization package and subplots (B, F) use the MVMR package. Subplots (C, G) display the distribution of causal effect estimates for the GMM estimator, while subplots (D, H) show the the distribution of the causal effect estimates for the mvivw estimator (using the Mendelian Randomization package).
https://doi.org/10.1371/journal.pgen.1011473.s004
(PNG)
S4 Fig. Standard error estimation and causal effect estimation comparison on simulated data of overdetermined systems.
(A) Distribution of estimated standard errors for SLC22A3, using the individual level data and exact form compared to using approximate form with summary level data. (B) Distribution of estimated causal effects for SLC22A3 (true effect size 0.15) for the estimator GMM, showing distributions across 2,000 independently simulated datasets across a range of sample sizes using discrete instruments with randomly generated covariances with real LD values from the SLC22A3-LPA-PLG locus.
https://doi.org/10.1371/journal.pgen.1011473.s005
(PNG)
S5 Fig. Bias in univariate two-sample MR.
Difference between estimated and true causal effect of X (true effect size c = 0.8) on Y across 20,000 independently simulated datasets for a range of sample sizes comparing bias of two sample MR vs one sample MR, in a simple ratio estimate with one instrument E. (A) Cov(E, X) and Cov(E, Y) estimated from the same sample, Sample 1. (B) Cov(E, X) estimated from sample, Sample 1 and Cov(E, Y) estimated from sample, Sample 2. (C) Cov(E, X) estimated from sample, Sample 2 and Cov(E, Y) estimated from sample, Sample 1. (D) Cov(E, X) and Cov(E, Y) estimated from the same sample, Sample 2. For simplicity, sizes of Sample 1 and Sample 2 were kept the same.
https://doi.org/10.1371/journal.pgen.1011473.s006
(PNG)
S6 Fig. Standard error verification for STARNET.
Tissue-wise causal estimates from the Least squares estimator (red) and MVIVW estimator from the Mendelian Randomization R package, with standard errors (blue).
https://doi.org/10.1371/journal.pgen.1011473.s007
(PNG)
S7 Fig. Estimated causal effects and their p-values.
Tissue-wise causal estimates from the MVIVW estimator from the Mendelian Randomization R package, with their corresponding −log10(p − value). One outlier (CDKN2B) with estimated effect size -1.0 and p-value <10−80 not shown.
https://doi.org/10.1371/journal.pgen.1011473.s008
(PNG)
S8 Fig. Mean explained variance of the first principal component as a function of linkage disequilibrium (LD) r (correlation coefficient).
The red dashed line represents the theoretical expectation for the explained variance, calculated as , which derives from the eigenvalues of the covariance matrix for two standardized variables with correlation r. The blue bars show the empirical mean explained variance of PC1 (Principal Component 1) obtained from simulations with a sample size of 2000 and 2000 repetitions.
https://doi.org/10.1371/journal.pgen.1011473.s009
(PNG)
S9 Fig. Comparison of Type 1 error rate and power rate.
We estimate here the Type 1 error rate for the estimators Least Squares (A) and GMM (C) and Power rate for the estimators Least Squares (B) and GMM (D) from 2,000 independently simulated datasets for a fixed sample size of 2000, using discrete instruments with randomly generated covariances with real LD values from the locus on Chromosome 15:79124475 shared by genes ADAMTS7 and CTSH in the MAM tissue.
https://doi.org/10.1371/journal.pgen.1011473.s010
(PNG)
References
- 1. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology. 2003;32(1):1–22.
- 2. Evans DM, Davey Smith G. Mendelian Randomization: New Applications in the Coming Age of Hypothesis-Free Causality. Annual Review of Genomics and Human Genetics. 2015;16(1):327–350. pmid:25939054
- 3. Sanderson E, Glymour MM, Holmes MV, Kang H, Morrison J, Munafò MR, et al. Mendelian randomization. Nature Reviews Methods Primers. 2022;2(1):6. pmid:37325194
- 4. Didelez V, Sheehan N. Mendelian randomization as an instrumental variable approach to causal inference. Statistical Methods in Medical Research. 2007;16(4):309–330. pmid:17715159
- 5. Hemani G, Bowden J, Davey Smith G. Evaluating the potential role of pleiotropy in Mendelian randomization studies. Human Molecular Genetics. 2018;27(R2):R195–R208. pmid:29771313
- 6. Verbanck M, Chen CY, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics. 2018;50(5):693–698. pmid:29686387
- 7. Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics. 2016;48(5):481–487. pmid:27019110
- 8. Bretherick AD, Canela-Xandri O, Joshi PK, Clark DW, Rawlik K, Boutin TS, et al. Linking protein to phenotype with Mendelian Randomization detects 38 proteins with causal roles in human diseases and traits. PLOS Genetics. 2020;16(7):e1008785. pmid:32628676
- 9. Reay WR, Cairns MJ. Advancing the use of genome-wide association studies for drug repurposing. Nature Reviews Genetics. 2021;22(10):658–671. pmid:34302145
- 10. Porcu E, Sjaarda J, Lepik K, Carmeli C, Darrous L, Sulc J, et al. Causal Inference Methods to Integrate Omics and Complex Traits. Cold Spring Harbor Perspectives in Medicine. 2021;11(5):a040493. pmid:32816877
- 11. Mohammadi P, Castel SE, Brown AA, Lappalainen T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Research. 2017;27(11):1872–1884. pmid:29021289
- 12. The GTEx Consortium, Aguet F, Anand S, Ardlie KG, Gabriel S, Getz GA, et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–1330.
- 13. Fauman EB, Hyde C. An optimal variant to gene distance window derived from an empirical definition of cis and trans protein QTLs. BMC bioinformatics. 2022;23(1):1–11. pmid:35527238
- 14. Tong P, Monahan J, Prendergast JG. Shared regulatory sites are abundant in the human genome and shed light on genome evolution and disease pleiotropy. PLoS genetics. 2017;13(3):e1006673. pmid:28282383
- 15. Burgess S, Thompson SG. Multivariable Mendelian Randomization: The Use of Pleiotropic Genetic Variants to Estimate Causal Effects. American Journal of Epidemiology. 2015;181(4):251–260. pmid:25632051
- 16. Burgess S, Dudbridge F, Thompson SG. Re: “Multivariable Mendelian Randomization: The Use of Pleiotropic Genetic Variants to Estimate Causal Effects”. American Journal of Epidemiology. 2015;181(4):290–291. pmid:25660081
- 17. Sanderson E, Davey Smith G, Windmeijer F, Bowden J. An examination of multivariable Mendelian randomization in the single-sample and two-sample summary data settings. International Journal of Epidemiology. 2019;48(3):713–727. pmid:30535378
- 18. Sanderson E. Multivariable Mendelian Randomization and Mediation. Cold Spring Harbor Perspectives in Medicine. 2021;11(2):a038984. pmid:32341063
- 19. Porcu E, Rüeger S, Lepik K, eQTLGen Consortium, Agbessi M, Ahsan H, et al. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nature Communications. 2019;10(1):3300. pmid:31341166
- 20. McDaid AF, Joshi PK, Porcu E, Komljenovic A, Li H, Sorrentino V, et al. Bayesian association scan reveals loci associated with human lifespan and linked biomarkers. Nature Communications. 2017;8(1):15842. pmid:28748955
- 21. Wright S. Correlation and Causation. tmp. 1921; p. p. 557–585.
- 22. Wright S. The method of path coefficients. The annals of mathematical statistics. 1934;5(3):161–215.
- 23.
Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. USA: Cambridge University Press; 2009.
- 24.
Brito C, Pearl J. Generalized instrumental variables. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.; 2002. p. 85–93.
- 25. Hansen LP. Large Sample Properties of Generalized Method of Moments Estimators. Econometrica. 1982;50(4):1029.
- 26. Franzén O, Ermel R, Cohain A, Akers N, Di Narzo A, Talukdar H, et al. Cardiometabolic risk loci share downstream cis and trans genes across tissues and diseases. Science. 2016;. pmid:27540175
- 27. Pearl J. Linear Models: A Useful “Microscope” for Causal Analysis. Journal of Causal Inference. 2013;1(1):155–170.
- 28. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. 1982;.
- 29. M N, A G, HH W. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nature Genetics. Nature Genetics. 2015;.
- 30. Olena O Yavorska SB. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. International Journal of Epidemiology, Volume 46, Issue 6. 2017;.
- 31. Sanderson Eea. Testing and correcting for weak and pleiotropic instruments in two-sample multivariable Mendelian randomization Eleanor Sanderson. Statistics in medicine. 2021;.
- 32. Hemani G, Zheng J, Elsworth B, Wade K, Baird D, Haberland V, et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife. 2018;7:e34408. pmid:29846171
- 33.
Grant AJ, Burgess S. Pleiotropy robust methods for multivariable Mendelian randomization. Statistics in medicine. 2021;.
- 34. Lin Zea. Robust multivariable Mendelian randomization based on constrained maximum likelihood. American journal of human genetics. 2023;.
- 35. Pacini D, Windmeijer F. Robust inference for the Two-Sample 2SLS estimator. Economics Letters. 2016;146:50–54. pmid:27667880
- 36. Gkatzionis A, Burgess S, Newcombe PJ. Statistical methods for cis-Mendelian randomization with two-sample summary-level data. Genetic Epidemiology. 2023;47(1):3–25. pmid:36273411
- 37. Burgess Sea. Mendelian randomization with fine-mapped genetic data: Choosing from large numbers of correlated instrumental variables. Genetic epidemiology. 2017;. pmid:28944551
- 38. Nikpay Mea. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nature genetics. 2015;.
- 39. Leander K, Mälarstig A, van’t Hooft FM, Hyde C, Hellénius ML, Troutt JS, et al. Circulating proprotein convertase subtilisin/kexin type 9 (PCSK9) predicts future risk of cardiovascular events independently of established risk factors. Circulation. 2016;133(13):1230–1239. pmid:26896437
- 40. Kessler T, Wobst J, Wolf B, Eckhold J, Vilne B, Hollstein R, et al. Functional characterization of the GUCY1A3 coronary artery disease risk locus. Circulation. 2017;136(5):476–489. pmid:28487391
- 41. Hannou SA, Wouters K, Paumelle R, Staels B. Functional genomics of the CDKN2A/B locus in cardiovascular and metabolic disease: what have we learned from GWASs? Trends in Endocrinology & Metabolism. 2015;26(4):176–184. pmid:25744911
- 42. Nica AC, Montgomery SB, Dimas AS, Stranger BE, Beazley C, Barroso I, et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS genetics. 2010;6(4):e1000895. pmid:20369022
- 43. Hao K, Ermel R, Sukhavasi K, Cheng H, Ma L, Li L, et al. Integrative prioritization of causal genes for coronary artery disease. Circulation: Genomic and Precision Medicine. 2022;15(1):e003365. pmid:34961328
- 44. Mizoguchi T, MacDonald BT, Bhandary B, Popp NR, Laprise D, Arduini A, et al. Coronary disease association with ADAMTS7 is due to protease activity. Circulation research. 2021;129(4):458–470. pmid:34176299
- 45. Brænne I, Civelek M, Vilne B, Di Narzo A, Johnson AD, Zhao Y, et al. Prediction of causal candidate genes in coronary artery disease loci. Arteriosclerosis, thrombosis, and vascular biology. 2015;35(10):2207–2217. pmid:26293461
- 46. Zeng L, Talukdar HA, Koplev S, Giannarelli C, Ivert T, Gan LM, et al. Contribution of gene regulatory networks to heritability of coronary artery disease. Journal of the American College of Cardiology. 2019;73(23):2946–2957. pmid:31196451
- 47. Koplev S, Seldin M, Sukhavasi K, Ermel R, Pang S, Zeng L, et al. A mechanistic framework for cardiometabolic and coronary artery diseases. Nature Cardiovascular Research. 2022;1(1):85–100. pmid:36276926
- 48. Talukdar HA, Asl HF, Jain RK, Ermel R, Ruusalepp A, Franzén O, et al. Cross-tissue regulatory gene networks in coronary artery disease. Cell Systems. 2016;2(3):196–208. pmid:27135365
- 49. Wang X, Musunuru K. Confirmation of causal rs9349379-PHACTR1 expression quantitative trait locus in human-induced pluripotent stem cell endothelial cells. Circulation: Genomic and Precision Medicine. 2018;11(10):e002327. pmid:30354304
- 50.
Gupta RM. Causal Gene Confusion: The Complicated EDN1/PHACTR1 Locus for Coronary Artery Disease; 2022.
- 51. Liu X, Li YI, Pritchard JK. Trans effects on gene expression can drive omnigenic inheritance. Cell. 2019;177(4):1022–1034. pmid:31051098
- 52. Uffelmann E, Huang QQ, Munung NS, De Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nature Reviews Methods Primers. 2021;1(1):59.
- 53. Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nature genetics. 2019;51(4):675–682. pmid:30926970
- 54. Zhao S, Crouse W, Qian S, Luo K, Stephens M, He X. Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits. Nature Genetics. 2024; p. 1–12. pmid:38279041
- 55. Liu L, Yan R, Guo P, Ji J, Gong W, Xue F, et al. Conditional transcriptome-wide association study for fine-mapping candidate causal genes. Nature Genetics. 2024; p. 1–9. pmid:38279040