Model checking via testing for direct effects in Mendelian Randomization and transcriptome-wide association studies

It is of great interest and potential to discover causal relationships between pairs of exposures and outcomes using genetic variants as instrumental variables (IVs) to deal with hidden confounding in observational studies. Two most popular approaches are Mendelian randomization (MR), which usually use independent genetic variants/SNPs across the genome, and transcriptome-wide association studies (TWAS) (or their generalizations) using cis-SNPs local to a gene (or some genome-wide and likely dependent SNPs), as IVs. In spite of their many promising applications, both approaches face a major challenge: the validity of their causal conclusions depends on three critical assumptions on valid IVs, and more generally on other modeling assumptions, which however may not hold in practice. The most likely as well as challenging situation is due to the wide-spread horizontal pleiotropy, leading to two of the three IV assumptions being violated and thus to biased statistical inference. More generally, we’d like to conduct a goodness-of-fit (GOF) test to check the model being used. Although some methods have been proposed as being robust to various degrees to the violation of some modeling assumptions, they often give different and even conflicting results due to their own modeling assumptions and possibly lower statistical efficiency, imposing difficulties to the practitioner in choosing and interpreting varying results across different methods. Hence, it would help to directly test whether any assumption is violated or not. In particular, there is a lack of such tests for TWAS. We propose a new and general GOF test, called TEDE (TEsting Direct Effects), applicable to both correlated and independent SNPs/IVs (as commonly used in TWAS and MR respectively). Through simulation studies and real data examples, we demonstrate high statistical power and advantages of our new method, while confirming the frequent violation of modeling (including valid IV) assumptions in practice and thus the importance of model checking by applying such a test in MR/TWAS analysis.


Introduction
It is of great interest in estimating and testing the causal effect of a risk factor/exposure X on an outcome Y. However, for observational data, due to the presence of unmeasured confounders, say U, it is difficult to tell whether an observed association between X and Y really indicates a causal relationship. Mendelian randomization (MR) has been applied as a popular and powerful approach to addressing this problem for causal inference, using genome-wide significant and (nearly) independent genetic variants, typically single-nucleotide polymorphisms (SNPs), as instrumental variables (IVs). Various versions of MR have been proposed, most of which are convenient to implement since they only require the use of GWAS summary data. Recently, MR has been widely applied to obtain substantial findings, and one example is [1], which found significant evidence for causal relationships between many traits of interest by utilizing the large-scale UK Biobank data [2,3]. A common workflow of MR (or TWAS) analysis is illustrated in Fig A in S1 Text. However, as expected, the validity of any MR analysis critically depends on its modeling assumptions, in particular including three key assumptions on valid IVs; an IV has to satisfy the following three conditions to be valid, as depicted in Fig 1A: confounding. It is well known that, the wide-spread horizontal pleiotropy [4][5][6], i.e. when an SNP is associated with multiple traits (e.g. X and Y here) through different pathways, will likely result in the violation of one or both of the last two IV assumptions, leading to invalid IVs and thus possibly incorrect causal conclusions. For instance, horizontal pleiotropy can occur in the form of having a pathway from the IV directly to Y, which corresponds to uncorrelated pleiotropy violating the second IV assumption. It can also happen when there is an additional pathway from the IV to U then to Y, corresponding to correlated pleiotropy [7] and violating the third IV assumption. In this paper, we use horizontal pleiotropy to refer to both cases; see Fig  1B. We also note that violation of the third IV assumption may not always be due to horizontal (correlated) pleiotropy; for example, the confounding is due to population stratification (i.e. when the effect direction is from U to IVs, instead of from IVs to U). As shown in the S1 Text, this may also imply some "direct effects" of the IV on Y, not mediated through X, for which our proposed methods appear to be applicable, though this is beyond the scope of this paper and needs further investigation.
In order to reduce the negative impact of invalid IVs, some MR methods have been developed to be either robust to or account for horizontal pleiotropy (with GWAS summary data) [6][7][8][9][10][11][12][13][14][15][16][17][18]. However, simulation studies have shown that it is unlikely that any method can completely solve the problem in all scenarios while possibly imposing its own other modeling assumptions [18,19]. For example, the constrained instrumental variable (CIV) methods [16] use a framework where each SNP's effect on the outcome has to either go through the exposure or a pleiotropic phenotype that is observed, while in reality there can be many other unknown pathways. In addition, due to their own modeling assumptions and often much lower estimation efficiency, they may give quite different results, casting doubt on which results are valid.
Hence, it is important to detect whether any modeling assumptions, including the IV assumptions, are violated before accepting MR results that may be problematic. For MR, one can apply Cochran's Q or Rucker's Q' statistic for model checking, and MR-Egger to test for directional pleiotropy using its intercept term [20]. [21] proposed a new method called global and individual tests for direct effects (GLIDE), which has been shown to have higher power than MR-Egger, but it has not been demonstrated to outperform Cochran's Q or Rucker's Q' statistic [20]. Also, GLIDE seems to require individual-level GWAS data for the outcome, which are often unavailable. Most importantly, all of these methods can only be applied to independent IVs, excluding their use with correlated IVs as in transcriptome-wide association studies (TWAS).
TWAS has been proposed recently to examine the relationship between a gene's (genetically regulated) expression and an outcome [22,23]. As in MR, if the three IV and other modeling assumptions hold, such a detected association implies a causal relationship. TWAS is a twostage least squares (2SLS) regression approach in the framework of IV regression; some correlated cis-SNPs near a gene are used as IVs to impute or predict the gene's expression level. As MR, an advantage of TWAS is that it can be conducted with GWAS summary data. Here we refer to TWAS in a general sense as an extension to MR that may take any other trait as the exposure while using some correlated/dependent (cis or whole genome-wide) SNPs as IVs. Some authors have found that TWAS can gain power over MR by more effectively using correlated SNPs, instead of independent ones, as IVs [24]. Nevertheless, cautions have to be taken when interpreting TWAS results because TWAS, as MR, may suffer from using invalid IVs [25][26][27]. The standard/default TWAS only models the relationship between an outcome trait and a gene's genetically predicted expression, which means that possible horizontal pleiotropy is not considered and thus can impact the result. To handle pleiotropy, [28] proposed a method called LD-aware (LDA) MR-Egger, which models direct/pleiotropic effects as random effects as in MR-Egger, but differs from the latter by modeling the joint effects, not marginal effects, of the SNPs/IVs. This approach can handle certain situations much better than the standard TWAS, but it may still have problems with invalid IVs, especially when the InSIDE (Instrumental Strength Independent of Direct Effect) assumption is violated; in the presence of correlated pleiotropy violating the third IV assumption, the InSIDE assumption will be violated. This means even when relatively robust methods like MR-Egger are used, it is still important to test for pleiotropic effects, or more generally, for the goodness-of-fit (GOF) of the model being used, which will influence the validity of the results. Like MR-Egger, LDA MR-Egger itself can be used to test whether there is directional (uncorrelated) pleiotropy by testing for a non-zero intercept term, but its power is quite low and it cannot handle correlated pleiotropy as to be demonstrated. Hence, to ensure valid conclusions from TWAS, model checking, including testing for the presence of (uncorrelated and/or correlated) pleiotropic effects for the violation of the second and third valid IV assumptions, is much needed in TWAS.
Several methods, including colocalization tests, have been proposed to test for pleiotropy, but they may not be able to distinguish horizontal pleiotropy from vertical pleiotropy. For instance, SMR (with the HEIDI test) of [29] can integrate summary level GWAS and eQTL data to find genetic variants with pleiotropic effects on both the GWAS trait and gene expression, but it cannot distinguish whether a variant is affecting the GWAS trait and gene expression through two different pathways (i.e., horizontal pleiotropy), or instead, it effects the GWAS trait through the gene as a mediator (i.e., vertical pleiotropy). Since only horizontal pleiotropy is problematic to the IV assumptions, it is important to test on horizontal pleiotropy directly, which is the goal here.
In consideration of the above limitations and challenges, there is a strong need for a LDA (LD-aware) method that can test for horizontal pleiotropy with higher power using possibly correlated IVs, which can be either some local or genome-wide SNPs. [30] proposed a likelihood-based method that aims to test and control for horizontal pleiotropy. Their models are more general than the LDA MR-Egger approach, but their method applies the burden test based on the possibly over-simplifying assumption that the horizontal pleiotropic effects of the IVs/SNPs are all equal, which can be violated, leading to power loss. To address the above issues, we propose a general GOF test, called TEDE (TEsting Direct Effects), for model checking in MR and TWAS; it can detect violations of modeling assumptions, including the IV  assumptions, through testing for the presence of direct effects of the IVs on the outcome, which, as depicted in Fig 1B and Fig B in S1 Text, can be due to the violation of the second or third IV assumption, to population structure, or to other reasons to be discussed later. We propose two versions, TEDE-Sc (TEDE-Score) by applying the score test, and TEDE-aSPU by an adaptive test called the aSPU test that is particularly powerful for a large number of SNPs/IVs [31]. The new tests can be seamlessly applied to both MR and TWAS. We also propose a way to control type I error rates better by taking into consideration of the variance of an estimated SNP-exposure association. Through simulation studies we show that our new methods are able to handle both uncorrelated and correlated IVs (for MR and TWAS respectively), and more importantly, have higher power than Cochran's Q-statistic (for MR) and LDA MR-Egger (for TWAS) while satisfactorily controlling type I errors. We apply the methods to a large GWAS dataset of schizophrenia (SCZ) [32] and the imputed UK Biobank data of various traits [2,3] to further demonstrate how different methods perform in the context of MR. We also apply the methods to the ADNI data [33], the IGAP stage 1 AD data [34] and the lipid data [35,36] to show the advantages of our new methods in TWAS, while confirming the commonality of the violation of modeling/IV assumptions perhaps due to the wide-spread horizontal pleiotropy in reality.

Existing model checking methods in MR
In this section, we give a brief review of some representative GOF tests in MR with GWAS summary data. We only include the most popular Cochran's Q test and MR-Egger, because Rucker's Q' test performs similarly to Cochran's Q while GLIDE requires individual-level data for the outcome [20,21], which is often unavailable in practice. As usual, we only consider linear regression models throughout this paper. Although logistic regression models are often considered for binary outcomes, their use in instrumental variable regression for causal inference is challenging and complicated; on the other hand, due to small effects of SNPs/IVs, linear models can approximate logistic models well [37] and thus have been widely used in MR and TWAS.
Suppose we have p independent SNPs. o � j is the total effect of SNP j on X, while g � j is the total effect of SNP j on Y based on marginal models (i.e. X~SNP j, Y~SNP j). Denote g � j =o � j by b � j , the true effect of X to Y by β and the direct effect of the SNP to Y by α j . For instrumental variable analysis, as shown in Fig 1A, valid IV assumptions require that there is no direct effect from the IV (SNP j) to the confounders or to the outcome. If there is an effect from SNP j to the outcome not mediated through X, we say that there is horizontal pleiotropy, and its presence leads to the violation of the last two IV assumptions and thus biased MR analysis. Fig 1B  shows the notations of different effects in the presence of horizontal pleiotropy. Here α j is the total direct effect of the SNP to Y, not mediated through X, which consists of two parts: the direct effect ν j of the SNP on Y that does not go through U, and the effect zδ j from the path going through U. Hence, we can combine the violation of valid IV assumptions (2) and (3) into one condition, α j 6 ¼0, and we only need to test whether α j = 0. Based on Fig 1B, Fig 1B. No horizontal pleiotropy means that each α j is 0 and each g � j =o � j is equal to β, and thus we can test H 0 : which only needs the marginal summary statistics: It is noted that the problem we consider is more general than horizontal pleiotropy (that is the focus here): the presence of the direct effects α j 6 ¼0 can arise due to the violation of other modeling assumptions. For example, as shown in Fig B in S1 Text, population stratification can also cause the violation of the IV assumptions manifested as the presence of some direct effects α j .
Cochran's Q test. As a general GOF test, Cochran's Q test uses the test statistic If the null hypothesis is true (i.e. if the model fits the data well, including that there is no pleiotropy), Q should follow w 2 pÀ 1 [20]. MR-Egger. Another way to test horizontal pleiotropy is to apply the MR-Egger method [8] and examine the intercept term. The model iŝ where ε~N(0,S) and l = (1. . .1) 0 , α Egger and β Egger are two parameters. Here S is usually assumed to be a diagonal matrix with diagonal elements being seðĝ � j Þ 2 0 s. To test (directional) pleiotropy, we simply test whether α Egger = 0. Note that α Egger models the average direct effect of the SNPs to Y, and thus testing α Egger = 0 is actually testing whether there is directional pleiotropy (i.e. whether the mean of the direct effects is nonzero). Also note that for MR-Egger, the coding of some of the SNPs may be flipped before model-fitting to ensure thatô � j s are all positive.

Model checking in TWAS
In this section, we introduce some LDA methods for model checking, including testing for horizontal pleiotropy, applicable to both TWAS and MR. TWAS examines the effect of a gene's expression (X) on the outcome (Y) using the gene's cis-SNPs as IVs in two stages. In the first stage, X is regressed on and thus predicted by the SNPs. In the second stage, Y is regressed on the predicted X with the regression coefficient as the key parameter of interest, measuring the causal effect of X on Y. This procedure works well when its modeling assumptions hold, including that the valid IV assumptions are not violated. However, if there is horizontal pleiotropy or population structure (as depicted in Fig 1B and Fig B in S1 Text), some of the SNPs may have direct effects on Y that are not mediated through X, under which the above standard TWAS (i.e. only regressing Y on the predicted X) may lead to biased inference on the causal relationship between X and Y. Hence, it is equally important to check the TWAS model via testing for direct effects of the IVs. Note that, even though both TWAS and MR appear similar as two-stage least squares IV regression in inferring the causal effect of X on Y using some SNPs as IVs, typical MR methods use independent SNPs as IVs while TWAS uses a gene's local/cis SNPs that are usually correlated (i.e. in linkage disequilibrium, LD), explaining why LDA methods are needed for the latter. Here we use TWAS to also represent MR extensions with correlated SNPs (either locally or across a whole genome) as IVs and with any trait as the exposure. In TWAS we model the joint effects ω j (based on the joint model X~multiple SNPs), instead of the marginal effects o � j (on the marginal models X~one SNP). This is crucial because ω j is usually quite different from o � j when the SNPs are correlated, and ω j captures the effect of each SNP on X after adjusting for other SNPs.
Using notations similar to the previous section, suppose we have p SNPs, which can be correlated in our new settings, and their direct effects on Y are denoted by α j 's. ω j is the effect of SNP j on X, γ j is that of SNP j on Y based on the joint models (i.e. X~SNPs, Y~SNPs). We can test H 0 : In practice, we are often provided with only GWAS marginal summary statistics: Using these with a reference panel, we can get estimatesô ¼ ðô j Þ andĝ ¼ ðĝ j Þ, their standard errors seðô j Þ and seðĝ j Þ, as well as CovðĝÞ, CorðĝÞ and CovðôÞ [38,39].
LDA MR-Egger. [28] proposed a method called LDA MR-Egger that applies the idea of Egger regression to TWAS while accounting for the LD structure of the SNPs. The model iŝ where ε~N(0,S) and l = (1. . .1) 0 , α TE and β TE are two parameters. Usually S is simply estimated as CovðĝÞ. For our current problem, since we are interested in testing H 0 : γ 1 /ω 1 = . . . = γ p /ω p , we can also test α TE = 0. We take a transformation such as the error term follows a standard normal distribution N(0,I): where S À 1 2 εeN 0; I ð Þ. As a result, we can estimate α and β bŷ This is the same as the LDA MR-Egger estimate in [28]. The covariance matrix is Thus we can get seðâ TE Þ and seðb TE Þ. We useâ TE =seðâ TE Þ to test (directional) horizontal pleiotropy andb TE =seðb TE Þ to test the effect of X on Y. Note that to be consistent with MR-Egger, we flip the coding of some SNPs to make sure thatô j 0 s are positive before estimatingâ TE andb TE . New method: Its framework. Our new method was motivated as a general GOF test for the original/standard TWAS. Suppose X i and Y i are the exposure and outcome respectively, and G i,j is the jth SNP/IV for subject i; all have been centered at 0. In Stage 1 of TWAS, we fit a linear model yielding estimatesô j 0 s and imputed/predicted gene expression/exposureX i ¼ P p j¼1ôj G i;j . Then in Stage 2 we fit and thus estimating the causal effect β of X on Y, the parameter of interest in TWAS. Note that we do not include an intercept term in either model since the variables have already been centered at 0. This procedure only looks at the part of X and Y that is explained by the SNPs, which is able to account for hidden confounding when the valid IV assumptions are met, and thus gives a good estimate of β in the presence of hidden confounding (i.e. U is present but not explicitly included in the models). However, as depicted in Fig 1B and Fig B in S1 Text, due to the violation of the second or third IV assumption, or to population structure, there will be non-zero direct effects α j 's on Y, leading to a mis-specified model being used in TWAS Stage 2. Accordingly, we propose a general GOF test for TWAS by testing whether the direct effects α j 's are all 0 in an expanded model Then we test H 0 : α 1 = . . . = α p = 0. If H 0 is rejected, then the TWAS Stage 2 model is incorrect (in assuming no direct effects), possibly due to the violation of some modeling assumptions, including the presence of some invalid IVs (e.g. due to horizontal pleiotropy) or of population structure. Furthermore, other violations of modeling assumptions can lead to rejecting H 0 , e.g. due to an incorrect model or bad estimation in Stage 1: ifô j and ω j are quite different, even if other TWAS modeling/IV assumptions hold, it may still lead to a nonzero "direct effect" α j , and thus the rejection of H 0 .
It is noted that Model (1) is over-specified for parameter estimation, but testing on H 0 is possible, including parameter estimation under H 0 . Our proposed testing framework is related to the Sargan test for over-identifying restrictions in IV regression [40,41], and can be regarded as an extension to (2-sample) TWAS with GWAS summary data.
New method: TEDE. To test H 0 : α 1 = . . . = α p = 0, we can use the score test or another test. We assume that ε i 's follow an i.i.d normal with mean 0 and variance s 2 Y . Note that this model is different from the true model since it uses b P p j¼1ôj G i;j ; rather than b P p j¼1 o j G i;j , which may have potential problems ifô j 0 s are inaccurate and β is nonzero. We will demonstrate this further in our simulation studies. Denote the parameter vector by θ = (α 1 ,. . .,α p ,β) 0 and the score vector by U(θ) = (U 1 (θ),. . .,U p+1 (θ)). The log-likelihood is We also have To apply the score test, we need to estimateθ 0 , which is the MLE of θ under H 0 : To estimate the covariance matrix of Uðθ 0 Þ, we need to calculate Hence, by denoting G = (G 1 ,. . .,G p ), G j = (G 1,j ,. . .,G n,j ) 0 , we can obtain where C is a diagonal matrix with the same diagonal elements as Iðθ 0 Þ 0 s.Z can be regarded as U-scores standardized by their standard errors.
To test whether all of the scores inZ are 0, which tells whether H 0 is true, we can apply the SPU tests and aSPU test [31]. First, we denoteZ ¼ ðz ð1Þ ; . . . ; z ðpþ1Þ Þ 0 and define where γ is usually chosen from {1, 2,. . ., 8, 1}. We sampleZ b (b = 1,2,. . .,B) from the null distribution MVNð0;ΣÞ, and the p-value for the SPU test is The general idea is to simply look at whether the sum of powered scores is too extreme, since under H 0 , the scores should have mean 0 and their powered sum should not be too large.
If we look at a set of different γ's, denoted by Γ = {γ 1 , γ 2 � � �γ r }, each of them yields a different pvalue P SPUðg t ;ZÞ . To combine these results, we define the aSPU test statistic as aSPUðZÞ ¼ min t¼1;...;r ðP SPUðg t ;ZÞ Þ. For each power index γ t , we calculate The p-value of the aSPU test is calculated as P aSPUðZÞ ¼ P B b¼1 IðaSPUðZ b Þ < aSPUðZÞÞ=B. We call this approach TEDE-aSPU, which applies the aSPU test to our problem of testing invalid IVs, The SPU(1) and SPU(2) tests, corresponding to the burden test and a variance component/ kernel test respectively, have been shown to have higher power when the signals are dense (i.e. more α j 's are nonzero), whereas SPU(8) and SPU(1) usually works better when the signals are sparse (i.e. few α j 's are nonzero) [31]. The aSPU test is able to combine their strengths and thus perform well in various scenarios. Since TEDE-Sc and SPU(2) both look at the second order of the scores (one withZ 0ΣÀ 1Z , the other with P j z ðjÞ 2 ), we expect TEDE-Sc to also perform better when the invalid IVs are relatively dense. When the invalid IVs are sparse or highdimensional, we expect that TEDE-aSPU to be more powerful.
Our method does not require the InSIDE (Instrument Strength Independent of Direct Effect) assumption to hold; we will demonstrate through simulations that our method performed well even when InSIDE was violated. It is also worth noting that our previous model and results are based on the assumption thatô j 0 s are fixed values. In reality, what follows i.i.d. normal with mean 0 and s 2 Y under the null should be As a result, by ignoring the estimation error and variability ofô j 0 s, we may inaccurately estimate β and α j 's, which may lead to inflated type I errors (if, as default in this paper, one does not really care about the estimation errors of ω j 's but only the functional form of the specified model; otherwise it would be power, instead of type I error). This may happen for TWAS since usually the sample size in the first stage is not large enough to ensure the accuracy ofô j 0 s. To mitigate this issue, we can incorporate the variance ofô j 0 s by replacing G 0 G=s 2 Y with G 0 G=s 2 Y þ b 2 G 0 GCovðωÞG 0 G=s 4 Y when calculating CovðUðθ 0 ÞjH 0 Þ, since the first p elements of Uðθ 0 Þ under H 0 can be written as ðG 0 Y ÀbG 0 GωÞ=s 2 Y . We do not need to worry about the other elements in S because U pþ1 ðθ 0 Þ ¼ 0 and those elements will not have any effect on TEDE-Sc or TEDE-aSPU. We call the approach with the modified covariance estimate TEDE-aSPU2, which is expected to better control type I errors. We can also use the modified covariance estimate in TEDE-Sc, and we call this approach TEDE-Sc2. Also note that when the significance threshold is too small (e.g. <5e-8), using the original version of aSPU with summary statistics may take too much time. Alternatively, we can apply the aSPU test based on either its asymptotics [42] or on importance sampling [43], which has been shown to perform well when p is large and small respectively.
A connection between TEDE-aSPU and TWAS-aSPU. A more powerful association test in TWAS has been proposed in [44]. Their model is where the link function g() is the identity function for quantitative trait Y i . For the linear case with variables already centered at 0, the model becomes Y i ¼ P p j¼1 φ j G i;j þ ε i . To assess possible association between the trait and the SNPs, one applies the aSPU test to the null hypothesis H 0 : φ 1 = . . . = φ p = 0. Comparing this model to our working model (1), we can decompose each φ j ¼ a j þ bô j . It is clear why this test was shown to be more powerful than TWAS: it tests not only on the causal effect β of X on Y (as does TWAS), but also on direct effects of the SNPs. In other words, the association test consists of two components: one is for causal effect of X on Y as in TWAS, and another on the mis-specified TWAS model (e.g. due to invalid IVs) as aimed by TEDE proposed here.
Applying LDA methods to MR. For MR analysis with independent SNPs, we can directly apply the LDA methods, including TEDE, for model checking or GOF testing because eventually they are all testing H 0 : α 1 = . . . = α p = 0. As long as we know the MAF of each SNP (either from a GWAS summary dataset or a reference panel), we can calculate the variance for each SNP, leading to a diagonal LD covariance matrix. If MAF information is already provided in the GWAS summary data, we do not even need to use a reference panel.

Simulations
Independent SNPs: Testing horizontal pleiotropy in MR. We generate genotype data of independent SNPs G = (G ij ) n×p using a multivariate binomial distribution, assuming Cov(G ij , G ik ) = 0 (j6 ¼k). We also assume each SNP has MAF f = 0.3 and simulate two traits X and Y using models similar to those in [19]: Here � i , e i and ε i each follow an i.i.d standard normal distribution, U i is a confounder. δ j , ω j , υ j are the direct effects of SNP j on U, X and Y respectively. β, the causal effect of X on Y, is determined so that the proportion of variability in Y explained by X is about h 2 X!Y . We generate ω j 's from a normal distribution with mean zero and standard deviation 0.15 first, and subsequently select those with ω j >0.08 to avoid weak IVs, ensuring that the first valid IV assumption holds (i.e. an IV is associated with X). Then we shrink ω j by a constant so that the proportion of gene variance explained by SNPs is about 20%. We randomly choose some of the SNPs to be invalid IVs with horizontal pleiotropy, and we denote the proportion as % invalid. We set υ j 's as zero for valid IVs and as nonzero for invalid IVs. We consider the following different scenarios with a proportion (e.g. 0, 10%, 30%, 50%) of the IVs being invalid: (S1) For invalid IVs, υ j~N (0, 0.075). δ j = 0 for every IV. This means balanced pleiotropy. Here α j = υ j .
(S2) For invalid IVs, υ j~N (0.1�sign(ω j ), 0.025). δ j = 0 for every IV. This suggests directional pleiotropy. The direction of the direct effects is the same as that of G to X. Here α j = υ j .
Note that (S1)(S2)(S3) actually become the same scenario when the proportion of invalid IVs is 0. In order to save space in the tables, we do not specify a different scenario for zero invalid IV.
When there are invalid IVs, we also shrink υ j 's so that the proportion of Y's variance explained by P p j¼1 u j G j is about 0.3%. Note that choosing a higher proportion (e.g. 1%) will make the power of most tests much higher (e.g. very close to 1). Since we are looking at the two-sample setting, we generate one dataset with n 1 subjects and another dataset with n 2 subjects to obtain summary statistics for X and Y respectively. Then we apply different methods to the summary statistics and calculate their rejection rates based on 1000 simulations.
As shown in Table 1, when p = 30, all methods are able to control type I error rates with the default nominal significance level 0.05. All methods except MR-Egger have similar performance in both scenarios 1 and 2. MR-Egger has limited power even in the presence of directional pleiotropy, though its power increases as the proportion of invalid IVs goes up. TEDE-Sc's power is higher than Cochran's Q's in all scenarios. TEDE-aSPU has higher power than TEDE-Sc when the proportion of invalid IVs is small, showing its advantage when dealing with sparse invalid IVs. In scenario 3, where the InSIDE assumption is violated in addition to directional pleiotropy, all methods have higher power, and the power goes up significantly as the proportion of invalid IVs increases. This power increase is different from what we have seen for scenarios 1 and 2 because in the first two scenarios, each invalid IV's direct effect on Y (that does not go through X) is α j = υ j , but in the third scenario, it is α j = υ j +δ j . When the proportion of invalid IVs goes up, υ j 's tend to be smaller since we control the proportion of Y's variance explained by P p j¼1 u j G j , but δ j 's are not scaled, which means the total direct effect is much stronger with more invalid IVs in scenario 3, but not in scenarios 1 and 2. Recall that, when there is no invalid IV, scenarios 1-3 all become the same (no invalid IVs; InSIDE not violated). This is why the type I error rates do not depend on different scenarios. MR-Egger depends on the InSIDE assumption and is usually expected to have problems with scenario 3, but that cannot be reflected in the type I errors here, since under the null hypothesis with no invalid IVs, InSIDE is always satisfied.
We further investigate the performance of each method with p = 100. As Table 2 shows, the power patterns in scenario 2 are similar to what we have in Table 1. We also have similar conclusions for scenario 3, the results of which are included Table A in S1 Text. TEDE-Sc seems to work better than Cochran's Q; TEDE-aSPU is more powerful than TEDE-Sc when invalid IVs are sparse or the number of IVs is large, and MR-Egger always is low powered. Nevertheless, when β>0, Cochran's Q, MR-Egger, TEDE-Sc and TEDE-aSPU have slightly inflated type I errors, and the inflation increases as β increases. This can be explained by looking at model If the sample size for estimatingô j 0 s is not large enough and β is nonzero, our estimate of α j can be off, which may lead to inflated type I errors. We need a sufficient sample size to ensure o j Àô j is small enough, especially when p is large. Cochran's Q and MR-Egger may have similar issues even though their models are different. For instance, the direct effect in MR-Egger under the null is actually b Egger ðo j Àô � j Þ, which means ifô � j s are inaccurate and β Egger is nonzero, the average of b Egger ðo j Àô � j Þ 0 s may sometimes be nonzero and lead to inflated type I errors. As shown in Table 2, once we increase n 1 to 50000, all methods are able to control type I errors. While usually the GWAS summary  Table 2. Correlated SNPs: Testing horizontal pleiotropy in TWAS. Now we generate genotype data of correlated SNPs G = (G ij ) n×p . Following [28], we assume that the LD structure is AR(ρ) with Cov(G ij , G ik ) = ρ |j−k| . We also assume each SNP has MAF f = 0.3. The rest is the same as what we did in the previous subsection for MR. Since we are looking at correlated variants, we only apply the LDA methods and examine their rejection rates based on 1000 simulations. Since n 1 is usually relatively small in TWAS, we use n 1 = 2000, n 2 = 4000. Here we do not consider Cochran's Q since it requires independent variants, and we replace MR-Egger with LDA MR-Egger. Furthermore, we include the recently developed PMR-Egger approach [30] with its default setting, which can also use summary statistics of correlated SNPs to test horizontal pleiotropy.
As shown in Tables 3 and 4, when p = 30, most methods are able to control type I errors, while TEDE-Sc and TEDE-aSPU have much higher power than LDA MR-Egger in all scenarios. As expected, TEDE-Sc2 and TEDE-aSPU2 tend to be slightly more conservative than TEDE-Sc and TEDE-aSPU respectively. PMR-Egger has better performance than LDA MR-Egger when used to test horizontal pleiotropy in most cases, though its power is usually lower than that of TEDE. We also have similar findings for p = 100, for which more details are provided in Tables B and C in S1 Text. Besides, we have observed some other interesting phenomena. For example, when the correlation between adjacent SNPs is relatively high, most methods seem to be more conservative in terms of smaller type I errors, and TEDE-aSPU has higher power than TEDE-Sc regardless of the proportion of invalid IVs, supporting its higher power for high-dimensional data. Further discussion on these is included in the S1 Text as well.

Real data applications
Testing direct effects in MR and TWAS for SCZ and other complex traits. [1] found some strong evidence for causal relationships between genetic liability to SCZ (schizophrenia) and many complex traits by constructing polygenic risk scores (PRS) for association analyses. Further investigations were done with a two-sample MR analysis to back up the conclusions. We apply different methods to test for direct effects in a similar context to see whether there are noticeable invalid IVs that may cast doubts on the conclusions of the MR analysis of SCZ and the complex traits. For SCZ, we use a GWAS summary dataset based on 150K subjects from [32]. As for other complex traits, we choose eight of the traits included in [1]'s MR analysis. For these traits (listed in Table 5), we use the GWAS results based on the imputed UK Biobank data [2,3] with up to 362K subjects. This is a two-sample problem since the subjects do not overlap. We check out the SNPs whose minor allele frequencies are greater than 0.1 and whose p-values for marginal associations with SCZ are smaller than 5e-8. Then we select those that are also present in the 1000 Genomes Project Data (phase 3; 503 subjects with European ancestry) from [45]. We prune the SNPs based on the LD information estimated from the 1000 Genomes data to get independent SNPs (r 2 <0.001), resulting in 39 IVs selected for MR analysis. We apply both the non-LDA tests and the LDA tests to test for direct effects with Y being each of the complex traits of interest.
As shown in Table 5, with 39 independent IVs, most of the tests have highly significant results for most of the analyzed outcomes. MR-Egger does not give any significant p-values under level 5e-3, which is consistent with its low power shown in our simulation studies, especially when the pleiotropy is not directional. TEDE-Sc's p-values are usually smaller than those of Cochran's Q, which is also consistent with our previous finding that TEDE-Sc tends to have higher power than Cochran's Q. TEDE-Sc2 and TEDE-aSPU2 are very close to TEDE-Sc and TEDE-aSPU, probably becauseô j 0 s have very small standard deviations given the large sample size for SCZ. For this problem, TWAS with GWAS summary statistics can also be applied to examine the relationship between SCZ and other traits, which may be more powerful by including more and correlated SNPs as IVs. As in MR, we need to test for direct effects as a way to check whether the TWAS model is appropriate. We use the LDA methods to test direct effects for each exposure-outcome pair using correlated SNPs across different chromosomes. This time we select significant SNPs based on r 2 <0.025 instead of r 2 <0.001, resulting in 140 SNPs. As shown in Table 5, TEDE-Sc and TEDE-aSPU have highly significant results, while LDA MR-Egger does not. For self-reported depression, the p-value is much more significant than before after including correlated SNPs, which may confirm the potential downside of adding more SNPs as IVs. Including more correlated IVs may yield higher power, but at the same time it will increase the chance of having invalid IVs. Testing direct effects in TWAS of AD. TWAS can be very useful in identifying genes whose expression contributes to complex traits like Alzheimer's disease (AD). By making use of correlated SNPs, instead of only independent SNPs, TWAS (as a more general MR approach) can be more powerful than MR as shown in certain scenarios [24]. However, similar to what MR faces, TWAS may gave incorrect results due to the violation of the valid IV assumptions, including the assumption of no horizontal pleiotropy. We use the ADNI data [33], the IGAP stage 1 data [34] and the reference gene expression weights from [23], which we call the weight data, to test whether direct effects of IVs exist in TWAS analysis of each gene and AD.
The weight data contains 6007 genes' expression in the whole blood. For each gene's expression, the weight data provides the SNPs selected by the elastic net regression and their joint effect sizes on the gene's expression, which we use as ourô j 0 s. These were pre-computed based on 369 samples. The IGAP GWAS summary statistics data (with a sample size of 54162) are used along with the ADNI individual-level data (sample size 712) to calculateĝ j 0 s and their covariance matrix. We match the SNPs across different datasets and prune them to make sure none of the pairwise correlations is larger than 0.9 (in absolute values). For convenience and to be consistent with the previous sections, we define each locus as the SNPs selected for each gene expression. We exclude those loci with less than 5 SNPs, resulting in 3611 loci and 49225 SNPs remaining. Next, we apply the various tests using GWAS summary statistics to test for direct effects for each locus. Since the number of loci is large and the significance threshold is very small, we choose to use the asymptotics-based TEDE-aSPU to save computation time. TEDE-Sc2, TEDE-aSPU2 and PMR-Egger cannot be applied since the variance ofô j is not provided in the weight data. As Fig 2A shows, many loci have been detected to have direct effects, suggesting that many SNPs may affect AD through pathways other than the corresponding gene. TEDE-Sc and TEDE-aSPU have found many more significant loci than LDA MR-Egger. TEDE-Sc is able to detect more loci with horizontal pleiotropy than TEDE-aSPU, probably suggesting that the proportion of invalid IVs is usually relatively high. Meanwhile, some loci only appear to be significant according to TEDE-aSPU, showing the complementary role of the two versions of the TEDE test.
Compared to the total number of loci, the proportion of loci with detected direct effects may seem small. However, it makes sense because most of the SNPs and genes cannot be detected to be associated with AD. In such a case, we do expect that direct effects of SNP to AD are not detectable or even do not exist for most loci. However, we still need to be careful when we have significant TWAS results. After applying TWAS and LDA MR-Egger to test the association between each gene and AD, we list the significant loci in Table 6 along with the pvalues of testing direct effects. TEDE-Sc has detected direct effects in three of the seven loci, covering the two found by TEDE-aSPU and the two found by LDA MR-Egger. Also, if we examine the 21 loci with at least one SNP's marginal p-value smaller than 5e-6 for its association with AD, 9.5%, 43% and 48% of them are found to have direct effects by LDA MR-Egger, TEDE-Sc and TEDE-aSPU respectively under the same significance threshold 1.38e-5. These results demonstrate the power advantage of TEDE-Sc and TEDE-aSPU over LDA MR-Egger, as well as the need to test for direct effects for model checking in TWAS, especially when we obtain significant associations.

Testing direct effects in TWAS of SCZ and LDL
To further examine the existence of horizontal pleiotropy in other scenarios, we apply TWAS for SCZ with the SCZ data used in the previous section and for LDL (low-density lipoprotein cholesterol) with the 2010 and 2013 lipid data [35,36] separately. As shown in Fig 2B-2D, after similar analyses, we are able to detect many significant loci with direct effects, especially for LDL: about 20% of the loci are significant. This is consistent with our previous explanation since more SNPs are associated with LDL than SCZ and AD. We also find that using the 2013 lipid data gives more significant results than using the 2010 lipid data, probably because of the sample size difference (about 189K vs. 100K). These results further demonstrate the possibility of widespread horizontal pleiotropy and the need for model checking in TWAS.

Discussion
We have presented a novel method with two versions (TEDE-Sc and TEDE-aSPU) that can be applied to test for direct effects as a general GOF test for model checking in MR and TWAS. For MR with only independent IVs across different loci, our simulations show that TEDE-Sc is more powerful than the widely used Cochran's Q statistic in most cases. TEDE-aSPU performs better than TEDE-Sc when the proportion of invalid IVs is small and/or the number of the IVs being used is large. MR-Egger has quite limited power when compared to other methods even in the presence of strong directional pleiotropy. We have noticed that when the number of IVs is large (e.g.~100) and the sample size for the exposure is not large enough, the tests with higher power may have slightly inflated type I errors (for detecting horizontal pleiotropy). Our alternative versions of the new method, TEDE-Sc2 and TEDE-aSPU2, are able to control type I errors better by taking into account the variability of estimating the effects of the SNPs/IVs on the exposure; however, these two versions require the summary level data to contain the standard errors of the estimated effects of SNPs on the exposure X, (i.e. seðô � j Þ 0 s or seðô j Þ 0 s). After applying different methods to test for direct effects in an MR analysis of SCZ and some complex traits, almost all of the results from Cochran's Q, TEDE-Sc and TEDE-aSPU turned out to be significant, indicating that the conclusions from the MR analysis may be problematic given the strong evidence of wide-spread direct effects. Meanwhile, MR-Egger did not reject the null hypothesis, presumably due to its low power or the possibility that the pleiotropy is not directional or uncorrelated (i.e. when the InSIDE assumption does not hold), confirming the potential issue of using MR-Egger (or its modification like LDA MR-Egger) for model checking in MR (or TWAS) in spite of its wide use in practice [20]. For TWAS, which usually includes correlated SNPs associated with a gene's expression level, TEDE-Sc and TEDE-aSPU are able to make use of the LD information and control type I errors like LDA MR-Egger. Nevertheless, similar to MR-Egger, LDA MR-Egger is fairly low powered when used to test for horizontal pleiotropy in our simulations. Again, TEDE-Sc seems to work better when the proportion of invalid IVs is relatively high, while TEDE-aSPU can handle the sparse or high-dimensional situation better. When different LDA methods were applied to test for horizontal pleiotropy in a TWAS analysis of AD, TEDE-Sc identified many significant loci, while TEDE-aSPU found much fewer, which might suggest that in many of these loci there were a relatively high proportion of SNPs with horizontal pleiotropy. On the other hand, TEDE-aSPU managed to find some significant loci that were not detected by TEDE-Sc, showing their complementary roles. In another real data application, our new method found around 20% of the loci with horizontal pleiotropy in TWAS of LDL using a large-scale GWAS lipid dataset, demonstrating substantial power advantages of our new tests over LDA MR-Egger and more importantly, highlighting the need to test for horizontal pleiotropy as a way to check the TWAS modeling assumptions (or apply other robust TWAS methods) in practice.
In practice, if it is reasonable to assume sparse direct effects, especially with a relatively large number of the SNPs/IVs to be tested, we'd recommend TEDE-aSPU2 for its expected higher power; otherwise, we recommend TEDE-Sc2 for its generally better performance as shown in our data examples (Fig 2) often with relatively small numbers of the SNPs/IVs being used. Alternatively, one may apply both TEDE-Sc2 and TEDE-aSPU2 to detect direct effects, given their different power advantages in different scenarios while controlling type I errors well. However, in certain situations where the summary level data of the exposure only have joint SNP-to-exposure effect size estimates (ô j 0 s) but without their standard errors, the above two cannot be applied and, instead, TEDE-Sc and TEDE-aSPU can be applied. We would also like to point out that our methods run relatively fast with a reasonable number of IVs: Table D in S1 Text contains more information on their computing time.
In the future, we can extend our new method to other MR or TWAS applications, such as MV-TWAS [46], which is a more robust version of TWAS (or MR) by including multiple genes (or other traits) as multiple exposures in the same model. In this scenario, we can test whether there are direct effects of the SNPs on the outcome through pathways other than through any of the multiple exposures included in the model; lack of evidence in such a test would lend support for the goodness-of-fit of the MV-TWAS model, and thus support for its conclusions. Besides, as suggested by a reviewer, identifying then removing invalid IVs in an MR or TWAS analysis may lead to better results [18]. Our current method aims for global testing (on whether there is any direct effect by any IV), rather than identifying which IVs are invalid. The latter task may appear straightforward to implement based on our framework, but it will be more challenging because of the difficulty in accurately estimating each direct effect in the over-specified model (1) (in which the parameters are non-identifiable if all SNPs/IVs used in imputing X are included for their possible direct effects on Y). On the other hand, because our proposed method is based on the simplified and identifiable model under the null hypothesis, it is possible to conduct the proposed GOF testing. Nevertheless, it is possible to develop a sequential testing procedure to filter out invalid IVs one at a time as shown in a different approach [18], or by other penalized regression and variable selection methods [12,41], under additional relatively mild assumptions on the distribution of invalid IVs such as their sparsity; this is worth further investigation.
Supporting information S1 Text. Additional Explanations of Tables 3 and 4. Additional Explanation to that Population Structure Can Lead to Some Direct Effects of SNPs/IVs. Fig A. Flowchart for common MR/TWAS analysis with summary level data. Fig B. Two different scenarios of having direct effects or "correlated pleiotropy": (A) G affects U. (B) U affects G. Table A. Rejection rates (type I error when there is 0 invalid IV; power otherwise) for testing horizontal pleiotropy. Independent variants. 1000 iterations. p = 100, n 1 = 10000, n 2 = 10000. Table B. Rejection rates (type I error when there is 0 invalid IV; power otherwise) for testing valid IV assumptions. 1000 iterations. p = 100. Low LD (ρ = 0.3). Table C. Rejection rates (type I error when there is 0 invalid IV; power otherwise) for testing valid IV assumptions. 1000 iterations. p = 100. High LD (ρ = 0.7). Table D. Computation time of TEDE (seconds) averaged over 5 runs. TEDE-Sc gives the results for both TEDE-Sc and TEDE-Sc2, and TEDE-aSPU gives the results for both TEDE-aSPU and TEDE-aSPU2.