Robust methods in Mendelian randomization via penalization of heterogeneous causal estimates

Methods have been developed for Mendelian randomization that can obtain consistent causal estimates under weaker assumptions than the standard instrumental variable assumptions. The median-based estimator and MR-Egger are examples of such methods. However, these methods can be sensitive to genetic variants with heterogeneous causal estimates. Such heterogeneity may arise from over-dispersion in the causal estimates, or specific variants with outlying causal estimates. In this paper, we develop three extensions to robust methods for Mendelian randomization with summarized data: 1) robust regression (MM-estimation); 2) penalized weights; and 3) Lasso penalization. Methods using these approaches are considered in two applied examples: one where there is evidence of over-dispersion in the causal estimates (the causal effect of body mass index on schizophrenia risk), and the other containing outliers (the causal effect of low-density lipoprotein cholesterol on Alzheimer’s disease risk). Through an extensive simulation study, we demonstrate that robust regression applied to the inverse-variance weighted method with penalized weights is a worthwhile additional sensitivity analysis for Mendelian randomization to provide robustness to variants with outlying causal estimates. The results from the applied examples and simulation study highlight the importance of using methods that make different assumptions to assess the robustness of findings from Mendelian randomization investigations with multiple genetic variants.


Introduction
Mendelian randomization uses genetic variants as instrumental variables to estimate the causal effect of a risk factor on an outcome using observational data [1,2]. The genetic variants must satisfy the following criteria (illustrated in Fig 1) to be a valid instrumental variable (IV): • IV1: the variant is associated with the exposure X, PLOS  associations with the risk factor [14]. If this assumption is invalid, then the type I error rate of the Q-statistic will be inflated. Bowden et al. [14] have accounted for possible violations in the NOME assumption by using adapted second order weights to calculate the Q-statistic. We here propose three further ways of downweighting or excluding variants with heterogeneous causal estimates that could be considered as part of a sensitivity analysis in a Mendelian randomization study. The first two of these extensions can be used as modifications to either the IVW or the MR-Egger method. These extensions have been influenced by the literature on robust statistics [15], and recent developments in robust methods for Mendelian randomization.
First, we outline the parametric assumptions made throughout the paper and discuss the estimation of the causal effect in a Mendelian randomization study. We then introduce three robust approaches: robust regression (MM-estimation), penalized weights, and Lasso penalization. We apply these approaches to published data on body mass index (BMI) and schizophrenia risk, and on low-density lipoprotein cholesterol (LDL-C) and Alzheimer's disease (AD) risk. Next, we perform a simulation study under realistic settings to compare bias and coverage properties of the robust methods when some of the genetic variants are invalid IVs. Finally, we discuss the results of the paper and its implications to applied Mendelian randomization research. Software code for implementing all of the methods used in this paper, including extracting the genetic association estimates for the applied examples, is provided in S1 Appendix. The methods (excluding Lasso penalization) can also be applied using the R package Men-delianRandomization [16].

Parametric assumptions
Throughout the paper, we assume linearity and no effect modification of the causal effect θ of the risk factor on the outcome, and the associations of the genetic variants G j (j = 1, . . ., J) with the risk factor and with the outcome. These assumptions are not necessary to estimate a causal effect, but they ensure that all valid IVs estimate the same causal parameter. Under these assumptions, the association b Y j between the variant G j and the outcome can be decomposed into an indirect effect via the risk factor and a direct (pleiotropic) effect α j (illustrated in Fig 2): We also assume that the outcome is a continuous variable. If the outcome is binary, then the methods can be applied to the log odds ratios obtained from logistic regression of each genetic variant on the outcome. The linearity assumption must now hold for the logit-transformed probability of the outcome. Difficulties with interpreting the causal estimate of an odds ratio with a binary outcome and a logistic-linear model have been widely discussed [17], with evidence to suggest that the causal estimates tend to be unbiased under the null [18].

Estimating the causal effect
The causal effect θ can be estimated using the genetic associations with the risk factor (b X j ) and with the outcome (b Y j ). The ratio estimate of the causal effect for variant j is given by: The J ratio estimates can be combined to provide an overall causal estimate by fitting weighted linear regression of the associations of the variants with the outcome on the associations of the variants with the exposure, with the intercept set to zero and se ðb Y j Þ À 2 as weights: The estimate obtained from Eq (3) is equivalent to the estimate from the IVW method [5]. Under a fixed-effects model, we set the residual standard error (ψ) to be equal to one by dividing the standard error of the causal estimate by the estimated residual standard error.
To account for heterogeneity (overdispersion) in the causal estimates, the residual standard error can be greater than one under a random-effects model. The causal estimate from the fixed and multiplicative random-effects models will be the same, but the standard error of the causal effect will be larger from the multiplicative random-effects model if there is heterogeneity between the causal estimates.
A genetic variant is pleiotropic if it has a direct effect on the outcome that is not via the risk factor (α j 6 ¼ 0). The IVW method under a fixed or multiplicative random-effects model will produce a consistent causal estimate when there is no pleiotropy (α j = 0 for all variants), or when the average pleiotropic effect is zero (referred to as balanced pleiotropy) and the pleiotropic effects are distributed independently of the associations of the genetic variants with the risk factor (known as the InSIDE assumption-Instrument strength independent of the Direct Effect) [9,19]. If an intercept term in Eq (3) is estimated, then this is the MR-Egger method, and the causal estimate will be consistent in the presence of directional pleiotropy (the average pleiotropic effect differs from zero) if the InSIDE assumption is satisfied [9]: appear as an outlier relative to the valid IVs as the direct effect of the pleiotropic variant will result in the vertical displacement ofb Y j from the causal effect (Eq (1)). Robust methods that downweight the contribution of variants with heterogeneous ratio estimates should reduce the impact that variants with outlying or over-dispersed estimates have on the causal estimate. For example, the simple median estimator is the median of the J ratio estimates θ j (j = 1, . . ., J), and will produce consistent causal estimates if at least 50% of the genetic variants are valid IVs [8].
Typically, applied Mendelian randomization analyses will use one variant from each gene region. Under Mendel's second law, these variants should be independently distributed due to their physical separation. The methods discussed in this paper will therefore assume that the variants are uncorrelated.
Robust regression (MM-estimation). The breakdown point is a measure of the robustness of an estimator to contaminations (such as outliers) in the dataset [15]. Ordinary least squares (OLS) has a breakdown point of 0% as all of the observations have equal weight and just one outlying observation can heavily influence the estimator, resulting in an arbitrarily large or small estimate. Robust regression methods, such as MM-estimation, have been proposed where the breakdown point is greater than 0% [15].
In this paper, we use an MM-estimation approach proposed by Koller and Stahel [20] as it retains the high asymptotic efficiency of the M-estimator ('maximum likelihood type'), whilst utilising the S-estimator ('scale-type estimate') to provide robustness against outliers and leverage points. Under this method, a S-estimate is fitted to minimize the M-estimate of scale, which has the desired high breakdown point but may lack efficiency. The estimates for the scale and regression parameters obtained in this stage are then used to fit an M-estimator with high efficiency, where the scale estimate is held constant to retain the high-breakdown point [20].
Additional robustness in MM-estimation may be achieved by using Tukey's bisquare objective function in the estimation procedure with its weighting function: where r j are the standardized residuals, and w(r j ) are used in the objective function of the iteratively reweighted least squares algorithm to obtain the MM-estimates. The recommended values for the tuning parameter c maintain a high breakdown point in the S-estimation step (c = 1.548) and provide efficiency in the M-estimation step (c = 4.685). In MM-estimation with Tukey's bisquare objective function, the weight of an observation decreases as r j tends away from zero, and when |r j | � c the observation will have zero weight. Throughout the paper, we will refer to this approach as robust regression. It is the default implementation of robust regression for the lmrob command in the R package robustbase [21]. Since the lmrob command allows the user to specify a vector of weights to be used in conjunction with Tukey's weighting function, robust regression can be used instead of 'ordinary regression' (weighted least squares) for the IVW and MR-Egger methods.
Penalized weights. We assume that the NOME assumption is satisfied, and propose an approach for downweighting genetic variants with heterogeneous ratio estimates in the IVW model using Cochran's Q statistic: which has an approximate w 2 JÀ 1 distribution under the null hypothesis that all J genetic variants satisfy the IV assumptions, with the J components Q j (j = 1, . . ., J) having approximate w 2 1 distributions [13]. Since penalized weights would normally be considered when pleiotropy is suspected, the simple (unweighted) median estimate is used for the value ofŷ in Eq (5) rather than the IVW estimate.
To ensure that the weights ( se ðb Y j Þ À 2 ) for the majority of the variants remain the same, we use a penalization for the IVW method based on the one-sided upper tail probability (denoted q j ) of Q j on a w 2 1 distribution by multiplying the weights by min(1, 100q j ). A similar downweighting factor, min(1, 20q j ), was used for the penalized-median estimator in the paper by Bowden et al. [8]. Initially we used min(1, 20q j ) but found that too many variants were being penalized, resulting in over-precise estimates that had poor coverage of the true causal effect. By multiplying the weights by min(1, 100q j ), the outlying variants should be severely penalized, without downweighting too many genetic variants that are valid IVs.
For the MR-Egger method, we consider the modified Q' statistic [22]: whereŷ 0 andŷ 1 are taken from the MR-Egger model. If the MR-Egger model is correct, the Q' statistic in Eq (6) should follow an approximate w 2 JÀ 2 distribution [23]. The penalized weights described in this Section can also be applied to robust regression for the IVW and MR-Egger methods, subsequently referred to as the robust and penalized approach (or robust regression with penalized weights).
Lasso penalization. The application of Lasso regression in IV analyses has already been considered in the literature [24][25][26]. The penalty term in Lasso regression shrinks the regression coefficients towards zero, and forces some coefficients to be zero [27]. The sparsity property (shrinking some coefficients to zero) of Lasso regression has been used to identify and remove invalid IVs. The IV methods that use Lasso regression have only been considered with respect to individual level data.
We take the 'post-lasso' method proposed by Windmeijer et al. [25] for individual level data and adapt this method to be used with summary level data. First, we consider the objective function for the MR-Egger model that is minimized when fitting the regression model of Eq (4): To better model the pleiotropic effects α j in Eq (1), we propose replacing θ 0 with a separate intercept coefficient for each genetic variant y 0 j , and adding a Lasso-penalty term for the y 0 j parameters: If y 0 j shrinks to zero in Eq (7), the genetic variant is treated as a valid IV. We take the genetic variants with a zero intercept term y 0 j , and perform the IVW method using these variants only to estimate the causal effect θ. The degree of shrinkage in Eq (7) is determined by the value of the tuning parameter λ. If λ = 1, then all of the genetic variants are assumed to be valid instruments as y 0 j is forced to be zero for all J variants, and the IVW method is performed using the full set of genetic variants. If λ = 0, then all of the variants can be pleiotropic, and the parameters in Eq (7) are not identified.
To determine the value of λ, two rules were considered: 1) a heterogeneity stopping rule; and 2) a cross-validation rule. The heterogeneity stopping rule is influenced by the method used by Windmeijer et al. [25] and Cochran's Q statistic. For the heterogeneity stopping rule, we fit the Lasso penalization model (Eq (7)) over a range of values for λ, starting with a value close to zero, and then increasing λ in small increments. We stop at λ = λ n when the residual standard error from the IVW model, based on the variants determined to be valid from λ = λ n+1 , is greater than 1, and the increase in the residual standard error from λ n to λ n+1 is greater than w 2 1 ð0:95Þ=J inc , where w 2 1 ð0:95Þ is the upper 95th percentile of a chi-squared distribution on 1 degree of freedom, and J inc is the number of genetic variants included in the IVW model when λ = λ n+1 .
As an alternative to the heterogeneity stopping rule, we use the optL1 command in the R package penalized [28]. optL1 compares the predictive ability of the Lasso regression model for different values of λ through leave-one-out cross-validation. The optimal value of λ is then determined by maximizing the cross-validated likelihood function.

Summary
In this Section, we have introduced three robust approaches that can be used in a Mendelian randomization study as part of the sensitivity analysis. The approaches use summary level data that either downweight or remove genetic variants that have heterogeneous causal ratio estimates. In the next Section, we apply these approaches to published summary data to investigate the causal effect of body mass index on schizophrenia risk, and the causal effect of lowdensity lipoprotein cholesterol on Alzheimer's disease risk.

Applied examples
To illustrate the performance of the proposed extensions, we considered two applied examples: one where there was evidence of over-dispersion in the ratio estimates (the causal effect of BMI on schizophrenia risk); and another that contained outliers (the causal effect of LDL-C on AD risk). Using summary data (beta-coefficients and standard errors) from PhenoScanner [29], we considered the IVW method with: 1) the full set of genetic variants; 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights. Lasso penalization with the heterogeneity stopping and cross-validation rules, the simple median, the weighted median, and the MR-Egger methods were also considered. Under the heterogeneity stopping rule, the Lasso penalization model was applied to λ = 0.1, 0.2, . . ., 4.9, 5.0, 5.2, 5.4, . . ., 9.8, 10.0. Multiplicative random-effects models were used in all analyses.

Causal effect of body mass index on schizophrenia risk
Although individuals with schizophrenia tend to be overweight [30], it is generally believed that this is due to the effect of anti-psychotic medication on body composition (reverse causation) rather than any causal effect of BMI on schizophrenia risk [31]. For this Mendelian randomization analysis, we used the 97 genetic variants reported by the Genetic Investigation of Anthropometric Traits (GIANT) consortium that were associated with BMI in 339,224 European-descent individuals at a genome-wide level of significance (p-value < 5 × 10 −8 ) [32]. Variants were clumped at a correlation threshold of r 2 > 0.1, and all 97 variants are separated by at least 500 kilobases. The genetic associations with schizophrenia were obtained from the Psychiatric Genomics Consortium (PGC) based on 35,476 cases and 46,839 controls mostly of European descent [33]. The summarized data used in this paper were recently applied in a Mendelian randomization study investigating the causal effect of BMI on psychiatric disorders, including schizophrenia risk [34].

Causal effect of low-density lipoprotein cholesterol on Alzheimer's disease risk
Epidemiological studies have provided evidence of an association between LDL-C and increased risk of AD [35,36]. However, there is also evidence to suggest that patients with AD have altered lipid metabolism (reverse causation) [37]. In this Mendelian randomization analysis, we used the 75 genetic variants previously demonstrated to be associated with LDL-C at a genome-wide level of significance by the Global Lipids Genetics Consortium (GLGC) [38]. The point estimates for the genetic associations with LDL-C were taken from the linear regression in up to 188,578 participants from GLGC [39]. The majority of variants are separated by at least 1 megabase. A second variant from a gene region was only selected if it was independently associated with LDL-C and in low linkage disequilibrium with the lead variant (r 2 < 0.05). A recent Mendelian randomization study used summarized data from GLGC to investigate the causal association between low LDL-C levels and AD risk using data on 380 variants. Our analysis is based on a smaller set of genetic variants compared to Benn et al. [40] as we excluded variants that were associated with LDL-C and high-density lipoprotein and/or triglycerides. The genetic associations with AD were obtained from the International Genomics of Alzheimer's Project (IGAP) based on 17,008 cases and 37,154 controls of European-descent [41].

Results
The estimated genetic associations with 95% confidence intervals for the two examples are displayed in Fig 3. The plots demonstrate the overdispersion in the ratio estimates for BMI and schizophrenia; and two outliers in the LDL-C and AD example. The outlying variants (rs6859 and rs7254892) for LDL-C and AD are located near to the APOE locus and are associated with AD risk with odds ratios of 1.40 (95% CI: 1.35, 1.44) and 1.28 (95% CI: 1.15, 1.44) respectively [41]. Studentized residuals from the IVW analysis for these variants are 16.5 and −0.95 (all other variants had absolute Studentized residual less than 2), and Cook's distances are 2.51 and 0.11 respectively (all other Cook's distances were less than 0.06).
Estimates and 95% confidence intervals from the Mendelian randomization analyses are provided in Table 1. All of the estimates for BMI and schizophrenia suggest a null causal effect (as also observed in the Mendelian randomization study by Hartwig et al. [34]), although there is wide variation in the standard errors. The use of penalized weights and robust regression in the IVW method improved the precision of the estimates. There was little difference in the point estimates or standard errors obtained from the IVW method with penalized weights, and from the IVW method with robust regression and penalized weights. With exception of the IVW and MR-Egger methods, the median estimates were the least precise.

Table 1. Estimates (standard errors) and 95% confidence intervals of the causal effect of body mass index on schizophrenia risk (log odds ratio for schizophrenia per 1 standard deviation increase in body mass index) and low-density lipoprotein cholesterol on Alzheimer's disease risk (log odds ratio for Alzheimer's per 1 standard deviation increase in low-density lipoprotein cholesterol) from the IVW method with: 1) the full set of genetic variants (IVW); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights.
Results from Lasso penalization with the heterogeneity stopping rule and cross-validation, simple median, weighted median and MR-Egger methods are also presented. The estimates from the IVW and MR-Egger methods suggested a positive causal effect of LDL-C on AD risk. This effect was attenuated to the null for the other robust methods. Compared to the robust methods that reported a null causal effect of LDL-C on AD risk, the simple and weighted median estimates had larger standard errors. The estimates from the IVW and MR-Egger methods from Benn et al. [40] indicated that lower LDL-C levels may be beneficial Robust methods in Mendelian randomization via penalization of heterogeneous causal estimates in reducing AD risk, whereas their estimate from the weighted median method suggested a null effect. Since the genetic variants in the APOE gene region tend to be highly pleiotropic [10], it is likely that the positive effects obtained from the IVW models in our analysis and in the paper by Benn et al. [40] are driven by these pleiotropic variants, rather than there being a true causal effect of LDL-C on AD risk.
The λ values for the heterogeneity stopping rule (λ = 3.4 based on 72 genetic variants) and cross-validation (λ = 4.00 based on 73 genetic variants) for Lasso penalization were similar ( Fig 5). However, the estimate based on 72 genetic variants was much closer to the null, The consistency of the results from the robust methods for the BMI and schizophrenia example strengthened the evidence from the primary IVW analysis, providing similar point estimates but with narrower confidence intervals. The LDL-C and AD example highlighted the possibility that only using the IVW method may provide conclusions that are not representative of the majority of the data. Whilst in practice the outlying rs6859 variant could have been identified and removed from the dataset prior to the analysis, the robust approaches identified this outlying variant in an automated manner.

Approaches applied to the simulated data
We applied the approaches introduced in this paper to simulated datasets, including the IVW method with: 1) all the J genetic variants (standard IVW method); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights. The Lasso penalization method with the heterogeneity stopping rule was also considered. The bias and coverage properties of the estimates from these robust methods were compared to those from the simple (unweighted) median, weighted median, and MR-Egger methods. Standard errors for the simple and weighted median estimates were obtained through bootstrapping [8]. Robust regression, penalized weights, and robust regression and penalized weights were also applied to the MR-Egger model. The Lasso penalization method was applied to λ = 0.1, 0.2, . . ., 4.9, 5.0, 5.2, 5.4, . . ., 9.8, 10.0 under the heterogeneity stopping rule.
To allow for direct comparisons with the MR-Egger method, and to assess the performance of the methods when the IV assumptions were violated, the simulations followed a similar structure to the simulation study performed in the paper by Bowden et al. [8]. The data generating model used in the simulation study is outlined below.

Data generating model
The simulation study generated data in accordance to Fig 6 for participants indexed by i = 1, . . ., N, and genetic variants indexed with j = 1, . . ., J: where α j represents the direct effect of the genetic variant G j on the outcome, ϕ j represents the effect of the genetic variant on the confounder U of the risk factor X and outcome Y association, b X j represents the genetic effect of G j on X, and θ is the causal effect of X on Y. The error terms � Ui , � Xi , and � Yi were drawn independently from standard normal distributions.
The performance of the robust methods was investigated under a two-sample Mendelian randomization setting with N = 10, 000 individuals and J = 15 genetic variants. Data were generated for 2N participants, and the associations of the variants with the risk factor were estimated in the first N participants, and associations with the outcome in the second N participants. Only the summary level data (beta-coefficients and standard errors) were used in the analyses. A one-sample setting was also considered where an additional N participants were simulated and all of the genetic associations were estimated from the same N participants.
If a genetic variant is associated with a confounder of the risk factor-outcome association, then this will affect the variant's association with both the risk factor and the outcome, leading to the violation of the InSIDE assumption. Using this observation, data were simulated to consider the following four scenarios: • Scenario 1-No pleiotropy, InSIDE automatically satisfied: α j and ϕ j were set to zero for all j.
• Scenario 2-Balanced pleiotropy, InSIDE satisfied: α j � U[0.05, 0.15] for invalid variants, with each α j having a 0.5 probability of being multiplied by -1. ϕ j was set to zero for all j.
The genetic variants G j were coded to correspond to a single nucleotide polymorphism with minor allele frequency 0.3. If a genetic variant was a valid IV then α j and ϕ j were set to zero in all four scenarios. In Scenarios 2 to 4, the number of invalid IVs was set to 1, 3 and 6.
The causal effect of the risk factor on the outcome was either θ = 0 (null causal effect) or θ = 0.3 (positive causal effect). The effects of the genetic variants on the risk factor (b X j ) were drawn from a uniform distribution between 0.06 and 0.13. 10 000 simulated datasets were generated for each combination of parameters (24 different combinations in total).

Results
The mean proportion of variance in the risk factor explained by the genetic variants (R 2 statistic), mean F statistic, and mean I 2 statistic are contained in Table A in the S2 Appendix for scenarios 1-4 for the null and positive causal effects by the number of invalid instruments. The mean R 2 values were greater than 3% for all of the scenarios, and the minimum mean F-statistic was 20.8. The I 2 statistic ranged from 39.1% to 80.9%. Since violations in the no measurement error (NOME) assumption of the genetic associations with the risk factor can lead to attenuation towards the null for the MR-Egger estimates, and this attenuation is approximately equal to the I 2 statistic, we expected the MR-Egger estimates for the positive causal effect to be severely attenuated towards the null [42].
The number of robust regression models that did not report a standard error (maximum of 2.6% across all of the scenarios considered) are given in Table B in the S2 Appendix. Apart from the calculation of the mean standard error, the robust regression models that did not report a standard error were included in the results, and the power calculations treated the standard error as infinite.
When all of the genetic variants were valid IVs ( Table 2), all of the methods produced unbiased estimates of the null causal effect and the Type I error rates were close to the nominal level of 5%. Apart from the simple median method, there was attenuation towards the null with a positive causal effect for all methods, and as expected, this was particularly evident for the MR-Egger method (also observed for Scenarios 2 and 3). Violation of the NOME assumption can lead to inflation of the intercept term in the MR-Egger method [42], and this was true for the simulation study where the power to detect the intercept term for Scenarios 1 and 2 was greater than 5% (Table C in the S2 Appendix). Only 7.5% of the MR-Egger models Table 2. Mean estimate (mean standard error), standard deviation, coverage of the 95% confidence interval (%), and power at the 5% significance level (%) of the estimates from the IVW model with: 1) the J genetic variants (IVW); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights for Scenario 1 with a null (θ = 0) or positive (θ = 0.3) causal effect. Results from Lasso penalization with the heterogeneity stopping rule, simple (unweighted) median, weighted median and MR-Egger methods are also provided.

Null causal effect (θ = 0)
Positive causal effect (θ = 0. detected a positive causal effect, and apart from the median estimators, all of the robust methods had approximately 95% power to detect the positive causal effect. Although the mean estimates in Scenario 2 (Tables 3 and 4) were similar to those in Scenario 1, there were clear differences in the precision of the estimates for the null and positive causal effects, with most of the methods reporting larger mean standard errors under Scenario 2. The mean standard error increased as the number of invalid instruments increased for all methods. The IVW model with penalized weights had the most precise estimates, but suffered from inflated Type I error rates and poor coverage. The simple and weighted median estimators performed just as well, if not better, than the other robust methods for Scenario 2.

Table 3. Mean estimate (mean standard error), standard deviation, coverage of the 95% confidence interval (%), and power at the 5% significance level (%) of the estimates from the IVW model with: 1) the J genetic variants (IVW); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights (R and P) for Scenarios 2-4 with a null causal effect (θ = 0) by the number of invalid IVs.
Results from Lasso penalization with the heterogeneity stopping rule, simple median, weighted median and MR-Egger methods are also provided. In Scenario 3 (directional pleiotropy, InSIDE satisfied), the IVW method produced biased causal estimates with inflated Type I error rates, and the degree of bias increased with the number of invalid IVs. With one invalid instrument, estimates from the robust methods were only slightly biased and Type I error rates were fairly well controlled. As the number of instruments increased, bias in the estimates for the robust methods also increased, although the magnitude of bias was smaller than the IVW method, and Type I error inflation was less severe. Robust regression with penalized weights performed reasonably well when there was 1 or 3 invalid instruments. Although the median methods give unbiased estimates asymptotically (that is, as the number of participants increases), when pleiotropic effects are directional there is some bias with a finite sample.

Table 4. Mean estimate (mean standard error), standard deviation, coverage of the 95% confidence interval (%), and power at the 5% significance level (%) of the estimates from the IVW model with: 1) the J genetic variants (IVW); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights (R and P) for Scenarios 2-4 with a positive causal effect (θ = 0.3) by the number of invalid IVs.
Results from the Lasso penalization method with the heterogeneity stopping rule, simple median, weighted median and MR-Egger methods are also provided. In Scenario 4 (directional pleiotropy, InSIDE violated), all of the robust methods produced biased estimates. When there were only one invalid instrument, the magnitude of bias from the robust methods was less severe than the IVW method, and this was particularly true for robust regression with penalized weights. As the number of invalid IVs increased, the performance of the robust methods worsened, and there was little advantage in applying the robust methods compared to the median estimator in Scenario 4 when 6 of the 15 genetic variants were invalid IVs. In this scenario, bias is greater for the weighted median than for the simple median method as the invalid genetic variants are on average more strongly associated with the risk factor than the valid ones. This is because invalid variants are associated with the risk factor directly and via their effect on the confounder. In practical applications, invalid genetic variants will not necessarily be more strongly associated with the risk factor than valid ones, and so the simple median will not necessarily perform better than the weighted median method.

invalid IV 3 invalid IVs 6 invalid IVs
While results were fairly similar for most of the methods, results from the MR-Egger method were often quite different. This is because the other methods are fairly similar in their assumptions (that most genetic variants are valid IVs) and their mode of operation (variants with causal estimates that differ from the consensus are penalized or downweighted). This highlights the importance in an applied analysis of performing a range of methods that make different assumptions, rather than multiple methods that make similar assumptions [43].
Results from applying robust regression and penalized weights to the MR-Egger method are provided in Table D in the S2 Appendix. Although we had hoped that the combination of the MR-Egger method and approaches to reduce the influence of outlying variants would be synergistic in improving robustness, findings were disappointing, and all of the models were affected by the violation of the NOME assumption. A reason for this is the flexibility of the method: in allowing the intercept to differ from zero and allowing outliers that deviate from the regression model, the method permits the IV assumptions to be violated in quite a broad way. In a substantial number of cases, the method identified the wrong variants as invalid, finding an incorrect configuration of valid and invalid variants that appeared to fit the data better.
Finally, results from the one-sample setting are provided in Tables E and F in the S2 Appendix. Bias in the direction of the observational association was observed for all methods. As with the two-sample setting, the median estimators and robust regression with penalized weights produced the least biased estimates, and the IVW with penalized weights was the most precise.

Increased number of genetic variants
Since many of the methods described in this paper are based on asymptotic theory, it was anticipated that there would be an improvement in the performance of the methods when the data were generated with a larger number of genetic variants. We therefore repeated the simulation study for Scenarios 2-4 for 1 000 simulated datasets with the number of genetic variants increased from 15 to 100, and the number of invalid IVs increased from 1, 3 and 6 to 5, 15 and 30. The bounds of the uniform distribution used to generate the genetic associations with the risk factor (b X j ) were multiplied by ffi ffi ffi ffi ffi 15 100 q to ensure the average R 2 values were comparable with the original simulation study. The IVW model with: 1) the full set of genetic variants; 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights were all applied to the dataset. The Lasso penalization method with the heterogeneity stopping rule was also considered.
Results. The mean R 2 statistic, F-statistic, and I 2 statistic are contained in Table G in the S2 Appendix for Scenarios 2-4 for the null and positive causal effect by the number of invalid IVs. The mean R 2 values for the 100 genetic variants were slightly higher than the values reported in the original simulation study ( Table A in the S2 Appendix). For all of the scenarios considered, there was a significant reduction in the mean F-statistic and I 2 statistic, and we therefore expected the estimates to be affected by weak instrument bias.
Results from the simulation study for the IVW model with: 1) the J genetic variants (IVW); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights, and the Lasso penalization method with the heterogeneity stopping rule are provided in Table 5.
The reduction in the strength of the IVs led to weak instrument bias, and there was severe attenuation towards the null for the positive causal effect (Table 5). For the null causal effect, there was little difference in the performance of the robust methods with the increased number of genetic variants. In fact, the methods performed worst under Scenario 4 when 100 variants were included in the data generating model rather than 15 (Table 5). Due to the attenuation of the positive causal effect when the number of variants was increased to 100, it was difficult to compare the results to the original simulations. Nevertheless, there was no evidence to suggest that the performances of the robust methods improved when the number of genetic variants was increased.

Discussion
In this paper, we have introduced three robust approaches for Mendelian randomization with summary level data that downweight the influence of heterogeneous causal estimates. The applied examples considered in this paper illustrate the importance of using a variety of methods in a Mendelian randomization analysis. The results from the robust methods support a null causal effect of BMI on schizophrenia risk. While the IVW and MR-Egger methods produced positive estimates that were strongly influenced by pleiotropic variants in the APOE gene region, the proposed methods were able to give null estimates that were unaffected by these outlying variants.
We also performed a simulation study to compare the robust approaches to the IVW, simple median, weighted median, and MR-Egger methods. The simulation study highlighted the sensitivity of the IVW method to violations in the IV assumptions, and the requirement for robust methods to be considered in the sensitivity analysis of a Mendelian randomization study. The simulations also demonstrated the impact of violating the NOME assumption on the estimates from the MR-Egger methods. Since it was not feasible to adjust for the violation of the NOME assumption through the SIMEX method [42] in the simulation study for computational reasons, it was difficult to compare the performance of the robust methods to MR-Egger.
Robust regression with penalized weights consistently produced the least biased estimates in the simulation study. Although the power and bias of this approach was significantly better than the standard IVW method when the IV assumptions were violated, it suffered from poor coverage and increased Type I error rates, particularly when there was a high proportion of invalid instruments. When there was only one invalid instrument, robust regression with penalized weights produced more precise estimates than the median estimator. However, as the number of invalid instruments increased there was little advantage of using robust regression with penalized weights compared to the median estimator.

Interpretation of heterogeneity among the causal ratio estimates
Throughout this paper, we have assumed that heterogeneity of the causal ratio estimates is indicative of violations in the IV assumptions, particularly the presence of pleiotropic effects. Table 5. Results from the simulation study when 100 genetic variants were simulated for 1 000 datasets. Mean estimate (mean standard error), standard deviation, coverage of the 95% confidence interval (%), and power at the 5% significance level (%) of the estimates from the IVW model with: 1) the J genetic variants (IVW); 2) robust regression; 3) penalized weights; and 4) robust regression and penalized weights (R and P) for Scenarios 2-4 with a null causal effect (θ = 0) and positive causal effect (θ = 0.3) by the number of invalid instrumental variables. Results from the Lasso penalization method with the heterogeneity stopping rule are also presented. However, heterogeneity among the causal ratio estimates may arise for a number of reasons [44]. For example, there may be multiple mechanisms of intervention on a complex risk factor, each of which has an associated causal effect. For a two-sample Mendelian randomization analysis, there may be heterogeneity among the causal ratio estimates due to substantial differences in the study populations used to estimate the genetic associations with the risk factor and outcome. The robust approaches considered in this paper penalize genetic variants with heterogeneous causal ratio estimates regardless of how this heterogeneity has materialised. As such, these methods should only be employed if it is suspected that the IV assumptions have been violated, and other possible reasons for heterogeneity among the causal ratio estimates explored.

Issues with penalizing genetic variants
The simulation study has highlighted some of the disadvantages of excluding or downweighting genetic variants from Mendelian randomization analyses. Excluding genetic variants with heterogeneous causal estimates will generally reduce the standard error of the estimate. However, too much penalization can potentially result in artificial overconfidence in the precision of the causal estimate, leading to poor coverage of the true causal effect and increased Type I error rates, as seen for Lasso penalization. If the excluded genetic variants are truly invalid IVs then removing them from the analysis will reduce bias and improve the precision of the causal estimate. However, outlying or heterogeneous causal ratio estimates may be valid IVs, and so removing them from the analysis would be inappropriate. On balance, it may be more appropriate to consider approaches that reduce the contribution that heterogeneous ratio estimates have on the causal estimate, such as the median estimator or robust regression, rather than excluding them from the analysis. If a large number of variants are identified as outliers, then researchers should consider reporting that the Mendelian randomization analysis is inconclusive, rather than reporting a causal estimate.

Implication for Mendelian randomization studies
The purpose of this paper was not to promote one robust method for Mendelian randomization over another, but to emphasize the need for multiple sensitivity analyses that make different sets of assumptions. Although we acknowledge that none of the proposed methods performed significantly better than the median estimator, the extensions proposed in this paper should provide additional confidence in the findings from a conventional Mendelian randomization analysis, particularly when the causal estimates are consistent. Genetic variants that are downweighted or excluded from the analysis by the robust methods should be examined for pleiotropy to determine whether they should be removed from the dataset. The methods proposed here are likely to be useful for Mendelian randomization analyses performed for large numbers of risk factors in an automated manner, such as for -omics risk factors measured on a high-throughput platform. These methods can help a researcher rapidly triage whether a positive causal estimate from the standard IVW method is evidenced just by a small number of variants (as in the LDL-cholesterol and Alzheimer's disease example), or by the majority of variants. The methods introduced in this paper, particularly robust regression with penalized weights, may be more suited to certain scenarios than the median estimator. In the applied example for LDL-C and AD risk, there were two variants that appeared to be clear outliers. The median estimator and robust regression with penalized weights both suggested that there was a null causal effect of LDL-C on AD risk, but the estimates from the median estimator were less precise. This observation of robust regression producing more precise estimates was also observed in the simulation study when there was one invalid IV. Robust regression with penalized weights may be a useful addition to sensitivity analyses in Mendelian randomization when there are a small proportion of variants with heterogeneous causal estimates.

Limitations
We found that the Lasso penalization method may be more appropriate in an applied setting, where the estimates can be reported over a range of values of the tuning parameter. The practicality of applying Lasso penalization to the simulation study was more restrictive, and required an automated approach to selecting the tuning parameter.
Whilst we appreciate the limitation of only considering methods with uncorrelated genetic variants, we argue that robust methods should be used when the IV assumptions are in doubt, and therefore using one genetic variant from each gene region is a sensible (although conservative) approach for robust methods in an applied Mendelian randomization analysis. This is because including multiple variants from a single region may mean that region receives a disproportionate weight in the analysis, and so the validity of the analysis would be overly dependent on the validity of these variants. This could be problematic as correlated variants are likely to all be valid or all be invalid, particularly if they are all in the same gene region. If an analyst does want to include correlated variants in an analysis, this can be done by first calculating the appropriate weighting matrix based on the inverse-variance weights and the correlations between variants, and multiplying the genetic associations by the Cholesky decomposition of this matrix, as described previously [45]. Software code to do this is provided in S1 Appendix. However, we caution that no allowance is made that correlated variants are likely to all be valid or all invalid simultaneously, as the methods treat all association estimates as separate datapoints.
The violation of the NOME assumption limited the utility of the simulation study as the estimates from MR-Egger could not be compared to the robust methods. Given that MR-Egger is frequently used as part of a sensitivity analysis in Mendelian randomization studies, this could be viewed as a weakness of the simulation study.
The main simulation study was also limited by the number of genetic variants considered in the data generating model. Since GWASs are now being performed on large study populations, and estimates of genetic associations are publicly available from large consortia, only considering 15 variants in the simulation study may have been conservative. We tried to rectify this limitation by re-performing the simulation study with 100 genetic variants (but keeping the overall R 2 statistic similar) and found that there was significant attenuation towards the null due to weak instrument bias. We had thought that the performances of some of the robust methods would have improved by increasing the number of genetic variants as the methods are based on asymptotic theory. However, we did not find any significant improvements in the methods, and in some cases, the performance of the models worsened with the increased number of genetic variants.

Conclusion
This paper has highlighted the difficulty in robust causal inference when genetic variants in a Mendelian randomization analysis violate the IV assumptions. The extensions proposed in this paper are by no means perfect; even when a small proportion of the variants were invalid IVs, all methods had inflated Type I error rates in at least one scenario. Nevertheless, the Type I error rate for the proposed extensions was substantially better than the IVW method and MR-Egger when the InSIDE assumption was violated. This paper has demonstrated the benefits of using multiple robust methods as part of a sensitivity analysis. We suggest that the IVW method using robust regression with penalized weights may be a worthwhile additional sensitivity analysis to be performed in a Mendelian randomization analysis in addition to previously proposed methods.
Supporting information S1 Appendix. Software code. R code for performing the approaches outlined in the paper, and extracting genetic association estimates. (PDF) S2 Appendix. Supplementary tables from the simulation study. Additional results from the simulation study. (PDF)