Figures
Abstract
Mendelian randomization (MR) is a popular statistical technique that uses genetic variants to explore causal relationships in observational epidemiology. Summary-level MR, the most common form, relies on published GWAS summary statistics to estimate causal effects between exposures and outcomes. However, empirical analyses tend to ignore issues relating to Winner’s Curse of instrument effects, weak instrument bias and sample overlap. Our simulations and empirical analyses using the UK Biobank indicate that such mechanisms can induce substantial bias in routine MR approaches. We propose MR Simulated Sample Splitting (MR-SimSS), a novel method that corrects this bias requiring no additional data beyond GWAS summary statistics for the exposure and outcome of interest. It operates by simulating statistically independent sets of summary statistics, analogous to what would be produced by splitting the individual-level data into independent subsets, which can then be plugged into existing two-sample MR methods. With sufficient instrument variants, MR-SimSS is robust to a range of sample overlap scenarios, providing a practical and modular solution to Winner’s Curse and weak instrument bias.
Author summary
A central challenge in epidemiology is determining whether an observed association reflects a true cause-and-effect relationship. Mendelian randomization (MR) addresses this by using genetic variants as natural experiments to test whether a particular trait or exposure genuinely influences disease risk. However, when the same genetic data are used both to select and to estimate genetic instruments, MR results can become biased due to a phenomenon known as the Winner’s Curse. This problem, along with weak instruments and sample overlap between datasets, can distort causal estimates even in large studies. We introduce MR Simulated Sample Splitting (MR-SimSS), a new framework that overcomes these issues using only publicly available genome-wide association study (GWAS) summary statistics. MR-SimSS works by statistically simulating independent subsets of the data, without requiring access to individual-level information, allowing existing MR methods to be applied without bias. Through extensive simulations and analyses using UK Biobank data, we show that MR-SimSS provides more accurate and reliable causal estimates, offering a practical tool for robust causal inference in modern genetic epidemiology.
Citation: Forde A, Hemani G, Ferguson J (2026) Simulated sample splitting approach to address biases due to instrument selection and participant overlap in two-sample Mendelian Randomization studies. PLoS Genet 22(5): e1011949. https://doi.org/10.1371/journal.pgen.1011949
Editor: Xiang Zhou, Yale University, UNITED STATES OF AMERICA
Received: October 31, 2025; Accepted: April 26, 2026; Published: May 8, 2026
Copyright: © 2026 Forde et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code used for both the simulation study and the real-data analysis in the manuscript is available at https://github.com/amandaforde/mrsimss-paper. Code to implement MR-SimSS is available in the ‘mr.simss’ R package (https://github.com/amandaforde/mr.simss). The real-data analysis has been conducted using the UK Biobank Resource under Application Number 23739 (https://www.ukbiobank.ac.uk/enable-your-research/approved-research/exploring-the-shared-genetic-aetiology-between-schizophrenia-and-cognition).
Funding: AF is funded by Science Foundation Ireland (https://www.sfi.ie/) under award 18/CRT/6214. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Mendelian randomization (MR) is a statistical framework that uses genetic variation to assess whether a modifiable exposure causally influences an outcome of interest, and to estimate the magnitude of this effect [1]. Since its advent, the use of MR in epidemiological research has grown exponentially, largely due to the limitations of traditional observational studies, such as unmeasured confounding and reverse causation [2]. MR capitalises on the principle that genetic variants are fixed at conception and randomly inherited, rendering MR analyses approximately analogous to ‘naturally occurring randomized trials’ [3]. It relies on instrumental variable (IV) estimation, where genetic variants serve as instruments. For a variant to be a valid IV, it must be associated with the exposure, independent of confounders and influence the outcome only through the exposure pathway [4].
With the widespread availability of large-scale GWAS summary data, most MR analyses now use a summary-level design. This involves performing the MR analysis using publicly available estimates of variant-exposure and variant-outcome associations, along with their standard errors, typically derived from two non-overlapping, or partially overlapping, GWAS datasets [5]. Provided certain conditions hold and a valid genetic instrument is used, a consistent estimate of the causal effect can be obtained by dividing the variant-outcome association by the variant-exposure association [6]. To increase statistical power, ratio estimates across multiple genetic variants are often aggregated using meta-analytic methods such as the inverse variance weighted (IVW) estimator [7]. However, IVW is known to be sensitive to weak instruments and violations of the IV assumptions [8]. To address these issues, alternative summary-level methods, such as MR-Egger [9], MR weighted median [6] and MR-RAPS [10], have been proposed.
A key focus of this work is the mitigation of Winner’s Curse bias in summary-level MR. To satisfy the IV relevance condition, genetic variants are usually selected by applying a genome-wide significance threshold, e.g., p-value < 5 × 10-8, to exposure GWAS summary data. However, using the same dataset for both instrument selection and estimation introduces Winner’s Curse, where variant-exposure associations of selected variants are overestimated due to selection bias [11]. Provided that the variant-outcome associations are estimated independently, this results in downward bias in the MR causal effect estimate, pulling it toward the null. To date, the only widely adopted solution to this problem has been the use of a third, independent dataset for instrument selection in a so-called ‘three-sample’ design [10]. While this avoids the overlap that causes Winner’s Curse, it is often impractical, requiring access to three large, non-overlapping GWAS datasets of similar ancestry, which is a challenge for many traits, particularly in understudied populations. Alternatively, a single dataset could be split into separate parts, but this sacrifices statistical power and efficiency, and also is only an option if individual-level data are available.
Weak instruments pose a further source of bias. A genetic variant is considered a weak instrument if it has a weak statistical association with the exposure relative to the sample size, resulting in finite-sample bias in the causal effect estimate [12]. The magnitude and direction of this bias depends on the degree of sample overlap between the exposure and outcome GWASs. In a complete overlap (one-sample) setting, causal estimates are biased toward the observational association, which may increase false positive findings. In contrast, using two non-overlapping samples biases results toward the null. While most summary-level MR methods have been designed under the assumption of complete independence between exposure and outcome GWAS samples, substantial participant overlap is common in practice as the largest outcome and exposure GWAS often come from the same consortia [13]. Restricting analyses to non-overlapping GWASs again leads to inefficient data use.
To address these challenges, we propose a novel summary-level MR method, MR Simulated Sample Splitting (MR-SimSS). MR-SimSS extends the benefits of the ‘three-sample’ design to single sample settings, by imitating the process of splitting an individual-level dataset into three parts: one for instrument selection and two for independent estimation of the variant-exposure and variant-outcome associations. Importantly, this is achieved using only GWAS summary statistics. MR-SimSS leverages asymptotic conditional distributions to repetitively simulate association estimates in each of the three parts, or data-subsets, conditional on the known full dataset estimates. On each iteration, instrument variants are selected based on the simulated variant-exposure associations for the first subsample, and a two-sample MR method, such as IVW or MR-RAPS, is then applied to the exposure and outcome associations that are simulated for that set of instruments in the second and third subsets. MR estimates from each iteration are averaged to reduce variance. This approach allows use of the full dataset while avoiding biases introduced by sample overlap and Winner’s Curse.
We assessed MR-SimSS under varying levels of sample overlap and exposure-outcome correlation, using pairs of simulated exposure and outcome GWAS summary statistics. Several existing MR methods were included for comparison. Same-trait analyses were also conducted using UK Biobank [14] body mass index (BMI) data, where the true causal effect is known to be 1, providing a useful benchmark. All methods were evaluated across three key metrics: average bias, root mean square error (RMSE) and empirical coverage probability of 95% confidence intervals. These evaluations showed that many MR methods are adversely affected by biases arising due to sample overlap, be it using the same samples for instrument selection and estimating variant-exposure associations or using overlapping datasets to estimate both variant-exposure and variant-outcome associations. MR-SimSS illustrated a strong ability to mitigate these biases, especially those arising from Winner’s Curse. However, the simulated sample splitting process can increase susceptibility to weak instrument bias. To address this, we recommend incorporating a robust two-sample method, such as MR-RAPS [10], within MR-SimSS. Our results show that the 3-split version of MR-SimSS, when combined with MR-RAPS, consistently outperforms all other evaluated methods across key performance metrics. Finally, we demonstrate the practical utility of MR-SimSS in a real-world example assessing the causal effect of BMI on risk of Type 2 diabetes (T2D), highlighting the method’s value in applied epidemiological research.
Methods
Overview of MR-SimSS
MR-SimSS reconstructs the statistical conditions of a three-sample Mendelian Randomization (MR) design using only GWAS summary statistics for variant-exposure () and variant-outcome (
) associations, even when these originate from overlapping samples. This enables unbiased causal inference in the presence of Winner’s Curse and weak instrument bias. The method operates under the assumption that marginal variant effect estimates follow an asymptotic multivariate normal distribution. Based on this, we have derived an analytical expression (see Equation (1)) that allows simulation of marginally independent variant effect estimates from hypothetical non-overlapping subsamples, without requiring access to individual-level genotypes or phenotypes.
The core procedure comprises three steps. First, MR-SimSS defines a hypothetical partition of the original dataset, allocating a fraction (e.g., ) to a synthetic discovery subset. Conditional on the observed summary statistics, variant-exposure and variant-outcome association estimates (
,
) are simulated for this subset and used to select instruments based on a genome-wide significance threshold.
Second, to mitigate Winner’s Curse, MR-SimSS generates variant-exposure and variant-outcome association estimates (,
) from the remaining pseudo-subsample. These are derived according to the asymptotic linearity of maximum likelihood estimators under data partitioning (see Equation (2)), and are marginally independent of the selection step. As a result, they can be supplied directly to any standard two-sample MR estimator, such as IVW or MR-RAPS, without inducing selection bias.
However, in the presence of sample overlap between exposure and outcome GWASs, a residual correlation between and
may remain, reintroducing weak instrument bias in the direction of the confounded observational association. To address this, MR-SimSS incorporates a three-split extension, where the
data fraction is further subdivided into two non-overlapping fractions. Variant-exposure association estimates are generated for one fraction, while independent variant-outcome association estimates are generated for the other fraction, using a similar conditional Gaussian framework, this time conditioning on the simulated summary statistics
and
from the first step (see Equation (S24)). This removes residual covariance and restores the assumption of independence between numerator and denominator that is made by many two-sample MR estimators, like IVW and MR-RAPS. This step is especially important when exposure and outcome GWASs exhibit substantial sample overlap, ensuring that any remaining bias, if not corrected by the chosen MR method, is toward the null.
To improve stability, the entire procedure is repeated over multiple random simulated splits, and causal effect estimates are averaged across iterations. MR-SimSS is compatible with both continuous and binary traits and generalizes to a wide range of GWAS settings, including full or partial sample overlap. Importantly, it permits valid use of robust MR methods, such as MR-RAPS, in settings where standard assumptions of independence are violated. Unbiased and efficient estimation of causal effects using commonly available summary-level data is therefore enabled as MR-SimSS reconstructs the independence structure of a three-sample MR design via conditional simulation.
Technical details
For each genetic variant , we assume availability of variant-exposure and variant-outcome association estimates together with their respective standard errors, i.e.,
from an exposure GWAS with sample size
and
from an outcome GWAS with sample size
. Summary statistics are assumed to arise from linear or logistic regression models applied to standardized genotypes, following standard GWAS practice. We allow for possible sample overlap
) between the exposure and outcome studies, and assume that linkage disequilibrium (LD) pruning has already been applied, resulting in a set of uncorrelated variants with summary statistics available in both GWASs.
If this individual-level data were available, Winner’s Curse could be eliminated by randomly splitting the data into two fractions and
, conducting GWASs on each, and then selecting instruments in one part and estimating causal effects in the other. Since only summary-level data are available, we simulate this splitting process by drawing from our derived asymptotic conditional distribution of the GWAS estimates in the
-fraction, conditional on the full-sample GWAS statistics. Therefore, for each variant j, association estimates
and
are simulated according to:
where and
denotes the correlation between the exposure and outcome, potentially non-zero due to confounding. A proof that this is the correct conditional distribution to use for the simulation is given in S1 Text. In practice,
and
may not be known, and therefore, we propose a data-driven estimator of the parameter
,as detailed in S1 Text. Unconditional standard errors in the
-fraction are approximated by
and
.Instruments are selected using z-statistics
, and a genome-wide significance threshold of 5 × 10-8.
To ensure independence between instrument selection and estimation, association estimates in the remaining -fraction can be reconstructed as follows, using asymptotic linearity of maximum likelihood estimates:
with corresponding standard errors scaled by . In the simplest (2-split) implementation of MR-SimSS, the selected genetic variants and their corresponding association estimates in the
-fraction are inputted into a summary-level MR method, such as IVW. However, this approach may still incur weak instrument bias in the direction of confounding if the exposure and outcome GWASs overlap.
To address this, we extend the approach to a 3-split framework. In this version, the -fraction is further conceptually split into sub-fractions of relative sizes
and
, with simulated exposure estimates,
, derived from the
-subset and outcome estimates,
, from the remaining (
)-subset. These are simulated using a second conditional distribution, analogous to the form above (see Equation (S24) in S1 Text). For each variant j, this yields independent association estimates and associated standard errors,
, which are then used as input into the 2-sample MR method being used with MR-SimSS at each iteration. Again, standard errors are scaled appropriately with
and
. This simulated sample splitting procedure is repeated multiple times to reduce variability, and thefinal causal effect estimate,
, is computed by averaging across iterations
:
in which is the causal effect estimate supplied by the summary-level MR method of choice on the kth iteration. To quantify uncertainty, we derive the standard error of the average estimate using the following decomposition:
The first term captures the average estimation variance across iterations, while the second adjusts for between-iteration variation, analogous to a Monte Carlo error correction. This formulation follows from the identity , and is shown in more detail in S1 Text. We refer to this method as MR-SimSS (Mendelian Randomization via Simulated Sample Splitting). The default implementation sets
to ensure stable convergence of the mean and variance estimates, and fixes
, following empirical guidance from Sadreev et al. [15], who found equal splits to offer the most flexibility when designing two-sample MR with sample splitting.
Because the 3-split procedure may increase weak instrument bias towards the null due to reduced effective sample sizes, we also consider MR-SimSS in combination with MR-RAPS [10], which is designed to account for weak instruments when using independent samples. Our framework facilitates the application of MR-RAPS even when the underlying GWASs are partially overlapping, by producing independent summary statistics through simulation.
While MR-SimSS is designed to improve causal inference accuracy, it can be computationally demanding when applied to large GWAS datasets due to the need to repeatedly simulate and evaluate many variants. To mitigate this, we introduce a deterministic variant pre-selection strategy that retains computational efficiency while preserving instrument coverage with high probability. For each variant , we compute the probability that it will be selected in any iteration based on the simulated z-statistic
. The probability of variant
passing the significance threshold
, for the standard normal cumulative distribution function
and chosen
, is given by:
in which. To construct a reduced variant subset, we rank variants by their selection probabilities and compute the cumulative sum of these probabilities. We retain the smallest set of variants whose cumulative inclusion probability exceeds 0.95. This guarantees that the reduced and full variant sets will produce identical instruments on any iteration with at least 95% probability. Equivalently, the expected difference in the number of instruments selected by the full and reduced procedures is bounded by 5%. This subsetting procedure provides a principled and efficient means to scale MR-SimSS to large-scale genomic subsets. Note that a more complete description of the strategy is available in S1 Text.
Simulation study
We conducted simulations to assess the performance of MR-SimSS in reducing Winner’s Curse bias in MR and to compare it against established MR methods. A factorial design was implemented, varying the following parameters:
- Exposure heritability:
- Proportion of causal SNPs:
- Sample overlap:
- Exposure-outcome correlation:
Each scenario assumed equal exposure and outcome GWAS sample sizes, , and a true causal effect
. For each replicate, we simulated GWAS summary statistics for N = 1,000,000 independent genetic variants. True variant-exposure effects
were sampled such that a proportion
had non-zero effects drawn from a normal distribution, calibrated to yield the specified heritability
. True variant-outcome effects were defined as
.
Estimated summary statistics were drawn from a bivariate normal distribution:
with variant minor allele frequencies () simulated uniformly over [0.01, 0.5], an expression that is justified as an appropriate asymptotic distribution in S1 Text, when both exposure and outcome have variance 1. Standard errors were assumed to be known. For each of the 80 parameter combinations, we simulated 100 independent datasets. To assess robustness to sample size, the simulations were repeated for
. In addition, we conducted simulations under the null hypothesis (
) with complete sample overlap, as well as simulations examining different pleiotropic scenarios.
Each dataset was analysed using MR-SimSS (2-split and 3-split) with IVW and MR-RAPS, as well as standard MR methods such as IVW and MR-RAPS using genome-wide significant instruments (p < 5 × 10-8). Each of these methods was evaluated using:
Here, is the true causal effect,
and
are the estimate and standard error in replicate k, and
is the indicator function. We additionally report the average causal effect estimate, the average standard error and the average absolute bias (absolute estimated bias averaged over simulation settings).
Real data processing
For the empirical same-trait BMI-BMI analyses, the large-scale UK Biobank [14] BMI dataset was randomly split in half 10 times to generate 10 pairs of non-overlapping samples. In each of the 20 resulting subsets, quality control and GWAS were performed using PLINK 2.0 [16], following the same procedures as outlined in Forde et al. [17]. This yielded 10 pairs of independent GWAS summary statistics datasets. A set of approximately independent variants was obtained via LD pruning using the PLINK 2.0 [16] command ‘indep-pairwise 50 5 0.5’. Instrument variants were selected based on the conventional genome-wide significance threshold of p < 5 × 10-8.
Results
Simulation study
We conducted a comprehensive simulation study to evaluate the finite-sample performance of MR-SimSS relative to existing summary-level MR methods. The simulations assumed equal exposure and outcome GWAS sample sizes of 200,000, and explored 80 distinct configurations defined by varying the proportion of causal variants, heritability of the exposure, fraction of sample overlap, and exposure-outcome correlation. Throughout, it was assumed that overlap and correlation parameters were unknown to the analyst. For each simulation, MR-SimSS in both two-split and three-split variants using MR methods; IVW and MR-RAPS, was used to estimate the causal effect of the exposure on the outcome. The implementation of MR-SimSS also incorporated estimation of the parameter λ, representing the correlation between variant-exposure and variant-outcome association estimates (see Methods).
Table 1 summarizes mean performance metrics - bias, absolute bias, root mean squared error (RMSE) and 95% coverage probability - averaged over heritability and proportion of true effects, for non-overlapping and fully overlapping samples with an exposure-outcome correlation of 0.5. Figs 1 and 2 present boxplots of estimated causal effects for the non-overlapping and fully overlapping scenarios, respectively, stratified by exposure-outcome correlation, heritability and proportion of true effects. Under non-zero sample overlap, SimSS-3-RAPS, the three-split implementation incorporating MR-RAPS, consistently exhibited superior performance. In simulations with complete sample overlap and an exposure-outcome correlation of 0.5, SimSS-3-RAPS achieved minimal bias (-0.0001), the lowest RMSE (0.0068), and the highest empirical coverage (93.5%). When exposure and outcome samples were independent, SimSS-2-RAPS outperformed competing methods, attaining the lowest average bias and RMSE (-0.0003 and 0.0076, respectively) and high empirical coverage (96%; Table 1). These performance patterns were consistent across all values of the exposure-outcome correlation (Figs 1-2).
Estimated causal effect for each method and simulation setting with sample sizes of 200,000 and zero overlap, averaged over 100 simulated pairs of exposure and outcome GWAS summary statistics for each setting. Methods are abbreviated as: SimSS-2-IVW = 2-split version of MR-SimSS using IVW, SimSS-2-RAPS = 2-split version of MR-SimSS using MR-RAPS, SimSS-3-IVW = 3-split version of MR-SimSS using IVW, SimSS-3-RAPS = 3-split version of MR-SimSS using MR-RAPS, IVW = Inverse variance weighted method and RAPS = Robust Adjusted Profile Score of Zhao et al. [10].
Estimated causal effect for each method and simulation setting with sample sizes of 200,000 and full overlap, averaged over 100 simulated pairs of exposure and outcome GWAS summary statistics for each setting. Methods are abbreviated as: SimSS-2-IVW = 2-split version of MR-SimSS using IVW, SimSS-2-RAPS = 2-split version of MR-SimSS using MR-RAPS, SimSS-3-IVW = 3-split version of MR-SimSS using IVW, SimSS-3-RAPS = 3-split version of MR-SimSS using MR-RAPS, IVW = Inverse variance weighted method and RAPS = Robust Adjusted Profile Score of Zhao et al. [10].
Although MR-SimSS corrects for Winner’s Curse, the simulated sample splitting procedure reduces the effective sample size used for estimation, increasing susceptibility to weak instrument bias when paired with IVW. This effect was evident for SimSS-3-IVW, which uses simulated non-overlapping splits for the generation of variant-exposure and variant-outcome association estimates. This MR-SimSS variant exhibited consistent downward bias regardless of sample overlap, with causal estimates remaining below 0.3 in all simulation settings (Figs 1-2). In contrast, SimSS-2-IVW shows overlap-dependent bias, reflecting incomplete independence between the simulated subsamples used for estimation.
Incorporating MR-RAPS within MR-SimSS mitigates this limitation. Under independent or weakly overlapping samples, both SimSS-2-RAPS and SimSS-3-RAPS yielded nearly unbiased estimates (Fig 1, Fig B in S1 Text), consistent with the theoretical properties of MR-RAPS under independence of variant-exposure and variant-outcome associations [10]. However, under high sample overlap and exposure-outcome correlation (0.5), SimSS-2-RAPS exhibited inflated bias (Fig 2), whereas SimSS-3-RAPS remained unbiased due to the enforced independence of simulated exposure and outcome associations used in the estimation step. In the extreme case of complete sample overlap and strong exposure-outcome correlation, SimSS-3-RAPS was the only method to avoid upward bias toward the observational association, aside from the negatively biased SimSS-3-IVW. Across all MR-SimSS variants, standard errors were modestly larger than those of competing methods, reflecting variance inflation from simulated splitting. In contrast, the standard IVW and MR-RAPS estimators exhibit substantial bias across multiple simulation settings, reflecting susceptibility to both Winner’s Curse and bias arising from sample overlap (Figs 1-2).
Fig 1 and Table B in S1 Text report performance under strictly non-overlapping samples for all other exposure-outcome correlation values. In these settings, both SimSS-2-RAPS and SimSS-3-RAPS achieved near-zero bias (<0.0005), low RMSE, and empirical coverage between 94% and 97%. SimSS-2-RAPS achieved slightly lower RMSE due to the absence of unnecessary variance inflation from an additional split. Conversely, standard IVW and MR-RAPS were substantially biased in scenarios with low proportions of causal variants. In the opposite extreme of complete sample overlap (Fig 2, Table C in S1 Text), SimSS-3-RAPS again achieved the most favourable balance of bias, RMSE, and coverage, with respect to all values of exposure-outcome correlation. Performance of all methods under intermediate overlap fractions (25%, 50%, 75%) is shown in Figs B-D and Tables D-F in S1 Text, in which consistent trends favouring SimSS-3-RAPS are evident.
Fig E and Table G in S1 Text report results from simulations conducted under the null hypothesis of no causal effect with fully overlapping samples. These results highlight the impact of biases due to instrument selection and participant overlap when two-sample MR methods are naively implemented, with IVW and MR-RAPS exhibiting poor empirical coverage and markedly elevated false positive rates. In contrast, both SimSS-3-IVW and SimSS-3-RAPS achieved approximately 95% coverage, indicating preservation of Type I error at the nominal 5% level under this setting.
To assess sensitivity to sample size, method performance was evaluated under alternative sample sizes of 500,000 and 50,000. With larger samples (500,000), performance improved across all methods due to increased instrument strength. Nevertheless, SimSS-3-RAPS remained the most accurate estimator, achieving minimal average absolute bias (0.0021), an RMSE of 0.0029, and empirical coverage of 94.1%. In contrast, for smaller sample sizes of 50,000, MR-SimSS methods exhibited sensitivity to instrument sparsity. Although SimSS-3-RAPS remained the least biased estimator (average absolute bias = 0.0012), it exhibited substantially increased variability, with a standard error of approximately 58.29 and an RMSE more than tenfold those of IVW and MR-RAPS (Table F in S1 Text). This instability is further illustrated in Fig H in S1 Text, where causal effect estimates span a wide range (-1–1), reflecting the limited number of genome-wide significant variants in these settings. For example, under 30% heritability and a 1% true effect proportion, only approximately 10 variants exceeded genome-wide significance, yielding an average of around three instruments per iteration, and thus, severely impairing causal effect estimation with MR-SimSS.
To investigate whether the performance of SimSS-3-RAPS could be improved under such low-power conditions, we examined the effect of relaxing the instrument selection threshold. As shown in Fig K and Table J in S1 Text, increasing the significance threshold from 5 x 10-8 to 5 x 10-4 substantially reduced RMSE, from 0.923 to 0.032, while maintaining low average absolute bias (0.0258). These findings indicate that relaxing the selection threshold within MR-SimSS can improve estimator stability when few variants reach genome-wide significance in the first sample split (p < 5 x 10-8). However, because such adjustments introduce weak instrument bias, it is essential that robust MR methods, such as MR-RAPS, are used within the MR-SimSS framework to ensure valid inference. Note that additional results from simulations examining various forms of pleiotropy are provided in Tables K-P and Figs L-O in S1 Text. These results further demonstrate MR-SimSS’ ability to avoid Winner’s Curse and overlap-induced bias, while retaining the properties of the embedded two-sample MR method.
Same-trait empirical analysis
To assess the empirical performance of MR-SimSS, we conducted same-trait MR analyses, estimating the causal effect of BMI on itself, using independent GWASs from the UK Biobank [14]. These analyses provide a realistic validation setting in which the true causal effect is known to be 1. 10 pairs of non-overlapping samples of ~166,000 individuals were used to generate BMI GWAS summary statistics for ~1.6 million LD-pruned variants using PLINK 2.0 [16]. Same-trait MR analyses were performed bidirectionally for each pair of summary statistics, yielding 20 causal effect estimates per evaluated method. All MR-SimSS variants and conventional methods, including MR-Egger and weighted median [6], were assessed using bias, RMSE and 95% coverage probability. Additional methods evaluated here also included debiased IVW (dIVW), a bias-corrected estimator that uses all variants and does not rely on SNP selection [18], and MRlap, a likelihood-based method designed to correct for both Winner’s Curse and sample overlap bias [19].
Table 2 and Fig 3 summarize the results of the repeated BMI-BMI analyses using pruned instruments. Classical IVW, MR-RAPS and weighted median estimators exhibited substantial downward bias, with mean causal estimates of 0.838, 0.865, and 0.807, respectively. All three methods yielded 0% empirical coverage, confirming susceptibility to Winner’s Curse. MR-Egger was less biased with an average causal effect estimate of 0.989, but its coverage was moderate (0.65) and it exhibited large estimate variance, reflecting instability. MRlap also produced estimates close to the true causal effect (1.028), with high empirical coverage (0.9), although it showed upward bias and increased variability, suggesting potential violation of model assumptions, e.g., spike-and-slab genetic architecture assumption. While dIVW achieved the lowest bias (-0.0026) and RMSE, its poor coverage (0.25) indicates miscalibrated standard errors.
Boxplots of the estimated causal effects for each method resulting from the 20 same-trait BMI-BMI analyses. Methods are abbreviated as: SimSS-2-IVW = 2-split version of MR-SimSS using IVW, SimSS-2-RAPS = 2-split version of MR-SimSS using MR-RAPS, SimSS-3-IVW = 3-split version of MR-SimSS using IVW, SimSS-3-RAPS = 3-split version of MR-SimSS using MR-RAPS, IVW = Inverse variance weighted method, RAPS = Robust Adjusted Profile Score of Zhao et al. [10], Egger = Egger regression of Bowden et al. [9], Weighted median = Weighted median approach of Bowden et al. [6], dIVW = debiased IVW method of Ye et al. [18] and MRlap = MRlap method of Mounier & Kutalik [19]. The black horizontal line represents the true causal effect of 1.
In contrast, SimSS-2-RAPS and SimSS-3-RAPS provided a favourable balance between bias, precision and calibration. Both approaches yielded near-unbiased estimates (bias = -0.0073 and -0.0070, respectively), with substantially lower RMSE than classical estimators and high empirical coverage (85–95%). SimSS-2-IVW and SimSS-3-IVW eliminated Winner’s Curse-induced attenuation but suffered from downward weak instrument bias, with mean biases of -0.0247 and -0.0503, and reduced coverage. Overall, these empirical results reinforce simulation findings: MR-SimSS, particularly when implemented with MR-RAPS, can offer a robust correction for selection-induced bias, with the ability to deliver accurate and well-calibrated causal estimates even in real-world data.
Effect of body mass index on type 2 diabetes
The four MR-SimSS variants, together with other MR methods, were also used to estimate the causal effect of BMI on type 2 diabetes (T2D), under varying degrees of sample overlap. First, two independent samples of ~166,000 individuals with outcome information were randomly selected from the entire UKBB [14] T2D data set and used to generate two sets of outcome GWAS summary statistics (T2D-A and T2D-B) with PLINK 2.0 [16]. For the exposure, BMI, 5 different sets of GWAS summary statistics were generated using similarly-sized sets of individuals, all with different percentages of sample overlap with the outcome data sets (0%, 25%, 50%, 75%, 100%).
For each overlap setting, results of the corresponding BMI-T2D analyses were pooled to provide average estimated causal effects. These results are summarized in Fig 4 and Table 3. In line with previous MR studies [20], higher BMI was confirmed to be a causal risk factor for T2D. All methods yield statistically significant causal effect estimates, with all 95% confidence intervals lying above 1. The traditional IVW approach yielded estimated effects ranging from 1.178, for non-overlapping samples, to 1.243, for fully overlapping samples. In contrast, the range of causal effect estimates provided by SimSS-3-RAPS is ~ 40% smaller, from 1.257 to 1.296. The standard deviation of SimSS-3-RAPS averaged estimates (0.014) was less than half of that of IVW (0.0282), giving evidence that SimSS-3-RAPS can provide more consistent effect estimates across different degrees of sample overlap between exposure and outcome data sets. The difference between SimSS-3-RAPS and IVW estimates was greatest in the zero overlap setting, with our results suggesting that IVW underestimated the causal effect by ~8% due to downward bias caused by both Winner’s Curse and weak instruments.
Average estimated causal effects for each method resulting from BMI-T2D analyses with various percentages of sample overlap. Error bars reflect confidence intervals computed as: Methods are abbreviated as: SimSS-2-IVW = 2-split version of MR-SimSS using IVW, SimSS-2-RAPS = 2-split version of MR-SimSS using MR-RAPS, SimSS-3-IVW = 3-split version of MR-SimSS using IVW, SimSS-3-RAPS = 3-split version of MR-SimSS using MR-RAPS, IVW = Inverse variance weighted method, and RAPS = Robust Adjusted Profile Score of Zhao et al. [10].
Discussion
We introduce MR Simulated Sample Splitting (MR-SimSS), a novel summary-level MR method designed to correct Winner’s Curse bias. Winner’s Curse arises when the same GWAS dataset is used for both instrument selection and variant-exposure association estimation, often producing deflated causal effect estimates. In recent years, summary-level MR has become the dominant MR approach, driven by its ease of use and the broad availability of complete summary data. This has spurred the development of numerous summary-level MR methods aimed at accurate causal effect estimation [21]. However, methods explicitly addressing Winner’s Curse bias remain underdeveloped. While using an independent sample for instrument selection is a commonly accepted solution [10], we consider such an approach suboptimal due to reduced statistical power from dataset partitioning and potential heterogeneity introduced by dissimilar populations. Accordingly, MR-SimSS requires no such independent dataset to be available.
MR-SimSS addresses Winner’s Curse by employing asymptotic conditional distributions to emulate repetitive splitting of a large individual-level dataset into distinct fractions. Instrument selection is based on association estimates from one fraction, while the remaining fraction is used for estimation. At each iteration, causal effect estimates are obtained via the integration of a summary-level MR method. We investigated the use of both the IVW [7] and MR-RAPS [10] methods within the context of our simulated sample splitting procedure. Furthermore, the MR-SimSS framework accommodates both partial and full sample overlap between exposure and outcome GWASs, a frequent scenario in large biobank datasets such as UK Biobank. We note that similar approaches to controlling for sample overlap have been previously proposed within the GRAPPLE framework [22].
In a factorial simulation study, varying sample overlap and exposure-outcome correlation, SimSS-3-RAPS - the 3-split version of MR-SimSS integrating MR-RAPS - consistently demonstrated superior performance, yielding unbiased causal estimates across all scenarios. It achieved the highest empirical coverage, minimal average bias and lowest RMSE, illustrating its capacity to overcome both Winner’s Curse and weak instrument bias. As shown in Fig 2, SimSS-3-RAPS remains unbiased irrespective of exposure-outcome correlation, even in the presence of fully overlapping samples. When independent GWASs are available, SimSS-2-RAPS performs comparably well. Although SimSS-2-IVW and SimSS-3-IVW are susceptible to weak instrument bias, both outperform naïve IVW, underscoring MR-SimSS’s utility even when paired with simpler estimators. Our simulations reaffirm the vulnerability of standard methods, such as IVW and MR-RAPS, to Winner’s Curse. For example, Table 1 and Table B in S1 Text shows coverage below 0.51 for these methods under zero-overlap conditions when selection and estimation utilize the same exposure data.
Our simulation results with smaller sample sizes, e.g., 50,000, underscore the importance of sufficient instrument strength and quantity for stable estimation using MR-SimSS. When the average number of instruments per iteration is low, fewer than ~20, the method exhibits increased variability and reduced precision. However, we find that performance improves substantially when the instrument selection threshold is relaxed. Specifically, our findings demonstrate that in low-power settings, employing a less stringent selection threshold offers a practical strategy to recover estimator stability within the MR-SimSS framework (Fig K in S1 Text), provided that weak instrument bias is appropriately mitigated through the use of robust MR estimators. However, this logic only extends so far. For instance, we would not recommend applying MR-SimSS blindly if there are no genome-wide exposure-associated variants. MR-SimSS with a reduced selection threshold is likely to have extremely low power in this setting, even under large exposure-outcome causal effects, and possibly inflated type I error.
Our simulations also demonstrate that MR-SimSS in its three-split variety, when paired with MR-RAPS has roughly 95% coverage, and consequentially 5% Type I error, under the null hypothesis of no causal effect (Fig E in S1 Text). While it is true that in zero overlap scenarios, classical MR methods, such as IVW, can demonstrate superior power over MR-SimSS variants when the true causal effect is small, these methods are highly susceptible to bias when sample overlap exists, resulting in greatly inflated Type I errors (Figs P-Q in S1 Text). This Type I error inflation is also seen for methods that are robust to weak instrument bias under no sample overlap, such as MR-RAPS, as we show in our simulations. In contrast, our study shows that SimSS-3-RAPS has the ability to retain a 5% Type I error, irrespective of the degree of sample overlap. Given what has been referred to as a ‘credibility crisis’ in Mendelian Randomization [23], it is essential that MR methods have preserved Type I error under general conditions, and thus, the MR-SimSS framework is an important development in this regard.
Same-trait empirical analyses (e.g., BMI-BMI) corroborate the simulation findings, particularly under independent sample conditions (Fig 3). Across 20 BMI-BMI analyses, IVW, MR-RAPS, and the weighted median method displayed pronounced Winner’s Curse bias, with coverage probabilities of zero and average bias ranging from -0.2 to -0.13. In contrast, SimSS-2-RAPS and SimSS-3-RAPS yielded unbiased estimates with high coverage and negligible bias (<0.01), demonstrating robustness to both Winner’s Curse and weak instrument bias.
In BMI-T2D analyses, MR-SimSS produced more stable and consistent causal effect estimates across varying degrees of sample overlap than conventional approaches. In particular, SimSS-3-RAPS yielded a narrower range of estimates with lower variability than standard IVW, suggesting improved robustness to both Winner’s Curse and weak instrument bias. For comparative purposes, the debiased IVW (dIVW) estimator [18] was applied in the zero-overlap setting. The dIVW method explicitly corrects for measurement error in variant-exposure associations and doesn’t rely on screening variants, thereby avoiding selection bias due to Winner’s Curse. Therefore, dIVW estimates obtained under sample independence should theoretically be similar to those provided by SimSS-3-RAPS in all overlap settings. The resulting averaged dIVW estimate lay within the range of estimates produced by SimSS-3-RAPS (Table 3), providing further support that MR-SimSS effectively mitigates both selection- and overlap-induced bias when applied to real data. We note that this empirical study has focussed primarily on alleviating bias arising from Winner’s Curse and sample overlap and thus, the potential impact of other sources of bias, such as correlated pleiotropy, remains an avenue for further exploration.
Admittedly, the current formulation of MR-SimSS has certain limitations. Throughout our investigation, the splitting fractions, and
, were both fixed at 0.5. Determining universally optimal values for these fractions is inherently challenging as they likely depend on the underlying genetic architectures of both the exposure and outcome traits, as well as the sample sizes of the source GWAS datasets. Our selection of
= 0.5 was partly informed by Sadreev et al. [15]. A higher value of
increases the number of instruments per iteration; thus,
= 0.5 balances the fractions used for instrument generation and for estimation of associations in the 3-split setting. With both splitting fractions set to 0.5, only 25% of the total sample informs each variant-exposure and variant-outcome association estimate, increasing variance in the resulting causal effect estimates. This variance inflation can be largely mitigated by using a sufficiently large number of iterations, e.g., 1000, ensuring stability and precision in the final estimate. Adaptive tuning of these parameters merits future exploration.
An additional practical consideration when applying MR-SimSS concerns the handling of linkage disequilibrium (LD) among genetic variants. In its current form, we recommend implementing MR-SimSS with an LD-pruned, rather than LD-clumped, set of variants, as pruning avoids additional selection bias introduced by retaining only the most strongly exposure-associated variant within each LD region and ensures that the standard measurement-error model for the selected variant-exposure associations remains appropriate. To illustrate this, we performed a same-trait BMI-BMI analysis using both pruned and clumped variant sets with comparable numbers of genome-wide significant variants (Fig R, Table Q in S1 Text). The MR-SimSS approaches incorporating MR-RAPS yielded unbiased causal effect estimates when applied to pruned variants, but not when applied to clumped variants. This observation is consistent with the additional selection bias induced by clumping, which preferentially selects variants with the smallest p-values within each genomic region and thus, biases estimates toward the null. In principle, this selection bias could be mitigated by explicitly modelling correlation between variant-exposure associations, for example by injecting correlated noise according to the underlying LD structure. Exploring such extensions, which would allow MR-SimSS to completely remove Winner’s Curse bias from clumped datasets, forms an interesting direction for future methodological work.
Another avenue for future research is the integration of MR-SimSS with alternative summary-level MR methods. Because MR-SimSS inherits the properties of the embedded two-sample MR method, optimal performance is expected when it is paired with methods that are robust to weak instruments and pleiotropy, whereas more vulnerable methods may yield less reliable estimates. This is illustrated by SimSS-3-RAPS, whose strong resistance to weak instrument bias is derived directly from MR-RAPS. Accordingly, combining MR-SimSS with pleiotropy-robust methods is expected to be appropriate for analyses affected by directional or horizontal pleiotropy. To explore this, we conducted an auxiliary simulation study incorporating pleiotropy, in which MR-SimSS was combined with the MR weighted median estimator [6] (SimSS-2-Med and SimSS-3-Med). Although this implementation did clearly exhibit resistance to pleiotropy-induced bias, its estimates remained biased downwards due to the weighted median’s susceptibility to weak instrument bias (Figs N-O, Tables M-P in S1 Text). Conceptually, MR-SimSS should therefore be viewed not as a standalone MR estimator, but as a general framework for mitigating bias due to Winner’s Curse and sample overlap in summary-level MR. It complements existing MR methods, enabling researchers to utilize the largest available GWAS datasets, even with overlapping samples, without requiring a third, independent dataset.
Due to the lack of accessible software, we were unable to include the recently proposed rerandomized IVW (RIVW) estimator [24] in our method evaluations. The RIVW estimator may be regarded as conceptually similar to MR-SimSS, as both approaches were designed to facilitate independent instrument selection and unbiased estimation of variant-exposure associations using a single exposure dataset. By introducing pseudo variant-exposure associations into the selection step and then using Rao-Blackwellization to produce a consistent estimator for the causal effect, RIVW successfully breaks the Winner’s Curse in the classical two-sample IVW estimator [24]. For settings with non-overlapping exposure and outcome samples, this Rao-Blackwellization result could be viewed as the theoretical expectation of what would be obtained if MR-SimSS was applied with an IVW estimate adapted to handle weak instrument bias. As RIVW is a non-simulation based approach, it is likely to be more computationally efficient than MR-SimSS. However, it requires that the exposure and outcome GWASs have been performed with non-overlapping samples and is vulnerable to unbalanced pleiotropy. In contrast, we have demonstrated here how MR-SimSS can mitigate Winner’s Curse bias irrespective of sample overlap and can also be used with existing MR methods that are resistant to weak instrument bias and certain types of pleiotropy.
In conclusion, MR-SimSS provides a principled and practical solution to two pervasive issues in MR analysis, Winner’s Curse and weak instrument bias, when only GWAS summary statistics are accessible. By enabling the use of maximal GWAS data, MR-SimSS substantially improves the reliability of causal effect estimates, thus offering a certain valuable contribution to the MR methodological toolkit.
Supporting information
S1 Text. Text supplement.
This file contains supplementary figures, tables as well as important derivations.
https://doi.org/10.1371/journal.pgen.1011949.s001
(PDF)
References
- 1. Smith GD, Ebrahim S. “Mendelian randomization”: can genetic epidemiology contribute to understanding environmental determinants of disease?. Int J Epidemiol. 2003;32(1):1–22. pmid:12689998
- 2. Smith GD, Ebrahim S. Data dredging, bias, or confounding: they can all get you into the BMJ and the Friday papers. BMJ. 2002;325(7378):1437.
- 3. Swanson SA, Tiemeier H, Ikram MA, Hernán MA. Nature as a Trialist?: Deconstructing the Analogy Between Mendelian Randomization and Randomized Trials. Epidemiology. 2017;28(5):653–9. pmid:28590373
- 4. Sanderson E, Glymour MM, Holmes MV, Kang H, Morrison J, Munafò MR, et al. Mendelian randomization. Nat Rev Methods Primers. 2022;2:6. pmid:37325194
- 5. Burgess S, Scott RA, Timpson NJ, Davey Smith G, Thompson SG, EPIC- InterAct Consortium. Using published data in Mendelian randomization: a blueprint for efficient identification of causal risk factors. Eur J Epidemiol. 2015;30(7):543–52. pmid:25773750
- 6. Bowden J, Davey Smith G, Haycock PC, Burgess S. Consistent Estimation in Mendelian Randomization with Some Invalid Instruments Using a Weighted Median Estimator. Genet Epidemiol. 2016;40(4):304–14. pmid:27061298
- 7. Burgess S, Dudbridge F, Thompson SG. Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods. Stat Med. 2016;35(11):1880–906. pmid:26661904
- 8. Bowden J, Del Greco M F, Minelli C, Davey Smith G, Sheehan NA, Thompson JR. Assessing the suitability of summary data for two-sample Mendelian randomization analyses using MR-Egger regression: the role of the I2 statistic. Int J Epidemiol. 2016;45(6):1961–74. pmid:27616674
- 9. Bowden J, Davey Smith G, Burgess S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol. 2015;44(2):512–25. pmid:26050253
- 10. Zhao Q, Wang J, Hemani G, Bowden J, Small DS. Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. Ann Stat. 2020;48(3):1742–69.
- 11. Jiang T, Gill D, Butterworth AS, Burgess S. An empirical investigation into the impact of winner’s curse on estimates from Mendelian randomization. Int J Epidemiol. 2023;52(4):1209–19. pmid:36573802
- 12. Burgess S, Thompson SG. Bias in causal estimates from Mendelian randomization studies with weak instruments. Stat Med. 2011;30(11):1312–23. pmid:21432888
- 13. Burgess S, Davies NM, Thompson SG. Bias due to participant overlap in two-sample Mendelian randomization. Genet Epidemiol. 2016;40(7):597–608. pmid:27625185
- 14. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
- 15. Sadreev II, Lepik K, Richmond RC, Palmer TM, Davey Smith G, Holmes MV, et al. Navigating sample overlap, winner’s curse and weak instrument bias in Mendelian randomization studies using the UK Biobank. medRxiv. 2021.
- 16. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. pmid:17701901
- 17. Forde A, Hemani G, Ferguson J. Review and further developments in statistical corrections for Winner’s Curse in genetic association studies. PLoS Genet. 2023;19(9):e1010546. pmid:37721937
- 18. Ye T, Shao J, Kang H. Debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization. Ann Statist. 2021;49(4).
- 19. Mounier N, Kutalik Z. Bias correction for inverse variance weighting Mendelian randomization. Genet Epidemiol. 2023;47(4):314–31. pmid:37036286
- 20. Corbin LJ, Richmond RC, Wade KH, Burgess S, Bowden J, Smith GD, et al. BMI as a Modifiable Risk Factor for Type 2 Diabetes: Refining and Understanding Causal Estimates Using Mendelian Randomization. Diabetes. 2016;65(10):3002–7. pmid:27402723
- 21. Boehm FJ, Zhou X. Statistical methods for Mendelian randomization in genome-wide association studies: A review. Comput Struct Biotechnol J. 2022;20:2338–51. pmid:35615025
- 22. Wang J, Zhao Q, Bowden J, Hemani G, Davey Smith G, Small DS, et al. Causal inference for heritable phenotypic risk factors using heterogeneous genetic instruments. PLoS Genet. 2021;17(6):e1009575. pmid:34157017
- 23. Burgess S, Woolf B, Mason AM, Ala-Korpela M, Gill D. Addressing the credibility crisis in Mendelian randomization. BMC Med. 2024;22(1):374. pmid:39256834
- 24. Ma X, Wang J, Wu C. Breaking the winner’s curse in Mendelian randomization: rerandomized inverse variance weighted estimator. Ann Stat. 2023;51(1):211–32.