Approaches to detect genetic effects that differ between two strata in genome-wide meta-analyses: Recommendations based on a systematic evaluation

Genome-wide association meta-analyses (GWAMAs) conducted separately by two strata have identified differences in genetic effects between strata, such as sex-differences for body fat distribution. However, there are several approaches to identify such differences and an uncertainty which approach to use. Assuming the availability of stratified GWAMA results, we compare various approaches to identify between-strata differences in genetic effects. We evaluate type I error and power via simulations and analytical comparisons for different scenarios of strata designs and for different types of between-strata differences. For strata of equal size, we find that the genome-wide test for difference without any filtering is the best approach to detect stratum-specific genetic effects with opposite directions, while filtering for overall association followed by the difference test is best to identify effects that are predominant in one stratum. When there is no a priori hypothesis on the type of difference, a combination of both approaches can be recommended. Some approaches violate type I error control when conducted in the same data set. For strata of unequal size, the best approach depends on whether the genetic effect is predominant in the larger or in the smaller stratum. Based on real data from GIANT (>175 000 individuals), we exemplify the impact of the approaches on the detection of sex-differences for body fat distribution (identifying up to 10 loci). Our recommendations provide tangible guidelines for future GWAMAs that aim at identifying between-strata differences. A better understanding of such effects will help pinpoint the underlying mechanisms.


Introduction
impact of the different approaches on the identification of sexually dimorphic variants for body fat distribution using real data from the GIANT consortium [12].

Materials and methods Notation and models
We consider K studies and 2 strata, with a total sample size of n = n 1 +n 2 , n i = sum k (n ik ), k = 1. . .,K, i = 1,2 strata, and f = n 2 /n 1 . For an individual j, Y ðjÞ ik denotes a continuous phenotype value and G ðjÞ ik = 0,1,2 the number of alleles for a genetic variant (omitting the indexing of the millions of variants analyzed). A stratified GWAMA involves two steps, the study-and stratum-specific GWAS conducted by the study analyst and the stratum-specific meta-analysis.
For stratified GWAS, the linear regression model computed per stratum for each of the genetic variants (omitting further covariates) can be written as with α ik denoting the intercept, β ik the genetic effect and E ðjÞ ik $ Nð0; s 2 Eik Þ. We assume that phenotypes have been normalized to have zero mean and unit variance in each study and stratum (i.e., s 2 Yik ¼ s 2 Y ¼ 1). We also assume similar minor allele frequencies across studies and strata (MAF ik % MAF) and thus similar genotype variances, s 2 Gik ¼ s 2 G = 2 Ã MAF Ã (1-MAF), i.e. that the SNP is not associated with the stratum variable and homogeneous across studies (S1 Note). Our notation here assumes that the studies include only unrelated individuals (see below for the extension to related individuals).
For the stratum-specific meta-analysis per variant, pooled genetic effect estimates,b i , and standard errors, se i , are computed via an inverse-variance weighted meta-analysis by stratum [16], assuming equal genetic effects across studies, since we focus on identification rather than quantification of genetic effects [17]. Under the assumption that the studies are from similar populations with similar genotypic variance, the inverse-variance weighted meta-analyzedb i and se i are approximately identical to an estimate derived from one single large mega-study [18,19].
Stratified GWAMA approaches to identify G x S Our stratified GWAMA approaches are based on four statistical tests. The statistical test to identify GxS is the difference test, Z Diff ¼ ðb 1 Àb 2 Þ= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi se 2 1 þ se 2 2 p . This is under the assumption of no relatedness of subjects across strata and thus no correlation ofb 1 withb 2 . Under the assumption of unrelated individuals across strata and no latent covariate interacting with a dichotomous factor S, the difference test is equivalent to testing interaction of the genetic effect with a dichotomous factor S. We consider three further tests that are utilized to filter genetic variants prior to the difference testing: filtering on (i) overall association, Z Overall ¼ ðb 1 =se 2 1 þb 2 =se 2 2 Þ= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=se 2 1 þ 1=se 2 2 p , (ii) stratified association, Z 1 ¼b 1 =se 1 or Z 2 ¼b 2 =se 2 , or (iii) the alternative joint association, All test statistics can be computed based on stratified GWAMA results, i.e. stratum-specific pooled genetic effect,b i , and corresponding standard errors, se i , i = 1,2 (see S1 Table for a detailed description of the four tests).
The tests are used to generate various approaches to identify GxS. In the approach without filtering, the difference test is applied genome-wide, ½Diff a Diff , and GxS can readily be identified using a genome-wide significance level of α Diff = 5x10 -8 (= 0.05/1,000,000) [20]. In the approaches with filtering, one of the three filtering tests is applied genome-wide, variants with a p-value below a filtering threshold, α Filter , are selected, and lead variants are extracted (variant with lowest P-value within +/-500,000 base positions). When M lead variants are selected by the filtering, these variants have an established association with the phenotype, but GxS has yet to be ascertained by the difference test using a Bonferroni-corrected significance level, α Diff = 0.05/M (S1 Methods). The stricter the filtering threshold, the smaller M, thus decreasing the multiple testing burden; however, a stricter filtering threshold might miss true GxS signals. We will thus consider varying filtering thresholds.
When overall, stratified or joint association filtering is applied to the same stratified GWAMA results as the difference test (one-stage approach), the approaches to detect GxS are denoted here as ½Overall a Filter ! Diff a Diff , ½Strat a Filter ! Diff a Diff or ½Joint a Filter ! Diff a Diff , respectively. When the filtering test is statistically dependent on the difference test the tests have to be applied to two independent sets of stratified GWAMA results (two-stage approach) to achieve appropriate type 1 error control. To obtain two independent stratified GWAMA results, the available GWAS studies are to be separated into a first and a second stage and meta-analyzed by stage. Then the filtering is to be conducted in the first stage meta-analysis and the difference test in the second stage meta-analysis. We denote the respective two-stage approaches to detect GxS as ½Overall a Filter ! ½Diff a Diff , ½Strat a Filter ! ½Diff a Diff or ½Joint a Filter ! ½Diff a Diff .
We explore the three filtering tests followed by the difference test both as one-stage and as two-stage approaches. Together with the difference test without filtering, this yields a total of seven approaches to detect GxS in our systematic evaluation (an overview of approaches is shown in S1 Fig).

Extending to related individuals
When a study includes related individuals, this can be accounted for within each stratified GWAS model and thus within each stratum by extending to mixed models [21]. Relatedness across studies within the same stratum can be handled via generalized meta-analysis [22]. Including related individuals across strata yields correlated stratum-specific (variant-specific) estimates (b 1 andb 2 ). This correlation can be estimated by the Pearson correlation coefficient of allb 1 andb 2 estimates across all genetic variants, denoted by r.
The test statistics for the difference test can then be extended to Z Diff ¼ ðb 1 Àb 2 Þ= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi se 2 1 þ se 2 2 À 2 Á Covðb 1 ;b 2 Þ q and the Covðb 1 ;b 2 Þ can be estimated by r Á se 1 Á se 2 . In the case of related individuals (r > 0), this correction yields more extreme test statistics compared to when relatedness is ignored and will thus have better power to detect a difference. When the individuals are unrelated (r close to 0), this formula is the same as the one given in the previous chapter. There is no disadvantage of using this extended formula in all circumstances. For the filtering with the overall or the joint test, the inclusion of related subjects between strata without accounting for them will yield underestimated standard errors and therefore deflated p-values and larger type I error. Thus, when filtering at a threshold at, e.g. 5x10 -5 , the filtering will be less stringent allowing for more variants to pass. The filtering by the stratified test is unaffected by the inclusion of such related subjects.

Simulation-based evaluation of type I error
Based on simulated data, we estimate the type I error rate of each of the seven stratified GWAMA approaches to detect GxS. We simulate one large data set of 200,000 unrelated individuals under the null hypothesis of no GxS i.e. no difference between stratum-specific effects (H 0 : β 1 = β 2 = β). We implement several scenarios of varying values for β, varying MAF (0.05, 0.30) and varying strata designs (balanced design, f = 1; unbalanced designs f = 1/3, 3) (details in S2 Methods).
For each scenario and approach, we estimate the type I error rate for a 5% significance level as the proportion of nominally significant variants (P Diff < 0.05) relative to the number of conducted difference tests (# significant variants/1,000,000 without filtering, # significant variants/ M with filtering).

Analytical computation of power
For all stratified GWAMA approaches to detect GxS that have valid type I error, we compute power analytically. As a measure of the stratum-specific genetic effect, we introduce For each stratum, R 2 i represents the phenotypic variance explained by the variant; R i denotes the direction of the effect as does b i , which can be opposite in the two strata (qualitative GxS), zero in one stratum and significant in the other (pure GxS), or directionally consistent with different quantity (quantitative GxS). Analytical power formulae are provided in the S3 Methods for the difference test, Power Diff , and for each of the filtering tests, Power Filter , for varying n 1 , f, R 1 , R 2 , and α, where n 1 reflects the sample size of the stratum with the larger absolute effect (|R 2 | < = |R 1 |); α is the filtering threshold, α Filter , for the filtering tests or the significance level, α Diff , for the difference test. When the filtering test and difference test are independent or when they are applied to two different stages of meta-analysis results, the power of the approach can be derived as the product of the power of the filtering test and the power of the difference test, i.e. Power Approach = Power Filter Á Power Diff (S3 Methods). Of note, n 1 and n 2 reflect the stratum-specific effective number of subjects for related subjects can be computed by the method proposed for correlated SNPs [23].
We compute the power of each approach for various realistic scenarios varying strata design and varying genetic effect sizes, motivated by the GIANT data with n = 200,000. We also varied the types of GxS (qualitative, pure, quantitative) and the filtering threshold (details in S4 Methods).

Genetic Investigation of Anthropometric Traits (GIANT) consortium
To exemplify the impact of the different stratified GWAMA approaches to detect GxS, we utilize the sex-stratified GWAMA results for WHRadjBMI from GIANT [12]. This data comprises up to 77,000 men and 98,000 women from two independent stages that derive from the fact that the GWAS data was collected in two waves. The GWAMA results contain pooled sexspecific genetic effect estimates and standard errors for~2.8 million variants. We apply each of the stratified GWAMA approaches with valid type I error to detect variants with between-sex difference, using the two stages for the two-stage approaches and meta-analyzing the stagespecific estimates by sex to generate one set of meta-analyzed results for each sex for the onestage approaches. Given that the majority of the 2.8 million variants can be considered as not associated, we evaluate the empirical type I error by QQ plots and by calculating genomic control inflation factors (λ GC ) [24]. We also derive the number of sexually dimorphic loci for WHRadjBMI based on each of the approaches, to demonstrate the approaches taken.

Overview
We examine type I error and power of stratified GWAMA approaches to detect GxS. We aim at identifying the best approach, which is the approach that maintains type I error and exhibits the best power, for various strata designs (balanced, unbalanced) with and without an a priori hypothesis of a given type of GxS (qualitative, pure, quantitative). We exemplify the impact of each approach on the identification of sexually dimorphic genetic variants for WHRadjBMI based on the sex-stratified GWAMA results from GIANT. A summary of the workflow is shown in Fig 1. Type I error of the stratified GWAMA approaches to detect GxS In order to derive the empirical type I error for the seven stratified GWAMA approaches to detect GxS, we simulate genetic association data first under a balanced strata design with n 1 = 100,000, n 2 = 100,000. Our results show that the difference test without filtering, ½Diff a Diff , when judged at a significance level of 0.05, keeps type I error control (4.97% to 5.02%, Table 1). Among the three one-stage approaches with filtering, we yield a valid type I error for the overall association filtering, ½Overall a Filter ! Diff a Diff , but a severe violation of type I error control for the stratified test filtering and the alternative joint test filtering, ½Strat a Filter ! Diff a Diff and ½Joint a Filter ! Diff a Diff (type I error from 8.83% to 49.9%). When applied in two stages, all three filtering approaches, ½Overall a Filter ! ½Diff a Diff , ½Strat a Filter ! ½Diff a Diff , or ½Joint a Filter ! ½Diff a Diff , keep type I error control as expected (type I error from 4.90% to 5.18%, Table 1). These results are supported by QQ-plots depicting the observed distribution of the difference test P-values versus the expected, which show that the number of observed P-values matches the expected very well across the full range of values (S2 Fig  and S3 Fig). For unbalanced strata designs, we observe similar results (S4 Fig and S5 Fig). Our results demonstrate that the stratified and the alternative joint tests are statistically not independent of the difference test, while the overall association is independent.
In summary, the difference test without filtering and the overall association filtering prior to the difference testing are the only valid one-stage approaches, and all two-stage approaches are valid, yielding five valid approaches altogether: ½Diff a Diff , ½Overall a Filter ! Diff a Diff , ½Overall a Filter ! ½Diff a Diff , ½Strat a Filter ! ½Diff a Diff or ½Joint a Filter ! ½Diff a Diff . Since the one-stage approach ½Overall a Filter ! Diff a Diff is valid and the two-stage approach ½Overall a Filter ! ½Diff a Diff is not expected to be more powerful, we will ignore the latter and focus in the following on the remaining four valid approaches.
Power of stratified GWAMA approaches to detect GxS under a balanced strata design Next, we compare the power of the selected four approaches to detect GxS, ½Diff a Diff , ½Overall a Filter ! Diff a Diff , ½Strat a Filter ! ½Diff a Diff and ½Joint a Filter ! ½Diff a Diff , for a balanced strata design (n 1 = 100,000, n 2 = 100,000) for varying stratum-specific effect sizes, R 1 and R 2 , and varying types and sizes of GxS. We assume a genome-wide significance level for the difference test without filtering (α Diff = 5 x 10 −8 ) and, for the approaches with filtering, a filtering threshold of α Filter = 1 x 10 −5 with a Bonferroni-corrected significance level at 0.05/M for the subsequent difference test (M being the number of filtered variants).
Our computations show that the power of an approach largely depends on the type of GxS (Fig 2A-2C): when the effects point into opposite direction (qualitative GxS), the difference test, [Diff 5e-8 ], and the two-stage approaches ½Strat 1eÀ 5 ! ½Diff a Diff and ½Joint 1eÀ 5 ! ½Diff a Diff perform substantially better than the overall association filtering approach, ½Overall 1eÀ 5 ! Diff a Diff . In contrast, when the effects point into the same direction (quantitative GxS) or are only pronounced in one stratum (pure GxS), the overall association filtering approach shows much better power than the other three approaches. For the example of the medium sized R 1 (Fig 2B), the power to detect a pure GxS is 80.8% for ½Overall 1eÀ 5 ! Diff a Diff , and only 47.4%, 61.1%, or 55.2%, for [Diff 5e-8 ], ½Strat 1eÀ 5 ! ½Diff a Diff or ½Joint 1eÀ 5 ! ½Diff a Diff , respectively. When the predominant effect (stratum 1) is large enough (Fig 2C), a quantitative or pure GxS can also be detected by the approaches [Diff 5e-8 ], ½Strat 1eÀ 5 ! ½Diff a Diff or ½Joint 1eÀ 5 ! ½Diff a Diff , but with less power compared to ½Overall 1eÀ 5 ! ½Diff a Diff (18.9%, 56.2%, and 51.% compared to 85.3% for a quantitative GxS with R 2 1 = 0.167%, R 2 2 = 0.042%). In all scenarios, ½Overall 1eÀ 5 ! Diff a Diff is the best approach to detect pure GxS (with 0.7%, 80.8%, and >99.9% power for the three effect sizes of R 1 depicted in Fig 2A-2C, respectively) and [Diff 5e-8 ] is the best approach to detect a qualitative GxS with equal effects pointing into opposite directions (43.7%, >99%, and >99% power for the three effect sizes of R 1 depicted in Fig 2A-2C, respectively).
Among the two-stage approaches, ½Joint 1eÀ 5 ! ½Diff a Diff has more power compared to ½Strat 1eÀ 5 ! ½Diff a Diff for qualitative GxS and similar power for pure and quantitative GxS. application and 4) Recommendation. *The valid 2-stage approach ½Overall a Filter ! ½Diff a Diff was omitted from power computations due to expectedly lower power of the approach compared to the valid 1-stage approach ½Overall a Filter ! Diff a Diff that makes use of the full available sample size for both filtering and difference testing.
https://doi.org/10.1371/journal.pone.0181038.g001 Table 1. Simulation-based Type I error for the seven stratified GWAMA approaches to detect GxS. Shown is the type I error at a 5% significance level derived from simulated data as the proportion of variants with nominally significant difference test (P Diff <0.05) relative to the number of variants tested for difference (1,000,000 in the difference test without filtering, number of filtered variants in the approaches with filtering). The simulation results are based on a balanced strata design (n 1 = 100,000, n 2 = 100,000; split in half for two-stage approaches), variants with MAF = 0.05 or 0.30, and phenotypes simulated under the null hypothesis of no GxS, i.e. no difference between stratum-specific effects (H 0 : β 1 = β 2 = β). We present the results for β = 0 and β 6 ¼ 0. For the second setting, we set β as the minimum effect size detectable at 80% power for the given MAF and the given sample size for the difference test (n = 200 000 for onestage approaches, β = 0.029, 0.014 for MAF = 0.05, MAF = 0.30, respectively; n Stage = 100,000 for the two-stage approaches, β = 0.041, 0.019 for MAF = 0.05, MAF = 0.30, respectively). Marked in bold are violated type 1 error rates. However, both two-stage approaches are outperformed in all scenarios by one of the one-stage approaches [Diff 5e-8 ] and ½Overall 1eÀ 5 ! Diff a Diff (Fig 2A-2C).
In summary, the difference test without any filtering, [Diff 5e-8 ], and the approach filtering for overall association followed by the difference test, ½Overall 1eÀ 5 ! Diff a Diff , are the two best approaches to detect qualitative or pure/quantitative GxS, respectively.

Influence of varying filtering thresholds
We next investigate how the filtering threshold impacts the power of the approaches to identify GxS. We thus compute the power analytically for the various approaches under the same scenarios as before (balanced strata design, n 1 = 100,000, n 2 = 100,000), but now vary the filtering threshold, α Filter , from 0.05 to 5 x 10 −8 , again with a Bonferroni-corrected α-level for the consecutive difference test, 0.05/M, with M being the number of filtered variants.
We observe the following: (1) For qualitative GxS (Fig 3A), the difference test without filtering, [Diff 5e-8 ], shows better power than any filtering approach, irrespective of α Filter . (2) For pure GxS (Fig 3B), the overall association filtering, ½Overall a Filter ! Diff a Diff , has the best power, irrespective of α Filter ; the power of this approach is the highest for α Filter of 0.05 to 1x10 -4 , but then decreases with decreasing α Filter down to a level that coincides with the power of the difference test without filtering (power = 46.7% and 47.4%, for approaches ½Overall 5eÀ 8 ! Diff a Diff and [Diff 5e-8 ], respectively). The two-stage approaches ½Joint a Filter ! ½Diff a Diff and ½Strat a Filter ! ½Diff a Diff show a maximum power at α Filter = 1x10 -5 . (3) For quantitative GxS (Fig 3C), the overall association filtering, ½Overall a Filter ! Diff a Diff , again, has the highest power of all approaches, irrespective of α Filter . The power of all three filtering approaches increases with decreasing filtering threshold.
Altogether, while [Diff 5e-8 ] outperforms all filtering approaches for qualitative GxS, ½Overall a Filter ! Diff a Diff is most powerful for pure/quantitative GxS. This approach can benefit from less stringent filtering (i.e., larger α Filter , larger M) to detect pure GxS, but from more stringent filtering (i.e., smaller α Filter , smaller M) to detect quantitative GxS, requiring a compromise to serve both.

Influence of unbalanced strata designs
We next investigate how an unbalanced strata design impacts the power of the approaches to identify GxS. We compute the power analytically for the same scenarios and approaches as previously, but now we model unbalanced strata designs by varying the proportion of the stratum sample sizes (with a total n = 200,000, as before). Denoting f = n 2 /n 1 , with stratum 1 defining the stratum with the larger effect, f = 0.05 indicates that stratum 1 (with the larger effect) is 20 times larger than stratum 2, whereas f = 20 indicates that stratum (with the larger effect) is in very small with only a 20 th of stratum 2 sample size.
As expected from theory, we find that, for all three types of GxS (Fig 4A-4C), the power of [Diff 5e-8 ] is symmetric to and at a maximum at f = 1. This indicates that the difference test without filtering is most efficient, if the two strata are balanced in size, and that its power does not depend on whether the larger effect is in the larger or in the smaller stratum. a fixed genetic effect in stratum 1, R 2 1 , that is (A) small (R 2 1 ¼ 0:014%), (B) medium (R 2 1 ¼ 0:058%), or (C) large (R 2 1 ¼ 0:167%). The effect sizes for R 2 1 are chosen as those observed for WHRadjBMI near STAB1, PPARG or LYPLAL1, respectively. The modeled GxS are visualized on the left side (red bar: R 2 1 , blue arrows: varying R 2 2 ). For the difference test without filtering, we assume a significance level at 5 x 10 −8 ; for approaches with filtering, the filtering threshold is 1 x 10 −5 and the significance level applied for the consecutive difference test is α Diff = M/0.05, with M being the number of filtered lead variants (see Methods). https://doi.org/10.1371/journal.pone.0181038.g002 Genetic effects with between-strata differences For a qualitative GxS (Fig 4A), [Diff 5e-8 ] shows the best power for moderately unbalanced strata designs (0.2 < f < 5), whereas ½Overall 1eÀ 5 ! Diff a Diff shows best power for more extremely unbalanced strata designs (f < 0.2 or f > 5). Here, power curves for all approaches are symmetric to f = 1, because absolute genetic effects are the same across strata. However, the symmetry of the filtered approaches disappears when varying R 2 (S6-S8 Figs).
For a pure GxS (Fig 4B) with the effect in the larger stratum (f < 1), the filtering approaches ½Overall 1eÀ 5 ! Diff a Diff , ½Strat 1eÀ 5 ! ½Diff a Diff , and ½Joint 1eÀ 5 ! ½Diff a Diff have larger power than the difference test alone with a maximum power at f~0.66 (i.e. 'effect' stratum 1 is 1.5-times larger than the 'no effect' stratum 2). The best approach here is the overall filtering approach, ½Overall 1eÀ 5 ! Diff a Diff . When the effect is in the smaller stratum (f > 1), the difference test without filtering, [Diff 5e-8 ], can provide a power gain over the filtering approaches: For the presented scenario, the power of [Diff 5e-8 ] surpasses the power of the filtering approaches at f~1.5 ('no effect' stratum 2 is 1.5 times larger than the 'effect' stratum 1). Generally, when using the filtering approaches, it is easier to identify pure GxS with the effect in the larger stratum (f < 1) than with the effect in the smaller stratum (f > 1), while the difference test alone does not depend on whether the effect is in the smaller or the larger stratum.
For quantitative GxS (Fig 4C) and for the presented scenario, the power of all approaches is symmetric to and at maximum at f = 1. Irrespective of f, ½Overall 1eÀ 5 ! Diff a Diff displays the best power to identify quantitative GxS compared to all other considered approaches (Fig 4C,  S6-S8 Figs).
Altogether, ½Overall 1eÀ 5 ! Diff a Diff is the most powerful approach to detect pure/quantitative GxS, for all stratum designs. It has also the best power to detect effects pointing into opposite directions (qualitative difference) when the strata are extremely unbalanced. [Diff 5e-8 ] is the most powerful approach to detect qualitative GxS, when the sample sizes of the strata do not differ too extremely.

Application to real sex-stratified GWAMA results for waist-hip ratio
We exemplify the impact of our approaches on the number of identifiable sexually dimorphic loci for WHRadjBMI, based on our sex-stratified GWAMA results from the GIANT consortium (up to 77,000 men and 98,000 women) [12].
First, in order to derive empirical type I error based on the real GIANT data, we apply all seven approaches to the real sex-stratified meta-analyzed GWAMA results and evaluate lambda factors and QQ plots. For [Diff 5e-8 ] and ½Overall a Filter ! Diff a Diff , we observe no inflation of difference P-Values (λ GC = 1.02 and λ GC = 1.06, respectively, S9 Fig). However, we observe inflated difference P-values for the one-stage approaches ½Strat a Filter ! Diff a Diff and ½Joint a Filter ! Diff a Diff , (λ GC = 6.67 and λ GC = 6.39, respectively, S9 Fig). This is in-line with the statistical theory and our results from simulated data, which support the notion that the stratified and the alternative joint tests depend on the difference test, while the overall test appears to be independent.
Next, in order to evaluate the detectability of sexually dimorphic loci for WHRadjBMI, we apply the four approaches [Diff 5e-8 ], ½Overall 1eÀ 5 ! Diff a Diff , ½Strat 1eÀ 5 ! ½Diff a Diff , and ½Joint 1eÀ 5 ! ½Diff a Diff to the real sex-stratified meta-analyzed GWAMA results. When A. qualitative GxS with small stratum-specific effects (R 2 1 ¼ 0:014%; R 2 2 ¼ 0:014% into opposite direction), B. pure GxS with medium sized stratum 1 effect (R 2 1 ¼ 0:058%; R 2 2 ¼ 0%), and C. quantitative GxS with large stratum 1 and smaller stratum 2 effect (R 2 1 ¼ 0:167%; R 2 2 ¼ 0:042% into the same direction). The effect sizes for stratum 1 are chosen as those observed for the WHRadjBMI loci around STAB1, PPARG, or LYPLAL1. The power of [Diff 5e-8 ] is constant due to the lack of any filtering. https://doi.org/10.1371/journal.pone.0181038.g003 Genetic effects with between-strata differences accounting for the subtle relatedness (r = 0.05 for one-stage, r = 0.03 for stage 2 of the twostage approaches), we identify a total of 10 independent loci with significant sex-difference across all considered approaches ( Table 2, S2 Table and S3 Table). The 10 loci include one qualitative, six pure and three quantitative sex-differences. Consistent with our power computations, the one-stage approaches identify more loci than the two-stage approaches, the qualitative difference is only detected by [Diff 5e-8 ] and ½Overall 1eÀ 5 ! Diff a Diff identifies all pure and quantitative differences (nine loci). The overall association test filtering at an α Filter of 5 x 10 −8 , ½Overall 5eÀ 8 ! Diff a Diff , identifies 8 of these and one additional; this approach is, in fact, the approach used by Locke and colleagues for BMI [25] and by Shungin and colleagues for WHRadjBMI [5], when only the genome-wide significant loci with main genetic effect are tested for interaction. When applying the two approaches [Diff 5e-8 ] and ½Overall 1eÀ 5 ! Diff a Diff jointly, 10 sexually dimorphic loci for WHRadjBMI are detected in the GIANT data. When ignoring the subtle relatedness, we detect only 9 loci with significant sex-difference across all considered approaches (the one missed with a sex-difference P-value of 9.1 x 10 −8 instead of 4.1 x 10 −8 when accounting for the relatedness).
In summary, our application of the approaches to real sex-stratified GWAMA data for WHRadjBMI corroborates our simulation-based and analytical evaluations of type I error rates and power.

Discussion Recommendations
Based on our evaluations of type I error and power, we found that two of the approaches to search for genetic effects with GxS based on stratified GWAMA data keep type I error control and are the most powerful: the genome-wide difference test and the overall association filtering prior to the difference test. Which of these two performed better than the other, depended on the type of GxS (qualitative, pure or quantitative) and on the strata design (balanced, unbalanced). We thus provide a recommendation of the best approach depending on the type of strata-difference and strata design (Fig 5). Generally, for any stratified GWAMA project that aims at detecting genetic variants with GxS without any hypothesis on the specific type of GxS and irrespective of the strata design, we recommend to perform two approaches in parallel: (i) a genome-wide screen for difference testing at an α-level of 5 x 10 -8 , and (ii) an approach that filters for overall association at P Overall < 10 −5 and then tests this subset of genetic variants for difference at a Bonferroni-corrected α-level.
This recommendation is based on several findings of our comparisons: (1) The difference test without filtering, [Diff 5e-8 ], has the best power to detect qualitative GxS in most scenarios and the overall association test filtering prior to the difference test, ½Overall a Filter ! Diff a Diff , has the best power for pure/quantitative GxS with few exceptions. (2) The approaches filtering for stratified or alternative joint association prior to difference testing in the same set of GWAMA results (one-stage approaches ½Strat a Filter ! Diff a Diff and ½Joint a Filter ! Diff a Diff ) violate type I error. Since stratified or (alternative) joint association tests are commonly applied for filtering in GWAMA literature [8,12,26,27], it is important to note that testing the selected variants for GxS necessitates an independent set of GWAMA results (two stage approaches f = n 2 /n 1 , with stratum 1 being the one with the larger effect. Effect sizes are chosen as in Fig 3 ( ½Strat a Filter ! ½Diff a Diff and ½Joint a Filter ! ½Diff a Diff ). (3) The two-stage approaches were outperformed by at least one one-stage approach. The reason for this is that splitting the data into two artificial stages does not make use of the total sample size neither for the filtering nor for the difference test, which is in-line with previous work [28]. (4) We found that the filtering threshold of the overall association test had a substantial impact on power. Since a less Table 2. Application to real sex-stratified GWAMA data for WHRadjBMI from the GIANT GENDER project. Shown are the 10 identified loci with GxSex by each approach ('x' indicating that the locus was identified by the respective approach) at a Bonferroni-corrected significance level, based on the GIANT data for WHRadjBMI (up to 77,000 men and 98,000 women) [12]. Detailed association results are provided in S2 Table for the one-stage approaches and in  S3 Table for the two-stage approaches.

One-stage approaches
Two-stage approaches https://doi.org/10.1371/journal.pone.0181038.t002 Genetic effects with between-strata differences stringent filtering (larger α Filter , more variants selected) yielded larger power to detect pure GxS, while more stringent filtering (smaller alpha, fewer variants selected) yielded larger power for quantitative GxS, a compromise is required. We found a filtering threshold of P Overall < 10 −5 to work well in most scenarios. Several aspects are interesting to note: No matter what filtering approach is used, it is easier to identify GxS with the larger effect in the larger stratum as compared to a GxS with the larger effect is in the smaller stratum. For example, a filtering approach can be expected to detect more variants with the predominant effect in non-smokers rather than in smokers (assuming more non-smokers in the data and same number of genetic effects specific to non-smokers compared to smokers). The genome-wide difference screen has no preference towards an effect in the larger or in the smaller stratum, but it loses power with increasing imbalance of strata.
It should also be noted that the commonly used GWAMA approach to screen for genomewide significant overall association (P Overall < 5 x 10 -8 ) and to subsequently test identified lead variants for difference between strata performs well to find pure and quantitative difference. The question here remains, whether many genetic effects with opposite effect direction in the two strata exist (qualitative GxS) that are missed by such an approach or whether such effects are biologically plausible. Without the utilization of approaches with sufficient power to detect qualitative GxS, this question will be left unanswered.

Strengths and limitations
We focused here on approaches to identify GxS that are directly applicable to stratified GWA-MAs results. We assumed a dichotomous stratification variable and continuous outcome that follow identical normal distributions. To our knowledge, this is the first study to evaluate such approaches systematically and to provide recommendations on how to design a GWAMA that aims to identify genetic variants with between-strata differences.
Despite the fact that our work is comprehensive and covers numerous approaches and scenarios, there are still scenarios that we have not considered in order to stay focused. This includes binary outcomes, different phenotypic variances between strata, between-study differences of the genetic effect, or a measurement error in the phenotype that differs between studies and strata. Still, our results can readily be translated into stratified GWAMAs of binary outcomes using logistic regression and different phenotypic variances can be implemented by extending the power formulae. Between-study heterogeneity of genetic effects and measurement error issues will be important extensions, but are widely ignored in the current GWAMA approaches to locus identification.
The stratified GWAMA approaches can be translated into interaction GWAMA approaches where the interaction term is fitted per study and meta-analyzed (S4 Table). Particularly the difference test from the stratified approach is equivalent to a test of the interaction term when there is no interaction between covariates and the strata S and that the trait variance is the same in the two strata. Our results and recommendations based on stratified GWA-MAS can be transferred to interaction modeling and suggest a parallel approach for testing interaction genome-wide and a filtering for overall association (main effect) prior to testing the interaction effect. The analogy suggests that a joint test filtering (testing jointly the main and the interaction effect) with subsequently testing selected variants for interaction in the same set of GWAMA results violates type I error. The stratified GWAMA framework has some important advantages and disadvantages compared to an interaction GWAMA framework (see S2 Note for a detailed discussion of pros and cons of the stratified and the interaction GWAMA frameworks). The focus on stratified GWAMA results here was motivated by the much easier logistics of computing stratum-specific GWAS and the meta-analysis of stratum-specific genetic effects compared to interaction modeling in GWAS and the respective meta-analysis.
We did not consider GxE interaction methods that are based on linearly increasing phenotypic variance [29], meta-regression of summary statistics [30], or on case-only [31], data adaptive [32,33], empirical Bayes [34] or other 2-step methods that involve filtering on geneexposure or gene-strata association [35][36][37][38]. The latter rely on the assumption of independence of implemented steps [39] and any methods that involve case-only designs or empirical Bayes methods are limited to binary disease outcome. The reason for not extending our considerations to the noted methods was that they either involve data sets that are not directly available from the stratified GWAMA or that they were only described for single genome-wide interaction studies rather than meta-analysis settings or for binary outcomes. Their implementation and transferability into a stratified GWAMA setting with continuous outcome may be limited and are as yet unclear.
Finally, by reducing the filtered variants to the independent lead variants (e.g. variant with smallest filtering test P-value within +/-500kB), we might miss the correlated variant close-by that is the truly interesting variant with between-stratum difference (i.e. the variant with the smallest difference test P-value). We could extend to approaches that select all variants meeting the filtering threshold without restricting to lead variants in this step, subsequently testing all selected variants for difference and alternative approaches for multiple testing that can handle correlated variants, such as employing a Bonferroni-correction based on the effective number of independent variants [23,40] or a false-discovery rate approach [41].

Summary and conclusion
In summary, we recommend the genome-wide difference test without filtering, [Diff 5e-8 ], to search for genetic effects that point into opposite direction in two strata and the overall association test filtering prior to a difference test, ½Overall a Filter ! Diff a Diff , to detect genetic effects that are only or more pronounced in one stratum. When there is no hypothesis on the type of GxS that is aimed to identify, we recommend applying these two approaches in parallel. For the overall association test filtering, a filtering threshold of 1 x 10 −5 appears to be reasonable, while a filtering threshold of 5 x 10 -8 is equivalent to the common GWAMA approach to identify variants with genome-wide significance and test only these variants for GxS.
Our results provide guidelines for current and future GWAMAs that aim at the identification of genetic effects with GxS. By these clear recommendations, researchers will be more motivated to search for GxS and by enhancing our searches with the most powerful approaches we will be able to unravel GxS for complex disease. Ultimately, our knowledge of genetic effects that show differential effects between strata will help our understanding of how a variant exerts its effect on the disease outcome under study.  Table. Statistical tests for stratified GWAMA approaches to identify GxS. Stated are the tests that can be applied based on the meta-analyzed stratum-specific genetic effect estimates, b i , and standard errors, se i (i = 1,2), the respective null hypotheses, test statistics, nomenclature for P-values and the usage. Table. Ten loci with significant sex-difference for WHRadjBMI identified by the onestage approaches. The table shows the lead variants identified by the two one-stage approaches [Diff 5e-8 ] and ½Overall 1eÀ 5 ! Diff a Diff that were applied to the sex-stratified GWAMA results for WHRadjBMI (up to 77,000 men and 98,000 women) from the GIANT consortium [12]. Significant sex-difference P-Values (corrected for correlation between strata using r = 0.05 as estimated from the GIANT data on 77,000 men and 98,000 women) are marked in bold. Sex-difference P-Values (uncorrected for correlation) were added to the table for comparison. (DOCX) S3 Table. Four loci with significant sex-difference for WHRadjBMI identified by the twostage approaches. The table shows the lead variants identified by the two-stage approaches ½Strat 1eÀ 5 ! ½Diff a Diff and ½Joint 1eÀ 5 ! ½Diff a Diff that were applied to the sex-stratified GWAMA results for WHRadjBMI (up to 35,000 and 42,000 men; and up to 43,000 and 55,000 women in the two stages respectively; from the GIANT consortium [12]. Significant stage 2 sex-difference P-Values (corrected for correlation between strata using r = 0.03 as estimated from the Stage 2 GIANT data on 42,000 men and 55,000 women) are marked in bold. Stage 2 sex-difference P-Values (uncorrected for correlation) were added to the table for comparison. (DOCX) S4 Table. Statistical tests for interaction GWAMA approaches to identify GxS. Instead of applying two stratified linear regression models per study and meta-analyzing stratified genetic estimates, an interaction GWAMA framework involves one interaction model per study,

Supporting information
where S codes strata membership (i.e. S = 0 for stratum 1, S = 1 for stratum 2). Meta-analyzed genetic main effects (b G ) and gene-strata interaction effects (b GxS ) with corresponding standard errors (se G and se GxS ) are obtained from study-specific genetic main effects (b G ik ) or gene-strata interaction (b GxS ik ) effects, respectively. Stated are the tests that can be applied based on the interaction GWAMA framework, the respective null hypotheses, test statistics, nomenclature for P-values and the usage. can either be conducted as one-stage or as two-stage approaches. For the one-stage approaches, the filtering and the difference test are applied to one large stratified GWAMA result of total sample size N (blue). For the two-stage approaches, the filtering and the difference test are applied consecutively to two independent stratified GWAMA results of size N/2 (purple and orange). Shown are simulated difference P-values for the two-stage approaches ½Overall a Filter ! ½Diff a Diff , ½Strat a Filter ! ½Diff a Diff and ½Joint a Filter ! ½Diff a Diff . We here assume two unbalanced sized strata (33, stratum sample sizes, f = n 2 /n 1 , with stratum 1 being the one with the larger effect). Effect size in stratum 1 is fixed to R 2 1 ¼ 0:014%, as observed for the small WHRadjBMI effect at STAB1. The effect in stratum 2 is fixed to A. 0.014%, into opposite direction (qualitative GxS; same as main Fig 3A), B. R 2 2 ¼ 0:007%, into opposite direction (qualitative GxS), C. R 2 2 ¼ 0:003%, into opposite direction (qualitative GxS). D. R 2 2 ¼ 0% (pure GxS), E. R 2 2 ¼ 0:003%, into consistent direction (quantitative GxS), and F. R 2 2 ¼ 0:007%, into consistent direction (quantitative GxS). (TIF) S7 Fig. Power of the stratified GWAMA approaches to identify GxS assuming unbalanced strata design and a medium effect in stratum 1. Shown is the power to detect GxS for the same approaches and designs as in Fig 3 (unbalanced strata designs with varying proportion of stratum sample sizes, f = n 2 /n 1 , with stratum 1 being the one with the larger effect). Effect size in stratum 1 is fixed to R 2 1 ¼ 0:058%, as observed for the medium WHRadjBMI effect at PPARG. The effect in stratum 2 is fixed to A. 0.058%, into opposite direction (qualitative GxS), B. R 2 2 ¼ 0:029%, into opposite direction (qualitative GxS), C. R 2 2 ¼ 0:014%, into opposite direction (qualitative GxS). D. R 2 2 ¼ 0% (pure GxS; same as main Fig 3B), E. R 2 2 ¼ 0:014%, into consistent direction (quantitative GxS), and F. stratum sample sizes, f = n 2 /n 1 , with stratum 1 being the one with the larger effect). Effect size in stratum 1 is fixed to R 2 1 ¼ 0:0:167%, as observed for the large WHRadjBMI effect at LYPLAL1. The effect in stratum 2 is fixed to A. 0.167%, into opposite direction (qualitative GxS), B. R 2 2 ¼ 0:084%, into opposite direction (qualitative GxS), C. R 2 2 ¼ 0:042%, into opposite direction (qualitative GxS). D. R 2 2 ¼ 0% (pure GxS), E. R 2 2 ¼ 0:042%, into consistent direction (quantitative GxS, same as main Fig 3C), and F. R 2 2 ¼ 0:084%, into consistent direction (quantitative GxS). (TIF) S9 Fig. QQ plots showing the results of the application of one-stage approaches to sexstratified GWAMA from the GIANT consortium. The QQ plot contrasts observed and expected difference P-Values for the considered 1-stage approaches ½Diff a Diff , ½Overall 0:05 ! Diff a Diff , ½Strat 0:05 ! Diff a Diff and ½Joint 0:05 ! Diff a Diff obtained from an application of approaches to real sex-stratified GWAMA data for WHRadjBMI from the GIANT consortium. (TIF)