Gene-Gene and Gene-Environment Interactions in Meta-Analysis of Genetic Association Studies

Extensive genetic studies have identified a large number of causal genetic variations in many human phenotypes; however, these could not completely explain heritability in complex diseases. Some researchers have proposed that the “missing heritability” may be attributable to gene–gene and gene–environment interactions. Because there are billions of potential interaction combinations, the statistical power of a single study is often ineffective in detecting these interactions. Meta-analysis is a common method of increasing detection power; however, accessing individual data could be difficult. This study presents a simple method that employs aggregated summary values from a “case” group to detect these specific interactions that based on rare disease and independence assumptions. However, these assumptions, particularly the rare disease assumption, may be violated in real situations; therefore, this study further investigated the robustness of our proposed method when it violates the assumptions. In conclusion, we observed that the rare disease assumption is relatively nonessential, whereas the independence assumption is an essential component. Because single nucleotide polymorphisms (SNPs) are often unrelated to environmental factors and SNPs on other chromosomes, researchers should use this method to investigate gene–gene and gene–environment interactions when they are unable to obtain detailed individual patient data.


Introduction
Extensive genetic studies have identified a large number of causal genetic variations in many phenotypes; however, these could not completely explain the phenomenon of heritability in complex phenotypes [1]. Previous studies have suggested that the "missing heritability" may have been masked by gene-gene and gene-environment interactions, and therefore, their detection is very important. However, researchers have to carefully assess significance levels to reduce false discovery rates when determining the effect of interactions among multiple variables [2]; therefore, performing a single study is often ineffective under the correction of multiple comparisons [3,4].
Meta-analysis is a commonly used method for increasing detection power, and a subgroup analysis also could be used to detect the effects of interactions. Meta-analysis using individual patient data is considered the gold standard for investigating the moderator effect of participant-type variables [5,6]; however, access to detailed individual data could often be difficult. Meta-analyses using aggregate data have been more frequently employed because it maximizes the number of studies, patients, and events [7,8]. However, these methods are relatively difficult to apply in the meta-analysis of genetic association studies. It was difficult for researchers to obtain the population aggregated summary values of case-control studies; they often could only access the aggregated summary values in "cases" and "controls." Unfortunately, most genetic association studies are designed as case-control investigations. Therefore, a method for detecting gene-gene and gene-environment interactions in a meta-analysis of a case-control study was imperative.
We hereby propose a simple method for detecting the effects of interactions in a meta-analysis of a case-control study. We have applied this method to an earlier study [9]. However, this was based on two assumptions (rare disease and independence), which may be violated in real data. The rare disease assumption is more frequently violated, and some researchers have debated on the extent of prevalence that should be established to classify the disease as "rare." There is currently no evidence that could confirm the robustness of this method when the assumptions were violated. Therefore, this study aimed to test the 95% confidence interval coverage rate, power, and robustness of this method, and compare individual patient data analysis using simulation methods.

Derivation of Formulas
Most genetic association studies utilize a case-control design, in which the association between the aggregated summary values of the factor and odd ratios was based on multiple factors. To better understand this principle, we hereby describe an example. In this example, the moderator can be not only factors of environment but also gene. That is a moderator implies any kind of covariates. This did not impact the results of derivation.
When the independent variable is an allele encoded with values of "minor allele" or "major allele" and the moderator is gender-encoded with values of "male" or "female," the variables E 1 , E 2 , E 3 , and E 4 in the population are the minor allele frequencies among the case women, case men, control women, and control men, respectively; these were based on seven population parameters, and their relationships are presented in the S1 Text. The odds ratio of exposure on disease outcome in women and men are as follows: Odds ratio ðORÞ in women ðOR women Þ : Based on these definitions, when a researcher conducts a case-control study, the expectation of a simple combined OR is affected by the proportion of males in the case group (k 1 ) and control group (k 2 ): Expectation of simple combined OR ðOR combine Þ : The present study established the following two setting assumptions: (1) rare disease and (2) independence (there was no association between the factor of interest and the major independent variable), and E 3 and E 4 were similar to the proportion of individuals with exposure in the whole population; p 5 was denoted the minor allele frequency in populations (S1 Text). Therefore, the OR combine could be simplified as follows (E 3 = E 4 = p 5, please refer the S2 Text): When moderator effects are present (OR women 6 ¼ OR men ), the proportion of males in the case group (k 1 ) is the only factor that could affect the OR combine . Researchers often perform a meta-regression to describe the association between the proportion of males and OR combine .
A typical single moderator equation of meta-regression (fixed-effect model) is shown in Eq 2.1-5 [The y i is logarithmic empirical combined OR from each study [log(OR combine )], and we denote η i as the residuals representing the unexplained errors of the reported y i ] as follows: Where m i is an unknown vector witch let Eq 2.1-5 holds. An appropriate m i is calculated using Eq 2.1-6. When Eq 2.1-6 is used to access m i , b 0 is considered to be the log(OR women ), and b 1 is considered the logarithmic moderator effect of gender [log(OR men ) − log(OR women )] (The details of the derivation was shown in S3 Text).
Where k 1i is the summary value of case group in each study; E 1 , E 2 are the minor allele frequencies of the respective case women and case men in each study. However, it was impossible to assess m i because E 1 and E 2 were population parameters and most paper didn't provide them. Fortunately, m i is equal to k 1i when null hypothesis (null moderator effect) is satisfied (the details of theoretical proof was shown in the S4 Text). Therefore, we could use k 1i to replace m i and create a new equation of meta-regression. The new equation of meta-regression is as follows: Where, the y i , k 1i , η i are logarithmic empirical combined OR [log(OR combine )], the proportion of moderator in the "case" group, residuals representing the unexplained errors of the reported y i from each study, respectively. Following above setting, the b 0 and b 1 are log(OR people without moderator ), moderator effect, respectively. In this method, the coefficient of b 1 can be represented the interaction between focus SNP and the moderator, such as gene-gene and geneenvironment interactions. This could be employed to detect gene-gene interactions when k 1i is the minor allele frequency of another SNP and detect gene-environment interactions when k 1i is the proportion of environment exposure in the "case group." Following Eq 2.1-7, the summary value of the case group (k 1i ) could be employed to build the meta-regression model, and the b 0 , b 1 are log(OR people without moderator ), moderator effect, respectively. The detailed calculated method of above coefficients and their variance were shown in S5 Text. In addition, S6 Text could help readers to understand the accuracy of Eq 2.1-7 when we violate the assumptions.
This could be employed to detect gene-gene interactions when k 1i is the minor allele frequency of another SNP and detect gene-environment interactions when k 1i is the proportion of environment exposure in the "case group." Individual patient data regression analysis is the gold standard in analyzing pooled data [6]. However, accessing the detailed trial results could be extremely difficult [7,8].
It is worth noting that x 2 can the genetic factor or environmental factor. The first step in generating simulation data is to set the parameters of the population. We assume that the moderator effect of the specific moderator is a fixed effect, and the association between SNP, the status of moderators, and the disease outcome in each study population is equal to following equation: In this equation, β 0 is the logit-transformation prevalence of the outcome disease in people with homozygous major and without the moderators in study population. β 1 is the log-transformation odds ratio of allele effect in people without moderators, β 2 is the log-transformation OR of moderators on disease in people with homozygous major, and β 3 is the logtransformation moderator effect. Following this model, we could set β 0 , β 1 , β 2 , and β 3 to calculate P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 . In our simulation, we set the mean of the minor allele frequency with 50% ( p), and we denote F st as the frequency difference between different studies. The minor allele frequency (π i ) in each study will be randomly generated from a beta distribution , according to the Blading-Nichols model [10]. Under the Hardy-Weinberg equilibrium assumption, the frequency of homozygous major [p(x 1 = 0) i , q 0i ], heterozygous [p(x 1 = 1) i , q 1i ], and homozygous minor [p(x 1 = 2) i , q 2i ] in each study were (1 − π i ) 2 , 2π i (1 − π i ), and π i Table 1 summarizes the simulation conditions employed in the present study. There were five models (Basic, Minor violation of rare disease assumption, Serious violation of rare disease assumption, Minor violation of independence assumption, and Serious violation of independence assumption) in our simulation. We set the rare disease prevalence (10 −5 ) in the Basic model; therefore, β 0 , the logit-transformation disease prevalence, is log[10 −5 /(1 − 10 −5 )]. Moreover, we set the odds ratios of allele effect and moderator effect as 1.5 and 2.0, respectively; therefore, β 1 and β 2 are log(1.5) and log(2.0), respectively. P 7 , P 8 , and P 9 are the same in the Basic model and were set at 50%. Based on the Basic model, we set two kinds of models that violated the rare disease or independence assumptions, and there are two levels in each situation. The model Minor violation of rare disease assumption replaced β 0 with log[10 −2 /(1 − 10 −2 )], and the model Serious violation of rare disease assumption replaced β 0 with log[10 −1 /(1 − 10 −1 )]. The model Minor violation of independence assumption replaced P 7 , P 8 , and P 9 with 0.4, 0.5, and 0.6, respectively, and the model Serious violation of rare disease assumption replaced P 7 , P 8 , and P 9 with 0.3, 0.5, and 0.7, respectively.
To conduct a meta-analysis of a genetic association study, we used the data from our past study [9] (S1 Table). In this data, the moderator (x 2 ) was encoded with values the following values: people without moderator (x 2 = 0) and with moderator (x 2 = 1). There were 69 case-control studies that contained information regarding gender distribution as well as 14,692 cases and 13,414 controls. The genotype of each individual, which was encoded by values of 0, 1, or 2, was randomly generated from a multinomial distribution [p = G 1i , G 2i , G 3i , and G 4i , respectively]. G 1i , G 2i , G 3i , and G 4i were the vector of genotype frequencies in cases without moderator [p( respectively. G 1i , G 2i , G 3i , and G 4i were calculated based on P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 , P 9 , q 0i , q 1i , and q 2i , respectively (S7 Text).
In the following analysis, we used our method to analyze the moderator effect using the summary data (Eq 2.1-7), and the summary odds ratio of each study was based on the additive model. The meta-regression used "metafor" packages [11] and the fixed-effect model was set to estimate the moderator effect. Moreover, the raw data were analyzed using individual patient data regression analysis. Individual patient data regression analysis was used for the hierarchical generalized linear model. Data in each condition were from 10,000 simulations.
Basic The primary outcome was the 95% confidence interval coverage rate of the moderator effect (β 3 ). The confidence interval coverage rate was the proportion of the 95% confidence interval, including the real parameter. The appropriate confidence interval coverage was 95%. In addition, type 1 errors were assessed in the null moderator effect model (β 3 = 0). The secondary outcome was the power of moderator effect assessment because the nonsignificant result may often be ignored. In addition, researchers often reports the results of stratified analysis when the moderator effect was significance. We also presented the 95% confidence interval coverage rate of the allele effect in people without a moderator (β 1 ) and people with a moderator (β 1 + β 3 ).

Simulations under assumptions
Tables 2 and 3 present the results of the simulation. The Basic model is the simulation under the rare disease and independence assumptions. The 95% confidence interval coverage rates of our method were similar to the results of the individual patient data regression analysis regardless of condition and were close to 95%. Moreover, the false positive rates at a p = 0.05 significance threshold did not significantly differ from that observed using 5% in the null moderator effect model (β 3 = 0). However, the power of our method was lower than the individual patient data regression analysis, indicating that the individual patient data regression analysis was more accurate. Fst is the difference in allele frequencies among various studies. Fst = 0, 0.01, and 0.1 indicated no differences, small difference, and large difference in allele frequency between the population and a specific ethnic group, respectively. The higher Fst may reduce the power of the analysis, although this may not impact the stability of the 95% confidence interval coverage rates. Figs 1 and 2 show the 95% confidence interval coverage rate of the allele effect in people without a moderator (β 1 ) and people with a moderator (β 1 + β 3 ). The 95% confidence interval coverage rates of our method were close to 95% in any condition under the rare disease and independence assumptions. The individual patient data regression analysis was also robust in this situation.

Simulations with violations of assumptions
Two models, Minor violation of rare disease assumption and Serious violation of rare disease assumption, tested the robustness when the outcome disease is not a rare disease, and we set the 1% and 10% disease prevalence rates in people with homozygous major without moderator, respectively. In the null moderator effect model analysis, the false positive rate of our method did not significantly differ from the 5% in any model and Fst. However, we observed that the 95% confidence interval coverage rates of our method were lower in the higher moderator effect model (β 3 = 0.25-1.0). The extent of reduction was impacted by disease prevalence and moderator effect; the simulation with higher disease prevalence and moderator effect showed lower 95% confidence interval coverage rates. Moreover, the analytical power was reduced because of higher disease prevalence. The individual patient data regression analysis remained robust regardless of the condition. The 95% confidence interval coverage rate of the allele effect in people without a moderator and people with a moderator was also lower in the Serious violation of rare disease assumption model, and the extent of reduction was impacted by the moderator effect.
We tested the robustness of our method when the situation violated the independence assumption. We set the small difference (0.1) and large difference (0.2) between P 7, P 8 , and P 9 , which indicated a small and strong association between SNP and moderator. We observed that  Confidence interval coverage rate of the allele effect in people without a moderator (β 1 ) and people with a moderator (β 1 + β 3 ) using our proposed method. The model names, "Basic," "Minor rare," "Serious rare," "Minor independence," and "Serious independence" indicate the models, "Basic," "Minor violation of rare disease assumption," "Serious violation of rare disease assumption," "Minor violation of independence assumption," and "Serious violation of independence assumption," respectively. F st is the parameter of frequency difference among various studies. The X-axis represents the confidence interval of the moderator effect (β 3 ); the Y-axis represents the 95% confidence interval coverage rate. The red bar represents the 95% confidence interval coverage rate of the allele effect in people without a moderator (β 1 ); the blue bar represents the 95% confidence interval coverage rate of the allele effect in people with a moderator (β 1 + β 3 ).  Confidence interval coverage rate of the allele effect in people without a moderator (β 1 ) and people with a moderator (β 1 + β 3 ) in individual patient data regression analysis. The model names, "Basic," "Minor rare," "Serious rare," "Minor independence," and "Serious independence" indicate the models, "Basic," "Minor violation of rare disease assumption," "Serious violation of rare disease assumption," "Minor violation of independence assumption," and "Serious violation of independence assumption," respectively. F st is the parameter of frequency difference among various studies. The X-axis represents the confidence interval of the moderator effect (β 3 ); the Y-axis represents the 95% confidence interval coverage rate. The red bar represents the 95% confidence interval coverage rate of the allele effect in people without a moderator (β 1 ); the blue bar represents the 95% confidence interval coverage rate of the allele effect in people with a moderator (β 1 + β 3 ).
the 95% confidence interval coverage rates of our method were lower in the model with violation of independence assumptions, and the extent of reduction was impacted by the strength of association between SNP and moderator. Moreover, the false positive rates of our method were significantly different from 5%. Therefore, the power analysis in this scenario was insignificant. The association between SNP and moderator did not impact the robustness of individual patient data regression analysis. Its 95% confidence interval coverage rates remained close to 95%, and it had appropriate false positive rates and high powers in any condition. Similar to the moderator effect, the results of allele effect in people without a moderator and people with a moderator showed that our method was not robust in the model with the violation of independence assumptions. The individual patient data regression analysis was also robust in any situation.

Discussion
This work is trying to propose a new method for meta-analysis when researchers were unable to obtain the raw data of each individual sample. It is difficult for accessing the detailed individual data [5,6]. Meta-analyses using aggregate data have been more frequently employed because it maximizes the number of studies, patients, and events [7,8]. However, there is no suitable methods for case-control studies but most genetic association studies are designed as case-control investigations. We believe this approach is an alternative to investigate more information of gene-gene and gene-environment interactions. Previous papers generally present the stratified results of minority participant types such as smoking status, and researchers utilize such information to assess their moderator effects [12]. However, most participant-type variables, such as other SNPs and gender, are presented as average summary values. Several meta-analyses of case-control studies consider that the absence of a control for various participant types was an important limitation and the exposure to different environmental factors could be difficult to completely assess [13][14][15][16]. The use of the meta-regression model using summary values has been employed for years. Some previous studies have used the summary values of the case group to determine the source of heterogeneity [17,18], whereas others have used the summary value of the control group [19]. One study even used the summary values of both the case and control groups [20]. However, these studies did not describe their bases for their selection of the summary value of a specific study group. Moreover, they often did not explain the biological significance of their analysis. The present study evaluated the biological significance of using the summary value of the case group in assessing their moderator effects, particularly when individual patient data could not be collected.
Individual patient data analysis had the higher confidence interval coverage rate and power, and this result was similar to that of previous simulation studies on meta-analyses of RCT [6]. However, accessing the detailed trial results can be difficult [7,8]. The standard error of individual patient data analysis was smaller than the standard error of our method, implying that the estimates of individual patient data analysis were more accurate. Therefore, we recommend that researchers contact the authors of included reports to obtain more detailed data and use our method as a last resort when they are unable to obtain sufficient information.
The independence assumption is important because the relationship between summary values and odds ratios does not follow a linear correlation when it occurs as a Simpson's paradox. The independence assumption could avoid the Simpson's paradox to determine whether the robustness of our method was insufficient when the situation violated the independence assumption. The rare disease assumption was relatively unimportant because the association between the summary values and odds ratios continued to follow a linear correlation. Therefore, the false positive rate did not increase when the situation violated the rare disease assumption. However, with the increase in disease prevalence, the effect of the summary value from the "case" and "control" on odds ratio changed. When the actual disease prevalence approached 0%, the summary value from the "case" was the only factor that influenced the estimator of the combined odds ratio. When the true disease prevalence approached 100%, the summary value from the "control" was the only factor that influenced the estimator of the combined odds ratio. In fact, the impact of the summary value from the "case" and "control" were based on the actual disease prevalence. However, diseases with >50% prevalence rates may not be present; therefore, we considered that the impact of the summary value from the "case" was always larger than the summary value from the "control." Because researchers often could not obtain actual disease prevalence rates, we considered that detecting the interactions using the summary value from a "case" was a suitable selection. In fact, the results of the meta-regression using the summary value from the case and control groups were similar because most of the studies had similar proportions of moderators in the case or control groups (e.g., matched studies). However, using the summary value of a case group was apparently a better selection because the impact weight of the summary value from the "case" was higher than that of the summary value from the "control" unless the real disease prevalence was >50%.
In conclusion, we considered that building the meta-regression using the summary value from a case group may be an effective approach when the information from every individual patient in insufficient. Furthermore, this approach is extremely easy to use and could assist in defining the biological significance. Several software programs can conduct meta-regression analysis such as R and STATA, and researchers can use these to investigate the interaction between the factor of interest, such as other SNPs or environment factor, and topic SNP. On the other hand, the rare disease assumption is relatively unimportant. However, when the actual disease prevalence is >10%, the estimators of meta-regression could be distorted, although the significant interactions may still possibly remain true. The independence assumption is important. The detection method for this interaction may largely deviate from the real situation, particularly when this violates the independence assumption. However, SNPs are often unrelated to environmental factors and SNPs of other chromosomes. Therefore, these results indicate that this method is useful in genetic studies. The meta-analysis of genetic association studies could also be effectively used in detecting gene-gene and gene-environment interactions, which may be accountable for the "missing heritability." Supporting Information S1 Text. The relationship between population parameters and the minor allele frequencies. S7 Text. The relationship between G 1 , G 2 , G 3 , G 4 and P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 , P 9 , q 0 , q 1 , q 2 . (DOCX) S1 Table. Detailed data in the real dataset. (DOCX)