^{1}

^{1}

^{2}

^{*}

Conceived and designed the experiments: ZW HZ. Analyzed the data: ZW. Contributed reagents/materials/analysis tools: ZW HZ. Wrote the paper: ZW HZ.

The authors have declared that no competing interests exist.

Genome-wide association studies (GWAS) aim to identify genetic variants related to diseases by examining the associations between phenotypes and hundreds of thousands of genotyped markers. Because many genes are potentially involved in common diseases and a large number of markers are analyzed, it is crucial to devise an effective strategy to identify truly associated variants that have individual and/or interactive effects, while controlling false positives at the desired level. Although a number of model selection methods have been proposed in the literature, including marginal search, exhaustive search, and forward search, their relative performance has only been evaluated through limited simulations due to the lack of an analytical approach to calculating the power of these methods. This article develops a novel statistical approach for power calculation, derives accurate formulas for the power of different model selection strategies, and then uses the formulas to evaluate and compare these strategies in genetic model spaces. In contrast to previous studies, our theoretical framework allows for random genotypes, correlations among test statistics, and a false-positive control based on GWAS practice. After the accuracy of our analytical results is validated through simulations, they are utilized to systematically evaluate and compare the performance of these strategies in a wide class of genetic models. For a specific genetic model, our results clearly reveal how different factors, such as effect size, allele frequency, and interaction, jointly affect the statistical power of each strategy. An example is provided for the application of our approach to empirical research. The statistical approach used in our derivations is general and can be employed to address the model selection problems in other random predictor settings. We have developed an R package

Almost all published genome-wide association studies are based on single-marker analysis. Intuitively, joint consideration of multiple markers should be more informative when multiple genes and their interactions are involved in disease etiology. For example, an exhaustive search among models involving multiple markers and their interactions can identify certain gene–gene interactions that will be missed by single-marker analysis. However, an exhaustive search is difficult, or even impossible, to perform because of the computational requirements. Moreover, searching more models does not necessarily increase statistical power, because there may be an increased chance of finding false positive results when more models are explored. For power comparisons of different model selection methods, the published studies have relied on limited simulations due to the highly computationally intensive nature of such simulation studies. To enable researchers to compare different model search strategies without resorting to extensive simulations, we develop a novel analytical approach to evaluating the statistical power of these methods. Our results offer insights into how different parameters in a genetic model affect the statistical power of a given model selection strategy. We developed an R package to implement our results. This package can be used by researchers to compare and select an effective approach to detecting SNPs.

In genome-wide association studies (GWAS), hundreds of thousands of markers are genotyped to identify genetic variations associated with complex phenotypes of interest. The detection of truly associated markers can be framed as a model selection problem: a group of statistical models are considered to assess how well each model predicts the phenotype, and the selected models are expected to include all or some of the truly associated genetic markers and few, if any, markers not associated with the phenotype. In the literature, three model-selecting procedures have been advocated: marginal search, exhaustive search, and forward search.

Marginal search analyzes markers individually and is the simplest and computationally least expensive among these three search methods. Under certain assumptions, such as no interactions among covariates (or markers in the GWAS context), Fan and Lv

In contrast to marginal search, exhaustive search and forward search simultaneously consider multiple markers in the model. Exhaustive search examines all possible models within a given model dimension, and forward search identifies markers in a stepwise fashion. As they consider interactions, they may gain statistical power compared to marginal search ^{11} candidate models. This requires significant computational resources, especially when permutations are needed to establish overall significance levels, e.g. for the purpose of appropriately accounting for dependencies among markers. Because of this computational burden, it is difficult or even impossible to assess the power of exhaustive search through simulation studies.

Based on limited simulations and real data analysis, conflicting results exist in the literature on the relative merit of exhaustive search and forward search. Because exhaustive search considers many more models, it may increase the probability that the truly associated markers do not rise to the top as more models involving unrelated markers may outperform the true models simply due to chance. Forward search explores a smaller model space, allowing a less stringent threshold for significance. However, forward search may miss the markers that have a strong interaction effect but weak marginal effect. Through limited simulation studies, Marchini and colleagues

It is clear that the optimal model selection strategy depends on the underlying genetic model, which is unknown to researchers. In the most extreme case, if the underlying genetic model has no marginal association, an exhaustive search is the only way to find influential genes. On the other hand, for a model with purely additive genetic effects, marginal or forward search will be the most effective. For the cases between these two extremes, the optimal model selection strategy should achieve a delicate balance between computational efficiency, statistical power, and a low false positive rate. Without the knowledge of underlying models, it is necessary to evaluate the different methods by thoroughly comparing them across a large genetic model space, in which both computationally intensive simulations and limited real data analysis are difficult to fully explore.

In this article, we derive the analytical results for statistical power of marginal search, exhaustive search, and forward search. These formulas can significantly reduce the computational burden in power estimation. To implement the formulas, we developed an R package

The rest of this article is organized as follows: in the

A genetic model relates phenotype to genotypes, and this relationship can be rather complex. In general, statistical power depends on the effects of risk alleles, allele frequencies in the population, epistasis, as well as environmental risk factors and their interactions with genetic factors. We focus on a model commonly used in the literature, which offers valuable insights into the relative performance of model selection methods.

Assume that genotype data are available from _{i}_{1}, …, _{ip}_{j}_{j}_{j}_{j}_{j}

We focus on the scenario that two of these SNPs, indexed by 1 and 2, are truly associated with a quantitative outcome _{i}^{2}) is independent of the genotypes. The interaction term represents the epistatic effect, and its coefficient _{3} measures the direction and magnitude of this effect.

Based on the observed data, we fit the following models using Ordinary Least Squares (OLS) involving one or two SNPs:

The subscripts in the above models index the SNP(s) included in these models. Based on models (3) and (4), three model selection methods seek candidate markers according to the corresponding test statistics. In marginal search, we fit simple linear model (3) and compare the _{j}_{jk}_{j}_{j}_{k}

Two criteria are adopted to decide if the chosen models are correct. On one hand, we could be rather stringent and call a model correct only if it matches the true underlying genetic model. This is consistent with the concept of “joint significance” in Storey et al.

the probability of identifying exactly the true model (in marginal search, it is the probability of detecting both true SNPs);

the probability of detecting at least one of the true SNPs.

Under power definition (A), the null model is any model other than the true genetic model; under power definition (B), the null model is any model containing neither true SNP.

We evaluated the accuracy of the asymptotic results derived in the

In the first set-up for model (2), we considered _{1} = _{2} = 0.1, _{3} = 2.4, allele frequency of each SNP _{j}^{2} = 3. _{3} = 1.4. For this set-up, _{3} represent large and small interaction terms with which the simulation generated a broad spectrum of power values. In both set-ups, the analytical power is very close to the empirical power based on simulations.

Category | Strategy | Source | ||||||

Marginal search | simulation | 0.268 | 0.556 | 0.683 | 0.754 | 0.790 | 0.851 | |

calculation | 0.279 | 0.552 | 0.673 | 0.738 | 0.781 | 0.836 | ||

Exhaustive search | simulation | 0.987 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | |

calculation | 0.978 | 0.995 | 0.998 | 0.998 | 0.998 | 1.000 | ||

Forward search | simulation | 0.780 | 0.788 | 0.789 | 0.789 | 0.789 | 0.789 | |

calculation | 0.795 | 0.800 | 0.800 | 0.800 | 0.801 | 0.801 | ||

Marginal search | simulation | 0.790 | 0.950 | 0.980 | 0.985 | 0.993 | 0.995 | |

calculation | 0.806 | 0.958 | 0.982 | 0.990 | 0.993 | 0.997 | ||

Exhaustive search | simulation | 0.993 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | |

calculation | 0.985 | 0.999 | 0.999 | 1.000 | 1.000 | 1.000 | ||

Forward search | simulation | 0.843 | 0.910 | 0.944 | 0.961 | 0.974 | 0.986 | |

calculation | 0.828 | 0.906 | 0.938 | 0.952 | 0.966 | 0.983 |

Category | Strategy | Source | ||||||

Marginal search | simulation | 0.055 | 0.180 | 0.291 | 0.359 | 0.425 | 0.512 | |

calculation | 0.053 | 0.179 | 0.289 | 0.355 | 0.424 | 0.515 | ||

Exhaustive search | simulation | 0.394 | 0.586 | 0.667 | 0.706 | 0.715 | 0.753 | |

calculation | 0.399 | 0.567 | 0.638 | 0.681 | 0.707 | 0.728 | ||

Forward search | simulation | 0.242 | 0.308 | 0.331 | 0.340 | 0.343 | 0.348 | |

calculation | 0.238 | 0.308 | 0.331 | 0.342 | 0.349 | 0.354 | ||

Marginal search | simulation | 0.394 | 0.695 | 0.802 | 0.869 | 0.899 | 0.935 | |

calculation | 0.406 | 0.698 | 0.807 | 0.862 | 0.894 | 0.932 | ||

Exhaustive search | simulation | 0.533 | 0.757 | 0.823 | 0.850 | 0.880 | 0.910 | |

calculation | 0.569 | 0.738 | 0.809 | 0.848 | 0.874 | 0.906 | ||

Forward search | simulation | 0.422 | 0.561 | 0.654 | 0.731 | 0.769 | 0.841 | |

calculation | 0.433 | 0.554 | 0.647 | 0.711 | 0.758 | 0.821 |

We chose these two set-ups in which the power was reasonably large to approximate most practical settings. The chosen value of

The simulation results shown in ^{2} = 3 and varied the values of _{1} = _{2} as well as that of _{3} from −1 to 1 by a step size of 0.1. To simplify the discussion, we assumed all SNPs had the same allele frequency of _{j}

_{1}+_{3}(_{2}−_{2}) = 0, depends on the additive genetic effect _{1}, epistatic effect _{3}, and the allele frequency _{2} of SNP 2. The non-detectable region for SNP 2 is analogous by symmetry. In exhaustive search, such region does not exist, as indicated by formula (12). So, exhaustive search can better identify the signals when they are counterbalanced.

The results of power for the three model selection methods: marginal search in the left column, exhaustive search in the middle column and forward search in the right column. Two definitions of power (A) for detecting the true model or both true SNPs in marginal search in row 1, and (B) for detecting either true SNP in row 2 are considered. We consider genetic models with the main effects _{1} = _{2} varying from −1 to 1 and the epistatic effect _{3} varying from −1 to 1. The allele frequency _{j}

In order to better visualize the difference of model selection methods, we show the power differences between different methods. The left, middle, and right columns of ^{2}), all model selection procedures have low power and tend to fail to pick up the true SNPs. Second, in the edge areas where the signals are strong, all model selection procedures have similarly good power. The light colored areas represent these two special situations in which there is little difference in power among model selection methods.

The power differences between marginal search and exhaustive search in the left column, between marginal search and forward search in the middle column, and between forward search and exhaustive search in the right column. Green areas indicate positive values of difference, and red areas indicate negative values of difference. We consider genetic models with the main effects _{1} = _{2} varying from −1 to 1 and the epistatic effect _{3} varying from −1 to 1. The allele frequency _{j}

The power differences between marginal search and exhaustive search in the left column, between marginal search and forward search in the middle column, and between forward search and exhaustive search in the right column. Green areas indicate positive values of difference, and red areas indicate negative values of difference. We consider genetic models with the main effects _{1} = _{2} varying from −1 to 1 and the epistatic effect _{3} varying from −1 to 1. The allele frequency _{j}

To compare marginal search and exhaustive search, the left columns of _{3} is large or _{1}+_{3}(_{2}−_{2}) is small. Such advantage is more pronounced under power definition (A) than under power definition (B). Marginal search performs better in the green areas where _{3} is small and _{1} and _{2} are both moderate. There are two reasons for the better performance of marginal search. First, with a small interaction term _{3} in these green areas, marginal search well detects the signals when the two-marker genetic effects are projected onto a marginal space through the simple regression of form (3). At the same time, with moderate _{1} and _{2}, the power for these two methods is not close to 0 or 1, so that they are distinguishable. Second, marginal search considers fewer models so that the desired models are more likely to be found from the models with the best fit.

Under different power definitions, the performance of forward search relative to that of marginal search can change. Capable of including interaction terms, forward search has an advantage over marginal search in finding the full correct model under power definition (A), as shown by the red areas in the middle column of

As shown in the right column of _{1}+_{3}(_{2}−_{2})≈0. Under power definition (B), forward search is more powerful than exhaustive search when

As reflected by the change of red/green areas between the first and the second rows in both

We also explored additional model set-ups in _{j}^{2} = 3. The values of the genetic effects _{1} = _{2} and _{3} varied from −2 to 2 by a step size of 0.2. When _{j}_{1} = _{2} = 0 and _{3} = 0. In general, the patterns are similar to those shown in

In the following we provide an example to show how to apply our approach to calculating and comparing the power of model selection methods in empirical analysis. Because there are no consistently replicated interaction effects from real studies, we constructed hypothetical interaction models based on real data so that the marginal associations between traits and markers were matched, while allowing the interaction term to vary. Specifically, we calculated power based on a set of genetic models derived from a genome-wide association study of adult height by Weedon et al. _{3}), we estimated the parameters _{1}, _{2}, and ^{2} using model (2) so that the marginal effects matched the observed values. The _{1} = 0.77 and _{2} = 0.48, respectively.

_{3}. For the detection of both SNPs, graphs A (_{3} is large, exhaustive search (red dashed curve) has significant advantage over forward search (green dotted curve), which is better than marginal search (black solid curve). If _{3}_{3}. The relative performance of exhaustive search strongly depends on the magnitude of epistasis. Comparing graphs B (

Power comparisons of three model selection procedures over a sequence of epistatic effect _{3}: marginal search by black solid curve, exhaustive search by red dashed curve, and forward search by green dotted curve. We assume the true SNPs to be rs11107116 and rs10906982, which influence adult height with their marginal effects set to be the same as those observed in Weedon et al. 2008. Graphs A with

With _{3}>0.3 or 0.6, respectively. We studied the statistical significance of the interaction terms with the simulated data (1,000 runs) when _{3} equals these two cutoffs. When _{3} = 0.3, 11.4% of the simulations had the Bonferroni p-values (adjusted by the number of all possible pairs of the 20 found loci) that exceeded the significant threshold at 0.05. Therefore, a small epistatic effect, rarely showing significance from the observed data, can still make an exhaustive search more powerful than a marginal search under power definition (A). Under power definition (B), when _{3} = 0.6, 87.3% of the Bonferroni adjusted p-values were significant. That is, to make exhaustive search more powerful than marginal search for finding either SNP, a true epistatic effect needs to be large enough to often identify a statistically significant interaction.

This example demonstrates that the value of the interaction term and the number of false discoveries affect the relative performance of model selection methods, which can be one of the reasons for the conflicting results about the power of model selection methods in the existing literature

In this article, we have derived rigorous analytical results for the statistical power of three common model selection methods, and applied these results to compare the methods' performance for GWAS data. These results not only make the computationally expensive simulations unnecessary, but also systematically reveal how different genetic model parameters affect the power.

The comparison results among the three model selection methods illustrate the trade-off between searching the full model space and a reduced space. In one extreme, exhaustive search explores the full 2-dimensional space covering all possible epistatic effects, but it may reduce the probability that the true model(s) ranks among the top models because many more models are considered. In the other extreme, marginal search casts the true 2-dimensional model onto a 1-dimensional space without considering epistasis at all. However, we have a better chance to find more true positives when the marginal association is retained in the 1-dimenisonal space, because fewer models are examined and the false positive control appears comparatively liberal. Between these two extremes, forward search first considers marginal projection, and then partially searches the 2-dimensional space via residual projection given the chosen predictor in the first step. Thus, forward search has the partial benefit of joint analysis which considers epistatic effects conditionally. The stringency of its false positive control exists between those of exhaustive search and of marginal search.

The relative performance of these model selection methods also depends on the definition of power. Based on definition (A), exhaustive search performs the best in finding the true underlying genetic model in most of the model space considered. Under power definition (B), marginal search is a good choice: it is not much worse than exhaustive search for a large proportion of the model space, and it is always better than the classic forward search through which only one SNP is picked up in the first step. For most geneticists, finding at least one of the truly associated SNPs under power definition (B) is a primary concern, especially in the first stage of GWAS. Because we do not have prior information about the true genetic model in the beginning, marginal search, which is easy to compute, is a good start in the first stage of GWAS to find one or some of the main genetic effects. In the later stage(s), if the promising SNP candidates are limited, exhaustive search can be applied with less demanding computation, especially when epistasis among loci is of interest. Our conclusions based on the analytical studies justify this multi-stage strategy in GWAS.

Our power calculation for model selection strategies is different from a traditional power calculation for multiple regression models

Our analytical approach leads to new insights into model selection methods than simulations and limited real data analysis. Furthermore, our approach addresses a critical limitation of prior studies

To compare the power of model selection methods, our approach explicitly considers the correlation structures among the test statistics for the null and alternative hypotheses, which achieves more accurate assessment of model selection methods than Bonferroni-corrected type I error control that is commonly used in the literature

To obtain the significance threshold, we control the number of false discoveries at

The model selection problem is also a large-scale simultaneous hypothesis testing problem. A widely applied significance control criterion in this scenario is the false discovery rate (FDR)

Through the simulations in the

We have assumed that the markers are independent in this paper. There may be linkage disequilibrium (LD) among SNPs. However, LD in general is weak among tagging SNPs

In reality, the underlying true model could be more complicated than model (2) with more related SNPs and interactions. Our analytical results of power calculation can be extended through the approaches similar to the one we developed here. Although the genetic models studied are simple, our results provide insights into the relative performance of different model selection procedures.

To calculate the power of model selection procedures shown in the _{i}_{i}_{1}, …, _{is}_{i}_{1}, …, _{s}_{j}_{ij}_{i}_{jk}_{ij}_{ik}

We extend the above result in two ways to suit our needs of deriving the distribution of test statistics that are examined in model selection procedures (the proofs are given in

Secondly, if

^{2})/2^{2}/^{2}), if

With the results above, we derive the relevant distributions of _{12} is the _{i}_{ij}_{j}_{lk}

To calculate the power of marginal search, we need to obtain the distributions of the involved test statistics. We first derive the _{2} is gotten by symmetry between indices 1 and 2.

Based on the asymptotic mean of _{1} derived above, we can quantify the influence of genetic parameters of SNP 2 and epistasis on the power of marginal search to pick up SNP 1. As for some genetically interesting observations, when there is no epistatic effect (i.e. _{3} = 0), we have _{1} and thus the power of marginal search to find _{1} are decreasing functions of the main effect of _{2}, the minor allele frequency (MAF) of _{2}, and the random error variance ^{2}, with the decreasing rate specifically given by _{3}≠0) but _{1} = 0, _{1}(_{2}. The influence of _{2} depends on the allele frequencies _{1} and _{1}. On the other hand, if _{1}≠0, it is possible that _{1}+_{3}(_{2}−_{2}) = 0 when the main effect _{1} and interaction effects _{3} have opposite directions (assuming _{2} is the MAF). With such epistatic pattern, marginal detection surely fails to detect the true genetic variants no matter how strong the true genetic effects are.

Now we derive the joint distribution of _{1} and _{2}. Since _{1} and _{2} in the underlying true model (2), _{1} and _{2} are correlated even when _{1} and _{2} are independent and do not interact, i.e. _{3} = 0. The correlation between _{1} and _{2} can be substantial in certain genetic models. The asymptotic joint distribution of (_{1}, _{2})′ is_{1,2} = _{1}, _{2}). The covariance τ_{1,2} is gotten based on the result in (6), and its formula (as a constant of

Let _{j}_{3} as an example is provided in _{j}_{1} and _{2} according to the result in (6). Under the assumption of fixed design matrix,

Based on the above results for the distributions of _{1}| and |_{2}| are greater than the _{j}_{(r)} is the _{j}_{1}, _{2}) is the joint probability density function (PDF) of (_{1}, _{2})′ given in (9). Let Φ(·) be the cumulative distribution function (CDF) of _{1}| or |_{2}| is larger than the random cutoff point: _{1}|∨|_{2}|≥|_{(r)}), where |_{1}|∨|_{2}| = max{|_{1}|, |_{2}|}.

The distributions of the relevant test statistics are derived first for calculating the power of exhaustive search. We first get the joint distribution of the test statistics involving true SNPs 1 and 2: _{1}, _{2}, and _{12}. Define

The formula of _{1}(_{12,i} = _{12}, _{i}

We then derive the _{j}_{k}_{jk}_{34} as an example, the detailed proof is given in

Based on the traditional power calculation for regression models, the null model is the incorrect model with neither SNP associated with the phenotype. When the design matrix is fixed, the null distribution of _{jk}

In order to calculate the power of model selection methods, we need to address the correlation structures among involved statistics. The statistics are correlated when two epistatic models in form (4) share a common SNP. Also, _{1} and those involving _{2} are correlated because the true underlying model includes both SNPs. Consequently, the elements in the set {_{12}, _{ij}_{1}(_{13} as an example is shown in _{ij}_{ik}_{i}_{j}_{|i} and _{k}_{|i} to be independent. Furthermore, with the result (14) we can use the joint distribution (11) to capture the correlation between _{12} and _{ij}

Based on the asymptotic distribution in (7), we have^{2}/_{j}_{|i})→_{j}_{|i})→_{3|1}. The formulas of _{i}_{j}^{2}} (e.g. when allele frequencies _{i}_{j}_{1}, _{2}, _{3})′ and random error variance ^{2} are not too large). When _{j}_{|i} when

With the distribution of test statistics derived above, we first calculate the probability of exhaustive search to identify the exact true model. Under power definition (A), the test statistic _{12} for the exact true model corresponds to the “alternative” distribution, whereas the _{1}≡{_{ij}_{2}≡{_{jk}_{S}_{,[R]} denote the _{12},_{1},_{2}) is the PDF of (11), _{2}, _{1i}(•) is the CDF of distribution (15) for _{2}(•) is the CDF of distribution (13). The test statistics within the sets ^{*}≡{_{j}_{|1}, _{j}_{|2}, _{2}≡{_{jk}

According to the power definition (B), the probability of exhaustive search to detect at least one of the associated SNPs is_{2(N−R+1)}(•) is the PDF of the (_{2}(•) and _{2}(•) are the CDF and PDF of the distribution of (13) respectively.

If _{12},_{1},_{2}), we can approximately replace the integrand _{A}

For forward search, first we derive the distributions of test statistics, which will be used to calculate the corresponding statistical power. Here we need to handle the comparison between two models: the model with SNPs 1 and _{1|j} be the _{1}+_{3}(_{2}−_{2})≠0, following the asymptotic result in (5), we can derive_{j}_{1}+_{3}(_{2}−_{2}) = 0, _{1|j},_{2|j}) can also be calculated. As an example the formula of _{1|3},_{2|3}) is given in

Moreover, the statistics (_{1}_{2}_{1|j},_{2|j})′ involving true SNPs have a multivariate normal distribution:

Through result (6), we have proved that _{j}_{1|j} are asymptotically independent (refer to _{k}_{|j} has the asymptotic distribution_{k}_{|j} in the traditional model comparison

In the forward search procedure, we first apply marginal search to find the most significant SNP among models (3). Based on the selected SNP, we then fit models (4) in the second step to find the SNPs that have strong joint effects, while controlling for ^{*}_{i}_{ = 1,2}{|_{i}_{12},_{1},_{2}) is the PDF of (_{12},_{1},_{2})′ given in (11), ^{*}^{*}_{1},_{2})′ of random vector (_{1},_{2})′, so it is easy to implement the power calculation with Monte Carlo integration.

Note that ^{*}^{*}

When

Under power definition (B), the power of forward model selection method is the sum of _{A}_{B}_{1},_{2}) is the PDF of joint distribution of (_{1},_{2})′ given in (9). Defining

For each _{i}_{|k} and _{k}^{*}_{i}_{|j}, _{1}_{2}_{1|j},_{2|j}) is the PDF of (_{1}_{2}_{1|j},_{2|j})′ given in (17),_{k}_{|j}, 3≤

To demonstrate how to evaluate the power of model selection methods in the empirical analysis, we have applied our approach in a real study example. In this example, the simple regression model on _{1}, _{2}. To estimate the variance of random error, note that_{3} and the corresponding estimators

Supplementary Note for proofs and arguments, distributions of test statistics, extended comparisons of power for model selection methods, and formulas for distribution parameters of test statistics.

(0.91 MB PDF)

We are grateful to Yale University Biomedical High Performance Computing Center for computation support. We thank Dr. Joshua Sampson and Dr. Yedan Zhang for their comments on the paper.