Conceived and designed the experiments: BH EE. Performed the experiments: BH HMK. Analyzed the data: BH HMK. Wrote the paper: BH HMK EE.
The authors have declared that no competing interests exist.
With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies—SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected
In genome-wide association studies, it is important to account for the fact that a large number of genetic variants are tested in order to adequately control for false positives. The simplest way to correct for multiple hypothesis testing is the Bonferroni correction, which multiplies the
Association studies have emerged as a powerful tool for discovering the genetic basis of human diseases
There are two common versions of the multiple testing correction problem: per-marker threshold estimation and p-value correction. In a typical study which collects
While the Bonferroni (or Šidák) correction provides the simplest way to correct for multiple testing by assuming independence between markers, permutation testing is widely considered the gold standard for accurately correcting for multiple testing
In this paper, we correct for multiple testing using the framework of the multivariate normal distribution (MVN). For many widely used statistical tests, the statistics over multiple markers asymptotically follow a MVN
(A) Correlations between 10 markers are depicted. (B) Correlations taken into account by a block-wise strategy with a block size of 5. The ignored correlations are shown as black. (C) Correlations taken into account by a sliding-window approach with a window size of 5. The ignored correlations are shown as black.
We propose a method for multiple testing correction called SLIDE (a
Second, SLIDE takes into account the phenomenon that the true null distribution of a test statistic often fails to follow the asymptotic distribution at the tails of the distribution. It is well known that if the sample size is small, the true distribution and the asymptotic distribution show a discrepancy
With these two advances, SLIDE is as accurate as the permutation test. In our simulation using the WTCCC dataset
The MVN framework for multiple testing correction is very general, allowing it to be applied to many different contexts such as quantitative trait mapping or multiple disease models
In addition to multiple testing correction, we extend the MVN framework to solve the problem of estimating the statistical power of an association study with correlated markers. There are two traditional approaches to this problem: a simulation approach constructing case/control panels from the reference dataset
The power estimation problem can be solved within the MVN framework because the test statistic under the alternative hypothesis follows a MVN centered at the non-centrality parameters (NCP). The vector of the NCPs turns out to be approximately proportional to the vector of correlation coefficients (
Seaman and Müller-Myhsok
Another approach for multiple testing correction is to estimate the effective number of tests from eigenvalues of the correlation matrix
Connecting the multiple testing correction and power estimation problems leads to the insight that the per-marker threshold estimated from the reference dataset for estimating power can be used as a precomputed approximation to the true per-marker threshold for the collected samples. In simulations using the WTCCC control data, we show that the per-marker threshold estimated from the HapMap CEU population data approximately controls the false positive rate.
Our methods SLIP and SLIDE require only summary statistics such as the correlation between markers within the window size, allele frequencies, and the number of individuals. Therefore unlike the permutation test, our method can still be applied even if the actual genotype data is not accessible. Our methods are available at
For many widely used statistical tests, the vector of statistics over multiple markers asymptotically follows a MVN
Assume we permute
Let
Let
The area outside the rectangle is the critical region. (A) Under the null hypothesis, the MVN is centered at zero. The outside-rectangle probability is the corrected p-value (or the significance level). (B) Under the alternative hypothesis, the MVN is shifted by the non-centrality parameter. The outside-rectangle probability is power.
If the asymptotic MVN closely approximates the true distribution of the statistic, then Formula (3) will provide an accurate multiple testing correction; this has been shown to be true for small regions such as those tested in candidate gene studies
However, we observe that this discrepancy can appear in genome-wide association studies, in spite of the large sample size, because of the extremely small per-marker threshold (or pointwise p-value) caused by the large number of tests. At its extreme tails, the asymptotic distribution is typically thicker than the true distribution.
This phenomenon can be illustrated with a single-SNP experiment using the
Given a
One may argue that this phenomenon is not important because it mostly occurs at rare SNPs (MAF≤5%) where current studies already have low power to detect associations. However, an incorrect approximation of the distributions at some SNPs affects the corrected p-values of all SNPs. This is because the corrected p-value depends on the distributions of the statistics at all of the SNPs, as it is defined as the probability observing significant results at any marker. For example, suppose we approximate 10 independent normal distributions at 10 independent SNPs. Assume that we correctly approximate 9 distributions, but for one distribution we think that the tails are thicker than the true distribution by a factor of 100. For any given pointwise p-value
One can avoid this type of error in corrected p-values by using a method not dependent on the asymptotic approximation, such as the permutation test, or by eliminating rare SNPs in the analysis. It may be sensible to remove rare SNPs with a few or tens of minor allele counts, if the power is very low or if the SNPs are error-prone in their calling. However,
SLIDE corrects for multiple testing by using a sliding-window approach to approximate the MVN and then scaling the MVN to approximate the true distribution of the statistic. There are two underlying intuitions. First, a sliding window approach takes into account most of the correlations in the data due to the local LD structure. Second, even though the asymptotic MVN shows a departure from the true distribution at the tail, the scaled MVN will closely approximate the true distribution because the covariance between the statistics is identical in both the true distribution and the MVN. (The covariance derivation does not involve the central limit theorem.)
Under the local LD assumption, the statistics at distant markers are uncorrelated. Thus, given a window size
The probability density function of the asymptotic bivariate MVN is depicted as a grid. The probability mass function of the true distribution is depicted as a histogram. (A) The asymptotic distribution often shows a discrepancy from the true distribution. (The discrepancy is exaggerated in this figure.) (B) After scaling down the asymptotic distribution, the discrepancy is removed.
The level of discrepancy between the asymptotic and true distributions is large at the tails of the distribution compared to the center. Thus, in order to scale the asymptotic distribution to fit to the true distribution, we cannot multiply the entire distribution by a single scaling factor, but must instead compute the scaling factor for each different threshold.
Given a
Note that, for unbalanced case/control studies, the level of discrepancy is not symmetric at the upper and lower tails of the normal distribution. Thus, we should compute the scaling factor for each tail of the normal distribution separately.
A discussion of association study power depends on many arbitrary assumptions. Though our framework can be extended to other assumptions, in this paper, we adopt those used in De Bakker
For complex diseases, assumption (1) can still be applied if each causal SNP marginally contributes to the risk. Assumptions (4) and (5) can lead to an overestimation of power, especially if the markers are chosen using the reference dataset
Finally, we assume that the investigator has determined the number of individuals in the study and the significance threshold.
We extend the MVN framework to the power estimation problem. Consider a study design which defines markers and plans to collect
The case/control study can be thought of as a procedure which draws
If the marker and the causal SNP are distinct (a condition called
Collecting cases (or controls) is equivalent to drawing
In practice, approximation in Formula (6) usually leads to an accurate power estimate. However, if the relative risk is very large, the Formula (5) can be computed exactly and used as follows. By Formula (4), we can calculate
Let
Power depends on the per-marker threshold
Given
Our method SLIP estimates the power of a study design using the MVN framework. First, like SLIDE, SLIP estimates the per-marker threshold in Formula (8) using a sliding window approach. Then SLIP samples causal SNPs, approximates the alternative MVN to estimate the per-causal-SNP power, and averages per-causal-SNP powers over sampled causal SNPs.
Since power is typically larger (e.g. 80%) than a p-value (e.g. .01), a small error in the per-marker threshold barely affects the estimate. Thus, the error caused by using the asymptotic approximation is negligible. Also, given a causal SNP, we can assume that nearby markers (e.g. those within ±1 Mb) can capture most of the statistical power due to local LD. Thus, we can set a window size and only use the markers within that window to estimate the alternative MVN, which will be a
The computation becomes very efficient if we use approximation (6). Since approximation (6) states that the covariance is the same for the null and alternative MVNs, we can re-use the null MVN constructed for estimating the per-marker threshold, by shifting it by the NCP to get the alternative MVN. If we re-use the random samples this way, the constructed random samples will be not completely random, as they depend on each other. However, we observe that the inaccuracy caused by this dependency is negligible if we generate a large number of samples for the null MVN. If we re-use the samples, then with almost no additional computational cost, SLIP can generate power estimates for multiple relative risks or study sample sizes, since these only change the NCP.
Multiple testing correction is generally performed using the collected data and not the reference data. Recall that the difference between the per-marker threshold for multiple testing correction (
We downloaded the HapMap genotype data (release 23a, NCBI build 36) from the HapMap project web site
The URL for methods presented herein is as follows:
In order to compare how accurately and efficiently different methods correct multiple testing, we simulate a study using the WTCCC data
First, we perform 10 M permutations to correct ten different pointwise p-values from 10−4 to 10−7, whose corrected p-values are from .04 to .0004. We will consider the corrected p-values by the permutation test as the gold standard, and call them
We use SLIDE, DSA, mvtnorm, RAT, and Keffective to correct p-values. DSA and mvtnorm are MVN-based methods using the block-wise strategy. We use a constant block size (window size) of 100 markers for all methods. Since RAT defines the window size in terms of physical distance, we use 600 kb, the average distance of 100 markers in the dataset. We use -X -e2 option for RAT for an exact computation of the importance sampling procedure as suggested by Kimmel and Shamir
We use the WTCCC T2D case/control chromosome 22 data. Approximated time is for correcting 10 p-values with respect to 500 K SNPs assuming 100 K permutations. The dashed lines denote the interval where an accurate methods' estimate will be found more than 95% of the time.
DSA is conservative with an average error of 19%. This is equivalent to reducing the error by only about two thirds relative to the Bonferroni correction. The reasons for the errors include the block-wise strategy ignoring inter-block correlations, and not correcting for the error caused by using the asymptotic approximation. In addition to these errors, mvtnorm suffers from an anti-conservative bias which grows as the p-value becomes more significant. This is because the p-value in each block is too small for mvtnorm to accurately estimate. Our simulation shows that this anti-conservative bias increases with the number of sampling iterations (data not shown).
Keffective is more accurate and faster than DSA and mvtnorm. The average error of Keffective is 10.6%. Note that Keffective is optimized to provide an efficient approximation for the effective number of tests within ∼10% of error. Thus, Keffective is achieving its goal.
Both RAT and SLIDE show accurate estimates with the same average error of 0.8%. Thus, the error rate of SLIDE's corrected p-values is more than 10 times smaller than the error rate of Keffective's corrected p-values, more than 20 times smaller than the the error rate of DSA's corrected p-values, and 80 times smaller than the error rate of the Bonferroni-corrected p-values.
We now explore how each source of error in MVN-based methods – the block-wise strategy and the use of the asymptotic approximation without correction – affects the error rate. We remove 1,048 rare SNPs (MAF<.05) and perform multiple testing correction with respect to the remaining 4,515 common SNPs. When considering only common SNPs, the error caused by using the asymptotic approximation will be much smaller (See
Procedure | # Permutations | Permutation | SLIDE | DSA | Mvtnorm |
RAT | Keffective |
Correcting 1 p-value | 10 K | 16 d | 0.6 h | 1.4 h | 0.7 h | 7 h | 19 h |
Correcting 10 p-values | 10 K | 16 d | 0.6 h | 14 h | 7 h | 70 h | 19 h |
Correcting 1 p-value | 100 K | 160 d | 6 h | 14 h | 7 h | 72 h | 19 h |
Correcting 10 p-values | 100 K | 160 d | 6 h | 140 h | 70 h | 30 d | 19 h |
Correcting 1 p-value | 1 M | 4 years | 3 d | 6 d | 3 d | 30 d | 19 h |
Correcting 10 p-values | 1 M | 4 years | 3 d | 60 d | 30 d | 300 d | 19 h |
Often anti-conservative.
All values are extrapolated from the chromosome 22 results.
In many settings, SLIDE is 500 times faster than the permutation test and considerably faster than the other methods. The running time of SLIDE, Keffective, DSA, and mvtnorm is approximately independent of the study sample size, whereas the time of the permutation test is linearly dependent on it. Thus, the efficiency gain of these methods relative to the permutation test will increase as the study size increases. We summarize the accuracy and efficiency of the tested methods in
We use the WTCCC T2D case/control chromosome 22 data. The vertical axis is the average error in corrected p-values relative to the Bonferroni correction. The horizontal axis is the approximated time for correcting 10 genome-wide p-values for 500 K SNPs assuming 100 K permutations.
Here we describe a few details of our running time measurements. We used our own C implementation for the permutation test. However, we expect that the measured time will be similar to that for commonly used software such as PLINK
Using the same WTCCC chromosome 22 dataset, we perform an additional experiment for the unphased genotype data using the trend test, assuming unbalanced case/controls. We find SLIDE achieves similar accuracy (See Text S4 and
In this experiment, we assume that a single threshold is being estimated to decide which findings to follow up, instead of correcting each pointwise p-value. We estimate the per-marker threshold corresponding to a significance threshold of .05. We use the 2.7 million polymorphic SNPs in the HapMap CEU data over the whole genome, instead of a single chromosome.
We generate a simulated dataset using the phased haplotype data of 60 HapMap CEU parental individuals. Specifically, we create a new haplotype by randomly shuffling the 120 chromosomes so that the average length of a haplotype segment is approximately 1 Mb. We mutate (flip) each SNP with probability 10−5. We create 2,000 cases and 2,000 controls by randomly pairing 8,000 such haplotypes. Although this model is arbitrary, it suffices to compare different methods. The results of the relative comparison between methods do not greatly vary using different parameters, such as a different average haplotype segment length (data not shown).
We compare the permutation test, Keffective, and SLIDE. RAT is not efficient for this setting because it is optimized for very significant p-values, much smaller than .05. We expect that the results of DSA or mvtnorm will be similar to or worse than those of Keffective, as in the previous experiment.
We perform 10 K permutations for this experiment. We run SLIDE with 10 K samplings and window size 100. We run Keffective with window sizes 100 and 10.
A dataset of 2,000 cases and 2,000 controls is generated from the HapMap CEU data. Using each method, we estimate the per-marker threshold corresponding to a significance level of .05. The effective number of test is simply .05 divided by the per-marker threshold. The dashed lines denote the interval where an accurate methods' estimate will be found more than 95% of the time.
The dashed lines denote the interval where an accurate methods' estimate will be found more than 95% of the time. SLIDE estimates the effective number as 1,038,888 (2.8% error), which is within the 95% interval. This small anti-conservative error is only due to the stochastic error and not an inherent bias, since the result becomes highly accurate as 1,068,445 (0.03% error) if we increase the number of samples to 100 K.
Keffective estimates the effective number as 1,409,811 (32% error) with window size 10 and as 1,252,986 (17% error) with window size 100. Unlike the previous experiment, for this higher-density marker dataset, Keffective no longer keeps the error within 10%. We do not expect that a larger window size will increase the accuracy of Keffective, because the error does not seem to be due to the missing long range correlations, since SLIDE is accurate with the same window size of 100.
The running time is 260 hours for permutation, 10 hours for SLIDE, 10 hours for Keffective with window size 10, and 90 hours for Keffective with window size 100.
Since SLIDE takes into account only correlations within the window size, here we investigate the effect of window size on performance. A reasonable choice for the window size will be the number of markers whose average distance is the average or maximum LD distance in the data. For our experiments, we use the WTCCC T2D case/control chromosome 22 dataset. A large number (10 M) of permutations allows us to find that a pointwise p-value 1.53×10−5 corresponds to the corrected p-value .05. We correct this pointwise p-value using SLIDE with various window sizes, and see if the corrected p-values are close to .05.
Using the WTCCC T2D case/control chromosome 22 data, we plot the ratios between the corrected p-value and the permutation p-value for varying window sizes for SLIDE. We use the pointwise p-value corresponding to the permutation p-value .05. The window size zero denotes the Bonferroni correction. The dashed lines denote the interval where an accurate methods' estimate will be found more than 95% of the time.
We now examine whether the per-marker threshold estimated from the reference dataset can approximate the true per-marker threshold for a study which may have a different sample correlation structure from the reference dataset. The marker set we use is the SNPs in the Affymetrix 500 K chip over the whole genome.
First, we apply SLIDE to the HapMap data using window size 100, to obtain the per-marker threshold 2.19×10−7 corresponding to the significance threshold .05. Then, we permute the WTCCC data to estimate the false positive rate given this per-marker threshold. We use the WTCCC 1958 British birth cohort control data, which consists of 1,504 individuals. We randomly permute the dataset 100 K times. We estimate the false positive rate, as the proportion of permutations showing significance given the per-marker threshold, to be .0508. Thus, in this experiment, the per-marker threshold estimated from the reference data controls the false positive rate with only 1.6% relative error. This result shows that, even if the reference population and the target population are slightly different (one from the Utah, U.S.A., and the other from the Great Britain), the per-marker threshold estimated from the reference data is a reasonable approximation.
We compare four different methods for estimating genome-wide power: standard simulation, null/alternative panel construction, best-tag Bonferroni, and SLIP. We assume a multiplicative disease model with a relative risk of 1.2 and a disease prevalence of .01, and a significance threshold of .05. We use the CEU population data in the HapMap as the reference dataset. We use the genome-wide markers in the Affymetrix 500 K chip and assume a uniform distribution of causal SNPs over all common SNPs (MAF≥.05) in the HapMap.
We first perform the standard simulation, which we will consider as the gold standard. We construct a number of genome-wide ‘alternative’ panels from the HapMap data by randomly assigning a causal SNP for each panel. We permute each panel 1,000 times to estimate the panel-specific per-marker threshold. The power is estimated as the proportion of panels showing significance given its per-marker threshold. Conneely and Boehnke
Another panel construction-based approach is the null/alternative panel construction method. Instead of permuting each of alternative panels, this method constructs another set of ‘null’ panels under the null hypothesis. The null panel gives us a ‘global’ per-marker threshold that can be applied to all alternative panels. Since this method is as accurate as the standard simulation but is more efficient, it is widely used
We apply SLIP and re-use the samples for the null MVN for estimating the alternative MVNs. Lastly, we apply the analytical best-tag Bonferroni method
For the standard simulation, we use 10 K alternative panels. For the null/alternative panel construction method, we use 10 K alternative panels and 10 k null panels. For SLIP, we use 10 K sampling points. For the best-tag Bonferroni method, we use 10 K samples for causal SNPs. For SLIP, we use a window size of 100 markers. For all other methods, we use a window size of 1 Mb.
We use the HapMap CEU reference data. We assume a multiplicative disease model with relative risk 1.2, disease prevalence .01, and a uniform distribution of causal SNPs over common SNPs (MAF≥.05). We use the significance threshold of .05.
Procedure | #cases/controls | Best-tag-Bonf. |
SLIP | Null/altern. | Std. simul. |
Estimating power | 1,000/1,000 | 0.1 h | 0.6 h | 36 h | 10 d |
5,000/5,000 | 0.1 h | 0.6 h | 8 d | 50 d | |
Estimating power for 5 different relative risks | 1,000/1,000 | 0.1 h | 0.6 h | 8 d | 50 d |
5,000/5,000 | 0.1 h | 0.6 h | 40 d | 250 d |
Inaccurate (average error is not within 1%).
SLIDE and SLIP provide efficient and accurate multiple testing correction and power estimation in the MVN framework. SLIDE shows a near identical accuracy to the permutation test by using a sliding-window approach to account for local correlations, and by correcting for the error caused by using the asymptotic approximation. SLIDE can be applied to datasets of millions of markers with many rare SNPs, while other MVN-based methods become inaccurate as more rare SNPs are included. To the best of our knowledge, SLIP is the first MVN-based power estimation method.
Throughout this paper, we considered the classical multiple testing correction controlling family-wise error rate (FWER)
We considered the permutation test as the gold standard for multiple testing correction. The permutation test can be performed in two different ways: at each permutation, we can either assess the maximum statistics among the markers (max-T permutation), or assess the minimum pointwise p-value among the markers by performing another permutation for each marker (min-P permutation)
In Text S5 and
Recently, a different view of multiple testing correction has been introduced
In our experiments, we used a constant block size for the block-wise strategy. In practice, it will be more reasonable to split the region according to the LD blocks. However, this is not always possible because LD blocks are often ambiguous and some blocks can be larger than the maximum block size of the method. For example, if we collect 10 million SNPs, a block size of 1,000 is required to cover 300 kb LD. However, the maximum block size of mvtnorm that allows an accurate estimate is currently 300
Recently, a method called PRESTO
We considered the pairwise correlation between SNPs. There can also be so-called higher-order correlations, such as the correlation between a haplotype and a SNP. For example, even though three SNPs are pairwisely independent, the combination of the first two SNPs can be a perfect proxy to the third SNP. However, the multivariate central limit theorem proves that the joint distribution of the test statistics is fully characterized by the matrix of the pairwise correlations. Thus, the effect of the other correlation terms on the joint distribution is asymptotically negligible. Nevertheless, our method is not limited to the SNP test. If our method is applied to the weighted haplotype test
In summary, SLIP and SLIDE are two useful methods for genome-wide association studies which provide accurate power estimation at the design step and accurate multiple testing correction at the analysis step. The software is available as a resource for the research community.
Ratios between the corrected p-values and permutation p-values after rare SNPs are removed. We use the chromosome 22 of the WTCCC Type 2 diabetes cases/controls data. Multiple testing is corrected with respect to the 4,515 common SNPs (MAF≥.05).
(0.01 MB PDF)
Ratios between the corrected p-values and permutation p-values for genotype data. We simulate a unphased genotype dataset using the chromosome 22 data of the WTCCC Type 2 diabetes cases/controls data, assuming a unbalanced study of 2,934 controls and 1,928 cases.
(0.01 MB PDF)
Inaccurate multiple testing correction caused by the use of an allelic test for unphased genotype data. We generate a simulated unphased genotype data of 120 cases and 120 controls from the HapMap CEU population chromosome 22 data. Then we plot the ratios between the corrected p-values by two different permutations: permutation test using the allelic test statistic, and permutation test using the genotypic test statistic. Quality control is performed by the standard χ2 test for HWE.
(0.01 MB PDF)
Rapid and accurate multiple testing correction and power estimation for millions of correlated markers.
(0.14 MB PDF)
We thank Noah Zaitlen for phasing the genotype data and Sean O'Rourke for valuable comments. We are grateful to Alan Genz, Gad Kimmel, Valentina Moskvina and Karl Michael Schmidt for helpful discussions regarding mvtnorm, RAT, and Keffective.