Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers

With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies—SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu.

of the permutation. Letp + h andp − h be the observed frequency of the haplotype h in the permuted cases and controls respectively. Let q ih be the conditional probability observing the minor allele at marker i given the haplotype h. q ih can be estimated from the reference dataset, and is invariant between cases and controls under the null. Then, the test statistic at ungenotyped marker i is ∼ N (0, 1) .
S i differs from the statistic of Zaitlen et al. [1] or Nicolae [2] by a factor of 2N −1

2N
due to the withoutreplacement sampling of the permutation test.
Now we consider the covariance between statistics S i at ungenotyped marker i and S j at ungenotyped marker j. The covariance between a genotyped marker and a ungenotyped marker can be straightforwardly derived, because a genotyped marker can always be considered as a ungenotyped marker with a nearby haplotype of size 1 which is the marker itself. Let h be the one of k nearby haplotypes defined for marker i and h be the one of k nearby haplotypes defined for marker j. Let p hh be the frequency of the overall sample chromosomes which have both haplotype h and haplotype h . Letp + hh andp − hh be the observed frequencies in the cases and controls after random permutation. Similarly to the derivation in the main text, Test for imputed genotypes based on posterior probabilities A different type of imputation statistic is the test for imputed genotypes based on the posterior probabilities. Since many imputation softwares [3] provide the posterior probabilities for genotypes, we will consider the genotypic test instead of the allelic test. We derive the covariance by using the score statistic. Unambiguous genotypes If the genotypes are unambiguous, the standard score test can be used as in Seaman and Müller-Myhsok [4]. Assume N/2 cases and N/2 controls. Let Y i be the binary trait of individual i. Let Z i ∈ {0, 1, 2} M be the vector of M marker genotypes of individual i. Let α be the intercept term, and let β be the vector of coefficients describing the effect of the genotypes on the trait. Under the logistic regression model such that The score and the information matrix are Letα 0 be the maximum likelihood estimate of α under the null hypothesis of β = β 0 = 0, which is zero for this balanced case/control study. The score and the information matrix for the genetic markers are obtained by correcting for the intercept term and apply θ 0 = (α 0 , β 0 ) = (0, 0), The score test statistic U T β I −1 β U β asymptotically follows a χ 2 df=M distribution. The test statistic at marker ∼ N (0, 1) is equivalent to the Cochran-Armitage trend test statistic. The covariance between two statistics S j and S k is , which is the Pearson correlation coefficient r.
Ambiguous genotypes (imputation) If the genotypes are ambiguous and only the posterior probabilities are given, we should take into account the uncertainty in the score statistic as in Marchini et al. [3] and Schaid et al. [5]. Let E p [·] be the expectation over the posterior distribution.
As described in Louis [6], the score and the information matrix with missing data are where U and I are the score and information when we assume the data are complete. The first term, , is the variance of assuming a complete dataset, and the second term is the penalty of having uncertainty in the data.
Under the null, for this balanced case/control study,α 0 = 0 andỸ i = 1 2 . Thus, follows the standard normal distribution. The covariance between S j and S k is given by For this balanced case/control study, since Y i − 1 2 2 = 1 4 , the formula simplifies to Thus, the covariance is computable from the posterior probabilities.
However, if the study is unbalanced, computing covariance requires the knowledge of the joint posterior probabilities between markers (E p Z i(j) Z i(k) ). To the best of our knowledge, no current imputation software provides the joint distribution. We assume that an algorithm can be developed to provide the joint distribution for a given window size W , however this will be beyond the scope of this paper.

S2 Details of sliding-window procedure in SLIDE
Then by the chain rule, Using this equation, the following procedure samples a single M -dimensional random sample vector, based on the standard formula for conditional mean and variance in the MVN.
1. SampleŜ 1 from N (0, 1) 4. Let Σ si,si be a sub-matrix of Σ from index s to index i in both rows and columns, and let v si,(i+1) be the sub-column vector of Σ with row index s to i, and column index i + 1.
6. i ← i + 1 and repeat from step 3 while i < M We repeat this procedure to draw R sample vectors. LetŜ j i be the sample statistic for marker i in the j'th sample vector. The set of statistics {Ŝ j i } approximates the MVN. At step 5, we can reduce the computational overhead of repeatedly drawing random sample vectors by storing v T si,(i+) V −1 si,si and σ 2 for each marker.

S3 Conditional probabilities in cases and controls
Let c denote the causal SNP, A denote the marker, and + denote the diseased status. We show that the conditional probability, P A|c , does not depend on the disease status and thus is invariant among case, control, and overall populations.
The result goes along with our intuition; since the causal SNP is the only factor which causes the difference in haplotype distribution between cases and controls, once the causal SNP is conditioned, there is no more difference in haplotype distribution between cases and controls.

S4 P-value correction in Chromosome 22 of WTCCC data: genotype data and unbalanced study
In the main text of our article, we perform the p-value correction simulation using the chromosome 22 of the WTCCC dataset. We perform the same experiment, but instead of using the phased haplotype data assuming a balanced case/control study, we use the unphased genotype data assuming a unbalanced case/control study. We split the data into 2,934 controls and 1,928 cases. We correct ten different pointwise p-values using the permutation test, the Bonferroni correction, Keffective, and SLIDE. All other settings are the same as the simulation in the main text. Figure S2 shows that the average error rate of SLIDE is only 2%, while the error rate of the Bonferroni correction and Keffective are 67% and 16% respectively. Unless, Sasieni [7] showed that the allelic test may be biased.
Specifically, if the number of genotypes in cases and controls at a SNP are given as follows, # of minor allele 0 1 2 the allelic test statistic is and the genotypic test statistic is While the numerators of the two statistics are identical, the variances (denominators) differ. Devlin and Roeder [8] showed that Thus, if the SNP does not follow HWP (n 2 1 = 4n 0 n 2 ), the variance of S G is 1 while the variance of S H is ρ, which can be greater than or less than 1. This non-unit variance is caused by the fact that the cell count in the allelic 2 × 2 contingency table does not follow the hypergeometric distribution. For this reason, Sasieni [7] recommends against the use of allelic test for unphased genotype data.
For estimating pointwise p-value, however, widely used software such as PLINK [9] allows the use of allelic test for genotype data, because the use of permutation test or exact test provides a kind of protection for this deviation. Since the permutation test is only concerned about relative comparisons between statistics, as long as all statistics are scaled with a constant factor (ρ), the result is the same.
Thus, the allelic test statistic may be safely used for unphased genotype data for estimating pointwise p-values.
Nevertheless, for estimating corrected p-values, the permutation test does not provide this kind of protection. When estimating pointwise p-value, we consider each SNP separately. For each single SNP, S H = ρS G . That is, the relative order between the statistics obtained by multiple permutations is the same for the genotypic test and the allelic test. On the other hand, for estimating corrected p-values, we consider each SNP simultaneously. Permutation can be thought of as a procedure selecting the maximum statistic max(S H ) or max(S G ) among all markers at each permutation. Every SNP typically has a different ρ, because n 0 , n 1 , and n 2 vary between SNPs. Therefore, max(S H ) and max(S G ) do not have such a relationship as S H = ρS G . That is, the relative order between the statistics obtained by multiple permutations may not be the same for the genotypic test and the allelic test. In the perspective of the MVN, while S G follows the standard MVN whose variances are all 1 for every coordinate, S H follows a "squeezed" MVN whose variances vary by each coordinate. Thus, the outside-rectangle probability of the MVN, which is considered an approximation of the corrected p-value, will be different from what we expect for this squeezed MVN. In other words, it may be difficult to accurately correct for multiple testing if we use the allelic test for unphased genotype data, even if we use the permutation test.
In fact, this error occurs because we are using the max-T style permutation which selects maximum statistic at each permutation. The goal is to identify the most significant result among all markers at each permutation. We can identify the most significant marker by performing min-P style permutation which involves performing an additional permutation to estimate each pointwise p-value. Min-P permutation is able to identify the most significant pointwise p-value even if the distribution of statistic is not identical among the markers. The use of an allelic test for genotype data makes the distribution of statistic not identical between markers by changing the variance of the normal distribution. Thus, the maximum statistic (under max-T) sometimes does not coincide with the most significant pointwise p-value. Nevertheless, the use of min-P permutation requires much more computational time than max-T permutation, which is impractical for large datasets.

Simulation
We first use the unphased genotype data of the WTCCC T2D case/control chromosome 22 dataset (∼ 5K SNPs over 4,862 individuals). We perform the permutation test using two different test statistics: allelic test statistic and genotypic test statistic. We do not observe a large difference in corrected p-values between the two approaches. This is possibly because the deviation from the HWE is not very large, with an average |F ST | of .012.
We then construct a smaller dataset. We use 5,261 SNPs in the chromosome 22, but instead of using the WTCCC data, we construct a simulated dataset from the 60 parental individuals in the HapMap CEU population data. We copy the genotypes of individuals two times each to get 120 cases and 120 controls. This duplication procedure simulates Hardy-Weinberg disequilibrium by violating the random mating assumption. The average |F ST | of this dataset is as high as .103.
Then we simulate two different levels of quality control (QC). We perform χ 2 test for HWE for each SNP. In one experiment, we exclude 272 SNPs with HWE p < .0001 leaving 4,989 SNPs. In another experiment, we exclude 1,479 SNPs with HWE p < .05 leaving 3,782 SNPs. The average |F ST | of the two datasets are still high as .090 and .059. This is because passing the HWE test does not always mean that the SNP follows HWP.
Then we perform the permutation test using the genotypic test and the permutation test using the allelic test. We consider the p-values obtained by the permutation test using the genotypic test as the gold standard and plot the ratio between the gold standard and the p-values obtained by the permutation test using the allelic test, for each dataset of no QC, QC p<.0001, and QC p<.05. Figure S3 shows that using the allelic test for the unphased genotype data can result in inaccurate corrected p-values even closer to the Bonferroni correction than to the gold standard. After QC, the error is reduced, but a considerable amount of error still remains.
In this experiment, the use of allelic test results in conservative estimates. This can happen if, for example, the genotype calling error rate is much higher for the heterozygous allele than for the homozygous allele. However, the direction of the error can also be anti-conservative.
Although our simulation is unrealistic, it shows that the use of allelic test for unphased genotype data can result in inaccurate multiple testing correction, even after QC excluding SNPs which do not follow HWE. Our experiment using the WTCCC data shows that a large study with thousands of individuals is not much affected by this error, possibly because SNPs approximately follow HWP, but a study with hundreds of subjects can be affected. A simple solution to avoid this error is, given unphased genotype data, to use a genotypic test to obtain both pointwise p-values and corrected p-values and an allelic test to obtain pointwise p-values but not the corrected p-values until the data is phased. Another solution is to perform a min-P style permutation, but this will take a large amount of computational time.