Multi-locus Test Conditional on Confirmed Effects Leads to Increased Power in Genome-wide Association Studies

Complex diseases or phenotypes may involve multiple genetic variants and interactions between genetic, environmental and other factors. Current genome-wide association studies (GWAS) mostly used single-locus analysis and had identified genetic effects with multiple confirmations. Such confirmed single-nucleotide polymorphism (SNP) effects were likely to be true genetic effects and ignoring this information in testing new effects of the same phenotype results in decreased statistical power due to increased residual variance that has a component of the omitted effects. In this study, a multi-locus association test (MLT) was proposed for GWAS analysis conditional on SNPs with confirmed effects to improve statistical power. Analytical formulae for statistical power were derived and were verified by simulation for MLT accounting for confirmed SNPs and for single-locus test (SLT) without accounting for confirmed SNPs. Statistical power of the two methods was compared by case studies with simulated and the Framingham Heart Study (FHS) GWAS data. Results showed that the MLT method had increased statistical power over SLT. In the GWAS case study on four cholesterol phenotypes and serum metabolites, the MLT method improved statistical power by 5% to 38% depending on the number and effect sizes of the conditional SNPs. For the analysis of HDL cholesterol (HDL-C) and total cholesterol (TC) of the FHS data, the MLT method conditional on confirmed SNPs from GWAS catalog and NCBI had considerably more significant results than SLT.


Introduction
Genome-wide association studies (GWAS) have identified genetic variants associated with a number of complex diseases or phenotypes [1], [2], [3] and some of these variants had confirmations from several studies [1]. Published GWAS studies typically used a single-locus test (SLT), in which each variant is tested individually for association with a specific phenotype. Single-locus analysis may not be the best approach in the presence of confirmed SNP effects, because confirmed effects become a component of random residuals and decrease statistical power for detecting new effects if those true effects are omitted in the analysis. Controlling genetic backgrounds in single-marker tests was well explored in the area of QTL mapping including Zeng's CIM [4] and Jansen's multiple regression method [5]. Roeder et al [6] proposed to use linkage genome scan results in GWAS to achieve higher power. In a recent meta-analysis on smoking behavior, a novel SNP was identified with genome-wide significance related to smoking conditional on a known SNP [7]. Conditional analysis was also used in other meta-analysis in GWAS [8]. In this study, we propose a multi-locus test (MLT) that tests each candidate SNP conditional on confirmed SNPs for GWAS analysis to increase the statistical power for detecting new SNP effects, and we demonstrate the MLT method had increased statistical power relative to SLT using analytical formulae derived in this study and using simulation, case studies, and the Framingham Heart Study (FHS) data.

Predicted Statistical Power of MLT and SLT
The multiple linear regression model for the MLT method can be expressed as: where m = the population mean of the phenotypic values, Z i = 16p vector of the p covariates for subject i (i = 1, …, N) to account for environment, population stratification, and other factors; b Z = p61 vector of the partial regression coefficients of the covariates Z i ; G i = (G i,1 , G i,2 , …, G i,s21 ) = 16(S21) vector of the S21 SNP genotypes that were confirmed to be associated with the phenotype for subject i, with G i,j taking values of 0, 1 or 2 according to the number of copies of the minor allele for subject i at SNP j; b G = (S21)61 vector of partial regression coefficients of S21 conditional SNPs; G i,s = the genotype value of the candidate SNP; b s = the partial regression coefficient of the candidate SNP, and e i = random residual that is assumed to follow N(0, s 2 ) normal distribution. The residual variance of Equation 1 is A standard t-test can be used for testing the significance of the candidate SNP based on testing the following hypotheses, H 0 : b s = 0, where b s is the regression coefficient of the candidate SNP and is the Mth element of b = (m, b Z , b G , b s ) and M = 1+p+S. Onesided statistical power was derived in this study for simplicity (twosided t-test statistical power can be obtained similarly but is not considered here). Using a multiple linear regression framework, the power of the one-sided t-test can be formulated as: where t a denotes the a% quantile value for t distribution and f denotes the non-central t-distribution with N2M degrees of freedom and non-centrality parameter: whereŜ S MM = the element at the Mth row and Mth column of variance-covariance matrixŜ S, andŜ S is a M6M variancecovariance matrix ofb b s and can be estimated as: where X is the design matrix in Equation 1.
For SLT, the statistical model is the same as Equation 1 except that the residual term is now a summation of confirmed effects and random residuals. The residual variance for the SLT model is no longer s 2 and has the following mathematical expression: where var(G i ) is calculated based on the G i values defined in Equation 1. Equation 6 shows that the residual variance of the SLT model is inflated over the MLT model of Equation 1 due to the fact that confirmed SNP effects are now in the residual term of Equation 6. Using this inflated estimated residual variance, the variance ofb b s can be estimated by: where c is the element at the (p+2)th row and (p+2)th column of matrix X 'X ð Þ {1 and X is the design matrix for the regression model of SLT. Therefore, the t-test statistic for SLT does not have a t-distribution but a t-distribution divided by a constant t~ffi ffiffiffiffiffiffiffiffiffi ffi cs 2 e î S S MM s . Similar to Equations 3, the power of the one-sided t-test can be formulated as: where t a denotes the a% quantile value for t-distribution with N2p22 degrees of freedom and f denotes the non-central tdistribution with N2p22 degrees of freedom and non-central parameter l.

Evaluation of Statistical Power of MLT and SLT Using Simulation and Real Data
The analytical formulae for statistical power for MLT accounting for confirmed effects and for SLT without accounting for confirmed effects (Equation 3 and Equation 8) were validated by simulated data of 2000 subjects for various effect sizes of the candidate SNP and confirmed SNPs with 10,000 repeats. The phenotypic values were simulated by the summation of a population mean, three additive SNP effects and a random error which followed a standard normal distribution. The three SNPs were simulated under Hardy-Weinberg equilibrium and linkage equilibrium with allele frequencies, 0.3, 0.4 and 0.2. The first two SNPs were assumed to have confirmed effects and the last SNP was assumed to be the candidate SNP. The candidate SNP was tested by the MLT and SLT methods in each simulation. Empirical power was calculated as the proportion of significant results from all 10,000 simulation results. We fixed the effects of the two confirmed SNPs as 0.3 and 0.2 standard deviation of residuals (SD) and varied the effect of the candidate SNP from 0.04 to 0.2 SD. Simulated statistical power were nearly identical to the predicted power for MLT and SLT based on different candidate SNP effect sizes (Table S1) and on different population sizes (Table S2). With this knowledge of the power formulae being correct, predicted statistical power for MLT and SLT were calculated for various effect sizes of the confirmed SNPs (Table 1). We further evaluated predicted statistical power using reported effect sizes for some confirmed SNP effects. We collected all reported SNP effects for HDL cholesterol (HDL-C), LDL cholesterol (LDL-C), triglycerides (TG), total cholesterol (TC), and serum metabolites (SM) from the GWAS catalog [1]. The effect sizes and risk allele frequencies of those SNPs were extracted and utilized for the power calculations. After filtering out SNPs in high linkage disequilibrium (LD) by only keeping one SNP with the largest effect size in each high LD region, the final selection of confirmed SNP markers included 22 relatively independent SNPs with effect sizes of 0.07-0.24 SD for HDL-C, 24 SNPs with effect sizes of 0.07-0.35 SD for LDL-C, 13 SNPs with effect sizes of 0.06-0.42 SD for TG, and 19 SNPs with effect sizes of 0.06-0.24 SD for TC. For SM, we extracted five SNPs which explained 5.6 to 36.3 percent of the total phenotypic variation [1], [2]. Conditional on those known significant SNP effects, statistical power of MLT increased over that of SLT by about 4-5% for HDL-C, LDL-C, TG and TC. The pattern of the heatmap for statistical power was similar for these four traits and the heatmap for HDL-C is shown in Figure 1A. Largest improvements were in the region where candidate SNPs had small effect sizes and large allele frequencies or candidate SNPs had medium effect sizes and relatively low allele frequencies (0.1-0.2). The increase in statistical power of MLT over SLT was much larger for SM, varying from 10% to 30%, because of the large effect sizes of the known SNPs ( Figure 1B).
For GWAS analysis using real data, true statistical power is not observable but MLT is expected to have more significant results than SLT. To compare observed statistical significance of MLT and SLT, we used the FHS GWAS data (version 2) that had 6575 individuals with SNP genotypes of the 500k SNP panel from dbGAP [9]. Of the 6575 individuals, 6078 individuals had observations on HDL-C and 6431 individuals had observations on TC. From the 500k SNP panel, 432,096 SNP markers with known locations and minor allele frequencies 0.01 or greater were selected and tested. The original cholesterol measures deviated from normality and had outliers. The Box-Cox transformation analysis [10] implemented by the R package [11] showed that the log-transformation was approximately the best transformation to achieve normality for HDL-C and TC. Age, age-squared, cholesterol treatment, blood sugar, body mass index, smoking status, number of cigars smoked, alcohol consumption and sex were adjusted for transformed HDL-C. Blood sugar, body mass index, smoking status, and sex were adjusted for transformed TC. The testing of SNP effects used the generalized least squares version [12] of epiSNP [13]. From the GWAS catalog [1] and NCBI (http://www.ncbi.nlm.nih.gov), we selected six SNP markers ( Table 2) with multiple confirmations. These six SNP markers were independent of each other because pairwise correlations measured by R-square among these SNPs were nearly zero. Results showed that MLT had more significant results than SLT for both HDL-C and TC ( Table 2). The first two markers in Table 2 had the largest improvement in observed significance (reduced P-value), while the remaining markers only had minor improvement.
In our analysis, we did not impute genotypic data to fill in missing genotypes so that the MLT test had smaller sample size than SLT. The observed significance should have been larger than observed in Table 2 if the missing genotypes were filled by genotype imputing using software like MACH [14] and BEAGLE [15]. Although improvement in statistical significance could be small in some cases, such improvement could be easily achieved using GWAS analysis software like PLINK [16] and epiSNP [13] that provide options to incorporate covariates. Due to incorporating confirmed effects, MLT has smaller degrees of freedom for residuals and can be less robust than SLT. Fortunately, sample size in typical GWAS studies (thousands) is large enough for incorporating relatively small number of confirmed effects (tens). With more and more confirmed effects reported in the literature, MLT is useful for identifying new genetic variants with smaller effects.