Novel Genetic Analysis for Case-Control Genome-Wide Association Studies: Quantification of Power and Genomic Prediction Accuracy

Genome-wide association studies (GWAS) are routinely conducted for both quantitative and binary (disease) traits. We present two analytical tools for use in the experimental design of GWAS. Firstly, we present power calculations quantifying power in a unified framework for a range of scenarios. In this context we consider the utility of quantitative scores (e.g. endophenotypes) that may be available on cases only or both cases and controls. Secondly, we consider, the accuracy of prediction of genetic risk from genome-wide SNPs and derive an expression for genomic prediction accuracy using a liability threshold model for disease traits in a case-control design. The expected values based on our derived equations for both power and prediction accuracy agree well with observed estimates from simulations.


Introduction
In the last five years, GWAS have been published for both quantitative traits (such as height [1], or blood markers [2]) and disease [3]. In order to assess the relative potential for success of these studies Yang et al. [4] provided an analytical method for comparison of power. For example, this method has been used to quantify that a sample of ,50,000 schizophrenia cases and 50,000 controls is needed to afford the same power as the largest published GWAS of height (a total sample size of 180,000) [5].
Use of quantitative endophenotypes rather than binary traits has been proposed as a strategy to increase power in neuropsychiatric disorders [6]. Endophenotypes are measurable quantitative scores that are assumed to be associated with a continuous liability that underlies observed disease status, in which case the quantitative score may be more informative and powerful compared to binary responses. Of course, the true underlying liability would be the most informative although it is not observable. Recently, van der Sluis et al. [7] suggested a better use of phenotypic information in GWAS of psychiatric disorders measured in population cohorts. Rather than using binary responses of affected/non-affected they considered the use of continuous scores from diagnostic instruments. They showed that binary responses based on clinical cut-off criteria decreased power dramatically compared to the use of sum scores of item responses from the diagnostic instrument. The authors recommended that continuous quantitative responses such as sum scores of item responses should be used in psychiatry disorder GWAS, where possible. The study by van der Sluis et al. [7] compared scenarios by simulation and was based on population samples. Here, we provide an analytical method to calculate power in different scenarios with both population and case control samples.
Another potential use of data collected in GWAS is the prediction of genetic risk. Genomic-enabled prediction is a potentially powerful tool to identify individuals at higher risk of disease [3,8]. Undoubtedly, prediction accuracy plays a crucial role in a successful clinical application for genetic risk prediction of disease, and several studies have evaluated the predictive ability [9,10,11]. Daetwyler et al. [12] derived a theoretical accuracy for predicting genetic risk from genome-wide SNPs, based on least squares methodology. Many studies have used their formula, which works well for quantitative traits. However, in simulation studies their formula for case-control traits underpredicted the true accuracy (Table 4 of Daetwyler et al. [12]).
In this study, we address two issues relevant for the design casecontrol GWAS, power and genomic prediction accuracy. First, we derive analytically, in a unified framework, the power of GWAS when using population or ascertained case-control samples with binary as well as quantitative responses. Secondly, we derive genomic prediction accuracy based on the 0,1 observed scale, and transform it to the liability scale using a liability threshold model for disease traits in population [13] and in case-control samples [14]. The expected values based on our derived equations and the average of observed estimates from simulation agree well.

Power
Given a specified critical value for significance, power of a given association study design can be derived from the non-centrality parameter (NCP, l) of a x 2 test of association. Following methods of Yang et al. [4] we derive the NCP for five different experimental designs:, i.e. quantitative responses in population (QT_POP) (1), binary responses in population (BT_POP) (2), binary responses in ascertained case-control samples (BT_CC) (3), quantitative responses in ascertained case-control samples (QT_CC) (4) and samples of both ascertained cases and controls in which quantitative responses are available in the cases only (QB_CC) (5). The derived NCP for BT_CC, QT_CC and QB_CC are novel. Following Yang et al. [4], 1) NCP for quantitative responses in population samples, l QT POP where N is the total number of sample, h 2 is the proportion of variance explained by a single genetic marker or set of markers, i.e. multi locus association tests [15,16].

1) NCP for binary traits in population samples
where h 2 o is the proportion of variance explained by a genetic marker or set of markers on the observed scale, and h 2 0 &z 2 h 2 [13], where z is the height of the normal curve truncating the proportion K, where K is the proportion of the population that are cases. 1) NCP for binary responses in ascertained case-control samples, l BT CC where h 2 occ &z 2 h 2 P(1{P)=½K(1{K) 2 h 2 =s 2 gcc , [14,17], with P the proportion of cases in the case control sample and s 2 gcc the genetic variance in the case-control sample inflated relative to the population sample as a result of the ascertainment process [12], such cases are over-represented compared to the population sample. When h 2 and h 2 occ is small, (3) can be approximated and simplified as l BT CC &N : z 2 h 2 P(1{P)=½K(1{K) 2 , which agrees with the derivation based on the relative risk and multiplicative model by Yang et al. [4].
2) NCP for quantitative responses in ascertained case-control samples, where h 2 cc &s 2 lcc h4=s 2 gcc , where s 2 lcc is the variance of disease liability [18]. This equality is derived from quantitative genetic theory [18] in the following way. Firstly, where i is the mean liability in cases and t is the threshold on the normal distribution which truncates the proportion of disease prevalence K, and from Daetwyler et al. [12] s 2 gcc~h 2 ½1{h 2 i½(P{K)=(1{K)½if(P{K)=(1{K)g{t: ð6Þ In a similar manner, the inflated variance due to non-genetic effects is, 3)

4)
2) The covariance between disease liability and genetic values in an ascertained case-control sample is cov(l cc ,g cc )~cov(g cc ze cc ,g cc )~s 2 gcc zcov(g cc ,e cc ) where cov(g cc ,e cc )~(s 2 lcc {s 2 gcc {s 2 ecc )=2.
Therefore, from (5), (6) and (7), The regression coefficient of l cc on g cc is b(l cc :g cc )~cov(l cc ,g cc )=s 2 gcc~h 2 s 2 lcc =s 2 gcc Finally, the proportion of variance attributable to the SNPs or set of SNPs of a quantitative response in an ascertained casecontrol sample can be obtained as the squared regression coefficient multiplied by the genetic variance in the case-control sample and scaled by the variance of disease liability in the case control sample, i.e.   When underlying continuous quantitative responses are available only for cases in the ascertained case-control sample, i.e. the recorded values follow a mixture distribution of zero for controls and truncated normal distribution for cases. An example may be a GWAS of major depressive disorder in which cases are recorded for a quantitative severity score, whereas controls have not been scored. In this situations, where h 2 mcc &½b(y cc :g cc ) 2 s 2 gcc =s 2 y cc , which is explained as follows. The variance of the mixed zero and truncated normal values in an ascertained case-control sample is, where i and t are the same as defined above. There is an assumption here is that the quantitative trait is the phenotypic liability.

and,
cov(g cc ,e cc Dcase)~½var(l cc Dcase){var(g cc Dcase){var(e cc Dcase)=2: From the equations above, regression coefficient of y cc on g cc can be derived analytically as, b(y cc :g cc )~cov(y cc ,g cc )=s 2 gcc : Therefore, the proportion variance attributable to the variance in the SNPs from mixed zero and quantitative response in an ascertained case-control sample (h 2 mcc ) can be expressed as above under (8). The power for this mixed 0 and truncated normal responses is very similar to that for BT_CC (results not shown).

Genomic prediction accuracy
Normal quantitative traits. For a quantitative trait, b j is the random allelic substitution effect of the jth single nucleotide polymorphism (SNP). Following Daetwyler et al. [12], prediction error variance for the jth SNP effect is whereb b j is the estimate of the true regression b j of the phenotype on the jth SNP genotype, x ij = 0, 1 or 2 for the ith individual, N is the number of individual records and s 2 is the residual variance. Assuming a phenotypic variance of one, the genetic variance (var(g)) explained by the set of M SNPs is h 2 . Following Daetwyler et al. [12], the estimated genetic variance explained by the M SNPs in the predictor (var(ĝ g)) is a function of the h 2 , M, the number of records (N) and the residual variance (s 2 ) as The squared correlation coefficient between the true and estimated genetic value is the ratio of the true genetic variance over the estimated genetic variance [12] as r 2 g,ĝ g~v ar(g)=var(ĝ g)~h where the residual variance would be approximated as s 2 = 1 (phenotypic variance) as in Daetwyler et al. [12]. With t defined as the ratio of the number of samples (N) over the number of SNPs (M), the accuracy can also be written as [12] r 2 g,ĝ g~h 2 h 2 z1=t : Disease traits in population sample. In binary disease traits, with s 2 approximated as s 2 = K(1-K) (i.e. binomial phenotypic variance for a disease with population prevalence of K), the prediction error variance for the jth SNP effect can be written as

5)
where b is allele substitution effect on the 0, 1 observed scale andb b is the estimated b from regression of the 0,1 discrete phenotypes on SNP coefficients. The estimated genetic variance on the observed scale of the SNP predictor (var(û u)) is a function of the genetic variance on the observed scale (var(u) or s 2 u ), the number of SNPs, the number of records and the residual variance as The squared correlation coefficient between the true and estimated genetic values is : Because genetic variance as a proportion of phenotypic variance on the observed scale can be transformed from that on the liability scale as h 2 o %s 2 u K(1{K)~h 2 z 2 K(1{K) [13], prediction accuracy can be re-expressed as Equation (10) here is the same as equation (6) in Daetwyler et al. [12].
Disease traits in ascertained case-control study. Ascertainment in case-control samples often results in over-representation of cases compared to the case prevalence in the population. The variance of the explanatory variable is inflated by a factor of f~s 2 gcc =s 2 g [12,17]. The term, s 2 gcc , is the inflated genetic variance due to ascertainment in case-control sample [12]. Therefore, the inflated explanatory variable for the jth SNP can be written as var(x ij ) Ã~v ar(x ij ) : f j . Then, the prediction error variance for the jth SNP effect can be expressed as where b is allele substitution effect on the 0, 1 observed scale andb b is estimated b from regression of the 0,1 discrete phenotypes on SNP coefficients in the case-control sample. The estimated genetic variance on the observed scale in a case-control design can be derived as where var(x ij b j ) is the genetic variance on the observed scale due to the jth SNP effect transformed to the liability scale [14,17] var(x ij b j )& z P(1{P) K(1{K) With a sufficient number of causal SNPs (., 20), the residual variance is approximated as s 2 = P(1-P) (i.e. the binomial phenotypic variance in a case-control sample where the proportion of cases is P), and the value for f is close to 1 (i.e. a small fraction of genetic variance has a negligible inflation). Therefore, the genetic variance in a case-control sample is and, the estimated genetic variance in a case-control sample is approximately The squared correlation coefficient between the true and estimated genetic values is Equation (11) differs from equation (9) of Daetwyler et al. [12], i.e. For binary traits, area under the receiver-operator characteristic curve (AUC) is a useful statistic for the genomic prediction accuracy [19,20]. A relationship between the correlation coefficient and AUC has been shown in previous studies [11,20].

Power
In order to check the analytically derived equations of NCP for BT_POP, BT_CC, QT_POP and QT_CC, we carried out a simulation study. Individual genetic values (g) were simulated from an additive multilocus model of M = 100 independent SNPs with equal allele effects and allele frequency of 0.5. Residual values (e) were independently generated from a random normal distribution with a mean of zero and variance of s 2 e . The value of s 2 e was set relative to s 2 g so that the desired proportion of variance explained by the markers, h 2 was obtained. We simulated h 2 of 0.01, 0.05 and 0.1 so that each SNP explained 0.0001, 0.0005 and 0.001 of the phenotypic variance (Table 1). Liability phenotypes for each individual were simulated as y = g + e. Affected individuals were those with liability phenotype that exceeded a threshold determined by population prevalence. The numbers of cases and controls in the sample was 2000. The values for population prevalence were varied as K = 0.1, 0.01 or 0.001. The proportion of cases was P = K in simulations of population sample and P = 0.5 in simulations of case-control sample where cases were oversampled by a factor (1-K)/K. In population or case-control sample, we used both binary (BT_POP or BT_CC) and quantitative responses (QT_POP or QT_CC). We conducted 100 replicates for each simulation scenario, therefore 10000 association tests were carried out. Power was calculated as the proportion of the 10000 association tests in which the association p-value less than 0.05 and was compared to power calculated from the NCP using a function in R package, i.e. power = 1-pchisq (T, 1, ncp = NCP) where T is the normal distribution threshold corresponding to the significance level 0.05.

Genomic prediction accuracy
Simulations were carried out to verify the validity of equations (10) and (11). In a simulation study, individual genetic values (g) were simulated from an additive multilocus model with equal allele effects (allele frequency of ,0.5) and residual values (e) independently generated from a random normal distribution with a mean of zero and variance of s 2 e . The value of s 2 e was set relative to s 2 g so that the desired proportion of variance explained by the markers, h 2 was obtained. Liability phenotypes for each individual were simulated as y = g + e. Affected individuals were those with liability phenotype that exceeded a threshold determined by population prevalence.

Power
The power of association tests observed in simulation and expected from theory agreed well under a range of scenarios (Table 1). Whether using lower or higher values of disease prevalence K, there was an excellent agreement between the observed and expected power with a small empirical standard error. When using a higher variance explained by each locus (h 2 ), although the empirical standard error increased slightly, the observed value also agreed well with the expected value (Exp and Obs in Table 1).
In Figure 1, values for the power based on NCP derived from equations (1),(4) were plotted against variance explained by SNPs (i.e., h 2 ). Generally, the power increases when the variance explained by SNPs increases, and when the ascertained casecontrol design is used. For BT_POP, the power decreases as K decreases, reflecting the smaller number of cases in a given population sample. For QT_POP, the power is, of course, constant across a-c in Figure 1. When using an ascertained sample (BT_CC or QT_CC), the power increases as the value for K decreases, which reflects the greater over-sampling of cases with lower K for the same sample size and hence the difference in mean liability between cases and controls increases. There is a moderate difference between BT_CC and QT_CC when using population prevalence K = 0.1 (a in Figure 1). The difference between BT_CC and QT_CC becomes smaller with lower values for K (b and c in Figure 1).

Genomic prediction accuracy
The expected accuracies predicted from equation (11) agreed well with the observed average of estimates from simulation for all simulation scenarios for both population and ascertained casecontrol samples. In Table 2, disease prevalence K varies, in Table 3 proportion of variance explained by SNPs h 2 varies and in Table 4, values for t = N/M vary. For comparison, we list also the predicted accuracies for case-control samples provided in Daetwyler et al. [12]. As shown in their Table 4, their formula underestimates prediction accuracy particularly when disease prevalences are low ( Table 2) and h 2 are high (Table 3). We also tested the prediction accuracy with allele effects sampled from a normal or an exponential distribution. The results from these alternative distributions of allele effects were not much different from the main results (results not shown). This agrees with Daetwyler et al. [12] in that the derived prediction accuracy is robust to distributional assumption for allele effects.

Discussion
Firstly, we provide analytical derivations in a unified framework to quantify the power of GWAS when using population or ascertained case-control samples with binary responses or quantitative responses. The derived equations were validated in a simulation study, showing that expected values from the equations and observed values from simulations agreed well. Secondly, following Daetwyler et al. [12], we derive an expression genomic prediction accuracy based on the 0,1 observed scale, and transformed it to that on the liability scale using a liability threshold model for disease traits in population [13] and in casecontrol samples [14]. Compared with Daetwyler et al. [12], our derivation agrees for population samples, but is more accurate for case-control samples.
The Genetic Power Calculator [21] is commonly used for calculation of power is genetic association studies. The calculator is based on theoretical derivation [22,23] of a single locus model with required parameters of allele frequency and its effect size (e.g. relative risk or odds ratio in binary responses). However, our derivations and application did not require those parameters (see equation (3), (4) and (8) and Appendix S1 and S2 for application) because our derivations are based on variance explained by a locus, and many combinations of allele frequency and effect size can generate the same variance explained. Our framework easily accommodates power of association of multiple loci because we use a single parameter for the total variance that is generated by any number of loci. Applications of multiple loci association GWAS have been published recently [24,25]. In practice, the power to detect causal variants may not exactly agree with our analytical derivations because of unknown parameters such as linkage disequilibrium among variants and distribution of effect size that alter the effective number of tests. We recommend that such unknown parameters should be carefully considered in applying power calculation.
Recently, Dudbridge [11] proposed a comprehensive study about power and predictive accuracy of polygenic scores. Our equation (11) and Dudbridge's equation (13) [11] are analogous to each other. However, Dudbridge used his equation (13) with a heuristic justification from simulations. We analytically derived equation (11) based on a liability threshold model and gave a reasonable explanation why f j is approximated as 1.
Lastly, van der Sluis et al. [7] quantified by simulation the power lost in genetic association analyses of population samples measured for quantitative endophenotypes but analysed with a dichotomous case-control score. Our analytical derivations for such scenarios allow easy generalization of their results to the design of new studies.