Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach

Background The prediction of the genetic disease risk of an individual is a powerful public health tool. While predicting risk has been successful in diseases which follow simple Mendelian inheritance, it has proven challenging in complex diseases for which a large number of loci contribute to the genetic variance. The large numbers of single nucleotide polymorphisms now available provide new opportunities for predicting genetic risk of complex diseases with high accuracy. Methodology/Principal Findings We have derived simple deterministic formulae to predict the accuracy of predicted genetic risk from population or case control studies using a genome-wide approach and assuming a dichotomous disease phenotype with an underlying continuous liability. We show that the prediction equations are special cases of the more general problem of predicting the accuracy of estimates of genetic values of a continuous phenotype. Our predictive equations are responsive to all parameters that affect accuracy and they are independent of allele frequency and effect distributions. Deterministic prediction errors when tested by simulation were generally small. The common link among the expressions for accuracy is that they are best summarized as the product of the ratio of number of phenotypic records per number of risk loci and the observed heritability. Conclusions/Significance This study advances the understanding of the relative power of case control and population studies of disease. The predictions represent an upper bound of accuracy which may be achievable with improved effect estimation methods. The formulae derived will help researchers determine an appropriate sample size to attain a certain accuracy when predicting genetic risk.


Introduction
Genetic risk of disease is an important component of overall risk of disease in addition to environmental, socio-economic, and behavioral risk factors. Therefore, predicting the genetic risk of disease for an individual is a powerful tool in taking preventative measures against the onset of the disease. Such predictions from genetic testing are relatively straightforward when a disease is caused by one or few genes. However, when a disease is of complex inheritance, the genetic risk of the disease may be associated with many loci, each explaining only a small portion of the genetic variance [1,2]. In this case, the prediction of genetic risk of disease of a particular individual becomes more challenging. Currently, prediction of risk for complex diseases is based mainly on pedigree analysis but this approach yields predictions of risk that are of low precision; for example predictions would be identical for full siblings without offspring, yet the genetic variation among them accounts for half or more of the genetic variance [3,4].
The identification of very large numbers of single nucleotide polymorphisms (SNP) has enabled the use of genome-wide association studies (GWA) to detect alleles that are associated with risk for complex diseases [5], such as Type II Diabetes and Crohn's disease [6]. In tandem with this substantive increase of SNP data, several methods for quantifying and/or predicting genetic risk of disease from multiple genes have been put forward [7,8]. Wray et al. [9] extended these methods by using an GWA approach to estimate the individual genetic risk of disease. Unlike the risk estimates obtained using only pedigree, the estimates resulting from such a GWA approach are more precise by allowing for differentiation among full-siblings. In addition, no pedigree or family history is needed either for estimating risk in one genotyped sample from the population or for predicting risk in a fresh sample. Similar genome-wide methodology has been proposed in animal and plant breeding to estimate additive genetic values for quantitative traits [10,11]. One critical difference between the two genome-wide approaches is that Wray et al. [9] set a significance threshold for the loci selected for disease prediction, whereas Meuwissen et al. [10] use all loci regardless of whether they affect or not the trait considered. The approach of Meuwissen et al. [10] therefore attempts to achieve the maximum estimate precision of the complete genetic value for a given dataset by including loci that may have too small of an effect to achieve statistical significance, and, thus, reduces the overestimation of allele effects [12].
Wray et al. [9] computed the precision of the individual genetic risk estimates by simulation. While simulation studies are useful in getting initial results on the number of phenotypic records needed to achieve a desired level of accuracy, they are computer intensive and time consuming with large numbers of markers. Most importantly, they do not provide a deep insight on how all variables that affect accuracy interact. Therefore, it is desirable to develop deterministic equations that are responsive to all variables that influence accuracy.
Here we present simple expressions for the genome-wide accuracy of prediction of genetic disease risk. We derive general expressions for continuous traits and the necessary extensions for dichotomous disease traits with data obtained either from population studies or case control studies. The predictions are tested by computer simulation under a variety of parameters influencing accuracy, such as, for example, disease prevalence, heritability and distributions of allele effects and frequencies

Derivation of Equations
The predicted accuracy that is derived below represents the upper bound that can be achieved when estimating effects in one population sample and then predicting individual genetic risk in another sample from the same population. Throughout this article the accuracy of predicted genetic risk (r gĝ ) is defined as the correlation between true and predicted genetic values. One advantage of using r gĝ is that the factors influencing it can be clearly derived using the principles of population genetics, as we show below. We will first derive equations that are predictive of r gĝ for a genome-wide approach with a continuous phenotype, such as height, assuming a population study where individuals are sampled at random. These will then be adapted to predict disease risk for a dichotomous phenotype ('affected' or 'unaffected') with an underlying continuous liability. The equations are then further adapted to the situation of case control data.

Continuous phenotype
We will assume that there are n G potential loci affecting a trait which are independent, biallelic and acting additively, where n G may be large. These loci may be candidate genes or genetic markers of which a significant proportion may have zero effects. For locus j, j = 1…n G , let a randomly chosen reference allele for that locus have frequency p j and true allelic substitution effect b j . We shall assume without loss of generality that the distribution of allele frequencies p j is symmetric about p = K, and likewise that allelic effects b j are symmetric about b = 0. No further distributional assumptions will be made here on p j and b j , so for example, many of the allele segregating may have negligible or zero effect. No assumptions are made concerning the covariance between p j and b j in the populations sampled. We intend to derive the accuracy of the prediction of the additive genetic value (r gĝ ) of an individual that can be achieved after the measurement of n P phenotypes.
An estimate of the effect of each allele may be obtained by regression of the phenotypic records on the genotypes one locus at a time because the loci are independently segregating. Assume the population variance of the phenotypes is 1. The estimated allele substitution effect will beb b j with expectation Eb b j h i~b j , and is obtained by regressing the phenotypes on the observed number of reference alleles in the genotype, denoted x ij for individual i and locus j (i.e. x ij = 0, 1, or 2). The sampling variance of the allele estimate is varb b j {b j ~s 2 e S xx,j where s 2 e is the residual variance after regression on x ij and S xx,j = n P var(x ij ) is the adjusted sums of squares for x ij . Although not assumed here, when the population is in Hardy-Weinberg equilibrium S xx,j is given by 2n P p j (12p j ). For the present, we shall conservatively take s 2 e~1 , which underestimates the accuracy of the prediction.
Our aim is to predict the accuracy of a new population sample, so we apply the original estimates to a new sample of the same population. Values referring to the second sample will be 'dashed', hence individual i from the second sample has x 0 ij alleles at locus j. The additive genetic value of i is given by g i~P loci j x 0 ij b j with estimatê g g i~P loci j x' ijb b j . Then r 2 gĝ g~c ov g . Noting that ĝ i can be re-written as Therefore accuracy is seen to be a function of the product of the observed heritability h 2 o and the ratio of the number of phenotypes recorded to the number of loci involved, l. A second order correction to relax the assumption s 2 e~1 is given in Appendix S1, where it is shown to result in an upward correction to r gĝ of fractional magnitude & 1 = 2 r 4 gĝ g l {1 .
We shall now derive the accuracy of predicting individual genetic risk to disease (r gĝ ) in a random population sample by considering disease prevalence in a liability model [9]. For a disease with prevalence q, phenotypes are defined as s i = 0 for unaffected, and s i = 1 for affected, so E[s i ] = q and var(s i ) = q(12q). Individuals with the highest liability are affected by the disease. Let liability be y i , scaled so E[y i ] = 0 and var(y i ) = 1, and b j is the regression of liability on the number of reference alleles at locus j. The linear predictor of s i on y i is given by s i = q+qi q y i [13], where i q equals the mean liability of affected individuals, which we will term the selection intensity [3] corresponding to the prevalence of the disease in the population. Let the slope of the regression of s i on x ij bep p j , then Ep p j Â Ã~q i q b j , with sampling variance, estimated conservatively using the phenotypic variance q(12q) The coefficientsp p j may be rescaled to give estimateŝ b b j~p p j qi q À Á , with sampling variance Repeating the argument outlined above for a continuous phenotype with var g i ð Þ~cov g i ,ĝ g i ð Þ~h 2 l , and varĝ g i ð Þ~h 2 l z n G q 1{q where h 2 l is the heritability on the liability scale. Simplifying terms results in: Robertson and Lerner [14] show that the relationship between additive heritability on the observed scale and the heritability on the liability scale satisfies Substitution then results in Equation (1) with h 2 l being replaced by h 2 o : Therefore the dichotomous phenotype study of disease results in an identical formula for r gĝ as the continuous phenotype provided the heritability used is that for the observed dichotomous scale.

Case Control Disease Study
The formulae will now be extended to derive the accuracy r gĝ of a genetic risk prediction when applying a case control design to a dichotomous phenotype. The need for modification of the equations for a case control design comes from the selection of individuals from within the population to achieve a prevalence within the sample of cases and controls of w, and where typically w = 1/2 with equal numbers of cases and controls. Parameter values post-selection will be 'starred'. It is assumed in the following without loss of generality that cases are less common than controls in the population so q#w#1/2. Two parameters in particular need to be re-estimated because of the selection practiced: (i) S Briefly, assuming no covariance between p j and b j , xx,j is n P var*(x ij ) and so since n G and E b 2 j h i over loci are unaffected by the sampling of cases and Appendix S2 shows that using Normal theory var 1 g i ð Þ~var g i ð Þ 1{h 2 l i i{x Approximating s 2 e~0 :25 for a binomial trait with probability K, appropriate for equal numbers of cases and controls, gives and substituting l results in Changing the heritability from the liability scale for a population sample to the observed scale for a population sample using Equation (5) produces Finally, substituting q 1{q ð Þ 1{h 2 l i i{x Thus the form of r gĝ for a case control study shows equivalence to the r gĝ of continuous and dichotomous phenotypes provided heritability is on the observed scale and the appropriate changes are made in c to account for the selection of cases and controls. The value of c is 1 in population studies (Equation (6)), where w = q (and, hence, i¯= 0). When q,w,1/2, c,1 and there is an increase in r gĝ compared to a population study with the same l.

Simulations
Stochastic computer simulations were used to test the deterministic predictions of r gĝ for a number of parameters affecting the continuous and dichotomous phenotypes. We describe the full simulation method for the continuous trait and then state additional steps that were needed for the dichotomous phenotypes (random population sample and case control). In all scenarios (i) individuals were unrelated; (ii) loci were independent; (iii) all genetic action was additive; (iv) for simplicity, loci were assumed to be in Hardy-Weinberg equilibrium; and (v) each scenario was replicated 100 times, except for case control scenarios with l = 0.02 where 500 replicates were run. Furthermore for initial simulations (vi) allele frequencies were sampled from a uniform distribution corresponding to a common-disease-common-variant hypothesis (CDCV) [15]; and (vii) allele effects were drawn from a reflected exponential distribution which was made symmetric about x = 0. Items (vi) and (vii) were modified as described below.
For the continuous phenotypes, the phenotypic variance was 1. True additive genetic values for n P individuals were calculated as (12p j )b j and 2p j b j for the minor and major alleles, respectively, for each of n G simulated loci, and summing over loci. The value of n G used in most scenarios was 1000 and n P varied accordingly, depending on l. Two exceptions were l = 0.02, where n G = 20,000, and the scenarios in which l was kept constant with n G = 100. The scale factor of the exponential distribution was chosen to obtain the required additive heritability h 2 o À Á . Phenotypic records were simulated by adding independent environmental terms to the true genetic effects drawn from a Normal distribution with mean zero and variance 1{h 2 o . Allele substitution effectsb b j were estimated by regression of n P phenotypic records on genotypes one locus at a time. A second sample of individuals was then simulated with genotypes based on the same allele frequencies and effects as the original population. The estimated additive genetic values were then computed according to the following model:ĝ g i~P loci,j x' ijb b j , as described above. Finally, r gĝ was calculated as the correlation between true and estimated additive genetic values. Bias was also assessed by the slope of the regression of g i on ĝ i .
The continuous phenotype case was tested for robustness to different distributions of allele frequency and effects, and their correlation. The allele frequencies were also drawn from a beta (U-shape) distribution, consistent with a neutral allele model [16], with parameters alpha = 0.3, and theta = 0.3. Allele effects were also sampled from a normal distribution with mean zero. The effect of having a percentage of loci with zero effects was investigated by setting a proportion of the effects to zero while keeping the overall genetic variance constant. In all cases, the scale factor for the distribution of allele effects was modified to maintain the desired h 2 o .
Further testing of the predictions was done by introducing a correlation between the heterozygosity at a locus and the squared magnitude of the allele substitution effect at a locus. This was done for a uniform distribution of allele frequencies and the reflected exponential distribution of allele effects. This was achieved empirically: if the randomly drawn frequency had heterozygosity greater than the median (i.e. 2p(12p).0.375) then the magnitude of the allele effect was drawn to be less than the median of the distribution of the magnitudes.
The simulation of a random population sample for the dichotomous disease phenotype followed the same structure as above but contained the additional step of treating the underlying continuous phenotype distribution as a liability for the disease with heritability h 2 l on the liability scale [14]. Therefore, with prevalence q, the fraction q of the population with the greatest liability were considered to be affected. Therefore allele effects were estimated from the dichotomous phenotype and the accuracy, r gĝ , was calculated as the correlation between the true and estimated genetic liability for the disease estimated in an independent population sample.
Case control studies were simulated with an equal number of cases and controls (i.e. w = 1/2). A dichotomous disease phenotype with sample size n P was simulated by including an additional selection step which expanded the population size to n P [2q d ] 21 . The liabilities were constructed as for the population study of a dichotomous disease, the n P /2 individuals with the greatest phenotypic liability were considered to be affected cases, and a further n P /2 were randomly chosen from those remaining as control phenotypes. Allele effects were estimated as for the population studies, and the accuracy was estimated from a randomly-drawn independent population sample of size n P .

Population-wide studies of continuous phenotypes
When allele effects were drawn from an exponential distribution and frequencies were from the uniform, the deterministic formula for r gĝ was found to predict the simulated data reliably across the wide range of parameters used ( Table 1). The prediction errors across all parameters studied were in the range of 21.3 to 4.0% (Table 1).
The close agreement between the predicted and achieved accuracies is also seen in Table 2 and was maintained when: (i) allele frequencies were drawn from a beta-distribution (% error 20.9 to 0.7); (ii) allele effects were drawn from a normal distribution (% error 20.8 to 5.0); (iii) exponential allele effects were mixed with varying proportions of alleles with no effects, ranging from 0 to 95% (% error 0.1 to 26.6, Table 3); (iv) l's ranging from 0.02 to 5 were investigated (% error 220.0 to 4.0, Table 1); and (v) the genetic architecture was varied by keeping l constant and changing n G (n G = 100, % error 0.1 to 7.6; and n G = 1000, % error 20.5 to 0.0). It should be noted that the large percentage errors seen when l = 0.02 are due to low r gĝ , where the absolute difference between the expected and simulated r gĝ was still less than 0.02. The introduced correlation between heterozygosity and squared substitution effect was tested with l = 1 and n G = 1000 using the empirical procedure described in the Materials and Methods. With an achieved correlation of 20.36 and an observed h 2 o~0 :39, the predicted accuracy from Equation (1) was 0.53, with an error of 1.1% when compared to simulation. In conclusion, it is clear that the deterministic r gĝ is robust to wide distributional assumptions on the joint distribution of frequency and effect of allele substitution, as predicted from the derivation.
Therefore the predictions of genome-wide accuracy shown in Figure 1 based on Equation (1) for different values of observed h 2 and l have wide applicability. For all l, the accuracy was most sensitive to h 2 when h 2 was low and this sensitivity was potentiated by higher numbers of phenotypes per genotype tested. The accuracies are functions of lh 2 , so the required l to achieve a given accuracy is proportional to 1/h 2 . Thus, the numbers of phenotypes per genotype need to be twice as high for half the heritability. To obtain accuracies of 0.71, corresponding to predicting half the genetic variance, l = 1/ h 2 , and therefore l must be $1 because h 2 #1.

Population-wide studies on dichotomous disease phenotypes
The form of the predicted accuracy (r gĝ ) is very similar to that for a quantitative trait. Again the prediction of r gĝ was very good (% error 214.1 to 1.6; see Table 1). The validity of the prediction resulting from Equation (6) was robust to varying disease    prevalence over the range of 0.01 to 0.5 (% error 21.9 to 1.4, Table 4). The form of the prediction in Equation (6) is a function of l and the observed additive heritability on a (0,1) scale, but this can be achieved with varied combinations of disease prevalence and underlying heritability of liability. This is shown in Table 5, which also demonstrates that, as predicted from Equation (6), r gĝ is a function of only h 2 o as accuracy remains constant with varied disease prevalence and h 2 l . The predicted r gĝ of population studies of continuous phenotypes and dichotomous disease phenotypes with an underlying continuous liability follow the same functional form as seen in Equation (6). Therefore, Figure 1 can be used to derive predicted r gĝ for dichotomous phenotypes as well as continuous phenotypes. However, note that in the liability model, even if liability was fully determined genetically, the additive heritability on the observed scale will never exceed 0.64 (i.e. 4h(0) 2 , where h(x) is the standardized normal density function) with the remaining genetic variation appearing non-additive. The corresponding maximum r gĝ achievable will be reduced and this will be most serious for low l. Even with the most favorable circumstances of q = 1/2 and liability h 2 l~1 , the accuracy will never exceed 0.71 if l,1.56, and it should be expected that l needs to be much greater than this to explain half the genetic variance. This circumstance should not be expected to change when using other disease models than the liability, since the loss of r gĝ arises from the loss of quantitative information when moving from a continuous genetic value (however defined) to the categorical observation of affected or not.

Case control studies of dichotomous disease phenotypes
The prediction formula for accuracy of case control studies (r gĝ ) is not a simple function of l and the observed h 2 o , but also depends on both the heritability on the liability scale and the disease prevalence, as seen from Equation (8). Therefore, comparisons require consideration of how c in Equation (9) varies. The simulations assumed w = 1/2, with equal numbers of cases and controls. Although, as seen in Table 1, the predictions are generally good (% error 220.0 to 3.5), where the large error    deviations are again due to low l, there is a trend towards the underestimation of r gĝ as prevalence becomes low ( Table 4). The value of r gĝ for case control studies is best illustrated by comparison with population studies of dichotomous disease traits. Figure 2 integrates this information and shows the relationship of prevalence and observed heritability in population and case control studies. Values of r gĝ below the narrowly dashed line derived from Equation (5) are not possible under the liability model, for example, an observed additive heritability of 0.5 and a prevalence of 0.1 could not exist in the same dataset. Each contour represents an level of constant r gĝ , where the dashed lines represent a population study and the solid lines denote a case control design with w = 1/2. As described above the contours are vertical for population studies as, given h 2 o , the accuracy is independent of q, but for case control studies move towards lower h 2 o as prevalence decreases. Several clear conclusions on case control studies can be drawn: (i) the overall trend of r gĝ increasing with more phenotypes per number of genotype holds true for case control studies (Table 1); (ii) population studies and case control studies are equivalent when the prevalence is 0.5 ( Figure 2); (iii) a case control study is always more accurate than a population study with the same number of individuals genotyped ( Figure 2); (iv) for a constant h 2 l , r gĝ increases as the disease prevalence increases in population studies, since this increases h 2 o , but in case control studies r gĝ increases as the disease prevalence decreases because of the more intense selection induced by the less prevalent disease (Table 4).

Discussion
We have derived simple deterministic predictions of r gĝ in continuous and dichotomous phenotypes using either a population or a case control study and we have shown them to be appropriately responsive to changes in disease prevalence, heritability, and the number of phenotypic records per number of risk loci to be estimated. In addition, the equations have proven robust to changes in allele effect distributions, including different fractions of loci with zero effect and differing allele frequency Table 5. Simulated accuracy of a population study for a dichotomous phenotype as prevalence and h 2  distributions. Population studies are also robust to covariances between the magnitude of allele effects and heterozygosity, although, in principle, this robustness does not hold for case control studies. This advance in understanding has been used to summarize the influence of critical parameters such as heritability and numbers of phenotypes and risk loci on accuracy of prediction, and also to show the degree to which case control designs can add power to studies.
The approach taken here has been to assume the potential loci affecting the trait are known, and this has an impact that is double edged. First, it allows for a clear quantification of the limitations imposed on r gĝ by the number of phenotypes obtained, irrespective of marker densities. The information gained by doing so is of equal importance to knowing the number of markers needed for a certain r gĝ but seems to have received less attention recently. Second, it implies that the predicted r gĝ are upper bounds for the data obtained, since some loss of r gĝ will occur through the use of markers which are potentially in imperfect linkage disequilibrium (LD) with loci with effect [17], and the inclusion of candidate loci that may have no effect within the population.
The impact of including these loci with no true effect may be explained by two applications of our formulae. The first application assumed the loci affecting a disease trait are known and thus r gĝ demonstrates an upper bound on the accuracy; for example, consider n G = 1000 loci with effects greater than 0, n P = 10,000 phenotypes and h 2 o~0 :1, then the predicted accuracy is obtained with l = 10, and will be 0.71. Now consider if those 1000 loci are contained with a set of n G = 100,000 marker loci, with 99% having zero effect so that now the accuracy is obtained with l = 0.1; our predictive equations remain valid and predict an accuracy of 0.10. From these applications of our formulae it is clear that the approach of estimating loci effects one at a time will inevitably result in low accuracies, and further, adding more marker loci with zero effects while using the same approach will reduce the expected accuracy. The low accuracies predicted accord with the empirical findings from large scale studies of human data that have recently been reported [18]. It is clear that alternative approaches to prediction will be needed to bridge the gap and raise accuracies towards the potential placed by the phenotype collection.
Nevertheless, potential alternative approaches are available and evidence already exists that these approaches may significantly increase predictive accuracy. One approach is to implement model selection approaches. Similarly, improvements in r gĝ can be achieved by implementing model selection least squares procedures to identify a subset of SNP from which to predict effects [10,19], or by using more complex procedures to identify a subset to set to zero [20]. Some of these studies [10,19,20] also incorporate the use of prior information within Bayesian procedures and demonstrate significant increases in accuracy over least squares. Increasing the number of markers when using priors can increase accuracy because the size of the marker subset chosen stays the same due to the prior but the portion of the genetic variance captured by the markers subset increases [21]. However the use of Bayesian approaches will demand reliable distributions for incorporation into models. Literature estimates informing priors on n G and the distributions of the effects will become more widely available as GWA studies become more powerful [1,22]. Full genome-wide methods [10,11], where genetic risk or additive genetic values are estimated in one step, using all loci simultaneously particularly if they are correlated, might be expected to approach the upper bound of r gĝ faster than methods which impose significance thresholds and, thus, do not capture all the genetic variation. From the results presented here it may be argued that priors on the numbers of loci positively contributing to the genetic variance will be more critical than those describing the distribution of gene effects.
In this paper we have used a liability model for disease instead of the commonly used log genetic risk model and the impact of doing so is likely to be small for large datasets. For a set of h 2 o and q, an underlying log-risk can be approximated well by a liability [9,23] and the distribution of effects on the log-risk scale will be transformed to a distribution on the liability scale, and the predictions developed here are not dependent on the distribution of effects. However there is evidence that distinctions may be larger when q is very close to zero or one [24].
A critical assumption of the genetic models studied was that the loci acted independently. In humans, most LD stretches for 10 to 30 kb, while some linkage disequilibrium blocks may be .100 kb [25]. The human genome contains 3.1 billion bases [26] and, assuming 2000 known loci contribute to the additive genetic variance, each genomic segment between them would be 1550 kb. This confirms that this model is viable in human. One could apply our formulae by interpreting n G as the number of independent chromosome segments (i.e. haplotype blocks). The length and, thus, the number of these segments would depend on the amount of LD present in the genome. The number of such segments have been estimated directly from pair-wise LD between markers [27] and closely related measures, such as the number of independent tests on the genome, have been estimated using principle component analysis [28] and have been derived analytically for specific experimental designs [29]. When LD exists, either between markers and risk loci or between risk loci, the predictive efficiency of our equations will be reduced. Modeling the pattern of LD by extension of our formulae would thus be important when many loci are used, as with dense SNP marker maps, or when predicting additive genetic values in other species, such as some livestock populations where the extent of LD is large compared to human [30,31].
An attraction of molecular predictors of genetic risk compared to pedigree predictors is the potential to apply the predictions more widely within populations and across populations. Obtaining sufficient accuracy within populations can be achieved by the quality and size of sampling, but there are additional factors in play when transfer across populations is being considered. For example, one benefit of genome-wide prediction is that individual allele effects are estimated with a precision that is related to the molecular variation observed at the locus, var(x ij ), which determines the contribution of genetic variance when combined with the squared magnitude of effect. This benefit may break down when predictions are transferred across populations. As an illustration, consider a rare allele of large effect which will be relatively imprecisely estimated in the estimation sample, but because the contribution of the locus to total variance is small there is only a small impact upon the accuracy of further predictions within the same population. In a different population, such an allele may have a greater frequency and contribute a greater part of the genetic variance, and, consequently, the predictive accuracy will suffer. Specifically, the ability to transfer predictions will depend on var(x ij ) in each of the two populations used for estimation and application, and this in turn depends on both the allele frequency (p j ) and the degree of admixture present in the population. Furthermore, an additional risk of transferability across populations is the presence of epistasis which may differentially influence b j .
Any directional selection present in the population is likely to introduce a covariance between the magnitude of allelic effect and heterozygosity, since selection promotes the movement of alleles of large effect quickly through intermediate frequencies, where they create large genetic variance, towards extreme frequencies. The predictions of r gĝ developed make no assumption of the covariance, and hence are robust to such selection in the population prior to estimation in population studies. In contrast, the derivation for the case control study does assume independence of heterozygosity and magnitude (as described in Appendix S2). However, in the limited simulations carried out with such covariances in case control studies, the impact of the breaking this assumption appeared small (results not shown).
Our derivations show that r gĝ can be reduced to very similar forms for population and case-control studies of continuous and dichotomous phenotypes (c.f. Equations (1), (6) and (9)). The common element affecting r gĝ for all three equations is the term lh 2 o , describing the joint effect of l, the number of phenotypic records per locus associated with the trait, and the observed heritability. Increasing either of these improves r gĝ , but the study shows that the major determinant of the trade-off between these two factors is their product. For a population study lh 2 o is completely sufficient to determine accuracy, independent of prevalence (q) and heritability h 2 l À Á of liability for a dichotomous trait, but for a case control study both q and h 2 l retain some influence on r gĝ over and above their impact upon h 2 o . This is because, in a case control study, the term c in Equation (9) is adjusting for the selection of the cases and controls, and the strength of selection will depend upon q, and its impact on genetic variance will depend on h 2 l .
The predictive equations are a good fit to the simulated values and we have demonstrated, by theory and simulation, that they are independent of allele frequency and effect distributions. The formulae have increased the understanding of the relative differences between predicting r gĝ in a random sample of a population and in case control studies. The expressions for r gĝ derived will help researchers design experiments of appropriate size to estimate genetic risk to disease.

Supporting Information
Appendix S1