Conceived and designed the experiments: CS DV PD. Performed the experiments: CS EH DV. Analyzed the data: CS EH DV PD. Contributed reagents/materials/analysis tools: CS. Wrote the paper: CS PD.
The authors have declared that no competing interests exist.
Genome-wide association studies (GWAS) have identified hundreds of associated loci across many common diseases. Most risk variants identified by GWAS will merely be tags for as-yet-unknown causal variants. It is therefore possible that identification of the causal variant, by fine mapping, will identify alleles with larger effects on genetic risk than those currently estimated from GWAS replication studies. We show that under plausible assumptions, whilst the majority of the per-allele relative risks (RR) estimated from GWAS data will be close to the true risk at the causal variant, some could be considerable underestimates. For example, for an estimated RR in the range 1.2–1.3, there is approximately a 38% chance that it exceeds 1.4 and a 10% chance that it is over 2. We show how these probabilities can vary depending on the true effects associated with low-frequency variants and on the minor allele frequency (MAF) of the most associated SNP. We investigate the consequences of the underestimation of effect sizes for predictions of an individual's disease risk and interpret our results for the design of fine mapping experiments. Although these effects mean that the amount of heritability explained by known GWAS loci is expected to be larger than current projections, this increase is likely to explain a relatively small amount of the so-called “missing” heritability.
Genome-wide association studies (GWAS) exploit the correlation in genetic diversity along chromosomes in order to detect effects on disease risk without having to type causal loci directly. The inevitable downside of this approach is that, when the correlation between the marker and the causal variant is imperfect, the risk associated with carrying the predisposing allele is diluted and its effect is underestimated. Using simulations, where we know the true risk at the causal locus, we quantify the extent of this underestimation. We show that, for loci which have a modest effect on disease risk and are common in the population, the risk estimated from the most associated SNP is very close to the truth approximately two thirds of the time. Although the extent of the underestimation depends on assumptions about the frequency and strength of the risk allele, we predict that fine mapping of GWAS loci will, in rare cases, identify causal variants with considerably higher risk. Using three common diseases as examples, we investigate the expected cumulative effects of underestimation at multiple loci on our ability to stratify individuals by disease risk and to explain disease heritability.
Genome-wide association studies (GWAS) have been extremely successful across many diseases in identifying loci harbouring genetic variants that affect disease susceptibility. Virtually all associated variants identified from GWAS to date have relatively small effects: each additional copy of the risk allele typically increases disease risk by 10%–30% (see for example
One of several important factors in the success of the GWAS design has been the pattern of linkage disequilibrium in human populations. The strong correlations between nearby SNPs mean that commercially available genotyping chips, which assay 300,000–1,000,000 SNPs, can capture much of the common variation in the human genome, particularly in Caucasian populations
While linkage disequilibrium is extremely helpful for GWAS discovery, the downside is that in most reported regions of association, the true causal variant or variants remain unknown. Therefore it is possible that many of the associated SNPs are only surrogates for the true causal variant(s). When it comes to quantifying the genetic effect, the genotype at the reported SNP acts as a noisy measurement of the genotype at the causal variant. This noise can dilute the apparent strength of the effect, and obscure the true relationship between genotype and phenotype. As we progress towards the identification of the causal variants, estimates of effect sizes for associated loci will thus tend to increase. In turn, the proportion of disease susceptibility explained by GWAS loci will also increase. Thus in addition to other plausible sources, such as secondary signals in GWAS loci, rare variants (<1% frequency), copy number polymorphisms, and epigenetic effects, some of the missing heritability is actually contained in loci already identified by GWAS, and is driven by common variation (>1% frequency).
In this paper we use an extensive simulation study to investigate, and quantify, this phenomenon. We show that estimates of the size of the genetic effect based on the best SNP from the GWAS genotyping chip can often closely approximate the effect size at the true causal SNP. In some cases the causal SNP has a large effect and is poorly tagged, leading to substantial underestimation of the true effect size. We investigate how much of the “missing” heritability could thus be hidden in reported GWAS loci, under several sets of assumptions about the nature of the effects at true causal SNPs. Our results also inform the design and value of fine mapping experiments in GWAS loci.
Patterns of linkage disequilibrium (LD) in human populations are complicated, and preclude analytical results, so we adopted a simulation approach (see
Reported genome-wide association studies differ in many particular details, including the choice of genotyping chip used and the sizes of the discovery and replication samples. Specific assumptions are necessary for any simulation study, and ours aim to capture the general features of many reported studies. Investigation of different simulation scenarios, including different genotyping chips and sample sizes, did not change the broad conclusions that follow (data not shown).
To begin, we compare the estimated effect size at the replicated hit SNP with the true effect size at the causal SNP in the simulation.
Histograms of estimated relative risks (RR), for three different true relative risks indicated by a vertical dashed line in each plot. Histograms include all simulations where the most associated (hit) SNP was significant in both the initial study and the replication study.
In
Scatter plots of the ratio of estimated to true relative risk against the correlation (r2) between the disease SNP and the hit SNP, for three different true relative risks. The horizontal dashed line indicates a ratio of 1. Points below the line are under-estimated, and above are over-estimated. (Note that in the case where the hit SNP is also the causal SNP the correlation is 1.)
Imperfect tagging and an ascertainment effect also explain the feature of the plots whereby the underestimation is much less for smaller true effect sizes. If the true effect is small and the true causal variant is not well-tagged on the genotyping chip, there will not be enough power for the GWAS and subsequent replication to reach significance
The results above describe the distribution of estimated effect sizes as a function of known true effect sizes and the frequency of the risk allele. In practice we are actually interested in the reverse question, namely what true effect sizes are plausible in the light of the effect size actually estimated from a GWAS and follow-up study? We will see that this requires assumptions about the true distribution of effect sizes. Indeed, writing RR for relative risk, and RAF (risk allele frequency) for the allele frequency at the risk allele, application of Bayes' theorem gives
We proceed by making two different sets of assumptions about these unknowns. In each case we assume that the distribution of risk allele frequencies is given by the empirical distribution of allele frequencies in the ENCODE regions. In effect this assumes that any SNP variant is,
Different sets of assumptions about true effect sizes and risk allele frequencies necessarily lead to different conclusions, and it is impossible to study all possibilities. A number of theoretical analyses
Under either set of assumptions, we can use our simulation study, and Bayes' Theorem (1) to estimate the conditional distribution of true effect sizes and risk allele frequency (RAF) in the light of the observed data at the GWAS hit SNP.
Histograms showing the posterior distribution on the true relative risk (RR) conditional on observing an estimated relative risk in the range 1.2–1.3 (vertical dashed lines). Left hand plots condition on the observed risk allele frequency (RAF) being between 20 and 50%, while the right hand plots condition on a RAF less than 20%. Results are shown using two different priors on RR and RAF: the blue histograms are the posterior distribution obtained using a
A common feature of the histograms in
Our observations are similar when the observed risk allele is the most common allele in the population (RAF>50%) and therefore the minor allele is protective (
One consequence of the potential underestimation of effect sizes from GWAS findings is that as we move to better identification of the actual causal variants, through fine mapping and/or functional studies of associated regions, our estimates of their effect sizes might well increase. Assuming a multiplicative model of risk across loci, these small expected changes could combine to increase the relative risk of disease in those individuals with highest genetic risk of disease.
To investigate this, we simulated genotypes at known associated loci in a population of individuals (assuming Hardy Weinberg equilibrium and no linkage disequilibrium across loci) for each of breast cancer, type 2 diabetes and Crohn's disease, based on reported risk allele frequencies
The results of the three simulations are given in
Type II Diabetes | |||||||
x% | 50 | 25 | 10 | 5 | 1 | 0.5 | 0.1 |
Unadjusted | 1.31 | 1.58 | 1.92 | 2.17 | 2.73 | 2.96 | 3.49 |
(1.3–1.32) | (1.56–1.6) | (1.89–1.95) | (2.12–2.21) | (2.64–2.82) | (2.85–3.1) | (3.25–3.8) | |
1.42 | 1.83 | 2.36 | 2.78 | 3.8 | 4.26 | 5.38 | |
(1.36–1.52) | (1.69–2.05) | (2.12–2.77) | (2.44–3.36) | (3.19–4.9) | (3.52–5.62) | (4.25–7.58) | |
1.54 | 2.12 | 3.05 | 3.97 | 6.7 | 8.16 | 12.43 | |
(1.41–1.74) | (1.81–2.7) | (2.34–4.74) | (2.76–6.99) | (3.75–15.08) | (4.21–20.64) | (5.26–43.06) |
Crohn's Disease | |||||||
x% | 50 | 25 | 10 | 5 | 1 | 0.5 | 0.1 |
Unadjusted | 1.67 | 2.41 | 3.66 | 4.89 | 9.25 | 11.91 | 20.07 |
(1.64–1.7) | (2.35–2.48) | (3.53–3.79) | (4.63–5.13) | (8.39–10.31) | (10.5–13.74) | (15.88–27.04) | |
1.78 | 2.71 | 4.28 | 5.82 | 11.2 | 14.56 | 25.41 | |
(1.71–1.9) | (2.52–3) | (3.86–4.9) | (5.18–6.81) | (9.54–13.67) | (12.04–18.24) | (18.84–36.27) | |
2 | 3.36 | 6.01 | 8.87 | 19.77 | 27.19 | 54.11 | |
(1.82–2.27) | (2.83–4.26) | (4.56–8.69) | (6.29–14.13) | (12.32–38.88) | (15.91–58.84) | (26.59–150.26) |
Breast Cancer | |||||||
x% | 50 | 25 | 10 | 5 | 1 | 0.5 | 0.1 |
Unadjusted | 1.2 | 1.38 | 1.58 | 1.71 | 2.02 | 2.14 | 2.4 |
(1.19–1.22) | (1.36–1.4) | (1.55–1.6) | (1.68–1.74) | (1.96–2.05) | (2.05–2.19) | (2.27–2.55) | |
1.25 | 1.46 | 1.71 | 1.89 | 2.29 | 2.46 | 2.82 | |
(1.2–1.33) | (1.37–1.62) | (1.57–1.99) | (1.71–2.27) | (2.02–2.92) | (2.13–3.21) | (2.39–3.9) | |
1.29 | 1.59 | 2.14 | 2.74 | 4.06 | 4.69 | 6.41 | |
(1.2–1.5) | (1.4–2.25) | (1.64–4.25) | (1.8–6.09) | (2.14–11.12) | (2.28–16) | (2.58–36.14) |
The median (and 95% confidence interval) of the increase in risk of individuals who are in the top x% of risk, relative to the average, for three common diseases. Values are estimated (using 100,000 simulations of a population of 10,000 individuals) from a set of replicated associations (see
The second and third simulations attempt to average over the possible outcomes of our future efforts to map causal mutations, to reveal the likely gains in our ability to stratify individuals on the basis of risk. These use the methodology above, under both prior distributions, to average over the posterior distribution of the allele frequency and effect size at the causal SNPs underlying reported GWAS loci for the three diseases. These adjusted estimates are also shown in
We have shown above that as we move to identification of the true causal variants underlying GWAS associations, through fine mapping and functional studies, their effect sizes will tend to increase, in a minority of cases substantially, compared to current estimates from GWAS. This will, in turn, increase the amount of heritability explained by these diseases. We can use the approach developed here to try to quantify this effect.
We investigated this question in the context of the three diseases just described, namely breast cancer, type 2 diabetes, and Crohn's disease. For each disease we took the set of hit SNPs from published associated loci
The results are shown in
Cumulative density functions of the posterior distribution of estimated sibling recurrence risk ratio (estimated λS) in breast cancer (BC), Type 2 diabetes (T2D), and Crohn's disease (CD) under the conservative and
The correlation between alleles along the human genome has allowed GWAS to look for regions associated with disease without having to either genotype all known genetic variation or guess
GWAS associations will thus typically relate to a noisy measurement of the causal variant. One consequence of this is that the size of the genetic effect associated with GWAS loci may be underestimated. We quantified this through an extensive simulation study designed to mimic patterns of linkage disequilibrium in European Caucasian populations. We draw two broad conclusions from these analyses. Firstly, a significant proportion of estimated relative risks will be biased downwards because the hit SNP is a powerful, but imperfect, tag for the true causal variation. In most cases this effect will be relatively minor, but in some instances, the best associated SNP may actually be a poor predictor of a, putatively rarer, SNP with a much larger effect, in which case the effect size estimated from the GWAS finding will substantially underestimate the true effect size.
The exact proportion of reported associations which fall into these two categories depends on properties of the design of the study from which the SNP was identified, and on one's belief about how likely low frequency (>1%) variants of large effect are to cause common diseases. The statistical power afforded by any particular association strategy sets a lower limit on the size of effect that can be under-estimated because an imperfect tag of an allele with a small effect size will simply fail to achieve genome-wide significance. Other properties of GWAS strategy, such as sample ancestry and the number of markers typed, also change our interpretation of observed effect sizes because they influence the distribution of linkage disequilibrium between putative hit SNPs and causal variants.
Our findings show that at any particular locus, especially if the associated SNP has a low MAF, the true effect could be quite large. But we would not expect this to be widespread. Were many true effects this large it would be extremely surprising for so few of them to have been observed: although any one such causal SNP may not be well tagged on the genotyping chips used for GWAS, some of them will happen to be at least moderately well tagged, and their detection would lead to much larger estimates than have been seen from current studies. In the context of this study these early observations suggest that, of the two prior distributions we investigated, it is the conservative prior that may better reflect the true distribution of effect sizes attributed to low and common frequency variants.
One way of viewing the posterior distribution on the true effects shown in
Here we have quantified the increased spread of genetic risk with genotypes just at known loci, and only considering a multiplicative disease model. But even in this restricted setting, there will be substantial differences in risk between high- and low-risk groups based on these genetic factors. For example the propensity of individuals in the top 0.1% of the population distribution of genetic risk of type 2 diabetes will be increased by a factor of 5–10, compared to the average. For breast cancer, in the analogous top-risk group this risk will be increased by a factor of 3–5 (on the basis of common variation). Importantly, with the growth of GWAS findings, both in terms of numbers of diseases and numbers of loci for particular diseases, more and more of the population will be in this most at risk category for at least one disease: assuming 100 independent diseases, nearly 10% of the population will be in the top 0.1% of risk of at least one disease. Knowing which individuals these are and what diseases they are most at risk of is therefore potentially useful information, both to the individual and at the population level. The issues involved in utilising such information in screening programmes (discussed for example in
We have shown that some of the “missing” heritability for common disease actually resides in known GWAS loci and have estimated this deficit for three particular diseases. While rather more heritability is likely to be explained by known GWAS loci than has been reported, this effect alone falls well short of explaining all the missing heritability. Note, however, that there are other reasons why existing loci may explain more heritability than currently thought. Current calculations (by others, and above) focus on a single causal variant in each associated region: more variants within regions will explain more heritability. They also ignore possible non-multiplicative disease effects, and also ignore interactions between variants at different loci. Power to detect either is low
In order to model the signal of association generated by disease-causing mutations, we chose to simulate data exploiting empirical surveys of human diversity. For this purpose we used data from the 10 ENCODE regions
As the typical sample size of most GWAS is much larger than the number of CEU HapMap individuals, we simulated 100,000 chromosomes using the HAPGEN software package. These 100,000 haplotypes we call the
For SNPs greater than 1% in frequency in the ENCODE regions we performed two hypothetical GWAS by letting each of the two alleles be causal in turn. We denote the causal allele by
For analyses involving only simulated data, we sampled 2,000 cases and 2,000 controls from the reference panel to emulate a typical large GWAS. For the subsequent analyses of heritability and individual risk profiling for type 2 diabetes, breast cancer and Crohn's disease that studied particular reported associations, we simulated 5,000 cases and 5,000 controls to obtain results more comparable to the size of study from which the associations were ascertained. We simulated under a range of relative risks at 24 grid points from 1.05 to 6. In attempting to simulate the signal of disease at rare alleles (1% to 5%) in a GWAS of 5000 cases and controls there were a small number of simulations in which there were insufficient haplotypes in our reference panel to generate the required number of genotypes at the causal SNP for large effect sizes. These simulations were discarded, but as the numbers were small (3% when the RR = 4 and 11% when RR = 6) we do not believe this greatly affects the results presented below.
Following common practice, for each simulated case control sample, we tested for association between genotype and case control status using the Cochran Armitage trend test
We simulated the replication experiment in three stages. First we simulated the frequency of the causal allele in cases and controls in the replication population. We then simulated the frequency of the hit SNP conditional on the frequency of the causal allele. Finally, we simulated the genotype counts for a sample of cases and controls in this replication population.
We motivated sampling of the frequency of the causal allele in controls in the replication population by thinking of the replication sample as an additional sample from the same population as the original GWAS sample. (Other assumptions are possible here, but seem unlikely to affect the main conclusions.) Specifically, we placed a uniform prior distribution on the unobserved population frequency and sampled a value,
We estimate the effect size, or relative risk, α, at the hit SNP by maximum likelihood under the model described above by equation (2). For studies with population controls this can be achieved in practice by fitting a logistic regression model for case status
We implement two different sets of prior assumption on the effect size and its relationship with minor allele frequency. Our first set of assumptions is that if α is the effect size at a causal variant, then log(α) is normally distributed with mean 0 and standard deviation 0.2, independent of RAF. We refer to this as the
Our second set of assumptions, which we call the
A commonly used measure of heritability is based on considering the risk of disease to an individual
We then simulated 100,000 times from the posterior of true RR and RAF of each locus conditional upon the reported RR and RAF, using the simulation approach and the two different priors as described in the paper. For each set of simulations, for each disease, we recalculated λS at each locus and multiplied over loci, giving a sample from the posterior distribution of sibling risk that could be explained by the current set of report loci if the causal loci where typed directly.
Average relative underestimation of effect size as a function of allele frequency and true effect size. Line plot shows the mean ratio of the effect size estimated from the most associated GWAS SNP to the true effect size at the causal locus. Lines are shown for 4 different risk allele frequency (RAF) bins.
(0.01 MB EPS)
Posterior distribution on true relative risk when the minor allele is protective. Histograms showing the posterior distribution on the true relative risk (RR) conditional on observing an estimated relative risk and risk allele frequency (RAF) at the hit SNP. Results are shown using two different priors on RR and MAF: the blue histograms are the posterior distribution obtained using a conservative prior, and the red histograms are the posterior distribution obtained using the MAF-dependent prior.
(0.92 MB EPS)
Posterior distribution on true relative risk for low estimated effect sizes. Histograms showing the posterior distribution on the true relative risk (RR) conditional on observing an estimated relative risk and risk allele frequency (RAF) at the hit SNP. Results are shown using two different priors on RR and MAF: the blue histograms are the posterior distribution obtained using a conservative prior, and the red histograms are the posterior distribution obtained using the MAF-dependent prior.
(0.92 MB EPS)
Priors on relative risks. Probability distributions of relative risk as a function of minor allele frequency for the MAF-dependent and conservative priors (the blue line). The MAF-dependent prior is pictured for five values of MAF
(3.14 MB EPS)
Empirical prior on risk allele frequency. Cumulative distribution of the frequency of SNPs within the ENCODE regions used for simulations. At each SNP, each allele is chosen is turn chosen to be the risk allele so the distribution is symmetric around a half.
(3.14 MB EPS)
ENCODE regions used in simulations. The build 35 coordinates of the regions of HapMap CEU data used by HapGen to simulate genome-wide association study data. When simulating haplotypes and testing for association a 500kb buffer window was included either side of the listed regions.
(0.03 MB DOC)
MAF dependent prior. Table of the standard deviation of the prior distribution of the log relative risk (RR) as a function of risk allele frequency (RAF).
(0.06 MB DOC)
SNPs for Type 2 diabetes. (See main text for reference).
(0.04 MB DOC)
Replicated SNPs for Crohn's disease. (See main text for reference).
(0.05 MB DOC)
Replicated SNPs for breast cancer. (See main text for reference).
(0.03 MB DOC)
We thank Rory Bowden, Gil McVean, and Simon Myers for helpful discussion.