Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes

Complex diseases are often highly heritable. However, for many complex traits only a small proportion of the heritability can be explained by observed genetic variants in traditional genome-wide association (GWA) studies. Moreover, for some of those traits few significant SNPs have been identified. Single SNP association methods test for association at a single SNP, ignoring the effect of other SNPs. We show using a simple multi-locus odds model of complex disease that moderate to large effect sizes of causal variants may be estimated as relatively small effect sizes in single SNP association testing. This underestimation effect is most severe for diseases influenced by numerous risk variants. We relate the underestimation effect to the concept of non-collapsibility found in the statistics literature. As described, continuous phenotypes generated with linear genetic models are not affected by this underestimation effect. Since many GWA studies apply single SNP analysis to dichotomous phenotypes, previously reported results potentially underestimate true effect sizes, thereby impeding identification of true effect SNPs. Therefore, when a multi-locus model of disease risk is assumed, a multi SNP analysis may be more appropriate.


A1 Simulation details
To simulate a distribution of estimated SNP-based allelic odds ratio OR e , we computed the odds ratio from 10,000 simulated 2x2 allelic cross tabulations at one risk SNP (risk/protective allele versus case/control). As effect sizes and risk allele frequencies were assumed to be equal for all risk SNPs, risk SNPs were interchangeable and hence shared the same distribution of OR e . The allelic cross tabulations were simulated by drawing from a categorical distribution with the average relative cell frequencies of the cross tabulations as parameters. Average relative frequencies were derived analytically from the model parameters in the disease model. As each SNP consists of two alleles, the sample size in each simulated cross tabulation was twice the number of subjects in the case-control study. All simulations were based on a study sample of 3500 subjects with a 1:1 case:control ratio.
The average relative cell frequencies in a 2x2 allelic cross tabulation at a particular SNP in a case-control study can be expressed as a sample dependent joint probability P sample (X a , D) of allele X a and disease status D (risk allele a : X a = 1, protective allele A : X a = 0; case: D = 1, control: D = 0). As only genotypes are observed P sample (X a , D) needs to be derived from a similar distribution P sample (X s , D) based on a 3x2 genotype cross tabulation (X s ∈ {0 = AA, 1 = Aa, 2 = aa} is the number of risk alleles in that particular SNP). Table  S1 shows the allelic joint probability as a function of genotype-based joint probabilities in a case-control sample. Note that heterozygosity contributes half to the a row and half to the A row in the allelic cross tabulation. We then derived P sample (X s , D). Because disease status D is known for all subjects in a case-control study, the joint distribution of the genotype cross table can be written as P sample (X s , D) = P (X s |D)P sample (D). P (X s |D) is the probability of genotype X s given disease status D. As P sample (D) is the percentage of cases in the case-control sample, the joint distribution is sample dependent. For simplicity, we have assumed case-control samples with a 1:1 case:control ratio. Note that the sample joint distribution [P sample (X s , D)] does not equal the joint distribution in the population [P (X s , D)], as the percentage of cases in a case-control study [P sample (D = 1)] typically does not equal the prevalence in the population [P (D = 1)].
Allelic cross tabulation D D a P sample (aa, D) + 1 2 P sample (Aa, D) P sample (aa, D) + 1 2 P sample (Aa, D) A P sample (AA, D) + 1 2 P sample (Aa, D) P sample (AA, D) + 1 2 P sample (Aa, D) Table S1. Allelic joint probabilities as a function of genotype-based joint probabilities in case-control sample.
Finally, we analytically derived the conditional probability of a genotype at a particular SNP given disease status as a function of the four disease model parameters: P (X s |D, β 0 , β 1 , p a , n a ) (for more details on the mathematical analysis we refer to section A3 below).
The simulation procedure described above allowed efficient simulation of SNP-based odds ratio OR e values for risk SNPs based on disease prevalence, true effect size, risk allele frequency, and total number of risk SNPs. By simulating distributions of OR e for a plausible range of model parameter values we were able to study the relationship between median OR e and true model odds ration OR t .
For a given disease prevalence, we set the intercept β 0 such that the probability of disease in the population matches the predefined disease prevalence. Let D be disease status, X a the number of risk alleles an individual caries, β 1 = log(OR t ) the effect size on log odds scale, p a the risk allele frequency, and n a the total number of effect alleles, i.e. twice the number of risk SNPs, and β 0 the intercept such that P (D = 1|β 0 , β 1 , p a , n a ) = p D . Given disease prevalence p D and assuming Hardy-Weinberg equilibrium, the intercept was computed by numerically approximating the minimum of the following squared error function.
A2 Relationship between marginal odds ratio OR e and conditional OR t for diseases with single effect SNP Figure S1 shows the relationship between median estimated odds ratio OR e and true conditional odds ratio OR t for diseases with a single risk SNP, different prevalences (A), and different minor allele frequencies (B). Ideally, the median estimated effect size of SNPs obtained from single SNP association testing should reflect the true effect size OR t of the disease generating model. As Figure S1 shows, for OR t < 3 the median OR e reflects OR t fairly well. For larger true odds ratios we found a non-linear relationship between OR e and OR t , depending on prevalence and risk allele frequency. This is due to the fact that the true conditional odds ratio OR t is based on a single allelic position, whereas the estimated SNP-based odds ratio OR e is based on two allelic positions. We refer to section A3.2 below for mathematical details.

A3.1 Deriving probability of SNP genotype given disease status
The joint distribution genotype-based cross tables per SNP used in the simulation ignore background risk, as they only involve a single risk SNP. However, the odds model is specified conditional on the total number of minor risk alleles a subject caries. It is therefore necessary to average over all possible genetic backgrounds to compute the probability of a genotype at SNP s given disease status. Combining this with Bayes rule we obtain the following analytical derivation. The final expression comprises a quotient of four terms (defined by square brackets). The first term in the numerator is the probability of disease given the genotype at a particular SNP. The second term in the numerator is the probability of the genotype at that particular SNP given the risk allele frequency. The two terms in the denominator are the probability of disease and non-disease in the population.

A3.2 Approximate equivalence of conditional allelic odds ratio and marginal SNPwise odds ratios under a single SNP model
To clarify the relationship between the estimated SNP-based odds ratio OR e and the true odds ratio OR t , we introduce the notion of allelic odds ratio. An allelic odds ratio OR a is the odds ratio of an extra risk allele versus no extra risk allele given a fixed background odds of disease.
For a (sub)population with no variation in background odds of disease, it is possible to show that the allelic odds ratio OR a at a particular allele equals the true odds ratio OR t .
Let X s ∈ {0, 1, 2} be the number of risk alleles at a particular SNP, and X bg the number of risk alleles in the remaining background SNPs. Then X a = X s + X bg is the total number of risk alleles. Note that We use this equality to demonstrate that the allelic odds ratio OR a for people with the same genetic background x bg equals the true odds ratio OR t = exp(β 1 ). The fifth equality below is obtained by multiplying both the numerator and the denominator by This equality demonstrates theoretically that we could estimate OR t by computing OR a at a particular SNP (i.e., aA versus AA or aa versus Aa) in a subset of the population carrying the same genetic background risk x bg for any x bg . Due to variation in background risk within a population equality OR t = OR a will generally not hold in a study sample. Only for diseases with a single effect SNP the background risk is equal for all subjects (i.e., no background risk) and hence the equality will hold.
Note that the definitions of OR a and OR t are based on a single allelic position, whereas GWA studies often report SNP-based allelic odds ratios (OR e ), based on two allelic positions. As OR a = OR t for diseases with a single risk SNP, Figure S1 shows the discrepancy between SNPbased estimate OR e and allele-based estimate OR a for different model parameters. Although OR e and OR a are not equal, their difference is small compared to the underestimation of effect size in complex diseases.

A3.3 Equivalence of marginal and conditional effect size under a linear regression model
Let y = β 0 + β 1 x a + be the value of continuous phenotype Y as a function of the number of risk alleles x a . The expected value of phenotype Y given genotype X s and background X bg is given by the linear model. Because the background is unknown when analyzing a particular risk SNP, it is necessary to average over all possible backgrounds by computing the expected value once again. Hence the expected value of phenotype Y given genotype x s is As before, n a is the total number of effect alleles and p a the risk allele frequency. The average SNP-based effect size B s is the difference in average phenotype of Aa genotypes (Y Aa ) and AA genotypes (Y AA ) at risk SNP s. Therefore the expected effect size obtained at SNP s is Hence single SNP association testing on continuous disease traits under a linear model can, on average, identify the true conditional effect sizes.

A3.4 Effect of underestimation on explained variance
In linear regression R 2 can be computed to assess the explained variance of a genetic model. For logistic regression, many pseudo−R 2 measures exist without agreement on the measure that should be applied. We adopted McKelvey-Zavoina's pseudo − R 2 . [1] Simulation studies have shown that, compared to other pseudo − R 2 statistics, it produces results that are most similar to the R 2 obtained when applying linear regression to an observed liability trait. [2] McKelvey-Zavoina's pseudo − R 2 mirrors the explained variance in linear regression by assuming a normally distributed latent liability trait Y * = β 0 + β 1 X a + with ∼ N (0, σ ). McKelvey-Zavoina's pseudo − R 2 is the percentage of variance in latent liability trait explained by the model. The total disease trait variance Y * consists of variance explained by the model and the error variance of the (approximated) probit model. Let Y * = β 0 + β 1 X a be the latent trait as predicted by the model. Then the explained variance is The variance of Y * can be written as The estimated explained variance is a function of coefficient β 1 , number of effect alleles n a , allele frequency p a , and standard error σ . The estimated effect size β 1 influences the variance of Y * and hence the estimated explained variance. In other words, underestimated effect sizes result in an underestimated explained variance. Although the choice of σ does not affect the estimation of the logistic coefficients, it does affect the estimated explained variance. In the probit model P (D = 1|X a = x a ) = f probit (y * = β 0 + β 1 x a ), which implies that a y * > 0 will result in disease. To ensure that the probit model is identifiable, it is customary to set the standard error σ to 1. The difference between the odds model and the probit model is the link function. That is, in the odds model P (D = 1|X a = x a ) = f logit (β 0 + β 1 x a ). This link implies a logistic error term instead of a normal error term. Because McKelvey-Zavoina's pseudo − R 2 is based on a normally distributed latent trait, (15/16)(π/ √ 3)f logit (y * ) ≈ f probit (y * ) is a good approximation of the probit model. It is conventional to use the standard logistic distribution for the error term, which has a standard error of π/ √ 3. However, a multiplication factor of (15/16)(π/ √ 3) provides the best approximation to the probit model on which McKelvey-Zavoina's pseudo − R 2 is based. [3]