A Model of Compound Heterozygous, Loss-of-Function Alleles Is Broadly Consistent with Observations from Complex-Disease GWAS Datasets

The genetic component of complex disease risk in humans remains largely unexplained. A corollary is that the allelic spectrum of genetic variants contributing to complex disease risk is unknown. Theoretical models that relate population genetic processes to the maintenance of genetic variation for quantitative traits may suggest profitable avenues for future experimental design. Here we use forward simulation to model a genomic region evolving under a balance between recurrent deleterious mutation and Gaussian stabilizing selection. We consider multiple genetic and demographic models, and several different methods for identifying genomic regions harboring variants associated with complex disease risk. We demonstrate that the model of gene action, relating genotype to phenotype, has a qualitative effect on several relevant aspects of the population genetic architecture of a complex trait. In particular, the genetic model impacts genetic variance component partitioning across the allele frequency spectrum and the power of statistical tests. Models with partial recessivity closely match the minor allele frequency distribution of significant hits from empirical genome-wide association studies without requiring homozygous effect sizes to be small. We highlight a particular gene-based model of incomplete recessivity that is appealing from first principles. Under that model, deleterious mutations in a genomic region partially fail to complement one another. This model of gene-based recessivity predicts the empirically observed inconsistency between twin and SNP based estimated of dominance heritability. Furthermore, this model predicts considerable levels of unexplained variance associated with intralocus epistasis. Our results suggest a need for improved statistical tools for region based genetic association and heritability estimation.

Introduction Risk for complex diseases in humans, such as diabetes and hypertension, is highly heritable yet the causal DNA sequence variants responsible for that risk remain largely unknown. Genomewide association studies (GWAS) have found many genetic markers associated with disease risk [1]. However, follow-up studies have shown that these markers explain only a small portion of the total heritability for most traits [2,3].
There are many hypotheses which attempt to explain the 'missing heritability' problem [2][3][4][5]. Genetic variance due to epistatic or gene-by-environment interactions is difficult to identify statistically because of, among other reasons, increased multiple hypothesis testing burden [6,7], and could artificially inflate estimates of broad-sense heritability [8]. Well-tagged intermediate frequency variants may not reach genome-wide significance in an association study if they have smaller effect sizes [9,10]. One appealing verbal hypothesis for this 'missing heritability' is that there are rare causal alleles of large effect that are difficult to detect [4,11,12]. These hypotheses are not mutually exclusive, and it is probable that a combination of models will be needed to explain all heritable disease risk [13].
The standard GWAS attempts to identify genetic polymorphisms that differ in frequency between cases and controls. A complementary approach is to estimate the heritability explained by genotyped (and imputed) markers (SNPs) under different population sampling schemes [14,15]. Stratifying markers by minor allele frequency (MAF) prior to performing SNP-based heritability estimation allows the partitioning of genetic variation across the allele frequency spectrum to be estimated [16], which is an important summary of the genetic architecture of a complex trait [16][17][18][19][20][21][22][23]. This approach has inferred a contribution of rare alleles to genetic variance in both human height and body mass index (BMI) [16], consistent with theoretical work showing that rare alleles will have large effect sizes if fitness effects and trait effects are correlated [18,[20][21][22][23][24][25]. Yet, simulations of causal loci harboring multiple rare variants with large additive effects predict an excess of low-frequency significant markers relative to empirical findings [4,26].
SNP-based heritability estimates have concluded that there is little missing heritability for height and BMI, and that the causal loci simply have effect sizes that are too small to reach genome-wide significance under current GWAS sample sizes [14,16]. Further, extensions to these methods decompose genetic variance into additive and dominance components and find that dominance variance is approximately one fifth of the additive genetic variance on average across seventy-nine complex traits [27]. When taken into account together with results from GWAS, these observations can be interpreted as evidence that the genetic architecture of human traits is best-explained by a model of small additive effects. However, a recent large twin study found a substantial contribution of dominance variance for fourteen out of eighteen traits [28]. The reason for this discrepancy in results remains unclear. One possibility is a statistical artifact; for example, twin studies may be prone to mistakenly infer non-additive effects when none exist. Another possibility, which we return to later, is that this apparently contradictory results are expected under a different model of gene action.
The design, analysis, and interpretation of GWAS are heavily influenced by the "standard model" of quantitative genetics [29]. This model assigns an effect size to a mutant allele, but formally makes no concrete statement regarding the molecular nature of the allele. Early applications of this model to the problem of human complex traits include Risch's work on the power to detect causal mutations [30,31] and Pritchard's work showing that rare alleles under purifying selection may contribute to heritable variation in complex traits [17]. When applied to molecular data, such as SNP genotypes in a GWAS, these models treat the SNPs themselves as the loci of interest. For example, influential power studies informing the design of GWAS assign effect sizes directly to SNPs and assume Risch's model of multiplicative epistasis [32]. Similarly, the single-marker logistic regression used as the primary analysis of GWAS data typically assumes an additive or recessive model at the level of individual SNPs [33]. Finally, recent methods designed to estimate the heritability of a trait explained by genotyped markers assigns additive and dominance effects directly to SNPs [14,16,27,34]. Naturally, the results of such analyses are interpreted in light of the assumed model of gene action.
A weakness of the multiplicative epistasis model [30,31] when applied to SNPs is that the concept of a gene, defined as a physical region where loss-of-function mutations have the same phenotype [35], is lost. Specifically, under the standard model, the genetic concept of a failure to complement is a property of SNPs and not "gene regions" (see [36] for a detailed discussion of this issue). We have recently introduced an alternative model of gene action, one in which risk mutations are unconditionally deleterious and fail to complement at the level of a "gene region" [36]. This model, influenced by the standard operational definition of a gene [35], gives rise to the sort of allelic heterogeneity typically observed for human Mendelian diseases [37], and to a distribution of GWAS "hit" minor allele frequencies [4,26] consistent with empirical results [36]. In this article, we explore this "gene-based" model under more complex demographic scenarios as well as its properties with respect to the estimation of variance components using SNP-based approaches [34] and twin studies. We also compare this model to the standard models of strictly additive co-dominant effects, and multiplicative epistasis with dominance.
We further explore the power of several association tests to detect a causal gene region under each genetic and demographic model. We find significant heterogeneity in the performance of burden tests [36,38,39] across models of the trait and demographic history. We find that population expansion reduces the power to detect causal gene-regions due to an increase in rare variation, in agreement with work by [22,23]. The behavior of the tests under different models provides us with insight as to the circumstances in which each test is best suited.
In total, our results show that modeling gene action is key to modeling GWAS, and thus plays an important role in both the design and interpretation of such studies. Further, the model of gene-based recessivity best explains the differences between estimates of additive and dominance variance components from SNP-based methods [27] and from twin studies [28] and is consistent with the distribution of frequencies of significant associations in GWAS [4,26]. Further, the genetic model plays a much more important role than the demographic model, which is expected based on previous work on additive models showing that the genetic load is approximately unaffected by changes in population size over time, [21,22]. Consistent with recent work by [23], we find that rapid population growth in the recent past increases the contribution of rare variants to total genetic variance. However, we show here that different models of gene action are qualitatively different with respect to the partitioning of genetic variance across the allele frequency spectrum. We also show that these conclusions hold under the more complex demographic models that have been proposed for human populations [21,40].

Results and Discussion
The models As in [36],we simulate a 100 kilobase region of human genome, contributing to a complex disease phenotype and fitness. The region evolves forward in time subject to neutral and deleterious mutation, recombination, selection, and drift. To perform genetic association and heritability estimation studies in silico, we need to impose a trait onto simulated individuals. In doing so, we introduce strong assumptions about the molecular underpinnings of a trait and its evolutionary context.
How does the molecular genetic basis of a trait under natural selection influence population genetic signatures in the genome? This question is very broad, and therefore it was necessary to restrict ourselves to a small subset of molecular and evolutionary scenarios. We analyzed a set of approaches to modeling a single gene region experiencing recurrent unconditionallydeleterious mutation contributing to a quantitative trait subject to Gaussian stabilizing selection. The expected fitness effect of a mutation is always deleterious because trait effects are sampled from an exponential distribution. Therefore, we do not allow for compensatory mutations that may occur in more general models of stabilizing selection. Specifically, we studied three different genetic models and two different demographic models, holding the fitness model as a constant. Parameters are briefly described in Table 1.
We implemented three disease-trait models of the phenotypic form P = G + E. G is the genetic component, and E ¼ Nð0; s 2 e Þ is the environmental noise expressed as a Gaussian random variable with mean 0 and variance s 2 e . In this context, s 2 e should be thought of as both the contribution from the environment and from the remaining genetic variance at loci in linkage equilibrium with the simulated 100kb region. The genetic models are named the additive codominant (AC) model, multiplicative recessive (Mult. recessive; MR) model and the genebased recessive (GBR) model. The MR model has a parameter, h, that controls the degree of recessivity; we call this model the complete MR (cMR) when h = 0 and the incomplete MR (iMR) when 0 h 1. Here, h = 1 corresponds to co-dominance, which is different from the typical formulation used when modeling the fitness effects of mutations directly. It is also important to note that here recessivity is being defined in terms of phenotypic effects; this may be unusual for those more accustomed to dealing directly with recessivity for fitness effects. An idealized relationship between dominance for fitness effects and trait effects of a mutation on an unaffected genetic background is shown in S15 Fig. The critical conceptual difference between recessive models is whether dominance is a property of a locus (nucleotide/SNP) in a gene or the gene overall. Mathematically, this amounts to whether one first determines diploid genotypes at sites (and then multiplies across sites to get a total genetic effect) or calculates a score for each haplotype (the maternal and paternal alleles). For completely co-dominant models, this distinction is irrelevant, however for a model with arbitrary dominance one needs to be more specific. As an example, imagine a compound heterozygote for two biallelic loci, i.e. genotype Ab/aB. In the case of traditional multiplicative recessivity the compound heterozygote is wild type for both loci and therefore wild-type over all; this implies that these loci are in different genes (or independent functional units of the same gene) because the mutations are complementary. However, in the case of gene-based recessivity [36], neither haplotype is wild-type and so the individual is not wildtype; the failure of mutant alleles to complement defines these loci as being in the same gene [35].
For a diploid with m i causative mutations on the i th haplotype, we may define the additive model as where c i,j is the effect size of the j th mutation on the i th haplotype. Each c i,j is sampled from an exponential distribution with mean of λ, to reflect unconditionally deleterious mutation. In other words, when a new mutation arises its effect c is drawn from an exponential distribution, and remains constant throughout its entire sojourn in the population. The GBR model is the geometric mean of the sum of effect sizes on each haplotype [36]. We sum the causal mutation effects on each allele (paternal and maternal) to obtain a haplotype score. We then take the square root of the product of the haplotype scores to determine the total genetic value of the diploid.
Finally, the MR model depends on the number of positions for which a diploid is heterozygous (m Aa ) or homozygous (m aa ) for causative mutations, Thus, h = 0 is a model of multiplicative epistasis with complete recessivity (cMR), and h = 1 closely approximates the additive model when effect sizes are small.
Here, phenotypes are subject to Gaussian stabilizing selection with an optimum at zero and standard deviation of σ s = 1 such that the fitness, w, of a diploid is proportional to a Gaussian function [41].
The AC and MR models draw no distinction between a "mutation" and a "gene" (as discussed in [36]). The GBR is also a recessive model, but recessivity is at the level of a haplotype (or allele) and is not an inherent property of individual mutations (see [36] for motivation of this model). Viewed in light of the traditional AC and MR models, the recessivity of a site in the GBR model is a function of the local genetic background on which it is found. Based on several qualitative comparisons we find that the GBR model is approximated by iMR models with 0.1 h 0.25. However, no specific iMR model seems to match well in all aspects. The demographic models are that of a constant sized population (no growth) and rapid population expansion (growth).
The use of the MR model is inspired by Risch's work [30,31], linking a classic evolutionary model of multiple loci interacting multiplicatively [42,43] to the the genetic epidemiological parameter relative risk. Risch and Merikangas [44] used this model to calculate the power to detect causal risk variants as a function of their frequency and effect size. Pritchard extended Risch's model to consider a trait explicitly as a product of the evolutionary process [17]. Pritchard's work demonstrated that the equilibrium frequency distribution suggested an important role for rare deleterious mutations when a trait evolves in a constant sized, randomly mating population with recurrent mutation and constant effect sizes. However, multiplicative epistasis is only one model of gene action. Exploring the effect of different genotype-to-phenotype models on the population and quantitative genetic properties of complex traits is the focus of the current work.

Additive and dominance genetic variance in the population
The amount of narrow sense heritability, h 2 = (V A )/(V P ), explained by variants across the frequency spectrum is directly related to the effect sizes of those variants [29]. Thus, this measure is an important predictor of statistical power of GWAS and should inform decisions about study design and analysis [45]. Empirically, SNP-based estimates of heritability have inferred negligible dominance variance underlying most quantitative traits [27]. We have a particular interest in the amount of additive variance, V A , that is due to rare alleles and how much of genetic variance, V G , is attributable to V A under different recessive models.
We follow the approach of [21], by calculating the cumulative percent of V G explained by the additive effects of variants less than or equal to frequency x, (V A;q x )/(V G ). The product of this ratio and broad-sense heritability is an estimate of the narrow-sense heritability, h 2 . This calculation is a population-wide equivalent to a SNP-based estimate of heritability in a population sample. In addition we calculate the same distribution for dominance effects (V D;q x )/(V G ) using the orthogonal model of [27]. Methods based on summing effect sizes [29] or the site frequency spectrum [21] would not apply to the GBR model, because the effect of a variant is not independent of other variants (e.g., there is intralocus epistasis). Therefore, we resort to a regression-based approach, where we regress the genotypes of the population onto the total genetic value as defined in our disease trait models (see Material and Methods). In the limit of Hardy-Weinberg and linkage equilibrium, the regression estimates are equivalent to standard quantitative genetic estimates [29] (S14 Fig). For consistency, we applied the regression approach to all models. Overall, these distributions are substantially different across genetic models, demographic scenarios and model parameters (Fig 1). Under the AC model, all of V G is explained by additive effects if all variants are included in the calculation; in Fig 1 the solid variance curves reach unity in the AC panel. Low frequency and rare variants (q < 0.01) explain a large portion of narrow sense heritability (26%-95%) even in models without rapid population expansion. Further, the variance explained at any given frequency threshold increases asymptotically to unity as a function of increasing λ (S4 Fig). While the total heritability of a trait in the population is generally insensitive to population size changes (S1 Fig, see also [21,22,46]), rapid population growth increases the fraction of additive genetic variation due to rare alleles (Fig 1).
Here, increasing λ corresponds to stronger selection against causative mutations, due to their increased average effect size. Recent work by Zuk et al. [24], takes a similar approach and relates the allele frequency distribution directly to design of studies for detecting the role of rare variants. However, our findings contrast with those of Zuk et al. [24] and agree with those of Lohmueller [22], in that we predict that population expansion will substantially increase the heritability, or portion of genetic variance, that is due to rare variants. Our results under the AC model agree with those of Simons et al. [21], in that we find that increasing strength of selection, increasing λ in our work, increases the contribution to heritability of rare variants. However, under the GBR model and the cMR model the distribution of genetic variance over risk allele frequency as function λ is non-monotonic (Fig 1 and S4 Fig).
For all recessive models, we find that total V A is less than V G (Fig 1). For the MR models, all additional genetic variation is explained by the dominance variance component; in  The cumulative additive and dominance genetic variance which can be explained by markers whose frequencies, q, are x. Each color represents a different value of λ: the mean effects size of a new deleterious mutation. Shown here are the gene-based (GBR), additive co-dominant (AC), incomplete multiplicative recessive (Mult. recessive (h = 0.25); iMR) and complete multiplicative recessive (Mult. recessive (h = 0);cMR) models. Solid lines show the additive variance alone and dotted lines show the combined additive and dominance variance. All data shown are for models where H 2 * 0.08. These particular results are robust to changes H 2 when V G is not changed, as is the case here. The additive and dominance genetic variance is estimated by the adjusted r 2 of the regression of all markers (and their corresponding dominance encoding) with q x onto total genotypic value (see methods for details); data are displayed as the mean of 250 simulation replicates. The vertical dotted and dashed lines correspond to the q = 0.001 and q = 0.05, respectively. The curves under no growth appear to be truncated with respect to rapid growth because the range of the x-axis differs between growth and no growth (minimum q = 1/2N). dotted variance curves reach unity in the MR panels. As expected, genetic variation under the MR model with partial recessivity (h = 0.25) is primarily additive [29,47], whereas V G under the cMR model (h = 0) is primarily due to dominance. The GBR model shows little dominance variance and is the only model considered here for which the total V G explained by V A +V D is less than the true V G for all λ. This can be clearly seen in Fig 1 where the dotted curves do not reach unity in the GBR panel. These observations concerning the GBR model are consistent with the finding of [27] that dominance effects of SNPs do not contribute significantly to the heritability for complex traits.
Under the GBR model, large trait values are usually due to compound heterozygote genotypes (e.g., Ab/aB, where A and B represent different sites in the same gene) [36]. Therefore, the recessivity is at the level of the gene region while the typical approach to estimating V A and V D assigns effect sizes and dominance to individual mutations. Thus, compound heterozygosity, which is commonly observed for Mendelian diseases (see [36] and references therein) would be interpreted as variation due to interactions (epistasis) between risk variants. Importantly, the GBR model assumes that such interactions should be local, occurring amongst causal mutations in the same locus. While the GBR model is reflective of the original definition of a gene in which recessive mutations fail to complement, we emphasize that this does not imply that mutations are necessarily exomic. The GBR model is of a general genomic region in which mutations act locally in cis to disrupt the function of that region with respect to a phenotype.
The increase in the number of rare alleles due to population growth is a well established theoretical and empirical result [48][49][50][51][52][53][54][55][56][57][58][59][60][61]. The exact relationship between rare alleles [4,17,26,62,63], and the demographic and/or selective scenarios from which they arose [21,22,64], and the genetic architecture of common complex diseases in humans is an active area of research. An important parameter dictating the relationships between demography, natural selection, and complex disease risk is the degree of correlation between a variants effect on the disease trait and its effect on fitness [18,[20][21][22]. In our simulations, we do not impose an explicit degree of correlation between the phenotypic and fitness effects of a variant. Rather, this correlation is context dependent, varying according to the current genetic burden of the population, the genetic background in which the variant is present and random environmental noise. However, if we re-parameterized our model in terms of [18], then we would have τ 0.5 (Gaussian function is greater than or equal to its quadratic approximation), which is consistent with recent attempts at estimating that parameter [20,65]. Our approach is reflective of weak selection acting directly on the complex disease phenotype, but the degree to which selection acts on genotype is an outcome of the model. While the recent demographic history has little effect on key mean values such as broad-sense heritability of a trait or population genetic burden (S1 and S3 Figs), the structure of the individual components in the population which add up to those mean values varies considerably. The specific predictions with respect to the composition of the populations varies drastically across different modeling approaches. It is therefore necessary to carefully consider the structure of a genetic model in a simulation study.
The conclusions reached here also hold when we consider more complex demographic scenarios relevant to human populations. Under the demographic model for European populations from [40], the additive and GBR models show the same behavior as in Fig 1 (S17 Fig). At all key time points where population size changes, V A = V G for the additive model, and the variance explained by rare mutations depends primarily on λ (S17 Fig). For the GBR model, V A < V G (as in Fig 1), and plateaus at the same ratio V A /V G for all time points except immediately after the bottleneck, which results in a short-lived increase in V A /V G that is undetectable by the time growth begins (S17 Fig). All recessive models (GBR, iMR and cMR) may show a transient increase in total V G after the bottleneck, depending on the value of λ (S18 Fig). However, the GBR and iMR models with h > 0.25 showed a return to constant population size levels by the final time point. The changes in V A and V G under recessive models is likely due to the transfer of non-additive variation into V A during a bottleneck, which has been studied thoroughly in the theoretical literature [66,67]. As in Fig 1, the genetic model, and not the demographic details, drive the relationship between mutation frequency and additive genetic variance. In agreement with existing literature, site based recessive models show complex dynamics during bottlenecks and population expansion (S18 and S19 Figs). However, with respect to load, the GBR model behaves more like a codominant model and is largely insensitive to changes in population size(S18 and S19 Figs). Thus, complex traits evolving under the GBR model are not expected to show large differences in load between extant human populations.

Estimating additive and dominance variance from population samples
The previous section shows that the relationship between genetic variance and allele frequency in the entire population strongly depends on the genetic model. Recent estimates of variance components from large population samples of unrelated individuals have inferred that dominance variance (V D ) is negligible for most traits [27]. However, a recent study of more than 10 4 Swedish twins and 18 traits obtained a contradictory result, inferring significant non-additive variance for most traits, which was interpreted as V D [68]. In this section, we show that this apparent inconsistency is expected under certain models of gene action.
We applied GREMLd, MAF-stratified GREMLd (MS-GREMLd), and MAF-stratified Haseman-Elston regression (see Methods for details). We found MS-GREMLd to be numerically unstable on our simulated data, and thus we present results for non-MS-stratified GREMLd. The numerical stability issues likely resulted from some combination of small number of SNPs per region (Oð1000Þ), low total V G in a region, or high variance in effect sizes across causal mutations [69]. Further, for large λ, where V G is primarily due to rare alleles (Fig 1), heritability in a sample may not reflect heritability in the entire population (S13 Fig). When GREMLd is performed on cMR model data after removing variants with MAF 0.01, as done in [27], the total heritability estimate (AD) is quite accurate until λ ! 0.25 where a downward bias is observed. As anticipated, GREMLd using unfiltered data yields results with a slight upward bias [70]. However, for the iMR (h = 0.25) model the filtered GREMLd estimates are only accurate for λ < 0.1 reflecting the preponderance of rare causal variants for larger values of λ. Unfiltered GREMLd estimates under the iMR (h = 0.25) model show a slight upward bias for small values of λ, but are otherwise accurate. This shows that GREMLd is performing as expected under the site-based model for which it is designed. The MS-HE regression results are generally consistent with the GREMLd results.
The GREMLd and MS-HE estimates are accurate under the GBR model when λ is small, because most heritability is additive in that case (Fig 1). However, under the GBR model, both filtered and unfiltered GREMLd heritability estimates show downward bias when λ is large (Fig 2). The MS-HE regression results reveal a similar pattern, which indicates that the downward bias for large values of λ is not strictly due to removal of rare variants in the filtered GREMLd analysis. Instead, the bias shown for large values of λ is likely due to the presence of substantial non-additive heritability, which is not captured by the dominance effects of SNPs.
In contrast to the variance component methods, our simulated large twin studies provide approximately unbiased estimates of total heritability for large values of λ, but were biased upward for small effect sizes under the AC and GBR models (Fig 2). The variance in twin-study estimates was quite large, possibly because only a single locus was simulated rather than the whole genome. Formally, twin studies estimate an additive and a non-additive component of variance and interpreting the non-additive component as epistatic or dominance variance is a matter of perspective. However, the GBR model is inspired by the definition of a gene as a physical region in which recessive mutations leading to the same phenotypic outcome fail to complement [35], consistent with the allelic heterogeneity observed for human Mendelian disorders (see [36] for further discussion). Thus, the model of recessivity at the level of the gene region is picked up as non-additive variance in twin studies, but missed by variance component methods (GREML and HE regression) because the dominance in the GBR model is due to Ab/aB (compound heterozygotes) genotypes rather than a/a genotypes (homozygotes for a specific loss of function variant) assumed by variance component methods. Thus the contradictory results of applying variance component methods [27] and analysis of large twin studies [68] in order to estimate V A and V D may be interpreted as evidence for a model of gene action such as the GBR, which may be viewed as either recessivity at the haplotype/gene level or intralocus epistasis at the level of causative mutations in a single gene region. Both interpretations are valid. The alternative explanation is that we must assert that one of the study designs is generating artifacts.

The genetic model affects the outcomes of GWAS
Both demography and the model of gene action affect the degree to which rare variants contribute to the genetic architecture of a trait (Fig 1). However, the different mappings of genotype to phenotype from model to model make it difficult to predict a priori the outcomes of GWAS under each model. Therefore, we sought to explicitly examine the performance of statistical methods for GWAS under each genetic and demographic model. We assessed the power of a single marker logistic regression to detect the gene region by calculating the proportion of model replicates in which at least one variant reached genome wide significance at α 10 −8 (Fig 3A). The basic logistic regression is equivalent to testing for association under the AC model. We simulated both a perfect "genotyping chip" (all markers with MAF ! 0.05) and complete re-sequencing including all markers (Fig 3B).
One of the most prominent feature of Fig  Across all genetic models, the single marker logistic regression has less power under population expansion (Fig 3A). The loss of power is attributable to a combination of rapid growth resulting in an excess of rare variants overall [48][49][50][51][52][53][54][55][56][57][58][59][60][61], and the increasing efficacy of selection against causal variants in growing populations [21]. While complete resequencing is more powerful than a gene-chip design, the relative power gained is modest under growth (Fig 3A). Region-based rare variant association tests behave similarly with respect to population growth ( Fig 3B).
There are important differences in the behavior of the examined statistical methods across genetic models. We focus first on the single marker tests (Fig 3A). For gene-chip strategies, power increases for "site-based" models as recessivity of risk variants increases (compare power for AC, iMR, and cMR models in Fig 3B). This increase in power is due to the wellknown fact that recessive risk mutations are shielded from selection when rare (due to being mostly present as heterozyogtes), thus reaching higher frequencies on average (S5 Fig), and that the single-marker test is most powerful when risk variants are common [32]. Further, for the complete multiplicative-recessive model (cMR), the majority of V G is due to common variants (Fig 1), explaining why resequencing does not increase power for this model (Fig 3A).
For single-marker tests, the GBR model predicts large gains in power under resequencing for intermediate λ (the mean trait-effect size of newly arising causal mutations), similar to the AC or iMR model. But, when λ is larger power may actually be less under the GBR model than under AC or iMR. For all models, causal mutations are more rare with increasing λ (S7 Fig). However, as a function of frequency, all V G may be attributed to V A or V D in the site-specific models whereas there is increasing intralocus epistasis in the GBR model as a function of λ (Fig 1). It is well-known that the single marker test has lower power when causal mutations have low frequencies, are poorly tagged by more common SNPs, or have small main effects [32,72].
Region-based rare variant association tests show many of the same patterns across genetic model and effect size distribution as single marker tests, but there are some interesting differences. The ESM test [36,73] is the most powerful method tested for the AC, iMR, and GBR models (Fig 3b), with the c-Alpha test as a close second in some cases. For those models, the power of naive SKAT, linear kernel SKAT and SKAT-O, is always lower than the ESM and c-Alpha tests. This is peculiar since the c-Alpha test statistic is the same as the linear kernel SKAT test. The major difference between SKAT and ESM/c-Alpha is in the evaluation of statistical significance. SKAT uses an analytical approach to determine p-values while the ESM/c-Alpha tests use an explicit permutation approach. This implies that using permutation based p-values results in greater power. Yet, under the cMR model the linear kernel SKAT is the most powerful, followed by c-alpha. The cMR model does not predict a significant burden of rare alleles and so the default beta weights of SKAT are not appropriate, and the linear kernel is superior. The ESM test does poorly on this model because there are not many marginally significant low-frequency markers. It is logical to think that these tests would all perform better if all variants were included. The massive heterogeneity in the performance of region-based rare variant tests across models strongly suggests that multiple methods should be used when prior knowledge of underlying parameters is not available. In agreement with [22,74], we predict that population growth reduces the power to associate variants in a causal gene region with disease status (Fig 3) when the disease also impacts evolutionary fitness. We have recently released software to apply the ESM test to case control data [73] in order to facilitate applying this test to real data.

The distribution of minor allele frequencies of GWAS hits
It was noted by [4,26], that an excess of rare significant hits, relative to empirical data, is predicted by AC models where large effect mutations contribute directly to fitness and the disease trait. We confirm that AC models are inconsistent with the empirical data (Fig 4), except when λ 0.01. The empirical data in Fig 4 represent a pooled data set with the same diseases and quality filters as in [26], but updated to include more recent data. The data are described in S1 Table, and can be visualized alone more clearly in S16 Fig. Close to half of the data comes from GWAS studies uploaded to the NHGRI database after 2011, yet the same qualitative pattern is observed. This contradicts the hypothesis that the initial observation of an excess of common significant hits relative to the prediction under an AC model was simply due to small sample sizes and low marker density in early GWAS previously analyzed in [4,26]. Yet the initial observation is in fact robust and the meta-pattern provides an appropriate point of comparison when considering the compatibility of explicit population-genetic models with existing GWAS data.
The GBR model predicts few rare significant hits and an approximately uniform distribution across the remainder of MAF domain (Fig 4),

even for intermediate and large values of λ.
For smaller values of λ, the GBR predicts an excess of common significant hits. The more uniform distribution of significant single markers seen under the GBR is consistent with the flatter distribution of genetic variance (Fig 1). If one considers trying to determine an approximate dominance coefficient in the GBR model, it would be found that there is a distribution of coefficients across sites. Yet, when simulating iMR model, we find that an intermediate degree of dominance, h = 0.25, results in distribution of significant hits which is similar to the GBR results (Fig 4).
Most of the models fail a KS test comparing the simulated and empirical distribution of significant hits(S21 Fig).The cMR (h = 0) model shows a visual excess of intermediate frequency variants (Fig 4), but this does not result in rejection under the KS test (S21 Fig) which is largely insensitive to deviations in the tails. According to the KS test, the remaining models(AC, GBR, iMR) perform best when there are fewer data points in the simulated data due to low GWAS power. This suggests that all models would be rejected with enough replicates. We note that there is no compelling reason to expect any specific value of λ to be a particularly good fit to the empirical data. The empirical data are composed of genome-wide data for multiple traits. We feel that the mutational parameters, λ and mutation rate to causal variants, are likely to vary across the genome and across traits. Thus, the empirical data reflect a mixture of different underlying models and ascertainment schemes. The reason we emphasize this feature of the data is to demonstrate that models with rare alleles of large effect do not necessarily imply a visual excess of rare significant GWAS hits.
In consideration of the rare allele of large effect hypothesis, [62] proposed a model where multiple rare alleles dominate disease risk and create synthetic associations with common SNPs. However, later it was shown that this particular model was inconsistent with GWAS theoretically and empirically [4,26,75]. Here, we have shown that there exist models in which rare alleles explain a substantial portion of heritability that are not inconsistent with findings from GWAS. We find that the MAF distribution of significant hits in a GWAS varies widely with choice of genetic model. In particular, we confirm the results of Wray et al. [26], that AC evolutionary models predict an excess of low frequency significant hits unless trait effect sizes are quite small. Also, the cMR model predicts an excess of intermediate and common significant hits. Utilizing a GBR model or an iMR model with h = 0.25−0.5, reconciles this inconsistency by simultaneously predicting the importance of rare alleles of large effect and the correct allele frequency distribution among statistically significant single markers. Horizontal violin plots depict the distribution of minor allele frequencies (MAF) of the most strongly associated single marker in a GWAS. Individual hits are plotted as translucent points and jittered to provide a sense of the total number and density of hits. Each panel contains simulated data pooled across model replicates for each value of λ with empirical data for comparison. Empirical data are described in Materials and Methods. In cases where more than one marker was tied for the lowest p-value, one was chosen at random. Shown here are the additive co-dominant (AC), gene-based (GBR), incomplete multiplicative recessive (Mult. recessive (h = 0.25); iMR) and complete multiplicative recessive (Mult. recessive (h = 0);cMR) models. All data shown are for models where H 2 * 0.08, because single marker test power was too low under H 2 * 0.04 to make informative density plots. To further increase the number simulated data points, we perfromed n = 1,250 replicates at each level for this figure. Simulated data were subjected to ascertainment sampling such that the MAF distribution of all markers on the simulated genotyping chip was uniform. Specific information regarding the empirical data can be obtained in S1 Table. doi:10.1371/journal.pgen.1006573.g004

Conclusion
Several empirical observations provide support for the presence of gene-based recessivity underlying variation for some complex traits in humans. The minor allele frequency distribution of significant GWAS hits is relatively flat [4,76], which our results show is consistent with either the presence of small additive effect loci or gene-/site-based partially-recessive loci with intermediate to large effects (Fig 4). Models with loci of large additive effects predict an excess of rare significant hits. Oppositely, models with complete site-based recessivity predict an excess of common significant hits for all simulated mutation effect size distributions.
SNP based estimates of dominance heritability are much lower than estimates of dominance from twins [27,68]. Of the models we explored, only the gene-based recessive model with intermediate to large effects is consistent with the difference between twin and SNP based estimates of dominance variance (Fig 2). Under a site-based recessive model of partial recessivity (e.g. h = 0.25), there should be no significant difference between estimates of dominance variance from SNP and twin studies, provided that the statistical assumptions are met for both approaches (Fig 2). These results are complementary to the work by Zuk et. al [24], who show that twin studies can over estimate heritability under a model with gene interactions. It now appears clear that the underlying genetic model does not have the same impact on SNP-based and family based study designs; an issue which should be further explored. Our findings also support a more thorough investigation into the importance of compound heterozygosity in the genetics of complex traits. However, it may be difficult to directly observe non-additive gene-level effects through analysis of individual SNP markers.
Additionally, the genetic model appears to be important in the design and analysis of association studies. While changes in population size do affect the relationship between effect size and mutation frequency [48][49][50][51][52][53][54][55][56][57][58][59][60][61] (Fig 1 and S5 Fig), different mappings of genotype to trait value do this in radically different ways for the same demographic history (Fig 1). From an empirical perspective, our findings suggest that re-sequencing in large samples is likely the best way forward in the face of the allelic heterogeneity imposed by the presence of rare alleles of large effect. Resequencing of candidate genes [77][78][79][80] and exomes [40,[81][82][83][84][85][86] in case-control panels have observed an abundance of rare variants associated with case status. Here we show that under a model of mutation-selection balance on the genic level, neither current single-marker nor popular multi-marker tests are especially powerful at detecting large genomic regions harboring multiple risk variants (Fig 3). However, we show that using permutations to derive p-values improves the power of SKAT [69] with a linear kernel (equivalent test statistic to c-Alpha [38]). Similarly, another permutation based test, the ESM test [73], has more robust power across demographic and genetic models (Fig 3).
Conceptually, cis-effects arise naturally from the original definition of a gene in which mutant recessive alleles fail to complement [35]. We show that cis-effects within a locus, represented by the GBR model, can have an important impact on the population level architecture of a complex trait. This conclusion is important for future simulation studies as well as the interpretation of empirical data. It is important to note that despite our use of the term "genebased" this model may apply to any functional genomic element in which there are multiple mutable sites affecting a trait in cis, not just to genes. From a theoretical perspective, our work motivates the development of a more generalized gene-based model to include arbitrary dominance and arbitrary locus size. Empirically, we find that the GBR model is broadly consistent with a variety of observations from the human statistical genetics literature. Thus, there is an evident need for improved region-based association tests and the development of genetic variance component methods for haplotypes.

Forward simulation
Using the fwdpp template library v0.2.8 [87], we implemented a forward in time individualbased simulation of a Wright-Fisher population with mutation under the infinitely many sites model [88], recombination, and selection occurring each generation. We simulated populations of size N = 2e4 individuals for a time of 8N generations with a neutral mutation rate of μ = 0.00125 per gamete per generation and a per diploid per generation recombination rate of r = 0.00125. Deleterious mutations occurred at a rate of μ d = 0.1μ per gamete per generation. These parameters correspond to θ = 4Nμ = ρ = 4Nr = 100 and thus our simulation approximates a 100Kb region of the human genome. For simulations with growth, we simulated an additional 500 generations of exponential growth from N i = 2e4 to N final = 1e6. This demographic model is much simpler than current models fit to empirical data [58]. However, this simple model allows us to more easily get a sense of the impact of population expansion [21,22]. 250 simulation trials were performed for each parameter/model combination unless specified otherwise.

Exploring the gene regions contribution to heritability
Broad-sense heritability can be calculated directly from our simulated data as H 2 ¼ V G V P . We explored broad-sense heritability as a function of mean causative effect size λ under each model; λ 2 {0.01, 0.025, 0.05, 0.1, 0.125, 0.25, 0.5}. We compare our simulation results to V G $ 4m d s 2 s for additive models and V G $ 2m d s 2 s for recessive models [71,89]. In our simulations, s

Determining the genetic load of the population
Genetic load is defined as the relative deviation in a populations fitness from the fitness optimum, L ¼ ðw max À " wÞ=ðw max Þ. We set the phenotypic optimum to be zero; P opt = 0. When determining fitness for the site based models, we subtract one from all phenotypes. This implies that w max ¼ e À P 2 opt 2s 2 s ¼ 1 and that load is a simple function of the phenotypes of the popu- 2s 2 s . We also used the mean number of mutations per individual, and the mean frequency and effect sizes of segregating risk variants as proxies for the genetic load [21,90]. Lastly, we calculated Burden Ratios (B r ) [91] as the ratio of load between an equilibrium and non-equilibrium population. We calculated B r using both the true load and the number of mutations per individual.

Additive and dominance genetic variance over allele frequency
We used an approach based on sequential (type-1) regression sums of squares to estimate the contribution of the additive and dominance effects of variants to the total genetic variation due to a locus. Given a genotype matrix (rows are individuals and columns are risk variants) of (0,1, or 2) copies of a risk allele (e.g. all mutations affecting phenotype), we sort the columns by decreasing risk mutation frequency. Then, within frequency classes, columns were sorted by decreasing effect sizes. For each variant a dominance component was also coded as 0, 2q, or 4q-2 according to the orthogonal model of [27], where q is the frequency of the variant in the population. We then used the R package biglm [92] to regress the individual genetic values (G in the previous section) onto this matrix. The variance explained by the additive and dominance effects of the m markers with q x is then approximately r 2 ¼ ð P m i¼1 SSS reg;i Þ=ðSS tot Þ. Averaging results across replicates, this procedure results in a Monte-Carlo estimate of the fraction of V G that is due to additive and dominance effects of variants with population frequency less than or equal to x is (V A;q x + V D;q x )/(V G;q 1 ) [21]. This fraction can be easily partitioned into strictly additive and dominance components.

Additive and dominance heritability in random population samples
We employed three different SNP-based approaches to estimating heritability from population samples: GREMLd, minor allele frequency stratified GREMLd (MS-GREMLd) [27], and MS-Haseman-Elston (HE) regression [93,94]. For comparison, we calculated the true total heritability in the sample as H 2 sample ¼ ðV G;sample Þ=ðV P;sample Þ. Unfortunately, due to the nature of our simulated data MS-GREMLd did not result in sufficiently reliable results. Under MS-GREMLd, many replicates resulted in numerical errors in GCTA. These problems were present at a rate of less than 1/100 replicates using non-MS GREMLd, but were increased by splitting the data into multiple GRMs.
Using raw individual phenotypic values as quantitative trait values, random samples from simulated populations (n = 6000) were converted to .bed format using PLINK 1.90a [95]. PLINK was also used to test for HWE (p < 1e−6) and filter on minor allele frequency. GCTA 1.24.4 [34] was used to make genetic relatedness matrices (GRM) for both additive and dominance components with the flags -autosome and -make-grm(-d).
For non-MS runs, we tested the effect of filtering on MAF by performing the analysis on unfiltered datasets and with markers with MAF < 0.01 removed. For MS estimates we stratified the additive and dominance GRM's into two bins MAF 0.01 and MAF > 0.01. GREMLd analysis was performed in GCTA with Fisher scoring, no variance component constraint and a max of 200 iterations. MS-HE regression was carried out by regressing the off diagonal elements of each GRM onto the cross product of the scaled and centered phenotypes in a multiple linear regression setting in R [96].

Twin studies
To simulate twin studies we sampled 2000 monozygotic (MZ) and 2000 dizygotic (DZ) twins pairs from the final generation of the simulations. Parents were sampled randomly without replacement. MZ twin pairs were formed by sampling a single gamete pair, one recombinant from each parent, and two environmental random deviates. DZ twin pairs were formed by sampling two gamete pairs, two recombinant gametes from each parent, and two environmental random deviates. Our simulated studies are ideal in that there are no correlated environmental effects, but potentially problematic due to low total heritability. We explored the use of structural equation modeling (SEM) using the package OpenMx [97], but chose to rely strictly on estimates of twin correlation obtained directly from the data. For monozygotic (MZ) twins, we used only a single child gamete pair with two unique environmental deviates. For dizygotic (DZ) twins we used two child gamete pairs, each with a unique environmental deviate. Broad sense heritability is the correlation between MZ twin pairs; H 2 = r MZ . Under a purely additive model, the DZ twin correlation should be half of the MZ twin correlation. Non-additive genetic components of phenotypic variance reduce the DZ twin correlation. If all non-additive heritability is due to dominance, then the dominance heritability can be calculated as twice the difference between the MZ twin correlation and two-times the DZ twin correlation: δ 2 = 2 Ã (r MZ − 2 Ã r DZ ). The additive heritability can then be calculated as the difference between the broad-sense and non-additive component: h 2 = H 2 − δ 2 = 4 Ã r DZ − r MZ [29].
These direct estimates of MZ and DZ twin correlations in our simulations are reliable as we have no measurement error, shared environmental effects, gene-by-environment effects, or gene-by-gene interactions. Additionally, we only simulate a single genomic region contributing H 2 * 0.04, which made use of SEM difficult numerically. This creates a limitation in that we can not discuss when a model with dominance is a better fit to the data than the additive only model. But, the benefit of using direct estimates is that we can clearly see what signals are present in the data. To further clarify the data visualization, we pooled our 512 twin-study replicates into groups of 8, creating 64 sets of MZ-DZ twin phenotypes. This did not have an effect on the central tendencies of our estimates, but it reduced the variance. The twin study error bars in Fig 2 are based on 64 sets of 64,000 individuals, which is larger than a typical twin study. However, one reason our results have high variance is because we only simulate a single locus, rather than a whole trait.

Case-control studies
Following [36], we sampled 3000 cases and 3000 controls from each simulated population. Cases were randomly sampled from the upper 15% of phenotypic values in the population, and controls were randomly sampled from within 0.5 standard deviations of the population mean(as in [36]). This is the liability scale model (see [29]). We define a "GWAS" to be a study including all markers with MAF !5% and a re-sequencing study to include all markers. In all cases we used a minor allele count logistic regression as the single marker test. For single marker tests, the p-value cut off for significance is p 1e − 08 which is common in current GWAS [62,98]. Power is determined by the percentage of simulation replicates in which at least one marker reaches genome wide significance.

Region-based tests of association due to rare alleles
We applied multiple region-based tests to our simulated data, ESM K [36], several variations of SKAT [39] and c-Alpha [38]. We used the R package from the SKAT authors to implement their test (http://cran.r-project.org/web/packages/SKAT/index.html). The remaining tests were implemented in a custom R package (see SOFTWARE AVAILABILITY below). For the ESM K and c-Alpha we performed up to 2e6 permutations of case-control labels to determine empirical p-values. Common variants (q ! 0.05) were removed prior to performing region-based rare variant association tests.

Distribution of significant GWAS hits
Following [4,26], we calculated the distribution of the minor allele frequency (MAF) of the most significant SNPs in a GWAS in empirical and simulated data. The empirical data was obtained from the NHGRI-EBI GWAS database (http://www.ebi.ac.uk/gwas/) on 02/05/2015. We considered the same diseases and applied the same filters as in Table 3 of [26]. Specific information regarding the empirical data can be obtained in S1 Table. In order to mimic ascertained SNP data, we sampled markers from our case/control panels according to their minor allele frequencies [99], as done in [36]. Additionally, we removed all markers with MAF <0.01 to reflect common quality controls used in GWAS. The simulated data were grouped by genetic model, demographic scenario, heritability level, and mutation effect distribution. We then plotted the minor allele frequency of the most significant marker with a single-marker score −log 10 (p)!8, for all replicates where significant markers were present.Finally, we performed a two-sample KS test in R between each group of simulated GWAS hit allele frequencies and the empirical data.

Human demography
We simulated a demographic model for Europeans based on [40] as described in [21]. For simplicity, we ignored migration between the European (EA) and African American (AA) populations. The model was implemented using the Python package fwdpy version 0.0.4, which uses fwdpp [87] version 0.5.1 as a C++ back-end. During the evolution of the EA population, we recorded the genetic variance in the population, V G , and the number of deleterious mutations per diploid (a measure of genetic load [21]) every 50 generations. In a separate set of simulations, we applied the regression method described above to calculate cumulative additive genetic variance as a function of allele frequency. Because the regressions are computationally demanding, we applied the method in the generation immediately before, and at the start of, any changes in population size.
These simulations were run with no neutral mutations, and the recombination rate and mutation rate to causative mutations were the same as in the simulations described above.
The Python scripts for these simulations and iPython/Jupyter notebooks used for generating figures are available online (see Software availability section below).

Software availability
Our simulation code and code for downstream analyses are freely available at The data plotted as mean over simulation replicates ±se. The data are calculated for the entire simulated population. B) The mean genetic value of a gamete, i.e. the average sum of mutational effect sizes on a gamete as a function of λ. Data are plotted as the mean across model replicates ± the standard error of the mean. In the case the gene-based recessive model, this value is also the expected value of the mean phenotype and is accurate within the sampling variance of the mean environmental variate and random pairing of gametes in diploid. Shown are the additive co-dominant (AC), gene-based (GBR) and complete multiplicative recessive Horizontal violin plots depict the distribution of minor allele frequencies (MAF) of the most strongly associated single marker in a GWAS. Individual hits are plotted as translucent points and jittered to provide a sense of the total number and density of hits. Each panel contains simulated data pooled across model replicates for each value of λ, with empirical data for comparison. The degree of dominance h was varied from 0.1 to 0.75; perfect co-dominance here is h = 1. Empirical data were downloaded from the NHGRI-EBI GWAS database (http:// www.ebi.ac.uk/gwas/) on 02/03/2015, diseases and inclusion criteria are as in [26]. In cases where more than one marker was tied for the lowest p-value, one was chosen at random. Simulated data were subjected to ascertainment sampling such that the MAF distribution of all markers on the simulated genotyping chip was uniform. Specific information regarding the empirical data can be obtained in S1 Table. (TIFF) S11 Fig. Skew of  (B) Cumulative percent of variance explained across the risk allele frequency, based on regression estimates and classical formula. 1000 unlinked markers were simulated with effects drawn from an exponential distribution with mean 0.1 and population frequencies drawn from the neutral Wright-Fisher allele frequency distribution. Sample data for 5000 individuals were then generated by sampling genotypes at each marker based on its allele frequency. We plot the regression estimate of variance explained by each marker against Fisher's classic result [9]: V G = 2pqα 2 . (TIFF) Histogram GWAS hits (n = 1208) obtained from the NHGRI-EBI GWAS database for disease discussed in [26]. Data are described in S1 Table. (TIFF) The left column of panels shows how V G changes over time under this model. The right column shows how the mean number of deleterious mutations per individual changes. The models shown are: a = additive, g = GBR, and m = multiplicative with varying degrees of dominance (h). The main difference is between additive models (a or m with h = 1.0) and recessive models (g or m with small h). The former models are largely insensitive to changes in N, while the recessive models show transient increases in V G and "load" immediately following a bottleneck (consistent with [21]). However, at the final time point representing the "modern European population", all mean V G is % 4μ for additive models and % 2μ for recessive models [71,89], and recessive models show larger loads as expected [21]. (TIFF) S19 Fig. The burden ratio over time. The burden ratio [91] is calculated as the ratio of genetic load between simulations with only ancient growth and those with an additional recent bottleneck and growth. Here load is calculated as the average deviation from optimum fitness due to (left) fixed mutations, (middle) segregating mutations and (right) all mutations. Because of the use of the Gaussian fitness function, the total load is not the sum of the fixed and segregating load. The models shown are: a = additive, g = GBR, and m = multiplicative with varying degrees of dominance (h). For large effect size models, under which there are relatively more mutations that experience strong selection, we see the characteristic drop in the burden ratio following the bottleneck and rebound following re-expansion [91]. and inclusion criteria are as in [26]. In cases where more than one marker was tied for the lowest p-value, one was chosen at random. Simulated data were subjected to ascertainment sampling such that the MAF distribution of all markers on the simulated genotyping chip was uniform. Specific information regarding the empirical data can be obtained in S1 Table. (TIFF)