An Empirical Bayes Mixture Model for Effect Size Distributions in Genome-Wide Association Studies

Characterizing the distribution of effects from genome-wide genotyping data is crucial for understanding important aspects of the genetic architecture of complex traits, such as number or proportion of non-null loci, average proportion of phenotypic variance explained per non-null effect, power for discovery, and polygenic risk prediction. To this end, previous work has used effect-size models based on various distributions, including the normal and normal mixture distributions, among others. In this paper we propose a scale mixture of two normals model for effect size distributions of genome-wide association study (GWAS) test statistics. Test statistics corresponding to null associations are modeled as random draws from a normal distribution with zero mean; test statistics corresponding to non-null associations are also modeled as normal with zero mean, but with larger variance. The model is fit via minimizing discrepancies between the parametric mixture model and resampling-based nonparametric estimates of replication effect sizes and variances. We describe in detail the implications of this model for estimation of the non-null proportion, the probability of replication in de novo samples, the local false discovery rate, and power for discovery of a specified proportion of phenotypic variance explained from additive effects of loci surpassing a given significance threshold. We also examine the crucial issue of the impact of linkage disequilibrium (LD) on effect sizes and parameter estimates, both analytically and in simulations. We apply this approach to meta-analysis test statistics from two large GWAS, one for Crohn’s disease (CD) and the other for schizophrenia (SZ). A scale mixture of two normals distribution provides an excellent fit to the SZ nonparametric replication effect size estimates. While capturing the general behavior of the data, this mixture model underestimates the tails of the CD effect size distribution. We discuss the implications of pervasive small but replicating effects in CD and SZ on genomic control and power. Finally, we conclude that, despite having very similar estimates of variance explained by genotyped SNPs, CD and SZ have a broadly dissimilar genetic architecture, due to differing mean effect size and proportion of non-null loci.

In this section we describe a simple multivariate, additive generative (causal) model for the genotypephenotype relationship. While this model is almost certainly a simplification of reality (e.g., ignoring dominance and epistasis), similar models form the basis for many discussions of effect size distributions in the literature (see, e.g., [1,2]). For the jth subject, j = 1, . . . , n, we have data {x x x j , y j }, where x x x j is the vector of mean-centered allele counts from N assayed bi-allelic loci and y j is a mean-centered quantitative response variable. We treat the dichotomous outcome case below. Let denote the n × N matrix of mean-centered allele counts. We assume a simple additive generative model where β β β is an N -dimensional vector of causal (per-allele) effects and is an n-dimensional vector containing additive environmental and measurement error effects. If the phenotype is continuous, then the observed If the phenotype is dichotomous, then y j = I y * j ≥0 , where I y * j ≥0 = 1 if y * j ≥ 0 and zero otherwise.
We make the following assumptions: (A1) Causal loci are a subset of tagged loci.
(A2) The x x x i are in Hardy-Weinberg Equilibrium (HWE).
Assumption (A1) is not necessary for the derivation but simplifies the description of the model. Here, (µ µ µ, Σ Σ Σ) denotes a multivariate normal distribution with mean µ µ µ and positive definite covariance matrix Σ Σ Σ. If the phenotype is continuous, we typically make the additional assumption that is multivariate normal. If the phenotype is dichotomous, assuming that the i are independent with the logistic distribution leads to a logistic regression model framework. Under Assumption (A2), the x ij are random draws from X i ∼ Bin(2, p i ) − 2p i , so that E{X i } = 0 and Var{X i } = 2p i (1 − p i ), where p i is the population proportion of 2 reference alleles for the ith SNP, and we neglect uncertainty arising from estimation of p i . We also assume that the marginal density g of the causal effects β i is symmetric around zero with finite first and second moments, so that E{β i } = 0 and σ 2 β ≡ E{β 2 i } = Var{β i }.

Regression Estimates
Here, for ease of presentation we focus the development on genome-wide analyses of quantitative traits. We assume the association of the ith SNP with the quantitative trait is assessed via univariate linear regression.
In the absence of covariates, the least squares estimates can be expressed as where The λ ii are estimates of the covariance between the ith and i th SNPs, with where ρ ii is the correlation between the ith and i th SNPs. Let ξ ii = E X { ξ ii }, the regression coefficient of the i th SNP on the ith SNP, so that The symbol " " denotes asymptotic equality as n → ∞. Thus, and hence the bias in estimation of β i from univariate regression estimates b i depends on the distribution of causal effects β i , as well as the distribution of ξ ii , for i = 1, . . . , N . In particular, non-null ("large") effects

Polygenic Risk Prediction
Posterior expectations E{δ i | Z i = z i }, i = 1, . . . , N , can be used to compute polygenic risk scores, defined below. Suppose the phenotype for the jth subject is dichotomous, Y j ∈ {0, 1}, where 0 indicates a control and 1 a patient. Let γ i = p 1i −p 0i , i = 1, . . . , N , where p 1i is the proportion of reference alleles in the patients and p 0i in the controls. If there are an equal number of patients and controls in the sample, p i = (p 0i +p 1i )/2, and hence Define the polygenic risk score for the jth subject as and SCZ papers [3,4]. If (a)-(c) are not valid, a random effects meta-analysis can be applied instead [5]. the algorithm is able to recover the ground truth when independent SNPs are to fit the algorithm.

Simulation 2
We also studied if the impact of linkage disequilibrium (LD) between SNPs, and if this could be the source of the observed non-zero small effect variance σ 1 , thus accounting for the observed non-zero slope around the origin. The z-scores of SNPs in a LD block were sampled from a multivariate normal distribution for each sub-study. The size of LD block, in terms of number of SNPs, were sampled from U (50, 100). We assume that the SNP in the middle of the LD block has the largest effect, and the effect decays both directions. To model this decay, we constructed a exponential decay with starting value=1, and stop value= 0.7. Then, the σ 2 value of the middle SNPs were multiplied by the decay parameter. The covariance matrix of the multivariate normal was then constructed by taking the outer product of the computed decaying σ 2 . estimates are very close to the ground truth with no LD, as in the first set of simulations. In the second row, the effects of LD are apparent: the "small" effects are now non-zero, which is apparent both from the non-zero slope of the posterior expectations and in the median estimate of σ 1 = 0.001. Also, the median estimate of the non-null proportion is much higher than the non-null proportion of the generative effects.
The fitting algorithm still captures the distribution of effects on the z-scores (as can be seen by the very close fits of the posterior means and variances in the first and second columns), but the interpretation of 5 these effects is different, i.e., the interpretation is in terms of the posterior expectation and variances of the (massively univariate) regression estimates b i rather than the causal effects β i , i = 1, . . . , N .