Estimating phenotypic polygenicity and causal effect size variance from GWAS summary statistics while accounting for inflation due to cryptic relatedness

Of signal interest in the genetics of traits are estimating the proportion, π1, of causally associated single nucleotide polymorphisms (SNPs), and their effect size variance, σ β , which are components of the mean heritabilities captured by the causal SNP. Here we present the first model, using detailed linkage disequilibrium structure, to estimate these quantities from genome-wide association studies (GWAS) summary statistics, assuming a Gaussian distribution of SNP effect sizes, β. We apply the model to three diverse phenotypes – schizophrenia, putamen volume, and educational attainment – and validate it with extensive simulations. We find that schizophrenia is highly polygenic, with ≃ 5 × 10 causal SNPs distributed with small effect size variance, σ β = 3.5 × 10 (in units where the phenotype variance is normalized to 1), requiring a GWAS study with more than 1/2-million samples in each arm for full discovery. In contrast, putamen volume involves only ≃ 3 × 10 causal SNPs, but with σ β = 1.2 × 10, indicating a much larger proportion of the causal SNPs that are strongly associated. Educational attainment has similar polygenicity to schizophrenia, but with effects that are substantially weaker, σ β = 5 × 10, leading to much lower heritability. Thus the model is able to describe the broad genetic architecture of phenotypes where both polygenicity and effect size variance range over several orders of magnitude, shows why only small proportions of heritability have been explained for discovered SNPs, and provides a roadmap for future GWAS discoveries.


INTRODUCTION
The genetic components of complex traits or diseases arise from hundreds to likely many thousands of single nucleotide polymorphisms (SNPs) (Visscher et al., 2012), most of which have weak effects.As sample sizes increase, more of the associated SNPs are identifiable (they reach genomewide significance), though power for discovery varies widely across phenotypes.Of particular interest are estimating the proportion of SNPs (polygenicity) involved in any particular phenotype; their effective strength of association (discoverability); the proportion of variation in susceptibility, or phenotypic variation, captured additively by all common causal SNPs (approximately, the narrow sense heritability), and the fraction of that captured by genomewide significant SNPs -all of which are active areas of research (Stahl et al., 2012;Yang et al., 2015;So et al., 2011;Speed et al., 2012;Lee et al., 2011;Yang et al., 2011a;Kumar et al., 2016;Palla and Dudbridge, 2015).However, the effects of population structure (Price et al., 2010), combined with high polygenicity and linkage disequilibrium (LD), leading to spurious degrees of SNP association, or inflation, considerably complicate matters, and are also areas of much focus (Yang et al., 2011c;Bulik-Sullivan et al., 2015;Kang et al., 2010).Yet, despite recent significant advances, it has been difficult to develop a mathematical model of polygenic architecture based on GWAS that can be used for power estimated across human phenotypes.
Here, in a unified approach explicitly taking into account LD, we present a model relying on genome-wide association studies (GWAS) summary statistics (z-scores for SNP associations with a phenotype (Pasaniuc and Price, 2016)) to estimate polygenicity (π 1 ) and discoverability (σ 2 β ), as well as any residual inflation of the z-scores arising from variance distortion induced by cryptic relatedness (σ 2 0 ), which remains a concern in large-scale studies (Price et al., 2010).We estimate π 1 , σ 2 β , and σ 2 0 , by postulating a z-score probability distribution function (pdf) that explicitly depends on them, and fitting it to the actual distribution of GWAS z-scores.
Estimates of polygenicity and discoverability allow one to estimate compound quantities, like narrow-sense heritability captured by the SNPs (Witte et al., 2014); to predict the power of larger-scale GWAS to discover genomewide significant loci; and to understand why some phenotypes have higher power for SNP discovery and proportion of heritability explained than other phenotypes.
In previous work (Holland et al., 2016) we presented a related model that treated the overall effects of LD on z-scores in an approximate way.Here we take the details of LD explicitly into consideration, resulting in a conceptually more basic model to predict the distribution of zscores.We apply the model to multiple phenotypes, in each case estimating the three model parameters and auxiliary quantities, including the overall inflation factor λ, (traditionally referred to as genomic control (Devlin and Roeder, 1999)) for pruned SNP sets, and narrow sense heritability, h 2 .We also perform extensive simulations on genotypes with realistic LD structure in order to validate the interpretation of the model parameters.

METHODS
The Model: Probability Distribution for z-scores Consistent with the work of others (Yang et al., 2011c), we assume the causal SNPs are distributed randomly throughout the genome (an assumption that can be relaxed when explicitly considering different SNP categories, but that in the main is consistent with the additive variation explained by a given part of the genome being proportional to the length of DNA (Yang et al., 2011b)), and that their β coefficients in the GWAS framework are distributed normally with variance σ 2 β : β ∼ N (0, σ 2 β ).
(1) (We use the symbol β to refer to a scalar or vector, with context indicating which.)Taking into account all SNPs (the remaining ones are all null by definition), this is equivalent to the two-component Gaussian mixture model where N (0, 0) is the Dirac delta function, so that considering all SNPs, the net variance is var(β) = π 1 σ 2 β .Ignoring LD, the association z-scores for causal SNPs can be decomposed into an effect δ and a residual term ǫ ∼ N (0, σ 2 0 ), assumed to be independent (Holland et al., 2016): where N is the sample size and H is the SNP's heterozygosity (frequency of the heterozygous genotype, H = 2p(1− p) where p is the frequency of either of the SNP's alleles), so that var(z) = var(δ) + var(ǫ) where Now consider the effects of LD on z-scores.Let β ef f be the true effective β-coefficient for a tag SNP arising due to LD with neighboring causal SNPs.It is given by the sum of neighboring causal SNP β-coefficients, each weighted by its correlation with the tag SNP: Then, from Eq. 3, the z-score for the tag SNP's association with the phenotype is given by: Thus, for example, if the SNP itself were not causal but were in LD with k known causal SNPs, where its LD with each of these was the same, given by some value r 2 (0 < r 2 ≤ 1), then σ 2 will be given by For this idealized case, the marginal distribution, or pdf, of z-scores for a set of such associated SNPs is where φ(•, µ, σ 2 ) is the normal distribution with mean µ and variance σ 2 , and L is shorthand for the LD structure of such SNPs -in this case, denoting LD given by r 2 with exactly k causals.If a proportion a of all tag SNPs are similarly associated with the phenotype while the remaining proportion are all null (not causal and not in LD with causal SNPs), then the marginal distribution for all SNP z-scores is the gaussian mixture dropping the parameters for convenience.
For real genotypes, however, the LD structure is far more complicated, and of course the causal SNPs are generally numerous and unknown.As in our previous work, we incorporate the model parameter π 1 for the fraction of all SNPs that are causal (Holland et al., 2016).Additionally, we calculate the actual LD structure for each SNP.That is, for each SNP we build a histogram of the numbers of other SNPs in LD with it for w equally-spaced r 2 -windows between 0.05 and 1; we use L again as a shorthand to represent all this.The value r 2 min = 0.05 was chosen as a lower-bound for LD above the noise threshold; we find that w ≃ 10 is sufficient for converged results.For any given SNP, the set of SNPs thus determined to be in LD with it constitute its LD block, with their number given by n (LD with self is always 1, so n is at least 1).The pdf for z-scores, given N, H, L and the three model parameters π 1 , σ β , σ 0 , will then be given by the sum of gaussians that are generalizations of Eq. 10 for different combinations of numbers of causals among the w LD windows, each gaussian scaled by the probability of the corresponding combination of causals among the LD windows, i.e., by the appropriate multinomial distribution term.
For w r 2 -windows, we must consider the possibilities where the tag SNP is in LD with all possible numbers of causal SNPs in each of these windows, or any combination thereof.There are thus w + 1 categories of SNPs: null SNPs (which r 2 -windows they are in is irrelevant), and causal SNPs, where it does matter which r 2 -windows they reside in.If window i has n i SNPs ( w i=1 n i = n), and the overall fraction of SNPs that are causal is π 1 , then the probability of having simultaneously k 0 null SNPs, k 1 causal SNPs in window 1, and so on through k w causal SNPs in window w, for a nominal total of K causals ( w i=1 k i = K and k 0 = n−K), is given by the multinomial distribution, which we denote M (k 0 , ..., k w ; n 0 , ..., n w ; π 1 ).For an LD block of n SNPs, the prior probability, p i , for a SNP to be causal and in window i is the product of the independent prior probabilities of a SNP being causal and being in window i: p i = π 1 n i /n.The prior probability of being null (regardless of r 2 -window) is simply p 0 = (1 − π 1 ).The probability of a given breakdown k 0 , ..., k w of the neighboring SNPs into the w +1 categories is then given by M (k 0 , ..., k w ; n 0 , ..., n w ; and the corresponding gaussian is For a SNP with heterozygosity H and LD structure L, the pdf for its z-score, given N and the model parameters, is then given by summing over all possible numbers of total causals in LD with the SNP, and all possible distributions of those causals among the w r 2 -windows: where K max in bounded above by n.Note again that L is shorthand for the linkage-disequilibrium structure of the SNP, giving the set {n i }, and hence, for a given π 1 , p i .Also there is the constraint w i=1 k i = K on the second summation, and, for all i, max(k i ) = max(K, n i ), though generally -see below -K max ≪ n i .The number of ways of dividing K causal SNPs amongst w LD windows is given by the binomial coefficient m a , where m = K + w − 1 and a = w − 1, so the number of terms in the second summation grows rapidly with K and w.However, because π 1 is small (often ≤10 −3 ), we find that the upper bound on the first summation over total number of potential causals K in the LD block for the SNP can be limited to K max < min(10, n), even for large blocks with n ≃ 10 3 .That is, Kmax K=0 k1,...,kw M (k 0 , ..., k w ; n 0 , ..., n w ; π 1 ) ≃ 1. (15) Still, the number of terms is large; e.g., for K = 8 and w = 5 there are 495 terms.We approximate the sums in Eq. 14 with the simpler expression involving only sums over terms where the causal SNPs all reside in the same r 2 -window, plus a null term.The probability that any K of the n SNPs in the block are causal while the remainder n − K are null is given by the binomial distribution, B(K, n; π 1 ): Multiplying this by n i /n approximates the probability of their being in the i-th r 2 -window.Multiplying these into the gaussian corresponding to K causals in window i, summing over both indices, and incorporating the null term, leads to the following approximation that is in good numerically agreement with Eq. 14:

Data Preparation
For real phenotypes, we calculated SNP minor allele frequency (MAF) and LD between SNPs using the 1000 Genomes phase 3 data set for 503 subjects/samples of European ancestry (Consortium et al., 2015(Consortium et al., , 2012;;Sveinbjornsson et al., 2016).For simulations, we used HapGen2 (Li and Stephens, 2003;Spencer et al., 2009;Su et al., 2011) to generate genotypes; we calculated SNP MAF and LD structure from 1000 simulated samples.We elected to use the same intersecting set of SNPs for real data and simulation.For HapGen2, we eliminated SNPs for which more than 99% of genotypes were identical; for 1000 Genomes, we eliminated SNPs for which the call rate (percentage of samples with useful data) was less than 90%.This left n snp =11,015,833 SNPs.Sequentially moving through each chromosome in contiguous blocks of 5,000 SNPs, for each SNP in the block we calculated its Pearson r 2 correlation coefficients with all SNPs in the central bock itself and with all SNPs in the pair of flanking blocks of size up to 50,000 each.For each SNP we calculated its total LD (TLD), given by the sum of LD r 2 's thresholded such that if r 2 < 0.05 we set that r 2 to zero (zeroing out the noise).For each SNP we also built a histogram giving the numbers of SNPs in w = 8 equally-spaced r 2 -windows covering the range 0.05 ≤ r 2 ≤ 1.These steps were carried out independently for both 1000 Genomes phase 3 and for HapGen2.
Employing a similar procedure, we also built binary (logical) LD matrices identifying all pairs of SNPs for which LD r 2 > 0.8, a liberal threshold for SNPs being "synonymous".
In applying the model to summary statistics, we restricted to SNPs for which TLD ≤ 600, MAF ≥ 0.005, and LD block size (defined by r 2 ≥ 0.05) ≤ 2000.
We analyzed summary statistics for participants with European ancestry for: (1) schizophrenia from the Psychiatric Genomics Consortium (PGC) (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014), with 35,476 cases and 46,839 controls (N ef f ≡ 4/(1/N cases +1/N controls ) = 76, 326) across 52 separate substudies, with imputation of SNPs using the 1000 Genomes Project reference panel (1000Genomes Project Consortium, 2010) for a total of approximately 5,369,285 genotyped and imputed SNPs passing the above restrictions (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014); (2) putamen volume, normalized by intracranial volume, using data from the Enhancing Neuro Imaging Genetics through Meta-Analysis (ENIGMA) consortium (Hibar et al., 2015), with 12,596 samples and a total of 4,196,831 SNPs;and (3) educational attainment, measured as the number of years of schooling completed, with 328,917 samples and a total of 5,361,110 SNPs, available at https://www.thessgac.org(Okbay et al., 2016).Examples of SNP histograms for schizophrenia are in Supporting Material Fig. 5.

Simulations
We generated genotypes for 10 5 unrelated simulated samples using HapGen2 (Su et al., 2011).For narrow-sense heritability h 2 equal to 0.1, 0.4, and 0.7, we considered polygenicity π 1 equal to 10 −5 , 10 −4 , 10 −3 , and 10 −2 .For each of these 12 combinations, we randomly selected n causal = π 1 ×n snp "causal" SNPs and assigned them β-values drawn from the standard normal distribution (i.e., independent of H), with all other SNPs having β = 0. We repeated this ten times, giving ten independent instantiations of random vectors of β's.Defining Y g = Gβ, where G is the genotype matrix and β here is the vector of coefficients over all SNPs, the total phenotype vector is constructed as Y =Y g +ǫ, where the residual random vector ǫ for each instantiation is drawn from a normal distribution such that h 2 = var(Y g )/var(Y ).For each of the instantiations this implicitly defines the "true" value σ 2 β .The regression slope, β, and the Pearson correlation coefficient, r, are assumed to be t-distributed.These quantities have the same t-value: , with corresponding p-value from Student's t cumulative distribution function (cdf) with N − 2 degrees of freedom: p = 2×tcdf(−|t|, N − 2).Since we are not here dealing with covariates, we calculated p from correlation, which is slightly faster than from estimating the regression coefficient.The t-value can be transformed to a z-value, giving the z-score for this p: z = −Φ −1 (p/2) × sign(r), where Φ is the normal cdf (z and t have same p-value).

Parameter Estimation
We randomly pruned SNPs using the threshold r 2 > 0.8 to identify "synonymous" SNPs, performing ten such iterations.That is, for each of ten iterations, we randomly selected a SNP (not necessarily the one with largest zscore) to represent each subset of synonymous SNPs.For schizophrenia, for example, pruning resulted in approximately 1.3 million SNPs in each iteration.
The postulated pdf for a SNP's z-score depends on the SNP's heterozygosity, H, and detailed LD structure, i.e., its LD histogram, L. Given the data -the set of z-scores for all SNPs, as well as their heterozygosities and LDstructures -and the H-and L-dependent pdf for z-scores, the objective is to find the model parameters that best predict the distribution of all z-scores.H ranges between 0.05 and 0.5, and the amplitudes of L will vary over a wide range.A useful one-dimensional proxy for L is TLD, which ranges from 1 to 600.Since the model pdf explicitly predicts z-score distributions for particular values of H and L, instead of taking all the SNPs at once, we bin the SNPs with respect to a grid of these quantities; for any given (H,TLD) bin there will be a range of z-scores whose distribution the model it intended to predict.We find that a 5 × 5-grid of equally spaced bins is adequate for converged results.In lieu of or in addition to TLD binning, one can bin SNPs with respect to their total LD block size (total number of SNPs in LD, ranging from 1 to 2,000).
To find the model parameters that best fit the data, for a given (H,TLD) bin we binned the selected SNPs zscores into equally-spaced bins of width dz=0.4 (between z min =−38 and z max =38, allowing for p-values near the numerical limit of 10 −315 ), and from Eq. 17 calculated the probability for z-scores to be in each of those z-score bins (the prior probability for "success" in each z-score bin).Then, knowing the actual numbers of z-scores (numbers of "successes") in each z-score bin, we calculated the multinomial probability, p m , for this outcome.The optimal model prameter values will be those that maximize the accrual of this probability over all (H,TLD) bins.We constructed a cost function by calculating, for a givem (H,TLD) bin, −ln(p m ) and averaging over prunings, and then accumulating this over all (H,TLD) bins.Model parameters minimizing the cost were obtained from Nelder-Mead multidimensional unconstrained nonlinear minimization of the cost function, using the Matlab function fminsearch().β ; and SNP association χ 2 -statistic inflation factor, σ2 0 .ĥ2 is the estimated narrow-sense chip heritability (reexpressed as h 2 l on the liability scale for schizophrenia assuming a prevalence of 1% in adult populations), and ncausal is the estimated number of causal SNPs.nsnp = 11, 015, 833 is the total number of SNPs, whose LD and MAF underlie the model; the GWAS z-scores are for subsets of these SNPs.Though the phenotypes are diverse (examples of a categorical mental disorder, a behavioural phenotype, and a cerebral subregional tissue volume), the model nevertheless provides good fits, even though estimated polygenicities differ by two orders of magnitude and discoverabilities differ by almost three orders of magnitude.Nsamp is the sample size, expressed as N ef f for schizophrenia -see text.Reading the plots: on the vertical axis, choose a p-value threshold (more extreme values are further from the origin), then the horizontal axis gives the proportion of SNPs exceeding that threshold (higher proportions are closer to the origin).

Posterior Effect Sizes
Model posterior effect sizes were calculated using numerical integration over the random variable δ: Here, since z|δ ∼ N (δ, σ 2 0 ), the posterior probability of z given δ is simply P (z) is shorthand for pdf(z|N, H, L; π 1 , σ β , σ 0 ), given by Eq. 17, and, also from Eq. 17, P (δ) is Similarly, which is used in power calculations.

GWAS Power
It is of interest to estimate the proportion of additive phenotypic variance arising from the n snp SNPs under study (the chip heritability (Witte et al., 2014)) that can be explained by SNPs that reach genome-wide significance, p≤5×10 −8 (i.e., for which |z|>z t =5.33) at a given sample size (Pe'er et al., 2008;McCarthy et al., 2008).For a SNP with genotype vector g (over N samples) and heterozygosity H, one has var(Y |g)=var(βg)=2β 2 H and δ= √ N Hβ.Using Eq.21, let C≡E(δ 2 |z, N )P (z, N ), emphasizing dependence on sample size, N .Then the proportion of chip heritability captured additively by genome-wide significant SNPs is The ratio in Eq. 22 should be accurate if the average effects of LD in the numerator and denominator cancel -which will always be true as the ratio approaches 1 for large N .Plotting S(N ; z t ) gives an indication of the power of future GWAS to capture chip heritability.
Quantile-Quantile Plots and Genomic Control One of the advantages of quantile-quantile (QQ) plots (also known as PP plots) is that on a logarithmic scale they emphasize behavior in the tails of a distribution, and provide a valuable visual aid in assessing the independent effects of polygenicity, strength of association, and cryptic relatedness -the roles played by the three model parametersas well as showing how well a model fits data.QQ plots for the model were constructed using Eq.17, replacing the normal pdf with the normal cdf, and replacing z with an equally-spaced vector z nom of length 10,000 covering a wide range of nominal |z| values (0 through 38).SNPs were divided into a 5×5 grid of H×TLD bins, and the cdf vector (with elements corresponding to the z-values in z nom ) accumulated for each such bin (using mean values Prop. of tagged var.

GWAS Power
Putamen Schizophrenia Education Current N Figure 2: Proportion of narrow-sense chip heritability captured by genome-wide significant SNPs as a function of sample size, N .Leftto-right plot order is determined by decreasing σ 2 β .For current sample sizes, the proportions are: putamen volume, 0.064; schizophrenia, 0.096; educational attainment, 0.109.
of H and TLD for SNPs in a given bin).
For a given set of samples and SNPs, the genomic control factor, λ, for the z-scores is defined as the median z 2 divided by the median for the null distribution, 0.455 (Devlin and Roeder, 1999).This can also be calculated from the QQ plot.In the plots we present here, the abscissa gives the -log 10 of the proportion, q, of SNPs whose z-scores exceed the two-tailed significance threshold p, transformed in the ordinate as -log 10 (p).The median is at q med = 0.5, or −log 10 (q med ) ≃ 0.3; the corresponding empirical and model p-value thresholds (p med ) for the z-scores -and equivalently for the z-scores-squared -can be read off from the plots.The genomic inflation factor is then given by λ = [Φ −1 (p med /2)] 2 /0.455.Note that the values of λ reported here are for pruned SNP sets; these values will be lower than for the total GWAS SNP sets.
Knowing the total number, n tot , of p-values involved in a QQ plot (number of GWAS z-scores from pruned SNPs), any point (q, p) (log-transformed) on the plot gives the number, n p = qn tot , of p-values that are as extreme as or more extreme than the chosen p-value.This can be thought of as n p "successes" out of n tot independent trials (thus ignoring LD) from a binomial distribution with prior probability q.To approximate the effects of LD, we estimate the number of independent SNPs as n tot /f where f ≃ 10.The 95% binomial confidence interval for q is calculated as the exact Clopper-Pearson 95% interval (Clopper and Pearson, 1934), which is similar to the normal approximation interval, q ± 1.96 q(1 − q)/n tot /f .

Narrow-sense Chip Heritability
Since we are treating the β coefficients as fixed effects in the simple linear regression GWAS formalism, with the phenotype vector standardized with mean zero and unit variance, the proportion of phenotypic variance explained by a particular causal SNP, q 2 =var(y|g), is given by q 2 = β 2 H.The proportion of phenotypic variance explained additively by all causal SNPs is, by definition, the narrow sense chip heritability, h 2 .Since E(β 2 )=σ 2 β and n causal = π 1 n snp , and taking the mean heterozygosity over causal SNPs to be approximately equal to the mean over all SNPs, H, the chip heritability can be estimated as For all-or-none traits like disease status, the estimated h 2 from Eq. 23 for an ascertained case-control study is on the observed scale and is a function of the prevalence in the adult population, K, and the proportion of cases in the study, P .The heritability on the underlying continuous liability scale (Falconer, 1965), h 2 l , is obtained by adjusting for ascertainment (multiplying by K(1 − K)/(P (1 − P )), the ratio of phenotypic variances in the population and in the study) and rescaling based on prevalence (Dempster and Lerner, 1950;Lee et al., 2011): where a is the height of the standard normal pdf at the truncation point z K defined such that the area under the curve in the region to the right of z K is K.

Phenotypes
Figure 1 shows QQ plots for the z-scores for schizophrenia, educational attainment, and putamen volume, along with model estimates.In all cases, the model fit (yellow) closely tracks the data (dark blue).Figure 6 in Supporting Material shows QQ subplots for a 5 × 5 grid of H × T LD ranges for schizophrenia.The estimated number of causal SNPs is given by the polygenicity, π 1 , times the total number of SNPs, n snp ; the latter is given by the total number of SNPs that went into building the LD structure, L in Eq. 17, i.e., the approximately 11 million SNPs selected from the 1000 Genomes Phase 3 reference panel, not the number of SNPs in the particular GWAS.Thus, for schizophrenia, π 1 = 5.0 × 10 −3 , so that ncausal = 5.5 × 10 4 , not all of which are in linkage equilibrium.Educational attainment has slightly greater polygenicity than schizophrenia, π 1 = 7.7 × 10 −3 .In contrast, for putamen volume π 1 = 2.6 × 10 −5 , so that ncausal = 285.
Note that for logistic linear regression coefficient β, the odds ratio for disease is OR = e β ; for a rare disease, this is approximately equal to the genotypic relative risk: GRR ≃ OR.Since E β 2 = σ 2 β , the mean relative risk E [GRR] ≃ 1 + σ 2 β /2.Thus, for schizophrenia, the mean relative risk is ≃ 1.0000175.
The narrow sense heritability from the ascertained casecontrol schizophrenia GWAS is estimated as h 2 =0.41 (with mean heterozygosity from the ∼11 million SNPs, H = 0.2165).Taking adult population prevalence of schizophrenia to be K=0.01 (Purcell et al., 2009;Whiteford et al., 2013), and given that there are 35,476 cases and 46,839 controls in the study, so that P =0.43, the heritability on the liability scale for schizophrenia from Eq. 24 is h 2 l =0.23; for K=0.005 (Kinney et al., 2009), h 2 l =0.20.For the quantitative endophenotype putamen volume, the heritability is estimated to be 7%, while for educational attainment the heritability is estimated to be 10%.
Figure 2 shows the sample size required to reach a given proportion of chip heritability for the phenotypes (assuming equal numbers of cases and controls for schizophrenia: N ef f = 4/(1/N cases + 1/N controls ), so that when N cases = N controls , N ef f = N cases +N controls = N , the total sample size).At current sample sizes, only 10%, 10%, and 7% of narrow-sense chip heritability is captured for schizophrenia, educational attainment, and putamen volume, respectively.And to capture the preponderance of chip heritability for schizophrenia, for example, a sample with approximately half a million each of cases and controls would be needed.
The estimated total inflation factor for the pruned data, λ, is almost exactly predicted by the model.E.g., for schizophrenia, λ = λ = 1.38, whereas for educational attainment the values are λ = 1.16 and λ = 1.17.Higher polygenicity, π 1 , mean strength of association, σ 2 β , and sample size, N , will all contribute to higher λ.Residual population structure in the form of cryptic relatedness will also contribute to genomic inflation.For schizophrenia, inflation from population structure is estimated to be σ2 0 = 1.179.In contrast, for educational attainment σ2 0 = 1.01, indicating essentially no residual inflation due to population structure.

Simulations
Table 1 shows the simulation results, comparing true and estimated values for the model parameters, heritability, and the number of causal SNPs.In supporting material, Figure 3 shows QQ plots for a randomly chosen β-vector and phenotype instantiation for each of the twelve (π 1 , h 2 ) scenarios.Most of the π1 estimated are in reasonable agreement with the true values, though for π 1 = 10 −5 they are larger by about a factor of two for h 2 equal to 0.4 and 0.7.The number of estimated causals are in correspondingly good agreement with the true values, ranging in increasing powers of 10 from 110 through 110,158.While the estimated polygenicities tend to be slight overestimates, the estimated discoverabilities, σ2 β 's, tend to be under-estimates.From Supporting Material Figure 3, the tails of the QQ plots for the true parameters (dashed dark blue curves), particularly for the larger π 1 's, deviate from the simulated data plots (solid dark blue curves), consistently over-estimating the proportion of SNPs with more extreme z-scores.The model fit, however, bends these curves down toward the data curves.Note that steeper tails have larger σ 2 β 's, and larger π 1 's lead to earlier departure from the null line.In all cases, σ 2 0 is close to 1, indicating no cryptic relatedness.Estimates of heritability, ĥ2 , show a tendency to decrease with increasing π 1 .

DISCUSSION
Building on our previous work and the work of others, here we present the first unified method based on GWAS summary statistics, incorporating detailed LD structure from a reference panel, for directly estimating phenotypic polygenicity, π 1 , "SNP discoverability" or strength of association (specifically, the variance of the underlying causal effects), σ 2 β , and residual inflation of the association statistics due to variance distortion induced by cryptic relatedness, σ 2 0 .We apply the model to three diverse phenotypes, one qualitative and two quantitative: schizophrenia, educational attainment, putamen volume.In each case, we estimate the polygenicity, discoverability, and residual inflation due to variance distortion; we also estimate the number of causal SNPs, n causal , and the SNP heritability, h 2 (for schizophrenia, we reexpress this as the proportion of population variance in disease liability, h 2 l , under a liability threshold model, adjusted for ascertainment).In addition, we estimate the proportion of SNP heritability captured by genome-wide significant SNPs at current sample sizes, and predict future sample sizes needed to explain the preponderance of SNP heritability.
We find that schizophrenia is highly polygenic, with π 1 = 5 × 10 −3 .This leads to an estimate of n causal ≃ 55, 000, which is in scale-agreement with a recent estimate that the number of causals is >20,000 (Loh et al., 2015).The SNP associations, however, are characterized by a narrow distribution, σ 2 β = 3.5×10 −5 , indicating that most associations are of week effect, i.e., have low discoverability.
For educational attainment (Rietveld et al., 2013;Okbay et al., 2016;Cesarini and Visscher, 2017), the polygenicity is somewhat greater, π 1 = 7.7 × 10 −3 , leading to an estimate of n causal ≃ 85, 000, which also is in scaleagreement with a recent estimate of the number of loci contributing to heritability of ≃ 70, 000 (Rietveld et al., 2013).The variance of the distribution for causal effect sizes is an order of magnitude smaller than for schizophrenia, σ 2 β = 5 × 10 −6 , indicating lower discoverability.In marked contrast is putamen volume, which has very low polygenicity: π 1 = 2.6 × 10 −5 , so that only 285 SNPs (out of ∼11 million) are estimated to be causal.However, these SNPs are characterized by high discoverability, twoorders of magnitude larger than for schizophrenia: The QQ plots (which are sample size dependent) reflect these differences in genetic architecture.For example, the early departure of the schizophrenia QQ plot from the null line indicates its high polygenicity, while the steep rise for putamen volume after its departure corresponds to its high SNP discoverability.
Despite the much stronger effects in putamen volume, the very high polygenicity for schizophrenia leads to its being more than three times as heritable.Our point estimate for liability-scale heritability of schizophrenia is h 2 l = 0.23 (assuming a population risk of 0.01), and that 10% of this (i.e., 2.3% of overall disease liability) is explainable based on common SNPs reaching genome-wide significance at the current sample size.This h 2 l estimate is in good agreement with a recent result, h 2 l = 0.27 (Loh et al., 2015;Golan et al., 2014), also calculated from the PGC2 data set but using raw genotype data for 472,178 markers for a subset of 22,177 schizophrenia cases and 27,629 controls of European ancestry; and with an earlier result of h 2 l = 0.23 from PGC1 raw genotype data for 915,354 markers for 9,087 schizophrenia cases and 12,171 controls (Lee et al., 2012;Yang et al., 2011a).Our estimate of 2.3% of overall variation on the liability scale for schizophrenia explainable by genome-wide significant loci is a little lower than the corresponding estimate of 3.4% based on risk profile scores (RPS) (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014).Nevertheless, these results show that current sample sizes need to increase substantially in order for RPSs to have predictive utility, as the vast majority of associated SNPs remain undiscovered.Our power estimates indicate that ∼500,000 cases and an equal number of controls would be needed to identify these SNPs (note thst there is a total of approximately 3 million cases in the US alone).The identified SNPs then need to be mapped to genes and their modality (e.g., regulatory or functional effects) determined, so that targeted therapeutics can be developed (Schubert et al., 2015).Greater power for discovery is achievable by using prior information involving SNP functional categories (Schork et al., 2013;Andreassen et al., 2013;Sveinbjornsson et al., 2016).However, it is not yet clear how significant a role genomics can play in psychiatric precision medicine (Breen et al., 2016).Noteworthy in this respect is that estimates of broad-sense heritability of schizophrenia from twin and family studies are in the range 0.6-0.8(Sullivan et al., 2003;Lichtenstein et al., 2009), considerably higher than the narrow-sense chip heritability estimates from GWAS.Additionally, schizophrenia is considered a spectrum disorder with multiple phenotypic dimensions and diverse clinical presentation (MacDonald and Schulz, 2009;Peralta and Cuesta, 2001); GWAS might therefore benefit from considering continuous phenotypes rather than dichotomous variables in such situations (Edwards et al., 2016).More specifically in the context of the present model, if a nominally categorical phenotype can be decomposed into more than one subcategory, there is potential for enhanced power for discovery.The heritability estimated in a binary case-control design would be an average over heritabilities for the case subcategories.If those heritabilities are similar, then, since the union of the subcategory polygenicities gives the total polygenicity over all cases, the σ 2 β for any subcategory will be larger by a factor equal to the ratio of overall polygenicity to the subcategory polygenicity, and the corresponding power curve (as in Figure 2) will shift to the left.
For educational attainment, we estimate SNP heritability h 2 = 0.10, in good agreement with the estimate of 11.5% given in (Okbay et al., 2016).As with schziophrenia, this is substantially less than the estimate of heritability from twin and family studies of ≃ 40% of the variance in educational attainment explained by genetic factors (Branigan et al., 2013;Rietveld et al., 2013).
For putamen volume, we estimate the SNP heritability h 2 = 0.07, in reasonable agreement with an earlier estimate of 0.1 for the same overall data set (Hibar et al., 2015;So et al., 2011).
To assess the validity of the model, we conduct extensive simulations over a wide range of polygenicities and heritabilities for simulated quantitative traits, using the full set of SNPs used in the phenotype analyses with realistic LD structure.The simulations in general validate the model: with the true number of causals ranging over three orders of magnitude, 10 2 -10 5 (while heritabilities range from 0.1 to 0.7), the estimated number of causals in each case is in reasonable agreement with the corresponding true value.Similarly, the true σ 2 β 's range over four orders of magnitued, and the estimated values are generally well within a factor of two of the corresponding true value.It should be noted that for all simulations, σ 2 0 is close to 1.0 (indicating no variance distortion, and hence no inflation due to cryptic relatedness, as expected in HapGen), though there is a trend toward larger values for higer heritability and polygenicity.Thus, the higher inflation found for schizophrenia is unlikely to be an artifact of the model.The simulation QQ model plots in general agree with the simulation QQ data plots, though there is an overestimation of the proportion of more extreme z-scores, particularly at very high polygenicities.This might be an artifact of using the computationally simpler but less accurate Eq. 17 instead of Eq. 14, which is currently a limitation in the implementation of the model.A Monte Carlo approach to calculating the pdf in Eq. 14 might lead to more accurate QQ model plots.

CONCLUSION
The SNP-level causal effects model we have presented is based on GWAS summary statistics and detailed LD structure, and assumes a Gaussian distribution of effect sizes at a fraction of SNPs randomly distributed across the autosomal genome.We have shown that it captures the broad genetic architecture of diverse complex traits, where polygenicities and the variance of the effect sizes range over orders of magnitude.In addition, the model provides a roadmap for discovery in future GWAS.The model was not designed to handle situations where the reversal of short sections of DNA underlies SNP association, as appears to be the case for some phenotypes, e.g, in chromosome 8p for neuroticism (Lo et al., 2016).Future extensions and refinements include modeling specific polygenicities and effect size variances for different SNP functional annotation categories (Schork et al., 2013;Andreassen et al., 2013;Sveinbjornsson et al., 2016), possible modified pdf for non-Gauassian distribution of effects at the tails of the z-score distributions, examining individual chromosomes and possible allele frequency dependencies in different phenotypes, and extension to pleiotropic analyses.Higher accuracy in characterizing causal alleles in turn will enable greater power for SNP discovery.Figure 3: Quantile-quantile plots for simulations.True polygenicity is specified for each row, and true heritability is specified for each column.QQ-plots for simulated data in dark blue, with 95% confidence interval in light blue; model prediction in yellow.The dashed blue curve is the QQ plot corresponding to the true parameters.λ ≡ λdata and λmodel are the overall nominal genomic control factors calculated from the plots.The three estimated model parameters are: polygenicity, π1 ; discoverability, σ2 β ; and SNP association χ 2 -statistic inflation factor, σ2 0 .ĥ2 is the estimated narrow-sense chip heritability, and ncausal is the estimated number of causal SNPs.The dotted black line is the expected plot under null.ĥ2 c is the same as ĥ2 but with H calculated from the known causal SNPs (instead of from all SNPs).Reading the plots: on the vertical axis, choose a p-value threshold (more extreme values are further from the origin), then the horizontal axis gives the proportion of SNPs exceeding that threshold (higher proportions are closer to the origin).

Figure 4 :Figure 5 :
Figure4: (A) Mean value of heterozygosity for given total LD (SNPs were binned based on TLD and the mean TLD for each bin plotted on the x-axis; the corresponding mean heterozygosity for SNPs in each bin was then plotted on the y-axis).(B) Mean value of total LD for given heterozygosity.Plots made for SNPs in the PGC2 schizophrenia GWAS; TLD and H calculated from 1000 Genomes phase 3 reference panel.

Figure 6 :
Figure6: QQ plots for schizophrenia, for a 5X5 grid of total LD X heterozygosity.n is the number of SNPs in each plot.H and T LD are the mean values in each plot.λdata and λmodel are the genomic control values calculated from the QQ plots for the data and the model, respectively.