Recalibrating the Genome-wide Association Null Hypothesis 1 Enables the Detection of Core Genes Underlying Complex 2 Traits

12 Traditional univariate genome-wide association studies generate false positives and negatives due to 13 diﬃculties distinguishing causal variants from “interactive” variants (i.e., variants correlated with causal 14 variants without directly inﬂuencing the trait). Recent eﬀorts have been directed at identifying gene or 15 pathway associations, but these can be computationally costly and hampered by strict model assumptions. 16 Here, we present gene-ε , a new approach for identifying statistical associations between sets of variants 17 and quantitative traits. Our key innovation is a recalibration of the genome-wide null hypothesis to treat 18 small-yet-nonzero associations emitted by interactive variants as non-causal. gene-ε eﬃciently identiﬁes 19 core genes under a variety of simulated genetic architectures, achieving greater than a 90% true positive 20 rate at 1% false positive rate for polygenic traits. Lastly, we apply gene-ε to summary statistics derived 21 from six quantitative traits using European-ancestry individuals in the UK Biobank, and identify genes 22 that are enriched in biological relevant pathways. 23


Introduction
Over the last decade, there has been considerable debate surrounding whether genome-wide singlenucleotide polymorphism (SNP) genotype data can offer insight into the genetic architecture of complex traits [1][2][3][4][5].In the traditional genome-wide association (GWA) framework, individual SNPs are tested independently for association with a trait of interest.While this approach has drawbacks [2,3,6], more recent approaches that combine SNPs within a region have gained power to detect biologically relevant genes and pathways enriched for correlations with complex traits [7][8][9][10][11][12][13][14].Reconciling these two observations is crucial for biomedical genomics.
In the traditional GWA model, each SNP is assumed to either (i ) directly influence (or perfectly tag a variant that directly influences) the trait of interest; or (ii ) have no affect on the trait at all (see Fig. 1a).Throughout this manuscript, for simplicity, we refer to SNPs under the former as "causal" and those under latter as "non-causal".These classifications are based on ordinary least squares (OLS) effect size estimates β j for each j-th SNP in a regression framework, where the null hypothesis assumes no association for the true effects of non-causal SNPs (H 0 : β j = 0).The traditional GWA model is agnostic to trait architecture, and is underpowered with a high false-positive rate for "polygenic" traits or traits which are generated by many mutations of small effect [5,[15][16][17].
Suppose that in truth each SNP in a GWA dataset instead belongs to one of three categories: (i ) causal or truly associated; (ii ) statistically associated with the trait but not directly influencing it, which we refer to as "interactive"; and (iii ) non-causal (Fig. 1b) [18].Equivalently, these three categories can be defined using the relationship between the true SNP effects β and the observed effect size estimates β: (i ) for causal SNPs, the true effects β = 0 and the observed effect size estimates β = 0; (ii ) for "interactive" SNPs, β = 0 or β ≈ 0, and β = 0; and (iii ) for non-causal SNPs, β = 0 and β = 0.More specifically, causal SNPs lie in core genes that directly influence the trait of interest.Interactive SNPs then follow one of two definitions.They are either non-causal SNPs that are statistically associated with the trait due to some varying degree of linkage disequilibrium (LD) with causal SNPs (β = 0); or they are SNPs that are not directly causal, but are located within a gene that covaries with a core gene (β ≈ 0) [19].
In either setting, interactive SNPs will emit intermediate statistical noise (in some cases, even appearing indistinguishable from causal SNPs), thereby confounding traditional GWA tests (Fig. 1b).Hereafter, we refer to this noise as "epsilon-genic effects" (denoted in shorthand as "ε-genic effects").There is a need for a computational framework that has the ability to identify mutations associated with a wide range of traits, regardless of whether heritability is sparsely or uniformly distributed across the genome.
Here, we develop a new and scalable quantitative approach for testing aggregated sets of SNP-level GWA summary statistics.In practice, our approach can be applied to any user-specified set of genomic regions, such as regulatory elements, intergenic regions, or gene sets.In this study, for simplicity, we refer to our method as a gene-level test (i.e., an annotated collection of SNPs within the boundary of a gene).
The key contribution for our approaches is that gene-level association tests should treat interactive SNPs with ε-genic effects as non-causal.Conceptually, this requires assessing whether a given SNP explains more than an "epsilon" proportion of narrow-sense heritability, h 2 .In this generalized model, we assume that true SNP-level effects for complex traits are jointly distributed as where Ω = H 1/2 ΣH 1/2 is a J ×J variance-covariance matrix, and H = diag(σ 2 1 , . . ., σ 2 J ) is a J ×J diagonal matrix with elements σ 2 j denoting the proportion of narrow-sense heritability h 2 contributed by the j-th SNP (with the constraint σ 2 j = h 2 ).Under our framework, we recalibrate the GWA null hypothesis to assume approximately no association for interactive and non-causal SNPs, H 0 : β j ≈ 0. This can be formally restated using the variance-component-based null hypothesis, H 0 : σ 2 j ≤ σ 2 ε , where σ 2 ε represents the maximum proportion-of-variance-explained (PVE) by an interactive or non-causal SNP (Fig. 1b).
Non-core genes are then defined as genes that only contain SNPs with ε-genic effects (i.e.0 ≤ σ 2 j ≤ σ 2 ε for every j-th SNP within that region).Core genes, on the other hand, are genes that contain at least one causal SNP (i.e.σ 2 j > σ 2 ε for at least one SNP j within that region).By modeling ε-genic effects (i.e., different values of σ 2 ε for interactive SNPs), our approach flexibly constructs an appropriate null hypothesis for a wide range of traits with genetic architectures that land anywhere on the polygenic spectrum (see Methods).
We refer to our gene-level association framework as "gene-ε" (pronounced "genie").gene-ε lowers false positive rates and increases power for identifying gene-level associations from GWA studies via two key conceptual insights.First, gene-ε regularizes observed (and inflated) GWA summary statistics so that SNP-level effect size estimates are positively correlated with the assumed generative model of complex traits.Second, it examines the distribution of regularized effect sizes to determine the ε-genic threshold σ 2 ε of interactive and non-causal SNPs.This makes for an improved hypothesis testing strategy for identifying core genes associated with complex traits.With detailed simulations, we assess the power of gene-ε to identify core genes under a variety of genetic architectures, and compare its performance against multiple competing approaches [7,10,12,14,20].We also apply gene-ε to the SNP-level summary statistics of six quantitative traits assayed in individuals of European ancestry from the UK Biobank [21].

Overview of gene-ε
The gene-ε framework requires two inputs: GWA SNP-level effect size estimates, and an empirical linkage disequilibrium (LD, or variance-covariance) matrix.The LD matrix can be estimated directly from genotype data, or from an ancestry-matched set of samples if genotype data are not available to the user.We use these inputs to both estimate gene-level contributions to narrow-sense heritability h 2 , and perform gene-level association tests.After preparing the input data, there are three steps implemented in gene-ε, which are detailed below (Fig. 2).
First, we shrink the GWA effect size estimators via regularized regression (Figs.2a,b; Eq. ( 5) in Methods).This shrinkage step reduces the inflation of effect sizes for interactive SNPs [22], and increases their correlation with the assumed generative model for the trait of interest (particularly for traits with high heritability; Supplementary Fig. 1).When assessing the performance of gene-ε in simulations, we considered different types of regularization for the effect size estimates: the Least Absolute Shrinkage And Selection Operator (gene-ε-LASSO) [23], the Elastic Net solution (gene-ε-EN) [24], and Ridge Regression (gene-ε-RR) [25].We also assessed our framework using the observed ordinary least squares (OLS) estimates without any shrinkage (gene-ε-OLS) to serve as motivation for having regularization as a step in the framework.
Second, we fit a K-mixture Gaussian model to the regularized effect sizes with the goal of classifying SNPs as causal, intermediate, or non-causal (Fig. 2c).Each successive Gaussian mixture component has distinctly smaller variances (σ 2 1 > • • • > σ 2 K ) with the K-th component fixed at σ 2 K = 0.In this study, we assume causal SNPs to appear in the first mixture component with the largest variance σ 2 1 , while non-causal SNPs appear in the last component σ 2 K .By definition, interactive SNPs with ε-genic effects then have PVEs that fall at or below the variance of the second component (i.e., σ 2 ε = σ 2 2 and H 0 : σ 2 j ≤ σ 2 2 for the j-th SNP).Indeed, while core genes may contain important variants that appear in intermediate mixture components, the biological definitions of these components may not be consistent across trait architectures.As a result, for results shown here, we take a conservative approach in our definition of causal SNPs within core genes.Fig. 1b illustrates this concept with a three-component mixture model (see also [18]).Intuitively, the intermediate mixture components may represent varying degrees of connectivity to core genes via LD or trans-interactions.Thus, gene-ε allows for flexibility in the number of Gaussians that specify the range of null and non-null SNP effects.For scalability, we estimate parameters of the K-mixture model using an expectation-maximization (EM) algorithm.
Third, we group the regularized GWA summary statistics according to gene boundaries (or userspecified regions) and compute a gene-level association statistic based on a commonly used quadratic form (Fig. 2d) [7,12,20].In expectation, these test statistics can be naturally interpreted as the contribution of each gene to the narrow-sense heritability.We use Imhof's method [26] to derive a P -value for assessing evidence in support of an association between a given gene and the trait of interest.Details for each of these steps can be found in Methods and Supplementary Notes.

Performance Comparisons in Simulation Studies
To assess the performance of gene-ε, we simulated complex traits under multiple genetic architectures using real genotype data on chromosome 1 from individuals of European ancestry in the UK Biobank (Methods).Following quality control procedures, our simulations included 36,518 SNPs (Supplementary Notes).Next, we used the NCBI's Reference Sequence (RefSeq) database in the UCSC Genome Browser [27] to annotate SNPs with the appropriate genes.Simulations were conducted using two different SNPto-gene assignments.In the first, we directly used the UCSC annotations which resulted in 1,408 genes to be used in the simulation study.In the second, we augmented the UCSC gene boundaries to include SNPs within ±50kb, which resulted in 1,916 genes in the simulation study.For both cases, we assumed a linear additive model for quantitative traits, while varying the following parameters: sample size (N = 5, 000 or 10,000); narrow-sense heritability (h 2 = 0.2 or 0.6); and the percentage of core genes (set to 1% or 10%).In each scenario, we considered traits being generated with and without additional population structure.In the latter setting, traits are simulated while also using the top ten principal components of the genotype matrix as covariates to create stratification.Regardless of the setting, GWA summary statistics were computed by fitting a single-SNP univariate linear model (via OLS) without any control for population structure.Comparisons were based on 100 different simulated runs for each parameter combination.
We compared the performance of gene-ε against that of five competing gene-level association or enrichment methods: SKAT [20], VEGAS [7], MAGMA [10], PEGASUS [12], and RSS [14] (Supplementary Notes).As previously noted, we also explored the performance of gene-ε while using various degrees of regularization on effect size estimates, with gene-ε-OLS being treated as a baseline.SKAT, VEGAS, and PEGASUS are frequentist approaches, in which SNP-level GWA P -values are drawn from a correlated chi-squared distribution with covariance estimated using an empirical LD matrix [28].MAGMA is also a frequentist approach in which gene-level p-values are derived from distributions of SNP-level effect sizes using an F -test [10].RSS is a Bayesian model-based enrichment method which places a likelihood on the observed SNP-level GWA effect sizes (using their standard errors and LD estimates), and assumes a spike-and-slab shrinkage prior on the true SNP effects [29].Conceptually, SKAT, MAGMA, VEGAS, and PEGASUS assume null models under the traditional GWA framework, while RSS and gene-ε allow for traits to have architectures with interactive SNPs.
For all methods, we assess the power and false discovery rates (FDR) for identifying core genes at a Bonferroni-corrected threshold (P = 0.05/1, 408 genes = 3.55 × 10 −5 and P = 0.05/1, 916 genes = 2.61×10 −5 , depending on if the ±50kb buffer was used) or median probability model (posterior enrichment probability > 0.5; see [30]) (Supplementary Tables 1-16).We also compare their ability to rank true positives over false positives via receiver operating characteristic (ROC) and precision-recall curves (Fig. 3 and Supplementary Figs.2-16).While we find gene-ε and RSS have the best tradeoff between true and false positive rates, RSS does not scale well for genome-wide analyses (Supplementary Table 17).In many settings, gene-ε has similar power to RSS (while maintaining a considerably lower FDR), and generally outperforms RSS in precision-versus-recall. gene-ε also stands out as the best approach in scenarios where the observed OLS summary statistics were produced without first controlling for confounding stratification effects in more heritable traits (i.e., h 2 = 0.6).Computationally, gene-ε gains speed by directly assessing evidence for rejecting the recalibrated GWA null hypothesis, whereas RSS must compute the posterior probability of being a core gene (which can suffer from convergence issues; Supplementary Notes).For context, an analysis of just 1,000 genes takes gene-ε an average of 140 seconds to run on a personal laptop, while RSS takes around 9,400 seconds to complete.
When using GWA summary statistics to identify genotype-phenotype associations, modeling the appropriate trait architecture is crucial.As expected, all methods we compared in this study have relatively more power for traits with high h 2 .However, our simulation studies confirm the expectation that the max utility for methods assuming the traditional GWA framework (i.e., SKAT, MAGMA, VEGAS, and PEGASUS) is limited to scenarios where heritability is low, phenotypic variance is dominated by just a few core genes with large effects, and summary statistics are not confounded by population structure (Supplementary Figs. 2, 3, 9, and 10).RSS, gene-ε-EN, and gene-ε-LASSO robustly outperform these methods for the other trait architectures (Fig. 3 and Supplementary Figs.4-8 and 11-16).One major reason for this result is that shrinkage and penalized regression methods appropriately correct for inflation in GWA summary statistics (Supplementary Fig. 1).For example, we find that the regularization used by gene-ε-EN and gene-ε-LASSO is able to recover effect size estimators that are almost perfectly correlated (r 2 > 0.9) with the true effect sizes used to simulate sparse architectures (e.g., simulations with 1% core genes).Regularization also allows gene-ε to preserve type 1 error when traits are generated under the null hypothesis.Importantly, our method is relatively conservative when GWA summary statistics are less precise and derived from studies with smaller sample sizes (e.g., N = 5,000; Supplementary Table 18).
Characterizing Genetic Architecture of Quantitative Traits in the UK Biobank We applied gene-ε to 1,070,306 genome-wide SNPs and six quantitative traits -height, body mass index (BMI), mean red blood cell volume (MCV), mean platelet volume (MPV), platelet count (PLC), waist-hip ratio (WHR) -assayed in 349,414 European-ancestry individuals in the UK Biobank (Supplementary Notes) [21].After quality control, we regressed the top ten principal components of the genotype data onto each trait to control for population structure, and then we derived OLS SNP-level effect sizes using the traditional GWA framework.For completeness, we then analyzed these GWA effect size estimates with the four different implementations of gene-ε.In the main text, we highlight results under the Elastic Net solution; detailed findings with the other gene-ε approaches can be found in Supplementary Notes.
Compared to body height, narrow-sense heritability estimates for BMI have been considered both high and low (estimates ranging from 25-60%; [31,33,34,36,37,39,[42][43][44][45]).Such inconsistency is likely due to difference in study design (e.g., twin, family, population-based studies), many of which have been known to produce different levels of bias [44].Here, our results suggest BMI to have a lower narrow-sense heritability than height, with a slightly different distribution of null and non-null SNP effects.Specifically, we found BMI to have 13% associated SNPs and 6% causal SNPs.
In general, we found our genetic architecture characterizations in the UK Biobank to reflect the same general themes we saw in the simulation study.Less aggressive shrinkage approaches (e.g., OLS and Ridge) are subject to misclassifications of causal, interactive, and non-causal SNPs.As result, these methods struggle to reproduce well-known narrow-sense heritability estimates from the literature, across all six traits.This once again highlights the need for computational frameworks that are able to appropriately correct for inflation in summary statistics.

gene-ε Identifies New Core Genes and Novel Genetic Associations
Next, we applied gene-ε to the summary statistics from the UK Biobank and generated genome-wide genelevel association P -values (panels a,b of Fig. 4 and Supplementary Figs.[17][18][19][20][21].As in the simulation study, we conducted two separate analyses using two different SNP-to-gene annotations: (i) we used the RefSeq database gene boundary definitions directly, or (b) we augmented the gene boundaries by adding SNPs within a ±50 kilobase (kb) buffer to account for possible regulatory elements.A total of 14,322 genes were analyzed when using the UCSC boundaries as defined, and a total of 17,680 genes were analyzed when including the 50kb buffer.The ultimate objective of gene-ε is to identify core genes, which we define as containing at least one causal SNP and achieving a gene-level association P -value below a Bonferroni-corrected significance threshold (in our two analyses, P = 0.05/14, 322 genes = 3.49 × 10 −6 and P = 0.05/17, 680 genes 2.83 × 10 −6 , respectively; Supplementary Tables 20-25).As a validation step, we compared gene-ε P -values to RSS posterior enrichment probabilities for each gene.We also used the gene set enrichment analysis tool Enrichr [46] to identify dbGaP categories with an overrepresentation of core genes reported by gene-ε (panels c,d of Fig. 4 and Supplementary Figs.[17][18][19][20][21].A comparison of gene-level associations and gene set enrichments between the different gene-ε approaches are also listed (Supplementary Tables 26-28).
Many of the candidate core genes we identified by applying gene-ε were not previously annotated as having trait-specific associations in either dbGaP or the GWAS catalog (Fig. 4); however, many of these same candidate core genes have been identified by past publications as related to the phenotype of interest (Table 1).Additionally, 45% of the core genes selected by gene-ε were also selected by RSS.For example, gene-ε reports C1orf150 as having a significant gene-level association with MPV (P = 1 × 10 −20 and RSS posterior enrichment probability of 1), which is known to be associated with germinal center signaling and the differentiation of mature B cells that mutually activate platelets [47][48][49].Importantly, nearly all of the core genes reported by gene-ε had evidence of overrepresentation in gene set categories that were at least related to the trait of interest.As expected, the top categories with Enrichr Q-values smaller than 0.05 for height and MPV were "Body Height" and "Platelet Count", respectively.Even for the less heritable MCV, the top significant gene sets included hematological categories such as "Transferrin", "Erythrocyte Indices", "Hematocrit", "Narcolepsy", and "Iron" -all of which have verified and clinically relevant connections to trait [50][51][52][53][54][55][56][57].
Lastly, gene-ε also identified genes with rare causal variants.For example, ZNF628 (which is not mapped to height in the GWAS catalog) was detected by gene-ε with a significant P -value of 1 × 10 −20 (and P = 4.58 × 10 −8 when the gene annotation included a 50kb buffer).Previous studies have shown a rare variant rs147110934 within this gene to significantly affect adult height [38].Rare and low-frequency variants are generally harder to detect under the traditional GWA framework.However, rare variants have been shown to be important for explaining the variation of complex traits [28,39,[58][59][60][61].With regularization and testing for ε-genic effects, gene-ε is able to distinguish between rare variants that are causal and SNPs with larger effect sizes due various types of correlations.This only enhances the power of gene-ε to identify potential novel core genes.

Discussion
During the past decade, it has been repeatedly observed that the traditional GWA framework struggles to accurately differentiate between truly associated (which we refer to as "causal" for simplicity) and interactive SNPs (which we define as SNPs that covary with causal SNPs but do not directly influence the trait of interest).As a result, the traditional GWA approach is prone to generating false positives, and detects variant-level associations spread widely across the genome rather than aggregated sets in disease-relevant pathways [4].While this observation has spurred to many interesting lines of inquiry -such as investigating the role of rare variants in generating complex traits [9,28,58,59], comparing the efficacy of tagging causal variants in different ancestries [62,63], and integrating GWA data with functional -omics data [64][65][66] -the focus of GWA studies and studies integrating GWA data with other -omics data is still largely based on the role of individual variants, acting independently.
Here, we challenge the view that signals in GWA datasets cannot reveal the architecture of complex traits by recalibrating the traditional GWA null hypothesis from H 0 : β j = 0 (i.e., the j-th SNP has zero statistical association with the trait of interest) to H 0 : β j ≈ 0. We accomplish this by testing for ε-genic effects: small-yet-nonzero effect sizes emitted by interactive SNPs.We use an empirical Bayesian approach to learn the distributions of ε-genic effects, and then we aggregate regularized SNPlevel association signals into a gene-level test statistic that represents the gene's contribution to the narrow-sense heritability of the trait of interest.Together, these two steps reduce false positives and increase power to identify the mutations, genes, and pathways that directly influence a trait's genetic architecture.By considering different levels of ε-genic effects (i.e., different values of σ 2 ε for interactive SNPs; Figs. 1 and 2), gene-ε offers the flexibility to construct an appropriate null hypothesis for a wide range of traits with genetic architectures that land anywhere on the polygenic spectrum.It is important to stress that while we repeatedly point to our improved ability distinguish "causal" variants in core genes, gene-ε is by no means a causal inference procedure.Instead, it is an association test which highlights genes in enriched pathways that are most likely to be associated with the trait of interest.
Through simulations, we showed the gene-ε framework outperforms other widely used gene-level association methods (particularly for highly heritable traits), while also maintaining scalability for genomewide analyses (Fig. 3, Supplementary Figs.2-16, and Supplementary Tables 1-18).Indeed, all the approaches we compared in this study showed improved performance when they used summary statistics derived from studies with larger sample sizes (i.e., simulations with N = 10, 000).This is because the quality of summary statistics also improves in these settings (via the asymptotic properties of OLS estimators).Nonetheless, our results suggest that applying gene-ε to summary statistics from previously published studies will increase the return made on investments in GWA studies over the last decade.
Like any aggregated SNP-set association method, gene-ε has its limitations.Perhaps the most obvious limitation is that annotations can bias the interpretation of results and lead to erroneous scientific conclusions (i.e., cause us to highlight the "wrong" gene [14,67,68]).We observed some instances of this during the UK Biobank analyses.For example, when studying MPV, CAPN10 only appeared to be a core gene after its UCSC annotated boundary was augmented by a ±50kb buffer window (P = 1.85×10 −1 and P = 1.17 × 10 −7 before and after the buffer was added, respectively; Supplementary Table 23).After further investigation, this result occurred because the augmented definition of CAPN10 included nearly all causal SNPs from the significant neighboring gene RNPEPL1 (P = 1 × 10 −20 and P = 2.07 × 10 −9 before and after the buffer window was added, respectively).While this shows the need for careful biological interpretation of the results, it also highlights the power of gene-ε to prioritize true genetic signal effectively.
There are several potential extensions for the gene-ε framework described here.First, in the current study, we only focused on applying gene-ε to quantitative traits (Fig. 4 and Supplementary Figs.[17][18][19][20][21]. Future studies extending this approach to binary traits (e.g.case-control studies) should explore controlling for additional confounders that can occur within these phenotypes, such as ascertainment [69][70][71].Second, we only focus on data consisting of common variants; however, it would be interesting to extend gene-ε for (i) rare variant association testing and (ii) studies that consider the combined effect between rare and common variants.A significant challenge, in either case, would be to adaptively adjust the strength of the regularization penalty on the observed OLS summary statistics for causal rare variants, so as to not misclassify them as interactive SNPs.Previous approaches with specific re-weighting functions for rare variants may help here [9,28,58] (Methods).A final related extension of gene-ε is to include information about standard errors when estimating ε-genic effects.In our analyses using the UK Biobank, some of the newly identified core genes contained SNPs that had large effect sizes but insignificant P -values in the original GWA analysis (after Bonferroni-correction; Supplementary Tables 20-25).While this could be attributed to the improved and recalibrated hypothesis test in gene-ε, it also motivates a regularization model that accounts for the standard error of effect size estimates from GWA studies [14,22,29].

Methods
Traditional Association Tests using Summary Statistics gene-ε requires two inputs: genome-wide association (GWA) marginal effect size estimates β, and an empirical linkage disequilibrium (LD) matrix Σ.We assumed the following generative linear model for complex traits where y denotes an N -dimensional vector of phenotypic states for a quantitative trait of interest measured in N individuals; X is an N × J matrix of genotypes, with J denoting the number of single nucleotide polymorphisms (SNPs) encoded as {0, 1, 2} copies of a reference allele at each locus; β is a J-dimensional vector containing the additive effect sizes for an additional copy of the reference allele at each locus on y; e is a normally distributed error term with mean zero and scaled variance τ 2 ; and I is an N × N identity matrix.For convenience, we assumed that the genotype matrix (column-wise) and trait of interest have been mean-centered and standardized.We also treat β as a fixed effect.A central step in GWA studies is to infer β for each SNP, given both genotypic and phenotypic measurements for each individual sample.
For every SNP j, gene-ε takes in the ordinary least squares (OLS) estimates based on Eq. ( 2) where x j is the j-th column of the genotype matrix X, and β j is the j-th entry of the vector β.
In traditional GWA studies, the null hypothesis for statistical tests assumes H 0 : β j = 0 for all j = 1, . . ., J SNPs.When both the trait and SNP measurements are standardized, it can be shown that the correlation between any two regression coefficients β j and β l (j = l) is proportional to the correlation between genotypic variants x j and x l .Therefore, under the null hypothesis, β is assumed to follow a multivariate normal distribution with mean vector zero and a correlation structure defined by the pairwise SNP-by-SNP LD matrix.For the applications considered here, the LD matrix is empirically estimated from external data (e.g., directly from GWA study data, or using an LD map from a population with similar genomic ancestry to that of the samples analyzed in the GWA study).

Regularization of GWA Effect Size Estimators
gene-ε uses regularization on the observed GWA summary statistics to reduce inflation of SNP-level effect size estimates and increase their correlation with the assumed generative model of complex traits.
For large sample size N , note that the asymptotic relationship between the observed GWA effect size estimates β and the true coefficient values β is where Σ jl = ρ(x j , x l ) denotes the correlation coefficient between SNPs x j and x l .The above mirrors a high-dimensional regression model with the misestimated OLS summary statistics as the response variables and the LD matrix as the design matrix.Theoretically, the resulting output coefficients from this model are the desired effect size estimators.Due to the multi-collinear structure of GWA data, we cannot reuse the ordinary least squares solution reliably [72].Thus, we derive the general regularization where, in addition to previous notation, the solution β is used to denote the regularized solution of the observed GWA effect sizes β; and • 1 and • 2 2 denote L 1 and L 2 penalties, respectively.The term α 10 distinguishes the type of regularization used, and can be chosen to induce various degrees of shrinkage on the effect size estimators.Specifically, α = 0 corresponds to the "Least Absolute Shrinkage and Selection Operator" or LASSO solution [23], α = 1 equates to Ridge Regression [25], while 0 < α < 1 results in the Elastic Net [24].The LASSO solution forces some inflated coefficients to be zero; while the Ridge shrinks the magnitudes of all coefficients but does not set any of them to be exactly zero.Intuitively, the LASSO will create a regularized set of effect sizes where causal SNPs have larger effects, interactive SNPs have small-yet-nonzero effects, and non-causal SNPs are set zero.It has been suggested that the L 1 -penalty can suffer from a lack of stability [73].Therefore, in the main text, we also highlighted gene-ε using the Elastic Net (with α = 0.5).The Elastic Net is a convex combination of the LASSO and Ridge penalties, but still results in distinguishable sets of causal, interactive, and non-causal SNPs.Results comparing each of the gene-ε regularization implementations are given in the main text (Fig. 3) and Supplementary Notes (Supplementary Figs.2-16 and Supplementary Tables 1-19 and 26-28).

Recalibration of the GWA Null Hypothesis
The gene-ε framework assumes that true SNP-level effects for complex traits are jointly distributed as where diagonal matrix with elements σ 2 j denoting the proportion of narrow-sense heritability h 2 contributed by the j-th SNP (with the constraint σ 2 j = h 2 ).The main innovation of gene-ε is to treat interactive SNPs with ε-genic effects as non-causal.This leads to the recalibrated GWA null hypothesis that assumes every SNP makes a negligible contribution to the genotypic variation of the trait of interest.Formally, we write H 0 : β j ≈ 0 and use Eq. ( 6) to derive the (limiting) joint null distribution as where ε ).We refer to σ 2 ε as the "ε-genic threshold".
Intuitively, σ 2 ε represents the maximum proportion of variance explained (PVE) by an interactive or noncausal SNP.To this end, we may equivalently state the recalibrated GWA null hypothesis as To infer the proportion of PVE contributed by each SNP, gene-ε takes an empirical Bayes approach by fitting a K-mixture of normal distributions over the regularized effect size estimates β j [18], such that where z j ∈ {1, . . ., K} is a latent variable representing the mixture membership for the j-th SNP.When summing over all components, Eq. ( 8) corresponds to the following marginal distribution where π k is a mixture weight representing the marginal (unconditional) probability that a randomly selected SNP belongs to the k-th component, with k π k = 1.The above mixture allows for distinct clusters of nonzero effects through K different variance components (σ 2 k , k = 1, . . ., K) [18].Here, we consider sequential fractions (π 1 , . . ., π K ) of SNPs to correspond to distinctly smaller effects (σ [18].Intuitively, causal SNPs will appear in the first set of fraction groups, while SNPs with ε-genic or non-causal effects will belong to the latter groups.Since our objective is to find the core genes underlying complex traits, results in the current study are based on considering an ε-genic threshold . This means that we consider SNP j to be causal if it is grouped in the largest fraction (i.e., the alternative H A : σ 2 j > σ 2 2 ), and interactive or non-causal otherwise (i.e., the null H 0 : σ 2 j ≤ σ 2 2 ).Given Eqs. ( 8) and ( 9), we fit the mixtures for all J SNPs by aiming to maximize the joint log-likelihood where Θ = (π 1 , . . ., π K , σ 2 1 , . . ., σ 2 K ) is the complete set of parameters for the mixture model.In the gene-ε framework, estimation is done using an expectation-maximization (EM) algorithm to maximize the joint distribution in Eq. (10).Here, we compute the probability that each SNP is a member of the K different components.The number of components K is chosen via a Bayesian Information Criterion (BIC).Details of this algorithm are given in the first section of the Supplementary Notes.

Hypothesis Testing for the Detection of Core Genes
We now formalize the statistical test for identifying core genes using the recalibrated GWA null hypothesis under the gene-ε framework.This test statistic is based on a quadratic form using GWA summary statistics, and is a commonly used approach for generating gene-level association statistics for complex traits.Let gene (or genomic region) g represent a known set of SNPs j ∈ J g ; for example, J g may include SNPs within the boundaries of g and/or within its corresponding regulatory region.Here, we conformably partition the regularized GWA effect size estimates β and define the gene-level test statistic where A is an arbitrary symmetric and positive semi-definite weight matrix, which we set to be the identity I for all analyses in the current study.Indeed, similar quadratic forms have been implemented to assess the enrichment of mutations at the gene level [7,12] and across general SNP-sets [9,20,28,58].
A key feature of the gene-ε framework is to test the statistics in Eq. ( 11) against a recalibrated gene-level association null hypothesis H 0 : Q g ≈ 0. Due to the asymptotic normality assumption for each SNP effect (Eq. ( 6)), Q g is theoretically assumed to follow a mixture of chi-square distributions, where |J g | denotes the cardinality of the set of SNPs J g and χ 2 1,j are standard chi-square random variables with one degree of freedom.Most importantly, (λ 1 , . . ., λ |Jg| ) are the eigenvalues of the matrix where ε Σ g is the variance-covariance structure of gene g with SNPs only emitting interactive and non-causal effects under the recalibrated GWA null hypothesis (Eq.( 7)).Several approximate and exact methods have been suggested to obtain P -values under a mixture of chi-square distributions.In this study, we use Imhof's method [26].
In the main text, we highlight some of the additional features of the gene-ε gene-level association test statistic.First, the expected enrichment for trait-associated mutations in a given gene is equal to the heritability explained by the SNPs contained in said gene.More formally, consider the expansion of Eq. ( 11) derived from the expectation of quadratic forms, where is the variance-covariance matrix of the SNPs within gene g, tr(•) denotes the matrix trace function, (Ω g • A) represents the Hadamard (i.e.element-wise) product between the two matrices, r and s are row and column indices, and h 2 g denotes the heritability contributed by gene g.When A = I (as in the current study), the gene-ε recalibrated hypothesis test for identifying core genes is based on the LD map between SNPs within the gene of interest, scaled by the individual SNP contributions to the narrow-sense heritability.Alternatively, one could choose to re-weight these contributions by specifying A otherwise [12,20,[74][75][76].For example, if SNP j has a small effect size but is known to be functionally associated with the trait of interest, then increasing A jj will reflect this knowledge.Specific weight functions have also been suggested for dealing with rarer variants [9,28,58].The variances of the gene-ε gene-level statistics are given by and can be used to derive measures of uncertainty regarding the strength of gene associations (e.g., building confidence intervals).

Simulation Studies
We used a simulation scheme to generate SNP-level summary statistics for GWA studies.First, we randomly select core genes and assume that complex traits (under various genetic architectures) are generated via a linear model where y is an N -dimensional vector containing all the phenotypes; C represents the set of causal SNPs contained within the core genes; x c is the genotype for the c-th causal SNP encoded as 0, 1, or 2 copies of a reference allele; β c is the additive effect size for the c-th SNP; W is an N × M matrix of covariates representing additional population structure (e.g., the top ten principal components from the genotype matrix) with corresponding fixed effects b; and e is an N -dimensional vector of environmental noise.The phenotypic variance is assumed V[y] = 1.The effect sizes of SNPs in core genes are randomly drawn from standard normal distributions and then rescaled so they explain a fixed proportion of the narrow-sense The covariate coefficients are also drawn from standard normal distributions and then rescaled such that . GWA summary statistics are then computed by fitting a single-SNP univariate linear model via ordinary least squares (OLS): β j = (x j x j ) −1 x j y for every SNP in the data j = 1, . . .J.These effect size estimates, along with an LD matrix Σ computed directly from the full N × J genotype matrix X, are given to gene-ε .We also retain standard errors and Pvalues for implementation of the competing methods (VEGAS, PEGASUS, RSS, SKAT, and MAGMA).
Given different model parameters, we simulate data mirroring a wide range of genetic architectures (Supplementary Notes).Effect size distributions (Omnigenic)

Figures and Tables
: Hypothesis Testing Genes -log 10 (P) d. Core Gene Association α = 0 : LASSO 0 < α < 1 : Elastic Net α = 1 : Ridge Regression ), where σ 2 j denotes the proportion of narrow-sense heritability h 2 contributed by the j-th SNP.(c) A unique feature of gene-ε is that it treats interactive SNPs as non-causal.gene-ε assumes a recalibrated (SNP-level) null hypothesis H 0 : σ 2 j ≤ σ 2 ε , where σ 2 ε is the ε-genic threshold and represents the maximum proportion-of-variance-explained (PVE) by an interactive or non-causal SNP.To infer σ 2 j , gene-ε fits a K-mixture of normal distributions over the regularized effect sizes with successively smaller variances (σ 2 1 > • • • > σ 2 K ; with σ 2 K = 0).In this study, we assume that causal SNPs will appear in the first set, while non-causal SNPs appear in the last set.By definition, the ε-genic threshold is then σ 2 ε = σ 2 2 .(d) Lastly, gene-ε computes gene-level association test statistics using quadratic forms, and estimates corresponding P -values using Imhof's method.For more details, see Methods.Body height has been estimated to have a narrow-sense heritability h 2 in the range of 0.45 to 0.80 [6,[31][32][33][34][35][36][37][38][39]; while, MPV has been estimated to have h 2 between 0.50 and 0.70 [33,34,77].Manhattan plots of gene-ε gene-level association P -values using Elastic Net regularized effect sizes for (a) body height and (b) MPV.The purple dashed line indicates a log-transformed Bonferroni-corrected significance threshold (P = 3.49 × 10 −6 correcting for 14,322 autosomal genes analyzed).We color code all significant genes identified by gene-ε in orange, and annotate genes overlapping with the database of Genotypes and Phenotypes (dbGaP).In (c) and (d), we conduct gene set enrichment analysis using Enrichr [46,78] to identify dbGaP categories enriched for significant gene-level associations reported by gene-ε.We highlight categories with Q-values (i.e., false discovery rates) less than 0.05 and annotate corresponding genes in the Manhattan plots in (a) and (b), respectively.For height, the only significant dbGAP category is "Body Height", with 9 of the genes identified by gene-ε appearing in this category.For MPV, the two significant dbGAP categories are "Platelet Count" and "Face" -the first of which is directly connected to trait [57,79,80].[86, 96-98] Table 1.Top three novel candidate core genes reported by gene-ε for the six quantitative traits studied in the UK Biobank (using imputed genotypes with gene boundaries defined by the NCBI's RefSeq database in the UCSC Genome Browser [27]).We call these novel candidate core genes because they are not listed as being associated with the trait of interest in either the GWAS catalog or dbGaP, and they have top posterior enrichment probabilities with the trait using RSS analysis.Each gene is annotated with past functional studies that link them to the trait of interest.We also report each gene's overall trait-specific significance rank (out of 14,322 autosomal genes analyzed for each trait), as well as their heritability estimates from gene-ε using Elastic Net to regularize GWA SNP-level effect size estimates.The traits are: height; body mass index (BMI); mean corpuscular volume (MCV); mean platelet volume (MPV); platelet count (PLC); and waist-hip ratio (WHR).*: Multiple genes were tied for this ranking. 489

gH 0 Figure 2 .
Figure 2. Schematic overview of gene-ε: our new gene-level association approach modeling ε-genic effects.(a) geneε takes SNP-level GWA marginal effect sizes (OLS estimates β) and a linkage disequilibrium (LD) matrix (Σ) as input.It is well-known that OLS effect size estimates are inflated due to LD (i.e., correlation structures) among genome-wide genotypes.(b) gene-ε first uses its inputs to derive regularized effect size estimators ( β) through shrinkage methods (LASSO, Elastic Net and Ridge Regression; we explore performance of each solution under a variety of simulated trait architectures, Supplementary Notes and Figures).Marginally, β j ∼ N (0, σ 2 j ), where σ 2 j denotes the proportion of narrow-sense heritability h 2 contributed by the j-th SNP.(c) A unique feature of gene-ε is that it treats interactive SNPs as non-causal.gene-ε assumes a recalibrated (SNP-level) null hypothesis H 0 : σ 2 j ≤ σ 2 ε , where σ 2 ε is the ε-genic threshold and represents the maximum proportion-of-variance-explained (PVE) by an interactive or non-causal SNP.To infer σ 2 j , gene-ε fits a K-mixture of normal distributions over the regularized effect sizes with successively smaller variances (σ 2 1 > • • • > σ 2 K ; with σ 2 K = 0).In this study, we assume that causal SNPs will appear in the first set, while non-causal SNPs appear in the last set.By definition, the ε-genic threshold is then σ 2 ε = σ 2 2 .(d) Lastly, gene-ε computes gene-level association test statistics using quadratic forms, and estimates corresponding P -values using Imhof's method.For more details, see Methods.

Figure 4 .
Figure 4. Gene-level association results from applying gene-ε to body height (panels a and c) and mean platelet volume (MPV; panels b and d), assayed in European-ancestry individuals in the UK Biobank.Body height has been estimated to have a narrow-sense heritability h 2 in the range of 0.45 to 0.80[6,[31][32][33][34][35][36][37][38][39]; while, MPV has been estimated to have h 2 between 0.50 and 0.70[33,34,77].Manhattan plots of gene-ε gene-level association P -values using Elastic Net regularized effect sizes for (a) body height and (b) MPV.The purple dashed line indicates a log-transformed Bonferroni-corrected significance threshold (P = 3.49 × 10 −6 correcting for 14,322 autosomal genes analyzed).We color code all significant genes identified by gene-ε in orange, and annotate genes overlapping with the database of Genotypes and Phenotypes (dbGaP).In (c) and (d), we conduct gene set enrichment analysis using Enrichr[46,78] to identify dbGaP categories enriched for significant gene-level associations reported by gene-ε.We highlight categories with Q-values (i.e., false discovery rates) less than 0.05 and annotate corresponding genes in the Manhattan plots in (a) and (b), respectively.For height, the only significant dbGAP category is "Body Height", with 9 of the genes identified by gene-ε appearing in this category.For MPV, the two significant dbGAP categories are "Platelet Count" and "Face" -the first of which is directly connected to trait[57,79,80].
[18]r our recalibrated GWA model, there are causal SNPs, non-causal SNPs, and interactive SNPs.We propose a multi-component framework (see also[18]), in which interactive SNPs can emit different levels of statistical signals based on (i ) different degrees of connectedness (e.g., through linkage disequilibrium), or (ii ) its regulated gene interacts with a true core gene.While causal SNPs are still more likely to emit large effect sizes than SNPs in the other categories, interactive SNPs can emit intermediate effect sizes.Here, our goal here is to classify interactive SNPs as non-causal.OLS effect size: β := ( β 1 , β 2 , ..., β J ) T Σ := [ρ jl ] J×J , ρ jl := LD between SNPs j & l Figure1.Illustration of null hypothesis assumptions for the distribution of GWA SNP-level effect sizes according to different views on underlying genetic architectures.The effect sizes of "non-causal" (pink), "intermediate" (red), and "causal" (blue) SNPs were drawn from normal distributions with successively larger variances.(a)Thetraditional GWA model of complex traits simply assumes SNPs are causal or non-causal.Under the corresponding null hypothesis, causal SNPs are more likely to emit large effect sizes while non-causal SNPs will have effect sizes of zero.When there are many causal variants, we refer to the traits as polygenic.(b)β|β − Σβ| 2 ,subject to (1 − α)||β|| 1 + α||β|| 2 2 ≤ t for some t Known as the transcription elongation factor of mitochondria (TEFM) which regulates transcription and can affect body height.