Leveraging effect size distributions to improve polygenic risk scores derived from summary statistics of genome-wide association studies

Genetic risk prediction is an important problem in human genetics, and accurate prediction can facilitate disease prevention and treatment. Calculating polygenic risk score (PRS) has become widely used due to its simplicity and effectiveness, where only summary statistics from genome-wide association studies are needed in the standard method. Recently, several methods have been proposed to improve standard PRS by utilizing external information, such as linkage disequilibrium and functional annotations. In this paper, we introduce EB-PRS, a novel method that leverages information for effect sizes across all the markers to improve prediction accuracy. Compared to most existing genetic risk prediction methods, our method does not need to tune parameters nor external information. Real data applications on six diseases, including asthma, breast cancer, celiac disease, Crohn’s disease, Parkinson’s disease and type 2 diabetes show that EB-PRS achieved 307.1%, 42.8%, 25.5%, 3.1%, 74.3% and 49.6% relative improvements in terms of predictive r2 over standard PRS method with optimally tuned parameters. Besides, compared to LDpred that makes use of LD information, EB-PRS also achieved 37.9%, 33.6%, 8.6%, 36.2%, 40.6% and 10.8% relative improvements. We note that our method is not the first method leveraging effect size distributions. Here we first justify our method by presenting theoretical optimal property over existing methods in this class of methods, and substantiate our theoretical result with extensive simulation results. The R-package EBPRS that implements our method is available on CRAN.


Introduction
The onset of common diseases results from the combined effects of genetic and environmental factors. The initial objective of genome-wide association studies (GWAS) is to identify genetic variants, such as single nucleotide polymorphisms (SNPs), that are associated with human diseases. In the past decade, GWAS have identified tens of thousands of genetic associations [1]. These identified associations have led to new insights on the etiologies of many diseases [2][3][4]. In addition to understanding the genetic basis of complex diseases, the association results provide us with an opportunity to develop genetic risk prediction models that are clinically useful. With accurate risk prediction methods, we can better advise individuals on appropriate prevention, screening, and diagnosis, and the estimated disease risk may also be helpful for treatment. For example, individuals with high predicted risk of cardiovascular disease or diabetes would be recommended for healthy diets, fitness regimens, as well as more regular check ups.
Calculating polygenic risk score (PRS) is a common genetic risk prediction approach because of its simplicity and computational efficiency. PRS sums the number of risk alleles among a set of SNPs, weighted by their effect sizes estimated from a certain cohort. We can regard PRS as the estimated genetic liability to a disorder or a trait [5]. The standard PRS method, namely the Pruning + Thresholding (P+T) method, selects SNPs after LD-clumping and p-value thresholding. The observed effect sizes of selected SNPs, which can be directly extracted from summary statistics, are used as weights. Hence, only GWAS summary statistics are needed to calculate PRS, which is attractive since summary statistics are more accessible than individual-level genotype data due to potential privacy and data sharing concerns. In the following, we only focus on the PRS methods utilizing GWAS summary statistics.
To further improve prediction accuracy, several methods have been proposed to utilize other information, such as LDpred (and LDpred-inf) that models the LD information extracted from a reference panel [6]; AnnoPred that leverages diverse types of genomic and epigenomic functional annotations [7]; and PleioPred and SMTpred that utilize pleiotropy relationship with other traits/diseases [8,9]. All of these methods need to borrow information from external panels or datasets.
In this article, we explore whether we can improve the standard PRS method without using information from external panels or datasets. We give a positive answer to this question in this article by proposing EB-PRS, a novel PRS approach based on Empirical Bayes theory, borrowing information across markers to improve prediction accuracy. Instead of using a prespecified prior distribution in standard Bayesian modeling, the Empirical Bayes approach estimates the prior distribution from the data. Here we first utilize the GWAS summary statistics to infer the overall distribution of effect sizes and then leverage this distribution to improve the standard PRS method by minimizing the prediction error.
Compared to the existing genetic risk prediction methods, our method has the following advantages: 1. Our method does not need to tune parameters, avoiding the use of training data to select the best performing parameters.
2. Compared to other improved PRS methods such as LDpred and AnnoPred, we do not need external panels or datasets.
3. We note that we are not the first one proposing to utilize effect size distributions for PRS calculations. Some methods in this category also have no tuning parameters or external input, such as the methods proposed in So and Sham (2017) [10] and Mak et al. (2016) [11]. We theoretically construct the optimal PRS in terms of minimizing the prediction error, thus our method has theoretical superiority compared with the existing methods in this category. Better experimental performance also supports our theoretical results.
We demonstrate the better performance of our method through both simulations and real data applications to six complex diseases including asthma (AS), breast cancer (BC), celiac disease (CEL), Crohn's disease (CD), Parkinson's disease (PD), and type 2 diabetes (T2D).

Simulation experiments
Simulation based on independent SNP assumption. We first performed simulations to investigate the performance of our method when the markers are independent. In these simulations, we fixed the number of independent SNPs to m = 10, 000 and the causal SNP proportion was set to 0.1. For each SNP, its allele frequency was simulated from a uniform distribution U(0.05, 0.95) and its effect size was drawn from a point-normal mixture distribution, i.e., μ * 0.9δ 0 + 0.1N(0, 0.001N e ). We further set the prevalence of the disease κ to be 1%. To explore the relationship between the prediction performance of our method and the sample sizes of training datasets, we varied the sample size from 2,000 to 8,000. We first set the control-to-case ratios (CCRs) to 1 in the training datasets. In order to simulate under a setting consistent with real data, we also performed simulations with lager CCRs (ranging from 2 to 4).
Because SNPs were simulated independently, there is no need to consider LD among SNPs in this scenario. Therefore, we will only compare our method with a simplified P+T method in which the pruning step is not carried out. The p-value threshold of the P+T method was varied among {1, 5e − 01, 5e − 02, 5e − 03, 5e − 04, 5e − 05, 5e − 06}. We simulated 100 controls and 100 cases as our testing dataset. For each individual in the simulated testing dataset, the PRSs generated from the EB-PRS, P+T, So et al.'s and Mak et al.'s methods were calculated. We evaluate the prediction performance by using both the squared correlation between the PRSs and the observed phenotypes (predictive r 2 ), and the area under the receiver operating characteristic (ROC) curve (AUC). For the P+T method, we report the results with the best performing parameters. We ran experiments in each setting 10 times and compared the average performances of four methods.  Table. Population genetics data suggest that there are more SNPs with low minor allele frequencies (MAF) than those with high MAF [12]. To mimic this more realistic situation, we also simulate genotype data with allele frequencies from a scaled Beta distribution within (0.05, 0.95), where the density function is  Simulations based on real genotypes. In order to evaluate the performance of our method for depenent SNPs (i.e., SNPs are in LD), we conducted simulations based on individual-level genotype data accessed from the database of Genotypes and Phenotypes (dbGaP) [13,14] (study accession number phs000021). This schizophrenia study dataset included 2,729 samples, and consisted of 729,454 SNPs. The CCR for the schizophrenia dataset is 1.2. We randomly selected 0.1% SNPs to have effects on disease and set the prior of the case proportion to 0.5. For these SNPs, their log-ORs (β) for associated SNPs were assumed to follow a normal distribution N(0, 0.04). The phenotype of each individual was generated according to the following formula: where C is the set of causal SNPs. With this setting, the corresponding heritability in the observed scale is 49.2%.
Here we compare EB-PRS with six other methods including unadjusted PRS, P+T, LDpred-inf, LDpred, So et al.'s method, and Mak et al.'s method. We used genotype data of individuals with European ancestry from the 1000 Genomes Project as the reference panel for both LDpred-inf and LDpred. For LDpred, we set the proportion of causal SNPs from {1, 3e − 01, 1e − 01, 3e − 02, 1e − 02, 3e − 03, 1e − 03, 3e − 04, 1e − 04, 3e − 05, 1e − 05}. The fivefold cross validation was used to evaluate the prediction performance of different methods. For each training dataset, we calculated the summary statistics from the genotype data and utilized them to derive PRSs. The performance is measured using both the predictive r 2 and AUC. For the P+T and LDpred, we report the results with the best performing parameters. Fig 3 shows the predictive r 2 of the seven methods using five-fold cross validation. We can see that EB-PRS is the best among the seven approaches. The exact values of predictive r 2 and AUC are shown in Table 1. EB-PRS achieved 107%, 32%, 107%, 21%, 61%, and 383% relative improvements over the other six methods using the r 2 metric. The AUC in the table also shows the predictive superiority of EB-PRS. To help the user budget computation, we provide a summary table of computational time for our methods, LDpred and So et al.'s method for this simulation based on five-fold and tenfold cross validation in Supplementary S6 Table. The simulations were based on an Intel Xeon processor with 2.50GHz.

Applications to six real datasets
We applied EB-PRS and other PRS methods to six complex disease data sets including asthma (AS) [13], breast cancer (BC) [15,16], celiac disease (CEL) [17], Crohn's disease (CD) [18,19], Parkinson's disease (PD) [20] and type 2 diabetes (T2D) [21,22] to illustrate the improved risk prediction performance in real data. We summarize the information of the training and testing sets in Table 2 and list details as below.
For AS, we trained the model using summary statistics from the GABRIEL Consortium, which was a meta-analysis of 23 studies, including 10,365 persons with physician-diagnosed asthma and 16,110 unaffected persons. There were 535,060 SNPs in the released summary statistics. We tested the performance utilizing samples from the study phs000490 in the dbGaP, with 515 cases and 875 controls [13]. For BC, we used summary statistics from the Genetic Associations and Mechanisms in Oncology (GAME-ON) study (n 0 = 41, 335 and n 1 = 16, 003) [15] covering 2,435,470 SNPs as the training data, and samples from the Cancer Genetic Markers of Susceptibility (CGEMS) study (n 0 = 70, n 1 = 966) [16] as the testing data. Shared individuals between two studies were removed in the testing data. For CEL, we trained the model with 4,533 individuals with celiac disease and 10,750 control from Dubois' study [17], in which 508,742 SNPs were genotyped. The testing data is from samples in the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) celiac disease study (1,716 cases and 530 controls) [23]. For CD, we trained the model using summary statistics from the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC; n 0 = 15, 056, n 1 = 6, 333 and m = 871, 743) [18]. Individuals from the WTCCC were removed from the meta-analysis Table 1

. Predictive r 2 and AUC of EB-PRS, unadjusted PRS, P+T, LDpred-inf, LDpred So et al.'s method and Mak et al.'s method on simulations based on observed genotypes using five-fold cross validation.
The simulations were based on individual-level genotype data accessed from the schizophrenia study (study accession number phs000021) in dbGaP. The dataset included 2,729 samples, and consisted of 729,454 SNPs. The highest mean r 2 and AUCs are highlighted in boldface. and used as the testing dataset (n 0 = 2, 891, n 1 = 1, 689) [19]. For PD, we used the summary statistics released from the Simon-Sanchez J et al.'s study as the training data, which summarized the association information of 450,439 SNPs genotyped among 1,713 individuals of European ancestry with PD and 3,978 controls with the same ancestry [20]. We used the Wellcome Trust Case Control Consortium 2 (WTCCC2) data (n 0 = 2, 574 and n 1 = 1, 843) as the testing data. For T2D, we trained the model on summary statistics from the Diabetes Genetics Replication and Meta-analysis consortium (DIAGRAM, n 0 = 56, 862, n 1 = 12, 171) [21] and tested the model on samples from the Northwestern NUgene Project (n 0 = 517, n 1 = 662) [22]. There were 2,400,624 SNPs in the training data. For these six diseases, EB-PRS outperformed all other methods, including the unadjusted PRS, P+T, LDpred-inf, LDpred, So et al.'s and Mak et al.'s methods. Even without external information or datasets, the performance of our method was better than best performing LDpred for all diseases. Here, we build two logistic models, one fits PRSs derived from one method only, while the other one fits PRSs derived from two methods. Then we use the likelihood ratio test to compare the difference of the prediction accuracy and report the corresponding p-value. EB-PRS attained significant improvement in prediction accuracy over P+T for AS (p-value = 4.5e − 03), BC (p-value = 1.1e − 02), CEL (p-value < 2.0e − 16), CD (pvalue = 1.5e − 07), PD (p-value = 1.8e − 02), and T2D (p-value = 4.4e − 06).   Table 3. From the table we can see the superiority of EB-PRS over the other six methods.

EB-PRS
In general, unadjusted PRS and LDpred-inf (without parameter tuning) performed poorly in all six datasets. Compared to methods which need to tune parameters (i.e., P+T and LDpred), our method also achieved better performance in all six datasets. These indicate the distinct advantage of our method: it can achieve noteworthy performance without tuning any parameters or incorporating external information.

Discussion
In this article, we have considered predicting an individual's genetic risk for complex diseases using summary statistics, which may prove helpful for both disease prevention and clinical decision. The PRS approach is widely used in genetic risk prediction because of its simplicity and efficiency, and a number of PRS methods have been proposed to improve prediction accuracy. Almost all these improved methods are based on borrowing information from external panels or datasets, such as the LD information, annotations or other GWAS datasets studying genetically correlated traits. In this paper we have proposed an improved PRS method by utilizing effect size distributions based on Empirical Bayes theory. No parameter tuning is needed, and no external panels or datasets are needed to be input. We have shown that our method outperformed the standard P+T method, as well as recently proposed methods in So et al. (2016) and Mak et al. (2017) in both simulations and real data analysis. Furthermore, compared to methods utilizing LD information from a reference panel, our method also achieved superior performance on all diseases including AS, BC, CEL, CD, PD, and T2D. One reason for the good performance is that we leverage the global information of effect sizes and the other reason is that we theoretically construct the optimal PRS in terms of minimizing the prediction error.
Note that both So et al.'s and Mak et al.'s methods also use effect size distribution to improve PRS. We show the theoretical superiority of our method over these two methods. So et al. used the square root of the explained liability as the effect size, which is not the optimal one in terms of minimizing prediction error. Besides, the effect size estimated from Tweedie's formula has already been weighted by local true discovery rates, a further multiplication can make the effect sized over-shrinked. Meanwhile, their method needs to specify the prevalence for each trait, which may be difficult sometimes.
Some may argue that why we use parametric mixture model instead of non-parametric Tweedie's formula in effect size estimation. First, in our model, we use symmetrical prior for the effect size, which makes the estimated magnitude invariant with the sign of z, i.e., the selection of reference alleles. Second, the K-component mixture assumption for effect size of associated SNPs is consistent with the empirical observations from GWAS of many common diseases [24][25][26]. Third, although the Gaussian mixture model is parametric, it provides great flexibility and precision in modelling the underlying data. We can use this distribution to approximate arbitrary distribution to any fidelity with an appropriate number of mixtures [27]. Although our method can achieve better performance without tuning any parameters and utilizing external information, its performance may be improved with external information, for example, the LD information. Also, in order to increase the prediction accuracy to a larger extent, we may improve our method by combining other available datasets in the future, such as annotations or other GWAS summary statistics studying genetically correlated traits.
A basic assumption underlying all PRS methods is that the training and testing datasets are homogeneous and sampled from the same population. If they come from different populations, the allele frequencies of SNPs will be different, together with the causal variants and their effect sizes. In the case, we expect a worse prediction performance [28,29]. In this article, we use simulation experiments to mimic the divergent situation between training and testing datasets. Different distributions of allele frequencies, as well as a divergency in effect sizes are simulated. Results summarized in S3 and S4 Tables showed that, although the prediction performance will be influenced due to the divergence of two sets, our EB-PRS method still outperforms the others. However, our current analysis still lacks the testing across diverse populations.

EB-PRS
Generally speaking, EB-PRS aims at minimizing the prediction error by leveraging estimated distribution of effect sizes. Assuming that the SNPs are independent, we can show that the optimal PRS (in terms of achieving the best classification accuracy) is where m is the total number of genotyped SNPs, x i is the genotypic value, and β i is the logodds ratio (OR) of the i-th SNP. The log-OR is a measure of the effect size defined in the following formula: where f i0 and f i1 are the reference allele frequencies among controls and cases, respectively. If the SNP is not associated with disease, then β i = 0. In practice, the true values of effect sizes are usually unknown, and need to be estimated from the data. In our method, we use the Empirical Bayes approach to estimate β, which is the minimizer of the Bayes risk under the distribution estimated from the data. The estimators can be directly derived from GWAS summary statistics.
Notations and assumptions. Suppose that there are m SNPs genotyped in a GWAS. For each SNP, there are usually two different alleles. We denote one of them as the reference allele, and use the number of reference alleles to code the genotype of the SNP for each individual. For SNP i (i = 1, . . ., m), the genotypic value is denoted by x i (x i 2 {0, 1, 2}). We use x to represent the vector of genotypic values across m SNPs of an individual and y the disease status of the individual, where y = 1 if the individual has the disease and y = 0 otherwise. With a multiplicative model and low prevalence, the genotypic value of each SNP follows binomial distribution as: Optimal polygenic risk score. If all SNPs are assumed to be independent, based on Eq (5), we have: where κ is the disease prevalence, and β i is the true value of the log-OR for the i-th SNP.
Our objective is to find the decision rule minimizing the overall Bayes risk: By Bayes decision rule, we minimize the overall risk by select the action that minimizes the conditional risk R(α(x)|x) for all x: where l 2 {0, 1} and α 0 whenŷ ¼ 0 and α 1 whenŷ ¼ 1. λ(�) is the Zero-One Loss Function: Thus, Eq (7) can be minimized by settingŷ ¼ 1 if P(y = 1|x) > P(y = 0|x), i.e., log Pðy ¼ 1jxÞ Pðy ¼ 0jxÞ where β is exactly the log-OR. That is, if we define a polygenic risk score S ¼ β T x ¼ P m i¼1 b i x i , the optimal decision rule minimizing the prediction error iŝ Inference. The constructed optimal polygenic risk score S is a function of the parameters β i (i = 1, . . ., m). To derive the PRS in practice, we can use GWAS summary statistics to estimate these parameters.
If we simply estimate β i by plugging the observed log-OR, the estimated effect sizes will tend to be inflated for SNPs with large values of estimated results. This phenomenon is commonly known as the "winner's curse". Here, we adopt the Empirical Bayes approach to address the issue of the selection bias as it is more robust to the winner's curse [30,31]. Also, it is the minimizer of the Bayes risk under the effect size distribution estimated from the data.
In GWAS, we usually use the following log-OR test to assess associations between SNPs and disease: where seðb i Þ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi and N e ¼ 4n 0 n 1 n is the effective sample size in the case-control study, where n 0 and n 1 are the number of control and disease subjects, respectively, and n = n 0 + n 1 . Given the standardized effect size m i ¼ b i seðb i Þ , the distribution of z-scores is Z i |μ i * N(μ i , 1). Recent GWAS results suggest that, among all disease-associated SNPs, there are many more SNPs with small effect sizes than those with large effect sizes [32,33]. Therefore, we use the following spike-and-slab prior to model the effect sizes of all SNPs: where π 0 is the proportion of non-associated SNPs (0 � π 0 � 1) and δ 0 is the distribution with point mass at zero. Here we use a K-component Gaussian mixture distribution as the slab, in which the proportion of SNPs in the j-th associated component is π j ( P K j¼0 p j ¼ 1) and the corresponding variance of the standardized effect sizes is N e s 2 j . We add N e as a scaling coefficient in the variance of each component in order to make s 2 j invariant with changing sample size. With this prior specification, the posterior expected effect size of an SNP under each alternative hypothesis is where H 1j : m i 6 ¼ 0 is the alternative hypothesis that the SNP is an associated one within the jth component (j = 1, . . ., K). The corresponding local true discovery rate is the probability that the hypothesis H 1j is true, given its z-value, and it can be calculated as follows: where ϕ(�) and f j (�) are the probablity density functions of N(0, 1) and Nð0; 1 þ N e s 2 j Þ, respectively.
The posterior expectation of β i is the optimal estimator for minimizing the Bayes risk. Therefore, we estimate β i as follows: Here we adopt an EM-algorithm to estimate unknown parameters (π 0 , π 1 , � � �, π K ) and (σ 1 , � � �, σ K ) in the above mixture model. In practice, the null proportion π 0 is always much larger than the proportions in the alternative components. To take advantage of this prior information, we first add a Dirichlet prior (α, 0, � � �, 0) to proportions (π 0 , π 1 , . . ., π K ). We use the following strategy to infer the value of α. First, we set α to a reasonable value ( m 20 in our default setting) to obtain a preliminary model. Then we generate parametric bootstrap samples based on the preliminary model and select α with the minimal relative errors in terms of parameter estimation using the bootstrap samples [34].
Theoretically, a better fitting for the underlying distribution can be obtained by increasing the component number K. That can further improve the prediction performance. However, the model and computational complexity will also be increased accordingly. If we keep increasing K to a certain level, an original component will be divided into multiple similar components in the estimation, which does not add benefit to the prediction. In practice, we found that we achieved both the discriminability for components and the prediction performance when K = 3. Therefore, we set K to three as the default setting in our method.
Next, we compare our method named EB-PRS with six other methods: unadjusted PRS, P+T, LDpred-inf, LDpred, and two methods proposed in So et al. (2017) [10] and Mak et al. (2016) [11], respectively. In the following, we briefly describe these methods. We note that the last two methods were also proposed to utilize effect size distributions for PRS calculations, where no tuning parameters or external input is needed. We will discuss their differences with our method later.

Unadjusted PRS
The unadjusted PRS is the summation of genotypic values weighted by the corresponding estimated effect sizes from the marginal logistic regression, i.e., whereb i is the estimated coefficient from the marginal logistic regression for SNP i. If no confounders are considered in the regression, thenb i is equal to the log-OR.

P+T
The P+T method generates PRS based on a subset of independent SNPs obtained via informed LD-clumping [35] and p-value thresholding. The method first prunes SNPs according to a given threshold for squared correlation coefficients (r 2 ) between different SNPs. Then it further filters the SNPs with a p-value cutoff in marginal association tests. In this article, we set LD clumping r 2 to 0.1 and vary the p-value cutoff values from {1, 5e − 01, 5e − 02, 5e − 03, 5e − 04, 5e − 05, 5e − 06}. In the following comparison experiments, we will report the prediction results of the P+T method with the optimally tuned parameters.

LDpred-inf
The Here D denotes the LD matrix, n is the sample size of the GWAS, I denotes the identity matrix, andβ is the vector of estimated effect sizes from marginal regression. The PRS is Here, w represents the corresponding standardized genotype.

LDpred
LDpred is an extension of LDpred-inf, which makes use of a point-normal prior instead of normal prior to estimate causal effect sizes. To calculate the posterior expectation of causal effects, Markov Chain Monte Carlo (MCMC) is used, and the proportion of causal effects π 0 is a parameter that needs to be specified by users. Here, we set π 0 from {1, 3e − 01, 1e − 01, 3e − 02, 1e − 02, 3e − 03, 1e − 03, 3e − 04, 1e − 04, 3e − 05, 1e − 05}, which are the default values recommended in LDpred. In the following comparison experiments, we will report the best prediction results with the optimally tuned parameters. So et al. proposed to construct PRS by using the following three steps to calculate the effect size of each SNP. Firstly, the standardized effect size μ is non-parametrically estimated by using the Tweedie's formula:

So et al.'s method
where f(z) is the estimated probability density function of z-values. Then the explained liability of each SNP is assessed based on estimated μ. The final effect size is estimated using the square root of explained liability weighted by the corresponding local true discovery rate, i.e., b So ¼ ltdr � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi �ðEðmjzÞ Since we have theoretically shown that the true log-OR is the optimal effect size for constructing PRS in terms of prediction accuracy, using the square root of explained liability will decrease the performance. In addition, Tweedie's formula evaluates the posterior expecation of μ without conditioning alternative hypothesis H 1 , and we have The further weighting strategy based on ltdr will make the effect size estimate over-shrinked. Mak et al. proposed to construct PRS by directly using the observed effect size weighted by ltdr, i.e.,b

Mak et al.'s method
As we pointed out before, the observed effect size suffers from the winner' curse. The shrinkage estimator derived from Empirical Bayes theory is a better choice in terms of overcoming the winner's curse and minimizing the Bayes risk.