Bayesian multiple logistic regression for meta-analyses of GWAS

Genetic variants in genome-wide association studies (GWAS) are tested for disease association mostly using simple regression, one variant at a time. Multiple regression can improve power by aggregating evidence from multiple nearby variants. It can also distinguish disease-coupled variants from variants merely correlated with a coupled variant. However, it requires individual genotype data, limiting its applicability when combining several GWAS. Moreover, multiple logistic regression to model binary phenotypes in case-control GWAS requires inefficient sampling schemes to integrate over the variant effect sizes. Our sparse Bayesian multiple LOgistic REgression (B-LORE) method overcomes these two drawbacks. We propose a quasi-Laplace approximation to analytically integrate over variant effect sizes. The resulting marginal likelihood functions of individual GWAS are approximated by multivariate normal distributions. Their means and covariance matrices serve as summary statistics for combining several GWAS. Additionally, B-LORE can integrate functional genomics tracks as priors for each variant’s causality. To test our method, we simulated synthetic phenotypes for real genotypes. B-LORE improved the prediction of loci harboring causal variants and the variant fine mapping. We also used B-LORE for a metanalysis of five small GWAS for coronary artery disease (CAD). We pre-selected the top 50 loci with SNPTEST / META, which included 11 loci discovered by a 14-fold larger meta-analysis (CARDIoGRAMplusC4D). While simple regression discovered only 3 of them with genome-wide significance, B-LORE discovered all of them with causal probability > 95%. Of the 12 other loci discovered by B-LORE, 3 are known from other CAD GWAS and 6 are associated with well-known CAD risk-related blood metabolic phenotypes. Software availability: https://github.com/soedinglab/b-lore.


Introduction
Common, noninfectious diseases are responsible for over 2 /3 of the deaths worldwide. These 2 diseases are usually polygenic, with many variants each contributing only a small fraction of 3 the disease risk. This made them very difficult to investigate using family studies. Genome 4 wide association studies (GWAS) have opened up a fundamentally new approach to explore and 5 understand the causality of disease development. The knowledge about the underlying biological 6 mechanisms is crucial to devise prophylactic and therapeutic treatments. 7 In the most common GWAS design, patients with diagnosed diseases ("cases") and healthy 8 people ("controls") are genotyped at several hundred thousand positions where single nucleotide 9 polymorphisms (SNPs) are relatively frequent in the population. The data is then statistically 10 analyzed to detect SNPs which have significant associations with the disease. Thousands of 11 GWAS with millions of patients have been conducted in the past decade [1] with which thousands 12 of genetic variants associated with many diseases and complex traits have been identified [2]. by being correlated to an actually causal SNP. This situation is the rule rather than the exception, 18 because SNPs are usually stronlgy correlated to dozens of nearby SNPs, up to distances of 100 19 kbp. This non-random correlation between nearby SNPs is referred to as linkage disequilibrium 20 (LD) and occurs due to the common descent of humankind from a relatively small ancestral 21 population. 22 The distinction between association (i.e. correlation) and coupling is therefore very important 23 for the prediction of causal SNPs. SNPs which are highly correlated with a causal SNP obtain 24 similarly significant p -values, making it difficult to decide which of these SNPs is really causal. A noncausal SNP in LD with multiple causal SNPs may exhibit more significant effect size for disease association than the actually causal SNPs, possibly leading to wrong interpretations of what genes are involved in the disease mechanism. 25 Fig. 1 illustrates the importance of this 26 distinction with another example, where 27 two causal SNPs are highly correlated 28 with a noncausal SNP between them. 29 The noncausal SNP may exhibit a more 30 significant p -value for disease associa-31 tion than the two causal SNPs, possibly 32 leading to wrong interpretations of what 33 genes are involved in the disease mech-34 anisms. 35 The low effect sizes of single SNPs 36 in complex polygenic diseases studied 37 in GWAS limits the power of simple re-38 gression to detect statistically significant 39 associations. Advances in GWAS have 40 focused on two aspects to improve the 41 power of detecting associations: (1) using multiple regression models for joint analysis of many 42 SNPs (multiple-SNP analyses, or polygenic modelling), and (2) using meta-analysis of multiple 43 independent studies to combine evidence from more and more patients. 44 Multiple regression models use many SNPs at a genetic region or locus as explanatory 45 variables. This improves the power of GWAS by distinguishing between correlation and coupling 46 as discussed above, as well as by aggregating evidence from many SNPs in the locus with 47 low effect size, each of which would not be detected by single-SNP tests. Bayesian multiple 48 regression models, particularly Bayesian variable selection regression (BVSR) [3,4], have been 49 shown to perform significantly better than simple regression methods, e.g. SNPTEST [5][6][7], 50 when individual-level genotype data are available. However, it is cumbersome to combine studies 51 using multiple regression analyses due to the requirement of individual genotype data and the 52 associated logistical, technical, and ethical restrictions for sharing genetic data from patients. 53 On the other hand, existing meta-analysis methods efficiently combine the single-SNP 54 summary statistics from many studies and increase the power for detecting associations by 55 collecting evidence from a larger pool of samples. The power gained from increased sample sizes 56 in meta-analyses usually outweighs the power of multiple regression models applied on a more 57 modest number of samples in an individual study. 58 Approaches that allow to combine multiple regression with meta analysis on many GWAS 59 should lead to more power for detecting associations. Cichonska et al. recently made a pioneering 60 contribution in this regard. Their tool metaCCA [8] performs canonical correlation analysis 61 (CCA) of multiple SNPs against multiple traits, based on standard SNPTEST summary statistics 62 and a genotype covariance matrix estimated from individual-level genotype data from the same or 63 a similar population. CCA is used to identify and quantify the linear association between the two 64 sets of variables, but model/variable selection is not included in the model. Therefore, metaCCA 65 requires a pre-selection of SNPs. The authors proposed to select a roughly uncorrelated set of 66 SNPs that together explain the maximum variance in a given genetic locus.

67
In this work, we present B-LORE, a Bayesian method using multiple logistic regression of 68 the case-control binary variable and a prior distribution for the effect size of SNPs modeled by a 69 two-component Gaussian mixture. The Gaussian mixture prior is motivated by BVSR [3], which 70 has been successfully applied to many GWAS [4].

71
The weights, means and variances of the two-component Gaussian mixture prior are the model 72 hyperparameters. B-LORE learns the hyperparameters from the data using the empirical Bayes 73 approach of maximizing the marginal likelihood, which is obtained from the total likelihood 74 2/19 by integrating out the unknown effect sizes of all SNPs. The total likelihood is the product over 75 the regularized likelihoods of each individual study (Eq. (14)). The regularized likelihood of 76 each study is approximated by a multivariate normal distribution, whose mean and covariance 77 matrix is obtained by L 2 -regularized multiple logistic regression. Thus the mean and covariance 78 matrices of each study serve as summary statistics that can be shared between research groups 79 for downstream meta-analysis. 80 Through simulations and application to real GWAS data, we show that B-LORE combines the 81 advantages of multiple logistic regression and meta-analysis, successfully incorporates functional 82 information of the SNPs, and outperforms state-of-the-art methods in the prediction of causal 83 loci and in finemapping of causal SNPs.

85
Model likelihood and priors 86 GWAS data consists of phenotypes φ n ∈ {0, 1} (healthy or diseased) and of genotypes w ni ∈ 87 {0, 1, 2}, where 0, 1, or 2 signify the number of minor alleles of patient n ∈ {1, . . . , N } at SNP 88 i ∈ {1, . . . , I}. The genotype is centered and normalized as where f i is the minor allele frequency of the i th SNP. Henceforth, we will denote the vector of 90 normalized genotypes for the n th sample as x n , and the N × I matrix of genotypes as X. We 91 model the effect strength, i.e. the log-odds ratio of diseased to healthy, by a linear function in 92 which each minor allele contributes independently with additive effect, The offset v 0 determines the odds ratio for a patient without any minor allele SNPs, and v i is the 94 effect size of the i th SNP. For notational convenience we define the (I + 1) th component of each 95 vector x n to be 1 in order to absorb the offset term into the scalar product. We can then write the 96 right-hand side in vector notation, v T x n . Abbreviating p n := p (φ n = 1 | x n , v), we note that the 97 above equation can be transformed to p n = 1/(1 + exp v T x n ), the standard model of logistic 98 regression. The likelihood for N patients with genotypes X = (x 1 , . . . , x N )) is therefore Usually the number of parameters p = I + 1 is much larger than the number of samples N 100 (p N). Hence, a standard logistic regression approach in which we maximize the likelihood 101 with respect to the effect sizes will lead to gross overtraining on the training data and poor 102 prediction performance on unseen test data.

103
When learning the model parameters using a maximum likelihood approach in the limit p N, 104 one common solution is to add a regularization term to the log likelihood that will push most of 105 the components of v to zero, or near zero (such as an L 1 regularizer −λ||v|| 1 or an L 2 regularizer 106 −λ||v|| 2 2 , respectively). From the viewpoint of Bayesian statistics, this approach can be motivated 107 by noting that it is equivalent to maximizing the posterior distribution p(v | X, λ) because 108 according to Bayes' theorem it is proportional to p(X | v) p(v). Maximizing the logarithm of the 109 posterior distribution is therefore equivalent to maximizing the log likelihood plus a regularization 110 term log p(v). Obviously, the L 1 and L 2 regularizers correspond to a Laplace and a normal prior 111 distribution, respectively. Because our prior expectation is that the overwhelming majority of 112 SNPs will have a negligible effect on disease risk, the prior p(v) should have most of its weight 113 narrowly distributed around zero. 114 We choose to model the prior probability of effect sizes with a more descriptive and realistic 115 two-component Gaussian mixture distribution, in which one component representing the effect 116 sizes of the non-causal SNPs is sharply peaked around v i = 0 and and the second much wider 117 3/19 component describes SNPs coupled to the phenotype: This prior is similar to various other two-component mixture priors (see [9] for an overview), e.g. 119 the one in BVSR [3], which used a delta function at v i = 0 for the non-causal SNPs and a Students 120 t-distribution for the causal ones. In our prior, θ = (β π , µ, σ) are the model hyperparameters 121 with β π defined below. The parameters π i , as in BVSR, control the sparsity of the model. We set 122 σ 2 bg to 0, corresponding to a delta function at v i = 0. This reduces our prior to a point-normal 123 distribution. The parameter σ 2 describes the variance of the effect size of the causal variants. We 124 assume that all causal SNPs will have the same variance, for which we assume a uniform prior, 125 p(σ 2 ) = const. 126 We further define z i ∈ {0, 1} as the hidden indicator variables defining the underlying causality 127 of the SNPs. Here, z i = 1 indicates that SNP i is causal and z i = 0 indicates that it is not. To 128 simplify notations, we define the vectors µ z and σ 2 z , whose i th components are µ z,i = z i µ and 129 σ 2 z,i = σ 2 bg + z i (σ 2 − σ 2 bg ) respectively. This allows us to reformulate Eq. (3) as: with the sum running over all 2 I possible causality configurations z ∈ {0, 1} I . Using we can write the prior on the effect sizes as, The above formulation also helps us to realize that π i = p (z i = 1 | θ), which gives the probability 133 of i th SNP being causal before observing the data.

134
In the simplest case the prior probability of all SNPs are same, i.e. π i = π = const. Here, we 135 consider a more flexible model in which the π i can depend on possibly informative local genomic 136 features or annotations of genetic variants by functional consequences. These could potentially 137 help causal inference for the SNPs because they bring independent sources of underlying biological 138 information about each SNP. Data tracks with informative features are becoming available for ever 139 more cell types, e.g. ENCODE data on histone modifications, chromatin accessibility enhancer, 140 promoter and coding region annotation [10], summary features from DeepSEA [11], etc. We 141 lump together these additional features into vectors ξ i (i ∈ {1, . . . , I}), each with K elements or 142 features. We model the dependency of π i on these functional features as, We also add a (K + 1) th "baseline" annotation of ξ i,0 = 1 for all SNPs whose corresponding 144 coefficient β π,0 can be interpreted as the prior odds for causality of any SNP. Only a few 145 of these K + 1 features contribute to determine the weights, and we enforce this sparsity by 146 introducing a Laplace hyperprior p (β π ) = f exp(−α| β π, f |) with α = 0.75, and f runs over 147 all the K + 1 feature tracks. The dependency of π i on β π (Eq. (6)) can be easily modified, and 148 similar dependencies can be introduced for the mean and variance of the effect size prior. configurations z for which SNP i is causal (i.e. z i = 1): Prediction of causality of each locus. Similarly, the probability for a locus to be coupled with 154 the disease phenotype is equal to the probability of the locus harboring at least one causally 155 associated SNP. This is equal to 1 minus the probability of not containing a single causal SNP: 156 Both tasks require computing for any causality configuration z, which in turn requires computing In contrast to the classical maximum likelihood approach, in which the parameters v are 159 optimized, the above method integrates out the parameters v. This is a crucial difference in 160 practice, because it eliminates the need to learn a large number p N of parameters and thereby 161 very effectively guards against overtraining. The only parameters that we still have to learn are 162 the hyperparameters θ = (β π , µ, σ). Also, by integrating out the parameters we avoid the errors 163 incurred by fixing them to noisy point estimates. We will explain how to solve this integral in the 164 next section. 166 To learn the hyperparameters θ = (β π , µ, σ), we maximize the marginal likelihood function 167 mL(θ)

Optimization of hyperparameters and quasi-Laplace approximation
This empirical Bayes approach [12] is usually 168 robust against overtraining when only few hyperparameters need to be learned from the data. 169 Inserting Eq. (10) into the marginal likelihood yields The integral on the right hand side does not have an exact solution. A common approach to 171 solve such integrals is to approximate the integrand with a multivariate Gaussian using Laplace's 172 method. The parameters of the Gaussian can be simply determined by finding the integrand's 173 mode using gradient-based optimization and setting the precision matrix to the Hessian at the 174 mode. Unfortunately, the mode depends on the causality configuration z and the hyperparameters 175 θ. That means we would need to determine mode and precision matrix for every z in the sum 176 and every time θ is changed, which is clearly infeasible. One might instead approximate only 177 the likelihood by a Gaussian. This is also problematic because the approximation will become 178 inaccurate as we move away from the mode of the likelihood, and unfortunately the region in v 179 space which contributes most to the integrand (around the mode of the integrand) can be quite 180 far from the mode of the likelihood. 181 We propose a novel approximation ("quasi-Laplace approximation") by splitting the integrand 182 into two factors, a regularized likelihood that closely approximates the integrand but does not 183

5/19
depend on θ or z and a correction term that depends on θ and z: Here,μ andσ are constants whose values will be estimated from the data. We approximate the 185 regularized likelihood by a multivariate Gaussian: Since the regularized likelihood depends neither on z nor on θ, we have to perform the gradient-187 based optimization only once to determine the summary statisticsṽ andΛ. The regularizer 188 N v |μ, diag σ 2 acts as an approximate, simple prior distribution. We learn it from the data 189 in an iterative way (usually two iterations suffice) by maximizing the marginal likelihood given a 190 first estimate ofμ andσ. The above approximation allows us to analytically solve the integral of 191 Eq. (11) (see the detailed derivation in S1 File).

192
The regularized likelihood is well approximated by a Gaussian when N 1 and the 193 number of cases and controls is similar. To see this, we note that the regularized log likelihood 194 (2)) plus a quadratic function with respect to v. The Hessians 196 of the concave functions must all have negative or zero diagonal elements and therefore their 197 sum will grow roughly proportionally with the number of patients N. In contrast to the second 198 derivatives, the third and higher partial derivatives will take both positive and negative signs. 199 If the number of diseased and control patients is roughly equal, p(φ n |x n , v) will mostly lie near 200 (1/N) n I(φ n = 1) ≈ 0.5, and therefore v T x n will be roughly as often positive as negative. 201 Therefore the third partial derivatives will tend to be close to zero and have no preferred signs. 202 The same is true of the higher derivatives. The magnitudes of the third and higher derivatives will 203 grow only as √ N because their signs fluctuate around 0 for all patients. The second derivatives 204 will increasingly dominate over the higher derivatives as N gets larger, and the log likelihood will 205 be increasingly better approximated by a quadratic function, or in other words, by the logarithm 206 of a multivariate Gaussian.

207
In summary, B-LORE works in two steps: 208 1. Calculate summary statistics for each cohort. This, in turn, requires two optimizations:

210
• Learnṽ andΛ from the data by gradient-based maximization of the logarithm of the 211 regularized likelihood obtained from Eqs. (13) and (2) and settingṽ to its mode and 212 Λ to the negative of the Hessian matrix at the mode (see S1 File for details). 213 2. Meta-analysis using summary statistics. In this step, B-LORE optimizes the hyperparam-214 eters θ by maximizing the marginal likelihood combining the summary statistics from 215 multiple studies (see below).

216
In our software, the first step can be run using the command summary and the second step can 217 be run using the command meta.

218
Factorization over loci 219 To speed up B-LORE analysis, we recommend to preselect loci with a faster method such as 220 SNPTEST [5][6][7] and to include SNPs from these preselected loci. Usually these candidate loci 221 will be in linkage equilibrium since LD is highly local. Therefore the covariance matrix X T X 222 is approximately block-diagonal, with each block corresponding to a locus. This allows us to 223 factorize the marginal likelihood in Eq. (11) as a product over all the loci (see S1 File). We 224 can therefore calculate the marginal likelihood for each locus independently, which makes the 225 evaluation of the sum over causality configurations and learning the hyperparameters from the 226 summary statisticsṽ andΛ quite efficient. 227

6/19
We note here that several other methods utilize this block-diagonal feature of LD matrix of 228 genotype. For example, BIMBAM [4] uses a factorization over loci to perform multiple regression 229 at each locus independently. However, it does not learn the hyperparameters from the data. Hence 230 it does not need to jointly analyze multiple loci and can compute summary statistics for each locus 231 separately. In contrast, B-LORE analyzes all loci jointly, which requires the summary statistics to 232 be computed for all loci jointly.
233 Meta-analysis of many studies 234 The likelihood for a single study is given in Eq. (2). We can combine multiple independent studies 235 simply by computing the total likelihood as the product of the likelihoods of each contributing 236 study s: The integrand in Eq. (11) will now have a product over multiple logistic functions. For each study, 238 we estimate the regularizer of the likelihood N v |μ s , diag σ 2 s and the summary statisticsṽ s 239 andΛ s . We apply the quasi-Laplace approximation for each study: We use this approximation to calculate the total marginal likelihood for all studies, and optimize 242 it to obtain estimates of the hyperparameters. We can then perform all the subsequent analyses 243 using the optimized hyperparameters. Unlike conventional meta-analysis methods which pool 244 aggregate allele count data of each individual SNP, the above method allows us to combine 245 information from multiple regression. For details of the derivations see S1 File. To illustrate B-LORE, we used the genotype from five German population cohorts: German 257 Myocardial Infarction Family Study (GerMIFS) I -V [13][14][15][16][17][18]. Details for quality control and 258 pre-processing of these datasets were described by Nikpay et al. [18]. Briefly, there were a total 259 of 6234 cases and 6848 controls with white European ancestry. Each cohort was imputed with 260 phased haplotypes from the 1000 Genomes Project. SNPs were filtered for MAF > 0.05 and 261 HWE p -value > 0.0001. To test the ability for using genome-wide functional genomics tracks to 262 help distinguish causal from merely correlated SNPs, we used DNase-seq data to measure DNA 263 accessibility, published by the ENCODE project [10]. We normalized the DNase-seq data for 264 112 human samples, as described previously by Sheffield et al. [19]. 265

7/19
Simulation framework 266 Simulation studies are popular to evaluate different methods of statistical analyses in GWAS, 267 because they are inexpensive and they give us access to the "ground truth". In our case, we needed 268 to know which SNPs or which loci are causal in the population. The inherent complexity of the 269 genotype data with strong linkage effects are very difficult to simulate realistically from haplotype 270 data. We therefore used real patient genotypes and real DNase-seq tracks in our semisynthetic 271 benchmarking test. We simulated the phenotypes using genomic features as described previously 272 by Kichaev et al. [20]. 273 We randomly selected 200 loci, each with 200 SNPs from the whole genome. Random 274 selection of a locus was done by selecting a random SNP from the whole genome and using the 275 chosen SNP plus the nearest 199 SNPs as the locus. While SNPs within a locus can have strong 276 LD, we made sure that all SNP pairs between different loci have LD r 2 < 0.8.

277
Each SNP had 112 functional genomics features, denoted by ξ i . For each simulation, we 278 randomly selected 3 features as significant. We then defined a baseline probability π 0 = 279 1/(1 + exp −β π,0 ). The enrichment induced by the k th feature can be defined as ψ k = π k /π 0 , 280 where π k = 1/(1 + exp −β π,0 − β π,k ) assuming that the corresponding feature is binary (i.e. 281 1 for enriched SNPs, and 0 for other SNPs). For each selected feature, we randomly chose ψ k 282 from a uniform distribution between 2 and 8, and calculated the corresponding β π,k . The β π,k 283 for the remaining 109 cell lines were set to zero. Next, we calculated π which gives the probability for each SNP to be causal. The prior probability of a locus to be 285 causal is equal to 1 minus the probability of not containing a single causal SNP, which can be 286 obtained as  Once we established the causal SNPs, we used a linear model to simulate continuous 293 phenotypes y n = i v i x in + ε j such that the causal SNPs aggregated to explain a fixed proportion 294 of the phenotypic variance h 2 g . This phenotypic variance was partitioned equally amongst all 295 the causal SNPs. The environmental contribution given by ε j , was assumed to be normally 296 distributed ε j ∼ N 0, 1 − h 2 g . In our simulations, we used a heritability of h 2 g = 0.25 as is used 297 typically in GWAS simulations [20]. 298 We obtained the binary phenotype for the 5 cohorts using the classical liability threshold 299 model [21,22]. The model assumes that the binary disease status results from an underlying 300 continuous disease liability that is normally distributed in the population. If the combined effects 301 of genetic and environmental influences push an individual's liability across a certain threshold 302 level, the individual is affected. In the population, the proportion of individuals with a liability 303 above the threshold is reflected in the disease prevalence. The observations on the risk scale and 304 the liabilities on the unobserved continuous scale can be related by a probit transformation [22]. 305 We used the continuous phenotype y n as the disease liability. Any individual with disease liability 306 exceeding a certain threshold T was assigned to be a case and a control otherwise. T is the 307 threshold of normal distribution truncating the proportion of K (disease prevalence). We used 308 K = 0.5 to obtain roughly equal number of cases and controls. 309 We repeated the simulations 50 times. For all simulations, we used the same genotype and 310 functional annotations. We resampled which functional features are significant, which loci and 311 which SNPs are causal and simulated new phenotype for every repetition of the simulation.

313
Prediction of causal loci 314 We tested two widely used software packages employing conventional methods for association 315 testing in GWAS: (1) SNPTEST [5][6][7] on each cohort, followed by meta-analysis using META [23] 316 (designated as SNPTEST / META henceforth). (2) Meta-analysis with BIMBAM [4] using the 317 -psd option to collect summary statistics on each cohort and -ssd to combine them. BIMBAM 318 can perform multiple regression and was earlier shown to outperform SNPTEST when given 319 access to individual genotype data. However, it falls back to simple single-SNP regression when 320 performing meta-analysis. 321 Furthermore, we tested the recent metaCCA software, which can perform multivariate meta-322 analysis from summary statistics. The method requires choosing the relevant SNPs beforehand 323 since it has no shrinkage in the model for variable selection [8]. We ranked the SNPs based on 324 p -values obtained from SNPTEST / META, and chose the 10 best SNPs successively such that 325 each chosen SNP had little correlation r 2 ≤ 0.8 with all previously chosen SNPs. In other 326 words, once a SNP is chosen, all SNPs which had correlation r 2 > 0.8 were removed from 327 the ranked list. The method also requires the matrix of LD correlation coefficients between the 328 selected SNPs at each locus for each study. We calculated the LD matrix for each study directly 329 from the individual level genotype using LDstore [24]. 330 It was shown previously that genome-wide annotations and data tracks such as DNA accessi-331 bility measurements from DNase-seq are enriched in causal SNPs [25,26] and can be used for 332 improving the finemapping of causal SNPs [20,27]. We therefore extended B-LORE, making 333 the mixture weight π i of the causal part of the effect size distribution for SNP i depend on these 334 functional genomics data tracks at the position of SNP i (see Eq. (6)). We ran B-LORE with and 335 without using the genome-wide functional genomics tracks and denoted the latter as B-LORE FG. 336 Fig. 2 summarizes the performance of different methods in predicting the causality of loci 337 using synthetic phenotypes. First, we compared the ranking of loci by SNPTEST/META and 338 B-LORE ( Fig. 2A). For SNPTEST/META, loci were scored by the − log 10 (p) value of their most 339 significant SNP. For B-LORE, loci were scored by the easily interpretable posterior probability of 340 the locus to be causal, i.e., to contain at least one causal SNP (Eq. (8)). The two scores agreed well 341 on those loci that are confidently predicted by SNPTEST/META: All loci with high − log 10 (p) 342 values (> 5) also got high posterior probabilities (> 0.95), and all loci with low − log 10 (p) values 343 (< 1) also get low posterior probabilities (< 0.05). However, a sizeable fraction of loci with high 344 B-LORE posterior probabilities (> 0.95) had low significance in SNPTEST/META (− log 10 (p) 345 values < 3) even though they were all causal. Loci with − log 10 (p) values between 1.0 and 4.0 346 were classified with higher accuracy by B-LORE. Overall, the noncausal and causal loci were 347 much better separated by B-LORE scores along the vertical axis than by SNPTEST/META along 348 the horizontal one.

349
For a more quantitative analysis of the prediction performance, we performed a precision-recall 350 analysis (Fig. 2B). For each of the prediction tools, the 50 × 200 scores were ranked, the true 351 positive (TP) and false positive (FP) predictions were counted up to different threshold scores, and 352 the precision (TP/(TP + FP)) was plotted as a function of recall (= sensitivity) (TP/(TP + FN)). 353 BIMBAM and SNPTEST / META perform meta-analyses on single-SNP summary statistics 354 and provide similar accuracy. Surprisingly, metaCCA failed to improve the classification in our 355 benchmarks, probably due to improper preselection of the relevant SNPs. B-LORE significantly 356 improved the ranking of loci over the next best tool. The results suggest that multiple regression 357 can extract more information from the data. B-LORE FG further improved the prediction of causal 358 loci, suggesting that our tool can efficiently incorporate information from functional genomics 359 tracks.

360
To demonstrate the advantage of distinguishing coupling from correlation, we chose 8 loci 361 from a strongly correlated genomic region of chr6, while the remaining 192 loci were allowed 362 to remain in linkage equilibrium. We also constructed synthetic phenotypes in 50 simulations 363 as before. This ensured that these 8 loci would be correlated among them, and would lead to 364 situations as described in Fig. 1: There would be some noncausal SNPs in a noncausal locus 365 which are correlated with causal SNPs in a nearby locus. B-LORE was much less affected as 366 compared to other methods in presence of such correlations (Fig. 2C). All other methods found 367 many non-causal loci to be highly significant because they were correlated with the coupled loci. 368 leading to many false positive predictions with high significance, seen by the dip at low recall 369 values.

370
The successful predictions of B-LORE might seem counter-intuitive at first because it neglects 371 the interlocus genetic correlation in order to factorize over loci for the hyperparameter optimization. 372 We explain in Fig. 2D  k along the x−axis and a causal SNP j in locus m along the y−axis. The true marginal likelihood 375 of these 2 SNPs shows their correlation (non-zero off-diagonal terms in the covariance matrix). 376 In the calculation of B-LORE the off-diagonal terms are assumed to be zero because they are in 377 different loci. Still, it learns the diagonal terms of the covariance matrix because it maximizes the 378 genome-wide marginal likelihood which includes information from all the loci. The likelihood 379 learned by B-LORE is a fairly good approximation of the true likelihood and is obviously better 380 than getting point estimates for each SNP. It can distinguish the two loci: v jm explains away 381 the effect of v ik and only the true causal locus m is picked. In contrast, methods that deal with 382 each locus separately will find associations for both m and k. Earlier attempts of distinguishing 383 correlation and coupling were confined to each locus independently, for example, in finemapping 384 of causal variants within a locus.

386
Prioritization of variants within the associated region or loci (popularly called finemapping) has 387 been an important focus of post-GWAS era to provide insight into disease mechanism. For a 388 recent review of finemapping, see Spain et al. [28]. Over the past few years, several methods 389 have been proposed for finemapping of causal variants, i.e. to pinpoint individual causal SNPs. 390 We compared B-LORE with 3 finemapping methods, particularly PAINTOR (v3.0) [20,29], 391 CAVIARBF (v0.1.4.1) [30,31] and FINEMAP (v1.1) [32]. These methods require only the 392 summary test statistics and a matrix of the pairwise correlation coefficients (r 2 ) of the variants in 393 each associated region. We used the meta-analysis results from SNPTEST / META as input to 394 these methods. To obtain r 2 of the meta study, we combined the study-specific LD matrices by 395 weighting them by sample size, i.e. r = j (r j N j )/ j N j , where r j and N j are the correlation 396 coefficient matrix and sample size for the j th study respectively. For all methods, we used the 397 default settings. Owing to computational limitations, we allowed a maximum of 2 causal SNPs 398 per locus in CAVIARBF, and 4 causal SNPs per locus in FINEMAP and B-LORE. We also 399 allowed B-LORE and PAINTOR to use the functional genomics tracks, and denoted them as 400 B-LORE FG and PAINTOR FG respectively.

401
B-LORE achieved superior accuracy over existing methodologies in identifying causal 402 variants (Fig. 3). For each of the 50 simulations, we ranked the 40000 scores from each of the 403 prediction tools. In Fig. 3A, we plotted the recall (TP/(TP + FN)) against the average number 404 of SNPs which were selected per locus ((TP + FN) /200) at different threshold scores. All 405 characteristic finemapping methods (i.e. PAINTOR, CAVIARBF and FINEMAP) performed 406 better than conventional GWAS methods (i.e. SNPTEST/META and BIMBAM) when selecting 407 less than 20 SNPs per locus on average. The recall of B-LORE was higher than all other methods 408 at practically relevant low selection thresholds. B-LORE improved the recall by more than 5% 409 from the next best tool when selecting 20 SNPs per locus.

410
Using ENCODE data for the SNPs significantly improved the performance of B-LORE and 411 PAINTOR. Both methods use the same logistic model (Eq. (6)) to describe the enrichment of 412 causality of a SNP from functional genomic features. Yet B-LORE FG provided better recall than 413 PAINTOR FG. B-LORE FG could identify more than 65% of the causal SNPs when selecting 414 only 20 SNPs per locus, as compared to 55% by PAINTOR FG at the given threshold. It should 415 be noted here that the gain in performance in simulation with functional genomics tracks does 416 not ascertain similar gains in real data, because the true enrichment strengths are yet unclear.

417
All finemapping methods got worse than conventional GWAS methods at higher selection 418 thresholds (see S3 Figure). Beyond a certain threshold, B-LORE predicts all SNPs with a causal 419 probability of zero when it can no longer distinguish between them. This happens because 420 learning hyperparameters from the data requires summing over causality configurations and 421 we restrict the sum to those configurations with non-negligible probability of being causal to 422 achieve the same in computationally reasonable time frame (see S1 File). Hence configurations 423 without enough evidence are removed from the calculation and the corresponding SNPs get zero 424 causal probability (Eq. (7)). Similarly, all other finemapping methods enforce sparsity in different 425 ways and hence show computational artefacts at high thresholds. However, performance at high 426 selection thresholds is not important for any practical purposes because the aim of finemapping 427 is to pick as few SNPs as possible for downstream analyses and experiments.

428
In classification of sparse binary data, as the one we are dealing with here, recall can hide 429 a lot of imprecision. However, improving precision presents more of a challenge. Hence, we 430 did a precision-recall analysis of the predictions by the different tools (Fig. 3B). PAINTOR, 431 CAVIARBF and FINEMAP had higher precision than conventional GWAS methods, but all 3 of 432 them performed similarly. B-LORE had higher precision than all other tools at all recall values. 433 Use of functional genomics tracks significantly improved the precision for B-LORE FG and 434 PAINTOR FG. Overall, B-LORE FG provided higher precision than PAINTOR FG. LD (correlation coefficient r 2 c > 0.9) with a true causal SNP (Fig. 3C). Approximately 10% of 439 the SNPs were correlated to a causal SNP averaged over all simulations. When recall was > 0.3, 440 the false positives from B-LORE had the highest fraction of correlated SNPs as compared to all 441 other methods. B-LORE not only predicted the true positives with high precision, but also the 442 false positives were strongly correlated to actually causal ones.

443
Application to coronary artery disease 444 As a proof of concept, we applied B-LORE to identify SNPs affecting risk of coronary artery 445 disease (CAD) from GWAS of five small cohorts (GerMIFS I -V). These cohorts were also 446 used in the largest meta study of CAD GWAS till date, organized by CARDIoGRAMplusC4D 447 Consortium [18], which found 58 significant loci harboring SNPs in genome-wide significant 448 association with CAD. The study leveraged the power from meta-analysis of ∼ 185,000 CAD 449 cases and controls, while the GerMIFS I-V cohorts have only ∼ 13,000 CAD cases and controls. 450 Comparison of ranking of 50 genetic loci using meta-analysis across 5 cohorts (GerMIFS I-V [13][14][15][16][17][18]) with a total of 6234 cases and 6848 controls from white European ancestry. We first used meta-analysis of genome-wide SNPTEST summary statistics on these 5 small GWAS to select the top 50 loci and then applied B-LORE on these loci using the genomic features obtained from ENCODE data and assuming a maximum of 4 causal SNPs per locus. On the x-axis of the scatter plot, we show the − log 10 (p) values obtained from META, and on the y-axis we show the probability of a locus being causal, obtained from B-LORE FG. The legend shows the classification of all the 50 CAD loci based on prior evidence of association in existing literature (see Application to coronary artery disease and S2 Table). This literature-based classification gives a reasonable "ground truth" of causal and non-causal loci, despite our incomplete knowledge about true underlying association in reality. The noncausal and causal loci are much better separated by B-LORE FG scores along the vertical axis than by SNPTEST/META along the horizontal one.
We performed a meta-analysis using SNPTEST summary statistics of the five cohorts, and 451 found 3 loci to be genome-wide significant, namely the 9p21 locus on chr9, PHACTR1 and 452 SLC22A3-LPAL2-LPA loci on chr6. From this meta-analysis, we ranked the SNPs according to 453 their p -values and selected all the nominally significant SNPs with p -values < 5 × 10 −5 . We 454 grouped these SNPs together based on their genomic positions. SNPs which were spatially close 455 (within ± 200 kb) were included in the same locus. We then used the top 50 groups, and defined 456 each locus by collecting the top 400 SNPs (ranked according to their p -values) within ± 200 kb 457 regions of each lead group. 458 We used these 50 loci for meta-analysis with B-LORE. In practice however, one can use as 459 many loci as desired and use more sophisticated definition of a locus, for example, by considering 460 LD blocks. We generated summary statistics for each of the 5 cohorts and combined them using 461 13/19 B-LORE meta-analysis. For the meta-analysis we used the 112 functional genomics tracks from 462 ENCODE data (see Datasets) and allowed a maximum of 4 causal SNPs per locus.

463
Since the ground truth is unknown for real data, we did an extensive blinded literature search 464 for all these 50 loci. For details about the genomic positions, exons in the region and prior evidence 465 of association with CAD please see S2 Table. We classified these 50 loci into 6 categories based 466 on this prior evidence: 467 • 8 loci harbor SNPs which were found to be statistically associated with CAD in the 468 CARDIoGRAMplusC4D study [18], the largest GWAS of CAD till date.

469
• 3 loci have significantly associated CARDIoGRAMplusC4D SNPs within ± 400 kb [18] 470 • 3 loci harbor SNPs which were found to be statistically associated with CAD in other 471 GWAS.

472
• 11 loci have evidence of statistical association with risk factors of cardiovascular diseases 473 (CVD), such as myocardial infarction, blood pressure, sudden cardiac arrest, heart failure, 474 high-density lipoprotein cholesterol levels, triglyceride levels, etc.

475
• 3 loci are associated with obesity and related traits.

476
• We could not find any statistical association with CAD or its risk pathway for the remaining 477 22 loci. 478 We compared the ranking of these loci using B-LORE and SNPTEST/META in Fig. 4, For 479 ranking, we used the same scores as already introduced for Fig. 2A. Unlike simulations, we 480 did not know the "ground truth" in this real data, but we used the literature classification as 481 qualitative indication for accuracy. Despite the modest sample size, B-LORE identified all the 11 482 CARDIoGRAMplusC4D loci (from the first 2 categories) present in the selected pool of loci 483 with Pr causal > 0.95. For comparison, SNPTEST / META could identify only 3 genome-wide 484 significant loci. Additionally, B-LORE predicted 12 other novel loci with Pr causal > 0.95. Three 485 of them are significant hits in other CAD GWAS, and 6 of them are significantly associated 486 with CVD risk-related phenotypes, such as blood pressure, high density lipoprotein cholesterol, 487 triglyceride levels, etc. B-LORE also predicted low probabilities for many loci with no prior 488 evidence of association despite being highly ranked by SNPTEST / META.

489
Three loci with no prior evidence of association with CAD were ranked with posterior 490 probabilities Pr causal > 0.95 by B-LORE. One of these loci is located in the AUTS2 gene region 491 in chr7, and another is located in chr10 overlapping with FGFR2 and ATE1 genes. The third one 492 is located in chr4q13.1, with no exons in the region and the nearest gene being LPHN3 ∼500kb 493 upstream. Due to lack of validation, it is unclear whether these loci are truely coupled to the 494 disease risk or if they are false positives. 495 We show the finemapping performance of B-LORE at two example loci -a known risk locus 496 near SMAD3 in chr15 (Fig. 5A) and a novel risk locus at 12q24.31 (Fig. 5B). At the SMAD3 497 locus, B-LORE picked up a single SNP (rs34941176) for explaining the association, while other 498 SNPs from the region showed negligible probability of being associated. In the novel locus, 499 there are three genes, DNAH10, ZNF664 and CCDC92, which are believed to be associated with 500 multiple CAD risk factors such as high-density lipoprotein cholesterol level, triglycerides levels 501 and waist-to-hip ratio [33][34][35] Here, B-LORE prioritized three SNPs (rs1187415, rs7961449 502 and rs6488913) in a region with strong LD. All SNPs in this region showed similarly significant 503 p -values. In both the above cases, B-LORE prioritized SNPs which are different than the ones 504 with lowest p -values in the region.

505
Indeed in most loci we find that B-LORE prioritizes a few SNPs, while SNPTEST / META 506 predict similar p -values for many SNPs (S4 Figure). Due to lacking validation it is unclear so far 507 whether the SNPs prioritized by B-LORE are actually causal. 508 In many loci that obtained a high probability to be coupled to disease risk, none of the 509 SNPs have significant probability of being causal (S4 Figure). This observation illustrates the 510 considerable advantage of a Bayesian method that can accumulate evidence for coupling over 511 many, sometimes weak, SNPs. In this way, highly confident predictions can result even though 512 no single SNP is considered significant by single-SNP analyses.  Figure 5. Representative examples of finemapping in CAD-associated loci. The top parts in A and B show the probability for each SNP to be causal as predicted by B-LORE (p (z i = 1 | φ, X, θ)).
Below we plot the − log 10 (p) values for each SNP obtained from SNPTEST / META. The four best SNPs predicted by B-LORE and SNPTEST / META are marked by special symbols and annotated in the legends. At the bottom, the genes and LD between the SNPs is shown. A: A known locus near SMAD3. (A SNP rs56062135 at 67.45Mb was found associated with CAD by the CARDIoGRAMplusC4D study [18]). The probability for finding at least one causal SNP in the locus is Pr causal = 0.999. B: A novel locus discovered by B-LORE, with Pr causal = 0.976.

514
Any method for GWAS meta-analysis that can predict substantially more risk loci at a given 515 precision or that can better distinguish the truly disease-coupled SNPs more accurately from 516 the merely correlated ones has enormous leverage. It can be applied to thousands of GWAS 517 involving millions of patients to help understand the origin of all common diseases in humans [2]. 518 Most GWAS have been analyzed using the simplest type of approach based on testing each 519 single variant in turn, which allows combining datasets using simple aggregate count summary 520 statistics. Previous work has shown that multiple regression and also Bayesian approaches have 521 the potential to yield better predictions [4], but their applicability was limited by the requirement 522 of individual genotype data, which precludes meta analyses.

523
Our software B-LORE for Bayesian multiple logistic regression can perform meta-analysis 524 using a novel type of summary statistics. B-LORE outputs easily interpretable probabilities 525 for each locus to harbor causal SNPs, as well as probabilities for each SNP to be coupled, not 526 only associated, with the phenotype. These probabilities contain all the information required for 527 further downstream analyses.

528
This study makes the following technical contributions: (1) It introduces a novel quasi-Laplace 529 approximation which makes the Bayesian treatment of the multiple logistic regression case 530 analytically tractable and yields an efficient software implementation obviating the need to use 531 MCMC sampling. (2) B-LORE learns the model hyperparameters from the GWAS data. One set 532 of hyperparameters describes the effect size distribution, another (optional) set describes how the 533 functional genomics tracks modify the prior probability of a SNP to be causal. (3) We show how 534 to calculate the marginal likelihood over many loci by factorizing the likelihood over the loci. (4) 535 The model can integrate genome-wide tracks from functional genomics and other sources. 536

15/19
B-LORE is similar to several successful GWAS analysis methods based on Bayesian multiple 537 regression. It models the effect size distribution of the causal SNPs using a single normal 538 distribution, whereas BVSR models it using a normal distribution whose precision (inverse 539 variance) is sampled by MCMC using a Gamma prior. This is essentially equivalent to assuming 540 a Student's-t prior for the causal component, because the latter is obtained as convolution of 541 a normal distribution with a Gamma distribution for the precision parameter. The Student's-t 542 distribution is more flexible, as its third parameter can be used to tune the heaviness of the 543 distribution's tails. However, this more flexible prior requires computationally expensive Markov 544 Chain Monte Carlo (MCMC) schemes for the integration over the effect sizes. Using a simpler 545 prior allowed us to find an analytical solution.

546
The analytical integration also allows B-LORE to learn the parameters of the prior distribution 547 from the GWAS data itself, whereas most existing Bayesian frameworks in GWAS, including 548 BVSR, work with a fixed prior distribution of effect sizes. Another method that is able to learn the 549 hyperparameters for the effect size distribution from the data is the Bayesian Sparse Linear Mixed 550 Model (BSLMM) [9], which, similar to B-LORE, uses a mixture of two normal distributions. 551 Any hyperparameter optimization method is limited by computational speed and the requirement 552 of individual level data. Our new quasi-Laplace approximation provides a solution to both 553 these problems. B-LORE thereby extends the scope of Bayesian multiple regression methods 554 to hundreds of loci over hundreds of studies, where the loci can be preselected with a lenient 555 p -value cutoff using a simpler and faster method such as SNPTEST [5][6][7].

556
In parallel work, Zhu and Stephens have proposed a clever method with similar aims as B-557 LORE but adapted to quantitative phenotypes using linear multiple regression, called Regression 558 with Summary Statistics (RSS) [36]. RSS uses summary statistics from simple regression methods 559 like SNPTEST [5][6][7]. Unlike RSS, our work uses logistic regression and is thus specifically 560 adapted to the case-control GWAS design, by far the most frequent type of GWAS.

561
B-LORE is robust and should perform well on any sufficiently large dataset. An important 562 caveat is that B-LORE needs a sufficient number of causal variants to be present in the GWAS 563 data it analyzes, because it needs to estimate the hyperparameters from them. If too few causal 564 loci are hidden in the data, the hyperparameter optimization could end up at unrealistic values, 565 e.g. by using both components of the Gaussian mixture to describe the non-causal SNPs. In such 566 a situation B-LORE would predict all SNPs as non-causal. We determined from simulations 567 that it is enough to have only 10 coupled SNPs for proper learning of hyperparameters (data not 568 shown). These SNPs could either be present in a single locus or distributed over many loci. This 569 requirement should be easily fulfilled in all practical cases because the number of validated SNPs 570 for most diseases investigated with GWAS are in the range of 20 − 100.

571
A central goal of GWAS is to learn what are the genetic factors that determine the risk 572 to acquire a noninfectuous disease or other, quantitative phenotypes. Yet despite ever larger 573 GWAS and advances in analysing the genotype-phenotype relationship, only relatively small 574 fractions of heritable variance can be explained for most diseases investigated [37]. One part 575 of this missing heritability might be due to the limitations of logistic regression models used 576 for risk prediction [38]. We hope to investigate this in the future by applying B-LORE to 577 predict the genetic component of disease risk from the genotype. We will also explore more 578 systematically how best to improve predictive performance by adding genome-wide experimental 579 and computational data tracks [11,20,39] that can inform us about the probability of each SNP to 580 be causal. 581

16/19
Supporting Information 582 S1 File. Additional Methods 583 S2 Table. Details of the 50 genetic loci used in this study, their literature classification and 584 significance score for association with the diseases. 585 S3 Figure. Extension of Fig. 2A showing the full range of selection thresholds 586 S4 Figure. Finemapping of 50 genetic loci using B-LORE meta-analysis from 5 small GWAS 587 (GerMIFS I-V) 588