SMetABF: A rapid algorithm for Bayesian GWAS meta-analysis with a large number of studies included

Bayesian methods are widely used in the GWAS meta-analysis. But the considerable consumption in both computing time and memory space poses great challenges for large-scale meta-analyses. In this research, we propose an algorithm named SMetABF to rapidly obtain the optimal ABF in the GWAS meta-analysis, where shotgun stochastic search (SSS) is introduced to improve the Bayesian GWAS meta-analysis framework, MetABF. Simulation studies confirm that SMetABF performs well in both speed and accuracy, compared to exhaustive methods and MCMC. SMetABF is applied to real GWAS datasets to find several essential loci related to Parkinson’s disease (PD) and the results support the underlying relationship between PD and other autoimmune disorders. Developed as an R package and a web tool, SMetABF will become a useful tool to integrate different studies and identify more variants associated with complex traits.


Introduction
Genome-wide association study (GWAS), a powerful tool to find out the associations between genetic variations and phenotypes, has received more and more attention in the field of statistical genetics and epidemiology [1]. Numerous variants, typically many common single nucleotide polymorphisms (SNPs), are identified linked with complex traits. However, since single variant's genetic effect on polygenic traits is relatively small, large sample sizes are often required to increase the statistical power [2]. Besides, due to the population stratification and other unobserved confounders, the estimated effect sizes in different studies are divided or even contradictory [3]. Therefore, it has become an increasingly essential challenge to make sufficient use of summary statistics derived from a wide range of studies and to attain pooled statistics through meta-analysis [4], especially when the requirement of data security and privacy makes individual-level data increasingly difficult to obtain [5,6].
Either the fixed-effect model (FEM) [7] or the random-effect model (REM) [8] is conventionally used to derive a pooled effect size, depending on the assumption on heterogeneity [9]. However, the p-value is dependent on the sample size and minor allele frequency (MAF) of the variant. Therefore, it is improper to use a single threshold [10]. Besides, the relationships between true effect sizes in different studies are hard to be considered in both FEM and REM [11]. On the contrary, it is easy to involve them into the model as a prior in the Bayesian framework. The Bayesian method is also prevalent for researchers for it is more intuitively explainable [12]. Recently, a promising method based on the Bayesian framework named MetABF has already been proposed [13]. With GWAS summary statistics, it could conveniently estimate the pooled associations between multiple traits and genetic variations in different associated models across studies. However, with the rapidly increasing data of GWAS, the method also confronts the challenge of exponential explosion in both time and space consumption. Since it requires traversing 2 n subsets represented by n-dimensional vectors to compute the optimal ABF, the considerable time and memory consumption required makes the computation almost impossible as the number of studies n increases.
In this article, we propose SMetABF, a method based on the Markov chain Monte Carlo (MCMC) method and its extension named shotgun stochastic search (SSS) [14] to speed the process of subset selection. SSS is proved to be superior in speed, accuracy, and stability through simulation. Based on SSS, we introduce SMetABF to obtain the maximum ABF in a large-scale meta-analysis quickly. SMetABF is implemented as an R package and the code is available at https://github.com/sjl-sjtu/GWAS_meta.

Asymptotic Bayes factor
Different from the traditional statistical framework based on the p-value for statistical inference, the Bayes factor (BF) is used in Bayesian statistical framework. BF is defined as the relative size of the likelihood to observe data under the null hypothesis (H 0 ) or the alternative one (H 1 ), where D stands for the data observed, β is the effect parameter we are interested in, γ is the parameter vector of confounders, and π(�) stands for the prior of β and γ. In general, BF > 1 means more inclined to accept H 1 , and on the contrary, 0 < BF < 1 means more inclined to accept H 0 . Since BF is difficult to calculate directly in many studies, an asymptotic Bayes factor (ABF) is proposed as an alternative [10]. If P(Y|β, γ) is replaced by the asymptotic distribution Pðb;γjb; γÞ, and the marginal prior for β instead of the joint prior π(β, γ) is considered, the probability of obtaining the parameter β under a certain hypothesis could replace the probability of observing data Y, written as For a study aimed to measure the association between several risk factors and specific outcomes, like GWAS, letb be the estimated size of the association. It is assumed to obey the normal distributionb where β is the true effect size of the variant, and SEb represents the estimated standard error. The true effect size β is also assumed to follow a normal distribution b � Nð0; s 2 Þ; where σ 2 represents the prior variance of the true effect size. When σ = 0, the distribution of β degenerates to a point, which means β = 0. In other words, the genetic variant has no effect on the outcome. Under H 0 : σ = 0, ABF can be calculated as Each study included in the meta-analysis provides an estimated effect size,b. When there are n studies in the meta-analysis, letβ be the estimated effect vector andβ follows a multivariate normal distribution Nðb; Vβ Þ, where β stands for the true effect vector, and Vβ represents the covariance matrix of the estimated standard errors, expressed as in which SE i is the standard error ofb i in the i-th study, and r i,j is the correlation between the i-th and j-th studies. For each study, the prior effect size is σ i , and the prior correlation coefficient between two studies is ρ i,j , then the prior matrix S is With the estimated effect vector and covariance matrix of the estimated standard errors, ABF for meta-analysis could be calculated as Similarly, H 0 is defined that S equals to the zero matrix.

Prior
The assumption on the heterogeneity among studies is critical in prior selection. Table 1 provides different models for prior under the assumption that σ and ρ remain the same for all studies included in the meta-analysis.
Since both the null model and the complete model can be regarded as special cases of the subset model, the subset model is adopted in the meta-analysis. But it also brings a tricky issue to determine the optimal subsets. It is preferred to get a higher ABF score in the meta-analysis, for it means that H 0 is harder to be accepted and the probability of the type II error decreases. In other words, it increases the statistical power, which equals one minus the probability of the type II error.

Model selection
2.3.1 Subset-exhaustive. The former study [13] performs model selection by traversing subsets to select the highest ABF, named the subset-exhaustive method (EXH). For a metaanalysis including n studies, there are in total 2 n different subsets. It requires taking all these subsets as a prior one by one to calculate ABF, and then find the optimal one. The time consumed will explode exponentially as the number of studies included increases. At the same time, the memory required to store all subsets (2 n n-dimensional 0-1 vectors) has also expanded dramatically. Therefore, it becomes quite essential to introduce a method to get higher ABF quickly.

MCMC.
A commonly used method to quickly find the best subset is the Markov chain Monte Carlo method (MCMC). Here MC 3 algorithm [15,16] is used to define the transition function and the Metropolis-Hastings algorithm is used for sampling. The whole process is carried out as Algorithm 1.
Randomly select a subset as the initial prior model x 0 , and calculate the ABF.
2. For the current model x t , define the neighborhood as a set constituted of subsets formed by adding or deleting an element from the current subset, as well as the current model itself.
The proposal distribution is defined by equalizing the sampling probability of all models in the neighborhood. In other words, the sampling probability of all models in the neighborhood remains equal, and the transition probability of all models outside the neighborhood is 0. Since each model has the same size of the neighborhood, the proposal distribution is symmetric.
3. Generate the alternative prior model y according to the transition probability, and then calculate the ABF. The discriminant function is defined as h ¼ minf1; ABFðyÞ represents the ABF value with subset x as the prior.
4. Generate a random number u that follows the uniform distribution U(0, 1). If u < h, accept y as a new step of x t+1 , and otherwise, x t+1 = x t .
5. Repeat steps 2-4 until the maximum number of iterations or stable distribution is reached.
The first half of the entire iterative sequence is used for the warm-up and the second for the final sampling.

Shotgun stochastic search.
Here we introduce an extension of MCMC for variable selection named shotgun stochastic search (SSS) [14]. It can be used to fast detect the optimal ABF following the procedures as below (see Algorithm 2): 1. Let Γ donate a set containing up to B optimal models. Randomly select the initial model x 0 , set Γ = {x 0 }, and calculate the score of the model S(x) = ABF(x).
2. For the current model x t , define models that add or delete or replace an element from the current subset to constitute the sets Γ + , Γ − , and Γ � , respectively, and then define the neighborhood models with the lowest scores.
4. Then take a sample from x + , x − , and x � , with the score S(x) as sampling weight, and let the sample be the new model x t+1 .
The former study [13] has provided R code for EXH. Here R functions for meta-analysis by MCMC and SSS are constructed.

The construction of simulated datasets
Several parameters are given to build the simulated datasets, including the incidence of the disease in the population (p), the frequency of the major allele of the studied variant (f, which equals to 1-MAF under the assumption that there are only two alleles in the SNP), the effect size (odds ratio, OR), and the sample size of both case and control groups (which is assumed to be the same, n). For example, suppose A is the risk allele while G is the non-risk allele. If the dominant model is applied, both AA and AG can be considered as equivalent risk genotypes while GG is non-risk. Suppose baseline effect is α, the increased effect on prevalence by risk genotype is θ, then where D represents the outcome (disease). Then we can get and Then α and θ can be calculated. According to the Bayes Theorem, the probability of risk and non-risk genotypes in the case and control groups can be calculated.
And then, the simulated genotypes in both the case and control groups could be randomly generated under binomial distribution. The estimated c OR can be calculated. The effect size is defined as β = ln OR, and similarly,b ¼ ln c OR. The standard error can be estimated from the contingency table, as SEb ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Suppose there are N studies included in the meta-analysis. For each study, the true effect OR i obeys the normal distribution N(OR, SE 2 ). The sample size n i in each study is generated as a random integer in a given range, and p and f remain the same for all studies. Through the process above,b i and SEb i of each study can be estimated.

Results
The ABFs calculated under different true ORs are shown in Fig 1. The overall trends of the ABF obtained by EXH, MCMC, and SSS remain consistent, corresponding to the p-value obtained by the traditional method. When the true OR approaches 1, the p-value increases, while the ABF value decreases to 0. However, when the sample size of the study included is small (Fig 1A), the change of p-value will be unstable if true OR is near to 1, which will affect the analysis. Besides, the ABF calculated by SSS almost coincides with the optimal ABF curve obtained by EXH, which shows the validity of SMetABF. Figs 2 and 3 show the performance of each algorithm in accuracy and speed under different priors and iterations, respectively. Since in SSS, ABF is calculated under all models in the neighborhood in one iteration, much more models will be calculated by the SSS with the same number of iterations. Therefore, the number of iterations of SSS is set to be 100, 200, 500, 1000 and 2000; while that of MCMC is set to be 1000, 5000, 10000 and 20000. To compare the averaged ABF and time consumed, the algorithm is repeated 100 times under each condition. The

Meta-analysis on the variants related to PD and other autoimmune disorders
Here an application is performed to measure the risk variants associated with Parkinson's disease (PD), a common chronic neurodegenerative disease among the elderly population. Its

PLOS COMPUTATIONAL BIOLOGY
common clinical manifestations include tremors, slow movement, and disorders in balance and movement posture. PD has been reported to be associated with both genetic variations [17] and environmental factors like personal lifestyles such as smoking and drinking [18,19], but the detailed mechanism remains unclear. Recent studies discuss the potential relationship between PD and autoimmune disorders [20]. To explore the underlying relationships, we conduct a GWAS meta-analysis across PD and three common autoimmune disorders: inflammatory bowel disease, multiple sclerosis, and systemic sclerosis.
Through the websites accommodating GWAS datasets, including DistiLD [21] (http:// distild.jensenlab.org), Open Targets Tables 2 and 3 show detailed information about the studies included. A pure meta-analysis across 29 studies on PD is conducted firstly, and then all 59 studies are analyzed jointly to obtain a mixed association pattern. The effects of over 10 million variants are assessed through parallel computing. The Manhattan plots for both the pure pattern and the mixed pattern are shown in Fig 5. We can find PD is highly associated with several loci located within gene SCNA on chromosomes 4. A peak also appears on chromosome 17, around gene MAPT, KANSL1, and NSF. When other autoimmune disorders are included in the meta-analysis, the peaks appear on chromosomes 1 and 6 in the mixed pattern. Some significant variants can also be found on chromosome 4. Table 4 shows several SNPs identified in the analysis. Detailed results are available at https://figshare.com/articles/dataset/Table_S_zip/19179179.
The results supports the previous reports on several essential loci related to PD, such as SCNA, MAPT and KANSL1 [17]. Additionally, some degree of underlying relationships between PD and other autoimmune disorders are revealed by comparing the mixed pattern to PLOS COMPUTATIONAL BIOLOGY the pure pattern, for they have some shared risk variants. For example, variants on BTNL2 have high ABF in both the pure pattern (*10 17 ) and the mixed pattern (*10 127 ), and the subsets indicates that BTNL2 is associated with all four disorders. For the top 4 variants on BTNL2 in the mixed pattern, on average 66.7% studies on PD, 65.7% studies on inflammatory bowel disease, 100% studies on multiple sclerosis, and 62.5% studies on systemic sclerosis are included in the final subsets to calculate ABF. However, although variants on SCNA also have a high ABF value in both pure pattern and mixed pattern, we found most of the studies in the final subsets to calculate ABF are from those studies on PD and the ABF values remain similar (*10 135 ) in both patterns. In other words, SCNA has weaker associations with other autoimmune disorders. The same is also true for those variants on MAPT. These two loci may relate more to the diseases in nervous system instead of autoimmune disorders. On the contrary, variants on KANSL1, reported as a factor in the immune system [71], show associations with both PD and other autoimmune disorders. In breif, he results of the meta-analysis indicate the presence of potential biological pathways and functional interactions between PD and

Software
We implement all the algorithms as an R package named GWASmeta. Besides, to help researchers to use SMetABF to quickly find key SNPs, we develop a web tool based on R Shiny as well. The requirements of the file uploaded can be found in the website. Multiple variants can be analyzed at once. This tool is accessible at https://sunjianle-sjtu.shinyapps.io/analycode.

Discussion
Meta-analysis has been widely conducted on GWAS data to discover essential loci associated with some complex genetic diseases during recent years [73][74][75], satisfying the requirements of large sample size in GWAS. However, the traditional p-value method used in meta-analysis is facing increasing criticisms. For example, it is not proper to use a single threshold since pvalue is dependent on the MAF and sample size [10]. Moreover, the sophisticated relationships

PLOS COMPUTATIONAL BIOLOGY
between different studies are tricky to deal with in traditional methods. FEM relies on the assumption that all studies in the meta-analysis share a common true effect size. The true effect size is allowed to vary in different studies in REM, but the detail information is hard to be included in the model. And the test for heterogeneity to determine whether FEM or REM should be used is often regarded poor in power [11]. Instead, the structure among different studies can be easily integrated into the Bayesian model as a prior. The BF compares the relative size between P(H 0 |Y) and P (H 1 |Y), and therefore is a better alternative to the p-value. Based on the Bayesian framework, a useful statistical model named MetABF has been proposed, which could easily measure the associations between multiple phenotypes and variants at the same time using GWAS summary statistics but confronts challenges in computation.
In this article, we propose SMetABF, an improved tool to attain the optimal ABF in a largescale meta-analysis efficiently. Through simulation, we confirm that SSS is superior to MCMC in terms of speed, accuracy, and stability. To a certain extent, our improvements effectively overcome the calculation problems due to the increase in the number of studies included. We performe an application to PD and other autoimmune disorders, illustrating the effectiveness of SMetABF. With more research conducted on various traits among a larger population and the increasing accumulations of GWAS summary statistics, the large-scale multi-phenotypic meta-analyses will be possible through SMetABF. Another possible application is to analyze the effect size across different variants in one study, where σ i represents the prior variation of the i-th variant on the outcome, and ρ i,j stands for the linkages between different variants. Furthermore, since many traits related to some complex diseases are correlated, it is necessary to consider the effect of multiple loci on the outcome across a large number of studies simultaneously [76]. In this case, the prior correlation matrix S will transform to a three-dimensional array, which will bring more challenges in computation.
The method still confronts many challenges. The choice of prior parameters is an example. Sensitive analysis reveals that different values of σ and ρ will affect the ABF values but seem not to change the relative effect size between different variants. Besides, the considerable size of human genome still brings challenges in computation.
The pooled statistics derived through meta-analysis can be further used for other post-GWAS analysis, for example, to identify causal genes through statistical fine-mapping [77] or to infer the causal relationships between traits by Mendelian randomization [78]. GWAS summary statistics from different studies can be conveniently integrated to a powerful pooled statistic by SMetABF. We believe the method will benefit to the integration of previous studies and help to reveal the genetic mechanisms of complex diseases.