The authors have declared that no competing interests exist.
Conceived and designed the experiments: TP PM AV. Performed the experiments: TP. Analyzed the data: TP. Contributed reagents/materials/analysis tools: TP AV. Wrote the paper: TP PM AV.
Highdimensional datasets with large amounts of redundant information are nowadays available for
Yet, the computation presents a challenge and application to largescale data is not routine. Here, we study aspects of the computation using the MetropolisHastings algorithm for the variable selection: finite adaptation of the proposal distributions, multistep moves for changing the inclusion state of multiple variables in a single proposal and multistep move size adaptation. We also experiment with a delayed rejection step for the multistep moves. Results on simulated and real data show increase in the sampling efficiency. We also demonstrate that with application specific proposals, the approach can overcome a specific mixing problem in real data with 3822 individuals and 1,051,811 single nucleotide polymorphisms and uncover a variant pair with synergistic effect on the studied trait. Moreover, we illustrate multimodality in the real dataset related to a restrictive prior distribution on the genetic effect sizes and advocate a more flexible alternative.
The progress in highthroughput measurement technologies has allowed application specialists to gather extensive datasets with often large amounts of redundant information for the addressed scientific question. This is particularly true in (human) genetics, where it has become costeffective to measure individual genetic variation at the scale of millions of polymorphic sites in the DNA. Numerous genomewide association studies (GWAS) have been published during the last decade linking the genetic variation to disease and other traits
However, such data analysis is not without problems. The primary association analyses in GWAS are mainly conducted by testing each polymorphic site, usually single nucleotide polymorphism (SNP), for association independently and then correcting for multiple hypothesis testing. This simplification is computationally convenient, but does not acknowledge the hypothesis of multifactorial genetic background for many common diseases and traits. Alternatives, which consider all of the genetic variants simultaneously, include penalized multivariate regression and variable selection methods (e.g.,
In this work, we focus on the computation of the Bayesian linear regression model with variable selection using Markov chain Monte Carlo (MCMC) methods. The variable selection is a natural fit for the main task in GWAS of searching for the genetic variants showing association to a phenotype of interest, and such models have been recently applied successfully to various sizes of genetic datasets including full GWAS scale
A general approach to the variable selection in this framework is the MetropolisHastings algorithm (MH)
Here, we study the following ideas in formulating the proposal distribution
The motivation for adapting the proposal distributions stems from the
Multistep moves have been used in GWAS setting by Guan and Stephens
We also experiment with a novel delayed rejection step, which reutilizes some of the computations leading to a rejected multistep proposal. In the delayed rejection algorithm if the first proposal is rejected, another proposal may be made. Here, assuming a
An open source C++ implementation of the samplers presented here is available at
The model mapping from genotypes (values of the explanatory variables) to a phenotype (the target variable) is briefly introduced here. This is essentially the same as in our previous work
A linear regression model is used:
To facilitate variable selection, binary variables
The prior of the effect sizes,
A.
The overview of the Markov chain Monte Carlo algorithm used to sample from the posterior distribution of the parameters of the above model is given here briefly, before focusing on the specifics of the sampling of
The linear model given
Sample
Sample
Sample
Sample
Sample
The last three steps are a factorized draw from
For posterior inference, the RaoBlackwellization method of Guan and Stephens
Three algorithms will be described for the MetropolisHastings step (MH) step, which is used to update
Single step (SS) algorithm, which proposes a change to a single
Multistep (MS) algorithm, which proposes multiple changes to
Multistep algorithm with delayed rejection (MSDR).
The proposals are formed in two main steps: 1) move size (number of changes) proposal and 2) sequential proposal of the variables to update (add to or remove from the model). The proposal is then accepted or rejected according to the MH acceptance probability. The single step algorithm always chooses move size of one.
The parameters of the proposal distribution may be adapted during an initial phase in the sampling (giving a total of six different samplers; three adaptive and three nonadaptive). The parameters are then fixed before collecting posterior samples (finite adaptation). Nonadaptive algorithms employ uniform distribution to generate the proposals (expect that move size adaptation is allowed here for all multistep samplers to avoid trialanderror in finding a good proposal distribution). Brief descriptions of the sampling and adaptation are given below. Details are given in
The proposal distribution
Pasarica and Gelman
In order to derive the connection of the expected squared jump distance and lag one autocorrelation in the present context, the mean and variance of
Pasarica and Gelman
The objective function to maximize with regard to the parameter
Given the move size, additions and removals are proposed in a sequence with probability 0.5 (unless there are no variables to add or remove). Denoting the sequence of proposed changes using auxiliary variables
Alternatively, the usual MCMC estimates of
The above sampling scheme is here cast into the form of Storvik
In this,
A. View of the full move as a Gibbs step followed by a MetropolisHastings (MH) step. B. Delayed rejection (DR): a second proposal may be done when the first proposal is rejected. Since the DR proposal is constructed here such that it is always accepted, there is no further branching after the second proposal.
Regarding
An alternative to introducing the sampling order to the acceptance probability would be to sum over the different orderings of
Here we illustrate the notation and behavior of the sampling algorithm using a concrete, albeit overly simplistic, example. Suppose the total number of SNPs in data is equal to 5 and let the current state of the algorithm be
Delayed rejection
An essential feature of delayed rejection is that the second proposal may depend on the first. Here, this is taken advantage of by reutilizing the computations performed for the first proposal. Note that the time complexity of computing the likelihood after the first proposal has been made is dominated by the updates to the Cholesky decomposition of the covariance matrix of the predictors (
The acceptance probability of the second proposal preserving reversibility is given by:
Here we illustrate the delayed rejection part of the sampling algorithm by continuing from Example 1 and assuming that the suggested move from
The delayed rejection part of the algorithm proceeds by sampling the second (forward) proposal from the set of all models which can be reached from the initial state
Now, the twostep backward proposal is determined as follows: first, auxiliary variables related to the
After rejection of the first step in the backward proposal, the second step must change the state back to the original model
Two additional moves are introduced specifically for genetic data, where the variables can be ordered linearly (corresponding to their locations in the genome) and neighboring variables may have blocklike correlation structure (linkage disequilibrium), which may complicate the mixing of the Markov chain.
The first move type proceeds by selecting one variable in the model (
In our implementation each of the additional move types is proposed with probability 0.15 and the main
The algorithms introduced above are compared to random scan versions of KohnSmithChan (KSC)
Human data was not collected primarily for this article and was analyzed here anonymously. Primary collection has followed appropriate ethics guidelines.
A dataset of 3895 individuals with quality controlled, measured or imputed genotypes at 1,051,811 single nucleotide polymorphisms (SNPs) is used to test the sampling algorithms. High (HDLC) and lowdensity lipoprotein cholesterol (LDLC, for 3822 individuals) phenotype data were available for analysis. Moreover, 20 simulated datasets were generated for four simulation configurations using the genotypes of the first chromosome (85,331 SNPs) for 2002 of the individuals and a linear model for the phenotype. The simulated data had either 30 or 100 SNPs randomly selected as causal with additive genetic effects, whose sizes were generated from a double exponential distribution. Normally distributed noise was added to the phenotypes to set the proportion of variance explained (
The efficiencies of the samplers were tested on the simulated datasets. The samplers are abbreviated as SS for singlestep sampler, MS for multistep sampler, MSDR for multistep sampler with delayed rejection, NK for NottKohn and KSC for KohnSmithChan. Maximum move size in the multistep samplers is 20 and delayed rejection is restricted to moves with size of 10 or less. The (finite) adaptivity of SS, MS and MSDR samplers refers to the tuning of the proposal probabilities of which variables to add or remove. Nonadaptive samplers employ discrete uniform distribution for this. All MS and MSDR samplers use move size proposal adaptation. NK and KSC samplers were run with block sizes 1, 5 and 10. Three independent MCMC chains were run for 20,000,000 (KSC and NK) or 2,000,000 (others) iterations of the third step in the Computation and thinned by taking every 100th (KSC and NK) or every 10th (others) sample. The KSC and NK algorithms were run for ten times longer as they have cheaper iterations and showed convergence problems with shorter runs. First halves of all chains are discarded as burnin. Prior parameters are given in
The effective sample size (
Convergence was checked visually and by computing potential scale reduction factors
The boxplots show ESS/time values normalized to the third sampler, where the ESSs are computed for the
The move size proposal distribution adaptation was validated by running the adaptive MS and MSDR samplers with fixed move size proposal distributions for six parameter configurations (giving mean move sizes from 2 to 7) for the 20 simulated datasets with
We further note that the multistep moves and delayed rejection do not necessarily increase the efficiency of moving between different model sizes (
Only the adaptive samplers proposed here were run for the HDLC and LDLC data as the others would be expected to perform worse with the large increase in the number of variables relative to the simulations. Twelve independent chains of length 8,000,000 iterations were run with each sampler and dataset and thinned by taking every tenth sample. Effective sample sizes and sampling times were computed as in the simulations. Here, results are presented as ESS/time rather than as relative efficiencies as there is no additional variation due to multiple datasets (HDLC and LDLC results are shown separately). Prior parameters are given in
Convergence analysis did not indicate problems with the HDLC dataset. The inferences regarding posterior inclusion probabilities and the proportion of variance explained did not change from the previous analysis
The source of the problem is a pair of correlated (Pearson’s correlation 0.91) SNPs in the
Each subplot contains traces (including burnin period) from 12 chains, where each trace is composed of three lines (red for snp1, blue for snp2 and black for snp3), which may be in upper state (
The SNP pair was missed in our previous analysis
The ESS/time values for comparing the algorithms on sampling efficiency are shown in
Boxes show the variation over the 12 independent MCMC chains for each sampler. ESSs are computed for the
Move size and rate statistics for the sampling algorithms are shown in
Dataset/Sampler  RJD  PJD  RJD/PJD  Move rate 

HDLC  
adaptive MSDR  2.00  6.75  0.45  0.67  0.12 
adaptive MS  1.15  6.25  0.33  0.33  0.14 
adaptive SS  0.66  1.00  0.66  0.66  NA 
LDLC  
adaptive MSDR  1.95  6.57  0.45  0.68  0.12 
adaptive MS  1.11  6.36  0.31  0.31  0.13 
adaptive SS  0.69  1.00  0.69  0.69  NA 
Several aspects related to the use of the MetropolisHastings algorithm (MH) in Bayesian variable selection in the context of genomewide association studies were studied here. Specifically, the focus was on the (finite) adaptation of the proposal distributions for additions and removals of variables, multistep proposals (batching of additions and removals) with move size adaptation and using a delayed rejection step in the multistep proposal. A more flexible prior formulation for the effect sizes and additional MH moves tailored to genetic data were also introduced.
The effect of the adaptation of the proposal distributions was studied on simulated datasets with 85,331 SNPs. The results suggest that the adaptation is beneficial in regard to the sampling efficiency. This is not surprising as similar ideas have been used previously in sampling from highdimensional model spaces for variable selection
The expected jump distance optimization
Problems in the mixing of the samplers were found in the LDLC data. This was identified being related to a pair of SNPs, which are required to be together in the model to have notable contribution. The interpretation of the SNP pair is unclear to us (e.g., haplotype tag or false positive), but it is plausible that such combinations could be found in other datasets also and that they are probably missed in singleSNP analyses. Multistep moves may help in finding such SNP pairs, but it is still improbable that one move would happen to propose the correct pair amongst all possible. We introduced a specific MH move to alleviate the problem of finding such local SNP combinations. Together with the delayed rejection, which allows for some misspecification of move size, this seemed to improve the mixing for the SNP pair markedly.
Moreover, the prior distribution of the effect sizes was changed to have more probability mass near the axes for the regression coefficients (through having SNP specific
We acknowledge that comparisons for sampling efficiency may be sensitive to the implementation, sampling parameters and the computer environment, where the experiments are run. To this end, all experiments here were run on a cluster computer, where the nodes have almost identical configurations (most importantly, the same CPU model and software libraries for linear algebra; for HDLC, and similarly for LDLC, a single node was used to run all experiments) and the same sampling parameters were used for all algorithms (where applicable). Moreover, the third step in the Gibbs scheme, the variable inclusion update, was timed separately and was used to compute the efficiencies. Thus, the time spent in the other steps, which may account for a significant portion of the total time (especially the RaoBlackwellization), was excluded. All of the algorithms were implemented by the first author and most of the source code is shared between them. A set of unit tests (including checks for likelihood computations and sampling on small test data, among others) was used to increase confidence in the correctness of the implementation and is available with the source code.
The results may also be expected to vary with the specifics of the data (e.g., scale, number of significant associations, effect size distribution and correlation structure) as seen to some extent between the different simulation configurations. Our experiments were specifically in the context of genomewide association analysis, but many of the ideas are applicable to other types of highdimensional data. However, the sampling algorithms used here may need to be combined with other means of tackling potential multimodality for general use.
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
We thank Antti Jula, Markus Perola and Veikko Salomaa for access to the HDLC and LDLC datasets. We would also like to thank the anonymous reviewers for their contribution to improving the manuscript.