Applying diffusion-based Markov chain Monte Carlo

We examine the performance of a strategy for Markov chain Monte Carlo (MCMC) developed by simulating a discrete approximation to a stochastic differential equation (SDE). We refer to the approach as diffusion MCMC. A variety of motivations for the approach are reviewed in the context of Bayesian analysis. In particular, implementation of diffusion MCMC is very simple to set-up, even in the presence of nonlinear models and non-conjugate priors. Also, it requires comparatively little problem-specific tuning. We implement the algorithm and assess its performance for both a test case and a glaciological application. Our results demonstrate that in some settings, diffusion MCMC is a faster alternative to a general Metropolis-Hastings algorithm.


Introduction
The advent of Markov Chain Monte Carlo (MCMC) has led to major advances in the application of Bayesian analysis in complex problems. The idea is simply put: faced with a posterior distribution too complicated to compute or simulate from directly (i.e., we cannot readily obtain the normalizer or denominator appearing in Bayes' Theorem), one develops a Markov chain whose stationary distribution is known to coincide with the target posterior distribution. One then runs that chain, knowing that eventually realizations from the chain form an approximate dependent sample from the posterior. Those realizations are then used to estimate features of the posterior (i.e., posterior expectations of interesting quantities, predictive densities, etc.) [1][2][3].
For example, in some settings, nonlinearity and/or nonconjugacy of certain components of a large model render the standard Gibbs Sampler unusable. Metropolis-Hastings algorithms and Gibbs-Metropolis hybrids can be suggested, though these approaches can be taxing and may require substantial tuning.
In response to such difficulties, we explore diffusion based strategies for MCMC analysis. That is, one develops a diffusion (a solution, in the sense of Itô, to a stochastic differential equation) whose stationary distribution is the target posterior, see Chapter 5 of [4]. The key idea is certainly not new. Indeed, Langevin MCMC procedures are often suggested for PLOS  generating candidate states in Metropolis steps in MCMC. In this article, we suggest diffusion MCMC as a stand-alone algorithm. In part, our motivation for suggesting diffusion MCMC is its simplicity in terms of set-up. There are no probability calculations to perform, as in Gibbs' Sampling, nor any need for choosing and updating distributions for generating candidate states. Indeed, the approach is recommended as an "off-the-shelf" strategy that can be readily implemented. However, as indicated below, it is not a panacea. Further, issues such as burn-in, mixing, convergence rates, and output analysis remain challenging.
Consider a Bayesian analysis for an unknown quantity θ (after the introduction, we allow vector-valued unknowns) based on observational data Y, having conditional density g(y j θ). Let π(θ) denote our prior distribution for θ. We are to obtain the posterior distribution for θ based on the fixed observation Y = y, where C(y) is the normalizing constant. Consider a one-dimensional stochastic differential equation (SDE) where θ 0 is some fixed initial value and the drift b(Á) and diffusion σ(Á) > 0 are specified functions, such that Eq (2) admits a unique weak solution. The initial state θ 0 is a random variable with specified density p(θ, 0); and dW(t) represents white noise. Specifically, {W(t): t ! 0} is a standard Brownian motion process or a Wiener process. Consider the temporal evolution of the probability density function, p(θ, t), of a solution θ(t). Under regularity conditions requiring b and σ to be differentiable and satisfy a Lipschitz condition jbðyÞ À bðy 0 Þj þ jsðyÞ À sðy 0 Þj Kjy À y 0 j; for some constant K and for all θ, θ 0 , p(θ, t) is the solution to the ordinary partial differential equation, known as the Fokker-Planck or Kolmogorov Forward Equation, @p @t ¼ 0:50 @ 2 @y 2 ðs 2 pÞ À @ @y ðb pÞ; ð3Þ subject to the initial condition p(θ, 0) and the assumption that θ(0) and {W(t): t ! 0} are independent. Our interest is in stationary solutions, i.e., solutions that are functionally independent of time, so the partial derivative with respect to t is zero. Setting the right-hand side of Eq (3) equal to zero and integrating the result once w.r.t. θ, it suffices to find a stationary density p(θ) such that 0:50 @ @y for all values of θ in the parameter space. The general solution of Eq (4) is where the constant c is a normalizer. We let p(θ) be the target posterior Eq (1) for our Bayesian model and find appropriate functions b(θ) and σ(θ) such that p(Á) satisfies the Eq (4). That is, Having completed this step, we can simulate the diffusion and proceed as in MCMC. This is typically accomplished by forming a discrete-time approximation to Eq (2), that is, a Markov chain approximation to the continuous-time process. There may be many pairs (b(Á), σ 2 (Á)) that work for a fixed Bayesian model. It is important to note that the core of a diffusion MCMC (DMCMC) implementation has been completely described.
In this article we discretize the diffusion Eq (2) using the Euler scheme. This method has been extensively discussed in the literature, see for example [5] for a comprehensive overview and [6][7][8] for recent developments. The solution of the stochastic differential Eq (2) is approximated using a discrete time Markov chain {θ m } m ! 0 , where θ 0 = θ(0), h > 0 is the discretization step-size, and Z m+1 is a realization from a standard Gaussian distribution. Abusing notation, we use θ to denote both the continuous-time defined in Eq (2) and the discrete-time process from Eq (6). It is typical to extend {θ m } m ! 0 to a continuous time process via interpolation; however this step is not necessary for this paper. From a practical perspective, we are interested in the process {θ m } m ! 0 . Two critical questions arise: 1. Does the discrete stochastic process converge to a stationary, ergodic distribution?
2. If so, is that stationary distribution "close" enough to the target posterior distribution to justify the use of conventional output analysis to enable approximate Bayesian inference?
Unfortunately, there are situations where for any choice of the time step h > 0, the Markov chain described by Eq (6) will behave drastically different than the continuous time version Eq (2), see [9] for a discussion on this issue. Nevertheless, in many cases the ergodic properties of the discretized process {θ m } m ! 0 are similar to those of {θ(t)} t ! 0 . In particular, in [10] the author shows that under regularity conditions, the Euler discretization scheme does have a stationary measure which converges at an appropriate rate to the unique stationary measure of the continuous-time SDE. Furthermore, in [9] the authors provide conditions under which the continuous-time Langevin diffusion (defined below in Eq (7)) as well as the discretized version Eq (6) are geometrically ergodic. In [7], the authors study the asymptotic properties of time averages ð1=MÞ P M m¼1 Fðy m Þ, where F is a given function. This statistic is the natural estimate for the expected value E(F(θ)) = R F(θ)dp. They use Poisson equations to show that under mild regularity conditions, any stationary measure of the Euler-discretized process Eq (6) will be close to the unique stationary measure of the underlying SDE. Their Theorem 5.1 also shows that the time average estimator is of order O(h + 1/M).
To illustrate diffusion-MCMC and indicate its potential value and simplicity, examples are reviewed in the next section. Henceforth, we set σ 1, and then find the function b(Á); this approach often has the tag Langevin. That is, we restrict ourselves to diffusion processes of the form where p(Á) is the posterior distribution Eq (1). We note that application of Eq (6) yields the corresponding transition distribution as y mþ1 jy m $ N y m þ 0:5hr log pðy m Þ; h ð Þ: We emphasize again that our goal is to present the benefits of a procedure that has been present in the literature for over a decade. Due to its wild behavior even in some simple cases, it has received little attention, especially from practitioners and applied scientists. It has been proposed (see for example the MALA algorithm presented in [9]) that an additional Metropolis-Hastings step will correct the explosive behavior. The MALA algorithm has been further studied and extended in [11] and [12]. In both cases the improvement comes with an increase in computational complexity. We stress that our goal is to avoid a Metropolis-Hastings acceptreject step and this work is motivated by recent theoretical advances in this direction, see [7]. We explore the efficiency and applicability of DMCMC to high-dimensional problems arising in a Bayesian framework, without performing the Metropolis-Hastings correction step. When classical (or adaptive) MCMC fails (for example, due to computational time restrictions or inability to select good proposals), we show that diffusion MCMC is a viable alternative which requires little input from the user and can be computationally more efficient.

Motivating examples
The multivariate form of the diffusion Eq (2) is written as where {θ(t): t ! 0} is a q-dimensional stochastic process. The initial state is θ(0) and {W(t), t ! 0} is a q-dimensional vector whose elements are each independent standard Brownian motions. Except for the first one, the examples below were chosen to be suggestive of realistic problems for which other MCMC methods can be difficult. As in Eq (1), we use g(y j θ) to denote the likelihood function, where y is the observed data and π(θ) to denote the prior density.
One class of problems in which diffusion MCMC may be useful involve nonlinearity. For example, suppose the likelihood function g depends on θ via a "link" function k(Á), that is g = g(Á j k(θ)). Nonlinear structures may also arise in hierarchically specified priors. Nonlinearity may make both Gibbs sampling and Metropolis algorithms difficult. However, if the nonlinearity does not disable the required differentiation, diffusion MCMC may be comparatively simple. We remark that in such cases, the drift function b(Á) may be unruly. If necessary, selection of the diffusion coefficient σ(Á) may be used to control b(Á). However, for the balance of the article we restrict to Langevin diffusions (σ = 1). Example 1. Assume that for τ 2 known, Y | θ * N(θ, τ 2 ) and θ * N(μ, η 2 ). Of course, we know that θ | Y = y is normally distributed with easily computed mean and variance (these will appear below). Applying Eq (5) with the choice of σ(θ) = 1, yields Let α = 0.50(τ −2 y + η −2 μ) and β = 0.50(τ −2 + η −2 ). The solution to It follows that It can be shown that Returning to the original parameterization, we conclude that as t ! 1, which are the usual posterior mean and variance for this Bayesian model. If the initial condition θ(0) is normally distributed, then for each t, θ(t) is also normally distributed. Note that the convergence rate to the stationary distribution is exponentially fast. Example 2. Diffusion MCMC is useful in combining data from highly different likelihoods. Let θ = (θ 1 , . . ., θ K ) and assume that Y ij |θ i * g i (Á|θ i ) where i = 1, . . ., K and j = 1, . . ., r i and all Y ij are conditionally independent. For example, let g i be the Gaussian pdf with mean θ i and variance τ 2 , and the prior for θ i be a Cauchy distribution with median μ and scale parameter A. For a Langevin setting (i.e. σ = 1), the i th component of the drift coefficient r log p(θ) is @ @y i r log pðθÞ ¼ À Note that conjugacy plays no direct role in this approach, though the presence of the Cauchy distribution makes a Gibbs sampler infeasible. This example is further analyzed in the next Section.

Example 3. (Mixture Models)
Suppose Y 1 , . . ., Y n are conditionally independent and identically distributed given θ according to a finite mixture of m probability density functions g i (Á | θ). For example, assume that the conditional distribution of the data is where y = (y 1 , . . ., y n ) 0 , α i > 0, i = 1, . . ., m and ∑ i α i = 1. Diffusion MCMC is easily formulated if the derivatives of the g i with respect to θ are easily available, either via formal calculations or by using symbolic software, such as Mathematica. We note that similar steps can be used to treat mixture priors.
Example 4. (Hierarchical Models). In many models g and π are products of a variety of terms, e.g., for conditionally independent observations, g is a product; π is often represented as a product of hierarchical components. In such cases, we have that @ log ðgpÞ @y i ¼ @ log ðg ðiÞ p ðiÞ Þ @y i ; where the superscripts indicate that only those components of g and π that explicitly depend on θ i are involved in the calculation. This parallels the familiar step in computing full conditionals in setting up a Gibbs Sampler. Namely, for each i, one computes the distributions Suppose that the Bayesian model takes the form Y | θ 1 , . . . θ q * g(y | θ 1 , . . . θ q ) and pðy 1 ; . . . y q Þ ¼ p 1 ðy 1 j y 2 ; . . . y q Þp 2 ðy 2 j y 3 ; . . . y q Þ Á Á Á p q ðy q Þ: We adapt the notation in Eq (5) as follows: for a function f(θ 1 , . . . θ q ) define Hence, @ log ðgpÞ We note that Gibbs sampling is useful when the full conditionals Eq (13) are readily obtained and simulated. This typically arises when the full conditionals actually depend on a small subset of the parameters in the conditions. This is not necessary in diffusion MCMC.

Applications
To provide insight into diffusion MCMC (DMCMC), we present a standard test case and a real-data example. Our goal is to assess the performance of DMCMC, especially in comparison with the current state-of-the art adaptive MCMC approach, see [13,14]. The DMCMC methodology is compared to a multivariate adaptive Metropolis sampler (AM). For the AM algorithm, the proposal distribution at iteration m is given by ð1 À bÞNðx; ð2:38Þ 2 S m =qÞ þ bNð0; ð0: where S m is the current estimate of the covariance matrix of the target distribution and β is a small positive constant (we take β = 0.05). The AM algorithm is widely accepted as one of the best sampling algorithms, especially for complex target distributions where dependencies among parameters make it difficult to select proposal distributions. We refer the reader to [15] for several comparisons between MCMC algorithms. The scaling factor (2.38) 2 can also be "adapted"; in this case we refer to the procedure as adaptive scaling within adaptive MCMC. However, user input is not eliminated completely as there remain tuning parameters to be specified. For comparisons, we inspect trace-plots to assess convergence and compare algorithms via their averaged squared jumping distance This quantity is estimated by 1 M P M m¼1 ðX m À X mÀ 1 Þ 2 for both AM and DMCMC algorithms.
Comparatively large ASJD indicates the desirable property of fast mixing. We also add a computational constraint for our examples. We limit ourselves to relatively short runs of the Markov chains (AM and DMCMC). This can be very dangerous for classical MCMC since one will have difficulty assessing whether the chains have reached stationarity. Our examples will show that the diffusion approach quickly finds regions with high posterior probability and explores them thoroughly.

Synthetic example
Assume that Y i1 , . . ., Y ir i |θ i , γ are an iid sample from a Gaussian(θ i , V(γ)) distribution, where 1 i 1000 and 1 j r i . We specify the variance V(γ) to be where 0 < a < b < 1 are specified constants. The reason behind Eq (15) is twofold: (1) we require that all the parameters of the model be supported on the entire real line, hence a transformation is required for all variances, and (2) we aim for a Uniform(a, b) prior distribution for the variance V(γ). Certainly, other distributions (such as Gamma or Inverse Gamma) can be considered. Using an inverse transformation, this is equivalent to specifying the prior density for γ as We let the sample sizes r i vary between 5 and 500. For θ 1 , θ 2 , . . ., θ 500 we specify independent prior distributions, θ i |μ, A * Cauchy(μ, A), with density proportional to [1 + ((θ The parameter A is held fixed for this example, although it can be treated similarly to the data variance V(γ). For the hyperparameter μ we specify a Gaussian(0, 1) prior distribution. Using the independence assumption, the likelihood function is written as g y ðθÞ g y ðy 1 ; . . . ; y 1000 ; g; mÞ ( ) and the prior density is proportional to pðθÞ pðy 1 ; . . . ; y 1000 ; g; mÞ Selection of the time step. The selection of "good" time steps for DMDMC is challenging. First, time steps that are too large may result in explosive (transient) processes. In general, the user faces a conundrum: a very small time step typically results in chain whose dynamics are similar to those of the target continuous-time diffusion, but is slowly mixing.
In our experiments we observed that selecting h = O(1/q) results in a good performance of the DMCMC algorithm for this example. In Fig 1 we display the trace plots and autocorrelation functions for three parameters (results for all parameters where quite similar) from the DMCMC approach. Fig 2 shows trace plots for the same parameters resulting from two adaptive approaches as described above. In each case the chains were run for 20000 iterations and thinned by ten. It is evident from these figures that the adaptive approach has difficulty exploring the state space. The key issue is the dimension of the state space (1000+ in this example). The adaptive approach will not "learn" a one thousand dimensional covariance matrix properly. In Table 1 we summarize the target values and estimates (posterior means) for two parameters, θ 1 and θ 201 . Table 2 contains corresponding estimates of ASJD. We see that the estimated ASJD indicates that the mixing rates of the diffusion MCMC algorithm is much higher than the adaptive case.

Glacial dynamics
In [16], the authors present a hierarchical Bayesian analysis for inferring features of the dynamics of the Northeast Ice Stream in Greenland. For our purposes, we use a subset of their data and a simplified version of their model. Glaciers flow under the influence of gravity moderated by resistive forces at its base and sides. For a path roughly along the center of the glacier as it flows toward the sea, physical reasoning and simplifying approximations lead to a model for surface velocity of the stream as a function of ice thickness and the shape of the surface. Let x (m) denote a spatial location. The model for the surface velocity u(x) (ms −1 ) used here is where u bx is sliding velocity, s(x) (m) is ice-surface elevation, H(x) (m) is ice thickness, ρ = 911 kgm −3 is the density of ice, and g = 9.81 ms −2 is the gravity constant. Though the quantity A(x) depends on temperature, it is often treated as a constant flow parameter. In this article we model A using a Fourier expansion We assume the following quadratic model for the surface: where β 0 and β 1 are unknown parameters. The authors in [16] use a different, more complicated functional form for s. We found Eq (20) sufficient for our purposes. Let B(x) be the elevation of the base of the glacier so that HðxÞ ¼ sðxÞ À BðxÞ: The dataset consists of vectors S observed at fill-in spatial locations covering approximately 200 km of the ice stream, the observed surface topography; B, the observed basal topography; and U, surface velocities. Additional description of the data is given in [16].
Data models. Let θ represent the set of all parameters introduced in the modeling. We assume that S, B, U are conditionally independent given θ.
Surface Data. The data model for S is a conventional Gaussian measurement error model: where s is the vector of values of Eq (20) at the observation locations; s 2 S is an unknown measurement error variance; and I is the identity matrix.
Basal Data. It is argued (see [16]) that the basal data must be smoothed to be useful in Eq (18). Following their approach we partition the domain of the data into 2 10 = 1024 bins of equal length (189.5 m). All basal observations within each bin are averaged, leading to a data vector " B of length 1024. As in [16], we use a wavelet model with two sets of wavelets to provide where W is the 1024 × 32 matrix of discretized wavelet basis functions; C is the 32-dimensional vector of wavelet coefficients; s 2 B is an error variance; and diagfn À 1 i g is a 1024 × 1024 diagonal matrix with diagonal elements equal to n À 1 i where n i is the number of observations averaged in bin i (all n i are either one or two). We selected Daubechies wavelets, see [17] and [18] for discussion.
Velocity Model. We assume that where u is the vector of values given by Eq (18) at the observation locations; s 2 U is the unknown measurement error variance; and I is the identity matrix. Note that the sliding velocity is assumed to be a constant over the study range.
Priors for parameters. Error Variances. The measurement error variances s 2 S , s 2 B , and s 2 U were assigned independent, inverse gamma distributions with means and standard deviations (100, 10), (2500, 500), and (9, 3), respectively. Surface Model Parameters. The prior distributions for β 0 and β 1 were specified to be independent normal distributions with large variances: 10,000 for β 0 and 10 for β 1 . The means of these normal distributions were set to be equal to the least squares estimates of β 0 and β 1 derived from a traditional analysis fitting the model in Eq (20) to the surface observations. Basal Model Parameters. The prior used for the four coefficients of the smooth-signal wavelets is where μ s is the vector of conventional least squares estimates of Haar-wavelet coefficients. The prior for the remaining 28 coefficients is We set s 2 c ¼ 2000 and s 2 d ¼ 10000. Velocity Model Parameters. The prior for the sliding velocity is To develop reasonable priors for the Fourier coefficients in Eq (19), we first obtained the least squares fits to the surface and the basal models (20) and (22). These fitted models were substituted into the velocity model (23). We then fitted the result via least squares. As above, the least squares estimates of the Fourier coefficients were used as prior means for the corresponding parameters. These values were on the order of 10 −16 (which is consistent with the theoretical value of the parameter A) except for the frequency parameter ω which was estimated to be roughly 10 −5 . These parameters were all assumed to be independent, normal random variables with prior variances equal to 10.
Performance. Fig 3 shows trace plots for various parameters. We ran the algorithm for 100000 iterations and thinned it by fifty steps. The diffusion MCMC algorithm performs very well. It appears that it explores the state space properly and mixes very fast. In Fig 4 we show posterior means for the surface, velocity and basal processes. For comparison, we added the posterior means (dashed lines) for the three processes from a much longer adaptive MCMC run. This plot confirms that diffusion MCMC performs as expected, giving similar results to the adaptive MCMC approach with the added benefit of a much shorter computing time.

Conclusions
Simulation of a diffusion formulated to have stationary distribution coinciding with a target posterior distribution is a viable MCMC method. The approach is comparatively simple to implement since it requires no probability computations such as those needed in Gibbs' sampling nor any accept-reject steps as in Metropolis algorithms. These advantages can be  significant in a variety of settings including mixture likelihoods and/or priors, hierarchical models, nonconjugate priors, and nonlinear models.
The key problem that arises in diffusion MCMC is the approximation of the desired continuous time diffusion by a discrete time Markov chain. Our implementations use Euler discretizations. As reviewed in the Introduction, there are results in the literature providing sufficient conditions under which the discrete approximation has a stationary distribution that approximates that of the target, continuous-time diffusion. Though beyond our scope here, selection of the time-step h can be done adaptively, see [6] for some recent theoretical developments in this area.
We implemented diffusion MCMC for a familiar test problem and compared it to an adaptive MCMC procedure. We found that diffusion MCMC out-performed the "state-of-the-art" adaptive MCMC. Next, we implemented the diffusion MCMC approach in a complicated, nonlinear model involving glacial dynamics. Again, we found that our suggested approach performs well, mixing very fast.
In summary, we believe that diffusion MCMC is a valuable addition to the MCMC toolbox. By construction, the DMCMC algorithm has the ability to quickly find important regions of the target distribution, while a classical, even adaptive MCMC, may require longer exploration times (as seen in the glaciological example). It can be applied in great generality and with ease in some complicated contexts for which other MCMC methods are difficult or very time-consuming to implement. DMCMC does carry the baggage of temporal discretization and concern for the quality of the resulting approximation. Nevertheless, the potential power of diffusion MCMC justifies its application and further development.
Supporting information S1 Dataset. The data set used in this analysis are available in the file S1_Dataset.zip. We provide the surface, basal and velocity data used in this manuscript.