Turning Simulation into Estimation: Generalized Exchange Algorithms for Exponential Family Models

The Single Variable Exchange algorithm is based on a simple idea; any model that can be simulated can be estimated by producing draws from the posterior distribution. We build on this simple idea by framing the Exchange algorithm as a mixture of Metropolis transition kernels and propose strategies that automatically select the more efficient transition kernels. In this manner we achieve significant improvements in convergence rate and autocorrelation of the Markov chain without relying on more than being able to simulate from the model. Our focus will be on statistical models in the Exponential Family and use two simple models from educational measurement to illustrate the contribution.


Introduction
In Bayesian statistical modeling, researchers formalize their substantive theories in a statistical model π(x j θ) for the distribution of the data x and a prior distribution π(θ) for the parameter θ. Together, the statistical model and the prior distribution lead to the posterior distribution π(θ j x). The desired posterior distribution is often intractable, and simulating draws from such posterior distributions can be a difficult task. However, simulating data from the model is often simple: first generate a parameter value θ Ã from the prior distribution π(θ) and then generate data x Ã from the statistical model π(x j θ Ã ). From this composite sampling scheme we obtain draws from the joint distribution [1]: showing that the generated value θ Ã is also a draw from the posterior distribution π(θ Ã j x Ã ) / π(x Ã j θ Ã )π(θ Ã ) [2]. The posterior distribution π(θ Ã j x Ã ) is equal to the desired posterior distribution π(θ j x) only if the generated data x Ã are equal to the observed data x. Here we improve on an algorithm that uses the composite data sampling scheme to obtain draws from the posterior distribution π(θ j x) to be used in a wide range of settings. To simulate draws from the desired posterior π(θ j x), Murray, Ghahramani and MacKay [3] cleverly utilized the posteriors π(θ j x Ã ) as proposal distributions in an independence chain a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Metropolis algorithm [4]. This Metropolis algorithm operates by constructing a discrete-time Markov chain with state space O θ that has the desired posterior π(θ j x) as invariant distribution [5], and can be characterized as follows: where the state θ 0 of the chain at a time t is a draw from the invariant distribution π(θ j x), the proposed value θ Ã is a draw from the proposal distribution π(θ j x Ã ), with θ 0 and θ Ã 2 O θ . The proposed point θ Ã is accepted if U < ϕ, with U * Uniform(0, 1) and: 0ðy 0 ; y Ã j x; x Ã Þ ¼ min 1; pðy Ã j xÞpðy 0 j x Ã Þ pðy 0 j xÞpðy Ã j x Ã Þ ¼ min 1; pðx j y Ã Þpðx Ã j y 0 Þ pðx j y 0 Þpðx Ã j y Ã Þ : ð2Þ Note that the proposed value θ Ã is always accepted, i.e., ϕ(θ 0 , θ Ã j x, x Ã ) = 1, if the proposed setting π(x j θ Ã )π(x Ã j θ 0 ) is more likely than the current setting π(x j θ 0 )π(x Ã j θ Ã ), otherwise it is accepted with a probability proportional to the ratio of likelihoods in the proposed and current setting. This is known as the Single-Variable-Exchange (SVE) algorithm [3] and makes simulating draws from a posterior distribution as simple as simulating data. Although the SVE algorithm presents a practical and simple solution to sampling from intractable posterior distributions, its application and development has focused exclusively on statistical models π(x j θ) with computationally intractable normalizing constants [6]. In fact, the SVE algorithm was originally developed for statistical models in the Exponential Family [7][8][9]; i.e., models that can be written as: where t(x) is a (vector of) statistic(s) sufficient for θ and Z θ a normalizing constant. Observe that for models in the Exponential Family, the acceptance probability is of a particular simple form: 0ðy 0 ; y Ã j tðxÞ; tðx Ã ÞÞ ¼ min 1; exp f ðy Ã À y 0 Þ Á ðtðxÞ À tðx Ã ÞÞ g ð Þ; ð3Þ and does not involve the normalizing constant Z θ ; making the SVE algorithm a practical tool for Bayesian inference of models where Z θ is intractable, such as Exponential Random Graphs [10,11] and Markov Random Fields [12,13]. Despite the simplicity with which the SVE algorithm operates, especially for models in the Exponential Family (e.g., generalized linear models), its application to tractable statistical models π(x j θ) has been completely unexplored. Simple implementations do not guarantee efficient Markov chains, and in practice we often see that the SVE algorithm operates with low efficiency; requiring many thousands of iterations to obtain accurate estimates and wasting expensive computations on rejected proposals. This inefficiency results from generating data sets x Ã that are unlikely to occur under the current setting θ 0 (or x under θ Ã ); i.e. statistics t(x Ã ) that are far from t(x) in Eq (3). To this aim, several approaches have been proposed that replace the simple proposal generating mechanism with more elaborate schemes, using, for instance, random walks [3], parallel tempering [3], population Markov chain Monte Carlo methods [10,14], Langevine diffusions [15], or delayed rejection [16]. Although these approaches improve the statistical efficiency, they often fail to generalize the simple implementation of the original SVE algorithm.
Consequently, our goals are two-fold. Our primary goal is to show several developments that improve the statistical efficiency of the original SVE algorithm and result from reformulating it as an instance of what Tierney refers to as a mixture of Metropolis kernels [4,17] in Section 2.1. Our efforts focus on simultaneously sampling from the posterior distribution of a large number of latent variables (e.g., random effects in generalized linear mixed models or Bayesian hierarchical models) in Sections 3.2 and 3.3, and on sampling from highly concentrated posterior distributions in Sections 3.1, 3.4 and 3.5. The strategies that we present improve their efficiency as the sample size grows (driving the autocorrelation to zero), and allow the utilization of the cheap parallelism that is available in modern day computing [18]. A simple Exponential Family model serves to illustrate the development.
Our secondary goal is to introduce the SVE algorithm as a general purpose method that makes Bayesian inference simple, even for relatively complicated models. As the SVE algorithm does not require much more than the ability to generate data from the statistical model, we believe that it is a practical tool for applied researchers and also serves as a simple introduction to Markov chain Monte Carlo methods for students novel to the field. The extensions that we propose in this paper also fit these two purposes in that they are simple and intuitive extensions of the original SVE approach. To illustrate the original SVE approach and our proposed extensions, we have included annotated R [19] code as supporting information. Specifically, S1-S6 Scripts can be used to reproduce our results (i.e., Figs 1-6), and S7 Script contains the original SVE algorithm and our proposed algorithms in isolation.
Even though we will specifically focus on models in the Exponential Family, we note that our approach also applies to other models by replacing the sufficient statistic with an auxiliary statistic to relate generated data to a parameter. In general one often has a good idea how data and parameters are related, such that it is simple to find efficient auxiliary statistics, an idea that is regularly exploited in Approximate Bayesian Computation [20][21][22].
Clearly, the main drawback of our approach is the assumption that one is capable of simulating data from the model. That is, we assume that routines to sample (directly) from π(x j θ) and π(θ) are available. For most models efficient sampling routines are readily available in standard statistical software such as R [19], or can be constructed using general procedures [23,24]. Extensions of the SVE algorithm where data are sampled using a Markov chain have also been considered [25,26], and although not investigated here, we anticipate that our approach also extends in this direction. The more general notion is this: if the problem of simulating data is solved, the SVE algorithm turns data simulation into parameter estimation by producing draws from the posterior distribution.

A Mixture of Transition Kernels
The factorization in Eq (1) reveals that by first sampling a parameter value θ Ã * π(θ) and then data x Ã * π(x j θ Ã ), with probability (density) π(x Ã ) we have generated θ Ã from the proposal π (θ j x Ã ). With the acceptance probabilities ϕ defined as in Eq (2), we have that each generated proposal π(θ j x Ã ) corresponds to a unique Metropolis transition kernel, i.e., a transition probability distribution with density: for which the posterior π(θ j x) is the invariant distribution: pðy j y 0 ; x Ã ; xÞ pðy 0 j xÞ dy 0 : Since each Metropolis kernel in the SVE algorithm has the same invariant distribution, so does their mixture [4,17]: where the integration is over all possible generated datasets O x Ã. This formulation shows that the inefficiency of the SVE algorithm is caused by the generation of kernels that have a low probability of making a move, i.e., kernels corresponding to statistics t(x Ã ) that are far from the observed statistic t(x) (c.f. Eq (3)).
To illustrate, consider a simple example from educational measurement; the Rasch model [27]. The Rasch model describes the distribution of a pupil's item responses as a function of an ability parameter θ, and a difficulty parameter δ for each of k items: where X = 1 denotes a correct response and X = 0 an incorrect response. The test score x + = The distribution of transition kernels, i.e., pðx Ã þ Þ, for the original SVE algorithm in the Rasch model example with k = 20 items. In this example the average acceptance rate for sampling from the posterior π(θ j x + = 9) was approximately 37% (see S1 Script). ∑ i x i is sufficient for θ such that its posterior depends on the data only through the test score [28], and the mixture ranges over the k + 1 possible test scores O x Ã þ ¼ f0; 1; :::; kg: The mixing distribution (i.e., test score distribution) π(x + ) is shown in Fig 1 for a test consisting of k = 20 items, and confirms that it gives much weight to kernels corresponding to values of tðx Ã Þ ¼ x Ã þ that are far from the observed value t(x) = x + . For instance, for a pupil with test score t(x) = x + = 9, say, the probability to generate a kernel corresponding to values , is approximately 35% in this example (see S1 Script).

Oversampling for Single Parameter Updates
We wish to assign more weight to kernels with a high probability of accepting a move, while preserving the correct invariant distribution π(θ j x). A simple way to achieve this is to generate m ! 1 proposed points and then select the one that yielded a sufficient statistic t(x Ã ) most similar to t(x) (c.f. Eq (3)  direct sample was produced, the proposal distributions became increasingly more similar to the target distribution, thus increasing the overall probability of making a move.
In the application above, we have used functions t() of the observed x and simulated data x Ã to select a proposal distribution (i.e., a transition kernel). In practice, however, we might have more information available that informs the selection of good proposal distributions, and one can use functions f() that incorporate this information. In the next section we illustrate such a function that incorporates, for instance, covariates that are used in the statistical model. Observe that we do not use the current state of the parameter, θ 0 , or the proposed point, θ Ã , to select a proposal distribution. This ensures that the posterior distribution π(θ j x) remains the correct invariant distribution of the Markov chain. Fig 2 reveals that using the sufficient statistic we are able to select statistically more efficient proposals as m increases. This follows from inspecting the acceptance probability in Eq (3), and observing that the statistically more efficient proposals are those for which |t(x Ã ) − t(x)| is at a minimum, and furthermore that the minimum min m {|t(x Ã ) − t(x)|} over m proposals is non-increasing with m, i.e., It is important to note that the m proposals can be generated in parallel so that the oversampling of proposals need not increase the computational burden. However, only one of the m proposals is subsequently accepted by the Markov chain. As we will see next, all generated proposals can be put to good use when simultaneously sampling from more than one target distribution.

Matching for Multiple Parameter Updates
With the commonly assumed conditional independence of observations in hierarchical models, we have independent posterior distributions for each of n random effects (or latent variables) [28]: pðy p j x p Þ: Since proposals are also independently generated, it is convenient to consider the joint application of n independent SVE kernels: where the original SVE algorithm has due to independence. We wish to modify π(x Ã ) such that each component pðx Ã p Þ ¼ R pðx Ã Þ dx Ã np assigns more weight to kernels with a high probability of accepting a move. Similar to our oversampling procedure we can generate m ! 1 proposals and assign each of the generated proposals to a target distribution. Here, we choose the number of generated proposals to be equal to the number of target distributions, which implies that we simply rearrange the m = n generated proposals. We wish to rearrange the proposals such that each of the n kernels has a high probability of accepting the proposed point; i.e., match proposals to targets such that for each target distribution the proposal statistic t(x Ã ) is close to the observed statistic.
However, even for relatively small values of n it becomes expensive to search the n! possible arrangements for the statistically most efficient one, which suggests to use a simple rule to automatically choose an arrangement given a generated dataset x Ã .
We illustrate such a simple rule with sampling from the posterior distribution of n ability parameters in the Rasch model; pðy p j x pþ Þ: We order the posterior distributions based on the corresponding test scores; those corresponding to a low test score are placed first whereas those corresponding to a high test score are placed last. Next, we generate m = n proposal distributions which are ordered in the same way as the target distributions; those corresponding to a low generated test score are placed first and those corresponding to a high generated test score are placed last. It is clear that the first proposal is likely to be a good proposal for the first target distribution, the second proposal for the second target distribution, and so on. That this procedure improves the statistical efficiency is shown in Fig 3. The solid horizontal line in Fig 3 shows the average acceptance rate of the original SVE algorithm applied separately to each of the n latent variables, the efficiency of which is independent of n, and the points refer to the acceptance rate using our kernel matching procedure. Even with as little as n = 25 latent variables there is a substantial improvement to the statistical efficiency when the proposals are matched, with an average acceptance rate of 29% in the original SVE algorithm and 67% when the proposals are matched. Moreover, Fig 3  reveals that the statistical efficiency continues to improve as n increases. Fig 3 reveals that if t(x) is sufficient for θ, we have a good way to match proposals to targets and as n becomes sufficiently large, each kernel tends to make a move such that we sample approximately i.i.d. from each of the n posteriors. We note that the proposals can be generated and subsequently accepted in parallel. The only non-parallizable part of the procedure is in matching the proposals, although one can find clever ways to do this. Sorting the statistics t (x Ã ) (posterior distributions) is of an order of complexity that is often usually n log n but at most n 2 , which compares favorably to the order of complexity n! that is needed to find the statistically most efficient rearrangement.
Sampling-unit specific prior distributions π p (θ) are easily handled by incorporating the information encoded in the prior distributions, such as covariates, in matching the proposals. Since this information is also encoded in the mixing probabilities, it is available for matching proposals to target distributions. The only difference is that when a point drawn from π q (θ) is proposed to a posterior with prior density π p (θ), p 6 ¼ q, the prior distributions do not cancel from the expression in Eq (3), and we accept θ Ã with probability: min 1; exp f ðy Ã À y 0 Þ Á ðtðxÞ À tðx Ã ÞÞ g Â p p ðy Ã Þ p q ðy 0 Þ p p ðy 0 Þ p q ðy Ã Þ ! ; ensuring that π(θ j x) / π(x j θ)π p (θ) remains the invariant. For most prior distributions, the ratio of prior distributions is easily computed as many parts cancel in the expression.
To illustrate, consider a latent regression model in which each of n abilities θ is assigned a unique prior distribution: where y p constitutes a covariate vector for person p and β a vector of regression weights. Assuming that a point drawn from π q (θ) is proposed to a posterior with prior density π p (θ), p 6 ¼ q, we consequently accept θ Ã with probability: min 1; exp f ðy Ã À y 0 Þ Á ð½x pþ þ y T p b=s 2 À ½x Ã qþ þ y T q b=s 2 Þ g ; which also reveals that it is opportune to use the statistic x pþ þ y T p b=s 2 to match proposals to targets.

A Rejection Algorithm
When t(X Ã ) is a discrete random variable, a simple Rejection procedure is to generate proposals until an exact match t(x Ã ) = t(x) is produced [2]. The matching of kernels points to an extension of the Rejection algorithm for sampling from n ! 1 posteriors. Consider sampling from the posterior distribution of n ability parameters in the Rasch model using a common prior. There are at most k + 1 different posterior distributions to sample from; one for every possible test score. Let n x þ denote the number of observations for a test score x + . We generate a proposal corresponding to a test score x Ã þ ; if n x Ã þ > 0 we retain the proposed point and decrease n x Ã þ by one, otherwise we reject the proposed point. This procedure is repeated until n x þ ¼ 0 for each possible test score, after which the generated values can be assigned to the target distributions. Note that this simplifies to the original rejection algorithm when . The solid horizontal line shows the average number of proposals that need to be generated when the original rejection algorithm is applied separately to each of n latent variables, the efficiency of which is independent of n, and the points show the average number of proposals that need to be generated when the proposals are recycled. Most significant is the reduction of the computational expense when samples are required from increasingly larger numbers of target distributions. When n becomes sufficiently large, only n proposals need to be generated to sample once from each of the n target distributions.

Binning the Statistics
The rejection algorithm is unsuited for applications in which t(X) is a continuous random variable or a discrete random variable with many possible realizations. Even though repeated sampling does not guarantee an exact match, oversampling revealed that we do continue to produce better proposals. In general, a good proposal is one for which the statistic t(x Ã ) is "sufficiently close" to t(x); i.e., t(x Ã ) is in some range T a ¼ ðtðxÞ À a; tðxÞ þ aÞ, with a > 0. This suggests that we generate proposals until tðx Ã Þ 2 T a , with the value of a controlling the quality of our proposals.
To illustrate, consider a simple extension of our Rasch model known as the two-parameter logistic model. The two-parameter logistic model describes the distribution of a pupil's item responses as a function of the ability parameter θ and an item discrimination α i and difficulty δ i for each of k items: where the weighted test score t(x) = ∑ i α i x i is sufficient for θ. Since the discrimination parameters α i are real-valued (typically positive) we have that t(X) is a discrete random variable with 2 k possible realizations, one for every possible vector of item scores. We consider sampling from a posterior π(θ j t(x)) for a weighted test score t(x) % 19 (x + = 9) based on a k = 20 item test with discriminations α i that are sampled uniformly between 0 and 4. Fig 5 reveals that generating proposals until t(x Ã ) falls in a bin T a increases the quality of proposals as a function of a (using a 2 {1, 5, 3, 2}); improving the overall acceptance rate from approximately 17% for a = 1 (the original SVE algorithm) to approximately 74% for a = 5.
The idea of generating proposals until the statistic t(x Ã ) falls within a certain range T a is closely related to the idea behind the ABC-rejection algorithm, where one simply accepts a proposed point when tðx Ã Þ 2 T a [21,22]. When a is "sufficiently small" the proposed point θ Ã will be drawn from a posterior distribution π(θ j t(x Ã )) that is "close" to the target distribution π(θ j t(x)). For the SVE approach a need not be "sufficiently small" as the Metropolis kernel ensures that the correct posterior distribution π(θ j x) is the invariant distribution. It is clear that decreasing the value of a implies higher acceptance rates, but also that more samples are required to produce a value t(x Ã ) in T a , on average. However, when there are multiple target posterior distributions one could bin the observed statistics into m non-overlapping ranges, and apply recycling to the m bins to reduce the number of samples that need to be produced.

A Data Augmentation Procedure
Matching and oversampling use auxiliary-or sufficient statistics t(x Ã ) to choose more efficient proposals. Marsman, Maris, Bechger and Glas [29] generalized this approach by making clever use of the augmented variables that are often used to sample from π(x j θ). For example, logistic random variables are commonly used to sample item scores in the Rasch model: Although Gibbs samplers are frequently used to sample from the augmented posteriors π(z, θ j x), such procedures tend to converge slowly. The solution to this problem is to only use the augmented variables to generate proposals from a slightly different model that, when used in an independence chain (i.e., the SVE algorithm), does not suffer from this slow convergence.
Consider the distribution of w = {z 1 , . . ., z k , θ}; the joint distribution of the augmented variables and the parameter. We have that the conditional distribution pðw Ã kþ1 j w Ã nkþ1 Þ corresponds to a unique posterior distribution pðw Ã kþ1 ¼ y Ã j x Ã Þ, where x Ã is a function of w Ã defined through relations as Eq (4) (i.e., , and 0 otherwise). Note that this relation can also be used to define posteriors pðw Ã i j y Ã Þ for each of the k remaining conditional distributions pðw Ã i j w Ã ni Þ. The difference between the posterior pðw Ã kþ1 j x Ã Þ and pðw Ã i j y Ã Þ is that w k+1 = θ is used as the augmented variable in pðw Ã i j y Ã Þ and w i = z i is the proposed point, whereas w k+1 = θ is the proposed point in pðw Ã kþ1 j x Ã Þ and w i = z i is used as the augmented variable to generate x Ã i . This means that pðw Ã i j y Ã Þ is a posterior distribution that corresponds to a slightly different model that uses the same model components, but where one of the likelihood functions switched places with the prior distribution (see Ref. [29] for more details). What makes this approach interesting is that from generating a single data vector we obtain k + 1 proposal distributions and we can choose the statistically most efficient one. Fig 6 illustrates the approach with sampling from posterior distributions π(θ j x + ) in the Rasch model using pðw Ã j y Ã þ ¼ x þ Þ as the proposal distribution (right panel). Note that the procedure is statistically efficient and further improves when more observations become available; even though the posteriors become more concentrated. Also shown in Fig 6 is the original SVE approach (left panel), which becomes less efficient as the number of observations increases.
The procedure applies also to Logistic regression models, and, when the augmented variables have a non-logistic distribution, for instance a normal distribution, we obtain other Bernoulli regression models, such as probit regression [30]. Marsman et al. [29] used the procedure to estimate Ising network models using a full-data-information procedure, utilizing a latent variable expression of the Ising model, where the conditional distribution π(x j θ) was found to be a multidimensional two-parameter logistic model [31].
Exponential Family models are closed under conditioning, that is, π(x j θ, x 2 ω & O x ) is also in the Exponential Family. In this manner, we find that generating responses to two Rasch items corresponds to a three category Partial Credit Model [32] whenever and the procedure similarly extends to such situations. In principle, this procedure can be used to generate other models, such as multinomial logit models [33], and extends the framework of Marsman et al. [29] to Potts network models. Acceptance rates for the original SVE algorithm and SVE using pðw Ã j y Ã þ Þ as proposal. The average acceptance rate for sampling from posterior distributions π(θ j x + ) when using the original SVE algorithm (left panel) and when using the proposal distribution pðw Ã j y Ã þ ¼ x þ Þ (right panel). doi:10.1371/journal.pone.0169787.g006

Discussion
With the SVE algorithm a powerful yet simple idea was introduced; any model that can be simulated can be estimated. Based on a mixture of Metropolis kernels representation we have built upon the idea introduced with the original SVE algorithm and suggested several approaches that produce significant improvements to the convergence and autocorrelation of the Markov chain. To keep things simple, we have focused explicitly on statistical models π(x j θ) that are in the Exponential Family. However, the approaches that we have proposed in this paper are more generally applicable and simple to implement.