Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Allocation Variable-Based Probabilistic Algorithm to Deal with Label Switching Problem in Bayesian Mixture Models

  • Jia-Chiun Pan,

    Affiliation Department of Mathematics, National Chung Cheng University, Chiayi, Taiwan

  • Chih-Min Liu,

    Affiliation Department of Psychiatry, National Taiwan University Hospital and National Taiwan University College of Medicine, Taipei, Taiwan

  • Hai-Gwo Hwu,

    Affiliation Department of Psychiatry, National Taiwan University Hospital and National Taiwan University College of Medicine, Taipei, Taiwan

  • Guan-Hua Huang

    Affiliation Institute of Statistics, National Chiao Tung University, Hsinchu, Taiwan

Allocation Variable-Based Probabilistic Algorithm to Deal with Label Switching Problem in Bayesian Mixture Models

  • Jia-Chiun Pan, 
  • Chih-Min Liu, 
  • Hai-Gwo Hwu, 
  • Guan-Hua Huang


The label switching problem occurs as a result of the nonidentifiability of posterior distribution over various permutations of component labels when using Bayesian approach to estimate parameters in mixture models. In the cases where the number of components is fixed and known, we propose a relabelling algorithm, an allocation variable-based (denoted by AVP) probabilistic relabelling approach, to deal with label switching problem. We establish a model for the posterior distribution of allocation variables with label switching phenomenon. The AVP algorithm stochastically relabel the posterior samples according to the posterior probabilities of the established model. Some existing deterministic and other probabilistic algorithms are compared with AVP algorithm in simulation studies, and the success of the proposed approach is demonstrated in simulation studies and a real dataset.


Finite mixture models provide a flexible way to model heterogeneous data, and have been applied to a wide variety of data in social, medical and physical science. Overviews of applications of finite mixture models can be found in Titterington et al. [1] and McLachlan and Peel [2].

The likelihood function of the finite mixture model is invariant when switching component labels. In the last decades, the development of Markov chain Monte Carlo (MCMC) methods [3] and progress of computer technology facilitate the popularity of performing Bayesian analysis for finite mixture models. In the Bayesian setting, if the prior information does not distinguish the components of the mixture model, the resulting posterior distributions will be invariant to all permutations of component labels. Hence, the ergodic averages over the MCMC samples from the posterior distributions are meaningless. This is termed as the label switching problem [4, 5].

Many approaches have been proposed to deal with the label switching problem in Bayesian analysis. The most commonly used approach is to impose some artificial ordering constraints on model parameters (OC algorithm) [6, 7]. However, the poor choice for the constrained parameters may not provide a satisfactory solution [4, 7]. Celeux et al. [8] and Stephens [5] proposed the decision theoretic approach that minimizes a selected Monte Carlo risk. Stephens [5] (KL algorithm) suggested a particular choice of loss function based on the Kullback-Leibler divergence to measure the similarity of posterior allocation probabilities. Grün and Leisch [9] developed a more flexible risk-based algorithm to deal with more practical situations in real-world applications. These algorithms designed to minimize Monte Carlo risk can be regarded as imposing a sophisticated constraint through a loss function.

Other relabelling approaches require more sophisticated algorithms. Papastamoulis and Iliopoulos [10] used equivalence classes representatives (ECR algorithm) to reduce symmetric posterior distribution to nonsymmetric ones, which can be used to deal with the label switching problem. Yao and Lindsay [11] (HPD algorithm) used each MCMC sample as the starting point in an ascending algorithm, and labeled the sample based on the posterior mode to which the algorithm converged. Sperrin et al. [12] who proposed the probabilistic relabelling methods (SJW algorithm) considered a probabilistic learning mechanism to avoid “over-correct” relabels. Rodriguez and Walker [13] proposed an iterative version of the ECR algorithm (the iterative version 2 of the ECR algorithm: ECR2 algorithm), which did not require a good pivot estimate from the start, but improved it via an iterative algorithm. In ECR2, the allocation probabilities needed to be stored. They also develop a deterministic relabelling algorithm that uses the relationship between the observed data and allocation variables to devise a K-means type of loss function (DBS algorithm).

In this paper, an allocation variable based probabilistic relabelling approach (AVP algorithm) is proposed to find the labelling functions. The proposed algorithm is developed under the assumption that the posterior distributions of allocation variables are independent. The AVP algorithm is compared with other six existing relabelling algorithms (KL, ECR, HPD, SJW, ECR2 and DBS) in simulation studies. In real data analysis, schizophrenia syndrome scale data fitted by latent class model is used to demonstrate that labels can be identified well by using the proposed algorithm.

The Label Switching Phenomena

Bayesian Analysis of Finite Mixture Models

A finite mixture model composed of K components is of the form where y is the random variable (vector) of response, ϕk is the component specific parameter of density f, ηk is the component weight with ηk > 0 and , ψ is the parameter common to all components, and K is considered as fixed and known in this paper. Here we denote θk = (ηk, ϕk), and θ = (θ1, …, θK, ψ). The likelihood for θ is where y = (y1, …, yn) are independent observations from a mixture density p(⋅∣θ).

Data Augmentation

In Bayesian analysis of finite mixture models, one can add missing data perspective into models to interpret the data formulation [7]. This is done by augmenting the data with latent class membership random variable (called allocation variable in this paper) Ci, i = 1, …, n, where Ci indicates the class membership of observation yi. If Ci = k, the observation yi is regarded as drawn from the kth component density. Then, we can assume that data yi given both Ci and θ has distribution and p(Ci = kθ) = ηk. The use of data augmentation technique simplifies the expression of likelihood; therefore, facilitate the MCMC simulation for posterior distributions.

Under a Bayesian framework, we specify prior distribution p(θ) for parameters θ. The joint posterior distribution of θ and C are proportional to L(θ, Cy) × p(θ), where C = (C1, …, Cn) and . The drawing of one parameter is full conditional on the other parameters. The procedures to draw the posterior samples of each element of θ and C are listed as follows:

  1. Step 1: Update the component weights ηk, for k = 1, …, K;
  2. Step 2: Update the component specific parameter ϕk, for k = 1, …, K;
  3. Step 3: Update the common parameter ψ;
  4. Step 4: Update the allocation variable Ci, for i = 1, …, n.

Step 1 is usually completed by giving a Dirichlet prior distribution D(e1, …, eK) for (η1, …, ηK), where ek’s are the hyperparameters. Given on the values of C, ϕ1, …, ϕK and ψ, the full conditional distribution of (η1, …, ηK) is D(e1+n1, …, eK+nK), where . Given the values of C and η1, …, ηK, Step 2 and Step 3 are standard steps for MCMC simulation and the way to simulate samples is model-dependent. Further blocking of θ is possible necessary for convenient sampling in each block. Examples of simulating θ are illustrating in Sections simulation studies and real data analysis. Given the values of θ, the implementation of Step 4 is carried out by drawing Ci from a multinomial distribution with parameters πi1(θ), …, πiK(θ), where (1) Allocation variable Ci can be expressed as a set of binary random variables as well. Define a set of binary random vector (Si1, …, SiK), and let Sim = 1 if Ci = m and Sik = 0 for all km. The allocation variable C forms an n × K allocation variable matrix S = [Sik]1 ≤ in, 1 ≤ kK that summaries the allocation informations of C.

The Label Switching Phenomenon

There are K! possible permutations of {1, …, K}. Let vq be the qth permutation among the K! possible permutations. The permutation function vq transfers the original index {1, …, K} to {vq(1), …, vq(K)}. Define the qth corresponding permutation of the parameter θ by and of allocation variable matrix S by vq(S) = [Sivq(k)]1 ≤ in, 1 ≤ kK. The label switching problem arises when likelihood L(θy) is permutation invariant, L(θy) = L(vq(θ)∣y) for all q = 1, …, K!. If the prior distributions of θ are also permutation invariant, the posterior distribution will also be invariant to any permutation function on parameters. Samples generated from MCMC are the simulation outputs of the permutation invariant likelihood and priors with unknown value of q; therefore, when Markov chain is stationary, every sample in MCMC simulation is a sample from permutation invariant posterior distributions. Then the statistics, such as credible interval and posterior mean, inferred from the marginal posterior distributions become meaningless unless the inverse permutation function of every sample is discovered to relabel the MCMC outputs of θ.

Although the label switching phenomenon causes difficulty in inferences of the posterior distributions, the phenomenon can help generate a useful convergence diagnostics of MCMC simulation jasra markov 2005. A Markov chain that fails to visit all permutation states with approximately equal frequency can be viewed as a warning message of nonstationarity. Hence, for ensuring a Markov chain to reach its stationary state, Frühwirth-Schnatter [15] proposed a dynamic switching procedure, called permutation sampler, for Bayesian mixture models to force the Markov chain quickly exploring all possible permutation states. This indicates that label switching phenomenon is a desired property. Therefore, the posterior distribution of parameters is a mixture of K!-component densities. Frühwirth-Schnatter [15] termed samples that visited all permutation states with approximately equal frequency as unconstrained samples. A formal proof given by Papastamoulis and Iliopoulos [16] states that the permutation sampler converges at least as fast as the unconstrained sampler. In the following, we adopt Frühwirth-Schnatter’s procedure and inherit their terminologies.

Proposed Relabelling Method

The permutation function that has worked on θ and S is arbitrary and not observed. We treat the unobservable index of the permutation function as a latent random variable τ taking one value of {1, …, K!} and for k = 1, …, K! fruhwirth-schnatter markov 2001. Another random variable σ is the index of the inverse permutation function of τ, where θ = vτ(vσ(θ)) = vσ(vτ(θ)) and S = vτ(vσ(S)) = vσ(vτ(S)). If the value of τ is observed, the inverse permutation function vσ is known and can transfer θ and S back to the one of the K! permuted posterior densities of the unconstrained samples.

In subsequent sections, the Markov chain is assumed to be stationary and ergodic. For MCMC samples {(θt, St):t = 1, …}, let τt be the latent random variable of the unobserved permutation function at time t, and let σt be the index of its corresponding inverse permutation function.

We propose an allocation variable based probabilistic (AVP) relabelling algorithm to deal with label switching problem. The AVP algorithm can be regarded as being developed under the assumption where the posterior distributions of the allocation random variables C1, …, Cn are independent. The independence assumption in the posterior distribution (C1, …, Cn)∣y usually does not hold because of the variability from prior distribution p(θ). We have imposed such an independence assumption to obtain a tractably practical solution to label switching phenomenon in Bayesian mixture models. Similar simplifications were assumed to other Bayesian techniques, such as variational Bayes approaches (see e.g., Corduneanu and Bishop, 2001 [17]; Bishop, 2006 [18]). In the rest of this section, we assume that the posterior distributions of C1, …, Cn are independent, and π0 = [π0, ik]1 ≤ in, 1 ≤ kK denotes the parameters of the posterior distribution of S.

Each posterior sample S is the consequence of label switching with an unknown permutation. The model of S can be constructed according to an unknown permutation random variable τ (or the relabelling random variable σ) and the parameters π0. We use multinomial distribution to model allocation variables (Si1, …, SiK) with Sik taking value on 0 or 1 for all k and . Then the probability mass function of (Si1, …, SiK) is . Since the allocation variables are assumed to be independent, the posterior probability density at realized sample point s given y and τ = q could be modeled by Let the probability Pr[τ = qy] be denoted by wq. Then the posterior probability density of S at s is (2) The value of wq is the proportion of the value q occurred in the random variable τ in the Markov chain. When the Markov chains is stationary, relative frequency of samples generated from different sample points of τ will be eventually close, and hence the proportion of the different values of τ should be equal. This means if T is sufficiently large, the chains will achieve fruhwirth-schnatter markov 2001. In the label switching problem, relabelling random variable σ is of our interest. We can rewrite Eq (2) through random variable σ as (3) where vm is the inverse permutation function of vq such that .

To estimate parameters π0 in model (3), let and (4) with restriction . Let (5) where . Notice that the expectation of Eq (5) is 0 when Ci and Cj are independent for all i, j and ij. Then E(∑ij gT(i, j)) = 0 is a moment equation for π0. According to this equation, an object function is defined as (6) Notice that Eq (4) depending on {π0, i1, …,π0, jK} is invariant to different label permutations, and so do Eqs (5) and (6). The minimizer with respect to π0 in Eq (6), , obtained through Newton’s method is the Generalized Method of Moments (GMM) estimator. The GMM estimator has been found to have several large sample properties in Hansen [19], including that approximates π0 almost surely.

To estimate the value of σ at different time point, let σt denote the random variable σ at time t. The estimation of σt can be obtained through the following posterior probability: (7) Based on these posterior probabilities, we adopt the following stochastic algorithm (termed AVP algorithm) to estimate σt, for each t = 1, …, T.

AVP Algorithm.

  1. Step A: Numerically solve the GMM estimator from Eq (6).
  2. Step B: For t = 1, …, T, estimate by substituting GMM parameter estimates, ’s, into Eq (7) to obtain , m = 1, …, K!.
  3. Step C: Randomly assign the relabelling permutation index at time t, , to a value of {1, …, K!}, with probability .

The AVP algorithm offers an approach that estimates the index of inverse permutation function. Then apply the estimate of permutation function to θt for relabelling parameters. For the examples in simulation studies and real data application, the AVP algorithm is able to have satisfactory relabelled results.

Simulation Studies

In this section, we compare the proposed AVP algorithm with various relabelling algorithms. First, we compare AVP with algorithms KL, ECR, SJW and HPD in poisson mixture models with fixed and known component weights and K = 2. With known component weights, we can then analytically show how these methods transform posterior distributions. Second, we compare AVP with more recent solutions ECR2 and DBS under normal mixture models with both known and unknown component weights. The comparison of AVP and ECR2 are studied under univariate normal mixture models with K = 3, and the comparison of AVP and DBS are studied under multivariate normal mixture models with K = 4. Except for the HPD and AVP algorithms, all the comparative algorithms are available to the label.switching package [20] of R software. Finally, the computation time of various relabelling algorithms for these simulation studies are summarized at the end of this section.

Poisson Mixture Models with Known Component Weights

Poisson mixture models are studies in this section, and five relabelling methods are compared, including KL, ECR, HPD, SJW and AVP.

This simulation study generates data from a two-component poisson mixture model whose probability density function is (8) where η = (η1, η2), ϕ = (ϕ1, ϕ2), and f(yiϕk) is a poisson distribution with the parameter ϕk for the response yi. Simulate y = (y1, …,yn) under four scenarios: (1) η1 = η2 = 0.5, ϕ1 = 5, ϕ2 = 7 and n = 10; (2) η1 = η2 = 0.5, ϕ1 = 5, ϕ2 = 7 and n = 100; (3) η1 = 0.3, η2 = 0.7, ϕ1 = 5, ϕ2 = 5.5 and n = 10; and (4) η1 = 0.3, η2 = 0.7, ϕ1 = 5, ϕ2 = 5.5 and n = 100. In the following simulations, the component weights (i.e., η1 and η2) are treated as fixed and known values, and only the parameters in the component densities (i.e., ϕ1 and ϕ2) are of our interest and are estimated via MCMC simulation. Assume that priors for ϕ1 and ϕ2 are i.i.d. from the gamma distribution Γ(1.2, 0.2) with mean 6, and use the poisson-gamma model to obtain the posterior samples of ϕ. While generating the posterior samples of ϕ, set the values of η to be the true values under each scenario.

The Gibbs sampling scheme is adopted here to produce posterior samples {(ϕ1, S1), …,(ϕT, ST)}, where the allocation variable matrix St is an n × 2 matrix of which the element is a 0/1 variable. if the ith subject is attributed to the kth component in the tth MCMC iteration, and otherwise. This sampling scheme starts with an initial value S0, and runs for t = 1, …, T as follows:

  1. Step 1. Generate from for k = 1, 2;
  2. Step 2. Generate St with its the element from the Bernoulli distribution with probability and set for i = 1, …, n, where η1 and η2 are fixed values and are therefore independent of t;
  3. Step 3. Select the permutation sampler (1, 2) or (2, 1) with equal probability 0.5. If (1, 2) is chosen, the labels of components of (θt, St) remain unchanged; else, alter the labels 1 and 2 of the components in (θt, St), where , , k = 1, 2.

The permutation sampler applied in Step 3 has different functions for different scenarios. In Scenarios (1) and (2) where η values are fixed at η1 = η2 = 0.5, the Markov chain can produce label switching, and the permutation sampler is applied here to enhance quick convergence of MCMC and to obtain the unconstrained samples fruhwirth-schnatter markov 2001. In Scenarios (3) and (4) where η values are fixed at η1 = 0.3 and η2 = 0.7, the likelihood Eq (8) is not symmetric, and the usual Gibbs sampling without adopting Step 3 does not produce label switching. The permutation sampler used here can make the unconstrained posterior samples from likelihood (9) which creates a “pseudo” label switching phenomenon. Then, we can apply various relabelling methods to the unconstrained samples of (ϕ1, ϕ2). The correctly labelled posterior samples of ϕ can be obtained by imposing an ordering constraint on η. Hence, we can compare the relabelled results of algorithms with the correctly labelled posterior samples.

The Gibbs sampling scheme was run for 110,000 samples for each scenario. The first 10,000 samples were treated as the burn-in period, and the subsequent 100,000 samples were used for relabelling. Algorithms KL, ECR, HPD, SJW and AVP were applied to the unconstrained samples of each scenario.

Fig 1 shows the relabelled results under Scenario (1). Fig 1a shows a scatter plot of the unconstrained samples of ϕ, which is symmetry along the 45 degree line. This means that the samples were explored well because of the use of permutation sampler. The Fig 1b–1f show the scatter plots of the relabelled results after adopting the five relabelling algorithms. Fig 1b and 1d show that KL and HPD assigned posterior samples of ϕ’s lying below the 45 degree line to the other side. The results in these figures are almost the same as the ordinary constraint relabelling with the restriction ϕ2ϕ1. Fig 1e shows that the results from the SJW algorithm are almost the same as those in Fig 1a, which does not seem to relabel the unconstrained samples well. The performance of the ECR algorithm shown in Fig 1c is almost the same as that of our AVP algorithm in Fig 1f.

Fig 1. Plots (a)–(f) are scatter plots of posterior samples of (ϕ1, ϕ2) for Scenario (1) (n = 10, ϕ1 = 5, ϕ2 = 7, η1 = η2 = 0.5).

Plot (a) is the unconstrained samples. Plots (b)–(f) are the relabelled samples under various relabelling algorithms.

To understand the effects of large samples, the sample size of Scenario (1) was increased from n = 10 to n = 100 (Scenario (2)). Fig 2 shows that the posterior samples are apparently more concentrated than those from n = 10. Conclusions from comparisons of KL, HPD and SWJ are consistent with those from n = 10. ECR (Fig 2c) and AVP (Fig 2f) have similar results, but it seems that ECR has posterior samples spreading more widely below the 45 degree line than AVP.

Fig 2. Plots (a)–(f) are scatter plots of posterior samples of (ϕ1, ϕ2) for Scenario (2) (n = 100, ϕ1 = 5, ϕ2 = 7, η1 = η2 = 0.5).

Plot (a) is the unconstrained samples. Plots (b)–(f) are the relabelled samples under various relabelling algorithms.

Fig 3 shows the results under Scenario (3). This scenario decreases the distance between ϕ1 and ϕ2, and allows the values of η to be unequal (η1 = 0.3 and η2 = 0.7). These settings place emphasis on the effect of the unequal weights and the reduced distance of ϕ. Notice that, in Scenarios (3) and (4), η values are set to the fixed true values of η1 = 0.3 and η2 = 0.7. Therefore, the correctly labelled posterior distribution of ϕ can be obtained by restricting η1 < η2. Fig 3a presents a scatter plot of the correctly labelled posterior samples of ϕ. The relabelled samples from HPD (Fig 3d) is the same to those of imposing an ordinary constraint ϕ2ϕ1. The KL algorithm (Fig 3b) seems to move the relabelled sample points in the middle-left region to the opposite side symmetric to the 45 degree line. This phenomenon cannot be improved even if we use the correctly labelled posterior samples as initial points for the KL algorithm. Compared with the scatter plot of correctly relabelled posterior samples, AVP (Fig 3f) seems to generate the most similar results than ECR (Fig 3c) and SJW (Fig 3e) do.

Fig 3. Plots (a)–(f) are scatter plots of posterior samples of (ϕ1, ϕ2) for Scenario (3) (n = 10, ϕ1 = 5, ϕ2 = 5.5, η1 = 0.3 and η2 = 0.7).

Plot (a) is the posterior samples with correct labels. Plots (b)–(f) are the relabelled samples under various relabelling algorithms.

Because the correctly labelled posterior samples are known in this scenario, the marginal distributions of ϕ of the relabelled samples from all relabelling methods can be compared with the true marginal densities, which are shown in Fig 4. Fig 4a and 4b show the density plots of ϕ1 and ϕ2 for Scenario (3), respectively. The density plot of the AVP algorithm nearly coincides with that of the correctly labelled posterior samples.

Fig 4. The density plots of relabelling samples from various relabelling methods in Scenarios (3) and (4).

The black dashed line represents the density plot of the true posterior distributions. The grey, blue, purple, blue and red lines represent the density plots of KL, ECR, HPD, SJW and AVP, respectively. Plots (a) and (b) are the density plots of ϕ1 and ϕ2 for Scenario (3), respectively. Plots (c) and (d) are the density plots of ϕ1 and ϕ2 for Scenario (4), respectively.

Fig 5 shows the results under Scenario (4), which increases the sample size of Scenario (3) to n = 100. In Scenario (4), the results from HPD (Fig 5d) are similar to those from the ordering constrainted samples. The performance of KL, SJW and AVP (Fig 5b, 5(e) and 5(f), respectively), is similar to that of the correctly labelled posterior samples (Fig 5a). ECR (Fig 5c) seems to gathers more sample points on the right side of the region. Fig 4c and 4(d) show the marginal density plots for Scenario (4). Except for HPD and ECR, other algorithms have density plots to coincide with that of correctly labelled posterior samples.

Fig 5. Plots (a)–(f) are scatter plots of posterior samples of (ϕ1, ϕ2) for Scenario (1) (n = 100, ϕ1 = 5, ϕ2 = 5.5, η1 = 0.3 and η2 = 0.7).

Plot (a) is the posterior samples with correct labels. Plots (b)–(f) are the relabelled samples under various relabelling algorithms.

To produce a more reliable conclusion, simulated datasets are generated with 100 replications under Scenarios (1)–(4). Note that η are set to be fixed in these sencearios. The averages and standard deviations of posterior means over 100 replications are shown in Table 1.

Table 1. The Performance of AVP, ECR, SJW, HPD and KL in Poisson Mixture Models with Fixed Component Weights under Scenarios (1)–(4).

For Scenarios (3) and (4) where η1 = 0.3 and η2 = 0.7, the correct labels of each replication can be obtained by applying the OC on the posterior samples of η. Averaged posterior means of correctly labelled samples are slightly closer to those of the proposed AVP algorithm than to those of the other algorithms. For Scenario (3), the standard deviations of posterior means of AVP is larger than those of OC; whereas, under Scenario (4), AVP seems to relabel almost all samples back their correct labels.

For Scenarios (1) and (2) where the simulating parameter of η are set to be equal (η1 = η2 = 0.5), the correct labels are unknown. Instead of comparing with the unknown true posterior means, we could compare the similarity between the relabelling algorithms. Among the compared algorithms, ECR and AVP have similar results. The performances of OC on θ, KL and HPD are highly similar to one another, especially in Scenario (4)

Normal Mixture Models with Known and Unknown Component Weights

In this section, we apply AVP to the unconstrained posterior samples generated from both univariate and multivariate normal mixture models with the number of components to be known and with known and unknown weights. We compare AVP with ECR2 in univariate cases and with DBS in the multivariate cases.

Univariate cases For the univariate case, we simulate observation xi from the normal mixture model, that is, (10) for i = 1, …, n, where μk and Vk are the mean and the variance of the kth component density, respectively. We investigate the simulated model (4.1) studied in [10] with K = 3 and n = 160. Two scenarios are studied under this model. Scenario (5): η is known and fixed, and Scenario (6): η is unknown. The posterior samples of the parameters are generated via the Gibbs sampling scheme suggested by [11], where they assume that the prior distributions are where D(⋅) is the Dirichlet distribution; Γ(⋅) is the gamma distribution; and R are the mean and the range of the data, respectively. Permutation sampler is used in the Gibbs sampling scheme to obtain the 100,000 unconstrained samples (after the burn-in period) of the parameters. Two scenarios are repeated for 100 times. The averages and the standard deviations of posterior means over replications are reported in Table 2. In Scenario (5), η is assumed to be fixed at true values during the Gibb sampling; hence, the correct labels can be obtained by applying an ordering constraint on η. The differences in averaged posterior means between AVP and ECR2 are small, which are no more than 0.11; however, averaged posterior means of correctly labelling samples are slightly closer to those of AVP than to those of ECR2 (upper part of Table 2). The standard deviations of the posterior means in Table 2 (upper part) show that AVP has better consistence (smaller standard deviations) and is closer to those of correctly labelled samples than ECR2 does.

Table 2. The Performances of Algorithms AVP and ECR2 for Univariate Normal Mixture Model under Scenarios (5) and (6).

For Scenario (6) where η is unknown, correct labels are unable to be obtained, leading to the true posterior means are unknown. The results in Table 2 (lower part) show that the simulating parameter values are slightly closer to averaged posterior means of ECR2 than to those of AVP. However, it is noteworthy that true posterior means may not necessarily be close to simulating parameter values because the former could be affected by the setting of prior distributions. The standard deviations of the posterior means show that AVP generally can obtain more consistent estimates in posterior means than ECR2. Putting an ordering constraint on η (OC) under this scenario could obtain unsatisfactory results, which is informed by its nonsensible estimates for posterior means of μ1 and μ2.

Multivariate cases To examine the performance and comparison of AVP and DBS in multivariate settings, we simulated data from multivariate normal mixture models. The posterior samples are generated according to [21]. Permutation sampler is adopted in the Gibbs sampling scheme to obtain 100,000 unconstrained samples (after the burn-in period) of the parameters. We study a bivariate normal mixture model with K = 4 and n = 200, where . The prior assumptions are where ζk = (ζk1,ζk2) and ζkj = min1 ≤ in{xij}+kRj/3 with Rj being the range of (x1j, …, xnj), j = 1, 2; W−1(⋅) is an inverse Wishart distribution and the scale matrix Ξ = diag(δ1, δ2) with the prior distribution for δ1 and δ2 being Γ(2, 36−1).

Two scenarios are considered. Scenario (7): η is known and fixed, and Scenario (8): η is unknown. Two scenarios are repeated for 100 times and the results are averaged over these replications. Parameter values used to simulate data from Scenario (7) are shown in the first column of Table 3. Notice that this is a challenging case since the true parameter values for one component are extremely close to another. The averaged posterior means in these scenarios are shown in Table 3. As compared with the results from correct labelling (OC), we see that AVP has better performance in the first two component and DBS is better in the fourth component. The standard deviations of posterior means over 100 replications are shown in S1 Table.

Table 3. The Performances of Algorithms AVP and DBS for Multivariate Normal Mixture Model under Scenarios (7) and (8).

For Scenario (8) where component weights are unknown, we adopt the bivariate normal mixture model given in [10] for simulating data. In this setting, the averaged posterior means from AVP and DBS are equally close to the true simulating parameter values (lower part of Table 3). The standard deviations of posterior means from AVP seem slightly larger than those from DBS (lower part of S1 Table).

For each relabelling algorithm, we summary its computing time for a relabelling procedure (averaging over 100 replications). Table 4 reports their computation times under scenarios with the same number of components (K) and sample size (n). Algorithms are run in R 3.1.3 using a personal desktop computer with Inter Core 2 Quad CPU 2.33 GHz. Notice that except the HPD and AVP algorithms, all the algorithms are performed by using label.switching package. Results show that the proposed AVP algorithm can have a long running time when K is large. This is because our probabilistic based algorithms requires the computation of K! quantities to determine the relabelling permutation per MCMC draw.

Real Data Analysis


A common application of mixture model analysis on polytomous response data is the regression extension of the latent class analysis (RLCA) model proposed by Huang and Bandeen-Roche [22]. The basic model of RLCA postulates an underlying categorical latent variable with, say, K latent classes, and measured items are assumed independent of one another within each component density. We define Yi = (Yi1, …, YiM)T to be a set of M polytomous response variables for the ith individual, i = 1, …, n. The mth variable, Yim, can take one of values {1, …, Jm}, where Jm ≥ 2; the allocation variable, Ci, denotes the subpopulation in which the ith individual belongs to, and takes a value {1, …, K}. The distribution of Yi can be expressed as the finite mixture density: (11) where ymj = I(ym = j) = 1 if ym = j; 0 otherwise. In addition, this model assumes and . Covariates are predictors associated with the allocation variable Ci, and with for m = 1, …, M are covariates built to cause direct influence on response variables. The probabilities and are often implemented assuming the generalized logit link function under the generalize linear model framework [23]: (12) and (13) for i = 1, …, N;m = 1, …, M;j = 1, …, (Jm−1);k = 1, …, (K−1);k′ = 1, …, K.

To perform Bayesian analysis on the RLCA model, prior distributions for βpk’s, γmjk’s and αlmj’s are assumed normal prior distributions with mean 0 and variance 1.52. Parameters βpk’s, γmjk’s and αlmj’s are sampled in Gibbs sampling approach with acceptance-rejection strategy [24]. The Gibbs sampling scheme for the hierarchical RLCA model are according to Pan and Huang [25]. The following briefly describes the move types:

  1. Step 1: For i = 1, …, n, generate Ci from (14) with Sij = I(Ci = j), and (Si1, …, SiK) can be sampled directly from a multinomial distribution.
  2. Step 2: Generate (β01, …, βP(K−1)) from
  3. Step 3: Generate (γ111, …, γMJM K) from
  4. Step 4: Generate (α111, …, αLMJM) from In addition to the four move types mentioned above, permutation sampling is adopted in the 5th move type.
  5. Step 5: Select on the permutation function vq for relabelling the current state. Define θk = (β0k, …, βPk, γ11k, …, γMJM k) for k = 1, …, K−1, and for the reference class. Take a new state as vq(θ) = (θvq(1), …, θvq(K), ψ) and vq(S) = [Sivq(k)]i = 1, …, n, k = 1, …, K, where ψ = (α111, …, αLMJM) is the parameter common to all latent classes, and is invariant to permutation function vq. The new state has to be adjusted to the new reference class in which the β coefficients are required to be 0’s.

Adopting the permutation sampling forces the Markov chain quickly to explore all permutation states [15].


To illustrate the usefulness of the proposed relabelling method, we used data (see S1 File) from two projects: the Multidimensional Psychopathological Study on Schizophrenia (MPSS) project and the Study on Etiological Factors of Schizophrenia (SEFOS) project. The details of study designs are described in detail in Chang et al. [26]. Written informed consent was obtained from all participants after complete description of the studies. These studies (MPSS and SEFOS) were approved by the institutional review boards of the 3 participating hospitals: National Taiwan University Hospital and the university affiliated Taipei City Psychiatric Center and Taoyuan Psychiatric Center. Participants’ consent to the MPSS and SEFOS studies included consent to use their data for other researches. The capacity for consent of patients were assessed by their attending certified psychiatrists to rule out those participants whose psychotic symptoms or mentality were so severe that impair their capacity for consent. All the psychiatric patients who were compulsory hospitalized did not allow to enter our studies. All informed consents were obtained from patients themselves. Proxy consent was prohibited in our studies.

The datasets had been published [27], but not available through any data repositories before. The data had been anonymized prior to access for this study and the age range of participants was from 18–65 years old. The inclusion/exclusion criteria were (i) meeting the DSM-IV diagnostic criteria of schizophrenia, (ii) no history of alcohol and drug abuse, (iii) no neurologic disease, (iv) no mental retardation, (v) no medical illnesses that may significantly impair neurocognitive function.

Briefly, MPSS and SEFOS projects recruited subsided schizophrenia patients (N = 225) from three hospitals in Taiwan. The patients are based on the Diagnostic and Statistical Manual of Mental Disorders [28] criteria for schizophrenia. Schizophrenia symptoms used in this study are assessed by the Positive and Negative Syndrome Scale (PANSS) [29, 30]. The PANSS is composed of three subscales and has 30 items (M = 30) with positive (seven symptoms, P1–P7), negative (seven symptoms, N1–N7) and general psychopathology (sixteen symptoms, G1–G16). Each item was originally rated on a 7-point scale (1 = absent, 7 = extreme), but the 7-point scale was reduced to the binary scale (J1 = … = J30 = 2) (no symptom and having symptom) for easing the sparseness problem of the latent class model. The hierachical RLCA applied here is to explore the underlying subtypes (classes) of schizophrenia based on the PANSS measurement, and to study the relationship between external covariates and obtained patient subtypes. The external covariates used in this study include demographic variables and one neuropsychological variable. Demographic variables are gender, age at recruitment, onset-age of psychotic symptoms, years of education, and occupation (having versus no occupation). The neuropsychological variable is the sensitivity index of the Continuous Performance Test (CPT) [31, 32]. The CPT score is transformed into z-score by comparing to a control group matched for three demographic variables: age, gender and education years [33]. This adjustment was made so that the higher z-score indicates better performance.

The hierarchical RLCA was applied to 30 dichotomized PANSS items. Demographic variables and the z-standardized CPT score were the covariates that were associated with the underlying latent class through Eq (9). Gender and age are identified as covariates incorporated in conditional probabilities in Eq (10). This analysis used the subsample of subjects that without missing values (N = 160). The hierarchical RLCA model was fitted through the Gibbs sampling scheme.

Analysis Results

In this data analysis, we set K = 3. We run for 210,000 samples with the first 10,000 samples being the burn-in period. Only every 10 scan is stored to keep independence, and 20,000 samples are recorded for analysis.

Fig 6a and 6(b) show the unconstrained samples and the relabelled samples after applying the AVP algorithm, respectively, in 3-dimension scatter plots with the dimensions of parameters γ211, γ212 and γ213. Because the schizophrenia syndrome scale data is fitted by a three-component latent class model, Fig 6a with 20,000 samples clearly shows the 3! = 6 clusters in unconstrained posterior samples, distinguished by 6 different colors. Fig 6b shows the relabelled samples after applying the AVP algorithm. The AVP algorithm can identify one out of the 3! sets of unconstrained posterior samples, and relabels the labels of the other 5 sets unconstrained samples to the specific one set. The trace plot of parameters γ811, γ812 and γ813 is shown in the plot of Fig 6c. From these plots, we see that the distributions of parameters are separated well.

Fig 6. Plot (a) is the 3-dimensional scatter plot of unconstrained sample with (γ211; γ212; γ213).

The six colors represent the 3! sets of labels before relabelling. The relabelled samples applied by AVP algorithm are shown in Plot (b). Plot (c) is the trace plots of γ811, γ812 and γ813.

After applying the AVP algorithm, the quantities of posterior distributions are summarized in Tables 5 and 6. Table 5 gives the estimation of relationship between subgroups memberships and covariates. The odds ratios (ORs) are the exponential transformation of β’s from regression coefficients. The 2.5% and 97.5% quartiles of posterior samples of β’s also take the same exponential transformation to obtain the 95% credible interval (CI) of the corresponding ORs. By comparing with the patients from class 3. The characteristics of the other two classes from this analysis are as follows. Patients in class 1 tend to be younger at onset age of psychotic symptoms. Patients in class 2 are more likely to be male, more years of education and better ungraded CPT.

Table 5. The relationship between underlying subgroups and covariates from hierarchical LCA.

Table 6 contains the direct association between PANSS symptom items and covariates. The ORs are obtained by the exponential transformation of regression coefficients α’s. The same exponential transformation is also applied to the 2.5% and 97.5% quantiles of the posterior samples of α’s to obtain 95% CI. Males are more likely to have G12 (lack of judgement and insight) symptom than females. The older the age, the higher the probability of having G5 (mannerisms and posturing) symptom and G6 (depression) symptom, but the lower the probability of having N4 (passive/apathetic social withdrawal) symptom.

Table 6. The association between the PANSS symptoms’ probability and covariates from hierarchical RLCA.


The proposed AVP algorithm has the following features. (i) AVP is attributed to probabilistic approach, which prevents over-corrected results compared with deterministic methods. (ii) AVP seems to perform reasonably well with the limiting settings in our simulation studies. (iii) The computation time of AVP depends on the dimension of allocation variables S (i.e., the number of observations (n) and the number of components (K) in the mixture model), but not on the complexity of the density function of mixture models. That is, even when data is drawn from a complicated mixture model, the computational cost for AVP holds the same as that from the models where have the same numbers of observations and components. (iv) AVP can have a long computation time when K is large, since a probabilistic based algorithm requires the computation of K! quantities to find the optimal permutation per MCMC draw.

Supporting Information

S1 Table. The Performances of the AVP and DBS Algorithms for Multivariate Normal Mixture Model under Scenarios (7) and (8).

This table summaries standard deviations of posterior means over 100 replications for algorithms OC, AVP and DBS, where OC stands for ordering constraints on η.


S1 File. Raw data of the study sample.

This dataset contains 30 outcome variables and 6 explanatory variables. The variables are summarised as follows and variable names are shown parenthetically. The 30 outcome variables are seven positive symptoms (P1P7), seven negative symptoms (N1N7) and sixteen general psychopathology symptoms (G1G16) with binary response with 0 = no symptom and 1 = having symptom. The 6 explanatory variables are gender (Male_gender) with 0 = female and 1 = male, age at recruitment (Age), onset-age of psychotic symptoms (Age_of_onset), years of education (Year_of_education), occupation (Having_occupation) with 0 = no occupation and 1 = having occupation and CPT score (Ungraded_CPT).



The authors thank the National Center for High-performance Computing for computer time and facilities.

Author Contributions

Conceived and designed the experiments: JP GH. Analyzed the data: JP. Wrote the paper: JP GH. Conducted simulation studies: JP. Provided the clinical dataset and the ideas for discussion: CL HH.


  1. 1. Titterington DM, Smith AF, Makov UE, et al. Statistical analysis of finite mixture distributions. vol. 7. Wiley New York; 1985.
  2. 2. McLachlan G, Peel D. Finite mixture models. John Wiley & Sons; 2004.
  3. 3. Hastings WK. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Bioamietrika. 1970;59:97–109.
  4. 4. Stephens M. Bayesian methods for mixtures of normal distributions. University of Oxford; 1997.
  5. 5. Stephens M. Dealing with label switching in mixture models. Journal of the Royal Statistical Society Series B, statistical methodology. 2000;62(4):795–809.
  6. 6. Diebolt J, Robert CP. Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society Series B (Methodological). 1994;p. 363–375.
  7. 7. Richardson S, Green PJ. On Bayesian Analysis of Mixtures with an Unknown Number of Components. Journal of the Royal Statistical Society, Series B. 1997;59:731–792.
  8. 8. Celeux G, Hurn M, Robert CP. Computational and Inferential Difficulties with Mixture Posterior Distributions. Journal of the American Statistical Association. 2000;95(451).
  9. 9. Grün B, Leisch F. Dealing with label switching in mixture models under genuine multimodality. Journal of Multivariate Analysis. 2009;100(5):851–861.
  10. 10. Papastamoulis P, Iliopoulos G. An Artificial Allocations Based Solution to the Label Switching Problem in Bayesian Analysis of Mixtures of Distributions. Journal of Computational and Graphical Statistics. 2010;19(2):313–331.
  11. 11. Yao W, Lindsay BG. Bayesian Mixture Labeling by Highest Posterior Density. Journal of the American Statistical Association. 2009;104(486):758–767.
  12. 12. Sperrin M, Jaki T, Wit E. Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Statistics and Computing. 2009;p. 1–10.
  13. 13. Rodriguez CE, Walker SG. Label switching in Bayesian mixture models: Deterministic relabeling strategies. Journal of Computational and Graphical Statistics. 2014;23(1):25–45.
  14. 14. Jasra A, Holmes CC, Stephens DA. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science. 2005;20(1):50–67.
  15. 15. Frühwirth-Schnatter S. Markov Chain Monte Carlo Estimation of Classical and Dynamic Switching and Mixture Models. Journal of the American Statistical Association. 2001;96(453).
  16. 16. Papastamoulis P, Iliopoulos G. On the convergence rate of random permutation sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability. 2013;15(2):293–304.
  17. 17. Corduneanu A, Bishop CM. Variational Bayesian model selection for mixture distributions. In: Artificial intelligence and Statistics. vol. 2001. Morgan Kaufmann Waltham, MA; 2001. p. 27–34.
  18. 18. Bishop CM. Pattern recognition and machine learning. springer; 2006.
  19. 19. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society. 1982;p. 1029–1054.
  20. 20. Papastamoulis P. label. switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs. arXiv preprint arXiv:150302271. 2015;.
  21. 21. Dellaportas P, Papageorgiou I. Multivariate mixtures of normals with unknown number of components. Statistics and Computing. 2006;16(1):57–68.
  22. 22. Huang GH, Bandeen-Roche K. Building an Identifiable Latent Variable Model with Covariate Effectson Underlying and Measured Variables. Psychometrika. 2004;69:5–32.
  23. 23. McCullagh P, Nelder JA. Generalized Linear Models. 2nd ed. London: Chapman and Hall; 1989.
  24. 24. Zeger SL, Karim MR. Generalized Linear Models with Random Effects; a Gibbs Sampling Approach. Journal of the American Statistical Association. 1991;86:79–86.
  25. 25. Pan JC, Huang GH. Bayesian Inferences of Latent Class Models with an Unknown Number of Classes. Psychometrika. 2014;79(4):621–646. pmid:24327064
  26. 26. Chang CJ, Chen WJ, Liu SK, Cheng JJ, Ou Yang WC, Chang HJ, et al. Morbidity Risk of Psychiatric Disorders Among the First Degree Relatives of Schizophrenic Patients in Taiwan. Schizophrenia Bulletin. 2002;28:379–392. pmid:12645671
  27. 27. Huang GH, Tsai HH, Hwu HG, Chen CH, Liu CC, Hua MS, et al. Patient subgroups of schizophrenia based on the Positive and Negative Syndrome Scale: composition and transition between acute and subsided disease states. Comprehensive psychiatry. 2011;52(5):469–478. pmid:21193177
  28. 28. American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders (DSM-IV). 4th ed. Washington, DC: American Psychiatric Press; 1994.
  29. 29. Kay SR, Flszbein A, Opfer LA. The Positive and Negative Syndrome Scale (PANSS) for Schizophrenia. Schizophrenia Bulletin. 1987;13:261–276. pmid:3616518
  30. 30. Cheng JJ, Ho H, Chang CJ, Lan SY, Hwu HG. Positive and Negative Syndrome Scale (PANSS): Establishment and Reliability Study of a Mandarin Chinese Language Version. Chinese Psychiatry. 1996;10:251–258.
  31. 31. Rosvold HE, Mirsky AF, Sarason I, Bransome ED Jr, Beck LH. A Continuous Performance Test of Brain Damage. Journal of Consulting Psychology. 1956;20:343–350. pmid:13367264
  32. 32. Chen WJ, Hsiao CK, Hsiao LL, Hwu HG. Performance of the Continuous Performance Test among community samples. Schizophrenia Bulletin. 1998;24(1):163–174. pmid:9502554
  33. 33. Liu SK, Hsieh MH, Huang TJ, Liu CM, Liu CC, Hua MS, et al. Patterns and Clinical Correlates of Neuropsychologic Deficits in Patients with Schizophrenia. Journal Formosan Medical Association. 2006;105:978–991.