Allocation Variable-Based Probabilistic Algorithm to Deal with Label Switching Problem in Bayesian Mixture Models

The label switching problem occurs as a result of the nonidentifiability of posterior distribution over various permutations of component labels when using Bayesian approach to estimate parameters in mixture models. In the cases where the number of components is fixed and known, we propose a relabelling algorithm, an allocation variable-based (denoted by AVP) probabilistic relabelling approach, to deal with label switching problem. We establish a model for the posterior distribution of allocation variables with label switching phenomenon. The AVP algorithm stochastically relabel the posterior samples according to the posterior probabilities of the established model. Some existing deterministic and other probabilistic algorithms are compared with AVP algorithm in simulation studies, and the success of the proposed approach is demonstrated in simulation studies and a real dataset.


Introduction
Finite mixture models provide a flexible way to model heterogeneous data, and have been applied to a wide variety of data in social, medical and physical science. Overviews of applications of finite mixture models can be found in Titterington et al. [1] and McLachlan and Peel [2].
The likelihood function of the finite mixture model is invariant when switching component labels. In the last decades, the development of Markov chain Monte Carlo (MCMC) methods [3] and progress of computer technology facilitate the popularity of performing Bayesian analysis for finite mixture models. In the Bayesian setting, if the prior information does not distinguish the components of the mixture model, the resulting posterior distributions will be invariant to all permutations of component labels. Hence, the ergodic averages over the MCMC samples from the posterior distributions are meaningless. This is termed as the label switching problem [4,5].
Many approaches have been proposed to deal with the label switching problem in Bayesian analysis. The most commonly used approach is to impose some artificial ordering constraints on model parameters (OC algorithm) [6,7]. However, the poor choice for the constrained parameters may not provide a satisfactory solution [4,7]. Celeux et al. [8] and Stephens [5] proposed the decision theoretic approach that minimizes a selected Monte Carlo risk. Stephens [5] (KL algorithm) suggested a particular choice of loss function based on the Kullback-Leibler divergence to measure the similarity of posterior allocation probabilities. Grün and Leisch [9] developed a more flexible risk-based algorithm to deal with more practical situations in realworld applications. These algorithms designed to minimize Monte Carlo risk can be regarded as imposing a sophisticated constraint through a loss function.
Other relabelling approaches require more sophisticated algorithms. Papastamoulis and Iliopoulos [10] used equivalence classes representatives (ECR algorithm) to reduce symmetric posterior distribution to nonsymmetric ones, which can be used to deal with the label switching problem. Yao and Lindsay [11] (HPD algorithm) used each MCMC sample as the starting point in an ascending algorithm, and labeled the sample based on the posterior mode to which the algorithm converged. Sperrin et al. [12] who proposed the probabilistic relabelling methods (SJW algorithm) considered a probabilistic learning mechanism to avoid "over-correct" relabels. Rodriguez and Walker [13] proposed an iterative version of the ECR algorithm (the iterative version 2 of the ECR algorithm: ECR2 algorithm), which did not require a good pivot estimate from the start, but improved it via an iterative algorithm. In ECR2, the allocation probabilities needed to be stored. They also develop a deterministic relabelling algorithm that uses the relationship between the observed data and allocation variables to devise a K-means type of loss function (DBS algorithm).
In this paper, an allocation variable based probabilistic relabelling approach (AVP algorithm) is proposed to find the labelling functions. The proposed algorithm is developed under the assumption that the posterior distributions of allocation variables are independent. The AVP algorithm is compared with other six existing relabelling algorithms (KL, ECR, HPD, SJW, ECR2 and DBS) in simulation studies. In real data analysis, schizophrenia syndrome scale data fitted by latent class model is used to demonstrate that labels can be identified well by using the proposed algorithm.

The Label Switching Phenomena
Bayesian Analysis of Finite Mixture Models A finite mixture model composed of K components is of the form where y is the random variable (vector) of response, ϕ k is the component specific parameter of density f, η k is the component weight with η k > 0 and P K k¼1 Z k ¼ 1, ψ is the parameter common to all components, and K is considered as fixed and known in this paper. Here we denote θ k = (η k , ϕ k ), and θ = (θ 1 , . . ., θ K , ψ). The likelihood for θ is where y = (y 1 , . . ., y n ) are independent observations from a mixture density p(Ájθ).

Data Augmentation
In Bayesian analysis of finite mixture models, one can add missing data perspective into models to interpret the data formulation [7]. This is done by augmenting the data with latent class membership random variable (called allocation variable in this paper) C i , i = 1, . . ., n, where C i indicates the class membership of observation y i . If C i = k, the observation y i is regarded as drawn from the kth component density. Then, we can assume that data y i given both C i and θ has distribution y i jðC i ¼ k; θÞ $ f ðy i j k ; cÞ; and p(C i = kjθ) = η k . The use of data augmentation technique simplifies the expression of likelihood; therefore, facilitate the MCMC simulation for posterior distributions.
Under a Bayesian framework, we specify prior distribution p(θ) for parameters θ. The joint posterior distribution of θ and C are proportional to L(θ, Cjy) × p(θ), where C = (C 1 , . . ., C n ) and Lðθ; CjyÞ ¼ Q n i¼1 fZ C i f ðy i j C i ; cÞg. The drawing of one parameter is full conditional on the other parameters. The procedures to draw the posterior samples of each element of θ and C are listed as follows: Step 1: Update the component weights η k , for k = 1, . . ., K; Step 2: Update the component specific parameter ϕ k , for k = 1, . . ., K; Step 3: Update the common parameter ψ; Step 4: Update the allocation variable C i , for i = 1, . . ., n.
Step 1 is usually completed by giving a Dirichlet prior distribution D(e 1 , . . ., e K ) for (η 1 , . . ., η K ), where e k 's are the hyperparameters. Given on the values of C, ϕ 1 , . . ., ϕ K and ψ, the full conditional distribution of (η 1 , . . ., η K ) is D(e 1 +n 1 , . . ., e K +n K ), where n k ¼ P n i¼1 IfC i ¼ kg. Given the values of C and η 1 , . . ., η K , Step 2 and Step 3 are standard steps for MCMC simulation and the way to simulate samples is model-dependent. Further blocking of θ is possible necessary for convenient sampling in each block. Examples of simulating θ are illustrating in Sections simulation studies and real data analysis. Given the values of θ, the implementation of Step 4 is carried out by drawing C i from a multinomial distribution with parameters π i1 (θ), . . ., π iK (θ), where Allocation variable C i can be expressed as a set of binary random variables as well. Define a set of binary random vector (S i1 , . . ., S iK ), and let S im = 1 if C i = m and S ik = 0 for all k 6 ¼ m. The allocation variable C forms an n × K allocation variable matrix S = [S ik ] 1 i n, 1 k K that summaries the allocation informations of C. and of allocation variable matrix S by v q (S) = [S iv q (k) ] 1 i n, 1 k K . The label switching problem arises when likelihood L(θjy) is permutation invariant, L(θjy) = L(v q (θ)jy) for all q = 1, . . ., K!. If the prior distributions of θ are also permutation invariant, the posterior distribution will also be invariant to any permutation function on parameters. Samples generated from MCMC are the simulation outputs of the permutation invariant likelihood and priors with unknown value of q; therefore, when Markov chain is stationary, every sample in MCMC simulation is a sample from permutation invariant posterior distributions. Then the statistics, such as credible interval and posterior mean, inferred from the marginal posterior distributions become meaningless unless the inverse permutation function of every sample is discovered to relabel the MCMC outputs of θ.

The Label Switching Phenomenon
Although the label switching phenomenon causes difficulty in inferences of the posterior distributions, the phenomenon can help generate a useful convergence diagnostics of MCMC simulation jasra markov 2005. A Markov chain that fails to visit all permutation states with approximately equal frequency can be viewed as a warning message of nonstationarity. Hence, for ensuring a Markov chain to reach its stationary state, Frühwirth-Schnatter [15] proposed a dynamic switching procedure, called permutation sampler, for Bayesian mixture models to force the Markov chain quickly exploring all possible permutation states. This indicates that label switching phenomenon is a desired property. Therefore, the posterior distribution of parameters is a mixture of K!-component densities. Frühwirth-Schnatter [15] termed samples that visited all permutation states with approximately equal frequency as unconstrained samples. A formal proof given by Papastamoulis and Iliopoulos [16] states that the permutation sampler converges at least as fast as the unconstrained sampler. In the following, we adopt Frühwirth-Schnatter's procedure and inherit their terminologies.

Proposed Relabelling Method
The permutation function that has worked on θ and S is arbitrary and not observed. We treat the unobservable index of the permutation function as a latent random variable τ taking one value of {1, . . ., K!} and Pr t ¼ k ½ ¼ 1 K! for k = 1, . . ., K! fruhwirth-schnatter markov 2001. Another random variable σ is the index of the inverse permutation function of τ, where θ = v τ (v σ (θ)) = v σ (v τ (θ)) and S = v τ (v σ (S)) = v σ (v τ (S)). If the value of τ is observed, the inverse permutation function v σ is known and can transfer θ and S back to the one of the K! permuted posterior densities of the unconstrained samples.
In subsequent sections, the Markov chain is assumed to be stationary and ergodic. For MCMC samples {(θ t , S t ):t = 1, . . .}, let τ t be the latent random variable of the unobserved permutation function at time t, and let σ t be the index of its corresponding inverse permutation function.
We propose an allocation variable based probabilistic (AVP) relabelling algorithm to deal with label switching problem. The AVP algorithm can be regarded as being developed under the assumption where the posterior distributions of the allocation random variables C 1 , . . ., C n are independent. The independence assumption in the posterior distribution (C 1 , . . ., C n )jy usually does not hold because of the variability from prior distribution p(θ). We have imposed such an independence assumption to obtain a tractably practical solution to label switching phenomenon in Bayesian mixture models. Similar simplifications were assumed to other Bayesian techniques, such as variational Bayes approaches (see e.g., Corduneanu and Bishop, 2001 [17]; Bishop, 2006 [18]). In the rest of this section, we assume that the posterior distributions of C 1 , . . ., C n are independent, and π 0 = [π 0, ik ] 1 i n, 1 k K denotes the parameters of the posterior distribution of S.
Each posterior sample S is the consequence of label switching with an unknown permutation. The model of S can be constructed according to an unknown permutation random variable τ (or the relabelling random variable σ) and the parameters π 0 . We use multinomial distribution to model allocation variables (S i1 , . . ., S iK ) with S ik taking value on 0 or 1 for all k and P K k¼1 S ik ¼ 1. Then the probability mass function of (S i1 , . . ., S iK ) is 0;iv q ðkÞ . Since the allocation variables are assumed to be independent, the posterior probability density at realized sample point s given y and τ = q could be modeled by Let the probability Pr[τ = qjy] be denoted by w q . Then the posterior probability density of S at s is The value of w q is the proportion of the value q occurred in the random variable τ in the Markov chain. When the Markov chains is stationary, relative frequency of samples generated from different sample points of τ will be eventually close, and hence the proportion of the different values of τ should be equal. This means if T is sufficiently large, the chains will achieve In the label switching problem, relabelling random variable σ is of our interest. We can rewrite Eq (2) through random variable σ as where v m is the inverse permutation function of v q such that v m ðv q ðSÞÞ ¼ S. where . Notice that the expectation of Eq (5) is 0 when C i and C j are independent for all i, j and i 6 ¼ j. Then E(∑ i 6 ¼ j g T (i, j)) = 0 is a moment equation for π 0 . According to this equation, an object function is defined as Notice that Eq (4) depending on {π 0, i1 , . . .,π 0, jK } is invariant to different label permutations, and so do Eqs (5) and (6). The minimizer with respect to π 0 in Eq (6),p 0 , obtained through Newton's method is the Generalized Method of Moments (GMM) estimator. The GMM estimatorp 0 has been found to have several large sample properties in Hansen [19], including that p 0 approximates π 0 almost surely.
To estimate the value of σ at different time point, let σ t denote the random variable σ at time t. The estimation of σ t can be obtained through the following posterior probability: Based on these posterior probabilities, we adopt the following stochastic algorithm (termed AVP algorithm) to estimate σ t , for each t = 1, . . ., T. AVP Algorithm.
Step C: Randomly assign the relabelling permutation index at time t,ŝ t , to a value of {1, . . ., The AVP algorithm offers an approach that estimates the index of inverse permutation function. Then apply the estimate of permutation function vŝ t to θ t for relabelling parameters. For the examples in simulation studies and real data application, the AVP algorithm is able to have satisfactory relabelled results.

Simulation Studies
In this section, we compare the proposed AVP algorithm with various relabelling algorithms. First, we compare AVP with algorithms KL, ECR, SJW and HPD in poisson mixture models with fixed and known component weights and K = 2. With known component weights, we can then analytically show how these methods transform posterior distributions. Second, we compare AVP with more recent solutions ECR2 and DBS under normal mixture models with both known and unknown component weights. The comparison of AVP and ECR2 are studied under univariate normal mixture models with K = 3, and the comparison of AVP and DBS are studied under multivariate normal mixture models with K = 4. Except for the HPD and AVP algorithms, all the comparative algorithms are available to the label.switching package [20] of R software. Finally, the computation time of various relabelling algorithms for these simulation studies are summarized at the end of this section.

Poisson Mixture Models with Known Component Weights
Poisson mixture models are studies in this section, and five relabelling methods are compared, including KL, ECR, HPD, SJW and AVP. This simulation study generates data from a two-component poisson mixture model whose probability density function is where η = (η 1 , η 2 ), ϕ = (ϕ 1 , ϕ 2 ), and f(y i jϕ k ) is a poisson distribution with the parameter ϕ k for the response y i . Simulate y = (y 1 , . . .,y n ) under four scenarios: (1) η 1 = η 2 = 0.5, ϕ 1 = 5, ϕ 2 = 7 and n = 10; (2) η 1 = η 2 = 0.5, ϕ 1 = 5, ϕ 2 = 7 and n = 100; (3) η 1 = 0.3, η 2 = 0.7, ϕ 1 = 5, ϕ 2 = 5.5 and n = 10; and (4) η 1 = 0.3, η 2 = 0.7, ϕ 1 = 5, ϕ 2 = 5.5 and n = 100. In the following simulations, the component weights (i.e., η 1 and η 2 ) are treated as fixed and known values, and only the parameters in the component densities (i.e., ϕ 1 and ϕ 2 ) are of our interest and are estimated via MCMC simulation. Assume that priors for ϕ 1 and ϕ 2 are i.i.d. from the gamma distribution Γ(1.2, 0.2) with mean 6, and use the poisson-gamma model to obtain the posterior samples of ϕ. While generating the posterior samples of ϕ, set the values of η to be the true values under each scenario. The Gibbs sampling scheme is adopted here to produce posterior samples {(ϕ 1 , S 1 ), . . .,(ϕ T , S T )}, where the allocation variable matrix S t is an n × 2 matrix of which the element S t ik is a 0/1 variable. S t ik ¼ 1 if the ith subject is attributed to the kth component in the tth MCMC iteration, and S t ik ¼ 0 otherwise. This sampling scheme starts with an initial value S 0 , and runs for t = 1, . . ., T as follows: Step 2. Generate S t with its the element S t i1 from the Bernoulli distribution with probability where η 1 and η 2 are fixed values and are therefore independent of t; Step 3. Select the permutation sampler (1, 2) or (2, 1) with equal probability 0.5. If (1, 2) is chosen, the labels of components of (θ t , S t ) remain unchanged; else, alter the labels 1 and 2 of the components in The permutation sampler applied in Step 3 has different functions for different scenarios. In Scenarios (1) and (2) where η values are fixed at η 1 = η 2 = 0.5, the Markov chain can produce label switching, and the permutation sampler is applied here to enhance quick convergence of MCMC and to obtain the unconstrained samples fruhwirth-schnatter markov 2001. In Scenarios (3) and (4) where η values are fixed at η 1 = 0.3 and η 2 = 0.7, the likelihood Eq (8) is not symmetric, and the usual Gibbs sampling without adopting Step 3 does not produce label switching. The permutation sampler used here can make the unconstrained posterior samples from likelihood which creates a "pseudo" label switching phenomenon. Then, we can apply various relabelling methods to the unconstrained samples of (ϕ 1 , ϕ 2 ). The correctly labelled posterior samples of ϕ can be obtained by imposing an ordering constraint on η. Hence, we can compare the relabelled results of algorithms with the correctly labelled posterior samples. The Gibbs sampling scheme was run for 110,000 samples for each scenario. The first 10,000 samples were treated as the burn-in period, and the subsequent 100,000 samples were used for relabelling. Algorithms KL, ECR, HPD, SJW and AVP were applied to the unconstrained samples of each scenario. To understand the effects of large samples, the sample size of Scenario (1) was increased from n = 10 to n = 100 (Scenario (2)). Fig 2 shows that the posterior samples are apparently more concentrated than those from n = 10. Conclusions from comparisons of KL, HPD and SWJ are consistent with those from n = 10. ECR (Fig 2c) and AVP (Fig 2f) have similar results, but it seems that ECR has posterior samples spreading more widely below the 45 degree line than AVP. Fig 3 shows the results under Scenario (3). This scenario decreases the distance between ϕ 1 and ϕ 2 , and allows the values of η to be unequal (η 1 = 0.3 and η 2 = 0.7). These settings place emphasis on the effect of the unequal weights and the reduced distance of ϕ. Notice that, in Scenarios (3) and (4), η values are set to the fixed true values of η 1 = 0.3 and η 2 = 0.7. Therefore, the correctly labelled posterior distribution of ϕ can be obtained by restricting η 1 < η 2 . Fig 3a  presents a scatter plot of the correctly labelled posterior samples of ϕ. The relabelled samples from HPD (Fig 3d) is the same to those of imposing an ordinary constraint ϕ 2 ! ϕ 1 . The KL algorithm (Fig 3b) seems to move the relabelled sample points in the middle-left region to the opposite side symmetric to the 45 degree line. This phenomenon cannot be improved even if we use the correctly labelled posterior samples as initial points for the KL algorithm. Compared with the scatter plot of correctly relabelled posterior samples, AVP (Fig 3f) seems to generate the most similar results than ECR (Fig 3c) and SJW (Fig 3e) do. Because the correctly labelled posterior samples are known in this scenario, the marginal distributions of ϕ of the relabelled samples from all relabelling methods can be compared with the true marginal densities, which are shown in Fig 4. Fig 4a and 4b show the density plots of ϕ 1 and ϕ 2 for Scenario (3), respectively. The density plot of the AVP algorithm nearly coincides with that of the correctly labelled posterior samples. Fig 5 shows the results under Scenario (4), which increases the sample size of Scenario (3) to n = 100. In Scenario (4), the results from HPD (Fig 5d) are similar to those from the ordering constrainted samples. The performance of KL, SJW and AVP (Fig 5b, 5(e) and 5(f), respectively), is similar to that of the correctly labelled posterior samples (Fig 5a). ECR (Fig 5c) seems to gathers more sample points on the right side of the region. Fig 4c and 4(d) show the marginal density plots for Scenario (4). Except for HPD and ECR, other algorithms have density plots to coincide with that of correctly labelled posterior samples.
To produce a more reliable conclusion, simulated datasets are generated with 100 replications under Scenarios (1)- (4). Note that η are set to be fixed in these sencearios. The averages and standard deviations of posterior means over 100 replications are shown in Table 1.
For Scenarios (3) and (4) where η 1 = 0.3 and η 2 = 0.7, the correct labels of each replication can be obtained by applying the OC on the posterior samples of η. Averaged posterior means of correctly labelled samples are slightly closer to those of the proposed AVP algorithm than to those of the other algorithms. For Scenario (3), the standard deviations of posterior means of AVP is larger than those of OC; whereas, under Scenario (4), AVP seems to relabel almost all samples back their correct labels.
For Scenarios (1) and (2) where the simulating parameter of η are set to be equal (η 1 = η 2 = 0.5), the correct labels are unknown. Instead of comparing with the unknown true posterior means, we could compare the similarity between the relabelling algorithms. Among the compared algorithms, ECR and AVP have similar results. The performances of OC on θ, KL and HPD are highly similar to one another, especially in Scenario (4)

Normal Mixture Models with Known and Unknown Component Weights
In this section, we apply AVP to the unconstrained posterior samples generated from both univariate and multivariate normal mixture models with the number of components to be known and with known and unknown weights. We compare AVP with ECR2 in univariate cases and with DBS in the multivariate cases.  Univariate cases For the univariate case, we simulate observation x i from the normal mixture model, that is, for i = 1, . . ., n, where μ k and V k are the mean and the variance of the kth component density, respectively. We investigate the simulated model (4.1) studied in [10] with K = 3 and n = 160. Two scenarios are studied under this model. Scenario (5): η is known and fixed, and Scenario (6): η is unknown. The posterior samples of the parameters are generated via the Gibbs sampling scheme suggested by [11], where they assume that the prior distributions are ðZ 1 ; . . . ; Z K Þ $ Dð1; . . . ; 1Þ; m k $ Nð y; R 2 Þ; and ; V k $ Gð2; R 2 =200Þ; k ¼ 1; . . . ; K; where D(Á) is the Dirichlet distribution; Γ(Á) is the gamma distribution; y and R are the mean and the range of the data, respectively. Permutation sampler is used in the Gibbs sampling scheme to obtain the 100,000 unconstrained samples (after the burn-in period) of the parameters. Two scenarios are repeated for 100 times. The averages and the standard deviations of posterior means over replications are reported in Table 2. In Scenario (5), η is assumed to be fixed at true values during the Gibb sampling; hence, the correct labels can be obtained by applying an ordering constraint on η. The differences in averaged posterior means between AVP and ECR2 are small, which are no more than 0.11; however, averaged posterior means of correctly labelling samples are slightly closer to those of AVP than to those of ECR2 (upper part of Table 2). The standard deviations of the posterior means in Table 2 (upper part) show that AVP has better consistence (smaller standard deviations) and is closer to those of correctly labelled samples than ECR2 does.
For Scenario (6) where η is unknown, correct labels are unable to be obtained, leading to the true posterior means are unknown. The results in Table 2 (lower part) show that the simulating parameter values are slightly closer to averaged posterior means of ECR2 than to those of AVP. However, it is noteworthy that true posterior means may not necessarily be close to simulating parameter values because the former could be affected by the setting of prior distributions. The standard deviations of the posterior means show that AVP generally can obtain more consistent estimates in posterior means than ECR2. Putting an ordering constraint on η (OC) under this scenario could obtain unsatisfactory results, which is informed by its nonsensible estimates for posterior means of μ 1 and μ 2 .
Two scenarios are considered. Scenario (7): η is known and fixed, and Scenario (8): η is unknown. Two scenarios are repeated for 100 times and the results are averaged over these replications. Parameter values used to simulate data from Scenario (7) are shown in the first column of Table 3. Notice that this is a challenging case since the true parameter values for one component are extremely close to another. The averaged posterior means in these scenarios are shown in Table 3. As compared with the results from correct labelling (OC), we see that AVP has better performance in the first two component and DBS is better in the fourth component. The standard deviations of posterior means over 100 replications are shown in S1 Table. For Scenario (8) where component weights are unknown, we adopt the bivariate normal mixture model given in [10] for simulating data. In this setting, the averaged posterior means from AVP and DBS are equally close to the true simulating parameter values (lower part of  (5) and (6).
Scenario (5): η are known and fixed    Table 3). The standard deviations of posterior means from AVP seem slightly larger than those from DBS (lower part of S1 Table). For each relabelling algorithm, we summary its computing time for a relabelling procedure (averaging over 100 replications). Table 4 reports their computation times under scenarios with the same number of components (K) and sample size (n). Algorithms are run in R 3.1.3 using a personal desktop computer with Inter Core 2 Quad CPU 2.33 GHz. Notice that except the HPD and AVP algorithms, all the algorithms are performed by using label.switching package. Results show that the proposed AVP algorithm can have a long running time when K is large. This is because our probabilistic based algorithms requires the computation of K! quantities to determine the relabelling permutation per MCMC draw.
where y mj = I(y m = j) = 1 if y m = j; 0 otherwise. In addition, this model assumes Z k ðx i Þ¼ Pr ðC i ¼ Kjx i Þ and p mjk ðz im Þ ¼ Pr . . . ; x ip Þ T are predictors associated with the allocation variable C i , and z i ¼ ðz i1 ; . . . ; z iM Þ with z im ¼ ðz im1 . . . ; z imL Þ T for m = 1, . . ., M are covariates built to cause direct influence on response variables. The probabilities Z k ðx i Þ and p mjk ðz im Þ are often implemented assuming the generalized logit link function under the generalize linear model framework [23]: and log To perform Bayesian analysis on the RLCA model, prior distributions for β pk 's, γ mjk 's and α lmj 's are assumed normal prior distributions with mean 0 and variance 1.5 2 . Parameters β pk 's, γ mjk 's and α lmj 's are sampled in Gibbs sampling approach with acceptance-rejection strategy [24]. The Gibbs sampling scheme for the hierarchical RLCA model are according to Pan and Huang [25]. The following briefly describes the move types: Step 1: For i = 1, . . ., n, generate C i from pðS ik jb 01 ; . . . ; b PK ; g 111 ; . . . ; g MJ M K ; a 111 ; . . . ; a LMJ M Þ with S ij = I(C i = j), and (S i1 , . . ., S iK ) can be sampled directly from a multinomial distribution.

Data
To illustrate the usefulness of the proposed relabelling method, we used data (see S1 File) from two projects: the Multidimensional Psychopathological Study on Schizophrenia (MPSS) project and the Study on Etiological Factors of Schizophrenia (SEFOS) project. The details of study designs are described in detail in Chang et al. [26]. Written informed consent was obtained from all participants after complete description of the studies. These studies (MPSS and SEFOS) were approved by the institutional review boards of the 3 participating hospitals: National Taiwan University Hospital and the university affiliated Taipei City Psychiatric Center and Taoyuan Psychiatric Center. Participants' consent to the MPSS and SEFOS studies included consent to use their data for other researches. The capacity for consent of patients were assessed by their attending certified psychiatrists to rule out those participants whose psychotic symptoms or mentality were so severe that impair their capacity for consent. All the psychiatric patients who were compulsory hospitalized did not allow to enter our studies. All informed consents were obtained from patients themselves. Proxy consent was prohibited in our studies. The datasets had been published [27], but not available through any data repositories before. The data had been anonymized prior to access for this study and the age range of participants was from 18-65 years old. The inclusion/exclusion criteria were (i) meeting the DSM-IV diagnostic criteria of schizophrenia, (ii) no history of alcohol and drug abuse, (iii) no neurologic disease, (iv) no mental retardation, (v) no medical illnesses that may significantly impair neurocognitive function.
Briefly, MPSS and SEFOS projects recruited subsided schizophrenia patients (N = 225) from three hospitals in Taiwan. The patients are based on the Diagnostic and Statistical Manual of Mental Disorders [28] criteria for schizophrenia. Schizophrenia symptoms used in this study are assessed by the Positive and Negative Syndrome Scale (PANSS) [29,30]. The PANSS is composed of three subscales and has 30 items (M = 30) with positive (seven symptoms, P1-P7), negative (seven symptoms, N1-N7) and general psychopathology (sixteen symptoms, G1-G16). Each item was originally rated on a 7-point scale (1 = absent, 7 = extreme), but the 7-point scale was reduced to the binary scale (J 1 = . . . = J 30 = 2) (no symptom and having symptom) for easing the sparseness problem of the latent class model. The hierachical RLCA applied here is to explore the underlying subtypes (classes) of schizophrenia based on the PANSS measurement, and to study the relationship between external covariates and obtained patient subtypes. The external covariates used in this study include demographic variables and one neuropsychological variable. Demographic variables are gender, age at recruitment, onsetage of psychotic symptoms, years of education, and occupation (having versus no occupation). The neuropsychological variable is the sensitivity index of the Continuous Performance Test (CPT) [31,32]. The CPT score is transformed into z-score by comparing to a control group matched for three demographic variables: age, gender and education years [33]. This adjustment was made so that the higher z-score indicates better performance.
The hierarchical RLCA was applied to 30 dichotomized PANSS items. Demographic variables and the z-standardized CPT score were the covariates that were associated with the underlying latent class through Eq (9). Gender and age are identified as covariates incorporated in conditional probabilities in Eq (10). This analysis used the subsample of subjects that without missing values (N = 160). The hierarchical RLCA model was fitted through the Gibbs sampling scheme.

Analysis Results
In this data analysis, we set K = 3. We run for 210,000 samples with the first 10,000 samples being the burn-in period. Only every 10 scan is stored to keep independence, and 20,000 samples are recorded for analysis. Fig 6a and 6(b) show the unconstrained samples and the relabelled samples after applying the AVP algorithm, respectively, in 3-dimension scatter plots with the dimensions of parameters γ 211 , γ 212 and γ 213 . Because the schizophrenia syndrome scale data is fitted by a threecomponent latent class model , Fig 6a with 20,000 samples clearly shows the 3! = 6 clusters in After applying the AVP algorithm, the quantities of posterior distributions are summarized in Tables 5 and 6. Table 5 gives the estimation of relationship between subgroups memberships and covariates. The odds ratios (ORs) are the exponential transformation of β's from regression coefficients. The 2.5% and 97.5% quartiles of posterior samples of β's also take the same exponential transformation to obtain the 95% credible interval (CI) of the corresponding ORs. By comparing with the patients from class 3. The characteristics of the other two classes from this analysis are as follows. Patients in class 1 tend to be younger at onset age of psychotic symptoms. Patients in class 2 are more likely to be male, more years of education and better ungraded CPT. Table 6 contains the direct association between PANSS symptom items and covariates. The ORs are obtained by the exponential transformation of regression coefficients α's. The same exponential transformation is also applied to the 2.5% and 97.5% quantiles of the posterior samples of α's to obtain 95% CI. Males are more likely to have G12 (lack of judgement and insight) symptom than females. The older the age, the higher the probability of having G5 (mannerisms and posturing) symptom and G6 (depression) symptom, but the lower the probability of having N4 (passive/apathetic social withdrawal) symptom.

Conclusion
The proposed AVP algorithm has the following features. (i) AVP is attributed to probabilistic approach, which prevents over-corrected results compared with deterministic methods. (ii) AVP seems to perform reasonably well with the limiting settings in our simulation studies. (iii) The computation time of AVP depends on the dimension of allocation variables S (i.e., the number of observations (n) and the number of components (K) in the mixture model), but not on the complexity of the density function of mixture models. That is, even when data is drawn from a complicated mixture model, the computational cost for AVP holds the same as that from the models where have the same numbers of observations and components. (iv) AVP can have a long computation time when K is large, since a probabilistic based algorithm requires the computation of K! quantities to find the optimal permutation per MCMC draw.
Supporting Information S1  (7) and (8). This table summaries standard deviations of posterior means over 100 replications for algorithms OC, AVP and DBS, where OC stands for ordering constraints on η.
(PDF) Table 6. The association between the PANSS symptoms' probability and covariates from hierarchical RLCA.

Male Gender Age
Variable OR a CI b OR CI S1 File. Raw data of the study sample. This dataset contains 30 outcome variables and 6 explanatory variables. The variables are summarised as follows and variable names are shown parenthetically. The 30 outcome variables are seven positive symptoms (P1-P7), seven negative symptoms (N1-N7) and sixteen general psychopathology symptoms (G1-G16) with binary response with 0 = no symptom and 1 = having symptom. The 6 explanatory variables are gender (Male_gender) with 0 = female and 1 = male, age at recruitment (Age), onsetage of psychotic symptoms (Age_of_onset), years of education (Year_of_education), occupation (Having_occupation) with 0 = no occupation and 1 = having occupation and CPT score (Ungraded_CPT). (CSV)