Overfitting Bayesian Mixture Models with an Unknown Number of Components

This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of components in a finite mixture model via a Zmix algorithm. Zmix provides a bridge between multidimensional samplers and test based estimation methods, whereby priors are chosen to encourage extra groups to have weights approaching zero. MCMC sampling is made possible by the implementation of prior parallel tempering, an extension of parallel tempering. Zmix can accurately estimate the number of components, posterior parameter estimates and allocation probabilities given a sufficiently large sample size. The results will reflect uncertainty in the final model and will report the range of possible candidate models and their respective estimated probabilities from a single run. Label switching is resolved with a computationally light-weight method, Zswitch, developed for overfitted mixtures by exploiting the intuitiveness of allocation-based relabelling algorithms and the precision of label-invariant loss functions. Four simulation studies are included to illustrate Zmix and Zswitch, as well as three case studies from the literature. All methods are available as part of the R package Zmix, which can currently be applied to univariate Gaussian mixture models.


Introduction
Finite mixture models naturally arise when homogeneous subgroups or clusters are thought to be present in a population, and can also be used as flexible parametric models for estimating complex or unknown distributions [1]. Whether latent subgroups are present or not, their flexible framework has the potential to help tackle many research problems. As such they are useful tools in many fields including but not limited to genetic and medical research [2][3][4], econometrics [5], and image and sound analysis, where mixtures are used to perform complex tasks such as object tracking and speaker identification [6,7]. Despite their popularity, model estimation can be difficult when the number of components is unknown [8].
The density of a K-component mixture with respect to some measure is given by Eq. (1), where π k and θ k denote the weight and associated emission parameters of component k, k = 1, . . . , K, with 0 < π k < 1 satisfying K k=1 π k = 1. This paper considers the situation where the emission densities f (y|θ k ) belong to a parametric family, i.e. θ k ∈ Θ ⊂ R d . While mixtures are most often used for clustering and classification, the methods presented here can also be used for density estimation to obtain a sparse representation of an unknown distribution.
MCMC methods are commonly used for Bayesian estimation of complex hierarchical models such as mixtures, and Gibbs samplers are a special case of these where all parameters are estimated from their full conditional distributions [9][10][11][12]. This would be a tedious endeavour for finite mixture models if not for the inclusion of a latent allocation variable, a Multinomial Z = {z 1 , . . . , z n }, where z i ∼ M (1; π 1 , . . . , π K ), so that y i |z i = f (y i |θ zi ) [13]. For each iteration t of a Gibbs sampler, the allocations Z (t) are estimated first, then the parameters are generated from their component-wise conditional distributions based on the clustering in Z (t) .
This paper addresses three important issues concerning mixture modelling when the number of components is unknown: (i) theoretical issues in estimating the number of components due to non-identifiability caused by overfitting, (ii) problems in applying standard Markov chain Monte Carlo (MCMC) sampling techniques, and (iii) the non-identifiability of the output of MCMC due to label switching. These issues are reviewed and then addressed in a coordinated manner, with the aim to develop a method for intuitive estimation of the number of components (also known as order estimation), resulting in a sparse yet representative posterior exhibiting clear separation between the estimated and unnecessary components.

Issue 1: Non-identifiability due to overfitting
Order estimation methods for finite mixtures can be loosely classified into two types of approaches: those which compare competing models (e.g. Bayes Factors [10,11]), and those which employ multidimensional samplers to directly estimate the distribution of K (e.g. Reversible Jump MCMC [30], the allocation sampler [42]). Overfitting, the act of including more components in a model than is supported by the data, is an integral part of both strategies as the former must fit at least one extra group to compare some criterion, whilst the latter implicitly explores an overfitted space to estimate K. Non-Bayesian methods for mixture estimation are particularly vulnerable to overfitting as it violates the regularity conditions required for maximum likelihood estimation and likelihood based goodness-of-fit criteria [1].
The difficulty with order estimation stems from the fact that overfitting induces a special type of non-identifiability in the posterior distribution of mixture models. Theoretically, any mixture distribution can be represented equally well by one with a larger number of groups, where some components have either merged together or have weights equal to zero [1,[14][15][16]43]. Developments in the Bayesian asymptotic theory of overfitted mixture models by [17] provide a theoretical basis to use overfitting for order estimation. [17] proved that quite generally, the posterior behaviour of overfitted mixtures depends on the chosen prior on the weights, and on the number of free parameters in the emission distributions (here "d"). Consider the prior on the weights P π , which we take to follow a Dirichlet distribution, P π = D(α 1 , ..., α K ). If min(α k , k ≤ K) > d/2, asymptotically two or more components in an overfitted mixture model will tend to merge with non-negligible weights. Conversely, if max(α k , k ≤ K) < d/2, the extra components are emptied at a rate of n −1/2 . Choosing a prior where max(α k , k ≤ K) < d/2 penalises the analysis more subtly than using the dimensions of the model or the number of parameters, placing mass on the sparsest configuration approximating the density in a uniquely Bayesian manner. In this context an exchangeable prior corresponds to choosing α 1 = α 2 = · · · = α K , which is done hereafter in this paper.
Overfitting is an appealing solution for order estimation as it requires little input from the investigator; it simply involves selecting a large number of components (greater than the anticipated number), and choosing a prior which encourages the extra components to have weights close to zero. This approach was recognised in [18], where Section 22.4 on mixtures with an unspecified number of components focuses almost entirely on the strategy of deliberately overfitting for the purpose of order estimation. They recommend a straightforward approach of counting the components whose posterior weights are larger than some threshold. In practice, [18] suggest choosing α = n 0 /K, where n 0 is the prior sample size of the components with a default "noninformative" value of n 0 = 1. However this leads to its own set of difficulties. First, extra components have a non-zero probability of being allocated observations, allowing MCMC samplers some freedom to explore the posterior surface. However, if the posterior weights of the extra, unwanted groups are not close enough to zero they become impossible to distinguish from the truly supported components. [18] note that components with small weights were sometimes found to have non-trivial posterior means. Second, some choices of K caused the posterior to contain several redundant, closely overlapping groups, indicating that some merging of extra components with the truth is allowed to occur under such a prior.
In this paper, we aim to place stronger bounds on α so that extra components with no support have posterior weights approaching zero, to the point where they are allocated no observations. When extra components can be said to have emptied, order estimation should be a simple case of reporting the number of alive (non-empty) components present in the posterior.
Issue 2: Obtaining a well mixed MCMC sample MCMC algorithms are prone to becoming trapped in regions of large posterior probability for high dimensional problems; they have a propensity for lack of mixing when the posterior contains multiple well seperated modes [15]. Some argue this hinders MCMC estimation since the samplers cannot explore all potentially important regions of a target space, clusters may be missed, and thus the MCMC cannot be assumed to have converged [8].
Parallel tempering is a popular method originating in physics which improves mixing in multimodal situations. The general idea is to simulate J replicas of the original distribution of interest, each produced under a different "temperature", and to sample from each of these allowing for information to flow between adjacent temperatures. The high temperature posteriors are increasingly flattened, providing less extreme surfaces which allow MCMC samplers to mix more freely, whereas the low temperature posteriors better reflect the precise distributions in a local region of the probability space, but have a strong risk of becoming trapped in local minima during sampling [19][20][21].
In essence, the higher temperature posteriors allow those of lower temperature to access a more complete set of regions in the posterior space. Tempering can be done in many ways, the most common approach being to raise the target distribution to a power T (where 0 ≤ T ≤ 1), which increasingly flattens the distribution as T → 0.
While tempering is usually performed directly on the likelihood or the posterior distribution, it is also readily adaptable to other situations as recently demonstrated in an application to Approximate Bayesian Computation [22].
In the Methods section entitled "Prior Parallel Tempering (PPT)", a parallel tempering algorithm is developed using α to directly control the degree of tempering as well as obtain the desired target distribution.
Issue 3: Untangling the label switching.
The third challenge is to retrieve the posterior estimates from the target chain of the MCMC. These are non-identifiable due to label switching, a phenomenon which occurs when exchangeable priors are placed on mixture parameters. Label switching results in a posterior which is invariant to permutations of the labelling of components [23]. In essence, the group names of two components 'switch' randomly during MCMC, resulting in the marginal posterior distributions of each parameter to be identical for all groups. Resolving the label switching can be a difficult task but its presence is proof of adequate mixing and is an important requirement to establish that an MCMC sampler has converged [24][25][26].
Excellent reviews of the label switching issue and a wide range of potential solutions can be found in [27] and [24]. An increasingly popular approach is to employ a relabelling scheme, such as that proposed by [23] and [28], where the posterior samples of the parameter of interest are clustered according to a k-means algorithm [27]. This method converges to local minima, so the results based on multiple starting points are compared to identify the optimal solution. This idea was extended by [8] who use the maximum a posteriori (MAP) estimate as the starting point of the clustering.
Another approach is to use label invariant loss functions, the idea being to identify some loss function based on a label invariant estimate and to select the permutation of the labelling which minimises this loss. For example if the allocations are computed, [29] propose a loss function based on the pairwise comparison of the allocations of each data point. To relabel the samples, the algorithm permutes the labelling to minimise this loss. However this can incur a high computational cost for mixtures with many components and rapidly become impractical [27].
Label switching in overfitted mixtures is particularly difficult to resolve as superfluous components may merge or overlap with other components, or may be empty, which negatively impacts on relabelling and clustering. The presence of many empty components is an additional level of complexity which is not generally accounted for by existing tools.
A new method for resolving the label switching problem is developed in the Methods section "Resolving the label switching with Zswitch" which aims to combine the MAP and relabelling approaches of [8] with the rich information available from the joint distribution of the allocations used by [29].

Motivation
While overfitting is an appealing tool for Bayesian order estimation, the number of non-empty components in the posterior of overfitted mixtures cannot currently be used to estimate of the true number of components. Extra components always have a non-zero probability of being allocated some observations, the number of which is determined by the prior on the weights. Setting this to be very close to zero is not possible with current estimation methods as such a prior creates a sparse posterior surface comprised of isolated modes separated by areas of near-zero probability, inhibiting mixing.
The goal of this paper is to produce a sparse, representative posterior configuration of finite mixtures with an unknown number of components. We develop an extension to parallel tempering to enable a Gibbs sampler to sample from a well mixed posterior, where the unsupported components contain no observations, to explore if this can be used for simple order estimation.

Models and notation
Given observations Y = {y 1 , . . . , y n }, component weights π = {π 1 , . . . , π K }, and component parameters θ = {θ 1 , . . . , θ K }, the full likelihood of a mixture model can be written as Eq. 2. Here, K is the number of components included in the model, where the true number of components in Y is K 0 and K 0 < K.

Prior Parallel Tempering (PPT)
We aim to use the prior on the weights to define different degrees of tempering, setting up an approximation to classical tempering which simultaneously models a wide range of possible posterior configurations.
J chains are included in the PPT algorithm, each indexed by j. In a Bayesian setting, each chain can be considered to have a different target posterior p j (ζ j |Y ). The ζ j denote the full set of unknown parameters, such as ζ j = {µ j , σ 2 j , π j )} for univariate Gaussian mixtures, and p(ζ j ) = p(µ j |σ 2 j )p(σ 2 j )p(π j ). The posterior parameters sampled at each iteration t are denoted ζ (t) j . For iteration t, the posterior of the j'th chain is indexed as p j (ζ (t) j |Y ), and p j (ζ When a proposal is made to swap the samples of a pair of adjacent chains at a given iteration, a Metropolis-Hastings update on the joint distribution must be made. Consider the proposal to swap chains j and j ′ at iteration t. The joint target of both chains can be written as f (ζ j ′ |Y ), and the goal of tempering is to preserve this target, only accepting moves with probability min(1, A). Omitting the iteration indicator (t) as all values are assumed to refer to the same iteration, the acceptance ratio formulation is as follows: In the case of PPT, the likelihood is the same in all chains, so p j (Y |ζ j ) = p j ′ (Y |ζ j ) and p j (Y |ζ j ′ ) = p j ′ (Y |ζ j ′ ). Expanding the ratio of posterior distributions reduces A to PLOS 5/26 the prior densities: Furthermore, as only the prior on the weights is allowed to change, A may be further simplified. Recalling the prior structure p(ζ) = p(µ)p(σ 2 )p(π), A can be written as Since p j (µ j |σ 2 j ) = p j ′ (µ j |σ 2 j ) and p j (σ j ) = p j ′ (σ j ), the final acceptance ratio is comprised of four densities defined by the prior on the weights only:

Sampling overfitted mixture models with PPT
We now set up Zmix, an MCMC sampling algorithm for mixture models which incorporates PPT into a collection of Gibbs samplers. A set of J parallel, independent samplers are set up, and as mentioned the degree of tempering is determined by the hyperparameter on the mixture weights, (α j , j ≤ J). The set of α j must be chosen to ensure a wide range of parallel chains and include values from well above d/2 to close to zero. As the overall goal is to sample from a posterior where extra components' posterior weights are very close to zero, the chain generated by the smallest value of α j in the PPT is referred to as the target chain of Zmix.
Choosing the candidate parameters (α j , j ≤ J) The choice of α j is arbitrary at this point; a wide range allows a broad spectrum of posterior configurations to be generated, but values too far apart result in undesirably large changes between tempered chains (and the acceptance ratio of the PPT algorithm is rarely satisfied). The smallest hyperparameter, α J , is set very close to zero to encourage unsupported components to have a negligible probability of being assigned observations. In practice, the success of the tempering is ensured by tracking the acceptance frequency of swaps between all chains to ensure an adequate acceptance rate. Values are chosen starting at α 1 = 30 to ensure total merging in all simulations and examples included this paper. Two sets of (α j , j ≤ J) are explored, a larger range in the early exploratory stage, followed by a refined set. Initially J = 30 chains are used to explore the posterior behaviour of overfitted mixtures under increasingly extreme conditions with values according to Eq. 9. Zmix Algorithm Recall that a Gibbs sampler [9] is based on drawing samples from the full conditional distributions of each unknown variable, and that PPT requires only the prior on the weights to differ between chains. Define π j as the set of K weights for chain j where j = 1, . . . , J. The prior on the weights is denoted p j (π j ) ∼ D(α j , . . . , α j ). The density of the distribution of π j given Similarly, the parameters of the components of chain j are indiced as θ j . Since the distribution of the parameters given the allocations and the data is the same across all chains, we write p(θ Before Zmix is implemented a choice of u must be made, which determines the probability a tempering move will be attempted at a given iteration. For clarity, note that each parameter is first indexed by tempering chain j, for example the weights in a chain are π j and the allocations Z j . More specificity is added by including another level when required, so that for example the k'th element of π j is denoted π j k .
MCMC sampling of the unknown parameters then proceeds as follows.
for each i = 1, ...., n and k = 1, ..., K, with n With probability u ∼ U (0, 1) : i. Draw j randomly from the set (1 : J − 1), selecting chains j and j ′ = j + 1 as candidates for tempering. ii. Accept the move with probability A, where and perform the tempering: iii. Return to Step 2(a).

Important quantities
The number of non-empty components at each iteration t (after a burn in period) for chain j is K The distinct values of K j0 are defined as K k0 for k 0 ∈ {1, . . . ,K j 0 } (whereK j0 is the maximum number of alive (non-empty) groups observed in chain j). The mode of the empirical distribution of K j0 isK j0 .
Choice of mixture distribution The asymptotic theory underpinning this paper can be applied to a wide range of mixture distributions, so a univariate Gaussian mixture model is adopted , in the conjugate form. This involves an Inverse Gamma prior on the variances σ 2 k ∼ G −1 (a, b), and a Gaussian prior on the means 2 , and τ = 1. This formulation is chosen to facilitate Gibbs sampling, particularly the choice of l = 1 n n i=1 y i which centres the prior for the means within the range of the observed sample. This speeds up convergence compared to choosing a value not within the range of the observations.

Resolving label switching with Zswitch
A relabelling algorithm is proposed here inspired by the methods of [8] and [29]. Unless otherwise indicated, all notation in this section relates to the target chain j = J of the tempering algorithm. For each iteration t, let K (t) 0 denote the number of non-empty groups. Then for For each value of k 0 , choose a reference set of allocations Z 0 and corresponding parameters θ 0 , permuting the labels so that the first k 0 groups are non-empty. Here, the reference is chosen as the MAP estimator of the target posterior, computed using only non-empty components.
For each iteration t ∈ T (k 0 ) let n (t) k be the number of observations assigned to component k (for k = 1, . . . , K), and let the vector k0 } be the labels of the non-empty components.
The joint distribution of the current and reference allocations is summarised by creating a k 0 × k 0 table M , where M (r,c) is the cell pertaining to row r and column c, the columns denote the reference labels, and the rows denote the elements of λ (t) . The value of M (r,c) is the number of observations assigned to the component labelled λ (t) r which are also in the reference group labelled c.
The table M is the key of Zswitch and is used to identify the subset of reference components which have a similar membership to each current component. The tuning parameter m, defined below, determines the sensitivity of the algorithm by designating the minimum proportion of the observations from each component which must belong to some reference group before it is considered a candidate for relabelling.
For each row r = 1, . . . , k 0 , let I r be the set of labels such that the proportion of observations shared by the current group and reference group exceeds a threshold m, that is I r = {c; M r,c /n (t) k=r > m}. Since I r is a set of labels, |I r | denotes the size of the set, and I r × I r * is the Cartesian product between I r and I r * . Letλ Updating the values of λ (t) relabels all associated allocations (Z (t) ) and parameters (θ (t) ), resolving the label switching.
If k0 r=1 |I r | > k 0 , there are multiple candidate labels for at least one component. The final choice is the permutation of the candidate labels which minimises the distance between the current and reference parameters under each possible relabelling scheme, as follows. Let S I be the set of permutations from I r to I r , the Cartesian product of the k 0 sets I 1 , . . . ,I k0 , S I = I 1 × · · · × I k0 . The final relabelling scheme v * is then identified as All parameters and allocations are then relabelled according to λ Zswitch algorithm Define the number of observations assigned to each group k = 1, . . . , K at each iteration as n Select reference Z 0 and θ 0 . Permute the labels of (Z 0 , θ 0 ) so that the first k 0 groups are non-empty.
iv. For r = 1, . . . , k 0 , start with an empty set I r = φ, and let A. If |I r | = 1, letλ i. Let S I be the set of permutations from I r to I r , found by computing the k 0 -fold Cartesian product S I = I 1 × · · · × I k0 . ii. Find the permutation v * for which: iii. Relabel Z (t) and θ (t) by setting λ (t) To ensure the success of Zswitch, density plots are created for each set of relabelled posterior parameter estimates, and we deem the Zswitch successful when these are all clearly unimodal.

Simulations and case studies
The results are presented according to the following evaluation strategy, which is designed to explore the impact of α j , and particularly the behaviour of the target posterior. Replicate simulation studies are included to explore the consistency of the observed behaviour, and case studies are also included for comparison with existing literature.
Simulations A set of simulations representing a range of univariate Gaussian mixtures is used to test and illustrate Zmix and Zswitch. Gaussian mixture models are considered, with unknown means and variances for all components. Four simulations illustrate the methods in this paper, denoted Sim 1-4. Fig 1 includes histograms of samples where n = 200 from each simulation as well as the density of the true underlying distribution. Sim 1 defines a well separated mixture of K 0 = 3 components. Sim 2 has the same number of groups but they are closer together, with two high peaks whose tails overlap with a central component with a larger variance. Sim 3 represents a scenario where the K 0 = 2 components are difficult distinguish; all parameters except the variances are equal, producing a unimodal density. Sim 4 contains K 0 = 3 components, where 99% of the observations are expected to represent only two components with close means, while a third, better separated component is only allocated 1% of the weight. Sim 4 describes a situation where true groups with small weights exist, to better understand how these can best be identified. Case Studies Three case studies are described to illustrate the results of Zmix and Zswitch in practice. The first case study is the Acidity dataset, which consists of the log acidity index for 155 lakes in the North-Eastern United States [16]. These have been previously analysed as a mixture of Gaussian distributions on the log scale by [16] who found evidence for two or three groups. [30] found evidence for three to five components with the same model with a Reversible Jump MCMC algorithm, while [18] overfit this dataset resulting in a posterior with two true components. The second case study involves the Enzyme dataset found in [31], which consists of measurements of enzymatic activity in blood for an enzyme involved in the metabolism of carcinogenic substances (velocity and substrate concentration), for a group of 245 unrelated individuals. [31] first analysed this data and found a mixture of 2 skewed components using MLE. [30] found evidence for 3 to 5 components using RJMCMC. [32] also modelled this data with 2 skewed Gaussian components.
Finally, Galaxy data [33] is considered since it has been investigated by many researchers with a wide range of results [34][35][36]. It is a small dataset of 82 measurements of galaxy speeds from 6 segments of the sky. It is of particular interest as different order estimation methods have suggested that the data contains anywhere from 3 to 9 components [11,18,35,[37][38][39]43]. [38] observed that extra components appears to be modelling underlying skewness present in the sample.
Evaluation strategy for Case Studies For each case study, run Zmix for 50, 000 iterations. Extract the last 20, 000 iterations of the target posterior. Subset by value of K (t) 0 , and apply Zswitch and post-processing to each subset. Compute and save posterior mean of all estimated parameters, including 95% credible intervals, and posterior allocation probabilities for each configuration considered.

Post-processing
For all but the replicate simulation studies, the same post-processing is performed on the target posterior of the results of Zmix.
The iterations of the target chain are split into subsets by the number of non-empty components present at each iteration, K (t) 0 , and the label switching is resolved for each of these according to Zswitch. Model quality statistics are particularly useful when multiple configurations are present as they allow further comparison of the candidate models, and the following statistics are computed for each subset processed by Zswitch.
For each considered configuration, identified by K K0 , the proportion of iterations represented is first computed to estimate the probability that the observations originate from a mixture with K k0 components, p(K k0 ). When K k0 = K 0 , the proportion of the observations whose predicted allocations corresponds to their true groupings is computed. The mean absolute error (MAE) and mean squared error (MSE) are calculated using the unswitched parameter estimates.
The remaining statistics are based on posterior predictive testing, which are found by resampling the posterior samples of the parameters in the target chain in order to predict 10, 000 datasets of the same size as the original data. Mean absolute errors (MAPE) and mean squared prediction errors (MSPE) are reported as an average over the replicates. Bayesian P-values estimating p(min(Y rep ) < min(Y )) and p(max(Y rep ) < max(Y )) are included, which we call P min and P max . Both are included as they can be useful in identifying a skewed fit. Predictive concordance is then computed, which can be interpreted as the average proportion of y i 's that are not outliers given the model (based on the suggestion that any y i that is in either 2.5% tail area of y rep i should be considered an outlier) [40] . An ideal fit should have a predictive concordance of around 95% [40].
A small set of plots is also created for each candidate configuration. Density plots of the posterior paramaters illustrate their distribution and the success of Zswitch at relabelling the output of Zmix. Also included is a plot of estimated allocation probabilities, and a plot of the density of 10,000 predicted datasets overlaid by a that of Y to allow for the overall fit to be explored and identification of areas of bad fit.

Results
The following results show that the number of alive components, the set of which is denoted K J 0 for the target of an overfitted mixture modelled with Zmix, provides a sparse estimate of the true number of components, K 0 . Given a large enough sample size and a well mixed MCMC sampler, there is little or no variation in K J 0 and the mode of this distribution is equal to K 0 . When the sample size is small relative to the complexity of the underlying mixture, K 0 encompasses a small range of likely configurations, which tend to include the true value as well as one or two more conservative estimates of the number of components. The estimated parameters and allocation probabilities corresponding to each model (or configuration) considered can be extracted directly from the target chain, and interpreted once label switching is resolved. plots pertaining to the other simulations can be found in the supplementary material (S1 Fig,S2 Fig, and S3 Fig). Results are shown for Sim 2, n=100 (left) and n=200 (right). Boxplots of the number of non-empty groups K j 0 for each chain j are included; each chain represents posterior samples from the Zmix sampler with the hyperparameter α j on the mixture weights, the value of which is included in red for each j.

Exploratory simulations
For the simulations in this paper, when α j > 5 all components merge so that none are empty. As α j approaches d/2 = 1, a slight decrease inK j 0 can be observed. As the threshold of d/2 is passed (at α j = 1), and α j decreases further, a steady drop in the number of non-empty groups is evident, which continues as α j approaches zero.
Once α j is close to zero (approximately 3 × 10 −8 here), the posterior distribution of K j 0 appears to reach an equilibrium, and remain constant for all subsequent chains up to and including the target. The posterior behaviour of the target K J 0 is well exemplified by Fig. 2, where the following can be observed. When the sample size is large enough (n = 200 here), K J 0 represents a single configuration in all t iterations, so that K . This is inferred to be equal to the true number of components, andK J 0 = K 0 . In the case where n = 100, the range of K J 0 includes a small subset of likely configurations, in this case one with the true number of components, and an alternate posterior configuration with one fewer group.
This behaviour is observed consistently across the four simulations except for Sim 4 with n = 100, where the range of K J 0 does not include the true value K 0 ; all iterations represent a posterior with only 2 groups. Since the true allocations are known here, it is noted that the component with π k = 0.01 is not represented in this realisation Sim 4, and thus could not be estimated.

1.(b)
Model fit and parameter estimates The candidate models found by Zmix defined by the values of K K0 for k 0 = (1, . . . , K J 0 ) are compared using the posterior parameter estimates obtained from the target posterior. A set of summary and model quality statistics computed for each configuration (or number of components) is included in Table 1.
For all simulations when n = 100, Zmix tends to place a higher probability on the configuration with fewer components, and when n = 200 a single configuration (or model) is represented in the posterior. The replicate study in the following section provides a comprehensive exploration of the distribution of K J 0 for each sample size and simulation. When n = 100, there is some ambiguity in the true number of components and a small subset of models is present in the results. For Sim 1, the model quality statistics in Table 1 exhibit a marked preference for the configuration with K k0 = 3. The statistics show a much smaller difference between competing configurations for the other, more complex simulations, particularly Sim 3.
These statistics do not provide a complete view of the fit of each configuration. For example, errors based on the estimated means invariably shrink slightly as more components are included. While large changes may be useful and appear to point towards the right number of components, we find visual evaluation tools to be quite illustrative and useful for decision making. We focus on the results of Sim 2 in the following paragraphs; all corresponding figures for the other simulations are available as supplementary material (diagnostic plots can be found in S4 Fig to S9 Fig, and parameter summaries of all candidate configurations can be found in S1 Table).
Exploring the results of Sim 2, from Table 1 it is known that when n=200 p(K k0 = 3) = 1, and the resulting clustering is found to be very accurate with 97% of observations correctly reclassified. Diagnostic plots for Sim 2 with n=200 can be found in Fig. 3. The predictive density plots show the three component model fits the data very well, and there is almost no uncertainty in the clustering of each observation. It is observed that the label switching has been resolved successfully, with all posterior densities exhibiting a single mode. Posterior parameter estimates and 95% credible intervals are given in Table 2. Parameters for n=200 are tightly estimated with only the variances slightly inflated, attributable to the modest sample size.
Two sets of plots describe the results of Zmix for Sim 2 with n=100, as two possible configurations were reported. Fig. 4 illustrates these for both candidates, given p(K k0 = 2) = 0.67, and p(K k0 = 3) = 0.33 from Table 1. For this simulation and sample size there is little difference evident in the overall fit of the model, and the predictive density plots are very close to Y for K k0 = 2 except for some skewness in the right tail. The same plots for K k0 = 3 show that this area of bad fit is resolved by adding an extra component. The allocation probabilities highlight the difference Looking at the posterior parameter estimates of all the simulations considered, which can be found in supplementary material S1 Table, parameter estimates when n=200 are very close to the true underlying values with some variances slightly inflated. As can be expected, estimated variances are generally larger for the results where n=100. Overall we find the clustering is very successful for these simulations when the correct number of components is estimated, ranging from 97% to 100% accuracy for all but Sim 3 (Table 1) . Sim 3 contains two components which overlap almost entirely, and here while the parameters are close to the truth, the posterior allocations are only correctly estimated for 70% to 77% of the observations. K k0 defines the number of non-empty groups in the configuration considered in that row, and is annotated by an asterisk when this is equal to the truth. The credible intervals which contain the true value of the parameter are identified with an asterisk.

Replicate simulation study
Recall that for the replicate simulation study, 20 realisations of each simulation were created and overfitted with K = 10 using Zmix and 25 chains. For each replicate,K J 0 is estimated as the most likely (most frequently observed) value of K J 0 . Table 3 shows the proportion of times eachK J 0 is estimated for each simulation.

Case Studies
In the analysis of the three case studies, Zmix results in posterior configurations with the same number of components as the smallest number found by previous literature.
Acidity Zmix finds that 2 components are best suited to model the Acidity data, with p(K k0 = 2) = 1. The target posterior does not contain any estimates from another configuration, indicating there is no ambiguity in this decision. The two components are well separated and there is little uncertainty in the posterior allocations, which are slightly less sharply defined in a small region of overlap between the two groups. Posterior predictive plots indicate the model represents the data well, with no need for any more components.
Enzyme Overfitting the Enzyme dataset with Zmix produces two possible alternate configurations with two or three components, in a similar manner to the simulation results observed where the sample size was too small. From Table 4 we obtain the probability that this data can be modelled by 2 components is p(K k0 = 3) = 0.90, and find that three components are less likely, with p(K k0 = 3) = 0.  Comparing the two candidate models with the model fit quantities in Table 4, the inclusion of three components decrease the MAPE and MSPE slightly, but have little impact on the remaining statistics. Concordance is observed to decrease with the addition of a third component. The posterior predictive density plots in Fig. 6 reveals little difference between the fit of the two models, but the plot of the allocation probabilities reveals the difference in the candidate configurations. It is clear that the 2 component posterior provides a much more certain fit with no uncertainty in the clustering of the data, whereas the 3 component model exhibits much less clarity in the posterior allocation probabilities.
This case study illustrates the importance of making a final choice based on the original goal of the analysis. Recall that the Enzyme dataset comprises measurements of enzymatic activity in blood for an enzyme involved in the metabolism of a carcinogenic substance. While the posterior may strongly favour 2 components, the fact that multiple configurations are included in K J 0 indicates there is some non-negligible probability that this is the true number of components. The added cluster describes a smaller component with a larger mean, suggesting that a small group of patients with a different distribution of enzymatic activity characterised by a larger mean may be present. If a higher level of activity is believed to relate to a higher risk of cancer, for example, then further analyses on a subset of individuals with potentially higher risk may be of interest and the less likely model may be reported.
Galaxy Surprisingly given the small sample size, analysis of this dataset results in a stable target consistently representing only 2 components with similar means, µ 1 = 21.33 (20.76, 21.9) andμ 2 = 19.47 (15.8, 22.69). One has a large weight of π 1 = 0.72(0.54, 0.86) and small varianceσ 2 1 = 3.69(2.31, 5.66), modelling the peak at the center of the range of Y , and the other is described by a smaller weight π 2 = 0.28(0.14, 0.46), but a very large variance ofσ 2 2 = 57.23 (30.41, 106.97). This second group models the outlying values of the dataset at both tails. The posterior predictive density plot reveals that this is a reasonable model for these data, resulting in similar predicted replicates.
Since the fitted mixture model places no restrictions on the variance of the underlying mixture, this configuration is possible, and it appears reasonable to conclude that these data could have originated from such a model. Given the physical origins of the data however, it may be warranted to impose some restrictions on other priors or on the variances. [41] use astronomically motivated priors to model this dataset, and find evidence for 7 components. In the sensitivity study conducted in [43], it is shown that while there appears to be evidence for anywhere from two to eight components, there is a very large probability assigned to two components when the variances is allowed to be large. In terms of Zmix, recall that in the simulation studies, the algorithm was able to identify a component with a very small weight of π k = 0.01 in 35% of replicates when n=100, and 70% of replicates when n = 200; it is frequently able to identify well separated univariate Gaussian components when these are represented by as few as two or three observations in a sample. From these observations, there appears to be some evidence that the Galaxy dataset may not originate from a Gaussian mixture. If this distribution is Gaussian, a larger number of observations is required for Zmix to estimate a more complex configuration, or some restrictions must be placed.

Discussion
The success of Zmix for order estimation is indicated by a close relationship between sample size and the underlying complexity of a mixture distribution to be overfitted. However, this is also true for all order estimation methods; a component must be adequately represented in the given sample before it can be estimated [1]. The algorithm is easy to implement and interpret, and requires only that a maximum number of components is specified and that this is larger than the expected upper bound of K 0 . It is based on the same basic format and conditional distributions as a standard Gibbs sampler on a single parametric model, with the addition of a range of prior hyperparameter values implemented in the PPT algorithm.
Obtaining a well-mixed MCMC sample for mixtures can be a difficult task in mixture modelling even when K 0 is known, and Zmix can also be used in such cases to ensure plentiful mixing.
Given a large enough sample size relative to the underlying complexity of a mixture, Zmix can provide an accurate estimate of the minimum number of components required to model the given data. When there is some uncertainty in the best configuration which fits the sample, Zmix produces a small range of candidate models. This commonly occurs when the sample size is small relative to the complexity of the mixture. Using the distribution of the number of non-empty components results in a strict subset of likely configurations smaller than that typically obtained by multidimensional samplers, as the chosen prior forbids components to be identical in the target posterior and prevents unnecessary groups from being allocated observations.
The tempering algorithm (PPT) which is incorporated in Zmix allows for an exchange of information between the many potential overfitted posteriors, from fully merged configurations with many identical components to the sparsest configuration, where components either differ by at least one parameter or are empty. If an overfitted model with a value of the common hyperparameter α very close to zero is fit directly with no tempering, the extreme posterior surface prevents the sampler from exploring this space, no mixing is present, and the results often lead to a single group. PPT allows a better exploration of the posterior distribution, even for small values of α. The number of alive components hovers within a small range, providing a small set of candidate models for further comparison. Model averaging has not been considered in this paper but could also be a useful way to interpret the target posterior when multiple configurations are present.
In considering the number and range of α j values which should be included in Zmix for each chain j = (1, . . . , J), the minimum α J should theoretically define a space where all extra groups are expected to have weights approaching zero. Since the goal of Zmix is to overfit K intentionally in order to create empty components, it makes sense to set the smallest value of α in relation to n as well as d, selecting α much smaller than 1/n. Indeed by doing so, one expects the posterior distribution of the number of non empty components to converge to a point mass on the true number of components.
Aside from modelling and order estimation, the Zswitch algorithm proposed is able to rapidly undo the label switching in the target posterior of Zmix. While designed specifically for dealing with the output of overfitted mixture models with empty components, it can be implemented as a stand-alone relabelling algorithm for Gaussian mixtures. It can also be applied with little modification to any mixture modelling situation where a latent allocation parameter is included; the set of parameters utilised in the second phase of Zswitch simply needs to be updated to match the desired distribution.
We present Zmix and Zswitch as part of an R package called Zmix, which is available on Github at github.com/zoevanhavre/Zmix. Zmix includes all methods and functions described in this paper for overfitting univariate Gaussian mixtures, with the intention of providing a straightforward Bayesian tool for modelling and order estimation of the most common type of mixtures.

21/26
This paper presents a comprehensive solution to estimating Gaussian mixtures with an unknown number of components, dealing with three general problems which inhibit accurate estimation. The issue of non-identifiability induced by overfitting is cast as an order estimation tool using recent theory on the effect of the prior on the weights of an overfitted Bayesian mixture model. MCMC mixing difficulties common to mixtures are greatly amplified by this prior, and this is resolved by Prior Parallel Tempering which ensures full posterior exploration by travelling through all possible configurations of the posterior. This is analogous to parallel tempering but uses a much simpler acceptance ratio formulation. Finally, Zswitch provides a straightforward and complete relabelling algorithm which is adaptable to a wide range of models, allowing the results of an MCMC sampler on mixture data to be interpreted with no extra modelling effort on the part of the analyst.

Supporting Information
The Supporting Information is available as a separate PDF at the following link: http://tinyurl.com/lpbneny.

S1 Fig
Sim 1, Boxplot of the number of non-empty groups for each chain For n = 100 and n = 200, the distribution of the number of alive (non-empty) groups in each chain of the tempering is plotted across all 50, 000 iterations minus a burn in of 5, 000. The value of the hyper-parameter of the weights α of each chain is included in red.

S2 Fig
Sim 3, Boxplot of the number of non-empty groups for each chain For n = 100 and n = 200, the distribution of the number of alive (non-empty) groups in each chain of the tempering is plotted across all 50, 000 iterations minus a burn in of 5, 000. The value of the hyper-parameter of the weights α of each chain is included in red.

S3 Fig
Sim 4, Boxplot of the number of non-empty groups for each chain For n = 100 and n = 200, the distribution of the number of alive (non-empty) groups in each chain of the tempering is plotted across all 50, 000 iterations minus a burn in of 5, 000. The value of the hyper-parameter of the weights α of each chain is included in red. '

S1 Table
Parameter summaries for each model estimated by Zmix, for each simulation. Parameter summaries are included for n = 100 and n = 200 for all non-empty components for Sim 1 to 4. 95% Bayesian credible intervals are included for all estimates.K 0 defines the number of non-empty groups in the configuration considered in that row, and is annotated by an asterisk when this is correct. The parameter estimates corresponding to this configuration which contain the true value are similarly identified with an asterisk.