Overfitting Bayesian Mixture Models with an Unknown Number of Components

Zoé van Havre; Nicole White; Judith Rousseau; Kerrie Mengersen

doi:10.1371/journal.pone.0131739

Abstract

This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of components in a finite mixture model via a Zmix algorithm. Zmix provides a bridge between multidimensional samplers and test based estimation methods, whereby priors are chosen to encourage extra groups to have weights approaching zero. MCMC sampling is made possible by the implementation of prior parallel tempering, an extension of parallel tempering. Zmix can accurately estimate the number of components, posterior parameter estimates and allocation probabilities given a sufficiently large sample size. The results will reflect uncertainty in the final model and will report the range of possible candidate models and their respective estimated probabilities from a single run. Label switching is resolved with a computationally light-weight method, Zswitch, developed for overfitted mixtures by exploiting the intuitiveness of allocation-based relabelling algorithms and the precision of label-invariant loss functions. Four simulation studies are included to illustrate Zmix and Zswitch, as well as three case studies from the literature. All methods are available as part of the R package Zmix, which can currently be applied to univariate Gaussian mixture models.

Citation: van Havre Z, White N, Rousseau J, Mengersen K (2015) Overfitting Bayesian Mixture Models with an Unknown Number of Components. PLoS ONE 10(7): e0131739. https://doi.org/10.1371/journal.pone.0131739

Editor: Cathy W.S. Chen, Feng Chia University, TAIWAN

Received: February 17, 2015; Accepted: June 4, 2015; Published: July 15, 2015

Copyright: © 2015 van Havre et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: The case studies utilize previously published data which are widely available. All data used are included as part of the R package associated with this manuscript. The file "S1 File" includes both the instructions for installation and the code required to replicate the simulations, access the case studies, and perform the analyses.

Funding: This study was partially funded by an Australian Postgraduate Award and the Australian Research Council.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Finite mixture models naturally arise when homogeneous subgroups or clusters are thought to be present in a population, and can also be used as flexible parametric models for estimating complex or unknown distributions [1]. Whether latent subgroups are present or not, their flexible framework has the potential to help tackle many research problems. As such they are useful tools in many fields including but not limited to genetic and medical research [2–4], econometrics [5], and image and sound analysis, where mixtures are used to perform complex tasks such as object tracking and speaker identification [6, 7]. Despite their popularity, model estimation can be difficult when the number of components is unknown [8].

The density of a K-component mixture with respect to some measure is given by Eq (1), where π_k and θ_k denote the weight and associated emission parameters of component k, k = 1, …, K, with 0 < π_k < 1 satisfying . This paper considers the situation where the emission densities f(y∣θ_k) belong to a parametric family, i.e. θ_k ∈ Θ ⊂ R^d. While mixtures are most often used for clustering and classification, the methods presented here can also be used for density estimation to obtain a sparse representation of an unknown distribution. (1)

MCMC methods are commonly used for Bayesian estimation of complex hierarchical models such as mixtures, and Gibbs samplers are a special case of these where all parameters are estimated from their full conditional distributions [9–12]. This would be a tedious endeavour for finite mixture models if not for the inclusion of a latent allocation variable, a Multinomial Z = {z₁, …, z_n}, where z_i ∼ 𝓜(1; π₁, …, π_K), so that y_i∣z_i = f(y_i∣θ_{z_i}) [13]. For each iteration t of a Gibbs sampler, the allocations Z^(t) are estimated first, then the parameters are generated from their component-wise conditional distributions based on the clustering in Z^(t).

This paper addresses three important issues concerning mixture modelling when the number of components is unknown: (i) theoretical issues in estimating the number of components due to non-identifiability caused by overfitting, (ii) problems in applying standard Markov chain Monte Carlo (MCMC) sampling techniques, and (iii) the non-identifiability of the output of MCMC due to label switching. These issues are reviewed and then addressed in a coordinated manner, with the aim to develop a method for intuitive estimation of the number of components (also known as order estimation), resulting in a sparse yet representative posterior exhibiting clear separation between the estimated and unnecessary components.

Issue 1: Non-identifiability due to overfitting

Order estimation methods for finite mixtures can be loosely classified into two types of approaches: those which compare competing models (e.g. Bayes Factors [10, 11]), and those which employ multidimensional samplers to directly estimate the distribution of K (e.g. Reversible Jump MCMC [14], the allocation sampler [15]). Overfitting, the act of including more components in a model than is supported by the data, is an integral part of both strategies as the former must fit at least one extra group to compare some criterion, whilst the latter implicitly explores an overfitted space to estimate K. Non-Bayesian methods for mixture estimation are particularly vulnerable to overfitting as it violates the regularity conditions required for maximum likelihood estimation and likelihood based goodness-of-fit criteria [1].

The difficulty with order estimation stems from the fact that overfitting induces a special type of non-identifiability in the posterior distribution of mixture models. Theoretically, any mixture distribution can be represented equally well by one with a larger number of groups, where some components have either merged together or have weights equal to zero [1, 16–19]. Developments in the Bayesian asymptotic theory of overfitted mixture models by [20] provide a theoretical basis to use overfitting for order estimation.

[20] proved that quite generally, the posterior behaviour of overfitted mixtures depends on the chosen prior on the weights, and on the number of free parameters in the emission distributions (here “d”). Consider the prior on the weights P_π, which we take to follow a Dirichlet distribution, P_π = 𝓓(α₁, …, α_K). If min(α_k, k ≤ K) > d/2, asymptotically two or more components in an overfitted mixture model will tend to merge with non-negligible weights. Conversely, if max(α_k, k ≤ K) < d/2, the extra components are emptied at a rate of n^−1/2. Choosing a prior where max(α_k, k ≤ K) < d/2 penalises the analysis more subtly than using the dimensions of the model or the number of parameters, placing mass on the sparsest configuration approximating the density in a uniquely Bayesian manner. In this context an exchangeable prior corresponds to choosing α₁ = α₂ = … = α_K, which is done hereafter in this paper.

Overfitting is an appealing solution for order estimation as it requires little input from the investigator; it simply involves selecting a large number of components (greater than the anticipated number), and choosing a prior which encourages the extra components to have weights close to zero. This approach was recognised in [21], where Section 22.4 on mixtures with an unspecified number of components focuses almost entirely on the strategy of deliberately overfitting for the purpose of order estimation. They recommend a straightforward approach of counting the components whose posterior weights are larger than some threshold. In practice, [21] suggest choosing α = n₀/K, where n₀ is the prior sample size of the components with a default “noninformative” value of n₀ = 1. However this leads to its own set of difficulties. First, extra components have a non-zero probability of being allocated observations, allowing MCMC samplers some freedom to explore the posterior surface. However, if the posterior weights of the extra, unwanted groups are not close enough to zero they become impossible to distinguish from the truly supported components. [21] note that components with small weights were sometimes found to have non-trivial posterior means. Second, some choices of K caused the posterior to contain several redundant, closely overlapping groups, indicating that some merging of extra components with the truth is allowed to occur under such a prior.

In this paper, we aim to place stronger bounds on α so that extra components with no support have posterior weights approaching zero, to the point where they are allocated no observations. When extra components can be said to have emptied, order estimation should be a simple case of reporting the number of alive (non-empty) components present in the posterior.

Issue 2: Obtaining a well mixed MCMC sample

MCMC algorithms are prone to becoming trapped in regions of large posterior probability for high dimensional problems; they have a propensity for lack of mixing when the posterior contains multiple well separated modes [17]. Some argue this hinders MCMC estimation since the samplers cannot explore all potentially important regions of a target space, clusters may be missed, and thus the MCMC cannot be assumed to have converged [8].

Parallel tempering is a popular method originating in physics which improves mixing in multimodal situations. The general idea is to simulate J replicas of the original distribution of interest, each produced under a different “temperature”, and to sample from each of these allowing for information to flow between adjacent temperatures. The high temperature posteriors are increasingly flattened, providing less extreme surfaces which allow MCMC samplers to mix more freely, whereas the low temperature posteriors better reflect the precise distributions in a local region of the probability space, but have a strong risk of becoming trapped in local minima during sampling [22–24].

In essence, the higher temperature posteriors allow those of lower temperature to access a more complete set of regions in the posterior space. Tempering can be done in many ways, the most common approach being to raise the target distribution to a power T (where 0 ≤ T ≤ 1), which increasingly flattens the distribution as T → 0. While tempering is usually performed directly on the likelihood or the posterior distribution, it is also readily adaptable to other situations as recently demonstrated in an application to Approximate Bayesian Computation [25].

In the Methods section entitled “Prior Parallel Tempering (PPT)”, a parallel tempering algorithm is developed using α to directly control the degree of tempering as well as obtain the desired target distribution.

Issue 3: Untangling the label switching

The third challenge is to retrieve the posterior estimates from the target chain of the MCMC. These are non-identifiable due to label switching, a phenomenon which occurs when exchangeable priors are placed on mixture parameters. Label switching results in a posterior which is invariant to permutations of the labelling of components [26]. In essence, the group names of two components ‘switch’ randomly during MCMC, resulting in the marginal posterior distributions of each parameter to be identical for all groups. Resolving the label switching can be a difficult task but its presence is proof of adequate mixing and is an important requirement to establish that an MCMC sampler has converged [27–29].

Excellent reviews of the label switching issue and a wide range of potential solutions can be found in [30] and [27]. An increasingly popular approach is to employ a relabelling scheme, such as that proposed by [26] and [31], where the posterior samples of the parameter of interest are clustered according to a k-means algorithm [30]. This method converges to local minima, so the results based on multiple starting points are compared to identify the optimal solution. This idea was extended by [8] who use the maximum a posteriori (MAP) estimate as the starting point of the clustering.

Another approach is to use label invariant loss functions, the idea being to identify some loss function based on a label invariant estimate and to select the permutation of the labelling which minimises this loss. For example if the allocations are computed, [32] propose a loss function based on the pairwise comparison of the allocations of each data point. To relabel the samples, the algorithm permutes the labelling to minimise this loss. However this can incur a high computational cost for mixtures with many components and rapidly become impractical [30].

Label switching in overfitted mixtures is particularly difficult to resolve as superfluous components may merge or overlap with other components, or may be empty, which negatively impacts on relabelling and clustering. The presence of many empty components is an additional level of complexity which is not generally accounted for by existing tools.

A new method for resolving the label switching problem is developed in the Methods section “Resolving the label switching with Zswitch” which aims to combine the MAP and relabelling approaches of [8] with the rich information available from the joint distribution of the allocations used by [32].

Motivation

While overfitting is an appealing tool for Bayesian order estimation, the number of non-empty components in the posterior of overfitted mixtures cannot currently be used to estimate of the true number of components. Extra components always have a non-zero probability of being allocated some observations, the number of which is determined by the prior on the weights. Setting this to be very close to zero is not possible with current estimation methods as such a prior creates a sparse posterior surface comprised of isolated modes separated by areas of near-zero probability, inhibiting mixing.

The goal of this paper is to produce a sparse, representative posterior configuration of finite mixtures with an unknown number of components. We develop an extension to parallel tempering to enable a Gibbs sampler to sample from a well mixed posterior, where the unsupported components contain no observations, to explore if this can be used for simple order estimation.

Methods

Models and notation

Given observations Y = {y₁, …, y_n}, component weights , and component parameters , the full likelihood of a mixture model can be written as Eq 2. Here, K is the number of components included in the model, where the true number of components in Y is K₀ and K₀ < K. (2) The allocations are modelled with a Multinomial variable Z = {z₁, …, z_n}, where z_i ∼ 𝓜(1; π₁, …, π_K), so that y_i∣z_i = f(y_i∣θ_{z_i}) [13].

A Dirichlet prior is placed on the mixture weights {π₁, …, π_K} ∼ 𝓓(α₁, …, α_K). As this prior is always of the exchangeable form where α₁ = α₂ = … = α_K, α_k will be refered to as α from this point.

Prior Parallel Tempering (PPT)

We aim to use the prior on the weights to define different degrees of tempering, setting up an approximation to classical tempering which simultaneously models a wide range of possible posterior configurations.

J chains are included in the PPT algorithm, each indexed by j. In a Bayesian setting, each chain can be considered to have a different target posterior p_j(ζ_j∣Y). The ζ_j denote the full set of unknown parameters, such as for univariate Gaussian mixtures, and . The posterior parameters sampled at each iteration t are denoted . For iteration t, the posterior of the j’th chain is indexed as , and .

When a proposal is made to swap the samples of a pair of adjacent chains at a given iteration, a Metropolis-Hastings update on the joint distribution must be made. Consider the proposal to swap chains j and j′ at iteration t. The joint target of both chains can be written as , and the goal of tempering is to preserve this target, only accepting moves with probability min(1, A). The acceptance ratio is the joint density of the chains given the move is accepted, divided by the current joint density.

Omitting the iteration indicator ^(t) as all values are assumed to refer to the same iteration, the acceptance ratio formulation is as follows: (3)

In the case of PPT, the likelihood is the same in all chains, so p_j(Y∣ζ_j) = p_j′(Y∣ζ_j) and p_j(Y∣ζ_j′) = p_j′(Y∣ζ_j′). Expanding the ratio of posterior distributions reduces A to the prior densities: (4) (5)

Furthermore, as only the prior on the weights is allowed to change, A may be further simplified. Recalling the prior structure p(ζ) = p(μ)p(σ²)p(π), A can be written as (6)

Since and , the final acceptance ratio is comprised of four densities defined by the prior on the weights only: (7)

Sampling overfitted mixture models with PPT

We now set up Zmix, an MCMC sampling algorithm for mixture models which incorporates PPT into a collection of Gibbs samplers. A set of J parallel, independent samplers are set up, and as mentioned the degree of tempering is determined by the hyperparameter on the mixture weights, (α_j, j ≤ J). The set of α_j must be chosen to ensure a wide range of parallel chains and include values from well above d/2 to close to zero. As the overall goal is to sample from a posterior where extra components’ posterior weights are very close to zero, the chain generated by the smallest value of α^j in the PPT is referred to as the target chain of Zmix.

Choosing the candidate parameters (α_j, j ≤ J).

The choice of α_j is arbitrary at this point; a wide range allows a broad spectrum of posterior configurations to be generated, but values too far apart result in undesirably large changes between tempered chains (and the acceptance ratio of the PPT algorithm is rarely satisfied). The smallest hyperparameter, α_J, is set very close to zero to encourage unsupported components to have a negligible probability of being assigned observations. In practice, the success of the tempering is ensured by tracking the acceptance frequency of swaps between all chains to ensure an adequate acceptance rate. Values are chosen starting at α₁ = 30 to ensure total merging in all simulations and examples included this paper.

Two sets of (α_j, j ≤ J) are explored, a larger range in the early exploratory stage, followed by a refined set. Initially J = 30 chains are used to explore the posterior behaviour of overfitted mixtures under increasingly extreme conditions with values according to Eq 8. (8) Subsequently, this is reduced to J = 25 chains (Eq 9).

(9)

Zmix Algorithm.

Recall that a Gibbs sampler [9] is based on drawing samples from the full conditional distributions of each unknown variable, and that PPT requires only the prior on the weights to differ between chains.

Define as the set of K weights for chain j where j = 1, …, J. The prior on the weights is denoted . The density of the distribution of given the allocations Z_j(t) at iteration t is written . Similarly, the parameters of the components of chain j are indiced as . Since the distribution of the parameters given the allocations and the data is the same across all chains, we write for iteration t.

Before Zmix is implemented a choice of u must be made, which determines the probability a tempering move will be attempted at a given iteration. For clarity, note that each parameter is first indexed by tempering chain j, for example the weights in a chain are and the allocations Z_j. More specificity is added by including another level when required, so that for example the k’th element of is denoted π_{j_k}.

MCMC sampling of the unknown parameters then proceeds as follows.

Initialise: Choose starting values for parameters and in all chains.
Step t: For each iteration t = 1, …,
1. Gibbs sampling. For each chain j = 1, …, J,
  1. Generate the allocations , from (10) for each i = 1, …., n and k = 1, …, K,
  2. Generate from (11) with .
  3. Generate from
2. Exchanging the chains.
  With probability u ∼ 𝓤(0, 1):
  1. Draw j randomly from the set (1:J − 1), selecting chains j and j′ = j + 1 as candidates for tempering.
  2. Accept the move with probability A, where (12) and perform the tempering:
    1. A. Exchange and ,
    2. B. Exchange and , and
    3. C. Exchange and .
  3. Return to Step 2(a).

Important quantities.

The number of non-empty components at each iteration t (after a burn in period) for chain j is and we set . The distinct values of are defined as 𝓚_k₀ for (where is the maximum number of alive (non-empty) groups observed in chain j). The mode of the empirical distribution of is .

Choice of mixture distribution.

The asymptotic theory underpinning this paper can be applied to a wide range of mixture distributions, so a univariate Gaussian mixture model is adopted with f(.∣θ_k) ∼ 𝓝(μ_k, σ_k). A hierarchical prior is used on θ_k where , in the conjugate form. This involves an Inverse Gamma prior on the variances , and a Gaussian prior on the means .

Hyperparameters are set to , a = 2.5, , and τ = 1. This formulation is chosen to facilitate Gibbs sampling, particularly the choice of which centres the prior for the means within the range of the observed sample. This speeds up convergence compared to choosing a value not within the range of the observations.

Resolving label switching with Zswitch

A relabelling algorithm is proposed here inspired by the methods of [8] and [32]. Unless otherwise indicated, all notation in this section relates to the target chain j = J of the tempering algorithm. For each iteration t, let denote the number of non-empty groups. Then for k₀ = 1, …, K, let denote the set of iterations for which .

For each value of k₀, choose a reference set of allocations Z⁰ and corresponding parameters , permuting the labels so that the first k₀ groups are non-empty. Here, the reference is chosen as the MAP estimator of the target posterior, computed using only non-empty components.

For each iteration t ∈ T(k₀) let be the number of observations assigned to component k (for k = 1, …, K), and let the vector be the labels of the non-empty components.

The joint distribution of the current and reference allocations is summarised by creating a k₀ × k₀ table M, where M_{(r, c)} is the cell pertaining to row r and column c, the columns denote the reference labels, and the rows denote the elements of . The value of M_{(r, c)} is the number of observations assigned to the component labelled which are also in the reference group labelled c.

The table M is the key of Zswitch and is used to identify the subset of reference components which have a similar membership to each current component. The tuning parameter m, defined below, determines the sensitivity of the algorithm by designating the minimum proportion of the observations from each component which must belong to some reference group before it is considered a candidate for relabelling.

For each row r = 1, …, k₀, let I^r be the set of labels such that the proportion of observations shared by the current group and reference group exceeds a threshold m, that is . Since I^r is a set of labels, ∣I^r∣ denotes the size of the set, and I^r × I^r* is the Cartesian product between I^r and I^r*. Let denote the updated or resolved label for .

If ∣I^r∣ = 1, then . In addition, if , then . Updating the values of relabels all associated allocations (Z^(t)) and parameters (θ^(t)), resolving the label switching.

If , there are multiple candidate labels for at least one component. The final choice is the permutation of the candidate labels which minimises the distance between the current and reference parameters under each possible relabelling scheme, as follows. Let S_I be the set of permutations from I^r to I^r, the Cartesian product of the k₀ sets I¹, …, , . The final relabelling scheme v* is then identified as All parameters and allocations are then relabelled according to .

Zswitch algorithm.

Define the number of observations assigned to each group k = 1, …, K at each iteration as .

For k₀ = 1, …, K, :

Select reference Z⁰ and . Permute the labels of so that the first k₀ groups are non-empty.
Step t: For each iteration t ∈ T(k₀),
1. Phase one: Allocation-based relabelling.
  1. For k = 1, …, K, compute .
  2. Create , the vector of component labels for which for k = {1, …, K}.
  3. Construct M, a k₀ × k₀ table and set (13)
  4. For r = 1, …, k₀, start with an empty set I^r = ϕ, and let (14)
    1. If ∣I^r∣ = 1, let .
  5. If , relabel Z^(t) and by setting and exit loop.
  6. If , proceed with Phase two.
2. Phase two: Parameter-based relabelling
  1. Let S_I be the set of permutations from I^r to I^r, found by computing the k₀-fold Cartesian product .
  2. Find the permutation v* for which: (15)
  3. Relabel Z^(t) and by setting .

To ensure the success of Zswitch, density plots are created for each set of relabelled posterior parameter estimates, and we deem the Zswitch successful when these are all clearly unimodal.

Simulations and case studies

The results are presented according to the following evaluation strategy, which is designed to explore the impact of α^j, and particularly the behaviour of the target posterior. Replicate simulation studies are included to explore the consistency of the observed behaviour, and case studies are also included for comparison with existing literature.

Simulations.

A set of simulations representing a range of univariate Gaussian mixtures is used to test and illustrate Zmix and Zswitch. Gaussian mixture models are considered, with unknown means and variances for all components. Four simulations illustrate the methods in this paper, denoted Sim 1- 4. Fig 1 includes histograms of samples where n = 200 from each simulation as well as the density of the true underlying distribution. Sim 1 defines a well separated mixture of K₀ = 3 components. Sim 2 has the same number of groups but they are closer together, with two high peaks whose tails overlap with a central component with a larger variance. Sim 3 represents a scenario where the K₀ = 2 components are difficult distinguish; all parameters except the variances are equal, producing a unimodal density. Sim 4 contains K₀ = 3 components, where 99% of the observations are expected to represent only two components with close means, while a third, better separated component is only allocated 1% of the weight. Sim 4 describes a situation where true groups with small weights exist, to better understand how these can best be identified.

Download:

Fig 1. Description of the four simulations considered in this paper.

Density plots of the mixture distributions are indicated by a dashed line, and histograms of a single realisation of each simulation (with n = 200) are included.

https://doi.org/10.1371/journal.pone.0131739.g001

The parameters of the simulations are as follows:

Sim 1 K₀ = 3 with , μ = {15, 7, 1} and σ² = {1, 1, 1}
Sim 2 K₀ = 3 with , μ = {−1, 10, 4} and σ² = {0.5, 0.5, 3}.
Sim 3 K₀ = 2 with , μ = {1, 1} and σ² = {10, 1}.
Sim 4 K₀ = 3 with , μ = {6, 10, 20} and σ² = {1, 1, 0.5}.

Evaluation strategy.

Exploratory simulations
For each of the four simulations, generate samples of size n = 100 and n = 200. Fit Zmix with K = 10 to each simulation, for 50,000 iterations and 30 chains. Store the last 20,000 iterations for all chains (j = 1, …, J).
1. Number of non-empty components:
  Compute , the number of alive (non-empty) components at each iteration, for each chain (j = 1, …, J).
2. Model fit and parameter estimates:
  Resolve the label switching for the target chain (where j = J) using Zswitch and proceed with post-processing (described in detail further on). Compute posterior estimates of all estimated parameters, including 95% credible intervals. Compute the posterior allocation probabilities of each observation and each alive component.
Replicate simulations
1. Number of non-empty components
  For n = 100 and n = 200, produce 20 replicates of each simulation.
  Run Zmix for 20,000 iterations with K = 10 and 25 chains, saving the target chain (j = J) for each run after 5000 iterations.
2. Compute for each iteration, storing the vector . Compute the proportion of each configuration represented, and the mode of the empirical distribution of .

Case Studies.

Three case studies are described to illustrate the results of Zmix and Zswitch in practice. The first case study is the Acidity dataset, which consists of the log acidity index for 155 lakes in the North-Eastern United States [18]. These have been previously analysed as a mixture of Gaussian distributions on the log scale by [18] who found evidence for two or three groups. [14] found evidence for three to five components with the same model with a Reversible Jump MCMC algorithm, while [21] overfit this dataset resulting in a posterior with two true components.

The second case study involves the Enzyme dataset found in [33], which consists of measurements of enzymatic activity in blood for an enzyme involved in the metabolism of carcinogenic substances (velocity and substrate concentration), for a group of 245 unrelated individuals. [33] first analysed this data and found a mixture of 2 skewed components using MLE. [14] found evidence for 3 to 5 components using RJMCMC. [34] also modelled this data with 2 skewed Gaussian components.

Finally, Galaxy data [35] is considered since it has been investigated by many researchers with a wide range of results [36–38]. It is a small dataset of 82 measurements of galaxy speeds from 6 segments of the sky. It is of particular interest as different order estimation methods have suggested that the data contains anywhere from 3 to 9 components [11, 19, 21, 37, 39–41]. [40] observed that extra components appears to be modelling underlying skewness present in the sample.

Evaluation strategy for Case Studies.

For each case study, run Zmix for 50,000 iterations. Extract the last 20,000 iterations of the target posterior. Subset by value of , and apply Zswitch and post-processing to each subset. Compute and save posterior mean of all estimated parameters, including 95% credible intervals, and posterior allocation probabilities for each configuration considered.

Post-processing

For all but the replicate simulation studies, the same post-processing is performed on the target posterior of the results of Zmix.

The iterations of the target chain are split into subsets by the number of non-empty components present at each iteration, , and the label switching is resolved for each of these according to Zswitch. Model quality statistics are particularly useful when multiple configurations are present as they allow further comparison of the candidate models, and the following statistics are computed for each subset processed by Zswitch.

For each considered configuration, identified by 𝓚_K₀, the proportion of iterations represented is first computed to estimate the probability that the observations originate from a mixture with 𝓚_k₀ components, p(𝓚_k₀). When 𝓚_k₀ = K₀, the proportion of the observations whose predicted allocations corresponds to their true groupings is computed. The mean absolute error (MAE) and mean squared error (MSE) are calculated using the unswitched parameter estimates.

The remaining statistics are based on posterior predictive testing, which are found by resampling the posterior samples of the parameters in the target chain in order to predict 10,000 datasets of the same size as the original data. Mean absolute errors (MAPE) and mean squared prediction errors (MSPE) are reported as an average over the replicates. Bayesian P-values estimating p(min(Y^rep) < min(Y)) and p(max(Y^rep) < max(Y)) are included, which we call P_min and P_max. Both are included as they can be useful in identifying a skewed fit. Predictive concordance is then computed, which can be interpreted as the average proportion of y_i’s that are not outliers given the model (based on the suggestion that any y_i that is in either 2.5% tail area of should be considered an outlier) [42]. An ideal fit should have a predictive concordance of around 95%[42].

A small set of plots is also created for each candidate configuration. Density plots of the posterior paramaters illustrate their distribution and the success of Zswitch at relabelling the output of Zmix. Also included is a plot of estimated allocation probabilities, and a plot of the density of 10,000 predicted datasets overlaid by a that of Y to allow for the overall fit to be explored and identification of areas of bad fit.

R Code.

Please refer to S1 File to obtain the R code to perform all analyses described in the methods.

Results

The following results show that the number of alive components, the set of which is denoted for the target of an overfitted mixture modelled with Zmix, provides a sparse estimate of the true number of components, K₀. Given a large enough sample size and a well mixed MCMC sampler, there is little or no variation in and the mode of this distribution is equal to K₀. When the sample size is small relative to the complexity of the underlying mixture, encompasses a small range of likely configurations, which tend to include the true value as well as one or two more conservative estimates of the number of components. The estimated parameters and allocation probabilities corresponding to each model (or configuration) considered can be extracted directly from the target chain, and interpreted once label switching is resolved.

1. Exploratory simulations

1.(a) Exploring the distribution of .

Boxplots illustrating the distribution of are presented in Fig 2 for each chain of Zmix j = {1, …, J}, where j = J corresponds to the target posterior. Fig 2 contains these results for Sim 2, while the plots pertaining to the other simulations can be found in the supplementary material (S1, S2, and S3 Figs).

Download:

Fig 2. Number of alive (non-empty) groups

for each chain j for Sim 2.

Results are shown for Sim 2, n = 100 (left) and n = 200 (right). Boxplots of the number of non-empty groups for each chain j are included; each chain represents posterior samples from the Zmix sampler with the hyperparameter α^j on the mixture weights, the value of which is included in red for each j.

https://doi.org/10.1371/journal.pone.0131739.g002

For the simulations in this paper, when α^j > 5 all components merge so that none are empty. As α^j approaches d/2 = 1, a slight decrease in can be observed. As the threshold of d/2 is passed (at α^j = 1), and α^j decreases further, a steady drop in the number of non-empty groups is evident, which continues as α^j approaches zero.

Once α^j is close to zero (approximately 3 × 10⁻⁸ here), the posterior distribution of appears to reach an equilibrium, and remain constant for all subsequent chains up to and including the target. The posterior behaviour of the target is well exemplified by Fig 2, where the following can be observed. When the sample size is large enough (n = 200 here), represents a single configuration in all t iterations, so that . This is inferred to be equal to the true number of components, and . In the case where n = 100, the range of includes a small subset of likely configurations, in this case one with the true number of components, and an alternate posterior configuration with one fewer group.

This behaviour is observed consistently across the four simulations except for Sim 4 with n = 100, where the range of does not include the true value K₀; all iterations represent a posterior with only 2 groups. Since the true allocations are known here, it is noted that the component with π_k = 0.01 is not represented in this realisation Sim 4, and thus could not be estimated.

1.(b) Model fit and parameter estimates.

The candidate models found by Zmix defined by the values of 𝓚_K₀ for are compared using the posterior parameter estimates obtained from the target posterior. A set of summary and model quality statistics computed for each configuration (or number of components) is included in Table 1.

Download:

Table 1. Goodness-of-fit statistics for each simulation estimated by Zmix.

https://doi.org/10.1371/journal.pone.0131739.t001

For all simulations when n = 100, Zmix tends to place a higher probability on the configuration with fewer components, and when n = 200 a single configuration (or model) is represented in the posterior. The replicate study in the following section provides a comprehensive exploration of the distribution of for each sample size and simulation. When n = 100, there is some ambiguity in the true number of components and a small subset of models is present in the results. For Sim 1, the model quality statistics in Table 1 exhibit a marked preference for the configuration with 𝓚_k₀ = 3. The statistics show a much smaller difference between competing configurations for the other, more complex simulations, particularly Sim 3.

These statistics do not provide a complete view of the fit of each configuration. For example, errors based on the estimated means invariably shrink slightly as more components are included. While large changes may be useful and appear to point towards the right number of components, we find visual evaluation tools to be quite illustrative and useful for decision making. We focus on the results of Sim 2 in the following paragraphs; all corresponding figures for the other simulations are available as supplementary material. Diagnostic plots can be found for Sim 1 n = 100 and n = 200 in S4 and S5 Figs, and the same for for Sim 3 can be found in S6 and S7, and S8 and S9 Figs for Sim 4. Parameter summaries of all candidate configurations can be found in S1 Table.

Exploring the results of Sim 2, from Table 1 it is known that when n = 200 p(𝓚_k0 = 3) = 1, and the resulting clustering is found to be very accurate with 97% of observations correctly reclassified. Diagnostic plots for Sim 2 with n = 200 can be found in Fig 3. The predictive density plots show the three component model fits the data very well, and there is almost no uncertainty in the clustering of each observation. It is observed that the label switching has been resolved successfully, with all posterior densities exhibiting a single mode. Posterior parameter estimates and 95% credible intervals are given in Table 2. Parameters for n = 200 are tightly estimated with only the variances slightly inflated, attributable to the modest sample size.

Download:

Fig 3. Sim 2, n = 200, 𝓚_k₀ = 3.

Results of Zmix and Zswitch including, from top left to right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

https://doi.org/10.1371/journal.pone.0131739.g003

Download:

Table 2. Estimated parameters of Sim 2 and 95% credible intervals for n = 100 and n = 200.

https://doi.org/10.1371/journal.pone.0131739.t002

Two sets of plots describe the results of Zmix for Sim 2 with n = 100, as two possible configurations were reported. Fig 4 illustrates these for both candidates, given p(𝓚_k₀ = 2) = 0.67, and p(𝓚_k₀ = 3) = 0.33 from Table 1. For this simulation and sample size there is little difference evident in the overall fit of the model, and the predictive density plots are very close to Y for 𝓚_k₀ = 2 except for some skewness in the right tail. The same plots for 𝓚_k₀ = 3 show that this area of bad fit is resolved by adding an extra component. The allocation probabilities highlight the difference between the two configurations. When only 2 groups are included, there is some uncertainty in the posterior allocations of observations which fall in a small region between the estimated components. This region forms the extra component included in the alternate configuration, and here the allocations probabilities are very high for all observations: 97% of the allocation are correctly predicted under this model.

Download:

Fig 4. Summary of the results for Sim 2 and n = 100.

The first two rows of plots refer to 𝓚_k₀ = 2, and the lower set refers to 𝓚_k₀ = 3. Results of Zmix and Zswitch are presented including, from upper left to lower right of each set: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior. A panel of plots is included for each candidate model found by Zmix.

https://doi.org/10.1371/journal.pone.0131739.g004

Looking at the posterior parameter estimates of all the simulations considered, which can be found in supplementary material S1 Table, parameter estimates when n = 200 are very close to the true underlying values with some variances slightly inflated. As can be expected, estimated variances are generally larger for the results where n = 100. Overall we find the clustering is very successful for these simulations when the correct number of components is estimated, ranging from 97% to 100% accuracy for all but Sim 3 (Table 1). Sim 3 contains two components which overlap almost entirely, and here while the parameters are close to the truth, the posterior allocations are only correctly estimated for 70% to 77% of the observations.

2. Replicate simulation study

Recall that for the replicate simulation study, 20 realisations of each simulation were created and overfitted with K = 10 using Zmix and 25 chains. For each replicate, is estimated as the most likely (most frequently observed) value of . Table 3 shows the proportion of times each is estimated for each simulation.

Download:

Table 3. Summary of the results for 20 replicate simulation study.

https://doi.org/10.1371/journal.pone.0131739.t003

For all simulations across most replicates, when n = 100 the include a small range of values of 𝓚_k₀, as in the exploratory simulations. For Sim 1 and 2, when n = 200 every replicate estimates three components consistently. Sim 3 and 4 are more complex mixtures, and the target of Zmix often encompasses a small range of one to three configurations.

For small sample sizes tends to underestimate the number of components. With a small increase in sample size, the probability of providing a correct estimate of K₀ increases sharply. This is most evident for the simpler simulations included, but also by a clear trend for Sim 3 and Sim 4.

0.1 Case Studies

In the analysis of the three case studies, Zmix results in posterior configurations with the same number of components as the smallest number found by previous literature.

Acidity.

Zmix finds that 2 components are best suited to model the Acidity data, with p(𝓚_k₀ = 2) = 1. The target posterior does not contain any estimates from another configuration, indicating there is no ambiguity in this decision. Referring to Fig 5, the two components are well separated and there is little uncertainty in the posterior allocations, which are slightly less sharply defined in a small region of overlap between the two groups. Posterior predictive plots indicate the model represents the data well, with no need for any more components.

Download:

Fig 5. Overfitting the Acidity dataset.

Results of Zmix and Zswitch including from upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

https://doi.org/10.1371/journal.pone.0131739.g005

The resulting mixture has tightly estimated posterior parameter estimates, included here with 95% credible intervals in brackets. One component is estimated with and , , while the other has an estimated posterior weight of and is described by and .

Enzyme.

Overfitting the Enzyme dataset with Zmix produces two possible alternate configurations with two or three components, in a similar manner to the simulation results observed where the sample size was too small. From Table 4 we obtain the probability that this data can be modelled by 2 components is p(𝓚_k₀ = 3) = 0.90, and find that three components are less likely, with p(𝓚_k₀ = 3) = 0.10.

Download:

Table 4. Goodness-of-fit statistics for each case study.

https://doi.org/10.1371/journal.pone.0131739.t004

The posterior parameters describing each possible configuration are

For 𝓚_k₀ = 2:
, , .
, , .
For 𝓚_k₀ = 3:
, , .
, , .
, , .

Comparing the two candidate models with the model fit quantities in Table 4, the inclusion of three components decrease the MAPE and MSPE slightly, but have little impact on the remaining statistics. Concordance is observed to decrease with the addition of a third component. The posterior predictive density plots in Fig 6 reveals little difference between the fit of the two models, but the plot of the allocation probabilities reveals the difference in the candidate configurations. It is clear that the 2 component posterior provides a much more certain fit with no uncertainty in the clustering of the data, whereas the 3 component model exhibits much less clarity in the posterior allocation probabilities.

Download:

Fig 6. Overfitting the Enzyme dataset.

Results of Zmix and Zswitch including, from upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a posterior predictive density plot of 10,000 replicates with the density of the data represented as a dashed line.

https://doi.org/10.1371/journal.pone.0131739.g006

This case study illustrates the importance of making a final choice based on the original goal of the analysis. Recall that the Enzyme dataset comprises measurements of enzymatic activity in blood for an enzyme involved in the metabolism of a carcinogenic substance. While the posterior may strongly favour 2 components, the fact that multiple configurations are included in indicates there is some non-negligible probability that this is the true number of components. The added cluster describes a smaller component with a larger mean, suggesting that a small group of patients with a different distribution of enzymatic activity characterised by a larger mean may be present. If a higher level of activity is believed to relate to a higher risk of cancer, for example, then further analyses on a subset of individuals with potentially higher risk may be of interest and the less likely model may be reported.

Galaxy.

Surprisingly given the small sample size, analysis of this dataset results in a stable target consistently representing only 2 components with similar means, and (see Fig 7). One has a large weight of and small variance , modelling the peak at the center of the range of Y, and the other is described by a smaller weight , but a very large variance of . This second group models the outlying values of the dataset at both tails. The posterior predictive density plot reveals that this is a reasonable model for these data, resulting in similar predicted replicates.

Download:

Fig 7. Overfitting the Galaxy dataset.

Results of Zmix and Zswitch including, from top left to right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data overlaid over the densities of 10,000 predicted datasets of the same size from the posterior.

https://doi.org/10.1371/journal.pone.0131739.g007

Since the fitted mixture model places no restrictions on the variance of the underlying mixture, this configuration is possible, and it appears reasonable to conclude that these data could have originated from such a model. Given the physical origins of the data however, it may be warranted to impose some restrictions on other priors or on the variances. [43] use astronomically motivated priors to model this dataset, and find evidence for 7 components. In the sensitivity study conducted in [19], it is shown that while there appears to be evidence for anywhere from two to eight components, there is a very large probability assigned to two components when the variances is allowed to be large. In terms of Zmix, recall that in the simulation studies, the algorithm was able to identify a component with a very small weight of π_k = 0.01 in 35% of replicates when n = 100, and 70% of replicates when n = 200; it is frequently able to identify well separated univariate Gaussian components when these are represented by as few as two or three observations in a sample. From these observations, there appears to be some evidence that the Galaxy dataset may not originate from a Gaussian mixture. If this distribution is Gaussian, a larger number of observations is required for Zmix to estimate a more complex configuration, or some restrictions must be placed.

Discussion

The success of Zmix for order estimation is indicated by a close relationship between sample size and the underlying complexity of a mixture distribution to be overfitted. However, this is also true for all order estimation methods; a component must be adequately represented in the given sample before it can be estimated [1]. The algorithm is easy to implement and interpret, and requires only that a maximum number of components is specified and that this is larger than the expected upper bound of K₀. It is based on the same basic format and conditional distributions as a standard Gibbs sampler on a single parametric model, with the addition of a range of prior hyperparameter values implemented in the PPT algorithm.

Obtaining a well-mixed MCMC sample for mixtures can be a difficult task in mixture modelling even when K₀ is known, and Zmix can also be used in such cases to ensure plentiful mixing. To ensure all groups merge in at least some tempered chains, the largest α used would need to be large compared to the sample size. When the number of groups is overfitted this is less important as the extra groups act as bridges between the supported modes, facilitating mixing.

Given a large enough sample size relative to the underlying complexity of a mixture, Zmix can provide an accurate estimate of the minimum number of components required to model the given data. When there is some uncertainty in the best configuration which fits the sample, Zmix produces a small range of candidate models. This commonly occurs when the sample size is small relative to the complexity of the mixture. Using the distribution of the number of non-empty components results in a strict subset of likely configurations smaller than that typically obtained by multidimensional samplers, as the chosen prior forbids components to be identical in the target posterior and prevents unnecessary groups from being allocated observations.

The method can underestimate the number of components present when there is a small sample size, or the observations represent many heavily overlapping groups. This is partially due to a hyperparameter on the mean and variance of the Gaussian distributions of each component; the τ hyperparameter. When τ = 1, the prior for the mean is strongly linked to the estimated variance of that component; such a prior assumes that the variance of the mean μ_k of each component is the same as the variance of that component, . This choice may be too restrictive for certain applications, and lowering this value will prevent groups with small sample sizes from being assimilated into the tail of other groups. Overfitting with τ = 1 will cause the posterior to have a stronger preference for a model with fewer components and large variances over one with more groups characterised by large means and small variances. The value of τ can be adjusted easily in the R implementation of Zmix.

This behaviour is observed in the results of the Galaxy case study, where only two components are found by Zmix. It may be more reasonable to weaken the bounds on the variance of the means for this dataset to reflect our existing knowledge that the observations do come from many small, separate sections of space. Repeating the analysis with τ = 0.01 results in a posterior with a 100% probability of three non-empty components, placing the two small clusters in each tail in separate groups (results not shown).

The tempering algorithm (PPT) which is incorporated in Zmix allows for an exchange of information between the many potential overfitted posteriors, from fully merged configurations with many identical components to the sparsest configuration, where components either differ by at least one parameter or are empty. If an overfitted model with a value of the common hyperparameter α very close to zero is fit directly with no tempering, the extreme posterior surface prevents the sampler from exploring this space, no mixing is present, and the results often lead to a single group. PPT allows a better exploration of the posterior distribution, even for small values of α. The number of alive components hovers within a small range, providing a small set of candidate models for further comparison. Model averaging has not been considered in this paper but could also be a useful way to interpret the target posterior when multiple configurations are present.

In considering the number and range of α^j values which should be included in Zmix for each chain j = (1, …, J), the minimum α^j should theoretically define a space where all extra groups are expected to have weights approaching zero. Since the goal of Zmix is to overfit K intentionally in order to create empty components, it makes sense to set the smallest value of α in relation to n as well as d, selecting α much smaller than 1/n. Indeed by doing so, one expects the posterior distribution of the number of non empty components to converge to a point mass on the true number of components.

Aside from modelling and order estimation, the Zswitch algorithm proposed is able to rapidly undo the label switching in the target posterior of Zmix. It is at this stage designed specifically for dealing with the output of overfitted mixture models with empty components, but the method is available to be implemented in other applications as needed. It can be applied with little modification to any mixture modelling situation where a latent allocation parameter is included; the set of parameters utilised in the second phase of Zswitch simply needs to be updated to match the desired distribution. Please note that a rigorous comparison of the performance of Zswitch versus other relabelling methods has not been performed, and this is planned for future work.

It is theoretically possible for Zswitch to result in a computational overload in practice, if it attempts to compute large permutations of labels (for example, if 6 or more labels were to be permuted in the second step, Zswitch would need to compute 6! label permutations). This was however not observed in any of our experiments, and is unlikely to occur in practice; for 6! to need to be computed, a mixture posterior would have to contain 6 components all of which overlap heavily with each other. In the unlikely event this does occur, simply reducing the sensitivity of Zswitch slightly (by choosing a larger value of m) will prevent such an overload. One must also ensure that Zswitch is only applied to a posterior containing no identical (merged) components.

We present Zmix and Zswitch as part of an R package called Zmix, which is available on Github at github.com/zoevanhavre/Zmix. Zmix includes all methods and functions described in this paper for overfitting univariate Gaussian mixtures, with the intention of providing a straightforward Bayesian tool for modelling and order estimation of the most common type of mixtures.

This paper presents a comprehensive solution to estimating Gaussian mixtures with an unknown number of components, dealing with three general problems which inhibit accurate estimation. The issue of non-identifiability induced by overfitting is cast as an order estimation tool using recent theory on the effect of the prior on the weights of an overfitted Bayesian mixture model. MCMC mixing difficulties common to mixtures are greatly amplified by this prior, and this is resolved by Prior Parallel Tempering which ensures full posterior exploration by travelling through all possible configurations of the posterior. This is analogous to parallel tempering but uses a much simpler acceptance ratio formulation. Finally, Zswitch provides a straightforward and complete relabelling algorithm which is adaptable to a wide range of models, allowing the results of an MCMC sampler on mixture data to be interpreted with no extra modelling effort on the part of the analyst.

Supporting Information

S1 Fig. Sim 1, Boxplot of the number of non-empty groups for each chain.

For n = 100 and n = 200, the distribution of the number of alive (non-empty) groups in each chain of the tempering is plotted across all 50,000 iterations minus a burn in of 5,000. The value of the hyper-parameter of the weights α of each chain is included in red.

https://doi.org/10.1371/journal.pone.0131739.s001

(EPS)

S2 Fig. Sim 3, Boxplot of the number of non-empty groups for each chain.

For n = 100 and n = 200, the distribution of the number of alive (non-empty) groups in each chain of the tempering is plotted across all 50,000 iterations minus a burn in of 5,000. The value of the hyper-parameter of the weights α of each chain is included in red.

https://doi.org/10.1371/journal.pone.0131739.s002

(EPS)

S3 Fig. Sim 4, Boxplot of the number of non-empty groups for each chain.

For n = 100 and n = 200, the distribution of the number of alive (non-empty) groups in each chain of the tempering is plotted across all 50,000 iterations minus a burn in of 5,000. The value of the hyper-parameter of the weights α of each chain is included in red.’

https://doi.org/10.1371/journal.pone.0131739.s003

(EPS)

S4 Fig. Sim 1 (n = 100): Results of Zmix and Zswitch.

From upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior. A panel of plots is included for each candidate model found by Zmix.

https://doi.org/10.1371/journal.pone.0131739.s004

(EPS)

S5 Fig. Sim 1 (n = 200): Results of Zmix and Zswitch.

From upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

https://doi.org/10.1371/journal.pone.0131739.s005

(EPS)

S6 Fig. Sim 3 (n = 100): Results of Zmix and Zswitch.

From upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior. A panel of plots is included for each candidate model found by Zmix.

https://doi.org/10.1371/journal.pone.0131739.s006

(EPS)

S7 Fig. Sim 3 (n = 200): Results of Zmix and Zswitch.

From upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

https://doi.org/10.1371/journal.pone.0131739.s007

(EPS)

S8 Fig. Sim 4 (n = 100): Results of Zmix and Zswitch.

From upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior. A panel of plots is included for each candidate model found by Zmix.

https://doi.org/10.1371/journal.pone.0131739.s008

(EPS)

S9 Fig. Sim 4 (n = 200): Results of Zmix and Zswitch.

From upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

https://doi.org/10.1371/journal.pone.0131739.s009

(EPS)

S1 Table. Parameter summaries for each model estimated by Zmix, for each simulation.

Parameter summaries are included for n = 100 and n = 200 for all non-empty components for Sim 1 to 4. 95% Bayesian credible intervals are included for all estimates. defines the number of non-empty groups in the configuration considered in that row, and is annotated by an asterisk when this is correct. The parameter estimates corresponding to this configuration which contain the true value are similarly identified with an asterisk.

https://doi.org/10.1371/journal.pone.0131739.s010

(PDF)

S1 File. R Code.

Script file (format.r) containing instructions for downloading and installing the Zmix package, obtaining the simulations and case studies, and repeating the analysis performed in the paper.

https://doi.org/10.1371/journal.pone.0131739.s011

(R)

Acknowledgments

The author(s) wish to thank the Queensland University of Technology and the Université Paris Dauphine.

Author Contributions

Conceived and designed the experiments: ZvH NW JR KM. Performed the experiments: ZvH NW JR KM. Analyzed the data: ZvH NW JR KM. Contributed reagents/materials/analysis tools: ZvH NW JR KM. Wrote the paper: ZvH NW JR KM.

References

1. Fruhwirth-Schnatter SI. Finite mixture and Markov switching models. 1st ed. Springer; 2006. PLOS 20/232.
2. Lewin A, Bochkina N, Richardson S. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007 Jan;6:Article36. pmid:18171320
- View Article
- PubMed/NCBI
- Google Scholar
3. Ferreira da Silva AR. A Dirichlet process mixture model for brain MRI tissue classification. Medical image analysis. 2007;11(2):169–182. pmid:17258932
- View Article
- PubMed/NCBI
- Google Scholar
4. White N, Johnson H, Silburn P, Mellick G, Dissanayaka N, Mengersen K. Probabilistic subgroup identification using Bayesian finite mixture modelling: A case study in Parkinson’s disease phenotype identification. Statistical methods in medical research. 2010 Dec;.
5. Heckman JJ, Taber CR. Econometric mixture models and more general models for unobservables in duration analysis Statistical Methods in Medical Research. 1994;3(3):279–299.
- View Article
- Google Scholar
6. Stauffer C, Grimson WEL. Adaptive background mixture models for real-time tracking. In: Computer vision and pattern recognition, 1999. IEEE Computer Society Conference on.. vol. 2. IEEE; 1999..
7. Reynolds DA, Rose RC. Robust text-independent speaker identification using Gaussian mixture speaker models. Speech and Audio Processing, IEEE Transactions on. 1995;3(1):72–83.
- View Article
- Google Scholar
8. Marin JM, Mengersen K, Robert CP. Bayesian modelling and inference on mixtures of distributions. Handbook of statistics. 2005;25:459–507.
- View Article
- Google Scholar
9. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. 1990;85(410):398–409.
- View Article
- Google Scholar
10. Chib S. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association. 1995 Dec;90(432):1313–1321.
- View Article
- Google Scholar
11. Carlin BP, Chib S. Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society Series B (Methodological). 1995;57(3):473–484.
- View Article
- Google Scholar
12. Robert C, Casella G. A short history of Markov chain Monte Carlo: subjective recollections from incomplete data. Statistical Science. 2011 Feb;26(1):102–115.
- View Article
- Google Scholar
13. Tanner MAMa, Wong WHWH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association. 1987;82(398):528–540.
- View Article
- Google Scholar
14. Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711.
- View Article
- Google Scholar
15. Nobile A., Fearnside A.T. 2007, Bayesian finite mixtures with an unknown number of components: The allocation sampler, Statistics and Computing, vol. 17, no. 2, pp. 147–162.
- View Article
- Google Scholar
16. McLachlan G, Peel D. Finite mixture models. Wiley Series in Probability and Statistics; 2000.
17. Celeux G, Hurn M, Robert CP. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical. 2000;95(451):957–970.
- View Article
- Google Scholar
18. Crawford S. An application of the Laplace method to finite mixture distributions. Journal of the American Statistical Association. 1994;89(425):259–267.
- View Article
- Google Scholar
19. Nobile, A. (2007). Bayesian finite mixtures: a note on prior specification and posterior computation. arXiv preprint arXiv:0711.0458.
20. Rousseau J, Mengersen K. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: B. 2011 Nov;75(Part 5):689–710.
- View Article
- Google Scholar
21. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis, Third Edition. 2013.
22. Swendsen RH, Wang JS. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters. 1986;57(21):2607. pmid:10033814
- View Article
- PubMed/NCBI
- Google Scholar
23. Earl DJ, Deem MW. Parallel tempering: theory, applications, and new perspectives. Physical Chemistry Chemical Physics. 2005;7(23):3910–3916. pmid:19810318
- View Article
- PubMed/NCBI
- Google Scholar
24. Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics. 2004;20(3):407–415. pmid:14960467
- View Article
- PubMed/NCBI
- Google Scholar
25. Baragatti M, Grimaud A, Pommeret D. Likelihood-free parallel tempering. Statistics and Computing. 2013;23(4):535–549.
- View Article
- Google Scholar
26. Celeux G. Bayesian inference for mixtures: The label-switching problem. Computational Statistics 1998. 1998;p. 227–232.
27. Grün B, Leisch F. Dealing with label switching in mixture models under genuine multimodality. Journal of Multivariate Analysis. 2009 May;100(5):851–861.
- View Article
- Google Scholar
28. Yao W, Lindsay BG. Bayesian mixture labelling by highest posterior density. Journal of the American Statistical Association. 2009 Jun;104(486):758–767.
- View Article
- Google Scholar
29. Robert E. On Bayesian analysis of mixtures with an unknown number of components—Discussion. Journal of the Royal Statistical Society—Series B: Statistical Methodology. 1997;59(4):731–792.
- View Article
- Google Scholar
30. Jasra a, Holmes CC, Stephens Da. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science. 2005 Feb;20(1):50–67.
- View Article
- Google Scholar
31. Stephens M. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2000 Nov;62(4):795–809.
- View Article
- Google Scholar
32. Hurn M, Justel A, Robert CP. Estimating mixtures of regressions. Journal of Computational and Graphical Statistics. 2003 Mar;12(1):55–79.
- View Article
- Google Scholar
33. Bechtel YC, Bonaiti-Pellie′e C, Poisson N, Magnette J, Bechtel PR. A population and family study of N-acetyltransferase using caffeine urinary metabolites. Clin Pharm Therp. 1993;54:134–141.
- View Article
- Google Scholar
34. Lin TI, Lee JC, Yen SY. Finite mixture modelling using the skew normal distribution. Statistica Sinica. 2007;17:909–927.
- View Article
- Google Scholar
35. Roeder K. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association. 1990;85(411):617–624.
- View Article
- Google Scholar
36. Aitkin M. Likelihood and Bayesian analysis of mixtures. Statistical Modelling. 2001 Dec;1(4):287–304.
- View Article
- Google Scholar
37. Escobar M, West M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association. 1995;90(430):577–588. PLOS 22/2336.
- View Article
- Google Scholar
38. Stephens M. Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics. 2000;28(1):40–74.
- View Article
- Google Scholar
39. Roeder K, Wasserman L. Practical Bayesian density estimation using mixtures of normals. Journal of the American Statistical Association. 1997;92(439):894–902.
- View Article
- Google Scholar
40. Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B (Methodological). 1997;59(4):731–792.
- View Article
- Google Scholar
41. Phillips DB, Smith AFM. Bayesian model comparison via jump diffusions. Markov chain Monte Carlo in practice. 1996;p. 215–239.
42. Gelfand A. Chapter 9: Model determination using sampling Based Methods. In: Gilks RSS W, editor. Markov chain Monte Carlo in practice. Boca Raton, FL.: Chapman Hall; 1996..
43. Cameron E, Pettitt A. Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis. Statist Sci. 2014 08;29(3):397–419.
- View Article
- Google Scholar

[ref1] 1. Fruhwirth-Schnatter SI. Finite mixture and Markov switching models. 1st ed. Springer; 2006. PLOS 20/232.

[ref2] 2. Lewin A, Bochkina N, Richardson S. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007 Jan;6:Article36. pmid:18171320
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Ferreira da Silva AR. A Dirichlet process mixture model for brain MRI tissue classification. Medical image analysis. 2007;11(2):169–182. pmid:17258932
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. White N, Johnson H, Silburn P, Mellick G, Dissanayaka N, Mengersen K. Probabilistic subgroup identification using Bayesian finite mixture modelling: A case study in Parkinson’s disease phenotype identification. Statistical methods in medical research. 2010 Dec;.

[ref5] 5. Heckman JJ, Taber CR. Econometric mixture models and more general models for unobservables in duration analysis Statistical Methods in Medical Research. 1994;3(3):279–299.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Stauffer C, Grimson WEL. Adaptive background mixture models for real-time tracking. In: Computer vision and pattern recognition, 1999. IEEE Computer Society Conference on.. vol. 2. IEEE; 1999..

[ref7] 7. Reynolds DA, Rose RC. Robust text-independent speaker identification using Gaussian mixture speaker models. Speech and Audio Processing, IEEE Transactions on. 1995;3(1):72–83.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Marin JM, Mengersen K, Robert CP. Bayesian modelling and inference on mixtures of distributions. Handbook of statistics. 2005;25:459–507.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref9] 9. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. 1990;85(410):398–409.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Chib S. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association. 1995 Dec;90(432):1313–1321.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref11] 11. Carlin BP, Chib S. Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society Series B (Methodological). 1995;57(3):473–484.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Robert C, Casella G. A short history of Markov chain Monte Carlo: subjective recollections from incomplete data. Statistical Science. 2011 Feb;26(1):102–115.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref13] 13. Tanner MAMa, Wong WHWH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association. 1987;82(398):528–540.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref14] 14. Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref15] 15. Nobile A., Fearnside A.T. 2007, Bayesian finite mixtures with an unknown number of components: The allocation sampler, Statistics and Computing, vol. 17, no. 2, pp. 147–162.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. McLachlan G, Peel D. Finite mixture models. Wiley Series in Probability and Statistics; 2000.

[ref17] 17. Celeux G, Hurn M, Robert CP. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical. 2000;95(451):957–970.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref18] 18. Crawford S. An application of the Laplace method to finite mixture distributions. Journal of the American Statistical Association. 1994;89(425):259–267.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref19] 19. Nobile, A. (2007). Bayesian finite mixtures: a note on prior specification and posterior computation. arXiv preprint arXiv:0711.0458.

[ref20] 20. Rousseau J, Mengersen K. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: B. 2011 Nov;75(Part 5):689–710.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref21] 21. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis, Third Edition. 2013.

[ref22] 22. Swendsen RH, Wang JS. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters. 1986;57(21):2607. pmid:10033814
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref23] 23. Earl DJ, Deem MW. Parallel tempering: theory, applications, and new perspectives. Physical Chemistry Chemical Physics. 2005;7(23):3910–3916. pmid:19810318
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref24] 24. Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics. 2004;20(3):407–415. pmid:14960467
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref25] 25. Baragatti M, Grimaud A, Pommeret D. Likelihood-free parallel tempering. Statistics and Computing. 2013;23(4):535–549.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref26] 26. Celeux G. Bayesian inference for mixtures: The label-switching problem. Computational Statistics 1998. 1998;p. 227–232.

[ref27] 27. Grün B, Leisch F. Dealing with label switching in mixture models under genuine multimodality. Journal of Multivariate Analysis. 2009 May;100(5):851–861.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref28] 28. Yao W, Lindsay BG. Bayesian mixture labelling by highest posterior density. Journal of the American Statistical Association. 2009 Jun;104(486):758–767.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref29] 29. Robert E. On Bayesian analysis of mixtures with an unknown number of components—Discussion. Journal of the Royal Statistical Society—Series B: Statistical Methodology. 1997;59(4):731–792.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref30] 30. Jasra a, Holmes CC, Stephens Da. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science. 2005 Feb;20(1):50–67.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref31] 31. Stephens M. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2000 Nov;62(4):795–809.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref32] 32. Hurn M, Justel A, Robert CP. Estimating mixtures of regressions. Journal of Computational and Graphical Statistics. 2003 Mar;12(1):55–79.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref33] 33. Bechtel YC, Bonaiti-Pellie′e C, Poisson N, Magnette J, Bechtel PR. A population and family study of N-acetyltransferase using caffeine urinary metabolites. Clin Pharm Therp. 1993;54:134–141.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref34] 34. Lin TI, Lee JC, Yen SY. Finite mixture modelling using the skew normal distribution. Statistica Sinica. 2007;17:909–927.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref35] 35. Roeder K. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association. 1990;85(411):617–624.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref36] 36. Aitkin M. Likelihood and Bayesian analysis of mixtures. Statistical Modelling. 2001 Dec;1(4):287–304.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref37] 37. Escobar M, West M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association. 1995;90(430):577–588. PLOS 22/2336.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref38] 38. Stephens M. Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics. 2000;28(1):40–74.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref39] 39. Roeder K, Wasserman L. Practical Bayesian density estimation using mixtures of normals. Journal of the American Statistical Association. 1997;92(439):894–902.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref40] 40. Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B (Methodological). 1997;59(4):731–792.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref41] 41. Phillips DB, Smith AFM. Bayesian model comparison via jump diffusions. Markov chain Monte Carlo in practice. 1996;p. 215–239.

[ref42] 42. Gelfand A. Chapter 9: Model determination using sampling Based Methods. In: Gilks RSS W, editor. Markov chain Monte Carlo in practice. Boca Raton, FL.: Chapman Hall; 1996..

[ref43] 43. Cameron E, Pettitt A. Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis. Statist Sci. 2014 08;29(3):397–419.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

Figures

Abstract

Introduction

Issue 1: Non-identifiability due to overfitting

Issue 2: Obtaining a well mixed MCMC sample

Issue 3: Untangling the label switching

Motivation

Methods

Models and notation

Prior Parallel Tempering (PPT)

Sampling overfitted mixture models with PPT

Choosing the candidate parameters (αj, j ≤ J).

Zmix Algorithm.

Important quantities.

Choice of mixture distribution.

Resolving label switching with Zswitch

Zswitch algorithm.

Simulations and case studies

Simulations.

Evaluation strategy.

Case Studies.

Evaluation strategy for Case Studies.

Post-processing

R Code.

Results

1. Exploratory simulations

1.(a) Exploring the distribution of .

1.(b) Model fit and parameter estimates.

2. Replicate simulation study

0.1 Case Studies

Acidity.

Enzyme.

Galaxy.

Discussion

Supporting Information

S1 Fig. Sim 1, Boxplot of the number of non-empty groups for each chain.

S2 Fig. Sim 3, Boxplot of the number of non-empty groups for each chain.

S3 Fig. Sim 4, Boxplot of the number of non-empty groups for each chain.

S4 Fig. Sim 1 (n = 100): Results of Zmix and Zswitch.

S5 Fig. Sim 1 (n = 200): Results of Zmix and Zswitch.

S6 Fig. Sim 3 (n = 100): Results of Zmix and Zswitch.

S7 Fig. Sim 3 (n = 200): Results of Zmix and Zswitch.

S8 Fig. Sim 4 (n = 100): Results of Zmix and Zswitch.

S9 Fig. Sim 4 (n = 200): Results of Zmix and Zswitch.

S1 Table. Parameter summaries for each model estimated by Zmix, for each simulation.

S1 File. R Code.

Acknowledgments

Author Contributions

References

Choosing the candidate parameters (α_j, j ≤ J).