Bayesian clustering with uncertain data

doi:10.1371/journal.pcbi.1012301

Fig 1.

Plate diagram of Gaussian Mixture Model.

When K → ∞ this is the Dirichlet Process Gaussian Mixture Model that is the basis of our model. In the plate diagram, each rectangle (plate) corresponds to a group of variables, and the text at the bottom of each plate shows how many copies of that plate are required for the full model. Circles indicate random variables, which are shaded if the random variable is observed. The variables without circles are hyperparameters for the model.

More »

Expand

Fig 2.

Accuracy on simulated datasets.

Simulation methods are split according to whether they assumed perfectly observed data or allow for uncertainty. mclust and kmeans do not by default allow for uncertain data, and appear in the second column when used as the clustering engine for representative clustering. The left two columns include results where K is inferred. For methods which allow a known value for K to be supplied, we show these for comparison in the righthand two columns. The first row of plots has the lowest noise N around the cluster mean, and the bottom row has the highest noise. Increasing uncertainty, corresponding to greater difference between latent data and the observed data is shown on the x-axis. The accuracy of the clustering is given by the Adjusted Rand Index (ARI) between the true clustering and the inferred clustering, averaged over 100 simulations. Higher values of ARI are better.

More »

Expand

Fig 3.

Clusters inferred by DPMUnc, DPMZeroUnc and mclust alongside the polygenic GWAS scores and the associated uncertainty.

The clusters of DPMZeroUnc and mclust have been coloured to best match the colours of DPMUnc. The dendrogram on the left is formed by applying hierarchical clustering on the posterior similarity matrix from DPMUnc, using complete-linkage.

More »

Expand

Table 1.

Sample counts in each gene expression dataset.

Counts for a subtype of a disease are in italics.

More »

Expand

Fig 4.

Clustering of samples in 3 gene expression datasets (rows) according to 3 gene signatures (columns).

(a-c) Ferreira (d-f) Chaussabel (g-i) Lyons. In each panel, the left plot shows the observed data, and the right plot shows the fraction of individuals assigned to each cluster. The p value shown relates to the null hypothesis that cluster membership is independent of disease.

More »

Expand

Fig 5.

Clustering of samples in 2 gene expression datasets (rows) using all 3 gene signatures.

In each panel, the left plot shows the observed data, and the right plot shows the fraction of individuals assigned to each cluster. The p value shown relates to the null hypothesis that cluster membership is independent of disease.

More »

Expand