Performance Evaluation of Missing-Value Imputation Clustering Based on a Multivariate Gaussian Mixture Model

Background It is challenging to deal with mixture models when missing values occur in clustering datasets. Methods and Results We propose a dynamic clustering algorithm based on a multivariate Gaussian mixture model that efficiently imputes missing values to generate a “pseudo-complete” dataset. Parameters from different clusters and missing values are estimated according to the maximum likelihood implemented with an expectation-maximization algorithm, and multivariate individuals are clustered with Bayesian posterior probability. A simulation showed that our proposed method has a fast convergence speed and it accurately estimates missing values. Our proposed algorithm was further validated with Fisher’s Iris dataset, the Yeast Cell-cycle Gene-expression dataset, and the CIFAR-10 images dataset. The results indicate that our algorithm offers highly accurate clustering, comparable to that using a complete dataset without missing values. Furthermore, our algorithm resulted in a lower misjudgment rate than both clustering algorithms with missing data deleted and with missing-value imputation by mean replacement. Conclusion We demonstrate that our missing-value imputation clustering algorithm is feasible and superior to both of these other clustering algorithms in certain situations.


Introduction
Clustering analysis, as a multivariate statistical method, refers to the process of classifying a set of observations into subsets, called clusters, such that observations in the same cluster are similar in certain respects [1][2][3]. Clustering is widely used in medical sciences, for instance when clustering diseases or gene-expression profiles. Clustering methods usually fall into two categories: hierarchical clustering methods [4], which are used for clustering datasets with small size [5,6]; and dynamic clustering methods, such as K-means [7,8] and self-organizing maps [9], which begin with an initial partitioning of the individuals and iteratively move individuals from one cluster to another until the criterion of convergence is met. With dynamic clustering, the number of clusters must be specified in advance [10]. Dynamic algorithms are mostly heuristically motivated, and they do not require an underlying statistical model. Nevertheless, selection of the "correct" number of clusters and of the best clustering method remains a topic for discussion. Model-based clustering methods [11][12][13] are a type of dynamic clustering based on the hypothesis that the whole dataset is a finite mixture of the same type of distribution with different sets of parameters, such as a finite mixture of a multivariate Gaussian distribution. Compared with "heuristic" algorithms, one obvious advantage to model-based clustering is that objective statistical criteria-such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [14]-are used to determine the number of clusters.
Missing data present a problem for medical research, given that the data are often supplied retrospectively and from various sources [15]. In particular, missing data are a frequent occurrence in microarray experiments and in medical research. Missing values are especially common in largescale studies involving dozens of variables and hundreds of individuals. Indeed, many clustering methods require a full set of data. Individuals with missing values are either rejected or been estimated prior to the analysis. Consequently, several missing-value imputation methods have been developed [16][17][18][19][20], such as mean substitution, regression imputation, fuzzy c-means (FCM) clustering of incomplete data [21], and Gaussian mixture model-based missing-value imputation classification [22]. In this study, we propose a dynamic method for a model-based missing-value imputation clustering algorithm. With our proposed method, missing values are estimated iteratively until arithmetic convergence is reached during the process of clustering individuals. We used 12 simulated datasets. Each of these datasets had two versions: a complete version and a version where 10% of the individuals had at least one variable missing. We used these datasets to evaluate the clustering accuracy and the accuracy and precision of cluster-parameter estimators. We compared our proposed algorithm based on the missing-value imputation according to the maximum likelihood estimation (a "pseudo-complete" dataset) with a model-based clustering algorithm using the complete dataset and another two model-based clustering algorithms, one using a dataset with missing values deleted and the other imputing the missing values by mean replacement. In addition, we compared our algorithm with the FCM clustering method using real datasets-viz., Fisher's Iris dataset, the Yeast Cell-cycle Gene-expression dataset, and the CIFAR-10 images dataset.

Multivariate Gaussian mixture model
The dataset is arranged as an n × k matrix denoted by Y, where n is the number of individuals and k is the number of variables. Let y ij be the observed value of the i-th individual in the j-th variable, for i = 1,2,. . ., n and j = 1,2,. . ., k. Let Y i = (y i1 ,y i2 ,. . .,y ik ) T be the i-th column in matrix Y T -i.e., a k × 1 vector of the data for individual i under all variables. The value of y ij across all the k variables represents the expression level of the i-th individual. Under a finite multivariate Gaussian mixture model, each Y i is assumed to follow a k-dimensional Gaussian mixture distribution. Mathematically, the mixture distribution for c clusters is as follows: where f l ðY i Þ ¼ ð2pÞ Àk=2 jS l j À1=2 exp Àð1=2ÞðY i À m l ÞS À1 l ðY i À m l Þ T h i is the probability density function for the l-th k-dimensional Gaussian distribution with mean vector μ l = (μ l1 , μ l2 ,. . ., μ lk ) and variance-covariance matrix S l ¼ 2   6  6  6  6  6  6  4   3   7  7  7  7  7  7  5 , for l = 1,2,. . .,c. Moreover, p l with X c l¼1 p l ¼ 1, is the mixing proportion of cluster l. The mixing proportion p l is defined as the proportion of individuals that belong to the l-th cluster. The joint log-likelihood function for n independent individual vectors is defined as follows: Clustering algorithm for missing values Most reports [23][24][25] consider only complete datasets (without missing values) for Gaussian mixture clustering. Indeed, it is challenging to estimate missing values accurately. An urgent problem concerns how to provide the optimal number of clusters c and how to cluster n individuals into the c clusters precisely. Therefore, we propose a missing-value imputation algorithm as follows.
Suppose that there are r missing values in Y i . Thus, y i1 , y i2 ,. . ., y ir are missing. The Y i , μ l , and S l are then divided into : Suppose further that Y i is from the l-th k-dimensional Gaussian distribution and that Y i(2) is known. The conditional expectation of Y i (1) , which belongs to the l-th cluster, is derived as follows: We provide the initial value for μ l(1) , S l11 , and S l12 . Then, based on the criterion of the minimum mean-square deviations from Eq (3), the conditional expectation E l (Y i(1) |Y i(2) ) can be calculated, representing the best predicted function for Y i(1) [26]. Under the mixture distribution of the individual vector Y i , the conditional expectation for where E(Y i(1) |Y i(2) ) denotes the estimators for the missing values Y i(1) . Subsequently, a "pseudo-complete" dataset can be constructed.

Iterative clustering algorithm
The proposed model-based clustering algorithm for datasets with missing values assigns each individual to one of the c clusters with a certain probability. We define this probability as p li , which is that of the i-th individual belonging to the l-th cluster. An individual is assigned to the lth cluster if p li is greater than a certain pre-determined threshold. The probability can be calculated using the expectation-maximization (EM) algorithm [27] to achieve the maximum likelihood for the objective function derived with Eq (2). The EM algorithm begins with some initial parameters that are set in advance. Then, each parameter is iteratively updated in the algorithm until convergence to a minimizer is reached. An EM iteration includes the following steps: 1. Initialize the prior probabilities of the cluster assignment and the cluster parameters: The number of clusters c can also be treated as an unknown parameter and inferred by the BIC or AIC tests. The BIC is derived as follows: where q = c(k+1)(k+2)/2−1 is the number of independent parameters to be estimated in the model, Lð b yÞ is the likelihood of b y-i.e., the vector for the maximum likelihood estimation of the parameters-and n is the size of the dataset. The number of clusters c is determined by the maximum BIC value.

Simulation analysis
A simulation was designed to evaluate the feasibility and accuracy of the proposed missingvalues imputation algorithm. The individuals belonging to each cluster were known exactly and without subjective errors. Indeed, when analyzing real data, the misjudgment rate (MR) may reflect confounding errors in the experiment and the subjective errors. Without loss of generality, we simulated datasets with a two-dimensional Gaussian distribution and some missing values. All simulations were performed using statistical SAS (version 9.3; SAS/IML) software.
Design of the simulation. Suppose that there are 500 individuals derived from one of two (two-dimensional) Gaussian populations, denoted by MVN (μ 1 ,S 1 ) and MVN (μ 2 ,S 2 ), from which 10% of the individuals have at least one variable randomly removed. The entire dataset contains 1000 individuals. Each individual has two variables, and 100 individuals have at least one variable missing. The cluster's mean vectors are denoted by μ 1 and μ 2 , which were simulated at three different levels: A 1 : μ 1 = (0, 0), μ 2 = (2.5, 2.5), A 2 : μ 1 = (0, 0), μ 2 = (2.0, 2.0), and A 3 : μ 1 = (0, 0), μ 2 = (1.5, 1.5). The variance-covariance matrices are denoted by S 1 and S 2 (without loss of generality, supposing that S 1 = S 2 = S), and these were set at four levels: . The total number of treatment combinations (datasets) is therefore Twenty replicated simulations were conducted for each of the twelve scenarios. Our missing-value imputation cluster algorithm (denoted by M-3, generating a "pseudocomplete" dataset) was compared with three other cluster algorithms: a cluster algorithm using the complete dataset (denoted by M-1, i.e., the above simulation dataset without any missing values); a cluster algorithm using a dataset from which all individuals with missing variables are deleted (denoted by M-2, i.e., a cropped dataset); and a cluster algorithm using missingvalue imputation by mean replacement in which the missing values are replaced by the mean value of the whole dataset (denoted by M-4, another "pseudo-complete" dataset). Thus, we used these four algorithms to cluster simulated datasets. These algorithms are all based on the multivariate Gaussian mixture model, using the maximum likelihood estimation with the EM algorithm. As such, the clustering results could be fairly compared. We evaluated M-1, M-2, M-3, and M-4 using the following metrics: (1) the average convergence rate; (2) the accuracy and precision of their respective parameter estimates; and (3) the total misjudgment rate (MR), where MR is the ratio of all misjudged individuals to the total number of individuals in the 20 replicates and MR = 1 -accuracy. A chi-square test and multiple comparisons of Scheffé's Confidence Interval were used to test the differences in MR among the four algorithms.
Simulation results. The average parameter estimates include the mean vectors (± se), variance-covariance matrices (± se), the likelihood value of the maximum-likelihood function, and the total MR for the four clustering algorithms. The results are presented in Tables 1-3. Our proposed missing-value imputation clustering algorithm (M-3) outperformed the other two algorithms (M-2 and M-4) in terms of the convergence rate for most of the simulation datasets. Indeed, its superiority was most obvious when the mean vectors of two clusters were near each   Performance Evaluation of Missing-Value Imputation Clustering other and the variance-covariance of each cluster was more disperse, such as with datasets A 2 B 4 , A 3 B 3 , and A 3 B 4 . Moreover, the proposed algorithm resulted in the most accurate and precise parameter estimations, closely approximating the true values. Thus, our proposed missing-value imputation clustering algorithm correctly clusters individuals with missing variables. The total MR resulting from the proposed algorithm (M-3) demonstrates that its performance is superior to that of M-4 in all simulation trials, with significant statistical difference (p < 0.05), and superior to that of M-2 in simulation trials A 2 B 4 , A 3 B 3 , and A 3 B 4 , with significant statistical difference (p < 0.05). Moreover, the clustering accuracy of our  proposed algorithm was comparable to that of the complete dataset (M-1) (p > 0.05). In general, the more similar the mean vectors of two clusters or the more disperse their variance-covariance, the higher the MR of all clustering methods and the better the performance of our proposed algorithm compared to M-2 and M-4. Interestingly, however, our algorithm performed well in such cases. For example, in scenario A 3 B 4 , the speed of convergence with M-2 and M-4 was lower in both cases than for other scenarios and had a total MR of 48.32% and 56.00%, respectively, whereas the total MR with our proposed algorithm M-3 was 21.30%. In fact, this result was slightly better than M-1, which resulted in an MR of 28.00%.

Real data analysis
Iris dataset. Fisher's Iris dataset (Fisher, 1936) [12] is perhaps the best-known database for comparing clustering algorithms. The dataset contains three clusters with 50 individuals each, where each cluster is a species of the genus Iris: namely, Iris setosa, Iris versicolor, and Iris virginica. These three species are thus segregated. The dataset contains four variables: sepal length, sepal width, petal length, and petal width. Fisher's Iris dataset is available on the Internet (http://archive.ics.uci.edu/ml/datasets/Iris). We randomly removed one of the four variables in 10% of the individual observations of Iris data; the resulting dataset constituted a dataset with missing values. Table 4 lists the clustering results from five clustering algorithms: the above four clustering algorithms [the clustering algorithm with a complete dataset (M-1), the clustering algorithm in which individuals with missing variables are deleted (M-2), our proposed missing-value imputation algorithm (M-3), and the clustering algorithm for missingvalue imputation by mean replacement (M-4)] and the FCM clustering method applied to the dataset with data missing from 10% of the individuals. In the M-4 and FCM methods, the missing values are replaced by the mean of the whole dataset. The superiority of our proposed method is obvious insofar as M-3 resulted in the lowest MR of the five algorithms. Because of Gaussian mixture distribution of the dataset, our algorithm makes full use of the known data based on the Gaussian mixture model to estimate missing values, the accuracy of our method was higher than those of the FCM method and the M-2 and M-4 algorithms, and it closely approximated the accuracy obtained using the complete-dataset clustering algorithm (M-1). Moreover, the model-based algorithms (M-1, M-2, M-3, and M-4) resulted in the highest BIC values in the three clusters, validating the effectiveness of using the BIC to determine the number of clusters.
Yeast Cell-cycle dataset. The Yeast Cell-cycle dataset [28] shows the fluctuation in the expression levels of the whole genome (6220 genes) over 17 time points taken at 10-minute intervals, covering nearly two cell cycles. The entire raw dataset is available at http://genomewww.stanford.edu/cellcycle/. Among the effectively expressed genes, 416 genes were found to have consistent periodic changes in expression level. Yeung et al. picked out 384 genes over 15 time points that reached their peaks at five (C = 5) cell phases on the basis of the size of the buds, after which five non-intersecting clusters were classified [29]. These 384 gene expressions, a subset with a moderate sample size, are available at http://www.cs.washington.edu/ homes/kayee/model. All 384 genes were assigned to one of the five clusters by the original researchers [28,29]. Therefore, we can use this dataset to test the performance of the clustering algorithms introduced here and compare them with the performance of existing methods. We randomly removed one of the 15 variable values in 10% of the genes to create a dataset with missing values. Table 5 lists the clustering results from five clustering algorithms: the above four model-based clustering algorithms (M-1, M-2, M-3, and M-4) and the FCM clustering  method. Further validating the results from Fisher's Iris dataset, we found that our missingvalue imputation algorithm had the lowest MR. In addition, because our missing-value imputation based on the Gaussian mixture distribution is compatible with the Yeast Cell-cycle dataset, the accuracy of our method was significantly higher than that of the FCM method, and it closely approximated the accuracy from using the complete-dataset clustering algorithm (M-1). CIFAR-10 images dataset. The CIFAR-10 dataset [30] contains 60,000 32 × 32 color images (3072 variables) divided into ten classes, which represent different kinds of pictures. The classes are designed to be completely mutually exclusive. For example, neither automobile pictures nor truck pictures contain images of pickup trucks. Having a large sample size, each class contains 5000 training images and 1000 test images. The entire dataset was used to test the performance of test images according to training images in supervised classification, and it is available at http://www.cs.toronto.edu/~kriz/cifar.html. We used the training images to test the performance of our unsupervised clustering algorithm and compared it to existing clustering algorithms. We randomly removed 10 of the 3072 variable values in 10% of the color images to create a dataset with missing values. Table 6 lists the clustering results from five clustering algorithms: the above four modelbased clustering algorithms (M-1, M-2, M-3, and M-4) and the FCM clustering method. We found that our missing-value imputation algorithm had a lower total MR than algorithms M-2 and M-4. The accuracy of our method closely approximated the accuracy obtained using the clustering algorithm with complete data (M-1) but was lower than that of the FCM method. (It should be noted that most clusters of the CIFAR-10 data have deviants from the normal distribution.) Thus, the performance of our model-based algorithm does not offer clustering that is more accurate than the FCM method. Moreover, the speed of computations was much lower in algorithms M-1, M-2, M-3, and M-4.

Discussion
Missing values are unavoidable in experimental data and data from field investigations [31]. Many imputation methods [16][17][18][19][20] have been proposed, such as the mean-substitution method, the regression imputation method, the EM estimation method, and the FCM method.
Clustering is a widely used statistical method, especially in gene expression profiles. Most classical clustering methods, such as K-means clustering and hierarchical clustering, are unable to deal with missing values. Therefore, we developed a model-based clustering algorithm for imputing missing values. The proposed method utilizes information from existing data, and it is based on the Gaussian mixture model. With our imputation algorithm, missing values are estimated by calculating the conditional expectation, from which a "pseudo-complete" dataset can be generated and used for clustering. The performance of our proposed algorithm was tested and compared with other algorithms using both real and simulated data. We found that our proposed method outperformed other algorithms. In particular, our proposed method produced clustering results comparable to an algorithm that uses a complete dataset, and it outperformed an algorithm in which samples with missing data are removed and another algorithm in which missing values are imputed by mean replacement. The results are consistent with Hunt and Jorgensen's reports [32]. Even with datasets in which the boundaries of the clusters overlap, such as datasets A 3 B 3 , A 2 B 4 , and A 3 B 4 , our algorithm performed well.
Furthermore, we found that the proposed algorithm outperformed the FCM clustering method on Fisher's Iris dataset and on the Yeast Cell-cycle dataset. The FCM clustering method, a heuristic approach, has been verified as a successful missing-value clustering method [33]. FCM clusters data based on the characteristics of the samples. The criteria for FCM clustering are inconsistent, however, because the characteristics of dynamic algorithms lead to uncertainties about the samples, degrading FCM's ability to precisely cluster data [34][35][36]. Our study shows that the proposed method is more likely to result in a low MR when the dataset is exactly a multivariable Gaussian mixture distribution. That is, it is more accurate than other methods with both simulated data and real data, e.g., Fisher's Iris dataset and the Yeast Cellcycle dataset. One possible reason for this is that our method estimates missing values with a conditional expectation based on the Gaussian mixture distribution and generates a "pseudocomplete" dataset. In addition, our method combines dynamic clustering with discriminant analysis and uses the EM algorithm to achieve a maximum-likelihood function. It can thus better utilize the information from known data to discover the intrinsic and implicit properties of a dataset. Usually, model-based methods are better than heuristic approaches, even though the former are always more time consuming. Model-based methods depend on a probability model, which is usually proposed based on the experience and feasibility of the model. The convergence and accuracy of the EM algorithm have already been established [25]. The Gaussian mixture model is both convenient and robust. Moreover, a model-based method provides substantial information-e.g., the posterior probabilities of individuals and the parameters of each cluster-and this information can be exploited to more accurately impute missing values with iterative updates to the algorithm until convergence to a minimizer is reach.
Our method uses the Bayesian posterior probability to cluster individuals; individuals are assigned to the cluster having the highest posterior probability. It is more reasonable to assign individuals to a cluster having a high posterior probability, such as 0.8 or 0.9, and this improves the clustering accuracy. Whereas traditional dynamic clustering methods cluster individuals directly to a certain category, this process is difficult when the boundaries of the clusters overlap. Another advantage to a model-based approach is that statistical criteria, such as AIC and BIC [14], can be used to discover the number of clusters that best suit the data. We confirmed the effectiveness of such criteria with results indicating three clusters for Fisher's Iris dataset (a dataset with exactly three species of Iris) and five clusters for the Yeast Cell-cycle Gene-expression dataset. However, the reversible-jump Markov-chain Monte Carlo (MCMC) method was previously developed to infer the number of distributions in a mixture [37]. MCMC estimates the optimal number of clusters from a Bayesian perspective. Our future research shall compare MCMC with the BIC criteria used in this study. Finally, a prerequisite for raw data collected by experimenters is that they be normalized. Indeed, our method is more sensitive to data that have not been normalized. Therefore, either a log or some other form of data transformation should be applied in order to distribute the data in a normal fashion. Furthermore, because our proposal relies on the information from known individuals when imputing missing values, a large sample size is desirable for more accuracy when estimating missing values.

Conclusions
In this paper, we have presented a model-based missing-data imputation algorithm for clustering numerical data with missing values. We achieved a high convergence rate with our algorithm when it was tested on simulated and real datasets. In addition, the statistical power of our algorithm is high, and it can accurately estimate parameters and correctly cluster individuals with missing variables. Moreover, the results of our evaluation show a low MR, even when the mean vectors of two clusters are close to each other or when their variance-covariance is disperse. The algorithm was applied successfully to datasets (both simulated and real) having missing data. Its performance is superior to that of an algorithm in which samples with missing data were removed and another algorithm in which missing values were imputed by mean replacement. A comparative evaluation demonstrated that our proposed algorithm outperforms the FCM method when the data from each cluster fit a multivariate Gaussian distribution.