Adaptive kernel fuzzy clustering for missing data

Many machine learning procedures, including clustering analysis are often affected by missing values. This work aims to propose and evaluate a Kernel Fuzzy C-means clustering algorithm considering the kernelization of the metric with local adaptive distances (VKFCM-K-LP) under three types of strategies to deal with missing data. The first strategy, called Whole Data Strategy (WDS), performs clustering only on the complete part of the dataset, i.e. it discards all instances with missing data. The second approach uses the Partial Distance Strategy (PDS), in which partial distances are computed among all available resources and then re-scaled by the reciprocal of the proportion of observed values. The third technique, called Optimal Completion Strategy (OCS), computes missing values iteratively as auxiliary variables in the optimization of a suitable objective function. The clustering results were evaluated according to different metrics. The best performance of the clustering algorithm was achieved under the PDS and OCS strategies. Under the OCS approach, new datasets were derive and the missing values were estimated dynamically in the optimization process. The results of clustering under the OCS strategy also presented a superior performance when compared to the resulting clusters obtained by applying the VKFCM-K-LP algorithm on a version where missing values are previously imputed by the mean or the median of the observed values.


Introduction
The incessant increase in volume and variety of data requires advances in methodologies in order to understand, process and summarize data automatically. Cluster analysis is one of the main unsupervised techniques that are used to extract knowledge from data, due to its ability to aid in the process of understanding and visualizing data structures [1,2].
The main goal in clustering is to organize the data (observations, data items, images, pixels etc.) based on similarity (or dissimilarity) criteria such that observations belonging to the same group show high degrees of similarity, while observations in different groups show high degrees of dissimilarity [3,4]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Clustering methods are widely used in many areas of knowledge, such as taxonomy, data mining, image segmentation, pattern recognition, information retrieval, computer vision, and so forth [3,5]. Depending on the application considered, the groups obtained in clustering may present different characteristics. Thus, different clustering techniques have been proposed in the literature, with the most popular ones being based on hierarchies and partitions. In hierarchical clustering algorithms, structures are found such that they can be recursively divided into levels. The output is a nested sequence of partitions of the input data known as a dendrogram [6].
In partitioning clustering methods, a single partition of the dataset is obtained, generally based on the optimization of a suitable objective function [5]. These methods are more flexible than the hierarchical ones because they allow observations to change groups at each step of the algorithm, if that change leads to a better solution in terms of the variability of the resulting partition. Partitioning clustering methods can be divided into two main branches: hard (or crisp) and fuzzy (or soft). In hard clustering methods, the groups are naturally disjoint, that is, the dataset is partitioned into a predefined number of groups and overlapping is not allowed, which means that each instance may belong exactly to one cluster.
In real world applications, group boundaries are often difficult to define, as it is complex to find reasonable criteria that include some data objects in a cluster, but exclude others. Trying to solve this problem, methods that allow more flexible criteria, such as fuzzy clustering algorithms, were proposed in the literature. In fuzzy clustering, an instance may belong simultaneously to all clusters with a certain membership degree [7,8]. Fuzzy clustering methods offer good capability to handle noisy/missing data, which is a common problem in different areas, including microarray data analysis [3,4,[9][10][11].
The most important component of any clustering algorithm is the dissimilarity (or similarity) measure. Distances are important examples of dissimilarity measures and the Euclidean distance is the most commonly used in the clustering literature. The Fuzzy C-Means (FCM) method [12] is one of the most popular clustering algorithms and it is based on the Euclidean distance. Algorithms that are based on this distance achieve good results when applied to datasets in which groups are approximately hyperspherical and approximately linearly separable [13]. In the opposite situation, i.e. clusters with non-hyperspherical shapes and/or linearly non-separable patterns), these algorithms may have poor performance and find unrepresentative clusters.
The use of kernel functions allows an arbitrary nonlinear mapping ϕ from the original pdimensional space of the dataset X � R p to a higher-dimensional (possibly infinite) space, called a feature space F . The purpose of this transformation is that by moving to higher dimensions it may be possible to obtain more defined and linearly separable groups [28]. The advantage and, at the same time, the main idea of methods based on kernel functions is that inner products in the feature space can be expressed as a Mercer kernel [14,29]. Two main approaches have guided the development of kernel-based algorithms: kernelization of the metric, in which the cluster prototypes are obtained in the original space and the distances between instances and cluster prototypes are computed by means of kernels; and clustering in feature space, in which cluster prototypes are obtained in the feature space [17].
Research studies have shown that clustering methods based on kernel functions perform better than traditional methods, as they are able to produce nonlinear differentiable hypersurfaces of separation between groups [5,17]. However, in most domains, especially if we are dealing with high-dimensional datasets, some variables may be irrelevant for the construction of the groups, and some among the relevant may be less important than others in relation to a specific group. Ferreira et al. [13] proposed a family of methods based on kernel functions with automatic weighting of variables. These methods were derived based on kernelized adaptive distances that change at each algorithm iteration and can be different for each group or common to all groups. In this context, the Kernel Fuzzy C-Means clustering under the kernelization of the metric approach with local adaptive distances was considered, assuming the constraint that the product of the weights of the variables on each cluster must be equal to one. In this work, we labeled this algorithm as VKFCM-K-LP.
Ferreira, et al. [13] focused on developing methods that are able to better describe the structures of groups in data, however, they did not investigate the performances of the algorithms in the context of missing data. In real world applications, many inferential procedures have to deal with the problem of missing data. There are several reasons for this problem, including imperfect manual data entry procedures, incorrect measurement and equipment measurement errors, among others [30].
In many areas, such as Industry and Medicine, it is common to find datasets that have up to 50% or more of missing values [31,32]. Extensive research has been done to study the problem of missing data, and the reason for this is the fact that many statistics were originally developed for datasets with no missing values, and even a small amount of them in the dataset can cause serious problems in analysis and decision making. This is enough to motivate the need to develop efficient mechanisms to deal with incomplete data [33].
The development of statistical methods to deal with incomplete data has been the subject of research for decades [34][35][36]. Green et al. [37] assessed two alternatives for dealing with missing values: Imputation, in which the missing values are estimated through the values observed in the dataset, of which the most popular techniques are Average Imputation or Median Imputation; and Exclusion, where observations that contain missing values are excluded from the dataset. Although simple, these alternatives can produce biased estimates through the reduction of the size of the dataset and by replacing these missing values with estimates [35]. A more effective approach can be to adapt traditional data analysis to deal with incomplete data.
Several approaches have been introduced in an attempt to extend the clustering techniques in the presence of missing values. One of the first attempts was an approach based on probabilistic assumptions to handle missing data in order to perform pattern recognition [38] introduces an approach based on probabilistic assumptions to handle missing data. The Expectation-Maximization (EM) algorithm was used to deal with incomplete data in clustering [39]. Several methods have been proposed to adapt the FCM method to deal with missing data [40]. Wagstaff [41] proposed the K-means method with Soft Constraints (KSC) and Poddar et al. [42] examine clustering data with missing entries using non-convex fusion penalties.
Hathaway [43] proposed strategies to deal with missing values in cluster analysis using the FCM method. Li et al. [44,45] proposed the FCM clustering method based on nearest-neighbor observations and extended the FCM method by adding a variable weighting process to handle incomplete data, in which the weight of each attribute is seen as an additional variable to be optimized simultaneously in clustering. Recently, Li et al. [46] introduced a kernel method to cluster datasets with missing values in the scope of imputation of observations.
In this work, we adapted the VKFCM-K-LP clustering methods [13,43] to deal with missing data. The first strategy, called Whole Data Strategy (WDS) performs clustering only on the complete part of the dataset, which means that, in this first strategy, the instances that contain any missing value are excluded from the analysis. The WDS can be applied as long as the amount of missing values does not exceed a percentage of 25% of all observed values. The second approach uses the Partial Distance Strategy (PDS), in which partial distances are computed among all available resources and then re-scaled by the reciprocal of the proportion of observed values. The third technique, called Optimal Completion Strategy (OCS), computes missing values iteratively as auxiliary variables in the optimization of a suitable objective function.
In the evaluation of the VKFCM-K-LP method under the WDS, PDS and OCS approaches, we considered artificially generated datasets with 5%, 10%, 15% and 20% of missing values. The results of the analyzes were quantified according to the following quality measures: the Corrected Rand index (CR), F-measure (FM), the Overall Error Rate of Classification (OERC) and the measure of consistency of variables for the OCS [47][48][49][50]. In addition, the results of the clustering under OCS were compared with the results of the clustering using the imputation methods via the mean and the median values.
The rest of the paper is structured as follows. In Section 2 the basic theory about kernels is briefly presented. Section 3 describes the conventional kernel fuzzy C-means (KFCM) algorithm under the kernelization of the metric approach. Section 4 presents the kernel-based fuzzy clustering with variable weighting via local adaptive distances under the kernelization of the metric approach (VKFCM-K-LP). Section 5 introduces the main approach to analyze missing data. New VKFCM-K-LP algorithms under the WDS, PDS and OCS schemes are proposed in Section 6. Section 7 proposes the experimental design. Section 8 contains the results of several numerical evaluations. Finally, Section 9 offers some concluding remarks.

Theoretical background
This section describes the basic theory about kernels. The main idea behind kernel-based methods is the use of an arbitrary nonlinear mapping ϕ from the original space of the input data to a space of higher dimension (possibly infinite), called feature space F . Let X = {x 1 , x 2 , . . ., x n } be a non-empty set with x i 2 R p ; 8 i . A function K : X � X ! R is a Mercer Kernel, if and only if, K is symmetric, i.e. Kðx k ; x i Þ ¼ Kðx i ; x k Þ and the following inequality is valid [29]: where, c r 2 R; 8r ¼ 1; . . . ; n. Each Mercer Kernel can be expressed as: in which, � : X ! F performs a nonlinear mapping from the original space of X to the space of high-dimensional features F . One of the most relevant aspects in the application of Kernel-based methods is the possibility to calculate Euclidean distances in F without having to explicitly specify the non-linear mapping ϕ [51,52].
This can be done using the so called distance Kernel trick [52,53]: where, the calculation of the distances in the feature space is a function of the input vectors. Kernel functions [54] typically used are: • Polynomial of degree d: • Gaussian: where, γ, θ, σ and d are Kernel parameters. In the literature, Kernel-based clustering methods can be divided into two main categories, kernelization of the metric [16,55] and clustering in feature space [56]. However, in this work, we consider only the kernelization of the metric approach. Under this approach, clustering methods seek for prototypes in the original space of the input data and the distances between a data point x i and the prototype of the k-th group v k are obtained by means of kernel functions:

Kernel fuzzy C-means (KFCM)
Let O = {1, . . ., n} be a set of n observations indexed by i and described by p variables. Let P = {P 1 , P 2 , . . ., P k } be a partition of O in K groups. The purpose of the Kernel fuzzy C-Means clustering method under kernelization of the metric is to minimize the following objective function where v k 2 R p is the prototype of the k-th cluster, k = 1, . . ., K, u ki is the fuzzy membership degree of the observation i to the k-th cluster, k = 1, . . ., K, i = 1, . . ., n and m 2 R þ is a parameter that controls the fuzziness of the membership for each observation i. Here, U ¼ ½u ki � 2 R K�n is the fuzzy partition matrix. Deriving prototypes for the clusters depends on the choice of the kernel function. When considering the Gaussian Kernel, the most popular in literature, we have that Kðx i ; x i Þ ¼ 1, for all i = 1, . . ., n. Thus, the objective function described in Eq (5) can be expressed as in Graves et al. [57] by Eq (6): therefore the equation of the cluster prototypes is defined for k = 1, . . ., K as v ðtþ1Þ When updating the fuzzy partition matrix U, the prototypes v k are kept fixed and we need to find the fuzzy membership degrees u ki (k = 1, . . ., K, i = 1, . . ., n). Using the Lagrange multipliers for the optimization process of the objective function J, subject to the restrictions in Eq (5), we have the following solution [57]: 4 Kernel-based fuzzy clustering with automatic variable weighting via local adaptive distance Kernel-based clustering methods commonly found in the literature, such as the kernel Fuzzy C-Means [58], do not take into account the weights or the relevance of each variable in the clustering process. However, for the majority of the datasets, and especially if we are dealing with high-dimensional data, some variables may be irrelevant, and, among the relevant variables, some may present greater or lesser importance than others. Moreover, different groups can have different sets of relevant variables. Motivated by this problem, Ferreira et al. [13] proposed a family of kernel-based fuzzy clustering methods with automatic weighting of variables, which are clustering algorithms in which dissimilarity measures are obtained as sums of Euclidean distances between patterns and cluster prototypes computed separately for each variable. The main idea supporting these methods is that the sum of kernel functions applied on each variable is also a kernel function. This reasoning enables the introduction of weights representing the relevance of each variable. The clustering method VKFCM-K-LP takes into account the weights or the relevance of each variable for the construction of the clusters [13]. This clustering method is based on a kernelized local adaptive distance with the constraint that the product of the weights of the variables on each cluster must be equal to 1. The algorithm considers a separate weight vector for each cluster in order to parameterize its local distances. Then, the closer the observations are to the prototype of a given cluster with respect to a given variable, the greater its importance to this cluster. The restrictions on the weight vector in the VKFCM-K-LP method are based on hard clustering via adaptive distances and on fuzzy quadratic distances [59,60]. [53]) If K 1 : X 1 � X 1 ! R and K 2 : X 2 � X 2 ! R are kernel functions, then the sum,

Result 1 (Scholkopf and Smola
Under this result, if an instance is represented by a vector with p variables, we can partition it into up to p parts, and consider up to p different kernel functions, one for each part. Formally, we have that Kðx i ; are Kernel functions and X j is the the space of the j-th variable with j = 1, . . ., p. Therefore, a distance based on kernelizing the metric between an instance x i and the k-th prototype v k with respect to the j-th variable [51,52] is defined by in which ϕ j j = 1, . . ., p is a non-linear mapping of x i 2 X, X � R p into the feature space F j concerning the j-th variable. In Eq (9) it is possible to introduce weights representing the relevance of each variable. Let φ 2 (x i , v k ) be a distance measure based on kernelization of the metric between an observation x i and the prototype v k of the k-th cluster. Thus, the local adaptive distance φ 2 (x i , v k ) with the restriction that the product of the weights of the variables in each cluster [61] is equal to 1, is given by where λ k = (λ k1 , . . ., λ kp ) is the vector of weights for the k-th cluster. Given Eqs (9) and (10) we can define an objective function J that measures the fit between the clusters and their prototypes, given by subject to the constraints given in Eq (5), where u ki is the fuzzy membership degree for observation i in the k-th cluster k = 1, . . ., K, i = 1, . . ., n and v k 2 R p is the prototype of the k-th cluster.
When considering the Gaussian Kernel the objective function described in the Eq (11) is rewritten as While deriving cluster prototypes, the fuzzy membership degrees and the weights of the variables are kept fixed. Therefore, the prototype of the k-th in which, t = 1, . . ., T where T is the maximum number of iterations. The next step is to determine the weights of the variables. To do so, the fuzzy membership degrees u ki and the cluster prototypes v k are kept fixed. The weight vector λ k = (λ k1 , . . ., λ kp ) that minimizes criterion J, under restrictions λ kj > 0 8 kj and Q p j¼1 l kj ¼ 1, 8 k , has its components λ kj (j = 1, . . ., p, k = 1, . . ., K) given by While updating the fuzzy membership degrees, the prototypes of the clusters v k and the weights of the variables are kept fixed. Therefore, the fuzzy membership degrees that minimize criterion J, given in Eq (5), are updated according to the following expression u ðtþ1Þ where φ 2 (x i , v k ) is defined in Eq (10). Algorithm 1 shows the steps of the VKFCM-K-LP method. The convergence properties of the method were demonstrated in the work of [13].
STOP ELSE do t = t + 1 and go to step 2.

Incomplete data analysis
Data quality is one of the most important factors that can affect the results of statistical analysis. Problems during data collection or pre-processing can generate uncertain values, incorrect or even absent values. Data analysis with missing data is a problem often discussed in many areas of science, because these analyses were originally designed for datasets without missing values. Although the causes of missing data are diverse in the literature, there are few missing data patterns resulting from the missing values in the datasets. The missing data pattern describes which values are observed and which values are absent from the dataset [35]. Generally, the most common missing data patterns are the multivariate, monotone, general and file-matching patterns [35]. In the multivariate pattern (Fig 1a), missing values occur in a group of attributes that are completely observed or missing. The monotone pattern (Fig 1b) usually occurs as a result of longitudinal studies and has a ladder-like arrangement of values when organized in a data matrix. The file-matching pattern (Fig 1d) occurs when the data are obtained from several different sources and, consequently, the combined dataset will have fully observed attributes and features that are not jointly observed.
In the general pattern (Fig 1c) the missing values are characterized by an arbitrary form in the dataset and can be observed in practice for example, in the omission of responses in a questionnaire or loss of data in pre-processing.
Although missing data patterns describe what values are missing from the dataset, missing data generation mechanisms provide information about the occurrence of these values. Missing data generation mechanisms refer to the relationship between the missing value and the attribute values of the variables in the dataset. Therefore, whereas a missing data pattern indicates what values in the dataset can be used for statistical analysis, mechanisms provide an indication of how the available values should be treated during data analysis to obtain the best results.
The first works that deal with missing data generation mechanisms were proposed by Rubin [34] and are still used today. These mechanisms are known as: Missing Completely at Random (MCAR), Missing at Random (MAR) and Not Missing at Random (NMAR) and describe the relationship between the analyzed variables and the percentage of missing values in the data matrix [62,63]. In this work, we focus on strategies for dealing with missing data of the MCAR type [35]. Let X = {x 1 , . . ., x n } be a data matrix and define the p-dimensional vector . The missing data generation mechanism is defined as the conditional probability of M given X, P(M|X, θ), where θ denotes the unknown parameters of a given probability distribution. Missing values are defined as MCAR if a missing value does not depend on the dataset. Formally, this mechanism is defined as: From a practical perspective, missing data mechanisms operate as assumptions that dictate which techniques should be used to deal with these values [62].

Handling missing values
Traditionally, researchers use a wide variety of techniques to handle missing values. However, the best method would be to avoid having these values in the dataset, through better experiment mapping or repeated data collection. Nonetheless, investigating why these values are absent and taking corrective measures can become impracticable or impossible. Therefore, it is usually more feasible to adopt techniques that deal with missing values in the data matrix.
There are three common approaches in the literature to manipulate missing values [35]: • Elimination: This technique is best used when the percentage of missing values in the dataset is relatively small. The approach is to ignore missing data items or the attributes that contain those values. Therefore, data analysis is performed on the set of available data, called Complete-Case Analysis (CCA). The main advantage of exclusion is that it produces a complete dataset, which in turn allows the use of standard data analysis techniques [62]. The disadvantage of this technique is that the sample size can be drastically reduced, especially for datasets that include a large proportion of missing data.
• Imputation: This approach, which is called Imputation of Missing Values (IMV), consists of replacing the missing values with estimated values that are generally derived from the available data. IMV techniques range from simple methods, such as replacing missing values with the Mean or the Median value, to more sophisticated ones that use Regression, Maximum Likelihood and other statistical methods [63]. The disadvantage of this approach is that the quality of the results of the data analysis can be affected by the imputation, since imputed values are treated as observed values. As an advantage, standard analysis techniques can be used since the missing values have been filled.
• Adaptation of data analysis methods to incomplete data: An effective approach is to adapt data analysis methods so that they can handle datasets that have missing values. These methods include estimating missing values during data analysis and distinguishing between observed and imputed values. The main advantage of the adaptation approach is that all observed data can be used for data analysis, avoiding the disadvantages of imputing the missing values.

Adapting the VKFCM-K-LP algorithm to handle missing data
The VKFCM-K-LP clustering method [13] cannot be applied directly to datasets with missing values. As with most clustering methods, VKFCM-K-LP requires all values in the data matrix to be present, in order to calculate prototypes and distance measurements. Several methods have been proposed in the literature to deal with incomplete data, such as Hathaway et al. [43], who proposed three strategies to group incomplete data using the Fuzzy C-Means algorithm (FCM). In this Section, we use these three approaches to adapt the VKFCM-K-LP clustering algorithm to deal with incomplete data.

Whole Data Strategy (WDS)
This strategy consists of omitting the incomplete data items and applying the VKFCM-K-LP algorithm to the resulting complete data matrix [43]. This method is an example of CCA, since the missing values are not included in the calculation of the cluster prototypes, and can be applied when the percentage of missing data is relatively small. It is generally suggested that WDS can be considered when the percentage of missing values is less than 25% of all values in the dataset [43]. However, incomplete observations are not completely excluded from the analysis. At the end of the clustering process using the complete dataset, incomplete data are partitioned using the nearest-prototype scheme based on Partial Distances (PD) computed from each incomplete instance to each cluster prototype. The PD function calculates the sum of the squared (kernelized) Euclidean distances between all available observations (i.e. non-missing) and then weights them by the proportion of values used in their calculation. Algorithm 2 describes the steps for WDS. Algorithm 2: VKFCM-K-LP clustering method with the WDS strategy.
Randomly initialize the fuzzy membership degrees u ki ; Uniformly initialize all weights as 1/p. Do t = 1. 2: Update prototype vector v k according to Eq (13). 3: Update weight vector λ k according to Eq (14). 4: Update fuzzy membership degree u ki using Eq (15).
Partition X M according to Eq 17 STOP ELSE do t = t + 1 and go to step 2.

Partial Distance Strategy (PDS)
Dixon [64] recommends the partial distance strategy in cases when X M is sufficiently large and WDS cannot is not recommended. PDS consists of estimating the distance between two observations using the Partial Distance function. In VKFCM-K-LP, which uses a local adaptive kernel distance, its partial version is given by where I i ¼ P p j¼1 I ij for 1 � i � n and 1 � j � p. The indicator function I ij is defined by where X obs and X M are defined in Section 5. Therefore the objective function for this strategy is given by (17), is called Local Adaptive Partial Kernel with the constraint given in Eq (10).
In the first iteration of the VKFCM-K-LP algorithm, prototypes and weights are updated using only the values in X obs . Prototypes are given by v ðtþ1Þ where Kð:Þ is the Gaussian Kernel. The weights of the variables are obtained by minimizing the objective function given in Eq (19), which gives Eq (21).
for 1 � k � K and 1 � l � p. The scale factor p/I i in Eq (17) has no effect on the calculation of prototypes [43] in Eq (20) and consequently it does not affect the weight calculation in Eq (21). This scale factor also has no effect on u ki , which is calculated using Eq (22), because it appears both at the top and at the bottom of the equation and can be omitted from the partial distance given in Eq (17).
The steps of the PDS version VKFCM-K-LP are listed in Algorithm 3. Algorithm 3: VKFCM-K-LP clustering method with the PDS strategy.

Optimal Completion Strategy (OCS)
The main idea of this strategy is to iteratively calculate the missing values in X M as auxiliary variables in the optimization of the objective function J M [43] defined in Eq (23).
in which Prototype v kj and weight λ kj are defined according to Eqs (13) and (14). Thus, the missing values are updated by minimizing Eq (25).
Thus, the missing value x ij 2 X M is given by Eq (26) as described in [43].
x ðtþ1Þ where membership degree u ki is defined as in Eq (15)

Experimental design
The performance of the VKFCM-K-LP method proposed by [13] has not been evaluated in the context of incomplete data. Thus, this work adapted VKFCM-K-LP using the three strategies defined by [43] to handle missing data. To evaluate the methods, we implemented a missing value generator, in order to create reproducible datasets with absent values on which the methods presented in this work can be evaluated. The implementation of the missing data generation mechanism and the graphical representations were performed with the aid packages offered by R [65]. The main R packages used were ggplot2, VIM and naniar. The clustering methods were implemented using C. Experiments ran on an Intel Core (TM) I3-3217U CPU, clocking at 1.80GHz, with 4GB of RAM, using the Linux operating system. The code and data for reproducing the results here reported are available in the following repository: https://github.com/AnnyKerol/clustering_for_missing_data. Three external indices were used to compare clustering results: Corrected Rand index (CR) [47], F-measure [48] and Overall Error Rate of Classification (OERC) [49]. The CR index takes its values from the interval [−1, 1], in which 1 indicates perfect agreement between partitions, whereas values near 0 (or negatives) correspond to cluster agreement found by chance [47]. F-measure takes its values from the [0, 1] interval, in which 1 indicates perfect agreement between partitions. OERC aims to measure the ability of a clustering algorithm to find original classes present in a dataset and takes its values from the [0, 1] interval, in which lower OERC values indicate better clustering results.
At the end of the clustering process of the VKFCM-K-LP method under the OCS approach, we obtained a complete dataset, which resulted in the best values of CR, OERC and F-measure. To verify if the values imputed by OCS resemble each variable's distribution; we calculated a consistency measure [50] where k denotes the k-th cluster, 1 � j � p and p represents the variables to be analyzed and μ p0 and s 2 p0 are the mean and variance of the dataset with missing values, respectively. Additionally, μ p1 and s 2 p1 refer to the mean and variance of the dataset with imputed values. The better the clustering under the OCS approach, the closer the values given by Eq (27) are to zero, which indicates that the imputed values were consistent in relation to the original scales of the variables in the dataset with missing values.

Missing data generation
The missing value generator used in this study removes values from the complete dataset with a given probability, according to the MCAR mechanism. In the generation of missing values of the MCAR type [35], we assume independence in the joint distribution of (x i , M), therefore, the probability that an x ij value is observed is independent of the values in X or M. Consider a Bernoulli distribution with parameter θ, 0 � θ � 1, for the indicator variable M i , with probability P(M i = 1|x i , θ), given that x i is a missing value. If the missing values are independent from X, P(M i = 1|x i , θ) = θ. Since the constant is independent of the values in X, this results in the generation of the MCAR type mechanism.
In computational terms, a complete dataset X is selected, and subsequently modified to obtain an incomplete dataset, by randomly selecting a specified percentage of its components {x ij } that are assigned as missing values. The {x ij } values are taken as missing when element m ij from the sample generated for the indicator variable M is equal to one, i.e., m ij = 1. Therefore the value of {x ij } is excluded from the complete dataset and designated as a missing value.

Results
This section presents an experimental evaluation of the kernel-based fuzzy clustering method with automatic weighting of the variables using local adaptive distances VKFCM-K-LP under the WDS, PDS and OCS approaches. In our experiments, datasets with 5%, 10%, 15% and 20% of missing values were artificially generated using the methodology described in Section 7.1, which means that random variable M was sampled from Bernoulli distributions with parameter θ taken from {0.05, 0.10, 0.15, 0.20}. The clustering algorithms were executed 100 times for each dataset, following a Monte Carlo simulation scheme with random initialization. On each Monte Carlo iteration, the adjustment between clusters and prototypes is observed until convergence, with a tolerance threshold of � = 10 −10 or until a maximum number of iterations is reached, i.e. until t > T with T = 300. At the end of the 100 Monte Carlo replications, we select the best solution according to objective function J.
In order to compare the models, we calculated CR, FM and OERC on their best solutions. The averages and standard deviations of these measures are also calculated across the 100 repetitions of each algorithm. The number of groups K was defined as equal to the known number of classes of each dataset. Parameter m was set as 2.0, following a previous study [13]. The terms 2s 2 j , {j = 1, . . ., p}, of the Gaussian Kernel functions, were estimated as the average between the 0.1 and 0.9 quantiles of kx ij − x kj k 2 for i 6 ¼ k; i, k = 1, . . ., n [13,61].
Additionally, we calculated the consistencies of the variables in the complete datasets when evaluating the VKFCM-K-LP method under the OCS approach and we compared the clustering with the OCS method and the clustering using the imputation of missing values using Mean and Median values. To show the effectiveness of the VKFCM-K-LP clustering methods under the WDS, PDS and OCS approaches, we used two datasets: the Iris Plant dataset [66] and the Thyroid Gland dataset [67], both obtained from the Machine Learning Repository at the University of California, Irvine, United States (UCI Machine Learning Repository) [68]. The choice of these datasets is due to the fact that the groups have different structures, in particular the Thyroid Gland dataset presents greater group overlap than the Iris Plant dataset. The performances of the methods in these datasets are described in the following Sections.

Iris Plant dataset
The Iris Plant dataset [66] is well known and widely used in the area of pattern recognition. This set has three a priori classes (K = 3), each with 50 observations, for a total of 150 instances. The classes correspond to three species of Iris flowering plants: Iris setosa (Class 1), Iris virginica (Class 2) and Iris versicolor (Class 3). For each species, four variables were observed (p = 4), corresponding to flower measurements: Sepal Length (SL), Sepal Width (SW), Petal Length (PL) and Petal Width (PW) .  Fig 2a and 2b show the dispersion of the values of the variables for this dataset and the boxplots for each species. It is possible to observe an apparently linear relationship between variables PL and SL and between variables PW and SW for the versicolor and virginica classes. We also note that, considering the versicolor and virginica species, these variables are directly proportional, that is and increase in the value of SL implies an increase in the value of PL and the same is observed for SW and PW. In addition, the three species differ in relation to the variables, especially the setosa species, which is linearly separable from the other two.
The boxplots in Fig 2a show higher variability in the data of the virginica species for the SL and PL variables. Fig 2b, on the other hand, shows less variability when considering the SW variable. Fig 3a and 3b present the missing values patterns that were artificially generated for the Iris Plant dataset, distributed across its four variables. In each plot, the x axis represents the variables and the y axis represents the observations, with the black regions indicating missing values. The Figures also show the number of missing values by variable for each missing percentage, with variable PL having the highest number of missing values for all analyzed percentages. In datasets with 5%, 10% and 20% of missing values, the SW variable has the lowest missing amount. Observations belonging to Class 1 are in the 1|-50 range, while observations belonging to Class 2 are in the 51|-100 range, and, finally, the 101|-150 range represents observations belonging to Class 3. Table 1 shows CR, FM and OERC corresponding to the best solutions obtained in the 100 Monte Carlo replications of the VKFCM-K-LP clustering algorithm with the WDS, PDS and OCS strategies. For all the missing value percentages studied, the CR and FM indices are close to 1, which indicates a good agreement between the a priori classes and the groups provided by the clustering methods. For 5% of missing values, the best performance was observed for the PDS method. However, when analyzing the data with 10%, 15% and 20% of missing values, the PDS method presented the worst performance. In general, increasing the percentage of missing values in the datasets affects the performance of the algorithms, as expected. This behavior is also verified for the PDS approach when increasing the percentage from 5% to 10% and for the WDS and OCS approaches when the percentage goes from 15% to 20%.
Aiming to investigate the predictive power of the VKFCM-K-LP algorithm under the three approaches for handling missing data, Table 2 shows the confusion matrices obtained for each method, and for each percentage of missing values considered. In the columns we have the original classes, and in the lines we have the clusters provided by the clustering methods, which were identified as Cluster 1 (setosa), Cluster 2 (virginica) and Cluster 3 (versicolor).
The confusion matrices in Table 2 show that for all clustering methods and for all percentages of missing values considered, observations belonging to the setosa species in the dataset Iris Plant were properly grouped into Cluster 1. This is expected, as this species is separable

PLOS ONE
from the other two species, as shown in Fig 2a and 2b. It can be also noted that Clusters 2 and 3 showed higher numbers of incorrectly clustered observations, which is expected because these groups are not linearly separable as observed for Cluster 1.
Tables 3-5 provide the weights of the variables in each cluster. In general, it is observed that in the three approaches and for all the percentages of missing values, variables PL and PW were the most relevant for the construction of the clusters. Variable PL obtained the greatest relevance in all groups, even with the largest number of missing values, as shown in Fig 3a-3d. However, there is a decrease in the weights of the PL variable with the increase in the    Fig 4 shows the performance results of the OCS, PDS and WDS algorithms in the 100 Monte Carlo repetitions. The WDS strategy had the largest deviations in error rate when compared to the others. For the PDS approach, increasing and decreasing average error rates were observed over the analyzed percentages. In the OCS strategy, there is an increasing error rate, starting from 10% of missing values. This method presents a more defined behavior, i.e. as the percentage of missing values increases, the error rate also increases. The OCS strategy showed the smallest deviations in relation to the average error rate when compared with the WDS and PDS strategies.
Analyzing the measures of variable consistency from Table 6, considering the complete dataset obtained after clustering with the VKFCM-K-LP algorithm, together with the OCS strategy, we have that these measures are very close to zero. This shows a good quality in the

Thyroid Gland dataset
In this Section, we evaluate the three missing data approaches using the Thyroid Gland dataset [67]. This dataset has three a priori classes (   Fig 5a and 5b present the dispersion and boxplot graphs for T3 plotted against TST and TST versus TTS. Class 2 is more dispersed than the others, which is evidenced in the boxplots for the analyzed variables. Fig 5b shows a linear relationship between variables TST and TTS for classes 1 and 2. In addition, these classes have less variability when considering the TST variable. Fig 6a-6d show the missing values distributed across the five variables in the Thyroid Gland dataset. Variable T3 presents a greater number of missing values for the 15% and 20% percentages. For all analyzed datasets, the DTSH variable has the smallest amount of missing values. Additionally, the missing values are well distributed among the variables. Observations in the 1|-150 range represent class 1 (normal), the 151|-175 interval corresponds to class 2 (hyper) and finally, interval 175|-215 contains class 3 (hypo) observations. Table 7 shows the best results among the 100 repetitions of the VKFCM-K-LP algorithm under the three types of strategies for missing data. For 5% of missing values, the best performances were obtained by the WDS method, presenting a CR equal to 0.818 and an FM equal to 0.943, which means there was a good agreement between the a priori classes and the clusters provided by the clustering algorithm. In this context, the OERC measure was equal to 5.5%. For the PDS and OCS strategies, the increase in the number of missing values in the Thyroid Gland dataset influences the quality of the clustering, as there was a decrease in the values of the studied measures. The PDS strategy showed the best performances according to the quality measures analyzed for all percentages of missing values.
To build the confusion matrices in Table 8, the clusters provided by the algorithm were identified as Cluster 1 (normal), Cluster 2 (hyper) and Cluster 3 (hypo). The confusion matrices show a great difficulty for the clustering algorithm in identifying Clusters 1 and 3 in all the methods analyzed. These clusters correspond to the normal and hypo classes, which in Fig 5a  and 5b are more overlapped when compared to Class 2, which hinders the performance of the clustering method.

PLOS ONE
The PDS method presents lower error rates from 5% to 15% of missing values, when compared to the OCS strategy. The largest variations are observed in the WDS method for 5% and 15% of missing values. This method obtained an increasing error rate between 5% and 10%, while its error decreased starting from 10%.
The weights of the variables in each cluster, with the WDS, PDS and OCS approaches, listed in Tables 9-11, show that the TST and TSH variables were the most relevant to compose Cluster 1. For Cluster 2, the most important variables were TSH and STD and, for Cluster 3, the most relevant variables were TTS and TST. In addition, in the WDS strategy the TTS and TSH variables were more relevant for the construction of Clusters 3 and 2 respectively, as the number of missing values increased. This behavior is also observed for the TST variable in Cluster 1 for the PDS and OCS strategies. In contrast, with the increase in the number of missing values in the DTSH variable, there was a decrease in its importance for the construction of Cluster 2 with the OCS strategy.
In order to assess the consistency of the variables in each cluster, as shown in Table 12, we used the datasets before and after clustering, with the missing values imputed using the OCS strategy. In this context, the consistencies obtained for the variables in the groups were close to zero, which indicates a good performance of the OCS method when imputing the missing values. Additionally, the greatest consistencies were found in Clusters 2 and 3 for all percentages of missing values evaluated.

Comparison between imputation methods
This Section compares VKFCM-K-LP using the OCS method with Imputation via Mean and Median . Fig 8a and 8b show the accuracies obtained using the OCS method and by Imputation via Mean and Median for the Iris Plant and Thyroid Gland datasets, when the amount of missing values varies from 5 to 20%.

PLOS ONE
Adaptive kernel fuzzy clustering for missing data For the Imputation via Mean and Median values, missing values are filled using the mean or median estimates of the observed values in the datasets for each variable, before applying the clustering algorithm.
For the Iris Plant dataset with 5% of missing values, accuracy was close to 0.90, which shows a good performance of the methods when imputing missing values. However, for the Thyroid Gland dataset, considering the same percentage of missing values, there are differences in accuracy as shown in Fig 8a. This difference between the two datasets is expected, because the classes in the Thyroid Gland dataset are more overlapped than the classes in the Iris Plant dataset. In order to visualize and understand the data overlap, we applied Principal Component Analysis (PCA) . Fig 9a and 9b show the resulting projections for the first two components. In PCA, the components are orthogonal and sorted according to how much variance they explain, so it is possible to identify patterns and extract features [69]. Even after   (Fig 8a).  show the consistencies of the variables with the imputation of the missing values via Mean and Median for the two analyzed datasets. Consistencies obtained by the Mean and Median strategies were higher than the OCS strategy, as shown in Tables 6-12. This means that Mean and Median imputations depart more from the original scale of the variables in the two datasets than values obtained by the OCS approach.
To show the dispersion of the new values imputed using the strategies mentioned above, variables T3 and TST were selected from the Thyroid Gland dataset with 5% and 15% of    6b and 6c). Therefore, it is important to graphically visualize the relationship of these imputed values with the ones in the dataset, as shown in Fig 10a-10f. The box-plots for the imputed values and the complete values (blue color) are also shown. The red and yellow colors represent imputed values for the T3 and TST variables respectively. If the values are imputed to both variables, they are colored black. Fig 10a-10c show that, with 5% of imputed missing values, most of the resulting points are close to the distribution of the complete data. The boxplots of the imputed values for the TST variable show higher similarity with the boxplots of the complete values, that is, the dispersion of the data before and after the imputation did not present significant discrepancies.
Regarding the T3 variable, the imputed values showed less variability than the complete observations. For datasets with 15% of imputed values (Fig 10d-10f), the values obtained by the OCS strategy showed better distribution in the groups compared to the values imputed with Mean and Median estimates.
In the imputation of missing values via Mean and Median, there is a concentration in the same value, forming a straight line with zero slope. Thus the set of values imputed through these methods has zero correlation between variables T3 and TST. Tabachnick et al. [70] argue that the imputation of missing values with central tendency measures such as the average, affects the correlation between the variables and the variance is underestimated.
Indeed when analyzing the correlations of variables T3 and TST, we obtain ρ = −0.528 and ρ = −0.529, after imputation with Mean and Median, respectively. Meanwhile, the correlation for the original set (without missing values) is ρ = −0.536. The variability of the data is also impaired, as the standard deviations (sd) for the T3 and TST variables in the complete dataset were sd = 13, 145 and sd = 1, 419, respectively, while for the set with 15% of imputed values, the standard deviations are sd = 11.87 and sd = 1.35 respectively, which indicates variance underestimation.
Therefore, although the imputations using Mean and Median values are easy to implement, the resulting clusterings are not satisfactory, since the structure of the correlation of the variables is modified and consequently these new values may not be related to their group of origin, as shown in Fig 10b, 10e and 10f. Finally, the VKFCM-K-LP method with the OCS  strategy showed better performance in identifying a priori classes, according to the accuracies observed in Fig 8a and 8b, so the set of values imputed using this strategy is closer to the set of observations from the original dataset shown in Fig 5a.

Conclusions
The problem of missing data is commonly discussed in several areas of science, as statistical techniques used for data analysis, such as clustering, were originally proposed for datasets without missing values. An alternative to face this issue is to adapt the clustering methods so that they can handle incomplete datasets. In this work, the VKFCM-K-LP clustering method was studied with three types of strategies to deal with missing data, WDS, PDS and OCS. In order to evaluate clustering methods in the context of missing data, two benchmark datasets were used: Iris Plant and Thyroid Gland. From these datasets, new datasets with 5%, 10%, 15% and 20% of missing values were artificially generated. The results of the clustering algorithms were evaluated according to CR, FM and OERC. The results of the clustering for the Iris Plant dataset were satisfactory, with CR and FM close to 1 and the OERC measure close to zero, for all analyzed methods and percentages of missing values, which showed a good performance of the VKFCM-K-LP method under the WDS, PDS and OCS approaches in identifying a priori classes. For 5% of missing values the best performance of the VKFCM-K-LP clustering algorithm was observed with the PDS strategy. However, the performance graph for the 100 repetitions of the algorithm shows that for 10%, 15% and 20% of missing values, this method had the poorest performance. Additionally, the confusion matrices showed that observations belonging to Class 1 (setosa) in the Iris Plant dataset were properly grouped.
Regarding the weights of the variables in each group, variable PL was the most relevant, even with a higher percentage of missing values. The measures of consistency of the variables for the datasets obtained from the grouping with the VKFCM-K-LP algorithm, together with the OCS strategy, were close to zero, which showed a good clustering quality, that is, the values imputed using the OCS method were not discrepant in relation to the original scale of the variables.
In the generation of missing values for the Thyroid Gland dataset, variable T3 presented a greater amount of these values for 15% and 20% of missing values. The best quality measures for this dataset were observed in the PDS method. In addition, the methods showed an increasing average error rate when analyzing the performance graph on the 100 repetitions of the algorithm. The confusion matrices for the Thyroid Gland dataset showed an overlap between Classes 1 and 2 in all methods analyzed, which corresponded to a greater number of incorrectly grouped observations when compared with Class 3.
Variables TSH and DTSH obtained the highest weights in the construction of Cluster 2 in all analyzed cases. In contrast, variable T3 had little influence on the formation of the groups. The consistencies of the variables obtained for the OCS method in the Thyroid Gland dataset were close to zero, which means a good performance of the method in imputing the missing values.
When comparing the clustering results using the OCS and the Average and Median imputation methods, we have found that the best accuracy was observed for the OCS method in all considered percentages of missing values for both analyzed datasets. The results of the VKFCM-K-LP clustering using the imputation methods with the Mean and Median did not present satisfactory results, because the set of imputed values affected the general correlation of the variables in the dataset and there was a distortion in the variability of the data, which affected the quality of the clusters.
In general, the VKFCM-K-LP clustering algorithm together with the missing data strategies WDS, PDS and OCS presented satisfactory results in the datasets with 5% 10%, 15% and 20% of missing values. The best performances obtained by the grouping method were observed when paired with the PDS and OCS strategies. In the groups made with the OCS approach, new datasets were derived and the missing values were estimated in the optimization process. The results of the clustering with the OCS strategy showed superior performances when compared to the results obtained by imputing with the mean and median of the observed values.