A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.


Introduction
One of the biggest challenges of this decade is with databases having a variety of data types. Variety is among the key notion in the emerging concept of big data, which is known by the 4 Vs: Volume, Velocity, Variety and Variability [1,2]. Currently, there are a variety of data types available in databases, including: interval-scaled variables (salary, height), binary variables (gender), categorical variables (religion: Jewish, Muslim, Christian, etc.) and mixed type variables (multiple attributes with various types). Despite data type, the distance measure is a main component of distance-based clustering algorithms. Partitioning algorithms, such as k-means, k-medoids and more recently soft clustering approaches for instance fuzzy c-means [3] and rough clustering [4], are mainly dependent on distance measures to recognize clusters in a dataset.
In data mining, ample techniques use distance measures to some extent. Clustering is a well-known technique for knowledge discovery in various scientific areas, such as medical image analysis [5][6][7], clustering gene expression data [8][9][10], investigating and analyzing air pollution data [11][12][13], power consumption analysis [14][15][16], and many more fields of study. Improving clustering performance has always been a target for researchers. Since in distancebased clustering similarity or dissimilarity (distance) measures are the core algorithm components, their efficiency directly influences the performance of clustering algorithms. These algorithms use similarity or distance measures to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. Examples of distance-based clustering algorithms include partitioning clustering algorithms, such as k-means as well as k-medoids and hierarchical clustering [17].
Although there are various studies available for comparing similarity/distance measures for clustering numerical data, but there are two difference between this study and other existing studies and related works: first, the aim in this study is to investigate the similarity/distance measures against low dimensional and high dimensional datasets and we wanted to analyse their behaviour in this context. Second thing that distinguish our study from others is that our datasets are coming from a variety of applications and domains while other works confined with a specific domain. In essence, the target of this research is to compare and benchmark similarity and distance measures for clustering continuous data to examine their performance while they are applied to low and high-dimensional datasets. For the sake of reproducibility, fifteen publicly available datasets [18,19] were used for this study, so future distance measures could consequently be evaluated and compared with the results of traditional measures discussed in this study. These datasets are classified into low and high-dimensional, and each measure is studied against each category. But before doing the study on similarity or dissimilarity measures, it needs to be clarified that they have significant influence on clustering quality and are worthwhile to be studied. In sections 3 (methodology) it is elaborated that the similarity or distance measures have significant influence on clustering results.
The key contributions of this paper are as follows: • Twelve similarity measures frequently used for clustering continuous data from various fields are compiled in this study to be evaluated in a single framework. Most of these similarity measures have not been examined in domains other than the originally proposed one.
• A technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the result of distance-based clustering algorithms.
• Similarity measures are evaluated on a wide variety of publicly available datasets. Particularly, we evaluate and compare the performance of similarity measures for continuous data against datasets with low and high dimension.
The rest of paper is organized as follows: in section 2, a background on distance measures is discussed. In section 3, we have explained the methodology of the study. Experimental results with a discussion are represented in section 4, and section 5 summarizes the contributions of this study.

Background on Distance Measures for Continuous Data
Utilization of similarity measures is not limited to clustering, but in fact plenty of data mining algorithms use similarity measures to some extent. To reveal the influence of various distance measures on data mining, researchers have done experimental studies in various fields and have compared and evaluated the results generated by different distance measures. Although it is not practical to introduce a "Best" similarity measure or a best performing measure in general, a comparison study could shed a light on the performance and behavior of measures. For instance, Boriah et al. conducted a comparison study on similarity measures for categorical data and evaluated similarity measures in the context of outlier detection for categorical data [20]. It was concluded that the performance of an outlier detection algorithm is significantly affected by the similarity measure. In their research, it was not possible to introduce a best performing similarity measure, but they analyzed and reported the situations in which a measure has poor or superior performance. In another research work, Fernando et al. [21] reviewed, compared and benchmarked binary-based similarity measures for categorical data. With some cases studies, Deshpande et al. focused on data from a single knowledge area, for example biological data, and conducted a comparison in favor of profile similarity measures for genetic interaction networks. They concluded that the Dot Product is consistent among the best measures in different conditions and genetic interaction datasets [22].
Similarly, in the context of clustering, studies have been done on the effects of similarity measures., In one study Strehl and colleagues tried to recognize the impact of similarity measures on web clustering [23]. In another, six similarity measure were assessed, this time for trajectory clustering in outdoor surveillance scenes [24]. In chemical databases, Al Khalifa et. al. [25] examined performance of twelve coefficients for clustering, similarity searching and compound selection. From the results they concluded that no single coefficient is appropriate for all methodologies.
Despite these studies, no empirical analysis and comparison is available for clustering continuous data to investigate their behavior in low and high dimensional datasets. At the other hand our datasets are coming from a variety of applications and domains and while they are limited with a specific domain. In this study, we gather known similarity/distance measures available for clustering continuous data, which will be examined using various clustering algorithms and against 15 publicly available datasets. It is not possible to introduce a perfect similarity measure for all kinds of datasets, but in this paper we will discover the reaction of similarity measures to low and high-dimensional datasets. The similarity measures with the best results in each category are also introduced.
Before presenting the similarity measures for clustering continuous data, a definition of a clustering problem should be given. Assuming that the number of clusters required to be created is an input value k, the clustering problem is defined as follows [26]:

Definition 1
Given a dataset D = {v 1 , v 2 , . . ., v n } of data vectors and an integer value k, the clustering problem is to define a mapping f: D ! {1, . . ., k} where each v i is assigned to one cluster C j , 1 j k. A cluster C j contains precisely those data vectors mapped to it; that is, C j = {v i | f(t i ) = C j , 1 i n, and v i 2 D}.
Subsequently, similarity measures for clustering continuous data are discussed. Some of these similarity measures are frequently employed for clustering purposes while others have scarcely appeared in literature.

Minkowski
The Minkowski family includes Euclidean distance and Manhattan distance, which are particular cases of the Minkowski distance [27][28][29]. The Minkowski distance is defined by where m is a positive real number and x i and y i are two vectors in n-dimensional space. The Minkowski distance performs well when the dataset clusters are isolated or compacted; if the dataset does not fulfil this condition, then the large-scale attributes would dominate the others [30,31]. Another problem with Minkowski metrics is that the largest-scale feature dominates the rest. Thus, normalizing the continuous features is the solution to this problem [31].
A modified version of the Minkowski metric has been proposed to solve clustering obstacles. For example, Wilson and Martinez presented distance based on counts for nominal attributes and a modified Minkowski metric for continuous features [32].

Manhattan distance
Manhattan distance is a special case of the Minkowski distance at m = 1. Like its parent, Manhattan is sensitive to outliers. When this distance measure is used in clustering algorithms, the shape of clusters is hyper-rectangular [33]. A study by Perlibakas demonstrated that a modified version of this distance measure is among the best distance measures for PCA-based face recognition [34]. This measure is defined as d man ¼ P n i¼1 jx i À y i j.

Euclidean distance
The most well-known distance used for numerical data is probably the Euclidean distance. This is a special case of the Minkowski distance when m = 2. Euclidean distance performs well when deployed to datasets that include compact or isolated clusters [30,31]. Although Euclidean distance is very common in clustering, it has a drawback: if two data vectors have no attribute values in common, they may have a smaller distance than the other pair of data vectors containing the same attribute values [31,35,36]. Another problem with Euclidean distance as a family of the Minkowski metric is that the largest-scaled feature would dominate the others. Normalization of continuous features is a solution to this problem [31].

Average distance
Regarding the above-mentioned drawback of Euclidean distance, average distance is a modified version of the Euclidean distance to improve the results [27,35]. For two data points x, y in ndimentional space, the average distance is defined as d ave ¼ 1

Weighted euclidean distance
If the relative importance according to each attribute is available, then the Weighted Euclidean distance-another modification of Euclidean distance-can be used [37]. This distance is where w i is the weight given to the ith component. This distance measure is the only measure which is not included in this study for comparison since calculating the weights is closely related to the dataset and the aim of researcher for cluster analysis on the dataset. As an instance of using this measure reader can refer to Ji et. al. research work. They used this measure for proposing a dynamic fuzzy cluster algorithm for time series [38].

Chord distance
Chord distance is one more Euclidean distance modification to overcome the previously mentioned Euclidean distance shortcomings. It can solve problems caused by the scale of measurements as well. Chord distance is defined as the length of the chord joining two normalized points within a hypersphere of radius one. This distance can be calculated from non-normalized data as well [27]. Chord distance is defined as

Mahalanobis distance
Mahalanobis distance is a data-driven measure in contrast to Euclidean and Manhattan distances that are independent of the related dataset to which two data points belong [20,33]. A regularized Mahalanobis distance can be used for extracting hyperellipsoidal clusters [30]. On the other hand, Mahalanobis distance can alleviated distortion caused by linear correlation among features by applying a whitening transformation to the data or by using the squared Mahalanobis distance [31]. Mahalanobis distance is defined by where S is the covariance matrix of the dataset [27,39].

Cosine deasure
The Cosine similarity measure is mostly used in document similarity [28,33] and is defined as , where kyk 2 is the Euclidean norm of vector y = (y 1 , y 2 , . . ., y n ) defined The Cosine measure is invariant to rotation but is variant to linear transformations. It is also independent of vector length [33].

Pearson correlation
Pearson correlation is widely used in clustering gene expression data [33,36,40]. This similarity measure calculates the similarity between the shapes of two gene expression patterns. The Pearson correlation is defined by Pearsonðx; yÞ ¼ where μ x and μ y are the means for x and y respectively. The Pearson correlation has a disadvantage of being sensitive to outliers [33,40]. The similarity measures explained above are the most commonly used for clustering continuous data. Table 1 represents a summary of these with some highlights of each.

Experimental design
This section is devoted to explain the method and the framework which is used in this study for evaluating the effect of similarity measures on clustering quality. The main objective of this research study is to analyse the effect of different distance measures on quality of clustering algorithm results. As it is illustrated in Fig 1 there are 15 datasets used with 4 distance based algorithms on a total of 12 distance measures. All the distance measures in Table 1 are examined except the Weighted Euclidean distance which is dependent on the dataset and the aim of clustering. Fig 2 explains the methodology of the study briefly. For each dataset we examined all four distance based algorithms, and each algorithms' quality of clustering has been evaluated by each 12 distance measures as it is demonstrated in Fig 1. It makes a total of 720 experiments in this research work to analyse the effect of distance measures. Representing and comparing this huge number of experiments is a challenging task and could not be done using ordinary charts and tables. Consequently we have developed a special illustration method using heat mapped tables in order to demonstrate all the results in the way that could be read and understand quickly. This method is described in section 4.1.1.
Very common, easy to compute and works well with datasets with compact or isolated clusters [27,31].
Better than Euclidean distance [35] at handling outliers.
Variables contribute independently to the measure of distance. Redundant values could dominate the similarity between data points [37].
The weight matrix allows to increase the effect of more important data points than less important one [37].
Same as Average Distance. Fuzzy c-means algorithm [38] Chord Can work with unnormalized data [27].
It is not invariant to linear transformation [33].
Mahalanobis is a datadriven measure that can ease the distance distortion caused by a linear combination of attributes [35].
It can be expensive in terms of computation [33] Hyperellipsoidal clustering algorithm [30].
Independent of vector length and invariant to rotation [33].
It is not invariant to linear transformation [33].
Is common and like other Minkowski-driven distances it works well with datasets with compact or isolated clusters [27].
Sensitive to the outliers. [27,31] K-means algorithm Mean Character Difference

Rand Index
In this study, we used Rand Index (RI) for evaluation of clustering outcomes resulted by various distance measures. This section is an overview on this measure and it investigates the reason that this measure has been chosen. Rand index is frequently used in measuring clustering quality. It is a measure of agreement between two sets of objects: first is the set produced by clustering process and the other defined by external criteria. Although there are different clustering measures such as Sum of Squared Error, Entropy, Purity, Jaccard etc. but among them the Rand index is probably the most used index for cluster validation [17,41,42]. Assuming S = {o 1 , o 2 , . . ., o n } is a set of n elements and two partitions of S are given to compare C = {c 1 , c 2 , . . ., c r }, which is a partition of S into r subsets and G = {g 1 , g 2 , . . ., g s }, a partition of S into s subsets, the Rand index (R) is defined as follows: where: • a is the number of pairs of vectors in S that are in the same set in C and in the same set in G.
• b is the number of pairs of elements in S that are in different sets in C and in different sets in G.
• c is the number of pairs of elements in S that are in the same set in C and in different sets in G.
• d is the number of pairs of elements in S that are in different sets in C and in the same set in G.
There is a modified version of rand index called Adjusted Rand Index (ARI) which is proposed by Hubert and Arabie [42] as an improvement for known problems with RI. These problems happen when the expected value of the RI of two random partition does not take a constant value (zero for example) or the Rand statistic approaches its upper limit of unity as the number of cluster increases. However, since our datasets don't have these problems and also owing to the fact that the results generated using ARI were following the same pattern of RI results, we have used Rand Index in this study due to its popularity in clustering community for clustering validation. In this study we normalized the Rand Index values for the experiments. The normalized values are between 0 and 1 and we used following formula to approach it: where r = (r 1 , . . ., r n ) is the array of rand indexes produced by each similarity measure.

Analysis of variance (ANOVA) test
Before continuing this study, the main hypothesis needs to be proved: "distance measure has a considerable influence on clustering results". In order to show that distance measures cause significant difference on clustering quality, we have used ANOVA test. For this purpose we will consider a null hypothesis: "distance measures doesn't have significant influence on clustering quality". Using ANOVA test, if the p value be very small, it means that there is very small opportunity that null hypothesis is correct, and consequently we can reject it. ANOVA analyzes the differences among a group of variable which is developed by Ronald Fisher [43]. ANOVA is a statistical test that demonstrate whether the mean of several groups are equal or not and it can be said that it generalizes the t-test for more than two groups. It is useful for testing means of more than two groups or variable for statistical significance. Statistical significance in statistics is achieved when a p-value is less than the significance level [44]. The p-value is the probability of obtaining results which acknowledge that the null hypothesis is true [45].
For ANOVA test we have considered a table with the structure shown in Table 2 which covers all RI results for all four algorithms and each distance/similarity measure and for all datasets.  Tables 3-6. The small Prob values indicates that differences between means of the columns are significant. From that we can conclude that the similarity measures have significant impact in clustering quality. In the rest of this study we will inspect how these similarity measures influence on clustering quality.

Experimental Results
It is noted that references to all data employed in this work are available in acknowledgment section. A diverse set of similarity measures for continuous data was studied on low and highdimensional continuous datasets in order to clarify and compare the accuracy of each similarity measure in different datasets with various dimensionality situations and using 15 datasets [18,19,[46][47][48][49]. Details of the datasets applied in this study are represented in Table 7.
The experiments were conducted using partitioning (k-means and k-medoids) and hierarchical algorithms, which are distance-based. As it is discussed in section 3.2 the Rand index served to evaluate and compare the results. The results for each of these algorithms are discussed later in this section.
The k-means and k-medoids algorithms were used in this experiment as partitioning algorithms, and the Rand index served accuracy evaluation purposes. Due to the fact that the kmeans and k-medoids algorithm results are dependent on the initial, randomly selected centers, and in some cases their accuracy might be affected by local minimum trap, the experiment   was repeated 100 times for each similarity measure, after which the maximum Rand index was considered for comparison.

Illustration technique
A summary of the normalized Rand index results is illustrated in color scale tables in Fig 3 and  Fig 4. Since the aim of this study is to investigate and evaluate the accuracy of similarity measures for different dimensional datasets, the tables are organized based on horizontally ascending dataset dimensions. After the first column, which contains the names of the similarity measures, the remaining table is divided in two batches of columns (low and high-dimensional) that demonstrate the normalized Rand indexes for low and high-dimensional datasets, respectively. The final column considered in this table is 'overall average' in order to explore the most accurate similarity measure in general. This illustrational structure and approach is used for all four algorithms in this paper. On the other hand, for high-dimensional datasets, the Coefficient of Divergence is the most accurate with the highest Rand index values. Fig 4 provides the results for the k-medoids algorithm. Mean Character Difference is the most precise measure for low-dimensional datasets, while the Cosine measure represents better results in terms of accuracy for high-dimensional datasets. Overall, Mean Character Difference has high accuracy for most datasets. As a general result for the partitioning algorithms used in this study, average distance results in more accurate and reliable outcomes for both algorithms. It is the most accurate measure in the k-means algorithm and at the same time, with very little difference, it stands in second place after Mean Character Difference for the k-medoids algorithm.

Benchmarking similarity measures for partitioning algorithms
From another perspective, similarity measures in the k-means algorithm can be investigated to clarify which would lead to the k-means converging faster. However the convergence of kmeans and k-medoid algorithms is not guaranteed due to the possibility of falling in local minimum trap. For this reason we have run the algorithm 100 times to prevent bias toward this weakness. Fig 5 shows two sample box charts created by using normalized data, which represents the normalized iteration count needed for the convergence of each similarity measure.  Results were collected after 100 times of repeating the k-means algorithm for each similarity measure and dataset. Regarding the discussion on Rand index and iteration count, it is manifested that the Average measure is not only accurate in most datasets and with both k-means and k-medoids algorithms, but it is the second fastest similarity measure after Pearson in terms of convergence, making it a secure choice when clustering is necessary using k-means or k-medoids algorithms.

Benchmarking similarity measures for hierarchical algorithms
In a previous section, the influence of different similarity measures on k-means and k-medoids algorithms as partitioning algorithms was evaluated and compared. In this section, the results for Single-link and Group Average algorithms, which are two hierarchical clustering A review of the results and discussions on the k-means, k-medoids, Single-link and Group Average algorithms reveals that by considering the overall results, the Average measure is regularly among the most accurate measures for all four algorithms.
According to heat map tables it is noticeable that Pearson correlation is behaving differently in comparison to other distance measures. It specially shows very weak results with centroid based algorithms, k-means and k-medoids. Based on the results in this research, in general,  Pearson correlation doesn't work properly for low dimensional datasets while it shows better results for high dimensional datasets. Fig 11 illustrates the overall average RI in all 4 algorithms and all 15 datasets also uphold the same conclusion. Fig 12 at the other hand shows the average RI for 4 algorithms separately. It can be inferred that Average measure among other measures is more accurate.
Furthermore, by using the k-means algorithm, this similarity measure is the fastest after Pearson in terms of convergence.

Concluding Remarks
Selecting the right distance measure is one of the challenges encountered by professionals and researchers when attempting to deploy a distance-based clustering algorithm to a dataset. The variety of similarity measures can cause confusion and difficulties in choosing a suitable measure. Similarity measures may perform differently for datasets with diverse dimensionalities. The aim of this study was to clarify which similarity measures are more appropriate for lowdimensional and which perform better for high-dimensional datasets in the experiments. In this work, similarity measures for clustering numerical data in distance-based algorithms were compared and benchmarked using 15 datasets categorized as low and high-dimensional datasets. The accuracy of similarity measures in terms of the Rand index was studied and the best similarity measures for each of the low and high-dimensional datasets were discussed for four well-known distance-based algorithms. Overall, the results indicate that Average Distance is among the top most accurate measures for all clustering algorithms employed in this article. Moreover, this measure is one of the fastest in terms of convergence when k-means is the target clustering algorithm. Based on results in this study, in general, Pearson correlation is not recommended for low dimensional datasets. It also is not compatible with centroid based algorithms. However, this measure is mostly recommended for high dimensional datasets and by using hierarchical approaches.