Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data

  • Ali Seyed Shirkhorshidi ,

    shirkhorshidi_ali@yahoo.co.uk

    Affiliation Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, 50603, Kuala Lumpur, Malaysia

  • Saeed Aghabozorgi,

    Affiliation IBM Analytics, Platform, Emerging Technologies, IBM Canada Ltd., Markham, Ontario L6F 1C7, Canada

  • Teh Ying Wah

    Affiliation Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, 50603, Kuala Lumpur, Malaysia

A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data

  • Ali Seyed Shirkhorshidi, 
  • Saeed Aghabozorgi, 
  • Teh Ying Wah
PLOS
x

Abstract

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.

Introduction

One of the biggest challenges of this decade is with databases having a variety of data types. Variety is among the key notion in the emerging concept of big data, which is known by the 4 Vs: Volume, Velocity, Variety and Variability [1,2]. Currently, there are a variety of data types available in databases, including: interval-scaled variables (salary, height), binary variables (gender), categorical variables (religion: Jewish, Muslim, Christian, etc.) and mixed type variables (multiple attributes with various types). Despite data type, the distance measure is a main component of distance-based clustering algorithms. Partitioning algorithms, such as k-means, k-medoids and more recently soft clustering approaches for instance fuzzy c-means [3] and rough clustering [4], are mainly dependent on distance measures to recognize clusters in a dataset.

In data mining, ample techniques use distance measures to some extent. Clustering is a well-known technique for knowledge discovery in various scientific areas, such as medical image analysis [57], clustering gene expression data [810], investigating and analyzing air pollution data [1113], power consumption analysis [1416], and many more fields of study. Improving clustering performance has always been a target for researchers. Since in distance-based clustering similarity or dissimilarity (distance) measures are the core algorithm components, their efficiency directly influences the performance of clustering algorithms. These algorithms use similarity or distance measures to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. Examples of distance-based clustering algorithms include partitioning clustering algorithms, such as k-means as well as k-medoids and hierarchical clustering [17].

Although there are various studies available for comparing similarity/distance measures for clustering numerical data, but there are two difference between this study and other existing studies and related works: first, the aim in this study is to investigate the similarity/distance measures against low dimensional and high dimensional datasets and we wanted to analyse their behaviour in this context. Second thing that distinguish our study from others is that our datasets are coming from a variety of applications and domains while other works confined with a specific domain. In essence, the target of this research is to compare and benchmark similarity and distance measures for clustering continuous data to examine their performance while they are applied to low and high-dimensional datasets. For the sake of reproducibility, fifteen publicly available datasets [18,19] were used for this study, so future distance measures could consequently be evaluated and compared with the results of traditional measures discussed in this study. These datasets are classified into low and high-dimensional, and each measure is studied against each category. But before doing the study on similarity or dissimilarity measures, it needs to be clarified that they have significant influence on clustering quality and are worthwhile to be studied. In sections 3 (methodology) it is elaborated that the similarity or distance measures have significant influence on clustering results.

The key contributions of this paper are as follows:

  • Twelve similarity measures frequently used for clustering continuous data from various fields are compiled in this study to be evaluated in a single framework. Most of these similarity measures have not been examined in domains other than the originally proposed one.
  • A technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the result of distance-based clustering algorithms.
  • Similarity measures are evaluated on a wide variety of publicly available datasets. Particularly, we evaluate and compare the performance of similarity measures for continuous data against datasets with low and high dimension.

The rest of paper is organized as follows: in section 2, a background on distance measures is discussed. In section 3, we have explained the methodology of the study. Experimental results with a discussion are represented in section 4, and section 5 summarizes the contributions of this study.

Background on Distance Measures for Continuous Data

Utilization of similarity measures is not limited to clustering, but in fact plenty of data mining algorithms use similarity measures to some extent. To reveal the influence of various distance measures on data mining, researchers have done experimental studies in various fields and have compared and evaluated the results generated by different distance measures. Although it is not practical to introduce a “Best” similarity measure or a best performing measure in general, a comparison study could shed a light on the performance and behavior of measures. For instance, Boriah et al. conducted a comparison study on similarity measures for categorical data and evaluated similarity measures in the context of outlier detection for categorical data [20]. It was concluded that the performance of an outlier detection algorithm is significantly affected by the similarity measure. In their research, it was not possible to introduce a best performing similarity measure, but they analyzed and reported the situations in which a measure has poor or superior performance. In another research work, Fernando et al. [21] reviewed, compared and benchmarked binary-based similarity measures for categorical data. With some cases studies, Deshpande et al. focused on data from a single knowledge area, for example biological data, and conducted a comparison in favor of profile similarity measures for genetic interaction networks. They concluded that the Dot Product is consistent among the best measures in different conditions and genetic interaction datasets [22].

Similarly, in the context of clustering, studies have been done on the effects of similarity measures., In one study Strehl and colleagues tried to recognize the impact of similarity measures on web clustering [23]. In another, six similarity measure were assessed, this time for trajectory clustering in outdoor surveillance scenes [24]. In chemical databases, Al Khalifa et. al. [25] examined performance of twelve coefficients for clustering, similarity searching and compound selection. From the results they concluded that no single coefficient is appropriate for all methodologies.

Despite these studies, no empirical analysis and comparison is available for clustering continuous data to investigate their behavior in low and high dimensional datasets. At the other hand our datasets are coming from a variety of applications and domains and while they are limited with a specific domain. In this study, we gather known similarity/distance measures available for clustering continuous data, which will be examined using various clustering algorithms and against 15 publicly available datasets. It is not possible to introduce a perfect similarity measure for all kinds of datasets, but in this paper we will discover the reaction of similarity measures to low and high-dimensional datasets. The similarity measures with the best results in each category are also introduced.

Before presenting the similarity measures for clustering continuous data, a definition of a clustering problem should be given. Assuming that the number of clusters required to be created is an input value k, the clustering problem is defined as follows [26]:

Definition 1

Given a dataset D = {v1, v2, …, vn} of data vectors and an integer value k, the clustering problem is to define a mapping f: D → {1, …, k} where each vi is assigned to one cluster Cj, 1 ≤ jk. A cluster Cj contains precisely those data vectors mapped to it; that is, Cj = {vi | f(ti) = Cj, 1 ≤ in, and viD}.

In the rest of this study, v1, v2 represent two data vectors defined as v1 = {x1, x2, …, xn}, v2 = {y1, y2, …, yn}, where xi, yi are called attributes.

Subsequently, similarity measures for clustering continuous data are discussed. Some of these similarity measures are frequently employed for clustering purposes while others have scarcely appeared in literature.

Minkowski

The Minkowski family includes Euclidean distance and Manhattan distance, which are particular cases of the Minkowski distance [2729]. The Minkowski distance is defined by where m is a positive real number and xi and yi are two vectors in n-dimensional space. The Minkowski distance performs well when the dataset clusters are isolated or compacted; if the dataset does not fulfil this condition, then the large-scale attributes would dominate the others [30,31]. Another problem with Minkowski metrics is that the largest-scale feature dominates the rest. Thus, normalizing the continuous features is the solution to this problem [31].

A modified version of the Minkowski metric has been proposed to solve clustering obstacles. For example, Wilson and Martinez presented distance based on counts for nominal attributes and a modified Minkowski metric for continuous features [32].

Manhattan distance

Manhattan distance is a special case of the Minkowski distance at m = 1. Like its parent, Manhattan is sensitive to outliers. When this distance measure is used in clustering algorithms, the shape of clusters is hyper-rectangular [33]. A study by Perlibakas demonstrated that a modified version of this distance measure is among the best distance measures for PCA-based face recognition [34]. This measure is defined as .

Euclidean distance

The most well-known distance used for numerical data is probably the Euclidean distance. This is a special case of the Minkowski distance when m = 2. Euclidean distance performs well when deployed to datasets that include compact or isolated clusters [30,31]. Although Euclidean distance is very common in clustering, it has a drawback: if two data vectors have no attribute values in common, they may have a smaller distance than the other pair of data vectors containing the same attribute values [31,35,36]. Another problem with Euclidean distance as a family of the Minkowski metric is that the largest-scaled feature would dominate the others. Normalization of continuous features is a solution to this problem [31].

Average distance

Regarding the above-mentioned drawback of Euclidean distance, average distance is a modified version of the Euclidean distance to improve the results [27,35]. For two data points x, y in n-dimentional space, the average distance is defined as .

Weighted euclidean distance

If the relative importance according to each attribute is available, then the Weighted Euclidean distance—another modification of Euclidean distance—can be used [37]. This distance is defined as , where wi is the weight given to the ith component.

This distance measure is the only measure which is not included in this study for comparison since calculating the weights is closely related to the dataset and the aim of researcher for cluster analysis on the dataset. As an instance of using this measure reader can refer to Ji et. al. research work. They used this measure for proposing a dynamic fuzzy cluster algorithm for time series [38].

Chord distance

Chord distance is one more Euclidean distance modification to overcome the previously mentioned Euclidean distance shortcomings. It can solve problems caused by the scale of measurements as well. Chord distance is defined as the length of the chord joining two normalized points within a hypersphere of radius one. This distance can be calculated from non-normalized data as well [27]. Chord distance is defined as , where ‖x2 is the L2-norm .

Mahalanobis distance

Mahalanobis distance is a data-driven measure in contrast to Euclidean and Manhattan distances that are independent of the related dataset to which two data points belong [20,33]. A regularized Mahalanobis distance can be used for extracting hyperellipsoidal clusters [30]. On the other hand, Mahalanobis distance can alleviated distortion caused by linear correlation among features by applying a whitening transformation to the data or by using the squared Mahalanobis distance [31]. Mahalanobis distance is defined by where S is the covariance matrix of the dataset [27,39].

Cosine deasure

The Cosine similarity measure is mostly used in document similarity [28,33] and is defined as , where ‖y2 is the Euclidean norm of vector y = (y1, y2, …, yn) defined as . The Cosine measure is invariant to rotation but is variant to linear transformations. It is also independent of vector length [33].

Pearson correlation

Pearson correlation is widely used in clustering gene expression data [33,36,40]. This similarity measure calculates the similarity between the shapes of two gene expression patterns. The Pearson correlation is defined by , where μx and μy are the means for x and y respectively. The Pearson correlation has a disadvantage of being sensitive to outliers [33,40].

The similarity measures explained above are the most commonly used for clustering continuous data. Table 1 represents a summary of these with some highlights of each.

thumbnail
Table 1. Similarity Measures for continuous data (in time complexity, n is the number of dimensions of x and y).

https://doi.org/10.1371/journal.pone.0144059.t001

Methodology of the Study

3.1 Experimental design

This section is devoted to explain the method and the framework which is used in this study for evaluating the effect of similarity measures on clustering quality. The main objective of this research study is to analyse the effect of different distance measures on quality of clustering algorithm results. As it is illustrated in Fig 1 there are 15 datasets used with 4 distance based algorithms on a total of 12 distance measures. All the distance measures in Table 1 are examined except the Weighted Euclidean distance which is dependent on the dataset and the aim of clustering.

Fig 2 explains the methodology of the study briefly. For each dataset we examined all four distance based algorithms, and each algorithms’ quality of clustering has been evaluated by each 12 distance measures as it is demonstrated in Fig 1. It makes a total of 720 experiments in this research work to analyse the effect of distance measures. Representing and comparing this huge number of experiments is a challenging task and could not be done using ordinary charts and tables. Consequently we have developed a special illustration method using heat mapped tables in order to demonstrate all the results in the way that could be read and understand quickly. This method is described in section 4.1.1.

3.2 Rand Index

In this study, we used Rand Index (RI) for evaluation of clustering outcomes resulted by various distance measures. This section is an overview on this measure and it investigates the reason that this measure has been chosen.

Rand index is frequently used in measuring clustering quality. It is a measure of agreement between two sets of objects: first is the set produced by clustering process and the other defined by external criteria. Although there are different clustering measures such as Sum of Squared Error, Entropy, Purity, Jaccard etc. but among them the Rand index is probably the most used index for cluster validation [17,41,42]. Assuming S = {o1, o2, …, on} is a set of n elements and two partitions of S are given to compare C = {c1, c2, …, cr}, which is a partition of S into r subsets and G = {g1, g2, …, gs}, a partition of S into s subsets, the Rand index (R) is defined as follows:

Definition 2

1 where:

  • a is the number of pairs of vectors in S that are in the same set in C and in the same set in G.
  • b is the number of pairs of elements in S that are in different sets in C and in different sets in G.
  • c is the number of pairs of elements in S that are in the same set in C and in different sets in G.
  • d is the number of pairs of elements in S that are in different sets in C and in the same set in G.

There is a modified version of rand index called Adjusted Rand Index (ARI) which is proposed by Hubert and Arabie [42] as an improvement for known problems with RI. These problems happen when the expected value of the RI of two random partition does not take a constant value (zero for example) or the Rand statistic approaches its upper limit of unity as the number of cluster increases. However, since our datasets don’t have these problems and also owing to the fact that the results generated using ARI were following the same pattern of RI results, we have used Rand Index in this study due to its popularity in clustering community for clustering validation.

In this study we normalized the Rand Index values for the experiments. The normalized values are between 0 and 1 and we used following formula to approach it: 2 where r = (r1, …, rn) is the array of rand indexes produced by each similarity measure.

3.3 Analysis of variance (ANOVA) test

Before continuing this study, the main hypothesis needs to be proved: “distance measure has a considerable influence on clustering results”. In order to show that distance measures cause significant difference on clustering quality, we have used ANOVA test. For this purpose we will consider a null hypothesis: “distance measures doesn’t have significant influence on clustering quality”. Using ANOVA test, if the p value be very small, it means that there is very small opportunity that null hypothesis is correct, and consequently we can reject it.

ANOVA analyzes the differences among a group of variable which is developed by Ronald Fisher [43]. ANOVA is a statistical test that demonstrate whether the mean of several groups are equal or not and it can be said that it generalizes the t-test for more than two groups. It is useful for testing means of more than two groups or variable for statistical significance. Statistical significance in statistics is achieved when a p-value is less than the significance level [44]. The p-value is the probability of obtaining results which acknowledge that the null hypothesis is true [45].

For ANOVA test we have considered a table with the structure shown in Table 2 which covers all RI results for all four algorithms and each distance/similarity measure and for all datasets. Table is divided into 4 section for four respective algorithms. In each sections rows represent results generated with distance measures for a dataset.

thumbnail
Table 2. Rand Index values used for ANOVA test (in the table HAverage stands for Hierarchical Average algorithm and HSingle stands for Hierarchical Single link).

https://doi.org/10.1371/journal.pone.0144059.t002

ANOVA test is performed for each algorithm separately to find if distance measures have significant impact on clustering results in each clustering algorithm.

The ANOVA test result on above table is demonstrated in the Tables 36.

The small Prob values indicates that differences between means of the columns are significant. From that we can conclude that the similarity measures have significant impact in clustering quality. In the rest of this study we will inspect how these similarity measures influence on clustering quality.

Experimental Results

It is noted that references to all data employed in this work are available in acknowledgment section. A diverse set of similarity measures for continuous data was studied on low and high-dimensional continuous datasets in order to clarify and compare the accuracy of each similarity measure in different datasets with various dimensionality situations and using 15 datasets [18,19,4649]. Details of the datasets applied in this study are represented in Table 7.

The experiments were conducted using partitioning (k-means and k-medoids) and hierarchical algorithms, which are distance-based. As it is discussed in section 3.2 the Rand index served to evaluate and compare the results. The results for each of these algorithms are discussed later in this section.

The k-means and k-medoids algorithms were used in this experiment as partitioning algorithms, and the Rand index served accuracy evaluation purposes. Due to the fact that the k-means and k-medoids algorithm results are dependent on the initial, randomly selected centers, and in some cases their accuracy might be affected by local minimum trap, the experiment was repeated 100 times for each similarity measure, after which the maximum Rand index was considered for comparison.

4.1 Illustration technique

A summary of the normalized Rand index results is illustrated in color scale tables in Fig 3 and Fig 4. Since the aim of this study is to investigate and evaluate the accuracy of similarity measures for different dimensional datasets, the tables are organized based on horizontally ascending dataset dimensions. After the first column, which contains the names of the similarity measures, the remaining table is divided in two batches of columns (low and high-dimensional) that demonstrate the normalized Rand indexes for low and high-dimensional datasets, respectively. The final column considered in this table is ‘overall average’ in order to explore the most accurate similarity measure in general. This illustrational structure and approach is used for all four algorithms in this paper.

thumbnail
Fig 3. K-means color scale table for normalized Rand index values (green represents the highest and it changes to red, which is the lowest Rand index value).

https://doi.org/10.1371/journal.pone.0144059.g003

thumbnail
Fig 4. K-medoids color scale table for normalized Rand index values (green is the highest and changes color to red, which is the lowest Rand index value).

https://doi.org/10.1371/journal.pone.0144059.g004

4.2 Benchmarking similarity measures for partitioning algorithms

Fig 3 represents the results for the k-means algorithm. According to the figure, for low-dimensional datasets, the Mahalanobis measure has the highest results among all similarity measures. On the other hand, for high-dimensional datasets, the Coefficient of Divergence is the most accurate with the highest Rand index values. Fig 4 provides the results for the k-medoids algorithm. Mean Character Difference is the most precise measure for low-dimensional datasets, while the Cosine measure represents better results in terms of accuracy for high-dimensional datasets. Overall, Mean Character Difference has high accuracy for most datasets.

As a general result for the partitioning algorithms used in this study, average distance results in more accurate and reliable outcomes for both algorithms. It is the most accurate measure in the k-means algorithm and at the same time, with very little difference, it stands in second place after Mean Character Difference for the k-medoids algorithm.

From another perspective, similarity measures in the k-means algorithm can be investigated to clarify which would lead to the k-means converging faster. However the convergence of k-means and k-medoid algorithms is not guaranteed due to the possibility of falling in local minimum trap. For this reason we have run the algorithm 100 times to prevent bias toward this weakness. Fig 5 shows two sample box charts created by using normalized data, which represents the normalized iteration count needed for the convergence of each similarity measure. Results were collected after 100 times of repeating the k-means algorithm for each similarity measure and dataset.

thumbnail
Fig 5. Sample box charts for k-means iteration counts created with a collection of normalized results after 100 times of repeating the algorithm for each similarity measure and dataset.

https://doi.org/10.1371/journal.pone.0144059.g005

Fig 6 is a summarized color scale table representing the mean and variance of iteration counts for all 100 algorithm runs. Pearson has the fastest convergence in most datasets. After Pearson, Average is the fastest similarity measure in terms of convergence.

thumbnail
Fig 6. Color scale table for iteration count mean and variance (green is the lowest and it changes color to red, which shows the greatest iteration count value).

https://doi.org/10.1371/journal.pone.0144059.g006

Regarding the discussion on Rand index and iteration count, it is manifested that the Average measure is not only accurate in most datasets and with both k-means and k-medoids algorithms, but it is the second fastest similarity measure after Pearson in terms of convergence, making it a secure choice when clustering is necessary using k-means or k-medoids algorithms.

4.3 Benchmarking similarity measures for hierarchical algorithms

In a previous section, the influence of different similarity measures on k-means and k-medoids algorithms as partitioning algorithms was evaluated and compared. In this section, the results for Single-link and Group Average algorithms, which are two hierarchical clustering algorithms, will be discussed for each similarity measure in terms of the Rand index. Fig 7 and Fig 8 represent sample bar charts of the results. The bar charts include 6 sample datasets. Because bar charts for all datasets and similarity measures would be jumbled, the results are presented using color scale tables for easier understanding and discussion. As discussed in the last section, Fig 9 and Fig 10 are two color scale tables that demonstrate the normalized Rand index values for each similarity measure. The results in Fig 9 for Single-link show that for low-dimensional datasets, the Mahalanobis distance is the most accurate similarity measure and Pearson is the best among other measures for high-dimensional datasets. The overall average column in this figure shows that generally, Pearson presents the highest accuracy and the Average and Euclidean distances are among the most accurate measures. For the Group Average algorithm, as seen in Fig 10, Euclidean and Average are the best among all similarity measures for low-dimensional datasets. For high-dimensional datasets, Cosine and Chord are the most accurate measures. Generally, in the Group Average algorithm, Manhattan and Mean Character Difference have the best overall Rand index results followed by Euclidean and Average. Considering the overall results, it is clear that the Average measure is constantly among the best measures, and for both Single-link and Group Average algorithms.

thumbnail
Fig 7. Bar chart of normalized Rand index values for selected datasets using the Single-link algorithm.

https://doi.org/10.1371/journal.pone.0144059.g007

thumbnail
Fig 8. Bar chart of normalized Rand index values for selected datasets using the Group Average algorithm.

https://doi.org/10.1371/journal.pone.0144059.g008

thumbnail
Fig 9. Color scale table of normalized Rand index values for the Single-link method (green is the highest and it changes color to red, which represents the lowest Rand index value).

https://doi.org/10.1371/journal.pone.0144059.g009

thumbnail
Fig 10. Color scale table of normalized Rand index values for Group Average (green is the highest and it changes color to red, which signifies the lowest Rand index value).

https://doi.org/10.1371/journal.pone.0144059.g010

A review of the results and discussions on the k-means, k-medoids, Single-link and Group Average algorithms reveals that by considering the overall results, the Average measure is regularly among the most accurate measures for all four algorithms.

According to heat map tables it is noticeable that Pearson correlation is behaving differently in comparison to other distance measures. It specially shows very weak results with centroid based algorithms, k-means and k-medoids. Based on the results in this research, in general, Pearson correlation doesn’t work properly for low dimensional datasets while it shows better results for high dimensional datasets.

Fig 11 illustrates the overall average RI in all 4 algorithms and all 15 datasets also uphold the same conclusion. Fig 12 at the other hand shows the average RI for 4 algorithms separately. It can be inferred that Average measure among other measures is more accurate.

Furthermore, by using the k-means algorithm, this similarity measure is the fastest after Pearson in terms of convergence.

Concluding Remarks

Selecting the right distance measure is one of the challenges encountered by professionals and researchers when attempting to deploy a distance-based clustering algorithm to a dataset. The variety of similarity measures can cause confusion and difficulties in choosing a suitable measure. Similarity measures may perform differently for datasets with diverse dimensionalities. The aim of this study was to clarify which similarity measures are more appropriate for low-dimensional and which perform better for high-dimensional datasets in the experiments. In this work, similarity measures for clustering numerical data in distance-based algorithms were compared and benchmarked using 15 datasets categorized as low and high-dimensional datasets. The accuracy of similarity measures in terms of the Rand index was studied and the best similarity measures for each of the low and high-dimensional datasets were discussed for four well-known distance-based algorithms. Overall, the results indicate that Average Distance is among the top most accurate measures for all clustering algorithms employed in this article. Moreover, this measure is one of the fastest in terms of convergence when k-means is the target clustering algorithm. Based on results in this study, in general, Pearson correlation is not recommended for low dimensional datasets. It also is not compatible with centroid based algorithms. However, this measure is mostly recommended for high dimensional datasets and by using hierarchical approaches.

Acknowledgments

Ali Seyed Shirkhorshidi would like to express his sincere gratitude to Fatemeh Zahedifar and Seyed Mohammad Reza Shirkhorshidi, who helped in revising and preparing the paper.

Author Contributions

Conceived and designed the experiments: ASS SA TYW. Performed the experiments: ASS SA TYW. Analyzed the data: ASS SA TYW. Contributed reagents/materials/analysis tools: ASS SA TYW. Wrote the paper: ASS SA TYW.

References

  1. 1. Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T. Big Data Clustering: A Review. Computational Science and Its Applications–ICCSA 2014. Springer; 2014. pp. 707–720.
  2. 2. Mohebi A, Aghabozorgi S, Ying Wah T, Herawan T, Yahyapour R. Iterative big data clustering algorithms: a review. Softw Pract Exp. 2015; n/a–n/a.
  3. 3. Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm [Internet]. Computers & Geosciences. 1984. pp. 191–203.
  4. 4. Peters G. Some refinements of rough k-means clustering. Pattern Recognit. 2006;39: 1481–1491.
  5. 5. Cui W, Wang Y, Fan Y, Feng Y, Lei T. Localized FCM clustering with spatial information for medical image segmentation and bias field estimation. Int J Biomed Imaging. 2013;2013: 930301. pmid:23997761
  6. 6. Ye J, Lazar NA, Li Y. Sparse geostatistical analysis in clustering fMRI time series. J Neurosci Methods. 2011;199: 336–345. pmid:21641934
  7. 7. Meyer G. Chinrungrueng F. J. Spatiotemporal clustering of fMRI time series in the spectral domain. Med Image Anal. 2004;9: 51–68.
  8. 8. An L, Doerge RW. Dynamic Clustering of Gene Expression [Internet]. ISRN Bioinformatics. 2012. pp. 1–12.
  9. 9. De Souto MCP, Costa IG, de Araujo DS a, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9: 497. pmid:19038021
  10. 10. Ernst J, Nau GJ, Bar-Joseph Z. Clustering short time series gene expression data. Bioinformatics. 2005;21: i159 –i168. pmid:15961453
  11. 11. Moolgavkar SH, Mcclellan RO, Dewanji A, Turim J, Georg Luebeck E, Edwards M. Time-series analyses of air pollution and mortality in the United States: A subsampling approach. Environ Health Perspect. 2013;121: 73–78. pmid:23108284
  12. 12. Ignaccolo R, Ghigo S, Bande S. Functional zoning for air quality. Environ Ecol Stat. 2013;20: 109–127.
  13. 13. Carbajal-Hernández JJ, Sánchez-Fernández LP, Carrasco-Ochoa J a., Martínez-Trinidad JF. Assessment and prediction of air quality using fuzzy logic and autoregressive models. Atmos Environ. Elsevier Ltd; 2012;60: 37–50.
  14. 14. Shen W, Babushkin V, Aung Z, Woon WL. An ensemble model for day-ahead electricity demand time series forecasting. Proc fourth Int Conf Futur energy Syst—e-Energy ‘13. New York, New York, USA: ACM Press; 2013; 51.
  15. 15. Iglesias F, Kastner W. Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns. Energies. 2013;6: 579–597.
  16. 16. Wijk J Van, Selow E Van. Cluster and calendar based visualization of time series data. Proc 1999 IEEE Symp Inf Vis. IEEE Comput. Soc; 1999; 4–9.
  17. 17. Aghabozorgi S, Seyed Shirkhorshidi A, Ying Wah T. Time-series clustering–A decade review. Inf Syst. 2015;53: 16–38.
  18. 18. Bache K, Lichman M. UCI Machine Learning Repository [Internet]. 2013. Available: http://archive.ics.uci.edu/ml
  19. 19. Speech and Image Processing Unit, University of Eastern Finland [Internet]. Available: http://cs.joensuu.fi/sipu/datasets/
  20. 20. Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation. In Proceedings of the eighth SIAM International Conference on Data Mining. 2008. pp. 243–254.
  21. 21. Lourenco F, Lobo V, Bacao F. Binary-based similarity measures for categorical data and their application in Self-Organizing Maps. 2004; 1–18.
  22. 22. Deshpande R, VanderSluis B, Myers CL. Comparison of Profile Similarity Measures for Genetic Interaction Networks. PLoS One. 2013;8: e68664. pmid:23874711
  23. 23. Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. Work Artif Intell Web …. 2000; 58–64. Available: http://www.aaai.org/Papers/Workshops/2000/WS-00-01/WS00-01-011.pdf
  24. 24. Zhang Z, Huang K, Tan T. Comparison of similarity measures for trajectory clustering in outdoor surveillance scenes. Proceedings—International Conference on Pattern Recognition. IEEE; 2006. pp. 1135–1138.
  25. 25. Khalifa A Al, Haranczyk M, Holliday J. Comparison of Nonbinary Similarity Coefficients for Similarity Searching, Clustering and Compound Selection. J Chem Inf Model. 2009;49: 1193–1201. pmid:19405526
  26. 26. Dunham MH. Data Mining Introductor and Advanced Topics. Upper Saddle River, New Jersey: Prentice Hall; 2003.
  27. 27. Gan G, Ma C, Wu J. Data Clustering theory, Algorithms, and Applications. ASASIAM Series on Statistics and Applied. Society for Industrial and Applied Mathematics; 2007.
  28. 28. Han J, Kamber M, Pei J. Data mining: concepts and techniques. Morgan Kaufmann; 2006.
  29. 29. Cha Sung-Hyuk. Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Model methods Appl Sci. 2007;1: 300–307. doi: 10.1.1.154.8446
  30. 30. Mao J, Jain AK. A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Trans Neural Networks. 1996;7: 16–29. pmid:18255555
  31. 31. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Computing Surveys. ACM; 1999. pp. 264–323.
  32. 32. Wilson D, Martinez T. Improved heterogeneous distance functions. JAIR. 1997;6: 1–34. Available: http://arxiv.org/abs/cs/9701101
  33. 33. Xu R, Wunsch D. Survey of clustering algorithms [Internet]. IEEE Transactions on Neural Networks. 2005. pp. 645–678. pmid:15940994
  34. 34. Perlibakas V. Distance measures for PCA-based face recognition. Pattern Recognit Lett. 2004;25: 711–724.
  35. 35. Legendre P, Legendre L. Numerical ecology. Elsevier; 2012.
  36. 36. Wang H, Wang H, Wang W, Wang W, Yang H, Yang H, et al. Clustering by pattern similarity in large data sets. 2002 ACM SIGMOD international conference on Management of Data. New York, New York, USA: ACM Press; 2002. p. 394.
  37. 37. Hand D, Mannila H, Smyth P. Principles of data mining(adaptive computation and machine learning). Drug safety. 2001.
  38. 38. Ji M, Xie F, Ping Y. A dynamic fuzzy cluster algorithm for time series. Abstr Appl Anal. 2013;2013: 1–7.
  39. 39. János Abonyi BF. Cluster Analysis for Data Mining and System Identification. Springer; 2007.
  40. 40. Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: A survey. IEEE Trans Knowl Data Eng. 2004;16: 1370–1386.
  41. 41. Santos JM, Embrechts M. On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2009. pp. 175–184.
  42. 42. Hubert L, Arabie P. Comparing partitions. J Classif. Springer; 1985;2: 193–218.
  43. 43. Fisher R. Statistical methods for research workers [Internet]. Edinburgh: Oliver and Boyd; 1925. Available: https://scholar.google.com/scholar?hl=en&q=Statistical+Methods+for+Research+Workers&btnG=&as_sdt=1%2C5&as_sdtp=#0
  44. 44. Cumming G. Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis [Internet]. 2013. Available: https://books.google.com/books?hl=en&lr=&id=1W6laNc7Xt8C&oi=fnd&pg=PR1&dq=Understanding+The+New+Statistics:+Effect+Sizes,+Confidence+Intervals,+and+Meta-Analysis&ots=PuHRVGc55O&sig=cEg6l3tSxFHlTI5dvubr1j7yMpI
  45. 45. Schlotzhauer S. Elementary statistics using JMP [Internet]. 2007. Available: https://books.google.com/books?hl=en&lr=&id=5JYM1WxGDz8C&oi=fnd&pg=PR3&dq=Elementary+Statistics+Using+JMP&ots=MZOht9zZOP&sig=IFCsAn4Nd9clwioPf3qS_QXPzKc
  46. 46. Gionis A, Mannila H, Tsaparas P. Clustering aggregation. ACM Trans Knowl Discov Data. 2005;1: Article 4.
  47. 47. Zahn CT. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Trans Comput. 1971;C-20: 68–86.
  48. 48. Veenman CJ, Reinders MJT, Backer E. A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell. 2002;24: 1273–1280.
  49. 49. Fu L, Medico E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics. 2007;8: 3. pmid:17204155