Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Unsupervised ranking of clustering algorithms by INFOMAX

Abstract

Clustering and community detection provide a concise way of extracting meaningful information from large datasets. An ever growing plethora of data clustering and community detection algorithms have been proposed. In this paper, we address the question of ranking the performance of clustering algorithms for a given dataset. We show that, for hard clustering and community detection, Linsker’s Infomax principle can be used to rank clustering algorithms. In brief, the algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. We show indeed, on a wide range of datasets of various sizes and topological structures, that the ranking provided by the entropy of the partition over a variety of partitioning algorithms is strongly correlated with the overlap with a ground truth partition The codes related to the project are available in https://github.com/Sandipan99/Ranking_cluster_algorithms.

1 Introduction

Cluster analysis is being increasingly used across wide range of applications ranging from biology and bioinformatics [1] to social networks [2] which has led to the development of a plethora of clustering algorithms. Given this, an obvious query that arises is how do we evaluate the performance of these algorithms in terms of the clusters obtained from them. In this paper, we show evidence in support of the idea that the Infomax principle [3] provides an answer to this question.

Clustering problem: We focus on the problem of hard partitioning: given a list of objects (or data points) the problem is that of dividing them into groups of similar ones. In the computer science and pattern recognition literature, this problem is popularly known as clustering. A plethora of different algorithms have been proposed for clustering (see [4, 5] for reviews) based on different measures of similarity between the data points. A large part of this literature has focused on the time complexity of the methods, which is particularly relevant for big data.

Quality of clusters: In this paper, we focus on the quality, i.e., on the accuracy of the method in terms of the results produced. Several algorithms (see e.g. [6, 7]) have been proposed claiming superior performance, yet it has been proven that no single clustering algorithm simultaneously satisfies a set of basic desiderata of data clustering [8]. In addition, the criteria for assessing the quality or validity of a clustering structure is not unique [4, 5]. When no ground truth is available, which is typically the case, (internal) criteria have been proposed based on stability [9] or on generalisability with respect to sub-sampling [10]. When a ground truth is available, an external criteria is possible, based on the distance of the predicted clustering to the ground truth. Yet the choice of the distance measure used is not unique [5]. Even in cases where comparison with a ground truth is possible, different algorithms are found to perform better in different cases and the predicted structures may differ substantially from the ground truth [11].

Infomax principle for measuring quality: We primarily intend to show that the Infomax principle [3] provides a natural measure for ranking clustering algorithms, for a given dataset, with respect to an unknown ground truth. In brief, a clustering algorithm is a mapping between data points xi in a high dimensional feature space to a set of labels si. The amount of information that the cluster structure retains about the data is given by the mutual information I(x, s) = H[s] − H[s|x]. The Infomax principle states that the optimal representation is the one that maximizes I(s, x). In hard clustering H[s|x] = 0, so I(x, s) = H[s] coincides with the entropy of the labels. We can visualize clustering as a translation of a dataset into a set of symbols—the cluster labels—of an alphabet of S letters, where S is the number of clusters. So, each partitioning algorithm is a translator that converts high dimensional data to a message. Following Shannon [12], the entropy of the cluster labels s provides a natural measure of the amount of information that the algorithm extracts from the data. Infomax then prescribes that the algorithm that “uses the most informative language”—i.e., with the highest —should be preferred. This allows one to rank partitioning algorithms in a completely unsupervised fashion, for a given dataset—the fundamental contribution of this paper. So, this criterion is internal, in the sense that it is based only on the data (i.e., it is unsupervised), but we will validate it showing that the obtained ranking has a positive correlation with distance to a ground truth in all of the cases analyzed, and this correlation is strong in most cases. Our results are based on an extensive comparison across different algorithms, different similarity metrics and different databases for data clustering.

Contributions: Our contributions in this paper are threefold -

  1. We propose a metric () which is able to rank, very efficiently, the clustering algorithms in a completely unsupervised way (i.e., without considering the ground truth cluster structure).
  2. Through rigorous experiments across a wide range of datasets we show the effectiveness of our metric in ranking the performance of data clustering algorithms. In fact, the metric remarkably correlates with the distance from the ground truth for a widely varying taxonomies of ground truth structures including (i) ground truth with different granularities, (ii) ground truth built from different attributes, (iii) very small number of ground truth clusters, (iv) ground truth clusters with very few data points, (v) ground truth clusters of equal sizes and (vi) ground truth clusters with skewed sizes.
  3. The proposed metric also outperforms the existing unsupervised metrics across all the datasets.

2 Background

In this section we present a brief overview of the related literature encompassing clustering algorithms and cluster quality measurement metrics used in our work.

2.1 Clustering algorithms

We consider two broad classes of clustering algorithms (i) hierarchical and (ii) partitional.

Hierarchical methods: These methods construct clusters through recursive partitioning of the data points in a bottom-up approach whereby each data point is assigned a cluster of its own initially and is merged until the desired number of clusters are obtained. The merging of the clusters is obtained according to some chosen similarity measure. We consider both city-block (l1) and Euclidean (l2) distance based similarity measures. The hierarchical clustering methods can be further classified according to the manner in which the similarity measure is calculated. We consider the following three classical ways—(1) Single linkage (SI) [13], (2) Complete linkage (CO) [14] and (3) Average linkage (AV) [15] Note that ‘l1SI’ would mean single linkage with city-block as distance metric and so on. We use this combination of acronyms for the algorithms and distance metrics in all our results presented in the subsequent sections.

We also consider BIRCH (BI) (balanced iterative reducing and clustering using hierarchies) [16] which improves upon the traditional hierarchical clustering methods. The algorithm commences by creating a height balanced tree out of the data points followed by execution of an agglomerative clustering method to obtain sub clusters.

Partitional methods: Among partitional methods we consider K-means, affinity propagation and spectral clustering. k-means (KM) clustering method which employs a squared error minimization criteria and is the most commonly used clustering technique in this category. The algorithm starts with an initial set of clusters chosen at random. In each round, each instance is assigned to its nearest cluster center according to distance between the two (we consider both l1 and l2 distances).

Affinity propagation (AP) algorithm introduced in [7] is based on the concept of passing messages between the data points. Unlike k-means clustering which identifies an exemplar (centroid) for each cluster, AP considers every data point to be a possible exemplar, representing a cluster. The goal is to obtain an appropriate set of exemplars which represents all the clusters.

Spectral clustering (SI) [17] employs a low dimensional embedding of the similarity matrix between the data points which is followed by clustering of eigenvector components in the low dimensional space.

2.2 Quality of cluster structure

The metrics available for determining the quality of clusters and thereby evaluating the performance of the clustering algorithms can be categorized as (i) external or supervised, which utilizes a benchmark or a ground truth cluster structure to determine quality and (ii) internal or unsupervised, which takes into account only the similarity between the data points used for clustering.

External metrics. Most commonly used external metrics are (i) purity [18], (ii) normalized mutual information (NMI) [19] and (iii) adjusted rand index (ARI) [20]. We explain them below.

Let Ω = (ω1, ω2, …, ωK) represent the set of clusters, denote the set of ground truth classes and N, the number of data points.

  1. Purity: Purity value between Ω and is calculated as - (1)
  2. Normalized mutual information (NMI): NMI value between Ω and is calculated as - (2) where is the mutual information and is defined as - (3) and is the entropy.
  3. Adjusted rand index (ARI): ARI is a corrected version of rand index and its value between Ω and is calculated as - (4) where nkj = |ωjck|, ak is the size of ωk and bj is the size of cj. Other measures include Jaccard index [21], Dice index [22] and Fowlkwes-Mallows index [23].

Internal metrics. Internal metrics for evaluation include Davies-Bouldin index [24], Silhouette [25] and Dunn index [26]. Among these we compare our proposed metric with Davies-Bouldin index DB and Silhouette SH. DB can be calculated as (5) where n is the number of clusters, cx is the centroid of cluster x, σx represents the average distance of all elements in cluster x to centroid cx and d(ci, cj) is the distance between centroids ci and cj.

For each data point, SH is computed utilizing the mean intra-cluster distance a, and its distance from the nearest cluster that it is not a part of b, with the score obtained as . The overall score is computed as the mean over all the individual data points.

3 Proposed metric

In this section we first discuss the clustering problem and then introduce our proposed metric which ranks the clustering and community detection algorithms in a completely unsupervised way.

3.1 Clustering problem

Consider a dataset composed of M points in a high dimensional feature space (d ≫ 1). The primary objective of clustering is to assign each point xi a label si that indicates the partition to which point xi belongs to. If there are S partitions, si can be taken as an integer between 1 and S. A data clustering algorithm [4, 5] partitions objects xi into groups or clusters of “similar” objects, where similarity is defined in terms of a metric distance.

With numerous clustering algorithms available for this specific task and ground truth not always available, we in this paper intend to propose a metric which ranks these algorithms based on their performance in a completely unsupervised way (i.e., without considering ground truth partition).

3.2 Infomax based metric

For a given data set and number of clusters S, each algorithm assigns to each point xi in the sample a label si in an alphabet of S possible labels. Loosely speaking, each algorithm translates the data into a message of a language written in this alphabet. The information content of this message can be quantified by the Shannon entropy. Assuming the order in which the data occur to be uninformative, as is often the case, the information is stored uniquely in the symbol frequencies, i.e. in the number Ks of times that a symbol s occurs (which is the size of cluster s). As an estimate of the amounts of bit of information per character in the message we take (6)

The Infomax principle [27] suggests a natural and universal criterium for scoring different algorithms: If algorithm A1 extracts more information than A2 from a dataset, i.e. if , then A1 should be preferred. For a given dataset and a fixed S, can be measured on the cluster predicted by different algorithms, thereby providing an un-supervised ranking of the algorithms. To summarize, given a cluster output of an algorithm consisting of S clusters, our metric essentially quantifies the quality of the cluster output by computing the entropy of the cluster labels. We illustrate using a toy example in Fig 1.

thumbnail
Fig 1. In this example there are 20 points that need to be clustered.

The number of clusters is set at 5 and we deploy two algorithms A1 and A2 which generate clusters of sizes {5, 5, 4, 4, 2} and {7, 8, 3, 1, 1} respectively. Our metric assigns a higher score to the cluster output of A1 (2.26) and thus inferring it to be better than A2.

https://doi.org/10.1371/journal.pone.0239331.g001

3.3 Advantages

The proposed metric has several advantages which we summarize below -

  • Model-free. The proposed metric is model-free which allows for its application across any clustering algorithm and dataset.
  • Information theory-based. Unlike the existing internal metrics, our metric builds upon information theory which is already deep-rooted in the existing literature making our metric much more reliable.
  • Outperforms existing metrics. Our metric consistently outperforms the existing internal metrics across numerous datasets (refer to section 6 for details).
  • Unsupervised. In contrast to the existing external metrics, our metric does not require ground truth cluster structure making it completely unsupervised and hence suited to a wide range of datasets. Even though it requires less information, the proposed metric provides comparable performance to the external metrics (refer to section 6 for details).

4 Datasets

In this section we briefly discuss the datasets that we have used in this paper.

Abalone: The Abalone dataset https://archive.ics.uci.edu/ml/datasets/Abalone consists of a set of abalone and are classified based on their age which is basically the number of rings they have [28]. The dataset consists of 4177 instances each consisting of 8 attributes. The task is treated as a classification problem and there are 28 clusters in the ground truth.

Football: The Football network [29] http://www-personal.umich.edu/mejn/netdata/ consists of American football games between Division IA colleges during regular season Fall of 2000. The vertices in the network are the football teams which are identified by the respective college names and an edge in the network represent regular season games between the two teams. The teams are divided into conferences containing around 8–12 teams each. Games are more frequent between members of the same conference than between members of different conferences. Each conference therefore represents a ground truth community in the network. Note the vertices in the network are devoid of any inherent features and we hence resort to representing each vertex by vectors of (i) neighborhood (1 if the corresponding vertex is a neighbor and 0 other) and (ii) shortest path (length of shortest path to the corresponding vertex).

Railway: The Indian railway network was proposed in [30] http://www.cnergres.iitkgp.ac.in/permanence/ and it consists of stations (nodes) and edges between all pairs of stations that are connected by at least one train-route (both stations must be scheduled halts on the train-route). The weight of the edge between two stations is the number of train-routes on which both these stations are scheduled halts. We filter out the low-weight edges and then make the resultant network unweighted. The states act as communities since the number of trains within each state is much higher than the number of trains in between two states. Similar to the Football dataset we again obtain two representations of each vertex (neighborhood and shortest path).

Wine: We consider two wine datasets namely Red and White wine [31] http://archive.ics.uci.edu/ml/datasets/Wine+Quality. The datasets respectively contain samples of red and white wines. Each wine sample is associated with 11 attributes like fixed acidity, volatility, residual sugar etc. Each wine sample is also graded by experts between 0 (very bad) and 10 (very excellent) based on the quality. This quality score acts as the ground truth cluster for the two datasets.

Leaf: The leaf dataset [32] https://archive.ics.uci.edu/ml/datasets/One-hundred+plant+species+leaves+data+set consists of 100 varieties of leaves and for each variety there are 16 examples. Each leaf sample is associated with a shape, texture and margin feature. Each such feature is a vector of 64 elements. Each variety of leaf act as the ground truth cluster.

TREC: The TREC dataset [33] http://glaros.dtc.umn.edu/gkhome/views/cluto consists of articles from the Los Angeles times and the categories correspond to the desk of the paper that each article appeared and include documents from the entertainment, financial, foreign, metro, national, and sports desks. Frequency of words in the document are its associated features. A stop-list was used to remove the common words and any word occurring in less than two documents was eliminated. Each desk here represents a ground truth cluster.

Synthetic: The dataset is obtained using the model of correlated time series discussed in [34]. The dataset consists of 1000 data points and 68 clusters in the ground-truth. The dataset https://www.kaggle.com/sandipan99/synthetic-data-for-clustering has been made public.

Protein: This dataset http://www.fludb.org/brc/home.spg?decorator=influenza consists of sequences of HA1 (hemagglutinin) of the H3N2 strain taken from the uniprot database http://www.uniprot.org/uniprot/P03440. These are strings of 566 characters (amino acids) and each character is replaced by the corresponding values of side-chain polarity, side-chain charge, hydropathy index and weight to obtain the feature matrix. The ground truth cluster structure is obtained based on place. The dataset https://www.kaggle.com/sandipan99/protein-dataset/ has been made public.

Stocks: We consider stock market dataset (the same used in [35]), where each xi is a time series of daily returns for the M = 4000 most actively traded assets in the New York Stock Exchange, over a period from 1 January 1990 to 30 April 1999 (i.e. d = 2358). Returns are defined as the logarithm of the ratio between close and opening price for each day (we refer to [35] for more details). The ground truth is given by the Security and Exchange Commission (SEC) classification of the stocks in industrial sectors, that assigns a code to each stock. Taking the first two digits of the SEC code yields Sσ = 68 clusters (but we also compared our results with the classification based on three digits Sσ = 302).

Crime: The crime dataset https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized combines socio-economic data from the ‘90 Census, (law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey), and crime data from the 1995 FBI UCR [36]. Typically this is a regression dataset and we bin the data points based on the values of the attributes to obtain the ground-truth cluster structure. In specific we consider three attributes which are—(i) murders per 100k population, (ii) robberies per 100k population and (iii) auto-thefts per 100k population.

MNIST: The MNIST dataset [37] http://yann.lecun.com/exdb/mnist/ consists of images of 70,000 handwritten digits (0-9). Each image is represented as a 28 × 28 pixel bounding box which we flatten to obtain a feature vector of size 784. The dataset consists of 10 classes each corresponding to a digit between 0 and 9.

5 Evaluation methodology

In this section we discuss in detail the evaluation methodology used in the paper.

To reiterate, we consider:

High dimensional datasets These are composed of M points in a high dimensional feature space (d ≫ 1). For example, in stock markets data, the ith component of the ith point is the daily return of stock i on day t = 1, …, d.

Table 1 lists the datasets used in this study (details provided later in this section). Each consist of a set of points xi i = 1, …, M. We consider different partitioning algorithms xisi that associate to each point i = 1, …, M in the sample a label si that indicates the partition to which point xi belongs to. If there are S partitions, si can be taken as an integer between 1 and S.

thumbnail
Table 1. Quantitative description of data sets: M is the number of points, H[σ] is the entropy of the ground truth classification.

d1d2 represents the conformity among the different goodness metrics (purity, NMI and ARI) in terms of Kendall’s and Spearman’s rank correlation (see text). The last column reports the Kendall’s τ and Spearman’s ρ rank correlations of with the majority ranking of similarity to the ground truth (see text).

https://doi.org/10.1371/journal.pone.0239331.t001

For each dataset studied, a ground truth classification σ = (σ1, …, σM) is also available. This associates to each point i a “true” classification σi, which can take one of Sσ values, where Sσ is the number of classes of the ground truth. For example, σ is the Security and Exchange Commission classification of stocks into economic sectors for financial data, or the state where a station is located for the data set of Indian railways [29]. Recall, that the classification s generated by a given partitioning method can be compared with the ground truth σ, using three well-established metrics: Purity, Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). We also compare with two existing internal metrics Davies-Bouldin (DB) and Silhouette (SH). Moreover, for the hierarchical methods, the number of clusters are set to be same as the partitional approaches. For a given data set and a given S, we rank algorithms according to their similarity with the ground truth.

5.1 Majority ranking

It is well-known that all the three similarity measures i.e., Purity, NMI and ARI have their own shortcomings [38]. This manifests in the fact that, for different similarity measures, the ranking over algorithms does not necessarily coincide. For this reason, we consider also a “majority ranking”: For algorithms A1 and A2, majority ranks A1 higher than A2 (i.e. A1 > A2) if the majority of the three similarity measures rank A1 higher than A2. This procedure is not guaranteed to produce a transitive ranking across algorithms, since it can happen that A1 > A2, A2 > A3 and A3 > A1 for some A1, A2 and A3. This signals the fact that a proper ranking is ill defined in these cases, hence we restrict attention to cases where this is not the case. As Table 1 further shows, our study covers a diverse variety of datasets, ranging from cases where the number of clusters in the ground truth is very small compared to the number of data points (red and white wines, TREC), to cases where clusters on average contain few points (football, railway). We also compare our results across different ground truths for the same dataset. For stocks we consider different levels of granularity given by the SEC codes at 2 or 3 digits. For the crime dataset we consider ground truths based on different indicators (geographic location of the community, incidence of different crimes in that community). We report the results for each case in the following subsections. The cluster size distribution also varies substantially across the data-sets used. As a measure of concentration, Table 1 reports the ratio between the entropy of the cluster size distribution and its maximal value. This is one for equally sized clusters (e.g. Leaf, TREC) whereas smaller values indicate more skewed distributions.

6 Results

The rest of the paper will be devoted to testing the accuracy of this prediction, by comparing it with the ranking provided by the distance to the ground truth, according to the measures discussed above. We classify the datasets based on the associated ground truth cluster structure. This is to show that our metric is indeed independent of the ground truth structure. We report in detail the methodology for the stock dataset which covers the case of different granularity levels of ground truth while for other cases we mainly report the results obtained. For all these cases the same methodology has been employed to obtain the results. For general information about each dataset (size, number of clusters in the ground truth) refer to Table 1.

6.1 Ground truth with different granularity

Dataset: To illustrate, we consider stock market dataset consisting of 4000 data points and two sets of ground truth (Sσ = 68, 302).

Observations: For each algorithm and choice of the measure, we compute the value of for the cluster structure obtained for Sσ clusters and compare it to the distance to the ground truth classification with two digits, for ARI, NMI and Purity. The plots for NMI and ARI versus in Fig 2 show a clear positive correlation that we quantify by computing the Kendall’s-τ and Spearman’s rank correlation ρ between the corresponding rankings. A pairwise comparison between and the different measures, and among the different measures, is shown in Table 2 for the stock dataset considering SEC codes at 2 digits. The corresponding results considering SEC codes at 3 digits are presented in Table 3. Different distances rank the algorithms differently and their correlation, though positive, is not one. For this reason, as already discussed, we also extract a majority ranking that combines the predictions of ARI, NMI and Purity. The correlation between majority ranking and the other rankings is also reported in Table 2 (last column). The top entry of the rightmost column (boxed) is reported in the last column of Table 1 for all the other datasets. This shows that correlates remarkably well with the majority ranking in most cases. As a comparison, we look into how the three similarity measures correlate among themselves. To this aim we calculate mean Kendall’s and Spearman’s correlation between the rankings obtained through Purity-NMI, Purity-ARI and NMI-ARI (underlined entries in Table 2). Further note that outperforms both SH and DB.

thumbnail
Fig 2.

H[S] versus purity, NMI and ARI for the stock dataset, using SEC codes at 2 (top) and 3 (bottom) digits. Different algorithms are represented by a code that depends on the distance metric used (“l1” or “l2”) and the algorithm (SI, AV and CO for single, average and complete linkage, KM for k-means, AP for affinity propagation).

https://doi.org/10.1371/journal.pone.0239331.g002

thumbnail
Table 2. Kendall’s Tau and Spearman correlation for stock considering SEC codes at 2 digits.

The correlation between the majority ranking and ranking (top-right boxed entry) is reported in the last column of Table 1, whereas the average of the correlations between rankings provided by the different measures (underlined entries) is reported in the d1-d2 column of Table 1 for all datasets.

https://doi.org/10.1371/journal.pone.0239331.t002

thumbnail
Table 3. Kendall’s τ and Spearman’s correlation result for stock considering SEC codes at 3 digits.

https://doi.org/10.1371/journal.pone.0239331.t003

6.2 Ground truth built from different attributes

Dataset: We illustrate with the crime dataset with ground truth constructed from three attributes which are—(i) murders per 100k population, (ii) robberies per 100k population and (iii) auto-thefts per 100k population.

Observations: In Fig 3(top), Fig 3(middle) and Fig 3(bottom) we plot against purity, NMI and ARI for the cluster structure obtained from each algorithm for crime murder, crime robbery and crime auto respectively. The similarity between the rankings obtained through , purity, NMI, ARI and majority for the corresponding ground truths are reported in Tables 4, 5 and 6 respectively. In almost all the cases correlates highly with purity and NMI while with ARI the correlation is low. The similarity of ranking with majority is high irrespective of the ground truth used. seems to perform better than SH and DB.

thumbnail
Fig 3. H[S] versus purity, NMI and ARI for (i) crime murder (top), (ii) crime robbery (middle) and (iii) crime auto (bottom).

https://doi.org/10.1371/journal.pone.0239331.g003

thumbnail
Table 4. Kendall’s τ and Spearman’s correlation result for Crime (murder).

https://doi.org/10.1371/journal.pone.0239331.t004

thumbnail
Table 5. Kendall’s τ and Spearman’s correlation result for Crime (robbery).

https://doi.org/10.1371/journal.pone.0239331.t005

thumbnail
Table 6. Kendall’s τ and Spearman’s correlation result for Crime (auto).

https://doi.org/10.1371/journal.pone.0239331.t006

6.3 Small number of ground truth clusters compared to the number of points

Datasets: For this scenario, we consider wine and TREC datasets here. For TREC M = 878 and Sσ = 10 and the corresponding numbers for red and white wines are M = 1598, Sσ = 6 and M = 4598, Sσ = 7 respectively. MNIST consists of 70000 data points and 10 clusters (i.e., M = 70000 and σ = 10).

Observations: We plot against purity, NMI and ARI for the cluster structure obtained from each algorithm for red wine (top), white wine, TREC and MNIST (bottom) in Fig 4 (top to bottom in the same order). The similarity scores between the rankings obtained through , purity, NMI, ARI and majority are reported in Tables 7 and 8 for the respective wine datasets. In both these cases rankings obtained through , correlates only moderately with the majority ranking. In fact, the similarity values are low among the rankings obtained through other metrics as well. The similarity is reasonably high for TREC (refer to Table 9) and MNIST (refer to Table 10).

thumbnail
Fig 4.

H[S] versus purity, NMI and ARI for (i) red wine, (ii) white wine, (iii) TREC and (iv) MNIST datasets (from top to bottom). Note that for the wine datasets we considered two types of feature matrices. For raw features (represented in blue) we considered the values of the features as provided in the dataset to obtain the feature vector of each point while for ‘ranked feature” (represented in red) we rank each feature based on the value and then use this rank score instead of the raw value.

https://doi.org/10.1371/journal.pone.0239331.g004

thumbnail
Table 7. Kendall’s τ and Spearman’s correlation result for Red Wine.

https://doi.org/10.1371/journal.pone.0239331.t007

thumbnail
Table 8. Kendall’s τ and Spearman’s correlation result for White Wine.

https://doi.org/10.1371/journal.pone.0239331.t008

thumbnail
Table 9. Kendall’s τ and Spearman’s correlation result for TREC.

https://doi.org/10.1371/journal.pone.0239331.t009

thumbnail
Table 10. Kendall’s τ and Spearman’s correlation result for MNIST.

https://doi.org/10.1371/journal.pone.0239331.t010

6.4 Ground truth clusters with very few points

Datasets: We consider the examples of football (M = 115, Sσ = 12) and railway (M = 301, Sσ = 20) datasets.

Observations: In Fig 5 (top) and (bottom) we plot against purity, NMI and ARI for the cluster structure obtained from each algorithm for football and railway. is indeed closely related with the other metrics in both cases which proves the effectiveness our metric. We further report the similarity among various rankings of the clustering algorithms obtained through the different metrics in Tables 11 and 12. In fact we observe a very high correlation between and majority ranking.

thumbnail
Fig 5.

H[S] versus purity, NMI and ARI for (i) football (top) and (ii) railway (bottom). We consider two types of feature vectors for each data point (node). In case of ‘neighborhood” (represented in blue) the feature vector of each node ui consists of 1s and 0s depending on whether uj(ji) is a neighbor or not. For ‘shortest path” (represented in red) the feature vector of each node ui consists of the shortest path to uj(ji).

https://doi.org/10.1371/journal.pone.0239331.g005

thumbnail
Table 11. Kendall’s τ and Spearman’s correlation result for Football.

https://doi.org/10.1371/journal.pone.0239331.t011

thumbnail
Table 12. Kendall’s τ and Spearman’s correlation result for Railway.

https://doi.org/10.1371/journal.pone.0239331.t012

6.5 Ground truth clusters are of equal sizes

Datasets: Here we consider the leaf and the abalone datasets. While for leaf the number of points in each ground truth cluster is exactly 16, the corresponding number for abalone is ∼ 90.

Observations: In Fig 6(top) and (bottom) we plot against purity, NMI and ARI values of the cluster structure obtained as output from all the clustering algorithms. A strong positive dependence suggests that is able to correctly rank the performance of the clustering algorithms. High correlation between the rankings of clustering algorithms obtained through and majority (refer to Tables 13 (leaf) and 14 (abalone)) further supports our hypothesis.

thumbnail
Fig 6.

H[S] versus purity, NMI and ARI for Leaf (top) and Abalone (below) datasets.

https://doi.org/10.1371/journal.pone.0239331.g006

thumbnail
Table 13. Kendall’s τ and Spearman’s correlation result for Leaf.

https://doi.org/10.1371/journal.pone.0239331.t013

thumbnail
Table 14. Kendall’s τ and Spearman’s correlation result for Abalone.

https://doi.org/10.1371/journal.pone.0239331.t014

6.6 Ground truth cluster sizes are skewed

Datasets: Here we consider the synthetic and the protein datasets where the ground truth cluster size distributions are skewed.

Observations: It can be clearly observed from the Fig 7 top (synthetic) and bottom (protein) that correlates nicely with other metrics in measuring the goodness of the cluster structure obtained as output from different clustering algorithms. Higher similarity (refer to Table 15 (synthetic) and Table 16 (protein)) between the majority ranking and that obtained through further indicates the effectiveness of our metric in ranking the performance of the clustering algorithms.

thumbnail
Fig 7.

H[S] versus purity, NMI and ARI for Synthetic (top) and Protein (below) datasets.

https://doi.org/10.1371/journal.pone.0239331.g007

thumbnail
Table 15. Kendall’s τ and Spearman’s correlation result for Synthetic.

https://doi.org/10.1371/journal.pone.0239331.t015

thumbnail
Table 16. Kendall’s τ and Spearman’s correlation result for Protein.

https://doi.org/10.1371/journal.pone.0239331.t016

6.7 Summary

To summarize we showed that performance of is comparable to the other metrics even though it does not require the ground truth cluster structure unlike the other competing metrics. Through extensive experiments on a large variety of datasets we showed that our proposed metric is indeed effective as well as robust. This further indicate that is independent of the associated ground truth structure. also consistently outperforms both the baseline internal metrics across all the datasets.

6.8 Dependence on cluster structure

We have demonstrated that the proposed metric is able to outperform the existing internal metrics across different datasets. We now focus on analysing dependence of the performance of our metric on the complexity of the dataset. To quantify the complexity of a dataset we define two metrics and where measures the entropy of the ground truth cluster for the dataset. For q1, is normalized by the number of points in the dataset (log M in specific) while for q2 it is normalized by the number of clusters in the ground truth (log Sσ). Note that we calculate these two metrics for each dataset (refer to Table 1 for exact values) and train a linear regression model to predict the performance () on each dataset. We obtain a reasonably high R2 of 0.52. This indicates that complexity of the dataset in terms of q1 and q2 is indeed correlated to the performance of the proposed metric.

7 Discussion

The results discussed in this paper suggest that Infomax can be used as a completely unsupervised measure, that can be computed solely from the partition size distribution, for each algorithm. Using this, we can rank data clustering algorithms in an unsupervised manner.

On community detection. A closely related problem, that of community detection in networks, has received considerable attention recently in Physics. The core idea is to group nodes in the network based on structural similarity. As in case of clustering, there exists a plethora of algorithms for community detection as well. An immediate extension would be to deploy our proposed metric to the problem of ranking community detection algorithms.

On experimenting with various datasets we observed that

  1. The performance of clustering algorithms depends on the dataset. In case of the football dataset we observed that average linkage was performing the best whereas in case of the railway dataset k-means was performing the best.
  2. The performance of clustering algorithms also depends on the distance metric used for calculating distance between the data points in the dataset. This dependence is different depending on the algorithm. For example, in the crime dataset, l2 distance performs better than l1 in k-means, but worse than l1 in complete linkage.
  3. The performance changes depending on the feature matrix used.

These observations reinforces the conclusion [8] that the search for the perfect clustering algorithm is chimeric. This makes it important to develop unsupervised methods to rank partitioning algorithms as the one we presented in this paper.

Acknowledgments

SS and AM would like to acknowledge Simons foundation for financial support through Simons Visitor and Simons Associate programme respectively. SS would also like to acknowledge ICTP-IAEA Sandwich Training Educational Programme (STEP) for financial support.

References

  1. 1. Remm M, Storm CE, Sonnhammer EL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of molecular biology. 2001;314(5):1041–1052. pmid:11743721
  2. 2. Fogel J, Nehmad E. Internet social network communities: Risk taking, trust, and privacy concerns. Computers in human behavior. 2009;25(1):153–160.
  3. 3. Linsker R. Self-organization in a perceptual network. IEEE Computer. 1988;21:105–117.
  4. 4. Jain AK. Data clustering: 50 years beyond K-means. Pattern recognition letters. 2010;31(8):651–666.
  5. 5. Gan G, Ma C, Wu J. Data clustering: theory, algorithms, and applications. vol. 20. Siam; 2007.
  6. 6. Slonim N, Atwal GS, Tkačik G, Bialek W. Information-based clustering. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(51):18297–18302. pmid:16352721
  7. 7. Frey BJ, Dueck D. Clustering by passing messages between data points. science. 2007;315(5814):972–976. pmid:17218491
  8. 8. Kleinberg J. An impossibility theorem for clustering. Advances in neural information processing systems. 2003; p. 463–470.
  9. 9. Lange T, Roth V, Braun ML, Buhmann JM. Stability-based validation of clustering solutions. Neural computation. 2004;16(6):1299–1323. pmid:15130251
  10. 10. Shamir O, Tishby N. Cluster Stability for Finite Samples. In: NIPS; 2007. p. 1297–1304.
  11. 11. Hric D, Darst RK, Fortunato S. Community detection in networks: Structural communities versus ground truth. Physical Review E. 2014;90(6):062805. pmid:25615146
  12. 12. Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review. 2001;5(1):3–55.
  13. 13. Sneath PH, Sokal RR, et al. Numerical taxonomy. The principles and practice of numerical classification.; 1973.
  14. 14. King B. Step-wise clustering procedures. Journal of the American Statistical Association. 1967;62(317):86–101.
  15. 15. Ward JH Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association. 1963;58(301):236–244.
  16. 16. Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record. 1996;25(2):103–114.
  17. 17. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence. 2000;22(8):888–905.
  18. 18. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval; 2008.
  19. 19. Danon L, Diaz-Guilera A, Duch J, Arenas A. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment. 2005;2005(09):P09008.
  20. 20. Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2(1):193–218.
  21. 21. Sneath PH. Some thoughts on bacterial classification. Microbiology. 1957;17(1):184–200. pmid:13475685
  22. 22. Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302.
  23. 23. Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. Journal of the American statistical association. 1983;78(383):553–569.
  24. 24. Davies DL, Bouldin DW. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence. 1979;(2):224–227. pmid:21868852
  25. 25. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987;20:53–65.
  26. 26. Dunn JC. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. 1973.
  27. 27. Cardoso JF. Infomax and maximum likelihood for blind source separation. 1997.
  28. 28. Waugh SG. Extending and benchmarking Cascade-Correlation: extensions to the Cascade-Correlation architecture and benchmarking of feed-forward supervised artificial neural networks. University of Tasmania; 1995.
  29. 29. Girvan M, Newman ME. Community structure in social and biological networks. Proceedings of the national academy of sciences. 2002;99(12):7821–7826. pmid:12060727
  30. 30. Ghosh S, Banerjee A, Sharma N, Agarwal S, Ganguly N, Bhattacharya S, et al. Statistical analysis of the Indian railway network: A complex network approach. Acta Physica Polonica B Proceedings Supplement. 2011;4(2):123–138.
  31. 31. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems. 2009;47(4):547–553.
  32. 32. Mallah C, Cope J, Orwell J. Plant leaf classification using probabilistic integration of shape, texture and margin features. Signal Processing, Pattern Recognition and Applications. 2013;5:1.
  33. 33. Zhao Y, Karypis G, Fayyad U. Hierarchical clustering algorithms for document datasets. Data mining and knowledge discovery. 2005;10(2):141–168.
  34. 34. Giada L, Marsili M. Data clustering and noise undressing of correlation matrices. Physical Review E. 2001;63(6):061101. pmid:11415062
  35. 35. Marsili M, et al. Dissecting financial markets: sectors and states. Quantitative Finance. 2002;2(4):297–302.
  36. 36. Redmond M, Baveja A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research. 2002;141(3):660–678.
  37. 37. LeCun Yann and Bottou Léon and Bengio Yoshua and Haffner Patrick. Gradient-based learning applied to document recognition Proceedings of the IEEE. 1998;86(11):2278–2324
  38. 38. Wagner S, Wagner D. Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe; 2007.