Unsupervised ranking of clustering algorithms by INFOMAX

Clustering and community detection provide a concise way of extracting meaningful information from large datasets. An ever growing plethora of data clustering and community detection algorithms have been proposed. In this paper, we address the question of ranking the performance of clustering algorithms for a given dataset. We show that, for hard clustering and community detection, Linsker’s Infomax principle can be used to rank clustering algorithms. In brief, the algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. We show indeed, on a wide range of datasets of various sizes and topological structures, that the ranking provided by the entropy of the partition over a variety of partitioning algorithms is strongly correlated with the overlap with a ground truth partition The codes related to the project are available in https://github.com/Sandipan99/Ranking_cluster_algorithms.


Introduction
Cluster analysis is being increasingly used across wide range of applications ranging from biology and bioinformatics [1] to social networks [2] which has led to the development of a plethora of clustering algorithms. Given this, an obvious query that arises is how do we evaluate the performance of these algorithms in terms of the clusters obtained from them. In this paper, we show evidence in support of the idea that the Infomax principle [3] provides an answer to this question.
Clustering problem: We focus on the problem of hard partitioning: given a list of objects (or data points) the problem is that of dividing them into groups of similar ones. In the computer science and pattern recognition literature, this problem is popularly known as clustering. A plethora of different algorithms have been proposed for clustering (see [4,5] for reviews) based on different measures of similarity between the data points. A large part of this literature has focused on the time complexity of the methods, which is particularly relevant for big data.
Quality of clusters: In this paper, we focus on the quality, i.e., on the accuracy of the method in terms of the results produced. Several algorithms (see e.g. [6,7]) have been proposed claiming superior performance, yet it has been proven that no single clustering algorithm simultaneously satisfies a set of basic desiderata of data clustering [8]. In addition, the criteria for assessing the quality or validity of a clustering structure is not unique [4,5]. When no ground truth is available, which is typically the case, (internal) criteria have been proposed based on stability [9] or on generalisability with respect to sub-sampling [10]. When a ground truth is available, an external criteria is possible, based on the distance of the predicted clustering to the ground truth. Yet the choice of the distance measure used is not unique [5]. Even in cases where comparison with a ground truth is possible, different algorithms are found to perform better in different cases and the predicted structures may differ substantially from the ground truth [11]. Infomax principle for measuring quality: We primarily intend to show that the Infomax principle [3] provides a natural measure for ranking clustering algorithms, for a given dataset, with respect to an unknown ground truth. In brief, a clustering algorithm is a mapping between data points x i in a high dimensional feature space to a set of labels s i . The amount of information that the cluster structure retains about the data is given by the mutual information I(x, s) = H[s] − H [s|x]. The Infomax principle states that the optimal representation is the one that maximizes I(s, x). In hard clustering H[s|x] = 0, so I(x, s) = H[s] coincides with the entropy of the labels. We can visualize clustering as a translation of a dataset into a set of symbols-the cluster labels-of an alphabet of S letters, where S is the number of clusters. So, each partitioning algorithm is a translator that converts high dimensional data to a message. Following Shannon [12], the entropyĤ½s� of the cluster labels s provides a natural measure of the amount of information that the algorithm extracts from the data. Infomax then prescribes that the algorithm that "uses the most informative language"-i.e., with the highestĤ½s�-should be preferred. This allows one to rank partitioning algorithms in a completely unsupervised fashion, for a given dataset-the fundamental contribution of this paper. So, this criterion is internal, in the sense that it is based only on the data (i.e., it is unsupervised), but we will validate it showing that the obtained ranking has a positive correlation with distance to a ground truth in all of the cases analyzed, and this correlation is strong in most cases. Our results are based on an extensive comparison across different algorithms, different similarity metrics and different databases for data clustering.
Contributions: Our contributions in this paper are threefold -1. We propose a metric (Ĥ½s�) which is able to rank, very efficiently, the clustering algorithms in a completely unsupervised way (i.e., without considering the ground truth cluster structure).
2. Through rigorous experiments across a wide range of datasets we show the effectiveness of our metric in ranking the performance of data clustering algorithms. In fact, the metric remarkably correlates with the distance from the ground truth for a widely varying taxonomies of ground truth structures including (i) ground truth with different granularities, (ii) ground truth built from different attributes, (iii) very small number of ground truth clusters, (iv) ground truth clusters with very few data points, (v) ground truth clusters of equal sizes and (vi) ground truth clusters with skewed sizes.
3. The proposed metric also outperforms the existing unsupervised metrics across all the datasets.

Background
In this section we present a brief overview of the related literature encompassing clustering algorithms and cluster quality measurement metrics used in our work.

Clustering algorithms
We consider two broad classes of clustering algorithms (i) hierarchical and (ii) partitional. Hierarchical methods: These methods construct clusters through recursive partitioning of the data points in a bottom-up approach whereby each data point is assigned a cluster of its own initially and is merged until the desired number of clusters are obtained. The merging of the clusters is obtained according to some chosen similarity measure. We consider both city-block (l1) and Euclidean (l2) distance based similarity measures. The hierarchical clustering methods can be further classified according to the manner in which the similarity measure is calculated. We consider the following three classical ways-(1) Single linkage (SI) [13], (2) Complete linkage (CO) [14] and (3) Average linkage (AV) [15] Note that 'l1SI' would mean single linkage with city-block as distance metric and so on. We use this combination of acronyms for the algorithms and distance metrics in all our results presented in the subsequent sections.
We also consider BIRCH (BI) (balanced iterative reducing and clustering using hierarchies) [16] which improves upon the traditional hierarchical clustering methods. The algorithm commences by creating a height balanced tree out of the data points followed by execution of an agglomerative clustering method to obtain sub clusters.
Partitional methods: Among partitional methods we consider K-means, affinity propagation and spectral clustering. k-means (KM) clustering method which employs a squared error minimization criteria and is the most commonly used clustering technique in this category. The algorithm starts with an initial set of clusters chosen at random. In each round, each instance is assigned to its nearest cluster center according to distance between the two (we consider both l1 and l2 distances).
Affinity propagation (AP) algorithm introduced in [7] is based on the concept of passing messages between the data points. Unlike k-means clustering which identifies an exemplar (centroid) for each cluster, AP considers every data point to be a possible exemplar, representing a cluster. The goal is to obtain an appropriate set of exemplars which represents all the clusters.
Spectral clustering (SI) [17] employs a low dimensional embedding of the similarity matrix between the data points which is followed by clustering of eigenvector components in the low dimensional space.

Quality of cluster structure
The metrics available for determining the quality of clusters and thereby evaluating the performance of the clustering algorithms can be categorized as (i) external or supervised, which utilizes a benchmark or a ground truth cluster structure to determine quality and (ii) internal or unsupervised, which takes into account only the similarity between the data points used for clustering.
i. Purity: Purity value between O and C is calculated as - ii. Normalized mutual information (NMI): NMI value between O and C is calculated as - where I is the mutual information and is defined as - and H is the entropy.
iii. Adjusted rand index (ARI): ARI is a corrected version of rand index and its value between O and C is calculated as - where n kj = |ω j \ c k |, a k is the size of ω k and b j is the size of c j . Other measures include Jaccard index [21], Dice index [22] and Fowlkwes-Mallows index [23].
Internal metrics. Internal metrics for evaluation include Davies-Bouldin index [24], Silhouette [25] and Dunn index [26]. Among these we compare our proposed metric with Davies-Bouldin index DB and Silhouette SH. DB can be calculated as where n is the number of clusters, c x is the centroid of cluster x, σ x represents the average distance of all elements in cluster x to centroid c x and d(c i , c j ) is the distance between centroids c i and c j . For each data point, SH is computed utilizing the mean intra-cluster distance a, and its distance from the nearest cluster that it is not a part of b, with the score obtained as ðbÀ aÞ maxða;bÞ . The overall score is computed as the mean over all the individual data points.

Proposed metric
In this section we first discuss the clustering problem and then introduce our proposed metric which ranks the clustering and community detection algorithms in a completely unsupervised way.

Clustering problem
Consider a dataset composed of M points x 2 R d in a high dimensional feature space (d � 1). The primary objective of clustering is to assign each point x i a label s i that indicates the partition to which point x i belongs to. If there are S partitions, s i can be taken as an integer between 1 and S. A data clustering algorithm [4,5] partitions objects x i into groups or clusters of "similar" objects, where similarity is defined in terms of a metric distance.
With numerous clustering algorithms available for this specific task and ground truth not always available, we in this paper intend to propose a metric which ranks these algorithms based on their performance in a completely unsupervised way (i.e., without considering ground truth partition).

Infomax based metricĤ½s�
For a given data set and number of clusters S, each algorithm assigns to each point x i in the sample a label s i in an alphabet of S possible labels. Loosely speaking, each algorithm translates the data into a message of a language written in this alphabet. The information content of this message can be quantified by the Shannon entropy. Assuming the order in which the data occur to be uninformative, as is often the case, the information is stored uniquely in the symbol frequencies, i.e. in the number K s of times that a symbol s occurs (which is the size of cluster s). As an estimate of the amounts of bit of information per character in the message we takê The Infomax principle [27] suggests a natural and universal criterium for scoring different algorithms: If algorithm A 1 extracts more information than A 2 from a dataset, i.e. if H A 1 ½s� >Ĥ A 2 ½s�, then A 1 should be preferred. For a given dataset and a fixed S,Ĥ½s� can be measured on the cluster predicted by different algorithms, thereby providing an un-supervised ranking of the algorithms. To summarize, given a cluster output of an algorithm consisting of S clusters, our metric essentially quantifies the quality of the cluster output by computing the entropy of the cluster labels. We illustrate using a toy example in Fig 1.

Advantages
The proposed metric has several advantages which we summarize below - • Model-free. The proposed metric is model-free which allows for its application across any clustering algorithm and dataset.
• Information theory-based. Unlike the existing internal metrics, our metric builds upon information theory which is already deep-rooted in the existing literature making our metric much more reliable. • Outperforms existing metrics. Our metric consistently outperforms the existing internal metrics across numerous datasets (refer to section 6 for details).
• Unsupervised. In contrast to the existing external metrics, our metric does not require ground truth cluster structure making it completely unsupervised and hence suited to a wide range of datasets. Even though it requires less information, the proposed metric provides comparable performance to the external metrics (refer to section 6 for details).

Datasets
In this section we briefly discuss the datasets that we have used in this paper. Abalone: The Abalone dataset https://archive.ics.uci.edu/ml/datasets/Abalone consists of a set of abalone and are classified based on their age which is basically the number of rings they have [28]. The dataset consists of 4177 instances each consisting of 8 attributes. The task is treated as a classification problem and there are 28 clusters in the ground truth.
Football: The Football network [29] http://www-personal.umich.edu/mejn/netdata/ consists of American football games between Division IA colleges during regular season Fall of 2000. The vertices in the network are the football teams which are identified by the respective college names and an edge in the network represent regular season games between the two teams. The teams are divided into conferences containing around 8-12 teams each. Games are more frequent between members of the same conference than between members of different conferences. Each conference therefore represents a ground truth community in the network. Note the vertices in the network are devoid of any inherent features and we hence resort to representing each vertex by vectors of (i) neighborhood (1 if the corresponding vertex is a neighbor and 0 other) and (ii) shortest path (length of shortest path to the corresponding vertex).
Railway: The Indian railway network was proposed in [30] http://www.cnergres.iitkgp.ac. in/permanence/ and it consists of stations (nodes) and edges between all pairs of stations that are connected by at least one train-route (both stations must be scheduled halts on the trainroute). The weight of the edge between two stations is the number of train-routes on which both these stations are scheduled halts. We filter out the low-weight edges and then make the resultant network unweighted. The states act as communities since the number of trains within each state is much higher than the number of trains in between two states. Similar to the Football dataset we again obtain two representations of each vertex (neighborhood and shortest path).
Wine: We consider two wine datasets namely Red and White wine [31] http://archive.ics. uci.edu/ml/datasets/Wine+Quality. The datasets respectively contain samples of red and white wines. Each wine sample is associated with 11 attributes like fixed acidity, volatility, residual sugar etc. Each wine sample is also graded by experts between 0 (very bad) and 10 (very excellent) based on the quality. This quality score acts as the ground truth cluster for the two datasets.
Leaf: The leaf dataset [32] https://archive.ics.uci.edu/ml/datasets/One-hundred+plant +species+leaves+data+set consists of 100 varieties of leaves and for each variety there are 16 examples. Each leaf sample is associated with a shape, texture and margin feature. Each such feature is a vector of 64 elements. Each variety of leaf act as the ground truth cluster.
TREC: The TREC dataset [33] http://glaros.dtc.umn.edu/gkhome/views/cluto consists of articles from the Los Angeles times and the categories correspond to the desk of the paper that each article appeared and include documents from the entertainment, financial, foreign, metro, national, and sports desks. Frequency of words in the document are its associated features. A stop-list was used to remove the common words and any word occurring in less than two documents was eliminated. Each desk here represents a ground truth cluster.
Synthetic: The dataset is obtained using the model of correlated time series discussed in [34]. The dataset consists of 1000 data points and 68 clusters in the ground-truth. The dataset https://www.kaggle.com/sandipan99/synthetic-data-for-clustering has been made public.
Protein: This dataset http://www.fludb.org/brc/home.spg?decorator=influenza consists of sequences of HA1 (hemagglutinin) of the H3N2 strain taken from the uniprot database http:// www.uniprot.org/uniprot/P03440. These are strings of 566 characters (amino acids) and each character is replaced by the corresponding values of side-chain polarity, side-chain charge, hydropathy index and weight to obtain the feature matrix. The ground truth cluster structure is obtained based on place. The dataset https://www.kaggle.com/sandipan99/protein-dataset/ has been made public.
Stocks: We consider stock market dataset (the same used in [35]), where each x i is a time series of daily returns for the M = 4000 most actively traded assets in the New York Stock Exchange, over a period from 1 January 1990 to 30 April 1999 (i.e. d = 2358). Returns are defined as the logarithm of the ratio between close and opening price for each day (we refer to [35] for more details). The ground truth is given by the Security and Exchange Commission (SEC) classification of the stocks in industrial sectors, that assigns a code to each stock. Taking the first two digits of the SEC code yields S σ = 68 clusters (but we also compared our results with the classification based on three digits S σ = 302).
Crime: The crime dataset https://archive.ics.uci.edu/ml/datasets/Communities+and +Crime+Unnormalized combines socio-economic data from the '90 Census, (law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey), and crime data from the 1995 FBI UCR [36]. Typically this is a regression dataset and we bin the data points based on the values of the attributes to obtain the ground-truth cluster structure. In specific we consider three attributes which are-(i) murders per 100k population, (ii) robberies per 100k population and (iii) auto-thefts per 100k population.
MNIST: The MNIST dataset [37] http://yann.lecun.com/exdb/mnist/ consists of images of 70,000 handwritten digits (0-9). Each image is represented as a 28 × 28 pixel bounding box which we flatten to obtain a feature vector of size 784. The dataset consists of 10 classes each corresponding to a digit between 0 and 9.

Evaluation methodology
In this section we discuss in detail the evaluation methodology used in the paper.
To reiterate, we consider: High dimensional datasets These are composed of M points x 2 R d in a high dimensional feature space (d � 1). For example, in stock markets data, the i th component x ðtÞ i of the i th point is the daily return of stock i on day t = 1, . . ., d. Table 1 lists the datasets used in this study (details provided later in this section). Each consist of a set of points x i i = 1, . . ., M. We consider different partitioning algorithms x i ! s i that associate to each point i = 1, . . ., M in the sample a label s i that indicates the partition to which point x i belongs to. If there are S partitions, s i can be taken as an integer between 1 and S.
For each dataset studied, a ground truth classification σ = (σ 1 , . . ., σ M ) is also available. This associates to each point i a "true" classification σ i , which can take one of S σ values, where S σ is the number of classes of the ground truth. For example, σ is the Security and Exchange Commission classification of stocks into economic sectors for financial data, or the state where a station is located for the data set of Indian railways [29]. Recall, that the classification s generated by a given partitioning method can be compared with the ground truth σ, using three well-established metrics: Purity, Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). We also compare with two existing internal metrics Davies-Bouldin (DB) and Silhouette (SH). Moreover, for the hierarchical methods, the number of clusters are set to be same as the partitional approaches. For a given data set and a given S, we rank algorithms according to their similarity with the ground truth.

Majority ranking
It is well-known that all the three similarity measures i.e., Purity, NMI and ARI have their own shortcomings [38]. This manifests in the fact that, for different similarity measures, the ranking over algorithms does not necessarily coincide. For this reason, we consider also a "majority ranking": For algorithms A 1 and A 2 , majority ranks A 1 higher than A 2 (i.e. A 1 > A 2 ) if the majority of the three similarity measures rank A 1 higher than A 2 . This procedure is not guaranteed to produce a transitive ranking across algorithms, since it can happen that A 1 > A 2 , A 2 > A 3 and A 3 > A 1 for some A 1 , A 2 and A 3 . This signals the fact that a proper ranking is ill defined in these cases, hence we restrict attention to cases where this is not the case. As Table 1 further shows, our study covers a diverse variety of datasets, ranging from cases where the number of clusters in the ground truth is very small compared to the number of data points (red and white wines, TREC), to cases where clusters on average contain few points (football, railway). We also compare our results across different ground truths for the same dataset. For stocks we consider different levels of granularity given by the SEC codes at 2 or 3 digits. For the crime dataset we consider ground truths based on different indicators (geographic location of the community, incidence of different crimes in that community). We report the results for each case in the following subsections. The cluster size distribution also varies substantially across the data-sets used. As a measure of concentration, Table 1 reports the ratiô H½s�=logðS s Þ between the entropy of the cluster size distribution and its maximal value. This is one for equally sized clusters (e.g. Leaf, TREC) whereas smaller values indicate more skewed distributions.

Results
The rest of the paper will be devoted to testing the accuracy of this prediction, by comparing it with the ranking provided by the distance to the ground truth, according to the measures discussed above. We classify the datasets based on the associated ground truth cluster structure. This is to show that our metric is indeed independent of the ground truth structure. We report in detail the methodology for the stock dataset which covers the case of different granularity levels of ground truth while for other cases we mainly report the results obtained. For all these cases the same methodology has been employed to obtain the results. For general information about each dataset (size, number of clusters in the ground truth) refer to Table 1.

Ground truth with different granularity
Dataset: To illustrate, we consider stock market dataset consisting of 4000 data points and two sets of ground truth (S σ = 68, 302). Observations: For each algorithm and choice of the measure, we compute the value ofĤ½s� for the cluster structure obtained for S σ clusters and compare it to the distance to the ground truth classification with two digits, for ARI, NMI and Purity. The plots for NMI and ARI versuŝ H½s� in Fig 2 show a clear positive correlation that we quantify by computing the Kendall's-τ and Spearman's rank correlation ρ between the corresponding rankings. A pairwise comparison betweenĤ½s� and the different measures, and among the different measures, is shown in Table 2 for the stock dataset considering SEC codes at 2 digits. The corresponding results considering SEC codes at 3 digits are presented in Table 3. Different distances rank the algorithms differently and their correlation, though positive, is not one. For this reason, as already discussed, we also extract a majority ranking that combines the predictions of ARI, NMI and Purity. The correlation between majority ranking and the other rankings is also reported in Table 2 (last column). The top entry of the rightmost column (boxed) is reported in the last column of Table 1 for all the other datasets. This shows thatĤ½s� correlates remarkably well with the majority ranking in most cases. As a comparison, we look into how the three similarity measures correlate among themselves. To this aim we calculate mean Kendall's and Spearman's correlation between the rankings obtained through Purity-NMI, Purity-ARI and NMI-ARI (underlined entries in Table 2). Further note thatĤ½s� outperforms both SH and DB.

Ground truth built from different attributes
Dataset: We illustrate with the crime dataset with ground truth constructed from three attributes which are-(i) murders per 100k population, (ii) robberies per 100k population and (iii) auto-thefts per 100k population.
Observations: In Fig 3(top), Fig 3(middle) and Fig 3(bottom) we plotĤ½s� against purity, NMI and ARI for the cluster structure obtained from each algorithm for crime murder, crime robbery and crime auto respectively. The similarity between the rankings obtained througĥ H½s�, purity, NMI, ARI and majority for the corresponding ground truths are reported in Tables 4, 5 and 6 respectively. In almost all the casesĤ½s� correlates highly with purity and NMI while with ARI the correlation is low. The similarity ofĤ½s� ranking with majority is high irrespective of the ground truth used.Ĥ½s� seems to perform better than SH and DB. Observations: We plotĤ½s� against purity, NMI and ARI for the cluster structure obtained from each algorithm for red wine (top), white wine, TREC and MNIST (bottom) in Fig 4 (top   Fig 2. H[S] versus purity, NMI and ARI for the stock dataset, using SEC codes at 2 (top) and 3 (bottom) digits. Different algorithms are represented by a code that depends on the distance metric used ("l1" or "l2") and the algorithm (SI, AV and CO for single, average and complete linkage, KM for k-means, AP for affinity propagation).

Small number of ground truth clusters compared to the number of points
https://doi.org/10.1371/journal.pone.0239331.g002

PLOS ONE
to bottom in the same order). The similarity scores between the rankings obtained througĥ H½s�, purity, NMI, ARI and majority are reported in Tables 7 and 8 for the respective wine datasets. In both these cases rankings obtained throughĤ½S�, correlates only moderately with the majority ranking. In fact, the similarity values are low among the rankings obtained through other metrics as well. The similarity is reasonably high for TREC (refer to Table 9) and MNIST (refer to Table 10). Observations: In Fig 5 (top) and (bottom) we plotĤ½s� against purity, NMI and ARI for the cluster structure obtained from each algorithm for football and railway.Ĥ½s� is indeed closely related with the other metrics in both cases which proves the effectiveness our metric. We further report the similarity among various rankings of the clustering algorithms obtained through the different metrics in Tables 11 and 12. In fact we observe a very high correlation betweenĤ½s� and majority ranking.

Ground truth clusters are of equal sizes
Datasets: Here we consider the leaf and the abalone datasets. While for leaf the number of points in each ground truth cluster is exactly 16, the corresponding number for abalone is � 90.
Observations: In Fig 6(top) and (bottom) we plotĤ½s� against purity, NMI and ARI values of the cluster structure obtained as output from all the clustering algorithms. A strong positive  dependence suggests thatĤ½s� is able to correctly rank the performance of the clustering algorithms. High correlation between the rankings of clustering algorithms obtained throughĤ½s� and majority (refer to Tables 13 (leaf) and 14 (abalone)) further supports our hypothesis.

Ground truth cluster sizes are skewed
Datasets: Here we consider the synthetic and the protein datasets where the ground truth cluster size distributions are skewed.

Fig 3. H[S] versus purity, NMI and ARI for (i) crime murder (top), (ii) crime robbery (middle) and (iii) crime auto (bottom).
https://doi.org/10.1371/journal.pone.0239331.g003   Table 15 (synthetic) and Table 16 (protein)) between the majority ranking and that obtained throughĤ½s� further indicates the effectiveness of our metric in ranking the performance of the clustering algorithms.

Summary
To summarize we showed that performance ofĤ½s� is comparable to the other metrics even though it does not require the ground truth cluster structure unlike the other competing metrics. Through extensive experiments on a large variety of datasets we showed that our proposed metric is indeed effective as well as robust. This further indicate thatĤ½s� is independent of the associated ground truth structure.Ĥ½s� also consistently outperforms both the baseline internal metrics across all the datasets.

Dependence on cluster structure
We have demonstrated that the proposed metric is able to outperform the existing internal metrics across different datasets. We now focus on analysing dependence of the performance of our metric on the complexity of the dataset. To quantify the complexity of a dataset we define two metrics q 1 ¼Ĥ ½s� log M and q 2 ¼ H½s� log S s whereĤ½s� measures the entropy of the ground truth cluster for the dataset. For q 1 ,Ĥ½s� is normalized by the number of points in the dataset (log M in specific) while for q 2 it is normalized by the number of clusters in the ground truth (log S σ ). Note that we calculate these two metrics for each dataset (refer to  Table 1 for exact values) and train a linear regression model to predict the performance (Ĥ½S� À Majorityðt; rÞ) on each dataset. We obtain a reasonably high R 2 of 0.52. This indicates that complexity of the dataset in terms of q 1 and q 2 is indeed correlated to the performance of the proposed metric.

Discussion
The results discussed in this paper suggest that Infomax can be used as a completely unsupervised measure, that can be computed solely from the partition size distribution, for each algorithm. Using this, we can rank data clustering algorithms in an unsupervised manner. On community detection. A closely related problem, that of community detection in networks, has received considerable attention recently in Physics. The core idea is to group nodes in the network based on structural similarity. As in case of clustering, there exists a plethora of algorithms for community detection as well. An immediate extension would be to deploy our proposed metric to the problem of ranking community detection algorithms.  On experimenting with various datasets we observed that 1. The performance of clustering algorithms depends on the dataset. In case of the football dataset we observed that average linkage was performing the best whereas in case of the railway dataset k-means was performing the best.
2. The performance of clustering algorithms also depends on the distance metric used for calculating distance between the data points in the dataset. This dependence is different depending on the algorithm. For example, in the crime dataset, l2 distance performs better than l1 in k-means, but worse than l1 in complete linkage.
3. The performance changes depending on the feature matrix used.
These observations reinforces the conclusion [8] that the search for the perfect clustering algorithm is chimeric. This makes it important to develop unsupervised methods to rank partitioning algorithms as the one we presented in this paper.