Clustering algorithms: A comparative approach

doi:10.1371/journal.pone.0210236

Table 1.

Clustering methods considered in our analysis and the respective libraries and functions in R employing the methods.

The first column shows the name of the algorithms used throughout the text. The second column indicates the category of the algorithms. The third and fourth columns contain, respectively, the function name and R library of each algorithm.

More »

Expand

Fig 1.

Illustration of the k-means clustering method.

Each plot shows the partition obtained after specific iterations of the algorithm. The centroids of the clusters are shown as a black marker. Points are colored according to their assigned clusters. Gray markers indicate the position of the centroids in the previous iteration. The dataset contains 2 clusters, but k = 4 seeds were used in the algorithm.

More »

Expand

Fig 2.

Examples of artificial datasets generated by the methodology.

The parameters used for each case are (a) C = 2, Ne = 100 and α = 3.3. (b) C = 2, Ne = 100 and α = 2.3. (c) C = 10, Ne = 50 and α = 4.3. (d) C = 10, Ne = 50 and α = 6.3. Note that each class can present highly distinct properties due to differences in correlation between their features.

More »

Expand

Fig 3.

Average performance of the seven considered clustering algorithms according to the number of features in the dataset.

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 2, 10 and 50 features. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.

More »

Expand

Fig 4.

Average performance of the seven considered clustering algorithms according to the number of objects per class in the dataset.

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 5, 50 and 100 objects per class. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.

More »

Expand

Fig 5.

Performance of the algorithms when the number of elements by class correspond to Ne = 5, 50, 500, 5000.

The plots correspond to the ARI, Jaccard and FM indexes averaged for all datasets containing 10 classes and 5 features (DB10C5F).

More »

Expand

Fig 6.

Performance of the algorithms when changing the expected number of clusters K in the dataset.

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 10 features (DB10C10F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.

More »

Expand

Fig 7.

Performance of the algorithms when changing the expected number of clusters K in the dataset.

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 2 features (DB10C2F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.

More »

Expand

Table 2.

Average difference of accuracies obtained when clustering algorithms are used with their default configuration of parameters.

In general, the spectral algorithm provides the highest accuracy rate among all evaluated methods.

More »

Expand

Table 3.

One-parameter analysis performed in DB2C2F and DB10C2F.

This analysis is based on the performance (measured through the ARI index) obtained when varying a single parameter of the clustering algorithm, while maintaining the others in their default configuration. 〈S〉, max S, ΔS are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.

More »

Expand

Table 4.

One-parameter analysis performed in DB2C10F and DB10C10F.

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈S〉, max S, ΔS are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.

More »

Expand

Table 5.

One-parameter analysis performed in DB2C200F and DB10C200F.

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈S〉, max S, ΔS are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter.

More »

Expand

Fig 8.

Distribution of ARI values obtained for the random sampling of the k-means parameters.

The algorithm was applied to dataset DB10C10F, and 500 sets of parameters were drawn.

More »

Expand

Table 6.

Multi-parameter analysis performed in dataset DB2C2F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

More »

Expand

Table 7.

Multi-parameter analysis performed in dataset DB10C2F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

More »

Expand

Table 8.

Multi-parameter analysis performed in dataset DB2C10F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

More »

Expand

Table 9.

Multi-parameter analysis performed in dataset DB10C10F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

More »

Expand

Table 10.

Multi-parameter analysis performed in dataset DB2C200F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

More »

Expand

Table 11.

Multi-parameter analysis performed in dataset DB10C200F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

More »

Expand

Table 12.

Summary table for the performance of clustering algorithms in datasets DB2C2F and DB10C2F.

ARI_def represents the average accuracy obtained when considering the default parameters of the algorithms. represents the average of the best accuracies obtained when varying a single parameter. represents the average of the best accuracies obtained when parameters are randomly selected.

More »

Expand

Table 13.

Summary table for the performance of clustering algorithms in datasets DB2C10F and DB10C10F.

ARI_def represents the average accuracy obtained when considering the default parameters of the algorithms. represents the average of the best accuracies obtained when varying a single parameter. represents the average of the best accuracies obtained when parameters are randomly selected.

More »

Expand

Table 14.

Summary table for the performance of clustering algorithms in datasets DB2C200F and DB10C200F.

ARI_def represents the average accuracy obtained when considering the default parameters of the algorithms. represents the average of the best accuracies obtained when varying a single parameter. represents the average of the best accuracies obtained when parameters are randomly selected.

More »

Expand