Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.


Average Silhouette Width
The Silhouette Width [1] is defined as the average of the degree of confidence of an element to be in a cluster. For an element i: where a i is average distance between i and elements in the same cluster, and b i is the average distance between i and elements in the nearest cluster. S(i) lies in [−1, 1] and should be maximized. Taking the average of the S(i) for all objects i belonging to that cluster we obtain the Average Silhouette Width (ASW), which allows us to identify weak clusters.

Special cases for Cross-clustering
Multiple optimal contingency matrices There could be cases in which more than one A is associated with the same maximum number of clustered elements. In order to choose a unique solution, the largest ASW is used to select the best combination of k C j and k W i . In the case of more than one partition obtaining the same maximum ASW, the algorithm will choose the one with the smaller number of clusters in the Ward algorithm.  Table S1. Example of a contingency matrix that will end up with an empty cluster.

Empty clusters
The algorithm may end in empty clusters, an example is the case of the contingency  table in Table S1. Here A * starts as an empty matrix of size 3 × 3. The algorithm choses max(A) = 6, it puts it in A * [1,1] and removes the row and column where max(A) appears, obtaining the sub-matrix A − . Then, the algorithm sets A * [2, 2] = max(A − ) = 4, removes the row and column where max(A − ) appears, obtaining a sub-matrix A 2− which is of dimensions 1 × 2 and contains only zero values. In the last available cell on the diagonal of A * we will put max(A 2− ) = 0. In this case, such clusters are simply eliminated and the number of clusters will be decreased, also in the case of just one left cluster.

Example of Cross-clustering
Data in a 10 × 7 matrix have been simulated with the rows representing genes and columns identifying samples. The random values of the first two samples are drawn from N (µ = 10, σ = 0.1), the values of samples 3 and 4 are drawn from N (µ = 20, σ = 0.1), the values of samples 5 and 6 are drawn from N (µ = 5, σ = 0.1), while the last sample is designed to be the outlier, with values drawn from U(0, 1). We used RStudio (version 0.98.507) [2] setting a seed of 123. Elements classified ASW  2  3  5  -2  4  4  -2  5  3  -2  6  3  -3  4  6  0.84  3  5  5  -3  6  4  -4 5 6 0.56 4 6 5 -5 6 6 0.28 Table S2. Results from the simulated data at step 5 of the algorithm. The first two columns represent the number of clusters in Ward's minimum variance and CL methods, while the third one shows how many samples have been classified together simultaneously by both algorithms. The last column reports the average Silhouette Width (ASW) computed in cases of multiple maximum values of elements classified (6, in this case) in order to choose the optimal combination.  Table S3. Contingency matrix A obtained after step 7 of CC algorithm on simulated data when the number of clusters for Ward algorithm is set to 3 and for CL is set to 4.

Ward CL
The distance chosen for the algorithm is Euclidean, as it is the most commonly used metric [3]. In this example the interval of plausible number of clusters is easy to set and will be between n Wmin = 2 and n Wmax = 5 for the Ward algorithm, to let the Complete-linkage (CL) have always a higher number of clusters (n Cmin = 3 and n Cmax = 6) and be able to isolate the outlier. This results in 10 contingency tables A with a number of rows varying between 2 and 5, and a number of columns varying between 3 and 6, representing all the possible combinations of Ward and CL partitions. The list of possible combinations is shown in Table S2. An example of a combination is shown in Table S3, where the number of clusters for the Ward algorithm is set equal to 3 and with the CL is set equal to 4. A * starts as an empty matrix of size n * Wi × n * Cj . The algorithm chooses max(A), it puts it in A * [1,1] and removes the row and column where max(A) appears, obtaining a sub-matrix A − . Then, the algorithm sets A * [2, 2] as max(A − ), removes the row and column where max(A − ) appears, obtaining a sub matrix of A − , defined as A 2− . Lastly, it sets A * [3,3] as max(A 2− ), removes the rows and columns where max(A 2− ) appears, and stops, as an empty matrix is obtained. In this case A * is a 3 × 3 matrix with all 2 on the diagonal. Cross-clustering (CC) repeats this operation for every combination of number of clusters. In this example we obtain 10 different A * matrices. The pairs that maximize the sum of the diagonal (representing the maximum number of elements clustered together by both Ward and CL) is obtained for combinations [(3,4), (4,5), (5,6)]. The combination (3,4) is chosen as it yields the largest ASW of 0.84. CC here suggests 3 clusters of the first three pairs of samples, excluding the last sample, which is correctly identified as an outlier. Figure S1. Principal Component Analysis plot of one simulation of the data with the highest variance (σ = 1.5) on the first three principal components. Each group of points corresponds to one of the five prototypes used to define the general behavior of the genes. The gray points represent the 150 completely random genes added to the set. The cumulative percentage of variance explained is 89.8%. Categories are well defined even when the variability added to the data is high, while the outlier genes are distributed uniformly around the center.
Simulation study Figure S2. Boxplots of the Adjusted Rand Index (ARI) resulting from applying CC, CL, and Ward's algorithm (all with Euclidean distance) with σ = 0.2, 0.5, 1, 1.5 on simulated data. The ARI here is used to measure the agreement between the obtained partition with each method and the real partition (the higher, the better), proving that CC performs at least as well as the two methods that constitute it. The y-axis has been cut in order to focus on the central part of the distribution and not on outlier values. The ARI here is used to measure the agreement between the obtained partition with each method and the real partition, proving that CC performs at least as well as the two methods that constitute it.    High AUC values represent high true positive rates and low false positive rates, thus, the higher the better. In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.    Figure S13. Boxplots of the Area Under the Curve in each of the K = 6 simulated clusters for S = 100 simulations with CC, CL, and Ward algorithm when σ = 0.5 using the Chebychev distance. The last group is the outliers one. As the CC's AUC is always at least as well as the competitors', we can conclude that it performs better than them in terms of sensitivity and specificity.  Table S4. The average information gained along with its standard deviation, computed for each of the considered methods (Ward, CL, CC) over all the number of clusters (in Ward and CL the number was ranging between 2 and 20, in CC the algorithm automatically determined it) in each of the 100 simulations for every σ added to the data (0.2, 0.5, 1.0, 1.5), is reported in the above table. Here you can see that CC led always to the highest information gain and to the lower standard deviation, meaning that the quality of the clusters in terms of purity is always higher when using CC.

PLOS
18/100 Ward CC Figure S16. Boxplots of the Sensitivity in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 0.2. The sensitivity is the rate between the number of true positives and the sum of true positives and false negatives.
In the first clusters the performance is approximately the same, while in the last cluster CC performs better.
Ward CC Figure S17. Boxplots of the Sensitivity in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 0.5. The sensitivity is the rate between the number of true positives and the sum of true positives and false negatives.
In the first clusters the performance is approximately the same, while in the last cluster CC performs better.

19/100
Ward CC Figure S18. Boxplots of the Sensitivity in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 1. The sensitivity is the rate between the number of true positives and the sum of true positives and false negatives. In the first clusters the performance is approximately the same, while in the last cluster CC performs better.
Ward CC Figure S19. Boxplots of the sensitivity in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 1.5. The sensitivity is the rate between the number of true positives and the sum of true positives and false negatives.
In the first clusters the performance is approximately the same, while in the last cluster CC performs better.
PLOS 20/100 Ward CC Figure  Ward CC Figure S25. Boxplots of the Geometric Accuracy (Acc g) in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 0.5.The Geometric Accuracy is the geometric mean of sensitivity and PPV. CC performed always better.

23/100
Ward CC Figure S26. Boxplots of the Geometric Accuracy (Acc g) in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 1. The Geometric Accuracy is the geometric mean of sensitivity and PPV. CC performed always better.
Ward CC Figure S27. Boxplots of the Geometric Accuracy (Acc g) in each of the K = 6 simulated clusters for S = 100 simulations with CC and Ward when σ = 1.5. The Geometric Accuracy is the geometric mean of sensitivity and PPV. CC performed always better.   Figure S52. Boxplots of the Adjusted Rand Index (ARI) resulting from CC and autoSOME on simulated data. The ARI here is used to measure the agreement between the obtained partition with each method and the real partition, proving that CC performs always better than autoSOME in terms of ARI.  Figure S53. Boxplots of the sensitivity in each of the K = 6 simulated clusters for S = 100 simulations with CC and autoSOME when σ = 0.2. The sensitivity is the rate between the number of true positives and the sum of true positives and false negatives.
In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.  Figure S54. Boxplots of the sensitivity in each of the K = 6 simulated clusters for S = 100 simulations with CC and autoSOME when σ = 0.5. The sensitivity is the rate between the number of true positives and the sum of true positives and false negatives.

38/100
In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.  Figure S61. Boxplots of the geometric accuracy in each of the K = 6 simulated clusters for S = 100 simulations with CC and autoSOME when σ = 0.2. The geometric accuracy is the geometric mean of sensitivity and PPV. In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.  Figure S62. Boxplots of the geometric accuracy in each of the K = 6 simulated clusters for S = 100 simulations with CC and autoSOME when σ = 0.5.The geometric accuracy is the geometric mean of sensitivity and PPV. In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.  Figure S63. Boxplots of the geometric accuracy in each of the K = 6 simulated clusters for S = 100 simulations with CC and autoSOME when σ = 1. The geometric accuracy is the geometric mean of sensitivity and PPV. In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.

43/100
Geometrical Accuracy − sigma = 1.5  Figure S64. Boxplots of the geometric accuracy in each of the K = 6 simulated clusters for S = 100 simulations with CC and autoSOME when σ = 1.5. The geometric accuracy is the geometric mean of sensitivity and PPV. In the first clusters the performance is approximately the same, while in the last group, which is the outliers one, CC performs better.  Figure S103. Boxplots of the geometric accuracy (Acc g) in each of the K = 6 simulated clusters for S = 100 simulations with CC and Spectral clustering when σ = 1.5. The Geometric Accuracy is the geometric mean of sensitivity and PPV. CC performed always better.

Identification of the number of clusters -CC versus Complete
Figs. S104 to S107 show the ability of CC to identify the correct number of clusters versus CL combined with several methods for the identification of such number. The figures show that CC is consistent across all the 100 simulated datasets, similarly to CL with some of the methods tested. Here CC proved to be one of the best performing methods, always being able to detect the actual number of clusters in 100 simulations. C-index consistently missed the actual number of clusters, while Beale, Duda, and Gap have high variability with median far from reality. At the increase of σ even KL starts performing badly.

69/100
Identification of the number of clusters -CC versus Ward Figs. S108 to S111 show the ability of CC to identify the correct number of clusters versus Ward combined with several methods for the identification of such number. The figures show that CC is consistent across all the 100 simulated datasets, similarly to Ward with some of the methods tested. Here CC proved to be one of the best performing methods, always being able to detect the actual number of clusters in 100 simulations. C-index and Gap consistently missed the actual number of clusters. At the increase of σ even KL starts performing badly.
Figs. S112 to S115 show the ability of CC to identify the correct number of clusters versus K-means combined with several methods for the identification of such number. The figures show that CC is consistent across all the 100 simulated datasets, similarly to K-means with Jump method.

Stability of CC
We checked the stability of CC with regard to the input parameters. We performed CC with 10 different pairs of values (shown in Fig. S116) for the boundaries of I W , while setting a high n cmax = 99 on the simulated data after the removal of outliers. Figs. S117 to S120 prove the robustness of the method, which is always able in 100 simulations to identify the correct number of clusters (K = 5).  Figure S120. Boxplots of the number of clusters found by the CC algorithm with ten different intervals I W and with σ = 1.5. The vertical line represents the correct number of clusters, k = 5. Labels on the y-axis represent the lower and upper boundaries of each interval. The maximum number of clusters for the CL is always set equal 99. In this case CC was always able to detect the actual number of clusters. Table S6. 70 combinations of x and y dimensions for the SOM's grid in order to decide which one to use on the brain tumors dataset. The last column of both tables reports the total number of resulting available cells.

Parameters for SOM
We applied SOM with a rectangular topology, Euclidean distance, number of training iterations rlen = 100, learning rate (amount of change) α decreasing linearly from 0.05 to 0.01, the radius of the neighborhood starts with a value that covers 2/3 of all unit-to-unit distances, and initial values for each node is chosen randomly without replacement from the data.

Parameters for DBSCAN
The idea is that points in a cluster are roughly at same distance from their n th nearest neighbor, while the distance from noise points is higher. The suggested n for PLOS 85/100 two-dimensional data is 4. Therefore, plotting the sorted distances, sorted in descending order, of each point from its n th nearest neighbor gives hints concerning the proximity of the elements in the data. A threshold point p should be chosen to be the first one in the first "valley" of the sorted distances: all the points on the left of the threshold are considered to be noise, while all other points are assigned to some cluster. The parameters are then set such as = dist(p) and M inP ts = n.

DBSCAN results
On  Table S8. Contingency table reporting both the real classification in subtypes and the clusters identified by the CL method on the brain tumors dataset. Values in bold are those resulted statistically significant after the over-representation analysis. Clu1  10  9  1  1  9 30  Clu2  0  0  1  0  1  2  Clu3  0  1  2  7  0 10  10  10  4  8  10 42  Table S9. Contingency table reporting both the real classification in subtypes and the clusters identified by the K-means method on the brain tumors dataset. Values in bold are those resulted statistically significant after the over-representation analysis. Clu1  0  0  0  1  0  1  Clu2  10  10  4  7  10 41  10  10  4  8  10 42  Table S10. Contingency table reporting both the real classification in subtypes and the clusters identified by the SOM method on the brain tumors dataset. No subtypes are over-represented by any cluster.  Table S11. Contingency table reporting both the real classification in subtypes and the clusters identified by CC on the brain tumors dataset. Values in bold are those resulted statistically significant after the over-representation analysis. Although CC identified more than five clusters, four of them almost perfectly represented MD, MGlio, Ncer, and Rhab subtypes. The PNET subtype was represented by six of CC clusters, but it is well known for having an high heterogeneity.

Method
# Clusters ARI Ward 2 0.14 CL 2 0.18 K-means 3 0.19 SOM 2 0.003 CC 9 + 1 0.64 Table S12. Summarization of results on brain tumors dataset with Euclidean distance. The table reports every method applied with the resulting number of clusters and adjusted Rand Index (ARI) (the higher the better). The actual number of subtypes in the dataset was five. CC identified nine clusters plus one cluster containing one outlier, leading to the highest ARI. None of the other methods was able to identify the correct number of clusters. Although CC identified more than five clusters, four of them almost perfectly represented MD, MGlio, Ncer, and Rhab subtypes. The PNET subtype was represented by six of CC clusters, but it is well known for having an high heterogeneity. Olive oil data Figure S125. Graphical representation of the true membership (first row) of the 572 specimens of oliveoil in olive oil data, compared with the memberships resulting from CC, DBSCAN, SOM, K-means, CL, and Ward. The nine natural clusters in the order of the first row of the image are: Apulia north, Calabria, Apulia south, Sicily, Sardinia inland, Sardinia coast, Liguria east, Liguria west, Umbria. The colors represent the index of the cluster fiven by each method. The white color represents outliers, only detected by CC and DBSCAN. Classical approaches performed poorly, obtaining ARI values ranging around 0.30. In terms of number of clusters, the ASW criterion always identified 2 clusters. In contrast, CC obtained an ARI of 0.60, identifying three clusters and one sample as an outlier.   Table S25. Running times of different methods in clustering the same 100 simulated data sets, sorted by increasing average time. Here s stands for "seconds", m stands for "minutes", and h for "hours". Computer information: Windows Server 2012 R2 Standard (64bit). QUAD-CORE AMD OPTERON(tm) Processor 8356 (4 2.30GHz quad-cores). 128GB RAM. Spectral clustering has been parallelized on 16 cores. Figure S128. Boxplots of the information gain computed over Ward's results on 100 simulated datasets with σ = 0.5 for every number of clusters ranging between 2 and 20 are shown. Here at the beginning the information gain tends to increase when the number of clusters increases, showing an high variability of the the information gained by splitting the entire dataset in 2, 3, and 4 clusters. When approaching to the real number of clusters (6) the information gained by splitting the data is high, with a low variability of the results, meaning that clusters have a good quality in terms of impurity.

Breast cancer dataset
PLOS 97/100 Figure S129. Boxplots of the information gain computed over CL results on 100 simulated datasets with σ = 0.5 for every number of clusters ranging between 2 and 20 are shown. Here at the beginning the information gain tends to increase when the number of clusters increases, showing an high variability of the results. The information gained by splitting the entire dataset in each value of the number of clusters tried is very variable and is not very high in correspondence with the real number of clusters (which is 6). Figure S130. Boxplots of the information gain computed over CC results on 100 simulated datasets with σ = 0.5 for every number of clusters found by the algorithm are shown. The information gained by splitting the whole dataset in the number of clusters chosen by the algorithm is high, even if the real number of clusters (6) is never correctly identified.