ClustAll: An R package for patient stratification in complex diseases

doi:10.1371/journal.pcbi.1012656

Fig 1.

Schematic representation of the ClustAll pipeline.

(A) Stepwise and main features included in the ClustAll package, illustrating the workflow integrated into the tool for clustering, data analysis, and results visualization; (B) Overview of the ClustALL Algorithm Methodology; (C) Application example overview. Figure based on Palomino-Echeverria et al. 2024 [5].

More »

Expand

Table 1.

ClustALL methods nomenclature.

More »

Expand

Fig 2.

Heatmap with the Jaccard indexes for population-robust stratifications.

Heatmap representing the similarity between stratifications using the Jaccard Index. It groups similar stratifications, allowing for the identification of patterns that exhibit similar behavior. The X-axis and Y-axis represent the different stratifications. The color gradient, ranging from blue to white, indicates the Jaccard index values. A darker blue represents a higher Jaccard index, indicating higher similarity between sets, while a lighter blue (approaching white) represents a lower Jaccard index, indicating less similarity. Red dashed lines highlight groups of stratifications that show a high degree of similarity based on the Jaccard index threshold (in this case Jaccard Index is 0.9). The label “Distance” refers to the type of similarity measure used, such as correlation or Gower distance. The label “Clustering” indicates the clustering method applied, such as hierarchical clustering, k-means, or k-medoids. “Depth” refers to the level of embedding, showing in a color range the depths of the dendrogram. H-Clustering: hierarchical clustering.

More »

Expand

Table 2.

Breast Cancer Winconsin (Diagnostic) dataset attributes description.

More »

Expand

Table 3.

Sensitivity and specificity for the stratification representatives.

The performance metrics for the representative stratifications identified by ClustAll when applied to the breast cancer dataset. The "Nomenclature" column shows the identifier for each stratification. "Distance Metric" and "Clustering Method" columns indicate the similarity measure and clustering algorithm used, respectively. "Embedding depth" refers to the level in the dendrogram at which the embedding was created during the Data Complexity Reduction step. "Sensitivity" shows the proportion of true positive cases (malignant tumors) correctly classified by the stratification. "Specificity" indicates the proportion of true negative cases (benign tumors) correctly classified. Both are calculated by comparing the stratification results to the reference column (true labels) in the dataset.

More »

Expand

Table 4.

ClustAll performance against standard clustering algorithms.

A comparative analysis of ClustAll’s performance against standard clustering algorithms across multiple datasets. The "Method" column lists the clustering approaches evaluated, including ClustAll and various combinations of distance metrics and clustering algorithms. "Sensitivity" and "Specificity" columns show the average accuracy of each method in identifying positive and negative cases, respectively, when compared to the known reference column (true labels). These values are calculated across multiple bootstrap samples to ensure reliability. The "Stability" column indicates the consistency of cluster assignments across different bootstrap iterations, with higher values suggesting more robust clustering. Results are provided for three scenarios: the complete breast cancer dataset, the breast cancer dataset with imputed missing values, and the heart attack dataset.

More »

Expand

Fig 3.

Runtime of ClustAll across different numbers of variables and cores.

(A) Runtime (in seconds) of two methods ("Linear_ClustALL" and "S4 ClustAll") for different dataset sizes (number of variables) in the breast cancer dataset. This comparison evaluates the sequential (non-parallelized) performance of ClustALL linear and S4 ClustAll using only 1 core. The results are shown for datasets with 10 imputations and without imputations. (B) Runtime (in seconds) of S4 ClustAll across different numbers of computational cores for different imputation scenarios (no imputation, 10 imputations, and 100 imputations) on the same dataset. The workstation used for the benchmarking has an AMD EPYC 7742 64-Core Processor with 64 cores and 2,1 TB of RAM.

More »

Expand