Fig 1.
The GECO metric scoring process; from cluster assignment to GECO score.
Transcriptomic data was clustered using k-means clustering. Each cluster contained both ground truth genes (seen in blue) and non-ground truth genes (seen in grey). The scoring function applied to each gene in each cluster. The gene score reflects the likelihood of that gene being a ground truth gene based on the makeup of the cluster to which the gene belongs and the distribution of ground truth genes throughout the dataset. A table containing all the genes in the dataset, scored by cluster, and their associated gene scores. The gene scores are used to generate a ROC plot and the corresponding AUC value is the GECO metric which indicates the overall quality of the clusters.
Fig 2.
Using the GECO metric to determine an optimal number of clusters.
A. A ROC plot using the scores from the GECO metric over ten iterations of k-means clustering with 16 clusters. In blue are the GECO scores for the ribosomal protein gene ground truth set, in grey are the scores for an equivalent number of randomly sampled genes. The GECO metric value is also noted for each group of genes. B. K-means clustering was performed ten times for each value of k ranging logarithmically from 1–14,175; each cluster iteration was scored using the GECO metric. The values for the ribosomal protein gene ground truth set are plotted in blue, while the grey scores are the values for an equivalent number of randomly sampled genes. The boxes cover the 0.25 to 0.75 confidence interval. The whiskers range from minimum to maximum values. The bar within the boxes indicate the mean value over the ten iterations. C. A ROC plot generated similar to A, but with 91 clusters. Blue represents the ribosomal protein gene ground truth set and grey the randomly sampled genes. The average GECO metric values are again provided. D. All three ground truth sets are plotted after their GECO metric scores are calculated. The boxes, whiskers, and inner bars are used in the same manner as detailed in section 2B.
Fig 3.
GECO performance is driven by the choice of the ground truth dataset.
A. K-means clustering performed on the 14,175-gene DICE dataset ten times per number of clusters (the x-axis value). Each iteration was scored by GECO, the scores were used to generate ROC plots, and those plots produced AUC values which are represented in the boxes in the boxplot. The boxes cover the 0.25 to 0.75 confidence interval. The whiskers range from minimum to maximum values. The bar within the boxes indicate the mean value over the ten iterations. The purple datapoints are scores from the “Burel” 74-gene CD4 T-cell TB signature. B. K-means clustering and GECO scoring of the 10,263-gene CD4 T-cell dataset carried out in the same manner as part A. The scores for the Burel 74-gene CD4 T-cell TB signature are significantly higher in the CD4 T-cell dataset compared to the DICE dataset due to the sensitivity of the metric to the choice of ground truth set in relation to the dataset.
Fig 4.
Validating the metric using known gene modules and the k91 DICE clusters.
A. The GECO metric values evaluating the cluster quality for three sets of clusters; the 2008 BTMs, the 2019 BTMs, and the k91 DICE clusters. The ground truth set used to score the different cluster groups was the “Burel” 74-gene CD4 T-cell TB signature. The error bar on the k91 DICE clusters indicate the range of potential scores from the other 9 iterations of the k91 DICE clusters. B. The GECO metric values evaluating the cluster quality of the same three cluster sets as before. The ground truth set used to score the clusters was the “Berry” 393-gene whole blood TB signature.