A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
Fig 3
GECO performance is driven by the choice of the ground truth dataset.
A. K-means clustering performed on the 14,175-gene DICE dataset ten times per number of clusters (the x-axis value). Each iteration was scored by GECO, the scores were used to generate ROC plots, and those plots produced AUC values which are represented in the boxes in the boxplot. The boxes cover the 0.25 to 0.75 confidence interval. The whiskers range from minimum to maximum values. The bar within the boxes indicate the mean value over the ten iterations. The purple datapoints are scores from the “Burel” 74-gene CD4 T-cell TB signature. B. K-means clustering and GECO scoring of the 10,263-gene CD4 T-cell dataset carried out in the same manner as part A. The scores for the Burel 74-gene CD4 T-cell TB signature are significantly higher in the CD4 T-cell dataset compared to the DICE dataset due to the sensitivity of the metric to the choice of ground truth set in relation to the dataset.