Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 16
Significance testing for the Image Cls.
task. We apply the Bonferroni correction across the total tests (N = 56). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.