Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 14
Significance testing for the POS task. We apply the Bonferroni correction across the total tests (N = 56).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.