Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 13
Significance testing for the RTE task. We apply the Bonferroni correction across the total tests (N = 56).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.