Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 2
Significance testing for the POS task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.