Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 5
Comparison of the average CLL and F1 score on the RTE task using different combinations of distributions for aggreagation.
Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.