Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 7
Comparison of the average CLL and F1 score on the Toxicity detection task using different combinations of distributions for aggreagation.
Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.