Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift

Fig 7

Comparison of the average CLL and F1 score on the Toxicity detection task using different combinations of distributions for aggreagation.

Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.