Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift

Fig 5

Comparison of the average CLL and F1 score on the RTE task using different combinations of distributions for aggreagation.

Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.