Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Fig 10
Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method in POS tagging.
Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. ECE for aggregation is comparable to the best performing methods (WaWA and ZBS). Models trained using aggregated soft labels have better calibration in both the low and high confidence regimes.