Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift

doi:10.1371/journal.pone.0323064

Table 1.

The accuracy of each annotation method with respect to the expert annotations in each dataset. Aggregating maintains best or near-best accuracy across tasks.

More »

Expand

Table 2.

Negative log likelihood of each annotation method with respect to the expert annotations in each dataset. Individual soft-labeling methods vary between tasks, while aggregating maintains best or near-best NLL.

More »

Expand

Table 3.

F1 and calibrated log likelihood. Results are averaged over 10 random seeds; standard deviation is given in the subscript. Tasks marked by * are subject to input data distribution shift while datasets marked by † are subject to annotator pool distribution shift. Methods marked by ‡ are those which estimate either worker skill or item difficulty. Aggregating the individual soft-labeling methods yields classifiers with consistently good uncertainty estimation (best on all text based tasks) and generally good raw performance in terms of F1 across tasks.

More »

Expand

Fig 1.

Significance testing for the RTE task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Fig 2.

Significance testing for the POS task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Fig 3.

Significance testing for the Toxicity task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Fig 4.

Significance testing for the Image Cls. task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Table 4.

A comparison of overall significance for each method. We obtain this score by comparing each method across all datasets: if method 1 is statistically significantly better than method 2, we add 1 to its score. If it is significantly worse, we subtract 1. If there is no difference, then we add 0 to the score.

More »

Expand

Fig 5.

Comparison of the average CLL and F1 score on the RTE task using different combinations of distributions for aggreagation.

Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.

More »

Expand

Fig 6.

Comparison of the average CLL and F1 score on the POS tagging task using different combinations of distributions for aggreagation.

Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.

More »

Expand

Fig 7.

Comparison of the average CLL and F1 score on the Toxicity detection task using different combinations of distributions for aggreagation.

Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.

More »

Expand

Fig 8.

Comparison of the average CLL and F1 score on the image classification task using different combinations of distributions for aggreagation.

Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.

More »

Expand

Fig 9.

Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method on image classification using the CINIC10 dataset.

Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. Perfect calibration, where confidence and accuracy are equal, would follow the dotted line (i.e. a black bar over the line indicates under-confidence and a black bar under the line indicates over-confidence). We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. Aggregating soft labels results in the best overall calibration.

More »

Expand

Fig 10.

Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method in POS tagging.

Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. ECE for aggregation is comparable to the best performing methods (WaWA and ZBS). Models trained using aggregated soft labels have better calibration in both the low and high confidence regimes.

More »

Expand

Fig 11.

Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method on RTE.

Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. All models are highly overconfident, potentially as a result of the limited amount of training data with moderate reliability.

More »

Expand

Fig 12.

Distribution of total variation distance (TVD) between model predictions and original crowd-sourced annotations for the Jigsaw toxicity detection dataset.

Perfect calibration is a TVD of 0, perfect miscalibration is a TVD of 1. “K” indicates the kurtosis of the distribution. Aggregation, standard, Wawa, and ZBS produce distributions with low mean and standard deviation compared to other methods.

More »

Expand

Fig 13.

Significance testing for the RTE task. We apply the Bonferroni correction across the total tests (N = 56).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Fig 14.

Significance testing for the POS task. We apply the Bonferroni correction across the total tests (N = 56).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Fig 15.

Significance testing for the Toxicity task. We apply the Bonferroni correction across the total tests (N = 56).

Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand

Fig 16.

Significance testing for the Image Cls.

task. We apply the Bonferroni correction across the total tests (N = 56). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.

More »

Expand