Table 1.
The accuracy of each annotation method with respect to the expert annotations in each dataset. Aggregating maintains best or near-best accuracy across tasks.
Table 2.
Negative log likelihood of each annotation method with respect to the expert annotations in each dataset. Individual soft-labeling methods vary between tasks, while aggregating maintains best or near-best NLL.
Table 3.
F1 and calibrated log likelihood. Results are averaged over 10 random seeds; standard deviation is given in the subscript. Tasks marked by * are subject to input data distribution shift while datasets marked by † are subject to annotator pool distribution shift. Methods marked by ‡ are those which estimate either worker skill or item difficulty. Aggregating the individual soft-labeling methods yields classifiers with consistently good uncertainty estimation (best on all text based tasks) and generally good raw performance in terms of F1 across tasks.
Fig 1.
Significance testing for the RTE task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Fig 2.
Significance testing for the POS task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Fig 3.
Significance testing for the Toxicity task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Fig 4.
Significance testing for the Image Cls. task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Table 4.
A comparison of overall significance for each method. We obtain this score by comparing each method across all datasets: if method 1 is statistically significantly better than method 2, we add 1 to its score. If it is significantly worse, we subtract 1. If there is no difference, then we add 0 to the score.
Fig 5.
Comparison of the average CLL and F1 score on the RTE task using different combinations of distributions for aggreagation.
Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.
Fig 6.
Comparison of the average CLL and F1 score on the POS tagging task using different combinations of distributions for aggreagation.
Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.
Fig 7.
Comparison of the average CLL and F1 score on the Toxicity detection task using different combinations of distributions for aggreagation.
Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.
Fig 8.
Comparison of the average CLL and F1 score on the image classification task using different combinations of distributions for aggreagation.
Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.
Fig 9.
Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method on image classification using the CINIC10 dataset.
Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. Perfect calibration, where confidence and accuracy are equal, would follow the dotted line (i.e. a black bar over the line indicates under-confidence and a black bar under the line indicates over-confidence). We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. Aggregating soft labels results in the best overall calibration.
Fig 10.
Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method in POS tagging.
Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. ECE for aggregation is comparable to the best performing methods (WaWA and ZBS). Models trained using aggregated soft labels have better calibration in both the low and high confidence regimes.
Fig 11.
Reliability diagram and expected calibration error (ECE, displayed as Equation 9 100) for each soft labeling method on RTE.
Black bars indicate the accuracy in the given bin and red bars indicate the gap between accuracy and confidence. We use the average of the logits produced by models trained with 10 different random seeds with no temperature scaling. All models are highly overconfident, potentially as a result of the limited amount of training data with moderate reliability.
Fig 12.
Distribution of total variation distance (TVD) between model predictions and original crowd-sourced annotations for the Jigsaw toxicity detection dataset.
Perfect calibration is a TVD of 0, perfect miscalibration is a TVD of 1. “K” indicates the kurtosis of the distribution. Aggregation, standard, Wawa, and ZBS produce distributions with low mean and standard deviation compared to other methods.
Fig 13.
Significance testing for the RTE task. We apply the Bonferroni correction across the total tests (N = 56).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Fig 14.
Significance testing for the POS task. We apply the Bonferroni correction across the total tests (N = 56).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Fig 15.
Significance testing for the Toxicity task. We apply the Bonferroni correction across the total tests (N = 56).
Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Fig 16.
Significance testing for the Image Cls.
task. We apply the Bonferroni correction across the total tests (N = 56). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.