Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift
Table 4
A comparison of overall significance for each method. We obtain this score by comparing each method across all datasets: if method 1 is statistically significantly better than method 2, we add 1 to its score. If it is significantly worse, we subtract 1. If there is no difference, then we add 0 to the score.