Table 1.
Overview of used variables in the annotation experiment.
Fig 1.
Interface of the annotation experiment.
Leftmost white box: tile-based configuration; white box in the middle: line-based configuration. Furthermore, questions are asked about the similarities of records A and C in comparison to B and the uncertainty of the answer.
Table 2.
Expertise of the 29 experiment participants in medicine, data mining, image processing.
Fig 2.
Boxplots of EDA (left subfigure) and of duration (right subfigure) with one box per triplet.
The order of the triplets did not have any evident effect on EDA but had an effect on duration.
Fig 3.
A_CorrectnessRatio for the 29 non-expert annotators, when all triplets are considered (red line) and when the first three are removed (black line).
Although a slight improvement can be perceived for most of the annotators, the difference is very small, implying that the acclimatization phase has almost no effect on correctness.
Table 3.
Statistical analysis on the association between Stated_U and duration for each triplet annotation task (first row), between EDA and duration (second row) and between EDA and Stated_U (last row).
The association between Stated_U and duration is significant.
Fig 4.
T_CorrectnessRatio for the experts (black line) and for the annotators (red line).
The two groups performed similarly, and misclassified some triplets in agreement; slight differences can be explained by the large difference between the number of non-experts (29) and the number of experts (3).
Table 4.
Correctness vs Stated_U, here aggregated into binary uncertainty.
Fig 5.
Heatmaps of correctness and Stated_U and correctness.
Left subfigure—Heatmap of correctness: green indicates correct annotation of a triplet (x-axis) by an annotator (y-axis), while blue indicates incorrect choice (left subfigure). Right subfigure—refined Heatmap of Stated_U and correctness: correct and certain (1: green intensive), correct and uncertain (2: lime green), incorrect and uncertain (3: light blue), incorrect and certain (4: blue intensive). Some triplets are consistently annotated incorrectly by (almost) all annotators.
Table 5.
Number of times an annotator has opted for annotation value A vs C, distinguishing between correct and incorrect choice.
C is preferred over A.
Table 6.
Agreement of the annotators under each threshold τ on the annotation values A and C, correctly or incorrectly.
Table 7.
Juxtaposition of annotator performance (last column) and agreement on triplet for and
.
We mark in bold the triplets for which experiment participants agreed on the wrong value for both thresholds, and in italic the triplets for which experiment participants agreed on the wrong value only for .
Table 8.
Juxtaposition of stated uncertainty and correctness of the annotators and the Artificial Similarity-Based Annotator (ASBA) for each triplet.
Fig 6.
Relationship between ASBA correctness and uncertainy of ASBA using linear regression (left subfigure) and LOESS regression (right subfigure).
The correctness increases as uncertainty decreases.
Table 9.
Associations of Triplet ID with duration and EDA.
Associations were analyzed by mixed models with random intercept.
Fig 7.
Relationship between Triplet ID and duration and EDA.
Modelling using fractional polynomials with non-linear progressions for duration (left subfigure) and for EDA (right subfigure).
Table 10.
Associations of correctness with duration, EDA and Stated_U.
Associations were analyzed by mixed models with random intercept.
Table 11.
Associations of Stated_U with duration and EDA.
Associations were analyzed by mixed models with random intercept.