Assessing the difficulty of annotating medical data in crowdworking with help of experiments

doi:10.1371/journal.pone.0254764

Table 1.

Overview of used variables in the annotation experiment.

More »

Expand

Fig 1.

Interface of the annotation experiment.

Leftmost white box: tile-based configuration; white box in the middle: line-based configuration. Furthermore, questions are asked about the similarities of records A and C in comparison to B and the uncertainty of the answer.

More »

Expand

Table 2.

Expertise of the 29 experiment participants in medicine, data mining, image processing.

More »

Expand

Fig 2.

Boxplots of EDA (left subfigure) and of duration (right subfigure) with one box per triplet.

The order of the triplets did not have any evident effect on EDA but had an effect on duration.

More »

Expand

Fig 3.

A_CorrectnessRatio for the 29 non-expert annotators, when all triplets are considered (red line) and when the first three are removed (black line).

Although a slight improvement can be perceived for most of the annotators, the difference is very small, implying that the acclimatization phase has almost no effect on correctness.

More »

Expand

Table 3.

Statistical analysis on the association between Stated_U and duration for each triplet annotation task (first row), between EDA and duration (second row) and between EDA and Stated_U (last row).

The association between Stated_U and duration is significant.

More »

Expand

Fig 4.

T_CorrectnessRatio for the experts (black line) and for the annotators (red line).

The two groups performed similarly, and misclassified some triplets in agreement; slight differences can be explained by the large difference between the number of non-experts (29) and the number of experts (3).

More »

Expand

Table 4.

Correctness vs Stated_U, here aggregated into binary uncertainty.

More »

Expand

Fig 5.

Heatmaps of correctness and Stated_U and correctness.

Left subfigure—Heatmap of correctness: green indicates correct annotation of a triplet (x-axis) by an annotator (y-axis), while blue indicates incorrect choice (left subfigure). Right subfigure—refined Heatmap of Stated_U and correctness: correct and certain (1: green intensive), correct and uncertain (2: lime green), incorrect and uncertain (3: light blue), incorrect and certain (4: blue intensive). Some triplets are consistently annotated incorrectly by (almost) all annotators.

More »

Expand

Table 5.

Number of times an annotator has opted for annotation value A vs C, distinguishing between correct and incorrect choice.

C is preferred over A.

More »

Expand

Table 6.

Agreement of the annotators under each threshold τ on the annotation values A and C, correctly or incorrectly.

More »

Expand

Table 7.

Juxtaposition of annotator performance (last column) and agreement on triplet for and .

We mark in bold the triplets for which experiment participants agreed on the wrong value for both thresholds, and in italic the triplets for which experiment participants agreed on the wrong value only for .