Assessing the difficulty of annotating medical data in crowdworking with help of experiments

Background As healthcare-related data proliferate, there is need to annotate them expertly for the purposes of personalized medicine. Crowdworking is an alternative to expensive expert labour. Annotation corresponds to diagnosis, so comparing unlabeled records to labeled ones seems more appropriate for crowdworkers without medical expertise. We modeled the comparison of a record to two other records as a triplet annotation task, and we conducted an experiment to investigate to what extend sensor-measured stress, task duration, uncertainty of the annotators and agreement among the annotators could predict annotation correctness. Materials and methods We conducted an annotation experiment on health data from a population-based study. The triplet annotation task was to decide whether an individual was more similar to a healthy one or to one with a given disorder. We used hepatic steatosis as example disorder, and described the individuals with 10 pre-selected characteristics related to this disorder. We recorded task duration, electro-dermal activity as stress indicator, and uncertainty as stated by the experiment participants (n = 29 non-experts and three experts) for 30 triplets. We built an Artificial Similarity-Based Annotator (ASBA) and compared its correctness and uncertainty to that of the experiment participants. Results We found no correlation between correctness and either of stated uncertainty, stress and task duration. Annotator agreement has not been predictive either. Notably, for some tasks, annotators agreed unanimously on an incorrect annotation. When controlling for Triplet ID, we identified significant correlations, indicating that correctness, stress levels and annotation duration depend on the task itself. Average correctness among the experiment participants was slightly lower than achieved by ASBA. Triplet annotation turned to be similarly difficult for experts as for non-experts. Conclusion Our lab experiment indicates that the task of triplet annotation must be prepared cautiously if delegated to crowdworkers. Neither certainty nor agreement among annotators should be assumed to imply correct annotation, because annotators may misjudge difficult tasks as easy and agree on incorrect annotations. Further research is needed to improve visualizations for complex tasks, to judiciously decide how much information to provide, Out-of-the-lab experiments in crowdworker setting are needed to identify appropriate designs of a human-annotation task, and to assess under what circumstances non-human annotation should be preferred.

Die Software war schwer zu bedienen.
Die Software war gut strukturiert.
Die Software war einfach aufgebaut.

Part A: Personal data
In the following, we will first collect some personal data about you. These data are needed for the statistical analysis of this study. Your person cannot be identified by answering these questions. A1. Please enter your age.
A2. Please enter your sex: female -male A3. Please enter your course of study.
A4. Please enter your native language.
A5. Please enter your country of origin.
A6. Are you left-handed or right-handed?
A7. How much experience do you have in the following areas? Medicine-none, little, much or very much Data Mining-none, little, much or very much Image processing-none, little, much or very much

Part B: Questions about the used graphical representation
The following is a list of statements about the graphical representation used previously. Please check off the statements you agree with. Multiple check marks are possible.
B1. The following questions refer to the tile-based configuration only.
-The graphical representation was easy to understand.
-The graphical representation was unnecessarily complex.
-The graphical representation was too cluttered.
-The graphical representation was clearly arranged.
-The size of the graphical representation was pleasant and appropriate.
-The individual elements of the graphical representation were pleasantly large.
-Terms and designations of the graphical representation were easily understandable.
-I could understand the graphical representation only with the help of the experimenter.
-I consider the graphical representation to be useful.
-It was easy for me to use the graphical representation.
-With the help of the graphical representation I was able to achieve my work goal.
-I could easily compare the instances with the help of the graphical representation.
-I was able to quickly compare instances using the graphical representation.
B2. The following questions refer to the parallel-based configuration only.
-The graphical representation was easy to understand.
-The graphical representation was unnecessarily complex.
-The graphical representation was too cluttered.
-The graphical representation was clearly arranged.
-The size of the graphical representation was pleasant and appropriate.
-The individual elements of the graphical representation were pleasantly large.
-Terms and designations of the graphical representation were easily understandable.
-I could understand the graphical representation only with the help of the experimenter.
-I consider the graphical representation to be useful.
-It was easy for me to use the graphical representation.
-With the help of the graphical representation I was able to achieve my work goal.
-I could easily compare the instances with the help of the graphical representation.
-I was able to quickly compare instances using the graphical representation. -The software was designed to be understandable.
-The software was difficult to use.
-The software was well structured.
-The software was simply structured.
-The individual elements of the software were easily recognizable.
-The texts within the software were easily understandable.The size of the texts was too small.
-The terms and designations used within the software were easy to understand.
-The task was clearly formulated.
C2. Was the activity sensor perceived as a disturbance during the experiment? yes or no Part D: Feedback D1. Opportunity for criticism, praise and suggestions about the study and this survey.
You have reached the end of the survey. Thank you for completing the questionnaire! I hope you enjoyed the study and I thank you for your time and your effort.