Figures
Abstract
The use of artificial intelligence (AI) in image analysis is an intensively debated topic in the radiology community these days. AI computer vision algorithms typically rely on large-scale image databases, annotated by specialists. Developing and maintaining them is time-consuming, thus, the involvement of non-experts into the workflow of annotation should be considered. We assessed the learning rate of inexperienced evaluators regarding correct labeling of pediatric wrist fractures on digital radiographs. Students with and without a medical background labeled wrist fractures with bounding boxes in 7,000 radiographs over ten days. Pediatric radiologists regularly discussed their mistakes. We found F1 scores—as a measure for detection rate—to increase substantially under specialist feedback (mean 0.61±0.19 at day 1 to 0.97±0.02 at day 10, p<0.001), but not the Intersection over Union as a parameter for labeling precision (mean 0.27±0.29 at day 1 to 0.53±0.25 at day 10, p<0.001). The times needed to correct the students decreased significantly (mean 22.7±6.3 seconds per image at day 1 to 8.9±1.2 seconds at day 10, p<0.001) and were substantially lower as annotated by the radiologists alone. In conclusion our data showed, that the involvement of undergraduated students into annotation of pediatric wrist radiographs enables a substantial time saving for specialists, therefore, it should be considered.
Citation: Nagy E, Marterer R, Hržić F, Sorantin E, Tschauner S (2022) Learning rate of students detecting and annotating pediatric wrist fractures in supervised artificial intelligence dataset preparations. PLoS ONE 17(10): e0276503. https://doi.org/10.1371/journal.pone.0276503
Editor: Sathishkumar V. E., Hanyang University, REPUBLIC OF KOREA
Received: June 13, 2022; Accepted: September 13, 2022; Published: October 20, 2022
Copyright: © 2022 Nagy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information file.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations and acronyms: AI, Artificial intelligence; DL, Deep learning; DR, Digital radiography; HITL, “Human-in-the-loop”; IoU, Intersection over Union; PACS, Picture archiving and communication system; SD, Standard deviation
Introduction
The use of artificial intelligence (AI) for image analysis is one of the leading topic in the field of radiology [1–4]. Radiological AI models usually originate from annotated image data, also known as supervised AI [5] or supervised machine learning. With few exceptions, they fall into the domain of deep learning (DL) [6, 7]. DL models commonly build upon large training image sets for robust outcomes [8, 9], often containing thousands or more of different samples, such as in case of ImageNet [10], Open Images [11], or Microsoft Common Objects in Context (COCO) [12]. Corresponding radiological datasets [13, 14] are typically magnitudes smaller, since building and maintaining comprehensive deep learning systems is still challenging [8]. For image annotation a user may decide among a palette of open-source and commercial software solutions with manual and (semi-)automatic labeling techniques [8, 15–18]. However, they require area-specific expert information and development, and their implementation is often computational- and time-intensive [8]. As the workload of radiologists have increased significantly in the last decades, mainly due to the increasing number of time consuming cross-sectional images [19], alternative solutions such as with the involvement of alternative workforce in image annotation might be reasonable. Medical students have demonstrated variable learning rates in other medical contexts like surgery skills or ultrasound [20–25]. To our best knowledge, with the involvement of students in studying or annotating radiographic examinations no study has been performed so far.
The goal of the current study was to estimate the learning rate of inexperienced evaluators in labeling pediatric wrist fractures on digital radiographs. We recruited students with and without a medical background or training to annotate fractures and, thus, to assess their utility to radiologists in creating a comprehensive supervised deep learning dataset.
Methods
We recruited nine medical and one high-school student to the study. We arranged them into four single raters and three teams of two evaluators. None of these ten individuals had specific experience in analyzing pediatric wrist fractures. Table 1 shows the particulars of these students, including previous experience in radiology or traumatology. They were instructed to manually tag all visible fractures of any age in randomly selected non-overlapping pediatric wrist digital radiography (DR) studies. Each observer processed 1,000 images, composed of 100 pictures per workday over two weeks or 10 business days.
Moreover, they were asked to annotate a list of additional image tags (laterality, image projection) and classes (text, metal, bone lesion, periosteal reaction, rotational axis, foreign bodies, and soft tissue swelling) in every image, if proper to do so. We also requested the raters to judge and note the subjective difficulty of every X-ray picture on a five-point Likert scale (1 = Very easy, 2 = Easy, 3 = Neither easy nor hard, 4 = Hard, 5 = Very hard). The cumulative 7,000 student-assessed trauma radiographs were part of a comprehensive, already published dataset on pediatric trauma wrist examinations, containing 20,327 images in total [26].
Professional reporting workstations equipped with calibrated radiological 10-bit gray-level monitors RX240, RX440, or RX650 (Eizo, Ishikawa, Japan) displayed the X-ray studies in darkened reading rooms of the Division of Pediatric Radiology, Department of Radiology, Medical University of Graz. The students used the Supervisely artificial intelligence online platform (Deep Systems LLC, Moscow, Russia) to show the anonymized examinations and to label the pathologies. This platform logged numerous annotation-related parameters like overall and net annotation times or labeling durations for the available classes and tags.
Two pediatric radiologists with seven (S.T.) and eight (R.M.) years of professional experience in childhood trauma imaging re-evaluated the student interpretations by consensually obtaining the number of true/false positive/negative fracture judgments. In cases where the reference radiologists were not able to ascertain the absence or presence of a fracture, they accepted the respective student classification as either true negative or true positive. The pediatric radiologists also recorded the time necessary to correct the erroneous annotations in the image sets. Long-term average labeling time per wrist image, including all previously-mentioned classifications and objects, was 22 seconds for radiologist 1 and 21 seconds for radiologist 2. Each day, a pediatric radiologist gave constructive feedback to six of the seven raters to enable appropriate learning progress. Rater 7 (defined as control) received no specialist response during the labeling period, but after completion of the annotation procedure.
Sensitivity (true positive rate = TPR), specificity (true negative rate = TNR), positive predictive value (PPV), negative predictive value (NPV), as well as the F1 score [= 2*(TPR * PPV) / (TPR + PPV)] [27], were among the main parameters of interest, calculated based on the aforementioned true/false positive/negative fracture numbers. The Intersection over Union (IoU) metric (or Jaccard Index) served as a measure of bounding box accuracy [28], compared between the reference radiologists and the student-produced annotations. The literature commonly describes an overlapping area of more than 50% as good accordance between annotations by different raters [29]. A self-written Python script computed the IoU value in every image.
We performed the statistical calculations with IBM SPSS Statistics version 21 (IBM, Armonk, New York, United States of America). The dataset was analyzed with descriptive statistics and comparisons of means, specifically t-tests and ANOVAs for group comparisons. Appropriate regression curves were fitted and selected to demonstrate learning rates or visualize progression over time. P values below 0.05 were assumed to be statistically significant.
The Ethics Committee of Medical University of Graz (IRB00002556) gave an affirmative vote for the retrospective data analyses (No. 31–108 ex 18/19), waiving the necessity to obtain informed consent.
Results
The reference radiologists diagnosed and labeled 6,072 fractures in 4,831 of 7,000 total wrist radiographs. Numbers of fractures ranged from 1 to a maximum of 3 per picture. Students marked 5,421 fractures in 4,246 images. A sum of 1,261 fractures in 1,157 images was misjudged by the raters, splitting into 305 false positives, and 956 false negatives. 1,814 images contained cast and 322 metal implants. 3,658 left, and 3,342 right radiographs were analyzed.
Fracture detection metrics
Apart from specificity (0.92 ±0.12 and 0.84 ±0.15, p = 0.022), most parameters were not significantly different between single or teams of raters in independent samples t-test: false ratings (14.20 ±17.21 and 13.90 ±11.79, p = 0.937), sensitivity (0.86 ±0.18 and 0.91 ±0.11, p = 0.248), PPV (0.94 ±0.11 and 0.94 ±0.06, p = 0.782), NPV (0.79 ±0.17 and 0.81 ±0.18, p = 0.717), and F1 score (0.90 ±0.15 and 0.92 ±0.08, p = 0.473). For further information refer to Fig 1A–1F and Table 2.
For the control, a linear fitting was applied as baseline. R2 values are given for the three groups: teams (circles), individual raters (crosses), and control (boxes).
Sums are shown for false negatives and false positives, mean values for sensitivity, specificity, and F1 score, and mean values plus SD for IoU.
Sensitivity (average of 0.83) decreased with higher difficulty ratings (ANOVA p<0.001). In rating 1 it was 0.83, in 2 0.90, in 3 0.82, in 4 0.79, an in difficulty rating 5 0.69. The raters perceived images with a cast more difficult, with 2.12 ±1.05 vs. 2.27 ±1.07 (p<0.001). However, the number of errors was not differing significantly (p = 0.789).
Labeling precision
IoU increased statistically significant in all three groups over time (p<0.001), as graphically depicted in Fig 2A. IoU mean values were 0.45 ±0.28 in teams, 0.48 ±0.28 in individual raters, and 0.31 0. ±24 in control (ANOVA p<0.001). In the Bonferroni posthoc analysis, all groups were significantly different (p<0.001) with individual raters performing best.
Image labeling precision during study period (a) and F1 scores and IoU metrics in relation to patient age (b). On image b quadratic curve fittings with 95% CI are displayed.
We found that the IoU was significantly better in images with a present cast (0.49 ±0.35 vs. 0.42 ±0.38, p<0.001). We noted a similar behavior with present metal implants; IoU 0.45 ±0.29 vs. 0.41 ±0.25, p<0.001. There was no statistical significant IoU difference between left and right sides (p = 0.412), and projection (p = 0.441). The IoU was lower in images with a higher difficulty rating (ANOVA p<0.001): 0.48 ±0.28 in rating 1, 0.49 ±0.27 in rating 2, 0.41 ±0.27, 0.35 ±0.29, and 0.21 ±0.28 in difficulty rating 5.
The regression analysis revealed, that IoU and F1 score was similarly influence by patient age (Fig 2B). Image analysis was more challenging in the very young and in the older ages of life. However, the relation between F1 score and patient age was stronger (R2 = 0.400) than in case of IoU (R2 = 0.266).
Annotation and correction times
Times required to annotate the images decreased over the study period, as shown in Fig 3A. Mean net annotation time was 21.8 ±9.7 seconds per image; 25.2 ±8.1 seconds in teams, 17.4 ±9.5 seconds in individual raters, and 24.6 ±9.9 seconds in control. Correction time was 16.8 ±5.1 per image on average; 14.8 ±5.6 seconds for teams, 15.5 ±5.7 seconds for individual raters, 13.4 ±5.3 for control.
Regression analyses of annotation (a) and correction times (b). Logarithmic curve fittings are given for all three rater groups.
The number of errors in the different quarters (25 of 100 images) of the daily image sets did not differ significantly (ANOVA p = 0.218). The means and standard deviations (SD) were 1.16 ±0.40 in quarter 1, 1.11 ±0.31 in quarter 2, 1.11 ±.34 in quarter 3, and 1.15 ±.37 in quarter 4.
Discussion
The current manuscript assessed the learning rates of students compared to board-certified pediatric radiologists in detecting and annotating childhood wrist fractures in the context of a supervised machine learning dataset generation.
The literature features only a few related studies on the learning rates of students in medical topics [21–23]. In the context of radiology, we found a few studies on ultrasound tasks [20, 24] and emergency neuroimaging [25]. Our literature inquiry did not find any comparable study on students performing image annotations on radiographics with detecting pediatric fractures in the context of supervised AI workflows.
We saw marked learning progress of the raters receiving professional radiologist feedback. Some of the teams and individual raters were able to exceed an F1 score of 99, while no one of them dropped below 95 on the last day. However, nobody attained an F1 score of 100 during the annotations. In contrast to the control who did not get repetitive feed-back during the annotation process, all others achieved significantly higher scores beginning from the second annotation day. Fig 4 gives examples of fractures often missed by the raters. Teams and individual raters did not exhibit relevant differences in learning rate and error patterns. Therefore, we assume that radiologists should prefer single non-expert annotators over teams with respect to responsible management of human resources. As we expected, the control demonstrated near-steady results over the study timeframe of ten days. Difficulty differences between the datasets could be the cause of the perceptible daily variance. The radiologists also gave feedback to that person after the data acquisition to capitalize on the mistakes made. In the repetitive feedback sessions, the reference radiologists systematically assessed all images of the prior rating together with the raters. The reasons for the mistakes made were debated, when possible.
The stars mark the areas of bone injury. a, c & d) Dorsal compression fractures of the distal radius. b & f) Overlooked scaphoid fractures. e) Missed epiphysiolysis, Salter-Harris type 2.
Annotations times were dependent on the individual rater and demonstrated a substantial variance, as demonstrated in Fig 3A. Overall there was a decrease in annotation time per image, approaching the typical annotation durations of the radiologists of 21 seconds in a single image. A comparable established system is in common use worldwide, when consultants sign the reports of their radiologists in training. In that setting, overall student and radiologist’s annotation times together increased to about 30 seconds per picture. As compensation, the non-experts benefited by receiving feedback to achieve learning success. More importantly, the correction times for the experts decreased steadily (Fig 3B), which led to a correction time per image of about 10 seconds at day ten. This reduction means considerable time savings for the experts and could approximately double the respective annotation throughput as major bottleneck.
The study results imply that it was easy for students to learn recognition of fractures, whereas grasping the whole extension of many bone injuries was not possible for any of the raters within the study duration. While F1 scores (surrogate parameter for fracture recognition) were increasing substantially, we only saw a small increase in IoU (labeling precision) over the days. This discrepancy implies that the recognition of smaller details in the images was more challenging, e.g. even when recognized correctly, the students could not reproduce the actual extent of the seen fractures in many cases. The results of this study regarding learning performance in fracture detection may not be directly transferred to other body regions or other specific tasks. Further studies in this area appear to be legit.
Surprisingly, patient age clearly influenced the number of errors and the scorings, as depicted in Fig 2B. The F1 score and the IoU decayed in teenagers and newborns, with a plateau between approximately one and ten years. Our experience indicates that fusing growth plates of the distal radius and ulna at that age (compare Fig 5) hinder the correct annotations to a certain degree. In addition, subtle fractures of the ulnar styloid process and the carpal bones were diagnosed and missed more commonly in teenagers.
a) The cast was mistaken for a fracture. The + sign indicates the second, correctly labeled bone injury. b, c & f) Students marked the ulnar and radial growth plates as a fracture. d) A Madelung’s deformity was mimicking a fracture. e) The carpal bones were mistaken for an injury. g) A so-called Harris line thought to be a fracture. * The stars indicate missed injuries.
Several authors proposed deep-learning algorithms to enhance the speed of image annotation by professionals as one significant bottleneck [8, 30–32]. We hypothesized, that depending on the complexity and difficulty of the labeling task, the help of inexperienced annotators accelerates the marking process. Other methods available like training a neural network on a small subsample and then applying it onto the rest [33]. This approach is known as “Human-in-the-loop” (HITL) method, which is known in many fields of artificial intelligence, also in the field of computer vision [34–36]. HITL is an alternative to the approach in this manuscript using non-specialists to relieve workload from experts when creating supervised DL record sets. It is yet undecided, which of the mentioned techniques is superior to the others.
Some limitations need to be reported and discussed. The observers faced randomly chosen datasets without overlapping examinations. That implies a certain amount of variability in difficulty to solve them correctly. A specific study set might have been more straightforward. Daily rates of true and false ratings may be affected in both directions by an "easier" or "harder" selection of studies in combination with a "lucky" or "unlucky" rater. To minimize the resulting selection bias, we decided to present the students a substantial number of 100 images per day. Also, the reference radiologists’ conditions on a particular day may influence the fracture assessment. We tried to overcome that type interference by accepting an index rating as correct if both reference radiologists were uncertain about a diagnosis. A reader should also keep the well-known fact of a reduced fracture detection sensitivity in plain radiographs in mind, which is methodically inherent. Another drawback is that we did not assess other parameters than fractures in greater detail, like bounding boxes containing text and metal, as there was a low rate of error and insignificant relevance for the project goals. Transcription errors during the correction phases are thinkable and may have occurred occasionally. However, the influence should be diminishingly small in our comprehensive dataset.
Conclusion
In conclusion, students can help detect and label pediatric fractures around the wrist, assisting radiologists in building a supervised artificial intelligence dataset. While the error rate in fracture recognition decreased quickly under feedback, bounding box precision was not improving as much. However, after a few days of instructing, substantial time savings for the specialists are possible. Our data showed no relevant benefit for employing teams over individual non-expert raters in that setting.
Acknowledgments
We sincerely thank all contributing students, Amina Pidro, Haris Muharemović, Nejra Selak, Amila Šabić, Edina Kovač, Etien Mama, Julie Vohryzkova, Sofia Tzafilkou, Moritz Trieb and Lisa Schöllnast for their help. We thank the team of Supervisely for supporting our artificial intelligence projects with reduced service fees.
References
- 1. Coppola F., et al., Artificial intelligence: radiologists’ expectations and opinions gleaned from a nationwide online survey. Radiol Med, 2021. 126(1): p. 63–71.
- 2. European Society of R., Impact of artificial intelligence on radiology: a EuroAIM survey among members of the European Society of Radiology. Insights Imaging, 2019. 10(1): p. 105.
- 3. European Society of R., Current practical experience with artificial intelligence in clinical radiology: a survey of the European Society of Radiology. Insights Imaging, 2022. 13(1): p. 107.
- 4. Kottler N., Artificial Intelligence: A Private Practice Perspective. J Am Coll Radiol, 2020. 17(11): p. 1398–1404. pmid:33010212
- 5. LeCun Y., Bengio Y., and Hinton G., Deep learning. Nature, 2015. 521(7553): p. 436–44. pmid:26017442
- 6. Bengio Y., Courville A., and Vincent P., Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell, 2013. 35(8): p. 1798–828.
- 7. Schmidhuber J., Deep learning in neural networks: an overview. Neural Netw, 2015. 61: p. 85–117. pmid:25462637
- 8. Philbrick K.A., et al., RIL-Contour: a Medical Imaging Dataset Annotation Tool for and with Deep Learning. J Digit Imaging, 2019. 32(4): p. 571–581. pmid:31089974
- 9. Shin H.C., et al., Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans Med Imaging, 2016. 35(5): p. 1285–98. pmid:26886976
- 10. Deng J., et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009.
- 11. Kuznetsova A., et al., The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. 2018.
- 12.
Lin T.-Y., et al. Microsoft COCO: Common Objects in Context. 2014. Cham: Springer International Publishing.
- 13. Rajpurkar P., et al., MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs. 2017.
- 14. Irvin J., et al., CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. 2019.
- 15. Park S., et al., Annotated normal CT data of the abdomen for deep learning: Challenges and strategies for implementation. Diagn Interv Imaging, 2019. pmid:31358460
- 16. Dreizin D., et al., Performance of a Deep Learning Algorithm for Automated Segmentation and Quantification of Traumatic Pelvic Hematomas on CT. J Digit Imaging, 2019.
- 17. Hu P., et al., Automatic abdominal multi-organ segmentation using deep convolutional neural network and time-implicit level sets. Int J Comput Assist Radiol Surg, 2017. 12(3): p. 399–411.
- 18. Tong N., et al., Fully automatic multi-organ segmentation for head and neck cancer radiotherapy using shape representation model constrained fully convolutional neural networks. Med Phys, 2018. 45(10): p. 4558–4567.
- 19. Kwee T.C. and Kwee R.M., Workload of diagnostic radiologists in the foreseeable future based on recent scientific advances: growth expectations and role of artificial intelligence. Insights Imaging, 2021. 12(1): p. 88.
- 20. Di Pietro S., et al., The learning curve of sonographic inferior vena cava evaluation by novice medical students: the Pavia experience. J Ultrasound, 2018. 21(2): p. 137–144. pmid:29564661
- 21. Feldman L.S., et al., A method to characterize the learning curve for performance of a fundamental laparoscopic simulator task: defining "learning plateau" and "learning rate". Surgery, 2009. 146(2): p. 381–6. pmid:19628099
- 22. Hogle N.J., Briggs W.M., and Fowler D.L., Documenting a learning curve and test-retest reliability of two tasks on a virtual reality training simulator in laparoscopic surgery. J Surg Educ, 2007. 64(6): p. 424–30.
- 23. Linsk A.M., et al., Validation of the VBLaST pattern cutting task: a learning curve study. Surg Endosc, 2018. 32(4): p. 1990–2002. pmid:29052071
- 24. Peyrony O., et al., Monitoring Personalized Learning Curves for Emergency Ultrasound With Risk-adjusted Learning-curve Cumulative Summation Method. AEM Educ Train, 2018. 2(1): p. 10–14. pmid:30051059
- 25. Pourmand A., et al., Impact of Asynchronous Training on Radiology Learning Curve among Emergency Medicine Residents and Clerkship Students. Perm J, 2018. 22: p. 17–055. pmid:29272248
- 26. Nagy E., et al., A pediatric wrist trauma X-ray dataset (GRAZPEDWRI-DX) for machine learning. Sci Data, 2022. 20(9). pmid:35595759
- 27. Trevethan R., Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Front Public Health, 2017. 5: p. 307. pmid:29209603
- 28. Jaccard P., Lois de distribution florale dans la zone alpine. Bulletin de la Société vaudoise des sciences naturelles, 1902. 38: p. 69–130.
- 29. Everingham M., et al., The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 2010. 88: p. 303–338.
- 30. Liang Q., et al., Weakly Supervised Biomedical Image Segmentation by Reiterative Learning. IEEE J Biomed Health Inform, 2019. 23(3): p. 1205–1214.
- 31. Shahedi M., et al., Accuracy Validation of an Automated Method for Prostate Segmentation in Magnetic Resonance Imaging. J Digit Imaging, 2017. 30(6): p. 782–795. pmid:28342043
- 32. Shahedi M., et al., Spatially varying accuracy and reproducibility of prostate segmentation in magnetic resonance images using manual and semiautomated methods. Med Phys, 2014. 41(11): p. 113503.
- 33. Koitka S., et al., Ossification area localization in pediatric hand radiographs using deep neural networks for object detection. PLoS One, 2018. 13(11): p. e0207496. pmid:30444906
- 34. Brostow G., Human in the loop computer vision. Perception, 2015. 44: p. 360–360.
- 35. Zanzotto F.M., Viewpoint: Human-in-the-loop Artificial Intelligence. Journal of Artificial Intelligence Research, 2019. 64: p. 243–252.
- 36. Bauckhage C., et al., Vision Systems with the Human in the Loop. EURASIP Journal on Advances in Signal Processing, 2005. 2005: p. 2375–2390.