Predicting semantic segmentation quality in laryngeal endoscopy images

Andreas M. Kist; Sina Razi; René Groh; Florian Gritsch; Anne Schützenberger

doi:10.1371/journal.pone.0314573

Abstract

Endoscopy is a major tool for assessing the physiology of inner organs. Contemporary artificial intelligence methods are used to fully automatically label medical important classes on a pixel-by-pixel level. This so-called semantic segmentation is for example used to detect cancer tissue or to assess laryngeal physiology. However, due to the diversity of patients presenting, it is necessary to judge the segmentation quality. In this study, we present a fully automatic system to evaluate the segmentation performance in laryngeal endoscopy images. We showcase on glottal area segmentation that the predicted segmentation quality represented by the intersection over union metric is on par with human raters. Using a traffic light system, we are able to identify problematic segmentation frames to allow human-in-the-loop improvements, important for the clinical adaptation of automatic analysis procedures.

Citation: Kist AM, Razi S, Groh R, Gritsch F, Schützenberger A (2025) Predicting semantic segmentation quality in laryngeal endoscopy images. PLoS One 20(7): e0314573. https://doi.org/10.1371/journal.pone.0314573

Editor: Xiaohui Zhang, Bayer Crop Science United States: Bayer CropScience LP, UNITED STATES OF AMERICA

Received: November 8, 2024; Accepted: May 10, 2025; Published: July 3, 2025

Copyright: © 2025 Kist et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data is shared publicly at https://zenodo.org/records/14034494.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Semantic segmentation is the pixel-wise classification of present objects in a given image. Especially in a medical context, semantic segmentation of different tissues is an important task and has seen major improvement since the advent of Deep Learning [1]. In contrast to object detection, semantic segmentation delineates specific organs and allows the precise quantification of covered areas. In the case of laryngeal endoscopy, the area between the vocal folds, the so-called glottal area, is an important proxy for the vocal folds’ oscillation behavior [2,3]. Many works have used a variety of classical and Deep Learning-based computer vision techniques to approach glottal area segmentation [4], ideally in a fully automatic manner [5–7]. A plethora of clinical parameters have been described [8] that rely on the glottal area waveform, a biosignal derived from the glottal area. As the validity of these parameters heavily depends on the segmentation quality of the glottal area [9], it is important to assess this quality using meaningful metrics.

Common metrics to evaluate this segmentation quality are the Dice and the Intersection over Union (IoU) score. Both scores compare the ground-truth area and predicted area and compute a score between 0 and 1 that needs to be maximized. Interestingly, Dice and IoU scores are highly related and can be converted into each other by a simple math formula [10], hence, we use for consistency only the IoU score in the following. However, both need the underlying ground-truth segmentation commonly provided by domain experts. Providing this often manually generated ground-truth data in a clinical context is not feasible, as it involves a significant amount of hands-on time [7]. Therefore, previous studies proposed an AI-powered system to predict the IoU/Dice score of a given segmentation to hint at potential failure cases [11,12]. Such a system for analyzing high-speed videoendoscopy footage segmentation that typically consists of 4,000 frames per second acquisition time would be highly beneficial but is nonexistent.

However, the IoU score itself has no contextual meaning about its clinical usefulness raising the issue which IoU scores are actually desired. We agree that an IoU score of 1 is thought to be perfect, but which IoU scores can be reached unequivocally by domain experts? First evidence has been given in [7] showing that the average IoU score for glottal area segmentation for three experts is 0.772, a value reasonably distant to 1. However, Maryn and colleagues [13] have demonstrated that the glottal area segmentation across individuals on that quality level barely affects the downstream parameter computation. However, inter- and intra-rater reliability has been not assessed clearly for glottal area segmentation.

In this study, we address these issues by providing an in-depth analysis of inter- and intra-rater reliability and developing a novel IoU prediction system for glottal area segmentation. With this information, we suggest a traffic light system for glottal area segmentations to pinpoint problematic frames observed during the analysis to guide the clinician in interpreting downstream computed parameters with respective care.

Materials and methods

Data

In this study, we utilized the Benchmark for Automatic Glottis Segmentation (BAGLS, [7]) for analyses presented in Figs 1, 2, and 3. The data used in Fig 4 consisted of retrospective medical records accessed on February 15, 2024. The individual authors did not have access to the subject identities during data access. We used custom Python scripts to create the target dataset (see Github). We cropped images of an aspect ratio different from 1:1 around the glottal area and resized each frame to a common resolution of 224 224 3. This training data is available through Zenodo (see Data Availability). The ground-truth masks were modified to generate segmentation masks with segmentation artifacts. The full process is shown in Fig 1. In particular, we used erosion and dilation to modify the mask uniformly. For uncertainties at the glottal area border, a Sobel filter was applied to detect the edges. After binary dilation, each edge pixel was randomly set to be added or removed from the ground truth. Small segmentation artifacts close to the glottis were mimicked using up to five randomly drawn spheres with a random radius between one and three pixels. For large segmentation artifacts, 2D Perlin noise was added to the ground truth mask. Here, we relied on the Perlin noise generator for numpy (https://github.com/pvigier/perlin-numpy).

Download:

Fig 1. Artifact generation process.

A: The BAGLS dataset contains 55,750 paired training samples consisting of endoscopy images (left) and their respective glottal area segmentation (right). B: Ground-truth glottal area segmentation is sent to the artifact generator. We apply four steps to incorporate uniform mask scaling artifacts, border fuzziness, and small and large segmentation artifacts. Each step is randomly applied, and step-dependent hyperparameters are randomly chosen. The resulting segmentation mask is used to compute the IoU score with the ground-truth segmentation. The resulting segmentation masks together with the IoU score are used for training downstream deep neural networks.

https://doi.org/10.1371/journal.pone.0314573.g001

Download:

Fig 2. Inter- and intra-rater reliability shows non-perfect agreement.

(A) Task overview. A subset of the BAGLS dataset was taken (100 random frames) and annotated by trained raters three times in random order. (B) Details inter-rater reliability among six raters, highlighting consistency in their evaluations. (C) Presents the inter-rater reliability among neural network models described in [7]. (D) Explores intra-rater reliability for six raters across three rounds, assessing individual consistency. (E) Histogram across different Intra-raters. (F) Analyzes the relationship between inter-rater reliability and the number of pixels in segmented areas.

https://doi.org/10.1371/journal.pone.0314573.g002

Download:

Fig 3. Segmentation quality prediction using neural networks.

(A) Showcases various combinations of endoscopic images and their corresponding segmentation processed through the neural network architecture. (B, C) RMSE across diverse network backbones under two distinct scenarios, namely pre-trained on ImageNet and trained from scratch with random initialization, respectively. (D) Comparison of RMSE between different neural networks and the average annotation mask of six human raters.

https://doi.org/10.1371/journal.pone.0314573.g003

Download:

Fig 4. Traffic light system.

(A) Inference on independent test videos: For each frame, the glottal area is segmented using three different segmentation networks. The predicted masks and the original input frame are used as input (format “see”, as demonstrated in Fig 3b/3c) to the trained MobileNetV2, which predicts the IoU score. A traffic light scheme is applied to these results. (B) Examples of traffic light bars using two exemplary videos: The traffic light bars show the predicted color for each segmentation network. For each video, three exemplary frames and their corresponding IoU predictions are shown.

https://doi.org/10.1371/journal.pone.0314573.g004

The Intersection over Union (IoU) score is our target metric (see evaluation) and ranges from 0 (no overlap, bad segmentation) to 1 (perfect overlap, excellent segmentation). We used 20 discrete bins, each with a width of 0.05, to ensure an even sampling across the full IoU range. From the BAGLS dataset, we draw random frames to generate 1,200 training pairs in each bin to ensure a balanced dataset, which leads to 24,000 data points. For each image, we applied random combinations of the above-mentioned techniques to the segmentation mask and determined the IoU score compared to the untreated ground-truth segmentation mask.

Deep neural networks

All deep neural networks were set up using TensorFlow/Keras in version 3.1.1 of Keras and 2.16.0 of Tensorflow with enabled CUDA/cuDNN support. Experiments were performed on an RTX A4000 GPU. We relied on established architectures that have been widely used across tasks. Specifically, we utilized the MobileNet, MobileNetV2, MobileNetV3, ResNet50, and EfficientNetB0 [14–18] architecture to regress the IoU score. In detail, we used the backbone architecture with their respective tensorflow.keras.applications implementation removed the top classification layer, added a GlobalAveraging2D pooling layer, then a Dense Layer with 256 units and ReLU activation, and a Dense layer with a single unit and sigmoid activation function. For regularization, a Dropout Layer with 0.1 dropout probability was placed between the two Dense layers. Neural networks were trained for 50 epochs using either SGD with Momentum (set to 0.9) or the Adam optimizer and an initial learning rate of 10⁻² and 10⁻³, respectively, with a reduction of after 10 epochs. We evaluated the mean squared error (MSE), the mean absolute error (MAE) and the ordinary categorical cross-entropy as a loss function, where the latter was used for subsequent analyses. We either provided the endoscopy frame (e) or the segmentation mask (s) as input in various combinations to predict the target IoU score. The combinations are (i) only endoscopy images (abbreviated as eee), (ii) only segmentation masks (sss), (iii) two endoscopy images and one segmentation mask (ees or see), (iv) two segmentation masks and one endoscopy image (sse or ess). We used three channels for convenience in training, as pre-trained networks expect RGB images with three channels. For our experiments, any RGB endoscopy image was converted to grayscale prior to training. Input frames were normalized to a range from -1 to 1. The BAGLS dataset contains already a training/test dataset split, which was used in this study as well. From the training data, 5% was allocated for validation, resulting in 22,800 training samples and 1,200 validation samples. Additionally, a separate test set of 100 samples was used not only for the inference of neural networks but also for intra-rater and inter-rater reliability analysis, similar to previous reports.

Inter- and intrarater analysis

The critical importance of assessing inter-annotator agreement is emphasized for fundamental reasons [19]. With high inter-rater reliability (IRR), the consistency of annotations, crucial for designing robust AI algorithms, is ensured together with effective AI model training and evaluation by reducing noise and subjectivity. It further contributes to overall annotation validity, signifying good data quality for reproducible studies. We use rater and annotator interchangeably in this context.

Here, six independent annotators manually annotated the glottal area in the same 100 randomly selected frames from the BAGLS dataset exactly three times (see Fig 2). Each annotator has been trained previously with the same instructions similar to our guidelines outlined in [8]. The annotators were selected on their training range to check for short- and long-term training effects. For each annotator, each annotation round consisted of exactly the same 100 frames, but randomly shuffled in order to avoid any progression bias. For glottis annotation, we used the Pixel Precise Annotator (PiPrA, [7]) to generate single segmentation masks for each frame individually.

Inter-rater reliability (IeRR).

Inter-rater reliability (IeRR) measures how similarly different raters segment the same images. For each pair of raters i and j, we:

Compute the mean segmentation mask for each rater, averaged over all rounds.
Threshold each mean mask to obtain a binary mask.
Compute the Intersection-over-Union (IoU) between these two masks for every image.
Average these IoU values over all images.

(1)

Intra-rater reliability (IaRR).

Intra-rater reliability (IaRR) measures each rater’s consistency with themselves across multiple rounds. For each rater i, we:

Compare every pair of that rater’s round-specific segmentation masks.
Compute IoU for each image.
Average these IoUs over all images and all pairwise round combinations.

(2)

We use N to denote the total number of raters and K to denote the number of images. Each rater completes R rounds of segmentation. For rater , X_i,k,r represents the binary segmentation mask for image in round . The notation indicates the average of rater i’s segmentation masks over all R rounds for image k. The function converts a (potentially continuous) mask to a binary mask via thresholding. Finally, is the standard Intersection-over-Union measure for two binary sets A and B.

Prediction of segmentation quality using traffic light system

Five videos of healthy subjects containing at least 2,000 frames were analyzed. Each frame was segmented using previously established glottis segmentation networks [8], including CustomUnet Fast [7], ResNet50 Quality and EfficientNetB0 Balanced [20], and the segmentation quality, i.e. the IoU score, was predicted. For IoU score prediction, we relied on the MobileNetV2 S-E-E (“see”) variant shown in Fig 3. The reason for selecting this variant is thoroughly explained in the Results section.

Each video is visualized by a bar whose colors are derived from a traffic light scheme computed for each of the 2,000 frames. Each frame’s IoU score follows the following color code: green indicates high IoU scores (above 0.7), suggesting good segmentation quality as derived from the IeRR analysis; yellow signals moderate IoU scores (between 0.6 and 0.7); and red denotes low predicted IoU scores (below 0.6), highlighting areas of potential concern. These values are in line with our IeRR study and with recent reports [21]. This visual tool allows for an immediate evaluation of the performance of various deep neural network models over continuous video frames.

Evaluation

The common metric to evaluate the semantic segmentation quality is the Intersection over Union (IoU) and the Dice score. These metrics have been assessed in many works previously [6–9,21]. As IoU and Dice scores are mathematically related (e.g. [22]), we solely report the IoU score in this work.

Inter-rater and intra-rater reliability (IeRR and IaRR) were assessed by comparing each rater to each other and its own three-fold annotations, respectively, using the IoU score. We determined the IaRR and IeRR by comparing each mask to each other (IaRR) or the average IoU score across masks (IeRR). Any error is shown as the standard error of the mean.

Results

Intra- and inter-rater reliability

To find an automatic solution for segmentation quality prediction, we first need to investigate the variability and reliability across manual annotations by humans, the current gold standard in glottis segmentation. In total, six trained raters with various experience (ranging from weeks to multiple years of experience) were tasked to segment the glottal area in a random subset of the BAGLS [7] dataset (Overview Fig 2a). Experts manually annotated each frame three times in a randomized order to also gain an intra-rater reliability measure (see Methods). For evaluation, we use the Intersection over Union (IoU), a common metric to compare two segmentation masks on a pixel-wise level (see Methods and [22]). In other words, we compare two segmentation masks (either from the same rater or from different raters) by using the IoU score (see above).

Fig 2 shows an overview of the intra- and inter-rater reliability (IaRR and IeRR, respectively). We found that the average IaRR is on 0.77 0.01 (mean std), whereas the IeRR is 0.70 0.06 as derived from Fig 2b and Fig 2d, respectively. This is in line with previous studies investigating the inter-segmenter reliability [7]. We additionally investigated how IaRR and IeRR change relative due to the average amount of labeled pixels and the distance to the center of mass. As shown in Fig 2f, we found that especially small areas (higher than 0 px and lower than 20 px) have a low IaRR and IeRR (0.029 0.042 and 0.26 0.00), respectively) compared to rather large areas (more than 200 px), showing very high IaRR and IeRR values (0.81 0.01 and 0.66 0.07, respectively). IeRR and IaRR are also highly consistent close to the center of mass, but decrease dramatically when plotted relative to the edge. In addition, Fig 2e shows a histogram illustrating the intra-rater reliability (IaRR) across six raters. This visualization highlights a trend where the frequency of IoU scores tends to increase with higher IoU values, indicating a tendency towards greater annotator consistency at higher levels of agreement. Furthermore, we also investigated the inter-model reliability (similar as IeRR) for previously established glottis segmentation networks [8]. These networks are contained in the GAT software and are established segmentation tools, namely a custom-tailored U-Net (Fast, similar to [9]) and two U-Nets based on the ResNet50 (Quality, [17]) and EfficientNetB0 (Balanced, [18]). These networks were also trained on the full BAGLS dataset (see Methods) and evaluated the same 100 frames as the raters. We found that on average the agreement is 0.788, larger than the human IaRR and IeRR, indicating higher self-consistency. Nevertheless, these value ranges are similar to the human corresponding part indicating that glottis segmentation has some degree of uncertainty for humans and machines.

Comparison of segmentation quality and intra- and inter-rater reliability

Our primary objective is to compare the results of human raters and neural network models to determine which is superior in terms of IoU. To address this, we aimed to regress the IoU score using a deep neural network. First, we asked which information is crucial to accurately predict the IoU score. We systematically changed the input to the neural network as shown in Fig 3a: We either used only endoscopic images as input (eee), the segmentation mask (sss), or a combination of both (see and sse). In Fig 3b, 3c can be observed that the use of the sss scheme or the eee scheme leads to higher RMSE values that are not sufficient for our task. This leads to the conclusion that the input scheme of the neural network should contain both the endoscopic image and the respective segmentation mask. Therefore, the see or the sse scheme should be preferred. Comparing these two configurations, it can be seen that they are on par. For the remaining results, we only performed experiments with the see scheme.

Finally, we compared the RMSE of different GAT neural networks and the average of six human raters. As shown in Fig 3d, we could not visually see a clear difference between the different models. We tested the samples for normality using the Shapiro-Wilk test and found that they were not normally distributed (p<0.05). We then applied the Mann-Whitney U test with Bonferroni correction to each combination and found no significant difference between the obtained RMSE values of each model (p>0.05). Based on these results, there was no bias towards any of the models, leading to the conclusion that each of them was equally suitable for our task.

Traffic light system for IoU prediction

When looking at the generated traffic light scheme, a visual difference between the models can be seen. To generate the traffic light indications, we feed the footage frame-by-frame to our pre-trained neural networks and then estimate the IoU score using the “see” model from Fig 3. With the given thresholds (see Methods), we bin the probability scores into three classes (green, yellow, and red). An overview is given in Fig 4a. Focusing on the EfficientNetB0 model, we observe that it predominantly features green throughout the videos (Fig 4b), consistently predicting high IoU scores where high-quality segmentation matches the ground truth. This observation is consistent with our findings where EfficientNetB0 exhibited the lowest RMSE among all models tested, in (Fig 3d), confirming its superior accuracy in predicting segmentation quality. Comparatively, other models display a significant amount of yellow and red. This pattern suggests their comparatively lower performance, especially in frames challenged by adverse conditions such as motion blur or inadequate lighting. This direct visual comparison underscores EfficientNetB0’s effectiveness in practical scenarios, particularly in accurately handling frames suitable for endoscopic analysis.

Discussion

In this study, we showed the significance of our findings in the context of glottal area segmentation, focusing on both inter-rater and intra-rater reliability, the prediction of segmentation quality, and finally providing the traffic light system for visualizing IoU scores and assigning each frame to a specific group. Our study focused on the evaluation of glottal area segmentation, a critical aspect for the quantification of vocal fold physiology using laryngeal high-speed videoendoscopy. The predicted segmentation quality, as represented by the IoU metric, demonstrated parity with human raters. This alignment underscores the potential of our fully automatic system to reliably assess glottal area segmentation, which is important for a labor-free clinical implementation of high-speed footage quantification. The evaluation of inter- and intra-rater reliability shed light on manual annotations’ consistency and variability. Our findings extend and confirm prior work on inter-rater reliability. An early assessment has been performed in [7], where up to four segmenters and their inter-rater reliability have been assessed. In exploring intersegmenter variability in Glottal Analysis Tools (GAT) measures, [13] conducted a cohort study assessing rater reliability. This study focused on the analysis of the glottal area waveform (GAW) from high-speed videoendoscopy, involving trained segmenters who independently annotated videos. The study’s use of the Intraclass Correlation Coefficient (ICC) adds a quantitative dimension to the evaluation of GAW measures. With high ICC values across various parameters, the study highlights the clinical applicability of GAT, indicating strong inter-rater reliability.

In line with these findings, we observed a higher IaRR compared to the IeRR indicating that there is a self-contained ability to perform manual glottis segmentation. Nevertheless, these IeRR values are far from optimal. Examining IeRR across pixel sizes and distances to the center of mass of the glottis segmentation revealed further insights. Smaller areas (less than 20 pixels) presented challenges, exhibiting lower inter- and intra-rater reliability, most likely due to the relatively high fraction of fuzzy and dubious contour pixels in small segmented areas. As the variance is 0.1 for 75 percent of the data when using EfficientNet (as seen in Fig 4), predictions that are close to the traffic light thresholds are particularly prone to misclassification. For example, if the true IoU value is 0.65, the model predicts in the range of 0.55 to 0.75 due to the observed variance. Thus, there is a high probability that predictions in this critical range will be misclassified. Misclassifications in this range can lead to moderate segmentations being considered good and vice versa. This affects the reliability of the automatic segmentation assessment and can lead to incorrect medical decisions being made. This also means that very poor segmentations and very good segmentations are reliably identified by our model, resulting in a very low misclassification rate in these regions.

Our investigation into predicting segmentation quality through IoU score regression revealed dependencies. Systematically altering neural network inputs, including endoscopic images, masks, or their combination, helped us in fine-tuning the system’s IoU. The traffic light system, visually representing IoU scores, demonstrated the influence of pixel size and expert-derived scores on prediction accuracy. Our analysis included comparisons between human raters and various neural network models such as CustomUnet Fast, ResNet50 Quality, and EfficientNetB0 Balanced. While all models demonstrated competitive capabilities, the EfficientNet model slightly outperformed others, including human raters, in terms of RMSE and showed a predominance of green in the traffic light visualization (see Fig 4b), indicating higher segmentation quality consistently across frames. The successful alignment of automated predictions with human raters supports our system’s integration into clinical workflows.

The traffic light system offers a user-friendly interface for clinicians, promoting a collaborative human-in-the-loop approach. This model, merging AI strengths with expert judgment, holds promise for enhancing glottal area segmentation reliability and efficiency in clinical settings. Overall, this study underscores the benefits of AI in enhancing diagnostic procedures in laryngeal endoscopy and sets the stage for further innovations in medical imaging.

Our traffic light system for visualizing IoU scores provides an easy-to-grasp, quick evaluation framework for downstream analysis. However, we observed that especially small segmentation areas have a high variance in IoU prediction. Future work should look into refining the algorithm to better handle these small segmented areas, which currently show lower reliability, consistent with previous reports [7]. The use of more elaborate architectures may yield better outcomes as suggested in recent studies [23,24]. Improving the system’s ability to accurately analyze these small segments could significantly enhance overall accuracy.

Conclusion

In conclusion, our study dives into the details of glottal area segmentation, reliability among raters, and predicting the quality of segmentation in laryngeal endoscopy. Our findings also shed light on the inter-rater reliability among human annotators and neural network models. Notably, while neural networks generally perform well, they are not without issues. One possible reason for discrepancies in neural network performance could be the inherent variability in human annotations used for ground truth data. This variability might affect the learning outcomes of the models, potentially leading to inconsistencies in segmentation quality predictions.

This performance edge of the EfficientNet underscores its potential to supplement human judgment in clinical settings. The close results among different models, such as CustomUnet, ResNet50, and EfficientNetB0, compared to human raters, indicate the critical role of high-quality, consistent annotations in training robust models. As we continue to refine our techniques and expand our model comparisons, our goal is to enhance the reliability of both human and automated assessments. This effort will not only improve the accuracy of diagnostics but will also support the broader integration of AI tools in healthcare and effective patient care through advanced technology.

Acknowledgments

We thank Marion Dörrich, Patrick Schlegel, and Hernan Aguilera for technical support and data annotations.

References

1. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2022;44(7):3523–42. pmid:33596172
- View Article
- PubMed/NCBI
- Google Scholar
2. Deliyski DD, Petrushev PP, Bonilha HS, Gerlach TT, Martin-Harris B, Hillman RE. Clinical implementation of laryngeal high-speed videoendoscopy: challenges and evolution. Folia Phoniatr Logop. 2008;60(1):33–44. pmid:18057909
- View Article
- PubMed/NCBI
- Google Scholar
3. Zacharias SRC, Deliyski DD, Gerlach TT. Utility of laryngeal high-speed videoendoscopy in clinical voice assessment. J Voice. 2018;32(2):216–20. pmid:28596101
- View Article
- PubMed/NCBI
- Google Scholar
4. Andrade-Miranda G, Stylianou Y, Deliyski DD, Godino-Llorente JI, Henrich Bernardoni N. Laryngeal image processing of vocal folds motion. Appl Sci. 2020;10(5):1556.
- View Article
- Google Scholar
5. Gloger O, Lehnert B, Schrade A, Völzke H. Fully automated glottis segmentation in endoscopic videos using local color and shape features of glottal regions. IEEE Trans Biomed Eng. 2015;62(3):795–806. pmid:25350912
- View Article
- PubMed/NCBI
- Google Scholar
6. Fehling MK, Grosch F, Schuster ME, Schick B, Lohscheller J. Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep convolutional LSTM network. PLoS One. 2020;15(2):e0227791. pmid:32040514
- View Article
- PubMed/NCBI
- Google Scholar
7. Gómez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Dürr S, et al. BAGLS, a multihospital benchmark for automatic glottis segmentation. Sci Data. 2020;7(1):186. pmid:32561845
- View Article
- PubMed/NCBI
- Google Scholar
8. Kist AM, Gómez P, Dubrovskiy D, Schlegel P, Kunduk M, Echternach M, et al. A deep learning enhanced novel software tool for laryngeal dynamics analysis. J Speech Lang Hear Res. 2021;64(6):1889–903. pmid:34000199
- View Article
- PubMed/NCBI
- Google Scholar
9. Kist AM, Dollinger M. Efficient biomedical image segmentation on EdgeTPUs at point of care. IEEE Access. 2020;8:139356–66.
- View Article
- Google Scholar
10. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015;15(1):1–28.
- View Article
- Google Scholar
11. Robinson R, Oktay O, Bai W, Valindria VV, Sanghvi MM, Aung N. Real-time prediction of segmentation quality. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV. 2018. p. 578–85.
12. Valindria VV, Lavdas I, Bai W, Kamnitsas K, Aboagye EO, Rockall AG, et al. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans Med Imaging. 2017;36(8):1597–606. pmid:28436849
- View Article
- PubMed/NCBI
- Google Scholar
13. Maryn Y, Verguts M, Demarsin H, van Dinther J, Gomez P, Schlegel P, et al. Intersegmenter variability in high-speed laryngoscopy-based glottal area waveform measures. Laryngoscope. 2020;130(11):E654–61. pmid:31840827
- View Article
- PubMed/NCBI
- Google Scholar
14. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint 2017. https://arxiv.org/abs/1704.04861
- View Article
- Google Scholar
15. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 4510–20. https://doi.org/10.1109/cvpr.2018.00474
16. Howard A, Sandler M, Chen B, Wang W, Chen L-C, Tan M, et al. Searching for MobileNetV3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019. p. 1314–24. https://doi.org/10.1109/iccv.2019.00140
17. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.
18. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. 2019. p. 6105–14.
19. Yang F, Zamzmi G, Angara S, Rajaraman S, Aquilina A, Xue Z, et al. Assessing inter-annotator agreement for medical image segmentation. IEEE Access. 2023;11:21300–12. pmid:37008654
- View Article
- PubMed/NCBI
- Google Scholar
20. Yakubovskiy P. Segmentation models. 2019. https://github.com/qubvel/segmentation_models
21. Dadras AA, Aichinger P. Deep learning-based detection of glottis segmentation failures. Bioengineering (Basel). 2024;11(5):443. pmid:38790311
- View Article
- PubMed/NCBI
- Google Scholar
22. Reinke A, Tizabi MD, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Kavur AE, et al. Understanding metric-related pitfalls in image analysis validation. Nat Methods. 2024;21(2):182–94. pmid:38347140
- View Article
- PubMed/NCBI
- Google Scholar
23. Montalbo FJP. S3AR U-Net: a separable squeezed similarity attention-gated residual U-Net for glottis segmentation. Biomed Signal Process Control. 2024;92:106047.
- View Article
- Google Scholar
24. Kruse E, Dollinger M, Schutzenberger A, Kist AM. GlottisNetV2: temporal glottal midline detection using deep convolutional neural networks. IEEE J Transl Eng Health Med. 2023;11:137–44. pmid:36816097
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2022;44(7):3523–42. pmid:33596172
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Deliyski DD, Petrushev PP, Bonilha HS, Gerlach TT, Martin-Harris B, Hillman RE. Clinical implementation of laryngeal high-speed videoendoscopy: challenges and evolution. Folia Phoniatr Logop. 2008;60(1):33–44. pmid:18057909
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Zacharias SRC, Deliyski DD, Gerlach TT. Utility of laryngeal high-speed videoendoscopy in clinical voice assessment. J Voice. 2018;32(2):216–20. pmid:28596101
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Andrade-Miranda G, Stylianou Y, Deliyski DD, Godino-Llorente JI, Henrich Bernardoni N. Laryngeal image processing of vocal folds motion. Appl Sci. 2020;10(5):1556.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref5] 5. Gloger O, Lehnert B, Schrade A, Völzke H. Fully automated glottis segmentation in endoscopic videos using local color and shape features of glottal regions. IEEE Trans Biomed Eng. 2015;62(3):795–806. pmid:25350912
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Fehling MK, Grosch F, Schuster ME, Schick B, Lohscheller J. Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep convolutional LSTM network. PLoS One. 2020;15(2):e0227791. pmid:32040514
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Gómez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Dürr S, et al. BAGLS, a multihospital benchmark for automatic glottis segmentation. Sci Data. 2020;7(1):186. pmid:32561845
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Kist AM, Gómez P, Dubrovskiy D, Schlegel P, Kunduk M, Echternach M, et al. A deep learning enhanced novel software tool for laryngeal dynamics analysis. J Speech Lang Hear Res. 2021;64(6):1889–903. pmid:34000199
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Kist AM, Dollinger M. Efficient biomedical image segmentation on EdgeTPUs at point of care. IEEE Access. 2020;8:139356–66.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref10] 10. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015;15(1):1–28.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref11] 11. Robinson R, Oktay O, Bai W, Valindria VV, Sanghvi MM, Aung N. Real-time prediction of segmentation quality. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV. 2018. p. 578–85.

[ref12] 12. Valindria VV, Lavdas I, Bai W, Kamnitsas K, Aboagye EO, Rockall AG, et al. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans Med Imaging. 2017;36(8):1597–606. pmid:28436849
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref13] 13. Maryn Y, Verguts M, Demarsin H, van Dinther J, Gomez P, Schlegel P, et al. Intersegmenter variability in high-speed laryngoscopy-based glottal area waveform measures. Laryngoscope. 2020;130(11):E654–61. pmid:31840827
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref14] 14. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint 2017. https://arxiv.org/abs/1704.04861
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref15] 15. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 4510–20. https://doi.org/10.1109/cvpr.2018.00474

[ref16] 16. Howard A, Sandler M, Chen B, Wang W, Chen L-C, Tan M, et al. Searching for MobileNetV3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019. p. 1314–24. https://doi.org/10.1109/iccv.2019.00140

[ref17] 17. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.

[ref18] 18. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. 2019. p. 6105–14.

[ref19] 19. Yang F, Zamzmi G, Angara S, Rajaraman S, Aquilina A, Xue Z, et al. Assessing inter-annotator agreement for medical image segmentation. IEEE Access. 2023;11:21300–12. pmid:37008654
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref20] 20. Yakubovskiy P. Segmentation models. 2019. https://github.com/qubvel/segmentation_models

[ref21] 21. Dadras AA, Aichinger P. Deep learning-based detection of glottis segmentation failures. Bioengineering (Basel). 2024;11(5):443. pmid:38790311
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref22] 22. Reinke A, Tizabi MD, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Kavur AE, et al. Understanding metric-related pitfalls in image analysis validation. Nat Methods. 2024;21(2):182–94. pmid:38347140
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref23] 23. Montalbo FJP. S3AR U-Net: a separable squeezed similarity attention-gated residual U-Net for glottis segmentation. Biomed Signal Process Control. 2024;92:106047.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref24] 24. Kruse E, Dollinger M, Schutzenberger A, Kist AM. GlottisNetV2: temporal glottal midline detection using deep convolutional neural networks. IEEE J Transl Eng Health Med. 2023;11:137–44. pmid:36816097
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Data

Deep neural networks

Inter- and intrarater analysis

Inter-rater reliability (IeRR).

Intra-rater reliability (IaRR).

Prediction of segmentation quality using traffic light system

Evaluation

Results

Intra- and inter-rater reliability

Comparison of segmentation quality and intra- and inter-rater reliability

Traffic light system for IoU prediction

Discussion

Conclusion

Acknowledgments

References