Predicting semantic segmentation quality in laryngeal endoscopy images

doi:10.1371/journal.pone.0314573

Fig 1.

Artifact generation process.

A: The BAGLS dataset contains 55,750 paired training samples consisting of endoscopy images (left) and their respective glottal area segmentation (right). B: Ground-truth glottal area segmentation is sent to the artifact generator. We apply four steps to incorporate uniform mask scaling artifacts, border fuzziness, and small and large segmentation artifacts. Each step is randomly applied, and step-dependent hyperparameters are randomly chosen. The resulting segmentation mask is used to compute the IoU score with the ground-truth segmentation. The resulting segmentation masks together with the IoU score are used for training downstream deep neural networks.

More »

Expand

Fig 2.

Inter- and intra-rater reliability shows non-perfect agreement.

(A) Task overview. A subset of the BAGLS dataset was taken (100 random frames) and annotated by trained raters three times in random order. (B) Details inter-rater reliability among six raters, highlighting consistency in their evaluations. (C) Presents the inter-rater reliability among neural network models described in [7]. (D) Explores intra-rater reliability for six raters across three rounds, assessing individual consistency. (E) Histogram across different Intra-raters. (F) Analyzes the relationship between inter-rater reliability and the number of pixels in segmented areas.

More »

Expand

Fig 3.

Segmentation quality prediction using neural networks.

(A) Showcases various combinations of endoscopic images and their corresponding segmentation processed through the neural network architecture. (B, C) RMSE across diverse network backbones under two distinct scenarios, namely pre-trained on ImageNet and trained from scratch with random initialization, respectively. (D) Comparison of RMSE between different neural networks and the average annotation mask of six human raters.

More »

Expand

Fig 4.

Traffic light system.

(A) Inference on independent test videos: For each frame, the glottal area is segmented using three different segmentation networks. The predicted masks and the original input frame are used as input (format “see”, as demonstrated in Fig 3b/3c) to the trained MobileNetV2, which predicts the IoU score. A traffic light scheme is applied to these results. (B) Examples of traffic light bars using two exemplary videos: The traffic light bars show the predicted color for each segmentation network. For each video, three exemplary frames and their corresponding IoU predictions are shown.

More »

Expand