Self-supervised learning framework for efficient classification of endoscopic images using pretext tasks

doi:10.1371/journal.pone.0322028

Table 1.

Benchmarking endoscopic image classification techniques: A comparative analysis.

More »

Expand

Fig 1.

The main steps of the proposed scenarios for anatomical landmark detection from endoscopic video frames.

More »

Expand

Fig 2.

Data description of anatomical landmarks and tasks derived from endoscopic video frames.

More »

Expand

Table 2.

Architecture and parameters of autoencoder models for colorization task.

More »

Expand

Table 3.

Architecture and parameters of autoencoder models for patch prediction task.

More »

Expand

Table 4.

Architecture and parameters of autoencoder models for jigsaw puzzle task.

More »

Expand

Table 5.

Architectural overview of combined autoencoder models for different pretext tasks.

More »

Expand

Fig 3.

Architecture of the ResNet50-based self-supervised learning framework for anatomical landmark classification in endoscopic images.

More »

Expand

Table 6.

The performance measures of the proposed scenarios for anatomical landmark identification from endoscopic video frames.

More »

Expand

Fig 4.

Training accuracy per epoch for different pretext task combinations.

Subplot (a) shows Scenario 1 (CI-PPred), with training accuracy increasing steadily to near 1.0 and test accuracy showing slight fluctuations. Subplot (b) depicts Scenario 2 (CI-JigPuzz), with both training and test accuracy showing a consistent upward trend, with test accuracy stabilizing around 0.95. Subplot (c) illustrates Scenario 3 (PPred-JigPuzz), with training accuracy rising quickly and test accuracy showing steady improvement. Subplot (d) represents Scenario 4 (CI-PPred-JigPuzz), with training accuracy reaching near 1.0 and test accuracy showing gradual improvement, stabilizing around 0.95.

More »

Expand

Fig 5.

Training and validation loss per epoch for different pretext task combinations.

Subplot (a) shows Scenario 1, where training loss decreases rapidly and stabilizes, with test loss following a similar trend. Subplot (b) depicts Scenario 2, where both training and test loss decrease rapidly initially and then gradually, with training loss generally lower than test loss. Subplot (c) illustrates Scenario 3, with training loss decreasing quickly and stabilizing, while test loss follows a similar trend but remains slightly higher. Subplot (d) represents Scenario 4, where training loss decreases rapidly and stabilizes, with test loss following a similar pattern but with higher initial values.

More »

Expand

Fig 6.

Confusion matrices for each pretext task combination.

Each matrix is color-coded with a gradient from light blue to dark blue, indicating frequency. In matrix (a), Scenario 1 shows high accuracy for true Z-line and esophageal categories. Matrix (b), Scenario 2 exhibits the highest accuracy among scenarios. Matrix (c), Scenario 3 shows high accuracy but with minor misclassifications. Matrix (d), Scenario 4 demonstrates lower accuracy with more misclassifications compared to other scenarios.

More »

Expand

Fig 7.

ROC curves for each pretext task combination, illustrating model performance across classes.

Subplot (a), Scenario 1 indicates high performance, with areas of 0.99, 0.96, and 0.95. Subplot (b), Scenario 2 exhibits the highest accuracy, with areas of 1.00, 0.97, and 0.97. Subplot (c), Scenario 3 shows high performance, with areas of 0.99, 0.97, and 0.96. Subplot (d) indicates lower performance, with areas of 0.99, 0.95, and 0.94.

More »

Expand

Table 7.

The performance measures of each pretext task in isolation.

More »

Expand

Fig 8.

Training accuracy and loss per epoch for individual pretext tasks (colorization, jigsaw puzzle, patch prediction).

More »

Expand

Fig 9.

Comparison of classification performance for individual pretext tasks using confusion matrices and ROC curves.

More »

Expand

Table 8.

Processing time for each pretext task.

More »

Expand

Table 9.

The performance comparison of attention-based and transformer-based models.

More »

Expand

Fig 10.

Grad-CAM visualization of model predictions highlighting clinically relevant features in endoscopic images.

More »

Expand

Fig 11.

Training dynamics of multi-task self-supervised models: comparative analysis of architectural enhancements (Transformer vs. Attention) for endoscopic image classification.

More »

Expand

Fig 12.

Comparison of classification performance across three models using confusion matrices and ROC curves.

The figure illustrates accuracy and loss trends across training epochs for three model variations: (a) (CI-JigPuzz) + Transformer – The model demonstrates rapid convergence, achieving near-perfect initial accuracy (1.00) and maintaining high performance (≥0.86) throughout training. The loss curve steadily decreases from 0.7 to 0.0, reflecting stable optimization. (b) (CI-JigPuzz) + Attention – Initial accuracy starts at 0.95 but declines to 0.80 over epochs. The loss curve exhibits more volatility (1.4 to 0.0), indicating a slower convergence rate compared to the Transformer-based model. (c) Best Scenario (CI-JigPuzz) – This configuration achieves optimal balance, with accuracy consistently high (1.000 to 0.875) and loss decreasing smoothly (0.7 to 0.0), showcasing effective learning and generalization.

More »

Expand

Table 10.

Performance comparison of the proposed multi-task self-supervised model and SimCLR (Macro-Averaged Metrics).

More »

Expand

Fig 13.

SimCLR baseline performance in endoscopic image classification.

More »

Expand

Fig 14.

SHAP heatmaps.

More »

Expand