Fig 1.
Traditional DR classification (up) relies on CNN-based models with extensive preprocessing on raw data [15,16].
In contrast, the proposed novel strategy (down) utilizes pre-trained models for feature extraction (FE). The generated feature vectors (FV) are refined using a Graph Convolutional Network (GCN) and subsequently leveraged for classification, quality assessment (QA), and uncertainty estimation (UE).
Table 1.
Related work on diabetic retinopathy (DR): Contributions, datasets, and gaps.
Fig 2.
Model architecture: The dataset undergoes basic preprocessing (e.g., resizing, transformations, rotations) to prepare the data.
The FE function processes each sample
to generate a feature vector (FV)
. This FV is then refined to a vector in
using Global Average Pooling (GAP). A graph is constructed with nodes corresponding to FVs, and the edge distance is computed based on spatial distance
and semantic distance
.
Table 2.
All datasets are categorized into five distinct groups based on the severity levels of DR present in the fundus images.
Here, “0” stands for No DR, “1” stands for Mild NPDR, “2” stands for Moderate NPDR, “3” stands for Severe NPDR, and “4” stands for PDR.
Table 3.
Hyperparameter values used in our experiments.
Table 4.
Model-specific training parameters.
The number of parameters refers to trainable parameters in the backbone. Single Image Inference Time (SIIT) is measured in milliseconds (ms), and Per Epoch (TTPE) is measured in seconds (s). These properties are independent of datasets.
Fig 3.
Comparison of normalized confusion matrices for multiclass DR classification on the APTOS and Messidor-2 datasets using different preprocessing methods: Our pipeline with No Preprocessing (left), CLAHE preprocessing (middle), and Ben-Graham preprocessing (right).
MobileViT model on APTOS2019 dataset shows excellent performance, with minimal misclassification and high precision-recall (AP=1.00). DenseNet169 on the Messidor-2 dataset achieves high accuracy.
Fig 4.
Precision-recall curves for multiclass DR classification on the APTOS and Messidor-2 datasets using three different preprocessing methods: our pipeline with no preprocessing (left), CLAHE Preprocessing (middle), and Ben-Graham preprocessing (right).
MobileViT model on APTOS2019 dataset shows excellent performance, with minimal misclassification and high precision-recall (AP=1.00). DenseNet169 on the Messidor-2 dataset achieves high accuracy.
Fig 5.
Grad-CAM heatmaps were generated for retinal fundus images in the DR classification task.
Each row presents the original image, its corresponding Grad-CAM heatmap, and the model’s prediction. The red regions in the heatmaps indicate areas that strongly influence the classification. (a) represents our approach without any sophisticated preprocessing techniques, which performs significantly better than the other two: (b) with CLAHE preprocessing.
Table 5.
Comparison with SOTA: After training the models using three different approaches—(1) applying CLAHE (), (2) applying Ben Graham’s preprocessing technique (♦), and (3) our proposed (
) pipeline without sophisticated preprocessing—we evaluated them on the actual test sets of the APTOS2019 and Messidor-2 datasets.
Our approach outperformed all existing benchmarks for DR classification on both datasets. Whether using CNN () or Transformer architectures (
), our method consistently achieved superior performance compared to all previous DR classification methods.
Table 6.
External validation on EyePACS: Performance of the two best backbones from our pipeline—DenseNet-169 (initially optimized on Messidor-2) and MobileViT (initially optimized on APTOS2019)—after a brief fine-tuning stage on the EyePACS dataset.
Fig 6.
(a) Training strategies were applied to OgD WC, OgD RS, and OgD, with the proposed method evaluated on APTOS. Our approach (black bar) consistently achieves the highest accuracy, F1-score, AUROC, AUPR, and Kappa, demonstrating superior classification performance and agreement.
(b) In comparison, our proposed approach outperforms SGD across all metrics, highlighting its better generalization and robustness.