Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Glaucoma detection and staging from visual field images using machine learning techniques

  • Nahida Akter,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliation School of Optometry and Vision Science, UNSW Sydney, Sydney, New South Wales, Australia

  • Jack Gordon,

    Roles Data curation, Resources, Validation

    Affiliation School of Optometry and Vision Science, UNSW Sydney, Sydney, New South Wales, Australia

  • Sherry Li,

    Roles Data curation, Investigation, Methodology, Resources, Validation, Writing – original draft

    Affiliation School of Optometry and Vision Science, UNSW Sydney, Sydney, New South Wales, Australia

  • Mikki Poon,

    Roles Data curation, Investigation, Methodology, Resources, Validation, Writing – original draft

    Affiliation School of Optometry and Vision Science, UNSW Sydney, Sydney, New South Wales, Australia

  • Stuart Perry,

    Roles Methodology, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation School of Electrical and Data Engineering, University of Technology Sydney, Sydney, New South Wales, Australia

  • John Fletcher,

    Roles Methodology, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, New South Wales, Australia

  • Thomas Chan,

    Roles Resources, Writing – review & editing

    Affiliations Faculty of Medicine & Health, University of Sydney, Sydney, New South Wales, Australia, PersonalEYES, Sydney, New South Wales, Australia

  • Andrew White,

    Roles Writing – review & editing

    Affiliations Faculty of Medicine & Health, University of Sydney, Sydney, New South Wales, Australia, PersonalEYES, Sydney, New South Wales, Australia, Department of Ophthalmology, Westmead Hospital, Sydney, New South Wales, Australia

  • Maitreyee Roy

    Roles Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing

    maitreyee.roy@unsw.edu.au

    Affiliation School of Optometry and Vision Science, UNSW Sydney, Sydney, New South Wales, Australia

Abstract

Purpose

In this study, we investigated the performance of deep learning (DL) models to differentiate between normal and glaucomatous visual fields (VFs) and classify glaucoma from early to the advanced stage to observe if the DL model can stage glaucoma as Mills criteria using only the pattern deviation (PD) plots. The DL model results were compared with a machine learning (ML) classifier trained on conventional VF parameters.

Methods

A total of 265 PD plots and 265 numerical datasets of Humphrey 24–2 VF images were collected from 119 normal and 146 glaucomatous eyes to train the DL models to classify the images into four groups: normal, early glaucoma, moderate glaucoma, and advanced glaucoma. The two popular pre-trained DL models: ResNet18 and VGG16, were used to train the PD images using five-fold cross-validation (CV) and observed the performance using balanced, pre-augmented data (n = 476 images), imbalanced original data (n = 265) and feature extraction. The trained images were further investigated using the Grad-CAM visualization technique. Moreover, four ML models were trained from the global indices: mean deviation (MD), pattern standard deviation (PSD) and visual field index (VFI), using five-fold CV to compare the classification performance with the DL model’s result.

Results

The DL model, ResNet18 trained from balanced, pre-augmented PD images, achieved high accuracy in classifying the groups with an overall F1-score: 96.8%, precision: 97.0%, recall: 96.9%, and specificity: 99.0%. The highest F1 score was 87.8% for ResNet18 with the original dataset and 88.7% for VGG16 with feature extraction. The DL models successfully localized the affected VF loss in PD plots. Among the ML models, the random forest (RF) classifier performed best with an F1 score of 96%.

Conclusion

The DL model trained from PD plots was promising in differentiating normal and glaucomatous groups and performed similarly to conventional global indices. Hence, the evidence-based DL model trained from PD images demonstrated that the DL model could stage glaucoma using only PD plots like Mills criteria. This automated DL model will assist clinicians in precision glaucoma detection and progression management during extensive glaucoma screening.

Introduction

Glaucoma is a multi-factorial neuro-degenerative eye disease considered the second leading cause of blindness worldwide [1]. Till now, there are no definitive explanations for what causes glaucoma; but it is possible to detect and track the progression by continuing to monitor the VF loss and visible damage to the optic nerve head (ONH) of the retina.

Standard automated perimetry (SAP) has been the primary method for glaucoma diagnosis and progression follow-up [2]. The Humphreys Visual Field Analyser (HVFA) 24–2 and 30–2 SITA-Standard testing protocols are common perimeters used by clinicians to measure the severity of VF loss and grading glaucoma progression based on the Mills et al. criteria [3]. While being the gold standard for identifying vision loss, SAP can still be a complicated, time-consuming, and subjective process, as it is highly dependent on the clinicians’ experience and expertise [4].

The SAP test is a diagnostic metric used by clinicians to measure the sensitivity of functional vision in different parts of an individual’s VF. For glaucoma, the VF tests are essential as the clinicians can observe any peripheral vision loss before the patient experiences it. By assessing the reduced sensitivity or blind spots, they can grade the severity of the VF loss, and determine the appropriate management. Glaucomatous VF losses are dependent on the topology and anatomical distribution of the retinal nerve ganglion fibre bundles, which usually follow an arcuate pattern before entering the optic nerve. These fibres respect the horizontal raphe in the temporal retina. As a result, glaucoma gives rise to generalized and localized VF defects commonly formed in arcuate, altitudinal, nasal step, central, and paracentral scotomas, which respect the horizontal midline. For early glaucoma, the damage may present as partial arcuate, paracentral, and nasal step defects [5]. The VF defects can be widened by conforming to the ONH structural damages [6]. Therefore, an earlier diagnosis is crucial to halt the severe peripheral vision loss.

Nowadays, clinicians follow the most widely used glaucoma staging system suggested by Mills et al., which can classify glaucoma stages from 0 (no or minimal VF defect/ocular hypertension) to stage 5 (end-stage). The criteria were set based on the MD and PSD score, PD plot clusters, glaucoma hemifield test (GHT) or dB plots. Additionally, the staging system can grade glaucoma severity only from PD plots into four categories: early, moderate, advanced, and severe glaucoma.

The Mills grading system was developed based on the Hodapp-Anderson-Parrish (HAP) classification system, which also used VF parameters [7]. However, the HAP system could not classify the full range of glaucoma stages and was unable to separate the early and end-stage disease from minimal defects and blind patients [8]. Mills criteria provide a standardized grading system to use in clinical practice, which considers multiple parameters to classify the full range of glaucoma compared to the other grading systems: Advanced Glaucoma Intervention Study (AGIS) [9], Collaborative Initial Glaucoma Treatment Study (CIGTS) [10], Glaucoma Staging System (GSS) [11] and Enhanced Glaucoma Staging System (GSS2) [12].

Researchers have developed several ML algorithms utilizing VF images to differentiate glaucomatous eyes from normal eyes [8, 1318]; however, few ML studies have been done on automated glaucoma staging [1923] using VF images. The outcomes of ML studies are promising for glaucoma differentiation from normal eyes using VF images and sometimes better than clinicians at recognizing complex patterns [18].

In this study, we aimed to investigate the DL model performance to classify VFs into four categories: normal VF, early glaucoma, moderate glaucoma, and advanced glaucoma, using only PD plots and to observe if the DL model can stage glaucoma according to Mills criteria. The outputs of the DL models were compared with ML models trained on conventional VF global indices for the same classification. This study highlights the potential for DL models to significantly enhance early glaucoma detection and staging. By automating the analysis of visual field images, these models can potentially lead to substantial savings in both time and resource allocation within clinical settings. The efficient processing of large datasets allows clinicians to focus more on patient care rather than manual analysis, which could result in improved patient outcomes and increased clinic throughput.

Methods

The details methodology implemented in this study is illustrated in Fig 1. The workflow is structured into several key stages, as detailed below:

thumbnail
Fig 1. Flowchart of the methodology of glaucoma detection and staging using machine learning techniques.

https://doi.org/10.1371/journal.pone.0316919.g001

1. Data collection

In our retrospective study, a total of 265 reliable VF images were collected, 138 images from the UNSW Optometry Clinic (accessed between September 2020 and August 2022) and 127 images from the Centre of Eye Health (CFEH) (accessed in November 2018), (n = 265, normal eyes = 119, glaucomatous eyes = 146). The 265 PD plots and 265 numerical data were extracted from the Humphrey 24–2 SITA-Standard VF test from the HVFA (II-i 745i, Carl Zeiss Meditec, Inc. Dublin, CA); patients underwent for VF test from April 2011 to July 2022. Perimetry testing measures a patient’s response to light stimuli in their VF) generating reliability indices, sensitivity maps, and trend analyses. Sensitivity maps quantify VF sensitivity at specific points (dB plots), compared to age-matched norms (total deviation map) and the patient’s own hill of vision (pattern deviation (PD) plot). The PD plot helps detect localized visual field defects typical in glaucoma. Trend analysis includes the Visual Field Index (VFI), which shows the percentage of remaining functional VF, Mean Deviation (MD), representing overall VF sensitivity loss compared to norms, and Pattern Standard Deviation (PSD), indicating localized VF variability. Additionally, the Glaucoma Hemifield Test (GHT) compares superior and inferior VF, aiding in glaucoma diagnosis. The current most widely used glaucoma staging system is one developed by Mills et al. criteria, which comprises a 6-stage classification system: normal, early, moderate, advanced, severe and end-stage glaucoma. The severity of glaucoma is diagnosed based on the MD score and the additional criteria of PD plot clusters, PSD, GHT or dB plots. The staging system also includes decision rules to distinguish between the stages, whereby the MD score and an additional criterion must be met to diagnose a particular stage. In our study, the expert clinicians used Mills et al. criteria and graded glaucoma data into72 early, 43 moderate, 21 advanced and 10 severe. Due to the limited number of severe patients, we combined the severe patients with the advanced group. The VFs labeled ‘Preperimetric’ and ‘No glaucoma with RNFL losses’ were excluded from the glaucoma and normal group. The inclusion criteria for the VF test reliability were VFs with fixation losses of ≤ 20%, and false positive and false negative rates ≤ 15%. To be consistent with the PD plot pattern, the 10–2 and 30–2 SITA standard VF scans were excluded from the final dataset. The ethics committee of UNSW Sydney approved the relevant ethics for the data collection. The study followed the tenets of the Declaration of Helsinki, and written informed consent was collected from the patients to use the de-identified clinical data for research purposes. The authors had no access to the information that could identify individual participants during or after data collection.

2. Power and sample size estimation

The power and sample size for the two classes (normal and glaucoma) and four classes (normal, early, moderate, and advanced) have been calculated. For two classes, the power was estimated using an independent t-test, with the required parameters: effect size d = 0.50, α error = 0.05, group allocation ratio N2/N1 = 1.2. For four classes, power was computed using F-tests, a linear multiple regression with the effect size, d = 0.15 and α error = 0.05.

3. Image pre-processing

To train DL models, only PD plots were cropped from the original VF images (Fig 2). The images were resized into 224×224 pixels for the input layer of the DL models. In the low-contrast images, the sharpness and contrasts have been enhanced.

thumbnail
Fig 2. Pattern deviation plots of a) normal b) early glaucoma c) moderate glaucoma d) advanced glaucoma, and (e) severe glaucoma.

https://doi.org/10.1371/journal.pone.0316919.g002

4. Deep learning models

4.1 Model selection and architecture.

Two pre-trained DL models, VGG-16 [24] and ResNet18 [25], were used to evaluate the classification performance of detecting and staging glaucoma using PD plots by classifying images into four classes: normal, early, moderate and advanced. ResNet and VGG are two popular CNN models that have shown remarkable progress in medical image analysis for smart medicine [26]. In the literature, these two models are commonly used in VF image prediction for glaucoma detection and progression grading [18, 19, 2731]. In this study, VGG16 architecture comprises of 41 layers, predominantly convolutional layers followed by max pooling layers. ResNet18 consist of 71 layers including convolutional layers, batch normalization, ReLU activation, and skip connections. The images with 224×224-pixel size have used for the input layers.

Having a small and imbalanced data in the classes, two combinations of images were used to train the DL models. In the first approach, the images were augmented and balanced in each group following the large image number group, i.e., normal = 119. The image count for the early, moderate, and advanced glaucoma groups was increased through three augmentation techniques: ‘horizontal flip’, ‘minus 5-degree rotation’, and ‘plus 5-degree rotation’. Therefore, in the final image dataset, there were 476 images, 119 in each group. For augmentation technique, we have used Fiji ImageJ (https://imagej.net/software/fiji) an open-source software for biological image processing. With a small dataset, we used a 5-fold CV in every DL approach to assess the generalizability of the DL models on unseen data and utilize the whole dataset for training.

4.2 Training with an augmented balanced dataset.

In the first approach, the two pre-trained models were trained using a pre-augmented balanced dataset. In the CV, for each fold, one of the subsets was used as the test set, and the remaining subsets were placed together to make a training set. Hence, all the data got a chance to be trained in the model, and in the end, the evaluation scores were averaged across the 5 test subsets. Though image augmentation was done in the pre-processing step, during the model training, the MATLAB ‘imageDataAugmenter’ function was used for further data augmentation using three arguments: ‘RandXTranslation’, ‘RandYTranslation’ and ‘RandXShear’ to reduce the overfitting problem. The Stochastic Gradient Descent with Momentum (SGDM) optimizer (an iterative optimization algorithm to find the optimal result) with an initial learning rate of 0.0005 for ResNet18 and 0.00001 for VGG16 was used to train the models. The minimum batch size was set to 10, and a maximum epoch size of 20 was selected by monitoring the loss and accuracy of the training and validation dataset.

4.3 Training with imbalanced dataset.

Secondly, the two models were trained using the imbalanced image group (no pre-augmentation). Similarly, during model training, we used image augmentation using the arguments mentioned above to improve the model’s generalizability. This temporary datastore augments the images and doesn’t save them in the machine. The model updates the model parameters and discards the augmented images after the training. For a small number of images in the advanced group, the minibatch size was set to 5 and maximum epochs were set to 30. For ResNet18, the learning rate was 0.0005; for VGG16 the learning rate was 0.00001.

4.4 Training using feature extraction.

In the final DL method, the two pre-trained models, ResNet18 and VGG16, were used only for image feature extractors, and the extracted features were trained in a multiclass support vector machine (SVM) classifier to classify the images into four groups. For VGG16, the features were extracted from the last convolutional layer before flattening i.e. pooling layer, which provides rich feature representations suitable for classification tasks and in ResNet18, we extracted features from the last residual block, which captures high-level spatial information while maintaining the depth of the model. The pooling layer pools the input features in the spatial locations, providing 512 features for VGG16 and 25088 for Resnet18. The features of the two pre-trained models were fed into the SVM classifier. The output results were evaluated on the confusion matrix.

4.5 Grad-CAM visualization.

The DL-trained images were further investigated using the Gradient-weighted Class Activation Mapping (Grad-CAM) function [32] to visualize whether the models accurately classify the images by identifying the actual amount of VF loss of the PD plot. The Grad-CAM generates a heatmap for a specific class by highlighting the features of the image using the gradient of the prediction score from the SoftMax layer of the class with respect to the final convolutional layer of the model [32]. The heatmap transparently overlays on the predicted image, deep red refers to the peak value which has the greatest impact on the decision, and deep blue is the lowest class value.

5. Machine learning models

The DL model results were compared with four popular ML classifiers trained on global indices extracted from VF images: MD, PSD and VFI. The RF classifier with a maximum depth of 5, multinominal logistic regression, k-nearest neighbors (k-NN, k = 5) and support vector classifier (SVC) with linear kernel were used to train a total of 119 normal and 146 glaucomatous datasets. These ML classifiers were selected based on the previous literature outcomes on numerical multiclass data prediction [18, 33, 34]. The ML classifiers’ performance was observed on five-fold CV. Finally, the result of the best-performing ML model was compared with the DL model result based on the performance metrics derived from the confusion matrix (CM).

Results

The demographic characteristics of the data are shown in Table 1. Group differences were analyzed using an independent samples t-test and expressed as mean ± standard deviation (SD). All the extracted features in the two groups were statistically significant (p<0.05). A total of n = 265 (normal 119 and glaucoma 146) data in two groups has greater than 95% power to detect a statistical difference using independent t-test for continuous variables. For four groups, n = 265, the power was estimated 90%, and at least 27 samples are needed per group to identify a statistical difference using linear multiple regression.

The pair representation of the features: MD, PSD and VFI of four groups are shown in Fig 3 using distribution and scatter plots.

thumbnail
Fig 3. The pair scatter and distribution plot of the three global indices, MD, PSD and VFI, in relation to the normal and glaucoma group (early, moderate, and advanced).

https://doi.org/10.1371/journal.pone.0316919.g003

The pair scatter and distribution plot in Fig 2 shows the features were overlapped for the early and normal groups. Besides, it is noticeable that the combination of MD and PSD pair can differentiate the normal to early group better than the MD, VFI or PSD, VFI pair. Though these MD, VFI or PSD, VFI pairs we can correctly discriminate the early, moderate, and advanced glaucoma group.

DL model results

The following metrics have been used to estimate the DL and ML model performance and are defined as follows:

Where TP is a true positive, TN is a true negative, FP is a false positive, and FN is a false negative. F1 score was considered as overall accuracy for multiclass classification.

The confusion matrices of the three approached DL models are shown in following Fig 4.

thumbnail
Fig 4. The confusion metrics of the ResNet18 (left) and VGG16 (right) model: classifying normal, early, moderate and advanced glaucoma using (a) pre-augmented balanced dataset (b) original imbalanced dataset (c) trained by a SVM classifier using extracted features using the original imbalanced dataset.

https://doi.org/10.1371/journal.pone.0316919.g004

In Fig 4(a), the confusion matrixes showed both DL models performed well in classifying the four groups using the pre-augmented balanced dataset. The ResNet18 model achieved higher accuracy than the VGG16 model, with an overall F1 score of 96.8%. The other performance metrics (precision, recall and specificity) are shown in Table 3. The early cases were detected at 100%, and only four normal images were misclassified as early glaucoma. The normal eye images were further investigated, and it was that found the GHT was borderline for three images, and one eye was myopic with high-risk factors for glaucoma. In VGG16, though the false positive number for the early group was high (13) and overlapped with the moderate group, only two normal eye images were detected as glaucomatous eyes. The model showed similar performance for advanced and moderate glaucomatous images like ResNet18.

The performance of the DL models degraded when the original dataset (imbalanced) was used in two other methods (Fig 4(b) and 4(c)). The detection rate of the glaucoma groups was significantly reduced, and the accuracy rate was highly varied between groups due to the imbalanced image number in the class. The VGG16 performed poorly on the dataset (Fig 4(b) right). However, the two SVM classifier models performed almost similar when trained on extracted features from the two pre-trained models (Fig 4(c)). Notably, the DL models distinguished the normal group from the glaucomatous group in three methods with high accuracy. For the original dataset in Fig 4(b), the normal detection rate was 97.5% for ResNet18 and 100% for VGG16, proving that the PD plot trained from the DL model is a promising automated tool for separating glaucoma from normal eyes.

Grad-CAM visualization

The trained DL models were further investigated with Grad-CAM visualization, and a few resultant images are shown in Fig 5.

thumbnail
Fig 5. Generated heatmap for the visual explanation produced by Grad-CAM on pattern deviation plot scans (normal, early, moderate, advanced) from left to right in multi-class classification by (a) ResNet18 (b) VGG16.

The Grad-CAM heatmap demonstrated that the ResNet18 and VGG16 models distinguished normal and glaucoma groups by localizing the affected regions in the visual field.

https://doi.org/10.1371/journal.pone.0316919.g005

Fig 5 shows the heat maps of randomly selected categorical images of normal and glaucomatous eyes from each stage. Both DL models successfully identified the affected VF loss in the PD plots and distinguished normal VF images from glaucomatous images.

The model’s ability to distinguish between normal and glaucomatous visual fields was fundamentally informed by the features identified through Grad-CAM visualization. For normal eyes, the Grad-CAM heatmap displayed a uniform distribution of activation across the entire visual field, indicating that the model recognized the absence of visual field loss and confirmed the integrity of the visual field features. In contrast, for glaucomatous eyes, the Grad-CAM technique highlighted specific regions of visual field loss, with localized activations indicating areas of concern. This differentiation underscores the model’s learning process, where it captured the salient features associated with healthy visual fields and the distinct patterns of loss characteristic of glaucoma. Through the Grad-CAM-generated heatmaps, it becomes evident that the model effectively learned to focus on critical areas within the visual field, enhancing its capacity for accurate classification and providing valuable insights into the underlying pathology of glaucoma. By localizing affected VF loss, Grad-CAM provides clinicians with crucial insights into the areas that may require further evaluation. This transparency is vital in clinical settings, as it helps bridge the gap between automated diagnosis and human expertise, supporting informed decision-making in glaucoma management.

Machine learning results

The RF model performed best among the ML classifiers. The F1 score for multinominal logistic regression was 92%, SVC (linear) model was 92% and K-NN was 90%. We have also trained the models with VFI and without VFI, and no effect was found in the classification result. Table 2 shows the performance summary of the RF model below.

thumbnail
Table 2. The performance metrics of the random forest model in classifying normal, early, moderate and advanced glaucoma using five-fold cross-validation.

https://doi.org/10.1371/journal.pone.0316919.t002

Tables 2 and 3 show that the performance of the best DL model (ResNet18) trained from a balanced dataset of PD plots and the best ML model (RF) trained from global indices is similarly performed for classifying the four groups: normal, early, moderate, and severe. The ResNet18 performed marginally better than RF mode, with precision: 96.98%, recall: 96.85% and F1 score: 96.8%. The DL model trained from the balanced dataset and ML model results proved that the PD plot could detect and stage glaucoma as the global indices. Moreover, it is also proven that in both cases, the ML model trained from VF data can separate glaucoma from healthy eyes and stage glaucoma from early to advanced stages based on Mills et al. criteria. However, compared with the original dataset, the DL model could not achieve as high accuracy as the ML model. Therefore, a large dataset with quantitative and qualitative analysis, including comparative assessment by the clinicians, is required to establish the result, which is our future study plan.

thumbnail
Table 3. The performance comparison summary of the DL and ML models for classifying normal, early, moderate and advanced glaucoma using five-fold cross-validation.

https://doi.org/10.1371/journal.pone.0316919.t003

The performance metrics of the DL and ML models provide valuable insights into their potential applications in the diagnosis and management of glaucoma. Notably, the high sensitivity and specificity demonstrated by the ResNet18 model highlight its capability as a reliable tool for early glaucoma detection. With an F1 score of 96.8% achieved using a balanced dataset, this model shows promise in accurately distinguishing between normal and glaucomatous eyes. This early detection is crucial, as it allows for timely intervention that can significantly reduce the risk of irreversible vision loss.

Similarly, the RF model exhibited an F1 score of 96.0%, underscoring its effectiveness in classifying glaucoma severity based on global indices. This capability can aid clinicians in better stratifying patients for appropriate treatment plans, ensuring that individuals receive the most suitable care based on their specific disease stage.

Our findings also emphasize the importance of employing balanced datasets during model training. The performance drop observed with the initial imbalanced dataset reinforces the necessity of using diverse and representative data to develop robust diagnostic models. By accurately differentiating between normal and glaucomatous eyes and classifying the stages of glaucoma, both DL and ML models represent significant advancements in the automation of glaucoma screening processes.

Discussion

Our current study investigated three different approaches of DL models to detect and stage glaucoma using PD plots and compared them with ML models trained with global indices of VF images. The results demonstrated that a single PD image could classify glaucoma like the global indices: MD, PSD and VFI. Moreover, the accuracy of the deep learning model is improved when trained with a balanced dataset in multiclass classification.

Several studies have used ML classification algorithms to detect and stage glaucoma using VF images [15, 19, 23, 35]; however, very few used the PD plots for glaucoma detection [18, 31, 3638], and we did not found any studies on glaucoma staging. Recently, Shin et al. used wide-field OCT-based DL classification to categorize the images into five groups: normal, pre-perimetric glaucoma, early, moderate, and severe glaucoma. The model achieved high accuracy for separating the normal group from the glaucoma group but showed difficulties in staging glaucoma with 72.69% accuracy [27]. Another study by Huang et al. used a 30–2 total deviation plot of VF images to train a fine-graded DL model to classify the images into five categories: clear VF, mild, moderate, severe, and diffuse VF defect. The model achieved an AUC of 0.93 with an accuracy of 85%. Zhen et al. trained DL models with fundus images to detect the glaucoma severity and achieved an accuracy of 85.29%. Though increasing the accuracy of the glaucoma staging still needs DL model improvement with diverse datasets, these studies’ strength is that the DL models were trained on a large training dataset. In contrast, in our study, the DL model with the original dataset achieved an F1 score of 87.8%, and with features extraction, the F1 score was 88.7% to classify the images into normal, early, moderate, and advanced groups. For detecting glaucomatous VF images, it achieved the AUC: 0.99 with an accuracy of 98.1%, and for staging (early, moderate, and severe), the DL model achieved an F1 score of 85.8%, which is promising for using PD plots for automated glaucoma diagnosis and progression grading.

In this study, we used the Mills et al. grading system as the foundation for our glaucoma grading. Although there is still no clinically accepted gold standard, this is the most popular grading system in clinical practice [39]. This criterion considers the MD, PSD, GHT, and PD plots to classify the full range of glaucoma from early to end-stage disease. We focused on using the PD plots for training data, as this plot best highlights any localised defects by taking the difference of the observed threshold at one point from the average threshold hold across the entire VF. Thus, this eliminates any field defects caused by cataracts, giving a more accurate representation for glaucomatous loss [40].

In clinical practice, diagnosing glaucoma involves a combined analysis and multimodal assessment of the structural and functional glaucomatous damage via OCT, VFs, and stereoscopic fundus images [41]. While characteristic glaucomatous VF defects may appear in some glaucoma patients, atypical patterns can still occur due to the various VF patterns that can be presented within glaucoma. As a result, this becomes a complex process to interpret, diagnose and stage glaucoma. Glaucoma diagnosis and staging not only involve interpretation of the PD maps, but also of other features such as global indices, which typically form part of the criteria in established glaucoma staging systems such as that by Mills et al. However, our study revealed that the DL model could use PD maps alone to detect glaucoma and stage glaucoma severity with the same performance of these trend analysis values. Therefore, the PD map can be a novel clinical imaging feature to train artificial neural networks for automated glaucoma detection and classification. Glaucoma diagnosis is a complex, time-consuming process requiring trained and skilled practitioners to interpret results. Thus, by implementing this into clinical practice, glaucoma detection and classification from just PD images may potentially improve the efficacy of clinicians as well as reduce the patient burdens associated with invasive clinical procedures. Moreover, an artificial neural network can be useful in the early detection of glaucoma, prompting early interventions to prevent progressive vision loss and the negative impacts on the patient’s quality of life.

Our study also demonstrated the visual explanation of the DL model in decision-making on the images based on the correctly localized affected VF loss. This refers to practitioners relying on our DL systems that provide automated predictions and visual explanations. The DL model can discriminate patterns between normal and glaucoma and the glaucoma severity in the VF images. Ultimately, this explainable DL-based glaucoma detection and staging will help the eye care professionals and clinicians to trust the DL algorithm and rely on the clinical decisions.

Our study has some limitations. Firstly, the limited amount of training datasets restricts the validity of our DL model in real-world clinical settings. Large amounts of data are required to train these ML and DL algorithms to be “well-phenotyped” to the disease and optimize their diagnostic performances. As our clinical data were obtained from the CFEH, a secondary referral centre, and the UNSW Optometry clinic, most of our dataset consists of normal and early eyes (who may also be glaucoma suspects) with limited amounts of moderate and advanced cases. Our ML models showed difficulties in differentiating the early and moderate glaucoma stages. In clinical practice, this is a vital delineation that dictates the management plan, whereby early glaucoma can be initially managed by an optometrist.

In contrast, advanced glaucoma is not suitable for collaborative care and requires referral to an ophthalmologist [42]. Although we have collected data from two eye care centres, future directions involve incorporating data from tertiary referral centres such as eye hospitals and ophthalmologic clinics, which may present more availability of moderate to severe glaucoma cases and improve the performance of the ML model.

Secondly, including more data from these centres may reduce ethnicity bias. In our current clinical dataset, there was an increased number of VFs obtained from Caucasian and Asian patients. As a result, our study cannot truly reflect the disease into the broader population or explore the epidemiological effects of ethnicity on glaucoma. It is important to note that obtaining large amounts of high-quality data may also be associated with time constraints and technological difficulties and ultimately be considered financially unfeasible [43].

Thirdly, with glaucoma being a multifactorial disease, we could not correlate and validate other diagnostic features and risk factors of glaucoma such as RNFL, cup-disc ratios, IOPs, age and family history. Furthermore, our study primarily focused on distinguishing between normal and glaucomatous defects from a database that was previously grouped as glaucoma or healthy. Our study is limited to differentiating glaucoma from other ocular pathology that may present with the same VF loss patterns.

Fourthly, our study is entirely dependent on PD plots only extracted from HVFA 24–2 SITA Standard VFs. The 30–2 protocol has also been implemented to investigate glaucomatous VF loss, employing a 76-point test grid over the central 30 degrees of the VF. Hence, there is a potential of missing any mid-peripheral glaucomatous scotomas. Though, the 24–2 testing strategy has been shown to provide comparable information to that provided by the 30–2 protocol in more than 95% of cases, with advantages of shorter test duration and less variability [44]. Thus, VF printouts under the 30–2 testing strategy could be considered as part of the dataset, which can ultimately increase the training dataset and strengthen the performance of the artificial neural network.

In conclusion, our proposed DL-based glaucoma detection and staging using PD plots showed promising and comparable performance in contrast to conventional functional parameters. Glaucoma diagnosis involves a complicated, time-consuming, and subjective process dependent on the clinician’s experience and expertise. Hence, with the high diagnostic accuracy of the DL model in detecting glaucoma, we can consider its application within a clinical practice to assist in yielding more accurate glaucoma diagnosis and, in particular, early detection of glaucomatous VF loss. Although the DL results of this study are promising for glaucoma detection and classification, further data collection and comparative study are warranted to train the algorithm with a larger number of VF datasets to validate the results and make it clinically useful and reliable. To improve the generalizability of the DL algorithms across different patient populations, it is essential to diversify the datasets to encompass a broader range of demographics and clinical presentations. Additionally, the incorporation of multimodal data, such as optical coherence tomography (OCT) and fundus imaging, may significantly enhance the robustness of these DL approaches. By addressing these areas, we can pave the way for the effective transition of DL models from research settings to practical applications in glaucoma management, ultimately leading to improved patient outcomes.

Acknowledgments

We would like to thank Dr Kathleen Watt, Clinic Director of UNSW Optometry clinic, for helping us with the data collection. We also want to thank the clinicians of CFEH, UNSW Sydney, for assisting in the data collection. Guide Dogs NSW/ACT funds clinical services at the CFEH. Guide Dogs NSW/ACT had no role in study design, data collection and analysis, publication decisions, or manuscript preparation.

References

  1. 1. Kingman S. Glaucoma is second leading cause of blindness globally. Bull World Health Organ. 2004;82(11):887–8. Epub 2005/01/11. pmid:15640929.
  2. 2. Alencar LM, Medeiros FA. The role of standard automated perimetry and newer functional methods for glaucoma diagnosis and follow-up. Indian J Ophthalmol. 2011;59 Suppl(Suppl1):S53–8. Epub 2010/12/22. pmid:21150035.
  3. 3. Mills RP, Budenz DL, Lee PP, Noecker RJ, Walt JG, Siegartel LR, et al. Categorizing the stage of glaucoma from pre-diagnosis to end-stage disease. Am J Ophthalmol. 2006;141(1):24–30. Epub 2006/01/03. pmid:16386972.
  4. 4. Devalla SK, Liang Z, Pham TH, Boote C, Strouthidis NG, Thiery AH, et al. Glaucoma management in the era of artificial intelligence. Br J Ophthalmol. 2020;104(3):301–11. Epub 2019/10/24. pmid:31640973.
  5. 5. Keltner JL, Johnson CA, Cello KE, Edwards MA, Bandermann SE, Kass MA, et al. Classification of visual field abnormalities in the ocular hypertension treatment study. Arch Ophthalmol. 2003;121(5):643–50. Epub 2003/05/14. pmid:12742841.
  6. 6. Kitazawa Y, Yamamoto T. Glaucomatous visual field defects: their characteristics and how to detect them. Clin Neurosci. 1997;4(5):279–83. Epub 1997/01/01. pmid:9292256.
  7. 7. Hodapp EA, Parrish RK, Anderson DR, editors. Clinical Decisions In Glaucoma1993.
  8. 8. Goldbaum MH, Sample PA, White H, Colt BO, Raphaelian P, Fechtner RD, et al. Interpretation of automated perimetry for glaucoma by neural network. Investigative ophthalmology & visual science. 1994;35(9):3362–73. pmid:8056511
  9. 9. Ederer F, Ma , Gaasterland DE, Sullivan EK, Beck AD, Prum BE, et al. The advanced glaucoma intervention study (AGIS): 7. the relationship between control of intraocular pressure and visual field deterioration. American Journal of Ophthalmology. 2000;130:429–40.
  10. 10. Mills RP, Janz N, Wren PA, Guire KE. Correlation of Visual Field With Quality-of-Life Measures at Diagnosis in the Collaborative Initial Glaucoma Treatment Study (CIGTS). Journal of Glaucoma. 2001;10:192–8. pmid:11442181
  11. 11. Brusini P. Clinical use of a new method for visual field damage classification in glaucoma. Eur J Ophthalmol. 1996;6(4):402–7. Epub 1996/10/01. pmid:8997583.
  12. 12. Brusini P, Filacorda S. Enhanced Glaucoma Staging System (GSS 2) for Classifying Functional Damage in Glaucoma. Journal of Glaucoma. 2006;15:40–6. pmid:16378017
  13. 13. Andersson S, Heijl A, Bizios D, Bengtsson B. Comparison of clinicians and an artificial neural network regarding accuracy and certainty in performance of visual field assessment for the diagnosis of glaucoma. Acta Ophthalmol. 2013;91(5):413–7. Epub 2012/05/16. pmid:22583841.
  14. 14. Asaoka R, Iwase A, Hirasawa K, Murata H, Araie M. Identifying "preperimetric" glaucoma in standard automated perimetry visual fields. Invest Ophthalmol Vis Sci. 2014;55(12):7814–20. Epub 2014/10/25. pmid:25342615.
  15. 15. Asaoka R, Murata H, Iwase A, Araie M. Detecting Preperimetric Glaucoma with Standard Automated Perimetry Using a Deep Learning Classifier. Ophthalmology. 2016;123(9):1974–80. Epub 2016/07/11. pmid:27395766.
  16. 16. Goldbaum MH, Sample PA, Chan K, Williams J, Lee TW, Blumenthal E, et al. Comparing machine learning classifiers for diagnosing glaucoma from standard automated perimetry. Invest Ophthalmol Vis Sci. 2002;43(1):162–9. Epub 2002/01/05. pmid:11773027.
  17. 17. Kucur Ş S, Holló G, Sznitman R. A deep learning approach to automatic detection of early glaucoma from visual fields. PLoS One. 2018;13(11):e0206081. Epub 2018/11/30. pmid:30485270.
  18. 18. Li F, Wang Z, Qu G, Song D, Yuan Y, Xu Y, et al. Automatic differentiation of Glaucoma visual field from non-glaucoma visual filed using deep convolutional neural network. BMC Med Imaging. 2018;18(1):35. Epub 2018/10/06. pmid:30286740.
  19. 19. Huang X, Jin K, Zhu J, Xue Y, Si K, Zhang C, et al. A Structure-Related Fine-Grained Deep Learning System With Diversity Data for Universal Glaucoma Visual Field Grading. Front Med (Lausanne). 2022;9:832920. Epub 2022/04/05. pmid:35372429.
  20. 20. Goldbaum MH, Lee I, Jang G, Balasubramanian M, Sample PA, Weinreb RN, et al. Progression of patterns (POP): a machine classifier algorithm to identify glaucoma progression in visual fields. Invest Ophthalmol Vis Sci. 2012;53(10):6557–67. Epub 2012/07/13. pmid:22786913 None; P.A. Sample, Alcon (F), Carl Zeiss Meditec (F); R.N. Weinreb, Pfizer (F, R), Carl Zeiss Meditec (F, C, R), Alcon (C, R), Allergan (C, R); J.M. Liebmann, Alcon (F, C), Carl Zeiss Meditec (F), Allergan (C), Pfizer (C); C.A. Girkin, Alcon (F, C), Allergan (C), Pfizer (C), Carl Zeiss Meditec (R); D.R. Anderson, None; L.M. Zangwill, Carl Zeiss Meditec (F); M.-J. Fredette, None; T.-P. Jung, None; F.A. Medeiros, Alcon (F, C, R), Allergan (F, C), Pfizer (F, C), Carl Zeiss Meditec (R); C. Bowd, Allergan (F), Pfizer (F), Alcon (R).
  21. 21. Wang M, Shen LQ, Pasquale LR, Boland MV, Wellik SR, De Moraes CG, et al. Artificial Intelligence Classification of Central Visual Field Patterns in Glaucoma. Ophthalmology. 2020;127(6):731–8. Epub 2020/02/23. pmid:32081491.
  22. 22. Yousefi S, Goldbaum MH, Balasubramanian M, Jung TP, Weinreb RN, Medeiros FA, et al. Glaucoma progression detection using structural retinal nerve fiber layer measurements and functional visual field points. IEEE Trans Biomed Eng. 2014;61(4):1143–54. Epub 2014/03/25. pmid:24658239.
  23. 23. Yousefi S, Kiwaki T, Zheng Y, Sugiura H, Asaoka R, Murata H, et al. Detection of Longitudinal Visual Field Progression in Glaucoma Using Machine Learning. Am J Ophthalmol. 2018;193:71–9. Epub 2018/06/20. pmid:29920226.
  24. 24. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2015;abs/1409.1556.
  25. 25. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:770–8.
  26. 26. Zhang Q, Bai C, Liu Z, Yang LT, Yu H, Zhao J, et al. A GPU-based residual network for medical image classification in smart medicine. Information Sciences. 2020;536:91–100.
  27. 27. Shin Y, Cho H, Jeong HC, Seong M, Choi JW, Lee WJ. Deep Learning-based Diagnosis of Glaucoma Using Wide-field Optical Coherence Tomography Images. J Glaucoma. 2021;30(9):803–12. Epub 2021/05/13. pmid:33979115.
  28. 28. Park K, Kim J, Lee J. A deep learning approach to predict visual field using optical coherence tomography. PLOS ONE. 2020;15(7):e0234902. pmid:32628672
  29. 29. Asano S, Asaoka R, Murata H, Hashimoto Y, Miki A, Mori K, et al. Predicting the central 10 degrees visual field in glaucoma by applying a deep learning algorithm to optical coherence tomography images. Sci Rep. 2021;11(1):2214. Epub 2021/01/28. pmid:33500462.
  30. 30. Hashimoto Y, Kiwaki T, Sugiura H, Asano S, Murata H, Fujino Y, et al. Predicting 10–2 Visual Field From Optical Coherence Tomography in Glaucoma Using Deep Learning Corrected With 24-2/30-2 Visual Field. Translational Vision Science & Technology. 2021;10(13):28-. pmid:34812893
  31. 31. Li F, Song D, Chen H, Xiong J, Li X, Zhong H, et al. Development and clinical deployment of a smartphone-based visual field deep learning system for glaucoma detection. npj Digital Medicine. 2020;3(1):123. pmid:33043147
  32. 32. Selvaraju RR, Das A, Vedantam R, Cogswell M, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision. 2017;128:336–59.
  33. 33. Kusumaningrum R, Indihatmoko TA, Juwita SR, Hanifah AF, Khadijah K, Surarso B. Benchmarking of Multi-Class Algorithms for Classifying Documents Related to Stunting. Applied Sciences. 2020;10(23):8621.
  34. 34. Baik R. Class imbalance learning—driven Alzheimer’s detection using hybrid features. International Journal of Distributed Sensor Networks. 2019;15:155014771982604.
  35. 35. Yi S, Zhang G, Qian C, Lu Y, Zhong H, He J. A Multimodal Classification Architecture for the Severity Diagnosis of Glaucoma Based on Deep Learning. Front Neurosci. 2022;16:939472. Epub 2022/07/19. pmid:35844230.
  36. 36. Lee J, Kim YW, Ha A, Kim YK, Park KH, Choi HJ, et al. Estimating visual field loss from monoscopic optic disc photography using deep learning model. Sci Rep. 2020;10(1):21052. Epub 2020/12/05. pmid:33273643.
  37. 37. Mariottoni EB, Datta S, Dov D, Jammal AA, Berchuck S, Tavares I, et al. A Deep Learning-Based Mapping of Structure to Function in Glaucoma. Investigative Ophthalmology & Visual Science. 2020;61(7):3213-.
  38. 38. Mariottoni EB, Datta S, Dov D, Jammal AA, Berchuck SI, Tavares IM, et al. Artificial Intelligence Mapping of Structure to Function in Glaucoma. Translational Vision Science & Technology. 2020;9(2):19-. pmid:32818080
  39. 39. Brusini P. Global Glaucoma Staging System (GGSS): A New Method to Simultaneously Assess the Severity of Both Functional and Structural Damage in Glaucoma. J Clin Med. 2021;10(19). Epub 2021/10/14. pmid:34640431.
  40. 40. Chaglasian M. Sharpen Your Visual Field Interpretation Skills. Review of Optometry. 2013.
  41. 41. Lucy KA, Wollstein G. Structural and Functional Evaluations for the Early Detection of Glaucoma. Expert Rev Ophthalmol. 2016;11(5):367–76. Epub 2016/01/01. pmid:28603546.
  42. 42. Clinical practice guidelines for the collaborative care of glaucoma patients and suspects by ophthalmologists and optometrists in Australia.. Optometry Australia; 2020.
  43. 43. Mayro EL, Wang M, Elze T, Pasquale LR. The impact of artificial intelligence in the diagnosis and management of glaucoma. Eye. 2020;34(1):1–11. pmid:31541215
  44. 44. Khoury JM, Donahue SP, Lavin PJ, Tsai JC. Comparison of 24–2 and 30–2 perimetry in glaucomatous and nonglaucomatous optic neuropathies. J Neuroophthalmol. 1999;19(2):100–8. Epub 1999/06/24. pmid:10380130.