Glaucoma classification based on scanning laser ophthalmoscopic images using a deep learning ensemble method

This study aimed to assess the utility of optic nerve head (onh) en-face images, captured with scanning laser ophthalmoscopy (slo) during standard optical coherence tomography (oct) imaging of the posterior segment, and demonstrate the potential of deep learning (dl) ensemble method that operates in a low data regime to differentiate glaucoma patients from healthy controls. The two groups of subjects were initially categorized based on a range of clinical tests including measurements of intraocular pressure, visual fields, oct derived retinal nerve fiber layer (rnfl) thickness and dilated stereoscopic examination of onh. 227 slo images of 227 subjects (105 glaucoma patients and 122 controls) were used. A new task-specific convolutional neural network architecture was developed for slo image-based classification. To benchmark the results of the proposed method, a range of classifiers were tested including five machine learning methods to classify glaucoma based on rnfl thickness—a well-known biomarker in glaucoma diagnostics, ensemble classifier based on inception v3 architecture, and classifiers based on features extracted from the image. The study shows that cross-validation dl ensemble based on slo images achieved a good discrimination performance with up to 0.962 of balanced accuracy, outperforming all of the other tested classifiers.


Introduction
As the world's population ages, glaucoma is becoming a leading cause of irreversible vision loss and blindness, with primary open-angle glaucoma (POAG) being the most prevalent form of it [1]. Until it reaches an advanced stage, glaucoma is an asymptomatic disease, so methods of early diagnosis are of high importance [2]. Glaucoma diagnosis and management heavily rely on advanced imaging techniques, which typically image the optic nerve head (ONH) and surrounding tissue [3]. The appearance of the ONH is usually assessed with a fundus camera, but scanning laser ophthalmoscopy (SLO) en-face imaging can also be utilized for differentiating glaucoma patients from normal subjects with high accuracy. For example, Haleem et al. [4] classified glaucoma patients based on geometric and non-geometric properties of different regions of the SLO image, whereas Wollstein et al. [5] used the parameters of optics disk derived from SLO images to differentiate early glaucoma patients from healthy individuals. Machine learning techniques have firmly entered the field of ophthalmology [6][7][8][9] and the number of studies showing their potential in glaucoma diagnosing is steadily growing [10][11][12]. These algorithms have also been used to differentiate early glaucoma patients from controls [13,14]. However, most of those techniques focused on classifying glaucoma are based on information from visual field measurements, fundus camera images, or measurements of retinal nerve fiber layer (RNFL) thickness. The use of SLO images, which are usually captured during the optical coherence tomography (OCT) acquisition, for glaucoma classification using deep learning (DL) methods has not received, until recently, as much attention [15,16]. DL methods are usually associated with large data volumes [17], where the overall performance of a classification system is highly dependent on the size of training data. However, for many applications within ophthalmology, data may be scarce. There exist methods to deal with the problem of insufficient sample size [18,19], which include transfer learning [20], data augmentation [21], and model architecture modifications such as dropout [22]. Additionally, ensemble methods have been widely used to stabilize and improve the final model performance in the biomedical classification task [23]. The classifiers based on ensemble learning consist in the integration of multiple base classifiers, i.e., classifiers whose predictions had an impact on the final result. The goal is to create a model that will outperform all base classifiers included in its composition, whereas the effectiveness of such a model depends both on the diversity of its base classifiers [24] and on the proper choice of integration rule.
This study aimed to assess the utility of retinal SLO images to support glaucoma diagnosis and to design a cross-validation ensemble of DL models that would accurately differentiate glaucoma patients from healthy controls in a low data regime.

Subjects and clinical measurements
The study was approved by the Bioethical Committee of the Wroclaw Medical University (KB-332/2015) and adhered to the tenets of the Declaration of Helsinki. Informed written consent to participate in the study was obtained from all subjects.
All subjects provided their medical history and underwent a comprehensive ophthalmic examination. In particular, Goldmann applanation tonometry, slit lamp examination, and dilated stereoscopic examination of the optic disc were performed for all subjects. Visual field (VF) parameters including mean deviation (MD) and pattern standard deviation (PSD) were measured using standard automated perimetry (Humphrey Field Analyzer II 750; 24-2 Swedish interactive threshold algorithm; Carl Zeiss Meditec, Inc., Dublin, CA). Additionally, spectral domain SD-OCT (Spectralis, Heidelberg Engineering GmbH, Heidelberg, Germany) were acquired, using a circular scanning protocol around the optic nerve head to measure the average RNFL thickness as well as its mean value in six different ONH sectors, that is temporal-superior (TS), temporal (T), temporal-inferior (TI), nasal-superior (NS), nasal (N) and nasal-inferior (NI). The OCT instrument acquires an additional en-face SLO image simultaneously during the acquisition of the OCT-scan. Those images are used here for classification.
Subjects were excluded if they had a history of ocular surgery within 12 months before the onset of the study. Patients younger than 40 years old, with intraocular disease (e.g., macular degeneration, diabetic retinopathy, retinal vein occlusion) or neurological disorders affecting visual fields were also excluded from the study. Eyes with spherical equivalent of <−6.0 Diopters (D) or >+ 3.0 D, and cylinder correction of < − 3.0 D or >+ 3.0 D were also excluded. When both eyes met the inclusion criteria, the eye used for the study was randomly selected.
All the glaucoma patients in this study were clinically categorized as POAG type. POAG was defined as the persistence presence of glaucomatous optic nerve damage assessed using dilated stereoscopic examination of ONH (i.e., concentric enlargement of the optic disc, rim thinning, or notching) with associated visual field defects in the presence of an open-angle. A normal visual field was defined as the absence of glaucomatous and neurologic field defects. Table 1 contains the group statistics of the clinical examination used to differentiate the two considered groups of subjects. The classification was performed by an experienced ophthalmologist (P.K.-B.).

Dataset
A total of 227 SD-OCT SLO images of 227 participants were used in this study. The participants were selected from consecutive patients who presented at the time of the study at the Outpatient and Glaucoma Clinic at the Department of Ophthalmology, Wroclaw Medical University. The dataset included 122 SLO images of healthy control subjects and 105 images of glaucoma patients (see S1 Dataset). Additionally, the measurements of RNFL thickness were considered for comparison. The groups of glaucoma patients and healthy controls represent a valid sample from the general population. This has been ensured by examining the clinical parameters (see Table 1) that a trained ophthalmologist used for assigning a subject to a particular group. Therefore, it is assumed that the corresponding SLO images from those subjects are also representative of the general population.

Techniques
The dataset in this study contains en-face OCT SLO images. For the image classification task, a Convolutional Neural Network (CNN) was used, because CNN-based DL algorithms have proven in recent years to provide state-of-the-art performance for medical image classification tasks  [25]. In ophthalmic applications, this model has already been applied to several image analysis applications, including retinal layer segmentation [26,27], cone photoreceptor detection [28] and attention-based glaucoma detection [29]. DL algorithms generally require a relatively large amount of data to yield good classification results, which in the case of this study was infeasible. Thus, this limitation was considered during the initial stages of development. Further, it was decided to design a task-specific CNN architecture with reduced complexity, with the aim of having fewer parameters required to train the model. The proposed neural network architecture is shown in Fig 1, with a cropped SLO image of size 156×238 × 1 pixels being its input. Cropping corresponded to the instrument's overlayed rectangular area and was performed to focus on the optic nerve head. The architecture consists of three major blocks, each of them containing 3, 2 and 2 convolution layers respectively, preceded by average pooling layer and followed by a maximum pooling layer. Following these blocks, there are two fully connected layers containing 128 and 2 units respectively. Between those layers, there is a dropout with a 0.60 rate, which aims to reduce the likelihood of overfitting. The last layer provides the result of the model, that is, the probability of an image belonging to a given class (healthy control and glaucoma subject). After every convolutional layer a batch normalization was performed using a Rectified Linear Unit (ReLU) activation function. For the first fully connected layer ReLU was utilized as activation function. To estimate the probability that a given image belongs to one of the two classes, the softmax function was used for the last and final layer, which provides the class with a higher probability selected as a model prediction.
For training purposes, the Adaptive Moment Estimation (Adam) optimizer [30] and binary cross entropy as the loss function were utilized. The models were trained up to 250 epochs. During training, the model with the best validation accuracy was saved and used later for testing. That occurred, in general, well before the 250th epoch. The learning rate was set to 0.001. The Glorot uniform initializer [31] was used for kernel and weights initialization, while the initial bias was set to zero. The software environment that was used for experimental evaluation consists of Keras 2.2.5 [32] with Tensorflow backend [33], Scikit-learn 0.20.3 [34] in Python 3.7.3.
To benchmark the proposed method performance with those of other networks, a modified inception v3 architecture [35] was also implemented. The original network, pre-trained on Imagenet, was used for the experiment, and the fully connected layers were removed, replacing them with two smaller fully connected layers of sizes 128 and 2, respectively. The training procedure, which includes the whole model, was identical to that of the custom architecture. The modification was introduced to improve the performance of the preliminary experiments on the original architecture.
Additionally to the DL methods a range of machine learning techniques were tested. This helps to assess the proposed whole-picture approach using DL, and its effectiveness in automatically extracting features for image classification versus the traditional machine learning that requires manual feature extraction methods. After extracting the features such as parameters obtained from Principal Component Analysis (PCA) and gray-level co-occurrence matrix (GLCM), a Support Vector Machine (SVM) was used as a classifier.
Given that information on structural data (thickness) is commonly used to support glaucoma diagnosis, additional classifiers based on RNFL  The code used in the experiments is located on the Github platform and all the data used is available upon request (https://github.com/dsulot/slo_classification).

Preprocessing of images
An original scan from OCT Spectralis contains both the SLO image of size 496 by 496 pixels, which provides the en-face 30˚-view of ONH, and the cross-sectional OCT scans, which were not used for this study. A region of interest (ROI) centered on the ONH and of size of 156 × 238 × 1 pixels was used for analysis. An example of such SLO image is shown in Fig 1. The image grey scale intensity within the ROI was normalized between 0 and 1.

Design of experiments
Because of limited data, it was decided to use data augmentation techniques for training purposes. Image transformations facilitate creating more training samples, prevent model overfitting and improve final accuracy. On observation of the image content, it was decided to use the following transformations: 1) horizontal/vertical flip, 2) shift (± 0.15 fraction of the total width/height), and 3) image rotation (± 50-degree range for random rotations). The parameters for each of these transformations were selected experimentally.
To check the stability of the proposed model and the model based on modified inception v3 architecture, k-fold cross-validation was utilized. The experiment was performed and conducted twice for different k values: 5 and 10. Within folds, to increase classification accuracy and stability, the ensemble of classifiers was created. The cross-validation ensemble was created from k − 1 classifiers trained on objects from train fold split into training and validation parts. Because each of the models was trained on a slightly different dataset (i.e., a different part of the training set was used to train and validate), they were able to establish different features to classify the images. An example scheme of a 5-fold cross-validation ensemble is presented in Fig  2. The cross-validation protocol was also used for the other methods, including the RNFL thickness classifier and the classifier learned on the features extracted from images.
Finally, the effect of two types of classifier combination techniques with and without weighting was tested, including majority voting and support accumulation [36]. In a weighted approach, each classifier is weighted according to its performance that is calculated based on the validation dataset. The balanced accuracy score was used to evaluate the performance of the model and was defined as the mean between sensitivity and specificity [37].Additionally, for all results, Wilcoxon test [38] was performed to assess whether the obtained results are statistically significantly different from each other.

RNFL thickness-based classifiers
The results of glaucoma classification based on the RNFL thickness are considered because they represent one of the essential biomarkers in clinical diagnosis, whereas the information extracted from the SLO images has a supplementary character, which is currently not utilized in the clinical practice. The scores obtained from the five considered standard classifiers based on RNFL thickness are presented in Table 2. It is evident that, in this case, MLP achieved the worst results, while the rest of the classifiers achieved similar balanced accuracy around 0.88. All of the considered methods are characterized by a relatively high standard deviation.

Assessing the efficacy of image features
In recent years, DL methods have become the standard for image classification, yet machine learning methods (features extracted from image + classifiers) have shown to also provide a good image classification performance. Therefore, it was decided to check the performance achieved by the traditional machine learning algorithm. Table 3 shows the results of SVM classifier trained using image features and trained on the whole image. Four techniques were considered: based on the vector created by averaging the image over columns and rows, the parameters calculated from GLCM, the results from PCA, and the combination of GLCM and PCA. After preliminary experiments, contrast and dissimilarity were used for further analysis from the available parameters calculated based on GLCM. For the PCA method, 99% of the explained variance was used (the image was initially flattened into a vector).The results indicate that this approach could obtain a balance accuracy metric up to 0.786 with the use of combined parameters from GLCM and PCA, which overall is inferior to the metrics from the RNFL thickness classifier. Within the considered techniques, there are no statistically significant differences in the results.
Assessing the potential of well-known, pre-trained architecture In this experiment, it was decided to check not only the effect of well-known, pre-trained CNN architecture but also further its connection to ensemble learning. The results from a 5 and 10-fold cross-validation ensemble using the modified inception v3 architecture, as well as the results from the single inception v3 model, are presented in Table 4. The preliminary experiments have shown that a fine-tuned model on SLO images for 10 epochs provided poor performance (0.492 ± 0.017 for a single model). Because of that, all models were fine-tuned for 250 epochs, like the models with custom architectures. As anticipated, the experiments indicate that ensemble learning improves the classification quality for each type of combination technique, providing statistically significant better metrics in the results. The obtained results show that for the modified, well-known architecture with transfer and ensemble learning, the model can classify glaucoma with a balanced accuracy of 0.945 based on SLO images only. Table 5 presents the overall classification performance of the considered DL methods for SLO images using the custom CNN architecture. The mean value and the standard deviation of Overall, the results of classifier ensemble using a 5-fold cross-validation reached only a marginally superior balanced accuracy than that using a 10-fold one. However, the 5-fold cross-validation showed a better (smaller) standard deviation. Assessing the accuracy metrics, it is evident that information contained in the relatively low-resolution SLO images can be successfully used for supporting glaucoma diagnosis. The DL methods reached accuracy of 0.962. The weighted ensemble models achieved almost identical results to those by regular models indicating that using this approach has no particular advantage, at least for the considered set of SLO images. Additionally, Table 6 shows the mean values of the performance characteristics calculated across the folds obtained from the confusion matrix for all presented models together with the resulting sensitivity and specificity. It is evident that all ensemble models achieve high performance levels.

SLO-based classifier
While comparing the results using custom architecture and the pre-trained, modified inception v3 architecture, it can be seen that for relatively small data sets, creating a compact, tailored architecture can be sufficient to achieve high classification accuracy and to reduce the time of experiments by using a smaller model that is faster to train. In each case, the classifier ensemble achieved better results with the task-specific architecture than with the modified, pre-trained well-known inception v3.

Conclusion
In this study, the development of an ensemble of CNN models to classify en-face non-structural SLO images into two different categories (glaucoma patients and healthy control subjects) were proposed. Despite the relatively low data regime and, consequently, the relatively small dataset to train the model, the results demonstrate that separation of the two considered groups could be performed with high accuracy using cross-validation ensemble of DL models (balanced accuracy up to 0.962).
Given the results presented in this paper, it is evident that the SLO image contains valuable clinical information. Thus, this imaging modality combined with DL methods can support glaucoma diagnosis. Additionally, it is worth noting that for our dataset the classifiers based on RNFL thickness show an inferior performance. The thickness data is extracted from a single circular B-scan around the ONH that may not be detailed enough to capture the structural changes in this cohort of glaucoma subjects. While comparing the traditional machine learning methods with the DL techniques, it is evident that ML method, as expected, showed an inferior performance to that of the proposed DL method. Regrading the DL solution, the findings demonstrate it was beneficial to develop a customized network architecture for this problem. The combination of SLO (non-structural) and OCT derived thickness data (structural) in a multi-modal DL approach should be considered in the future to further improve classification accuracy. Given the limited dataset size, it is expected that by increasing it in future studies, the classification performance may be further improved. One of the limitations of working in a small data regime is a potential overfitting. Every effort has been made to prevent this, among other things, by using dropout layers and augmentation as well as by applying cross-validation which primarily shows that the results are repeatable at a similar level regardless of the test and training parts. The results for 5 and 10-fold cross-validation do not substantially vary. Additionally, generative adversarial methods to generate synthetic SLO images can be used for more complex data augmentation purposes and improvement of model performance [39]. They will be explored in the future. It is worth noting that the SLO images used in this study are captured as part of a standard OCT scan. These en-face images are commonly used to check for measurement alignment within the retina or to track thickness changes in the follow-up studies. Although the SLO image is embedded in OCT, it is not normally used by clinicians for patient screening. However, several studies have shown the potential of SLO imaging for glaucoma diagnosis, particularly for differentiating glaucoma patients from normal subjects [4,5]. This study supports such developments with matching or increasing classification accuracy. Adding that the use of a smaller, task-specific architecture can be beneficial for classifying small data sets.
Finally, this study shows that DL methods based on an ensemble of classifiers can provide balanced accuracy to discriminate ONH SLO images of healthy and glaucoma patients, even in the low data regime. Given that this imaging modality is normally captured along with OCT images, it can be relatively easily utilized for supporting glaucoma detection. Hence, ONH SLO images have the clinical utility to support glaucoma detection and management.
Supporting information S1 Dataset. The data used in this study.