DMENet: Diabetic Macular Edema diagnosis using Hierarchical Ensemble of CNNs

Diabetic Macular Edema (DME) is an advanced stage of Diabetic Retinopathy (DR) and can lead to permanent vision loss. Currently, it affects 26.7 million people globally and on account of such a huge number of DME cases and the limited number of ophthalmologists, it is desirable to automate the diagnosis process. Computer-assisted, deep learning based diagnosis could help in early detection, following which precision medication can help to mitigate the vision loss. Method: In order to automate the screening of DME, we propose a novel DMENet Algorithm which is built on the pillars of Convolutional Neural Networks (CNNs). DMENet analyses the preprocessed color fundus images and passes it through a two-stage pipeline. The first stage detects the presence or absence of DME whereas the second stage takes only the positive cases and grades the images based on severity. In both the stages, we use a novel Hierarchical Ensemble of CNNs (HE-CNN). This paper uses two of the popular publicly available datasets IDRiD and MESSIDOR for classification. Preprocessing on the images is performed using morphological opening and gaussian kernel. The dataset is augmented to solve the class imbalance problem for better performance of the proposed model. Results: The proposed methodology achieved an average Accuracy of 96.12%, Sensitivity of 96.32%, Specificity of 95.84%, and F−1 score of 0.9609 on MESSIDOR and IDRiD datasets. Conclusion: These excellent results establish the validity of the proposed methodology for use in DME screening and solidifies the applicability of the HE-CNN classification technique in the domain of biomedical imaging.

Diabetic macular edema (DME) is a complication of Diabetic Retinopathy (DR), and it 2 usually occurs when vessels in the central part of the macula are affected by the fluid 3 accretion [1]. DME is caused due to diabetes which is a chronic disease induced by 4 inherited and/or acquired deficiency in the production of insulin by the pancreas. DME 5 is an advanced stage of DR that can lead to irreversible vision loss [2][3][4]. 6 Diabetes currently affects more than 425 million people worldwide and is expected 7 to affect an estimated 520 million by 2025. It is estimated that 10% of people who 8 suffer from some form of Diabetes are at the risk of DME. DME currently affects 9 around 26.7 million people globally, and the number is expected to rise to around 50 10 million by 2025 [3,5]. About 7.7 million Americans have DR and approximately 750,000 11 are suffering from DME. [4]. 12 Identifying exudates in fundus images is a standard technique for determining 13 DME [6,7].The nearness of exudates to the macula determines the severity of DME and 14 July 19, 2019 1/17 the probability for DME augments when the exudates are closer to the macula with the 15 risk being maximum when they are inside the macula [8]. Immediate treatment is 16 required if a person is diagnosed with DME to avoid complete vision loss. 17 The ratio of ophthalmologists to population in developed countries like USA is 18 1:15,800 and in developing countries like India is 1:25,000 in urban areas and 1:219,000 19 in rural areas [9,10]. The limited number of ophthalmologists cannot keep up with the 20 rapidly increasing number of DME patients and this heavily skewed ratio of 21 ophthalmologists with respect to DME patients is also leading to delayed services. 22 Manual evaluation of DME is not adaptable in a large-scale screening scheme, especially 23 in developing countries where there is a shortage of ophthalmologists [11]. During the 24 screening tests, one out of nine people turns out to be positive cases of DME [12]. 25 Another challenging aspect of the healthcare sector in developing nations is to provide 26 the best and timely diagnosis at an affordable cost. In this kind of a scenario we require 27 an automatic disease discovery framework that can reduce cost and workload, as well as 28 solve the shortage of ophthalmologists by limiting the referrals to those cases that 29 require prompt consideration. The reduction of effort and time of ophthalmologists in 30 diagnosis will be pivotal for arresting the growth of DME cases. 31 Propelled by these promising possibilities, we propose to develop an effective 32 solution using Deep learning techniques to automatically grade the fundus images. 33 Machine learning techniques have powered many aspects of medical investigations and 34 clinical practices. Deep learning is emerging as a leading machine learning tool in 35 computer vision and has started to command significant consideration in the field of 36 medical imaging. Deep learning techniques, in particular, convolutional neural networks, 37 have rapidly gained prominence for analysis of the medical images. 38 In this paper, we propose a novel algorithm DMENet which automatically analyses 39 the preprocessed color fundus images. Preprocessing on the images is performed using 40 morphological opening, Gaussian kernel and the dataset is augmented to solve the class 41 imbalance problem. After Preprocessing the images, they are passed through a 42 two-stage pipeline. In the first stage, the algorithm detects for the presence/absence of 43 DME and once the presence of DME is confirmed it is passed through the second stage 44 where the image is graded based on the severity. Both these stages are equipped with a 45 state-of-art technique called Hierarchical Ensemble of Convolutional Neural Networks

46
(HE-CNN). 47 Our proposed DMENet algorithm in this paper is built on a novel classification 48 structure known as HE-CNN which uses the concept of ensemble learning. Ensemble 49 learning is a technique which combines the outputs from multiple classifiers to improve 50 the classification performance of the model [13]. This approach is intuitively used in our 51 daily lives, where we seek guidance from multiple experts, weigh and combine their 52 views to make a more informed and optimized decision. In matters of great importance 53 that have financial or medical implications, we often seek a second opinion before 54 making a decision, for example, having several doctors agree on a diagnosis reduces the 55 risk of following the advice of a single doctor whose specific experience may differ 56 significantly from that of others. clusters of these pixels where the largest cluster formed was considered as the macula.

85
They used Gabor filter as well as adaptive thresholding to detect exudates, and grade 86 DME severity with the help of support vector machine (SVM). The work given in [18] 87 proposed a method for texture extraction from various regions of interest of the fundus 88 image and employed SVM for grading severity of DME.

89
However, the overall performance of the above-described grading systems is largely improvement in the performance [21,22]. Feature space expansion and use of different 104 starting points has been instrumental in providing better approximation of the optimal 105 solution and has been cited as the rationale for the improved classification using 106 ensemble learning. Ensemble learning offers promising alternative to the currently used 107 techniques in computer vision domain and we wish to explore it in our proposed where each grade is described as given in Table 1. Preprocessing plays a key role in improving accuracy of results by reducing noise in the 133 background. It removes some of the variations between images due to differing lighting 134 conditions and camera resolution thus making the data consistent [25]. The images 135 contained in the above-mentioned databases have different image resolutions were scaled 136 down to a fixed resolution size to form a standardized dataset and also to reduce where U en denotes the corresponding output image after enhancement, U (x, y) 144 denotes the raw fundus image, G(x, y; σ) is a Gaussian kernel with scale σ [25]. The 145 convolution operator is represented by * . In this paper σ is empirically set as 25, λ, ω 146 and δ are parameters to control the weights, which are empirically set as -4, 4 and 0.5. 147 We then perform morphological opening on the enhanced image to control the 148 brightness of blood vessels which appear brighter than other retinal surfaces due to the 149 lower reflectance [26]. It performs the second step using the expression given by where U p denotes the corresponding output image after performing morphological 152 opening and is the final output of preprocessing stage. ϕ represents the structuring 153 element disk with a radius of 3 pixels and • denotes a morphological opening operation. 154 Fig. 1 shows comparision of the preprocessed image with respect to the original image. 155

156
In order to build a robust automated disease grading system using CNN's it is 157 important to have a dataset of images with uniform representation from all the classes. 158 In the field of Medical imaging, datasets are often imbalanced, as the number of 159 patients that turn out to be positive is very less compared to total cases. This class 160 imbalance problem introduces significant challenge when training deep CNN's which are 161 data intensive [26]. Another common issue that occurs while training deep CNN's on 162 smaller datasets is overfitting [27]. Training deep CNN's on larger datasets has shown 163 to improve robustness and generalisability of the model [25]. Data augmentation is an 164 effective solution to reduce overfitting during CNN training as well as to balance the where a m (.) denotes the activation function, W m j,k is the weight vector of the filter that 199 connects feature map j in layer m − 1 to feature map k in layer m and b m k is bias term 200 associated with k th feature map in the m th layer. p m k denotes the set of planes in layer 201 m − 1 that are connected to k th feature map. Suppose the size of input feature map The Fig. 3 illustrates the convolutional layer. The size of output feature map would be 204 The pooling layer is placed between two convolutional layers with the aim of reducing 206 spatial dimensions, improving the computing performance and reducing the number of 207 parameters that control overfitting. There are various types of pooling operations 208 however the most commonly used ones are max-pooling [33], and average pooling [34]. 209 The feature map k for m th pooling layer can be mathematically represented as follows 210 where Φ(.) denotes the pooling function, w m k and b m k are the scalar weights and the bias 211 term respectively. Here we divide the feature map k of convolution layer (m − 1) into 212 non-overlapping blocks of size 2 × 2 pixels. In case of max-pooling we take the largest  Fig. 4 shows the pooling layer.

217
After several convolutional and pooling layers, we have a fully connected layer which 218 performs high-level reasoning [35,36]. A fully connected layer takes all the neurons in 219 the previous layer and connects them to each of the current layer's single neurons to 220 generate global semantic information. [32]. The result of the final output layer is given 221 as Various studies [37,38] demonstrated that transfer learning or finetuning a CNN is 223 better than training a CNN from scratch. Training CNN's from scratch suffers from 224 several issues especially the requirement of having a huge dataset is a serious concern in 225 the medical domain where the expert annotation is a costly affair. Training a CNN 226 from scratch is also computationally expensive, time-consuming and suffers from 227 frequent overfitting and convergence issues. An effective alternative for training a CNN 228 from scratch is transfer learning. Transfer learning is the method where CNN's extract 229 generic features from smaller datasets based on its feature learning from a larger and 230 well-labeled dataset. The factors which affect the transfer learning strategy is the size of 231 the new dataset on which it is applied and its similarity to the original dataset [39]. Fine-tuning is an iterative process that works by minimizing the cost function with 237 respect to the pre-trained weights [22]. The cost function is represented as follows where D is training dataset with n images, d j is the j th image of D, f (d j , w) is the 239 CNN function that predicts the class c j of d j given w,ĉ j is the ground-truth of j th image, 240 l(c j ,ĉ j ) is a penalty function based on logistic loss l for predicting c j instead ofĉ j .

241
In order to minimize the cost function, the Stochastic Gradient Descent 242 algorithm [40] is commonly used where the cost over entire training dataset is calculated 243 based on the approximated results obtained over mini-batches of data. iteration are computed as follows [38] 247 where |Y | denotes the number of training images, α m is the learning rate of the m th 250 layer which controls the size of updates to the weights, η is the momentum coefficient 251 that indicates the contribution of the previous weight update in the current iteration.

252
This has the effect of speeding up the learning process while simultaneously smoothing 253 the weight updates, τ is the scheduling rate that decreases learning rate α at the end of 254 each epoch. clusters. The output of the i th cluster is given by where w j i is activation of j th output unit of Local Gating System in the i th cluster 268 and O ij denotes the output given by j th learner in the i th cluster.The final output of 269 the HE-CNN architecture is given by and where g i and g j i are the weighted sums at the output units of the corresponding 275 gating networks.

276
As shown in Fig. 5, in this architecture we are dividing the input space into a nested 277 set of regions and then attempting to fit simple surfaces to the data which fall in these 278 regions. The regions would be having soft boundaries which means that the data points 279 are spread across multiple regions and the boundaries between these regions are simple 280 parameterized surfaces which are adjusted from the learning process [42]. Every learner 281 has expertise in one specific area of the high-dimensional input space and each learner 282 estimates the conditional posterior probability on the partitioned feature space  We have analysed the hierarchical nature of this architecture using probabilistic 289 interpretation as given below. We then discuss the error functions that were employed 290 for training simultaneously both the CNN learners and the gating systems.
Here P (.) represents the likelihood function of the HE-CNN model. The gating 302 system is responsible for assigning weights and thereby allowing the overall model to 303 execute a competitive learning process by maximizing the likelihood function of the 304 training data [43][44][45]. Each learner module specializes in exclusive regions of feature 305 space and all modules in this methodology learn simultaneously by interacting with 306 each other rather than learning independently [46]. The learners in a cluster are closely 307 linked with each other and learn similar mappings early in the training phase. They 308 differentiate later in training as the probabilities associated with the cluster to which 309 they belong become larger. Thus the architecture tends to acquire coarse structure 310 before acquiring fine structure, this particular feature of the architecture is notable as it 311 provides robustness to problems with overfitting in deep hierarchies.

313
Taking the error functions used in [42,44] where D v is the desired output vector, O j v i is the output vector of j th learner in i th 317 cluster.The effective error for j th learner in i th cluster on full training set is defined as 318 where h v j i is the posterior probability estimate provided by j th learner of i th cluster 319 for input x v and h v j i is given as The error functions of both the gating systems (GGS, LGS) are defined as follows where h v j gives the posterior probability provided by j th learner in the cluster for which 322 error function is being calculated. It is defined as given below. Receiver Operating Characteristic (ROC) analyses were used to calculate the accuracy, 328 precision, sensitivity/recall, specificity and average area under the ROC curves (AUC). 329 F1−score is the harmonic mean of precision and recall [47]. F1−score is calculated as 330 given below.

331
F 1 − score = 2 * P recision * Recall P recision + Recall and recall is perfect. Cohens kappa (κ) score is used to determine the potential of 336 HE-CNN in partitioning the feature space [48]. κ measure signifies the level of classifiers 337 agreement and its value ranges between −1 and 1 (higher the κ score higher is the 338 agreement).

339
In this study, we have applied 5-fold cross-validation technique to assess the This study is based on a two-stage DMENet methodology which uses HE-CNN as a key 349 classification algorithm in each stage. One of the main issue we faced while developing 350 the DME screening solution is the class-imbalance problem as the datasets which we are 351 using contain a significant portion of DME negative images and more importantly the 352 grade 1 images are less than 10% of the whole dataset. This led to mis-classification of 353 an image with grade 1 characteristics, thus we have broken the tri-class (Grade 0, 1, 2) 354 problem into a binary classification problem using two stage DMENet model as 355 depicted in Fig. 2. We have used a combination of pre-trained networks like 356 ResNet [49], DenseNet [50], SqueezeNet [51], GoogleNet [31] and SE-ResNet [52] as the 357 learners in each cluster of the ensemble model. The gating systems, both the local and 358 global are based on the CNN architectures given in Gamma Ensemble follows the two-level HE-CNN architecture as shown in Fig. 5.

362
The CNN learners in the first cluster are setup using a combination of ResNet and 363 DenseNet architectures. We used a pretrained ResNet-50 as learner-1, pretrained 364 DenseNet-161 as learner-2 and the outputs of these learners were aggregated using LGS 365 whose architecture is based on the CNN-1 structure as given in given in Table 2 is used as LGS for aggregating the outputs. In order to combine the 370 outputs of both these clusters, we used a GGS which used CNN-3 architecture as given 371 in Table 2. DenseNet-169 as learner-1, SE-ResNet-50 as learner 2 and we combined the outputs of 377 these learners using an LGS-1 which used CNN-2 architecture as given in Table 2. The 378 two pretrained CNN learners in the second cluster are based on architectures of 379 ResNet-50 and SqueezeNet. We used a CNN-1 architecture given in Table 2 as LGS-2 to 380 combine the outputs of the learners in the second cluster. The third cluster is composed 381 of three pretrained CNN learners-DenseNet-161, ResNet-34 and GoogLeNet. The

382
CNN-3 architecture is employed as LGS-3 and the CNN-4 architecture as given in Table 383 2 for GGS.

384
All the CNN learners were initialized on the ImageNet [53] dataset weights. The 385 filter weights derived from ImageNet were then finetuned through back-propagation 386 thus minimizing the CNN's empirical cost in equation 6. We used Stochastic Gradient 387 Descent as discussed in section to minimize the cost function. Here the cost calculated 388 over mini-batches of size 32 is used to approximate the cost over the entire training set. 389 We set the learning rate α m to 10 −3 ensuring proper convergence and the scheduling 390 rate which depends on the speed of convergence τ is set to 0.9. The training process for 391 both the gating systems LGS and GGS is performed using Adam optimizer [54] with Confusion Matrix in Fig. 6.A shows the performance of Gamma Ensemble on 360 test 404 images and Fig. 6.B gives the performance of Delta Ensemble on 244 test images. ResNet, DenseNet architectures and the reults obtained are shown in Table 4. However, 416 we can see from the results that these models did not perform well as they were 417 overfitting the given data. Also prevential measures were taken to control overfitting by 418 adding dropout factor of 50% on the fully connected layer before the output. Although 419 dropout helped in reducing overfitting but the results obtained are not satisfactory as 420 compared to results given by HE-CNN ensemble. To elaborate the comparative fully-connected layer processes all the features [57], [58]. Mixture-Ensemble uses 426 multiple CNN's and a weight regulation network [59]. Pruned-Ensemble uses pruning to 427 retain best pretrained models and then make predictions using maximum voting [60].

428
The results of these ensemble techniques can be observed in Table 4 DMENet's performance in comparison to other existing methodologies is shown in Table 434 7.

435
In the HE-CNN architecture, one of the key component was the gating systems (both 436 LGS and GGS) and we have used adam optimizer while training them. We compared 437 SGD with Nesterov momentum (with the learning rate set as 10 −2 and momentum set 438 as 0.9) to adam optimization method and came to a conclusion that adam optimizer 439 performed well in this task. The results obtained are shown below in Table 5. We have analyzed the performance of DMENet two-stage pipeline with respect to 441 tri-class problem to show the competence of DMENet pipeline in DME screening. The 442 fundus images used in DME screening are generally graded into one of the three classes 443 (Grade 0, 1 and 2). The results obtained by performing tri-class classification using 444 pretrained CNN's as well as ensemble methods using HE-CNN are shown in Table 6 445 along with results obtained using the corresponding pretrained CNN and HE-CNN. The 446 results strongly validates our premise of breaking the tri classification problem into 447 binary classification using two levels.  In this paper we have successfully demonstrated the efficacy of our proposed DMENet 450 methodology and HE-CNN architecture. The proposed methodology is simple yet 451 extremely efficient. We strongly believe that the HE-CNN architecture is fairly general 452 and with little modification, it has the potential of solving all such classification 453 problems in the field of bio-medical imaging elegantly.