Towards implementation of AI in New Zealand national diabetic screening program: Cloud-based, robust, and bespoke

Convolutional Neural Networks (CNNs) have become a prominent method of AI implementation in medical classification tasks. Grading Diabetic Retinopathy (DR) has been at the forefront of the development of AI for ophthalmology. However, major obstacles remain in the generalization of these CNNs onto real-world DR screening programs. We believe these difficulties are due to use of 1) small training datasets (<5,000 images), 2) private and ‘curated’ repositories, 3) locally implemented CNN implementation methods, while 4) relying on measured Area Under the Curve (AUC) as the sole measure of CNN performance. To address these issues, the public EyePACS Kaggle Diabetic Retinopathy dataset was uploaded onto Microsoft Azure™ cloud platform. Two CNNs were trained; 1 a “Quality Assurance”, and 2. a “Classifier”. The Diabetic Retinopathy classifier CNN (DRCNN) performance was then tested both on ‘un-curated’ as well as the ‘curated’ test set created by the “Quality Assessment” CNN model. Finally, the sensitivity of the DRCNNs was boosted using two post-training techniques. Our DRCNN proved to be robust, as its performance was similar on ‘curated’ and ‘un-curated’ test sets. The implementation of ‘cascading thresholds’ and ‘max margin’ techniques led to significant improvements in the DRCNN’s sensitivity, while also enhancing the specificity of other grades.


Introduction
It is estimated that by 2040, nearly 600 million people will have diabetes worldwide [1]. Diabetic retinopathy (DR) is a common diabetes-related microvascular complication, and is the leading cause of preventable blindness in people of working age worldwide [2,3]. It has been estimated that the overall prevalence of non-vision-threatening DR, vision-threatening DR and the blinding diabetic eye disease were 34�6%, 10�2%, and 6�8% respectively [3][4][5][6]. Clinical trials have shown that the risk of DR progression can be significantly reduced by controlling major risk factors such as hyperglycaemia and hypertension [7][8][9]. It is further estimated that screening, appropriate referral and treatment can reduce the vision loss from DR by 50% [10][11][12]. However, DR screening programs are expensive to implement and administer and even in In this, paper we report the use of two established post training techniques; cascading thresholds and margin max to manipulate the sensitivity and specificity of a bespoke DRCNN towards: 1. No disease and 2. Sight threatening DR. We are actively pursuing clinical implementation of our AIs and as such our recent findings would be of great interest for similar groups around the world.

Methodology
Here is a short summary of the methodological steps in this project, which are explained in more details subsequently. The original EyePACS Kaggle DR dataset was obtained and uploaded onto the Microsoft Azure™ platform. Initially, a "Quality Assessment" CNN was trained for assessing the quality of the retinal images. Next and to better match the New Zealand grading scheme, the original grading was modified in two ways [Healthy vs Diseased] and [Healthy vs Non-referable DR vs Referable DR]. The uploaded dataset was then divided into training (70%), validation (15%) and test (15%) sets. A separate DR classifier CNN was then trained on the Microsoft Azure™ platform, using the 'un-curated' training and validation datasets. The not-seen-before test set was then analysed by the "Quality Assessment" CNN thus creating a 'curated' test set in addition to the original 'un-curated' set. The performance of the DRCNN was then assessed using both 'curated' and 'un-curated' test sets. Finally, the 'cascading threshold' and 'margin max' techniques were implemented post-training, to investigate their effects on boosting the sensitivity of the DRCNN [Fig 1].

Quality assessment dataset
A subset of 7,026 number of images (8% of the entire dataset), randomly chosen from the original set were used for creating the "Quality Assessment" CNN. The images were audited by a senior retinal specialist (DS) and labelled as 'adequate' or 'inadequate' (3400 \ 3626) respectively [ Fig 2]. They were then split into (70%) training, (15%) validation and (15%) testing sets.

"Quality-assessment" CNN architecture
To choose the optimum CNN design, several architectures were tested on Microsoft Azure™ cloud platform. These included ResNet, DenseNet, Inception and Inception-ResNet and Inception-ResNet-V2 [41]. The "Quality-Assessment" CNN was then based on a modified version of the InceptionResNet-V2 architecture. This particular neural network structure, in our experience, has faster convergence and avoids converging onto local minima. For our purposes, the number of neurons in the final output layer was changed to two, corresponding to 'adequate' and 'inadequate' classes. The learning rate was 0.001, using ADAM optimizer, with a mini-batch size of 30, and training was continued to 140 epochs.

Classifier dataset
The public Kaggle Diabetic Retinopathy was downloaded through EyePACS, which can be found in https://www.kaggle.com/c/diabetic-retinopathy-detection/data. This dataset contains 88,700 high-resolution fundus images of the retina, labelled as No DR, Mild, Moderate, Severe, Proliferative DR. To mimic the decision making of the New Zealand national DR Screening program, the original grading was remapped to three cohorts of Healthy, Non-referable DR and Referable DR [ Table 1]. Furthermore, as one potential gain of using an AI in DR Screening program is to quickly identify those that are healthy, separately the dataset was remapped to the broad classification of Healthy vs Diseased [ Table 2].  The public data attaiment and upload onto the Microsoft Azure cloud platform was the first step. "quality assessment" CNN was trained to identify adequate and inadequate images. the entire public dataset was then devided to training, validation and test sets. The test set was then 'curated' by the "quality assessment" CNN. The DRCNN was trained on un-curated data, and then tested on 'curated' and 'un-curated' data. Its performance was also assessed using 2 or 3 DR labels. https://doi.org/10.1371/journal.pone.0225015.g001 Each re-categorized dataset was then split into training set, validation set and testing set with corresponding ratios of 70%, 15% and 15% respectively. The 'curated' dataset was created by excluding lower quality images from the 'un-curated' set, as identified by the "Quality Assessment" CNN.

Pre-processing
The Kaggle EyePACS images were cropped and resized to 600 � 600. The choice of image size was to minimize the computational load on the Microsoft Azure™ platform, while not compromising the performance of the trained CNNs. According to existing literature [42] and based on our experience, larger image sizes would have led to diminishing returns in accuracy and overall performance of the designed CNNs. The resized cropped images were enhanced by applying a Gaussian blur technique [43], using the equation below.
A series of Gaussian blur parameters were tried and an optimum set was chosen by a senior retinal specialist (DS): The Gaussian blur technique has been designed to remove the variation between images due to differing lighting conditions, camera resolution and image qualities [Fig 3].

DRCNN architecture
The DRCNN was then designed based on the Inception-ResNet-V2 architecture, since this architecture has outstanding network capacity, faster convergence speed and better stability, which are critical when training utilizing such a large dataset. Three sequential layers of a Glo-balAveragePooling2D layer, a dropout layer (dropout rate = 0.3) and a fully connected layer were added to the original architecture. The activation function of the added dense layer was a Softmax function; and cross-entropy loss/error was utilized as the loss function, while Adam algorithm was utilized as the optimizer. The CNN was initially trained using the ImageNet dataset (i.e. transfer learning). An image shuffling function was applied prior to each mini- batch assembly and mini-batches were normalized prior to each epoch. In machine learning literature, k-fold cross validation is routinely used for detect any potential model's bias. Here, due to the size of the dataset and the cloud implementation of DRCNN (i.e. resource constraints), it was ensured that the DRCNN is unbiased in its performance by using randomlysplit training, validation and testing sets. The learning rate was 0.001 and a mini-batch size of 64 was used for model training, and training was continued for 140 epochs [Fig 4]. Finally, a weighted loss-function was used here to address the class imbalance of the Kaggle EyePACS dataset. It appeared that the DRCNN was over-fitted after the 120th epoch. The best-performing epoch was then manually selected based on minimum cross-entropy loss of the validation set.

Cascading thresholds
The cascading thresholds technique has been used previously in the literature, in order to boost the sensitivity of a given CNN [35]. Normally, a classifying CNN has a Softmax layer followed by a Classification Output layer as its last layers. The Softmax layer generates a list of probabilities of given input (i.e. fundus photo), to belong to a certain class (i.e. Healthy, Nonreferable DR, Referable DR). The Classification Output layer will then choose the class with maximum probability as the outcome of the classifier CNN. Alternatively, to increase the sensitivity of the CNN towards a specific grade (e.g. Referable DR), sub-maximum probabilities of that specific grade could be used.
An example of (~, 0.3, 0.3) cascading thresholds limit is presented here. Following AI's image analysis, if the output of the Softmax layer for the Referable class reaches the threshold of 0.3, then regardless of the less severe grades probabilities, the image is classified as Referable. If the image is not classified as Referable and if the Softmax layer output of Non-referable DR grade reaches the threshold of 0.3, this image is then assigned to this grade, regardless of the Healthy grade probability. Otherwise, the photos that are not classified as either Referable DR or Non-referable DR, are classified as Healthy. Here, we experimented with the cascading

Margin max
To our knowledge, the 'margin max' technique has not previously applied in similar studies. In this method, if the top two less sever classes' probabilities (e.g. Healthy and Non-referable DR) are within a set given threshold, to boost the sensitivity of a certain class (e.g. Healthy) the maximum rule will be ignored. As an example, consider the case of the 'margin max' of (0.2) for boosting the sensitivity of the Healthy grade. If the Softmax scores of Healthy, Non-referable and Referable DR were assigned as [0.3-0.45-0.25] respectively, the Healthy grade is chosen although it is not the maximum of three probabilities.

Microsoft Azure parameters
A standard NV6 Windows instance (6 VCPUs, 56 GB memory) from East US 2 region was selected as the training virtual machine. An additional standard SSD disk of 1023 GB storage space was attached to the training virtual machine [ Fig 5]. Resources rented on a cloud platform, such as Microsoft Azure, have a direct impact on the cost of operating an AI as a screening tool. The Azure setup parameters for this project was very modest, as a very expensive but efficient cloud infrastructure (i.e. GPU based Virtual Machines) will become a barrier to its clinical implementation and scalability. However the utility of the screening tests result will be influenced by the prevalence of the diseased state being screened and as such it is also helpful to calculate the negative and positive predictive values.
The positive predictive value (PPV) is the probability that subjects with a positive screening test is diseased, defined by: The negative predictive value (NPV) is the probability that subjects with a negative screening test is disease free, defined by:

NPV ¼ number of true negatives number of true negatives þ number of false negatives
In a disease like sight-threatening DR, where the prevalence of the diseased state is low (<5%), a tool which drives the false negative rate to near zero will therefore generate not only a high sensitivity, but also a very high negative predictive value of a negative test result. F1 score is a measurement of classification accuracy that considers both positive predictive value and sensitivity and is defined by:

Generate the curated testing set
The "Quality Assessment" CNN reached 99% accuracy and the validation loss of lower than 0.05. This CNN model was then used to create a 'curated' test set from the Kaggle EyePACS dataset. The 'curated' test set included 6,900 images from the original 13,305 'un-curated' set (i.e. a 47% rejection rate).

Un-curated testing set versus curated testing set
The DRCNN was trained and validated using the Microsoft Azure™ cloud platform. This was done twice, once for the binary DR grading classification Healthy vs Diseased, and once for the DR grading classification Healthy, Non-referable DR, and Referable DR. The crossentropy and accuracy were tracked and recorded throughout the training and validation process. The training progress was monitored for 140 epochs and the best set of weights that resulted in minimal validation loss was picked and set for the proceeding CNN performance assessment. While, the DRCNN was trained and validated using 'un-curated' data, it was tested separately using unseen 'curated' and 'un-curated' data. One would assume that using 'curated' (i.e. higher quality) data for the CNN test would improve the performance of the model. Here and for the first time, we wanted to assess this hypothesis [Tables 3&4]. Interestingly, the DRCNN prediction performance improved only marginally for the 'curated' test sets, compared to the 'un-curated' set. It appeared that boosting the sensitivity using both 'cascading thresholds' and 'margin max' had a similar effect for 'curated' and 'un-curated' datasets. Also, it seemed that uplifting the sensitivity of the Healthy grade, also enhanced the specificity of the Diseased state, and vice versa.
Here we have shown [ Tables 7 & 8] that by adjusting the post processing of the outcome of a CNN, we have outperformed the previously published best performance of Kaggle EyePACS, which was later failed to be replicated [37].
There have been several studies that have used the EyePACS dataset for training a DR AI [28, [44][45][46][47]. However, these studies are not directly comparable to the DRCNN presented here, as each study uses different (sometimes private) training sets, validation sets, DR grading schemes and performance reporting metrics. Regardless, a comparison of the performance of these models against the un-boosted and boosted DRCNN is provided here [ Table 9]. It can be seen that while our DRCNN is not the best performing neural network (although different studies are not directly comparable), different performance aspects of it (sensitivity to healthy or diseased, etc.) would outperform other published CNNs, depending on the boosting strategy. The confusion matrix of each study mentioned here is presented as a S1 Data.

Discussion
Diabetic retinopathy (DR) is the most common microvascular complication of diabetes and is the leading cause of blindness among the working-age population [48]. Whilst the risk of sight loss from DR can be reduced through good glycaemic management [49], if sight-threatening DR develops, timely intervention with laser photocoagulation or injections of anti-vascular endothelial growth factor is required [50,51]. Thus, if the risk of sight loss is to be reduced, individuals with diabetes should have their eyes screened regularly to facilitate the detection of DR, before vision loss occurs [52]. Unfortunately, in many regions including New Zealand, the attendance at DR screening falls below the recommended rates [53][54][55], and this is particularly true for those who live in remote areas and those of lower socioeconomic status [56][57][58].
There remain then significant challenges to ensure that delivery of these services is equitable and that all patients at risk are being screened regularly. Currently, the existing model of DR screening is resource intensive, requiring a team of trained clinicians to read the photographs. They also have high capital setup costs, related to the retinal cameras required to take the photographs, and require an efficient administrative IT support system to support and run them. As a result, the existing DR programs are relatively inflexible and are not easily scalable. All these issues are more acute in the developing world, which lack both capital funds and trained healthcare professionals [59]. Incorporating AI to accurately grade the retinal images for DR would offer many benefits to DR screening programs; reducing their reliance on trained clinicians to read photographs, enabling point of contact diagnosis whilst reducing the need for complex IT support systems. Research into AI design and its development for DR screening has progressed significantly in recent years, and this field has enjoyed a good deal of attention of late [60][61][62]. However, for all the excitement very little of this work has progressed to a clinically useful tool which provides a real-world AI-solution for DR screening programs and this is due largely to the challenges of the research-driven AI to generalize to a real-world setup. Whilst there are many reasons for such a lack of generalisation, the principal ones are the use of small and 'curated' datasets and an emphasis on overall accuracy, rather than sensitivity of the developed AI. The AI's reliance on powerful computers that are not available in most clinical environments has been an additional contributory factor. Traditionally, several metrics have been used to describe the performance of a DRCNN including, but not limited to accuracy, sensitivity, specificity, precision, negative and positive predictive values. However traditional screening is a binary exercise, categorising patients into those at low risk of having disease from those at high risk of having disease. As such, there is trade-off between the need for a high sensitivity with an acceptable specificity. Traditional diabetic eye screening programs have therefore mandated a minimum sensitivity of >85% and specificity of >80% for detecting sight-threatening diabetic retinopathy as there is a personal and financial cost associated with unnecessary referrals to eye clinics [63]. Whilst it is appropriate that a screening program has to strike the correct balance between these parameters, we envisage that in many situations, a classifier DRCNN will not be the sole arbitrator for grading diabetic retinopathy. It is therefore appropriate to consider the role that a classifier DRCNN could play in a diabetic eye screening program as currently there has been very little discussion of this subject. In this study, we report the results of two techniques, 'cascading thresholds' and 'margin max' to assess how they could be used to drive up the sensitivity, the negative predictive value, or the specificity and the positive predictive value in bespoke DRCNN's depending on the applications mode. In doing so, we boosted the AI's sensitivity to detect Healthy cases to more than 98%, while also improving the specificity of the other more severe classes. These techniques also boosted the AI's sensitivity of referable disease classes to near 80%.
Using the techniques described in this paper, it then becomes possible to develop classifier DRCNNs, tailored to the specific requirements of the DR program they have been commissioned to support [63]. The United Kingdom national Ophthalmology database study revealed that of the 48,450 eyes with structured assessment data at the time of their last record, 11,356 (23.8%) eyes had no DR and 21,986 (46.0%) had "non-referable" DR [64]. Thus a sensitivity boosted classifier like the one described here, manipulated to detect eyes with referable DR with a very high sensitivity (>95%) and ultra-high negative predictive value (>99.5%), could be embedded into an existing eye screening program rapidly and safely triaging eyes with referable disease from eyes which do not. Such a DRCNN would reduce the number of images sent to the human grading team for review by between 70-80%, leading to immediate and significant cost savings for the program.
In the context of a rural or developing-world setting, the ability to identify those patients with no disease from those with any disease may be desirable as it is well-recognised that the development of any DR, no matter how "mild" is a significant event [65]. Thus its detection could be used to target valuable and scarce health care resources more effectively to those individuals at highest risk to ensure that their diabetes and blood pressure control are reviewed. In effect, even if no other CNN approach was then used, the relatively simple cloud-based DRCNN we described here would help identify those patients at increased risk of either advanced disease or disease progression, and who therefore merit review of their systemic disease. Moreover, using the techniques described here, more sophisticated classifier CNNs could also be developed, ones that are manipulated to detect disease with a very high sensitivity. It is even conceivable that different classifier CNNs could then be run concurrently within diabetic eye screening programs to sequentially grade differing levels of disease with high sensitivity, ultimately leaving the human grading team with a relatively small number of images to review for adjudication and quality assurance.
Arguably, one of the biggest challenges that faces all AI-based "diagnostic" systems is the issue of public trust. Whilst it is accepted that in a traditional screening program with a sensitivity of 90%, 1 in 10 patients will be informed that are healthy when in actual fact they have disease, well-publicised failures of AI systems suggest that the public would not accept such failure rates from a "computer". In this context, the negative predictive value is arguably more important than traditional sensitivity and specificity. Whilst the relatively simple CNN described in this paper lacks the required sensitivity to be the sole arbitrator for identifying referable disease in a structured screening program, the fact that the methods we describe boosted both the sensitivity of the DRCNN to detect disease by over 10%, and thus the negative predictive value to near 100%, is noteworthy. We therefore believe that the techniques we describe here will prove to be valuable tools for those looking to build bespoke CNN's in the future.
In this research, we have endeavoured to address those issues that hinder the clinical translation of an in-house bespoke AI for DR screening. Our DRCNN was developed and tested using real-world 'un-curated' data. Here we demonstrated that our DRCNN is 'robust', as its performance is not critically affected by the quality of the input data. Furthermore, this process of data management, model training and validation was performed using Microsoft's Azure™ cloud platform. In doing so, we have demonstrated that one can build AI that is constantly retrainable and scalable through cloud computing platforms. Although few DRCNN's are accessible online, to our knowledge this is the first time that an AI has been fully implemented and re-trainable through a cloud platform. Hence, provided there is internet access, our DRCNN is capable of reaching remote and rural places; areas traditionally not well served by existing DR screening services.
In conclusion, we have demonstrated how existing machine learning techniques can be used to boost the sensitivity, and hence negative predictive value, and specificity of a DRCNN classifier. We have also demonstrated how even a relatively simple classifier CNN, one that is capable of running on a cloud-based provider, can be utilised to support both existing DR screening programs and the development of new programs serving rural and hard to reach communities. Further work is required to both develop classifiers that can detect sight-threatening DR with a very high sensitivity, and evaluate how a battery of DRCNN's each with differing specifications and roles, may be used concurrently to develop a real-world capable, fully automated DR screening program.