Towards implementation of AI in New Zealand national screening program: Cloud-based, Robust, and Bespoke

Convolutional Neural Networks (CNN)s have become a prominent method of AI implementation in medical classification tasks. Grading Diabetic Retinopathy (DR) has been at the forefront of the development of AI for ophthalmology. However, major obstacles remain in the generalization of these CNN’s onto real-world DR screening programs. We believe these difficulties are due to use of 1) small training datasets (<5,000 images), 2) private and ‘curated’ repositories, 3) offline CNN implementation methods, while 4) relying on accuracy measured as area under the curve (AUC) as the sole measure of CNN performance. To address these issues, the public EyePACS Kaggle Diabetic Retinopathy dataset was uploaded onto Microsoft Azure™ cloud platform. Two CNNs were trained as a “Quality Assurance”, and a “Classifier”. The “Classifier” CNN performance was then tested both on ‘un-curated’ as well as the ‘curated’ test set created by the “Quality Assessment” CNN. Finally, the sensitivity of the “Classifier” CNNs was boosted post-training using two post-training techniques. Our “Classifier” CNN proved to be robust, as its performance was similar on ‘curated’ and ‘uncurated’ sets. The implementation of ‘cascading thresholds’ and ‘max margin’ techniques led to significant improvements in the “Classifier” CNN’s sensitivity, while also enhancing the specificity of other grades.


Introduction
It is estimated that by 2040, nearly 600 million people will have diabetes worldwide(1). 46 Diabetic retinopathy (DR) is a common diabetes-related microvascular complication, and is 47 care, due to major generalizability issues of research-built AIs. Some of the major flaws of 69 research-built AIs that are hindering their generalizability are 1) using small training (<5,000 70 images) datasets, 2) repositories that are often private and 'curated' to remove images that are 71 deemed to be of low quality, and 3) lack of external validation (17, 27-29). These issues are  it should be noted that different diabetic eye screening programs will have different 87 requirements. It could be argued that 1) the emphasise of a community-based screening 88 program, potentially operating in the remote and low socioeconomic region and using portable 89 handheld cameras, is on identifying those patients with no disease from those with any disease. 90 However, in traditional screening, a CNN which is highly sensitive and removes the need for 91 a significant (>70%) portion of images to be sent for human review, would lead to immediate 92 and significant cost savings for the program. 93 We are actively working towards implementation of our DR classifying AI, within a long-94 established diabetic eye-screening program in New Zealand(37, 38). This program has multiple 95 facets, including community-based and clinic-based screening phases. In this project, 96 EyePACS Kaggle Diabetic Retinopathy public dataset was used to develop two CNNs, based 97 on one of the most sophisticated architectures available. Next, both CNNs were deployed and 98 trained on the Microsoft Azure™ cloud platform as 1) a fundus image "Quality Assessment" 99 and 2) a DR "Classifier". The "Quality Assessment" CNN was used to create a 'curated' test 100 set, in addition to the original 'un-curated' set. The performance of the "Classifier" CNN was 101 assessed on both sets. Finally, we used two post-training methods to boost the sensitivity of the 102 "Classifier" CNN towards 1) Healthy grade and 2) the most severe DR. We are actively 103 pursuing clinical implementation of our AIs and our recent findings would be of great interest 104 for similar groups around the world.       The Gaussian blur technique has been designed to remove the variation between images due to 172 differing lighting conditions, camera resolution and image qualities [ Figure 3].  The "Classifier" CNN was trained and validated using the Microsoft Azure™ cloud platform.

232
This was done twice, once for the binary DR grading classification Healthy vs Diseased, and 233 once for the tertiary DR grading classification Healthy, Non-referable DR, and Referable DR.

234
The cross-entropy and accuracy were tracked and recorded throughout the training and 235 validation process. The training progress was monitored for 100 epochs and the best set of weights that resulted in minimal validation loss was picked and set for the proceeding CNN 237 performance assessment.

238
While, the "Classifier" CNN was trained and validated using 'un-curated' data, it was tested 239 separately using unseen 'curated' and 'un-curated' data. One would assume that using 'curated' 240 (i.e. higher quality) data for the CNN test would improve the performance of the model. Here

241
and for the first time, we wanted to assess this hypothesis [ Table 3&4].
242       Research into AI design and its development for DR screening has progressed significantly in 297 recent years, and this field has enjoyed a good deal of attention of late (54-56). However, for 298 all the excitement none of this work has progressed to a clinically useful tool, providing a real-299 world AI-solution for DR screening programs. This is due largely to the inability of the 300 research-driven AI to generalize to a real-world setup. Whilst there are many reasons for such 301 a lack of generalisation, the principal ones are the use of small and 'curated' datasets and an 302 emphasis on overall accuracy, rather than sensitivity of the developed AI. The AI's reliance on 303 powerful computers that are not available in most clinical environments has been an additional 304 contributory factor.

305
During this research, we endeavoured to address those issues that hinder the clinical translation 306 of an in-house developed AI for DR screening. Our "Classifier" CNN was developed and tested 307 using real-world 'un-curated' data. Here we demonstrated that our "Classifier" CNN is 308 'robust', as its performance is not critically affected by the quality of the input data.

309
Furthermore, this process of data management, model training and validation was performed 310 using Microsoft's Azure™ cloud platform. In doing so, we have demonstrated that one can 311 build AI that is constantly re-trainable and scalable through cloud computing platforms.

312
Although few DR AIs are accessible online, to our knowledge this is the first time that an AI 313 is fully implemented and re-trainable through a cloud platform. Hence, provided there is 314 internet access, our AI is capable of reaching remote and rural places; areas traditionally not 315 well served by existing DR screening services. 316 We have also successfully experimented with two "sensitivity-boosting" techniques,

317
'cascading thresholds' and the 'margin max' technique. We observed good improvements in 318 sensitivities and specificities of either Healthy or Diseased grades, depending on the application 319 mode. In doing so we boosted the AI's sensitivity to detect Healthy cases to more than 98%, (while also improving the specificities of the other more severe classes). These techniques also 321 boosted the AI's sensitivity of referable disease classes to near 80%.

322
The sensitivity of a screening test is the percentage of the condition that is correctly detected; 323 the specificity of a screening test is the percentage of people that one refers unnecessarily.

324
Within all screening programs, the need to balance high sensitivity with an acceptable 325 specificity has been long recognised. for adjudication and quality assurance.

362
Arguably, one of the biggest challenges that faces all AI-based "diagnostic" systems is the 363 issue of public trust. Whereas it is accepted that in a screening program with a sensitivity of 364 90%, 1 in 10 patients will be informed that are healthy when in actual fact they have diseases, 365 well-publicised failures of AI systems suggest that the public would not accept such failure 366 rates from a "computer" (61). Whilst the relatively simple CNN described in this paper lacks 367 the required sensitivity to be the sole arbitrator for identifying referable disease in a structured 368 screening program, the fact that the methods we describe boosted the sensitivity of the CNN to 369 detect disease by over 10% in most cases is noteworthy. We therefore believe that the techniques we describe here will prove to be valuable tools for those looking to build bespoke 371 CNN's in the future.

372
In conclusion, we have demonstrated how existing machine learning techniques can be used to 373 boost the sensitivity of a CNN classifier to detect both health and disease. We have also 374 demonstrated how even a relatively simple classifier CNN, one that is capable of running on a Competing interests 396 The authors declare no competing interests.