Deep learning for AI-based diagnosis of skin-related neglected tropical diseases: A pilot study

Background Deep learning, which is a part of a broader concept of artificial intelligence (AI) and/or machine learning has achieved remarkable success in vision tasks. While there is growing interest in the use of this technology in diagnostic support for skin-related neglected tropical diseases (skin NTDs), there have been limited studies in this area and fewer focused on dark skin. In this study, we aimed to develop deep learning based AI models with clinical images we collected for five skin NTDs, namely, Buruli ulcer, leprosy, mycetoma, scabies, and yaws, to understand how diagnostic accuracy can or cannot be improved using different models and training patterns. Methodology This study used photographs collected prospectively in Côte d’Ivoire and Ghana through our ongoing studies with use of digital health tools for clinical data documentation and for teledermatology. Our dataset included a total of 1,709 images from 506 patients. Two convolutional neural networks, ResNet-50 and VGG-16 models were adopted to examine the performance of different deep learning architectures and validate their feasibility in diagnosis of the targeted skin NTDs. Principal findings The two models were able to correctly predict over 70% of the diagnoses, and there was a consistent performance improvement with more training samples. The ResNet-50 model performed better than the VGG-16 model. A model trained with PCR confirmed cases of Buruli ulcer yielded 1–3% increase in prediction accuracy across all diseases, except, for mycetoma, over a model which training sets included unconfirmed cases. Conclusions Our approach was to have the deep learning model distinguish between multiple pathologies simultaneously–which is close to real-world practice. The more images used for training, the more accurate the diagnosis became. The percentages of correct diagnosis increased with PCR-positive cases of Buruli ulcer. This demonstrated that it may be better to input images from the more accurately diagnosed cases in the training models also for achieving better accuracy in the generated AI models. However, the increase was marginal which may be an indication that the accuracy of clinical diagnosis alone is reliable to an extent for Buruli ulcer. Diagnostic tests also have their flaws, and they are not always reliable. One hope for AI is that it will objectively resolve this gap between diagnostic tests and clinical diagnoses with the addition of another tool. While there are still challenges to be overcome, there is a potential for AI to address the unmet needs where access to medical care is limited, like for those affected by skin NTDs.


Introduction
Deep learning has achieved remarkable success in vision tasks such as image classification, image localization and image semantic segmentation, which also includes skin disease prediction. Deep learning is a part of a broader concept of artificial intelligence and/or machine learning whereby it uses vast volumes of data and complex algorithms to train a model to perform certain tasks. The success of the approach undoubtedly can be attributed to the ability of learning abstract semantic knowledge with the hierarchical network architecture from visual signals [1]. It is increasingly gaining interest and becoming more important in the field of dermatology in this digital era. Evidence is accumulating that deep learning can assist healthcare providers to make better clinical decisions, even to an extent that sometimes exceeds human judgement [2,3,4]. However, many of the diseases studied are pigmented lesions such as melanoma and basal cell carcinoma, or inflammatory dermatoses which often affect people with lighter skin color and thus provide a high degree of contrast [5,6,7].
Skin-related neglected tropical diseases, or skin NTDs, comprise a group of infectious diseases whose morbidity is expressed on the skin. They include at least nine diseases and disease groups listed by the World Health Organization (WHO) [8]. More than 1 billion people are known to be either at risk or infected with skin NTDs [9]. They mainly prevail in poor communities of lowand middle-income countries (LMICs) where resources are scarce and where there are limited numbers of dermatologists to diagnose the conditions. Additionally, skin NTDs more often affect people of color. Availability of screening systems, therefore, is critical for this set of diseases which will enable earlier diagnosis and treatment. The longer the delay in diagnosis, the more patients with skin NTDs may be left with life-long disabilities and deformities.
While there is growing interest in the use of deep learning for diagnosis of skin NTDs to fill these gaps, there have been limited studies to date which investigated the development of an AI model for a combination of these less studied diseases in less studied populations. In this study, we aimed to develop deep learning based AI models with clinical images we collected for five skin NTDs, namely, Buruli ulcer, leprosy, mycetoma, scabies, and yaws, to understand how diagnostic accuracy is influenced by different models, especially when the training images are relatively small in number and collected under diverse conditions. All images are from dark-skinned African populations, with Fitzpatrick skin type IV or above. We anticipate that our findings will support future development of AI models for the skin NTDs, and in addition, other skin diseases in people with darker skin types.

Ethics statement
The study obtained ethical approvals from the institutional review boards of the Tulane University School of Public Health and Tropical Medicine (2020-2054-SPHTM) (USA), the Ministry of Health of Côte d'Ivoire (No. IRB000111917), and the Ministry of Health of Ghana (GHS-ERC:014/05/21). Written informed consent was obtained from all patients for use of their images.

Data collection
This study used photographs that were collected prospectively in the West African countries of Côte d'Ivoire and Ghana, through our ongoing studies with use of digital health tools for clinical data documentation and for teledermatology. The description of the design of this study can be found elsewhere [10]. Briefly, photographs of skin lesions were collected along with clinical information including sex, age, past medical history, contact history, and disease description such as body location, duration, patient complaint including itchiness and pain / no pain, and progression to support dermatologists in providing diagnoses remotely. The photographs were taken by nurses or community health workers who had been trained in dermatological phototaking during a course provided by dermatologists. They were taken using the camera on Lenovo Tab M10 FHD Plus smart tablets under field conditions and in rural clinics from a total of six health districts (four in Côte d'Ivoire and two in Ghana) known to be endemic with one or more skin NTDs. Image resolution was 1920 x 2560 pixels stored in JPEG format.

Dataset screening
Images were selected from our data repository for which diagnoses for one of the five targeted diseases (Buruli ulcer, leprosy, mycetoma, scabies, and yaws) were made either remotely or in person. For those cases diagnosed remotely, diagnosis was made independently by two dermatologists with more than 10 years of experience in diagnosing patients in the context of Côte d'Ivoire and Ghana (RRY, AD). In the case of disagreement between the two, a third dermatologist (BV) was invited to review the case, and discussions were held to agree on the final diagnosis. A part of our cases were diagnosed in person by a dermatologist during monitoring visits made for the parent projects, in this case, diagnosis made in person was regarded to be superior to diagnosis made remotely. A portion of cases of Buruli ulcer underwent polymerase chain reaction (PCR) testing for confirmation. Likewise, dual path platform (treponemal and non-treponemal) (DPP) testing (Chembio Diagnostics, Medford, NY, USA) was done for a portion of cases of yaws. Table 1 shows the data summary of the five diseases, with number of patients and number of images for each disease. Multiple images were obtained for most patients. For Buruli ulcer and yaws, the numbers within parenthesis were those with positive results with PCR and DPP, respectively.

AI-based skin disease diagnosis model
Convolutional neural networks (CNNs) are the popular deep learning techniques to extract feature representation from the visual image samples for image classification such as disease diagnosis. A skin image is basically a 2-dimentional (2D) grid of pixel values. CNNs are multilayer neural networks with 2D convolutional filters that capture visual patterns from skin images and generate low-dimensional feature vectors. In the context of deep learning, a vector is an ordered collection of elements where each element corresponds to a specific feature of the input sample. In this study, we adopted two popular CNN architectures or designs, the ResNet-50 (50-layer residual neural network) [11] and the 16-layer VGG-16 (16-layer Visual Geometry Group) model [12]. These two models are frequently used in visual classification tasks, with different architectures and performances. Evaluating the performance of two (or more) models on the same diagnosis task and data can offer reassurance that the results are not too dependent on a particular choice of model.
All the original images were resized into the same size to 224 x 224 pixel resolution with RGB (red, green, blue) 3 channels to fit the input of deep learning models. Data augmentations and normalization pre-processing strategies were also employed following existing image classification tasks [13]. These were then sent to the ResNet-50 model or VGG-16 model, pretrained on the ImageNet dataset, which is a large-scale, open-source image repository [14]. Each image was represented as a 2048-dimensional feature vector for ResNet-50, and a 4096-dimensional feature vector for VGG-16. Following this, we designed the disease diagnosis classifier with output of a 5-dimensional vector as a 5-disease probability vector. For model optimization, we adopted stochastic gradient descent (SGD) with a momentum of 0.9 as optimizer to update the whole network parameters (i.e., ResNet50 and classifier parameters). We performed the experiments using the PyTorch library running on one Graphics Processing Unit (GPU) (NVIDIA Titan V).

Analysis
As a part of our quantitative analysis, we performed Task 1 to assess how the model performance would increase when we included more training data for AI model. To train our designed models, the images from k% of patients from our collection chosen at random was used as a training set (k% is the percentage of images used in training), and the images from the remaining patients were used to test the model performance. The value of k% was increased from smaller to larger. Training and test images did not overlap. Furthermore, we tested if and how laboratory confirmation may change the accuracy of the classifier. For this, we performed two kinds of experiments: firstly, using all cases [all cases] (Task 1(a)), and secondly, using only those cases that tested positive with PCR or DPP for Buruli ulcer and yaws, respectively [test positives] (Task 1(b)). Otherwise, the analysis was the same. We adopted two metrics, the Top-1 accuracy (%) and the Matthew's correlation coefficient (MCC, 0~1) to evaluate our models [15]. Top-1 accuracy measures the proportion of test images for which the predicted disease matches the single target disease. MCC is a reliable statistical score that produces a high value only if the prediction obtained good results in all the four confusion matrix categories (true positives, false negatives, true negatives, and false positives). Next, we performed a comparison test (Task 2) matching the number of the training images to compare the effect of having the diagnostic confirmation against clinical diagnosis. This is because the performances of the models trained in Task 1 were not directly comparable since the number of training images were not the same in the two experiments. We used cases of Buruli ulcer for this Task as we had more cases confirmed with PCR than confirmed yaws with DPP, and ran the two experiments. In Task 2(a), we used 50 clinically diagnosed Buruli ulcer patients (including every patient diagnosed as Buruli ulcer irrespective of their PCR results) [clinical diagnosis] plus the other 4 diseases (19, 6, 54, and 31 cases of leprosy, mycetoma, scabies, and yaws, respectively) to train the model based on ResNet-50. In Task 2(b), we used 50 Buruli ulcer cases with PCR positives [test positives] plus the same samples from other 4 diseases (19,6,54, and 31 cases of leprosy, mycetoma, scabies, and yaws, respectively) to train the model based on ResNet-50. The test set was the same for the two models, which included 100 clinically diagnosed Buruli ulcer cases plus the other 4 diseases (19,5,54, and 75 cases of leprosy, mycetoma, scabies, and yaws, respectively). As before, there was no patient overlap between the training and test set.
For our qualitative analysis, we reviewed every image with the likelihood of the diagnosis by the prediction model [prediction label] as compared to the actual diagnosis [true label]. This was to assess what resulted in incorrect prediction by our pilot AI model. We based this analysis on our model using ResNet-50 with (k = 50)% training data for all data (Task 1(a)).
Lastly, to further understand why we achieved better performances for some diseases over the others, we used the dimensionality reduction method, Principal Component Analysis (PCA) [16] to map the learned visual representations (2048-dimensional features of ResNet-50) of each test class image to a 2-D plane. The goal was to visualize the learned feature representation and provide a direct way to understand the discriminative ability of AI features from raw skin images.   given to each test image, which is calculated by the correlation between the predicted probability with random guess. Higher correlation means higher uncertainty score. The uncertainty score indicates the degree of irrelevant evidence the AI model finds for the given test image used to predict its diagnosis. For example, Fig 2A shows a true label score for yaws of 0.187 and a predicted label for Buruli ulcer of 0.254 with high uncertainty of 0.93. This means that the model predicted the image to be more like Buruli ulcer than yaws, however it was also highly uncertain. An uncertainty score closer to 1 represents higher uncertainty for the diagnosis output. When it is 100% uncertain, AI estimates it to be a random guess and provides a confidence score of 0.200 (5 diseases, 1/5 = 0.200). The AI prediction is better when the uncertainty score is lower, although the diagnosis could still be incorrect. Fig 3A shows the training samples while Fig 3B lists

Discussion
In this report, we explored how deep learning might help in screening and/or diagnosis of skin NTDs, which often affects people with darker skin tones. Two deep learning models were examined in our work. Between the ResNet-50 and VGG-16 models, we conclude that the ResNet-50 model achieved better performance (around 2% better prediction for all evaluation) in predicting our skin images. The major difference between the two models is the depth of their layers, i.e., ResNet-50 contains 50 layers of convolutional, pooling operations, while VGG-16 only contains 16 layers of the same. Generally, deeper models with more layers can extract more powerful representations from image data [12]. This tendency was consistent also for our dataset which focused on skin disease diagnosis. However, models with more layers contain more parameters, which make them heavier, and less efficient [12]. VGG-16 is more efficient as fewer layers are included.
Although classified together as skin NTDs, the target infections have quite different appearances, presentations, and progressions. The lesions can be raised, depressed, smooth or rough, various colors or multicolored even for the same condition. We observed that deep learning approaches to identification of Buruli ulcer, scabies and yaws showed good performance of close to or over 80% prediction, perhaps since these were trained with more images. Leprosy and mycetoma, used smaller sample sizes and had poorer performance. For leprosy, we speculate that it was not only the sample size, but also the complexity of the disease presentation that impacted performance [17]. We had a range of images from tuberculoid to borderline to lepromatous type leprosy, as well as some included deformities and wounds that developed due to peripheral neuropathy. We stratified these different conditions and ran the same analysis, with an expectation that this may increase power by reducing variance. However, this further decreased the number of our samples, and we were unable to obtain any meaningful results this time. Likewise, similar results were obtained for yaws, when stratified for ulcerative versus non-ulcerative (papilloma, hyperkeratosis, patch, etc.) lesions. However, we believe if there were enough images, stratifying may increase the accuracy of the predicted diagnosis. Moreover, as prior study on AI-based diagnosis for leprosy showed, clinical data other than images, most importantly loss of sensation for leprosy, are essential to be combined in the deep learning dataset for better model development [17].
The percentages of correct diagnosis increased with PCR-positive cases of Buruli ulcer. This demonstrated that it may be better to input images from the more accurately diagnosed cases in the training models also for achieving better accuracy in the generated AI models. It was interesting to see that the PCR-confirmed cases of Buruli ulcer contributed in increasing the diagnostic accuracy not just for Buruli ulcer but also for other diseases. On the other hand, contrary to our hypothesis, the percentage increase was minimal (3% for Buruli ulcer), which may be an indication that the accuracy of clinical diagnosis alone is reliable to an extent. Especially for Buruli ulcer, a previous study by Eddyani et al. has shown that sensitivity of clinical diagnosis was as high as 92% (95% CI, 85-96%), which was the highest among any other methods including PCR [18]. PCR results can be false negatives in Buruli ulcer due to several factors, for example, site of sample collection, skills in sample taking and duration of the wound [19]. While it is currently the preferred test for diagnostic confirmation, it has its flaws and is not always reliable. In many studies, PCR is considered 65-70% sensitive [20] or even only 61% sensitive [21]. Specificity is perhaps highest for the PCR positive cases, but sensitivity is highest for clinically identified cases. The PCR positive cases should be enriched for true cases, but it also misses true cases. One hope for AI-which our findings also support-is that it will objectively resolve this gap between diagnostic tests like PCR and clinical diagnoses with addition of another tool.
Incorrect diagnoses made by our model were skewed towards other skin NTDs being diagnosed as Buruli ulcer, as about half of our images were Buruli ulcer. Fairness issues in deep learning arise when the dataset is extremely imbalanced across different categories or groups [22]. When these images with incorrect prediction were reviewed, some cases would have been difficult to differentiate even with the human eye, such as the case of yaws in Fig 2B, for example. On the other hand, some cases with obviously different presentations were predicted to be Buruli ulcer, such as the case of mycetoma in (d) and leprosy in (f). We were unsure why they were predicted to be Buruli ulcer. For cases shown in (a) and (g), images and location on the limbs may have played some role in these being predicted as a Buruli ulcer case, as the most commonly affect body parts in Buruli ulcer are the limbs [23,24]. Fig 2C was a case of yaws, but the main lesion was not centered, and the lesion of interest was not very obvious. The backgrounds or the clothes may have disturbed the predictions in cases such as in Fig 2A, 2E, 2G and 2H. It will be necessary to understand these patterns in order to resolve incorrect predictions, which will be one of our future study directions.
A major source of bias in AI applications stems from the availability and variety of images used in training. There are a very limited number of images of these diseases and a more limited number of images of people of color. In addition, the phrase "people of color" embraces a huge range of hues and surface characteristics even within the African continent. One of the strong points of this pilot has been the use of local dermatologists. In one example in the field, the local dermatologists recognized a series of deeply pigmented lesions as being a reaction to skin whitening agents, a diagnosis that would not easily be arrived at by physicians in the US or Europe. A key observation here has been to reinforce the need for more images from a wider diversity of cases from this part of the world, similarly to the recently recognized gap in dermatological training in general [25,26]. We were able to derive almost the same, if not close, accuracy in diagnosis with the model trained with images from clinical diagnosis over those trained with images with laboratory confirmation-this was partly possible because of the involvement of our skilled local dermatologists.
There are limitations to our study, some of which were already described, such as limited number of images and imbalance in image numbers between diseases. Moreover, images were taken under different conditions, and they were highly heterogeneous, for example, distracting objects in the background or lighting. We are currently working on how to mitigate them, as the photos are taken under field conditions in Côte d'Ivoire and Ghana where conditions are less formal than with many other studies. As it is difficult to mandate the images be taken in a uniform environment in these settings, and as this will also limit the number of images that we can use for deep learning, it is potentially more up to the technology how we can overcome this challenge. The development of such technologies holds immense potential for advancing deep learning models in diagnosing a diverse range of skin diseases prevalent in regions where skin NTDs are endemic beyond the specific target diseases we had for this study. Furthermore, the implications can extend to areas where skin NTDs are rare. In today's globalized world, the import and export of skin NTD cases are common, and diverse skin diseases occur across various skin types. By harnessing the power of the emerging technologies, we can effectively address the pressing needs associated with the diagnosis and treatment of skin diseases in different populations.

Conclusions
Here, we presented our exploratory approach in developing deep learning models for skin NTDs and the challenges that we encountered. These attempts have only just begun. We hope that the lessons learnt here will support the future development of AI technology for these neglected diseases in the neglected populations. Our approach was to have the deep learning model distinguish between multiple pathologies simultaneously. This is different from many other studies where deep learning models were asked to make a diagnosis of a single disease. However, in real-world, what happens in clinicians' mind is that we are required to compare between different pathologies-accordingly, we devised an approach that is more in line with this practice. AI is not yet a replacement for human diagnosis, but if used well and appropriately, it is a tool that can be useful in screening for diseases and improving patient outcomes. Particularly, the hope is that it will address the unmet needs where access to medical care is limited, like for those affected by skin NTDs. their support in making diagnosis of skin diseases on site and remotely. We would also like to thank all members of the project teams in Côte d'Ivoire and Ghana for making this study possible.