Acral melanoma detection using a convolutional neural network for dermoscopy images

Background/Purpose Acral melanoma is the most common type of melanoma in Asians, and usually results in a poor prognosis due to late diagnosis. We applied a convolutional neural network to dermoscopy images of acral melanoma and benign nevi on the hands and feet and evaluated its usefulness for the early diagnosis of these conditions. Methods A total of 724 dermoscopy images comprising acral melanoma (350 images from 81 patients) and benign nevi (374 images from 194 patients), and confirmed by histopathological examination, were analyzed in this study. To perform the 2-fold cross validation, we split them into two mutually exclusive subsets: half of the total image dataset was selected for training and the rest for testing, and we calculated the accuracy of diagnosis comparing it with the dermatologist’s and non-expert’s evaluation. Results The accuracy (percentage of true positive and true negative from all images) of the convolutional neural network was 83.51% and 80.23%, which was higher than the non-expert’s evaluation (67.84%, 62.71%) and close to that of the expert (81.08%, 81.64%). Moreover, the convolutional neural network showed area-under-the-curve values like 0.8, 0.84 and Youden’s index like 0.6795, 0.6073, which were similar score with the expert. Conclusion Although further data analysis is necessary to improve their accuracy, convolutional neural networks would be helpful to detect acral melanoma from dermoscopy images of the hands and feet.


Introduction
In Asians, melanoma is rare, compared to its prevalence in Caucasians, and usually occurs in acral areas such as the hands and feet. It can be misrecognized as benign nevi (BN), is occasionally hidden by calluses, and eventually results in late diagnosis at an advanced stage, with a poor prognosis [1][2][3]. Since effective anti-cancer agents for treating melanoma have not yet been developed, early detection and wide excision of the skin lesion is more crucial to the cure for melanoma.
Recently, to aid the early diagnosis of melanoma and the reduction of unnecessary skin biopsy, dermoscopy has been widely used [4,5]. Moreover, because it is difficult for nonexperts to use [6], artificial intelligence and deep-learning models have been applied to help physicians who are untrained to handle a digital dermoscope [7]; its use is expected to increase in the field of teledermatology.
A convolutional neural network (CNN) is one of the representative models among the various deep-learning models. It has already shown potential for general and highly variable tasks across many fine-grained object categories [8][9][10][11][12] and has been shown to exceed human performance in object recognition [9]. Recently, it was applied to detect skin cancers in images, including from dermoscopy, and successfully demonstrated artificial intelligence capable of classifying skin cancer with a competence level comparable to that of dermatologists [13]. For the success of CNN models, a large amount of training data labeled with class types to produce a rich feature hierarchy is necessary, and therefore, its usefulness in the diagnosis of rare diseases with insufficient data has not been fully established.
In this study, we applied an end-to-end CNN framework to detect a rare disease in Asians, acral melanoma (AM), from the dermoscopy images of pigmentation on the hands and feet. To overcome the insufficiency of the datasets, we adopted a transfer learning technique to leverage learned features from a CNN model pre-trained on a large-scale natural image dataset [14]. Moreover, we also applied a half-training and half-trial method to validate its clinical usefulness for the early diagnosis of patients compared with the dermatologist's and non-expert's evaluation.

Dermoscopy images
A total of 724 dermoscopy images were collected from January 2013 to March 2014 at the Severance Hospital in the Yonsei University Health System, Seoul, Korea, and from March 2015 to April 2016 at the Dongsan Hospital in the Keimyung University Health System, Daegu, Korea. Among them, 350 dermoscopy images were from 81 patients with AM and 374 images were from 194 patients with BN of the acral area (Fig 1). A total of 632 dermoscopy images were captured by the DermLite Cam (3Gen Inc., USA), and 92 images were captured by the Dermlite hybrid II (3 Gen Inc., USA), connected to a digital camera (Nikon Coolpix P6000, Japan). All diagnoses were histopathologically confirmed and multiple images were captured in cases of large lesions. We provide a STROBE checklist for the study of diagnostic efficacy as supporting information (S1 Table). Dermoscopy images of BN were divided into nine types, and AM images into three types according to the reference [15], by two dermatologists. This study protocol was approved by the Institutional Review Board of Yonsei University, Severance Hospital and Keimyung University, Dongsan Hospital and was conducted according to the Declaration of Helsinki Principles. Patient records/information was anonymized and deidentified prior to analysis.

Convolutional neural network to detect melanoma
We have described the CNN architecture we adopted in Section 2. 1 and presented the training and inference methods for detecting melanoma in Section 2. 2.
2.1 Convolutional neural network. CNNs are composed of several convolutional layers, each involving linear and nonlinear operators, as well as fully connected layers. The architecture for the state-of-the-art CNN has many parameters; for example, the VGG-16 Model has 138 million parameters, where the parameters are learned from the ImageNet dataset containing 1.2 million general object images of 1,000 different object categories for training [16]. Deep neural networks are difficult to train using small datasets (i.e., a few hundred images). To circumvent this problem, we used the fine-tuning technique, which is one of the regularization techniques. We fine-tuned a modified VGG model with 16 layers (13 convolutional and three fully connected layers), which uses the convolution filters of the same size (i.e., 3 × 3) for all convolution layers, as seen in Table 1. Our network configuration is depicted in Fig 2 and Table 1. Each layer and feature map in the CNN is represented by a three-dimensional array of  Table 1; "conv" represents a convolutional layer and "FC" represents a fully connected layer).
The input with a fixed-size, 224 × 224, was passed through a stack of convolutional layers, where each followed a rectified linear unit (ReLU) activation function, and max-pooling was performed over a 2 × 2 pixel window with a stride of 2. A series of convolutional layers (conv1, conv2, conv3, conv4, and conv5) were followed by three fully connected layers: the first 2 fully connected layers (FC6 and FC7) had 4,096 channels each, where each followed a ReLU activation function, while the last fully connected layer (FC8) had 2 channels since our problem was a two-way classification problem (melanoma and non-melanoma class). It should be noted that the number of channels of the last fully connected layer was the same as the number of classes. Hence, we replaced the original fully connected layer (FC8: FC with 1000 channels) with a fully connected layer with two channels. The last layer had the soft-max activation function and predicted whether the input patch was a melanoma or non-melanoma lesion.
Moreover, the VGG-16 model pre-trained on the ImageNet database are used to perform transfer learning, and the weights of the last convolutional layers (the last two layers of conv5) and three fully convolutional layers (FC6, FC7, and FC8) are initialized using Xavier weight initialization [17]. In order to perform fine-tuning, we froze the weights of conv1, conv2, conv3, conv4, and the first layer of conv5 on pre-trained ImageNet, and trained the initialized weights on our dermoscopy image dataset. The above procedure is performed to prevent the large gradient caused by randomly initialized weights from ruining the pre-trained weights. After several training epochs, we trained the all weights of our network without freezing any layer.

Training and inference.
Our dataset consisted of 724 images and associated labels, which were split into two mutually exclusive subsets (group A and B); half of the total image dataset was selected for training and the rest for testing. The scale and location of a skin lesion in a captured image were changed according to the capture conditions. To resolve this issue, we adopted a sliding window strategy and used the cropped patches instead of the full image at the training and inference time. At the inference time, we extracted about 12 image patches from each test image on a regularly spaced grid with a partial overlap between neighboring patches and then each patch was rescaled to the size of 224 × 224 pixels, as seen in Fig 3. In addition, to increase the robustness of the variation of geometric transformation in our CNN model, the training dataset was artificially augmented at training. Additional augmented data were formed by rotating and flipping images from the original training set. We generated 216 image patches from a single image using rotations by 0˚, 45˚, 90˚, and 135˚, as well as leftright and top-bottom reflections. In addition, the patches that did not contain any melanoma lesions among the melanoma training images were manually removed and the patches that did not contain any skin lesions among the non-melanoma training images were assigned to the non-melanoma class at training time.
We randomly selected 30% of the training dataset as a validation set and the rest as a training set at the onset of training. The validation data were used to prevent the overfitting of the training data and to provide guidance on when to stop training the network. The training of our CNN was stopped when the validation error on the validated dataset stopped decreasing. We trained the network using an adaptive stochastic sub-gradient method where the batch size is set to 50, and the momentum parameter, learning rate, and weight decay are set to 0.9, 0.0001, and 0.0005, respectively.
Some of the filters learned from our melanoma dataset may be seen in Fig 4. Fig 4(a) shows 64 learned filters at the 1 st convolutional layer, where each represents a learned filter with a 3 × 3 kernel size. The input of the first layer is an RGB image with 224x224 pixel size, and it is convolved with 64 learned filters with 3x3 kernel size as shown in Fig 3(a) and 64 feature maps with 224x224 size are generated. In addition, the output feature maps are used as the input of the next layer. Fig 4(b)-4(m) shows 100 filters among the learned filters from the 2 nd to the 13 th convolution layer, respectively, where each represents a learned filter with a 3 × 3 kernel size. At the time of inference, we interpreted 12 image patches per test image, and when one or more images were predicted as containing melanoma, the corresponding test image was interpreted as containing melanoma. Each input of the network was an RGB image subtracted from the average image and calculated over the entire training image dataset. We implemented our method using MatConvNet, a Matlab-based CNN framework for computer vision applications [18]. Moreover, we fine-tuned a VGG model with 16 layers downloaded from http:// www.vlfeat.org/matconvnet/pretrained).

Comparison of diagnostic rate
To assess the clinical usefulness of the CNN, we compared its diagnostic rate with those of two dermatologists who had five or more years of clinical experience in dermoscopy (expert group) and two non-trained general physicians (non-expert group). All images on the computer screen were evaluated simultaneously. If there was a dissensus between two physicians, they reached a conclusion under the agreement. Since 724 images were randomly and equally The agreement between the pathologic result and each rater's diagnosis was measured using the calculation of Cohen's kappa coefficient. All statistical analyses were performed with MedCalc software version 17.9.

Cohen's Kappa
(Po = Accuracy, Pe = hypothetical probability of a chance agreement)

Results
Among 724 dermoscopy images, 71 images were from the hands and fingers, and the others were from the feet and toes. A total of 350 AM images included homogenous diffuse irregular pigmented, parallel ridge, and multicomponent patterns, while 374 BN images included parallel furrow, fibrillar, lattice-like, reticular, globular, and homogenous patterns (S1 Table).
In the group A results obtained by the training of Group B images, CNN showed 92.57% sensitivity and 75.39% specificity, which were similar to those of the expert (94.88% and 68.72%, respectively). However, the non-expert showed lower sensitivity (41.71%) and relatively higher specificity (91.28%, Table 2). For diagnostic accuracy, both the CNN and expert group showed similar scores (83.51% and 81.08%, respectively), which were higher than that of the non-expert (67.84%, Fig 5). In the result of group B by the training of group A images, CNN also showed a higher diagnostic accuracy (80.23%) than that of the non-expert (62.71%) but was similar to that of the expert (81.64%). For validating diagnostic reliability, both the CNN and expert showed an AUC above 0.8 in group A and B (Fig 5). However, the non-expert Regarding the concordance rate between the CNN and expert group, 73 cases (73/362, 20.17%) in Group A (AM: 14 cases, BN: 59 cases) were discordant. Of these, 41 cases (56.16%) of the CNN and 32 cases (43.84%) of the expert were identical with the pathologic results. However, in the concordant cases between them, 29 cases (29/362, 8.01%) differed from the pathology reports. In Group B, 57 cases (AM: 12 cases, BN: 45 cases) showed discordance between the CNN and expert, and 26 cases (45.61%) of the CNN and 31 cases (54.39%) of the expert were identical with the pathologic results. Among the concordant cases in group B, 39 cases (39/362, 10.77%) differed from the pathology results. Cohen's kappa between CNN and Expert, CNN and Non-expert, Expert and Non-expert is shown in Table 3.
To verify the performance of CNN architecture for the discrimination of acral melanoma, we perform the deep learning architecture, Inception-V3, in [13], the state-of-the-art publication for the classification of skin cancer. In [13], a single image was used for learning. Meanwhile, we applied multiple images for learning. Thus, we compared Inception-V3 with a single image and Inception-V3 with multiple images to CNN with multiple images. The results are shown in Table 4.

Discussion and conclusions
Although non-invasive and automated diagnostic techniques have been introduced for the early detection of melanoma, they are still not easy to apply in the acral type [7,19]. This may be due to the overall low occurrence rate of melanoma in Asians, depending on the ethnic differences, which need a longer time to provide a sufficient dataset to improve diagnostic accuracy.
To overcome the problem of an insufficient dataset, we adopt a 2-fold cross validation method, for the training and test groups. In addition, capturing images at different places for one lesion helps to construct a robust CNN. Similarly, data augmentation generating virtual images using rotation, translation, different angle positioning from one image also helps for a robust CNN. These procedures are necessary to construct an automated diagnosis system from small datasets due to the low occurrence rate of acral melanoma. For the effective screening of melanoma, higher sensitivity is required. Thus, if there is a small compartment corresponding to the melanoma in one image, our system considers it as melanoma. Also, our system recognizes one image as one patient.
From the results, the accuracy of the CNN was above 80%, which was similar in both groups and was close to that of the expert. The CNN and expert also showed AUC values above 0.8, indicating good discrimination. Generally, higher AUC values are considered to demonstrate better discriminatory abilities as follows: excellent discrimination, AUC of !0.90; good discrimination, 0.80 AUC < 0.90; fair discrimination, 0.70 AUC < 0.80; and poor discrimination, AUC of <0.70 [20]. Since the AUC of the non-expert was lower than 0.7, CNN can be a useful tool for the early detection of AM by the physicians who are not familiar with the dermoscopic images. Moreover, additional datasets of AM images can improve the diagnostic accuracy of CNN [21], making it a more reliable tool for the evaluation of the need for skin biopsy for hand and feet pigmentation.
There were several auto-classification methods independent of the size of training data using dermatologists' checklist, such as the ABCD rule and 7-point scale [22][23][24][25]. This method used particular features such as color, shape, size, the boundary of the skin lesion, and statistical features of wavelength, which showed 91.26% of accuracy and 0.937 AUC value [25]. However, these cannot be directly applied to acral melanoma due to the different morphologic features such as ridge or furrow patterns. Although there was a new dermoscopic algorithm reflecting these characteristics for diagnosing acral melanoma: BRAAFF [26], it has not yet been applied to the automated diagnosis. In addition, although there is a state-of-art automated classification method for acral melanoma, these methods cannot be generalized and only work well for a particular pattern of acral melanoma, which is a ridge-and-furrow pattern [27]. Automated diagnosis methods using particular features are able to reflect experts' perception and the speed of performance is fast. However, it is not easy to catch experts' perception, although we are trying to reach the goal with significant features. On the other hand, deep learning does not require specific features as inputs. It automatically finds the most correlated features with expert's perception by learning. Thus the accuracy is higher than feature-based methods. However, a large database is critical for the successful completion of deep learning.
Recently, the melanoma classification performance of CNN using 1,010 dermoscopy images was reported as having an AUC of 0.94 [13], which was higher than noted in our results (0.84, 0.8). Our inferior results may be due to the characteristics of AM; it occurs on the pressure area, thick skin, callus, etc., which can hinder and transform the classic pigmented lesion into an atypical case. Because of this, experts in our experiment also showed an AUC of 0.82. Therefore, if the datasets are analyzed separately considering these anatomic characters, CNN may perform a more precise discrimination. Furthermore, if combined with images from noninvasive devices for melanoma diagnosis, which may overcome the problems presented by a thick skin, the accuracy of CNN can be markedly improved. Several non-invasive devices such as confocal and photon microscopy are being introduced to provide convenient ways to diagnose melanoma early [28]. However, they require much effort and time for a physician to gain expertise. An automated diagnostic system using a CNN, even with a small dataset, may alleviate the difficulty of learning how to use these newly developed devices.
In conclusion, a half-training and half-trial method were useful for creating a comparatively accurate deep-learning model from a relatively small dataset. Although further data analysis is necessary to improve its accuracy, CNN would be helpful for the early detection of AM, which is usually associated with delayed diagnosis and poor prognosis.
Supporting information S1