SkinViT: A transformer based method for Melanoma and Nonmelanoma classification

Over the past few decades, skin cancer has emerged as a major global health concern. The efficacy of skin cancer treatment greatly depends upon early diagnosis and effective treatment. The automated classification of Melanoma and Nonmelanoma is quite challenging task due to presence of high visual similarities across different classes and variabilities within each class. According to the best of our knowledge, this study represents the classification of Melanoma and Nonmelanoma utilising Basal Cell Carcinoma (BCC) and Squamous Cell Carcinoma (SCC) under the Nonmelanoma class for the first time. Therefore, this research focuses on automated detection of different skin cancer types to provide assistance to the dermatologists in timely diagnosis and treatment of Melanoma and Nonmelanoma patients. Recently, artificial intelligence (AI) methods have gained popularity where Convolutional Neural Networks (CNNs) are employed to accurately classify various skin diseases. However, CNN has limitation in its ability to capture global contextual information which may lead to missing important information. In order to address this issue, this research explores the outlook attention mechanism inspired by vision outlooker, which improves important features while suppressing noisy features. The proposed SkinViT architecture integrates an outlooker block, transformer block and MLP head block to efficiently capture both fine level and global features in order to enhance the accuracy of Melanoma and Nonmelanoma classification. The proposed SkinViT method is assessed by different performance metrics such as recall, precision, classification accuracy, and F1 score. We performed extensive experiments on three datasets, Dataset1 which is extracted from ISIC2019, Dataset2 collected from various online dermatological database and Dataset3 combines both datasets. The proposed SkinViT achieved 0.9109 accuracy on Dataset1, 0.8911 accuracy on Dataset3 and 0.8611 accuracy on Dataset2. Moreover, the proposed SkinViT method outperformed other SOTA models and displayed higher accuracy compared to the previous work in the literature. The proposed method demonstrated higher performance efficiency in classification of Melanoma and Nonmelanoma dermoscopic images. This work is expected to inspire further research in implementing a system for detecting skin cancer that can assist dermatologists in timely diagnosing Melanoma and Nonmelanoma patients.


Introduction
Cancer has become a major concern in the healthcare sector, with a projected global incidence of 18.1 million in 2020 [1].As per World Health Organization (WHO) report [2], cancer has emerged as one of the top killers worldwide, responsible for approximately 10 million fatalities in 2020 alone.Nowadays, skin cancer is one of the most common types of the disease to be detected, with an estimated 1.2 million cases reported globally in 2020.Among the two primary kinds of skin cancer, that is Melanoma and Nonmelanoma, Melanoma is more fatal than the latter.According to American Cancer Society, the estimated annual incidence rate of Melanoma in United States is about 100,000 people, with around 7,650 succumbing to it [3].Melanoma has one of the highest incidence rates in New Zealand, with 6000 people diagnosed annually and accounting for almost 80% of all skin cancer deaths [4].Cancer Research UK statistics show that the relative survival rate for skin cancer after 5 year is 90% [5].The survival rate for skin cancer can be increased by early detection and treatment.
The healthcare sector has been revolutionized with the recent advancements in Artificial Intelligence (AI) [6].The emergence of machine learning in computer vision has opened up new avenues for Computer Aided Diagnosis (CAD) [7,8].Over the years, CAD has made considerable progress, especially in diagnosis of cancer such as lung [9], breast [10], thyroid [11], brain [12], diabetic retinopathy [13], liver [14] etc. Due to the alarming increase in skin cancer incidence rate, there is a shortage of experienced dermatologists, which can lead to difficulties in timely skin cancer identification.Moreover, CAD tools are more efficient compared to existing clinical approaches, saving both time and cost.Therefore, a CAD system for skin cancer detection is essential.
The rapid developments and improvements in deep learning (DL) have exponentially impacted the performance of CAD systems [15].Convolutional Neural Network (CNN), a DL algorithm, has been employed extensively in applications such as classification of images and object detection.With the advancements in CNN, DL has seen significant rise in real-world applications such as surveillance [16,17], smart city applications [18,19], healthcare [20] etc.The emergence of self-attention in vision based applications has paved the way for the success of transformers [21].Vision Transformer, which employs self-attention mechanism, has displayed promising results when trained on large datasets [22].
CNN faces problems distinguishing low-level features that may result in missing crucial information.Furthermore, minimum false alarms are vital for accurate diagnosis in the medical field.We propose transformer based approach which employs outlook attention mechanism to generate fine level features for token representation that helps improve the model performance.Unlike other vision transformers that use dot product attention computation, outlook attention approach utilizes linear projection to aggregate surrounding tokens from anchor token features.This characteristic of outlook attention mechanism enhances the cost-effectiveness of the model.Furthermore, SVM with L2 kernel classifier is utilized for the classification task.The proposed SkinViT model ensures Melanoma and Nonmelanoma detection with higher accuracy.The major contributions of this research are given below: 1. Distinghuising between Melanoma and Nonmelanoma is a significant challenge owing to considerable visual interclass similarities and intraclass variations.To the best of our knowledge, this is the first work on Melanoma and Nonmelanoma classification using BCC and SCC in the Nonmelanoma class.Therefore, this work focuses on the different types of skin cancer detection with the help of CAD.
2. The datasets used for this research have imbalance class distribution which can lead to the misinterpretation of the class with the fewer image samples.Thus, we perform data augmentation technique to address the imbalance dataset issue.
3. While the visual differences in skin lesions may seem small and localized, it is crucial to consider fine level global context information for efficient recognition of the skin lesions.Therefore, we present a novel DL model named as SkinViT to efficiently integrate fine level information with outlooker and global information with transformer for more reliable classification method of Melanoma and Nonmelanoma images.
4. Furthermore, a detailed analysis of the proposed SkinViT approach with different optimizers and classifiers on three datasets to detect Melanoma and Nonmelanoma is presented and compared with other SOTA models and existing research to validate the efficacy of the SkinViT model.
The remaining sections are structured as follows: Section 2 presents the related work.Next, Section 3 describes the datasets used in this research and the proposed method.Further, Section 4 presents simulation setup and results, Section 5 provides discussion and conclusion.

Related work
Over the past few years, the massive increase in skin cancer cases has overwhelmed dermatologists.To help in the accurate differentiation and diagnosis of melanoma from other skin lesions, International Society for Digital Imaging of Skin (ISDIS) took initiative to tackle the problem of the increasing incidence rate of skin cancer by introducing an annual challenge known as International Skin Imaging Collaboration (ISIC) [23].With all these efforts, there have been considerable interest among researchers in exploring computer vision techniques for skin cancer detection [24,25].
In [26], Pham et al. conducted comparative analysis of different data processing methods, feature extraction methods and classifiers for melanoma classification.In their analysis, Linear Normalization as data processing, HSV as feature extraction and Balanced Random Forest classifier performed best with 74.75% accuracy on the HAM10000 dataset.In [27], Shen et al. proposed high performance data augmentation, which can be integrated to any deep learning method to classify skin lesions.Their proposed approach with efficienetb0 showed the best results with 85.3% accuracy on ISIC2018/HAM10000 dataset.In [28], Zhang et al. proposed convolutional neural network with attention residual learning (ARL-CNN) to classify skin diseases.Their proposed method achieved 91.8% AUC on the ISIC2017 dataset for binary classification task.Liu et al. [29] proposed a mid level feature representation method for learning features, and the CNN model is used as an extractor of ROI images.Their proposed method achieved 92.1% AUC in classifying melanoma and S. keratosis using the ISIC2017 dataset.In [30], Zhou et al. proposed convolutional spiking neural networks (SNN) employing spiketime-dependent plasticity (STDP) learning rate for melanoma skin lesions classification.Their proposed method showed an accuracy of 87.7% in classifying malignant melanoma and nevus skin lesions using the ISIC2018 dataset.
Gouda et al. [31] proposed pre-trained deep learning models based on transfer learning such as CNN, ResNet50, InceptionV3 and Inception ResNet for skin cancer classification.In their analysis, InceptionV3 showed the highest accuracy of 85.76% in classifying malignant and benign skin lesions using the ISIC2018 dataset.In [32], Damian et al. proposed MobileNet based model transfer learning for melanoma and nevus skin lesion classification.Their proposed method an achieved accuracy of 89.7% using the ISIC2018 dataset.In [33], Indraswari et al. proposed transfer learning technique based on the MobileNetV2 model for melanoma classification.Their proposed method showed 85% accuracy on the ISIC archive dataset.Hoang et al. [34] proposed EW-FCM+ShuffleNet based hybrid method, entropy-based weighting and first-order cumulative moment (EW-FCM) is used for segmentation and wide-Shuf-fleNet for classification.In their proposed method EW-FCM with wide-ShuffleNet performed best with 86.33% accuracy for multi-class classification on the HAM10000 dataset.In [35], Lopez et al. proposed CNN model based on transfer learning such as VGGNet (VGG16), for skin lesion classification.Their proposed method achieved 81.33% accuracy in classifying malignant and benign skin lesions using the ISIC2016 dataset.
In [36], Xie et al. proposed a Swin-SimAM based hybrid method for detecting melanoma where a Swin transformer is used for feature extraction and SimAM is parameter-free attention module.Their proposed method displayed 90% AUC in classifying melanoma and nonmelanoma (nevus and seborrheic keratosis).In [37], Naeem et al. introduced SCDNet approach that integrates VGG16 architecture with convolutional neural networks.Their proposed SCDNet method showed the accuracy of 96.91% on ISIC2019 dataset for multi-class skin cancer classification.Tahir et al. [38] proposed DSCC_Net which utilizes convolutional neural networks (CNN).Their proposed method demonstrated promising result with an accuracy of 94.17% on three publicly available datasets (ISIC2020, HAM10000 and DermIS) for the task of classification of multi-class skin cancer types.Table 1 details the comprehensive analysis of all the work in the literature.
The reviewed approaches for melanoma detection showed promising results, mainly in detecting melanoma from benign dermoscopic images.The existing research predominantly focused on detecting malignant and benign lesions or melanoma and benign with either nevus or seborrheic keratosis in the benign class which are non-cancerous skin lesion types.To the best of our knowledge, no existing research considered the Melanoma and Nonmelanoma (BCC and SCC) classes which are the most common skin cancer types for the classification task.Discriminating Melanoma and Nonmelanoma is quite challenging due to high intraclass differences.Although the previous research showed promising results, but for efficient skin cancer detection, there is a further need to research Melanoma and Nonmelanoma where

Materials and methods
This section details the proposed SkinViT method and dermoscopic image datasets utilized for Melanoma and Nonmelanoma classification.

Dataset acquisition
This study considers the binary classification problem of Melanoma and Nonmelanoma.For this research, we considered three dermoscopic image datasets: Melanoma class and Nonmelanoma class, where Nonmelanoma comprising of Basal Cell Carcinoma and Squamous Cell Carcinoma.Dataset1 [39] is a public dataset that contains 25,331 dermoscopic images in 8 different classes.We considered only two classes, with 4521 Melanoma dermoscopic images and 3952 Nonmelanoma images.Dataset2 contains dermoscopic images collected from various online dermotological database such as DermIS [40], PH2 [41] and Dermnet-NZ [42], to get more representation of the considered classes for classification task.We considered two classes with 410 Melanoma images and 672 Nonmelanoma images.For Dataset3, we combined both datasets, Dataset1 and Dataset2, to have 4930 Melanoma images and 4624 Nonmelanoma images.Melanoma is comparatively less common but the most fatal form of skin cancer.It begins in the melanocytes which is responsible for producing melanin, a pigment that gives color to the skin.It is usually a dark colored mole and changes shape, size or color over the time.Nonmelanoma is the most common kind of skin cancer and can be categorized as Basal Cell Carcinoma (BCC) and Squamous Cell Carcinoma (SCC).BCC affects the basal cell of the epidermis skin layer.BCC can have varied appearances but often appear as small pinkish or pearly-white bump.It can also be a red scaly patch sometimes with brown or black pigment within the patch.SCC is the development of keratinocytes in the squamous cell of epidermis skin layer.SCC can have variety of appearances where it typically appears as red to pink rough or scaly patch and also look like raised wart-like growth sometimes with a spiky horn-like surface sticking out.The sample images of Melanoma and Nonmelanoma are shown in Fig 1.

Dataset preprocessing
Data preprocessing is one of the key steps in deep learning models.The dermoscopic images of Melanoma are labelled as 1, whereas the dermoscopic images of Nonmelanoma are labelled as 0. Dataset1 has a total of 8473 dermoscopic images, with 4521 images for the Melanoma class and 3952 images for the Nonmelanoma class.Dataset2 comprises 1082 dermoscopic images for the considered binary classification task, where 410 images are assigned to the Melanoma class whereas 672 images are assigned to the Nonmelanoma class.Dataset3 consists of 4930 images for Melanoma and 4624 images for the Nonmelanoma class, totaling 9554 images.The datasets are split into training and testing by applying the 80:20 splitting rule, where 80% is for training and 20% for testing purpose as depicted in Table 2.The images in datasets are of different sizes so it is essential to convert the images to a single image size to match the input of the deep learning model.Therefore, all the images in our research are converted to the image size of 224 × 224.

Data augmentation
The size of the data has significant impact on the performance of the deep learning models.The more the data size, the greater the chances of deep learning models to perform better.The datasets considered for our research have imbalanced data in the considered classes which can greatly impact the performance of the model.Therefore, data augmentation technique is applied to handle the imbalanced data, which can cause misinterpretation of the class with fewer sample images.For our research, we performed geometric, also known as position augmentation, on the training data, as depicted in Table 3.The images are transformed by 180r otation, horizontal flip and shear transformation by a factor of 0.2.

Proposed SkinViT architecture
The proposed SkinViT model is designed to classify Melanoma and Nonmelanoma.The architecture of the proposed SkinViT model is depicted in Fig 2 .Inspired by VOLO [43] and ViT [44], the proposed SkinViT model combines the outlooker block, transformer block and Skin-ViT multi-layer perceptron (MLP) head block.The proposed method first converted images into patches of size 8 × 8 which are then passed through the outlooker for generating the finelevel token representation.After that, the tokens are further down sampled using a patch embedding module which is then passed to the Transformer encoder for processing.Then the output is passed into the Multi-Layer Perceptron (MLP) head, which in our proposed SkinViT consists of flatten layer, dense layer with Swish function and a classification layer with SVM linear kernel L2 to output the prediction for skin cancer type.The details of the proposed Skin-ViT architecture is presented as follows: 1. Outlooker: The outlooker is responsible for generating fine level features for tokenization.
The outlooker comprises of outlook attention layer, which encodes spatial information and MLP, which is responsible for inter-channel information interaction.The outlook attention, unlike self-attention, computes the similarity between each spatial location (i, j) and neighboring elements to focus on fine level features.For given input X, each C-dimensional feature is projected with two layers of linear weights; A 2 R H�W�K 4 as outlook weights and V 2 R H×W×C as value representations.Suppose the value representations within the local window at (i, j) are V Di;j 2 R C�K 2 where The outlook weight is reshaped into Âði;jÞ 2 R K 2 �K 2 to obtain the aggregated value of attention weight.The value projection is the weighted average of outlook weights and can be calculated as follows: The outlook weight is reshaped into Âði;jÞ 2 R K 2 �K 2 to obtain the aggregated value of attention weight.The value projection is the weighted average of outlook weights and can be calculated as follows: Unlike self-attention, which is dependent on query key matrix multiplications, the outlook attention matrix can be generated by attention weights within the local window located at (i, j) followed by reshape operation.Each layer of Outlooker can be written as; 2. Transformer: The transformer encoder consists of multi-head outlook attention layers, layer normalization and MLP.The architecture of the transformer is similar to ViT, but unlike the ViT, which uses self-attention mechanism, it uses outlook attention mechanism.The multi head outlook attention is obtained by combining the computed outlook weight A n and value embedding V n .For N number of heads, the outlook weight and value embeddings are given as A n 2 R H�W�K The Swish function can be written as: Where X is the input and sigmoid(X) is a sigmoid function that outputs the value between (0, 1).SVM with L2 Kernel: We employed linear kernel L2 [45] to implement SVM in our proposed method because it helps to handle the multicollinearity issue (correlated independent variables) by reducing the coefficient and maintaining all the variables.The linear kernel performs the best in the case of a large number of features.In contrast to L1 which uses the median of the data to estimate, linear kernel L2 makes a prediction based on the mean of data to prevent overfitting.L2 kernel includes the penalty to the cost function as the squared value of weights and learns complex patterns.L2 is computationally efficient and improves prediction accuracy when the output is the function of all input variables.L2 kernel can be calculated by: Where W i is the weight and λ represents the regularization parameter.

Optimizer:
The proposed SkinViT model used Adam as an optimizer.The Adam optimizer works by computing the exponential moving average of the gradients of the parameters with respect to the loss function.It is the combination of gradient descent and momentum.The equation for the Adam optimizer is as follows: Here W is the model weights, η is the step size, Mt is the unbiased estimate of the moving average of the gradient, V t is the unbiased estimate of the moving average of the squared gradient and � is the constant used for numerical stability having a value of 10 −8 .

Loss function:
The binary cross entropy (L BCE ) loss or log loss (L L ) is often used for binary classification tasks.The L BCE helps evaluate the model accuracy by determining prediction probability.The L BCE computes the difference between actual probability and prediction probability and can be calculated as: Where x is the label i.e. 1 for Melanoma and 0 for Nonmelanoma, and x is the predicted probability of x.

Performance metrics
To evaluate the performance of the proposed SkinViT, we considered performance metrics which are as follows: The T neg represents the true negative, which means the accurate classification of Nonmelanoma images and T pos shows the true positive, meaning the accurately classified Melanoma images by the proposed model.False positive F pos is the wrongly classified Nonmelanoma image as Melanoma while false negative F neg is the opposite of F pos , meaning misclassification of the Melanoma image as Nonmelanoma.Recall measures how often it correctly predicts a result for all samples that should have been classified positive, whereas Precision measures how often a method predicts a positive result.The F1 − score is the harmonic mean of precision and recall, which shows how the classifier predicts correctly.

Simulation setup and results
This section details the simulation setup and results of the proposed SkinViT.

Simulation setup
The proposed SkinViT model is implemented in the Anaconda environment using Python 3.8 with Tensorflow, Keras, Scikit-Learn, Matplotlib and Numpy libraries installed on Windows OS with system configuration Intel Core i7-11800H @2.3GHz, 16GB DDR4, NVIDIA RTX 3060.The SkinViT model is trained on 3 datasets, as described in the dataset acquisition section with 8473 dermoscopic images in dataset1, 1082 in dataset2 and 9555 in dataset3, as depicted in Table 2.Moreover, we augmented the dataset as mentioned in data augmentation section to avoid overfitting at the same time increase the classifier's efficiency on unseen images.Furthermore, Adam is employed as an optimizer to update SkinViT parameters in the proposed work during the model training.The epochs and batch size are set to 70 and 16, respectively.The learning rate for the proposed work is set to 1e − 5.

Simulation results
This section first discusses various ablation studies related to the proposed SkinViT model.Next, a comprehensive analysis of the performance of the proposed SkinViT is carried out and compared with other SOTA models.

Ablations
The ablations of this work comprise of 1) determining the best value of the L2 kernel; 2) training the model with different optimizers to determine the optimum one; 3) the effect of augmentations and without augmentations on the proposed method; 4) comparative analysis with different kernels classifiers.
1. Tuning the L2 kernel classifier: In this simulation, we experimented with different values of L2 to select the best value of L2 (Eq 6) for the proposed model.We changed the value of L2 from 0.01 to 1.0 with five intervals in total to select the best result.3. Effect of Augmentations on Proposed SkinViT: In this simulation, we evaluated the effect of classification accuracy using the augmentations on training datasets.It can be observed from Table 6 that the proposed SkinViT performed better while using augmentations, this is due to the fact that there was more representation of the training samples for generalization.From the results, it can be seen that augmentations helped the proposed SkinViT to exceed accuracy by 2.72% on Dataset1, 3.24% on Dataset2 and 1.52% on Dataset3.

Selection of classifier:
In this simulation, we used various classifiers to select the optimal classifier to classify Melanoma and Nonmelanoma and to achieve the best results on the test set.Table 7 details the results of different classifier, it can be observed that the proposed SkinViT achieves the best performance using L2 kernel for the classification task which exceed using the Gaussian kernel by 1.72% on Dataset1, 7.4% on Dataset2 and 0.37% on Dataset3.Here it can be observed SkinViT on Dataset2 achieved better accuracy while using the L1 kernel than the Gaussian kernel, which is due to the fact that Dataset2 has a small number of training samples, so the more complex Gaussian kernel performed poor while L1, which is not complex kernel achieved better accuracy.Overall from Table 8, it can be seen that using the L2 kernel has increased the performance of the proposed SkinViT.

SkinViT performance analysis
This section describes the performance analysis of the proposed SkinViT on considered datasets.The feasibility of using pretrained model on our custom dataset is dependent upon the nature and characteristics of the dataset on which the model was trained.The lack of unique medical related features means that transfer learning cannot achieve higher level of classification accuracy.Therefore, the proposed SkinViT model is trained from scratch on considered datasets.From Table 8, it can be noticed that SkinViT performed best on Datasset1 with an overall accuracy of 0.9109, whereas it achieved 0.9082 accuracy for Melanoma class and 0.9139 accuracy for the Nonmelanoma class.Dataset1 displayed overall recall of 0.9082, precision of 0.9235 and F1-score of 0.9158.Dataset2 achieved an accuracy of 0.7682 in the Melanoma class and 0.9179 in the Nonmelanoma class, which is the highest accuracy of Nonmelanoma.The overall accuracy of SkinViT on Dataset2 came out to be 0.8611 with an overall recall of 0.7683.Moreover, SkinViT achieved on Dataset2 overall precision of 0.8514 and F1-score of 0.8077.Dataset3 achieved an accuracy of 0.9087 in the Melanoma class, which is slightly higher than  ViT with 0.9004, precision of 0.9235 which is a gain of 2.81% from the second best 0.8954 displayed by EfficientNetV2 and the F1-score of 0.9158 which is a gain of 4.01% from the second best 0.8757 by ViT.SkinViT on Dataset2 displayed an accuracy of 0.8611, followed by the second best on ViT 0.8241, which is a gain of 3.7%, recall of 0.7683 which is a gain of 10.98% from the second best 0.6585 displayed by EfficientNetV2, the precision of 0.8514 which is loss of 2.79% from the best 0.8793 achieved by ViT and F1-score of 0.8077 which is gain of 8.17% from the second best 0.7286 achieved by ViT.Similarly, SkinViT on Dataset3 showed the best accuracy of 0.8911, which is a gain of 4.55% from the second best accuracy of 0.8456 displayed by MaxViT, recall of 0.9087 which is a gain of 9.73% from the second best 0.8114 by Efiicient-NetV2, the precision of 0.8836 which is a loss of 0.59% from the best 0.8895 by MaxViT and F1-score of 0.8960 which is a gain of 5.35% from the second best of 0.8425 displayed by MaxViT.

SkinViT performance on HAM10000
Further, to validate the performance of our proposed SkinViT, we evaluated the performance of the HAM10000 dataset which is used by most of the previously published work.It can be seen from Table 10 that the proposed SkinViT obtained an accuracy of 0.  is because of the attention mechanism that helps model learn the desired features more efficiently.

Discussion
In this work, SkinViT displayed the ability of outlooker and self-attention to diagnose Melanoma and Nonmelanoma through dermoscopic images.It can be seen from the results that SkinViT performed better compared to the other CNN and Transformer based models.In contrast to transformers, which can compute the attention of any patch, regardless of its distance, a CNN alone needs to perform additional convolutions to increase the receptive field in order to determine the relationship between any two neighboring pixels, resulting in difficult to possess the ability to perform long-range computation.In SkinViT, outlooker block is used instead of patch embedding component in ViT to learn features whereas self-attention is used to learn important features and ignoring the noisy ones.Results show that the SkinViT performed better compared to CNN and Transformer based models, which validates its superiority over other models.
From the results, it can be noted that the SkinViT performed better on both Melanoma and Nonmelanoma classes.However, CNN model EfficientNetV2 was better in predicting Nonmelanoma images while performed poor in classification of Melanoma as given in Table 9.Moreover, Transformer based method MaxViT and ViT performed better in classifying Melanoma images whereas another hybrid model MobileViTV2 performed well on classifying Nonmelanoma images compared to the Melanoma images.Whereas, SkinViT was equally good in classifying both the classes indicating that SKinViT is robust than using CNN or transformer based models alone in dealing with imbalanced datasets.We also observed from the results Most of the researcher used CNN models for Melanoma and Nonmelanoma detection while some used transformer alone architecture for the considered task.To the best of our knowledge this is the first time to use outlooker and transformers for skin cancer detection task with Melanoma and Nonmelanoma (SCC and BCC) classes which are the most frequently diagnosed skin cancer types.Additionally, previous research results were compared with the proposed work as depicted in Table 10.It is essential to point out that each researcher used different classes for their respective problems.Although MobileNet and STDP based SNN showed good accuracy of 0.897 and 0.877 compared to others which took into account Melanoma and Nonmelanoma (Nevus) and our proposed method outperformed both with an accuracy of 0.9254 on the same dataset which validates the efficiency of our proposed method SkinViT.However, our work focused on binary classification problem of Melanoma and Nonmelanoma (BCC and SCC).Moreover, the high accuracy by the proposed method can help early diagnosis of skin cancer and ease burden on dermotologist.This research can benefit researchers to further improve the methodology for the image segmentation to detect abnormalities in dermoscopic images in terms of Melanoma and Nonmelanoma.
Despite the great performance of the proposed SkinViT model compared to SOTA, there are some limitations and challenges in this research.Firstly, the data used for training the proposed SkinViT model from scratch is of moderate size.The size of data significantly impacts the efficacy of training a deep learning model for optimal performance.The greater the amount of data, the higher the efficiency of the model.Therefore, for future research, a large dataset should be curated by combining the publicly available datasets (ISIC archive, HAM10000 etc.) to improve the model efficiency.Furthermore, the dataset used has huge class imbalance which can highly impact the performance of the proposed model.The current research employed the geometric data augmentation technique to handle the class imbalance issue.The future research will explore the use of generative adversarial network (GAN) or advanced data augmentation techniques like MixUp and CutMix which involves combining the multiple images or patches to create new training samples for representation and enhance the model generalization.

Conclusion
The focus of this research is on the automated detection of skin cancer types, Melanoma and Nonmelanoma (BCC and SCC), which can help reduce the mortality rate by early diagnosis and also help ease the burden on dermatologists.To achieve this goal, we devised a novel deep learning model named SkinViT, which employs transformer blocks, outlooker blocks and MLP Head for classification.Further, the proposed SkinViT eliminates the requirement for high computational power due to its fewer params (around 27.1million), as opposed to other popular classification models such as ViT and MaxViT.
The total number of training samples were enhanced by employing augmentations such as horizontal flip, shear transformation and rotation to resolve class imbalance problem.Moreover, the use of an SVM classifier, specifically the L2 kernel, has increased the optimality of the prediction value by taking the mean of the data to avoid overfitting.We performed multiple simulations to assess SkinViT model performance.It is evident from Tables 9 and 10 that the SkinViT achieved higher classification accuracy in comparison to other methods.This is perhaps due to the outlooker block in SkinViT, unlike ViT, efficiently encodes fine level features by measuring the likeness between token pair representations which is efficient in terms of parameters learning features than convolutions.Moreover, the sliding window adopted in outlook attention locally encodes token representations and preserves important positional information for classification task.Furthermore, the outlook attention weight generation is simple reshaping operation, unlike self-attention which is dependent on query key matrix multiplications.Finally, the proposed MLP head has the SVM L2 kernel classifier that further optimizes the model, which takes the mean of the values for the prediction score.This provides the SkinViT with better feature learning ability which results in higher accuracy in classifying Melanoma and Nonmelanoma in this proposed work.
For the skin cancer detection problem, it is crucial that the false classifications should be minimal to ensure the model's applicability and reliability in the real-world scenarios.For Melanoma detection, it is imperative to minimize false negative as it may lead to treatment delays and subsequently diminish the 5-year survival rate.While the false positive would only necessitate further diagnostic procedures such as biopsy.The reason for the false classification of Melanoma images as Nonmelanoma (F neg ) could be due to insufficient representation of the Melanoma images.As it can be observed from Table 2 that the number of images is Melanoma is significantly lower than that of Nonmelanoma.This can make it difficult for the model to generalize the instances for which it is not trained.Another reason for the false classification can be the excessive noise such as hair and air bubble in the images which make it challenging for the model to learn the important features.The proposed SkinViT method can further be improved in the future using additional datasets available publicly with segmentation task to detect skin diseases.Furthermore, the image quality of the dermoscopic images for Melanoma and Nonmelanoma can further be improved.Also, the classification results of SkinViT can be utilised for implementing a Melanoma and Nonmelanoma recognition system to assist the dermatologists in diagnosis.

Fig 1 .
Fig 1. Sample Image of a) Melanoma and b) Nonmelanoma (BCC) and c) Nonmelanoma (SCC).https://doi.org/10.1371/journal.pone.0295151.g001 4 and V n 2 R H�W�C N respectively.Here n = 1, 2, . ....,N represents the dimensions of each head.The MLP in the transformer encoder in the proposed model has two layers with GeLU.Layer normalization is added before each block which helps to enhance the training performance.3. SkinViT MLP head: The transformer encoder output is fed into the newly designed Skin-ViT MLP head to classify Melanoma and Nonmelanoma skin cancer.The MLP head comprises of flatten layers to flatten the output, a dense layer with Swish activation function and SVM linear kernel L2 as a classifier.Swish is a nonlinear and continuous function.It has a non-zero gradient for negative inputs, which allows better optimization during training.

Table 1 . Comparative analysis of the related work. Author Method Dataset Classes Accuracy Precision Recall AUC
Nonmelanoma class has BCC and SCC.The previous work primarily focused on CNN models.CNN faces problems distinguishing low level features that may result in missing crucial information.Furthermore, minimum false alarm is vital for accurate medical diagnosis.The focus of this research is on the automatic and accurate detection of Melanoma and Nonmelanoma, which can help reduce the mortality rate due to skin cancer by early diagnosis and also help ease the burden on dermatologists.For efficient Melanoma and Nonmelanoma classification, we propose the SkinViT model based on outlooker and transformer and further aid in development of automated skin cancer detection. https://doi.org/10.1371/journal.pone.0295151.t001
Fig 2. Architecture of SkinViT.https://doi.org/10.1371/journal.pone.0295151.g002 Table 4 depicts the performance of the proposed SkinViT on different values of L2.From Table 5 it can be observed the best result was obtained on 0.1 with 0.9109 on dataset1, 0.8611 on dataset2 and 0.8911 on dataset3.2. Selection of Optimizer: In this simulation, the proposed model is trained by employing different optimizers to evaluate the classification performance as given in Table 5.The Skin-ViT performed best with the Adam optimizer achieving the classification accuracy of 0.9109 on Dataset1, 0.8611 on Dataset2 and 0.8911 on Dataset3, which is superior to RMSprop, achieved 0.8996 on Dataset1, 0.8518 on Dataset2 and 0.8733 on Dataset3.

Table 8 . Performance Comparison of SkinViT on different datasets.
https://doi.org/10.1371/journal.pone.0295151.t008 9254, followed by MobileNet by Damian et al. with 0.897 accuracy and STDP based SNN by Zhou et al. with 0.877 accuracy.The higher accuracy of SkinViT is due to the lesser false classifications, which