Evaluating deep learning-based melanoma classification using immunohistochemistry and routine histology: A three center study

Pathologists routinely use immunohistochemical (IHC)-stained tissue slides against MelanA in addition to hematoxylin and eosin (H&E)-stained slides to improve their accuracy in diagnosing melanomas. The use of diagnostic Deep Learning (DL)-based support systems for automated examination of tissue morphology and cellular composition has been well studied in standard H&E-stained tissue slides. In contrast, there are few studies that analyze IHC slides using DL. Therefore, we investigated the separate and joint performance of ResNets trained on MelanA and corresponding H&E-stained slides. The MelanA classifier achieved an area under receiver operating characteristics curve (AUROC) of 0.82 and 0.74 on out of distribution (OOD)-datasets, similar to the H&E-based benchmark classification of 0.81 and 0.75, respectively. A combined classifier using MelanA and H&E achieved AUROCs of 0.85 and 0.81 on the OOD datasets. DL MelanA-based assistance systems show the same performance as the benchmark H&E classification and may be improved by multi stain classification to assist pathologists in their clinical routine.


Introduction
Melanoma diagnoses have increased in recent decades (1) and melanoma is the fifth most common cancer in the United States (2).In spite of its relatively high frequency, melanoma is often difficult to be histopathologically differentiated from nevi, a diagnostic discordance rate having been reported in up to 25% even among experienced histopathologists (3,4).If a melanoma is initially misclassified as nevus and therefore diagnosed at a later stage, the patient's chances of survival are significantly reduced and therapy will likely have to be more intense.On the other hand, if harmless benign lesions are diagnosed as melanoma, the patient will suffer an unnecessary psychological and physical burden.In individual cases, overdiagnosis can even lead to unnecessary, expensive and stressful therapies, which can also be associated with high costs in the healthcare system and unnecessary toxicity for affected patients.More precise diagnostic options could contribute to overcoming these problems.
Due to rapid technological advances of the last few years, AI-based assistance systems may become powerful tools for pathological cancer diagnostics.Deep Learning (DL) with Convolutional Neural Networks (CNN) has shown promise in studies aimed at distinguishing melanomas and nevi on digitized hematoxylin and eosin (H&E)-stained whole slide images (WSI), even outperforming humans in some cases (5).However, accuracy of these classifiers especially on external data still shows room for improvement.
In addition to standard H&E-stained slides, immunohistochemical (IHC)-stained tissue sections are often available for many cancer entities and represent a source of complementary prognostic and/or predictive information in addition to H&E-stained tissue.However, the analysis of IHC-stained slides by DL models is a relatively new area of research.Recent studies, however, have employed DL for successful classification of non-skin cancer entities, i.e., to determine HER2 status in breast cancer (6) and immune cell multistains as prognostic and predictive biomarkers in colorectal cancer (7) on IHC-stained slides.Moreover, as shown in previous work, the fusion of different data modalities often improves generalizability and performance of DL models (7)(8)(9).
IHC expression of MelanA can be automatically analyzed using state-of-the-art artificial intelligence (AI) methods.In this study, we investigate the use of DL-based image analysis models on MelanA-stained tissue for melanoma classification in comparison and in addition to the standard H&E-based diagnosis.

Materials and Methods
The presented study investigates melanoma suspicious lesions based on dermatoscopic investigation, which were verified histopathologically as melanoma or nevus.We use DL models to classify whether a lesion is a melanoma or a nevus based on MelanA or H&E stained tumor tissue or a combination of both stains.

Datasets
The inclusion criteria to participate in our study was to be 18 years old with melanoma-suspicious skin lesions that were biopsied after dermoscopic examination.Suspicious lesions that were pre-biopsied or located near the eye, under the fingernails or toenails were excluded.The ground truth labels were histopathological confirmed by at least one reference dermatopathologist investigating at least the H&E-stained reference slide.MelanA (MART-1) (16,17) immunohistochemical (IHC) and Hematoxylin and Eosin (H&E) stained tissue slides from the university hospital in Dresden were used for training, validation and hold-out testing.Slides from the university hospital in Erlangen and from the National Cancer Institute of Naples were used for out of distribution (OOD) testing.Table 1 describes the population of all three cohorts.The Dresden, Erlangen and Naples cohorts were collected prospectively.Data received before 2023 from the university hospital in Dresden was used as a training set, data received later was used as a holdout test dataset.The labels of the datasets were pathologically verified.All 3 cohorts differ in the stains of MelanA slides.Antibodies from different manufacturers with different dilutions were used at each site (Table 2).

Pre-processing
IHC and adjacent H&E slides from the Dresden, Erlangen and Naples cohorts were digitized with an Aperio® AT2 Slide Scanner with a 40× magnification resulting in WSIs with a resolution of 0.25 µm/px.Tumor boundaries were manually annotated under expert supervision with the QuPath digital pathology software version 0.3 (18).WSIs were tessellated into patches of 237 px x 237 px by an in-house developed QuPath script for each slide in different (40x, 20x, 10x, 5x) magnifications for IHC WSIs and in 40x magnification for H&E WSIs.Tiles with 40x magnification were created with a size of 60 x 60 µm, which corresponds to 237px x 237px.Tile sizes at 20x, 10x and 5x magnification are 120x120 µm, 240x240 µm and 480x480 µm, respectively.

Models
To classify pigmented lesions between melanomas and nevi, the ResNet architecture introduced by He et.al.(19) was selected as a model for all data modalities.The hyperparameters of the different models were tuned individually using the Bayesian optimization framework Optuna (20) and five-fold cross-validation.To avoid overfi ing with respect to slides containing a huge amount of tiles, we used weighted sampling to train with a predefined amount of tiles per slide in all epochs.
The hyperparameters we tuned were the size of the ResNet, the learning rate, the number of training epochs, the type of pooling, the number of tiles used per training epoch and whether or not the initialized ResNet was pretrained on ImageNet.The parameterization of all models is shown in the supplements in Table S1.
The slide prediction procedure for the different image modalities is as follows: The models were trained at the tile level, using the slide label for each tile of the slide.All tiles of the slide were predicted and the slide score was calculated by averaging all tile scores (see Figure 1).To train models capable of handling domain shifts, the color ji er augmentation package of PyTorch (21) was used as part of the training process.In contrast to H&E stained slides, features of protein expression can be distributed over a larger area in the cytoplasm.For this purpose, different magnifications were used to analyze these larger features.

Combined Models
Unimodal classifiers were combined to build models based on multiple data modalities.A classifier based on all four MelanA magnifications was built, where predictions with higher certainty give a higher contribution to the combined prediction.Scores of the different magnification models were averaged and weighted based on their distance to the optimal decision threshold.Other fusion approaches like averaging the scores unweighted or weighted based on the model's validation performances were investigated, all of which yielded comparable results (data not shown).The H&E classifier was combined with the MelanA multiscale classifier using the same fusion method.
Motivated through the clinical practice we investigated another setup, called the hierarchical setup, where we first predict the label based on the H&E-classifier but add the MelanA based classifier for those lesions where the H&E WSI leads to an uncertain prediction only.
To calculate whether or not a H&E-based prediction was uncertain, we calculated confidence intervals (CIs) of the slide-level score via bootstrap and checked afterwards whether the optimal decision threshold is contained in the 95% CI.For cases where the threshold was contained in the CI of the slide-level score we added the MelanA based classifier.

Reporting
For all results, 95% CIs are given next to the corresponding Areas under the receiver operating characteristic curve (AUROCs) of the model.CIs were calculated using the bootstrap-method (22).
The method was applied to the predicted values of a cohort.AUROCs were then calculated for this bootstrap cohort.After 10,000 repetitions the 2.5% as well as the 97.5% quantiles and thus, the 95% CI were calculated.

Results
We trained one model on H&E slides with a resolution of 0.25 µm/px, as it has been done in several other works (8,(23)(24)(25)(26) and four models on the corresponding MelanA slides with resolutions of 0.25, 0.5, 1.0 and 2.0µm/px, respectively.Afterwards, we fused the four MelanA-based models into one MelanA classifier and all five models into one multi-modal classifier.The H&E-based model as well as all different MelanA-based models and all combinations were tested within internal distribution (InD) on a holdout set from the university hospital in Dresden, and OOD on cohorts from the university hospital in Erlangen, and the National Cancer Institute of Naples (see Table 1).AUROCs and bootstrapped CIs for all models are shown in Table 3.
All results differ significantly from random guessing since no CI contains 0.5, the critical value.Thus, we are able to classify melanoma on all evolved MelanA-based models as well as with the benchmark H&E-based model as well.Beside this, it should be highlighted that almost all models on all cohorts perform with a AUROC significantly be er than 0.7 which makes findings probably relevant for clinical practice.However, note that CIs overlap in several cases, indicating that different models perform similarly and thus, probably contain a high amount of shared Information.
In addition, we investigated another hierarchical approach motivated by clinical practice, using only MelanA-stained slides for cases where the H&E-based model is uncertain.
The ROC diagrams of the MelanA-based, the H&E-based, and the combined models for all three cohorts are shown in Figure 2. Another representation of this plot, to be er compare models within one cohort is shown in the supplements in Figure S3.Additional ROC plots of the individual MelanA models, which consider only one magnification, and results of the hierarchical approach are shown in the supplementary material in Figure S1 and Figure S2.

MelanA-based classifiers
The curves of the different magnifications are shown in the supplementary material in Figure S1.
They overlap at several points in all cohorts, which means that for different sensitivity/specificity trade-offs, different magnifications lead to the best results.In the internal cohort, the classifiers reached AUROCs between 0.85 and 0.92, in the Erlangen cohort AUROCs between 0.67 and 0.78, and in the Naples cohort AUROCs between 0.75 and 0.80.The CIs of the different magnifications overlap in all cohorts, so there is no magnification that leads to a significantly best performance overall.
The combination of all 4 magnifications, shown in Figure 2 A), was not significantly different from the models that use only one magnification.
In the Dresden (0.88) and Erlangen (0.74) cohorts, the AUROC of the combined MelanA model without considering CIs is worse than that of the (0.50 µm/px) model as a stand-alone classifier.For the Naples cohort, the AUROC of the combined MelanA classifier (0.82) is slightly, but not significantly, be er than all individual models.

H&E-based classifier
The classifier using only H&E-stained tissue, as our benchmark, achieved an AUROC of 0.96 on the internal test set and AUROCs of 0.75 and 0.81 on the external cohorts, respectively.The ROC plot in Figure 2 B) and the results in Table 3 show that the internal performance is significantly be er than the external performance.Performance on both external data sets is not significantly different.

Combined Classifiers using H&E and MelanA
The model based on both data modalities, the H&E-stained tissue as well as the MelanA-stained tissue of all investigated resolutions, shown in Figure 2

Discussion
In this work, we were able to predict melanoma/nevi classification across multiple datasets on MelanA slides with a similar accuracy as on benchmark H&E slides using DL-based image analysis.
Furthermore, the results may suggest that the multistain approach has the potential to improve prediction accuracy and robustness, since at least on both external cohorts the combined model reached the highest AUROCs.
To integrate the presented work into clinical practice, a method for AI-pathologist interaction needs to be developed.For this purpose, we are developing an Explainable artificial intelligence system in collaboration with dermatologists (27), which produces easily interpretable explanations based on dermatoscopic images and aims to be integrated as an AI tool into digital pathology and clinical practice.Such a system can be expanded to include other data modalities such as immunohistochemistry or routine histology.
In clinical practice, pathologists often use H&E-stained tissue sections for melanoma diagnosis and resort to IHC-stained tissue in uncertain cases (28).While DL-assisted detection of melanoma on H&E sections has been well studied (5), few studies have been performed using additional routine IHC-stained slides.Digital image analysis by automated quantification of the proliferation marker Ki-67 was used to distinguish melanoma from nevi as a diagnostic and prognostic aid (29).Recently, an improved DL annotation method for H&E/SOX10 dual stains was developed to be er identify tumor cells in cutaneous melanoma (30).In the study presented here, MelanA-stained tissue was selected as an additional diagnostic tool since it highlights the cytomorphology and the distribution of melanocytes, thereby allowing a more accurate evaluation of the architecture of any melanocytic tumor, along with the size and the shape of single cells.Other IHC stains such as HMB45, p16, and PRAME were excluded because they are useful only in selected cases.However, SOX10 was not chosen because it is a nuclear marker and gives no idea about the actual size of melanocytes and about the morphologic features of their dendritic processes.Finally, Ki67, although largely used in routine, is of li le help in the recognition of in situ and early invasive melanoma (11)(12)(13)(14).
In the current pathological routine, IHC markers including MelanA are used heterogeneously in different hospitals and laboratories.At the university hospital in Dresden, generally all dermatologically melanoma-suspicious skin lesions are stained with MelanA, providing an unbiased training dataset for our study.In contrast, the OOD datasets likely contain more challenging lesions since MelanA-stained tissue was only prepared at the university hospital in Erlangen in case the H&E-stained slides provided uncertain pathological results.The Naples dataset contained 40% in situ melanomas, all of which are small in size and generate few tiles, making classification in general potentially difficult.
The Dresden test set apparently does not benefit from the inclusion of additional data (Figure S3), since the H&E-stained tissue slides are already sufficient to yield maximum accuracy.This may be due to the rather unambiguous dataset and thus, the very high performance and a broad data set with many subclasses.In contrast, the OOD datasets benefit from incorporating the additional MelanA-stained slides, making the classifier externally more robust.A combined classifier thus provides an advantage here, a finding we have already made in predicting BRAF status using H&E, clinical and methylation data in melanoma (8) suggesting that a multi stain based classifier can lead to be er generalizability.Although the information contained in the H&E-and the MelanA-stained slides is probably partially redundant, one can still see a benefit of combining both stains on OOD data.
Due to the cytoplasmic distribution of the MelanA protein, tiles from a higher magnification can potentially be too small to extract all relevant features.Pathologists frequently investigate the MelanA stains at lower magnifications to evaluate the silhoue e and overall architecture of the lesion, which also contains valuable information.Our data could not show that there is an identifiable best magnification.However, each magnification contains partly different information as the combination of all 4 magnifications brings a slight overall improvement.
Contrary to clinical practice, the hierarchical approach did not lead to any improvement on external datasets.This shows that an unbiased dataset is preferable for training a DL model, since the network can make be er decisions with larger datasets.Interestingly, the uncertain Dresden specimens are lesions with large diameters of 8 mm to 17 mm, where a melanoma has developed in the center of a nevus, with melanoma features smoothly merge into nevus features which probably confuses the model, as all tiles are weighted equally in our model.In contrast, the uncertain lesion in Naples is very small with a diameter of <1.0 mm.

Limitations
Overall, the major limitation of this study is the relatively small sample size of the external test sets.In addition, the above-mentioned variability in the pathological routine as well as the different staining protocols of the respective clinics complicate the comparison of the results and findings.In addition, a not inconsiderable label noise must be taken into account, since the labels were histopathologically verified according to the gold standard of care, but a high interrater variability must be assumed, as shown in previous studies (4,31).

Conclusions
With DL analysis of MelanA-stained tissue, we were able to classify melanomas and nevi in two distinct OOD cohorts with similar accuracy as with H&E-stained tissue.The numerically, but not statistically significantly, be er classification results achieved by combining H&E and MelanA classifiers suggests that the combination of these image modalities may lead to improved generalizability and performance.However, these results need to be confirmed in larger studies containing more lesions.Brinker, German Cancer Research Center, Heidelberg, Germany).The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Institutional Review Board Statement: Ethics approval was obtained from the ethics commi ees of the respective universities before the study was initiated.Patients provided informed wri en consent.This work was performed in accordance with the Declaration of Helsinki (32).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Figure 1 :
Figure 1: Schematic diagram of the different models.The red box shows the pipeline for MelanA-stained WSIs and the purple box the pipeline for H&E-stained WSIs.We tessellated MelanA-stained WSIs corresponding to different magnifications and trained individual models on each tile size.The class probabilities for each tile were predicted and aggregated into a slide score by averaging all tile scores.For the H&E-based model we proceeded in the same way.
C), performs numerically slightly worse compared to the H&E model on the Dresden cohort, reaching an AUROC of 0.94.However, in the external cohorts the combined model performs best in absolute numbers, reaching AUROCs of 0.81 and 0.85 on the Erlangen and Naples cohorts, respectively.Nevertheless, the performance of the combined model is not significantly different from the MelanA-based model or from the H&E-based model for any of the investigated cohorts.The hierarchical approach, where MelanA predictions are only taken into account when H&E-based prediction is uncertain, which reflects the diagnostic path be er, leads to ROC-plots shown in FigureS2.This approach resulted in the numerically best, albeit still not significantly different, performance on the internal cohort.It did not change results on Naples, the smaller external cohorts, since the H&E-based model was only uncertain for one sample within the Naples cohort and was certain for all samples in the Erlangen cohort.

Figure 2 :
Figure 2: ROC plots by data modality with corresponding AUROC values.The different subplots show results for the individual evolved models: A: H&E-based performance B: MelanA-based performance taking all magnifications into account C: combined model using H&E as well as MelanA by aggregating the individual scores.The different colors of the ROC curves show from which data source site the results come: Red: internal results (Dresden), Blue: external results (Erlangen), Purple: external results (Naples).

Table 1 :
Description of the population in our datasets.For continuous features we report median, range, and number of NAs, for categorical features we report the total number of observations per group.Here the training population as well as all three test populations are described.

Table 2 :
Antibodies and parameters of staining methods used by the different clinics

Table 3 :
AUROC values as well as 95% bootstrapped CIs for the three test cohorts and all evolved models.

Table S1 :
Hyperparameters of all evolved models