Chi2 weighted ensemble: A multi-layer ensemble approach for skin lesion classification using a novel framework - optimized RegNet synergy with Attention-Triplet

Anwar Hossain Efat

doi:10.1371/journal.pone.0321803

Abstract

Skin lesions, including various abnormalities and potentially fatal skin cancers, require early detection for effective treatment. However, current methods often struggle to identify the precise areas responsible for these abnormalities after model dominance dispersion. To address this, we propose a novel Transfer Learning-based framework that integrates Optimized RegNet Synergy architectures and Attention-Triplet mechanisms—comprising channel attention, squeeze-excitation attention, and soft attention—combined with an advanced Ensemble Learning strategy. A significant gap in current research is the lack of techniques for optimal weight allocation in model predictions. Our study fills this gap by introducing the Weighted Ensemble (CWE) method, which is further enhanced into a Multi-Layer Weighted Ensemble (ML-CWE) to improve model aggregation across multiple layers. Evaluation on the HAM1000 dataset demonstrates that our ML-CWE approach achieves an impressive accuracy of 94.08%, outperforming existing state-of-the-art methods. To enhance model interpretability, we employ Gradient Class Activation Maps (Grad-CAM) to highlight critical regions of interest, improving both transparency and reliability. This work not only boosts accuracy but also facilitates early diagnosis, addressing challenges related to time, accessibility, and cost in skin lesion detection, and offering valuable insights for practical applications in dermatology.

Citation: Efat AH (2025) Chi² weighted ensemble: A multi-layer ensemble approach for skin lesion classification using a novel framework - optimized RegNet synergy with Attention-Triplet. PLoS One 20(5): e0321803. https://doi.org/10.1371/journal.pone.0321803

Editor: Fatih Uysal, Kafkas University: Kafkas Universitesi, TÜRKIYE

Received: September 2, 2024; Accepted: March 8, 2025; Published: May 20, 2025

Copyright: © 2025 Efat. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data used in this study, including the augmented training data, are publicly accessible on the Kaggle repository: [HAM10000: Split and Augmented] (https://www.kaggle.com/datasets/ahefatresearch/ham10000-split-and-augmented). Our use of the HAM10000 dataset complies with the Creative Commons Attribution-NonCommercial 4.0 International Public License because we have properly attributed the original dataset and cited the recommended paper by the authors. This fulfills the attribution requirement of the license. Additionally, our use of the dataset is for non-commercial purposes, aligning with the non-commercial clause of the license.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Skin lesions represent abnormal changes in the skin’s appearance, while skin diseases encompass a broad range of conditions affecting the skin’s health, structure, and function. These conditions range from common ailments like acne to more severe issues such as skin cancer. While skin diseases present a variety of symptoms, they are not solely defined by the presence of lesions. Lesions arise due to infections, inflammatory responses, allergic reactions, malignancies, insect bites, trauma, autoimmune disorders, genetic predispositions, environmental factors, vascular anomalies, warts, and cysts, each with unique causes and characteristics [1]. Broadly, lesions are classified based on their potential harm. ‘Benign skin lesions’ are non-cancerous and generally pose no significant threat, including examples like moles, skin tags, warts, seborrheic keratoses, and hemangiomas. In contrast, ‘malignant skin lesions’ are cancerous, with the potential to spread to other parts of the body, with basal cell carcinoma, squamous cell carcinoma, and melanoma being the most common types [2].

Accurate diagnosis and effective treatment of skin conditions typically require a combination of clinical assessments and diagnostic tests. Neglecting symptoms leads to serious consequences, including the development of skin cancer, which is the most common form of cancer globally. Melanoma, although relatively rare, is the leading cause of skin cancer-related deaths [3]. According to recent estimates, around 2.2% of men and women are diagnosed with melanoma during their lifetime, with 97,610 new cases and 7,990 deaths reported in the United States in 2023. The impact of melanoma is significant, with over 1.4 million people currently living with the disease in the U.S [4].

Early detection of skin lesions is crucial for preventing the progression of serious conditions. However, many individuals are unaware of their skin abnormalities due to the extensive medical evaluations and associated costs. Dermatoscopy, or dermoscopy, is a non-invasive diagnostic technique that uses a magnifying device with lighting to examine skin lesions, aiding in the early detection of skin cancer and other skin conditions. While effective, the accuracy of dermatoscopy depends heavily on the expertise of the examiner, which introduces the potential for human error [5].

In contrast, AI-powered systems, particularly those leveraging Machine Learning (ML) and Deep Learning (DL) techniques, show great promise in skin lesion detection by enabling rapid image analysis, early diagnosis, and improved medical outcomes. Despite advancements by numerous researchers, challenges remain, including overreliance on data-rich classes and difficulties in capturing deep features in TL models without fine-tuning. Additionally, combining multiple models effectively is complex, and current models often lack interpretability while being prone to bias from using the same data for validation and testing. Transfer learning (TL) models like DenseNet and ResNet also face issues such as inefficient scaling, reliance on manual design choices, and rigidity in fixed scaling rules, which hinder adaptation to diverse environments and lead to suboptimal performance. High computational costs further limit their suitability for resource-constrained settings, highlighting the need for more flexible architectures and systematic exploration of design spaces to optimize performance across varied tasks [6].

In response to these challenges, researchers have explored the use of convolutional neural networks (CNNs) and ensemble learning techniques to overcome the limitations of individual models. However, the absence of an optimal weight selection process for each model in traditional ensemble techniques, such as majority voting, softmax averaging, and weighted averaging, limits the accuracy of results. These methods do not consider the varying importance of individual predictors, which leads to suboptimal outcomes. As a result, there is a clear need for more sophisticated approaches that effectively harness the strengths of different models while ensuring accurate and interpretable results.

Our methodology is strategically designed to directly confront these challenges, with the primary objective of addressing the following key research questions. These questions form the foundation for developing a robust architectural framework that provides well-informed responses.

RQ1: What actions optimize the Transfer Learning model for specific tasks?

- With a vast array of TL models available, selecting the appropriate one is challenging. Moreover, the fixed architecture of models trained on the ImageNet dataset may not be suitable for all tasks. Therefore, optimizing and tailoring a TL model to the specific task is crucial for achieving optimal performance.

RQ2: What approach effectively identifies the most critical features, particularly significant areas or regions?

- In classification tasks, not all regions of an image contribute equally to feature extraction; some introduce redundant or irrelevant information, negatively impacting results. Identifying and focusing on the most crucial regions is essential for improving classification accuracy.

RQ3: Is a single algorithm sufficient, or is an Ensemble Learning (EL) technique necessary? If so, which should be employed?

- No single algorithm consistently classifies all data accurately, making reliance on one method risky. Employing an EL technique mitigates this issue, offering a more reliable solution through a combination of algorithms.

RQ4: What are the limitations of traditional EL methods that demand a new approach?

- Traditional EL methods cannot always determine the optimal prediction ratio for each model in an ensemble. A novel approach that calculates and applies the optimal ratio for model predictions is necessary to enhance overall performance.

The above-mentioned inquiries are meticulously addressed, culminating in the study’s significant contributions:

The issue of class imbalance is effectively tackled through rigorous augmentation of the training dataset. This strategic augmentation ensures balanced class distribution, preventing model bias toward dominant classes. As a result, the architecture demonstrates reliability and impartiality in handling test and validation data.
To optimize the TL model for specific tasks, we select the RegNet model due to its versatile design and advantages. We then optimize various versions of RegNet with customized layers and combine them into the “Optimized RegNet Synergy (ORNS)” network, capable of extracting both shallow and complex deep features.
To focus on critical features, we ingeniously integrate Attention-Triplet (AT) mechanisms within customized architectures. This innovative approach ensures that models concentrate on the most essential aspects of the input data.
We introduce a pioneering technique, the Weighted Ensemble (CWE), to enhance the model’s robustness, accuracy, and generalization capabilities. This novel ensemble method operates across multiple layers, strategically boosting the architecture’s performance.
Grad-CAM visualization is incorporated to prioritize interpretability, allowing specific regions related to diagnosed skin conditions to be highlighted. This advanced visualization method enhances the transparency and insightfulness of the architecture.

The structure of the paper is meticulously organized to ensure clarity and coherence. It commences with an in-depth exploration of the existing literature in Sect 2, followed by a comprehensive presentation of the materials and methods in Sect 3. The subsequent Sect 4, provides a concise yet thorough analysis of the achieved performances. Building on these findings, Sect 5 engages in a comprehensive discourse, assessing the model’s pragmatic implications along with minor improvement scopes, offering a holistic view. Ultimately, Sect 7 concludes the paper, encapsulating the essential takeaways and contributions of the study.

2 Literature review

The classification of skin lesions has been a well-explored area of research, with many studies contributing significantly to the understanding and advancement of this field. This section highlights the significant contributions of various researchers in this area. For instance, studies such as [7] through [11] have presented diverse approaches to classification and segmentation, each offering valuable insights for our current research. Additionally, a range of studies, from [12] to [17], have utilized Custom CNN architectures, while [16] and [17] have incorporated various transformation techniques. Innovative methods on skin lesion datasets have been explored in studies [18] and [19]. On the other hand, studies ranging from [20] to [27] have concentrated on feature extraction through TL, and studies [28] and [29] have combined soft attention mechanisms with TL.

Tajerian et al. [7] employed TL with EfficientNET-B1, achieving an 84.30% accuracy in identifying pigmented skin lesions. However, this approach struggled with highlighting specific features unique to skin datasets, affecting the diagnostic precision. In contrast, Hosny et al. [8] used AlexNet in a TL framework to classify skin lesions automatically, achieving high accuracy rates in diagnosing melanoma and nevus lesions. Dong et al. [9] introduced the TC-Net, a fusion network combining Transformer and CNN architectures, which significantly improved skin lesion segmentation by effectively integrating local and global features, outperforming other models like Swin UNet.

Khan et al. [10] developed the SkinViT architecture, incorporating an outlook attention mechanism, transformer blocks, and an MLP head block, which achieved up to 91.09% accuracy on different datasets, enhancing melanoma and nonmelanoma skin cancer classification. Singh et al. [11] proposed the SkiNet framework, which used Bayesian MultiResUNet for segmentation and DenseNet-169 for classification, achieving an accuracy of 86.67%, although this was considered suboptimal.

Saarela and Geogieva [12] proposed a Bayesian inference-based approach to improve model interpretability, achieving 80% accuracy, which was not particularly impressive. Sevli [13] created a CNN model for skin lesion classification, integrating it with a web application via a REST API and obtaining a 91.51% accuracy after evaluation by dermatologists. However, the custom CNN could not focus adequately on critical features. Shetty et al. [14] used a CNN to detect skin cancer, achieving a 94% accuracy, but their method was limited by using only a small subset of the dataset, raising concerns about generalizability.

Hoang et al. [15] used a lightweight neural network architecture, wide-ShuffleNet, for skin lesion classification, but it resulted in comparatively lower accuracy rates of 84.80% and 86.33% on different test datasets. Sun et al. [16] incorporated additional metadata and supplementary information during data augmentation, achieving an accuracy of 88.7% with a single model and 89.5% for the embedding solution, though the augmentation process lacked clarity. Nie et al. [17] presented a hybrid CNN-transformer model with focal loss, achieving an accuracy of 89.48%, but the approach had limitations in deep feature extraction.

Khan et al. [18] introduced a DL and Entropy-NDOELM-based architecture for multiclass skin lesion classification, fine-tuning EfficientNetB0 and DarkNet19 models, which achieved over 90% accuracy on all datasets. Ajmal et al. [19] developed a novel architecture combining DL models and a fuzzy entropy slime mould algorithm for feature optimization, achieving high accuracy on the HAM10000 and ISIC 2018 datasets with Grad-CAM for explainability.

Wang et al. [20] proposed a two-stream network combining DenseNet-121 and VGG-16, which extracted multiscale pathological information and achieved a 91.24% test accuracy, although the pre-trained model lacked fine-tuning. Mahbod et al. [21] explored the impact of image size on classification using TL, achieving a balanced multi-class accuracy of 86.2%, although the model was heavy. Harangi et al. [22] used a TL-based CNN framework for multiclass classification with binary classification outcomes, achieving an average accuracy of 93.46%. However, the rationale for combining binary and multi-class classifications was not provided. Rahman et al. [23] created a weighted ensemble model using five deep neural networks via TL, enhancing the accuracy to 88%, but the model struggled with dataset specificity.

Popescu et al. [24] developed a skin lesion classification system using TL and collective intelligence, achieving an 86.71% validation accuracy through decision fusion but lacking independent test results. Nigar et al. [25] presented an explainable AI-based system using the LIME framework and ResNet-18, achieving 94.47% accuracy but relying on a small dataset and single pre-trained model. Gouda et al. [26] enhanced image quality using ESRGAN before applying a CNN, achieving 83.2% accuracy but not addressing data imbalance. Khan et al. [27] employed Resnet50 and a feature pyramid network for segmentation, followed by a 24-layer CNN for classification, achieving 86.5% accuracy but failing to utilize mask information during segmentation.

Datta et al. [28] explored the use of a soft-attention mechanism in skin cancer classification, achieving a 93.4% accuracy, though they struggled to find proper color channel weights. Nguyen et al. [29] combined DL with soft-attention, achieving accuracies of 90% and 86% with different models, but did not justify the choice of soft attention over other modules.

Building on the insights from these studies, our research addresses identified gaps by using the entire dataset and augmenting the training set to address data imbalance. This approach ensures the independence of the test set, providing a more accurate evaluation of the model on unseen data. We utilize the Attention Transfer (AT) method to identify crucial regions of interest and integrate them with TL models. Additionally, we fine-tune the TL models and ORNS architecture to reduce dependence on the ImageNet dataset, and our novel ensemble approach optimally weights predictions, overcoming previous limitations.

3 Materials and methods

3.1 Dataset description

Our study utilizes the publicly available Human Against Machine (HAM10000) dataset from the Harvard Dataverse repository, meticulously curated to encompass a diverse collection of skin lesion samples [30]. It includes 10,015 dermatoscopic images, all in jpg format, distributed into 7 classes: Melanoma (MEL), Nevus (NV), Vascular lesions (VASC), Actinic keratosis (AK), Basal Cell Carcinoma (BCC), Benign keratosis (BKL), and Dermatofibroma (DF), where MEL, AK, and BCC are types of cancer. NV, BKL, and DF are non-cancerous, whereas some types of VASC can be cancerous. The overview of the dataset is presented in Table 1.

Download:

Table 1. Portrayal information of the dataset.

https://doi.org/10.1371/journal.pone.0321803.t001

In Fig 1, examples of images are displayed, with one sample provided per class in the dataset, while the high degree of class representation imbalance is corroborated by the class distribution depicted in Fig 2.

Download:

Fig 1. Sample images of each class.

https://doi.org/10.1371/journal.pone.0321803.g001

Download:

Fig 2. Instance distribution for each class.

https://doi.org/10.1371/journal.pone.0321803.g002

To align with the objectives of our study, the dataset is meticulously preprocessed. Details on the specific version used can be found in [31].

3.2 Methodology

Our approach begins with dataset collection, followed by a vital phase of data preprocessing. The dataset is then partitioned into training, testing, and validation sets. To handle class imbalances, data augmentation is applied solely to the training set, ensuring that validation and testing remain unaffected by augmented data. We utilize Optimized RegNet Synergy (ORNS) architectures, which are trained on the training data and validated on the validation data. These models are then evaluated using the testing set. Predictions from each architecture are integrated using the Multi-Layer Weighted Ensemble (ML-CWE) method to boost performance. The efficacy of CWE is assessed across several layers to support our conclusions. GradCAM visualization is ultimately employed to provide insight into the models’ internal mechanisms. The sequential process of this investigation is illustrated in Fig 3.

Download:

Fig 3. Schematic representation of methodology.

https://doi.org/10.1371/journal.pone.0321803.g003

3.3 Preprocessing and data augmentation

In this phase, we first categorize the images based on their lesion IDs and then carefully sample distinct images for the training, testing, and validation sets. We assign 15% of the images to both the testing and validation datasets, leaving 70% for the training dataset. To ensure that the testing set remains completely unseen during training, we introduce extra redundant images into the training set. This strategy enhances the robustness and reliability of our model by keeping the test data entirely separate from the training process. After this separation, we apply augmentation techniques exclusively to the training data, maintaining the independence of the test and validation sets. Through this approach, we generate around 8,000 images per class, addressing potential data imbalance issues effectively.

In our study, we implement a sophisticated image augmentation strategy using TensorFlow’s ‘ImageDataGenerator’. We begin by enhancing the contrast of the original images to improve their quality before augmentation. The augmentation process involves various transformations to significantly diversify the training data and strengthen the model’s robustness. These transformations include random rotations up to 180 degrees, width and height shifts of 10%, and zoom variations within a 10% range. Additionally, horizontal and vertical flips are used to further increase variability. To handle any gaps introduced by these transformations, we use the nearest neighbor fill mode to ensure consistency in the augmented images. This comprehensive approach simulates a wide range of possible image variations, thereby enhancing the generalization ability of our DL model.

As shown in Fig 4, the original, contrast-enhanced, and augmented images are presented, using a sample from Vascular Lesions (VASC) along with its augmented versions. To address the issue of dataset imbalance, we aim to generate approximately 8,000 images in the training set for each class. Consequently, we achieve the following image distribution: AK (7,854), BCC (7,965), BKL (7,944), DF (7,377), MEL (7,932), NV (8,004), and VASC (7,706).

Download:

Fig 4. Images of the augmented samples.

https://doi.org/10.1371/journal.pone.0321803.g004

3.4 Creation of optimized RegNet synergy (ORNS) architectures

Our approach centers on the strategic utilization of Optimized RegNet models within the framework of Transfer Learning. Specifically, we focus on fine-tuning a diverse set of 24 pre-trained models, including all 12 variants of both RegNetX (RNX) and RegNetY (RNY), which accommodate input images of size 224x224x3. Recognizing that these models are not originally trained on our dataset, we meticulously fine-tune them to optimize the extraction of both shallow and deep features relevant to our data.

To achieve this, we introduce four customized CNN structures: Customized ORNS (C-ORNS), Channel Attention-based ORNS (CA-ORNS), Squeeze-Excitation Attention-based ORNS (SEA-ORNS), and Soft Attention-based ORNS (SA-ORNS). These architectures are specifically designed to leverage the power of AT, ensuring an enhanced focus on pertinent features during the learning process. The graphical depiction of the full architecture is illustrated in Fig 5.

Download:

Fig 5. ORNS architecture.

https://doi.org/10.1371/journal.pone.0321803.g005

The integration process begins by importing the pre-trained model from the ‘keras’ library, adapting it to our unique input shape, and transforming the output into a four-dimensional structure (None, height, width, channels). This adaptation aligns our model with the pre-trained one, allowing for seamless integration and effective feature extraction.

Fine-tuning is executed systematically, culminating in the generation of predictions from each individualized model for subsequent analysis.

The C-ORNS structure features two Convolution Blocks, each containing four ‘Conv2D’ layers with varying kernel sizes (7x7, 5x5, 3x3, 1x1), accompanied by ‘BatchNormalization’ layers. The first block employs 128 filters, while the second block uses 256 filters, with ‘MaxPooling2D’ layers condensing the output. The ReLU activation function is consistently used across all convolutional layers, effectively mitigating vanishing gradient issues.

Again, we enhance the C-ORNS architecture by integrating a Channel Attention (CA) module within each convolution block. The CA layer is positioned after each ‘Conv2D’ layer and its corresponding ‘BatchNormalization’ layer, refining feature emphasis and selectively enhancing significant channel-wise information at intermediate stages of processing.

In the SEA-ORNS variant, we embed the Squeeze-Excitation Attention (SEA) module after each convolution block, following the C-ORNS structure. The SEA layer recalibrates feature responses across channels, allowing high-level adjustment of channel importance and improving the model’s ability to capture complex, hierarchical features.

For the Soft Attention (SA) module in SA-ORNS, we adopt a similar integration approach as in SEA-ORNS but position the SA layer after each convolution block rather than after every ‘Conv2D’ layer. This selective placement balances the increased parameter count while capturing fine-grained patterns within feature maps.

Finally, the output from the last max-pooling layer in each architecture is flattened and passed through three fully connected layers, configured with dimensions of 1024, 512, and 7 (corresponding to the number of classes). The first two layers utilize the ReLU activation function, while the final layer employs softmax activation to predict class probabilities.

This methodical architecture design ensures that our ORNS models are finely tuned and fully optimized for extracting the most relevant features from our dataset.

3.4.1 Feature extraction process.

As previously mentioned, we employed ORNS models for effective feature extraction, excluding the top fully connected layers (using ‘include_top=False‘) and applying global average pooling (‘pooling=’avg’‘). The resulting model output was reshaped to optimally supported dimensions, preparing it for further processing through multiple convolutional layers. These layers, equipped with filter sizes of 7x7, 5x5, 3x3, and 1x1, were systematically followed by ReLU activation and batch normalization to ensure stabilization and efficient learning. Max pooling layers were then utilized to reduce spatial dimensions and sharpen feature focus. The extracted feature maps were eventually flattened and passed through a series of fully connected layers with ReLU activation, leading to a final dense layer with softmax activation for class probability prediction.

Fig 6 provides a visual representation of the feature map activations at various stages of the TL model, exemplified here by an Optimized RegNetX002 architecture. Each row corresponds to activations from different layers within the model, offering a step-by-step view of the feature extraction process:

Input Layer (input_1): Displays the preprocessed input image, showing the raw pixel data.
Zero Padding (zero_padding2d): Feature maps after zero padding, which prepares the input tensor for subsequent convolution operations.
Convolution (conv2d): Activation maps post convolution with 64 filters, revealing learned patterns and edges.
Batch Normalization (batch_normalization): Normalized feature maps following batch normalization, enhancing training stability and convergence.
ReLU Activation (activation): Output after the ReLU activation function, introducing the necessary non-linearity to the network.
Max Pooling (max_pooling2d): Downsampled feature maps post max pooling, reducing spatial dimensions while preserving critical features.
Concatenation (concatenate): Activation maps after concatenating feature maps from previous layers, integrating multi-path information.
Dense Layer (dense): Transformed feature maps into vector form before entering the fully connected dense layer.
Output Layer (dense_1): Final layer activations showing class probabilities through softmax activation.

Download:

Fig 6. Feature extraction after activation of each layer (one sample).

https://doi.org/10.1371/journal.pone.0321803.g006

Each subplot illustrates up to 5 filters per layer, using the ‘viridis’ colormap to ensure clarity and contrast. This figure offers valuable insights into how the model progressively processes and transforms input images, capturing hierarchical features crucial for accurate classification.

This approach, exemplified through a single sample and a subset of layers, underlines our strategy of extracting thousands of feature images. These images significantly contribute to enhancing the algorithm’s overall performance by providing a detailed understanding of the feature extraction process across different layers.

3.4.2 Attention-Triplet (AT).

Our approach leverages a trio of attention modules, collectively referred to as the AT, to enhance model focus on critical input features while suppressing less relevant ones. This strategic incorporation of Channel Attention (CA), Squeeze-Excitation Attention (SEA), and Soft Attention (SA) enables our models to effectively capture and prioritize essential patterns within the data.

3.4.3 Channel Attention (CA).

The Channel Attention (CA) module refines feature maps by computing attention weights across channels. These weights, derived from the mean and standard deviation of the input feature maps, are applied to emphasize key features [32]. The process is mathematically represented as follows:

(1)

(2)

where x denotes the input feature maps of size , and are weight matrices, represents ReLU activation, indicates sigmoid activation, is the calculated channel attention weight, and denotes element-wise multiplication [32].

3.4.4 Squeeze-Excitation Attention (SEA).

The Squeeze-Excitation Attention (SEA) module enhances the representational power of feature maps by combining spatial dimension reduction with channel-wise attention learning [33]. Given an input feature map x of size , the SEA module operates as follows:

(3)

(4)

(5)

This approach ensures that the model dynamically recalibrates channel-wise feature responses, focusing on the most informative aspects of the input data.

3.4.5 Soft Attention (SA).

The Soft Attention (SA) module assigns attention weights to individual input elements, prioritizing specific regions based on their importance [34]. The attention mechanism is expressed as:

(6)

where a_i represents the attention weight for the i-th input element, T is the input length, and e_i is the scalar value associated with the i-th element [35].

This targeted emphasis enables the model to concentrate on the most relevant portions of the input, thereby improving overall performance.

3.5 Chi² weighted ensemble (CWE)

The novel approach of ensemble learning, CWE, is introduced by us. It calculates the most suitable weights for predictions from each classifier and then combines them through averaging, considering these weights. To achieve this, the concept of value is employed. The Chi-Square statistic plays a pivotal role in this approach by quantifying the accuracy of each classifier. The classifier with the highest value is considered the most accurate, as it exhibits the greatest alignment with the true label distribution. This method prioritizes classifiers that demonstrate a stronger relationship between predictions and actual outcomes, thereby enhancing the overall reliability of the ensemble. The sequential procedure for implementing CWE is outlined as follows.

Step - 1

This method begins by assessing the performance of individual classifiers using the test. To facilitate this, correctly classified samples are labeled as class ‘1’, while incorrectly classified ones are labeled as ‘0’. The test is a statistical measure used to evaluate the association between observed and expected frequencies in categorical data. It is calculated using the following formula:

(7)

where represents the Chi-Square statistic; O_i and E_i denote the observed and expected frequencies, respectively. The value quantifies the discrepancy between observed and expected frequencies, with a higher value indicating a stronger relationship between the classifier’s predictions and the actual labels.

The observed frequencies are derived from the number of correctly and incorrectly classified samples for each classifier. Expected frequencies are calculated based on the overall distribution of the true labels. By applying the test, this method evaluates how well each classifier’s predictions align with the expected distribution of labels. This serves as a performance indicator for each classifier.

Step - 2

After obtaining the values for each classifier, the ensembling weights are computed. These weights are derived based on the ratio of each classifier’s value to the total value across all classifiers. This ensures that classifiers with higher values contribute more prominently to the ensemble.

(8)

where n is the number of classifiers.

Step - 3

Finally, the predictions are averaged according to their respective weights, resulting in a blended output that leverages the strengths of each classifier within the ensemble. This can be achieved through the following process. Let N be the number of individual classifiers in the ensemble. Each classifier i produces predictions denoted as , where n represents the number of instances in the dataset. The weights for each classifier, denoted as w_i, are determined based on the value. The ensemble prediction for each instance j is calculated as:

(9)

where E_j is the final prediction for instance j, w_i is the weight assigned to classifier i, and p_ij is the prediction of classifier i for instance j. The weights w_i are determined based on the values, emphasizing the classifiers that exhibit better performance. Overall, the CWE technique combines individual classifiers’ predictions using weighted averaging, where the weights are based on the Chi-Square statistic, enhancing the ensemble’s accuracy and robustness. The overall approach of the CWE method is illustrated in Fig 7.

Download:

Fig 7.

weighted ensemble in layer - L.

https://doi.org/10.1371/journal.pone.0321803.g007

3.6 Multi-layer CWE

Our Multi-Layer CWE technique is applied across four distinct layers to effectively highlight and leverage the strengths of different models. This strategy is essential because, within a single layer, assigning adequate importance to a superior model is challenging due to the relatively low individual classifier weights. To address this, we adopt a sequential “Layer by Layer" ensembling approach. This method allows us to progressively emphasize models that demonstrate superior performance at each layer, with their influence being further compounded as they are ensembled in subsequent layers. The “Layer by Layer" approach is detailed in the following sections, while a generic visual overview of the Multi-Layer CWE is provided in Fig 8.

Download:

Fig 8. Organization of multi-layer CWE.

https://doi.org/10.1371/journal.pone.0321803.g008

3.6.1 CWE in Layer 1.

In the first layer, we ensemble the predictions from the four foundational models—C-ORNS, CA-ORNS, SEA-ORNS, and SA-ORNS—for each classifier. This process results in a total of 24 predictions, derived from both RegNetX (RNX) and RegNetY (RNY), each containing 12 ORNS architectures.

3.6.2 CWE in Layer 2.

The 24 predictions obtained from Layer 1 are then combined using two approaches. First, we separately ensemble all 12 architectures from both RNX and RNY, resulting in two distinct predictions. Second, we ensemble the common versions from both RNX and RNY to generate 12 additional predictions.

3.6.3 CWE in Layer 3.

In Layer 3, we further ensemble the two predictions (RNX and RNY) obtained from Layer 2 to create a comprehensive third layer prediction (RNXY), which serves as the pre-final outcome. Additionally, the common versions’ 12 ensembled predictions are combined to produce another key prediction, referred to as RN_XY.

3.6.4 CWE in Layer 4.

Finally, the two predictions (RNXY and RN_XY) generated in Layer 3 are ensembled to form the ultimate prediction, denoted as RN. This final layer prediction represents the conclusive outcome of our study.

4 Experimental results and analysis

This section provides a comprehensive analysis, combining both theoretical insights and visual representations, to assess the classification performance. The primary objective of these results is to demonstrate the effectiveness of using CWE to improve the predictions of the ORNS architectures. By presenting experimental outcomes, which include a wide range of evaluation metrics, visual aids, and confusion matrices, we enable a detailed comparison of the different methods discussed earlier.

4.1 Performance evaluation metrics

To evaluate the performance and effectiveness of our models, we utilize several metrics such as accuracy, precision, recall (sensitivity), f1-score, specificity, and ROC-AUC (Receiver Operating Characteristic Area Under the Curve). These metrics provide valuable insights into the models’ predictive capabilities. Each metric can be derived from the confusion matrix, which summarizes the model’s predictions into true positives, false positives, true negatives, and false negatives. The mathematical expressions for these metrics are as follows:

(10)

(11)

(12)

(13)

(14)

(15)

(16)

Where,

By thoroughly evaluating these metrics, we obtain a detailed understanding of our models’ classification capabilities, allowing us to make informed decisions about their applicability in real-world scenarios.

4.2 Experimental setup

The entire architecture is implemented on a Kaggle notebook, utilizing a GPU P100 and a 2-core Intel Xeon CPU with a clock speed of 690 ms/step. The dataset, containing unique lesion images resized to (224, 224, 3), is split with 15% set aside for validation, another 15% for testing, and the remainder for training. The models are trained over 50 epochs with a batch size of 16. The Adam optimizer, initialized with a learning rate of 0.001, drives the optimization process. Categorical cross-entropy is used for loss calculation, with early stopping implemented through Reduce on Plateau with a patience of 25 epochs.

This section presents a thorough exploration of both theoretical concepts and graphical results to examine classification performance. The primary purpose of these findings is to validate CWE’s effectiveness in enhancing model performance. By showcasing experimental results, including a broad array of evaluation metrics and graphical representations such as ROC-AUC curves and confusion matrices, we facilitate a robust comparison of the different approaches outlined in previous sections.

4.2.1 Trainable parameters.

Since we ensemble the algorithms at prediction time, the number of trainable parameters remains unchanged post-ensemble. Table 2 summarizes the trainable parameters for each model.

Download:

Table 2. Trainable parameters for each architecture.

https://doi.org/10.1371/journal.pone.0321803.t002

It is evident that most of the RNY versions have a higher number of parameters compared to their RNX counterparts, with RNY320 and RNX320 having the highest parameter counts—SEA-ORNS trains approximately 154 million and 121 million parameters, respectively. The models ranked second have less than half of these parameters, and the remaining models have significantly fewer. However, since all algorithms operate independently and in parallel before being ensembled, the final prediction by CWE is achieved efficiently without introducing a significant time overhead.

4.2.2 Hyperparameters selection.

Hyperparameter tuning can significantly enhance performance, often in ways that exceed expectations [36]. The hyperparameters are carefully selected through a detailed process of manual tuning, guided by empirical observations and well-established practices in DL. Each aspect, from the learning rate and batch size to architectural decisions such as kernel sizes and activation functions, is meticulously assessed to enhance model performance and ensure resistance to overfitting. This approach is informed by extensive experimentation and a thorough understanding of the network’s behavior, with the goal of balancing computational efficiency and achieving top-tier results for the task at hand.

We use a learning rate of 0.001 with the Adam optimizer to facilitate precise weight updates, which are essential for navigating the complex optimization landscape of our model. Batch normalization is incorporated to stabilize training by normalizing the inputs of each layer, thereby improving convergence speed and reducing the risk of overfitting. The ‘he_normal’ kernel initializer is employed to ensure effective weight initialization, aiding in the maintenance of gradient flow and enhancing the model’s learning capacity. Additionally, the use of the ‘ReLU’ activation function allows our model to effectively capture complex patterns and relationships in the data, which is vital for achieving high accuracy in classification tasks.

4.3 Performance analysis of ORNS architectures in CWE

The CWE application with all classifiers at each layer, represented as CWE_L (where L stands for the layer), is employed to produce the ML−CWE.

4.3.1 ORNS architectures in CWE-Layer 1.

Twenty-four models, previously discussed, are utilized in combination with C-ORNS, CA-ORNS, SEA-ORNS, and SA-ORNS variants. The performance outcomes from these varied combinations, along with results from Layer-1 CWE, are summarized in Tables 3 through 14.

Download:

Table 3. Performance evaluation of RNX002 and RNY002 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t003

Download:

Table 4. Performance evaluation of RNX004 and RNY004 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t004

Download:

Table 5. Performance evaluation of RNX006 and RNY006 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t005

Download:

Table 6. Performance evaluation of RNX008 and RNY008 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t006

Download:

Table 7. Performance evaluation of RNX016 and RNY016 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t007

Download:

Table 8. Performance evaluation of RNX032 and RNY032 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t008

Download:

Table 9. Performance evaluation of RNX040 and RNY040 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t009

Download:

Table 10. Performance evaluation of RNX064 and RNY064 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t010

Download:

Table 11. Performance evaluation of RNX080 and RNY080 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t011

Download:

Table 12. Performance evaluation of RNX120 and RNY120 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t012

Download:

Table 13. Performance evaluation of RNX160 and RNY160 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t013

Download:

Table 14. Performance evaluation of RNX320 and RNY320 in CWE-Layer 1.

https://doi.org/10.1371/journal.pone.0321803.t014

Table 3 presents the performance evaluation of the RNX002 and RNY002 models across different basic blocks, with CWE representing an ensemble of these blocks designed to harness their combined strengths. The CWE ensembles, RNX002 (CWE₁) and RNY002 (CWE₁), consistently outperform individual blocks.

For RNX002, the CWE ensemble achieves the highest accuracy at 91.43% and an F1-score of 90.85%, indicating that the ensemble approach yields a more robust and effective model compared to individual blocks.

In the same way, RNY002 (CWE₁) excels with an accuracy of 91.67% and an F1-score of 91.08%, outperforming all basic blocks. The ensemble’s integration of attention mechanisms and customization leads to improved precision (91.13%) and recall (91.67%), underscoring its superior performance.

The CWE ensembles, by combining the strengths of the basic ORNS architectures, deliver superior performance across all key metrics, highlighting the value of this ensemble approach in optimizing model outcomes.

Table 4 depicts the performance evaluation of RNX004 and RNY004 models within CWE-Layer 1. The results highlight the effectiveness of the CWE ensemble approach compared to the individual basic blocks.

For RNX004, the CWE ensemble (CWE₁) achieves the highest performance across all key metrics, with an accuracy of 92.87% and an F1-score of 92.48%. These values surpass those of the individual blocks, demonstrating the added value of the ensemble method. The CWE ensemble also shows a notable improvement in precision (92.49%) and recall (92.87%), indicating a more balanced model.

Likewise, the RNY004 model benefits significantly from the CWE ensemble, which reaches an accuracy of 92.63% and an F1-score of 92.20%. The precision (92.08%) and recall (92.63%) for the CWE variant also outperform the individual block variants, further confirming the advantages of combining these blocks.

Table 5 provides an assessment of the RNX006 and RNY006 models within CWE-Layer 1, highlighting the advantages of the CWE ensemble over individual basic blocks.

For RNX006, the CWE ensemble (CWE₁) surpasses individual blocks, achieving the highest accuracy at 92.39% and an F1-score of 91.87%. The ensemble also enhances precision (92.04%) and recall (92.39%), demonstrating the benefits of integrating the strengths of each block.

In a similar aspect, the RNY006 model benefits significantly from the CWE ensemble, achieving a peak accuracy of 92.63% and an F1-score of 92.23%, alongside high precision (92.28%) and recall (92.63%). These results underscore the ensemble’s ability to deliver a more balanced and robust performance compared to the individual blocks.

Table 6 presents the performance evaluation of RNX008 and RNY008 models within CWE-Layer 1, highlighting the benefits of the CWE ensemble compared to individual basic blocks.

For RNX008, the CWE ensemble (CWE₁) achieves the highest performance across all metrics, with an accuracy of 93.60% and an F1-score of 93.30%. The ensemble also delivers strong precision (93.29%) and recall (93.60%), outperforming the individual blocks and demonstrating the advantages of the ensemble approach.

Similarly, the RNY008 model shows significant performance improvements with the CWE ensemble. It achieves an accuracy of 92.99% and an F1-score of 92.49%, both of which are higher than those of the individual block variants. The CWE ensemble also exhibits enhanced precision (92.67%) and recall (92.99%).

Table 7 provides the performance evaluation of the RNX016 and RNY016 models within CWE-Layer 1, emphasizing the benefits of the CWE ensemble over individual basic blocks.

For RNX016, the CWE ensemble (CWE₁) stands out with the highest accuracy (93.36%) and F1-score (92.98%). These metrics surpass those of the individual blocks, illustrating the ensemble’s superior ability to balance precision (92.97%) and recall (93.36%) effectively. This performance indicates that the CWE approach successfully combines the strengths of different blocks, leading to more robust results.

Similarly, RNY016 demonstrates notable performance improvements with the CWE ensemble. It achieves an accuracy of 93.24% and an F1-score of 92.80%, which are higher than the individual block variants. The ensemble’s precision (92.83%) and recall (93.24%) further highlight its advantage in delivering balanced and enhanced model performance.

Table 8 highlights the comparative performance of RNX032 and RNY032 models in CWE-Layer 1, showcasing the benefits of the CWE ensemble over the individual basic blocks.

For RNX032, the CWE ensemble (CWE₁) stands out with an impressive accuracy of 92.87% and an F1-score of 92.39%, outperforming each individual block. The precision of 92.67% and recall of 92.87% underline the ensemble’s capability to achieve a more comprehensive and balanced performance compared to the basic blocks.

In the case of RNY032, the CWE ensemble also excels, achieving the highest accuracy of 93.48% and an F1-score of 93.22%. The precision and recall rates of 93.47% and 93.48%, respectively, further demonstrate the ensemble’s superior performance and effectiveness over the standalone models.

Table 9 presents the performance evaluation of RNX040 and RNY040 models within CWE-Layer 1, highlighting the advantages of the CWE ensemble over individual basic blocks.

For RNX040, the CWE ensemble (CWE₁) exhibits the highest performance with an accuracy of 93.12% and an F1-score of 92.59%. This outperforms the individual blocks, showcasing the ensemble’s enhanced ability to deliver balanced and effective model results. The CWE ensemble also achieves superior precision (92.77%) and recall (93.12%), reinforcing its overall effectiveness.

In the case of RNY040, the CWE ensemble (CWE₁) shows improved results with an accuracy of 92.51% and an F1-score of 91.94%. Although it does not surpass all individual block metrics, it delivers strong performance in precision (92.16%) and recall (92.51%), demonstrating the ensemble’s benefit in achieving high-quality results.

Table 10 presents the performance evaluation of the RNX064 and RNY064 models within CWE-Layer 1, emphasizing the CWE ensemble’s effectiveness over individual basic blocks.

For RNX064, the CWE ensemble (CWE₁) stands out with the highest accuracy of 92.87% and an F1-score of 92.48%, surpassing the performance of individual blocks. This demonstrates the ensemble’s superior ability to balance precision (92.45%) and recall (92.87%). Additionally, the CWE ensemble maintains a strong specificity of 83.33%, highlighting its overall effectiveness.

Similarly, the RNY064 model benefits significantly from the CWE ensemble, achieving the highest accuracy at 93.24% and an F1-score of 92.83%. The ensemble’s precision (92.84%) and recall (93.24%) further emphasize its advantage in delivering a balanced and high-quality performance. The CWE ensemble also exhibits strong specificity at 84.79%, supporting its effectiveness in producing robust results.

Table 11 displays the performance evaluation of RNX080 and RNY080 models within CWE-Layer 1, demonstrating the clear advantages of the CWE ensemble over individual basic blocks.

For RNX080, the CWE ensemble (CWE₁) leads with the highest accuracy (93.72%) and F1-score (93.27%). It also achieves the best precision (93.40%) and recall (93.72%), surpassing all individual block variants. The specificity of 84.80% further emphasizes the CWE ensemble’s robust performance across key metrics.

In the case of RNY080, the CWE ensemble also shows superior results, with an accuracy of 93.24% and an F1-score of 92.76%. The precision (92.78%) and recall (93.24%) are the highest among the tested variants, highlighting the ensemble’s effectiveness in achieving balanced and high-quality performance. The specificity of 85.74% underscores its strong performance in distinguishing true negatives.

Table 12 presents the performance evaluation of RNX120 and RNY120 models within CWE-Layer 1, illustrating the advantages of using the CWE ensemble compared to individual basic blocks.

For RNX120, the CWE ensemble (CWE₁) achieves the highest accuracy (92.99%) and F1-score (92.43%). It also excels in precision (92.60%) and recall (92.99%), surpassing all individual basic blocks. The specificity of 80.47% indicates that the CWE ensemble maintains robust performance in distinguishing true negatives as well.

In a similar manner, for RNY120, the CWE ensemble (CWE₁) delivers top performance with an accuracy of 93.24% and an F1-score of 92.81%. It also shows the highest precision (92.87%) and recall (93.24%) among the variants, highlighting its effectiveness in achieving balanced and high-quality results. The specificity of 83.35% further supports the CWE ensemble’s strong performance across key metrics.

Table 13 highlights the performance evaluation of the RNX160 and RNY160 models within CWE-Layer 1, underscoring the benefits of the CWE ensemble approach.

For RNX160, the CWE ensemble (CWE₁) surpasses the individual basic blocks, achieving the highest accuracy at 92.87% and an F1-score of 92.30%. It also excels in precision (92.40%) and recall (92.87%), demonstrating consistent and high-quality performance across these metrics. The specificity of 82.37% indicates its effectiveness in identifying true negatives.

Similarly, for RNY160, the CWE ensemble (CWE₁) achieves an accuracy of 92.63% and an F1-score of 92.15%. It shows competitive precision (92.02%) and recall (92.63%), with a specificity of 83.33%, further emphasizing the CWE ensemble’s robust performance.

Table 14 presents the performance evaluation of RNX320 and RNY320 models in CWE-Layer 1, demonstrating the effectiveness of the CWE ensemble approach.

For RNX320, the CWE ensemble (CWE₁) achieves an accuracy of 93.36% and an F1-score of 93.00%. It excels in precision (93.08%) and recall (93.36%), indicating robust and consistent performance. The specificity is 87.16%, reflecting strong capability in identifying true negatives.

Likewise, for RNY320, the CWE ensemble (CWE₁) shows even higher performance with an accuracy of 93.60% and an F1-score of 93.20%. It also performs well in precision (93.18%) and recall (93.60%), with a specificity of 86.71%, underscoring its effective performance across all metrics.

Overall, in the initial layer of CWE, the ensembles consistently outperform individual basic blocks across various RegNet configurations, regardless of the model variant. This robust performance is observed across all key metrics, including accuracy, precision, recall, and F1-score, underscoring the effectiveness of the ensemble approach in optimizing model outcomes. By leveraging the strengths of both RNX and RNY architectures through ensembling, we achieve superior results, confirming the ensemble method’s ability to enhance and maximize model performance across diverse scenarios.

4.3.2 ORNS architectures in CWE-Layer 2.

At Layer 2, our method utilizes two distinct approaches based on the predictions from Layer 1. Specifically, the first approach involves combining the twelve RNX models and twelve RNY models, resulting in two layer-2 predictions. The outcomes of this approach are detailed in Table 15. The second approach merges the common versions of both RegNet variants, producing twelve predictions. The results of this combined model are presented in Table 16.

Download:

Table 15. Performance evaluation of RNX and RNY in CWE-Layer 2.

https://doi.org/10.1371/journal.pone.0321803.t015

Download:

Table 16. Performance evaluation of common variants ensemble of RNX and RNY in CWE-Layer 2.

https://doi.org/10.1371/journal.pone.0321803.t016

Table 15 provides an insightful performance evaluation of the RNX and RNY models at Layer 2 within the CWE framework. The results clearly demonstrate the enhancement in model performance when utilizing the CWE ensemble approach compared to individual models.

For the RNX models, the CWE Layer 1 (CWE₁) already exhibits strong performance across the board, with accuracies ranging from 91.43% to 93.72%. Notably, the RNX080 and RNX320 models achieve the highest accuracies, highlighting the effectiveness of the CWE ensemble in aggregating predictions from multiple models. However, the Layer 2 ensemble (CWE₂) further refines these predictions, pushing the overall accuracy to an impressive 93.84%, along with superior precision (93.51%) and F1-score (93.41%).

Similarly, the RNY models under the CWE Layer 1 (CWE₁) demonstrate robust performance, with accuracies spanning from 91.67% to 93.60%. The RNY032 and RNY320 models stand out with the highest accuracy scores, underscoring the ensemble’s ability to capture complex patterns. The Layer 2 ensemble (CWE₂) further consolidates these gains, achieving a near-perfect accuracy of 93.72%, accompanied by a precision of 93.40% and an F1-score of 93.27%.

Overall, the table illustrates the clear advantage of using a ML-CWE ensemble approach. The transition from Layer 1 to Layer 2 significantly enhances the performance metrics across all RNX and RNY models, making a compelling case for the effectiveness of this methodology in achieving superior predictive accuracy and reliability.

Table 16 presents the performance evaluation of the common variant ensemble of RNX and RNY architectures across twelve different model configurations at Layer 2. Each pair of RNX and RNY models is combined to create a unified prediction (denoted as ‘XY’), showcasing the strength of the CWE ensemble at this layer.

For the smaller models, XY002 and XY004 deliver consistent performance, with accuracy values of 92.15% and 92.87%, respectively, reflecting the ensemble’s reliability even in more compact configurations. Moving to the mid-range models, XY006, XY008, and XY016 also maintain strong results, with XY008 achieving the highest accuracy of 93.84% among these, demonstrating the ensemble’s ability to leverage the strengths of both RNX and RNY.

The larger models, particularly XY032, XY064, and XY320, continue to exhibit robust performance, with XY320 reaching a peak accuracy of 93.84%, which matches the best results seen in the smaller models like XY008. This consistency across varying model sizes emphasizes the versatility and effectiveness of the CWE ensemble approach. Other models, such as XY040, XY080, XY120, and XY160, also perform well, with accuracy values ranging from 93.24% to 93.60%, further supporting the ensemble’s overall effectiveness in combining the predictive strengths of both RNX and RNY variants.

4.3.3 ORNS architectures in CWE-Layer 3.

In this layer, two pre-final predictions are generated. The first prediction is derived from the ensemble of RNX and RNY models from the previous layer, referred to as RNXY in this layer. The second prediction comes from the ensemble of 12 common variants, labeled as RN_XY. The evaluations of these predictions are presented in Tables 17 and 18.

Download:

Table 17. Performance evaluation of RNXY in CWE-Layer 3.

https://doi.org/10.1371/journal.pone.0321803.t017

Download:

Table 18. Performance evaluation of RN_XY in CWE-Layer 3.

https://doi.org/10.1371/journal.pone.0321803.t018

Table 17 captures the culminating evaluation of the RNXY architecture at CWE-Layer 3, showcasing the pre-final predictions derived from the ensemble of RNX and RNY models. More specifically, the RNXY (CWE₃) ensemble model demonstrates a slight but significant improvement across all metrics compared to the individual RNX and RNY (CWE₂) models. It achieves the highest accuracy at 93.96%, outstripping RNX (CWE₂) at 93.84% and RNY (CWE₂) at 93.72%. This model also leads in precision (93.64%), recall (93.96%), and F1-score (93.58%), confirming the effectiveness of this layered ensemble approach. The specificity of the RNXY ensemble is also enhanced to 85.30%, underscoring its robustness in distinguishing true negatives.

The incremental gains observed with the RNXY (CWE₃) model over its predecessors highlight the cumulative strength of the ensemble learning strategy employed throughout the layers. This strategic layering enhances predictive accuracy and consistency, culminating in a model that is not only superior in performance but also balanced across all evaluated metrics.

Table 18 provides a comprehensive evaluation of the ensemble method applied to twelve common variants of the RNX and RNY architectures, culminating in the RN_XY ensemble at CWE-Layer 3.

The individual XY models demonstrate strong performance metrics, with accuracy ranging from 92.15% for XY002 to 93.84% for models like XY008, XY080, and XY320. Notably, these models also exhibit consistent precision, recall, and F1-scores, underscoring their robustness across different configurations. For example, XY016 achieves a commendable F1-score of 93.15% and stands out with a specificity of 88.60%, the highest among the individual models, highlighting its balanced performance in both positive and negative classifications.

The RN_XY ensemble, however, elevates the predictive power to a new level, reaching an accuracy of 93.96%, with precision and recall closely aligned at 93.64% and 93.96%, respectively. The F1-score of 93.56% further reflects the ensemble’s efficiency in synthesizing the strengths of the individual variants. Specificity remains competitive at 84.82%, demonstrating the ensemble’s ability to maintain a high standard across various metrics. This layer represents the culmination of the model’s iterative refinement, bringing together the best of each architecture into a final, high-performing ensemble.

4.3.4 ORNS architectures in CWE-Layer 4.

In the final evaluation at CWE-Layer 4, the performance metrics indicate a significant culmination of the ensemble strategies applied in the previous layers. Table 19 showcases the precision, accuracy, recall, F1-score, and specificity of the models, offering a comprehensive view of the refined prediction quality.

Download:

Table 19. Performance evaluation of RN in CWE-Layer 4.

https://doi.org/10.1371/journal.pone.0321803.t019

Both the RNXY and RN_XY models, derived from the earlier CWE-Layer 3 predictions, demonstrate identical accuracy (93.96%), precision (93.64%), and recall (93.96%). These models also maintain a high F1-score, with RNXY at 93.58% and RN_XY closely following at 93.56%. Specificity, a crucial metric for understanding the models’ true negative rate, stands at 85.30% for RNXY and 84.82% for RN_XY, indicating robust performance in correctly identifying non-target classes.

The RN model, which represents the final amalgamation in CWE-Layer 4, slightly surpasses its predecessors with an accuracy of 94.08%, a precision of 93.77%, and a recall of 94.08%. The F1-score, crucial for balancing precision and recall, is also higher at 93.71%. Additionally, the RN model achieves a specificity of 85.78%, underscoring its enhanced ability to correctly exclude non-relevant instances.

This final result at CWE-Layer 4 marks the pinnacle of the ORNS architecture’s performance, demonstrating the effectiveness of iterative refinement and the strategic combination of model predictions across layers.

4.4 Performance analysis by visualization

To simplify the analysis, we decide not to display confusion matrices for every classifier used, given the wide variety and differences among them. Instead, we focus on showing the confusion matrices for the pre-final and final layers of the CWE model. These matrices, presented in Figs 9, 11, and 13, clearly illustrate the rates of correct and incorrect classifications for each category.

Download:

Fig 9. Confusion matrix obtained by RNXY architecture in CWE-Layer 3.

https://doi.org/10.1371/journal.pone.0321803.g009

Download:

Fig 10. ROC-AUC curve obtained by RNXY architecture in CWE-Layer 3.

https://doi.org/10.1371/journal.pone.0321803.g010

Similarly, we analyze the ROC-AUC curves, which are useful for visualizing a model’s performance. We specifically focus on the ROC-AUC curves for the Multi-Layer CWE approach, similar to how we present the confusion matrices for only the last two layers. For consistency, these ROC-AUC curves are also shown in Figs 10, 12, and 14.

More specifically, the performance of our first pre-final architecture, RNXY (CWE₃), demonstrates its effectiveness across various classes. The VASC class achieves perfect accuracy, correctly classifying all 9 samples. In the DF class, 5 out of 6 samples are correctly identified, with only one misclassification. The NV class also performs impressively, correctly classifying 558 out of 663 samples, showing strong results for both minority and majority classes. For the AK class, 14 samples are correctly classified, with 8 errors, while the BCC class has 21 correct classifications and 6 mistakes. The BKL class performs well, with 52 out of 66 samples correctly classified. The MEL class, which is the most challenging, still manages 19 correct classifications out of 35 samples. Overall, the architecture shows high accuracy and robust performance across all classes.

The ROC curve AUC scores further highlight the strong performance of the RNXY (CWE₃) architecture. The MEL class, which has the lowest AUC score, still achieves an impressive 0.995, indicating excellent performance. The VASC class achieves a perfect AUC score of 1, while the other classes also perform well with AUC scores around 0.99. These minimal fluctuations in ROC curve demonstrate the architecture’s precision and reliability.

Moving on to the performance of another pre-final architecture, RN_XY (CWE₃), we find that it also performs effectively across different classes, though slightly less so than RNXY. Starting with the most challenging class, MEL, we see that it successfully classifies 19 out of 35 samples. The BKL class performs well, with 53 out of 66 samples correctly classified. In the BCC class, 21 out of 27 samples are accurately classified, with 6 errors. The AK class identifies 13 samples correctly but has 9 misclassifications. The NV class showcases impressive performance, correctly classifying 558 out of 663 samples, indicating strong results in both minority and majority classes. For the DF class, 5 out of 6 samples are accurately classified, with only one misclassification. The VASC class achieves perfect accuracy, with all 9 samples correctly classified. Overall, while the RN_XY (CWE₃) architecture performs slightly less effectively than RNXY, it still demonstrates significant accuracy and robustness across all classes.

The ROC curve AUC scores for the RN_XY (CWE₃) architecture mirror those of RNXY, reinforcing its strong performance. The MEL class, despite having the lowest AUC score at 0.995, still shows excellent results. The AUC scores for the other classes also demonstrate very little fluctuation, underscoring the architecture’s consistency and effectiveness.

Finally, the performance of our ultimate architecture, RN (CWE₄), stands out as the most effective across all classes, significantly outperforming previous layers. Starting with the VASC class, it achieves perfect accuracy with all 9 samples correctly classified, and an AUC score of 1, showcasing flawless performance. The DF class also performs exceptionally well, with 5 out of 6 samples accurately classified and only one misclassification, resulting in a near-perfect AUC score of 0.999. The AK class demonstrates strong results, correctly identifying 14 samples with 8 misclassifications, achieving an AUC score of 0.994, indicating good performance on the ROC curve. The NV class performs remarkably as well, correctly classifying 558 out of 663 samples and achieving an AUC score of 0.987, reflecting its capability in both minority and majority classes. Next, the BKL class manages to correctly classify 53 out of 66 samples, with an AUC score of 0.989, closely followed by the BCC class, which correctly classifies 21 out of 27 samples and attains an AUC score of 0.987. Lastly, the MEL class, which is the most challenging, still manages to classify 19 out of 35 samples correctly, achieving the lowest AUC score of 0.957. However, this score is still better than any obtained in the previous architectures. Overall, RN (CWE₄) showcases exceptional performance and robustness across all classes, with impressive accuracy and reliability.

4.4.1 Gradient Class Activation Map (GradCAM) for interpretability.

GradCAM (Gradient-weighted Class Activation Mapping) was employed as a visualization technique to improve the interpretability of the proposed model. It highlighted the regions in input images that contributed most significantly to the model’s predictions. The final convolutional layer of the network was selected for generating activation maps, as this layer captures high-level spatial features crucial for classification.

The methodology for implementing GradCAM is depicted in Fig 15. The process began by constructing a gradient model capable of mapping input images to both the activations of the last convolutional layer and the model’s final predictions. Using TensorFlow’s GradientTape, the gradient of the class-specific output score with respect to the activations of the selected convolutional layer was computed.

These gradients were then spatially pooled by calculating the mean intensity for each feature map channel, which represented the importance of each channel for the target class. The pooled gradients were used to weight the activation maps of the final convolutional layer, and the weighted activations were aggregated to produce a class activation heatmap. This heatmap was normalized to a range between 0 and 1 for better visualization. Finally, the normalized heatmap was overlaid on the original input image using a colormap, highlighting the regions that influenced the model’s decision.

This approach was extended to multiple classes by iterating over the test dataset and selecting representative images from each class. GradCAM outputs were generated for these images, revealing the regions of interest for each class. The resulting heatmaps demonstrated the model’s ability to focus on salient areas related to the target classes, such as lesions in medical images, thus confirming its capacity to identify key features.

Despite its strengths, GradCAM has certain limitations. Its reliance on model predictions means that any misclassification can result in inaccurate or misleading heatmaps. Additionally, for complex or subtle features, such as ambiguous skin lesions, GradCAM might sometimes highlight irrelevant regions, which can reduce its reliability in such cases.

This combination of interpretability and limitations emphasizes the importance of complementing GradCAM visualizations with robust model evaluation to ensure reliable insights.

In Fig 16, we present visualizations for each class, showcasing seven instances across seven different classes. The GradCAM views are seamlessly integrated with the original images, revealing how effectively our model focuses on the most critical regions rather than the entire image. This targeted approach substantially enhances the model’s ability to classify images with higher accuracy, underscoring the strength of our contribution.

When a model successfully generates a precise heatmap that highlights the relevant region, it signals the model’s ability to make accurate classifications. On the other hand, if the heatmap is misaligned, it often suggests that the model’s classification may also be incorrect. To demonstrate this concept and evaluate the performance of AT, we provide an illustrative example in Fig 17 for clearer understanding.

Download:

Fig 11. Confusion matrix obtained by RN_XY architecture in CWE-Layer 3.

https://doi.org/10.1371/journal.pone.0321803.g011

Download:

Fig 12. ROC-AUC curve obtained by RN_XY architecture in CWE-Layer 3.

https://doi.org/10.1371/journal.pone.0321803.g012

Download:

Fig 13. Confusion matrix obtained by RN architecture in CWE-Layer 4.

https://doi.org/10.1371/journal.pone.0321803.g013

Download:

Fig 14. ROC-AUC curve obtained by RN architecture in CWE-Layer 4.

https://doi.org/10.1371/journal.pone.0321803.g014

Download:

Fig 15. Step by step implementation of gradient class activation map.

https://doi.org/10.1371/journal.pone.0321803.g015

Download:

Fig 16. GradCAM visualization for each class.

https://doi.org/10.1371/journal.pone.0321803.g016

Download:

Fig 17. GradCAM visualization for AT explainability (Example by RNX002).

https://doi.org/10.1371/journal.pone.0321803.g017

In Fig 17, the original image serves as a representative example. Notably, by RNX002, both the CA and SA-based ORNS models successfully predict the correct class, while the SEA-based ORNS model does not. This pattern is observed across other models as well, reinforcing our argument. By integrating multiple models, our final prediction consistently proves to be accurate. This is clearly illustrated through the GradCAM visualizations of these models, which further strengthen our claim. The true innovation lies in our CWE ensemble approach, which enables precise class prediction by effectively compensating for the limitations of individual classifiers. This observation highlights the superiority of our advanced Multi-Layer CWE technique, demonstrating its ability to achieve accurate predictions where other methods may falter, thereby underscoring the strength of our contribution.

4.5 Ablation study

To highlight the superiority of our novel approach over state-of-the-art methods, we conduct an extensive ablation study centered on three key innovations: the significance of the AT, the superiority of the CWE, and the critical role of CWE across multiple layers. We assess the impact of these components by comparing the model’s performance with and without each element, providing a clear analysis of their contributions.

4.5.1 Utilization of CWE without AT.

We apply CWE across various models, including all variants of RegNet, at different layers as previously mentioned. Each model is tested in four configurations: three with AT and one without AT. To highlight the efficacy of AT, we present the results in Table 20, showcasing the performance of CWE in Multi-Layer Ensemble (MLE) setup, excluding the AT-integrated models.

Download:

Table 20. Performance evaluation of CWE without AT.

https://doi.org/10.1371/journal.pone.0321803.t020

The table presents the performance metrics for the different configurations of the CWE model without AT, compared across various layers in the MLE setup. The analysis highlights that as the layers progress from CWE₁ to CWE₄, the performance generally improves, with the highest accuracy achieved by RN (CWE₄), which reports an accuracy of 93.84%. However, despite these improvements across layers, the absence of AT still limits the overall effectiveness of the models.

For instance, models like RNXY and RN_XY in CWE₃ achieve commendable results, with RNXY reaching an accuracy of 93.72%, and RN_XY achieving a similar accuracy of 93.72%. However, even the best-performing model without AT, RN (CWE₄), with its accuracy of 93.84%, precision of 93.54%, and F1-score of 93.42%, falls short compared to our proposed model.

Our proposed model, labeled as Ours, which utilizes AT, significantly surpasses all previous configurations. It achieves an accuracy of 94.08%, precision of 93.77%, and F1-score of 93.71%, clearly demonstrating the effectiveness of AT integration. This final model not only outperforms the configurations in CWE₄ but also proves the importance of AT in maximizing the model’s potential.

4.5.2 Utilization of conventional ensemble methods instead of CWE.

As previously mentioned, we apply the CWE across multiple layers. Predictions from all architectures are combined by determining the optimal weights for each. To showcase the superiority of CWE, we compare its performance against traditional ensemble methods: Softmax Averaging (SA), Majority Voting (MV), and Weighted Averaging (WA) with random weights. In this comparison, each method applied at Layer L is denoted as Method_L, such as SA_L for Softmax Averaging. The results of these comparisons are presented in this section.

Softmax Averaging (SA):

As shown in the following tables, Softmax Averaging (SA) using the predictions at Layer L is denoted as SA_L.

The results of SA across different layers are presented in Table 21, showcasing the performance metrics of accuracy, precision, recall, F1-score, and specificity. As expected, the performance of SA typically improves as we move up to higher layers. However, this trend is inconsistent, indicating a limitation in relying solely on SA for optimal ensemble outcomes.

Download:

Table 21. Performance evaluation using SA instead of CWE.

https://doi.org/10.1371/journal.pone.0321803.t021

For instance, while the SA₁ predictions for models such as RNX008 and RNX016 achieve accuracies of 93.60% and 93.36%, respectively, this improvement is not sustained across all layers. Specifically, at SA₄, the RN model achieves a relatively modest accuracy of 93.48%, falling short of expectations. Moreover, even the best performance metrics achieved by SA₄, precision of 93.16%, recall of 93.48%, F1-score of 93.02%, and specificity of 84.32%, do not surpass those of our proposed method, CWE, at Layer 4 (CWE₄). Our CWE₄ model outperforms SA, achieving a notable accuracy of 94.08%, precision of 93.77%, recall of 94.08%, F1-score of 93.71%, and specificity of 85.78%.

These results highlight the limitations of SA, particularly its inability to consistently improve performance across layers and its ultimate failure to outperform our CWE method. In contrast, CWE demonstrates superior performance, effectively leveraging the strengths of multiple classifiers through optimal weighting, ultimately delivering the best results. Thus, it is evident that the accuracy obtained by SA₄ is 0.60% lower than that of our CWE₄.

Majority Voting (MV):

As shown in the upcoming table, Majority Voting (MV) using all predictions at Layer L is denoted as .

The results for MV across different layers are presented in Table 22, showing how the ensemble method performs in terms of evaluation metrics as it progresses through the layers. As anticipated, the performance should ideally improve with higher layers, culminating in the final outcome at . However, the results reveal certain limitations of the MV approach.

Download:

Table 22. Performance evaluation using MV instead of CWE.

https://doi.org/10.1371/journal.pone.0321803.t022

For example, at , models like RNX008 and RNX064 show reasonably strong performance with accuracies of 92.75% and 93.36%, respectively. Yet, this upward trend is not consistently maintained across all layers. The final predictions at for RN achieve an accuracy of 93.84%, which, while solid, fails to outperform the best results obtained with our proposed CWE method.

Notably, our CWE₄ model surpasses MV’s performance with an accuracy of 94.08%, F1-score of 93.71%, and all other metrics. These results underscore the effectiveness of our CWE approach, which optimally combines classifier outputs, outperforming traditional ensemble methods like MV, particularly at higher layers where performance is expected to peak. The consistency and superiority of CWE₄ highlight its robustness and advantage over MV, even when the latter is executed perfectly. Thus, it is clear that the highest accuracy achieved by is 0.24% lower than that of our CWE₄.

Weighted Averaging (WA):

As depicted in the next table, Weighted Averaging (WA) with all predictions at Layer L is denoted as WA_L. The performance of WA across various layers is presented in Table 23, where random weights are assigned to each classifier, ensuring that the sum of all weights equals one. The weights are allocated such that the highest-performing models receive the greatest weights, followed by progressively lower weights for less effective models.

Download:

Table 23. Performance evaluation using WA instead of CWE.

https://doi.org/10.1371/journal.pone.0321803.t023

The results show that while some configurations exhibit strong performance, particularly with higher accuracy and F1-scores in the upper layers, the final results (WA₄) still fall short when compared to our proposed method. For instance, at WA₁, accuracy values are notably high for several models, but the performance does not consistently improve across the layers. Specifically, the final layer WA₄ achieves an accuracy of 93.60% and an F1-score of 93.15%, which, despite being competitive, does not surpass the results obtained with our CWE method.

In conclusion, although Weighted Averaging demonstrates reasonable performance across different layers and configurations, it does not achieve the same level of effectiveness as our CWE model. The CWE approach outperforms WA, highlighting its superior capability in leveraging classifier predictions for enhanced overall performance. So, it is evident from our findings that WA₄ achieves an accuracy that is 0.48% lower than our CWE₄.

Across all tables and metrics, CWE demonstrates superior performance compared to any other conventional ensemble method. The consistent improvement in accuracy, precision, recall, F1-score, and specificity across different classifier ensembles proves that CWE is a more effective method for combining classifier outputs, resulting in better overall model performance.

4.5.3 ORNS architectures in single-layer CWE.

In the single-layer CWE approach, we establish evaluation metrics to reassess the claim regarding the superiority of the multi-layer CWE. The results, as shown in Table 24, clearly indicate that the multi-layer CWE significantly outperforms the single-layer version, which achieves only 93.48% accuracy. This difference is further supported by the values of other performance metrics, confirming the robustness of the claim.

Download:

Table 24. Performance evaluation of single-layer CWE.

https://doi.org/10.1371/journal.pone.0321803.t024

The comprehensive performance comparisons clearly demonstrate that our approach, which integrates AT with CWE across multiple layers, offers a superior architecture compared to existing methods.

4.6 Answers to the research questions

Answer to RQ1: To optimize the Transfer Learning model for our classification tasks, we strategically employ all variants of RegNet. By harnessing RegNet’s diverse architectural patterns, we explore a wide design space, ensuring our models are well-suited for various tasks. Fine-tuning with customized layers allows us to tailor each model to the specific requirements of each task. This approach leverages RegNet’s scalable and efficient quantized linear parameterization, enhancing performance while maintaining computational efficiency. Ultimately, combining RegNet’s robust architecture with targeted fine-tuning results in a highly effective TL model, demonstrating RegNet’s versatility in TL applications.

Answer to RQ2: To effectively identify the most critical features, particularly significant areas or regions, the Attention-Triplet (AT) approach is applied in the design of an ORNS model. This method prioritizes important features by focusing more on them as feature maps pass through different layers, enhancing the model’s ability to capture relevant details. It balances the risks associated with ignoring deep features in simpler models or overfitting in more complex ones. By incorporating AT into the ORNS architecture, the model demonstrates superior performance compared to versions without an attention mechanism.

Answer to RQ3: A single algorithm is not sufficient for skin lesion classification; instead, an Ensemble Learning (EL) technique is necessary. Our study shows that using multiple classifiers in an ensemble is more effective than relying on a single classifier. By integrating a customized CNN architecture with three different attention mechanisms (CA, SEA, and SA) and combining them with TL models, we construct various ORNS architectures. Testing these architectures across 24 models, including all RegNet variants, and applying diverse ensemble strategies for predictions reveal a significant improvement in handling unseen data. This approach enhances both the accuracy and robustness of skin lesion classification by leveraging diverse methods for feature extraction and classification.

Answer to RQ4: Traditional Ensemble Learning (EL) methods have limitations in effectively weighting each model’s contribution, often relying on fixed or manually determined weights, which can lead to suboptimal performance on unseen data. The proposed approach overcomes these limitations by dynamically calculating optimal weight ratios for each model using a novel method based on values. This data-driven technique allows for better generalization and performance by selecting the optimal ratio of predictions from each model. Visual results from the ML-CWE demonstrate significant performance improvements, highlighting increased robustness and efficiency in managing diverse patterns for complex DL tasks.

5 Discussion and extended comparison

Our study demonstrates the effectiveness of combining ORNS architectures with a sophisticated ensemble approach, CWE, for enhancing classification performance. By integrating data preprocessing, augmentation, and fine-tuning of RegNet models, we address class imbalances and optimize feature extraction, leading to significant improvements in model accuracy.

The final results, achieved at CWE-Layer 4, show that our approach effectively leverages iterative refinement and model ensembling. The RN model, with an accuracy of 94.08%, outperforms its predecessors, highlighting the success of our multi-layer CWE strategy. The high specificity of 85.78% further demonstrates the model’s capability in distinguishing non-target classes, ensuring reliable performance across various metrics.

These outcomes validate our methodological choices, including the application of advanced attention mechanisms and ensemble techniques. The enhanced performance metrics underscore the potential of our approach to achieve robust and accurate predictions, making it a valuable contribution to the field of image classification.

Despite a larger array of classifiers, our advanced yet user-friendly approach demonstrates clear superiority in overall performance, generalization, and evaluation measures in this domain. A comprehensive comparison of our proposed model’s performance against existing literature is detailed in Table 25. Notably, we let alone compare the studies conducted using the HAM10000 dataset.

Download:

Table 25. Comparison of our proposed architecture with existing others.

https://doi.org/10.1371/journal.pone.0321803.t025

Additionally, we compared the performance of our architecture with state-of-the-art methods to demonstrate its superiority. As evidenced in Table 26, our customized architecture outperforms existing approaches, successfully validating its effectiveness.

Download:

Table 26. Comparison of our proposed architecture with state-of-the-art methods.

https://doi.org/10.1371/journal.pone.0321803.t026

5.1 Advantages and disadvantages of CWE

The CWE method offers several advantages and a few limitations compared to other weighting schemes. Its primary advantage lies in its statistical foundation, leveraging the statistic to assign weights based on classifier performance. This ensures that models demonstrating a stronger relationship between predictions and actual outcomes have a greater influence, thereby enhancing the reliability and accuracy of the ensemble. Additionally, the CWE method is adaptable and can be applied in multi-layer scenarios, allowing the incremental emphasis on high-performing classifiers. Finally, guaranteeing the optimal weight for each classification model can be as the most important advantage.

However, CWE relies heavily on the quality of the test results, making it sensitive to small sample sizes or imbalanced datasets, which could skew the weight assignments. Moreover, its computational complexity may increase when dealing with large-scale datasets or a high number of classifiers due to the need for calculating and normalizing values. Despite these limitations, CWE is particularly advantageous for scenarios requiring robust and statistically sound ensembling strategies.

5.2 Threats to validity

Although our approach shows vast significance in image classification, outlined below are certain limitations of our study, which may serve as opportunities for future research and refinement:

Extensive Use of Classifiers in the Initial Ensemble Layer

Incorporating a diverse set of classifiers in the initial layer can enhance model robustness, but it also introduces significant computational demands. The increased complexity and resource requirements during both training and inference limit the scalability of our approach, particularly when applied to larger datasets or in environments with limited computing resources.

Dependence on a Single Dataset

Our study’s reliance on a single dataset for training and evaluation poses a potential limitation in terms of generalizability. A model trained on just one dataset may not capture the full range of variations across different data sources, potentially leading to reduced performance when applied to new and unseen datasets.

6 Future work and research directions

Our future work will address several key research gaps and explore areas that require further investigation to enhance the applicability and efficiency of our model.

Streamlined Web-Based API for Real-Time Diagnostic Predictions:

We plan to develop a more user-friendly, web-based API that allows for real-time diagnostic predictions. This would provide dermatologists and patients with a practical tool for on-demand skin lesion analysis. The API will improve accessibility and scalability, allowing users to submit skin images from anywhere and receive immediate feedback on potential skin conditions.

Evaluation Across Multiple Datasets:

To ensure our model’s robustness and generalizability, we intend to test its performance on a variety of diverse datasets. This will help assess the model’s ability to handle different types of skin lesions, lighting conditions, and image quality. Such an evaluation will provide valuable insights into how the model performs across real-world scenarios and across different populations.

Reduction of Computational Costs:

While our model achieves high accuracy, computational efficiency remains a key area for improvement. In future research, we will focus on optimizing the architecture by reducing the number of algorithms used, while maintaining or even improving accuracy. This will help reduce computational costs, making the model more feasible for deployment in resource-constrained environments, such as mobile devices or remote healthcare settings.

These research directions aim to address the challenges of scalability, accessibility, and efficiency, ensuring that our model can be widely adopted and deployed to improve early diagnosis and patient care in dermatology.

7 Conclusion

This paper presents a comprehensive approach to image classification through the innovative application of ORNS architectures and a Multi-Layer Weighted Ensemble (ML-CWE) strategy. Our methodology begins with meticulous data preprocessing and augmentation, leading to the effective training and fine-tuning of diverse ORNS models. The integration of three advanced attention mechanisms enhances the model’s ability to focus on critical features and improve performance.

The Multi-Layer CWE technique, comprising sequential refinement across four layers, proves to be a powerful tool for optimizing ensemble predictions. By leveraging the strengths of various models at each layer, the final RN model achieves remarkable performance. This iterative approach effectively harnesses the benefits of each model, culminating in a robust and high-performing classification system.

Our study highlights the potential of ORNS architectures and Multi-Layer CWE to advance image classification through improved accuracy, reliability, and interpretability, as demonstrated by GradCAM visualizations. These contributions set new benchmarks in Transfer Learning and hold significant practical value, particularly in healthcare. By enabling early diagnosis of skin conditions, the approach can improve patient outcomes and accessibility.

All data used in this study, including the augmented training data, are publicly accessible on the Kaggle repository: [HAM10000 Dataset] [31] (https://www.kaggle.com/datasets/anwarhossaine/ham10000-splitted-and-augmented-CWE-70-15-15). Our use of the HAM10000 dataset complies with the Creative Commons Attribution-NonCommercial 4.0 International Public License because we have properly attributed the original dataset and cited the recommended paper by the authors. This fulfills the attribution requirement of the license. Additionally, our use of the dataset is for non-commercial purposes, aligning with the non-commercial clause of the license.

References

1. Bibi S, Khan M, Shah J, Damaševičius R, Alasiry A, Marzougui M, et al. MSRNet: Multiclass skin lesion recognition using additional residual block based fine-tuned deep models information fusion and best feature selection. Diagnostics. 2023;13(19):3063.
- View Article
- Google Scholar
2. Dillshad V, Khan M, Nazir M, Saidani O, Alturki N, Kadry S. D2LFS2Net: Multi-class skin lesion diagnosis using deep learning and variance-controlled Marine Predator optimisation: An application for precision medicine. CAAI Trans Intell Technol. 2023.
3. Hussain M, Khan M, Damaševičius R, Alasiry A, Marzougui M, Alhaisoni M, et al. SkinNet-INIO: Multiclass skin lesion localization and classification using fusion-assisted deep neural networks and improved nature-inspired optimization algorithm. Diagnostics. 2023;13(18):2869.
- View Article
- Google Scholar
4. Efat AH, Hasan SMM, Uddin MP, Mamun MA. A Multi-level ensemble approach for skin lesion classification using Customized Transfer Learning with Triple Attention. PLoS One. 2024;19(10):e0309430. pmid:39446759
- View Article
- PubMed/NCBI
- Google Scholar
5. Ahmad N, Shah J, Khan M, Baili J, Ansari G, Tariq U, et al. A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable AI. Front Oncol. 2023;13(6):1151257.
- View Article
- Google Scholar
6. Malik S, Akram T, Awais M, Khan MA, Hadjouni M, Elmannai H, et al. An improved skin lesion boundary estimation for enhanced-intensity images using hybrid metaheuristics. Diagnostics (Basel). 2023;13(7):1285. pmid:37046503
- View Article
- PubMed/NCBI
- Google Scholar
7. Tajerian A, Kazemian M, Tajerian M, Akhavan Malayeri A. Design and validation of a new machine-learning-based diagnostic tool for the differentiation of dermatoscopic skin cancer images. PLoS One. 2023;18(4):e0284437. pmid:37058446
- View Article
- PubMed/NCBI
- Google Scholar
8. Hosny KM, Kassem MA, Foaud MM. Classification of skin lesions using transfer learning and augmentation with Alex-net. PLoS One. 2019;14(5):e0217293. pmid:31112591
- View Article
- PubMed/NCBI
- Google Scholar
9. Dong Y, Wang L, Li Y. TC-Net: Dual coding network of Transformer and CNN for skin lesion segmentation. PLoS One. 2022;17(11):e0277578. pmid:36409714
- View Article
- PubMed/NCBI
- Google Scholar
10. Khan S, Khan A. SkinViT: A transformer based method for melanoma and nonmelanoma classification. PLoS One. 2023;18(12):e0295151. pmid:38150449
- View Article
- PubMed/NCBI
- Google Scholar
11. Singh RK, Gorantla R, Allada SGR, Narra P. SkiNet: A deep learning framework for skin lesion diagnosis with uncertainty estimation and explainability. PLoS One. 2022;17(10):e0276836. pmid:36315487
- View Article
- PubMed/NCBI
- Google Scholar
12. Saarela M, Geogieva L. Robustness, stability, and fidelity of explanations for a deep skin cancer classification model. Appl Sci. 2022;12(19):9545.
- View Article
- Google Scholar
13. Sevli O. A deep convolutional neural network-based pigmented skin lesion classification application and experts evaluation. Neural Comput Applic. 2021;33(18):12039–50.
- View Article
- Google Scholar
14. Shetty B, Fernandes R, Rodrigues AP, Chengoden R, Bhattacharya S, Lakshmanna K. Skin lesion classification of dermoscopic images using machine learning and convolutional neural network. Sci Rep. 2022;12(1):18134. pmid:36307467
- View Article
- PubMed/NCBI
- Google Scholar
15. Hoang L, Lee SH, Lee EJ, Kwon KR. Multiclass skin lesion classification using a novel lightweight deep learning framework for smart healthcare. Appl Sci. 2022;12(5):2677.
- View Article
- Google Scholar
16. Sun Q, Huang C, Chen M, Xu H, Yang Y. Skin lesion classification using additional patient information. Biomed Res Int. 2021;2021:6673852. pmid:33937410
- View Article
- PubMed/NCBI
- Google Scholar
17. Nie Y, Sommella P, Carratù M, O’Nils M, Lundgren J. A deep CNN transformer hybrid model for skin lesion classification of dermoscopic images using focal loss. Diagnostics (Basel). 2022;13(1):72. pmid:36611363
- View Article
- PubMed/NCBI
- Google Scholar
18. Khan MA, Akram T, Zhang Y, Alhaisoni M, Al Hejaili A, Shaban KA, et al. SkinNet-ENDO: Multiclass skin lesion recognition using deep neural network and Entropy-Normal distribution optimization algorithm with ELM. Int J Imaging Syst Tech. 2023;33(4):1275–92.
- View Article
- Google Scholar
19. Ajmal M, Khan M, Akram T, Alqahtani A, Alhaisoni M, Armghan A, et al. BF2SkNet: Best deep learning features fusion-assisted framework for multiclass skin lesion classification. Neural Comput Applic. 2023;35(30):22115–31.
- View Article
- Google Scholar
20. Wang G, Yan P, Tang Q, Yang L, Chen J. Multiscale feature fusion for skin lesion classification. Biomed Res Int. 2023;2023:5146543. pmid:36644161
- View Article
- PubMed/NCBI
- Google Scholar
21. Mahbod A, Schaefer G, Wang C, Dorffner G, Ecker R, Ellinger I. Transfer learning using a multi-scale and multi-network ensemble for skin lesion classification. Comput Methods Progr Biomed. 2020;193:105475.
- View Article
- Google Scholar
22. Harangi B, Baran A, Hajdu A. Assisted deep learning framework for multi-class skin lesion classification considering a binary classification support. Biomed Signal Process Control. 2020;62(1):102041.
- View Article
- Google Scholar
23. Rahman Z, Hossain M, Islam M, Hasan M, Hridhee R. An approach for multiclass skin lesion classification based on ensemble learning. Inform Med Unlocked. 2021;25:100659.
- View Article
- Google Scholar
24. Popescu D, El-Khatib M, Ichim L. Skin lesion classification using collective intelligence of multiple neural networks. Sensors. 2022;22(12):4399.
- View Article
- Google Scholar
25. Nigar N, Umar M, Shahzad M, Islam S, Abalo D. A deep learning approach based on explainable artificial intelligence for skin lesion classification. IEEE Access. 2022;10:113715–25.
- View Article
- Google Scholar
26. Gouda W, Sama N, Al-Waakid G, Humayun M, Jhanjhi N. Detection of skin cancer based on skin lesion images using deep learning. Healthcare. 2022;10(7):1183.
- View Article
- Google Scholar
27. Khan M, Zhang Y, Sharif M, Akram T. Pixels to classes: Intelligent learning framework for multiclass skin lesion localization and classification. Comput Electr Eng. 2021;90:106956.
- View Article
- Google Scholar
28. Datta S, Shaikh M, Srihari S, Gao M. Soft attention improves skin cancer classification performance. In: Interpretability of machine intelligence in medical image computing, and topological data analysis and its applications for medical data: 4th international workshop, iMIMIC 2021, and 1st international workshop, TDA4MedicalData 2021, held in conjunction with MICCAI 2021, Strasbourg, France, September 27; 2021. p. 13–23.
29. Nguyen V, Bui N, Do H. Skin lesion classification on imbalanced data using deep learning with soft attention. Sensors. 2022;22(19):7530.
- View Article
- Google Scholar
30. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data. 2018;5(1):1–9.
- View Article
- Google Scholar
31. Efat AH. HAM10000: Splitted and Augmented CWE (70 15 15). https://www.kaggle.com/datasets/ahefatresearch/ham10000-split-and-augmented
32. Joy T, Efat A, Hasan S, Jannat N, Oishe M, Mitu M, et al. Attention trinity net and DenseNet fusion: Revolutionizing American Sign Language Recognition for inclusive communication. In: 2023 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6.
33. Shafin S, Efat AH, Mahedy Hasan SM, Jannat N, Oishe M, Mitu M, et al. Skin lesion classification through sequential triple attention DenseNet: Diverse utilization of the combination of attention modules. In: 2023 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6. https://doi.org/10.1109/iccit60459.2023.10441280
34. Sikder S, Efat A, Hasan S, Jannat N, Mitu M, Oishe M, et al. A triple-level ensemble-based brain tumor classification using Dense-ResNet in association with three attention mechanisms. In: Proceedings of the 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6.
35. Haque N, Efat A, Hasan S, Jannat N, Oishe M, Mitu M. Revolutionizing pest detection for sustainable agriculture: A transfer learning fusion network with attention-triplet and multi-layer ensemble. In: Proceedings of the 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6.
36. Efat A, Hasant S, Jannat N, Mitu M, Taraque M, Ferdous S, et al. Inquisition of the support vector machine classifier in association with hyper-parameter tuning: A disease prognostication model. In: Proceedings of the 4th international conference on electrical, computer & telecommunication engineering (ICECTE). IEEE; 2022. p. 131–4.

[ref1] 1. Bibi S, Khan M, Shah J, Damaševičius R, Alasiry A, Marzougui M, et al. MSRNet: Multiclass skin lesion recognition using additional residual block based fine-tuned deep models information fusion and best feature selection. Diagnostics. 2023;13(19):3063.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Dillshad V, Khan M, Nazir M, Saidani O, Alturki N, Kadry S. D2LFS2Net: Multi-class skin lesion diagnosis using deep learning and variance-controlled Marine Predator optimisation: An application for precision medicine. CAAI Trans Intell Technol. 2023.

[ref3] 3. Hussain M, Khan M, Damaševičius R, Alasiry A, Marzougui M, Alhaisoni M, et al. SkinNet-INIO: Multiclass skin lesion localization and classification using fusion-assisted deep neural networks and improved nature-inspired optimization algorithm. Diagnostics. 2023;13(18):2869.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Efat AH, Hasan SMM, Uddin MP, Mamun MA. A Multi-level ensemble approach for skin lesion classification using Customized Transfer Learning with Triple Attention. PLoS One. 2024;19(10):e0309430. pmid:39446759
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref5] 5. Ahmad N, Shah J, Khan M, Baili J, Ansari G, Tariq U, et al. A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable AI. Front Oncol. 2023;13(6):1151257.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref6] 6. Malik S, Akram T, Awais M, Khan MA, Hadjouni M, Elmannai H, et al. An improved skin lesion boundary estimation for enhanced-intensity images using hybrid metaheuristics. Diagnostics (Basel). 2023;13(7):1285. pmid:37046503
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref7] 7. Tajerian A, Kazemian M, Tajerian M, Akhavan Malayeri A. Design and validation of a new machine-learning-based diagnostic tool for the differentiation of dermatoscopic skin cancer images. PLoS One. 2023;18(4):e0284437. pmid:37058446
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Hosny KM, Kassem MA, Foaud MM. Classification of skin lesions using transfer learning and augmentation with Alex-net. PLoS One. 2019;14(5):e0217293. pmid:31112591
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Dong Y, Wang L, Li Y. TC-Net: Dual coding network of Transformer and CNN for skin lesion segmentation. PLoS One. 2022;17(11):e0277578. pmid:36409714
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Khan S, Khan A. SkinViT: A transformer based method for melanoma and nonmelanoma classification. PLoS One. 2023;18(12):e0295151. pmid:38150449
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref11] 11. Singh RK, Gorantla R, Allada SGR, Narra P. SkiNet: A deep learning framework for skin lesion diagnosis with uncertainty estimation and explainability. PLoS One. 2022;17(10):e0276836. pmid:36315487
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref12] 12. Saarela M, Geogieva L. Robustness, stability, and fidelity of explanations for a deep skin cancer classification model. Appl Sci. 2022;12(19):9545.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref13] 13. Sevli O. A deep convolutional neural network-based pigmented skin lesion classification application and experts evaluation. Neural Comput Applic. 2021;33(18):12039–50.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref14] 14. Shetty B, Fernandes R, Rodrigues AP, Chengoden R, Bhattacharya S, Lakshmanna K. Skin lesion classification of dermoscopic images using machine learning and convolutional neural network. Sci Rep. 2022;12(1):18134. pmid:36307467
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref15] 15. Hoang L, Lee SH, Lee EJ, Kwon KR. Multiclass skin lesion classification using a novel lightweight deep learning framework for smart healthcare. Appl Sci. 2022;12(5):2677.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref16] 16. Sun Q, Huang C, Chen M, Xu H, Yang Y. Skin lesion classification using additional patient information. Biomed Res Int. 2021;2021:6673852. pmid:33937410
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref17] 17. Nie Y, Sommella P, Carratù M, O’Nils M, Lundgren J. A deep CNN transformer hybrid model for skin lesion classification of dermoscopic images using focal loss. Diagnostics (Basel). 2022;13(1):72. pmid:36611363
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref18] 18. Khan MA, Akram T, Zhang Y, Alhaisoni M, Al Hejaili A, Shaban KA, et al. SkinNet-ENDO: Multiclass skin lesion recognition using deep neural network and Entropy-Normal distribution optimization algorithm with ELM. Int J Imaging Syst Tech. 2023;33(4):1275–92.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref19] 19. Ajmal M, Khan M, Akram T, Alqahtani A, Alhaisoni M, Armghan A, et al. BF2SkNet: Best deep learning features fusion-assisted framework for multiclass skin lesion classification. Neural Comput Applic. 2023;35(30):22115–31.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref20] 20. Wang G, Yan P, Tang Q, Yang L, Chen J. Multiscale feature fusion for skin lesion classification. Biomed Res Int. 2023;2023:5146543. pmid:36644161
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref21] 21. Mahbod A, Schaefer G, Wang C, Dorffner G, Ecker R, Ellinger I. Transfer learning using a multi-scale and multi-network ensemble for skin lesion classification. Comput Methods Progr Biomed. 2020;193:105475.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref22] 22. Harangi B, Baran A, Hajdu A. Assisted deep learning framework for multi-class skin lesion classification considering a binary classification support. Biomed Signal Process Control. 2020;62(1):102041.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref23] 23. Rahman Z, Hossain M, Islam M, Hasan M, Hridhee R. An approach for multiclass skin lesion classification based on ensemble learning. Inform Med Unlocked. 2021;25:100659.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref24] 24. Popescu D, El-Khatib M, Ichim L. Skin lesion classification using collective intelligence of multiple neural networks. Sensors. 2022;22(12):4399.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref25] 25. Nigar N, Umar M, Shahzad M, Islam S, Abalo D. A deep learning approach based on explainable artificial intelligence for skin lesion classification. IEEE Access. 2022;10:113715–25.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref26] 26. Gouda W, Sama N, Al-Waakid G, Humayun M, Jhanjhi N. Detection of skin cancer based on skin lesion images using deep learning. Healthcare. 2022;10(7):1183.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref27] 27. Khan M, Zhang Y, Sharif M, Akram T. Pixels to classes: Intelligent learning framework for multiclass skin lesion localization and classification. Comput Electr Eng. 2021;90:106956.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref28] 28. Datta S, Shaikh M, Srihari S, Gao M. Soft attention improves skin cancer classification performance. In: Interpretability of machine intelligence in medical image computing, and topological data analysis and its applications for medical data: 4th international workshop, iMIMIC 2021, and 1st international workshop, TDA4MedicalData 2021, held in conjunction with MICCAI 2021, Strasbourg, France, September 27; 2021. p. 13–23.

[ref29] 29. Nguyen V, Bui N, Do H. Skin lesion classification on imbalanced data using deep learning with soft attention. Sensors. 2022;22(19):7530.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref30] 30. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data. 2018;5(1):1–9.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref31] 31. Efat AH. HAM10000: Splitted and Augmented CWE (70 15 15). https://www.kaggle.com/datasets/ahefatresearch/ham10000-split-and-augmented

[ref32] 32. Joy T, Efat A, Hasan S, Jannat N, Oishe M, Mitu M, et al. Attention trinity net and DenseNet fusion: Revolutionizing American Sign Language Recognition for inclusive communication. In: 2023 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6.

[ref33] 33. Shafin S, Efat AH, Mahedy Hasan SM, Jannat N, Oishe M, Mitu M, et al. Skin lesion classification through sequential triple attention DenseNet: Diverse utilization of the combination of attention modules. In: 2023 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6. https://doi.org/10.1109/iccit60459.2023.10441280

[ref34] 34. Sikder S, Efat A, Hasan S, Jannat N, Mitu M, Oishe M, et al. A triple-level ensemble-based brain tumor classification using Dense-ResNet in association with three attention mechanisms. In: Proceedings of the 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6.

[ref35] 35. Haque N, Efat A, Hasan S, Jannat N, Oishe M, Mitu M. Revolutionizing pest detection for sustainable agriculture: A transfer learning fusion network with attention-triplet and multi-layer ensemble. In: Proceedings of the 26th international conference on computer and information technology (ICCIT). IEEE; 2023. p. 1–6.

[ref36] 36. Efat A, Hasant S, Jannat N, Mitu M, Taraque M, Ferdous S, et al. Inquisition of the support vector machine classifier in association with hyper-parameter tuning: A disease prognostication model. In: Proceedings of the 4th international conference on electrical, computer & telecommunication engineering (ICECTE). IEEE; 2022. p. 131–4.

Figures

Abstract

1 Introduction

2 Literature review

3 Materials and methods

3.1 Dataset description

3.2 Methodology

3.3 Preprocessing and data augmentation

3.4 Creation of optimized RegNet synergy (ORNS) architectures

3.4.1 Feature extraction process.

3.4.2 Attention-Triplet (AT).

3.4.3 Channel Attention (CA).

3.4.4 Squeeze-Excitation Attention (SEA).

3.4.5 Soft Attention (SA).

3.5 Chi2 weighted ensemble (CWE)

3.6 Multi-layer CWE

3.6.1 CWE in Layer 1.

3.6.2 CWE in Layer 2.

3.6.3 CWE in Layer 3.

3.6.4 CWE in Layer 4.

4 Experimental results and analysis

4.1 Performance evaluation metrics

4.2 Experimental setup

4.2.1 Trainable parameters.

4.2.2 Hyperparameters selection.

4.3 Performance analysis of ORNS architectures in CWE

4.3.1 ORNS architectures in CWE-Layer 1.

4.3.2 ORNS architectures in CWE-Layer 2.

4.3.3 ORNS architectures in CWE-Layer 3.

4.3.4 ORNS architectures in CWE-Layer 4.

4.4 Performance analysis by visualization

4.4.1 Gradient Class Activation Map (GradCAM) for interpretability.

4.5 Ablation study

4.5.1 Utilization of CWE without AT.

4.5.2 Utilization of conventional ensemble methods instead of CWE.

Softmax Averaging (SA):

Majority Voting (MV):

Weighted Averaging (WA):

4.5.3 ORNS architectures in single-layer CWE.

4.6 Answers to the research questions

5 Discussion and extended comparison

5.1 Advantages and disadvantages of CWE

5.2 Threats to validity

6 Future work and research directions

7 Conclusion

References

3.5 Chi² weighted ensemble (CWE)