Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Brain tumor classification in VIT-B/16 based on relative position encoding and residual MLP

  • Shuang Hong ,

    Roles Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Writing – original draft

    hongshuang1206@wust.edu.cn

    Affiliation School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China

  • Jin Wu,

    Roles Data curation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China

  • Lei Zhu,

    Roles Data curation, Formal analysis, Methodology, Software, Supervision, Validation, Writing – review & editing

    Affiliation School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China

  • Weijie Chen

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China

Abstract

Brain tumors pose a significant threat to health, and their early detection and classification are crucial. Currently, the diagnosis heavily relies on pathologists conducting time-consuming morphological examinations of brain images, leading to subjective outcomes and potential misdiagnoses. In response to these challenges, this study proposes an improved Vision Transformer-based algorithm for human brain tumor classification. To overcome the limitations of small existing datasets, Homomorphic Filtering, Channels Contrast Limited Adaptive Histogram Equalization, and Unsharp Masking techniques are applied to enrich dataset images, enhancing information and improving model generalization. Addressing the limitation of the Vision Transformer’s self-attention structure in capturing input token sequences, a novel relative position encoding method is employed to enhance the overall predictive capabilities of the model. Furthermore, the introduction of residual structures in the Multi-Layer Perceptron tackles convergence degradation during training, leading to faster convergence and enhanced algorithm accuracy. Finally, this study comprehensively analyzes the network model’s performance on validation sets in terms of accuracy, precision, and recall. Experimental results demonstrate that the proposed model achieves a classification accuracy of 91.36% on an augmented open-source brain tumor dataset, surpassing the original VIT-B/16 accuracy by 5.54%. This validates the effectiveness of the proposed approach in brain tumor classification, offering potential reference for clinical diagnoses by medical practitioners.

1. Introduction

Cancer remains a leading cause of non-natural deaths in human society. According to statistics released by the National Cancer Center, in 2016, out of the 4.064 million cancer patients in China [1], approximately 10.9% were diagnosed with brain tumors. As the most intricate organ in the human body, the brain is susceptible to the generation of brain tumors due to the abnormal growth and functioning of cells within the brain tissue [2]. To determine the treatment plan for patients, early screening and diagnosis of brain tumors primarily rely on modern medical imaging technologies. Physicians utilize medical imaging techniques such as X-rays, ultrasounds, CT scans, and MRI scans to identify the location and extent of the lesions within the patient’s brain [3]. Indeed, Brain tumors encompass a wide range of tumor types, each with distinct characteristics and behavior, this requires a high level of expertise from physicians [4], extensive clinical experience, and a strong foundation of prior understanding. However, healthcare professionals generally face high work intensity and heavy workloads, which can lead to the possibility of missed diagnoses and misdiagnoses during the screening and diagnosis of brain tumor images [5].

The integration of artificial intelligence technology has led to the widespread adoption of computer-aided diagnosis and treatment techniques, significantly mitigating the occurrence of missed diagnoses and misinterpretations during tumor image screening and diagnosis. Currently, brain tumor classification tasks primarily rely on Convolutional Neural Network (CNN), and there is a limited amount of brain tumor datasets with relatively small-scale open-source datasets. The emergence of Vision Transformers (VIT) [6] has challenged the necessity of CNN for classification tasks. Applying the Transformer module [7] directly to segmented small image sequences, especially after pre-training on a comprehensive image dataset, has demonstrated remarkable outcomes. Vision Transformers excel with limited computational resources, attaining substantial classification accuracy even with modest datasets.

Yet, Vision Transformers face challenges such as the inability of their core self-attention mechanism to capture input token order, limiting effectiveness in structured data modeling. Additionally, the MLP structure within Vision Transformers for classification encounters information degradation, leading to the loss of significant image details and premature performance saturation. To address the limitations of the aforementioned methods, this paper proposes a brain tumor classification approach based on relative position encoding and residual MLP structure. The main contributions of this work are as follows:

  1. 1 Enhanced Dataset Utilization with Image Enhancement Techniques: We enhance the brain tumor dataset using advanced image enhancement techniques, including Homomorphic Filtering (HF) [8], All Channels Contrast Limited Adaptive Histogram Equalization (CLAHE) [9], and Unsharp Masking (UM) [10]. These methods are chosen to enhance image details, mitigate noise levels, and improve contrast in brain tumor images [11]. The resulting dataset not only advances the quality of input data but also amplifies the generalization capability of the network. Enabling the Vision Transformer model to leverage data for more accurate classification.
  2. 2 Incorporating Relative Position Encoding for MRI Images of Brain Tumors Understanding: A novel relative position encoding method is introduced during network training, encoding the relative distances between input tokens. This mechanism enables our model to discern and comprehend pairwise relationships between tokens, enhancing its ability to understand the spatial context of brain tumor images.
  3. 3 Empowering Vision Transformers through Residual MLP Architecture: The incorporation of this architecture elevates the network’s capacity to model complex features within brain tumor images, thereby augmenting the model’s classification accuracy. Furthermore, by introducing adaptive average pooling in the fully connected layer of the MLP head, we achieve heightened classification accuracy without inducing a substantial increase in computational complexity [12].

2. Related work

In the field of medical-assisted diagnosis, convolutional neural networks have been widely used as deep learning models and have achieved significant results [13]. In 2019, COVID-19 as a pandemic disease has affected millions of human lives and caused a massive burden on healthcare centers. Khan et al [14]. proposed a quick, accurate, and low-cost computer-based tool that’s two new deep learning frameworks: Deep Hybrid Learning (DHL) and Deep Boosted Hybrid Learning (DBHL) for timely detection and treatment of COVID-19 patients. After a period of research, Khan et al [15]. also presented a new deep CNN-based framework by novel channel-boosted CNNs to detect and analyze COVID-19 from lung CT images, which can capture useful dynamic features of the infected regions, discriminating the COVID-19 infected region from the healthy ones. The two innovative researches by Khan et al. have provided robust support to the healthcare system at that time, enabling rapid and accurate identification of COVID-19 infection in patients through pulmonary CT imaging. And in the research related to lymphocytes, Rauf et al [16]. proposed a lymphocyte analysis framework based on a deep convolutional neural network (DC-Lym-AF) to analyze lymphocytes in immunohistochemistry images. This framework has the potential to be turned into a medical diagnostic tool to investigate various histopathological problems. Amidst these compelling strides in medical research, malaria cast a significant public health challenge, affecting millions worldwide each year. Khan and Saddam Hussain [17] responded to this pressing concern with the development of an innovative Deep Boosted and Ensemble Learning (DBEL) framework tailored for the screening of malaria parasite images. The essence of their approach involved the strategic amalgamation of new Boosted-BR-STM CNNs and ensemble machine learning classifiers. This fusion of methodologies heralded a novel approach to address malaria’s potentially fatal impact on red blood cells.

In brain tumor research, Rehman et al [18]. proposed the use of 3D CNN for brain tumor detection. They employed a feedforward neural network to select the optimal features for classification. While their approach achieved commendable accuracy, it’s important to acknowledge the time-consuming nature of the detection process and the complexity associated with their classification procedure. These limitations underscore the need for more efficient methods that strike a balance between accuracy and computational efficiency. Zhao et al [19]. introduced a feature fusion layer into the original U-Net network for brain glioma classification. This approach aimed to create an end-to-end classification system by merging shallow and deep features. However, it tended to overexpress redundant features and did not fully utilize the images’ global and local salient features, thereby leaving room for further improvement in classification accuracy. Indeed, brain tumor classification is challenging because of its complex structure, texture, size, location, and appearance. Zahoor, M. M. et al [20]. developed a novel deep residual and regional-based Res-BRNet Convolutional Neural Network for effective brain tumor Magnetic Resonance Imaging (MRI) classification, they have achieved excellent performance in image classification tasks. However, there is still room for improvement in both the computational complexity and efficiency of this model.

In light of the limitations associated with the aforementioned approach, ViT divides the input image into a set of fixed-size image patches or blocks, which are then transformed into sequential data. It operates on the self-attention mechanism, allowing it to capture long-range dependencies and relationships between image patches [21]. This global contextual understanding is especially valuable for brain tumor analysis, where capturing intricate patterns and relationships across different regions of the image is crucial. Traditional CNN, on the other hand, relies on local receptive fields, which might struggle to capture global contextual information effectively. And compared to traditional CNN, ViT can achieve comparable or even better performance with fewer parameters. This can lead to improved computational efficiency during training and inference, which is important for medical applications where computational resources might be limited. Introducing the ideas of Transformers provides a new paradigm for image processing, allowing models to learn features and relationships directly from raw pixel-level data without relying on handcrafted feature extractors. This innovative model structure opens up new avenues and insights for further exploration of Transformer applications in visual tasks. For instance, in the field of medical image segmentation, Khan et al [22]. propose a novel medical image segmentation technique based on Vision Transformer. This method effectively models dependencies between distant structures through a multi-scale attention mechanism, providing a successful solution to the challenges of segmenting complex, interconnected structures.

3. Materials and methods

In this study, the brain tumor images are preprocessed using the HF, CLAHE, and Unsharp Masking image enhancement methods. The relative position encoding method is applied to generate weights for the positional relationships between input tokens. The performance of VIT-B/16 is optimized by incorporating a residual structure in the MLP. The overall model architecture is illustrated in Fig 1. From the model architecture diagram, it can be observed that the model is based on the VIT-B/16 network and is divided into three parts: the Patches Embedding layer, the Transformer Encoder layer with repeated stacked Encoder Blocks, and the MLP Block. Before being input to the Transformer Encoder, these tokens need to be inserted with a trainable vector called the Class Token, which is specifically designed for classification purposes. To address the inherent limitation of self-attention in capturing the order of input tokens, a new two-dimensional image encoding method called Image-Relative Position Encoding (IRPE) is employed for the multi-head self-attention mechanism of the Transformer [23]. The interaction computation between the query, value, and key in the self-attention module is performed by incorporating learnable parameters.

In IRPE, the clip function is introduced to map relative positions to encodings, reducing computational costs and the number of parameters. Later, Dropout/Drop Path is added to accelerate computation speed and improve the model’s generalization ability [24]. Brain tumors of different types and stages exhibit varying degrees of lesion severity, and most of them do not exhibit significant differences from a normal brain on MRI. This makes it easy for the model to overlook crucial features, leading to the quick disappearance of gradients and a decrease in recognition accuracy. To address this issue, an improved residual MLP is introduced to enhance the transmission of feature information within the network and improve its performance. The residual structure of the MLP increases the depth of the network and thus its complexity, to address this issue, an adaptive average pooling layer is added to the MLP Head, which helps reduce the number of parameters and computational overhead to some extent. Finally, the classification results are obtained through a fully connected layer with a tanh activation function. Our source code will be made publicly available at: https://github.com/zhulei2016/RST-saliency/upon_acceptance.

3.1 Dataset

The dataset used in this paper consists of a few open-source CE-MRI datasets, which are a combination of the Figshareand Kaggle site [25]. The original dataset used in this study comprises a total of 7023 MRI images of brain tumors. Specifically, it includes 1621 images of glioma slices, 1645 images of meningioma slices, 1757 images of pituitary tumor slices, and 2000 images of tumor-free slices. Each tumor type is represented in three different orientations: axial, sagittal, and coronal, as shown in Fig 2.

thumbnail
Fig 2. The examples of images in different orientations for different types of brain tumors.

https://doi.org/10.1371/journal.pone.0298102.g002

To address the limited number of samples in the dataset, this paper employs image augmentation to expand it, thereby mitigating overfitting and improving the accuracy and generalization of the network. Three image enhancement methods, namely HF, CLAHE, and Unsharp Masking are used to generate additional images. This augmentation process results in a total of 14,046 images. All images are then standardized by resizing them to a uniform size of 224x224 pixels and normalizing their pixel values. To evaluate the performance of the model, a random split is performed, with 80% of the images used as the training set and the remaining 20% as the test set as shown in Table 1.

3.2 Image enhancement

Image enhancement techniques start with the initial MRI images and aim to enrich the image information, enhance image quality, and meet the input requirements of subsequent models while improving model accuracy [26]. We employed the techniques of Homomorphic Filtering (HF), Contrast Limited Adaptive Histogram Equalization (CLAHE), and Unsharp Masking (UM) in this paper. In brain tumor MRI images, noise and significant variations in brightness often exist, resulting in poor contrast and blurred details. Homomorphic filtering, combined with image space and frequency domain features, enhances the details in the darker regions of the image and improves overall resolution. Histogram equalization is a commonly used image enhancement technique. However, traditional histogram equalization methods perform equalization over the entire brightness range of the image, which can result in increased noise and unwanted details, additionally, it may also disrupt the local information of the image. CLAHE improves the overall contrast of an image without negatively impacting the image details. It addresses the limitations of traditional histogram equalization by applying adaptive equalization locally. This approach prevents the amplification of noise while enhancing the contrast. Lastly, the Unsharp Masking technique is applied to enhance the edges and details of MRI images, resulting in sharper image contours.

3.2.1 Homomorphic filtering.

The homomorphic filtering technique applies the illumination-reflection model to transform each pixel value f(x, y). Each pixel value is equal to the product of the illumination component and the reflection component in this model. The formula is as follows: (1)

The pixel value f(x, y) is subjected to logarithm and Fourier transform, resulting in F(x, y). Then, a Gaussian high-pass filter H(u, v) is applied to enhance the contrast of F(x, y). (2) ln r(x, y)is subjected to Fourier transform, resulting in R(x, y), ln i(x, y)is also subjected to Fourier transform, resulting in I(u, v). (3) (4) (5)

Where the high-frequency component γH = 1.5, the low-frequency component γL = 0.5, the constant c = 1, the cutoff frequency D0 = 40, the distance from the point(u, v)to the filter center(u0, v0)is denoted as D(u, v).

As shown in Fig 3, homomorphic filtering enhances contrast and highlights image details, resulting in a significant improvement in overall resolution. After applying homomorphic filtering to brain tumor MRI images, CLAHE is performed.

thumbnail
Fig 3. Original image and homomorphic filtered enhanced image.

https://doi.org/10.1371/journal.pone.0298102.g003

3.2.2 Contrast limited adaptive histogram equalization.

Traditional histogram equalization performs equalization of the overall image brightness [27]. However, the uneven distribution of brightness in brain tumor MRI images can result in the loss of local image information. CLAHE reduces the enhancement magnitude of local contrast by limiting the height of the local histogram to control the amplification of image noise and prevent excessive enhancement of local contrast, which can lead to the loss of details.

CLAHE divides the image to be processed into multiple sub-blocks, where each sub-block contains the number of pixels denoted as S. The histogram of each sub-block is represented by h(rk), and it is clipped using the corresponding clip threshold value applied to h(rk), the clipped pixels are redistributed uniformly across the histogram grayscale, and this process is repeated until all pixels have been assigned. After the redistribution, the grayscale histogram of each sub-region undergoes histogram equalization. Finally, the grayscale value of each pixel in the output image is calculated based on the processed center pixel of the corresponding sub-region.

As shown in Fig 4, the image undergoes CLAHE processing, where the parameters for histogram equalization are adjusted based on the characteristics of different parts of the image. This effectively enhances the contrast and details of the image, further improving the quality and interpretability of the deep learning network. It enables the network to better learn and extract image features while reducing the risk of overfitting and the impact of noise.

thumbnail
Fig 4. Original image, the CLAHE-enhanced image, and the image enhanced by combining HF and CLAHE.

https://doi.org/10.1371/journal.pone.0298102.g004

3.2.3 Unsharp Masking.

Unsharp Masking is applied to brain tumor MRI images to enhance image features and strengthen image edges, Because there exists essential feature information between human brain tissues and structures, while MRI images often suffer from edge blurring.

Unsharp Masking utilizes a high-pass filter to extract the high-frequency components of the image [28]. These high-frequency components are amplified and then added back to the original image. This process enhances the high-frequency information, such as edges and details, in the original image while preserving the low-frequency information.

As shown in Fig 5, the sharpening process applied to the brain tumor MRI image further enhances the edge and detail features between the tissues. The overall clarity and accuracy of the image are significantly improved. This enhancement facilitates faster convergence and training of the network, thereby improving the model’s robustness and generalization capability [29].

thumbnail
Fig 5. Original image, the image enhanced by Unsharp Masking, and the image enhanced by combining HF, CLAHE, and UM.

https://doi.org/10.1371/journal.pone.0298102.g005

The information entropy represents the richness of details contained in an image. The more information the image contains, the higher its information entropy [30]. The evaluation results of information entropy for several methods are shown in Table 2. The method combining HF, CLAHE, and UM shows the highest information entropy, indicating the image enhancement effect is evident.

3.3 Relative position encoding methods

The Transformer Encoder layer consists primarily of Layer Norm, Multi-Head Attention, Dropout / Drop Path, and MLP Block. Self-attention is the core of the Transformer. As shown in Fig 6, It models the relationships between tokens sequentially to map the query and a set of key-value pairs to an output. Before being inputted into the network, brain tumor MRI images are subjected to block-wise segmentation and flattening. Then, they are mapped to a representation a = (a1, a2an). Each element a1, a2an is individually processed using three-parameter matrices WQ, WK, WV to compute their corresponding q (query, which is matched with each k), k (key, which is matched with each q), and v (value, which contains extracted information from a). The matching process between q and k calculates their correlation, and the higher the correlation, the larger the weight assigned to the corresponding v.

The weight coefficients αij are calculated using softmax: (6) Where: (7)

The value of eij is calculated by scaled dot-product attention, where the dot product may result in large values. This can cause gradients to become very small after applying softmax, so we use sequence length for the scaling.

Self-attention computes an output sequence z = (z1, z2, …zn), where each output element zi is computed by a weighted sum of the input elements. (8)

The formula is summarized as follows:: (9)

Multi-Head Attention is a crucial component in the Transformer architecture and plays a key role in brain tumor classification. It essentially performs parallel computations of multiple self-attentions, using multiple attention heads, and concatenates their outputs for linear transformation to obtain the desired dimensions. In the processing of brain tumor MRI images, each attention head can capture different aspects of information, providing multiple representation subspaces. This helps in gaining a more comprehensive understanding of brain tumors. For brain tumor MRI images, we utilize randomly initialized Q, K, and V weight matrices in each Attention Head. As shown in Fig 7(b), Q, K, and V undergo linear transformations through the Linear layers before entering the Scale Dot-Product Attention layer (as shown in Fig 7(a). In the Scale Dot-Product Attention layer, Q and K are matrix multiplied using Matmul, followed by dimension scaling in the scale layer. This process allows us to obtain the correlations between Q and each K, which are then used to calculate the softmax-weighted matrix. In the task of brain tumor classification, we can consider these weight matrices as the similarities between image blocks, allowing us to perform a weighted fusion of the image blocks and emphasize features related to brain tumors. During the training phase, we can also utilize a Mask layer to mask out irrelevant sequence information. In brain tumor MRI images, some irrelevant image blocks or noise may exist. Using the Mask layer, we can suppress these interfering factors and enhance the model’s focus on the features relevant to brain tumors. Finally, the output is obtained by multiplying the softmax-weighted matrix with V.

thumbnail
Fig 7. (a) Scale Dot-Product Attention (b) Multi-Head Attention.

https://doi.org/10.1371/journal.pone.0298102.g007

In the task of brain tumor classification, brain tumor MRI images have specific spatial structures and positional information. These pieces of information are crucial for accurately capturing the features and dependencies of brain tumors. However, traditional self-attention mechanisms have inherent limitations in capturing the sequential order of input tokens. To address this issue, we introduce the relative positional encoding method to encode the relative distances between input elements. This allows us to learn the positional relationships between input tokens, particularly capturing longer dependencies between tokens. In the processing of brain tumor MRI images, this method encodes the relative positional information between input elements ai and aj into vectors , , and , which are then combined with self-attention. By weighting the dot product results of the query and key with the relative positional information, the consideration of positional relationships is introduced in the attention computation. This approach enables the Transformer model to better understand the dependency relationships between different positions in brain tumor MRI images and more accurately capture important tumor features.

The output zi element is represented as follows: (10) Where does not change (11)

The approach used in this paper is an independent relative position encoding method, separate from the input embedding layer. In the input embedding layer, the interaction between query, key, and value takes place, while the relative position encoding, which captures the positional relationships between tokens, does not participate in the interaction. Instead, it is added to the dot product result of the query and key before the softmax operation, as shown in Fig 8. In addition, we observed that not all relative positional information between input elements is useful. In brain tumor MRI images, distant positional information is often redundant and can increase the number of model parameters and computational costs. Therefore, we apply the clip function to limit the distance of relative positions, retaining only the relative positional information within a certain range [31]. By doing so, we can more effectively utilize the limited positional information and improve the performance of the brain tumor classification model. Summing up, we can express it with the following formula: (12) where bclip(ij, k) is a learnable scalar and is obtained by applying the clip function to the two-dimensional relative positional encoding.

And the clip function: (13)

The rationale behind introducing relative position encoding is rooted in the recognition of the importance of sequential information in medical imaging. In scenarios like brain tumor classification, the spatial arrangement and order of image patches can carry valuable diagnostic insights. Conventional self-attention mechanisms, while powerful, lack the inherent capability to understand the sequential context. The relative position encoding method addresses this gap by providing an explicit mechanism for the model to comprehend the sequential relationships. Our method involves applying the clip function to two-dimensional relative positional encodings, allowing the model to differentiate between tokens based on their positions and maintain a notion of order. By introducing such positional awareness, the model can better capture the spatial layout of features and thereby improve its capacity to recognize intricate patterns within brain tumor images. In medical imaging classification tasks, the arrangement of features often carries crucial diagnostic information. By equipping the model with the ability to consider positional context, we enable it to harness sequential patterns that could significantly enhance its diagnostic accuracy.

3.4 Residual MLP

In Vision Transformer, the MLP (Multi-Layer Perceptron) is used as a sub-module of the self-attention layer to achieve non-linear mapping. It projects the feature vectors of each position in the input to a higher-dimensional space, followed by dimension reduction. Finally, a non-linear activation function like GELU is applied to transform the feature vectors. The MLP can be seen as a fully connected neural network and helps to improve the overall representational capacity of the model [32].

In Vision Transformer, the Multi-Layer Perceptron (MLP) is utilized as a component of the self-attention layer to perform the non-linear mapping. This module projects each position’s feature vectors in the input to a higher-dimensional space, followed by dimension reduction. A non-linear activation function, such as the Gaussian Error Linear Unit (GELU), is then applied to the feature vectors. The MLP, resembling a fully connected neural network, enriches the overall representational capacity of the model [32]. A significant enhancement in our approach, depicted in Fig 9, is the introduction of a residual structure into the MLP. Rather than using the original MLP module, we employ a Residual MLP, which utilizes the output from the Multi-Head Self-Attention as input. This modification is not arbitrary but is driven by the need to address the degradation problem typically associated with the original MLP network. In the conventional MLP setup, the input vector first passes through a fully connected layer, which alters the number of input nodes by a factor of four. A GELU activation function [33] and a dropout layer are then applied. The GELU activation function introduces stochastic regularity, enhancing non-linearity in the network while maintaining the integrity of the input data. This process improves the model’s generalization ability by preventing overfitting. The input subsequently passes through another fully connected layer to restore the node number, followed by a dropout layer to produce the final output.

However, with the original MLP, we observed that the network often converged too swiftly, which could lead to suboptimal learning of complex features. This issue is where the residual structure comes into play. In the Residual MLP block, we introduce a shortcut connection that bypasses several layers. This modification transforms the original mapping F(x) into F(x) + x, where x is the input vector. The addition of residual structures allows for the learning of more intricate features without significantly increasing the computational burden. The residual structure essentially creates a form of memory in the model, allowing it to learn from previously seen data and thus improving the model’s capacity to generalize [34]. Furthermore, the residual structure mitigates the vanishing gradient problem, enabling the model to learn deeper representations without converging prematurely. As a result, we achieve better network recognition accuracy.

4. Experiment and analysis

This experiment uses the software python 3.10.8, the deep learning framework pytorch1.13.0 as the backend, the operating system is Windows11, the hardware device CPU is Intel core i7-12700H, GPU is Nvidia RTX3070ti.

4.1 The evaluation metrics

The evaluation index accuracy (Accuracy) used in this paper is used to evaluate the performance of our model. The definition of accuracy is as follows: (14) Where Nt represents the number of correctly classified test samples and N represents the total number of test samples.

To comprehensively evaluate the performance of the model, this study also calculates the accuracy for individual classes. The accuracy for a single class is defined as follows: (15) Where represents the number of correctly predicted images for a single class by the model, and N(i) represents the total number of images belonging to that class in the test dataset. In addition to the aforementioned two evaluation metrics, we also utilized precision [35], recall [36], F1-score [37], and precision-recall (PR) to assess the effectiveness of the developed model. The definitions are given as Eqs (16) to (18) as shown below: (16) (17) (18) Where TP defined truly positive predictions, TN as truly negative predictions, FP as incorrectly positive predictions, and FN for incorrectly negative predictions.

4.2 The training process of the experiments

Based on deep learning theory, a brain tumor image classification task was implemented using the Transformer neural network. The experiment utilized the Adam optimizer with a learning rate of 0.001. A batch size of 8 and 30 epochs was set for the training process. The CE-MRI dataset was employed, and image augmentation techniques were applied to enhance the information of the images, prevent overfitting, and improve the model’s generalization ability. Preprocessing steps were also conducted to enhance the model’s sensitivity. The overall process of training the brain tumor recognition model is illustrated in Fig 10.

thumbnail
Fig 10. Training curves of improved VIT-B/16 on training and test sets.

https://doi.org/10.1371/journal.pone.0298102.g010

During the experiment, the accuracy and loss rates of the model were recorded and plotted after each epoch. Fig 7 demonstrates the overall trend of the model’s accuracy improving over iterations, as both the training and testing accuracy rates increase. Simultaneously, the loss rate gradually decreases, indicating that the model is learning and making more accurate predictions as the training progresses. The accuracy of the training set starts to converge around the 10th epoch, and it begins to plateau around the 20th epoch, stabilizing at approximately 0.8765. The loss rate of the training set stabilizes around 0.3651. As for the testing set, the accuracy and loss rates exhibit similar fluctuations to the overall trend observed in the training set. The final accuracy of the testing set is 0.9136, and the loss rate is 0.3064. These values indicate that the model has high overall accuracy and good robustness.

From Fig 11, it is evident that the improved network achieves significantly higher accuracy compared to VIT-B/16. The loss rate of the original network is 0.484, while the improved network has a loss rate of 0.306. This indicates that the improved network exhibits better classification performance for brain tumors. The accuracy of the models before and after the improvement, as well as the accuracy for each individual class, are shown in Tables 3 and 4, respectively.

thumbnail
Fig 11. Loss and accuracy plots of VIT-B/16 before and after improvement.

https://doi.org/10.1371/journal.pone.0298102.g011

thumbnail
Table 3. Classification results of the network before improvement.

https://doi.org/10.1371/journal.pone.0298102.t003

According to Table 3, it can be observed that the network before the improvement achieves the highest accuracy of 89.38% in classifying normal brain images. However, the accuracy is relatively lower for other types of brain tumor MRI images. This is because the features of the normal brain without tumors are relatively distinct, with no pathological areas in the images. On the other hand, the visual features of other brain tumor images are not easily distinguishable. Some tumor locations may be close in proximity, making it challenging for the network to extract sufficient detailed features. As a result, the classification accuracy for these images is lower. By comparing Tables 3 and 4, we can observe that the improved network achieved an increase in accuracy for each category by 8.64%, 8.25%, 9.67%, and 5.24% respectively. The overall accuracy improved by 5.54%. These results indicate that the improved network demonstrates higher classification accuracy in brain tumor classification tasks.

Precision and recall (sensitivity) are the primary metrics for evaluating the efficiency of medical assistive diagnosis systems. Effective brain tumor classification demands good classification performance. As depicted in Table 5, our proposed method has been assessed, demonstrating that our model achieved a Precision of 90.6% and a Recall of 90.74%, effectively enhancing the accuracy of the model. The normal distribution plot with the associated 95% confidence interval from Fig 12 provides a comprehensive statistical analysis of your model’s accuracy. The shape of the curve reveals the variability in the model’s accuracy, while the 95% confidence interval offers an estimate of the true accuracy. Through this visualization, Our model demonstrates relatively stable accuracy across multiple experiments, with the overall performance maintaining a consistent level. Additionally, From Fig 13, through a PR curve analysis, our proposed brain tumor classification model has demonstrated better performance than other classification models on the dataset. The PR curve reveals the model’s excellence in terms of precision and recall signifies the achievement of a lower false positive rate, thereby augmenting sensitivity. These demonstrate the robust discriminative capabilities of our model in the intricate landscape of brain tumor classification and our study offers a potential solution for enhancing the accuracy and reliability of brain tumor diagnostic processes.

thumbnail
Fig 12. Normal distribution of accuracy with 95% confidence lnterval.

https://doi.org/10.1371/journal.pone.0298102.g012

thumbnail
Fig 13. PR curve based analysis of different classification models.

https://doi.org/10.1371/journal.pone.0298102.g013

thumbnail
Table 5. Performance evaluation of VIT-B/16 and after improvement.

https://doi.org/10.1371/journal.pone.0298102.t005

Fig 14 presents the visual attention maps for the transformer encoder blocks of our model. We randomly selected three types of brain tumor images from the dataset, with each column representing the attention map of a different layer of encoder blocks. We can observe that the model highlights the regions of interest in brain tumor images, with the areas of interest becoming more pronounced as we progress to the final block of the model. This indicates that the model effectively focuses on the locations of lesions in the brain tumor images The visualization results above demonstrate that our model has successfully captured the areas in MRI images that require identification.

thumbnail
Fig 14. Visualization results of transformer encoder blocks.

https://doi.org/10.1371/journal.pone.0298102.g014

4.3 Ablation experiments

In this paper, VIT-B/16 is used as the base network in the implementation of brain tumor classification experiments. In this section, the accuracy obtained at each step of the network improvement is analyzed, and ablation experiments are performed for each step of improvement to verify its importance and contribution to the model, and the results of the ablation experiments are shown in Table 6.

The results of the ablation studies demonstrate that by performing image enhancement solely on the dataset, thereby increasing the diversity of the images and providing the network with more examples to learn from, the overfitting of the model was mitigated to a certain extent. As a result, the original model’s accuracy improved by 0.82%. When only the Residual MLP was added into the network, increasing the depth of the classification network and thereby enhancing the overall model complexity, it was able to better learn complex image features, the accuracy improved to 88.57% as a result. By combining both methods to enrich the image information and enhance the learning capability of the network, the model accuracy improved to 89.51%. Finally, the introduction of a new relative positional encoding method was incorporated to improve recognition accuracy by learning the spatial relationships between different parts of the input images. This enhancement resulted in an overall network accuracy improvement of 5.54% compared to the original network.

4.4 Comparison of different models

Table 7 presents a comparison of the performance of the improved network proposed in this study with traditional neural networks, including AlexNet [38], VGGNet [39], GoogLeNet [40], ResNet [41], the lightweight neural network MobileNet [42], and EfficientNet [43] which employs a compound scaling strategy.

From the table, it can be observed that AlexNet demonstrates the effectiveness of deep convolutional neural networks in image classification, achieving an accuracy of 88.02% on this dataset, showcasing the potential of deep convolutional neural networks in the field of image classification. VGGNet, while inheriting from AlexNet, demonstrated that increasing the depth of the neural network while using smaller convolutions can still effectively improve the performance of the network. In this study, VGGNet achieved a model accuracy of 1.71% higher than AlexNet. Like VGGNet, GoogLeNet also utilizes small convolutional layers for dimensionality reduction. In addition, it replaces the fully connected layers with average pooling layers [44]. These modifications contribute to the success of achieving a classification accuracy of over 90% on the brain tumor dataset. VGGNet and GoogLeNet, as traditional neural networks, have achieved performance improvements by stacking convolutional and downsampling layers, but they also suffer from the problem of degradation. In response to this issue, ResNet introduced residual connections to alleviate the degradation problem and enabled the construction of extremely deep network architectures [45]. As a result, ResNet further improved the accuracy to 90.16%. The use of extremely deep network architectures comes with the drawback of high memory requirements and computational complexity, which may not be suitable for mobile and embedded applications. The increasing demand for these domains necessitates lightweight network models. MobileNet addresses this requirement by employing the Depthwise Convolution structure, which reduces the computational complexity while maintaining a reasonable level of classification accuracy. The more advanced EfficientNet architecture incorporates various techniques like depthwise separable convolutions, linear bottleneck structures, and advanced regularization to optimize the trade-off between accuracy and efficiency. This makes them suitable for image classification tasks and also leads to a higher accuracy 90.67% in the classification experiments.

In contrast to traditional convolutional networks, ViT uses self-attention mechanisms to capture global context information from the entire image instead of relying solely on convolutional layers [46]., which can be particularly useful for tasks like brain tumor classification where the relationships between different regions of the image are important. ViT generates attention maps that can help in interpreting which parts of the image contribute most to the classification decision. This can be important in medical applications to understand the model’s decision-making process. Our improved network achieves an accuracy of 91.36%, which is 3.34% higher than AlexNet, 1.63% higher than VGGNet, 1.2% higher than ResNet, 1.23% higher than MobileNet, and 0.69% higher than advanced EfficientNet. These results demonstrate that our proposed method achieves higher classification accuracy in brain tumor classification.

5. Conclusion

Brain tumors, as highly malignant tumors with high incidence and mortality rates, require timely detection and treatment to save patients’ lives potentially. Magnetic resonance imaging technology provides excellent visualization of the internal structure and tissue of the brain. With the assistance of classification-assisted diagnostic techniques, doctors can make faster and more accurate judgments regarding the type of tumor, aiding in the efficient diagnosis and treatment of brain tumors. However, the limited size of brain tumor datasets and the scarcity of image sources pose challenges to accurate classification. In this study, image enhancement techniques were employed to enrich the information within the images and increase the number of samples, benefiting network classification accuracy. Traditional classification-assisted diagnostic techniques heavily rely on convolutional neural networks. In this study, ViT-B/16 was employed for brain tumor classification, offering advantages in handling global features, generalization ability, and training speed. The network structure and performance were further optimized by incorporating methods such as relative positional encoding and Residual MLP. The improved network achieved a final accuracy of 91.36%, demonstrating a significant improvement of 5.54% compared to the original ViT-B/16 model and validating the effectiveness of the proposed modifications.

However, it’s imperative to acknowledge the inherent limitations and potential pitfalls of deep learning and ViT models, especially when applied to medical applications like brain tumor classification. Datasets are often limited in size and can be biased due to variations in data collection methods, patient demographics, and clinical settings. Such biases can impact model performance and generalization to diverse patient populations. While advancements in generalization have been achieved through techniques like transfer learning, there’s still a risk that models might struggle with unseen or rare cases, leading to potential misdiagnoses or inaccuracies in clinical practice. Future research should focus on acquiring more diverse datasets to mitigate biases and enhance model generalization. Advanced data augmentation methods should be explored to artificially increase dataset size and variability, thereby improving robustness and performance. Integrating MRI data with other medical data types, such as genomic, histopathological, and clinical data, can provide a more comprehensive understanding of brain tumors and enhance classification accuracy. Utilizing transfer learning and domain adaptation techniques can further improve performance on small and diverse datasets. Finally, conducting longitudinal studies to validate model performance over time and across different patient cohorts is essential to ensure reliability and robustness in clinical practice.

References

  1. 1. Bray F, Ferlay J, Soerjomataram I, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries[J]. CA: a cancer journal for clinicians, 2018, 68(6): 394–424. pmid:30207593
  2. 2. Liu D, Zhang H, Zhao M, et al. Brain Tumor Segmentation Based on Dilated Convolution Refine Networks. 2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA). IEEE, 2018: 113–120. https://doi.org/10.1109/sera.2018.8477213.
  3. 3. Eis M, Els T, Hoehn-Berlage M. High resolution quantitative relaxation and diffusion MRI of three different experimental brain tumors in rat[J]. Magnetic Resonance in Medicine Official Journal of the Society of Magnetic Resonance in Medicine, 2010, 34(6): 835–844.
  4. 4. Doi K. Computer-aided diagnosis in medical imaging: historical review, current status and future potential[J]. Computerized medical imaging and graphics, 2007, 31(4-5): 198–211. pmid:17349778
  5. 5. Buell JF, Gross T, Alloway RR, et al. Central nervous system tumors in donors: misdiagnosis carries a high morbidity and mortality[C]. Transplantation proceedings. Elsevier, 2005, 37(2): 583–584. pmid:15848464
  6. 6. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations(ICLR). New Orleans: 2021: 1–22.
  7. 7. Vaswani A, Shazier N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). California: Curran Associates Inc, 2017:Pages 6000–6010.
  8. 8. Yugander P, Tejaswini CH, Meenakshi J, et al. MR Image Enhancement using Adaptive Weighted Mean Filtering and Homomorphic Filtering. Procedia Computer Science Volume 167, 2020, Pages 677–685.
  9. 9. Abood Loay Kadom. Contrast enhancement of infrared images using Adaptive Histogram Equalization (AHE) with Contrast Limited Adaptive Histogram Equalization (CLAHE). Iraqi Journal of Physics. 2018 Sep, volume 16, issue 37, pages 127–135. https://doi.org/10.30723/ijp.v16i37.84.
  10. 10. Pu XT, Jia ZH, Wang LJ, Hu YJ, Yang J. The remote sensing image enhancement based on nonsubsampled contourlet transform and unsharp masking. Concurrency and Computation: Practice and Experience. 2014 Mar, volume 26, issue 3, pages 742–747.
  11. 11. Mozaffarzadeh M, Mahloojifar A, Orooji M. Image enhancement and noise reduction using modified Delay-Multiply-and-Sum beamformer: Application to medical photoacoustic imaging. Iranian Conference on Electrical Engineering (ICEE). 2017 May. https://doi.org/10.1109/iraniancee.2017.7985131.
  12. 12. Liu Y, Zhang YM, Zhang XY, et al. Adaptive spatial pooling for image classification[J]. Pattern Recognition, 2016, 55: 58–67.
  13. 13. Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis[J]. Journal of Big data, 2019, 6(1): 1–18.
  14. 14. Khan S. H., Sohail A., Khan A., Hassan M., Lee Y. S., Alam J. et al. COVID-19 detection in chest X-ray images using deep boosted hybrid learning. Computers in Biology and Medicine, 137, 104816. pmid:34482199
  15. 15. Khan Saddam Hussain, Iqbal Javed, et al. Covid-19 detection and analysis from lung ct images using novel channel boosted cnns. Expert Systems with Applications 229 (2023): 120477. pmid:37220492
  16. 16. Rauf Zunaira, Sohail Anabia, et al. Attention-guided multi-scale deep object detection framework for lymphocyte analysis in IHC histological images. Microscopy 72, no. 1 (2023): 27–42. pmid:36239597
  17. 17. Khan Saddam Hussain. Malaria Parasitic Detection using a New Deep Boosted and Ensemble Learning Framework. Converg. Inf. Ind. Telecommun. Broadcast. data Process. 1981-1996, vol. 26, no. 1, pp. 125–150, Dec. 2022.
  18. 18. Rehman A, Khan MA, Saba T, et al. Microscopic brain tumor detection and classification using 3D CNN and feature selection architecture. Microscopy Research and Technique, 2020,84(1):133–149. pmid:32959422
  19. 19. O Ronneberger, P Fischer, T Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234–241, 2015. https://doi.org/10.1007/978-3-662-54345-03.
  20. 20. Zahoor, M. M. and Khan, S. H. Brain tumor MRI Classification using a Novel Deep Residual and Regional CNN. arXiv preprint arXiv:2211.16571.
  21. 21. Gu J, Wang Z, Kuen J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354–377.
  22. 22. Khan A, Rauf Z, Khan A R, et al. A Recent Survey of Vision Transformers for Medical Image Segmentation[J]. arXiv preprint ArXiv abs/2312.00634 (2023): n. pag.
  23. 23. Wu K, Peng H, Chen M, et al. Rethinking and improving relative position encoding for vision transformer[C] Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV). 2021: 10033–10041.
  24. 24. Srivastava Nitish et al. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (2014): 1929–1958.
  25. 25. Figshare and Kaggle site Brain Tumor MRI Dataset. https://figshare.com/articles/dataset/brain_tumor_dataset/1512427 https://www.kaggle.com/sartajbhuvaji/brain_tumor_classification-mri/metadata https://www.kaggle.com/datasets/ahmedhamada0/brain_tumor_detection/metadata.
  26. 26. Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI[J]. Zeitschrift für Medizinische Physik, 2019, 29(2): 102–127. pmid:30553609
  27. 27. Abdullah-Al-Wadud M, Kabir MH, Dewan MAA, et al. A dynamic histogram equalization for image contrast enhancement[J]. IEEE transactions on consumer electronics, 2007, 53(2): 593–600.
  28. 28. Bai J, Yuan L, Xia S T, et al. Improving vision transformers by revisiting high-frequency components[C]. European Conference on Computer Vision(ECCV). Cham: Springer Nature Switzerland, 2022: 1–18.
  29. 29. Tian L, Cao Y, He B, et al. Image enhancement driven by object characteristics and dense feature reuse network for ship target detection in remote sensing imagery[J]. Remote Sensing, 2021, 13(7): 1327.
  30. 30. Shi W, Zhu CQ, Tian Y, et al. Wavelet-based image fusion and quality assessment[J]. International Journal of Applied Earth Observation and Geoinformation, 2005, 6(3-4): 241–251.
  31. 31. Shaw P, Uszkoreit J, Vaswani A. Self-Attention with Relative Position Representations[C] Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018: 464–468.
  32. 32. Zhou Z H, Feng J. Deep forest: towards an alternative to deep neural networks[C] Proceedings of the 26th International Joint Conference on Artificial Intelligence(IJCAI). 2017: 3553–3559. https://doi.org/10.24963/ijcai.2017/497.
  33. 33. Hendrycks D, Gimpel K. Gaussian error linear units (gelus)[J]. arXiv preprint 2016. https://doi.org/10.48550/arXiv.1606.08415
  34. 34. Balnarsaiah B, Nayak BA, Sujeetha GS, et al. Parkinson’s disease detection using modified ResNeXt deep learning model from brain MRI images[J]. Soft Computing, 2023: 1–10. https://doi.org/10.1007/s00500-023-08535-9.
  35. 35. Buckland M. and Gey F. The relationship between recall and precision[J]. Journal of the American Society for Information Science, 1994, 45(1): 12–19.
  36. 36. J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves[C]. Proceedings of the 23rd International Conference on Machine Learning(ICML). 2006: 233–240.
  37. 37. M. Sokolova, N. Japkowicz, and S. Szpakowicz. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation[C]. Australasian joint conference on artificial intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006: 1015–1021.
  38. 38. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolution neural networks. Advances in neural information processing systems, 2012: 1097–1105.
  39. 39. S. Liu and W. Deng Very deep convolutional neural network based image classification using small training sample size 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 2015, pp. 730–734. https://doi.org/10.1109/ACPR.2015.7486599
  40. 40. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 1–9.
  41. 41. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 770–778. https://doi.org/10.1109/cvpr.2016.90.
  42. 42. Howard AG, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. CoRR [Internet]. 2017; abs/1704.04861. Available: http://arxiv.org/abs/1704.04861.
  43. 43. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks International Conference on Machine Learning(ICLR). California: 2019: 6105–6114.
  44. 44. Kumar RL, Kakarla J, Isunuri BV, et al. Multi-class brain tumor classification using residual network and global average pooling[J]. Multimedia Tools and Applications, 2021, 80: 13429–13438. https://doi.org/10.1007/s11042-020-10335-4.
  45. 45. Saini S S, Rawat P. Deep Residual Network for Image Recognition[C]. 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE). IEEE, 2022: 1–4.
  46. 46. Cordonnier J B, Loukas A, Jaggi M. On the Relationship between Self-Attention and Convolutional Layers[C]. Eighth International Conference on Learning Representations (ICLR) 2020. 2020 (CONF).