Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

DS-Mamba: Depthwise separable mamba for hyperspectral image classification

  • Lin Wei ,

    Contributed equally to this work with: Lin Wei, Huihan Yang, Yuping Yin, Zhiyuan Qu, Haonan Zheng

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – review & editing

    Affiliations Basic Teaching Department, Liaoning Technical University, Huludao, Liaoning, China, School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning, China

  • Huihan Yang ,

    Contributed equally to this work with: Lin Wei, Huihan Yang, Yuping Yin, Zhiyuan Qu, Haonan Zheng

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    1165692334@qq.com

    Affiliation School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning, China

  • Yuping Yin ,

    Contributed equally to this work with: Lin Wei, Huihan Yang, Yuping Yin, Zhiyuan Qu, Haonan Zheng

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation, Writing – review & editing

    Affiliation Faculty of Electrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning, China

  • Zhiyuan Qu ,

    Contributed equally to this work with: Lin Wei, Huihan Yang, Yuping Yin, Zhiyuan Qu, Haonan Zheng

    Roles Conceptualization, Data curation, Validation, Writing – original draft, Writing – review & editing

    Affiliation School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning, China

  • Haonan Zheng

    Contributed equally to this work with: Lin Wei, Huihan Yang, Yuping Yin, Zhiyuan Qu, Haonan Zheng

    Roles Conceptualization, Data curation, Validation, Writing – original draft, Writing – review & editing

    Affiliation School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning, China

Abstract

Transformers experience quadratic computational complexity in hyperspectral image (HSI) classification tasks, which can result in error propagation and memory usage issues. Recently, Mamba architectures built upon the State Space Models have supplanted Transformers across various domains to accomplish long-range sequence modeling capability while demonstrating the advantages of linear computational efficiency. However, employing the basic Mamba model for HSI classification has problems associated with the extraction of spatial and spectral features. Motivated by this, we propose the DS-Mamba, a novel depthwise separable Mamba for HSI classification. Specifically, to extract the spatial and spectral features more efficiently, we design a depth spatial Mamba block (DSpaM), a depth spectral Mamba block (DSpeM) and a feature enhancement module. These blocks use depthwise separable convolution in conjunction with the basic Mamba block to improve classification accuracy while maintaining a low computational cost. Subsequently, to enhance the classification performance, feature weights are adjusted and spatial as well as spectral information are integrated through the feature fusion module. Finally, the feature information is enhanced and categorized by a classification module with Efficient Channel Attention (ECA). Through comparative experiments, DS-Mamba achieved overall accuracies of 96.54%, 91.52%, and 94.89% on the Pavia University, Hanchuan, and Houston datasets, respectively. Its classification performance surpassed that of several advanced transformer-based methods. Furthermore, DS-Mamba has lower model parameters and floating point operations (FLOPs), with only 137.74K parameters and 12.52G FLOPs recorded on the Pavia University dataset.

1 Introduction

Hyperspectral images (HSI) capture information regarding the electromagnetic radiation reflected by objects across continuous narrow bands [1,2], which enables precise identification and classification of objects. Due to its advantages, HSI has found extensive applications aross various remote sensing scenarios, such as geological resource exploration [3], environmental monitoring [4] and precision agriculture [5]. As a fundamental task of HSI processing, the core objective of HSI classification is to distinguish features at the pixel level [6].

Early research methodologies commonly employed techniques such as support vector machine (SVM) [7], principal component analysis (PCA) [8], and linear discriminant analysis (LDA) [9] for feature extraction or dimensionality reduction. These methods primarily concentrate on leveraging the spectral features of HSI while overlooking the spatial information. Therefore, some researchers have developed classification frameworks based on spectral spatial features, such as extended morphological profiles (EMP) [10], extended multi-attribute profiles (EMAP) [11], and sparse manifold representations [12]. However, these methods rely on manually designed features and predetermined parameters, which are insufficient for effectively capturing feature information in complex environments.

In recent years, deep learning has been widely used in the field of computer vision [13,14], and it has also propelled advancements in HSI classification research [15]. The primary classical deep learning models encompass convolutional neural networks (CNN) [16], recurrent neural networks (RNN) [17], graph convolutional networks (GCN) [18], and Transformers [19]. Among these, CNN architectures have garnered significant attention in research due to their local receptive fields and the properties of parameter sharing. The 2D-CNN proposed by Lee et al. [20] uses multiple convolutional and pooling layers to extract deep features, but the structure of the deep full convolution results in a relatively high number of parameters and computational complexity. Zhong et al. [21] developed an end-to-end residual network utilizing 3D-CNN, which effectively captures deep spectral-spatial information directly from the original 3D HSI cube. However, the operation of 3D convolution significantly escalates the computational demands of the model. The HybridSN proposed by Roy et al. [22] integrates 2D and 3D convolution, reducing the complexity of the model. Li et al. [23] constructed a depth-separable residual neural network (ResNet), which separates the spectral and spatial information using depthwise separable convolution and reduces the network size to mitigate the risk of overfitting. However, CNN-based models have limited ability to model global context, and fixed convolutional kernels pose challenges in adapting to dynamically changing input features. When training data is scarce, these models are susceptible to overfitting, resulting in diminished generalization capabilities. Recently, the Transformer has been widely applied owing to its powerful capability in modeling long-range dependencies. He et al. [24] proposed a cross-spectral vision transformer (CSiT), which employs a dual-branch architecture to extract pixel-level multi-scale features. Moreover, Sun et al. [25] introduced a spectral-spatial feature tokenization transformer (SSFTT) that leverages the strengths of both CNN and Transformer. This model employs 2D and 3D convolutions to extract shallow features, subsequently incorporating a Gaussian-weighted feature tokenizer for feature transformation, which generates the input tokens required for the Transformer block. Hon et al. [26] introduced a SpectralFormer to produce grouped spectral embeddings by learning spectrally localized sequence information from neighboring bands. MorphFormer proposed by Roy et al. [27] employs spectral and spatial morphological convolution to improve the interaction between structure and shape information. The cross spatial-spectral dense transformer (CS2DT) [28] utilizes an adaptive dense encoder to extract multi-scale semantic information and employs cross-attention mechanisms for effective feature fusion. In addition, [29] designed a lightweight network, GSC-ViT, which uses grouped separable convolution to decrease the number of parameters while effectively capturing local spectral-spatial information.

However, the secondary computational complexity brought by the self-attention mechanism of Transformers may lead to inefficiency and memory limitations when dealing with high-dimensional data from HSI. Recently, the Mamba [30] built on the state-space model (SSM) has shown excellent performance in natural language processing (NLP) tasks. By introducing a selective scanning mechanism and hardware-aware algorithm, Mamba exhibits the advantages of linear computational efficiency while enabling remote modeling, which is anticipated to serve as an alternative to Transformer. Consequently, several studies have commenced the application of Mamba models for computer vision tasks. Vim [31] employs positional embedding to annotate image sequences and models state space representations using bidirectional compressed vision. VMamba [32] collects contextual information through four distinct scanning routes, drawing from a diverse array of sources and perspectives. They demonstrate outstanding performance in tasks such as image classification and segmentation. However, Mamba is less frequently employed in HSI classification tasks. To this end, we proposes a deepthwise separable Mamba (DS-Mamba) for HSI classification. The main contributions are summarized as follows:

  1. 1) By integrating depthwise separable convolution with Mamba, a depth spatial Mamba block and a depth spectral Mamba block are developed to effectively extract both spatial and spectral features. This approach significantly reduces the computational complexity of the model, thereby alleviating the overall computational burden.
  2. 2) A feature fusion module has been developed to integrate the extracted spatial and spectral information by adjusting the weights accordingly. Additionally, the concept of residual learning is incorporated through the use of skip connections, which enhances both model performance and generalization capability.
  3. 3) The lightweight Efficient Channel Attention (ECA) [33] is introduced prior to classification, facilitating local cross-channel interactions, thereby enhancing the representation of features.

2 State space models and Mamba

The State Space Model (SSM) [34] originates from continuous linear time-invariant systems and is widely used to model dynamic systems through the state variables. SSM converts an input one-dimensional signal into an output through an intermediate hidden state . This process can be expressed through a linear ordinary differential equation as:

(1)(2)

where denotes the time derivative of , represent the state matrix, and and signify the projection matrices.

However, SSM, as a continuous time system, is difficult to be directly integrated into deep learning algorithms. Consequently, discretization is achieved through the application of the zero-order hold (ZOH) technique along with specified time scales. This process transforms the continuous parameters and into discrete parameters and :

(3)(4)

where is the time scale parameter. The discretized SSM can be expressed through a linear ordinary differential equation as

(5)(6)

The above computation can be expressed in terms of the global convolution operation as:

(7)(8)

where denotes the length of the input sequence , represents the dimensions of the structured convolution kernel, and is convolution operation.

The traditional SSM struggles to effectively capture the contextual information within input sequence [26]. To address this limitation, Mamba introduces a distinctive selection mechanism that enables the model to dynamically adjust the parameters of the SSM according to the input data. This mechanism allows for selective retention or discarding of context-awareness in relation to the sequence state, thereby enhancing its capability to process long sequential data. Additionally, Mamba introduces a hardware-aware algorithm designed to enhance the computational efficiency of the model. The parameters of SSM (Δ, A, B, C) are loaded in fast Static Random-Access Memory (SRAM) instead of slower High Bandwidth Memory (HBM). A series of pre-processing steps, such as discretization, are conducted in SRAM before the final output is written back to HBM, which improves the training efficiency of the model. The detailed architecture of Mamba is shown in Fig 1.

3 Model topology

In this section, the design of DS-Mamba model will be presented. The overall framework is illustrated in Fig 2. The model has three main components: feature enhancement module, feature extraction module, and ECA classification module. Different from the traditional patch input-based model, the image input of this model first passes through the feature enhancement module to extract information of each pixel. This fine-grained pixel embedding enables the model to capture more intricate local features, thereby demonstrating improved accuracy and robustness in HSI tasks. Then the spatial and spectral features are extracted and feature fusion is performed by the feature extraction module, respectively. Specifically, the feature extraction module mainly contains three parts: depth spatial Mamba block (DspaM), depth spectral Mamba block (DSpeM) and feature fusion module. Finally, the ultimate image classification is conducted following the classification head containing ECA attention.

3.1 Feature enhancement module

The feature enhancement module, as illustrated in Fig 3(a), aims to construct more discriminative features in the spectral domain. Specifically, the group normalization mitigates the dependence on batch size by normalizing the feature maps within each group, thereby demonstrating enhanced stability during training with small batch samples. The SiLU activation function improves the nonlinear representation of the model through its smoothness and self-gating mechanism, while also accelerating model convergence. FEM not only allows the model to capture subtle variations in spectral information but also significantly enhances the discrimination and robustness of features.

3.2 Depth spatial Mamba block

Although existing Transformer-based models demonstrate exceptional capabilities in long-range modeling, their quadratic computational complexity results in significant error propagation within the model. Furthermore, the extensive number of parameters and high computational demands also lead to issues related to memory usage. To address this issue, we design DspaM using depthwise separable convolution to minimize the computational effort and the number of parameters. Simultaneously, DspaM utilizes the Mamba as the basic feature extraction unit to build long-range dependencies in linear computational efficiency. The detailed structure of the DSpaM is shown in Fig 3(a). The forward process can be formulated as follows:

(9)(10)

where DConv, PConv, GN, SiLU, and FEM denote the depthwise convolution with 3 × 3 kernel size, the pointwise convolution with 1 × 1 kernel size, the group normalization (GN) layer, SiLU activation function, and feature enhancement module, respectively. The Mamba denotes the standard Mamba block proposed in [30].

3.3 Depth spectral Mamba block

Each pixel in a HSI encompasses hundreds of consecutive spectral bands, which exhibit complex interactions and dependencies. How to effectively model the relationships among these spectral bands and extract discriminative features is a critical challenge. Therefore we propose a DSpeM to effectively harnesses the abundant spectral information of HSI. The details are illustrated in Fig 3(b). The input features are first grouped for spectral dimensions, then fed into the Mamba block after depth separable convolution and SiLU activation function, and finally go through the feature enhancement module. The DSpeM is computed as follows:

(11)(12)(13)

3.4 Feature fusion module

In HSI classification, the effective integration of spatial and spectral information can provide more accurate identification and classification of ground objects, thus improving the overall performance of the model. This motivates us to design a feature fusion module. As depicted in Fig 3(d), we employ jump connections and weighting to fuse spatial and spectral information while mitigating gradient vanishing and overfitting. The weights W are learned and updated through backpropagation after random initialization.

3.5 ECA classification module

We design the ECA mechanism within the classification head, as shown in Fig 4, with the aim of capturing inter-channel dependencies prior to classification and enhancing feature representation. ECA implements a local cross-channel interaction strategy without dimensionality reduction through 1D convolution to enhance performance while maintaining lower model complexity. ECA adaptively determines the size of the convolution kernel k by the channel dimensions C in order to determine the local cross-channel interaction coverage. Initially, input features are processed through a dot convolution and feature enhancement module, followed by global average pooling to obtain aggregated features. Subsequently, these features undergo 1D convolution to generate channel weights, which are then passed through a sigmoid activation function before being subjected to 2D convolution for classification purposes.

4 Experimental results and analysis

4.1 Datasets description

In order to comprehensively evaluate the performance of the proposed model, considering the diversity of spectral information, spatial resolution, and scene types, three widely used datasets are selected: the Pavia University, the WHU-Hi-HanChuan (HanChuan) [35] and the Houston.

  1. 1) The Pavia University dataset was collected from the University of Pavia campus in Italy, with an image size of 610 × 340 pixels and a spatial resolution of 1.3 m. The HSI image contains 103 bands with a wavelength range approximately from 0.43 µm to 0.86 µm. The dataset contains nine distinct land use categories. The Pavia University dataset is a classic dataset commonly used for HSI classification. It serves as an effective benchmark for evaluating the performance disparities between novel models and traditional approaches. The detailed category information is presented in Table 1.
thumbnail
Table 1. Category details of the Pavia University dataset.

https://doi.org/10.1371/journal.pone.0342343.t001

  1. 2) The Hanchuan dataset was acquired in 2016 in Hanchuan area, Hubei Province, China, by an airborne remote sensing platform with an image size of 1217 × 303 pixels, with a spatial resolution of approximately 0.109 meters. There were 274 bands in the wavelength range of 0.4 µm to 1 µm. The study area is a combination of urban and rural areas, containing 16 diverse land cover types such as buildings, water bodies, cultivated land, and crops, mainly agricultural scenes, filling the gap in agricultural hyperspectral analysis. Especially, the spectral similarity of farmland land cover types is high, making it suitable for testing the feature learning and discrimination capabilities of models. An overview of this dataset is given in Table 2.
  1. 3) The Houston dataset, which covers the University of Houston and surrounding regions in Texas, USA, is the official dataset of the 2013 IEEE Geoscience and Remote Sensing Society (GRSS) Data Fusion Contest and is authoritative for evaluating model performance. It features an image size of 349 × 1905 pixels and a spatial resolution of 2.5 meters. The hyperspectral modal contains 144 spectral bands covering the spectral range from 0.38 to 1.05 µm. The scene encompasses 15 typical urban features with high spectral mixing, rendering it an ideal setting for evaluating the model’s generalization ability in complex environments. The detailed category information is shown in Table 3.

4.2 Experimental setup and evaluation metrics

The training and testing environment of this study was established on the PyTorch 2.1.2 framework, which was accelerated with CUDA 11.8. The Adam optimizer was used to train the model with a learning rate of 0.0003. 30 and 10 samples were randomly selected for the training and validation sets, respectively, and the test set was made up of the rest. The following five generalized performance evaluation metrics were used in this experiment: Overall Accuracy (OA), Average Accuracy (AA), Kappa coefficient, Parameters, and floating point operations (FLOPs). To ensure the fairness of the comparison, all experiments are carried out under the identical experimental conditions, and the results are taken as the mean and standard deviation of ten consecutive experiments. Hardware environment: the operating system is Windows Subsystem for Linux subsystem of Windows 11, the processor is Intel Core i7-14700HX CPU, the graphics card is NVIDIA GeForce RTX 4070 GPU with 8GB of VRAM, and 16GB of system RAM.

4.3 Comparison experiments and analysis

To demonstrate the effectiveness of DS-Mamba, we select three representative classification methods for comparison, including the traditional method, the CNN-based method and the Transformer-based method. The details can be listed as the following:

  1. 1) SVM [7]: The model adopts SVM (support vector machine) for HSI classification.
  2. 2) 2D-CNN [16]: The model mainly consists of 2D convolution, maximum pooling and fully connected layers.
  3. 3) HybridSN [21]: The model uses a mixture of 2D and 3D convolution to extract spatial and spectral information, which mainly consists of 2D convolution, 3D convolution and fully connected layers.
  4. 4) SpectralFormer [23]: The model generates grouped spectral embeddings by learning the spectral sequence information of adjacent bands, which mainly consists of grouped spectral embedding layer, cross-layer adaptive fusion, transformer layer, and an MLP (Multilayer Perceptron) for classification.
  5. 5) MorphFormer [24]: The model fuses attention and morphological features, which mainly consists of spectral spatial morphological convolution, attention module based on Morphological Feature Fusion (MFF), and linear layer for classification.
  6. 6) GSC-ViT [25]: The model adopts a groupwise separable convolution ViT to capture local and global spatial spectral information, which mainly consists of group-separable convolutional blocks, group-separable multi-head self-attention, global average pooling layer and softmax classifier.

The classification results and complexity of various models across the three datasets are presented in Tables 46 and Figs 57, while the confusion matrices on the test sets are illustrated in Figs 810. It can be observed that the proposed DS-Mamba demonstrates superior performance compared to other methods. From what is seen in Tables 46, the classification accuracy of DS-Mamba is improved more significantly and is the highest compared to other networks.

thumbnail
Table 4. Quantitative Comparison results of the Pavia University dataset.

https://doi.org/10.1371/journal.pone.0342343.t004

thumbnail
Table 5. Quantitative comparison results of the HanChuan dataset.

https://doi.org/10.1371/journal.pone.0342343.t005

thumbnail
Table 6. Quantitative comparison results of the Houston dataset.

https://doi.org/10.1371/journal.pone.0342343.t006

thumbnail
Fig 5. Qualitative visualization of the classification map for the Pavia University dataset.

https://doi.org/10.1371/journal.pone.0342343.g005

thumbnail
Fig 6. Qualitative visualization of the classification map for the HanChuan dataset.

https://doi.org/10.1371/journal.pone.0342343.g006

thumbnail
Fig 7. Qualitative visualization of the classification map for the Houston dataset.

https://doi.org/10.1371/journal.pone.0342343.g007

On the Pavia University dataset, compared with GSC-ViT, the OA, AA and Kappa of DS-Mamba increased by 3.89%, 4.54% and 6.87% respectively. On the Hanchuan dataset, the classification results of the three transformer-based models were comparable, and DS-Mamba demonstrated superior accuracy, with an increase in OA of approximately 8%, AA by around 17%, and Kappa by about 9%. On the Houston dataset, compared with the second-place GSC-ViT, DS-Mamba improved OA, AA and Kappa by approximately 3%. After conducting ten repeated experiments, the fluctuations in OA, AA, and Kappa for DS-Mamba on the Hanchuan dataset were recorded at only 0.91%, 0.7%, and 1.62%, respectively. These values are notably 2% to 3% lower than those observed for other models, indicating that our model exhibits greater stability. From what is seen in Figs 57, the proposed DS-Mamba demonstrates a closer alignment with the ground truth map. All other networks suffer from more classification errors, high noise, and unclear boundaries.

Meanwhile, DS-Mamba has fewer parameters and exhibits significantly lower computational intensity compared to the other models. Although the parameter count of DS-Mamba is slightly higher than that of GSC-ViT across all three datasets, its FLOPs are considerably reduced. On the Pavia University dataset, DS-Mamba demonstrated a modest increase in parameter count of only 59K compared to GSC-ViT, while achieving a significant reduction in FLOPs by 2210G. On the Hanchuan dataset, DS-Mamba not only decreased the parameter count by 98K but also decreased FLOPs by an impressive 9269G when compared to MorphFormer. When contrasted with CNN-based models, both the parameter count and computational costs were markedly diminished. On the Houston dataset, relative to HybridSN, DS-Mamba achieved a substantial reduction in parameters amounting to 4984M and a decrease in FLOPs by 5536G. This efficiency is attributed to the linear computational complexity inherent in Mamba. The comparison of the complexity and OA of each model is presented in Fig 11, where DS-Mamba demonstrates superior classification accuracy while maintaining lower model complexity. This further demonstrates the efficacy of the proposed method.

thumbnail
Fig 11. Comparison of complexity and OA of the Pavia University Dataset.

https://doi.org/10.1371/journal.pone.0342343.g011

4.4 Ablation experiments and analysis

To evaluate the effectiveness of the key components of the models, the ablation experiments were conducted on all three dataset. The parameter settings and training strategies were consistently maintained across all models, with results reported as the mean and standard deviation derived from ten consecutive experiments. Four comparison models have been selected: DS-Mamba-0 (using only DspaM in the feature extraction module), DS-Mamba-1 (using only DSpeM in the feature extraction module), DS-Mamba-2 (using the normal classification header instead of the one that incorporates ECA), and DS-Mamba-3(removing the Feature Enhancemen Module). The experimental results are shown in Tables 79. Compared with DS-Mamba, the accuracy of both DS-Mamba-0 and DS-Mamba-1 have decreased to varying degrees. On the Hanchuan dataset, OA, AA, and Kappa have all declined by approximately 4%. This indicates that both spatial and spectral information hold significant importance. On the Pavia University dataset and the Houston dataset, we observed that the classification accuracy of DS-Mamba-0 was 1% to 2% higher than that of DS-Mamba-1, indicating that spatial features are more distinguishable than spectral features. Compared with DS-Mamba-2, DS-Mamba only increased the parameter and FLOPs by 6 and 1.4M respectively on the Pavia University dataset, but the OA, AA, and Kappa were improved by about 1.5%, 1%, and 3%, respectively. This indicates that the ECA classification module can bring significant performance gains while introducing very few additional parameters and negligible computations. Compared with DS-Mamba, the OA, AA, and Kappa of DS-Mamba-3 decreased by 3% to 8%, thereby demonstrating the efficacy of the feature enhancement module. This is due to the fact that Group Normalization can enhance the model’s generalization capability by minimizing dependence on batch computations. Additionally, the SiLU activation function helps mitigate the vanishing gradient problem. Consequently, the feature enhancement module contributes to improved training stability, augments the model’s expressive power, and ultimately leads to more accurate classification outcomes for the model.

thumbnail
Table 7. Results of ablation experience of the Pavia University dataset.

https://doi.org/10.1371/journal.pone.0342343.t007

thumbnail
Table 8. Results of ablation experience of the Hanchuan dataset.

https://doi.org/10.1371/journal.pone.0342343.t008

thumbnail
Table 9. Results of ablation experience of the Houston dataset.

https://doi.org/10.1371/journal.pone.0342343.t009

4.5 Discussion

Through experiments, we found that the Transformer-based model demonstrates superior classification performance compared to the CNN-based model. CNNs require stacking multiple convolutional kernels to expand their receptive fields, which limits their ability to effectively capture global context and necessitates a fixed input size. In contrast, the self-attention mechanism inherent in Transformers endows them with robust long-range modeling capabilities and allows for flexible adjustment of input sizes through positional encoding. However, the self-attention computation in Transformers relies on weight matrix multiplication, introducing quadratic computational complexity that limits its efficiency in hyperspectral image classification.

The proposed DS-Mamba is an advancement based on Mamba, characterized by linear computational complexity and fewer parameters and computational costs compared to Transformers. DS-Mamba not only efficiently captures long-range dependencies but also dynamically adjusts weights through a selection mechanism, enabling it to extract more detailed feature information. The model utilizes a dual-branch architecture to conduct in-depth modeling of spatial and spectral information independently, thereby fully accommodating the two-dimensional spatial characteristics and one-dimensional spectral sequence properties inherent in HSI. This design ensures a complementary relationship between spatial and spectral features. The Deep Spatial Mamba block effectively addresses local spatial discontinuities that arise from one-dimensional scanning, while the Deep Spectral Mamba block is dedicated to capturing fine-grained discriminative features within the spectral dimension. Consequently, this enables the model to simultaneously consider both local details and global dependencies. Deep separable convolutions are employed to model spatial features for each channel without significantly increasing the number of parameters; subsequently, cross-channel information is fused through pointwise convolutions. This approach enhances feature extraction efficiency while preserving rich discriminative information. The introduced feature fusion module dynamically adjusts feature weights during the integration of spatial and spectral data, thereby enhancing the model’s focus on critical information. Incorporating an ECA mechanism into the classification head emphasizes discriminative features across key channels through effective local cross-channel interactions. Compared to traditional attention mechanisms, ECA improves fine-grained classification accuracy without imposing significant computational overhead. Comparative experiments conducted on three representative datasets demonstrate that DS-Mamba consistently outperforms CNN- and Transformer-based models across all evaluated metrics: OA, AA, and Kappa coefficient. Notably, it achieves these results with substantially fewer parameters and reduced computational resources compared to Transformer-based models. Ablation studies further validate the effectiveness of the proposed modules.

However, this study does have certain limitations. It did not account for hierarchical feature extraction, which may hinder the full utilization of contextual information across different levels, particularly when category differences are subtle or feature scales vary significantly. This oversight can result in suboptimal classification performance in complex scenarios. The effectiveness of this classification method may be compromised on datasets with highly imbalanced category sample counts, and its robustness in situations involving small samples or long-tail distributions still necessitates further enhancement. Moreover, validation was primarily conducted using commonly employed public datasets. Practical applications may face more intricate challenges such as varying lighting conditions, noise interference, and sensor discrepancies. Future research could investigate the integration of multi-scale feature extraction fusion, data augmentation techniques, or transfer learning strategies to bolster the model’s robustness and generalization capabilities.

5 Conclusion

We propose a DS-Mamba model for HSI classification, which mainly consists of a DSpaM, a DSpeM, a feature fusion module and an ECA classification module. By integrating depth-separable convolution with Mamba blocks, the model effectively captures remote dependencies with linear complexity while simultaneously reducing both the number of parameters and computational requirements, and the incorporation of ECA attention significantly enhances the model’s performance. Experimental results across three commonly used datasets demonstrate the effectiveness of the proposed model, achieving high classification accuracy while also lowering computational complexity. Future work will strive to explore more applications of Mamba in hyperspectral image classification tasks and develop lightweight networks to further improve the classification accuracy.

References

  1. 1. Gong J, Li F, Wang J, Yang Z, Ding X. A Split-Frequency Filter Network for Hyperspectral Image Classification. Remote Sens. 2023;15(15):3900.
  2. 2. Hong D, He W, Yokoya N, Yao J, Gao L, Zhang L, et al. Interpretable Hyperspectral Artificial Intelligence: When nonconvex modeling meets hyperspectral remote sensing. IEEE Geosci Remote Sens Mag. 2021;9(2):52–87.
  3. 3. Shirmard H, Farahbakhsh E, M ü ller RD, Chandra R. A review of machine learning in processing remote sensing data for mineral exploration. Remote Sens Environ. 2022;268:1–21.
  4. 4. Wei L, Ran H, Yin Y, Yang H. Multi-Scale Depthwise Separable Capsule Network for hyperspectral image classification. PLoS One. 2024;19(8):e0308789. pmid:39197053
  5. 5. Zhang X, Sun Y, Shang K, Zhang L, Wang S. Crop Classification Based on Feature Band Set Construction and Object-Oriented Approach Using Hyperspectral Images. IEEE J Sel Top Appl Earth Observations Remote Sens. 2016;9(9):4117–28.
  6. 6. Li S, Song W, Fang L, Chen Y, Ghamisi P, Benediktsson JA. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans Geosci Remote Sens. 2019;57(9):6690–709.
  7. 7. Melgani F, Bruzzone L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans Geosci Remote Sens. 2004;42(8):1778–90.
  8. 8. Kang X, Xiang X, Li S, Benediktsson JA. PCA-Based Edge-Preserving Features for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sens. 2017;55(12):7140–51.
  9. 9. Camps-Valls G, Tuia D, Bruzzone L, Benediktsson JA. Advances in Hyperspectral Image Classification: Earth Monitoring with Statistical Learning Methods. IEEE Signal Process Mag. 2014;31(1):45–54.
  10. 10. Fauvel M, Benediktsson JóA, Chanussot J, Sveinsson JR. Spectral and Spatial Classification of Hyperspectral Data Using SVMs and Morphological Profiles. IEEE Trans Geosci Remote Sens. 2008;46(11):3804–14.
  11. 11. Dalla Mura M, Atli Benediktsson J, Waske B, Bruzzone L. Extended profiles with morphological attribute filters for the analysis of hyperspectral data. Int J Remote Sens. 2010;31(22):5975–91.
  12. 12. Duan Y, Huang H, Wang T. Semisupervised Feature Extraction of Hyperspectral Image Using Nonlinear Geodesic Sparse Hypergraphs. IEEE Trans Geosci Remote Sens. 2022;60:1–15.
  13. 13. Hu F, Xia G-S, Hu J, Zhang L. Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery. Remote Sens. 2015;7(11):14680–707.
  14. 14. Scott GJ, England MR, Starms WA, Marcum RA, Davis CH. Training deep convolutional neural networks for land–cover classification of high-resolution imagery. IEEE Geosci Remote Sens Lett. 2017;14(4):549–53.
  15. 15. Paoletti ME, Haut JM, Plaza J, Plaza A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS-J Photogramm Remote Sens. 2019;158:279–317.
  16. 16. Yang X, Ye Y, Li X, Lau RYK, Zhang X, Huang X. Hyperspectral Image Classification With Deep Learning Models. IEEE Trans Geosci Remote Sens. 2018;56(9):5408–23.
  17. 17. Mou L, Ghamisi P, Zhu XX. Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sens. 2017;55(7):3639–55.
  18. 18. Wang D, Du B, Zhang L. Spectral-Spatial Global Graph Reasoning for Hyperspectral Image Classification. IEEE Trans Neural Netw Learn Syst. 2024;35(9):12924–37. pmid:37134039
  19. 19. Yang X, Cao W, Lu Y, Zhou Y. Hyperspectral Image Transformer Classification Networks. IEEE Trans Geosci Remote Sens. 2022;60:1–15.
  20. 20. Lee H, Kwon H. Contextual deep CNN based hyperspectral classification. In: 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). 2016. p. 3322–5.
  21. 21. Zhong Z, Li J, Luo Z, Chapman M. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans Geosci Remote Sens. 2018;56(2):847–58.
  22. 22. Roy SK, Krishna G, Dubey SR, Chaudhuri BB. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci Remote Sens Lett. 2019;17(2):277–81.
  23. 23. Li K, Ma Z, Xu L, Chen Y, Ma Y, Wu W, et al. Depthwise Separable ResNet in the MAP Framework for Hyperspectral Image Classification. IEEE Geosci Remote Sens Lett. 2022;19:1–5.
  24. 24. He W, Huang W, Liao S, Xu Z, Yan J. CSiT: A Multiscale Vision Transformer for Hyperspectral Image Classification. IEEE J Sel Top Appl Earth Observations Remote Sens. 2022;15:9266–77.
  25. 25. Sun L, Zhao G, Zheng Y, Wu Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2022;60:1–14.
  26. 26. Hong D, Gao L, Chanussot J. Spectralformer: Rethinking hyperspectral image classification with transformers. IEEE Trans Geosci Remote Sens. 2022;60.
  27. 27. Roy SK, Deria A, Shah C, Haut JM, Du Q, Plaza A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2023;61:1–15.
  28. 28. Xu H, Zeng Z, Yao W, Lu J. CS2DT: Cross Spatial–Spectral Dense Transformer for Hyperspectral Image Classification. IEEE Geosci Remote Sens Lett. 2023;20:1–5.
  29. 29. Zhao Z, Xu X, Li S, Plaza A. Hyperspectral image classification using groupwise separable convolutional vision transformer network. IEEE Trans Geosci Remote Sens. 2024;62:Art. no. 5511817.
  30. 30. Gu A, Dao T. Mamba: linear-time sequence modeling with selective state spaces. 2023;arXiv:2312.00752.
  31. 31. Zhu L, Liao B, Zhang Q, Wang X, Liu W, Wang X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint. 2024;arXiv:2401.09417. https://arxiv.org/abs/2401.09417
  32. 32. Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, et al. VMamba: Visual state space model. arXiv preprint. 2024;arXiv:2401.10166.
  33. 33. Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. p. 11534–42.
  34. 34. Gu A, Goel K, Re C. Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations. 2021.
  35. 35. Zhong Y, Hu X, Luo C, Wang X, Zhao J, Zhang L. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF. Remote Sens Environ. 2020;250:112012.