Leveraging potential of limpid attention transformer with dynamic tokenization for hyperspectral image classification

Dhirendra Prasad Yadav; Deepak Kumar; Anand Singh Jalal; Bhisham Sharma; Panos Liatsis

doi:10.1371/journal.pone.0328160

Abstract

Hyperspectral data consists of continuous narrow spectral bands. Due to this, it has less spatial and high spectral information. Convolutional neural networks (CNNs) emerge as a highly contextual information model for remote sensing applications. Unfortunately, CNNs have constraints in their underlying network architecture in regards to the global correlation of spatial and spectral features, making them less reliable for mining and representing the sequential properties of spectral signatures. In this article, limpid size attention network (LSANet) is proposed, which contains 3D and 2D convolution blocks for enhancement of spatial-spectral features of the hyperspectral image (HSI). In addition, limpid attention block (LAB) is designed to provide a global correlation of the spectral and spatial features through LS attention. Furthermore, the computational costs of LS-attention are less compared to the multi-head self-attention (MHSA) of the classical vision transformer (ViT). In the ViT encoder a conditional position encoding (CPE) module is utilized that dynamically generates tokens from the feature maps to capture a richer contextual representation. The LSANet obtained overall accuracy (OA) of 98.78%, 98.67%, 97.52% and 89.45%, respectively, on the Indian Pines (IP), Pavia University (PU), Salina Valley (SV) and Botswana datasets. Our model’s quantitative and qualitative results are considerably better than the classical CNN and transformer-based methods.

Citation: Yadav DP, Kumar D, Jalal AS, Sharma B, Liatsis P (2025) Leveraging potential of limpid attention transformer with dynamic tokenization for hyperspectral image classification. PLoS One 20(8): e0328160. https://doi.org/10.1371/journal.pone.0328160

Editor: Bardia Yousefi, University of Maryland College Park, UNITED STATES OF AMERICA

Received: May 2, 2025; Accepted: June 26, 2025; Published: August 4, 2025

Copyright: © 2025 Yadav et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Dataset used in the study can be downloaded from https://figshare.com/articles/dataset/Hyperspectral_dataset/29371889.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Hyperspectral images are acquired through spectrometer sensors that capture several narrow overlapping spectral bands [1]. In an HSI, each pixel is represented by a vector equal to the number of spectral bands. Since every vector component is measured by matching to a specific wavelength, the pixels have enormously detailed spectral signatures. The contiguous acquisition allows the radiance spectrum to be precisely estimated at each pixel in the image [2]. The extensive spectrum information improves surface feature and object discrimination over standard imaging methods [3]. However, these bands have close relationships due to the short spectral distance and contain redundant information. Since hyperspectral cameras are not built for particular applications, certain beneficial bands may not be helpful in others. As a result, collecting application-specific information is critical for maximizing the benefits of hyperspectral images [4].

The classification of HSIs is a non-linear problem [5], and the initial attempts by linear transformation-based statistical techniques like discriminant analytical methods [6,7], principal component analysis methods, and wavelet transforms [8] do not yield satisfactory results for HS data. In contrast, composite [9], probabilistic [10], and generalized kernel [11] methods demonstrated the potential to produce promising results. Nevertheless, these methods focus on only spatial features for HS data classification. The feature extraction strategies aided by some machine learning algorithms take less time, cost, and space complexity. At the same time, classification performance is not optimal. Following the success of these classical methodologies for HSI categorization, researchers applied the most current emerging computer vision-based model, which made the classification procedure easier and closer to excellence.

According to research, advances in artificial intelligence (AI), from the last decade is considered as the most rapidly evolving era in advanced automated technology. Machine learning (ML) is a sophisticated technology that mimics the cognition of the human brain. By holding abstraction, it simply describes a complex system. As a result, it may decrease complications and delve into the insights of large amounts of HSI to uncover promising spatial and spectral features [12]. Recently, deep learning methods provided promising results for HSI classification. However, local kernel features extracted by CNNs lack the global co-relation of the spectral and spatial features. The ViT improved the global co-relation of the features through the attention mechanism. However, classical ViT fails to perform well on the HSI data. In addition, attention mechanism costs are high.

To overcome these challenges, LSANet is developed, which improved the accuracy of HS data classification. The spectral and spatial features are extracted through lightweight 3D-CNN and 2D-CNN and provided attention through a transformer. CPE generates dynamic positional encoding of the features map in the transformer block. The feature map is divided into rows and columns for parallel LS-attention. In the LS-attention, tokens directly interact within the regions and capture more comprehensive contextual information from the HS data. Further, model performance is evaluated on four datasets and achieves better quantitative and visual results.

The significant contributions of the paper are as follows:

[1] A lightweight 3D-CNN and 2D-CNN is designed for the spectral and spatial features. The 3D-CNN captures spectral features, the 2D-CNN explores spatial features, and a global co-relation of spectral and spatial features is provided using a transformer.
[2] The LSANet contains a CPE module that generates dynamic positional encoding using positional encoding generator (PEG), which is translational equivalence and provides complex positional relationships. In addition, zero padding was added in the CPE to retain knowledge of the position and boundary region.
[3] In the LSA, the feature map is divided into rows and columns for parallel attention calculation in the region through the tokens, reducing computation costs and producing more comprehensive contextual information.
[4] The proposed model is evaluated on four standard datasets, and better classification performance is obtained compared to CNN and transformer-based methods.

The rest of the paper is organized as follows.

In Section 2, a detailed description of the classical, CNN and transformer-based methods has been discussed. Section 3 provides a detailed description of the proposed method. Section 4 includes a detailed overview of the datasets, experimental results, and ablation study. Finally, in section 5, we comprehensively conclude the proposed method.

2. Related work

Several methods based on machine learning, CNN and ViT, have been developed to classify the land covers available in the hyperspectral data. Chen et al. [13] utilized the classical PCA method for dimensionality reduction. After that, a LBP (local binary pattern) is applied to extract texture features. Furthermore, the grey wolf optimization technique is used to improve features. Finally, the kernel extreme learning machine (KELM) classifies the objects of the hyperspectral image. Camps and Bruzzone et al. [14] applied kernel-based methods to assess the performance of support vector machines (SVMs), regularized radial basis function neural networks (Reg-RBFNN), regularized AdaBoost (Reg-AB), and kernel fisher discriminant (KFD) analysis. They compared Reg-AB and Reg-RBFNN for HSI classification and achieved high accuracy in noisy environments. The edge-preserving filtering method applied by Kang et al. [15] improved the spectral-spatial features. Their method classified HSI by pixel-wise classifier, and the result is presented through multiple probability maps. Finally, the class of each pixel is selected based on maximum probability. Ratle et al. [16] proposed a Laplacian support vector machine (LapSVM) method for HS data classification. The semi-supervised LapSVM method results are compared with those of a supervised SVM.

Deep learning (DL) methods have recently been developed to classify the HSI. Sun et al. [17] proposed a fully convolutional segmentation network (FCSN) method to identify the land cover labels of all pixels in a HSI cube. First, they demonstrate the weak generalization capabilities using CNN-based methods. After that, provide the label of all pixels in the HSI cube for detailed spatial land-cover distributions. Finally, the method used pixel labels to improve the diversity of spatial feature distributions in the HSI and achieve an average accuracy (AA) of 88.31% on the IP dataset. Wang et al. [18] proposed a unified multiscale learning (UML) model for the classification of land covers. They proposed two mechanisms, spatial channel attention and multiscale shuffle block, to enhance spatial and spectral features in the land covers. Bai et al. [19] urged that due to rich spectral information in HSI, it makes similar spectral curve trends, which makes it challenging in land cover classification. To resolve the issue, they proposed a spectral curve-based method to enhance the spectral features and applied a dual attention mechanism to enhance the spatial features. Liu et al. [20] analyzed the spectral and spatial features of the HSI for pixel-level classification of the land covers. They extracted spectral and spatial features from the central pixel through SDPCA (scaled dot-product central attention). Furthermore, a central attention network (CAN) module is designed to classify the land covers in the three datasets. Paoletti et al. [21] used ghost model architecture with CNNs to reduce the computational cost and achieve an efficient classification performance.

Hong et al. [22] applied CNNs and GCNs (graph convolutional networks) for the HSI classification. The GCN network works on non-grid data representation. Furthermore, they designed a new mini-batch GCN (miniGCN) for training large-scale GCNs. Subsequently, they compared the different HSI features using CNN and GCNs. Hang et al. [23] proposed an attention-aided CNN model for HSI classification. The attention mechanism focuses on more discriminative channels and the spectral attention subnetwork to improve the land cover classification. Using spectral-spatial attention (SPA), their method obtained an accuracy of 89.76% on the Houston 2013 dataset. Cao et al. [24] applied a deep learning approach for HSI classification using a unified framework. They trained a CNN using labelled pixels. Furthermore, they selected the pixel from the labelling pool. Finally, fine-tuned labelled pixels with a new training set are passed to the Markov random field to enforce class label smoothness and enhance the classification performance. Hou et al. [25] used contrastive learning for hyperspectral image classification. Their method uses the information of generous unlabelled samples to help with insufficient label information in hyperspectral data. They design a two-stage model, which enables the model for positive and negative sample judgement. After that, a small amount of samples is used to extract and fine-tune the features of the hyperspectral image.

Sun et al. [26] proposed a multi-structure KELM with an attention fusion strategy (MSAF-KELM) for the accurate fusion of multiple classifiers that effectively classified the land covers. Furthermore, they applied a weighted self-attention fusion strategy (WSAFS), which merges the KELM sub-branch output and self-attention mechanism to achieve efficient fusion results. Their method obtained an accuracy of 95.64% on the SV dataset using spectral-spatial attention MSAF-KELM. Zheng et al. [27] suggested that cropping of the HSI data may result in the spatial information loss of the input image. They proposed two modules based on a rotation-invariant attention network (RIAN) for the HS data classification. In a central spectral attention (CSpeA) module, they avoided the effects of the other categories of land covers and quashed extreme spectral bands. Furthermore, a 1x1 convolution-based rectified spatial attention (RSpaA) module is utilized to avoid the rotation invariant problem and extract spatial and spectral features for land cover classification. Zhang et al. [28] developed a single-source domain expansion network (SDEnet) model to ensure the reliability and effectiveness of domain extension. They use generative contentious learning to train the source domain (SD) and test the target domain (TD). Semantic encoder and morph encoder generate the extended domain (ED). The overall accuracy of SDEnet for Houston 2018 data is 79.96%. Recently, several ViT-based methods have been utilized to classify HS data. In this regard, Ahmad et al. [29] claim average pooling in ViT may result in information loss. To solve the issue, the wavelet-based attention mechanism was utilized to design WaveFormer. The WaveFormer enhanced the interaction of the tokens between different patches, shapes and channel maps, resulting in better classification performance. Sun et al. [30] proposed a spectral-spatial attention network (SSAN) for HSI classification using spectral- and spatial features. First, a simple spectral-spatial network (SSN) extracts spatial-spectral features. After that, spectral and spatial modules consisting of 3-D convolution and activation functions are applied to reduce the influence of identical neighbour pixels. In another research, Sun et al. [31] introduce a spectral-spatial feature tokenization transformer (SSFTT) for spectral-spatial and high-dimension semantic features. First, they utilized 3-D and 2-D convolutional layers to extract spectral and spatial features. Further, a transformer encoder in which Gaussian weighted tokens are generated for feature transformation and representation. After that, the learnable token of the sample label is classified using a softmax layer. Haut et al. [32] introduce visual attention-driven techniques for HSI classification using residual networks (ResNets) for feature extraction. This method evaluates the mask applied for the feature obtained and identifies the different land covers in HS data.

Problem statement

Hyperspectral images contain rich spatial and spectral features that are utilized in several applications, including precise agriculture, cancer diagnosis, surveillance, etc. For the land cover classification, pixels are labelled based on the spatial and spectral characteristics. In addition, it poses a high-dimension and complex spectral-spatial correlation that must be carefully utilized for classification. Traditional CNNs excel at exploring spectral and spatial features but lack global contextual information, thus often leading to misclassification of the objects at boundaries and edge regions. At the same time, transformer-based methods provide long range dependency to the feature map but require a high volume of data for training the model. In addition, the computation costs of the classical ViT attention mechanism are quadratic in time. This study designed LSANet with three major components: 3DCNN, 2DCNN, and limpid shape attention-based ViT encoder. The 3D-CNN is designed using the convolution filter number of 64, kernel size of 3x3, followed by batch normalization (BN) and ReLu activation function. A moderate number of filters, such as 64, ensures diverse features from different filters. At the same time, a small kernel size of 3x3 captures fine-grained spatial features from the edge and boundary region. Moreover, batch normalization stabilizes the training, and ReLu activation allows RSANet to learn the complex patterns of the HSI. The features extracted from the 3D-CNN are reshaped and passed to the 2D-CNN module for spatial feature refinement. The 2D-CNN block contains two convolution layers, one with the convolution filter number of 128 and a kernel of size 3x3, followed by BN and ReLu activation function and zero padding. The second convolution layer has 256 filters, kernel size 3x3 and ReLu activation function. The features obtained from the 2D-CNN block are flattened, and tokens are generated for the ViT encoder. The limpid shape attention block captures rich contextual information by splitting it into multiple limpid shape blocks, each of which has equal numbers of rows and columns to capture self-attention. In addition, LSANet is GPU friendly and parallel computation of attention score can be performed, which results in less training time.

3. Proposed method

In the proposed study, the limpid size attention network (LSANet) is designed as shown in Fig 1. It has 3D and 2D convolution blocks to leverage the spectral and spatial features from the HSI. In addition, limpid attention block (LAB) is designed to provide a global correlation of the spectral and spatial features through LS attention. Moreover, a conditional position encoding (CPE) module is incorporated that dynamically generates tokens from the feature maps to capture richer contextual representation.

Download:

Fig 1. The proposed LSANet architecture for LULC in HSI.

https://doi.org/10.1371/journal.pone.0328160.g001

3.1. The convolutional module

Let the HSI image.

, which has height P, width Q and spectral bands D. Principal component analysis (PCA) is applied to reduce the dimension of the HSI to

. After that, patches of dimension are extracted. Each patch is formed around a pixel, which serves as the patch’s focal point. Furthermore, a padding procedure is applied to generate a spatial foundation for these pixels. The HSI contains rich spatial and reduced spectral features. For the efficient segregation of land cover from the HSI, it is necessary to extract spectral and spatial features. The 3D-CNN is capable of extracting spectral features. At the same time, a higher number of 3D-CNN layers increases computational costs. In the proposed study, we utilized a single 3D-CNN layer for spectral and spatial features. The 3D-CNN is designed using 64 convolutional filters, with kernel size of 3x3 followed by BN and ReLu activation functions. A moderate number of filters such as 64 ensures diverse feature extraction from different filters. At the same time, the small kernel size of 3x3 captures fine grained spatial features from the edge and boundary regions. Moreover, batch normalization stabilizes training and ReLu activation allows LSANet to learn the complex patterns of the HSI. The features extracted from the 3D-CNN are reshaped and passed to the 2D-CNN module for spatial feature refinement. The 2D-CNN block contains two convolutional layers, one with 128 convolutional filters, with a kernel size of 3x3, followed by BN and ReLu activation function and zero padding. Zero padding preserves the spatial resolution of the pixel-wise HSI classification. At the same time, it does not increase the computational burden since only zeros are added around the feature map. Moreover, it ensures that the output dimension matches the input. Reflection padding creates mirrors of the edge and boundary regions. However, in HSI, the pixel pattern of land cover differs, and performing reflection padding may lead to overlapping of different class pixels. At the same time, set parameters are used to represent paddings in the learned embedding training. Thus, this type of padding will increase training times and computational burden. Zero padding does not contain meaningful information about the edge and boundary regions, which may reduce the classification performance. ViT encoders have been utilized to mitigate the problem and provide a global correlation to the overlapping patches. The second convolution layer has 256 filters, with kernel size of 3x3 and ReLu activation function. The spatial features extracted in the convolutional layers are calculated as follows.

(1)

where i = layer under consideration, j = number of the feature maps in the layer i, = Output features at position (u, v) of the j^th feature map in the layer i, = Network bias. The activation function for each layer is denoted by f (.). The m index contains a collection of feature maps from layer (i-1) that are the inputs to layer i. is the weight in position (l, r) in which the convolutional kernel is related to the j^th feature map of the i^th layer, J_i and I_i are the kernel’s column and row sizes. The feature extraction specifications for the 3-DCNN model is very similar to the 2-DCNN model shown in Eq. (2).

The connection of the spectral dimension is preserved by organizing the related spectral bands in ascending order. The patch extraction and label identification process in the 3D and 2D convolutions are very similar. The 3D CNN block feature extraction procedure is defined as follows.

(2)

where K is the total number of kernels in layer i and K_i is the size of the 3D kernel in the spectral dimension. Parameter is the weight at location (l, r, k), whose convolution kernel corresponds to the j^th feature map in the i^th layer.

3.2. The ViT module

The features obtained from the 2D-CNN block are flattened and reshaped to . After that, the feature map is divided into several limpid blocks {L_1,…..L_N} of the same size , where . The padding technique is used to ensure that the number of pales equal to . In addition, the distances between neighboring rows and columns are identical for all limpid blocks. Furthermore, the self-attention is carried out within each limpid separately. Fig 2 shows that compared to all preceding local self-attention processes, the lLS attention’s receptive field is substantially larger and richer, providing a more effective ability for context modelling.

Download:

Fig 2. Illustration of standard GSA (2(a)), axial self-attention (2(b)), cross shaped attention (2(c)) and the proposed LS-attention (2(d)).

In 2(b), 2(c) and 2(d), the shadow area represents the input features split into different groups on which SA is conducted, and the yellow dot can directly interact with the token covered in the shadow region. Here, represent height, width and channel of the image, respectively.

https://doi.org/10.1371/journal.pone.0328160.g002

The LS-attention is divided into two parallel row and column paths to reduce the computational costs. Furthermore, row and column wise self-attention mechanism improves token interaction in the groups. The feature map breaks into two halves and in the channel dimension and it is defined as follows.

(3)

where . and contain u_r interlaced rows and u_c interlaced columns, respectively. SA is then applied to each group of tokens organized by row and column, respectively. Furthermore, three convolution layers, , and are used to generate the query, key, and value as follows.

(3)

(4)

The final output

is generated by concatenating (row-wise) and (column-wise) attention along the channel dimension as follows.

(5)

where and .

3.2.1. Limpid Attention Block (LAB).

In the ViT, self-attention is calculated on the patches to provide a global correlation of the spatial features. Several studies reported that their ViT model, e.g., Swin transformer, has a square window-based self-attention mechanism. Axial attention-based ViT calculates attention row and column-wise. In the proposed study, the LSANet attention block is named limpid-shaped attention, which is non-square and calculates attention on the tokens. The limpid-shaped attention has broader receptive fields and linear time complexity. Our LAB is composed of the CPE that dynamically generates the positional embedding using PEG. The global attention of the spectral and spatial features is provided using the LSA module that collects contextual information and the MLP that captures complex patterns in the hyperspectral image. The l^th block’s forward pass can be written as follows:

(6)

(7)

(8)

where LN(.)=layer normalization.

The global self-attention (GSA) model sets a global token to capture the global co-relation of the features (Fig 2(a)). However, it may miss the local information and computational costs also increase. On the other hand, axial self-attention captures the information through local regions via row wise or column wise (Fig 2(b)). It reduces the computation costs and at the same time it limits the model interaction with other parts of the input token sequence. In the cross shaped attention, the model can interact with several rows and columns of the input sequence to improve the features (Fig 2(c)). However, interacting with local neighbours restricts the model's capability in certain applications. In addition, selecting the window size for the specific application is crucial to attaining better performance. The limpid shape attention block captures rich contextual information. The feature map is split into multiple limpid shape blocks in which each limpid has equal rows and columns to capture the self-attention. In addition, it is GPU friendly, providing for parallel computation of attention scores, which results to reducing the training time. The yellow shadow shown in Fig 2(d) represents one limpid that interacts with the other tokens in the same limpid. Furthermore, a parallel LS-attention is implemented for more contextual information.

In traditional ViT, positional encoding is performed by either learned position embedding or fixed positional encoding techniques (Fig 3(a)). However, these techniques produce fixed-length input encoding on which the model is trained. During testing of the model, this causes difficulties for data with longer sequences. In the proposed method, a positional invariant encoding scheme, CPE, was used to generate the input sequence (Fig 3(b)). The CPE utilizes a depth-wise convolution in the PEG shown in Fig 3(c). The flattened input sequence in the PEG is reshaped to 2D image space . After that, a function depth-wise convolution block of a 3x3 filter is applied to produce the CPE. In addition, zero padding is applied to the absolute position. The CPE is defined using Equation (9).

Download:

Fig 3. Illustration of (a) standard embedding in ViT, (b) proposed CPE, (c) PEG block.

https://doi.org/10.1371/journal.pone.0328160.g003

(9)

where = Flatten input sequence and = Reshaped 2D space. After that, a depth-wise convolution is applied to the 2D space to generate the CPE.

(10)

= Convolution filter applied to each channel. Furthermore, a zero padding is applied on the generated CPE to maintain the absolute position knowledge as follows.

(11)

Here, . The final CPE is generated by trimming the padded CPE to the original dimension.

The LSA is described in Eq. (7) and is constructed by utilizing Eq. (3) to (5). Moreover, the MLP described in Eq. (8) consists of two linear projection layers that expand and shrink the embedding dimension. A softmax layer is added to classify the land covers on the top of the model, and loss is calculated using the Categorical Crossentropy function.

The algorithm for land cover classification using RSANet is described below.

Algorithm 1. The RSANet for land cover classes classification

Input: HSI

Output: The predicted label Y

1: Apply PCA on to reduce the dimension into .

2: Extracts patches of dimension .

3: For i= 1 to 200 do

(a) Feed the patches to the 3D-CNN block

(b) Features extracted from the 3D-CNN are reshaped and passed to the 2D-CNN block.

(c) Features extracted from the 2D-CNN blocks are flattened and reshaped to .

(d) partition into two independent components and

(e) Generates the Q, K, and V using Eq. (4).

(f) Calculate the LS-Attention and MLP projection using Eq. (6), Eq. (7) and Eq. (8).

End.

4. Experimental result and discussion

In this section, quantitative and visual results on PU, IP, SV and Botswana datasets and different parameters affecting the performance are discussed in detail.

4.1. Dataset

The Pavia University (PU), Indian Pines (IP), Salinas Valley (SV) and Botswana are the four benchmark datasets on which the model was utilized for the experiments. The PU dataset comprises images of 610 × 340 pixels, each containing 115 spectral bands. At the same time, it is categorized into 9 land cover types: asphalt, meadows, gravel, trees, metal sheet, bare soil, bitumen, brick, and shadow. The collection comprises a total of 42,776 samples that have been accurately tagged. The IP dataset is a collection of hyperspectral images used for segmenting an area in Indiana, US. It consists of 220 bands of spectral reflectance. The dataset consists of 145 × 145 pixels, each representing a 20 m × 20 m region. The dataset comprises 16 land cover classes. The SV dataset was acquired utilizing the 224-band AVIRIS sensor across the geographical area of Salinas Valley, located in California. The image size is 512 x 217, with 16 distinct land cover classes, including vegetables, bare soils, and grape fields. The dataset is extensively utilized for hyperspectral image classification applications. The Botswana dataset comprises hyperspectral images that accurately depict the various land cover categories found in the region of Okavango Delta, located in Botswana. The dataset is suitable for hyperspectral image classification, each image pixel labelled according to its spectral signature. The desciptions of the datasets are depicted in Figs 4–7.

Download:

Fig 4. Details of the Pavia University (PU) dataset with samples and color coding.

https://doi.org/10.1371/journal.pone.0328160.g004

Download:

Fig 5. Details of the Indiana Pines (IP) dataset with samples and color coding.

https://doi.org/10.1371/journal.pone.0328160.g005

Download:

Fig 6. Details of the Salinas Valley (SV) dataset with samples samples and color coding.

https://doi.org/10.1371/journal.pone.0328160.g006

Download:

Fig 7. Details of the Botswana dataset with samples samples and color coding.

https://doi.org/10.1371/journal.pone.0328160.g007

4.2 Experimental setup

The RSANet is evaluated in terms of performance on the PU, IP, SV and Botswana datasets using a NVIDIA QUADRO RTX-4000 GPU. Python was used for script writing on Windows 10 operating system. For each experiment, the model was trained for 200 epochs using a batch size of 64. Furthermore, the Adam optimizer with initial learning rate of 3e^-4 was used to accelerate the training process.

4.3. Quantitative results comparison

The proposed LSANet performance is compared with 2D-CNN [32], 3D-CNN [33], HybridSN [34], CSIL [35], SpectralFormer(SF) [36], DSGSF [37], morphFormer (MF) [38] and SS1DSwin [39], shown in Tables 1–4.The 2D-CNN and 3D-CNN utilized 2D and 3D convolution layers to extract spatial and spectral features. Meanwhile, HybridSN implemented joint 3D-CNN and 2D-CNN to improve land cover classification. Centre-to-surrounding interactive learning (CSIL) utilized two transformer modules, the first one for central region pixel extraction and avoiding the blur effect in the image. The second surrounding transformer block performed local attention and improved the spatial features globally. In the SF model, HSI classification can be done using pixel-wise and patch-wise approaches.

Download:

Table 1. Performance comparison on the PU dataset.

https://doi.org/10.1371/journal.pone.0328160.t001

Download:

Table 2. Quantitative results comparison on the IP dataset.

https://doi.org/10.1371/journal.pone.0328160.t002

Download:

Table 3. Performance comparison on the Botswana dataset.

https://doi.org/10.1371/journal.pone.0328160.t003

Download:

Table 4. Performance comparison on the SV dataset.

https://doi.org/10.1371/journal.pone.0328160.t004

Furthermore, a transformer encoder with cross-layer fusion (CAF) was designed for spectral-spatial feature extraction. In addition, the group-wise spectral embedding (GSE) module was designed to remove local spectral profiles. Dual-view spectral and global spatial feature fusion (DSGSF) has two spectral and spatial feature extraction subnetworks. The first subnetwork is based on an encoder and decoder for global spatial feature extraction, while the second spectral subnetwork extracts spectral features. MF extracts high-dimensional spatial and spectral features using 2D and 3D convolution blocks. In addition, it performs morphological dilation operations in the transformer block to increase the interaction between the CLS and HSI tokens. In the SS1DSwin network consists of a Group feature tokenization module (GFTM) for token embedding and a 1DSwin Transformer block for global attention of the spectral-spatial features. The performance of the CNN-based models is relatively lower. 2D-CNN extracts only spatial features, while 3D-CNN extracts spectral features because this performance is superior in several land covers.

Moreover, HybridSN utilized the 2D-CNN and 3D-CNN for joint spatial and spectral features and demonstrated superior performance in several classes. Further, DSGSF improved performance through enhanced spatial features extracted from the encoder and decoder blocks and spectral features via the CNN module. However, traditional CNN-based models extract shallow high-dimensional features and ignore the edge features. ViT-based models have better quantitative results compared to traditional CNN models.

In CSIL, the inclusion of two transformer blocks has improved the classification accuracy. At the same time, computational costs have increased. The SF method utilized cross-layer fusion to improve the global co-relation of the features. However, costs increased due to additional learnable embedding. In the MF method, dilation-based spectral and spatial morphological operations are performed, which improved classification performance. The proposed method OA on the PU dataset is 3.1% better, and the kappa value is 2.1% better than SS1DSwin. Furthermore, OA is 1.39% better on the IP dataset than SS1DSwin.

Furthermore, the proposed model showed remarkable performance on the SA and Botswana datasets and obtained 97.52% and 89.45% OA, respectively. LSANet obtained the highest OA in the corn, grass-pasture, grass-pasture-mowed, hay-windrowed, soyabean-notill, soyabean-clean, woods and stone-steel-towers classes. Furthermore, on the Botswana dataset, the model respectively achieved 2.22% and 1.99% higher kappa and OA compared to SS1DSwin. Moreover, on the SV dataset, LSANet obtained 96.17% and 97.52% kappa and OA, respectively, which is better than other methods.

4.4. Visual results

The visual results of the proposed LSANet and 2D-CNN [32], 3D-CNN [33], HybridSN [34], CSIL [35], SpectralFormer(SF) [36], DSGSF [37], morphFormer (MF) [38] and SS1DSwin [39] on the PU, IP, SV and Botswana datasets are depicted in Figs 8–11. The region of interest (ROI) is selected and zoomed three times for better views of the Soyabean-clean and Stone-Steel-Towers of the IP datasets. Furthermore, the classification map of the Bare Soil and Bitumen class of the PU dataset is highlighted. In the classification map of the SV, Grapes_untrained class is highlighted and on the Botswana dataset floodplain grasses1 and firescar2 ROI is zoomed three times. The 2D-CNN and 3D-CNN classification map has large areas of noise and suffers from severe oversmoothing in the boundary region. The HybridSN and DSGSF improved the visual results in several classes, however suffer from smoothing issues. ViT based methods have clear classification maps but lack in differentiability of similar classes and showed overlapping. The proposed method showed classification maps close to GT except in a few classes.

Download:

Fig 8. The visual map on PU dataset.

(a)GT (b)2D-CNN (c)3D-CNN (d)HybridSN (e)CSIL (f)SF (g) DSGSF (h)MF (i) SS1DSwin and (j) LSANet.

https://doi.org/10.1371/journal.pone.0328160.g008

Download:

Fig 9. Illustration of the visual map on IP dataset.

(a)GT (b)2D-CNN (c)3D-CNN (d)HybridSN (e)CSIL (f)SF (g) DSGSF (h)MF (i) SS1DSwin and (j) LSANet.

https://doi.org/10.1371/journal.pone.0328160.g009

Download:

Fig 10. The visual map on SV dataset.

(a)GT (b)2D-CNN (c)3D-CNN (d)HybridSN (e)CSIL (f)SF (g) DSGSF (h)MF (i) SS1DSwin and (j) LSANet.

https://doi.org/10.1371/journal.pone.0328160.g010

Download:

Fig 11. The visual map on Botswana dataset.

(a)GT (b)2D-CNN (c)3D-CNN (d)HybridSN (e)CSIL (f)SF (g) DSGSF (h)MF (i) SS1DSwin and (j) LSANet.

https://doi.org/10.1371/journal.pone.0328160.g011

4.5. Ablation studies

In this section, several hyperparameters that affect model performance on the IP, PU, SV and Botswana datasets are discussed in detail.

4.5.1. Effect of limpid size.

The size of the limpid greatly affects the contextual information and accuracy of the model. An experiment was conducted by varying the size in the range of 1–9 in the four stages of the transformer encoder. After the value of 9, the model did not gain significant improvement in land cover classification accuracy, as shown in Fig 12. However, the number of flops increases with the increase in limpid size, with the highest accuracy achieved for a size of 7 on the IP, PU, SV and Botswana datasets.

Download:

Fig 12. Illustration of limpid size on IP, PU, SV and Botswana datasets.

https://doi.org/10.1371/journal.pone.0328160.g012

4.5.2. Effect of different components.

The results of the ablation study using different components of the proposed LSANet on the four datasets are presented in Table 5. Table 5 shows that the OA accuracy of the ViT on the PU and SV datasets is 90.18% and 91.67%, respectively. On the other hand, 2D-CNN+ ViT improved the OA and Kappa values in the datasets. The 2D-CNN extracted high-dimensional local spatial features, and ViT provided global context. Furthermore, 3D-CNN+ViT achieved more than 2% improvement in the OA on PU, IP and SV datasets due to the capability of 3D-CNN towards spatial and spectral feature extraction. Moreover, 3D-CNN+2D-CNN+ViT achieved the highest OA accuracy and Kappa value in all the datasets due to improved spatial spectral features and global attention provided by the LS-attention based ViT encoder.

Download:

Table 5. Performance evaluation using different components.

https://doi.org/10.1371/journal.pone.0328160.t005

4.5.3. Effect of batch size and learning rate.

The edge and boundary regions of the images require extensive parameter tuning for classification [40,41]. In addition, focusing on the edge and boundary regions can enhance object detection [42]. Hyperparameters, such as batch size and learning rate, are important choices for stable training and generalization of the model performance. Generally, a larger batch size leads to faster convergence of the model. At the same time, it might miss the fine-grained details of the hyperspectral data [43]. As we can notice in Fig 13(b), with a batch size of 16, the proposed model accuracy on the PU, IP, SV, and Botswana datasets is lower. However, OA increases gradually with the batch size, and the highest value is obtained with a batch size of 64. Furthermore, OA slightly decreases with the further increase in batch size. The learning rate determines the number of iterations required for the model’s stable training. Higher learning rates may lead to sensitivity to noise, and lower rates may increase the training iteration for stable performance. Fig 13(a) shows that the model achieved the best classification accuracy with a learning rate of 3e^-4.

Download:

Fig 13. Illustration of the (a) learning rate and (b) batch size on the IP, PU, SV and Botswana datasets.

https://doi.org/10.1371/journal.pone.0328160.g013

4.5.4. Effect of CPE and zero padding.

CPE and zero padding in the LSANet are important factors for improving the classification performance. We summarize model performance on IP, PU, SV and Botswana datasets with several combinations of these parameters, as shown in Table 6. Table 6 shows that without CPE and zero padding, classification accuracy on the IP dataset is 95.19%, whereas CPE improved performance by more than 1.5%. Furthermore, the model achieved 98.67% accuracy with CPE and zero padding on the IP dataset. Moreover, on the PU dataset, the model gained a 0.16% performance improvement with CPE and zero padding. At the same time, on the SV dataset, the model achieved an OA of 95.26% with CPE and 96.29% with zero padding, respectively. LSANet achieved the highest performance on all datasets using both CPE and zero padding.

Download:

Table 6. LSANet performance on IP dataset with CPE and zero padding.

https://doi.org/10.1371/journal.pone.0328160.t006

4.5.5. Training time analysis.

The trainable parameters and samples presented in the land covers influence the models training time. In the proposed study, the training times of each model under the same experimental conditions were calculated on the PU, IP, SV and Botswana datasets. As shown in Fig 14, it can observed that 2D-CNN takes less time on all the datasets due to less trainable parameters. Meanwhile, 3D-CNN has a high training time due to large numbers of training parameters. Furthermore, Hybrid CNN, which has 3D and 2D convolutional layers, exhibits higher train times than 2D-CNN. ViT-based methods, SF and MF, have higher computation times than classical CNN, except for 3D-CNN, due to the computation of attention in the encoder. The SS1DSwin has relatively lower training times due to the attention calculation through 1D-CNN layers. The proposed LSANet is designed using lightweight 3D-CNN and 2D-CNN, and attention is calculated through parallel row- and column-wise. Due to this, training times are smaller than the majority of methods.

Download:

Fig 14. Training time comparison on PU, IP, SV, and Botswana datasets.

https://doi.org/10.1371/journal.pone.0328160.g014

4.5.6. Performance comparison with different attention.

The experiments are conducted using self-attention (SA), axial self-attention (ASA), and cross self-attention (CSA) by replacing the LSA in the proposed model on the PU, IP, SV and Botswana datasets. The experimental settings were kept the same, as discussed in section 4.2. After training, OA and Kappa values on each dataset are calculated, as shown in Table 7. The ViT + SA achieved 91.56% and 93.19% Kappa values on the PU and IP datasets, respectively. Meanwhile, the ViT + ASA respectively obtained Kappa values of 92.07% and 94.16% in the PU and SV datasets. The ViT + CSA improved the Kappa values by 2.75%, 2.14% and 0.88% on the PU, Botswana and SV datasets. Moreover, our ViT + LSA obtained Kappa values of 96.05%, 97.10% and 86.14% on the PU, IP and Botswana datasets. Furthermore, the computational complexity of the SA is , which grows quadratic with the size of the feature map. Whereas, in the ASA, attention is calculated using row- and column-wise sequentially and complexity is reduced to . Meanwhile, in the CSA attention is calculated between two feature maps, which is . Moreover, our LSA calculates parallel attention in the row and column of each block and has complexity, , where, h,w is the input shape, d_k is the dimension of the query/key, and is the size of the limpid. The SA mechanism in the ViT calculates the attention score across all token pairs and can provide global attention. However, it may miss the local mid-level features essential for HSI classification. Moreover, ASA calculates attention in rows and columns at a time. Due to this, the diagonal features may be missing. At the same time, CSA has a cross pattern that is broader along the axes but narrower in terms of directional coverage. Lastly, LSA calculates attention in all directions and is more adaptive to providing contextual information.

Download:

Table 7. Performance comparison using different attention mechanisms.

https://doi.org/10.1371/journal.pone.0328160.t007

4.5.7. Performance evaluation using different encoding techniques.

An experiment is conducted using FPE and LPE on the PU, IP, SV and Botswana datasets, and the results are presented in Table 8. Table 8 shows that the FPE using the sine function has an OA of 94.86% and a Kappa value of 93.18%. At the same time, it achieved 86.15% and 93.28% OA on the SV and Botswana datasets, respectively. The LPE achieved better OA and Kappa values than FPE on all the datasets. Moreover, the CPE obtained OA of 98.78% and 89.45% on the PU and SV datasets, respectively. Fixed positional encoding (FPE) , e.g., sinusoidal, uses a sine or cosine function to generate the image encodings at different frequencies. Another fixed positional encoding technique, learned positional encoding (LPE), provides a model to learn positional encodings from the positional encoding vectors during training. However, after training, it remains fixed. Our CPE generates dynamic positional encodings using depth-wise convolution from the input feature map, which makes them efficient and flexible across different resolutions.

Download:

Table 8. Performance comparison using different encoding techniques.

https://doi.org/10.1371/journal.pone.0328160.t008

4.5.8. Parameters and flops comparison.

The trainable parameters in millions (M) and flops in Gigaflops (G) of each model based on 30 bands of the datasets are depicted in Table 9. Table 9 shows that the 3D-CNN model has the highest number of trainable parameters and the 2D-CNN has the lowest. At the same time, both 2D-CNN and 3D-CNN have high flop values. The transformer-based method,s CSIL, SpectralFormer, DSGSF, MorphFormer and SS1DSwin, also have high flop values. For instance, SS1DSwin has 26.05 M trainable parameters. The proposed LSANet has 2.3M trainable parameters and 1.27 G flops.

Download:

Table 9. Parameters and flops comparison.

https://doi.org/10.1371/journal.pone.0328160.t009

5. Conclusion

Hyperspectral data is characterized by the unique challenge of limited spectral bands that provide extensive spectral information but reduced spatial detail. CNNs have made incredible progress in collecting high-level contextual information for remote sensing applications, particularly hyperspectral image classification tasks. However, CNNs lack in adequately characterizing sequence features of spectral signatures and capturing the global characteristics of complete HSIs with local kernels. In this study, LSANet is proposed, which overcomes these constraints by combining global attention mechanisms for spatial and spectral information generated by CNNs. A conditional position encoding (CPE) is utilized that dynamically creates tokens, which improves the ability of the model to identify subtle patterns in data. Furthermore, by separating feature maps into smaller regions for token interaction, our LS-attention strategy provides a more comprehensive contextual representation than earlier attention mechanisms. The superiority of the model is tested on four datasets and achieves OAs of 98.78%, 98.67%, 97.52% and 89.45% on PU, IP, SV and Botswana datasets, respectively. One major limitation includes computational costs of the attention mechanism that can be reduced through other approaches. Another constraint is zero padding and CPE to improve the classification performance. In the future, transformer-based architectures with advanced techniques, e.g., efficient attention and self-supervised learning, can be designed to better fit the HS data classification task. Additionally, a lightweight ViT network can be designed to decrease complexity, while maintaining performance. Furthermore, more physical features of the spectral bands based on past information can be extracted to improve the classification performance.

References

1. Cai W, Ning X, Zhou G, Bai X, Jiang Y, Li W, et al. A novel hyperspectral image classification model using bole convolution with three-direction attention mechanism: small sample and unbalanced learning. IEEE Transactions on Geoscience and Remote Sensing. 2022;61:1–17.
- View Article
- Google Scholar
2. Jia S, Jiang S, Lin Z, Li N, Xu M, Yu S. A survey: Deep learning for hyperspectral image classification with few labeled samples. Neurocomputing. 2021;448:179–204.
- View Article
- Google Scholar
3. Mohan A, Venkatesan M. HybridCNN based hyperspectral image classification using multiscale spatiospectral features. Infrared Physics & Technology. 2020;108:103326.
- View Article
- Google Scholar
4. Ang KLM, Seng JKP. Big data and machine learning with hyperspectral information in agriculture. IEEE Access. 2021;9:36699–718.
- View Article
- Google Scholar
5. Han T, Goodenough DG. Investigation of Nonlinearity in Hyperspectral Imagery Using Surrogate Data Methods. IEEE Trans Geosci Remote Sensing. 2008;46(10):2840–7.
- View Article
- Google Scholar
6. Li W, Prasad S, Fowler JE, Bruce LM. Locality-Preserving Discriminant Analysis in Kernel-Induced Feature Spaces for Hyperspectral Image Classification. IEEE Geosci Remote Sensing Lett. 2011;8(5):894–8.
- View Article
- Google Scholar
7. Imani M, Ghassemian H. Principal component discriminant analysis for feature extraction and classification of hyperspectral images. In Proceedings of the 2014 Iranian Conference on Intelligent Systems (ICIS). IEEE, Bam, Iran; 4 February 2014.
8. Cao X, Yao J, Fu X, Bi H, Hong D. An enhanced 3-D discrete wavelet transform for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. 2020;18:1–5.
- View Article
- Google Scholar
9. Peng J, Chen H, Zhou Y, Li L. Ideal Regularized Composite Kernel for Hyperspectral Image Classification. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2017;10(4):1563–74.
- View Article
- Google Scholar
10. Li J, Marpu PR, Plaza A, Bioucas-Dias JM, Benediktsson JA. Generalized composite kernel frame work for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2013;51(9):4816–29.
- View Article
- Google Scholar
11. Liu J, Wu Z, Li J, Plaza A, Yuan Y. Probabilistic kernel collaborative representation for spatial–spectral hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2016;54(4):2371–84.
- View Article
- Google Scholar
12. Kumar MS, Keerthi V, Anjnai RN, Sarma MM, Bothale V. Evalution of machine learning methods for hyperspectral image classification. In: Proceedings of the 2020 IEEE India Geoscience and Remote Sensing Symposium (InGARSS). Ahmedabad, India; 2020. p. 225–8.
13. Chen H, Miao F, Chen Y, Xiong Y, Chen T. A Hyperspectral Image Classification Method Using Multifeature Vectors and Optimized KELM. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2021;14:2781–95.
- View Article
- Google Scholar
14. Camps-Valls G, Bruzzone L. Kernel-based methods for hyperspectral image classification. IEEE Trans Geosci Remote Sensing. 2005;43(6):1351–62.
- View Article
- Google Scholar
15. Kang X, Li S, Benediktsson JA. Spectral–spatial hyperspectral image classification with edge-preserving filtering. IEEE Transactions on Geoscience and Remote Sensing. 2013;52(5):2666–77.
- View Article
- Google Scholar
16. Ratle F, Camps-Valls G, Weston J. Semisupervised Neural Networks for Efficient Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2010;48(5):2271–82.
- View Article
- Google Scholar
17. Sun H, Zheng X, Lu X. A Supervised Segmentation Network for Hyperspectral Image Classification. IEEE Trans Image Process. 2021;30:2810–25. pmid:33539293
- View Article
- PubMed/NCBI
- Google Scholar
18. Wang X, Tan K, Du P, Pan C, Ding J. A Unified Multiscale Learning Framework for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2022;60:1–19.
- View Article
- Google Scholar
19. Bai J, Shi W, Xiao Z, Ali TAA, Ye F, Jiao L. Achieving better category separability for hyperspectral image classification: A spatial–spectral approach. IEEE Transactions on Neural Networks and Learning Systems. 2023.
- View Article
- Google Scholar
20. Liu H, Li W, Xia XG, Zhang M, Gao CZ, Tao R. Central attention network for hyperspectral imagery classification. IEEE Transactions on Neural Networks and Learning Systems. 2022.
- View Article
- Google Scholar
21. Paoletti ME, Haut JM, Pereira NS, Plaza J, Plaza A. Ghostnet for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2021;59(12):10378–93.
- View Article
- Google Scholar
22. Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J. Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2020;59(7):5966–78.
- View Article
- Google Scholar
23. Hang R, Li Z, Liu Q, Ghamisi P, Bhattacharyya SS. Hyperspectral image classification with attention-aided CNNs. IEEE Transactions on Geoscience and Remote Sensing. 2020;59(3):2281–93.
- View Article
- Google Scholar
24. Cao X, Yao J, Xu Z, Meng D. Hyperspectral Image Classification With Convolutional Neural Network and Active Learning. IEEE Trans Geosci Remote Sensing. 2020;58(7):4604–16.
- View Article
- Google Scholar
25. Hou S, Shi H, Cao X, Zhang X, Jiao L. Hyperspectral imagery classification based on contrastive learning. IEEE Transactions on Geoscience and Remote Sensing. 2021;60:1–13.
- View Article
- Google Scholar
26. Sun L, Fang Y, Chen Y, Huang W, Wu Z, Jeon B. Multi-structure KELM with attention fusion strategy for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2022;60:1–17.
- View Article
- Google Scholar
27. Zheng X, Sun H, Lu X, Xie W. Rotation-Invariant Attention Network for Hyperspectral Image Classification. IEEE Trans Image Process. 2022;31:4251–65. pmid:35635815
- View Article
- PubMed/NCBI
- Google Scholar
28. Zhang Y, Li W, Sun W, Tao R, Du Q. Single-Source Domain Expansion Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans Image Process. 2023;32:1498–512. pmid:37027628
- View Article
- PubMed/NCBI
- Google Scholar
29. Ahmad M, Ghous U, Usama M, Mazzara M. WaveFormer: Spectral–spatial wavelet transformer for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. 2024.
- View Article
- Google Scholar
30. Sun H, Zheng X, Lu X, Wu S. Spectral–Spatial Attention Network for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2020;58(5):3232–45.
- View Article
- Google Scholar
31. Sun L, Zhao G, Zheng Y, Wu Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2022;60:1–14.
- View Article
- Google Scholar
32. Chen Y, Jiang H, Li C, Jia X, Ghamisi P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans Geosci Remote Sensing. 2016;54(10):6232–51.
- View Article
- Google Scholar
33. Li Y, Zhang H, Shen Q. Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network. Remote Sensing. 2017;9(1):67.
- View Article
- Google Scholar
34. Roy SK, Krishna G, Dubey SR, Chaudhuri BB. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. 2019;17(2):277–81.
- View Article
- Google Scholar
35. Yang J, Du B, Zhang L. From center to surrounding: An interactive learning framework for hyperspectral image classification. ISPRS Journal of Photogrammetry and Remote Sensing. 2023;197:145–66.
- View Article
- Google Scholar
36. Hong D, Han Z, Yao J, Gao L, Zhang B, Plaza A, et al. SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers. IEEE Trans Geosci Remote Sensing. 2022;60:1–15.
- View Article
- Google Scholar
37. Guo T, Wang R, Luo F, Gong X, Zhang L, Gao X. Dual-View Spectral and Global Spatial Feature Fusion Network for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing. 2023.
- View Article
- Google Scholar
38. Roy SK, Deria A, Shah C, Haut JM, Du Q, Plaza A. Spectral–Spatial Morphological Attention Transformer for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2023;61:1–15.
- View Article
- Google Scholar
39. Xu Y, Xie Y, Li B, Xie C, Zhang Y, Wang A, et al. Spatial-Spectral 1DSwin Transformer with Group-wise Feature Tokenization for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing. 2023.
- View Article
- Google Scholar
40. Wang J, Hu F, Abbas G, Albekairi M, Rashid N. Enhancing image categorization with the quantized object recognition model in surveillance systems. Expert Systems with Applications. 2024;238:122240.
- View Article
- Google Scholar
41. Xu Z, Wang J, Hu F, Abbas G, Touti E, Albekairi M, et al. Improved camouflaged detection in the large-scale images and videos with minimum boundary contrast in detection technique. Expert Systems with Applications. 2024;249:123558.
- View Article
- Google Scholar
42. Wang J, Alshahir A, Abbas G, Kaaniche K, Albekairi M, Alshahr S, et al. A Deep Recurrent Learning-Based Region-Focused Feature Detection for Enhanced Target Detection in Multi-Object Media. Sensors (Basel). 2023;23(17):7556. pmid:37688012
- View Article
- PubMed/NCBI
- Google Scholar
43. Vaddi R, Phaneendra Kumar BLN, Manoharan P, Agilandeeswari L, Sangeetha V. Strategies for dimensionality reduction in hyperspectral remote sensing: A comprehensive overview. The Egyptian Journal of Remote Sensing and Space Sciences. 2024;27(1):82–92.
- View Article
- Google Scholar

[ref1] 1. Cai W, Ning X, Zhou G, Bai X, Jiang Y, Li W, et al. A novel hyperspectral image classification model using bole convolution with three-direction attention mechanism: small sample and unbalanced learning. IEEE Transactions on Geoscience and Remote Sensing. 2022;61:1–17.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Jia S, Jiang S, Lin Z, Li N, Xu M, Yu S. A survey: Deep learning for hyperspectral image classification with few labeled samples. Neurocomputing. 2021;448:179–204.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Mohan A, Venkatesan M. HybridCNN based hyperspectral image classification using multiscale spatiospectral features. Infrared Physics & Technology. 2020;108:103326.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Ang KLM, Seng JKP. Big data and machine learning with hyperspectral information in agriculture. IEEE Access. 2021;9:36699–718.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Han T, Goodenough DG. Investigation of Nonlinearity in Hyperspectral Imagery Using Surrogate Data Methods. IEEE Trans Geosci Remote Sensing. 2008;46(10):2840–7.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Li W, Prasad S, Fowler JE, Bruce LM. Locality-Preserving Discriminant Analysis in Kernel-Induced Feature Spaces for Hyperspectral Image Classification. IEEE Geosci Remote Sensing Lett. 2011;8(5):894–8.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Imani M, Ghassemian H. Principal component discriminant analysis for feature extraction and classification of hyperspectral images. In Proceedings of the 2014 Iranian Conference on Intelligent Systems (ICIS). IEEE, Bam, Iran; 4 February 2014.

[ref8] 8. Cao X, Yao J, Fu X, Bi H, Hong D. An enhanced 3-D discrete wavelet transform for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. 2020;18:1–5.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Peng J, Chen H, Zhou Y, Li L. Ideal Regularized Composite Kernel for Hyperspectral Image Classification. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2017;10(4):1563–74.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Li J, Marpu PR, Plaza A, Bioucas-Dias JM, Benediktsson JA. Generalized composite kernel frame work for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2013;51(9):4816–29.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref11] 11. Liu J, Wu Z, Li J, Plaza A, Yuan Y. Probabilistic kernel collaborative representation for spatial–spectral hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2016;54(4):2371–84.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref12] 12. Kumar MS, Keerthi V, Anjnai RN, Sarma MM, Bothale V. Evalution of machine learning methods for hyperspectral image classification. In: Proceedings of the 2020 IEEE India Geoscience and Remote Sensing Symposium (InGARSS). Ahmedabad, India; 2020. p. 225–8.

[ref13] 13. Chen H, Miao F, Chen Y, Xiong Y, Chen T. A Hyperspectral Image Classification Method Using Multifeature Vectors and Optimized KELM. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2021;14:2781–95.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref14] 14. Camps-Valls G, Bruzzone L. Kernel-based methods for hyperspectral image classification. IEEE Trans Geosci Remote Sensing. 2005;43(6):1351–62.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref15] 15. Kang X, Li S, Benediktsson JA. Spectral–spatial hyperspectral image classification with edge-preserving filtering. IEEE Transactions on Geoscience and Remote Sensing. 2013;52(5):2666–77.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. Ratle F, Camps-Valls G, Weston J. Semisupervised Neural Networks for Efficient Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2010;48(5):2271–82.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref17] 17. Sun H, Zheng X, Lu X. A Supervised Segmentation Network for Hyperspectral Image Classification. IEEE Trans Image Process. 2021;30:2810–25. pmid:33539293
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref18] 18. Wang X, Tan K, Du P, Pan C, Ding J. A Unified Multiscale Learning Framework for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2022;60:1–19.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref19] 19. Bai J, Shi W, Xiao Z, Ali TAA, Ye F, Jiao L. Achieving better category separability for hyperspectral image classification: A spatial–spectral approach. IEEE Transactions on Neural Networks and Learning Systems. 2023.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref20] 20. Liu H, Li W, Xia XG, Zhang M, Gao CZ, Tao R. Central attention network for hyperspectral imagery classification. IEEE Transactions on Neural Networks and Learning Systems. 2022.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref21] 21. Paoletti ME, Haut JM, Pereira NS, Plaza J, Plaza A. Ghostnet for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2021;59(12):10378–93.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref22] 22. Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J. Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2020;59(7):5966–78.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref23] 23. Hang R, Li Z, Liu Q, Ghamisi P, Bhattacharyya SS. Hyperspectral image classification with attention-aided CNNs. IEEE Transactions on Geoscience and Remote Sensing. 2020;59(3):2281–93.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref24] 24. Cao X, Yao J, Xu Z, Meng D. Hyperspectral Image Classification With Convolutional Neural Network and Active Learning. IEEE Trans Geosci Remote Sensing. 2020;58(7):4604–16.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref25] 25. Hou S, Shi H, Cao X, Zhang X, Jiao L. Hyperspectral imagery classification based on contrastive learning. IEEE Transactions on Geoscience and Remote Sensing. 2021;60:1–13.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref26] 26. Sun L, Fang Y, Chen Y, Huang W, Wu Z, Jeon B. Multi-structure KELM with attention fusion strategy for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2022;60:1–17.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref27] 27. Zheng X, Sun H, Lu X, Xie W. Rotation-Invariant Attention Network for Hyperspectral Image Classification. IEEE Trans Image Process. 2022;31:4251–65. pmid:35635815
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref28] 28. Zhang Y, Li W, Sun W, Tao R, Du Q. Single-Source Domain Expansion Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans Image Process. 2023;32:1498–512. pmid:37027628
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref29] 29. Ahmad M, Ghous U, Usama M, Mazzara M. WaveFormer: Spectral–spatial wavelet transformer for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. 2024.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref30] 30. Sun H, Zheng X, Lu X, Wu S. Spectral–Spatial Attention Network for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2020;58(5):3232–45.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref31] 31. Sun L, Zhao G, Zheng Y, Wu Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2022;60:1–14.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref32] 32. Chen Y, Jiang H, Li C, Jia X, Ghamisi P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans Geosci Remote Sensing. 2016;54(10):6232–51.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref33] 33. Li Y, Zhang H, Shen Q. Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network. Remote Sensing. 2017;9(1):67.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref34] 34. Roy SK, Krishna G, Dubey SR, Chaudhuri BB. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. 2019;17(2):277–81.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref35] 35. Yang J, Du B, Zhang L. From center to surrounding: An interactive learning framework for hyperspectral image classification. ISPRS Journal of Photogrammetry and Remote Sensing. 2023;197:145–66.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref36] 36. Hong D, Han Z, Yao J, Gao L, Zhang B, Plaza A, et al. SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers. IEEE Trans Geosci Remote Sensing. 2022;60:1–15.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref37] 37. Guo T, Wang R, Luo F, Gong X, Zhang L, Gao X. Dual-View Spectral and Global Spatial Feature Fusion Network for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing. 2023.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref38] 38. Roy SK, Deria A, Shah C, Haut JM, Du Q, Plaza A. Spectral–Spatial Morphological Attention Transformer for Hyperspectral Image Classification. IEEE Trans Geosci Remote Sensing. 2023;61:1–15.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref39] 39. Xu Y, Xie Y, Li B, Xie C, Zhang Y, Wang A, et al. Spatial-Spectral 1DSwin Transformer with Group-wise Feature Tokenization for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing. 2023.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref40] 40. Wang J, Hu F, Abbas G, Albekairi M, Rashid N. Enhancing image categorization with the quantized object recognition model in surveillance systems. Expert Systems with Applications. 2024;238:122240.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref41] 41. Xu Z, Wang J, Hu F, Abbas G, Touti E, Albekairi M, et al. Improved camouflaged detection in the large-scale images and videos with minimum boundary contrast in detection technique. Expert Systems with Applications. 2024;249:123558.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref42] 42. Wang J, Alshahir A, Abbas G, Kaaniche K, Albekairi M, Alshahr S, et al. A Deep Recurrent Learning-Based Region-Focused Feature Detection for Enhanced Target Detection in Multi-Object Media. Sensors (Basel). 2023;23(17):7556. pmid:37688012
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref43] 43. Vaddi R, Phaneendra Kumar BLN, Manoharan P, Agilandeeswari L, Sangeetha V. Strategies for dimensionality reduction in hyperspectral remote sensing: A comprehensive overview. The Egyptian Journal of Remote Sensing and Space Sciences. 2024;27(1):82–92.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

Figures

Abstract

1. Introduction

2. Related work

Problem statement

3. Proposed method

3.1. The convolutional module

Let the HSI image.

3.2. The ViT module

3.2.1. Limpid Attention Block (LAB).

4. Experimental result and discussion

4.1. Dataset

4.2 Experimental setup

4.3. Quantitative results comparison

4.4. Visual results

4.5. Ablation studies

4.5.1. Effect of limpid size.

4.5.2. Effect of different components.

4.5.3. Effect of batch size and learning rate.

4.5.4. Effect of CPE and zero padding.

4.5.5. Training time analysis.

4.5.6. Performance comparison with different attention.

4.5.7. Performance evaluation using different encoding techniques.

4.5.8. Parameters and flops comparison.

5. Conclusion

References