Attention 3D central difference convolutional dense network for hyperspectral image classification

Hyperspectral Images (HSI) classification is a challenging task due to a large number of spatial-spectral bands of images with high inter-similarity, extra variability classes, and complex region relationships, including overlapping and nested regions. Classification becomes a complex problem in remote sensing images like HSIs. Convolutional Neural Networks (CNNs) have gained popularity in addressing this challenge by focusing on HSI data classification. However, the performance of 2D-CNN methods heavily relies on spatial information, while 3D-CNN methods offer an alternative approach by considering both spectral and spatial information. Nonetheless, the computational complexity of 3D-CNN methods increases significantly due to the large capacity size and spectral dimensions. These methods also face difficulties in manipulating information from local intrinsic detailed patterns of feature maps and low-rank frequency feature tuning. To overcome these challenges and improve HSI classification performance, we propose an innovative approach called the Attention 3D Central Difference Convolutional Dense Network (3D-CDC Attention DenseNet). Our 3D-CDC method leverages the manipulation of local intrinsic detailed patterns in the spatial-spectral features maps, utilizing pixel-wise concatenation and spatial attention mechanism within a dense strategy to incorporate low-rank frequency features and guide the feature tuning. Experimental results on benchmark datasets such as Pavia University, Houston 2018, and Indian Pines demonstrate the superiority of our method compared to other HSI classification methods, including state-of-the-art techniques. The proposed method achieved 97.93% overall accuracy on the Houston-2018, 99.89% on Pavia University, and 99.38% on the Indian Pines dataset with the 25 × 25 window size.


Introduction
Hyperspectral Images (HS images) consist of numerous contiguous bands that provide extensive spectral information containing high spectral resolution features.HS images have various real-world applications, including urban analysis, land cover analysis, agriculture, and environmental analysis.HS image classification is an effective way to distinguish the variety of features and help in critical decision-making.Usually, the classification of these images relies on analyzing and interpreting the unique spectral signatures exhibited by the objects within them.However, real-world applications of hyperspectral imaging often face challenges such as low spatial resolution [1] caused by signal-noise ratio, sensor limitations, and complexity constraints.In past decades to address this, various classification techniques such as K-Nearest Neighbor (KNN) [2], Support Vector Machine (SVM) [3], Maximum Likelihood (ML) [4], Logistic Regression (LR) [2], and Extreme Learning Machine (ELM) [5] have been employed to classify spectral features, aiming for improved accuracy and robustness.Despite using these classifiers, their effectiveness is limited by redundant factors and high correlation among spectral bands, leading to suboptimal results.Furthermore, these classifiers exhibit suboptimal performance outcomes due to their inability to consider the spatial heterogeneity of hyperspectral image data.Achieving optimal classification accuracy requires the development of a classifier that effectively integrates both spectral and spatial information.The Spatial characteristics offer additional distinguishing details regarding an object's dimensions, configuration, and arrangement.Proper integration of these details can lead to more effective outcomes.The spatial-spectral characteristics of two distinct groups involve thoroughly analyzing their multifaceted properties and complex interdependencies. 1) The analysis focuses on spatial and spectral features separately, conducting independent evaluations.Spatial attributes are obtained using advanced-level modules such as Entropy [5], Morphological [6,7], Low-Rank Representation [8], and Attribute Self-Represent [9].These spatial features are later merged with spectral features for pixel-level classification operations.2) The joint spectral-spatial features are investigated [10].This involves the extraction of features that combine both spectral and spatial information, accomplished through the generation of 3-D wavelets, scattering wavelets [11], and Gabor filters [12] at various frequencies and scales.The traditional feature extraction methods relied on shallow and handcrafted learning approaches, which come from expert knowledge [13], which may potentially limit the application's ability to achieve accurate classification.
Recently, deep learning-based methods have shown effectiveness in various applications, specifically in image classification and object detection, by identifying the low to high-level features, which enables precise classification.The high spatial resolution of the HSI data is organized into 3D cubes, capturing complex details and effectively maintaining correlations between spectral and spatial features.This process enhances feature extraction and classification outcomes.Among these models, the convolutional neural network (CNN) [14] has gained popularity due to its superior ability to classify features compared to manually designed ones.That's strategy applied in various image processing tasks, including image classification [15,16], object detection [17], semantic segmentation [18], colon cancer classification [19], depth estimation [20], face anti-spoofing [21], and related domains.Advanced techniques within the field of deep learning have been suggested for dealing with the problem of shifting domains in hyperspectral imaging (HSI).The paper [22] introduces a new approach called LRR-Net for anomaly detection.LRR-Net is a baseline network that combines the low-rank representation (LRR) model with deep learning techniques.LRR-Net employs the ADMM optimizer to efficiently solve the LRR model and convert normal parameters into trainable ones, hence reducing the need for human adjustment.The paper [23] presents a novel and comprehensive framework for remote sensing (RS) applications, specifically designed to overcome the limited emphasis on spectral data in visual representation learning.SpectralGPT is designed specifically for processing spectral remote sensing (RS) images, unlike conventional models that mainly focus on RGB images.It utilizes a 3D-generated pre-trained transformer (GPT) architecture.It distinguishes itself by providing support for images of diverse sizes, resolutions, and time series through incremental training.A spatial and spectral BERT is proposed in [24] utilizing the local and global features to improve the HSI classification.
Recent years have witnessed remarkable progress in hyperspectral image analysis through deep learning techniques.Deep learning-based methods have already shown effectiveness in extracting the semantic deep features from HS images using stacked layers architectures either with 2-D or 3D convolutional neural networks that allowed the customization of spatial features, as shown in several studies [25][26][27].However, it's essential to acknowledge that the 2D approaches are widely incorporated to extract spatial features independently, which might limit the full utilization of spectral-spatial data present in hyperspectral images.Moreover, these methods are weak in providing detailed, complex information about the spatial and spectral dimensions of the HS images.Although CNN-based frameworks provide deep information but cannot extract local intrinsic detailed and low-frequency information, such information is necessary for accurate classification.To overcome this limitation, We proposed a novel 3DCDCN dense architecture that is equipped with a 3D attention mechanism for exploring the more appropriate features, while the proposed dense connection provides robustness to the architecture towards the finely detailed and low frequency features from the HSIs.The CDC strategy is used to explore the intrinsic detailed information.In conclusion, the main contributions to this article are as 1.A customized 2D to 3D CDCN modules proposed that utilize central difference convolutional network techniques that incorporate central difference into vanilla convolution to enhance its representational characteristics and improve its generalization capacity.This method combines intensity and gradient data to extract intricate patterns within the spatial-spectral data.This advanced method offers a higher level of robustness and adaptability when compared to traditional CNN methodologies.

2.
A novel CDCN architecture is introduced, which is equipped with 3D attention capabilities within the basic architecture of CDCN, which provides the robustness of the proposed method and efficiently considers detailed intrinsic features during the classification task.
3. A Dense Network module is introduced in the architecture that employs pixel-wise channel concatenation techniques to extract low-rank frequency features from 3D-CDCN, as explained in reference [28].The spatial attention mechanism fine-tunes the 3D feature maps, enabling the model to fully leverage the benefits of low-rank frequency features while minimizing data loss.
4. Besides evaluating the effectiveness of the proposed CDCN architecture in terms of Overall accuracy (OA), Average accuracy (AA), and kappa coefficient (Kappa), we compared the efficiency of the proposed method with existing HSI classification-based methods.
This paper is organized into various sections such as Section 2 contains the related work while the methodology is discussed in Section 3, encompassing its technical aspects and theoretical foundations.Section 4 presents a comprehensive examination of the experimental datasets, accompanied by the analysis of results and discussions.Section 4.3 presents the ablation study.Section 5 indicates conclusions, emphasizing the findings and their implications for future research.The abbreviations and meanings are shown in Table 1.
There are various applied domains of computer vision and digital image processing such as object detection, remote sensing image classification, medical image classification, video analysis, crime detection and industrial automation [38,39].Feature fusion and deep learning algorithms have shown robust results in various domains of computer vision [40,41].CNNs have seen widespread use in HSI classification.For instance, a basic CNN model with five layers was proposed [32], focusing solely on spectral information.To address this limitation, an enhanced CNN model [36] was introduced, utilizing 3-D patches as input to incorporate both spectral and spatial information.Furthermore, another approach involves a CNN integrated with a spatial pyramid pooling strategy to contain spatial information [36] comprehensively.Additionally, there's a proposition that combines CNN features with hand-crafted features and Conditional Random Field (CRF) [42].Another variant, CNN with Markov Random Field (MRF) [33], was introduced to leverage label correlations effectively.
A dual-channel CNN [43,44] was introduced, utilizing 1-D and 2-D CNNs for feature extraction.To expand the training dataset for deep CNNs, a novel pixel-pair method [37] was proposed.Moreover, a 3-D Convolutional Neural Network (3D-CNN) [45] was introduced, enabling joint spectral and spatial information processing.Similarly, a 3-D Contextual deep CNN (3D-FCN) [35] was suggested to optimize the exploration of local contextual interactions among neighboring individual pixel vectors.When we talk about applications of computer vision, then there are many research works done so far like on wheat classification [46], brightness correction [47], pattern analysis [48,49], and photo-synthesis [50].The transformer learning models also perform well for target object detection [51][52][53].The advancement of Convolutional Neural Networks (CNNs) has spurred the development of various convolution techniques.One such technique, tiled convolution, employs distinct filters for feature map neurons with nearby receptive fields on the input image [54].Consequently, this method generates a feature map using multiple filters, extracting more definitive features with an equal number of feature maps.In a study, the augmented linear mixing model (ALMM) tackles spectral variability in hyperspectral images by using a datadriven approach to isolate scaling factors linked to illumination or typography using an endmember dictionary, while also capturing additional variations from environmental factors and instrument settings via spectral variability dictionary [55].Their proposed method, integrated into the spectral unmixing framework, allows for the concurrent acquisition of the spectral variability dictionary and the estimation of abundance maps, showcasing enhanced effectiveness compared to earlier advanced techniques in experiments conducted on both synthetic and real datasets.Dilated convolution [36], another convolutional approach, focuses on broadening the receptive field of a filter without increasing the parameters.This is achieved by introducing cells with zero weight values into the filters.Studies have indicated that this technique may enhance performance in certain scenarios [36].
In a different study, micro Multi-Layer Perceptron (MLP) structures are utilized as filters, referred to as Network In Network (NIN) [56,57].This enables the filters to learn more intricate relationships during the training phase.Another notable alternative convolution technique is Inception [58], where varying-sized filters are incorporated within a single convolution layer.This method can simultaneously execute convolution and pooling processes, demonstrating improved performance without escalating the parameter count by utilizing the inception module.While deep learning methods have accomplished impressive performance in HSI classification, they often require more data as input.Moreover, unlike traditional descriptors, Convolutional Neural Networks (CNNs) tend to overfit easily and face challenges in generalizing well to unseen scenes.This difficulty in generalization can pose a significant issue when applying CNNs for HSIs, hindering their adaptability to diverse and unfamiliar environments or scenes.Additionally, the reliance of these methods on extensive sequences as input poses practical limitations, particularly in scenarios where real-time processing is necessary.Therefore, despite achieving state-of-the-art results, these drawbacks highlight the need to improve the generalization ability and adaptability of CNN-based approaches in HSI classification tasks.The convolution operator plays a vital role in extracting fundamental visual features within the deep learning framework.
Recently, there have been developments and extensions to the conventional convolution operator.One direction involves incorporating classical local descriptors like LBP [59] and Gabor filters [60] into convolution design.Notable works include Local Binary Convolution [61] and Gabor Convolution [62,63].These innovations aim to save computational resources and enhance resistance to spatial changes.For instance, Local Binary Convolution is devised to reduce computational costs, while Gabor Convolution aims to improve resilience against spatial alterations.Another direction in extending convolution operators involves modifying the spatial scope for aggregation.Noteworthy works in this area include dilated convolution [64] and deformable convolution [65,66].These adaptations aim to alter the convolution's receptive field, allowing for wider spatial information aggregation.
However, these convolutional operators have primarily been designed and studied for the RGB modality.How effectively they perform when applied to depth and abundant spectral modalities remains uncertain.Understanding their performance across different modalities, such as depth and Hyperspectral data, requires further exploration and investigation.Specifically, how these modified convolutional techniques function and adapt to HSI data is an open question that must be addressed to understand their efficacy across various modalities comprehensively.These alternative techniques aim to increase the number of acquired features from a sole convolution layer.Aligning with this notion, this paper introduces a novel convolution technique termed "central difference convolution."The proposed method harbors unique qualities and advantages compared to existing approaches.

Proposed methodology
The given assumption is that the Hyperspectral Image, which consists of spectral-spatial features, can be represented as follows: The dataset comprises multiple L bands, with each band containing H × W samples. Within each band, there are C L classes assigned to every sample.Here, X denotes the original input, with L B denoting the number of spectral bands, W representing the width, and H representing the height.The input X comprises Hyperspectral pixels with L B spectral measurements, which are utilized to generate a one-hot label vector for each pixel as follows: The classification of Hyperspectral pixels in land cover categories, denoted as C L , poses a significant challenge for classification models.This complexity arises from various factors, such as diverse land-cover classes, inter-class similarity in heights, intra-class variability in heights, and overlapping and nested regions.Overcoming these complexities requires substantial and intensive efforts [67][68][69][70][71]. Consequently, any model aiming to address and resolve these issues effectively faces significant obstacles.To mitigate the dimensionality of spectral bands from L B to B while preserving spatial features in terms of height (H) and width (W), the Principle Component Analysis (PCA) method is employed.This method is illustrated in Fig 1, where selective reduction of spectral bands retains critical spatial feature information essential for object recognition.Following the application of PCA, the data cube is transformed into a modified input X 2 R ðW�H�BÞ , where W represents width, H represents height, and B represents the number of spectral bands retained after PCA reduction.The data cube features are divided into small-scale, overlapping 3-D patches for HSI classification.The ground truth labels for these patches are determined based on the pre-defined label of the central pixel, facilitating accurate classification of the entire data cube.When generating P 2 R ðS ws �S ws �BÞ , we use X inputs.P represents a set of 3-D neighboring patches, and each patch is centered at the spatial location (α, β), covering a spatial extent of S ws × S ws , where S ws is the window size.The parameter B represents the number of spectral bands.The number of 3-D neighboring patches generated from the set X depends on (W − S ws + 1) × (H − S WS + 1), where W and H are the dimensions of the original data matrix, and S ws is the patch size in each dimension.Therefore, the 3D neighboring patches at location (α, β), denoted as P (α, β) , are characterized by their distinctive features, covering a range from (α − (S ws − 1)/2) to (α + (S ws − 1)/2) in width and from (β − (S ws − 1)/2) to (β + (S ws − 1)/2) in height, inclusive of all B spectral bands present in the PCA-reduced data cube X features.Then 3D cubic patches are forwarded to the CDCN blocks, where these blocks are arranged in specific patterns as presented in Fig 1, where each block is composed of two CDCN modules and one 3D attention module which are placed between these two CDCN blocks, on the one hand, this provides the novel architecture with 3D Attention and on the other providing the robustness to the architecture and helps to minimize the computational complexity with more accurate classification accuracy.The output of Block 1, Block 2, and Block 3 are concatenated and passed from the flattened layers.To deeply extract the features, these features are passed further from two more CDCN blocks, and at the end, features are passed from the fully connected layer for the classification task.More detail about the composition of the block and the Denseness of the proposed network is discussed in the next section.

Central difference convolution
The CDC, as utilized in the proposed method, comprises a convolutional layer designed to extract features from hyperspectral data.Specifically, it leverages the concept of central differences within the convolutional operation to capture complex spectral-spatial information in hyperspectral images.The CDCN module computes the central differences between adjacent spectral bands or channels during the convolution process.Utilizing this technique, the module aims to enhance feature representation by emphasizing subtle spectral variations across neighboring bands.This allows for extracting discriminative features that encapsulate spectral and spatial characteristics unique to hyperspectral data.In modern deep learning frameworks, the conventional operators play a fundamental role in capturing spatial-spectral features.The convolution operation of 3D-CNN remains consistent across the channel dimension.The CDC approach is implemented with CNN to fully grasp the finely detailed information from the HSIs.This delicate information is vital in obtaining accurate classification.The following subsections will briefly explain how the CDC incorporated with CNN.

Vanilla convolution.
The main operation utilized in Convolutional Neural Networks (CNNs) for visual tasks is known as the 3D spatial-spectral vanilla convolution.Referred to as vanilla convolution, it involves two key steps.The initial step entails selecting a local receptive field region denoted as R l from the input feature map X fm .The subsequent step involves aggregating the sampled values by means of weighted summation using W sv .As a result, the output feature map Y fm can be expressed as CDC.
In the context of coordinate representation in feature maps, P 0 denotes the current coordinates for both the input and output maps.On the other hand, P n serves as a variable that enumerates the coordinates within the region R l .To provide an example, let's consider a 3D-CNN convolution operation with a kernel size of 3 × 3 and a dilation value of 1.In this case, the region R l = {(−1, −1, −1), (−1, −1, 0), . .., (0, 1, 1), (1, 1, 1)} corresponds to the specific local receptive field region.

Vanilla convolution (Central difference).
The suggested methodology is based on the renowned local binary pattern [72,73], which analyzes the intricate local relationships using a binary central difference approach.Our proposal incorporates central difference into vanilla convolution to enhance its representational characteristics and improve its generalization capacity.Central Difference Convolutional (CDC) consists of two essential and interconnected stages: Sampling and Aggregation.The sampling process resembles vanilla convolution, while the aggregation step differentiates itself through the method depicted in Fig 2 .In CDC, we prioritize amalgamating the gradient of sample feature values directed towards the center.Eq 3 is modified accordingly.At the origin (0, 0, 0), the gradient sample feature value of Pn is consistently zero in relation to the central location P 0 .The classification of Hyperspectral Images involves analyzing complex and interconnected patterns, such as intensity-level semantics and gradient-level details, which are crucial and mutually supportive.To address this, combining the vanilla 3D-CNN convolution with 3D-CDC can be a promising method to improve the modeling capacity, offering increased resilience and discrimination.Therefore, we present a generalized formulation of central difference convolution as follows: |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl ffl ffl ffl ffl ffl ffl ffl } |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl }

VC ð5Þ
The hyper-parameter θ h is crucial in balancing the impact of intensity-level semantic information and gradient-level detailed features information.This parameter is confined to the closed interval [0, 1].Increasing the value of θ h amplifies the significance of central difference gradient information, enhancing its overall contribution.

Implementation for CDC.
To successfully integrate CDC into modern deep learning frameworks, we adopt a strategy that involves decomposing and combining Eq 5 by incorporating a vanilla convolution and an extra central difference term.This novel convolution technique, known as CDC, draws its name from a similar concept introduced in [74,75].By implementing this approach, we enhance the effectiveness of CDC within contemporary deep learning frameworks.
3.1.4Detailed derivation for CDC.We have thoroughly examined the precise derivation of the Central Difference Convolution (CDC), which is a crucial element of our suggested model.This derivation seeks to clarify the mathematical foundations of the CDC, including its essential elements, the Central Difference Term (CDT) and Vanilla Convolution (VC).The elements stated combined provide the key basis of our method, as represented by Eq 6.This discussion aims to clarify the operational mechanics of the CDC, offering a clear understanding of how it significantly improves the feature extraction capabilities of our model.
|ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl |ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl |ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl }

CDT ð6Þ
In this equation, Y(P 0 ) represents the output at the central pixel P 0 .θ h is a hyperparameter that balances the contributions of the central difference term (CDC) and the value convolution term (VC).W sw (P n ) denotes the spatial weights for neighboring pixels P n within the local region R l .X f m(P 0 + P n ) and X f m(P 0 ) are the feature map values at the neighboring pixel P n and the central pixel P 0 , respectively.The model combines the CDC and VC terms, where the CDC captures local differences in feature maps, emphasizing edges and fine details.At the same time, the VC aggregates local information, ensuring robustness to noise.The hyperparameter θ h allows for the adjustment of the model's sensitivity to local differences.

Attention module
In Fig 3, we are presented with a visualization of the attention module, which is an integral part of the CDC Network.Following the initial step, the spatial module acquires the spatial S map M S for the CDCN features, enabling a detailed process of refinement and fine-tuning.This comprehensive methodology can be explained as follows: The attention block in natural image processing has demonstrated its empirical effectiveness by employing the � feature-wise multiplication method [76,77].This involves implementing average pooling F C avg and max pooling F C max operations along the filter axis, which primarily directs the spatial attention mechanism towards the CDC spatial feature maps C. The resulting descriptors from this concatenation process are then passed through the convolution function f.Subsequently, the output transforms the non-linear sigmoid activation function, represented as σ.To summarize, the entire procedure can be briefly described as follows:

3D-CDC Attention Dense Network model
The 3D-CDC Attention Dense Network model is exhaustively expounded by elucidating the various categories of layers employed, the dimensions of output maps generated, and the numerical count of parameters entailed.The specific layer types and their corresponding It should be noted that the Input Layer dense layer contains the greatest number of parameters among all the layers.Additionally, it can be noted that the number of nodes present in the final dense layer corresponds to the number of categories present within the Houston2018 dataset, which is specifically seven.The proposed model's parameter count varies based on the number of classes within the dataset, thus rendering the determination of the total parameter count complex.Furthermore, it is worth mentioning that the network's weights undergo a random initialization process, followed by training using the back-propagation algorithm in conjunction with the Adam optimizer and utilizing the Softmax loss function.The training process uses mini-batches consisting of 256 units and is iterated throughout 100 epochs.It is imperative to mention that batch normalization and data augmentation techniques are not implemented during training.

Datasets explanation
During the experimental phase of this study, the analysis included three Hyperspectral datasets: Indian-Pines (IP), University of Pavia (UP), and Houston-2018 (HT).Additionally, a comprehensive description of each Hyperspectral dataset was provided.These datasets are publicly available for the experiments and can be downloaded from the website www.ehu.eus[78].

Indian Pines (IP):
In 1992, the Air-Borne Visible/Infrared Imaging Spectro-meter [79] sensor was utilized to procure the dataset recognized as IP.The area comprised several agricultural fields with an organized geometric structure and some areas of irregular forest.The image under analysis encompasses a vast (145 × 145) P pixel array, containing an extensive collection of 224 spectral bands that span the wavelength range of 400 to 2500 nanometers, all of which are captured at a remarkable spatial resolution of 20 meters per pixel.After removing four null bands and 20 other bands affected by atmospheric water absorption, the pre-processed data consisted of 200 remaining bands utilized for experimentation.Additionally, almost 50% of the dataset, i.e., (10, 249) P pixels out of the total (21, 025) P , contained groundtruth information that provided a single label belonging to one of the 16 different classes.

Pavia University (UP):
The UP dataset, including the Northern Italian campus of UP, was obtained using the Reflective Optics System Imaging Spectro-meter [80] sensor.It mainly encompasses an urban environment characterized by numerous solid structures such as Asphalt, Gravel, and metal sheets, along with natural objects such as Trees, Meadows, and Soil.The dataset also includes shadows.After the elimination of the noisy bands, a total of 103 spectral bands were obtained within the spectral range of 0.43 to 0.86 meter, with a spatial resolution of (1.3) MPP , and consisting of (610 × 340) P pixels in size, each matched pixel by pixel.Out of the total of (207, 400) P pixels, a significant proportion comprising precisely 20%, amounting to (42, 776) P pixels, have been accurately annotated to contain authentic ground-truth information belonging to as many as nine distinctive class labels.The present circumstance entails the utilization of a model processing size measuring 200 units in width, 200 units in height, and 103 units in bands.
Houston 2018 (HT): The present study involves the analysis of a scene from Houston in 2018, which encompasses an area of 210 by 954 pixels, comprising a total of 48 spectral-bands within the wavelength range of 380 to 1050 nanometers, with 1 ground sample interval meter.The ground-truth data for the Houston 2018 scene has a pixel size of 0.5, meaning each pixel in the image corresponds to a physical area of 0.5 square units.To process the Houston 2018 scene, a model with dimensions of 200 units in width, 200 units in height, and 48 units in bands is utilized.The reference [78] provides further details about the experimental datasets.In all experiments, the initial Test/Train set was divided into a 30: 70, ratio, with 70% of the population reserved for train samples and the remaining 30% for test samples.To ensure fair evaluations, we used a standardized learning rate of (1e − 02) and a decay rate of (1e − 06) for all experiments.Furthermore, we utilized the rectified linear unit activation function in all layers except for the last layer, where we employed the softmax activation.The patch sizes used in our experiments were 13 × 13 × 20, 17 × 17 × 20, 21 × 21 × 20, and 25 × 25 × 20, respectively.These patch sizes were determined using the PCA method to identify the 20 most informative bands.For optimization, the Adam optimizer was used, and 100 epochs were set to train the model.

Evaluation metrics.
The evaluation of hyperspectral image (HSI) classification performance requires the use of multiple assessment metrics, namely overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa), which are interconnected and intricate.OA measures the proportion of correctly classified test samples among the total testing samples, calculated by the given equation.

OA ¼
No: of correct predictionsð AA calculates the average accuracy across different classes.The mathematical equation of the AA is given by Kappa is a statistical metric that quantifies the agreement between the ground truth and the classification map and is calculated by the equation.
A comprehensive analysis of these evaluation measures is essential for accurately assessing HSI classification performance.
4.1.3Experiments of proposed method using the different window sizes.First, we measured the efficiency of the proposed model on the different window sizes, and the achieved outcomes are listed in Table 2.For our experiments, four cases were formulated based on different-sized windows to evaluate the efficiency of the proposed model.

Case 1:
In the first scenario, we selected the patch size 13 × 13 using the 20 spectral bands.
Experiments were performed on the three datasets, i.e., HT, PU, and IP.The proposed method achieved the OA 97.75% with the HT dataset, 99.57% on the PU, and 98.94% on the IP dataset.The proposed method outperformed the IP dataset.When we select the window size 13 × 13.

Case 2:
In the second case, we selected the window size 17 × 17 and noted the efficiency of our method.In Table 2, it can be seen that 3D-CDC Attention DenseNet produced the highest 99.80% OA on PU images Case 3: 21 × 21 window size was used to check the behavior of 3DCDCN ATT Dense Net on the under-study HSI datasets.With this window size, the proposed method achieved the top overall accuracy on PU, i.e., 99.86%, and the second best produced on the IP dataset with 98.71%.

Case 4:
In this case, the highest window size 25 × 25 was chosen for the experimental purposes.The proposed method consistently showed effectiveness, achieved the highest OA on the PU dataset at 99.89%, and produced the second-best results with IP.Experiments using all these cases show that our method can perform well on different size window portions and produce good results with other datasets.Moreover, all the experimental settings utilized an equal number of spectral bands.i.e., 20.In the next part of the experiments, the performance of the proposed method will be compared with the benchmarks.

Comparison with benchmarks.
In comparative analysis, five main approaches use Convolutional Neural Networks (CNNs).These approaches include the Semi-Supervised 3D-CNN method [81], the Spectral-Spatial 3D-SSCNN method [82], the Fast and Compact 3D-FCCNN method [83], the Hybrid-SN method [13], and the Jigsaw-HSI method [84].We performed extensive experiments using the proposed method and comparing methods to find the efficiency of our proposed method.First, all the models were tested by selecting the window size 13 × 13.This is the small window size of our experiments.20 number of spectral bands was used with all the window sizes.All comparing model results are listed in Table 3 and in Fig 6 .The 3D-FCNN algorithm showed good results on the HT and PU datasets but could not maintain consistency on the IP dataset.It predicted an overall accuracy of 95.21%, which is less than 0.02 than the 3D-CNN model 3D-CNN provided better results on the IP dataset than the rest of the algorithms except the proposed model.In this series, Jigsaw could not show the performance on HT, PU, and IP datasets.In contrast, the proposed method showed superior results than the comparing methods and achieved OA = 98.75%,AA = 79.01%,Kappa = 92.29% on the HT dataset, and achieved OA = 99.57%,AA = 89.97%,Kappa = 99.46% on PU where as scored OA = 98.94%,AA = 98.48%,Kappa = 98.79% on the IP dataset which are the good results on this window size.To further check the reliability of the proposed method against the comparing methods on different patch sizes, experiments were performed using the window size or patch size 17 × 17, where 3D-SSCNN showed satisfactory results on HT and PU dataset but provided the lower AA = 71.78%which is lower than 3D-CNN and predicted the lower OA on IP, which is lower than 3D-CNN and 3D-FCNN.HYbridSN and JIgsaw-HSI predicted the overall poor numeric values.Whereas the proposed methods predicted the overall top-class results on the measuring scales, these results are listed in Table 4 and in Fig 7.
When we came to the experiments with size 21 × 21, The proposed 3D-CDC attention-dense net showed effectiveness and superior results.Meanwhile, Jigsaw-HSI improved the accuracy on the measuring scales but still produced the lowest prediction.3D-FCCNN could only produce better OA (97.50%) on HT and showed unsatisfactory results on the remaining measuring scales and the as well as on the other two datasets.These outcomes are recorded in Table 5 and Fig 8 .To make the experiments more comprehensive, a 25 × 25 patch size was also used where 3D-FCCNN predicted OA = 97.55%,which is the second highest result on HT, but AA was still lower than most of the methods, which placed this model in the  comparison line at 3rd place.Hybrid-SN and Jigsaw-HSI could not show the prominent classification prediction.On the other hand, the proposed 3D-CDCN Attention Dense Net consistently maintains the highest accuracy on all the evaluation metrics.These results can be seen in Table 6.In The results indicate that our method outperformed the alternatives in terms of both (loss, accuracy).The proposed method achieved good accuracy, demonstrating the efficiency of the proposed special attention mechanism.The spatial dimensions outlined in Table 6 were analyzed by our model, resulting in the computation of accuracy metrics such as (OA, AA, and Kappa).These metrics are presented in Table 2.

Convergence analysis
From Fig 10, it can be observed that the proposed special attention mechanism performs well and produces fast and accurate results in just 15 epochs.The proposed attention mechanism  1. Selective Feature Learning: Attention mechanisms allow the model to selectively focus on specific regions or features within the input data.This selective attention enables the network to prioritize relevant information and ignore less important details, leading to a more efficient learning process.
2. Adaptive Learning Rates: Adam (short for Adaptive Moment Estimation) is an optimizer that maintains adaptive learning rates for each parameter.It computes the adaptive learning rates based on both the first-order moment (mean) and the second-order moment (uncentered variance) of the gradients.This adaptability allows Adam to converge quickly by adjusting the learning rates for each parameter individually.4. Improved Gradient Flow: The attention mechanism can assist in improving the flow of gradients during backpropagation.By assigning higher gradients to more relevant features, the model can learn more effectively from the most critical information, facilitating faster convergence.
5. Enhanced Discriminative Power: Attention mechanisms enable the network to assign different weights to different input parts, allowing it to focus on the most discriminative features for the task at hand.This enhanced discriminative power can lead to quicker convergence as the model hones in on the crucial aspects of the data.
6. Facilitation of Long-Range Dependencies: In 3D data, capturing long-range dependencies is crucial for understanding temporal dynamics.Attention mechanisms facilitate the modeling of long-range dependencies by allowing the model to selectively attend to relevant frames or sequences, aiding faster convergence.

Ablation study
An ablation study was performed to measure the effectiveness of each module in the proposed architecture.3D-CDC architecture is introduced with spatial attention and dense-net modules.We conducted ablation experiments, and the results are listed in Table 7.The first experiment was with CNN (vanilla convolution), and OA was achieved at 96.07%, and computational time was 2.78.In the second experiment, the efficiency of the spatial attention (SA) was measured with CNN.By adding the SA module, the over-efficacy of the CNN also increased with the OA 96.27% with 3.98 seconds.In the third ablation study, CDCN (center difference convolutional network) 97.88% OA using the 4.98 seconds.In the fourth ablation experiment, the effectiveness of 3D-CDCN with the SA module achieved the highest accuracy 97.93% with less time, i.e., 4.38 seconds, which is 0.6 seconds less than the network (CDCN) without the SA module.
On the other hand, The SA module increases the performance of the proposed network.

Computational efficiency
Each model's performance is measured to determine the computational efficiency.The average run time is presented in Table 8, which shows that information is processed quickly with a small patch size, i.e., 13 × 13.HybridSN produced the results very fast and placed at the top.3D-FCCNN is the second-best model to process the model, while Jigsaw-HSI is placed at 3rd in this series.Of course, our model is placed as the 4th best efficient model.The proposed method is deeper than the comparison algorithm.Our method handles the detailed intrinsic information and provides the best classification results.The proposed model becomes efficient when we go to the larger patch size of 25 × 25. 3DCDCN predicted the results just in 4.38 seconds, which is the second-best model on the larger size.While the other models cannot manage sustainability.So, we can say that the proposed model is the best fit and reduces the complexity of the HSIs.

Comparison with base-line deep learning methods
We have extensively evaluated the performance of the proposed approach, which is composed of a 3D layered architecture, by comparing it to several deep learning methods.At first, we examined our method in the context of 3D layered architecture during the experimental phase.Afterward, we expanded our research to include deep learning methods that expand beyond this specific architecture.The complete comparison of baseline deep learning models is shown in Table 9.The provided comparison table in this part clearly shows the superiority of the approach we propose compared to common deep learning models such as 1D CNN, 2D-CNN, SSFCN, and GCN across three distinct datasets: HT, PU, and IP.The 1D Convolutional Neural Network (CNN) has a reasonable level of performance, with the maximum overall accuracy (OA) of 80.23% in the HT dataset.Its performance does, however, significantly decline in the PU and IP datasets, indicating there are limitations when handling these kinds of data.While 2D-CNN outperforms 1D CNN, particularly in the PU dataset with an overall accuracy (OA) of 88.19%, it is still not as effective as more advanced models.This implies that its ability to capture the complex structure of the datasets is limited.The SSFCN model consistently achieves good performance across all datasets, with its maximum overall accuracy (OA) of 89.62% observed in the IP dataset.This demonstrates a superior ability to manage diverse data structures in contrast to the prior models.GCN exhibits better performance on the HT dataset with an Overall Accuracy (OA) of 71.41%, its effectiveness significantly diminishes in the IP dataset, indicating possible challenges in extrapolating results across heterogeneous data types.The proposed method, demonstrates superior performance compared to all other models across all datasets, with outstanding Overall Accuracy (OA) scores of 97.93%, 99.89%, and 99.3% in the HT, PU, and IP datasets, respectively.The good results show the efficacy of the 3D layered architecture in managing diverse and intricate data structures.The significant superiority of our method over the baseline models emphasizes its improved capacity to reliably identify and predict results, even in complex scenarios.The investigation highlights the improved efficiency of the proposed method in handling complex data structures and its superiority over conventional deep learning models.This comparison analysis not only confirms the success of our approach but also provides the way for its implementation in increasingly complex and diverse based on data environments.

Limitation of the network
Using 3D Convolutional Neural Networks (CNNs) with 3D attention helps handle moving images and time-related data.But, there are problems.These methods need much computing power because they process complex 3D data.This might slow down the process, especially with big sets of data.Also, there's a risk of the model getting too focused on small details in the training data.This might make it struggle when dealing with new or different data, especially if there isn't much training data available.

Discussion
Various experiments were performed to measure the effectiveness of the proposed 3D-CDCN attention-dense net.Comprehensive experiments were performed for this purpose.Three publicly available datasets, i.e., HT, PU, and IP, were taken for the experiments.The proposed method was tested on different patch sizes.First, the experiments were performed on the proposed method using different patch sizes.For this purpose, the window size was set to 13 × 13, 17 × 17, 21 × 21 and 25 × 25 with 20 spectral bands, from the result Table 2 it is noticed that the proposed method's Performance increased with larger patch size.
In the second experiment, we compared the performance of the proposed method with the existing CNN-based methods for a fair comparison.Most methods are 3D, and an equal experimental environment is provided.These methods were also tested on different patch sizes.When we summarized all the experiments of comparing methods, the performance of these methods varied with different patch sizes and on different datasets.For example, the 3D-FCNN produced OA = 98.36% on the HT dataset, the second-best result.This model cannot maintain its position on the PU dataset, achieving OA = 95.21% and placed at the 3rd position, the same as with the IP dataset.Jigsaw-HSI produced the lowest results with window size 13 × 13.On the other side, when we see the performance of the proposed 3D-CDCN attention Dense Net with different window sizes, it shows the highest results and maintains its consistency.
The second discussion point is each method's efficiency and capacity.For instanceThe semi-supervised 3D-CNN can capture distinct spatial classes across various wavelengths, enabling the analysis of a broad spectrum of spectral data.Compared to conventional 2D-CNN, the Spectral-Spatial 3D-SSCNN introduces higher computational complexity and challenges obtaining a diverse and representative HSI dataset for training.However, the Fast and Compact 3D-FCCNN approach has a drawback.It divides the HSI cube into smaller overlapping patches, which leads to a loss of spatial context.Consequently, this method becomes unable to capture global relationships between different regions.The Hybrid-SN and Jigsaw-HSI consider the importance of interpretability and transparency losses.To address these challenges and improve HSI classification performance and robustness, the clever 3D-CDC Attention DenseNet has been developed.This model focuses on extracting spatial-spectral feature maps, utilizing joint local intrinsic detailed patterns and interrelation among spectral features.The attention mechanism and dense network incorporate low-rank frequency feature information and guide feature tuning.As a result, these advancements have successfully overcome the challenges faced by existing state-of-the-art models.

Conclusion
In this article, we proposed a method for HSI classification based on a center difference convolution approach that incorporates central difference into vanilla convolution to enhance its representational characteristics and improve its generalization capacity of the convolutional neural network to extract the detailed intrinsic features for the most accurate classification task.The proposed method exploits the 3D Attention mechanism to explore the more appropriate features, whereas 3D Central Difference CNN is used to extract the detailed intrinsic features, and the dense connections were employed to improve the robustness of the architecture.Although the 2D-CNN and 3D-CNN-based approaches have been widely used for the HSI classification, they have limitations, i.e., 2D CNN does not simultaneously analyze the spatial and spectral data.In contrast, the 3-D CNN emerges as a superior alternative because precise estimation of HSIC requires considering both spatial and spectral features, however, this method ignores the intrinsic features that are important for the accurate classification of HSIs.Our proposed algorithm achieves superior experimental outcomes on three HSI benchmark datasets HT-2018, PU, and IP, establishing state-of-the-art results with different window sizes i.e. 13 × 13, 17 × 17, 21 × 21 and 25 × 25 with 20 spectral bands, 3D-CDC Attention DenseNet, produced the better OA score against the comparing methods on all the three datasets, whereas 3D-CDC Attention DenseNet achieved the highest OA% on 25 × 25 × 20 window-sized patch that shows the efficiency of the proposed method.Our experiments demonstrate that our approach not only surpasses conventional 3D CNN-based models but also shows superiority when compared with the baseline deep learning methods, i.e.ID, CNN, 2D CNN, SSFCN, and GCN, and outperforms state-of-the-art networks on various public benchmarks while maintaining lower complexity.For future work, we intend to introduce more convolutional operators with CNNs to make better generalizations of the CNN-based architectures for the HSI classification.

Fig 4 .
Fig 4. The proposed 3D-CDC Attention Dense Network architecture is summarized layer-wise, employing a window size 25x25.The final layer of this architecture is specifically designed using the Houston 2018 dataset.https://doi.org/10.1371/journal.pone.0300013.g004

Fig 5 4 . 1 . 1
presents visual representations of the ground images for all the experimental datasets.Experimental settings.Experiments are done in this study using the proposed model and comparing methods.Three well-known HS HSI datasets were taken to check the efficiency of the 3DCDCN.Houston-2018 (HT), IP, and PU were used to test.All the experiments were done in Colab Pro.

Fig 6 .
Fig 6.Our models, which incorporate state-of-the-art techniques, perform sophisticated processing of the Ground Truths with high precision and accuracy, considering the complexity of each spatial dimension.https://doi.org/10.1371/journal.pone.0300013.g006

Fig 7 .
Fig 7. Our models, which incorporate state-of-the-art techniques, perform sophisticated processing of the Ground Truths with high precision and accuracy, considering the complexity of each spatial dimension.https://doi.org/10.1371/journal.pone.0300013.g007 Fig 9, we can observe the classification maps of Houston 2018 (HT), Pavia University (PU), and Indian Pines(IP).These maps present the geographical characteristics of each class based on different window sizes (spatial dimensions).Fig 10 illustrates the (loss, accuracy) across 100 epochs of training, comparing our method to other techniques.

Fig 8 .
Fig 8. Our models, which incorporate state-of-the-art techniques, perform sophisticated processing of the Ground Truths with high precision and accuracy, considering the complexity of each spatial dimension.https://doi.org/10.1371/journal.pone.0300013.g008

3 .
Reduction of Redundant Information: Attention mechanisms help in identifying and emphasizing important spatial and temporal features in the 3D data.By reducing the

Fig 9 .
Fig 9. Our models, which incorporate state-of-the-art techniques, perform sophisticated processing of the Ground Truths with high precision and accuracy, considering the complexity of each spatial dimension.https://doi.org/10.1371/journal.pone.0300013.g009

Fig 10 .
Fig 10.The epoch-wise evaluation of the Houston 2018, Pavia University and Indian Pines datasets, with a window patch size of 25 × 25, where a, b and c are the accuracy graphs, whereas d, e, and f are the loss graphs on 100 epochs.https://doi.org/10.1371/journal.pone.0300013.g010

Table 2 . Based on our research findings, the efficacy of the proposed model is contingent upon the size of the window.
https://doi.org/10.1371/journal.pone.0300013.t002

Table 8 . The computational times measured in seconds were recorded for the HT experimental dataset across various window sizes.
https://doi.org/10.1371/journal.pone.0300013.t008