Remote sensing image super-resolution using multi-scale convolutional sparse coding network

With the development of convolutional neural networks, impressive success has been achieved in remote sensing image super-resolution. However, the performance of super-resolution reconstruction is unsatisfactory due to the lack of details in remote sensing images when compared to natural images. Therefore, this paper presents a novel multiscale convolutional sparse coding network (MCSCN) to carry out the remote sensing images SR reconstruction with rich details. The MCSCN, which consists of a multiscale convolutional sparse coding module (MCSCM) with dictionary convolution units, can improve the extraction of high frequency features. We can obtain more plentiful feature information by combining multiple sizes of sparse features. Finally, a layer based on sub-pixel convolution that combines global and local features takes as the reconstruction block. The experimental results show that the MCSCN gains an advantage over several existing state-of-the-art methods in terms of peak signal-to-noise ratio and structural similarity.


Introduction
Many remote sensing applications rely on high-resolution (HR) images with rich details, such as target detection and recognition [1][2][3][4], classification [5][6][7][8][9], and segmentation [9,10]. However, some remote sensing satellites only provide images with low spatial resolution, which do not meet practical requirements in real-world scenes. Image super-resolution (SR) attempts to recover the HR image from the related low-resolution (LR) image. Therefore, SR is an essential topic in remote sensing. SR methods can be divided into multiple image super-resolution (MISR) and single image super-resolution (SISR). We pay more attention to the SISR because it is a well-known ill-posed inverse problem that the same LR images have multiple HR solutions [11].
The SISR problem can be solved by using three different methods: interpolation-based, reconstruction-based, and learning-based methods. Interpolation-based methods, such as bicubic interpolation (Bicubic), bilinear interpolation (BI), etc., are simple to implement. However, their performance is limited to a few smooth images, and their inability to recover high-frequency information limits their application [12]. The reconstruction-based methods a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 In summary, the current popular approaches typically have the following problems: 1) Difficulty in replication: Some SR reconstruction methods contain multiple network layers, which necessitate the use of complex hardware. Besides, the same model obtains varied performances by employing alternative training tricks, implying that the gain in performance may not be due to a change in model architecture, but to the application of some undiscovered training techniques. Because of these qualities, recurrence of these network models is difficult. 2) Inadequacy of features utilization: Most approaches frequently fail to make full advantage of the LR image attributes with only raising the depth of the network instead. It is critical for the network to understand how to make full use of these features to rebuild HR images. 3) Ignorance of domain expertise: The domain expertise can be used to design better deep model architectures, i.e. sparse coding model, etc. [23]. However, most networks are built with convolutional neural network, which means all their knowledge about SR are learned from training data. Therefore, in deep learning-based methods, people's domain expertise of images, such as natural image prior and image degradation model, is largely ignored.
This article presents a novel multiscale convolutional sparse coding network (MCSCN) to solve the mentioned problems. It is shown that domain expertise can improve the SR performance [23]. Therefore, we adopt multiscale convolutional sparse coding module (MCSCM) for MCSCN, which combines the sparse coding and deep learning. Firstly, we use the MCSCM to obtain the different scales image features, which are referred to local multiscale features. Secondly, the outputs of each MCSCM are concatenated for global feature fusion. Finally, the combination of local multiscale features and global features can maximize the use of the LR image features. Contributions of this paper are as follows: • Proposing a novel MCSCM. This module extracts multiscale features with stacking dictionary convolutional units, implements multiscale sparse coding using different convolutional kernel sizes, and adaptively improves image features extraction.
• Combining the convolutional sparse coding with deep learning for image SR. Based on dictionary convolutional units, we can conduct a feed-forward neural network to carry out the convolutional sparse coding. It can improve performance by consolidating the merits of convolutional sparse coding with the domain knowledge of deep neural networks.
• Conducting an objective evaluation on several representative and state-of-the-art SR methods with remote sensing image datasets.

Materials and methods
In this section, we will give a brief overview of the proposed networks and then present the details of each part. Fig 1 shows the architecture of the network. We apply a new network that combines conventional sparse features and deep learning to the image SR. Unlike most patchbased SR algorithms, our proposed network explicitly accepts LR images as input. Our model can be divided into three parts: the basic feature extraction (BFE), the multiscale convolutional sparse coding module (MCSCM) and the reconstruction module. Each of the modules is described in the following. Given the fact that sparse coding can be effectively implemented with generalized dictionary convolutional units (DCUs), it is straightforward to build a multilayer neural network that extracts the sparse features. So we will firstly describe the DCUs.

Dictionary convolutional units (DCUs)
Given an image X 2 R c�h�w (c = 1 for gray images and c = 3 for RGB images) and q convolutional filters D 2 R q�c�s�s , Convolutional sparse coding model (CSC) can be formulated as the following problem, where λ is a hyperparameter, � denotes the convolution operator, z is sparse feature maps, and g(�) is a sparse regularizer. This problem can be solved by iterative methods, and it is easily written as where ρ is the step size and D T is the flipped version of D along horizontal and vertical directions. Note that prox(�) is the proximal operator. If g(�) is the ℓ 1 -norm, the proximal operator is also soft shrinkage thresholding function. By the principle of algorithm unrolling, we can employ convolutional units to replace the filters and extend the proximal operator to activation function, [22] the Eq (2) can be rewritten as where we also take batch normalization (BN) into account. The Eq (3) is called a dictionary convolutional unit (DCU). The implementation of DCU is shown in Fig 2. For the encoder module, we use convolution layers to maps the feature space into image space. And for the decoder module, we also use convolution layers to map the residual between the images and the reconstructed images from image space to feature space. By stacking DCUs, the original CSC model can be represented as a deep neural network. This process for CSC model can be regard as an iterative auto-encoder [22].

Basic feature extraction (BFE)
BFE first embeds the LR image into the feature space and then lets the embedding feature pass through M mapping layers to obtain the output feature. We name the output feature from BFE as the base feature because we need to reconstruct the SR details by passing the feature through the MCSCM.
Given the input LR image, I LR 2 R h×w×c where h and w denote the height and width of the image, respectively. We first define the embedding feature and M mapping layers as where Conv 3;n 3�3 denotes a 3×3 convolution operation and the number of input and output channels are 3 and n, respectively. F e is the embedding feature, f i 3�3;ReLU represents the ith mapping layer in BFE, and F i , F i−1 are input and output feature of the ith mapping layer.
Besides, we use local residual learning to integrate the features in BFE, so the entire BFE process can be formulated as where f LRL (�) denotes the local residual learning operation and F B indicates the output feature of the BFE module.

Multiscale convolutional sparse coding module (MCSCM)
As we know, the performance of traditional ISTA algorithm for CSC model highly depended on the configuration of hyperparameters. The multiscale nature of the image is similar to that

PLOS ONE
Remote sensing image super-resolution using multi-scale convolutional sparse coding network of human eyes observing an object. In order to detect the image sparse features at different scales, we propose multiscale convolutional sparse coding module (MCSCM). This module consists of numbers of DCUs with different scales in Fig 2. The Basic image feature F B pass through the stacking dictionary convolutional units (SDCU) with different convolutional filters (with kernel sizes 3 × 3 and 5 × 5), respectively. The structure outputs P 1 and P 2 can be expressed as where SDCU(�) denotes the sparse coding for original feature M 0 predicted using the CSC model with parameter set Θ i (i = 1,2), respectively. Additionally, the output of each SDCU contains distinct sparse features. These sparse features contain more information, and the computational complexity will be increased if using them directly for reconstruction. In order to adaptively make use of these hierarchical features, the bottleneck layer and 3 × 3 convolution is introduced by Xu et al. [22] and Li et al. [28]. The output can be formulated as where P i (i = 1,2) represents the output of the i th stacking DCUs, w 1×1 ,w 3×3 and b 0 , b 1 represent 1 × 1, 3 × 3 convolution kernels and their biases respectively; Note that concat(�) is the concatenation operator.

Image reconstruction
The LR inputs of the previous super-resolution methods are often upsampled to the same dimensions as HR using Bicubic. This approach will increase the computational complexity. The sub-pixel convolutional operation is widely applied to solve this problem in signal image super resolution [28,33]. Furthermore, it is critical to discover a mechanism to combine the shallow and sparse features. As a result, a structure is constructed using Basic feature F B and sparse feature of multiscale convolutional sparse model. As shown in Fig 1, the Basic feature F B and sparse features from MCSCM respectively perform sub-pixel convolutional layer and rearrange the image tensor with dimensions H × W × Cr 2 as rH × rW × C. Then, the features are reconstructed as SR image after 3 × 3 standard convolution. It is proved that the reconstruction structure makes use of the original feature information and prevents information loss [22].

Results and discussion
In this section, we evaluate the performance of our model on several benchmark test datasets. Firstly, we explain the dataset used in the training and testing process, and then give implementation details. Secondly, we compare our model with several state-of-the-art methods. Finally, we introduce the result of our model and give some result analysis.

Datasets
We choose two datasets with plentiful scenes to verify the robustness of our proposed method, namely aerial image dataset (AID), UCMerced Land Use (UCM). The AID is a large aerial image dataset that collects sample image from Google Earth images. It contains more than 10,000 images of 30 land-use scenes, including river, mountain, farmland, pond, and so on. All the images of each category were carefully selected from different countries and regions of the world. Therefore, the diversity in the class of the data has been strongly increased. We randomly choose 20% of the total number as the testing set, and the remaining 80% as the training set.
The UCM dataset was released by the University of California in 2010. It contains 21 types of remote sensing scenes such as medium residential, airplanes, storagetanks, and parking lots and so on. Each class includes 100 pictures. We also randomly selected 80% of the images as the training set and 20% as the testing set.
During testing, we also choose the RSSCN7 dataset and the test dataset with 20 images (called Test20 for short) used by Fernandez-Beltran et al. as testing set [34].

Implementation details
During training, the image data is augmented by random rotation, and flips to expand the dataset. We generate the LR images by the Bicubic and extract the LR patches with the size of 48 × 48. We set the training epochs as 1000. We train our model with the ADAM optimizer by setting the learning rate to 0.0001, β 1 = 0.9 and β 2 = 0.999. In our model, we use 4 DCUs for SDCU and the output of MCSCM has 128 features. Our model directly trained and tested in RGB color space. In addition, the upscaling factors: ×2, × 3 and ×4 are used for both training and testing. We implement MCSCN with the PyTorch framework and train them using the NVIDIA RTX 2080ti GPUs.

Evaluation metrics
The evaluation metrics for experiments results contain peak signal-to-noise ratio (PSNR), structure similarity (SSIM) and spectral angle mapper (SAM). Given a reference images I and a reconstructed imageÎ. The widely used metric is PSNR, defined as follows: RMSEðI;ÎÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi PSNRðI;ÎÞ ¼ 20log 10 where the index j is used to identify each one of the K image bands and N is the total numbers of pixels in each image. The SSIM is calculated as where u I and u^I are the mean of I andÎ, respectively, s 2 I and sigma 2 I are the variance of I andÎ, respectively and s IÎ is the covariance of I andÎ. c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 are the constants used to maintain stability. L is the dynamic range of the pixel value and k 1 = 0.001 and k 2 = 0.003. A higher PSNR and SSIM value represents a better image quality.
SAM considers each spectral band as a coordinate axis, and then it computes the average angle between the pixels I andÎ. Its expression defined as note that the ideal value of SAM is 0.

PLOS ONE
Remote sensing image super-resolution using multi-scale convolutional sparse coding network

Loss function
We choose the L1 loss (i.e. mean absolute error) as the loss function, since L2 loss (i.e. mean square error) penalizes larger errors, but it is more tolerant to small errors, and thus often results in too smooth results. The L1 loss can be formulated as where h, w and c are the height, width and number of channels of the evaluated images, respectively.

Ablation experiments
We designed a set of ablation experiments to verify the effectiveness of the MCSCM structure, including the kernel sizes of the MCSCM and the number of stacked DCU units.
The ablation experiments about the kernel sizes of the MCSCM module are performed on x4 AID dataset, as shown in Table 1. We test single-scale and multiscale convolution kernel sizes for MCSCN to explain the impact of multiscale on reconstruction results. The effect of reconstruction may be improved to 0.57-0.76 dB by employing different scales of convolution kernels. A small-scale convolution kernel may extract local details, whereas a large-scale convolution can extract broader global features [12,28]. We can gain more plentiful details by integrating features collected from different convolution kernels. Better results can be obtained by combining global and local multiscale features.
The ablation experiment about the number of stacked DCUs is shown in Fig 3. It is shown that the PSNR and SSIM results of 4 stacked DCUs are higher than that of 2 or 6 stacked DCUs, indicating that the use of 4 stacked DCU units has an effective performance to the proposed structure.

Comparison with the state-of-the-art method
In this subsection, we compare our model with the FSRCNN [35], VDSR [36], LGCNet [11], EDSR [37] and IRAN [29] on the RSSCN7, UCM and Test20 datasets. The LGCNet and IRAN are representative SR models for remote sensing images, while the other methods are excellent models for natural scenes. All these methods are trained and tested under the same conditions for the sake of fairness. Table 2 shows the peak-signal-to-noise ratio (PSNR) and the structural similarity (SSIM) with the up-scaling factors ×2 and ×4 for the methods mentioned above on the RSSCN7 dataset, including Grass, Field, Industry, RiverLake, Forest, Resident, and Parking. The results in bold indicate the best performance methods. We have average PSNR gains of 0.126 dB and 0.121 dB for the up-scaling factors ×2 and ×4, respectively. Fig 4 shows the visual effect obtained by using our method and the compared methods on the RSSCN7 with up-scaling factor ×4. To improve contrast, a tiny region marked by the red rectangle is enlarged, and the enlarged image is shown on the right of the images. As observed in the local enlarged image,   our approach produces images with more refined boundaries and richer textures than others. Obviously, it can be seen that our method is superior to the other compared methods. Furthermore, we also compare our model on the UCM test images and Test20 dataset with several methods stated before and additional MRMFSCSR [30] and ESRGAN [38]. Table 3  provides the values of PSNR, SSIM and SAM on the 4 test images from UCM dataset and the all images from Test20 dataset with up-scaling factor ×4. As a whole, it can be seen that the PSNR and SSIM of our model outperform the compared approaches. Figs 5 and 6 show the visual comparison of the previous methods in the Test20 with up-scaling ×3 and ×4, respectively. It is observed that our model produces finer details, and the detailed information of the reconstructed SR image is more closely match the ground truth images. It demonstrates that our model achieves competitive performance compared to other methods.

PLOS ONE
Remote sensing image super-resolution using multi-scale convolutional sparse coding network method can achieve higher PSNR than EDSR and ESRGAN with much fewer number of parameters.

The limitations of this research
The LR images of the train data are degraded for using bicubic interpolation. Actual LR images have a different distribution compared to the ones generated synthetically using bicubic interpolation. As a result, our methods can't be used for blind SR. there are very few works whose target SR rates are higher than 8× [12]. In such extreme upsampling conditions, it becomes challenging to preserve accurate local details in the image. Therefore, this situation also exists in our model. The sub-pixel layer may result in some artifacts near the boundaries of different blocks. On the other hand, it may cause unsmooth outputs [39]. The research of deep learning in the field of remote sensing image SR can be carried out in the following aspects in the future: • There is still a scarcity of specific data sets for remote sensing SR. Future research can be done to try to create a remote sensing SR dataset with abundant LR and HR images. Besides, we can also use blind SR methods for remote sensing images.
• Recently, most upsampling methods are the bicubic interpolation. To overcome the shortcoming of this, we can learn upsampling in an end-to-end manner [39]. We will use these learning-based layers as upsampling methods for our method in the future.
• SR performance can be improved by combining multi-stage and multiscale features. As a result, it points in the direction of increased SR rates. In the future, we can observe deeper into these scenarios.

Conclusion
In this paper, we put forward a novel SR model for remote sensing images, which combines the convolutional sparse coding and deep network. We employ the multiscale sparse coding module to obtain multiscale sparse features, which we then fuse with global features to derive abundance features. By using sparse coding knowledge, we can gain considerable improvement over the several deep learning models.
In the future, we plan to apply the MCSCN approach to additional issues where spare convolutional coding might be beneficial. The interplay of deep networks for low-and high-level vision tests will also be investigated. We will also research this model employed in multi-spectral images.