FDNet: An end-to-end fusion decomposition network for infrared and visible images

Infrared and visible image fusion can generate a fusion image with clear texture and prominent goals under extreme conditions. This capability is important for all-day climate detection and other tasks. However, most existing fusion methods for extracting features from infrared and visible images are based on convolutional neural networks (CNNs). These methods often fail to make full use of the salient objects and texture features in the raw image, leading to problems such as insufficient texture details and low contrast in the fused images. To this end, we propose an unsupervised end-to-end Fusion Decomposition Network (FDNet) for infrared and visible image fusion. Firstly, we construct a fusion network that extracts gradient and intensity information from raw images, using multi-scale layers, depthwise separable convolution, and improved convolution block attention module (I-CBAM). Secondly, as the FDNet network is based on the gradient and intensity information of the image for feature extraction, gradient and intensity loss are designed accordingly. Intensity loss adopts the improved Frobenius norm to adjust the weighing values between the fused image and the two raw to select more effective information. The gradient loss introduces an adaptive weight block that determines the optimized objective based on the richness of texture information at the pixel scale, ultimately guiding the fused image to generate more abundant texture information. Finally, we design a single and dual channel convolutional layer decomposition network, which keeps the decomposed image as possible with the input raw image, forcing the fused image to contain richer detail information. Compared with various other representative image fusion methods, our proposed method not only has good subjective vision, but also achieves advanced fusion performance in objective evaluation.


Introduction
Image fusion methods use suitable feature extraction methods and fusion strategies to generate a single image containing key image information.The above methods adopt more than two raw images, which provide complement and redundant characteristics.In the realm of image fusion, one of the most important topics is infrared and visible image fusion [1], which can effectively extract the complementary redundant information between each raw image and combine them into high-quality stable and informative images.It has critical image processing applications, such as remote sensing [2,3], target detection [4,5], security surveillance [6], medical imaging [7,8], and military applications [9].
Infrared and visible image fusion methods can be broadly divided into two categories: traditional methods and deep learning-based methods [10][11][12][13].The traditional methods typically accomplish the image fusion goal in the space domain or frequency domain using corresponding mathematical transformations, such as wavelet transform [14], multiscale transform [15,16], sparse representation [17].However, in the image fusion stage, all these methods require manually designed complex image fusion rules.Deep learning-based methods extract and combine image features based on strong feature learning capabilities of neural networks, and could be classified into supervised learningbased methods and unsupervised learning-based methods.Liu et al. [18] adopted the Convolutional Neural Network (CNN) for image fusion and made significant progresses comparing with traditional algorithms, but the CNN requires supervised training.For infrared and visible image fusion tasks, it is impossible to generate usable labeled data.In other words, it is impossible to artificially construct fusion images that can be referenced for supervised training.To address this problem, Li Hui et al [19] proposed to use the pre-trained VGG network for fusing infrared and visible images.This algorithm enables the extraction and fusion of multi-level deep features from the source images.Later, ResNet-50 [20]was proposed to extract and fuse depth features from the source images.However, a significant drawback of these models is their reliance on pre-trained CNN models as offline feature extractors.This limitation prevents adaptive extraction and fusion of features from the source images.Subsequently, scholars designed an end-to-end network framework specifically for image fusion.Prabhakar et al [21].proposed an unsupervised end-to-end convolutional neural network learning framework, which does not require manually setting complex fusion strategies than other image fusion methods.The novel framework has the flexibility and versatility than previously experienced, but its performance evaluation results are not optimal for specific image fusion tasks.
To solve the above problems, we propose a Fusion Decomposition Network called FDNet to achieve infrared and visible image fusion.Fig 1 shows a set of infrared and visible image pairs, and the corresponding fused images generated by deep learning -based and our proposed FDNet.There are two main aspects for the proposed network.On the one hand, Considering the characteristics of infrared and visible images derived from different sensors, we use multi-scale layers, depthwise separable convolution, and improved Convolutional Block Attention Module(I-CBAM) to create a double-branch network framework to extract the gradient and intensity information of related images.Secondly, we design a new loss function, representing the gradient and intensity information at each image.On the other hand, we consider not only the image fusion process, but also the image decomposition process from the fusion result to raw image.According to the above analysis, we design a single and double channel convolutional layer to maintain consistency between the decomposition result and corresponding raw image.Significantly, the image fusion results can contain a lot of detailed information.
The contribution of our work consists of the following five main aspects: 1. We propose a novel deep learning-based method called as FDNet for fusing infrared and visible images.Compared to traditional image fusion methods, our approach successfully complete image fusion task without manual setting activity level measurement and fusion rules.The overall fusion method can simultaneously perform both the fusion and

Competing interests:
We also declare that we no competing interests exist.
decomposition stages.The fusion network designs a double-branch network to complete feature extraction, including multi-scale layers, depthwise separable convolution and I-CBAM.The decomposition network is composed of single and double channel network including convolutional layers, which makes the fusion result contain more scene detail information and improves the network fusion performance.
2. For the shallow feature extraction step, we design multi-scale convolutional network structures to extract image feature information with different receptive field sizes for infrared and visible images.This effectively solves the problem about insufficient feature extraction using a single scale convolution kernel.This not only increases the multiscale convolution structures of the processed image, but also accurately extracts the image features of object regions, and improve the shallow feature extraction ability.
3. For multi-scale deep feature fusion, we design a depthwise separable convolutional structure, which separately considers channel information and spatial information for the image regions.Deep convolution operation and point-by-point convolution operation could guarantee that the size of the feature map is not changed using a deeper network, improve the network expression ability, and build a lightweight network.
4. We propose a novel Frobenius norm loss function, an adaptive gradient loss function, and a structural similarity loss function between the decomposed fused image and the raw image, and generate a desired image fusion result for the novel network.
The remainder of the paper is structured as follows: Section 2 reviews related work.Section 3 presents overall framework, network architecture, and loss functions.Section 4 conducts experimental analyses, algorithm comparisons, and ablation experiments.Section 5 makes conclusion and suggests future work.From left to right: the infrared image, the visible image, the fusion results of the CNN [18], the Deeplearning approach [19], the ResNet50 approsch [20], and our proposed FDNet.https://doi.org/10.1371/journal.pone.0290231.g001

Infrared and visible image fusion
With the emergence of various methods, image fusion techniques have made significant advancements.Currently, the most popular image fusion methods are the deep learningbased methods, which can be further classified into the supervised learning-based methods and unsupervised learning-based methods.It is the most challenging to lack the groundtruth fused images for supervised learning image fusion methods.DeepFuse is the first image fusion method based on unsupervised learning, including encoding step, image fusion step, and decoding step.As the general image fusion framework, the image fusion performance of DeepFuse on specific problems is not good enough.Subsequently, Hui Li et al. [22] proposed DenseFuse, which incorporates the encoder-decoder structure with the Dense-Block and better preserve the original image information.Li et al. [23] proposed NestFuse, a method derived from DenseFuse to retain more detailed features and provide more infrared target information.But finding an effective fusion strategy is difficult for image fusion.To address the issue of arbitrary fusion strategies, Ma et al. [24] proposed the FusionGAN framework based on a genetic algorithm.This image processing framework utilizes a generator to extract and combine meaningful information from the raw images.The purpose of the discriminator in FusionGAN is to enforce the fused image to contain more detailed information in visible image.However, the discriminator network cannot preserve image detail information.Zhang et al. [25] proposed a novel generative adversarial network called GAN-FM, specifically designed to retain more detailed image information.In this network, a full-size jump-connected generator is applied to extract shallow features, and the discriminator uses two Markov discriminators to fully retain the valid information in the infrared and visible images by playing adversarial games with the generator.In addition, a novel intensity masking generative adversarial network (IM GAN) [26] and an unsupervised continual-learning generative adversarial network (UIFGAN) [27] were designed to complement multimodal image information, whereas, it fails to integrate the extracted features efficiently.Xu et al. [12] introduced attention mechanisms to the fusion network for feature extraction, while Liu [28] proposed an Attention-guided and Wavelet-constrained Generative Adversarial Network for infrared and visible image fusion(AWFGAN) model based on Generative Adversarial Nets (GAN), which could better preserve important information of the raw images.

Attention mechanism
Attention mechanism plays an important role for human visual system and brings significant successes in image processing field.The related achievement methods focus selectively on interest regions by reassigning the related weight values of input sequences [29].Attention mechanism has many image processing applications, including target detection [30], image enhancement [31] and emotion recognition [32], and it can be divided into local-attention, soft-attention and hard-attention [33] according to the achievement methods.Hereinto, Softattention mechanism can currently be subdivided into channel attention, spatial attention, and their combined module.Convolutional Block Attention Module (CBAM) is a typical joint module, and its spatial attention model focuses only on important information regions to reduce resource consumption, and the channel attention model allocates channel resources effectively by considering the relationship between feature map channels.The related calculating equations about CBAM are written as follows: where � denotes the multiplication operation of corresponding elements.F denotes the input feature map.F 0 denotes the out result of channel attention mechanism.F 00 denotes the output result of spatial attention mechanism.M C (F) denotes the output weights of F based on channel attention.M S (F 0 )denotes the output weights of F 0 based on spatial attention.σ denotes the sigmoid function.W 0 and W 1 denote the weight values of the MLP.F C avg and F C max denotes the average pooling feature and the maximum pooling feature, respectively.f 7×7 denotes a convolution operation with the filter size of 7 × 7.

Depthwise separable convolution
The work of Laurent Sifre at Google Brain in 2013 developed depthwise separable convolution (DSC) and was applied to AlexNet to improve recognition accuracy moderately and reduced the size of the proposed model.The first layer of Inception V1 and Inception V2 also used depthwise separable convolution [34,35].Within Google, Andrew Howard [36] introduced an efficient mobile models called MobileNets using depthwise separable convolution.Depthwise separable convolution is also a factorization convolution.It has two main achievement steps: depth convolution and pointwise convolution, which are used to filter and combine feature information.This type of factorization not only reduces computational complexity compared to other standard convolutions, but also could acquire better trained models, and is widely used in image classification and image segmentation.

Overall framework
Infrared images have strong anti-interference capability and are not limited by weather conditions.Visible images can provide significant texture and detail information, and have high spatial resolution.In order to enhance the feature extraction capabilities and image fusion performance for visible and infrared images, we proposed FDNet, which is accompanied by its corresponding general framework is shown in Fig 4. The proposed FDNet fusion network consists of a fusion network and a decomposition network.The fusion network takes into account different properties of raw images from different sensors on the research, so we design a double-branch network to process related data of infrared and visible images, which has large differences from spatial resolution.The purpose of the decomposition network is to contain more abundant scene information for fusion images, so we design a single and double channel convolution layer to obtain a more finely decomposing image.The above proposed network has the same network structure and shared parameters, and receives infrared and visible image as the network inputs.The above network structure consists of multi-scale layers, depthwise separable convolution, and I-CBAM.In the training stage, firstly, two modal images with the same size of 120 × 120, are entered into the doublebranch network.Here its multiscale convolution layer not only extracts the multi-scale features from the raw images, but also reduces the loss of image feature information.The depthwise separable convolution independently conducts spatial convolution step using the multi-scale input features by depthwise convolution operation, and can finds new spatial channels by pointwise convolution operation.Obviously, the related network parameters are reduced and the lightweight network is constructed to achieve deeply feature extraction.Subsequently, I-CBAM focuses on the salient information of infrared and visible images from both channel and spatial aspects, and suppress useless channel information to ensure that all salient features can be utilized during image fusion steps.The extracted features from the double-branch network of infrared and visible images use concat and convolution strategies, shown as a big yellow box in Fig 4 .Finally, the decomposition network extracts image feature information by common convolutional layers from the fused images and decomposes them into two branches to generate a new decomposition image consistent with the raw images.In the testing process, the fused images are generated using the trained model data only.

Shallow feature extraction.
In the deep learning-based methods, feature information is typically extracted using convolutional layers.However, when using a single-scale The multi-scale feature extraction equations are calculated as follows: where F in and F out are input feature map and output feature diagram, respectively.* represents the convolution operation.f j represents the used convolution kernel size (j = 1,3,5,7).

Deep feature fusion.
The depthwise separable convolution module plays a crucial role in deep feature fusion.Compared to standard convolution operation considering the spatial and channel information in image regions, depthwise separable convolution will consider channel information and spatial regions separately and learn more abundant representation features with less parameters.On the research, we employ the depthwise separable convolution module from the second to fourth layers for deep feature extraction, and select the Leaky Relu as the activation function.Firstly, the previous layer in deep feature fusion network mainly adopts 3 × 3 convolution kernels to conduct the spatial convolution operation of each channel and decrease the parameter number.Secondly, the network depth will be deepened by 1 × 1 convolution kernels without changing the size of the feature map, easily realizing cross-channel information interaction and integration, learning deep target information, and improving the network expression capability.The related parameters for the depthwise separable processes are presented in Table 1.

Improved CBAM.
To enhance the image fusion performance, CBAM is used as the attention module in this study.The receptive field size in the CBAM determines the spatial attention performance.In order to aggregate more extensive spatial context features, a 7 × 7 convolutional kernel in receptive field is used rather than previous 3 × 3 convolution kernel.The number of the module parameters with 7 × 7 convolutional kernel has an obvious increase for the receptive field.Therefore, compared to other same receptive fields, we design a spatial attention module using dilated convolution to complete feature aggregation to reduce the  where F denotes the input feature map.M S (F) denotes the output weights of F based on spatial attention.σ denotes the sigmoid function.f 3�3 dilat denotes the dilated convolution with a convolution kernel size of 3. The experiments use the dilated convolution with a dilated rate of 2.
The CBAM attention mechanism generally adopts "cascade" connection, but this will bring a large influence that the previous feature mapping determines the later weighing values and learned features from the attention modules.Significantly, the interference caused by the "cascade connection" could bring a worse effect for the attention modules in image fusion tasks.Therefore, we change the original "cascade connection" to "parallel connection", which directly learns the initial input feature map without considering the order of spatial attention and channel attention, and the related mathematical equation is given as: where F 00 denotes the final output feature map.M C (F) denotes the output weights of F based on channel attention.
For the I-CBAM, the spatial attention module and the channel attention module are learned simultaneously.Hereinto, In the channel attention module, the input feature diagram F(H × W × C) is subjected to the maximum pooling and average pooling, and obtain two feature diagrams of 1 × 1 × C, and then they are given to a Multi-Layer Perceptron (MLP).The channel feature diagram is generated by element-wise operation and sigmoid activation, known as M C .In the spatial attention module, Firstly, the input feature diagram F obtains two feature diagrams of H × W × 1 by maximum pooling and average pooling operation.Secondly, we conduct a channel-based concat operation and use the dilated convolution with convolution kernel size 3 to reduce the number of dimensions.Thirdly, through the Sigmoid activation function obtain the final spatial feature diagram M S .Fourthly, the feature map obtained by channel attention and spatial attention is directly weighted with the original input feature map F to obtain the final output feature diagram.The overall block diagram of I-CBAM is shown in Fig 6.

Decomposition networks
The purpose of designing decomposing network is to decompose the fused images and to generate good image fusion results closer to raw images.The framework of the decomposition network is illustrated in Fig 7.
In Fig 7, We extract image features from the fused image using three single-channel convolutional layers, and then generate the decomposition results from two dual-channel convolutional layers.The first convolutional layer utilizes 1 × 1 convolutional kernel, and the remaining convolution layers employ a 3 × 3 convolutional kernel.The activation function Leakly ReLU is chosen for the common convolutional layer, and the activation function Tanh is adopted for the last double-channel convolutional layer.

Loss functions
FDNet architecture is divided into the fusion and decomposition components.The fusion network combines multiple images into a single fused image through feature extraction.
Moreover, the decomposition network is to make these fused results contain deeper scene information.The corresponding loss function consists of fusion loss L sf and decomposition loss L dc .The mathematical expression is written as: where L represents the total loss function, L sf represents the fusion loss, L dc represents the decomposition loss.

Fusion loss.
The most basic components of infrared and visible images are image pixels, whose intensities represent the overall pixel luminance distribution.The differences between pixels could form gradient information, which represents the texture details of a raw image.Therefore, the traditional infrared and visible image fusion scheme can be constructed to extract and reconstruct the gradient and intensity information from raw images on the research.Correspondingly, the fusion loss function composes of intensity loss and gradient loss.The corresponding equation is expressed as: β is the key parameter between the intensity term and the gradient term, L grad represents the adaptive gradient loss function, L int represents the intensity function.
An adaptive gradient loss function L grad is designed to add abundant texture features for the fusion images.We also introduce an adaptive weight block to reduce the noise influence by Gaussian low-pass filter on the weighing block.This adaptive weight block evaluates the optimization objectives of the respective pixels in the raw images based on the richness of gradients.The complete process of the adaptive weight block is depicted in Fig 8.
The equations of gradient loss function are expressed as follows: S 1i;j ¼ signðj 5 ðLðI 1i;j ÞÞjÞ À minðj 5 ðLðI 1i;j ÞÞj; j 5 ðLðI 2i;j ÞÞjÞ ð12Þ where I 1 and I 2 are the raw images, I fused is the fused image, H and W denote the height and width of the processing images, respectively.i and j represent pixel coordinates in position (i, j).5(�) is the Laplacian operator.L(�) and |�| represent the Gaussian low-pass filter function and absolute value function, respectively.min(�) and sign(�) denotes the minimum function and sign function, respectively.Intensity loss, adopting improved the Frobenius norm, affects the brightness and contrast of the image, and brings the natural and realistic effect for the fused images.The loss function is defined as the square root of the sum of the squares for the matrix elements at each position.Its main role is to measure the distance of the matrices between the raw and fused image pixels, and to adjust their weighing values effectively.It is noted that the function could select more effective information in the network training process.The related formula is expressed as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 HW α is used to adjust the infrared and visible image intensity information.
3.4.2Decomposition loss.Decomposition loss L dc requires that the decomposition result of a raw image after image fusion step is akin to the corresponding raw image.We choose the structural similarity (SSIM) as the loss function, and calculate the SSIM value between the fusion decomposition result and the raw image, in terms of structural distortion, contrast distortion and luminance distortion.The corresponding formulae is written as follows: where I 1de and I 2de are the decomposition results, I 1 and I 2 are the raw images.μ and σ are the mean value and standard deviation, respectively.The parameters C 1 , C 2 , and C 3 are three important constants to avoid the SSIM value to zero during the training process.

Datasets and setup details
This paper utilizes the publicly available TNO database for completing infrared and visible image fusion tasks.We add related experimental images by designed cropping and decomposing methods in training steps.The training images with the maximum pixel size of 576 × 768 and the minimum pixel size of 360 × 270 are selected and cropped to generate 42,484 experimental images with the suitable size of 120 × 120.In contrast to the training data, ten pairs of testing images are selected as the testing data, using the original sizes of raw images.This experiment is conducted on a Windows 10 operating system with an Intel Core i5-1035G1 CPU.Tensorflow and imageio are used to train and test network performance in the Pycharm compiler.The related parameters of the experiments are set to epoch = 15, batch size = 32, learning rate = 1e-4.The strong and well-converging adaptive optimization algorithm Adam are adopted for the optimization algorithms.In Eq 14, the parameter α is 0.5, allowing the network to obtain the main intensity information from the infrared images to maintain high contrast.In addition, after repeated experiments, the ratios of gradient loss, intensity loss, and decomposition loss included in the total loss are set to 80, 1, and 1, respectively.

Ablation experiments 4.2.1 Module performance test.
To validate the effectiveness of our proposed methods, related ablation experiments are conducted as follows: 1. Image fusion experiments with multi-scale depthwise separable convolution (M-DSC).
An image fusion result of "soldier-behind-smoke-1" was randomly selected from the testing dataset for subjective evaluation.Ten groups of image fusion results were chosen for objective evaluation.The evaluation results of the ablation experiment are shown in Fig 9 .It is noted that the overall image contrast is slightly insufficient only using M-DSC, and some image details cannot be captured.The fusion results of M-DSC and CBAM significantly solve these problems, making the person in the forest clearer and improving the overall image contrast to reveal more image details.Furthermore, the image fusion results obtained by combining our proposed M-DSC with I-CBAM effectively retainess ential image information and show significant improvements compared to the previous two comparison methods.
In the ablation experiment, we select the average gradient (AG) and multi-scale structural similarity (MSSSIM) as objective evaluation metrics.The MSSSIM uses different resolutions to evaluate the image fusion quality and reflects the fused image sharpness and texture detail information with AG.Table 2 is the objective evaluation results of adopted metrics for the ablation experiment.
In Table 2, the increase of the number of modules brings obvious improvement for image fusion performance, and the calculation value of multi-scale structure similarity metric is 88.In Fig 10, the using decomposition network improves the clarity of the trees and soldiers in the image fusion result, and the overall visual effect is better.Additional, two objective evaluation metrics, namely spatial frequency SF and average gradient AG, are chosen to reflect the clarity of the processed images.The experimental data is shown in Table 3.
From Table 3, the fused image clarity in our proposed network becomes higher than missing decomposition network, and the image fusion performance is improved referring to the above experiment data.

Intensity loss analysis.
Intensity loss plays a key role in the fused image to retain important information, such as contrast.Meanwhile, this method helps to maintain a natural scene style in the fused image.For this reason, we perform the ablation experiments to prove the effectiveness of the proposed method, as shown in Fig 11.As shown in Fig 11, the lack of intensity loss generates several problems, including low brightness, information loss and stylistic unrealism for the fused images.This indicates that intensity loss is critical for the image fusion results.Due to the significant deviation of the experimental results without intensity loss from the expected outcome,we decided not to conduct quantitative experiments in this case.

Gradient loss analysis.
Gradient loss forces the more texture details in the fused image, as demonstrated by our ablation experiments.The calculation results of gradient loss ablation experiment are shown in Fig 12.In the given image fusion results with no gradient loss shows texture loss and sharpness reduction, while the use of gradient loss retains original sharpness and acquire more texture details.Additionally, objective evaluation results of gradient loss ablation experiment is given in Table 4.In Table 4, it can be seen that the inclusion of gradient loss leads to further improvement in the image fusion results.This strongly demonstrates the significance of gradient loss in enhqncing the fusion performance.

Fusion image analysis
Different image evaluation methods are easy to give different evaluation results in most cases.In this study, we adopt subjective and objective evaluation methods to evaluate the image fusion effect of a randomly given image.
The image fusion result of "Nato camp" is shown in Fig 13 .In the BF method, the grass, person and plants around the building are all blurred, and the overall fusion effect is bad.The ResNet50, Deeplearning, SMVIF, U2Fusion and FLFusion methods provide more de tailed information about the persons in the fused image, but the plant is still blur.In the MLGCF,  SDNet, SwinFusion and PIAFusion methods, the person and grass are not blurred, but the overall contrast is low, limiting the visibility of additional feature information.The CDL and CCFL methods have better overall contrast, but the target edges are not clear enough, and some detailed information is not clear.Compared to previous popular methods, our proposed method yields clearer texture details, richer scene information and remarkable target objects.
The "helicopter" fusion result is shown in Fig 14 .The FLFusion fails tot fully integrate the visible image information, resulting in a fusion result dominated by infrared information.Although the BF method shows some improvement compared to the FLFusion, it still falls short in fully extracting the original visible information, resulting in an image with only contour information.The ResNet50, Deeplearning, SMVIF, MLGC-F and U2Fusion methods maintain the related information of the raw image, the whole image is blurry and the specific texture information is unclear.The SwinFusion and PIAFusion methods have prominent infrared image object regions, but it brings a low overall contrast, which leads to the inability to represent more detailed information.The CDL and CCFL methods have better overall brightness, but the clarity is not high with artifacts around the target.The SDNet method shows significant targets and clear textures, but some missing edge information.In comparison, our proposed method retains the feature information while obtaining the best brightness and largest edge gradient, with clear background texture and good visual effect for the processed image.have distorted for background texture information, but the texture details on the road are lost and the overall contrast is low.The FLFusion and U2Fusion methods do not lose texture details on the road, but the window features appear blurry.The MLGCF and SDNet method have good overall contrast and meets subjective visual perception requirements, and the significant feature information (texture on the road and car windows) from the raw image is well shown in the fusion images, but more detailed regions cannot be reflected.Our proposed method could acquire good image fusion results with clearer background texture information and better brightness than previous experienced.
The image fusion result of "Movie-01" is shown in Fig 16 .The overall CDL and CCFL methods provide the blurred fusion images, with the unclear trees and houses enough.The overall clarity of SwinFusion and PIAFusion method is improved, but the river in front is blurred.The visual effect of the BF method has improved, but the windows of the house are blurry and have low clarity and contrast.The SMVIF, ResNet50, Deeplearning, SDNet and FLFusion methods bring the clear object regions of the processed infrared image, with low contrast and a large amount of missing detail information.The MLGCF and U2Fusion methods have high contrast, but the object region has a virtual shadow with low clarity.Our method preserves the image feature information while having the best brightness, clarity, and detailed information.
The "Movie-18" fusion result is shown in Fig 17 .The fusion image given by the BF is relatively dark with low contrast.The corresponding person and thing on the road are not clear.The ResNet50, Deeplearning, FLFusion, and U2Fusion methods have much noise and the overall fusion image is more blurred.The experimental image of the SMVIF method lacks the detailed texture information from infrared and visible images, and only contains some contour information.The CDL, CCFL, MLGCF, SwinFusion, and PIAFusion methods exhibit more prominent object regions, but there is a lack of detail information around the person.The SDNet method demonstrates good overall contrast with prominent target features, but the street light appears blurred.In comparison, our proposed method achieves good contrast, high definition, and detailed information, offering superior performance.
The "bench" fusion results are shown in Fig 18 .The BF method has a low overall contrast with blurring flame and much noise for infrared images.The FLFusion method loses the texture detail information in the visible image, resulting in a serious lack of fusion image information.The ResNet50, the SMVIF, Deeplearning and U2Fusion have improved the overall fusion effect with rich scene information, but the flame of the object region is still not obvious with low clarity.The CDL, CCFL, SwinFusion and PIAFusion have obvious flame in the infrared images, but the clarity is not high and there is a lot of detail information loss.The SDNet and MLGCF methods achieve better overall effects, with prominent targets, but there is a slight loss of scene information.In comparison, our proposed method acquires more remarkable object regions, clearer background information, richer scene information and better visual effects.

Objective evaluation.
In this paper, we select eight evaluation metrics to objectively evaluate the fused images, including average gradient (AG), entropy (EN), standard deviation (SD), spatial frequency(SF), correlation coefficient (CC), visual information fidelity for fusion (VIFF), signal to noise ratio (SNR), and mutual information (MI).
AG denotes the detailed representation and texture representation of a proceeded image by calculating the average gray-scale rate of change; EN measures the richness of the image by calculating the average information content of the image fusion result; SD reflects the separation of gray-scale values for a processed image by calculating the difference between intensity values and mean intensity values, which can helps to calculate image contrast; SF reflects the fused image sharpness by calculating the gray-level activity in the spatial domain, using information theory knowledge; MI calculates how much information the fused image includes its corresponding raw image to measure the similarity between these two images; VIFF provides the corresponding objective evaluation values of human vision system; SNR reflects the quality of Analyzing the objective evaluation metrics from Tables 5 to 12, it becomes evident that our methods exhibit high EN values, demonstrating that the image fusion result contains an  abundant of information; a high SF value corresponds to high clarity in the image fusion results; the high AG value suggests that the fused image has more detailed feature and texture information; the high SD value indicates that the processed image contains abundant detailed information with high pixel intensity; the high VIFF value suggests that the subjective perception of the processed image is concordant between that of human visual system; the high SNR value suggests that the useful information in the image fusion result is retained and rarely affected by image noise; a high CC value suggests that the raw image transmit many important image features, resulting in a high correlation between the fusion result and the above features.However, our MI values are slightly lower than some comparison algorithms.This can be attributed to the fact that we employ concat and convolution fusion strategies to preserve luminance information in infrared images and texture information in visible images.The MI metric focuses mainly on the luminance information based on the mean method, if a fused image ultimately contains much noise, it will also result in the increase of luminance information.The CDL, CCFL, PIAFusion and BF methods focus on infrared information fusion while ignoring visible information, so the MI metric has the best image fusion performance.

Conclusion and future work
We propose a fusion decomposition network called FDNet to achieve the goal of image fusion.
In image fusion stage, considering the large differences between raw images, a double branch fusion network framework, consisting of multi-scale layers, depthwise separable convolution and I-CBAM, is proposed on the research.Additionally, an improved Frobenius norm and adaptive gradient loss term are designed for unsupervised learning.The network framework can effectively extract image feature information while reducing computation complexity.In image decomposition stage, it is considered to decompose the fusion results to regenerate raw images, a SSIM structure loss is used as the decomposition loss.The related experimental results demonstrate that our method has high subjective visibility, good overall clarity, and clear background texture information.However, it should be noted that our image fusion framework is applicable to for aligned images, which has limitations for real-time, non-aligned images.In the future work, we will not only explore how to efficiently fuse unaligned images for real-time tasks, but also integrate to more advanced image processing techniques and design a unified fusion framework to handle other complex image fusion tasks.

Fig 1 .
Fig 1.Schematic illustration of our proposed FDNet through the comparsion with others popular algorithms.From left to right: the infrared image, the visible image, the fusion results of the CNN[18], the Deeplearning approach[19], the ResNet50 approsch[20], and our proposed FDNet.
Fig 2 shows the block diagram of CBAM.

Fig 4 ,
the purple box represents the raw image used for multi-scale feature extraction.The orange box for concat the multi-scale feature extraction map, the blue box represents the depthwise separable convolution operation, the green box for the I-CBAM attention mechanism, the yellow box for concat, the red box for 1 × 1 convolution operation and tanh as the activation function, the lavender represents perform 1 × 1 convolution and LReLu as the activation function, light green represents perform 3 × 3 convolution and LReLu as the activation function, light orange represents perform 3 × 3 convolution and tanh as the activation function.