Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

SIFusion: Lightweight infrared and visible image fusion based on semantic injection

Abstract

The objective of image fusion is to integrate complementary features from source images to better cater to the needs of human and machine vision. However, existing image fusion algorithms predominantly focus on enhancing the visual appeal of the fused image for human perception, often neglecting their impact on subsequent high-level visual tasks, particularly the processing of semantic information. Moreover, these fusion methods that incorporate downstream tasks tend to be overly complex and computationally intensive, which is not conducive to practical applications. To address these issues, a lightweight infrared and visible light image fusion method known as SIFusion, which is based on semantic injection, is proposed in this paper. This method employs a semantic-aware branch to extract semantic feature information, and then integrates these features into the fused features through a Semantic Injection Module (SIM) to meet the semantic requirements of high-level visual tasks. Furthermore, to simplify the complexity of the fusion network, this method introduces an Edge Convolution Module (ECB) based on structural reparameterization technology to enhance the representational capacity of the encoder and decoder. Extensive experimental comparisons demonstrate that the proposed method performs excellently in terms of visual appeal and advanced semantics, providing satisfactory fusion results for subsequent high-level visual tasks even in challenging scenarios.

Introduction

Different imaging modalities have their own characteristics, and they describe image scenes in different ways [1]. Taking visible light cameras and infrared cameras as examples, the former generates images by reflecting light from objects, which has the advantage of capturing rich texture details. However, it is also susceptible to adverse imaging conditions, such as scene brightness and fog. Relatively speaking, infrared cameras rely on thermal radiation information from objects to image, so it performs well in highlighting targets, but it also has its limitations, such as low image resolution and relatively less background detail information [2]. It is precisely because of the natural complementary characteristics of these two modalities of sensors that many researchers have begun to explore methods of fusing infrared and visible light images to generate richer information fusion images.

Fig 1 shows a challenging example. As shown in Fig 1(a), in nighttime scenes, visible light images are affected by ambient lighting, making it difficult to recognize pedestrians and vehicles. However, infrared images can clearly capture these targets, thanks to the unique imaging principle of infrared sensors. Fig 1(b) shows an advanced image fusion method [3], but the test results of the mainstream YOLOv5s object detector [4] show that in complex scenes, the method still needs to be improved in terms of incorporating semantic information into the fused image. Therefore, as shown in Fig 1(d), an ideal image fusion method should not only pursue good visual effects, but also ensure the integrity of source image information and highlight significant targets. Only in this way can image fusion technology play a greater role in various practical applications such as target tracking [5], nighttime assisted driving [6], pedestrian re-recognition [7], object detection [8] and semantic segmentation [9].

thumbnail
Fig 1. Detection results of SeAFusion and SIFusion in challenging scenarios.

https://doi.org/10.1371/journal.pone.0307236.g001

In recent years, the fusion technology of infrared and visible light images has received widespread attention from researchers. Early image fusion methods mainly rely on technologies such as multi-scale transformation [10], subspace transformation [11], sparse representation [12], and saliency analysis [13] to enhance visual effects. With the rapid development of deep learning, researchers have begun to combine convolutional neural networks (CNN), autoencoders (AE), and generative adversarial networks (GAN) to further improve the visual performance of fusion results. For example, Li et al. [14] added dense connections to the AE-based framework, enhancing the network’s ability to extract features and making it easier to train. Jian et al. [15] introduced attention mechanisms into the AE, enabling the network to focus more on salient targets and texture details in the image. Tang et al. [16] improved the network’s ability to interact with intermodal features by adding a cross-modality difference perception fusion module (CMDAF) to the CNN. Ma et al. [17] proposed STD FusionNet, which uses a salient object mask to select important information from infrared and visible light images. Ma et al. [18] were the first to apply GANs to the image fusion task, transforming the fusion task into an adversarial game between the generator and discriminator, but this method may be insufficient in preserving texture details. Subsequently, Ma et al. [19] designed DDcGAN, which designs a dual discriminator based on the modality differences between infrared and visible light images, but may lead to artifacts in some results. Although the aforementioned methods, especially those based on deep learning, have achieved good fusion results, they often overlook how to facilitate subsequent advanced visual tasks.

To address this issue, Tang et al. [3] proposed the semantic-aware fusion framework of SeAFusion, which enhances the semantic information in the fused image by attaching a segmentation model behind the fusion network. In addition, Tang et al. [20] also proposed PSFusion, in which the fusion network and segmentation network share the same semantic feature extraction network, thus better achieving semantic information fusion. Liu et al. [21] and Sun et al. [22] also designed image fusion methods based on object detection, aiming to force the fusion network to retain more semantic information from the perspective of detection. However, these methods mainly promote the inclusion of more semantic information in the fused image through high-level visual task models, which may impose certain limitations on the visual performance of the fused image. In addition, complex designs may lead to a significant increase in the computational load of network models, which is not conducive to the application of image fusion technology in practical engineering scenarios.

The design of lightweight models allows image fusion technology to better adapt to real-world scenarios, but such networks often lead to a decrease in fusion performance. To address this issue, researchers have proposed a series of innovative methods. IFCNN [23] uses only two convolutional layers for feature extraction and image reconstruction in its encoder and decoder. This method adjusts the fusion rules based on the type of source image, enabling a unified network to solve various fusion tasks. SDNet [24] generates fused images by constructing a squeeze-and-excitation network structure, ensuring that the fused image contains more source image information. SeAFusion [3] uses gradient residual dense blocks for feature extraction and combines semantic segmentation task loss to guide the training of the fusion network. Xue et al. [25] designed a fast and lightweight fusion network that simultaneously performs feature extraction and fusion. Chen et al. [26] designed a lightweight fusion network based on structural re-parameterization. Lu et al. [27] proposed LDRepFM, which achieves a balance between fusion speed and evaluation metrics through real-time end-to-end hierarchical decomposition and re-parameterization networks. Although structural re-parameterization technology can effectively solve the problem of computational resource consumption and imbalance in fusion performance, it still fails to fully focus on preserving semantic information. Therefore, to achieve effective preservation of semantic information, further research and optimization are needed for lightweight image fusion networks.

In view of the limitations of existing image fusion algorithms, this paper proposes a lightweight infrared and visible light image fusion network SIFusion based on semantic injection. The method includes a semantic feature extraction branch and an image information fusion branch, which effectively integrates modality image information and semantic information through a unique architecture. Then, this paper uses the semantic injection module (SIM) to fuse semantic features with heterogeneous modality image features, ensuring that the fused image has rich semantic clues. In addition, this paper introduces structural re-parameterization technology for optimizing the encoder and decoder, which not only improves the fusion performance, but also significantly reduces the number of required parameters and reduces the computational resource consumption. The main contributions of this paper are as follows:

  • A lightweight infrared and visible image fusion framework based on semantic injection is proposed in this paper, which recognizes semantic features in multimodal images and effectively integrates them into the fusion result.
  • A semantic injection module (SIM) is designed in this paper, which integrates semantic features with heteromodal image features, thus ensuring that the fused image is rich in semantic cues.
  • The edge convolution block (ECB) based on structural reparameterization technique is introduced in this paper as an encoder and decoder, which significantly improves the fusion performance without increasing the computational burden in the inference phase.
  • Numerous experiments demonstrate that the fusion results of the proposed method have good visual perception and advanced semantics, and outperform existing fusion algorithms in terms of fusion performance.

The remainder of this paper is organized as follows. In Section 2, this article briefly introduces related work on image fusion and semantic injection. In Section 3, this article details the proposed SIFusion, including the overall framework, semantic injection module, and loss function. In Section 4, through a large number of experimental comparisons, this method is shown to have superior performance compared to other methods. Some conclusions are then drawn in Section 5.

Related work

Infrared and visible fusion

In early research, the main purpose of infrared and visible light image fusion was to ensure that the fusion results could fully present the information of the source images and make them more consistent with the human visual perception system. To better learn image features, people first considered training encoders and decoders using large-scale datasets. Li et al. [14] proposed a pretrained fusion model called DenseFuse, which used a natural image dataset to train the encoder and decoder, and then used fusion rules to combine the information of different modal images. Tang et al. [28] designed multiple encoders based on the Retinex theory to decompose the lighting and reflection components of visible light images, thereby enhancing the fusion effect in night scenes. To further achieve end-to-end image fusion, researchers also designed many unique loss functions and network architectures. Ma et al. [17] designed a fusion loss based on significant target masks, which can guide the fusion network to selectively process salient objects and background regions. Tang et al. [16] also constructed an illumination-aware subnetwork and a cross-modal differential perception fusion module to ensure that the fusion results are visually appealing. To address the challenge of simulating cross-modality features and decomposing ideal modality-specific and modality-shared features, Zhao et al. [29] proposed a novel Correlation-Driven feature Decomposition Fusion network, CDDFuse, which achieves good fusion effects through a dual-branch Transformer-CNN feature extractor. Due to the lack of real reference images, researchers also introduced generative adversarial mechanisms for unsupervised learning. Ma et al. [18] proposed FusionGAN, which was the first to introduce a generative adversarial network to the field of image fusion, aiming to further preserve more texture details and saliency. To alleviate the modal imbalance problem caused by a single discriminator, they also designed a dual discriminator-based DcGAN [19]. Wang et al. [30] proposed ICAFusion, which constructs a triple-path interactive compensatory attention fusion network to enhance the model’s ability to extract global feature information. These early studies have laid a solid foundation for subsequent image fusion technologies and promoted the development of this field.

Although complex network designs can improve fusion performance to some extent, they also increase the difficulty of applying image fusion technology in real-world scenarios. To address this issue, some researchers have focused on developing lightweight fusion networks and carefully designed various network architectures. Tang et al. [3] proposed the SeAFusion method, which uses dense blocks of gradient residuals for feature extraction, aiming to improve the efficiency of feature extraction. Xue et al. [25] designed a fast and lightweight fusion network called FLFuse, which simultaneously performs feature extraction and feature fusion, thereby improving the fusion speed. Chen et al. [26] designed a lightweight fusion network based on structural re-parameterization technology, which significantly improves fusion performance without increasing the computational burden during the inference stage. Lu et al. [27] proposed LDRepFM, which combines real-time end-to-end hierarchical decomposition networks with re-parameterization networks to achieve a balance between fusion speed and evaluation metrics. Considering our need to improve fusion performance while maintaining lightweightness, this article decides to introduce structural re-parameterization technology to further enhance the feature extraction capability of the network. This technology can optimize network structure, reduce redundant parameters, and improve the efficiency and accuracy of feature extraction, thereby achieving better image fusion results.

Semantic guidance

In recent years, some researchers have proposed practical solutions to enhance the deep learning networks’ ability to extract semantic information by using semantic guidance maps. Wang et al. [31] achieved more realistic texture recovery effects by injecting semantic features into the super-resolution network through spatial feature transformation. Tang et al. [3] constructed a framework that integrates fusion branches with segmentation branches, guiding the training of the fusion network through segmentation loss, which allows the segmentation task to promote the performance improvement of the fusion task. The SCFusion proposed by Liu et al. [32] enhances the target prominence of the fusion results by fusing infrared salient information into the texture extraction branch network through spatial biasing. Liu et al. [33] proposed a multi-interactive feature learning architecture for image fusion and segmentation, SegMiF, and leveraged the correlation between dual tasks to enhance the performance of both.

However, these methods have not achieved effective semantic injection efficiently and lightweight. Therefore, this paper’s approach draws on the network design of SCFusion and proposes a lightweight infrared and visible image fusion framework based on semantic injection. To more easily obtain semantic feature information, this framework utilizes semantic mask-guided semantic extraction encoder training to extract semantic features from the source images. Furthermore, through the semantic injection module, this method effectively integrates richer semantic information into the fused image, thereby achieving excellent performance for advanced visual tasks.

Methods

Network architecture

The overall architecture of SIFusion is shown in Fig 2, which mainly includes the image information fusion branch and the semantic feature extraction branch. The specific process is as follows: First, given a pair of registered infrared (IR) and visible light (VI) images, these two images are stitched together. Then, through the fusion encoder, the fused feature ϕf is extracted from the stitched image. At the same time, through the semantic extraction encoder, semantic feature ϕs is extracted from another path. Subsequently, the semantic feature ϕs undergoes processing by the semantic injection module (SIM), interacting with the fused feature ϕf. This process aims to integrate more semantic information into the fused feature, thereby enhancing the semantic richness of the fused image. Finally, the processed features undergo information interaction through the fusion decoder to reconstruct the fused image (Fusion). Throughout this process, the method uses a carefully designed loss function to guide the training of the network, ensuring that the synthesized fused image performs well in both visual perception and advanced semantic tasks. This lightweight infrared and visible light image fusion framework SIFusion based on semantic injection not only improves the performance of image fusion but also provides efficient solutions for various practical application scenarios while maintaining lightweightness.

thumbnail
Fig 2. The overall framework of SIFusion.

The figure include the Edge Convolution Module (ECB) and the Semantic Injection Module (SIM).

https://doi.org/10.1371/journal.pone.0307236.g002

In the field of infrared and visible image fusion, both single-branch and dual-branch structures are among the mainstream network architectures in existing fusion methods. The dual-branch structure can extract features from two different modalities separately before fusing them, offering better interpretability. In contrast, the single-branch structure obtains shared feature information with similar information domains through the same feature extraction approach. To further achieve a lightweight network model and in conjunction with the characteristics of the semantic injection architecture, the method proposed in this paper selects a single-branch structure with fewer parameters, while the unique feature information is combined with the shared feature information through a semantic injection approach.

In addition, the main goal of this paper is to achieve lightweight image fusion. In order to enhance the representational ability of the network model while minimizing additional inference computation, this paper references the efficient and lightweight ECBSR [34] method. This paper applies the edge convolution block (ECB) based on structural re-parameterization technique to the encoder and decoder. This design enables the fusion network to have stronger feature extraction ability while maintaining relatively few parameters. It is worth noting that ECB adopts different network structures in the training stage and inference stage, which fully complies with the definition of structural re-parameterization technique [35]. This design not only improves the performance of the network, but also significantly reduces the computational cost, making it more suitable for practical application scenarios. The transformation process of the structural reparameterization of ECB is shown in Fig 3.

thumbnail
Fig 3. The structural re-parameterization of ECB.

The figure include the transformation process of the structural re-parameterization of ECB.

https://doi.org/10.1371/journal.pone.0307236.g003

In order to allow the network to extract more meaningful information, such as edge feature information of the source image, predefined Sobel and Laplacian operators are added to multiple branches in ECB. The definitions of these two operators are as follows: (1) (2) where, two branches use Sobel and Laplacian operators, which can effectively extract edge feature information from images. Interestingly, these operators use the same computation method as DWConv, so it is possible to use structural re-parameterization techniques to combine the parameters of multiple branches into a single common convolution parameter, resulting in a more concise network structure during the inference stage.

During the training phase, having more branches is beneficial for the network model to possess a richer feature representation. Therefore, ECB can be formulated as: (3)

Upon completion of the training of the fusion network, in order to further enhance the model’s inference speed and reduce the computational consumption of the network, the trained network is optimized using structural reparameterization techniques. The optimized ECB can be formulated as: (4)

By referring to Fig 3, Eqs 3 and 4, it can be observed that the structural reparameterization technique can effectively reduce the number of branches in the network, thereby further decreasing computational consumption. This is beneficial for the model to better accomplish lightweight image fusion tasks.

Semantic injection module (SIM)

Image fusion is a fundamental visual task, aiming to better serve high-level visual tasks such as semantic segmentation and object detection. Semantic information is particularly important in these high-level visual tasks. If we only pursue the visual effect of fused images and ignore the aggregation of semantic information, it may lead to poor fusion results. To address this issue, this paper introduces a deep semantic information injection and modulation component in the intermediate stage of the encoder and decoder. This design enables the fusion network to aggregate more meaningful semantic information, thereby improving the visual effect of fused images while better serving high-level visual tasks.

Specifically, as shown in Fig 2, we expect to obtain semantic features of the source image through a semantic encoder ϕs can be better injected into the fusion branch. To achieve this goal, this article draws inspiration from the design of Wang et al. [31] and develops an efficient Semantic Injection Module (SIM). This module consists of N units (N = 3), each containing a semantic injection block (SIB) and 3 × 3 Convolution and LeakyRelu activation function. Among them, the Semantic Injection Block (SIB) is the core part of the entire module, which enables effective interaction between semantic features and fused features. Through the design of this module, the model can integrate more semantic information into the fused image, thus improving the adaptability of the fusion results to high-level visual tasks. SIB can be formulated as: (5) where, ⊗ represents the pixel multiplication, and ⊕ represents the pixel addition.

In order to better integrate semantics into the image information fusion branch, the Semantic Injection Module (SIM) fuses image information features and semantic information features through multiple Semantic Injection Blocks (SIBs) connected to convolutional pipelines. Additionally, residual connections are used to enhance the expression of source image feature information. This process can be formalized as: (6)

Loss function

SIFusion not only directly constrains the fusion result through the fusion loss, but also uses the target loss to constrain the encoder network of the semantic feature extraction branch. Next, this section will also detail the fusion loss and target loss.

Fusion loss mainly includes content loss and correlation loss. In recent years, content loss functions have been widely used in CNN-type fusion networks. The definition of content loss function is as follows: (7)

However, due to the characteristic of maximum intensity loss, when there are extreme environments such as strong light and fog in the visible light image, the fused image may learn more information from the visible light image. As a result, meaningful infrared image information may be overlaid by higher intensity visible light image information, which is not the desired result. In order to prevent the loss of target-level infrared information in complex scenes, this paper adopts a strategy of using a target Mask to preserve infrared intensity information. In this way, this paper can ensure that meaningful infrared information is not obscured by visible light information during the fusion process, thereby preserving more target-level details in the final fused image. The detailed definition of intensity loss is as follows: (8) where, ‖⋅‖1 represents l1norm, and ⊗ represents the pixel multiplication.

In order to ensure that the fusion result can capture the texture details in the source image, this paper not only focuses on the intensity information of the source image, but also pays special attention to the edge gradient information. This is because the edge gradient plays a crucial role in the visual effect and texture recovery of the image. The detail definition of gradient loss is as follows: (9) where, ∇ represents the Sobel operator, and |⋅| indicates the absolute operation.

In addition, the correlation loss is introduced in this paper to strengthen the correlation between the fused image and the source image, and the correlation loss is defined as follows: (10) where, corr(⋅) represents the correlation function.

In order to better extract the semantic information of the source image, this paper uses the target loss to constrain the training of the semantic coder. The target loss is defined as follows: (11) where, CA(⋅) denotes the channel average function.

Finally, in this paper, the training of SIFusion is jointly constrained by multiple loss functions to obtain fusion results with better results. The overall loss is defined as follows: (12)

Experimental validation

Experimental configurations

Benchmark dataset.

In order to verify the effectiveness of the proposed method, comparative experiments were conducted on three public datasets: MSRS [16], M3FD [21], and LLVIP [36]. In order to expand the data samples used for training and better train the model, a common method is to use the reshape operation, but this will destroy the continuity between adjacent pixels, which is not conducive to the model’s learning of pixel texture details. Therefore, in this paper, the original 480 × 640 images in the training set on the MSRS dataset were cut into 16 pieces, which are 120 × 160 small blocks, thus expanding the source dataset from 1083 to 17328 pairs, including visible light images, infrared images, and corresponding masks. In addition to conducting comparative experiments on the MSRS dataset, this paper also conducted generalization experiments on the TNO and LLVIP datasets to verify the performance of SIFusion.

Implementation details.

The method in this article is an end-to-end model. The network optimizer uses AdamW, epoch = 100, initial learning rate = 5 × 10−4, and loss function parameter is μ = 2, α = 25, β = 25, λ = 50. The test sets used are the public data sets MSRS, TNO, and LLVIP, which fuse infrared and visible light images. 30, 42, and 50 pairs of images are selected for algorithm comparison experiments. The entire experiment was implemented on the PyTorch deep learning framework on NVIDIA 2080Ti 11GB. All comparison algorithms in the experiment were set up according to the original paper.

Comparison algorithm.

In this article, we compare SIFusion with three AE-based methods (DenseFuse [14], RFN-Nest [37], and CSF [38]), five CNN-based methods (SDNet [24], FLFuse [25], U2Fusion [39], SeAFusion [3], and PSFusion [20]), and four GAN-based methods (FusionGAN [18], GANMcC [40], TarDAL [21], and UMF-CMGR [41]).

Evaluation metrics.

Because the task of infrared and visible light image fusion does not have a reference image, a single evaluation metric is not sufficient to prove the superiority of the fusion effect. Therefore, six general image quality evaluation metrics are introduced in this article, namely SD, MI, VIF, SCD, EN, and Qabf, to measure the effect of the fusion result from different perspectives. MI, SCD, and EN evaluate the amount of information contained in the fused image from the perspective of information quantity. SD measures the high contrast of the fused image from the perspective of contrast. Qabf measures the edge intensity retained in the fused image. VIF quantifies the amount of shared information, thereby measuring the degree to which the fusion result conforms to human visual perception. These six evaluation metrics are all positive indicators, that is, higher values represent better results.

Comparison experiments

Figs 4 and 5 show the visualization results of the proposed method and twelve comparative algorithms. The red box highlights the degree of preservation of salient objects by each method, while the green box shows the differences in background details between different methods.

thumbnail
Fig 4. Visualization results for daytime scenes in the MSRS dataset.

https://doi.org/10.1371/journal.pone.0307236.g004

thumbnail
Fig 5. Visualization results of the blackout scene in the MSRS dataset.

https://doi.org/10.1371/journal.pone.0307236.g005

The MSRS dataset contains infrared and visible light images of urban street scenes in both daytime and nighttime. As shown in the visualization results in Fig 4, in daytime scenes, SIFusion exhibits significant advantages in visual effects compared to other methods. FusionGAN has a problem of texture blurring, while SDNet has clearer texture, but the overall brightness is insufficient. FLFuse and U2Fusion have insufficient contrast. The results of DenseFuse, CSF, and RFN-Nest achieve good visual effects to some extent, but the contrast and brightness of the scene are still insufficient. Although TarDAL can highlight the significance of semantic targets, the texture details of the background are distorted. In nighttime scenes, as shown in Fig 5, the above methods have similar limitations. Despite these algorithms being able to preserve the saliency of pedestrians in infrared images to varying degrees, SeAFusion, PSFusion, and SIFusion are all capable of making the background texture clearer and enhancing the visual experience while almost maintaining the saliency of semantic targets.

Visual qualitative comparison can distinguish the differences between different methods through human vision, but due to different display devices, the discrimination of some results may be relatively low, so quantitative comparison is needed in evaluation metrics.

The quantitative comparison results in the MSRS dataset are shown in Table 1. From the results, it can be seen that SeAFusion, PSFusion, and SIFusion belong to semantic-aware types of methods, which have better performance than other fusion methods. This indicates that the fused images generated by these methods contain rich information and transfer a substantial amount of information from the source images. Among them, PSFusion, which has the most complex model, has the highest evaluation metrics, while the evaluation metrics of SeAFusion and SIFusion are not significantly different, implying that SIFusion has a similar capability to retain semantic information as SeAFusion. However, compared to PSFusion and SeAFusion, SIFusion has a lighter network design, with specific model parameter comparisons shown in Table 6. In summary, the results of both qualitative and quantitative analysis can prove the superiority of SIFusion on the MSRS dataset.

thumbnail
Table 1. Results of quantitative comparisons in the MSRS dataset.

The best results are marked in bold, the second-best results are underlined, and the third-best results are italicized.

https://doi.org/10.1371/journal.pone.0307236.t001

Generalization experiments

In addition to the experiments on the MSRS dataset, in order to validate the generalization of the methods in this paper, we also conducted experiments on the M3FD and LLVIP datasets.

The M3FD dataset.

The M3FD dataset is a multi-scenario multi-modal dataset, featuring scenes in daylight, overcast, nighttime, and challenging environments, with rich semantic information of people, vehicles, etc. Figs 6 and 7 display the visualization results of the M3FD dataset, where the red-boxed areas are salient targets, and the green-boxed areas are background details. In terms of the saliency degree of pedestrians, the targets of TarDAL, PSFusion, and SIFusion are all quite prominent. However, the fusion results of TarDAL are too close to the infrared images, leading to the loss of some background texture information in the visible light images. In contrast, PSFusion and SIFusion can better preserve the scene description in the visible light images, especially in terms of background texture details.

In Table 2, PSFusion has the best performance, with SIFusion and SeAFusion ranking second and third in overall performance, respectively. We can see that SIFusion achieved good results in the three metrics of SD, SCD, and EN, following PSFusion, indicating that the proposed method can generate fusion results with better saliency. At the same time, it also performed well in the VIF and Qabf metrics, indicating that its fusion results are more visually compatible with the human visual system. In summary, through experimental verification with the M3FD dataset, SIFusion has shown good advantages in the fusion of infrared and visible light images in multi-scenarios.

thumbnail
Table 2. Results of quantitative comparisons in the M3FD dataset.

The best results are marked in bold, the second-best results are underlined, and the third-best results are italicized.

https://doi.org/10.1371/journal.pone.0307236.t002

The LLVIP dataset.

LLVIP is a public dataset of infrared and visible light images for urban transportation in nighttime scenes. The images in this dataset have high image quality and contain a large number of common semantic objects, such as pedestrians and vehicles. Visualization results on the LLVIP dataset are shown in Figs 8 and 9.

The LLVIP is a public dataset of infrared and visible light images for urban traffic at night, characterized by high-quality images and containing a multitude of common semantic targets, such as pedestrians and vehicles. The visualization results on the LLVIP dataset are shown in Figs 8 and 9. From the magnified red boxes, it can be observed that, with the exception of SIFusion and PSFusion, other methods have to varying degrees weakened the contrast of the infrared targets. Meanwhile, from the magnified green boxes, it can be seen that DenseFuse, GANMcC, FLFuse, TarDAL, and SeAFusion have blurred the text in the background, while other methods, although capable of displaying details in the background, do not match the clarity of SIFusion and PSFusion.

Furthermore, the three metrics of SD, SCD, and Qabf in Table 3 are higher than other methods, indicating that the fusion results of SIFusion exhibit good contrast performance. Although the results in other metrics are not as good as PSFusion and SeAFusion, the proposed method achieves satisfactory performance with a lighter model. These results demonstrate that SIFusion has a distinct advantage in the fusion of infrared and visible light images for night urban traffic scenarios.

thumbnail
Table 3. Results of quantitative comparisons in the LLVIP dataset.

The best results are marked in bold, the second-best results are underlined, and the third-best results are italicized.

https://doi.org/10.1371/journal.pone.0307236.t003

Ablation experiment

In order to further validate the effectiveness of each module designed in the methodology of this paper, ablation experiments were also conducted in this paper. The qualitative and quantitative results of the ablation experiments are shown in Fig 10 and Table 4, respectively.

thumbnail
Table 4. Quantitative comparison results of ablation experiments.

The best results are marked in bold.

https://doi.org/10.1371/journal.pone.0307236.t004

Semantic injection module (SIM).

The semantic injection module (SIM) utilizes semantic information injection and modulation to enable the fusion network to integrate more semantic information. As shown in Fig 10(b), the proposed method can still maintain high contrast of infrared targets in extreme scenarios. As shown in Fig 10(c), when the SIM is removed, the target contrast in the enlarged red box significantly decreases. At the same time, the values of the evaluation metrics in Table 4 also decrease. These results indicate that the semantic injection module has a promoting effect on the contrast preservation of infrared targets.

Edge convolution block (ECB).

The Sobel operator and Laplacian operator are added to the Edge Convolution Block (ECB) to enhance the network’s fine-grained expression. As shown in Fig 10(d), after removing the ECB, the enlarged background details in the green box become blurred, the gradient changes become less, and the values of the evaluation metrics in Table 4 decrease, further indicating that the performance is deteriorating. Both qualitative and quantitative results show the effect of this module on the overall network.

Additionally, to further validate the effectiveness of ECB, this paper also compares ECB with other typical modules, including 3 × 3 convolution, MobileNet Block [42], and GhostNet Block [43]. The comparison results are shown in Table 5. As indicated in Table 5, the MobileNet Block and GhostNet Block, as two typical modules, can achieve lightweight fusion to a certain extent. However, the ECB demonstrates better results across various evaluation metrics, which validates its effectiveness.

thumbnail
Table 5. Comparative experiment of ECB with other typical modules.

The best results are marked in bold.

https://doi.org/10.1371/journal.pone.0307236.t005

Efficiency comparison experiment

In the methodology of this paper, attention is not only given to the quality of the fusion results but also to the lightweight nature of the model. To provide a more comprehensive assessment of the performance of SIFusion, this section takes reference from the image sizes in the MSRS dataset, setting the dimensions of the input data to 640 × 480 × 1. The relevant parameters and FLOPs of the network model are calculated using TorchSummary and Thop. The paper conducts an experimental comparison of SIFusion with twelve comparative algorithms in terms of runtime, operational memory space, parameter count, weight size, and FLOPs. The comparative results are shown in Table 6.

thumbnail
Table 6. The model parameters of SIFusion and comparative algorithms.

Bold indicates the best result and underline represents the second best result.

https://doi.org/10.1371/journal.pone.0307236.t006

The data in Table 6 clearly shows that, with the exception of FLFuse, SIFusion has a significantly faster runtime compared to other comparative algorithms. Although FLFuse achieves the best performance in terms of speed, its lightweight model to some extent limits the extraction capability of source image information, leading to weaker performance in information integration. Additionally, the models of SeAFusion and PSFusion require a larger amount of parameters and computational resource consumption. In contrast, the method proposed in this paper, while maintaining a lower computational complexity, can effectively integrate information from the source images, thereby generating fused images that are more in line with the visual effects of human vision. This advantage endows SIFusion with higher efficiency and practicality in practical applications, especially in scenarios with limitations on real-time performance and computational resources. Therefore, the method in this paper excels in balancing network performance and lightweight design, providing a beneficial reference for the development of the lightweight image fusion field.

The method proposed in this paper significantly reduces the complexity and computational consumption of the SIFusion inference network through structural reparameterization techniques. This advantage makes SIFusion more practical and efficient in actual application scenarios. Therefore, in practical application scenarios, we can first train the fusion network on a high-computing platform and then optimize the network structure and computational consumption using structural reparameterization techniques. This process enables the deployment of the network on mobile devices with lower computational costs.

Application of semantic segmentation

Although the method in this paper has good performance in the image quality and visibility of the fusion results, the fusion results still need to meet the semantic requirements of high-level machine vision. In order to verify the ability of the proposed method in semantic expression, more detailed experiments are conducted in this section. Specifically, twelve comparative algorithms are selected to compare with the proposed method in semantic segmentation task. To ensure fairness, the segmentation network is retrained using the MSRS dataset, and the configuration of the training set and test set is ensured to be consistent with that of SeAFusion [3]. Firstly, various fusion methods are used to generate fused images. Then, the pixel intersection-union (IoU) commonly used in semantic segmentation task is adopted as an evaluation metric to evaluate the segmentation performance of different fusion results. From Table 7, it can be seen that compared to the other twelve methods, the method proposed in this paper performs quite well in segmentation accuracy across all categories, second only to PSFusion. This further proves the advantage of the method in this paper in enhancing the segmentation model’s recognition of semantic information.

thumbnail
Table 7. Comparative experimental results for semantic segmentation performance.

Bold indicates the best result and underline represents the second best result.

https://doi.org/10.1371/journal.pone.0307236.t007

We believe that this excellent result is mainly due to two points: first, SIFusion can effectively integrate complementary information in infrared and visible light images, thereby helping the segmentation model to comprehensively understand the imaging scene. Secondly, the introduction of semantic injection module significantly enhances the expression of meaningful semantic information, making the fused image contain rich semantic information. In summary, improving the semantic information in the fused image is the key factor that makes our method superior to other fusion algorithms in terms of segmentation performance.

Performance discussion on the TNO dataset

The TNO dataset is a classic dataset in the field of image fusion, involving a large number of military-related targets and scenes. Due to its early release time, the image data quality is poor, lacking effective background texture details and salient target information. To more comprehensively explore the performance of the methods presented in this paper, generalization comparative experiments were also conducted on the TNO dataset, and the results are shown in Table 8.

thumbnail
Table 8. Results of quantitative comparisons in the TNO dataset.

The best results are marked in bold, the second-best results are underlined, and the third-best results are italicized.

https://doi.org/10.1371/journal.pone.0307236.t008

As can be seen from the results in Table 8, the fusion results of the method presented in this paper did not achieve satisfactory performance. This is due to the lack of common semantic targets such as pedestrians and vehicles in the TNO dataset, which led to the inability of our method to effectively extract the necessary semantic information on this dataset, resulting in poor fusion performance. Although our method has high requirements for the quality of the input images, as a lightweight image fusion algorithm, it still demonstrates good performance and great potential.

Conclusion

In this study, we propose a lightweight semantic-infused fusion network framework called SIFusion. The framework can adaptively integrate meaningful semantic information by designing a semantic injection module (SIM) to inject semantic feature information into fused features, and introducing an edge convolution block (ECB) based on structural re-parameterization technique to achieve high-performance lightweight image fusion. At the same time, we also design content loss, similarity loss, and target semantic loss based on the mask of salient objects to better achieve the desired results. A large number of experiments have shown that SIFusion can handle various complex scenarios well.

However, in low-light scenarios, due to the severe degradation of visible light images, almost all fusion methods, including the method in this paper, have the limitation of not being able to effectively extract feature information from visible light images. A potential solution is to combine SIFusion with low-light enhancement techniques to achieve semantic-driven fusion in low-light scenarios. In addition, in the future, we can further improve SIFusion to meet the real-time demands of complex scenes in video image fusion, which has significant application value in the field of security surveillance.

Acknowledgments

We thank all the editors and reviewers in advance for their valuable comments that will improve the presentation of this paper.

References

  1. 1. Zhang H, Xu H, Tian X, et al. Image fusion meets deep learning: A survey and perspective. Information Fusion. 2021; 76:323–336.
  2. 2. Chen J, Li X, Luo L, et al. Multi-focus image fusion based on multi-scale gradients and image matting. IEEE Transactions on Multimedia. 2021; 24:655–667.
  3. 3. Tang L, Yuan J, Ma J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion. 2022; 82:28–42.
  4. 4. Pan Yang, Yang Jinhua, Zhu Lei, Yao Lina, Zhang Bo. Aerial images object detection method based on cross-scale multi-feature fusion. Mathematical Biosciences and Engineering. 2023; 20(9):16148–16168. pmid:37920007
  5. 5. Zhang P, Wang D, Lu H, et al. JLearning adaptive attribute-driven representation for real-time RGB-T tracking. International Journal of Computer Vision. 2021; 129:2714–2729.
  6. 6. Ha Q, Watanabe K, Karasawa T, et al. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2017: 5108–5115.
  7. 7. Guan D, Cao Y, Yang J, et al. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Information Fusion. 2019; 50:148–157.
  8. 8. Jain D K, Zhao X, González-Almagro G, et al. Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Information Fusion. 2023; 95:401–414.
  9. 9. Zhang Q, Zhao S, Luo Y, et al. ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2633–2642.
  10. 10. Chen J, Li X, Luo L, et al. Infrared and visible image fusion based on target-enhanced multiscale transform decomposition. Information Sciences. 2020; 508:64–78.
  11. 11. Fu Z, Wang X, Xu J, et al. Infrared and visible images fusion based on RPCA and NSCT. Infrared Physics & Technology. 2016; 77:114–123.
  12. 12. Li H, Wu X J, Kittler J. MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. 2020; 29:4733–4746. pmid:32142438
  13. 13. Ma J, Zhou Z, Wang B, et al. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Physics & Technology. 2017; 82:8–17.
  14. 14. Li H, Wu X J. DenseFuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing. 2019; 28(5):2614–2623.
  15. 15. Jian L, Yang X, Liu Z, et al. SEDRFuse: A symmetric encoder–decoder with residual block network for infrared and visible image fusion. IEEE Transactions on Instrumentation and Measurement. 2021; 70:1–15.
  16. 16. Tang L, Yuan J, Zhang H, et al. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Information Fusion. 2022; 83:79–92.
  17. 17. Ma J, Tang L, Xu M, et al. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Transactions on Instrumentation and Measurement. 2021; 70:1–13.
  18. 18. Ma J, Yu W, Liang P, et al. FusionGAN: A generative adversarial network for infrared and visible image fusion. Information fusion. 2019; 48:11–26.
  19. 19. Ma J, Xu H, Jiang J, et al. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing. 2021; 29:4980–4995.
  20. 20. Tang L, Zhang H, Xu H, et al. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Information Fusion. 2023: 101870.
  21. 21. Liu J, Fan X, Huang Z, et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 5802–5811.
  22. 22. Sun Y, Cao B, Zhu P, et al. Detfusion: A detection-driven infrared and visible image fusion network. Proceedings of the 30th ACM International Conference on Multimedia. 2022: 4003–4011.
  23. 23. Zhang Y, Liu Y, Sun P, et al. IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion. 2020; 54:99–118.
  24. 24. Zhang H, Ma J. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision. 2021; 129:2761–2785.
  25. 25. Xue W, Wang A, Zhao L. FLFuse-Net: A fast and lightweight infrared and visible image fusion network via feature flow and edge compensation for salient information. Infrared Physics & Technology. 2022; 127:104383.
  26. 26. Chen Z, Fan H, Ma M, et al. FECFusion: Infrared and visible image fusion network based on fast edge convolution. Mathematical Biosciences and Engineering, 2023; 20(9): 16060–16082. pmid:37920003
  27. 27. Lu M, Jiang M, Kong J, et al. LDRepFM: A Real-time End-to-End Visible and Infrared Image Fusion Model Based on Layer Decomposition and Re-parameterization. IEEE Transactions on Instrumentation and Measurement. 2023:3280496
  28. 28. Tang L, Xiang X, Zhang H, et al. DIVFusion: Darkness-free infrared and visible image fusion. Information Fusio. 2023; 91:477–493.
  29. 29. Zhao Z, Bai H, Zhang J, et al. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023; 5906–5916.
  30. 30. Wang Z, Shao W, Chen Y, et al. Infrared and visible image fusion via interactive compensatory attention adversarial learning. IEEE Transactions on Multimedia. 2022; 25: 7800–7813.
  31. 31. Wang X, Yu K, Dong C, et al. Recovering realistic texture in image super-resolution by deep spatial feature transform. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 606–615.
  32. 32. Liu H., Ma M., Wang M, Chen Z., Zhao Y. SCFusion: Infrared and Visible Fusion Based on Salient Compensation. Entropy. 2023; 25:985. pmid:37509931
  33. 33. Liu J, Liu Z, Wu G, et al. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. Proceedings of the IEEE/CVF international conference on computer vision. 2023; 8115–8124.
  34. 34. Zhang X, Zeng H, Zhang L. Edge-oriented convolution block for real-time super resolution on mobile devices. Proceedings of the 29th ACM International Conference on Multimedia. 2021: 4034–4043.
  35. 35. Ding X, Zhang X, Han J, et al. Diverse branch block: Building a convolution as an inception-like unit. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 10886–10895.
  36. 36. Jia X, Zhu C, Li M, et al. LLVIP: A visible-infrared paired dataset for low-light vision. Proceedings of the IEEE/CVF international conference on computer vision. 2021: 3496–3504.
  37. 37. Li H, Wu X J, Kittler J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion. 2021; 73: 72–86.
  38. 38. Xu H, Zhang H, Ma J. Classification saliency-based rule for visible and infrared image fusion. IEEE Transactions on Computational Imaging. 2021; 7: 824–836.
  39. 39. Xu H, Ma J, Jiang J, et al. U2Fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022; 44(1):502–518. pmid:32750838
  40. 40. Ma J, Zhang H, Shao Z, et al. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Transactions on Instrumentation and Measurement. 2021; 70: 1–14.
  41. 41. Wang D, Liu J, Fan X, et al. Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. arxiv preprint arxiv:2205.11876, 2022.
  42. 42. Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arxiv preprint arxiv:1704.04861, 2017.
  43. 43. Han K, Wang Y, Tian Q, et al. Ghostnet: More features from cheap operations. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020; 1580–1589.