Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Progressive decomposition of infrared and visible image fusion network with joint transformer and Resnet

Abstract

The objective of image fusion is to synthesize information from multiple source images into a single, high-quality composite that is information-rich, thereby enhancing both human visual interpretation and machine perception capabilities. This process also establishes a robust foundation for downstream image-related tasks. Nevertheless, current deep learning-based networks frequently neglect the distinctive features inherent in source images, presenting challenges in effectively balancing the interplay between basic and detailed features. To tackle this limitation, we introduce a progressive decomposition network that integrates Lite Transformer (LT) and ResNet architecture for infrared and visible image fusion (IVIF). Our methodology unfolds in three principal stages: Initially, a foundational convolutional neural network (CNN) is deployed to extract coarse-scale features from the source images. Subsequently, the LT is employed to bifurcate these coarse features into basic and detailed feature components. In the second phase, to augment the detail information across various inter-layer extractions, we substitute the conventional ResNet preprocessing with a combination of coarse and LT module. Cascade LT operations are implemented following the initial two ResNet blocks (ResB), enabling two-branch feature extraction from these reconfigured blocks. The final stage involves the design of specialized fusion sub-networks to process the basic and detail information blocks extracted from different layers. These processed image feature blocks are then channeled through semantic injection module (SIM) and Transformer decoders to generate the fused image. Complementing this architecture, we have developed a semantic information extraction module that aligns with the progressive inter-layer detail extraction framework. The LT module is strategically embedded within the ResNet network architecture to optimize the extraction of both basic and detailed features across diverse layers. Moreover, we introduce a novel correlation loss function that operates on the basic and detail information between layers, facilitating the correlation of basic features while maintaining the independence of detail features across layers. Through comprehensive qualitative and quantitative analyses conducted on multiple infrared-visible datasets, we demonstrate the superior potential of our proposed network for advanced visual tasks. Our network exhibits remarkable performance in detail extraction, significantly outperforming existing deep learning methodologies in this domain.

1. Introduction

Different imaging devices have unique imaging mechanisms, resulting in captured images with distinct characteristics that reflect information from various perspectives. The primary objective of image fusion research has consistently been to extract richer and more comprehensive information from multi-modal images [14]. In challenging external environments, many devices are limited to capturing only partial information characteristics of an image based on their inherent capabilities. For example, unmanned equipment [2] equipped with camera technology designed to handle complex field scenes may face difficulties in accurately locating and identifying concealed objects due to uncontrollable factors, thereby presenting significant challenges to scientific research. To address these limitations, researchers have found that infrared imaging can effectively capture prominent target objects based on thermal radiation, offering notable advantages in terms of anti-interference and anti-obscuration. This capability partially mitigates the shortcomings associated with outdoor imaging. However, infrared imaging is often limited in its ability to describe environmental details, frequently resulting in the surroundings of the target object appearing blurred and indistinct. In contrast, visible light imaging excels at capturing rich texture and detail information, providing a complementary perspective that enhances overall image quality and utility. Consequently, numerous researchers have capitalized on the complementary characteristics of these two imaging modalities to devise diverse fusion methodologies. These approaches are designed to enhance target prominence while simultaneously delivering comprehensive contextual information, effectively addressing the constraints inherent in single-modal visual techniques.

The infrared and visible light images in Fig 1 are representative image pairs selected from the classic TNO dataset. Fig 1 illustrates a typical example of IVIF in a complex and harsh field environment. Under such conditions, visible light can only capture the texture information around the fog, making it difficult to perceive the characteristics of the target behind the fog, with much of the detailed information obscured. Infrared images, due to their unique imaging mechanism, can capture an overall view of the target through the fog based on thermal radiation. As demonstrated in Fig 1c-e, image fusion effectively aggregates information from both modal images. Clearly, effective fusion methods can highlight the features of the target while also describing the detailed information of the environment.

thumbnail
Fig 1. Provides a visual comparison of different fusion methods.

https://doi.org/10.1371/journal.pone.0330328.g001

Fig 1 provides a visual comparison of different fusion methods. To quantitatively validate the superiority of our approach, we employ two rigorous evaluation metrics: Kullback-Leibler Divergence (KL) for measuring distributional differences between images and Centered Kernel Alignment (CKA) for assessing their structural similarity. As shown in Table 1, which presents the average KL and CKA values between fused and original images, CCDFusion demonstrates significantly higher KL divergence (indicating poorer distribution matching) and lower CKA values (reflecting weaker structural similarity) compared to other methods. Our proposed method, by contrast, achieves optimal performance on both metrics, with the lowest KL divergence and highest CKA values, thereby conclusively demonstrating the effectiveness of our fusion strategy. In this paper, the term “basic features” refers to cross-modal features with CKA > 0.6, while “detailed features” denote those with CKA < 0.3.

Currently, several advanced deep learning-based IVIF frameworks have demonstrated remarkable capabilities in target enhancement and texture preservation. Notable examples include adversarial mutual constraint-based fusion networks, deep learning architectures emphasizing texture and intensity ratio preservation [57], and saliency-guided fusion frameworks [8]. However, these approaches predominantly focus on optimizing fusion quality metrics while overlooking two critical aspects: inter-modal variability and the intrinsic relationship between image fusion and subsequent high-level vision tasks. Such limitations often result in fusion outcomes with substantial detail loss and inadequate target saliency. In response to these challenges, Luo et al. [9] proposed a novel multi-branch encoder with contrastive constraints, specifically designed to learn both basic and detailed features across modalities for multi-modal image fusion. While innovative, their approach presents certain limitations: the experimental validation methodology appears outdated and lacks robustness, and the network architecture remains relatively conventional in its monolithic design. A significant advancement was achieved by Zhao et al. [10] through their correlation-driven fusion network (CCDFusion), which redefined traditional fusion paradigms by explicitly modeling modality-shared and modality-specific features. Despite its impressive fusion performance, CCDFusion’s architectural design shows room for improvement, particularly in its simplistic approach to base and detail encoding where information from different modalities is merely summed and concatenated before decoding. As illustrated in Fig 1, this architecture demonstrates sub-optimal performance in detail extraction compared to its counterparts. In contrast, PSFusion [11] exhibits superior capability in capturing detailed target information, as evidenced by its clearer representation of fog dispersion and enhanced visibility of vegetation in bush areas. This comparative analysis underscores the pivotal role of fusion methodology design in effective information aggregation and highlights the need for more sophisticated architectural approaches in IVIF systems.

Motivated by the aforementioned challenges, we propose a novel and robust framework designed to overcome the limitations of incomplete extraction of basic and detailed features in existing networks. Our primary objective is to enhance texture detail preservation while establishing a robust foundation for downstream image processing tasks. The proposed network architecture comprises a backbone network and three specialized branch networks, each serving distinct functional purposes. The backbone network integrates LT with ResNet34 to hierarchically extract multi-level feature information from shallow to deep layers. This architecture is further enhanced by a superficial information fusion module (SIFM), which effectively integrates basic image information across different modules. The first branch network employs an invertible neural network (INN) for hierarchical detail information extraction, building upon the backbone network’s output. This branch incorporates two specialized modules: an INN-based specialized detail fusion module (SDFM) and a multi-attention based deep detail fusion module (D2FM), which collectively facilitate the integration of both shallow and deep layer detail information. The second branch network, designated as the feature fusion network, performs layer-wise integration of basic and detail information through a deep feature fusion module (DF2M). This network implements bottom-up feature information aggregation utilizing the semantic injection module (SIM), while enhancing the layer output of LT to achieve optimal fusion results. Additionally, it strengthens the original scene fidelity path to enable accurate image reconstruction from feature information. The third branch network serves as a bottom-up semantic information extraction network, which progressively aggregates deep detail extraction information and processes it through the sparse semantic perception module (S2PM). This network combines convolutional layers with batch normalization (BN) and ReLU activation to achieve efficient feature-level semantic information extraction. Notably, our semantic extraction is implemented through the detail extraction sub-network for information aggregation, eliminating the need for a dedicated semantic feature extraction network. This innovative approach significantly reduces network parameters and computational overhead. Through extensive data training and parameter optimization in the PyTorch environment, we have thoroughly validated the effectiveness of our proposed method. As demonstrated in Fig 1e, our fusion network surpasses existing state-of-the-art (SOTA) networks in both detail preservation and visual perception, achieving superior performance metrics.

The two calibrated input images are jointly processed by our network, where each level of the backbone independently extracts multi-scale features from both images without shared parameters, which are dynamically optimized based on the global network structure. The backbone generates six hierarchical feature representations for each input: features from levels 1–4 are integrated by the SIFM to capture structural information, while levels 2–6 undergo detail refinement through the INN+SDFM and D2FM. The fused shallow and deep features are combined via hierarchical addition and further optimized by the DF2M, with the SIM and LT then collaboratively decoding the unified representation to produce the final fused image. This work presents several key advantages:

  1. (1) Network architecture design: Our proposed framework employs a hierarchical architecture comprising a top-down backbone network integrated with bottom-up information processing pathways. The backbone network synergistically combines LT and ResNet architecture to enable comprehensive extraction of both basic and detailed features. Additionally, we have developed two specialized bottom-up networks for hierarchical information integration, incorporating advanced semantic information extraction mechanism to enhance feature representation and processing efficiency.
  2. (2) Backbone network construction: We have optimized the original CNN and Transformer blocks to facilitate both local and global feature extraction while maintaining computational efficiency. The architecture incorporates coarse and LT modules as preprocessing components within ResNet, specifically integrating the LT network into the first two layers of ResNet. This design strategy enables comprehensive extraction of fundamental image information through the Transformer’s multi-head attention mechanism, followed by progressive extraction of detailed texture features using the INN in a layer-wise manner.
  3. (3) Auxiliary network architecture: Our framework incorporates a hierarchical feature integration system comprising three specialized modules. The superficial information fusion module (SIFM) facilitates the integration of fundamental features extracted from multiple network layers. For enhanced detail processing, we have developed two complementary components: The specialized detail fusion module (SDFM) and the deep detail fusion network (D2FM) synergistically facilitate an exhaustive integration of both superficial and profound details. The integrated base and detail features undergo layer-wise summation before being processed by our novel deep feature fusion module (DF2M), which leverages multi-attention mechanisms for advanced feature re-integration. Furthermore, to substantiate the network’s semantic extraction capabilities, we have implemented a bottom-up semantic information extraction pipeline that operates concurrently with the detail information consolidation process.
  4. (4) Experimental validation: Through comprehensive experimentation and rigorous evaluation, we have validated the efficacy and robustness of our proposed network architecture. Quantitative and qualitative analyses demonstrate that our framework achieves superior performance in both fundamental feature aggregation and intricate detail preservation. Comparative studies against contemporary SOTA fusion networks reveal statistically significant improvements across multiple evaluation metrics, substantiating the competitive advantages of our approach.

The experimental results substantiate that our proposed network architecture achieves three critical objectives: (1) enhanced target saliency, (2) superior texture information preservation, and (3) comprehensive semantic feature extraction. The remainder of this paper is systematically organized as follows: Section 2 provides a comprehensive review of current IVIF methodologies and establishes the theoretical foundation for our proposed framework. Section 3 presents the architectural overview of our approach, accompanied by detailed technical explanations of each component. Section 4 conducts extensive experimental validation through comparative analysis with SOTA methods, employing both quantitative metrics and qualitative visual assessments. Additionally, we perform rigorous ablation studies to validate the architectural rationality and operational efficacy of our framework. Finally, Section 5 concludes the paper by summarizing our key contributions and discussing potential future research directions.

2. Preliminary

In recent years, deep learning-driven fusion networks have witnessed remarkable advancements, emerging as a predominant research focus in the field of image processing. This section presents a systematic review of the historical progression and technological evolution in deep learning-based image fusion methodologies. Furthermore, it establishes the theoretical foundation essential for our proposed framework, encompassing two critical components: (1) advanced encoding-decoding mechanisms based on Transformer architectures, and (2) fundamental theoretical principles of INN. These elements collectively form the cornerstone of our innovative approach to image fusion.

2.1. Current status of IVIF

The field of image fusion has undergone significant paradigm shifts, progressing from rudimentary pixel-level fusion to sophisticated transform-domain based methodologies. This technological evolution has witnessed the development of numerous advanced transform domain tools, including but not limited to DWT [12], NSST [13], and various adaptive filters with their derivatives [1418]. Concurrently, researchers have pioneered several innovative fusion frameworks, such as sparse representation-based approaches [1920], Markov random field models [21], and low-rank representation techniques [22,23]. These methodologies have dominated the image fusion landscape for decades, delivering remarkable innovations and enhanced visual performance. Nevertheless, the field currently faces substantial challenges in achieving groundbreaking advancements, presenting both obstacles and opportunities for contemporary academic research.

The advent of machine learning has catalyzed a transformative shift in image fusion methodologies, enabling researchers to transcend the limitations of conventional algorithms and explore innovative fusion strategies through deep learning paradigms. This transition has yielded remarkable outcomes and demonstrated unprecedented developmental potential [24]. Early deep fusion architectures primarily employed basic convolutional layers to extract image features [2527], with fusion achieved through dimensionality reduction of these feature outputs. Notable contributions include Long et al.‘s [28] residual dense network framework and Ma et al.’s [8] ResBlock-based dense-block networks, which incorporated shared encoding-decoding mechanisms and saliency detection to enhance feature learning. These pioneering architectures [2530], characterized by their encoder-decoder structures with convolutional dense blocks, offered simplicity and computational efficiency. However, researchers gradually recognized the inherent limitations of basic convolutional processing in capturing comprehensive image details, compounded by training process variability, often resulting in detail loss, texture degradation, and image blurring. This realization spurred the development of adversarial-based fusion networks [3135], where the generator-discriminator paradigm was employed to enhance feature extraction. Despite their initial promise, these architectures revealed practical limitations, prompting further innovation. Liu et al. [33] pioneered a goal-driven dual adversarial learning network for multi-scene and multi-modal image fusion. Zhou et al. [34] enhanced adversarial networks by incorporating gradient and intensity components, developing a dual-discriminator architecture with distinct optimization objectives. Xu et al. [35] advanced the field through spectral and spatial loss-constrained adversarial networks. These evolutionary developments demonstrate that adversarial networks augmented with specialized feature representation elements significantly improve algorithmic adaptability and offer novel perspectives for fusion framework design.

The remarkable success of Transformer networks in natural language processing has catalyzed their adaptation for image fusion tasks, despite the computational challenges inherent in their architecture. Researchers have made significant strides in optimizing Transformer models for efficient image fusion. Ma et al. [36] pioneered a generalized fusion framework utilizing the Swin Transformer architecture, implementing an innovative long-short distance learning mechanism to enhance the multi-head attention process. Zhao et al. [10] advanced this approach through a dual-branch feature extraction network based on LT, incorporating specialized architectures for base and detail feature extraction. The integration of downstream vision tasks has emerged as a promising direction in image fusion research. Numerous studies have successfully embedded critical computer vision tasks (including image segmentation, object detection, and semantic segmentation) within fusion frameworks [11,3740], creating synergistic relationships between fusion quality and task performance. Tang et al. [37] and Zhang et al. [38] implemented post-fusion semantic segmentation and object detection respectively, employing customized loss functions to establish mutual constraints. However, these approaches were limited to pixel-level semantic extraction. A significant advancement was achieved by Tang et al. [11], who reimagined high-level vision task integration by embedding semantic extraction within the fusion network itself, enabling feature-level semantic information extraction and substantially improving outcomes. Wang et al. [40] further innovated with an interactive enhancement paradigm that incrementally integrates saliency-based IVIF with object detection through layer-wise feature incorporation. The rapid evolution of deep learning-based fusion frameworks and their expanding application domains [4145] underscore the tremendous potential for future research in this dynamic field. These developments not only demonstrate the adaptability of Transformer architectures but also highlight the growing sophistication of task-integrated fusion approaches.

However, existing deep learning-based fusion frameworks exhibit several notable limitations that warrant further investigation. The primary concern lies in the architectural design of early CNN-based fusion networks. These frameworks demonstrate excessive homogeneity, predominantly focusing on optimizing convolutional kernel designs and layer configurations, which consequently leads to substantial loss of detailed information in cross-modality feature representation. Although recent advancements integrating CNNs with Transformer architectures and downstream tasks have shown promising results, they simultaneously incur significantly increased computational costs and training complexity. Furthermore, the backbone networks designed for semantic information extraction often exhibit oversimplified structures with inadequate refinement. Another critical limitation is the insufficient theoretical foundation underlying CNN mechanisms. Current literature rarely provides comprehensive explanations regarding the internal operational principles of CNNs, nor does it adequately justify the rationale behind network architecture designs or the selection criteria for convolutional layer configurations. This theoretical gap hinders the development of more efficient and effective fusion frameworks. From a practical perspective, CNN-based networks face challenges in information preservation during forward propagation, particularly in balancing the extraction of basic and detailed features. This imbalance frequently results in fused images with weak detail contrast and sub-optimal visual quality. Additionally, the training process of most fusion networks is constrained by limited dataset availability and insufficient sample diversity, coupled with a lack of robust theoretical guidance in training methodology development. In this study, we aim to systematically address these limitations through a comprehensive approach that encompasses architectural innovation, theoretical analysis, and practical implementation improvements in existing fusion algorithms.

2.2. Transformer and its variants

In recent years, the Transformer model [46] has emerged as a fundamental tool in natural language processing (NLP), showcasing exceptional feature extraction capabilities across both low-level [4749] and high-level visual tasks [5053], with significant practical applications. Building on its success, researchers have extensively explored and enhanced the Transformer model by leveraging its remote dependency mechanisms, leading to the development of innovative and efficient fusion networks such as SwinFusion [36], IFT [54], and AFT [55]. These advanced architectures have demonstrated outstanding performance in various domains, including classification tasks [56,57], target detection [58,59], image segmentation [60,61], and multi-modal learning [62,63]. Despite its remarkable achievements, the Transformer model is not without limitations. Its high computational demands and substantial hardware requirements pose significant challenges for practical deployment. To mitigate these issues, Wu et al. [64] introduced the Lite Transformer, which incorporates long-short range attention mechanisms and a planarized feed-forward network. This approach enables effective modeling of both global and local contextual information while significantly reducing the number of parameters, thereby maintaining the performance of the original Transformer with improved efficiency. In this work, we adopt the LT as the core algorithm for our framework, considering its multi-head attention mechanism and efficient parameter utilization. This choice not only facilitates the deep extraction of image feature information but also optimizes computational efficiency, making it well-suited for our objectives.

2.3. Invertible neural networks (INN) and its variants

Invertible neural networks (INN) has gained significant attention due to its distinctive feature extraction capabilities and ability to preserve information losslessly. These properties have enabled their successful application in various practical domains, including image coloring [65], information hiding [66], and high-resolution image processing [67]. A key strength of INN lies in its ability to enhance backbone network performance while minimizing memory consumption, achieved through innovative inter-crossing convolutional management strategies. Recognizing these advantages, we employ INN in this work as the primary mechanism for detail extraction and feature fusion. This choice is particularly motivated by its effectiveness in facilitating the seamless integration of detailed features across different network layers, thereby addressing critical challenges in multi-level feature representation.

2.4. The data flow between modules

Our proposed network architecture integrates ResNet and LT modules to implement a hybrid information processing paradigm that combines top-down feature extraction with bottom-up feature integration. The processing pipeline begins by standardizing input images to a 256 × 256 resolution before feeding them into our multi-stage backbone network. This backbone consists of six sequentially connected modules: (1) initial coarse feature extraction, (2) LT processing, (3) ResNet34 residual blocks followed by LT, (4) ResB and LT processing, and finally (5–6) two additional ResNet34 residual blocks. Through this architecture, we observe progressively decreasing feature scales across the six modules, leading us to classify the first four modules’ outputs as basic features and the last two as detailed features. The fusion process involves several key operations: SIFM integrates basic features from all four initial modules, while INN performs coarse-to-fine detail extraction on modules 2–4 outputs. Subsequently, SDFM and D2FM handle detail integration for INN-processed features and final module outputs respectively. To address scale discrepancies, we implement an upsampling-based feature recombination strategy where integrated detail features are progressively combined with corresponding fundamental features at each level. The complete fusion is ultimately achieved through DF2M module processing under SIM module coordination, which effectively combines the multi-scale feature representations to generate the final output image. This hierarchical approach ensures comprehensive utilization of both fundamental and detailed visual information throughout the fusion pipeline.

3. The proposed fusion network

In this chapter, we present our comprehensive conceptualization of the proposed network and offer a detailed explanation and illustration of its three main components. To enhance the persuasiveness of our work, we provide a comparative analysis that encompasses both theoretical aspects and the presentation of results. Additionally, we introduce the mechanism and principles behind the construction of the loss function, drawing on established theories in image fusion to seek insights and make improvements.

3.1. Analysis of the fusion framework

Deep learning has evolved a distinctive network architecture system through years of exploration and refinement. From the perspective of image fusion intrinsic architecture, nearly all deep learning networks are built upon CNNs, which rely on sufficient and reliable training sets as their foundational support. This enables computers to simulate the human brain, achieving high efficiency and intensive learning capabilities that surpass human abilities. Upon examining deep learning fusion frameworks over the past decade, it is evident that differences in fusion performance are primarily influenced by the design of the internal structure of the framework. The reasonableness and timeliness of this internal structure are crucial for effective information extraction. To highlight the differences in framework design, this paper focuses on two recent network architectures relevant to our study, which are briefly summarized in Fig 2 below.

thumbnail
Fig 2. A comparative analysis of streamlined fusion frameworks.

https://doi.org/10.1371/journal.pone.0330328.g002

In 2023, Zhao et al. [10] proposed the mutually driven two-branch feature decomposition network, CCDFuse, to emphasize both basic and detailed information. This network utilizes improved LT blocks and INN for targeted extraction of basic and detailed features, adopting an additive strategy to aggregate information features. Meanwhile, L. Tang et al. [11] revisited high-level vision tasks by implementing different layers of information extraction from the input image based on ResNet. They then designed corresponding strategies to realize the network PSFusion for feature integration and semantic information extraction. A thorough review of the literature reveals that these two networks are considered advanced frameworks in the field of image fusion. Both frameworks demonstrate the advantages of current deep learning, whether through their overall design or sub-network architectural choices. However, they also exhibit some minor flaws. For instance, while CCDFuse has a commendable initial intention, its implementation is somewhat rudimentary. It forcibly uses the Restormer block and INN block in a loop to attempt basic and detailed information extraction, followed by a simple additive strategy to merge common and private features of the two modal images into the Restormer block for decoding, resulting in the fused image. On the other hand, PSFusion’s overall framework design is both perfect and exquisite, fully considering the semantic requirements of downstream tasks. However, It neglects the private feature extraction of the two modal images throughout the network. Although the visual effect of fusion is highly satisfactory, the extraction of detailed information in regions rich in information features still requires improvement.

From the comparative framework diagram in Fig 2, it is evident that the CCDFuse framework is designed to be straightforward and user-friendly, yet it falls short in extracting deeper layers of information. In contrast, the PSFusion framework boasts a meticulous design that meticulously accounts for feature extraction of basic information across various levels, albeit without distinct separation of basic and private information. Drawing on the strengths and addressing the shortcomings of these two frameworks, we propose an optimized and enhanced scheme. Our objective is to facilitate the thorough extraction of both detailed and basic information across different layers, succeeded by the efficient amalgamation of this information through inter-layer integration.

3.2. Analysis of semantic segmentation framework

Semantic segmentation represents a sophisticated downstream task in image processing, extending beyond mere image classification. It constitutes a comprehensive prediction challenge within computer vision systems, with extensive applications in areas such as intelligent recognition, autonomous driving, and artificial intelligence. Initially, semantic segmentation methodologies predominantly relied on fully convolutional network architectures, utilizing mainstream classification backbones for image segmentation. Nonetheless, the swift advancement of deep learning networks has spurred the development of more efficient segmentation networks, including SegFormer [51], BANet [68], and SegNeXt [69]. Recently, Liu et al. [70] accomplished multi-modal image task segmentation employing the advanced SegFormer network, which stands as a cutting-edge algorithm in the semantic segmentation domain. It is noteworthy that numerous scholars have addressed the dual requirements of image fusion and semantic segmentation by integrating diverse feature elements into the model framework design, aiming for a synergistic effect between feature fusion and semantic segmentation modules. However, the majority of existing approaches position semantic information extraction subsequent to the fusion network [37,38], leading to uni-modal semantic information segmentation of the fusion outcome, which frequently falls short of precise semantic segmentation. Progressively, scholars have refined the original fusion framework design [11], transcending pixel-level feature extraction to embed semantic information extraction within the fusion framework. This entails the construction of a multi-branch backbone network to facilitate feature-level semantic information integration. Earlier methodologies depended entirely on the fusion result for semantic information extraction, with the fusion result’s quality constrained by the fusion network’s design, complicating the achievement of accurate information feature classification. PSFusion [11] re-examines semantic information features, eschewing the prior pixel-level semantic segmentation approach. Instead, it devises a specialized semantic segmentation network at the feature level, attaining high-accuracy segmentation. Drawing inspiration from PSFusion, our network amalgamates semantic information while sequentially integrating detailed features. This strategy seeks to fulfill the dual objectives of feature extraction and semantic segmentation without imposing additional parameter load.

3.3. Overall framework

Through a comprehensive analysis of the CCDFuse, PSFusion, and other network frameworks, we have identified inherent limitations in each. Our objective is to refine and enhance these frameworks to the greatest extent possible, striving for robust aggregation of information features and the optimal presentation of visual effects. Our network maintains the overarching structure of the PSFusion framework while incorporating LT sub-networks within the ResNet34 architecture to facilitate based information extraction across various levels. The INN network is utilized to intersperse the extraction of detailed features at multiple levels, ensuring comprehensive extraction of both basic and detailed information at each stage. Furthermore, we have developed several sub-networks to aid in the progressive integration of information at every level. The overall network architecture comprises a backbone network and three branch networks, enabling a top-down approach to information extraction and a bottom-up strategy for information integration. The specific network framework has been streamlined based on extensive experimentation, as depicted in Fig 35.

thumbnail
Fig 4. Annotated supplementary documentation for the comprehensive framework schematic.

https://doi.org/10.1371/journal.pone.0330328.g004

The overall framework is not a product of hasty design but rather the culmination of persistent exploration and refinement, achieved through rigorous data training and bolstered by SOTA computer hardware. Below, we provide a detailed description of the backbone and branch networks.

Figs 6 and 7 demonstrate the encoding-decoding pipeline for infrared and visible light image fusion in our proposed network. The framework processes the input images through parallel streams in the backbone network, effectively extracting both fundamental and detailed features, which are subsequently integrated via dedicated fusion modules. Notably, Fig 7 adopts UNet-like architecture that facilitates hierarchical feature aggregation through its bottom-up pathway. The final fused image is reconstructed through the LT decoder’s output layer. The specific formulas for encoding and decoding will be elaborated in detail below.

3.3.1. The backbone network.

As illustrated in Fig 3, the core structure of the framework is built upon LT and ResNet34. These two components work in tandem to extract the fundamental information from the input image, facilitating a top-down, layer-by-layer information extraction process. Concurrently, the SIFM is engineered to effectively amalgamate the basic information extracted at each layer. Assuming a pair of aligned infrared and visible light images are provided, the initial step involves coarse-scale information extraction to derive the basic information of the first layer. In this structure, the coarse scale primarily consists of two convolutions achieving preliminary information extraction from the original image, as encapsulated by the following formula:

(1)(2)(3)

where BN represents the block normalization operation and stands for Flexible Rectified Linear Unit. The process initiates with a broad, shallow-level generalization and integration of the input information. Subsequently, the extracted coarse-scale information is channeled into the LT, where multiple attention heads collaboratively work to uncover the secondary layer of underlying information.

(4)

For the internal construction of the LT block, please refer to reference [64]. The primary objective of this architectural design is to facilitate multi-angle extraction of basic features from the original image, leveraging the previously obtained coarse-scale information as a foundation. The processed information from the preceding layer is then sequentially propagated through the initial layer of ResNet32, which is strategically interleaved with the LT module, to progressively extract higher-level features corresponding to the third and fourth layers. This hierarchical feature extraction process can be formally expressed through the following mathematical representation:

(5)

where denote the first two residual blocks of ResNet34. To capture more granular and sophisticated features, we employ direct feature extraction through the third and fourth modules of ResNet34, leveraging their inherent capacity for processing higher-level visual information. This strategic approach enables the network to progressively refine its feature representation, focusing on intricate patterns and detailed characteristics that are essential for comprehensive image understanding.

(6)

here represent the hierarchical information features extracted from shallow to deep layers for infrared-visible images, respectively. represent the third and fourth residual blocks of ResNet34. A total of six levels of main information have been extracted based on the coordinated action of LT and ResNet34, of which the first four levels are the extraction of layer-by-layer general information of the original image, so we are extremely basic information here. The last two levels have penetrated into the interior of the information, which is memorized as the detail information.

As can be seen from Table 2, the original images are first unified into and then fed into our network framework. The backbone network follows the principle of increasing the number of layers and decreasing the size.

3.3.2. Base information integration.

The multi-level basic information captured by the backbone network is strategically fused with the corresponding SIFM to ensure comprehensive integration of essential details at each layer. This structured approach facilitates the seamless aggregation of overall information in subsequent stages. The fused base information can be expressed as:

(7)

The SIFM employed in this study is the SDFM adopted from reference [11], which effectively enables both local and global information fusion. This module ensures optimal integration of multi-level features from infrared-visible images, maximizing the preservation and utilization of complementary information across different scales.

3.3.3. Detail information integration.

Building upon the backbone network, we employ the INN architecture to facilitate hierarchical detail information extraction. To effectively integrate feature information across different network depths, we propose two novel modules: an INN-based SDFM and a D2FM enhanced with multi-attention mechanisms. The initial extraction of base-level detail information is accomplished through the INN framework, which can be formally expressed as:

(8)(9)

INN primarily extracts the detailed depth information of infrared-visible images layer by layer. The specific framework structure can be referred to in literature [10]. SDFM is a detail integration module constructed based on the principles of information extraction from INN. The construction concept of SDFM is briefly explained as follows:

(10)(11)

here, denote the detail information extracted from the two original images at different layers. represent the information combined in the SDFM process, while denote the detailed information integrated between the different layers. The BRB (bottleneck residual block) mainly refers to the structure described in [10], which balances the efficiency of computer operations and feature extraction capability. It consists of a sequential connection of 1 × 1 convolution+ReLU6, depth-wise convolution + ReLU6, and 1 × 1 convolution + Linear.

In conjunction with Fig 5, we designed D2FM to integrate deeper levels of detail information, with the aim of extracting deep detail information using multiple heads of attention. The specific formula is as follows:

(12)(13)(14)(15)(16)

where CR denotes the joint convolution+ReLU operation on the feature information. denotes the reshape operation. indicates that the front and back items are spliced. denote the key, value and query of the two modal image information.

3.3.4. Feature integration network.

The inter-layer basic and detail information has been efficiently extracted and integrated in the preceding network through a layer-by-layer process. As depicted in Fig 4, we perform a hierarchical summation of the integrated basic and detail information, which is then fed into the DF2M. This process facilitates bottom-up feature information aggregation, guided by the semantic injection module (SIM). Ultimately, the layer output of LT is enhanced to achieve the final fusion result, while the original scene fidelity path (SFP) is refined to predict the original image from the feature information. The DF2M consists of three sequentially connected density blocks, as illustrated in Fig 4. To optimize the bottom-up information integration, we retain the framework structure proposed in literature [11].

(17)(18)(19)(20)(21)(22)

Here, denote the merging of the detailed information obtained from the integration of each layer into the previous three layers. represent the result of summing the base and detail information of each layer before sending it to DF2M for processing. To achieve lossless information transfer between layers, denote the results of integrating inter-layer information using SIM, and denotes the final fusion output.

To harmonize the architecture of the entire network, we improved the SFP model from literature [11] to derive the prediction results of the source images. As illustrated in Fig 5, the refined framework structure is as follows:

(23)(24)

where denote local and global processing of feature information respectively. BN denotes the block normalization operation and denote the final prediction results for the input images.

3.3.5. Semantic information integration network.

The semantic information extraction network is designed without the need for a separate framework system; instead, it is implemented through the progressive upward aggregation of deep detail information across layers. Initially, the integrated detail information is processed by the sparse semantic perception module (S2PM), which performs deep convolutional operations on the information features. Subsequently, a straightforward top-down information aggregation network is constructed. Finally, the extraction of binary, semantic, and boundary information is accomplished using a combination of convolution, batch normalization, and ReLU activation (CBR). As shown in Fig 4, the S2PM comprises a dense block network formed by three sequentially connected CBR layers.

(25)(26)(27)

where Binary, Semantic and Boundary denote binary, semantic and boundary information, respectively.

3.4. Loss function

The proposed framework incorporates a fusion loss function that integrates intensity, gradient, and structural similarity metrics to directly constrain the fusion output results. Furthermore, auxiliary loss and reconstruction loss are employed to indirectly regulate the network’s feature extraction and integration processes. In the following sections, we provide a detailed explanation of the fusion loss, auxiliary loss, and reconstruction loss.

3.4.1. Fusion loss.

We design corresponding loss functions based on intensity, gradient, and similarity to jointly constrain the fusion results, enabling continuous optimization of the output through the coordinated interplay between the fusion framework and the loss functions. As illustrated in Fig 1, the CCDFuse framework, despite its concise design, falls short of achieving high-quality fusion results due to the lack of strong constraints, such as gradient preservation in the fused images. To overcome this limitation, we propose a more robust fusion loss strategy built upon the CCDFuse framework and introduce corresponding parameters to fine-tune the final fusion output. The intensity loss function is defined as follows:

(28)(29)

here, denotes the contrast mask of the original image. denotes the salient target mask, which aims to guide the fusion network to retain more targets in the infrared image and can be simply generated from the semantic segmentation labels. denotes another modality, denotes the variance of the different modal images, and , denotes the fused image. The gradient loss function is defined as follows:

(30)

where denotes the Sobel gradient computation, is the absolute computation, and denotes the selection out of the element-wise maximum. The similarity loss function is defined as follows:

(31)

denotes the structural similarity of the two images.

(32)

where are used to balance the weights between different loss functions and regulate the final fusion loss.

3.4.2. Auxiliary loss.

In the proposed fusion framework, multi-layer extraction and integration of base and detail information are conducted. To effectively constrain the fusion features and ensure that the fused image incorporates more informative features, we introduce a correlation loss based on inter-layer information.

(33)

here, denotes the correlation coefficient operator, and (set 0.01) is a constant ensuring that the denominator is a positive number. Additionally, the scene fidelity loss is increased to guarantee the efficiency of reconstructing the original image.

(34)

in addition, to ensure sufficient semantic information output, reference [11] designs corresponding binary loss , semantic loss and boundary loss . The binary loss is designed using a weighted cross-entropy loss to mitigate the class imbalance between the object and the background. The semantic loss is implemented through OHEMCELoss, while the boundary loss is formulated by applying a cross-entropy loss function to measure the discrepancy between the predicted boundary results and the ground truth.

(35)

where are hyper-parameters used to regulate the fusion loss, relevance loss, scene fidelity loss and semantic relevance loss. The optimal values are found during training and simulation on a large amount of data.

4. Experimental argumentation

The previous section primarily outlined the overall framework structure of the proposed method and the construction of the loss function from a theoretical perspective. This section aims to supplement and further validate the theoretical foundations through empirical analysis. To comprehensively demonstrate the effectiveness of the proposed framework, we select several widely used infrared-visible datasets for training and testing, ensuring robust validation across multiple datasets. First, we detail the experimental configurations and determine the optimal parameters through extensive comparative analyses of experimental operations. Next, we conduct qualitative and quantitative evaluations on three distinct datasets to highlight the advantages of the proposed network, both intuitively and theoretically. Furthermore, we present semantic segmentation results under different segmentation models to showcase the superior performance and generalization capability of our approach. Finally, to substantiate the design rationale of the proposed framework, we perform two sets of ablation experiments to underscore the rationality and necessity of its structural components.

4.1. Preparation for the experiment

The MSRS [37] dataset is utilized for training and validation of our model, while the RoadScene [27], MSRS, and M3FD [33] datasets are employed for testing. For the visualization and comparison experiments, we select three representative image pairs: a daytime image pair from RoadScene, a nighttime image pair from MSRS, and a haze image pair from M3FD, which features challenging external environmental conditions. To ensure a comprehensive and realistic evaluation of the fusion performance, we compare our method with nine classical deep learning-based fusion approaches. These include the generalized fusion framework IFCNN [26], based on convolutional neural networks; and the adversarial-based fusion network DDCGAN [32]. Additionally, we incorporate the widely recognized STDFusion [8], which leverages saliency detection; SeAFusion [37], a semantic-aware fusion network; and SwinFusion [36], a fusion network based on the Swin Transformer. We include several advanced networks proposed in 2023, such as the interactive reinforcement fusion network IRFS [40], which integrates saliency detection; the correlation-driven dual-branch fusion network CDDFuse [10]; and the progressive semantic injection-based fusion network PSFusion [11]. Additionally, to ensure methodological currency in our comparisons, we integrated the diffusion Transformer-based feature-guided image fusion framework recently developed by Yang et al. [71] (2025) (LFDT). This architecture demonstrates particular efficacy for multi-modal image fusion tasks while maintaining robust generalization capabilities. To evaluate the extraction of semantic information, we employ three classical segmentation models (BANet [68], SegFormer [51], and SegNeXt [69]) to qualitatively and quantitatively assess the fusion results on the MSRS dataset.

To comprehensively demonstrate the effectiveness of the proposed fusion networks, we employ six widely recognized statistical evaluation metrics. These include generalized quantitative assessment metrics such as entropy (EN) [72], standard deviation (SD) [73], mutual information (MI) [74], visual information fidelity (VIF) [75], sum of correlation differences (SCD) [76], and edge preservation (QAB/F) [77]. For all these metrics, higher values indicate superior fusion performance. Additionally, to evaluate the semantic segmentation results, we conduct qualitative and quantitative analyses on the MSRS and MFNet [78] datasets using the pixel intersection over union (IoU) metric, which is a standard evaluation tool in segmentation models.

The proposed framework is designed to generate high-quality fused visual images through the synergistic interaction of feature extraction and loss functions, while also emphasizing the extraction of semantic perceptual information. After extensive experimental training and empirical tuning, the hyper-parameters for the various loss functions are set to and. The model is trained using the classical stochastic gradient descent (SGD) method with a batch size of 16. The learning rate is initialized at 0.001 and follows a decay strategy. The training process spans 2500 epochs over approximately 30 hours, ensuring that the intrinsic features of the images and the semantic information are thoroughly explored. All input images are normalized to a range between 0 and 1 to facilitate consistent processing within the network. Based on prior experience, the YCbCr color space is adopted for processing color images. The proposed network is implemented on the PyTorch platform using the PyCharm tool, while the other eight comparison networks and segmentation models are implemented as described in their respective original papers. All experiments in this study were conducted on an NVIDIA GeForce RTX 4080 GPU and a 13th Gen Intel(R) Core(TM) i7-13700F 2.10GHz processor.

4.2. Fusion comparison and analysis

Once the network training achieves robustness and stability, we conduct experimental comparisons and analyses on the RoadScene, MSRS, and M3FD datasets. To ensure a comprehensive and realistic evaluation, we select representative scenarios from each dataset: a daytime scenario, a nighttime scenario, and a haze scenario characterized by challenging external conditions. In this section, we present both qualitative and quantitative results, enabling a dual validation of the proposed method through intuitive visual assessments and theoretical data analysis.

4.2.1. RoadScene experiments.

Here, we test 50 pairs of infrared-visible images from the RoadScene dataset and select one pair of images from daytime scenes for visualization. The exact results of the experiment are shown in Fig 8, with a partially enlarged portion displayed in Fig 9. Given the substantial volume of image pairs in the RoadScene dataset, it is impractical to display each result individually. Consequently, we have adopted a cumulative distribution approach to effectively present the six key metric values derived from our experimental results, as illustrated in Fig 10. This methodological approach enables a comprehensive and statistically robust representation of our findings across the entire dataset.

thumbnail
Fig 8. Qualitative comparison of ten fusion methods on scene 05005 from the RoadScene dataset.

https://doi.org/10.1371/journal.pone.0330328.g008

thumbnail
Fig 9. Localized zoom effects for ten fusion results of the 05005 scene.

https://doi.org/10.1371/journal.pone.0330328.g009

thumbnail
Fig 10. Cumulative distribution of the six metrics on the RoadScene dataset.

https://doi.org/10.1371/journal.pone.0330328.g010

By analyzing Figs 8 and 9 in combination, several key observations can be made regarding the performance of different fusion methods. IFCNN and DDCGAN exhibit inadequate integration of letter information in the red zoomed-in areas located at the lower left corner. STDFusion and SeAFusion present considerable advancements in ground marking integration; however, as revealed in Fig 9, these methods struggle with proper integration around vehicles and pedestrians, suffering from overexposure issues that result in less distinct target features. The IRFS image appears notably darker, while SwinFusion fails to achieve sufficient detail extraction. CDDFuse manifests two critical limitations: overexposure on the vehicle’s side and seriously inadequate detail extraction. LFDT has a certain advantage in terms of clarity, but it still lacks in processing detailed information, particularly evident in the display of ground letters. Although PSFusion demonstrates substantial improvement over previous fusion methods, a detailed comparison with Fig 9 reveals that its detail extraction capability, particularly on the right side of the vehicle and around pedestrians, remains inferior to our proposed method. In contrast, our method successfully integrates the complementary information from source images while providing superior detailed information for characterization, outperforming all comparative methods in terms of both integration quality and detail preservation.

The RoadScene dataset comprises 50 carefully curated infrared-visible image pairs, encompassing diverse day and night scenarios, which were rigorously evaluated using cumulative distribution functions for comprehensive analysis. As illustrated in Fig 10, the performance analysis reveals distinct patterns among the evaluated methods: IFCNN, DDCGAN, and STDFusion demonstrate suboptimal performance across all metric values. While SwinFusion and CDDFuse show relatively improved metric values, our proposed method consistently achieves superior performance, attaining the highest scores across all six evaluation metrics. Notably, our approach outperforms all competing methods except PSFusion and LFDT, which shows comparable but slightly inferior results. These extensive tests on the RoadScene dataset conclusively demonstrate that our method delivers highly satisfactory and robust performance across various environmental conditions.

4.2.2. MSRS experiments.

For comprehensive evaluation, we conducted extensive testing on 361 infrared-visible image pairs from the MSRS dataset, with the experimental results systematically presented in Figs 11 and 12. Specifically, Fig 11 showcases a representative pair of late-night images, carefully selected to demonstrate the fusion performance under challenging low-light conditions, thereby highlighting the method’s effectiveness across diverse environmental scenarios.

thumbnail
Fig 11. Qualitative comparison of ten fusion methods on scene 00754N from the MSRS dataset.

https://doi.org/10.1371/journal.pone.0330328.g011

thumbnail
Fig 12. Cumulative distribution of the six metrics on the MSRS dataset.

https://doi.org/10.1371/journal.pone.0330328.g012

As illustrated in Fig 11, the late-night environment presents a particularly challenging scenario for image fusion, as both infrared and visible light sensors capture significantly less information compared to daytime conditions, thereby testing the robustness of fusion methods. The fusion results reveal distinct performance variations among the evaluated approaches: DDCGAN and STDFusion produce fuzzy and dark outputs. IFCNN, SwinFusion, IRFS, CDDFuse and LFDT demonstrate clear limitations in preserving target details and distant building features, with magnified targets appearing predominantly dim. Although PSFusion outperforms other methods in maintaining brightness and detail for both distant houses and nearby targets, it still falls short of our proposed method in target detail extraction. Consequently, our method demonstrates remarkable advantages in challenging low-light conditions, particularly in dark night environments, establishing its superior capability in handling adverse external conditions.

The cumulative distribution function analysis of 361 image pairs, as presented in Fig 12, clearly demonstrates the superior performance of our proposed method across multiple evaluation metrics. This comprehensive evaluation reveals that our method maintains consistent advantages in several critical aspects: detail information extraction, visual fidelity preservation, correlation maintenance between fused images, and mutual information retention.

4.2.3. M3FD experiments.

The M3FD dataset consists of 300 pairs of outdoor infrared-visible images, predominantly captured under challenging environmental conditions including haze, rain, and low visibility scenarios. These adverse conditions significantly influence the image quality and characteristics during acquisition. For visual demonstration purposes, we have selected a representative set of images captured under severe haze conditions, which effectively illustrates the performance of fusion methods in extreme environmental situations.

As demonstrated in Fig 13, the comparative analysis reveals distinct performance characteristics among the evaluated methods. DDCGAN exhibits an extremely blurry condition. IFCNN exhibits a significant black spot artifact that compromises image quality. STDFusion, SeAFusion, and CDDFuse show limited effectiveness in haze removal, failing to achieve satisfactory de-fogging results. IRFS presents severe black shadow artifacts throughout the image. LFDT is still significantly affected by fog, and the brightness of the target objects is insufficient. Although SwinFusion and PSFusion show relatively better performance, they still demonstrate clear limitations in detail extraction capabilities. In contrast, our proposed method not only demonstrates effective haze reduction but also significantly outperforms other methods in preserving and enhancing detailed information, establishing its superior performance in challenging foggy conditions.

thumbnail
Fig 13. Qualitative comparison of ten fusion methods on scene 01443 from the M3FD dataset.

https://doi.org/10.1371/journal.pone.0330328.g013

As illustrated in Fig 14, the proposed fusion method demonstrates superior performance across the majority of evaluation metrics. Notably, SwinFusion and PSFusion show comparable results to our method specifically in terms of VIF and SCD metrics. This performance similarity suggests that these two methods also achieve commendable results in maintaining visual fidelity and preserving correlation between source and fused images. However, our proposed method maintains an overall advantage across the comprehensive set of evaluation metrics.

thumbnail
Fig 14. Cumulative distribution of the six metrics on the M3FD dataset.

https://doi.org/10.1371/journal.pone.0330328.g014

In conclusion, the proposed framework demonstrates superior performance in both luminance representation and detail extraction compared to existing methods. Notably, it exhibits robust performance even under challenging environmental conditions, a capability primarily attributed to the innovative design of its LT and detail sub-networks.

4.3. Segmentation comparison and analysis

Based on the comprehensive analysis of fusion experiments, we conduct a systematic evaluation of the semantic segmentation performance across ten fusion methods using the MSRS dataset. For the segmentation experiments, we employ the classical BANet architecture as the primary model, complemented by two state-of-the-art models, SegFormer and SegNext, to ensure robust validation. The quantitative and qualitative segmentation results are presented in Figs 15 and 16, respectively.

thumbnail
Fig 15. Segmentation results of ten fusion methods under BANet.

https://doi.org/10.1371/journal.pone.0330328.g015

The semantic segmentation results of the ten fusion methods, as obtained through the classical BANet architecture, are presented in Fig 15. A comparative analysis reveals that our proposed method demonstrates superior performance in semantic information extraction, particularly evident in its enhanced capability to identify distant and challenging targets, including pedestrians and bicycles. This remarkable performance can be fundamentally attributed to the well-designed network architecture, which establishes a robust framework for effective semantic feature extraction. Specifically, the detail sub-network’s sophisticated information integration mechanism plays a pivotal role in preserving and enhancing critical semantic features throughout the processing pipeline.

The quantitative segmentation results of the three models BANet, SegFormer, and SegNeXt on the MFNet dataset are illustrated in Table 3. As evidenced by the statistical data presented in Table 3, our proposed method consistently achieves superior performance, obtaining the highest IoU values across critical categories including car, bike, and person, while maintaining competitive scores in other categories. Notably, all three segmentation models attain their peak average IoU metrics when integrated with our proposed method. These experimental results substantiate that our approach surpasses existing fusion methods in both qualitative visual perception and quantitative IoU measurements. This exceptional performance can be fundamentally attributed to two key factors: the innovative architectural design of the network framework, which optimizes feature representation, and the meticulously configured detail sub-network, which not only enhances fusion capabilities but also effectively preserves and integrates crucial semantic information throughout the processing pipeline.

thumbnail
Table 3. Segmentation of MFNet data under three segmentation models. Where STDFusion, SeAFusion, SwinFusion, CDDFuse and PSFusion are abbreviated as STDF., SeAF., SwinF., CDDF. and PSF. The highest value is marked in bold.

https://doi.org/10.1371/journal.pone.0330328.t003

4.4. Ablation experiments

Extensive research on deep learning-based fusion frameworks has established that the fusion performance is fundamentally determined by the intrinsic architectural design of the framework. To systematically investigate the impact of our framework’s structural components, we conducted a comprehensive ablation study. The proposed framework introduces two innovative architectural features: an interspersed LT mechanism within the backbone network, which leverages Transformer architecture for multi-attention feature extraction, and a dedicated detail information extraction module that operates synergistically with the hierarchical feature extraction process in the main network, complemented by an INN-based detail integration strategy. To quantitatively assess the individual contributions of these components, we implemented three controlled experimental configurations: removal of the detail features extraction module (-w/o DFEM), elimination of the LT mechanism (-w/o LT) in the backbone network, and exclusion of the INN component (-w/o INN) in the detail integration phase. The comparative experimental results, as illustrated in Fig 16, provide valuable insights into the performance impact of each architectural component.

The visualized comparative results in Fig 16 reveal significant performance variations across different architectural configurations. The network architecture lacking the DFEM demonstrates limited capability, primarily extracting only generic features from source images. This limitation is particularly pronounced in the first experimental set, where the importance of the detail module becomes crucial under challenging environmental conditions. A representative example can be observed in region Fig 16 (a3), where the network fails to effectively process overexposed infrared images in its fusion output. The architecture without the LT component produces visually flat fusion results, characterized by insufficient feature representation and sub-optimal brightness adjustment, particularly in target regions. While the network configuration without the INN module shows relatively improved performance compared to the previous two cases, it still under-performs in critical aspects, particularly in the accurate representation of pedestrians and distant targets, when compared to our complete proposed method. Furthermore, by comparing the quantitative metrics in Table 4, we can be more confident that the key components of our fusion framework make unique and significant contributions to the overall performance, holding important practical implications for real-world applications.

thumbnail
Table 4. Average quantitative metrics of ablation experiments on the MSRS dataset.

https://doi.org/10.1371/journal.pone.0330328.t004

5. Conclusion

Building upon state-of-the-art (SOTA) fusion frameworks, we propose a novel, more robust, and higher-performance fusion network architecture. This comprehensive design integrates four key components: (1) a dual-stream backbone network combining LT and ResNet architectures, (2) a sophisticated detail extraction and integration network based on INN, (3) a basic information integration sub-network, and (4) a hierarchical semantic information extraction module. The backbone network utilizes a multi-head attention mechanism to facilitate progressive, layer-wise feature extraction, enabling comprehensive information mining from shallow to deep levels. Unlike existing CCDFuse and PSFusion methods, which neglect the extraction of intrinsic deep-level image details, our approach introduces a critical innovation: a detail-oriented sub-network. This sub-network employs deep hierarchical processing to meticulously capture textures and intrinsic features. Departing from conventional approaches that directly feed extracted information into fusion strategies, we implement a novel bottom-up progressive integration paradigm. Crucially, our architecture achieves semantic information integration through inter-layer detail extraction, thereby eliminating the need for additional network components. This design innovation substantially reduces computational complexity and parameter count. To optimize network performance, we have developed specialized correlation loss functions that jointly constrain the network’s coordination during both base and detail information extraction across layers. Leveraging advanced GPU computational capabilities, we conducted extensive training and fine-tuning of the network framework. Experimental results demonstrate that our network surpasses numerous popular architectures in both qualitative visual perception and quantitative metrics. Particularly noteworthy is its exceptional performance in detail extraction under challenging environmental conditions, showcasing remarkable de-fogging capabilities and information integration efficiency. The network also excels in image clarity and color reproduction. While the network demonstrates superior performance in most aspects, we have identified areas for improvement, particularly in color rendering of specific regions such as sky areas, which occasionally appear darker than desired. Addressing this limitation will be a focus of our future research efforts, along with further optimization of the network’s computational efficiency and generalization capabilities across diverse environmental conditions.

Supporting information

S1 File. All original code and part of the data analyzed in this study are provided as supplementary materials in a file named “code.rar”.

https://doi.org/10.1371/journal.pone.0330328.s001

(RAR)

Acknowledgments

The authors would like to extend their heartfelt appreciation to the editorial board and anonymous reviewers for their meticulous evaluation, valuable insights, and constructive recommendations, which have significantly enhanced the quality of this work.

References

  1. 1. Xiao G, Bavirisetti DP, Liu G, Zhang X. Image Fusion. Springer; 2020.
  2. 2. Paramanandham N, Rajendiran K. Infrared and visible image fusion using discrete cosine transform and swarm intelligence for surveillance applications. Infrared Physics & Technology. 2018;88:13–22.
  3. 3. Bogdoll D, Nitsche M, Zöllner JM. Anomaly detection in autonomous driving: A survey. In: IEEE Conference on Computer Vision and Pattern Recognition. 2022. 4488–99.
  4. 4. Zhang P, Wang D, Lu H, Yang X. Learning Adaptive Attribute-Driven Representation for Real-Time RGB-T Tracking. Int J Comput Vis. 2021;129(9):2714–29.
  5. 5. Zhang H, Xu H, Xiao Y, Guo X, Ma J. Rethinking the Image Fusion: A Fast Unified Image Fusion Network based on Proportional Maintenance of Gradient and Intensity. AAAI. 2020;34(07):12797–804.
  6. 6. Zhao Z, Xu S, Zhang C, Liu J, Zhang J, Li P. DIDFuse: Deep image decomposition for infrared and visible image fusion. International Joint Conference on Artificial Intelligence. 2020;20:970–6.
  7. 7. Li J, Liu J, Zhou S, Zhang Q, Kasabov NK. Infrared and visible image fusion based on residual dense network and gradient loss. Infrared Physics & Technology. 2023;128:104486.
  8. 8. Ma J, Tang L, Xu M, Zhang H, Xiao G. STDFusionNet: An Infrared and Visible Image Fusion Network Based on Salient Target Detection. IEEE Trans Instrum Meas. 2021;70:1–13.
  9. 9. Luo X, Gao Y, Wang A, Zhang Z, Wu X-J. IFSepR: A General Framework for Image Fusion Based on Separate Representation Learning. IEEE Trans Multimedia. 2023;25:608–23.
  10. 10. Zhao Z, Bai H, Zhang J, Zhang Y, Xu S, Lin Z, et al. CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. 5906–16.
  11. 11. Tang L, Zhang H, Xu H, Ma J. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Information Fusion. 2023;99:101870.
  12. 12. Vanmali AV, Kataria T, Kelkar SG, Gadre VM. Ringing artifacts in wavelet based image fusion: Analysis, measurement and remedies. Information Fusion. 2020;56:39–69.
  13. 13. Ullah H, Ullah B, Wu L, Abdalla FYO, Ren G, Zhao Y. Multi-modality medical images fusion based on local-features fuzzy sets and novel sum-modified-Laplacian in non-subsampled shearlet transform domain. Biomedical Signal Processing and Control. 2020;57:101724.
  14. 14. Zhu D, Zhang Y, Gao Q, Lu Y, Sun D. Infrared and Visible Image Fusion Using Threshold Segmentation and Weight Optimization. IEEE Sensors J. 2023;23(20):24970–82.
  15. 15. Chen J, Li X, Luo L, Mei X, Ma J. Infrared and visible image fusion based on target-enhanced multiscale transform decomposition. Information Sciences. 2020;508:64–78.
  16. 16. Bavirisetti DP, Dhuli R. Fusion of Infrared and Visible Sensor Images Based on Anisotropic Diffusion and Karhunen-Loeve Transform. IEEE Sensors J. 2016;16(1):203–9.
  17. 17. Zhou Z, Fei E, Miao L, Yang R. A perceptual framework for infrared–visible image fusion based on multiscale structure decomposition and biological vision. Information Fusion. 2023;93:174–91.
  18. 18. Zhu F, Liu W. Infrared-visible image fusion method based on multi-scale shearing Co-occurrence filter. Infrared Physics & Technology. 2024;136:105009.
  19. 19. Maqsood S, Javed U. Multi-modal Medical Image Fusion based on Two-scale Image Decomposition and Sparse Representation. Biomedical Signal Processing and Control. 2020;57:101810.
  20. 20. Zhang Q, Li G, Cao Y, Han J. Multi-focus image fusion based on non-negative sparse representation and patch-level consistency rectification. Pattern Recognition. 2020;104:107325.
  21. 21. Luo X, Jiang Y, Wang A, Wang J, Zhang Z, Wu X-J. Infrared and visible image fusion based on Multi-State contextual hidden Markov Model. Pattern Recognition. 2023;138:109431.
  22. 22. Li H, Wu X-J, Kittler J. MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Trans Image Process. 2020:4733–46. pmid:32142438
  23. 23. Li Y, Liu G, Bavirisetti DP, Gu X, Zhou X. Infrared-visible image fusion method based on sparse and prior joint saliency detection and LatLRR-FPDE. Digital Signal Processing. 2023;134:103910.
  24. 24. Karim S, Tong G, Li J, Qadir A, Farooq U, Yu Y. Current advances and future perspectives of image fusion: A comprehensive review. Information Fusion. 2023;90:185–217.
  25. 25. Li H, Wu X-J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans Image Process. 2018;:10.1109/TIP.2018.2887342. pmid:30575534
  26. 26. Zhang Y, Liu Y, Sun P, Yan H, Zhao X, Zhang L. IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion. 2020;54:99–118.
  27. 27. Xu H, Ma J, Jiang J, Guo X, Ling H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans Pattern Anal Mach Intell. 2022;44(1):502–18. pmid:32750838
  28. 28. Long Y, Jia H, Zhong Y, Jiang Y, Jia Y. RXDNFuse: A aggregated residual dense network for infrared and visible image fusion. Information Fusion. 2021;69:128–41.
  29. 29. Xu H, Gong M, Tian X, Huang J, Ma J. CUFD: An encoder–decoder network for visible and infrared image fusion based on common and unique feature decomposition. Computer Vision and Image Understanding. 2022;218:103407.
  30. 30. Özer S, Ege M, Özkanoglu MA. SiameseFuse: A computationally efficient and a not-so-deep network to fuse visible and infrared images. Pattern Recognition. 2022;129:108712.
  31. 31. Ma J, Yu W, Liang P, Li C, Jiang J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion. 2019;48:11–26.
  32. 32. Ma J, Xu H, Jiang J, Mei X, Zhang X-P. DDcGAN: A Dual-discriminator Conditional Generative Adversarial Network for Multi-resolution Image Fusion. IEEE Trans Image Process. 2020;:10.1109/TIP.2020.2977573. pmid:32167894
  33. 33. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W, et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition. 2022:5802–11.
  34. 34. Zhou H, Hou J, Zhang Y, Ma J, Ling H. Unified gradient- and intensity-discriminator generative adversarial network for image fusion. Information Fusion. 2022;88:184–201.
  35. 35. Xu Q, Li Y, Nie J, Liu Q, Guo M. UPanGAN: Unsupervised pansharpening based on the spectral and spatial loss constrained Generative Adversarial Network. Information Fusion. 2023;91:31–46.
  36. 36. Ma J, Tang L, Fan F, Huang J, Mei X, Ma Y. SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer. IEEE/CAA J Autom Sinica. 2022;9(7):1200–17.
  37. 37. Tang L, Yuan J, Ma J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion. 2022;82:28–42.
  38. 38. Zhang X, Zhai H, Liu J, Wang Z, Sun H. Real-time infrared and visible image fusion network using adaptive pixel weighting strategy. Information Fusion. 2023;99:101863.
  39. 39. Xie H, Zhang Y, Qiu J, Zhai X, Liu X, Yang Y, et al. Semantics lead all: Towards unified image registration and fusion from a semantic perspective. Information Fusion. 2023;98:101835.
  40. 40. Wang D, Liu J, Liu R, Fan X. An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Information Fusion. 2023;98:101828.
  41. 41. Wang C, Wu J, Zhu Z, Chen H. MSFNet: MultiStage Fusion Network for infrared and visible image fusion. Neurocomputing. 2022;507:26–39.
  42. 42. Zhang W, Wang Y, Li C. Underwater Image Enhancement by Attenuated Color Channel Correction and Detail Preserved Contrast Enhancement. IEEE J Oceanic Eng. 2022;47(3):718–35.
  43. 43. Liu L, Wang F, Jung C. LRINet: Long-range imaging using multispectral fusion of RGB and NIR images. Information Fusion. 2023;92:177–89.
  44. 44. Sun L, Li Y, Zheng M, Zhong Z, Zhang Y. MCnet: Multiscale visible image and infrared image fusion network. Signal Processing. 2023;208:108996.
  45. 45. Zhang H, Ma J. IID-MEF: A multi-exposure fusion network based on intrinsic image decomposition. Information Fusion. 2023;95:326–40.
  46. 46. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Neural Information Processing Systems. 2017:5998–6008.
  47. 47. Yang F, Yang H, Fu J, Lu H, Guo B. Learning texture transformer network for image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition. 2020. 5791–800.
  48. 48. Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, et al. Pre-trained image processing transformer. In: IEEE Conference on Computer Vision and Pattern Recognition. 2021. 12299–310.
  49. 49. Liang J, Cao J, Sun G, Zhang K, Gool LV, Timofte R. SwinIR: Image restoration using swin transformer. In: IEEE International Conference on Computer Vision. 2021:11–7.
  50. 50. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations. 2020:11929.
  51. 51. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition. 2021. 6881–90.
  52. 52. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. Segformer: Simple and efficient design for semantic segmentation with transformers. Neural Information Processing Systems. 2021;:12077–90.
  53. 53. Lin L, Fan H, Zhang Z g, Xu Y, Ling H. SwinTrack: A simple and strong baseline for transformer tracking. Neural Information Processing Systems. 2022.
  54. 54. Vs V, Valanarasu JMJ, Oza P, Patel VM. Image fusion transformer. In: IEEE International Conference on Image Processing, 2022. 3566–70.
  55. 55. Chang Z, Feng Z, Yang S, Gao Q. AFT: Adaptive Fusion Transformer for Visible and Infrared Images. IEEE Trans Image Process. 2023;32:2077–92. pmid:37018097
  56. 56. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. International Conference on Computer Vision. 2021:9992–10002.
  57. 57. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jegou H. Training data-efficient image transformers & distillation through attention. ICML. 2021:10347–57.
  58. 58. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. ECCV. 2020;3:213–29.
  59. 59. Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable DETR: deformable transformers for end-to-end object detection. ICLR. 2021;3.
  60. 60. Wang W, Xie E, Li X, Fan DP, Song K, Liang D, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. International Conference on Computer Vision. 2021:548–58.
  61. 61. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. IEEE Conference on Computer Vision and Pattern Recognition. 2021:6881–90.
  62. 62. Ju X, Zhang D, Li J, Zhou G. Transformer-based label set generation for multi-modal multi-label emotion detection. ACM International on Multimedia. 2020:512–20.
  63. 63. Zhao J, Zhao Y, Li J. M3TR: multi-modal multi-label recognition with transformer. ACM International on Multimedia. 2021:469–77.
  64. 64. Wu Z, Liu Z, Lin J, Lin Y, Han S. Lite transformer with long-short range attention. ICLR. 2020;3.
  65. 65. Ardizzone L, Luth C, Kruse J, Rother C, Kothe U. Guided image generation with conditional invertible neural networks. CORR. 2019:3.
  66. 66. Jing J, Deng X, Xu M, Wang J, Guan Z. Hinet: Deep image hiding by invertible network, International Conference on Computer Vision. 2021. 4713–22.
  67. 67. Zhou M, Yan K, Huang J, Yang Z, Fu X, Zhao F. Mutual Information-driven Pan-sharpening. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022:1788–98.
  68. 68. Peng C, Tian T, Chen C, Guo X, Ma J. Bilateral attention decoder: A lightweight decoder for real-time semantic segmentation. Neural Netw. 2021;137:188–99. pmid:33647536
  69. 69. Guo M, Lu C, Hou Q, Liu ZN, Cheng MM, Hu S. SegNeXt: Rethinking convolutional attention design for semantic segmentation. Neural Information Processing Systems. 2022.
  70. 70. Zhang J, Liu H, Yang K, Hu X, Liu R, Stiefelhagen R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers. IEEE Trans Intell Transport Syst. 2023;24(12):14679–94.
  71. 71. Yang B, Jiang Z, Pan D, Yu H, Gui G, Gui W. LFDT-Fusion: A latent feature-guided diffusion Transformer model for general image fusion. Information Fusion. 2025;113:102639.
  72. 72. Van Aardt J. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J Appl Remote Sens. 2008;2(1):023522.
  73. 73. Rao Y-J. In-fibre Bragg grating sensors. Meas Sci Technol. 1997;8(4):355–75.
  74. 74. Qu G, Zhang D, Yan P. Information measure for performance of image fusion. Electron Lett. 2002;38(7):313–5.
  75. 75. Han Y, Cai Y, Cao Y, Xu X. A new image fusion performance metric based on visual information fidelity. Information Fusion. 2013;14(2):127–35.
  76. 76. Aslantas V, Bendes E. A new image quality metric for image fusion: The sum of the correlations of differences. AEU - International Journal of Electronics and Communications. 2015;69(12):1890–6.
  77. 77. Petrovi V, Xydeas C. On the effects of sensor noise in pixel-level image fusion performance. Proceedings of the Third International Conference on Image Fusion. 2000;2:14–9.
  78. 78. Ha Q, Watanabe K, Karasawa T, Ushiku Y, Harada T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)., 2017. 5108–15.