Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

DEF-Net: A dual-modal feature enhancement and fusion network for infrared and visible object detection

  • Xiaoming Guo,

    Roles Methodology, Writing – original draft

    Affiliations Shanxi Key Laboratory of Machine Vision and Virtual Reality, North University of China, Taiyuan, China, School of Information and Communication Engineering, North University of China, Taiyuan, China

  • Fengbao Yang ,

    Roles Funding acquisition, Project administration, Validation

    yangfbnuc@163.com

    Affiliation School of Information and Communication Engineering, North University of China, Taiyuan, China

  • Linna Ji

    Roles Project administration, Supervision, Writing – review & editing

    Affiliation School of Information and Communication Engineering, North University of China, Taiyuan, China

Abstract

Infrared-visible object detection in complex dynamic environments often suffers from weak feature representation and underutilized cross-modal complementarity, leading to missed and false detections. To address these issues, we propose a Dual-modal Enhanced Feature Enhancement and Fusion Network (DEF-Net). To enhance the model’s focus on informative features within both infrared and visible modalities, a feature interaction enhancement module is designed to effectively highlight and reinforce salient information. Furthermore, to better exploit the complementary characteristics of the two modalities, a transformer-based fusion architecture incorporating a cross-attention mechanism is introduced, enabling deep inter-modal feature integration. Experiments on SYUGV and LLVIP datasets show that DEF-Net outperforms existing methods in accuracy while maintaining real-time processing speed.

1. Introduction

Unmanned mobile platforms are widely used in autonomous driving, surveillance, and disaster response due to their mobility, sensing capability, and low risk. These systems typically rely on target detection technologies to execute tasks [1]. However, in complex environments, images acquired by single-sensor systems are often degraded by various environmental factors, resulting in compromised recognition performance. To support applications across diverse scenarios, unmanned platforms are often equipped with dual-modal vision sensors, with infrared and visible sensors being among the most widely deployed for target detection [2]. Visible images offer rich texture and color details but are highly sensitive to changes in lighting and visibility. In contrast, infrared imagery, which captures thermal radiation information, remains relatively invariant to illumination variations, though it typically exhibits lower spatial detail and overall image quality [3]. By integrating visible and infrared sensors, the complementary characteristics of both modalities can be effectively leveraged to achieve more robust and accurate target detection under varying environmental conditions.

Pedestrian and vehicle detection using infrared-visible image pairs is a typical multimodal detection task. However, it faces challenges such as complex environments with blurred targets and the difficulty of effectively leveraging cross-modal complementary information [4]. To address these issues, several researchers have adopted fusion-based approaches that combine infrared and visible imagery, using the integrated result for subsequent target recognition. For instance, Yue et al. [5] employed a saliency detection network to generate target saliency maps from infrared images, followed by fusion with visible images via non-subsampled contourlet transform (NSCT). The combined saliency and fusion results were then fed into a detector, demonstrating improved accuracy under low-light conditions. Similarly, Liu et al. [6] introduced a Target-aware Dual Adversarial Learning (TarDAL) framework to merge infrared and visible images into a single representation, which was subsequently processed by a YOLOv5 detector. Tang et al. [7] developed a parameter transfer model and later proposed a deep learning-based decision-level fusion framework for joint infrared-visible detection. Sun et al. [8] designed a triple-branch recognition network named UA-CMDet, incorporating separate branches for visible, infrared, and fused features, with ResNet-50 serving as the backbone for feature extraction, followed by decision integration across branches. Despite these efforts, front-end fusion strategies tend to preserve both foreground and background details in the fused modality, often introducing redundant information that can hinder recognition performance. On the other hand, back-end fusion methods, which operate at the decision level, typically lack the ability to exploit cross-modal interactions during feature learning and often require multiple models, leading to high computational costs.

Feature-level fusion is a dominant approach in multimodal detection, effectively integrating complementary information while keeping the model lightweight. For instance, Geng et al. [9] developed a dual-branch detection network based on Faster R-CNN that processes paired visible and infrared images through a shared VGG-16 backbone for separate feature extraction, followed by the fusion of both branches via classification and bounding box regression modules. Xue et al. [10] introduced a real-time detection model named MAF-YOLO, which incorporates an image brightness attention mechanism. Their framework uses a Darknet-53-based multimodal feature extraction module and integrates the features through a modal-weighted fusion module. Cheng et al. [11] proposed a lightweight dual-modal fusion network called SLBAF-Net, which includes a Bimodal Adaptive Fusion Module (BAFM) to adaptively merge visible and infrared feature maps, thereby improving detection robustness in challenging environments. Recognizing that conventional CNNs are limited to local receptive fields, Shen et al. [12] designed a fusion method guided by bidirectional attention. This approach employs a Transformer architecture to enable global feature interaction and complementary information capture between visible and infrared modalities, enhancing the discriminability of target features and alleviating the issue of spatial misalignment. Despite these advances, current fusion strategies—such as simple channel concatenation or stacking, and attention-based weighting mechanisms—often fall short in fully exploiting complementary inter-modal information. Particularly under low-quality feature conditions, these methods may still result in the loss of critical information. To solve the above problems, we combine attention mechanism and Transformer model, and propose an object detection method based on dual-modal fusion network. The dual-branch feature-level fusion method is adopted, and the feature interaction enhancement and fusion module is embedded, so that the model can make full use of dual-modal information.

Our main contributions are as follows:

  1. 1). We design a dual-branch backbone to extract features from visible and infrared modalities separately, balancing and mining their unique characteristics to enhance feature richness and specificity.
  2. 2). We introduce an interaction enhancement module that dynamically highlights key features by integrating dual-modal information, thereby improving feature saliency and cross-modal complementarity.
  3. 3). We propose a fusion network architecture with a Cross Attention mechanism, enabling deep interaction and comprehensive fusion of cross-modal features to significantly boost integration efficiency and detection accuracy.

The remainder of this paper is structured as follows. In Section 2, we describe related works. Section 3 introduces the proposed DEF-Net, including Dual-branch backbone, Cross attention Fusion Network, Object Detection Structure and Loss function. Experiments and discussions are shown in section 4. Subsequently, the conclusion is provided in Section 5.

2 Related works

2.1 Traditional dual-modal object detection

Traditional dual-modal object detection methods primarily fall into two categories. The first relies on visual feature extraction, utilizing salient characteristics such as edges and color in visible images and brightness in infrared images. Techniques include clustering, threshold-based, and region-based methods. While effective in specific scenarios—such as agricultural detection using color and morphological operations or target recognition through infrared-visible feature fusion—these methods are often sensitive to noise and limited in efficiency and accuracy.

The second category employs handcrafted features with candidate region processing. Candidate regions are generated via sliding windows [13] or selective search [14], followed by feature extraction using algorithms like SIFT [15], HOG [16], ORB [17], or LBP [18], and classification with SVM [19] or Adaboost [20]. Although strategies such as ROI-focused HOG sampling improve efficiency, these methods generally suffer from slow feature extraction, limited generalization, and constrained practical applicability.

2.2 Dual-modal object detection based on deep learning

Compared to traditional object detection methods, deep learning-based dual-modal object detection utilizes network layers to extract image features, achieving robust feature representations and significantly improving detection performance. The core idea is to adapt and extend existing single-modal detection frameworks through dual-modal fusion and enhancement. Current single-modal detectors can be categorized into two types based on their pipeline: Two-Stage and One-Stage methods, as illustrated in Fig 1.

thumbnail
Fig 1. Dual-modal object detection based on deep learning.

https://doi.org/10.1371/journal.pone.0345815.g001

Two-Stage methods, primarily based on the R-CNN series [21], first generate region proposals using algorithms like selective search, then perform classification and regression on these regions. While achieving high accuracy, these methods suffer from computational redundancy due to overlapping proposals. Fast R-CNN improved efficiency by sharing convolutional features, and Faster R-CNN further replaced selective search with a Region Proposal Network (RPN), reducing redundancy and increasing speed [22]. Despite their accuracy, Two-Stage methods are often outperformed in practical applications by One-Stage approaches due to their superior inference speed and competitive performance.

One-Stage methods, such as SSD, YOLO, and DETR, perform end-to-end detection by directly predicting object categories and bounding boxes from the input image, eliminating the need for region proposal generation [2330]. SSD uses multi-scale feature maps for detection but struggles with small objects due to insufficient shallow feature utilization. The YOLO series [31], built on Darknet, divides the image into grids and predicts objects within each cell. YOLOv3 introduced multi-scale detection, YOLOv5 improved efficiency and ease of deployment [32], and later versions like YOLOv8 and YOLOv10incorporated advanced attention mechanisms and end-to-end strategies [3334]. YOLOv11 further reduced parameters while maintaining performance. Despite progress, detecting small and dense objects remains challenging.

Transformers, initially designed for NLP, have also been adapted for vision tasks. DETR [35] pioneered the use of Transformer encoder-decoder architecture for object detection, using object queries to directly output predictions without non-maximum suppression. While accurate, it suffers from high computational cost. RT-DETR enhanced real-time performance with a hybrid encoder and IoU-aware query selection, offering a better speed-accuracy balance for practical use.

The existing dual-modal object detection algorithms based on visible and infrared images have not fully exploited the complementarity and rich context information between different modalities. Therefore, there is still a broad research space to improve the performance of bimodal object detection, especially in complex scenes. This study will explore a more effective mechanism for deep interaction between features of different modalities, so as to further improve the accuracy and generalization ability of bimodal object detection.

3 Method

The specific flow chart of the proposed dual-modal fusion network (DEF-Net) is show in Fig 2, which mainly includes three parts, Dual-branch backbone, Cross fusion network and Object Detection Structure.

  1. 1). The Dual Branch Backbone: It is composed of the Convolutional Feature Extraction and the Dual-branch Interaction Enhancement. The visible image and infrared image obtain features and through the Feature Encoder respectively.
  2. 2). Cross-Transformer Fusion: and are obtained more effective feature representation through the cross-attention network layer to obtain fusion features , which realizes the complementarity of dual-modal feature information.
  3. 3). Object Detection Structure: The output features of the first two parts are used as the input of the Feature Pyramid Networks, and sent to the detection head to obtain the classification and localization of the target.

3.1 Dual-branch backbone

Dual-branch backbone network is composed of Convolutional Feature Extraction and the Dual-branch Interaction Enhancement.

3.1.1 Convolutional feature extraction.

The dual-branch backbone network aims to extract and represent the features of infrared and visible data, and lay the foundation for feature fusion. Darknet53, as a network backbone architecture based on residual learning, becomes a suitable framework for bimodal object detection tasks due to its unique structural design and advantages, as shown in Fig 3.

Firstly, the network uses a 3 × 3 convolution layer with a step size of 2 for feature extraction, and then concatenates 1 × 1 and 3 × 3 convolution operations, and forms a residual connection with the convolution layer feature map. Through this deep superposition of structural units including 1 × 1 convolutions, 3 × 3 convolutions and residual edges, the network is deepened and the expression ability of the network is significantly enhanced. In addition, each convolutional module adopts the unique DarknetConv2D structure, including l2 regularization, Batch Normalization, and ReLU activation function. This design not only optimizes the training process of the network, improves the performance of the model by increasing the depth of the network, but also effectively alleviates the problem of gradient disappearance in deep neural networks by means of skip connections. However, simply extending the single-branch model to a two-branch structure leads to a significant increase in computational resources. In response to this challenge, we introduce Cross Stage Partial Network (CSPNet) to optimize the network structure and calculation process, which can significantly improve the feature learning ability of lightweight models. The core idea is that the gradient information can be propagated through different network paths through the segmentation mechanism of the gradient flow. By switching the series calculation and transition steps, the propagated gradient information can obtain a larger correlation difference, so as to significantly reduce the computational complexity and improve the inference efficiency and detection accuracy.

Referring to the CSPDarknet53 backbone network [36], the double-branch backbone network is designed as shown in Fig 4, including CONV module, CSP structure and DBE module. CONV module was used as the basic module of feature extraction, including convolutional layer, batch normalization layer and SILU activation function. The CSP structure introduces residual features to propagate gradient information and enhance the learning ability of the convolutional network. The DBE module combines the dual-branch features to obtain the dual-modal feature map of the enhanced representation. In order to balance the dual-modal information and limit the complexity of the network, both branches use 640 × 640 pixels, 3 channels of input, and finally are downsampled to 20 × 20 × 256 feature maps. The design aims to fully extract and enhance the bimodal information, and provide a favorable feature representation basis for subsequent feature fusion.

3.1.2 Dual-branch interaction enhancement.

At present, most of the methods to improve the performance of feature extraction are to fuse features by using various types of attention, including channel attention and convolutional attention. However, the intra-modal attention can only solve the uneven distribution of single-modal features, but ignores the uneven distribution at the bi-modal feature level. Although the channel attention inside the feature attention module integrates the differences between channels, it does not consider the context information between infrared and visible light, which leads to an unbalanced distribution of infrared and visible light information in each feature channel in the spatial dimension. In addition, the attention weights of space and channel in feature attention usually lack information interaction. Therefore, it is necessary to combine the feature channels of infrared and visible to construct a spatially important feature map, and achieve the enhancement of bimodal information through feature interaction. In order to solve the above problems. Dual-branch Interaction Enhancement (DBE) is shown in Fig 5.

The proposed DBE addresses these limitations by introducing a parallel dual-attention mechanism that enables simultaneous spatial and channel-wise enhancement of bimodal features. Through broadcast addition, the module facilitates early interaction between spatial and channel attentions, generating adaptive fusion weights. These weights are further refined with original features to produce channel-specific spatial interaction maps, which dynamically highlight important regions across modalities while preserving original information via residual connections. This design ensures effective bimodal feature interaction and enhancement at the attention level, providing a more discriminative feature representation for subsequent fusion.

After a series of convolution modules, the visible and infrared features of , are obtained as . In order to use the bimodal information, after Concat is used as the input of DBE to calculate the spatial attention and channel attention. See Eq (1) and Eq (2) to obtain the spatial attention weights and .

(1)(2)

In Eq (1), AvgPool represents the global average pooling across channel dimensions, and MaxPool represents the global maximum pooling operation across channel dimensions. In Equation (2), AvgPool represents the global average pooling across spatial dimensions, and the two convolutions respectively reduce the dimension and increase the dimension. The first convolution reduces the dimension 2C to , r is the scaling factor, and the second convolution expands it back to 2C. In order to limit the complexity of the model, when r is set to 8, the calculation amount is maintained and the effect is considerable. The activation function is introduced in the two convolutions to improve the feature representation and help the gradient flow. In order to make the two kinds of attention interact, the Broadcast mechanism is used to add and to obtain the initial attention weight , and the formula is as follows.

(3)

The initial attention weight is refined according to each channel of the input bimodal feature map to focus on a unique part of the features in each channel to generate the final SIM. The feature layer after concatenating the input dual-modal feature map and the attention weight is used as the input of the convolution layer, and the feature weight value is normalized by the sigmoid function to obtain W, which is calculated as shown in Equation (4). W combines the dual-modal feature information and assigns a unique SIM to each channel. To guide the model to focus on important regions in the infrared and visible light channels.

(4)

The weighted feature information is calculated by combining W and input features, and the residual connection is introduced to add input features, as shown in Eq (5), so as to enhance the gradient transfer and prevent the loss of feature information caused by too low weight assignment.

(5)

Finally, the features are assigned according to the channel dimension to obtain the interactive enhanced bimodal features, as shown in Eq (6).

(6)

After dimension assignment, the enhanced features and are obtained and used as the input of the subsequent fusion module.

3.2 Cross attention fusion network

The interaction enhancement module based on spatial and channel attention mechanisms enables the model to focus on important features. However, only assigning weights to bimodal feature channels cannot effectively exploit the complementarity between the two modalities. In order to fully mine and utilize the complementary information of infrared and visible light features, a Cross-Attention fusion network is constructed by using the advantage of Transformers in modeling long-range sequence relationships and the idea of cross-attention Transformers (CAT). The structure is shown in Fig 6.

The CAFN employs a bidirectional query-retrieval mechanism to achieve deep cross-modal complementarity: infrared features serve as queries to retrieve relevant visible information, while visible features query complementary infrared content. This dynamic retrieval process is further enhanced through multi-head attention, which captures diverse cross-modal correlations across different representation subspaces. The resulting fused features exhibit both semantic consistency and detail completeness, effectively integrating complementary information from both modalities. By structuring fusion as an interactive retrieval process, CAFN enables more robust and discriminative feature representations for downstream detection tasks.

In the feature maps of infrared and visible light, useful information is mutually retrieved and extracted, while interfering information remains inactive. This essentially constitutes a bidirectional complementary information extraction process—mutual information retrieval from visible to infrared and from infrared to visible light—to fuse dual-modal features. Specifically, given the interactively enhanced visible and infrared features , in order to utilize Transformer-based attention, the inputs are downsampled via average pooling and undergo dimension transformation to convert them into sequences , To constrain model complexity, we set . For the input sequences, they are projected into three weight matrices to compute a set of Query, Key, and Value vectors, as shown in Eq (7) to (9).

(7)(8)(9)

In Eq (7) to (9), are learnable weight parameters. In this model, the number of channels for both modalities is C, and the parameters are set as. The infrared feature information is used to retrieve visible feature information to compute channel correlations. The cross-attention weights are calculated through scaled dot-product operations. In this process, the visible feature information provides the Key and Value, as shown in Eq (10). The output represents the visible attention weights fused with infrared features.

(10)

The cross-attention mechanism enables adaptive fusion of visible and infrared image features across different representation spaces. By automatically performing both intra-modal and inter-modal information fusion, this mechanism computes a correlation matrix between features to capture potential interactions between the visible and infrared modalities. These relationships are quantified into corresponding attention weights , thereby achieving effective and complementary feature fusion. A schematic diagram of this computation is shown in Fig 7.

In order to deal with the representation of different complex relations from different locations, the multi-head attention mechanism is adopted, as shown in Eq (11).

(11)

Where h denotes the number of heads, which is set to 4 in our method. represents the parameter matrix after concatenating , and is used to generate the fused infrared-visible feature attention weights . Similarly, the corresponding weights can be obtained for the infrared features. The two sets of weights are applied to the initial branch features, respectively, and then concatenated to form . The fused feature is output through a subsequent Transformer connection layer, as shown in Eq (12):

(12)

The Transformer connection layer consists of a two-layer neural network, which sequentially passes through a linear layer FC₁, ReLU non-linear activation, and another linear transformation FC₂. As illustrated in Fig 5, the fusion module contains 8 cross-attention blocks to achieve inter-modal fusion. The final output corresponds to the input feature dimensions, and the output is upsampled to the original resolution via bilinear interpolation. By leveraging the cross-attention mechanism of the Transformer, the model can naturally perform inter-modal fusion and effectively capture interactive relationships between visible and infrared features, thereby enhancing the performance of dual-modal object detection.

3.3 Object detection structure and loss function

3.3.1 Object detection structure.

The backbone of the network consists of two parts: the dual-branch backbone network and the fusion module described above. The dual-branch backbone network progressively enhances dual-modal features through multiple feature interaction blocks, while the feature fusion module enables cross-modal fusion at various scales. In the paper, the outputs of the backbone are as follows:

  • A 32 × downsampled dual-modal feature map obtained after 3 layers of DBE feature enhancement and fusion module processing,
  • A 16 × downsampled dual-modal feature map formed by concatenating features after 2 layers of DBE enhancement,
  • An 8 × downsampled dual-modal feature map formed by concatenating features after 1 layer of DBE enhancement.

These three multi-scale feature maps are fed into a Feature Pyramid Network (FPN) to improve object detection performance by integrating both high-level and low-level features. Finally, the three levels of features are processed by the YOLOv8-Head to output the final bounding box regressions and classification results.

3.3.2 Loss function.

The loss function of the proposed method consists of two parts: the bounding box loss for object detection and the improved cross-entropy loss for classification.

The bounding box loss uses Complete-IoU Loss (CloU), which comprehensively considers three key metrics: overlap area, center point distance, and aspect ratio. By constructing the loss function to bring predicted boxes closer to ground truth boxes, CloU enables more thorough optimization. The CloU calculation is given by Eqs (13) – (15):

(13)(14)(15)

Where, IoU denotes the intersection over union between the predicted and ground truth boxes, ρ represents the Euclidean distance between their center points, c is the diagonal length of the smallest enclosing box covering both boxes, and α is a balancing coefficient. The terms b and refer to the center points of the predicted box and ground truth box, respectively. The variables w, h, , and denote the width and height of the predicted box and the ground truth box, while v measures the consistency of aspect ratio between the two boxes.

The classification loss employs Varifocal Loss (VFL), which adopts an asymmetric weighting scheme for training samples. It reduces the weight of negative samples while increasing the weight of high-quality positive samples, thereby focusing training on high-quality positives. The loss is computed as shown in Eq (16).

(16)

Here, p is the predicted classification score, and q is the IoU between the predicted box and the ground truth box. When q > 0, the sample is considered positive. If the two boxes do not overlap, q = 0, and the sample is treated as negative. The terms α and γ are hyperparameters. This loss function places greater emphasis on hard-to-classify positive samples, thereby improving overall object detection performance.

4 Experiment and analysis

4.1 Experimental environment and training strategy

All models were trained on a system running Windows 11 23H2. The hardware platform consisted of an Intel Core i7-13650 CPU and an NVIDIA GeForce GTX 4060 GPU, with CUDA version 11.6. All experiments were implemented using Python 3.8 and the PyTorch 1.13.1 framework. No pre-trained weights were loaded during training to ensure a fair comparison across different architectures.

The detailed hyperparameter settings used in our experiments are summarized in Table 1. We adopted Stochastic Gradient Descent (SGD) with momentum as the optimizer, with an initial learning rate of 0.01 and a weight decay coefficient of 5 × 10 ⁻ ⁴. The learning rate was reduced by a factor of 0.1 at epochs 50, 100, and 150. The batch size was set to 8 due to GPU memory constraints. During training, a visualization tool was employed to monitor the loss function, which stabilized after approximately 200 epochs, indicating model convergence.

4.2 Datasets and evaluation indicators

To validate the effectiveness of the proposed method in complex dual-modal detection scenarios, we constructed the SYUGV dataset using our unmanned ground vehicle platform SYUGV-01, equipped with synchronized infrared and visible cameras (30 fps, 720 × 576 resolution). The dataset contains 6,272 image pairs (3,232 daytime, 3,040 nighttime) covering challenging conditions such as nighttime, overexposure, motion blur, and occlusion (as shown in Fig 8).

We applied a coarse feature-based registration method with an alignment accuracy within ~10 pixels, which provides region-level semantic consistency suitable for object detection, consistent with common practice in multimodal detection benchmarks. All image pairs share COCO-format annotations across two categories (pedestrians, vehicles) and are split into 5,146 training and 1,126 test pairs with no overlap. The dataset will be publicly released upon acceptance to support reproducibility. Additionally, we evaluate our method on the public LLVIP dataset [37] for generalization assessment.

The evaluation metrics used in the experiments include Precision (P), Recall (R), mean Average Precision (mAP), and Frames Per Second (FPS). Additionally, Floating Point Operations (FLOPs) and the number of model parameters (Params) were used to assess algorithm complexity.

4.3 Comparative experiment and analysis

In order to verify the effectiveness of the algorithm in this chapter, it is applied to the SYUGV and LLVIP datasets with the recent bimodal object detection algorithms, such as MAF-YOLO [10], SLBIF-Net [11], and ICAFusion [12], and the indicators are used for evaluation and comparison.

4.3.1 Results and analysis of SYUGV dataset.

To validate the performance of the proposed method under realistic operational scenarios, we conducted comprehensive experiments on our self-built SYUGV dataset. This dataset emulates challenging conditions encountered by unmanned platforms in the field (e.g., low-light, occlusion, and motion blur), providing an in-depth evaluation of accuracy, robustness, and real-time capability.

  1. 1). Quantitative Performance

As shown in Fig 9 and Table 2, the proposed DEF-Net achieves state-of-the-art results on the SYUGV test set. The model attains the highest mAP@0.5 of 95.64%, outperforming MAF-YOLO (92.76%) and SLBAF-Net (93.13%) by 2.88% and 2.51%, respectively. This improvement can be attributed to the synergistic enhancement of salient features via the Dual-branch Interaction Enhancement (DBE) module and the deep integration of complementary information through the Cross-Attention Fusion Network (CAT).

thumbnail
Table 2. Comparative experimental results of different models.

https://doi.org/10.1371/journal.pone.0345815.t002

In particular, the model achieves a recall (R) of 90.48%, the highest among all compared methods, indicating a lower miss-detection rate. This demonstrates the effectiveness of the CAT module in retrieving missing features from the complementary modality. Under the stricter mAP@0.5:0.95 metric, DEF-Net also leads with 69.63%, reflecting more accurate bounding-box localization. Importantly, while maintaining high accuracy, DEF-Net reaches an inference speed of 117 FPS, significantly exceeding other methods and satisfying real-time detection requirements.

thumbnail
Fig 9. Comparison of model training on the SYUGV dataset.

https://doi.org/10.1371/journal.pone.0345815.g009

  1. 2). Qualitative Analysis

Fig 10 displays detection results on six challenging scenes from the SYUGV dataset, visualized on infrared images for clearer target contrast. Inference thresholds were fixed at 0.25 (confidence) and 0.7 (IoU) for all methods.

thumbnail
Fig 10. Comparison of detection effects of different models on the SYUGV dataset.

https://doi.org/10.1371/journal.pone.0345815.g010

The qualitative results demonstrate the robustness of our DEF-Net. In low-light or nighttime scenes (e.g., Scene 1 and Scene 4), DEF-Net successfully detects several pedestrians that are missed by other methods (highlighted with yellow circles), which visually substantiates the effectiveness of the Cross-Attention Fusion Network (CAT) in leveraging thermal information to reinforce features from the visible spectrum. In scenarios involving occlusion or target interaction (e.g., Scene 2 and Scene 3), DEF-Net produces more complete bounding boxes around partially obscured targets while generating fewer false positives. This can be attributed to the Dual-branch Interaction Enhancement (DBE) module, which dynamically emphasizes spatially and semantically informative regions within the feature maps.

thumbnail
Fig 11. Comparison of model training on the LLVIP dataset.

https://doi.org/10.1371/journal.pone.0345815.g011

These visual results align with the quantitative metrics in Table 2, confirming DEF-Net’s robustness and practical suitability for real-time detection in complex environments.

4.3.2 Results and analysis of LLVIP dataset.

To assess the generalization capability and benchmark performance of the proposed method, we further evaluated DEF-Net on the public LLVIP dataset.

  1. 1). Quantitative Performance

The quantitative results in Fig 11 and Table 3 demonstrate that DEF-Net remains highly competitive on this independent benchmark. Although its precision (95.39%) is marginally lower than that of MAF-YOLO (96.09%) and ICAFusion (96.53%), our method achieves the highest recall (89.51%), indicating a stronger ability to reduce missed detections through enhanced cross-modal complementarity. More importantly, DEF-Net obtains the best overall accuracy, with mAP@0.5 of 95.70% (exceeding ICAFusion by 0.87%) and mAP@0.5:0.95 of 61.67% (surpassing ICAFusion by 1.80%). These results confirm that DEF-Net effectively balances precision and recall, leading to superior comprehensive detection performance. Notably, DEF-Net maintains an excellent balance between model complexity and accuracy. Compared with the high-accuracy ICAFusion, our method uses approximately 47% fewer parameters (12.3 M vs. 23.2 M) and runs 4.7 times faster (113 FPS vs. 24 FPS), making it markedly more suitable for real-time deployment in practical systems.

thumbnail
Table 3. Comparative experimental results of different models.

https://doi.org/10.1371/journal.pone.0345815.t003

It is worth emphasizing that the performance trends observed on SYUGV—specifically, superior recall and leading mAP—are consistently reproduced on LLVIP. This cross‑dataset coherence reinforces the generalizability and reliability of the proposed architecture.

  1. 2). Qualitative Validation

For a visual comparison, Fig 12 presents detection results on six representative scenes from LLVIP. To ensure clarity, all results are overlaid on the infrared modality, where target information is typically more distinct. During inference, the confidence threshold and the Intersection over Union (IoU) threshold were uniformly set to 0.25 and 0.7, respectively. The visualizations show that DEF-Net produces more complete and accurate bounding boxes across diverse scenarios, further corroborating the quantitative advantages reported in Table 3.

thumbnail
Fig 12. Comparison of detection effects of different models on the LLVIP dataset.

https://doi.org/10.1371/journal.pone.0345815.g012

From the above experimental results, it can be seen that our method shows good performance on the bimodal object detection task. Compared with other methods, our method maintains a relatively stable detection effect in different scenes and conditions, especially in the case of occlusion (scene 4 and scene 6) and illumination change (scene 2 and scene 3), it still maintains a good detection effect.

4.4 Ablation study

4.4.1 Number of feature channels in backbone network.

In order to limit the complexity of the model while maintaining accuracy, Yolov8s is selected as the single-modal benchmark detection network in this study. The dual-modal benchmark network modifies the Backbone part of the Darknet to the above dual-branch backbone network structure, reduces the number of feature channels, does not use the DBE module, and Concat the infrared features and visible features through the convolutional layer to obtain the fusion features and send them to the detection head. Thus, the Dual-CSPDarkNet (hereinafter referred to as Dual-CDNet) is constructed, which will be used as the dual-branch benchmark model for subsequent experiments to verify the effectiveness of the method proposed in this paper.

Through the experimental comparison on the SYUGV dataset, as shown in Table 4, it is verified that the dual-branch backbone network improves the detection performance. Specifically, the mAP0.5 of YOLOv8s model is 86.74% when only visible light image (VI) is used as input, and the mAP0.5 is 91.82% when only infrared image (IR) is used as input. The average detection accuracy of dual-modal images based on the dual-branch backbone network reaches 94.06%, which is 7.32%and 2.24%higher than that of the visible and infrared image input alone. At the same time, by reducing the number of backbone network feature channels, Flops and Params are respectively reduced by 5.8G and 2.3M compared with the single-branch benchmark model on the Dual-CDNet model. In terms of detection speed, the proposed dual-branch model structure maintains the detection speed of the single-branch structure. Fig 13 plots the Precision-Recall (P-R) curves for the different models. It can be intuitively seen that after using the dual-branch structure to extract the infrared and visible light features respectively, the overall detection accuracy is significantly improved compared with the single mode.

thumbnail
Table 4. Model detection performance of different modal inputs.

https://doi.org/10.1371/journal.pone.0345815.t004

4.4.2 The detection performance of dual branch model.

The Grad-CAM method is used to identify and highlight the key regions in the input image for category prediction of the network model. Select all the target categories and use the feature map to calculate the weight features for different layers to generate a heat map. In the resulting heatmap, the red regions indicate the regions that the model focuses on when making a prediction, that is, the regions that have the most influence on the prediction result, the yellow regions have the second most attention, and the blue regions have the least influence on the prediction result. In this chapter, night scene data is taken as an example, and different network layers are selected by Grad-CAM method to realize feature attention visualization, as shown in Fig 14.

As can be seen from Fig 14, when dealing with nighttime images, especially in the case of local overexposure or insufficient light, as shown in Fig 14 (a), the visible light image is difficult to show all the target information, and the visible single-branch model only focuses on some weak features, as shown in Fig 14 (b). After the infrared branch is introduced, the model significantly focuses on the brightness characteristics of thermal targets that are difficult to detect in visible images but prominent in infrared images, as shown in Fig 14 (d). The dual-branch backbone network was used to process the images of two modes respectively, and the specific characteristics of infrared and visible light were used, which verified the effectiveness of the dual-branch structure in processing dual-modal information.

Due to the modal heterogeneity of infrared and visible, independent networks are required to extract the features of both respectively, thus constructing the above two-branch backbone and demonstrating its effectiveness. In addition, since the number of parameters and computational complexity of different backbone networks are different, experiments are needed to evaluate the performance of the backbone network. This paper selects several current mainstream backbone models to construct corresponding dual-branch structures. Six dual-branch model architectures—Dual-EfficientNet, Dual-DenseNet, Dual-MobileNet, Dual-SwinT, Dual-RTDETR, and Dual-CDNet—are built based on EfficientNet-B0 [38], DenseNet-121 [39], MobileNet-v3 [40], SwinTransformer-T [41], RT-DETR-R18 [42], and CSPDarkNet, respectively. To objectively evaluate the performance of each dual-branch model, comparative experiments were conducted on the SYUGV dataset using both infrared and visible modalities as input. Detailed metric results are shown in Table 5.

thumbnail
Table 5. Model detection performance of different backbone networks.

https://doi.org/10.1371/journal.pone.0345815.t005

It can be found that the Dual-CDNet model built on CSPDarkNet shows significant advantages. The model achieves a better balance between computing resource consumption and detection accuracy. Compared with models such as Dual-SwinT and Dual-RTDETR with Transformer backbone network, Models based on CNN architectures, including Dual-DenseNet, Dual-MobileNet, Dual-EfficientNet, and Dual-CDNet, among others, have clear advantages in terms of computational overhead while maintaining comparable levels of performance. This is because the data size limit of the current object detection task fails to give full play to the advantages of the Transformer architecture. The Dual-CDNet model has a more flexible feature hierarchy than other models based on CNN architecture, which makes it better adapt to the subsequent interactive fusion module, and provides an ideal infrastructure support for the fusion of dual-modal features. Therefore, this paper chooses the Dual-CDNet model as the basic framework model to provide architectural support for the subsequent fusion research of dual-modal features.

4.4.3 Contribution analysis of the proposed modules.

To systematically evaluate the individual and synergistic contributions of the proposed Dual-branch Interaction Enhancement (DBE) module and the Cross-attention Fusion (CAT) module, a comprehensive ablation study was conducted on both the SYUGV and LLVIP datasets. The baseline model (Baseline) employs a dual-branch backbone with simple feature concatenation.

  1. 1). Quantitative Analysis

The detailed results are summarized in Table 6 (for SYUGV) and Table 7 (for LLVIP). Several key observations can be made:

thumbnail
Table 6. Ablation study of different module combinations on the SYUGV dataset.

https://doi.org/10.1371/journal.pone.0345815.t006

thumbnail
Table 7. Ablation study of different module combinations on the LLVIP dataset.

https://doi.org/10.1371/journal.pone.0345815.t007

The incorporation of only the DBE module improves the mAP@0.5 over the Baseline by 1.17% (from 94.06% to 95.23%) on SYUGV and 0.41% (from 93.98% to 94.39%) on LLVIP. This enhancement surpasses that achieved by applying other attention mechanisms like CBAM [43] or ECA [44] to the dual-branch network, demonstrating the specific effectiveness of our interactive enhancement design for bimodal features.

Adding only the CAT module results in a moderate improvement (+0.96% on SYUGV, + 0.04% on LLVIP), indicating that fusion alone provides limited gains without enhanced feature specificity from DBE.

The full model (Baseline+DBE + CAT) achieves the highest mAP@0.5 on both datasets (95.64% on SYUGV, 95.70% on LLVIP). The combined improvement (+1.58% on SYUGV, + 1.72% on LLVIP) is greater than the sum of individual gains, revealing a clear synergistic effect where the DBE-enhanced features enable more effective fusion via CAT.

  1. 2). Convergence and Stability

The mAP@0.5 curves during training are plotted in Fig 15. It is observed that the model combining both DBE and CAT converges faster and more stably to a higher performance plateau compared to all other variants, further confirming the robustness introduced by the proposed modules.

thumbnail
Fig 15. mAP@0.5 curves of ablation studies on the(a) SYUGV and (b) LLVIP datasets.

https://doi.org/10.1371/journal.pone.0345815.g015

  1. 3). Visualization of Feature Focus

The feature activation heatmaps for different model variants are displayed in Fig 16. It can be observed that: (1) The attention of the Baseline model is relatively dispersed; (2) The DBE module helps the model focus more precisely on the target objects themselves; (3) The CAT module alone cannot fully utilize complementary information without prior enhancement from DBE; (4) Only when DBE and CAT are combined does the model achieve precise and focused attention on the complementary features from both modalities, visually validating the mechanism of the proposed approach.

thumbnail
Fig 16. Visualization of feature activation (heatmaps) for different model variants.

https://doi.org/10.1371/journal.pone.0345815.g016

5 Conclusions

In this paper, a target detection method based on dual-modal fusion network DEF-Net is proposed. Firstly, a dual-branch feature encoder suitable for infrared and visible dual-input is constructed, and the dual-branch coding network is used to extract dual-modal features respectively. Secondly, in order to make better use of dual-modal features for object detection in complex scenes, the DBE module is proposed, which uses internal feature attention to enhance the information interaction between modalities. In addition, the cross-attention mechanism is introduced to further improve the expression ability of the network for complementary information, so as to effectively improve the detection performance of the model. Experimental results show that compared with the existing methods, the proposed method performs well on the SYUGV dataset. In the face of complex environments such as mutual occlusion of targets and loss of single-modal information caused by local transition exposure, the proposed method still maintains good detection performance and has strong robustness. Future work will increase the complexity of the scene and expand the dataset, and explore ways to improve the generalization ability of the model to make it more applicable to actual unmanned detection systems.

Supporting information

References

  1. 1. Al-lQubaydhi N, Alenezi A, Alanazi T, Senyor A, Alanezi N, Alotaibi B, et al. Deep learning for unmanned aerial vehicles detection: A review. Computer Science Review. 2024;51:100614.
  2. 2. Feng Y, Chen F, Sun G, Wu F, Ji Y, Liu T, et al. Learning multi-granularity representation with transformer for visible-infrared person re-identification. Pattern Recognition. 2025;164:111510.
  3. 3. Guo J, Du H, Hao X, Zhang M. CFET: A Cross-Fusion Enhanced Transformer for Visible-infrared person re-identification. Expert Systems with Applications. 2025;271:126645.
  4. 4. Bustos N, Mashhadi M, Lai-Yuen SK, Sarkar S, Das TK. A systematic literature review on object detection using near infrared and thermal images. Neurocomputing. 2023;560:126804.
  5. 5. Yue G, Li Z, Tao Y, Jin T. Low-illumination traffic object detection using the saliency region of infrared image masking on infrared-visible fusion image. J Electron Imag. 2022;31(03).
  6. 6. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W, et al. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 5792–801.
  7. 7. Tang C, Lin YS, Yang H. Decision-level fusion detection for infrared and visible spectra based on deep learning. Infrared and Laser Engineering. 2019; 48(6): 626001.
  8. 8. Sun Y, Cao B, Zhu P, Hu Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans Circuits Syst Video Technol. 2022;32(10):6700–13.
  9. 9. Geng KK, Zou W, Yin GD. Low-observable targets detection for autonomous vehicles based on dual-modal sensor fusion with deep learning approach. J Automob Eng. 2019;233(9):2270–83.
  10. 10. Xue Y, Ju Z, Li Y, Zhang W. MAF-YOLO: Multi-modal attention fusion based YOLO for pedestrian detection. Infrared Physics & Technology. 2021;118:103906.
  11. 11. Cheng X, Geng K, Wang Z, Wang J, Sun Y, Ding P. SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment. Multimed Tools Appl. 2023;82(30):47773–92.
  12. 12. Shen J, Chen Y, Liu Y, Zuo X, Fan H, Yang W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognition. 2024;145:109913.
  13. 13. Zhao X, Zhu YP, Chen Z, Xu DT. Marine target detection and recognition method based on YOLO neural network in embedded system. In: 2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence, PRAI 2023, 2023. 229–35.
  14. 14. Liu J, Sun N, Li X. Rare Bird Sparse Recognition via Part-Based Gist Feature Fusion and Regularized Intraclass Dictionary Learning. CMC: Computers, Materials & Continua.2018; 55(3): 435–46.
  15. 15. Hossein-Nejad Z, Nasri M. Adaptive RANSAC and extended region-growing algorithm for object recognition over remote-sensing images. Multimed Tools Appl. 2022;81(22):31685–708.
  16. 16. Al-Tamimi A-K, Qasaimeh A, Qaddoum K. Offline signature recognition system using oriented FAST and rotated BRIEF. IJECE. 2021;11(5):4095.
  17. 17. Sultana M, Ahmed T, Chakraborty P, Khatun M, Rakib Md, Shorif M. Object Detection using Template and HOG Feature Matching. IJACSA. 2020;11(7).
  18. 18. Xiao Z, Xu P, Wang X, Chen L, An F. A Multi-Class Objects Detection Coprocessor With Dual Feature Space and Weighted Softmax. IEEE Trans Circuits Syst II. 2020;67(9):1629–33.
  19. 19. Sasaki Y, Emaru T, Ravankar AA. SVM based Pedestrian Detection System for Sidewalk Snow Removing Machines. 2021 IEEE/SICE International Symposium on System Integration (SII), Iwaki, Fukushima, Japan, 2021; pp. 700-701. https://doi.org/10.1109/IEEECONF49454.2021.9382618
  20. 20. Yan C, Wang Z, Xu C. Gentle Adaboost algorithm based on multi‐feature fusion for face detection. The Journal of Engineering. 2019;2019(15):609–12.
  21. 21. Wang H, Xiao NF. Underwater object detection method based on improved faster rcnn. Applied Sciences. 2023;13(4):2746.
  22. 22. Xie J, Pang Y, Nie J, Cao J, Han J. Latent Feature Pyramid Network for Object Detection. IEEE Transactions on Multimedia. 2023;25:2153–63.
  23. 23. Cao Z, Kooistra L, Wang W, Guo L, Valente J. Real-Time Object Detection Based on UAV Remote Sensing: A Systematic Literature Review. Drones. 2023;7(10):620.
  24. 24. Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 6517–25.
  25. 25. Redmon J, Farhadi A. Yolov3: An incremental improvement. Computer vision and pattern recognition. 2018.
  26. 26. Bochkovskiy A, Wang CY, Liao MH. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint. 2020.
  27. 27. Olorunshola OE, Irhebhude ME, Evwiekpaefe AE. A Comparative Study of YOLOv5 and YOLOv7 Object Detection Algorithms. JCSI. 2023;2(1):1–12.
  28. 28. Wang CY, Bochkovskiy A, Liao HY. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023; pp. 7464-7475.
  29. 29. Wang CY, Yeh IH, Liao HY. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In: Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors. Computer Vision – ECCV 2024. Cham: Springer. 2025.
  30. 30. Harikrishnan PM, Thomas A, Gopi VP. Inception single shot multi-box detector with affinity propagation clustering and their application in multi-class vehicle counting. Applied Intelligence. 2021;51(7):4714–29.
  31. 31. Vijayakumar A, Vairavasundaram S. YOLO-based Object Detection Models: A Review and its Applications. Multimed Tools Appl. 2024;83(35):83535–74.
  32. 32. Jin R, Xu Y, Xue W, Li B, Yang Y, Chen W. An Improved Mobilenetv3-Yolov5 Infrared Target Detection Algorithm Based on Attention Distillation. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. Springer International Publishing. 2022. p. 266–79.
  33. 33. Lou H, Duan X, Guo J, Liu H, Gu J, Bi L, et al. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics. 2023;12(10):2323.
  34. 34. Wang A, Chen H, Liu LH, et al. Yolov10: Real-time end-to-end object detection. 38th Conference on Neural Information Processing Systems (NeurIPS 2024). arXiv:2405.14458.
  35. 35. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-End Object Detection with Transformers. Lecture Notes in Computer Science. Springer International Publishing. 2020. p. 213–29.
  36. 36. Mahasin M, Dewi IA. Comparison of CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0 backbones on YOLO v4 as object detector. International Journal of Engineering, Science and Information Technology. 2022;2(3):64–72.
  37. 37. Jia X, Zhu C, Li M, Tang W, Zhou W. LLVIP: A Visible-infrared Paired Dataset for Low-light Vision. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021. 3489–97.
  38. 38. Tadepalli Y, Kollati M, Kuraparthi S, Kora P. EfficientNet-B0 Based Monocular Dense-Depth Map Estimation. TS. 2021;38(5):1485–93.
  39. 39. Tan PS, Lim KM, Tan CH, Lee CP. Pre-trained DenseNet-121 with Multilayer Perceptron for Acoustic Event Classification. IAENG International Journal of Computer Science. 2023;50(1):Article IJCS_50_1_07.
  40. 40. Koonce B. MobileNetV3. In: Koonce B, editor. Convolutional neural networks with Swift for TensorFlow: Image recognition and dataset categorization. Apress. 2021. p. 125–44.
  41. 41. Liu Z, Tan Y, He Q, Xiao Y. SwinNet: Swin Transformer Drives Edge-Aware RGB-D and RGB-T Salient Object Detection. IEEE Trans Circuits Syst Video Technol. 2022;32(7):4486–97.
  42. 42. Thin Jun EL. Tham M -L and Kwan B -H. A Comparative Analysis of RT-DETR and YOLOv8 for Urban Zone Aerial Object Detection. 2024 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Shah Alam, Malaysia, 2024. pp. 340-345.
  43. 43. Woo S, Park J, Lee J-Y, Kweon IS. CBAM: Convolutional Block Attention Module. Lecture Notes in Computer Science. Springer International Publishing. 2018. p. 3–19.
  44. 44. Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 11531–9.