Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

CORE-Net: A cross-modal orthogonal representation enhancement network for low-altitude multispectral object detection

  • Daoze Tang ,

    Contributed equally to this work with: Daoze Tang, Shuyun Tang

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Harbin University of Commerce, Harbin, China, College of Information and Electrical Engineering, China Agricultural University, Beijing, China

  • Shuyun Tang ,

    Contributed equally to this work with: Daoze Tang, Shuyun Tang

    Roles Data curation, Formal analysis, Investigation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Harbin University of Commerce, Harbin, China, College of Information and Electrical Engineering, China Agricultural University, Beijing, China

  • Dequan Zheng

    Roles Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review & editing

    dqzheng@hrbcu.edu.cn

    Affiliation Harbin University of Commerce, Harbin, China

Abstract

Object detection in visible light (RGB) images is frequently compromised by low-illumination conditions, whereas infrared (IR) imaging typically exhibits superior robustness in such environments. Multispectral fusion addresses this limitation by leveraging complementary information from both modalities; however, existing methods predominantly rely on intricate fusion modules to integrate cross-modal features, inevitably incurring significant computational overhead and architectural complexity. To mitigate this issue, we propose a novel Cross-modal Orthogonal Representation Enhancement Network (CORE-Net). Diverging from conventional heavy-fusion paradigms, our framework adopts a dual-branch architecture integrated with a streamlined Cross-modal Concatenation Network Framework (CCNF), which achieves efficient feature integration while substantially reducing model complexity. Furthermore, CORE-Net incorporates two distinct components—the Multiple Pooling Convolution Downsampling (MPCD) module and the Refined Integration Network (RINet)—specifically designed to optimize feature extraction capabilities. Extensive evaluations on the DroneVehicle and LLVIP datasets demonstrate that CORE-Net achieves state-of-the-art (SOTA) performance in terms of both detection accuracy and computational efficiency. Ablation studies substantiate the individual and synergistic contributions of each proposed component, while deployment on edge devices further corroborates the model’s practical efficiency. Additionally, qualitative visualizations confirm the model’s efficacy in suppressing background noise and enhancing discriminative fine-grained features. In summary, CORE-Net establishes a robust new paradigm for high-performance and efficient multispectral object detection.

Introduction

In recent years, optical remote sensing has been widely applied in fields such as autonomous driving, surveillance, environmental monitoring, and urban planning [15]. Advances in deep learning (DL) have significantly enhanced object detection accuracy. Two-stage detection algorithms, such as R-CNN [6], Faster R-CNN [7], and SPP-Net [8], rely on region proposal networks to generate candidate bounding boxes for classification and recognition. In contrast, single-stage detection algorithms simultaneously predict both object categories and locations, eliminating the need for region proposal generation. Notable examples include SSD [9], RetinaNet [10], and the YOLO series [1113]. Remote sensing object detection (RSOD) [1416] further plays a critical role in target localization and recognition within remote sensing applications.

In previous studies [17,18], most methodologies have been designed and optimized for single-modal target detection using red-green-blue (RGB) and infrared (IR) imagery. However, the performance of these approaches remains limited when detecting targets with subtle or indistinct features [19]. Traditional low-altitude remote sensing target detection predominantly relies on RGB images as the primary data source due to their detailed edge structures, complex textures, and rich color information [20], particularly under optimal lighting conditions. Furthermore, the widespread availability of large-scale RGB datasets provides ample training samples for such tasks. Nevertheless, in extreme environments, RGB images are susceptible to illumination variations, which can lead to detail degradation, while complex backgrounds may result in occlusion or incomplete data coverage [21]. These limitations present significant challenges to enhancing detection accuracy and robustness in low-altitude remote sensing applications. Conversely, infrared images capture thermal radiation patterns and temperature gradients, enabling clear delineation of target contours even under low-light conditions, nighttime operations, long-distance scenarios, or severe occlusion [22,23]. While IR imagery exhibits a distinct advantage over RGB in low-light environments [24,25], it performs suboptimally when target boundaries are ambiguous. Given the complementary nature of RGB and IR modalities in target detection, integrating their synergistic information could substantially improve the performance of multimodal low-altitude remote sensing systems.

Previous research on RGB-infrared (RGB-IR) object detection has faced two primary challenges in complementary feature extraction and image fusion. First, the distinct imaging principles of RGB and IR modalities lead to inherent inconsistencies in their feature representations [26]. Second, objects with analogous visual or thermal signatures often introduce redundancy during fusion [27], which can degrade model performance. Consequently, the effective integration of multimodal complementary features is critical for improving detection accuracy. Existing RGB-IR fusion strategies broadly fall into three categories: concatenation-based, summation-based, and gated control-based methods [28,29]. The concatenation strategy [30] typically fuses RGB and IR features at the channel level, thereby maximizing the preservation of multimodal information. The summation strategy integrates [31,32] features through pixel-wise weighted addition, ensuring the overall consistency between different modalities. In contrast, the gated control strategy [33,34] employs gated units to dynamically adjust the weights of RGB and IR features, enabling adaptive multimodal fusion. Concatenation-based approaches are further categorized as early, intermediate-level, or late fusion. Early fusion integrates RGB and IR data at the pixel level or input channels. Intermediate-level fusion leverages semantic correlations between modalities by extracting features from each and combining them into enriched representations. Late fusion independently processes both modalities and merges their final detection outputs. Although intermediate-level fusion generally yields superior accuracy, its high computational complexity and memory requirements hinder practical deployment. Hence, a paramount objective in multimodal object detection is to navigate the inherent trade-off between computational demands and the robustness of detection performance.

To address these challenges, this study proposes the Cross-Modal Orthogonal Representation Enhancement Network (CORE-Net), an innovative RGB-IR fusion framework tailored for small-target detection in low-altitude remote sensing imagery. By leveraging orthogonal representation learning, CORE-Net enhances discriminative multispectral feature extraction while mitigating cross-modal redundancy—a key challenge stemming from the limited pixel footprint of targets in aerial remote sensing scenarios.

The main contributions of this study are as follows:

  • A series of structural optimization modules—including Multiple Pooling Convolution Downsampling (MPCD) and Refined Integration Network (RINet)—is proposed to enhance discriminative feature extraction capabilities in visually complex environments.
  • A Cross-modal Concatenation Network Framework (CCNF) is introduced, which replaces computationally intensive fusion operators with streamlined channel-wise concatenation. This design reduces the architectural complexity inherent in traditional cross-modal fusion paradigms while maintaining inter-modal feature compatibility.
  • A novel multimodal object detection model, CORE-Net, is developed based on CCNF. This model features a collaborative architecture comprising a dual-branch backbone network and a fusion-guided neck network, significantly improving detection accuracy and model generalizability in challenging scenarios.

The structure of this paper proceeds as follows. A survey of the literature is presented in Sect Related Work, which synthesizes contemporary advances in object detection and multimodal object detection specific to the remote sensing domain. Sect Methods furnishes a comprehensive exposition of the architectural blueprint and implementation specifics of our proposed CORE-Net framework. Sect Results delineates the experimental protocol, benchmark performance against prevailing state-of-the-art methods, and a series of ablation studies that critically evaluate the model’s constituent components. Finally, Sect Discussion and Sect Conclusions critically examines the model’s performance, addresses its limitations, and outlines promising directions for future research.

Related work

Remote sensing object detection

Remote sensing object detection [Drone-[35,36], a critical subtask in computer vision, leverages deep learning to classify and localize objects in remote sensing images (RSIs). This task faces challenges such as complex backgrounds, arbitrary object orientations, and small target sizes, which necessitate robust methods to minimize false negatives (missed detections) and false positives (erroneous detections). To address these issues, recent studies have proposed optimization frameworks to improve detection accuracy and robustness.

For instance, Li et al. [37] introduced the Lightweight Large Selective Kernel Network (LSKNet), which enhances contextual semantic modeling across object categories by stacking LSK modules. Similarly, Bi et al. [38] developed the Local Semantic Enhancement Convolutional Network (LSE-Net). By integrating a context-aware Class Peak Response (CACPR) mechanism, LSE-Net extracts discriminative local features and refines semantic representations, thereby improving recognition performance in aerial scenes. Wu et al. [39] proposed CBGS-YOLO, a framework addressing object occlusion in cluttered environments through the Ghost and SPD-Conv modules, which strengthen feature extraction and detection precision. In another approach, Zhang et al. [40] incorporated a Bidirectional Feature Pyramid Network (BiFPN) into YOLO, utilizing adaptive feature pooling and fully connected fusion layers to retain spatial information, achieving state-of-the-art performance in benchmark evaluations.

While these studies demonstrate progress, they primarily focus on single-modal data (e.g., RGB images). However, single-modal approaches inherently suffer from limited feature representation due to sensor-specific constraints. In contrast, multimodal data fusion can leverage complementary information from diverse sources (e.g., LiDAR, hyperspectral, SAR), offering richer contextual cues for improved detection.

Object detection with multimodal data

Object detection models are widely applied in practical scenarios such as remote sensing image classification, autonomous driving, and visual question answering. To enhance detection performance, researchers increasingly leverage multimodal data sources, including synthetic aperture radar (SAR), light detection and ranging (LiDAR), infrared (IR), and multispectral (MS) data [41,42]. By integrating complementary features from diverse modalities, these models improve detection accuracy and system robustness while mitigating the risk of object omission inherent to single-modal approaches.

Multimodal fusion methods are broadly categorized into early fusion, intermediate-level fusion [43], and late fusion, which correspond to pixel-level, feature-level, and decision-level strategies, respectively. For instance, Zhang et al. [44] proposed SuperYOLO, a pixel-level fusion framework that integrates auxiliary super-resolution (SR) techniques to enhance multi-scale target feature learning. Fei et al. [45] introduced ACDF-YOLO, employing an efficient shuffle attention (ESA) mechanism and a cross-modal difference module (CDM) to optimize global feature extraction while reducing computational redundancy, thereby improving multimodal fusion efficiency. Sharma et al. [46] developed YOLOrs, a feature-level fusion-based convolutional neural network, designed for real-time vehicle detection and prediction using multi-scale features. Wang et al. [47] proposed YOLOfiv, a dual-stream architecture incorporating an attention mechanism. This model integrates an efficient channel attention (ECA) module and a rotating detection head to enhance accuracy and stability in all-weather remote sensing imagery (ARSI) applications. Building on this, Xie et al. [48] addressed spatial misalignment in ARSI detection by incorporating cross-modal local calibration (CLC) and cross-modal global context modeling (CGC) modules.

Existing studies demonstrate the prevalence of multi-branch feature-level fusion in remote sensing applications. However, persistent challenges include modality inconsistency (e.g., spatial or spectral mismatches) and information redundancy (e.g., overlapping features across modalities). To address these challenges, we propose CORE-Net, a dual-branch feature extraction framework. By fusing features from RGB and infrared images at multiple levels, the model enhances the extraction of semantic and spatial features, achieves cross-modal feature complementarity, and improves target detection accuracy and robustness while reducing computational overhead.

Methods

To address the challenges of object detection in low-altitude remote sensing imagery under demanding conditions—including small object size, dense spatial distributions, and low-light environments—we introduce the CORE-Net model. This architecture optimizes the cross-modal integration of visible light and infrared spectral features using computationally efficient architectural components. The operational workflow of CORE-Net is detailed in Algorithm 1, while its architecture is illustrated in Fig 1.

thumbnail
Fig 1. Architectural overview of the CORE-Net model.

https://doi.org/10.1371/journal.pone.0340499.g001

Algorithm 1. Workflow of the CORE-Net model.

Conventional multimodal (RGB+IR) detection architectures typically employ complex fusion modules that achieve significant performance gains at the expense of substantial computational complexity. The CORE-Net framework introduces a Cross-modal Concatenation Network Framework (CCNF) that establishes multi-spectral feature integration through hierarchical channel concatenation operations. This configuration enables progressive cross-channel feature fusion through cascaded extraction components, forming a symmetrical dual-branch backbone and neck network architecture that maintains computational efficiency while effectively leveraging multimodal information.

The complete CORE-Net system integrates multiple innovative components including MPCD and RINet modules. Its architecture begins with a dual-branch, cross-modal backbone for parallel feature extraction. These features are then progressively fused within the neck network. Subsequently, the consolidated feature representation undergoes final processing in the task head to generate bounding box predictions and classification labels.

Cross-modal concatenation network framework

The architecture of the Cross-modal Concatenation Network Framework (CCNF) is illustrated in Fig 2. This framework is designed to achieve efficient cross-modal feature fusion. Unlike conventional single-layer fusion approaches, CCNF adopts a multi-stage feature fusion strategy.

The core principle of the proposed framework is the multi-stage fusion of multimodal features. At each stage, features are concatenated along the channel dimension according to a predefined ratio. These fused representations are then propagated to subsequent layers. Although relying on computationally simple operations, this design enables robust performance with minimal computational overhead.

Multiple pooling convolution downsampling

In remote sensing image processing, challenges such as cluttered backgrounds, significant variations in target sizes, low spatial resolution, and dynamic lighting conditions pose substantial obstacles to effective multi-scale feature extraction. Conventional methods often utilize convolutional (Conv) modules for downsampling. However, standard Conv modules predominantly rely on fixed-scale convolutional kernels, inherently restricting their capacity to capture multi-scale features, consequently compromising detection accuracy for targets of diverse dimensions. Although downsampling techniques facilitate the aggregation of local features while reducing computational complexity and model parameters, they simultaneously constrain the receptive field, which may result in the loss of critical spatial or contextual information.

To address the challenge of feature extraction, we propose an adaptive feature fusion module termed Multiple Pooling Convolutional Downsampling (MPCD). The MPCD first applies local average pooling to the input feature map and processes the features through two parallel branches: one branch employs a convolutional (Conv) module, while the other combines max pooling with pointwise convolution. By independently processing features through these branches, the MPCD preserves spatial details captured by the Conv module while incorporating high-frequency information from the max pooling branch. This dual-path architecture enables multi-scale feature representation and reduces computational complexity without sacrificing critical information. A structural comparison with the baseline Conv module is shown in Fig 3.

thumbnail
Fig 3. Schematic representation of the MPCD in CORE-Net.

https://doi.org/10.1371/journal.pone.0340499.g003

Eq (1) describes the average pooling operation used in MPCD. Let denote the input feature map, where C, H, and W correspond to the number of channels, height, and width, respectively. The value of X at spatial position (i, j) is denoted by X(i,j), and K corresponds to the pooling kernel size (scalar or tuple).

(1)

The input feature map is partitioned into two distinct sub-tensors along the channel axis.

The first sub-tensor undergoes a 3×3 convolutional layer (kernel size=3, stride=2, padding=1) to capture spatially localized patterns, as formalized in Eq (2). Here, denotes the convolutional kernel weight matrix, and is the bias term.

(2)

The second component employs Max Pooling Pointwise Convolution (MPPConv) to capture global information. MPPConv applies max pooling to extract the most significant features, thereby expanding the receptive field. It then integrates inter-channel features via pointwise convolution to enhance multi-scale information representation, as defined in Eq (3) and Eq (4).

(3)(4)

Finally, the two components are concatenated and fused to integrate local and global features effectively.

MPCD integrates both average pooling and maximum pooling operations, combining their advantages to reduce spatial data redundancy while improving computational efficiency. This integration simultaneously enlarges the receptive field and enhances the model’s ability to prioritize salient features within the target region, thereby minimizing the model’s reliance on irrelevant information. Furthermore, the component mitigates overfitting by suppressing excessive background attention during feature extraction, resulting in more robust and generalizable representations.

Refined integration network

To enhance adaptability to targets of varying scales and orientations in complex backgrounds, this paper proposes a Refined Integration Network (RINet). RINet employs multi-branch architecture to efficiently capture key directional and multi-scale features in remote sensing images. It prioritizes edge-related and high-frequency features while optimizing computational efficiency and feature extraction accuracy.

As illustrated in Fig 4, RINet comprises three branches:

  1. The Asymmetric Edge-extended convolution (AEEC) module branch, designed to strengthen directional feature representation and emphasize edge-specific target features;
  2. The 1 × 1 bypass convolutional branch, which redistributes and calibrates channel-wise features;
  3. The inverted bottleneck structure branch, composed of a depthwise convolution followed by two 1 × 1 convolutions, to enhance hierarchical multi-scale feature extraction.

In the AEEC module, the input feature map is first replicated four times to generate four parallel intermediate branches. Each branch is padded with two pixels at its four cardinal directions (top-left, top-right, bottom-left, and bottom-right). Subsequently, each branch is independently processed by a 3 × 3 convolutional layer for feature extraction. The outputs of the four branches are then concatenated along the channel dimension, followed by a 3 × 3 convolutional layer with stride 1 to fuse spatial and channel features. This design strengthens the model’s attention to edge-related features, expands the receptive field, and integrates global context, thereby improving the detection of small targets, edge-region objects, and multi-scale objects. Consequently, the module achieves higher detection accuracy in complex backgrounds.

In the inverted bottleneck branch, the input feature map first undergoes channel-wise information fusion via a 1×1 convolutional layer while maintaining the original channel count. The fused features are then processed by a 3×3 depthwise convolutional layer (DWConv) to extract spatial details. Finally, a 1×1 convolutional layer integrates cross-channel information while reducing the channel dimension by half. This design allows the inverted bottleneck branch to supplement feature representations and attention scales for the AEEC module branch, striking a balance between model performance and computational efficiency.

Spatial pyramid pooling - Fast

In the CORE-Net model, a Spatial Pyramid Pooling - Fast (SPPF) [49] module is incorporated, as illustrated in Fig 5. This module hierarchically processes feature maps derived from the final backbone network layer by integrating convolutional operations with a series of max-pooling layers.

By default, three consecutive max-pooling layers are employed. Initially, the input feature map is processed through a convolutional layer to integrate cross-channel information and adjust the channel dimensionality. Subsequently, multi-scale salient features across varying receptive fields are extracted via cascaded max-pooling layers. The outputs from each pooling stage are concatenated along the channel dimension to form a comprehensive multi-scale feature representation. Finally, a concluding convolutional layer fuses these aggregated features and refines the channel dimension.

This architecture enhances the model’s sensitivity to salient features, improves the distinction between target characteristics and background noise, and strengthens the integration of multi-scale spatial context. Furthermore, it achieves an optimal balance between computational efficiency and detection performance through hierarchical feature abstraction and parameter optimization.

Cross-stage-partial convolution with position-sensitive attention

The CORE-Net model further incorporates the Cross-Stage-Partial Convolution with Position-Sensitive Attention (C2PSA) [50] module, whose architecture is illustrated in Fig 6.

The C2PSA integrates multiple Position-Sensitive Attention (PSA) sub-modules. Each PSA employs a Multi-Head Self-Attention (MHSA) mechanism and a Feed-Forward Neural Network (FNN) comprising two convolutional layers, thereby enhancing the model’s capability to locate and emphasize critical spatial regions. To preserve crucial information, a skip connection is adopted within the component, concatenating the original features with those refined by the PSA. Subsequently, a convolutional operation is applied to the concatenated features to facilitate cross-channel interaction and multi-level feature fusion.

By combining multi-scale convolution with a feature weighting mechanism, the C2PSA module can extract richer and more discriminative feature representations. This design strengthens spatial attention and perceptual capacity in complex environments, effectively addressing the challenge of feature extraction for tiny and occluded objects in complex, low-altitude remote sensing environments.

Task head

Small objects in low-altitude remote sensing imagery frequently exhibit low contrast relative to the background environment, leading to a tendency toward missed detections in conventional approaches. To address this challenge and enhance the robustness of small object detection, the proposed CORE-Net model employs a task-specific processing head featuring a decoupled dual-branch architecture. As illustrated in Fig 7, this architecture comprises two distinct branches: one dedicated to bounding box prediction and the other to category prediction. Such a decoupled design has been demonstrated in prior studies [50] to enhance detection accuracy while mitigating interference between localization and classification tasks.

The bounding box branch employs two consecutive standard convolutional (Conv) modules for feature extraction from the input feature map. A 1×1 2D convolutional layer is incorporated to strengthen inter-channel feature interactions, while the integration of distributed focal loss (DFL) and complete intersection over union (CIoU) optimizes bounding box localization accuracy.

The classification branch employs two sequential depthwise separable convolutional layers—composed of a depthwise convolution (DWConv) followed by a pointwise convolution (PWConv)—to hierarchically extract discriminative features. These layers are succeeded by a standard 2D convolutional layer coupled with a binary cross-entropy (BCE) loss function to perform category prediction. This parameter-efficient design achieves a favorable trade-off between computational overhead and classification accuracy.

Results

Datasets

The DroneVehicle dataset [51], a large-scale visible-infrared aerial benchmark developed by Tianjin University, provides 28,439 spatially aligned image pairs containing 953,087 annotated vehicle instances across five categories (Car, Truck, Bus, Van, and Freight Car). This low-altitude remote sensing dataset is partitioned into three subsets: 17,990 training pairs, 1,469 validation pairs, and 8,980 test pairs, encompassing diverse illumination conditions (daytime, dusk, night) and urban scenarios (roadways, parking facilities, residential zones). All aerial captures were acquired through drone-mounted dual-spectral cameras, with geometric alignment ensured through affine transformations and region cropping during preprocessing.

The LLVIP dataset [52], developed by Beijing University of Posts and Telecommunications researchers, focuses on low-light pedestrian detection with 15,488 registered RGB-IR pairs predominantly captured under nocturnal conditions. This specialized dataset provides 12,025 training pairs and 3,463 test pairs, maintaining temporal-spatial synchronization between modalities to ensure experimental reproducibility.

Experimental evaluations are primarily conducted using the DroneVehicle dataset, while the LLVIP dataset is employed to validate the cross-domain generalization capability of the CORE-Net framework. Fig 8(a) illustrates the category-wise instance distribution for both visible and infrared modalities within the DroneVehicle dataset, whereas Fig 8(b) details the partitioning schemes for these two benchmarks. Notably, the RGB and IR images in these datasets are strictly paired.

thumbnail
Fig 8. (a) Ratio of RGB to infrared image counts across different categories in the DroneVehicle dataset; (b) image partitioning details of the DroneVehicle and LLVIP datasets.

https://doi.org/10.1371/journal.pone.0340499.g008

Evaluation metrics

To assess model performance comprehensively, two primary metrics were employed for computational overhead evaluation: parameter count (Params) and floating-point operations (FLOPs). Detection capabilities were quantified using established benchmarks including precision, recall, and Average Precision (AP), with additional analysis through mean Average Precision metrics ( and ) for multi-threshold evaluation.

The parameter count (Params) quantifies all learnable weights in the model architecture, indicating model size. Floating-point operations (FLOPs) measure the arithmetic computations required per input instance, characterizing computational complexity. Reduced parameter counts correlate with lower memory requirements, while diminished FLOPs suggest greater computational efficiency – particularly advantageous for resource-constrained deployments or real-time inference systems.

(5)(6)

Precision and recall are calculated using Eq (5) and Eq (6). Here, True Positive (TP) and True Negative (TN) are the counts of correct positive and negative predictions, respectively; False Positive (FP) is the count of negatives incorrectly predicted as positive; and False Negative (FN) is the count of positives incorrectly predicted as negative.

(7)

The Average Precision (AP) for each class, computed as the area under its Precision-Recall (PR) curve, provides a comprehensive summary of detector performance across all confidence thresholds. The calculation method is formally presented in Eq (7).

(8)

Intersection over Union (IoU) serves as a standard evaluation metric quantifying the spatial overlap between predicted and ground-truth bounding boxes. This metric exhibits a monotonic relationship with localization accuracy, where higher values indicate greater alignment between predicted and ground-truth regions. The mathematical formulation is formally defined in Eq (8).

(9)

To address the inherent ambiguity in object boundary delineation, two complementary evaluation metrics are employed: (mean average precision across all categories at a fixed IoU threshold of 0.5) and (average precision computed over multiple IoU thresholds from 0.5 to 0.95 with 0.05 increments). These metrics jointly assess model performance in terms of localization consistency and multi-threshold generalization capability. The formal mathematical definitions are provided in Eq (9).

Experimental environment and hyperparameters

The experimental setup for both the comparative and ablation studies consisted of a server equipped with an AMD EPYC 9654 CPU, 120 GB of RAM, and dual NVIDIA GeForce RTX 4090 GPUs. Detailed hardware and software specifications are provided in Table 1.

thumbnail
Table 1. Server-side computing environment specifications.

https://doi.org/10.1371/journal.pone.0340499.t001

The experiments on edge device deployment were conducted using an NVIDIA Jetson AGX Orin. Its core components comprise a 12-core Arm® Cortex®-A78AE CPU, 32 GB of memory, and a GPU implementing the NVIDIA Ampere architecture. Full specifications are delineated in Table 2.

thumbnail
Table 2. Edge computing environment specifications.

https://doi.org/10.1371/journal.pone.0340499.t002

Training hyperparameters are summarized in Table 3. To ensure experimental consistency, all trials retained identical hardware, software configurations, and hyperparameter settings.

thumbnail
Table 3. Hyperparameter specifications for training protocol.

https://doi.org/10.1371/journal.pone.0340499.t003

Comparative experiments

A comprehensive comparative evaluation of performance was conducted on the DroneVehicle dataset, involving the proposed CORE-Net model and a series of baseline models including the LF-MDet [53], C2Former-S2ANet [54], DDCI-S2ANet [55], MDA [56], CM-YOLO-m [57], and UDDet [58].

As presented in Table 4, the proposed CORE-Net demonstrates superior performance across all evaluation metrics. Specifically, compared to LF-MDet, CORE-Net achieves accuracy improvements of 13.6% and 17.7% on the two respective metrics, while simultaneously reducing the parameter count by 95.3% and computational cost by 93.2%. Furthermore, the efficiency advantage of CORE-Net is even more pronounced when benchmarked against heavy-weight models. For instance, relative to UDDet, CORE-Net secures accuracy gains of 4.6% and 9.2%, yet requires only 3.1% of the parameters and 2.8% of the computational resources.

thumbnail
Table 4. Performance comparison between CORE-Net and baseline methods on the DroneVehicle dataset.

https://doi.org/10.1371/journal.pone.0340499.t004

To further assess the adaptability and generalizability of the proposed CORE-Net model, additional robustness verification experiments were conducted on the LLVIP dataset. Baseline models selected for comparison included the ACDF-YOLO [45], YOLOXCPCF [59], FS-Diff [60], RTMF-Net [61], UIRGBfuse [62], DIVFusion [63], FQDNet_n [64], Diff-IF [65], and the proposed CORE-Net model were evaluated.

As illustrated in Table 5, CORE-Net exhibits superior performance on the LLVIP dataset, consistent with its results on the DroneVehicle dataset, effectively outperforming all baseline models. Specifically, compared with RTMF-Net, which possesses a comparable model scale, CORE-Net achieves improvements of 0.7% and 6.5% in two accuracy metrics, respectively, while simultaneously reducing parameter count by 57.1% and computational cost by 54.3%. Notably, despite utilizing only 3.1% of the parameters and less than 0.01% of the computational resources required by the computationally intensive FS-Diff model, CORE-Net surpasses it by 3.3% and 4.6% in the respective accuracy metrics.

thumbnail
Table 5. Performance comparison between CORE-Net and baseline methods on the LLVIP dataset.

https://doi.org/10.1371/journal.pone.0340499.t005

The experimental findings demonstrate that CORE-Net’s design successfully reconciles high accuracy with low computational demands, resulting in a net gain in overall performance.

Ablation experiments

In this ablation study, YOLO11n serves as the baseline model. Each component and design of the proposed CORE-Net was progressively integrated into this baseline architecture, with the corresponding performance metrics on the DroneVehicle dataset reported in Table 6.

thumbnail
Table 6. Component ablation study for CORE-Net on the DroneVehicle dataset.

https://doi.org/10.1371/journal.pone.0340499.t006

Initially, we evaluated the individual contributions of the three core components: MPCD, RINet, and CCNF. When the MPCD module is implemented independently, accuracy metrics improve by 1.3% and 1.2%, while the parameter count and computational cost decrease by 19.2% and 15.9%, respectively. Notably, MPCD achieves the most significant reduction in parameters among the three components. The isolated implementation of the RINet module yields performance gains of 3.4% and 4.2%; however, this comes at the cost of a moderate increase in model complexity. Conversely, implementing CCNF alone results in accuracy improvements of 5.5% and 10.6%, accompanied by a 7.7% reduction in parameters and a 34.0% decrease in FLOPs. This represents the most substantial improvement in comprehensive performance, primarily attributed to the effective fusion and utilization of multi-modal features. These experiments substantiate the individual effectiveness of CORE-Net’s core components.

Subsequently, we conducted experiments involving pairwise combinations and the complete integration of MPCD, RINet, and CCNF. Across all combination settings, pairwise integrations consistently outperform the individual implementations of their respective components in terms of accuracy. Moreover, the simultaneous implementation of all components achieves superior accuracy compared to any partial integration configuration. This phenomenon demonstrates the synergistic efficacy and mutual reinforcement among the core components of CORE-Net.

In summary, this systematic evaluation validates both the independent validity and the complementary interaction of the CORE-Net architectural elements.

Edge device deployment experiments

The deployment experiment on edge devices evaluated the proposed CORE-Net against the YOLO11 series baseline models (recognized for real-time performance) using the DroneVehicle dataset.

Experimental results summarized in Table 7 indicate that CORE-Net achieves a 28.1% reduction in computational latency compared to the Baseline-m variant while simultaneously improving accuracy metrics by 2.4% and 3.8%. When compared to the highest-accuracy Baseline-x model, CORE-Net further reduces computational latency by 68.6% with additional accuracy gains of 2.5% and 3.2%. These findings demonstrate that CORE-Net effectively balances accuracy and computational efficiency, showing stronger suitability for high-precision real-time tasks on resource-constrained edge devices.

thumbnail
Table 7. Performance comparison between CORE-Net and baseline methods on the DroneVehicle dataset under edge deployment constraints.

https://doi.org/10.1371/journal.pone.0340499.t007

Results visualization

Experimental results are comprehensively presented via Precision-Recall (PR) curves, confusion matrices, feature heatmaps, and qualitative visualizations to demonstrate the overall performance superiority of CORE-Net over the baseline model. To ensure a fair and consistent comparison, we adhere to the baseline settings established in the ablation study, thereby explicitly attributing the performance gains to the novel architectural components and structural innovations of CORE-Net.

The PR curve serves as a pivotal metric for evaluating model performance, particularly in scenarios characterized by class imbalance. It elucidates the trade-off between precision and recall across varying decision thresholds, facilitating both intuitive interpretation and rigorous quantitative analysis. Typically, precision exhibits a downward trend as recall increases; lowering the classification threshold captures more True Positives (TP) but inevitably introduces additional False Positives (FP). A superior model demonstrates a marginal decay in precision as recall increases, whereas a suboptimal model suffers a precipitous decline, indicating limited discriminative capability between positive and negative samples. Consequently, a curve approaching the top-right corner—corresponding to a larger Area Under the Curve (AUC)—signifies superior overall performance and an effective equilibrium between precision and recall.

Fig 9 illustrates the comparative PR curves for (a) the baseline model and (b) the proposed CORE-Net. With the aid of auxiliary reference lines, it is observed that the all-category PR curve of the baseline intersects the function y = x approximately at the point (0.70, 0.70), whereas the CORE-Net curve intersects at (0.75, 0.75). This shift demonstrates that the CORE-Net curve consistently approaches the ideal top-right corner and yields a higher AUC value. This indicates that CORE-Net maintains high precision even as recall improves, displaying a superior capability to distinguish true positives from background noise while effectively suppressing false positives. These results confirm that CORE-Net achieves a more optimal precision-recall trade-off, thereby outperforming the baseline model in overall detection performance.

thumbnail
Fig 9. Precision-recall curves for CORE-Net and baseline methods on the DroneVehicle dataset.

https://doi.org/10.1371/journal.pone.0340499.g009

The confusion matrix serves as a foundational evaluation tool for assessing classification model performance. This tabular visualization presents predicted and actual class labels in parallel alignment, conventionally with columns denoting true classes and rows indicating predicted classes (or inversely arranged). Each cell contains the count of samples corresponding to a unique actual-predicted class pair. Diagonal entries correspond to correct classifications, whereas off-diagonal elements quantify classification errors.

Fig 10 displays normalized confusion matrices for the (a) baseline and (b) CORE-Net models. Each matrix column is normalized to visualize prediction distributions across categories while mitigating class imbalance bias. Diagonal elements (top-right to bottom-left) denote correct classifications (true positives/negatives), with off-diagonal elements representing misclassifications. Specifically, the top row of off-diagonal elements indicates false negatives where true positives are misclassified as background, while the rightmost column corresponds to false positives where background regions are erroneously classified as target categories. Remaining off-diagonal elements reflect inter-class confusion errors. The CORE-Net matrix exhibits intensified diagonal dominance and suppressed off-diagonal values compared to the baseline, demonstrating enhanced classification performance through reduced error propagation.

thumbnail
Fig 10. Confusion matrices for CORE-Net and baseline methods on the DroneVehicle dataset.

https://doi.org/10.1371/journal.pone.0340499.g010

Grad-CAM [66] is utilized to produce visual heatmaps, facilitating a comparative analysis of feature attention patterns across the evaluated models. The resulting heatmaps illustrate the regions of interest upon which the model focuses during object detection, with gradations of color intensity indicating the degree of attention allocated to different areas. By comparing these heatmaps against the ground-truth target regions, it is possible to evaluate whether the model effectively directs its attention to the relevant objects in the image.

Fig 11 presents a comparative visualization of several randomly selected samples from the DroneVehicle dataset, illustrating: (a) feature activations of the baseline model, (b) salient regions identified by the proposed CORE-Net, and (c) corresponding infrared frames that serve as visual references for target feature comparison. The baseline model exhibits more pronounced attention toward target regions; however, it fails to adequately differentiate between densely distributed targets, leading to suboptimal generalization in relevant scenarios and resulting in target confusion or missed detections. In contrast, for clusters of small and dense targets, the feature activations generated by CORE-Net demonstrate finer granularity and more clearly delineated region boundaries. It is noteworthy, however, that despite CORE-Net’s effective feature discrimination, residual background activations persist, which may cause certain areas to be erroneously overemphasized. This issue may stem from feature similarities between annotated targets and environmental elements.

thumbnail
Fig 11. Heatmap comparison on DroneVehicle dataset samples: (a) baseline method, (B) CORE-Net, and (C) corresponding infrared frames.

https://doi.org/10.1371/journal.pone.0340499.g011

A subset of samples from the DroneVehicle dataset was randomly selected for visual comparison. As illustrated in Fig 12, the baseline model predictions (a), CORE-Net predictions (b), and infrared reference frames overlaid with CORE-Net detections (c) are systematically presented. Comparative analysis reveals that CORE-Net demonstrates enhanced robustness against ambient lighting variations and cluttered backgrounds. Specifically, the proposed model achieves superior detection accuracy for distant small targets, low-light objects, and densely clustered or occluded instances compared to the baseline, exhibiting higher correct detection rates in these challenging scenarios. These improvements can be attributed to the algorithm’s effective fusion of multi-spectral features.

thumbnail
Fig 12. Detection result comparison on DroneVehicle dataset samples: (a) baseline method, (B) CORE-Net, and (C) infrared frames overlaid with CORE-Net detections.

https://doi.org/10.1371/journal.pone.0340499.g012

Discussion

The experimental results demonstrate that the proposed CORE-Net model effectively balances high detection accuracy with low computational cost for low-altitude multispectral object detection. By leveraging a dual-branch architecture and a streamlined Cross-modal Concatenation Network Framework (CCNF), which uses simple channel concatenation instead of complex fusion modules to efficiently integrate RGB and IR features.

Ablation studies confirm the individual and synergistic contributions of the core components. The MPCD module significantly enhances multi-scale feature representation and reduces spatial redundancy, leading to a notable decrease in model parameters. The RINet strengthens the model’s capacity to discern targets across varying scales and orientations, particularly in cluttered backgrounds. The CCNF itself was instrumental in achieving substantial gains in accuracy with reduced FLOPs, demonstrating that efficient cross-modal integration can be realized without resorting to complex gating mechanisms. The collective integration of these components within CORE-Net yields superior performance on both the DroneVehicle and LLVIP benchmarks, underscoring the model’s robustness and generalizability across different low-altitude sensing tasks.

Qualitative analyses, including Grad-CAM visualizations and detection result comparisons, provide further evidence of the model’s advanced capabilities. CORE-Net focuses more precisely on small and dense objects than baselines, reducing missed detections in challenging conditions like low light and occlusion. Nevertheless, the visualizations also reveal a limitation: the model occasionally allocates non-trivial attention to non-salient background regions. This phenomenon indicates a potential avenue for future refinement, wherein feature prioritization mechanisms could be further optimized to suppress irrelevant contextual information without compromising the detection of genuine targets.

Conclusions

This paper presents CORE-Net, a novel network for efficient and accurate RGB-IR object detection in low-altitude remote sensing. The model successfully addresses key challenges in multispectral fusion through a dual-branch architecture and a simple yet effective Cross-modal Concatenation Network Framework (CCNF).

Supported by the MPCD and RINet modules, CORE-Net achieves superior detection performance on standard benchmarks while maintaining significantly lower computational complexity than existing methods. The model’s efficiency is further validated through deployment experiments on edge devices, confirming its suitability for real-time, resource-constrained applications.

In conclusion, CORE-Net establishes a streamlined and powerful paradigm for multispectral object detection, effectively balancing high precision with low computational overhead. Future work will focus on augmenting the model’s feature discrimination capabilities to reduce background interference and exploring self-supervised learning strategies to leverage unlabeled multispectral data for further performance enhancement.

Supporting information

S1 File. The DroneVehicle and LLVIP datasets are publicly available via https://github.com/VisDrone/DroneVehicle and https://github.com/bupt-ai-cz/LLVIP, respectively.

https://doi.org/10.1371/journal.pone.0340499.s001

(DOCX)

S2 File. The CORE-Net implementation and source code are accessible at https://github.com/DaozeTang/CORE-Net.

https://doi.org/10.1371/journal.pone.0340499.s002

(DOCX)

Acknowledgments

Gratitude is extended to all contributors who supported this study through their valuable assistance and insights.

References

  1. 1. Zhang X, Zhang T, Wang G, Zhu P, Tang X, Jia X, et al. Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geosci Remote Sens Mag. 2023;11(4):8–44.
  2. 2. Lv Z, Zhang P, Sun W, Benediktsson JA, Li J, Wang W. Novel adaptive region spectral–spatial features for land cover classification with high spatial resolution remotely sensed imagery. IEEE Trans Geosci Remote Sensing. 2023;61:1–12.
  3. 3. Kim S, Hong SH, Kim H, Lee M, Hwang S. Small object detection (SOD) system for comprehensive construction site safety monitoring. Automat Construct. 2023;156:105103.
  4. 4. Yao X, Feng X, Han J, Cheng G, Guo L. Automatic weakly supervised object detection from high spatial resolution remote sensing images via dynamic curriculum learning. IEEE Trans Geosci Remote Sensing. 2021;59(1):675–85.
  5. 5. Bin J, Zhang R, Wang R, Cao Y, Zheng Y, Blasch E, et al. An efficient and uncertainty-aware decision support system for disaster response using aerial imagery. Sensors (Basel). 2022;22(19):7167. pmid:36236263
  6. 6. Dollar P, Appel R, Belongie S, Perona P. Fast feature pyramids for object detection. IEEE Trans Pattern Anal Mach Intell. 2014;36(8):1532–45.
  7. 7. Girshick R. Fast R-CNN. In: 2015 IEEE international conference on computer vision (ICCV); 2015. 1440–8.
  8. 8. He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16. pmid:26353135
  9. 9. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: Single shot MultiBox detector. In: Leibe B, Matas J, Sebe N, Welling M, editors. Computer vision – ECCV 2016 . Cham: Springer International Publishing; 2016. p. 21–37.
  10. 10. Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. In: 2017 IEEE international conference on computer vision (ICCV); 2017. p. 2999–3007. https://doi.org/10.1109/iccv.2017.324
  11. 11. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 779–88.
  12. 12. Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, et al. YOLOv10: Real-time end-to-end object detection. In: Proceedings of the 38th international conference on neural information processing systems. NIPS ’24. Red Hook, NY, USA: Curran Associates Inc.; 2024. p. 3429.
  13. 13. Tian Y, Ye Q, Doermann D. YOLOv12: Attention-centric real-time object detectors; 2025. https://arxiv.org/abs/2502.12524
  14. 14. Nie J, Wang C, Yu S, Shi J, Lv X, Wei Z. MIGN: Multiscale image generation network for remote sensing image semantic segmentation. IEEE Trans Multimedia. 2023;25:5601–13.
  15. 15. Liu W, Quijano K, Crawford MM. YOLOv5-Tassel: Detecting tassels in RGB UAV imagery with improved YOLOv5 based on transfer learning. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2022;15:8085–94.
  16. 16. Lu W, Lan C, Niu C, Liu W, Lyu L, Shi Q, et al. A CNN-transformer hybrid model based on CSWin transformer for UAV image object detection. IEEE J Sel Top Appl Earth Observ Remote Sensing. 2023;16:1211–31.
  17. 17. Liu M, Hu Q, Wang C, Tian T, Chen W. Daff-Net: Dual attention feature fusion network for aircraft detection in remote sensing images. In: 2021 IEEE international geoscience and remote sensing symposium IGARSS; 2021. p. 4196–9.
  18. 18. Li N, Wei D. IR-ADMDet: An anisotropic dynamic-aware multi-scale network for infrared small target detection. Remote Sensing. 2025;17(10):1694.
  19. 19. Pan L, Liu T, Cheng J, Cheng B, Cai Y. AIMED-Net: An enhancing infrared small target detection net in UAVs with multi-layer feature enhancement for edge computing. Remote Sensing. 2024;16(10):1776.
  20. 20. Wang Q, Zhou L, Xu C, Shang Y, Jin P, Cao C, et al. Progress and perspectives on UAV visual object tracking. IEEE J Sel Top Appl Earth Observ Remote Sensing. 2025;18:20214–39.
  21. 21. Muzammul M, Li X. Comprehensive review of deep learning-based tiny object detection: Challenges, strategies, and future directions. Knowl Inf Syst. 2025;67(5):3825–913.
  22. 22. Wang H, Wang H, Han F. Detection of low-altitude infrared small targets for UAVs using a density-based artificial bee colony algorithm. Sci Rep. 2025;15(1):23344. pmid:40604036
  23. 23. Huang Y, Qu J, Wang H, Yang J. An all-time detection algorithm for UAV images in urban low altitude. Drones. 2024;8(7):332.
  24. 24. Nguyen PT, Nguyen LH. YOLOv11n-UAV: Improved YOLOv11n model for detecting small UAVs using infrared images on complex backgrounds. Neural Comput Applic. 2025;37(22):17231–47.
  25. 25. Ma D, Su J, Li S, Xian Y. AerialIRGAN: Unpaired aerial visible-to-infrared image translation with dual-encoder structure. Sci Rep. 2024;14(1):22105. pmid:39333306
  26. 26. Zhou K, Chen L, Cao X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In: Vedaldi A, Bischof H, Brox T, Frahm JM, editors. Computer vision – ECCV 2020 . Cham: Springer International Publishing; 2020. p. 787–803.
  27. 27. Yang X, Qian Y, Zhu H, Wang C, Yang M. BAANet: Learning Bi-directional adaptive attention gates for multispectral pedestrian detection. In: 2022 International conference on robotics and automation (ICRA); 2022. p. 2920–6. https://doi.org/10.1109/icra46639.2022.9811999
  28. 28. Hosseinpour H, Samadzadegan F, Javan FD. CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images. ISPRS J Photogrammetry Remote Sensing. 2022;184:96–115.
  29. 29. Sun Y, Fu Z, Sun C, Hu Y, Zhang S. Deep multimodal fusion network for semantic segmentation using remote sensing image and LiDAR data. IEEE Trans Geosci Remote Sensing. 2022;60:1–18.
  30. 30. Zhang Y, Ye M, Zhu G, Liu Y, Guo P, Yan J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans Geosci Remote Sensing. 2024;62:1–15.
  31. 31. Zhao Z, Zhang W, Xiao Y, Li C, Tang J. Reflectance-guided progressive feature alignment network for all-day UAV object detection. IEEE Trans Geosci Remote Sensing. 2025;63:1–15.
  32. 32. Zhang Z, Dong W, Bai J, Wang C, Qiu J, Sun D, et al. UAV perspective small object detection with RGB-IR fusion perception. J Shanghai Jiaotong Univ (Sci). 2025.
  33. 33. Li W, Sun K, Li W, Wei J, Miao S, Gao S, et al. Aligning semantic distribution in fusing optical and SAR images for land use classification. ISPRS J Photogrammetry Remote Sensing. 2023;199:272–88.
  34. 34. Dong W, Zhu H, Lin S, Luo X, Shen Y, Guo G, et al. Fusion-mamba for cross-modality object detection. IEEE Trans Multimedia. 2025;27:7392–406.
  35. 35. Li W, Chen Y, Hu K, Zhu J. Oriented RepPoints for aerial object detection. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2022. p. 1819–28. https://doi.org/10.1109/cvpr52688.2022.00187
  36. 36. Zhang Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones. 2023;7(8):526.
  37. 37. Li Y, Li X, Dai Y, Hou Q, Liu L, Liu Y, et al. LSKNet: A foundation lightweight backbone for remote sensing. Int J Comput Vis. 2024;133(3):1410–31.
  38. 38. Bi Q, Qin K, Zhang H, Xia G-S. Local semantic enhanced ConvNet for aerial scene recognition. IEEE Trans Image Process. 2021;30:6498–511. pmid:34236963
  39. 39. Wu Z, Wu D, Li N, Chen W, Yuan J, Yu X, et al. CBGS-YOLO: A lightweight network for detecting small targets in remote sensing images based on a double attention mechanism. Remote Sensing. 2024;17(1):109.
  40. 40. Doherty J, Gardiner B, Kerr E, Siddique N. BiFPN-YOLO: One-stage object detection integrating bi-directional feature pyramid networks. Pattern Recognit. 2025;160:111209.
  41. 41. Qingyun F, Zhaokui W. Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognit. 2022;130:108786.
  42. 42. Wang Z, Li S, Huang K. Cross-modal adaptation for object detection in infrared remote sensing imagery. IEEE Geosci Remote Sensing Lett. 2025;22:1–5.
  43. 43. Gomez-Chova L, Tuia D, Moser G, Camps-Valls G. Multimodal classification of remote sensing images: A review and future directions. Proc IEEE. 2015;103(9):1560–84.
  44. 44. Zhang J, Lei J, Xie W, Fang Z, Li Y, Du Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans Geosci Remote Sensing. 2023;61:1–15.
  45. 45. Fei X, Guo M, Li Y, Yu R, Sun L. ACDF-YOLO: Attentive and cross-differential fusion network for multimodal remote sensing object detection. Remote Sensing. 2024;16(18):3532.
  46. 46. Sharma M, Dhanaraj M, Karnam S, Chachlakis DG, Ptucha R, Markopoulos PP, et al. YOLOrs: Object detection in multimodal remote sensing imagery. IEEE J Sel Top Appl Earth Observ Remote Sensing. 2021;14:1497–508.
  47. 47. Wang H, Wang C, Fu Q, Si B, Zhang D, Kou R, et al. YOLOFIV: Object detection algorithm for around-the-clock aerial remote sensing images by fusing infrared and visible features. IEEE J Sel Top Appl Earth Observ Remote Sensing. 2024;17:15269–87.
  48. 48. Xie J, Nie J, Ding B, Yu M, Cao J. Cross-modal local calibration and global context modeling network for RGB–infrared remote-sensing object detection. IEEE J Sel Top Appl Earth Observ Remote Sensing. 2023;16:8933–42.
  49. 49. Jocher G. Ultralytics YOLOv5; 2020. https://github.com/ultralytics/yolov5
  50. 50. Jocher G, Qiu J. Ultralytics YOLO11. https://github.com/ultralytics/ultralytics
  51. 51. Sun Y, Cao B, Zhu P, Hu Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans Circuits Syst Video Technol. 2022;32(10):6700–13.
  52. 52. Jia X, Zhu C, Li M, Tang W, Zhou W. LLVIP: A visible-infrared paired dataset for low-light vision. In: 2021 IEEE/CVF international conference on computer vision workshops (ICCVW); 2021. p. 3489–97.
  53. 53. Sun X, Yu Y, Cheng Q. Low-rank multimodal remote sensing object detection with frequency filtering experts. IEEE Trans Geosci Remote Sensing. 2024;62:1–14.
  54. 54. Yuan M, Wei X. C2Former: Calibrated and complementary transformer for RGB-infrared object detection. IEEE Trans Geosci Remote Sensing. 2024;62:1–12.
  55. 55. Bao W, Huang M, Hu J, Xiang X. Dual-dynamic cross-modal interaction network for multimodal remote sensing object detection. IEEE Trans Geosci Remote Sensing. 2025;63:1–13.
  56. 56. Zhu J, Zhang H, Li S, Wang S, Ma H. Cross teaching-enhanced multispectral remote sensing object detection with transformer. IEEE J Select Topics Appl Earth Observ Remote Sensing. 2025;18:2401–13.
  57. 57. Zhao C, Mo B, Zhao J, Tao Y, Zhao D. CMIFDF: A lightweight cross-modal image fusion and weight-sharing object detection network framework. Infrared Phys Technol. 2025;145:105631.
  58. 58. Sun X, Yu Y, Cheng Q. Unified diffusion-based object detection in multi-modal and low-light remote sensing images. Electron Lett. 2024;60(22).
  59. 59. Hu S, Bonardi F, Bouchafa S, Prendinger H, Sidibé D. Rethinking self-attention for multispectral object detection. IEEE Trans Intell Transport Syst. 2024;25(11):16300–11.
  60. 61. Jie Y, Xu Y, Li X, Zhou F, Lv J, Li H. FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution. Inform Fusion. 2025;121:103146.
  61. 61. Wei X, Li Z, Wang Y, Zhu S. RTMF-Net: A dual-modal feature-aware fusion network for dense forest object detection. Sensors. 2025;25(18):5631.
  62. 62. Yi S, Guo S, Chen M, Wang J, Jia Y. UIRGBfuse: Revisiting infrared and visible image fusion from the unified fusion of infrared channel with R, G, and B channels. Infrared Phys Technol. 2024;143:105626.
  63. 63. Tang L, Xiang X, Zhang H, Gong M, Ma J. DIVFusion: Darkness-free infrared and visible image fusion. Inform Fusion. 2023;91:477–93.
  64. 64. Meng F, Hong A, Tang H, Tong G. FQDNet: A fusion-enhanced quad-head network for RGB-infrared object detection. Remote Sensing. 2025;17(6):1095.
  65. 65. Yi X, Tang L, Zhang H, Xu H, Ma J. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior. Information Fusion. 2024;110:102450.
  66. 66. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE international conference on computer vision (ICCV), 2017. 618–26. https://doi.org/10.1109/iccv.2017.74