Fig 1.
Overall architecture of the proposed LDA-DETR.
The framework comprises three components: (1) Backbone, a lightweight feature extraction backbone with SRFM modules to reduce redundancy while preserving essential spatial cues; (2) Neck, a multi-scale fusion pathway that combines DMSFusion for receptive field aggregation and AEFN for enhancing interactions between low-level details and high-level semantics. (3) Head. Initial object queries are selected via an IoU-aware mechanism to initialize the decoder, followed by iterative refinement of bounding boxes and confidence scores through an auxiliary prediction head.
Fig 2.
Comparison between RT-DETR and LDA-DETR.
The upper part illustrates the overall processing pipeline, while the lower part highlights the modifications in LDA-DETR: (1) Backbone: SRFM modules (yellow) replace ResNet-18 blocks for lightweight feature extraction; (2) Neck: DMSFusion (pink) replaces the baseline fusion unit for multi-scale receptive field aggregation; (3) Detail enhancement: AEFN with CHM (green) replaces PAN to improve shallow interaction and small object detection.
Fig 3.
Structure of the proposed lightweight feature extraction backbone (LFEB).
The network begins with a Conv Stem Block for low-level feature extraction, followed by four Sparse Residual Feature Modules (SRFM) stages. Spatial downsampling is performed between stages using convolutions with stride of 2.
Fig 4.
Architecture of the proposed dynamic multi-scale fusion module (DMSFM).
It employs parallel convolution and pooling branches with diverse receptive fields, aggregated by branch addition to enhance multi-scale feature representation.
Fig 5.
Structural reparameterization operations in DMSFM.
Six transformations are applied to unify multi-branch structures into equivalent convolutions.
Fig 6.
Inference-time transformation of DMSFM.
It converts six training-time branches (B1–B6) into a single equivalent convolutional layer via structural reparameterization.
Fig 7.
Structure of the proposed attention-enhanced fusion network (AEFN).
A channel attention module (CHM) is integrated into the PAN pathway to strengthen shallow features, improve cross-scale interactions, and preserve fine-grained details for small-object detection.
Fig 8.
Representative images from the four datasets:
(a) URPC2020–underwater scenes with dense marine organisms under low contrast; (b) RSOD–aerial imagery containing aircraft and industrial targets; (c) NWPU VHR-10–remote sensing scenes with small artificial structures. (d) VisDrone-DET–UAV imagery of urban and suburban scenes containing diverse small object categories.
Table 1.
Ablation experiments on the URPC2020 dataset.
Table 2.
Ablation experiments on the NWPU VHR-10 dataset.
Table 3.
Ablation experiments on the VisDrone-DET dataset.
Table 4.
Hyperparameter analysis of the PConv ratio in SRFM on the URPC2020 dataset.
Table 5.
Hyperparameter analysis of branch-weight settings in DMSFM on the URPC2020 dataset.
Table 6.
Performance comparison by category on the RSOD dataset.
Table 7.
Comparative experiments on the RSOD dataset.
Table 8.
Performance comparison by category on the NWPU VHR-10 dataset.
Table 9.
Comparative experiments on the NWPU VHR-10 dataset.
Table 10.
Comparative experiments on the URPC2020 dataset.
Table 11.
Comparative experiments on the VisDrone-DET dataset.
Fig 9.
Confusion matrices of LDA-DETR on four datasets.
Fig 10.
Training performance comparison of LDA-DETR, RT-DETR-r18, and three YOLO variants (YOLOv10-m, YOLOv11-m, and YOLOv13-l) on the NWPU VHR-10 dataset: (a) mAP@0.5; (b) mAP@0.5:0.95.
Fig 11.
Attention heatmap comparisons between RT-DETR-r18 (second column) and LDA-DETR (third column) on the NWPU VHR-10 dataset.
The first column shows the original images. Warmer colors (e.g., red and yellow) denote higher attention weights. LDA-DETR exhibits more concentrated and accurately localized activations compared with the more diffuse responses of RT-DETR-r18, particularly for small or densely distributed targets.
Fig 12.
Visualization of attention heatmaps showing the incremental contributions of LFEB, DMSFM, and AEFN to LDA-DETR.
(a) Original image. (b) Baseline RT-DETR (ResNet-18). (c) +LFEB. (d) +DMSFM. (e) +AEFN (full LDA-DETR). These progressive visualizations illustrate how each module enhances small-object localization and background suppression on the NWPU VHR-10 dataset.
Fig 13.
Qualitative detection results on NWPU VHR-10 and URPC2020.
Left: baseline; right: LDA-DETR. White ovals mark missed detections. (a–b) NWPU VHR-10: LDA-DETR detects bridge and tennis court missed by RT-DETR-r18. (c–d) URPC2020: LDA-DETR detects holothurian and starfish missed by the baseline.
Fig 14.
Representative failure cases of the proposed method across four datasets: (a) RSOD, (b) VisDrone-DET, (c) NWPU VHR-10, and (d) URPC2020.
Purple boxes and arrows highlight missed detections of extremely small or occluded targets (e.g., tiny aircraft, pedestrians) and false activations of background structures or visually confusing categories (e.g., circular landscaping, kiosks, docks, seaweed, and rocks).