Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Overall architecture of the proposed LDA-DETR.

The framework comprises three components: (1) Backbone, a lightweight feature extraction backbone with SRFM modules to reduce redundancy while preserving essential spatial cues; (2) Neck, a multi-scale fusion pathway that combines DMSFusion for receptive field aggregation and AEFN for enhancing interactions between low-level details and high-level semantics. (3) Head. Initial object queries are selected via an IoU-aware mechanism to initialize the decoder, followed by iterative refinement of bounding boxes and confidence scores through an auxiliary prediction head.

More »

Fig 1 Expand

Fig 2.

Comparison between RT-DETR and LDA-DETR.

The upper part illustrates the overall processing pipeline, while the lower part highlights the modifications in LDA-DETR: (1) Backbone: SRFM modules (yellow) replace ResNet-18 blocks for lightweight feature extraction; (2) Neck: DMSFusion (pink) replaces the baseline fusion unit for multi-scale receptive field aggregation; (3) Detail enhancement: AEFN with CHM (green) replaces PAN to improve shallow interaction and small object detection.

More »

Fig 2 Expand

Fig 3.

Structure of the proposed lightweight feature extraction backbone (LFEB).

The network begins with a Conv Stem Block for low-level feature extraction, followed by four Sparse Residual Feature Modules (SRFM) stages. Spatial downsampling is performed between stages using convolutions with stride of 2.

More »

Fig 3 Expand

Fig 4.

Architecture of the proposed dynamic multi-scale fusion module (DMSFM).

It employs parallel convolution and pooling branches with diverse receptive fields, aggregated by branch addition to enhance multi-scale feature representation.

More »

Fig 4 Expand

Fig 5.

Structural reparameterization operations in DMSFM.

Six transformations are applied to unify multi-branch structures into equivalent convolutions.

More »

Fig 5 Expand

Fig 6.

Inference-time transformation of DMSFM.

It converts six training-time branches (B1–B6) into a single equivalent convolutional layer via structural reparameterization.

More »

Fig 6 Expand

Fig 7.

Structure of the proposed attention-enhanced fusion network (AEFN).

A channel attention module (CHM) is integrated into the PAN pathway to strengthen shallow features, improve cross-scale interactions, and preserve fine-grained details for small-object detection.

More »

Fig 7 Expand

Fig 8.

Representative images from the four datasets:

(a) URPC2020–underwater scenes with dense marine organisms under low contrast; (b) RSOD–aerial imagery containing aircraft and industrial targets; (c) NWPU VHR-10–remote sensing scenes with small artificial structures. (d) VisDrone-DET–UAV imagery of urban and suburban scenes containing diverse small object categories.

More »

Fig 8 Expand

Table 1.

Ablation experiments on the URPC2020 dataset.

More »

Table 1 Expand

Table 2.

Ablation experiments on the NWPU VHR-10 dataset.

More »

Table 2 Expand

Table 3.

Ablation experiments on the VisDrone-DET dataset.

More »

Table 3 Expand

Table 4.

Hyperparameter analysis of the PConv ratio in SRFM on the URPC2020 dataset.

More »

Table 4 Expand

Table 5.

Hyperparameter analysis of branch-weight settings in DMSFM on the URPC2020 dataset.

More »

Table 5 Expand

Table 6.

Performance comparison by category on the RSOD dataset.

More »

Table 6 Expand

Table 7.

Comparative experiments on the RSOD dataset.

More »

Table 7 Expand

Table 8.

Performance comparison by category on the NWPU VHR-10 dataset.

More »

Table 8 Expand

Table 9.

Comparative experiments on the NWPU VHR-10 dataset.

More »

Table 9 Expand

Table 10.

Comparative experiments on the URPC2020 dataset.

More »

Table 10 Expand

Table 11.

Comparative experiments on the VisDrone-DET dataset.

More »

Table 11 Expand

Fig 9.

Confusion matrices of LDA-DETR on four datasets.

More »

Fig 9 Expand

Fig 10.

Training performance comparison of LDA-DETR, RT-DETR-r18, and three YOLO variants (YOLOv10-m, YOLOv11-m, and YOLOv13-l) on the NWPU VHR-10 dataset: (a) mAP@0.5; (b) mAP@0.5:0.95.

More »

Fig 10 Expand

Fig 11.

Attention heatmap comparisons between RT-DETR-r18 (second column) and LDA-DETR (third column) on the NWPU VHR-10 dataset.

The first column shows the original images. Warmer colors (e.g., red and yellow) denote higher attention weights. LDA-DETR exhibits more concentrated and accurately localized activations compared with the more diffuse responses of RT-DETR-r18, particularly for small or densely distributed targets.

More »

Fig 11 Expand

Fig 12.

Visualization of attention heatmaps showing the incremental contributions of LFEB, DMSFM, and AEFN to LDA-DETR.

(a) Original image. (b) Baseline RT-DETR (ResNet-18). (c) +LFEB. (d) +DMSFM. (e) +AEFN (full LDA-DETR). These progressive visualizations illustrate how each module enhances small-object localization and background suppression on the NWPU VHR-10 dataset.

More »

Fig 12 Expand

Fig 13.

Qualitative detection results on NWPU VHR-10 and URPC2020.

Left: baseline; right: LDA-DETR. White ovals mark missed detections. (a–b) NWPU VHR-10: LDA-DETR detects bridge and tennis court missed by RT-DETR-r18. (c–d) URPC2020: LDA-DETR detects holothurian and starfish missed by the baseline.

More »

Fig 13 Expand

Fig 14.

Representative failure cases of the proposed method across four datasets: (a) RSOD, (b) VisDrone-DET, (c) NWPU VHR-10, and (d) URPC2020.

Purple boxes and arrows highlight missed detections of extremely small or occluded targets (e.g., tiny aircraft, pedestrians) and false activations of background structures or visually confusing categories (e.g., circular landscaping, kiosks, docks, seaweed, and rocks).

More »

Fig 14 Expand