Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

LDA-DETR: A lightweight dynamic attention-enhanced DETR for small object detection

  • Yanli Shi ,

    Roles Conceptualization, Supervision, Writing – review & editing

    syl@jlict.edu.cn

    Affiliation College of Science, Jilin Institute of Chemical Technology, Jilin, China

  • Jing Li,

    Roles Conceptualization, Methodology, Visualization, Writing – original draft

    Affiliation College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, China

  • Yi Jia,

    Roles Formal analysis, Investigation, Project administration

    Affiliation College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, China

  • Qihua Hong

    Roles Resources, Software, Validation

    Affiliation College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, China

Abstract

The issues of complex background interference, dense distribution, and insufficient feature representation for small objects have become significant challenges and research hotspots in computer vision. Particularly when the algorithm needs to be deployed in practical applications, many state-of-the-art detectors struggle to balance accuracy and efficiency, often requiring extensive computational power or suffering from degraded detection performance on small objects. To tackle these problems, this paper proposes a lightweight dynamic attention-enhanced DETR (LDA-DETR). Firstly, a lightweight feature extraction backbone (LFEB) is designed to improve the efficiency of object detection under limited computational resources. The proposed backbone enhances gradient flow and reduces the model’s parameters through residual structures and partial convolution operations. Then, a Dynamic Multi-Scale Fusion Module (DMSFM) is proposed to improve the model’s adaptability and the ability to fuse diverse features. The proposed module enhances feature representation ability and inference performance by performing convolutions at different scales across multiple branches and dynamically selecting operations. Finally, considering shallow features contain more detailed information, the Attention-Enhanced Fusion Network (AEFN) is constructed. The proposed approach refines and enriches features through attention mechanisms and cascading operations, endowing the features with comprehensive semantic and spatial details. Extensive experiments on the RSOD, NWPU VHR-10, URPC2020, and VisDrone-DET datasets demonstrate that LDA-DETR outperforms the state-of-the-art detection methods and further validate that the technique is better suited for small object detection applications.

Introduction

Recent progress in computer vision technology has significantly enhanced object detection, facilitating its application in various fields, including autonomous driving, facial recognition, defect inspection, and remote sensing image analysis. However, small object detection is still challenging because the interested objects are typically small in size (less than 32×32 pixels), lack sufficient target information, occur in complex detection scenes, and are prone to noise and deformation. These issues make small objects difficult to distinguish, hindering further development in real-world scenarios.

Currently, object detection approaches using deep learning are usually categorized into two groups, depending on their detection strategies. One approach is the two-stage method, where candidate regions are initially generated and then undergo classification and refinement. Typical two-stage methods include Faster R-convolutional neural network (CNN) [1], Mask R-CNN [2], and Cascade R-CNN [3]. Although these methods achieve better accuracy, their inference speed is slower, limiting their ability for real-time use. The other is the regression-based single-stage method, such as SSD series [4], YOLO series [5], EfficientNet [6], and RetinaNet [7]. Single-stage methods directly classify and detect objects in the image using an end-to-end convolutional neural network without generating candidate regions, thus maintaining high detection accuracy while ensuring faster speed. Previous research [8] has shown that Transformer-based models excel in generalization and robustness. DETR (DEtection Transformer) [9] uses CNN as the backbone network and combines it with a transformer encoder-decoder network to perform object detection tasks. In contrast to one-stage and two-stage detectors, it adopts an end-to-end training approach that removes the requirement for post-processing procedures like non-maximum suppression (NMS) and prior knowledge, thereby simplifying the detection pipeline. Research shows that DETR outperforms CNN-based detectors in object detection [10]. However, DETR has some limitations, including slow convergence during training and difficulty with small object detection. Several improved versions have been proposed to address these challenges, including Deformable-DETR [11] and SMCA-DETR [12], which accelerate convergence by modifying the attention modules. Group DETR [13] introduces a group-wise one-to-many assignment scheme, where multiple groups of object queries are processed independently, injecting diverse supervision signals into training and significantly improving efficiency and accuracy. Similarly, DN-DETR [14] adopts a query denoising approach by injecting noised ground-truth boxes into the decoder, simplifying the bipartite matching problem and facilitating faster training. H-DETR [15] proposes a hybrid matching scheme that combines one-to-one and one-to-many assignments during training while retaining the Hungarian algorithm. This strategy allows more object queries to capture spatial information effectively, particularly benefiting the detection of small and densely distributed objects. In addition, the Real-Time Detection Transformer (RT-DETR) [16] employs an efficient hybrid encoder to process multi-scale features, achieving superior precision and speed compared to leading YOLO detectors. In contrast to heavier variants, such as Deformable-DETR and DINO [17], the RT-DETR-r18 offers a lightweight, real-time framework that is more conducive to further optimization. However, as a general object detector, RT-DETR-r18 shows limited effectiveness in handling small objects. Specifically, the features from the final three backbone stages are fed into the hybrid encoder, yet it struggles to capture small-object features from deeper layers. Moreover, while the cross-scale feature-fusion module (CCFM) within the hybrid encoder integrates features from different layers, the interactions between features at different scales may not be fully optimized. Despite these limitations, RT-DETR-r18 presents a promising foundation for small object detection due to its real-time processing capabilities, hybrid encoder design, and multi-scale feature handling. Given the growing importance of real-time detection in many applications, optimizing RT-DETR-r18 for small object detection could yield valuable improvements. Therefore, this paper chooses RT-DETR-r18 as the baseline for the study.

To address the above-mentioned issues, a lightweight dynamic attention-enhanced DETR (LDA-DETR) is proposed. Firstly, the lightweight module is designed using residual structures and partial convolution operations, along with the design of an appropriate backbone, to reduce parameters and enhance the model’s feature representation and training stability. Then, the multi-scale fusion module is constructed to strengthen feature expression by performing convolutions and pooling operations at different scales across multiple branches. Meanwhile, the model uses the fused main branch for inference, thereby improving inference performance. Finally, the attention-enhanced fusion network is constructed, to enhance shallow feature information by embedding a channel attention module at the PAN’s bottom layer, and then fusing it with the intermediate layer features from the FPN, thereby enhancing the position and semantic information of small objects. The main contributions of this work can be summarized as follows:

1. A lightweight small object detection framework (LDA-DETR) is proposed by integrating the advantages of CNNs and Transformers. The proposed framework consists of three main components: a Lightweight Feature Extraction Backbone (LFEB), a Dynamic Multi-Scale Fusion Module (DMSFM), and an Attention-Enhancing Fusion Network (AEFN), all integrated within the RT-DETR-r18 architecture. This framework enables the small object detection model to more effectively extract and fuse multi-scale features while maintaining a low computational cost and high detection accuracy.

2. Lightweight Feature Extraction Backbone (LFEB) is designed to minimize parameters while improving gradient flow and optimizing feature representation. Meanwhile, the introduction of partial convolution operations allows for selective processing of the effective regions in the input, reducing model parameters while preserving key feature information.

3. Dynamic Multi-Scale Fusion Module (DMSFM) is constructed to fuse feature representations across multiple levels and perspectives, improving the model’s expressiveness and ability to capture complex data relationships. During inference process, the model uses the fused main branch, reducing complexity and computational cost.

4. Attention-Enhanced Fusion Network (AEFN) is proposed. This network embeds a newly designed CHannel attention Module (CHM) to enhance key feature information, and fuses it with the intermediate feature information from FPN through cascading operations, thereby improving the localization and semantic information of small objects.

5. LDA-DETR is verified on the RSOD, NWPU VHR-10, URPC2020, and VisDrone-DET datasets, and the performance of LDA-DETR outperforms these of the benchmark model and several state-of-the-art techniques for small object detection. The visualization results further show the outstanding detection capability of the proposed methods.

Related work

Small object detection techniques

Detecting small objects remains challenging because of their ambiguous features, unclear boundaries, and dense distributions. Existing approaches can be broadly categorized into feature enhancement and feature fusion optimization. The first direction, feature enhancement, employs attention-based modules to improve the quality and expressiveness of extracted features. For example, PHAM [18] combines channel, spatial, and coordinate attention in a dual-branch structure to emphasize defect-relevant regions while suppressing background interference, and Coordinate Attention integrated into YOLOv5 [19] helps capture long-range spatial dependencies and focus on salient features. Although these methods enhance feature representation, they often increase parameter overhead and computational burden. More importantly, as data propagate into deeper layers, fine-grained spatial cues crucial for small-object localization tend to degrade, especially during fusion stages. The second direction, feature fusion optimization, focuses on integrating multi-scale features across semantic levels to improve robustness and context awareness. For instance, Sun et al. [20] employed a lightweight SPP-based fusion structure with heuristic strategies for real-time UAV-based detection. DFPN [21] enhances multi-scale semantics through deeper aggregation in the neck network. While these methods improve semantic integration, they often increase architectural complexity and may fail to preserve the fine-grained positional information required for accurate localization.

To alleviate these issues, attention mechanisms have been incorporated into fusion processes to strengthen spatially informative shallow features. Representative modules such as CBAM [22] and Coordinate Attention [23] refine channel and spatial representations and are widely embedded in feature extraction or fusion. Recent attention-guided fusion networks further explore this direction. For example, CANet [24] employs centerness-aware attention within a feature pyramid to improve robustness in cluttered scenes. DET-YOLO [25] integrates deformable-embedding transformers with pyramid modules to enhance detail-sensitive features in aerial imagery. Nguyen et al. [26] introduced a hybrid convolution-transformer framework that employs multi-scale adaptive attention with feature fusion to jointly capture local details and global context, thereby improving feature activation and localization accuracy.

Inspired by these studies, this paper adopts an attention-guided fusion strategy. Specifically, lightweight channel attention is embedded into the shallow feature path and then cascaded with mid-level semantic features, preserving positional cues and improving small-object detection performance.

Model lightweight

Recent advances in machine learning have provided valuable insights for lightweight model design. Autoencoders, for example, learn compact and noise-tolerant feature representations, thereby reducing storage and computational demands without sacrificing essential information [27]. In a complementary direction, generative AI techniques for distributed learning enable local synthetic data generation to reduce communication overhead while maintaining global performance [28]. Both approaches align with the resource-saving goals of lightweight object detection, offering perspectives beyond traditional compression and architecture simplification.

Lightweight object detection models can minimize the model size and computational cost, facilitating efficient deployment and inference in resource-constrained environments. Early researchers have conducted in-depth studies on model lightweight, and many classic lightweight convolutional neural networks have been proposed. MobileNetV2 [29] reduces model complexity through depthwise separable convolutions while maintaining strong representational capacity. ShuffleNet [30] further decreases parameters by employing pointwise group convolutions combined with channel shuffling. GhostNet [31] minimizes redundancy by generating additional feature maps via inexpensive linear operations, lowering computational overhead without compromising quality. Conformer-S [32], a compact variant of the Conformer architecture, couples convolutional modules for local feature extraction with Transformer blocks for global representation learning, offering a balanced and lightweight design. Meanwhile, several studies have introduced object detection models that utilize lightweight backbone structures. CSL-YOLO [33] introduces a Cross-Stage Lightweight (CSL) module with CSL-Bone and CSL-FPN to reduce complexity. YOLO-SM [34] constructs a lightweight backbone (DCMNet) with multi-scale and attention mechanisms, combined with a lightweight neck (GMF), to achieve an efficient trade-off for single-class multi-deformation detection. LAR-YOLOv8 [35] adapts YOLOv8 with a locally enhanced backbone and an attention-guided bidirectional feature pyramid, improving small-object detection in remote sensing while reducing parameter count by approximately 40%.

However, classical lightweight networks and simplified detection architectures still struggle with small or densely distributed objects. Depthwise or grouped convolutions often weaken channel interactions, while aggressive pruning or channel reduction may degrade semantic richness in deeper layers. To address these issues, this paper proposes a lightweight backbone design that combines residual connections with partial convolution. This structure improves gradient flow, promotes efficient feature reuse, and reduces parameter overhead, thereby maintaining strong representational capacity while remaining suitable for real-time small-object detection in resource-constrained scenarios.

Multiscale feature fusion

In small object detection, models must remain highly sensitive to fine-grained details while leveraging sufficient contextual information for accurate recognition and localization. However, conventional convolutional neural networks (CNNs), which rely on fixed-size kernels and downsampling operations, possess limited receptive fields and often fail to represent features across multiple scales. This limitation weakens spatial representation and hampers detection performance, particularly for small or densely distributed objects. To address these challenges, several multiscale fusion methods have been proposed. MFFSODNet [36] employs a multi-branch convolution module with kernels of varying sizes (1×1, 3×3, and 5×5) to capture diverse receptive fields and incorporates a bidirectional dense feature pyramid to fuse shallow fine-grained with deep semantic features, thereby improving detection of small and densely distributed objects. Nie et al. [37] proposed the ECM, which enhances feature representation by integrating multiscale contextual information through pyramid-structured convolutions of different kernel sizes at each level, leading to significant performance gains. Jiang et al. [38] introduced MSFEM, which divides the input into residual and main branches; the main branch further applies multiple convolutions (1×1, 3×3, and 5×5) followed by dimensional adjustment, while the residual branch uses a single 1×1 convolution. The outputs are merged and compressed to yield the final representation.

Although multiscale modules enhance representational capacity, their reliance on parallel convolutional branches often increases inference latency and architectural complexity. To address this issue, recent studies have explored structural reparameterization, which converts multi-branch training architectures into equivalent single-path forms during inference. Representative approaches, such as RepVGG [39], ACNet [40], and Diverse Branch Block (DBB) [41], achieve substantial reductions in computational cost while preserving expressive power.

Inspired by this paradigm, this paper proposes a multiscale fusion strategy that incorporates diverse receptive fields with structural reparameterization. During training, multi-branch convolutional blocks are employed to capture information at different scales. At inference, these branches are equivalently transformed into a single branch, thereby preserving multiscale representation capacity without introducing additional computational costs.

Methods

Overall structure

The overall architecture of the proposed lightweight dynamic attention-enhanced DETR (LDA-DETR) is shown in Fig 1. It follows the RT-DETR framework while proposing three key improvements. Firstly, the backbone is redesigned with Sparse Residual Feature Modules (SRFM) to achieve lightweight and efficient feature extraction. Secondly, the neck incorporates a Dynamic Multi-Scale Fusion Module (DMSFM) to aggregate features across diverse receptive fields. Finally, an Attention-Enhanced Fusion Network (AEFN) is introduced to strengthen the interaction between shallow details and high-level semantics. The positions of these modifications are illustrated in Fig 2.

thumbnail
Fig 1. Overall architecture of the proposed LDA-DETR.

The framework comprises three components: (1) Backbone, a lightweight feature extraction backbone with SRFM modules to reduce redundancy while preserving essential spatial cues; (2) Neck, a multi-scale fusion pathway that combines DMSFusion for receptive field aggregation and AEFN for enhancing interactions between low-level details and high-level semantics. (3) Head. Initial object queries are selected via an IoU-aware mechanism to initialize the decoder, followed by iterative refinement of bounding boxes and confidence scores through an auxiliary prediction head.

https://doi.org/10.1371/journal.pone.0340977.g001

thumbnail
Fig 2. Comparison between RT-DETR and LDA-DETR.

The upper part illustrates the overall processing pipeline, while the lower part highlights the modifications in LDA-DETR: (1) Backbone: SRFM modules (yellow) replace ResNet-18 blocks for lightweight feature extraction; (2) Neck: DMSFusion (pink) replaces the baseline fusion unit for multi-scale receptive field aggregation; (3) Detail enhancement: AEFN with CHM (green) replaces PAN to improve shallow interaction and small object detection.

https://doi.org/10.1371/journal.pone.0340977.g002

Fig 2 provides a comparative overview of RT-DETR and the proposed LDA-DETR. The upper part depicts the overall processing pipeline, where input images are processed through the backbone, Efficient Hybrid Encoder, Uncertainty-minimal Query Selection, and Decoder & Head to generate detection results. The lower part details the three modifications, where bold black arrows indicate module replacement relationships rather than feature flow:

(1) Backbone modification: The original ResNet-18 BasicBlocks are replaced with a 3×3 convolution layer with stride 2 for downsampling (except in Stage 1, where the stem block is applied), followed by SRFM modules (yellow). Each SRFM integrates residual connections with partial convolution (PConv) to reduce parameters and computational cost while maintaining strong feature representation and enhancing gradient flow.

(2) Neck fusion modification: In the multi-scale fusion stage, the baseline fusion unit (Fusion) is replaced by DMSFusion (pink). This module employs the reparameterizable multi-branch DMSFM instead of the original RepBlock, aggregating features from different receptive fields. During inference, the multi-branch structure is equivalently transformed into a single branch, thereby reducing computational cost while retaining representational capacity.

(3) Shallow detail enhancement path: The AEFN replaces the PAN to enhance interactions between shallow and intermediate features. At the bottom of the PAN, the Channel Attention Module (CHM) (green) reweights shallow features, and the enhanced outputs are cascaded with intermediate features from the FPN. This design effectively integrates fine-grained details with high-level semantics, thereby improving the localization and recognition of small objects.

Lightweight feature extraction backbone

The baseline backbone is constructed on ResNet, a widely used deep architecture that alleviates vanishing and exploding gradients through residual connections, thereby enabling the effective training of deeper networks. Despite its strong performance across various tasks, ResNet exhibits inherent limitations. As network depth increases, optimization becomes more difficult, leading to performance degradation, while the associated computational cost and memory consumption grow substantially, limiting its suitability for real-time detection.

To address these challenges, this paper proposes a Lightweight Feature Extraction Backbone (LFEB). As shown in Fig 3, LFEB begins with a Conv Stem Block composed of three sequential ConvNorm layers (3×3 kernels; strides of 2, 1, and 1), followed by a 3×3 max-pooling layer with stride = 2, which extracts low-level features and reduces the spatial resolution to one-quarter of the input. The backbone then proceeds through four stages (Stage 1–4), each containing a single Sparse Residual Feature Module (SRFM) with stride = 1. Spatial downsampling is decoupled from the SRFM and performed between stages using inter-stage 3×3 convolutions with stride = 2, following the resolution schedule (1/4 → 1/8 → 1/16 → 1/32).

thumbnail
Fig 3. Structure of the proposed lightweight feature extraction backbone (LFEB).

The network begins with a Conv Stem Block for low-level feature extraction, followed by four Sparse Residual Feature Modules (SRFM) stages. Spatial downsampling is performed between stages using convolutions with stride of 2.

https://doi.org/10.1371/journal.pone.0340977.g003

Compared with prior lightweight backbones, MobileNet applies depthwise separable convolutions to all channels, while ShuffleNet reduces computation via grouped convolutions combined with channel shuffling to restore cross-group information flow. MobileViT [42] integrates convolutional layers with lightweight Transformer blocks, applying self-attention within locally unfolded regions to improve contextual modeling; however, this design weakens the spatial inductive bias inherent to CNNs and introduces additional computation due to unfolding and folding operations. Motivated by the need to reduce redundant computation while retaining effective cross-channel interactions, the proposed SRFM adopts a partial-channel transformation strategy. In the main branch, a partial convolution (PConv) is applied standard convolution only to a subset of input channels, capturing spatial features at reduced computational cost. In contrast, the remaining channels are preserved and concatenated with the processed ones for subsequent operations. In parallel, a residual pathway directly adds the unaltered input to the final output, thereby improving gradient flow and preserving global contextual information.

In each SRFM, features obtained from the PConv operation in the main branch are further processed by two sequential convolutions. Batch normalization and ReLU activation are applied only after the first convolution to avoid excessive normalization and activation, which could otherwise suppress feature diversity. Formally, let the input feature map be denoted as , and the output of the RPCM be . The computation is expressed as:

(1)(2)

where . The PConv applies a convolution to cp = rC channels (0<r<1), while the remaining Ccp channels bypass the operation. The computational complexity of PConv is:

(3)

which reduces to only of the cost of a standard convolution when . The corresponding memory access cost (MAC) is:

(4)

This significant reduction in FLOPs and MACs highlights the efficiency of the partial convolution strategy, establishing SRFM as an effective component of LFEB and particularly well suited for lightweight, real-time object detection backbones.

Dynamic multi-scale fusion module

Traditional multi-scale fusion modules often incur significant inference latency due to parallel branch computation, which increases model complexity and computational overhead. To address this issue and improve small-object perception, this paper proposes the Dynamic Multi-Scale Fusion Module (DMSFM) (Fig 4). The module employs convolutions with varying receptive fields in a multi-branch structure to capture features across scales, thereby enhancing spatial relationship modeling and detection accuracy. In contrast to designs such as BiFPN [6], which rely on iterative top-down and bottom-up pathways with learnable weights and consequently introduce additional inference overhead, DMSFM employs a parallel structure that achieves one-stage multi-scale integration. Furthermore, it leverages structural reparameterization to merge multi-branch operations into a single equivalent convolution during inference, maintaining detection accuracy without additional computational cost.

thumbnail
Fig 4. Architecture of the proposed dynamic multi-scale fusion module (DMSFM).

It employs parallel convolution and pooling branches with diverse receptive fields, aggregated by branch addition to enhance multi-scale feature representation.

https://doi.org/10.1371/journal.pone.0340977.g004

Specifically, DMSFM employs a multi-branch structure that combines multi-scale convolutions, average pooling, sequential convolutions, and branch addition. These operations provide receptive fields of varying sizes and complexities, enriching features from different perspectives. The outputs are then aggregated through branch addition to form a comprehensive input representation. Each convolutional and pooling layer is followed by batch normalization to enhance training stability. In this design, four intermediate branches are constructed in paired identical structures, each consisting of a convolution, a convolution, another convolution, and average pooling. This symmetry compensates one branch when the other extracts unstable features, improving overall performance. Six structural reparameterization techniques [41] are employed to reduce model complexity. Let W and b denote the convolution kernel and bias, respectively. Let γ, β, μ, represent the batch normalization (BN) scale, shift, mean, and variance, and let ε be a small constant for numerical stability. Let u,v denote the spatial indices of the convolution kernel.

Transform I – Merge convolution and BN into a single convolution layer:

(5)

Transform II – Sum outputs of convolutions with identical configurations:

(6)

Transform III – Merge sequential convolutions:

(7)

Transform IV – Handle group convolutions by applying Transforms I–III within each group, then concatenating along the channel dimension.

Transform V – Replace average pooling with a fixed convolution kernel AK where all elements are 1/K2.

Transform VI – Expand smaller kernels (e.g., ) to by zero-padding around the center.

These transformations are illustrated in Fig 5.

thumbnail
Fig 5. Structural reparameterization operations in DMSFM.

Six transformations are applied to unify multi-branch structures into equivalent convolutions.

https://doi.org/10.1371/journal.pone.0340977.g005

Inference-time conversion of DMSFM: Let x be the input feature map, and Wi, bi be the equivalent kernel and bias of branch i after applying the above transforms. From Fig 4, DMSFM contains six branches (from left to right: B1→B6): B1: convolution; B2, B3: convolution; B4, B5: AvgPooling; B6: convolution.

Step 1 (BN Fusion, Transform I): Fuse BN into all convolution layers to obtain and .

Step 2 (AvgPooling to Convolution, Transform V): Convert B4 and B5’s average pooling into convolution kernels AK.

Step 3 (Sequential merging, Transform III with VI): B1: Expand to (Transform VI). B2, B3: Expand to (VI), then merge with convolution (III). B4, B5: Expand to (VI), then merge with AK convolution (III). B6: Remains unchanged.

Step 4 (Pairwise branch addition, Transform II): Sum B2 & B3 to form one branch; sum B4 & B5 likewise.

Step 5 (Final branch fusion, Transform II): All equivalent convolution kernels are summed to obtain

(8)

Step 6 (Deployment): During inference, the DMSFM is computed as

(9)

This transformation replaces the original multi-branch structure with a single convolutional layer, maintaining identical outputs while significantly reducing inference cost. The overall reparameterization process from multi-branch training to single-path inference is illustrated in Fig 6.

thumbnail
Fig 6. Inference-time transformation of DMSFM.

It converts six training-time branches (B1–B6) into a single equivalent convolutional layer via structural reparameterization.

https://doi.org/10.1371/journal.pone.0340977.g006

Attention-enhanced fusion network

Existing fusion strategies improve semantic integration but fail to fully exploit shallow discriminative cues, which are critical for small-object detection in cluttered scenes. Transformer-based methods, such as DINO [17], advance detection accuracy by employing denoising training and contrastive learning to stabilize query refinement in the decoder. However, DINO primarily enhances high-level semantic representations without explicitly preserving low-level spatial details, limiting its ability to localize densely distributed or occluded small objects. To address this limitation, this paper proposes an Attention-Enhanced Fusion Network (AEFN) (Fig 7). The network integrates a newly designed Channel Attention Module (CHM) into the PAN-based structure, enhancing low-level feature representations and preserving critical spatial cues for accurate small-object detection. The CHM captures global context through global pooling, which is processed by a shared multi-layer perceptron (MLP) to generate channel attention weights. Specifically:

thumbnail
Fig 7. Structure of the proposed attention-enhanced fusion network (AEFN).

A channel attention module (CHM) is integrated into the PAN pathway to strengthen shallow features, improve cross-scale interactions, and preserve fine-grained details for small-object detection.

https://doi.org/10.1371/journal.pone.0340977.g007

For the input feature map (where C represents the channel count, and H and W are the height and width), average pooling over the spatial dimensions is performed to create a global feature vector . The formula is:

(10)

where denotes the average pooling output of the c-th channel.

Apply maximum pooling across the spatial dimensions of the input feature map F to create another global feature vector . The formula is:

(11)

where denotes the maximum pooling output of the c-th channel.

The two pooling results are fed into a shared-parameter 2D convolutional layer to generate two-channel attention vectors. The network structure can be represented as:

(12)

where W0 and W1 are learnable weight matrices, and Hardswish is a nonlinear activation function. Through this process, the model can capture more complex inter-channel relationships.

After adding the two-channel attention vectors, they are normalized using the sigmoid function to produce the final channel attention weight . The corresponding formula is:

(13)

where σ is the sigmoid function, ensuring the output is between 0 and 1.

Finally, the channel attention weights Mc are utilized to adjust the feature map F through element-wise multiplication, yielding the weighted feature map .

(14)

where denotes element-wise multiplication.

The intermediate features from the FPN path are then fused with the enhanced underlying features from the AEFN path through cascading operations, ensuring that the fused features retain rich semantic and spatial information. Compared with the Sigmoid function, Hardswish exhibits greater numerical stability by avoiding unstable exponential operations. Its piecewise smooth formulation preserves gradient continuity while introducing effective nonlinearity, enhancing feature expressiveness. This, in turn, allows attention mechanisms to emphasize informative regions and suppress irrelevant background responses more effectively.

Experiments

Datasets

Four publicly available datasets with diverse characteristics were employed in our experiments to evaluate the mode’s robustness across domains. Fig 8 illustrates representative images from each dataset, highlighting their distinct visual characteristics and detection challenges.

thumbnail
Fig 8. Representative images from the four datasets:

(a) URPC2020–underwater scenes with dense marine organisms under low contrast; (b) RSOD–aerial imagery containing aircraft and industrial targets; (c) NWPU VHR-10–remote sensing scenes with small artificial structures. (d) VisDrone-DET–UAV imagery of urban and suburban scenes containing diverse small object categories.

https://doi.org/10.1371/journal.pone.0340977.g008

The first dataset used is URPC2020, provided by the China Underwater Robot Competition. It contains images of four sea creatures: holothurian, echinus, scallop, and starfish. All the images were captured in natural underwater conditions characterized by low contrast, complex backgrounds, and a high proportion of small and densely distributed objects. It contains 5,400 JPG images, divided into 90% for training and 10% for testing. A sample from this dataset is shown in Fig 8(a).

The second dataset is the RSOD dataset proposed by Wuhan University, which includes 180 overpasses, 191 playgrounds, 1586 oil tanks, and 4993 aircraft. Image sizes range from 512×512 to 1961×1193 pixels, and the dataset is partitioned in a 6:2:2 ratio for training, validation, and testing. A sample from this dataset is shown in Fig 8(b).

The third dataset is the NWPU VHR-10, a geospatial dataset for remote sensing object detection, proposed by Cheng et al. from Northwestern Polytechnic University, China. It consists of 650 annotated images and 150 background-annotated images, with sizes varying from 533×597 to 1728×1028 pixels. It includes ten categories: ship, ground track field, harbour, bridge, tennis court, vehicle, baseball diamond, storage tank, plane, and basketball court. Many of these objects are small and densely distributed. Independent training, validation, and testing sets are created using a 6:2:2 proportion. A sample from this dataset is shown in Fig 8(c).

The fourth dataset is the VisDrone-DET dataset, released in 2018 by the AISKYEYE team from Tianjin University. It contains 10,209 static images with resolutions ranging from 960×540 to 2000×1500, and 288 video clips, all collected from 14 cities with diverse and complex scenes. Ten object categories are annotated, including both pedestrians and vehicles. The dataset is split into 6,471 images for training, 548 for validation, and 3,190 for testing. Due to its severe imbalance in object categories and scales, it serves as a challenging benchmark for small object detection. A sample from this dataset is shown in Fig 8(d).

Experimental environment setup

All experiments were conducted on an Intel Core i5-8264U CPU and an NVIDIA GeForce RTX 3090 Ti GPU (CUDA 11.7), using Python 3.8.13 and PyTorch 1.12.1 with the Ultralytics framework. Input images were resized to pixels. The model was trained for 300 epochs with a batch size of 8, using the AdamW optimizer ( × 10−4, momentum  = 0.9, weight decay  = 1 × 10−4). A linear learning rate schedule was applied, with 2000 warmup iterations (warmup momentum  = 0.8, bias learning rate  = 0.1) followed by linear decay (). Loss weights were set as , , and . The same training strategy and hyperparameters were applied across all four datasets without dataset-specific tuning, indicating the robustness and parameter insensitivity of the proposed method.

Evaluation indicators

To assess the model’s efficiency and accuracy, we use metrics like Recall (R), Precision (P), FPS, Average Precision (AP), model parameters (Params), Floating Point Operations (GFLOPs), as well as AP at different scales (APs, APm, APl). Additionally, mAP@0.5 and mAP@0.5:0.95 are employed to assess the mean average precision at IoU thresholds of 0.5 and from 0.5 to 0.95. The formulas for Precision and Recall are as follows:

(15)(16)

where TP refers to the instances that the model identifies as positive and are indeed positive. FP refers to those identified as positive but are negative. FN refers to the ones predicted as negative but are positive.

Average Precision (AP) measures the detection performance for each object category by computing the area under the precision-recall (P-R) curve.

(17)

FPS measures how many image frames the model processes each second. It is commonly used to employed to assess the model’s processing rate or ability to operate in real-time.

Model parameters (Params) is an important metric for assessing the model’s complexity and resource requirements. The more parameters a model has, the larger its capacity to learn complex patterns. However, this can also lead to overfitting and require more computational resources.

Floating Point Operations (GFLOPs) represents the overall count of floating-point calculations performed during inference or training. It is commonly used to evaluate the model’s computational demands. A smaller value signifies greater model efficiency, which is beneficial for resource-constrained environments.

Mean Average Precision (mAP) calculates the average precision across various categories, providing an overall measure of the model’s performance:

(18)

Specifically, mAP@0.5 refers to the average precision when the IoU threshold is 0.5, whereas mAP@0.5:0.95 calculates the average precision across IoU thresholds ranging from 0.5 to 0.95, with a step of 0.05.

Following the COCO evaluation protocol, objects are categorized into three size groups based on their bounding box area: small (area < 322 pixels), medium ( area < 962 pixels), and large (area pixels). This study adopts this standard to evaluate model performance on the URPC2020, RSOD, NWPU VHR-10, and VisDrone-DET datasets. Although these datasets do not provide explicit size annotations, we calculate the bounding box area for each object to enable scale-specific performance analysis using APs, APm, and APl.

Ablation experiments

This section conducts ablation experiments to verify the contribution of the three proposed modules in different combinations to the overall detection performance. All experiments are performed under a consistent environment on three small object detection datasets: URPC2020, NWPU VHR-10, and VisDrone-DET. The results are presented in Tables 1-3. It is worth noting that the RSOD dataset, which shares similar aerial remote sensing characteristics with NWPU VHR-10, is reserved for the final comparative analysis to validate model generalization and is therefore omitted from this section to avoid redundancy.

thumbnail
Table 2. Ablation experiments on the NWPU VHR-10 dataset.

https://doi.org/10.1371/journal.pone.0340977.t002

thumbnail
Table 3. Ablation experiments on the VisDrone-DET dataset.

https://doi.org/10.1371/journal.pone.0340977.t003

As shown in the experimental results, the step-by-step introduction of each module leads to varying degrees of performance improvement. Experiment A serves as a reference with the unmodified baseline and is used for comparative analysis with all other experiments. Firstly, integrating DMSFM significantly enhances the model’s recall and average precision, particularly on the VisDrone-DET dataset, where mAP@0.5 improves by 1.7%. This improvement is attributed to the module’s ability to capture multi-scale semantic information across different receptive fields, thereby enhancing the feature representation of small objects. Secondly, replacing ResNet18 with LFEB yields a better trade-off between accuracy and efficiency on all three datasets. For instance, on NWPU VHR-10, mAP@0.5 increases by 0.7%, while the parameters are reduced by 3.09M, and GFLOPs are decreased by 7.5. This indicates that the residual structure and partial convolution strategies effectively retain spatial detail while reducing redundant computations. Furthermore, incorporating AEFN further improves the precision of small object localization. On the URPC2020 dataset, mAP@0.5:0.95 increases by 0.6%, demonstrating that the channel attention mechanism enhances the fusion of semantic and spatial information between shallow and intermediate layers, thereby improving the recognizability of small objects. More importantly, the joint use of two modules consistently outperforms individual configurations, indicating the structural complementarity among the proposed components. On the URPC2020 dataset, the combination of LFEB and DMSFM achieves mAP@0.5 and mAP@0.5:0.95 scores of 84.4% and 50.0%, respectively, surpassing either module’s performance alone. This improvement stems from the lightweight architecture of LFEB, which effectively preserves key spatial information, and the multi-scale semantic modeling capability of DMSFM under varying receptive fields. Their integration enables the preservation of spatial details while enhancing high-level semantic representation. On the VisDrone-DET dataset, the combination of LFEB and AEFN attains the highest mAP@0.5 among dual-module configurations (50.4%), whereas DMSFM combined with LFEB offers a balanced trade-off, achieving competitive accuracy with reduced parameters (16.80M). Similarly, on the NWPU VHR-10 dataset, the combination of DMSFM and AEFN yields a mAP@0.5 of 91.1%, outperforming the results of using DMSFM (90.6%) or AEFN (90.7%) individually. In this configuration, DMSFM enhances semantic expressiveness through scale-aware feature aggregation, which is suitable for objects of varying sizes, while AEFN employs an attention mechanism to fuse position-sensitive shallow features with semantically enriched intermediate representations. This integration facilitates precise localization of small objects while maintaining semantic consistency, ultimately leading to more robust small object detection performance. Finally, when all three modules are jointly employed, the overall detection performance reaches its optimal level. On the URPC2020 dataset, mAP@0.5 increases to 84.5%, and mAP@0.5:0.95 reaches 50.2%, while the number of parameters is reduced to 16.92M and GFLOPs to 49.7. On the NWPU VHR-10 dataset, mAP@0.5 reaches 91.4%, marking a 1.4% improvement over the baseline. On the VisDrone-DET dataset, Experiment H achieves mAP@0.5 of 50.6% and mAP@0.5:0.95 of 31.6%, improving precision to 64.3%. Although these absolute values appear lower than those on URPC2020 or NWPU VHR-10 due to the intrinsic difficulty of the dataset, such as extreme object density, scale variation, and severe category imbalance, they represent a significant improvement over the baseline, increasing mAP@0.5 from 47.0% to 50.6% and mAP@0.5:0.95 from 28.7% to 31.6%. This consistent gain demonstrates the effectiveness of the proposed modules in handling complex scenarios.

These results indicate that the three modules are highly complementary: LFEB preserves low-level spatial details, DMSFM enhances multi-scale semantic aggregation, and AEFN strengthens the fusion of positional and semantic cues. Their joint integration produces synergistic effects beyond simple additive gains. Furthermore, the consistent improvements across all three datasets–characterized by diverse object categories and complex visual scenes–demonstrate that LDA-DETR with the proposed modules achieves strong generalization and robustness under stringent evaluation benchmarks.

Hyperparameter sensitivity analysis

To further evaluate the robustness of the proposed design, we performed ablation studies on the key hyperparameters of SRFM and DMSFM. In particular, we investigated the impact of the partial convolution ratio in SRFM and the branch-weight configurations in DMSFM, with the results summarized in Tables 4 and 5.

thumbnail
Table 4. Hyperparameter analysis of the PConv ratio in SRFM on the URPC2020 dataset.

https://doi.org/10.1371/journal.pone.0340977.t004

thumbnail
Table 5. Hyperparameter analysis of branch-weight settings in DMSFM on the URPC2020 dataset.

https://doi.org/10.1371/journal.pone.0340977.t005

Table 4 presents the results under different PConv ratios, ranging from 1/2 to 1/10. As the ratio decreases, parameter count and FLOPs are consistently reduced, confirming the lightweight nature of the design. Detection performance remains stable across all configurations, with mAP@0.5 varying within 0.6% and mAP@0.5:0.95 within 0.9%. The default ratio 1/4 achieves the best trade-off, yielding 84.5% mAP@0.5 and 50.2% mAP@0.5:0.95, while reducing the parameter count to 16.92M and FLOPs to 49.7. These results demonstrate that SRFM consistently sustains strong detection accuracy under varying levels of lightweight compression, thereby highlighting the robustness of the proposed partial convolution strategy.

Table 5 presents the ablation study on branch-weight configurations in DMSFM. Each branch weight vector corresponds, from left to right, to six parallel branches: the main K × K convolution, the 1 × 1 convolution, two 1 × AvgPooling branches, and two 1 × convolution branches.

Configurations are defined as follows:

  • A1 [1,1,1,1,1,1]: Equal-weight baseline.
  • A2 [1,0,0,0,0,0]: Retains only the main branch, used to assess the necessity of auxiliary branches.
  • A3 [2,2,0.5,0.5,1,1]: Emphasizes the main and 1 × 1 convolutions (× 2); suppresses pooling (); hybrid branches unchanged.
  • A4 [1,1,2,2,0.5,0.5]: Emphasizes pooling (× 2); suppresses hybrid branches (); others unchanged.
  • A5 [2,1,1,1,1,1]: Applies a mild bias toward the main branch.
  • A6 [1,1,1,1,1,1]*: Learns weights adaptively, initialized equally (default setting).

Although the fixed-weight settings (A2–A5) achieve comparable accuracy, their results vary slightly depending on which branches are emphasized or suppressed. In contrast, the learnable configuration A6 achieves the best performance, indicating that adaptive optimization of branch contributions is more effective than manual weighting. This confirms the necessity of multi-branch fusion and the robustness of the proposed learnable weight allocation.

Quantitative analysis

To verify whether LDA-DETR can achieve better detection performance, this section compares LDA-DETR with several models across four datasets: URPC2020, RSOD, NWPU VHR-10, and VisDrone-DET. The competing methods are grouped into three categories: (1) classical baselines, including Faster R-CNN [1], SSD [4], EfficientDet [6], Sig-NMS [43], and MobileNetV2 [29]; (2) widely adopted mainstream detectors, such as YOLOv7 [5], YOLOv8, Cascade R-CNN [3], Sparse R-CNN [44], FCOS [45], and YOLOX [46]; and (3) recent state-of-the-art approaches, represented by DN-DETR [14], Group-DETR [13], DINO [17], CANet [24], SuperYOLO [47], and Swin Transformer [48]. This categorization ensures the evaluation involves traditional reference models, widely adopted strong baselines, and the latest frontier methods.

Comparative analysis on the RSOD dataset: The effectiveness of the proposed model is evaluated by comparing its performance with existing methods, as shown in Table 6. For ease of identification, the best results are highlighted in bold black and the second-best in bold green. The proposed model achieves a detection accuracy of 97.2% for the oil-tank class, representing improvements of 1.3% over RT-DETR and 4.5% for the overpass class. It further enhances mAP@0.5 by 5.1% and 5.7% compared with transformer-based models such as MobileVit [42] and DETR [9], and by 0.9% relative to the baseline RT-DETR. In addition, a comparison in terms of mAP, FPS, and parameters is provided in Table 7. The results show that LDA-DETR achieves the highest mAP@0.5 of 95.3%, which is 1.65% higher than the second-best MobileNetV2 [29]. Owing to its lightweight design, the model reduces parameters to 16.93M while sustaining an FPS of 65.6. This combination of accuracy and efficiency highlights the competitiveness of the proposed method, surpassing most existing approaches.

thumbnail
Table 6. Performance comparison by category on the RSOD dataset.

https://doi.org/10.1371/journal.pone.0340977.t006

Comparison on the NWPU VHR-10 dataset: The performance of the proposed model is compared with existing methods, as shown in Table 8. For ease of identification, the best results are highlighted in bold black and the second-best in bold green. The results show that the proposed model achieves leading performance in 5 out of 10 categories, with particularly notable gains in the Ground Track Field category, where detection accuracy reaches 100%, representing a 12.5% improvement over EfficientDet [6]. Significant advantages are also observed in the Tennis Court and Vehicle categories. Specifically, in the Tennis Court category, the detection accuracy of LDA-DETR is 3.1% higher than RT-DETR, 2.7% higher than DAB-DETR [50], and 18.3% higher than YOLOv8. In the Vehicle category, it surpasses PR-Deformable DETR [49] by 8.3%, EfficientDet by 8.7%, and Deformable DETR [11] by 20.7%. Overall, LDA-DETR achieves the highest mAP@0.5 of 91.4%. Furthermore, comparisons in terms of mAP, FPS, and parameters, as presented in Table 9, confirm the significant competitive advantage of the proposed model.

thumbnail
Table 8. Performance comparison by category on the NWPU VHR-10 dataset.

https://doi.org/10.1371/journal.pone.0340977.t008

thumbnail
Table 9. Comparative experiments on the NWPU VHR-10 dataset.

https://doi.org/10.1371/journal.pone.0340977.t009

Comparison on the URPC2020 dataset: The proposed model achieves superior performance over several advanced methods, as shown in Table 10. For ease of identification, the best results are highlighted in bold black and the second-best in bold green. LDA-DETR attains an mAP@0.5:0.95 of 50.2, ranking first and tied with Swin Transformer [48]. In terms of mAP@0.5, it reaches 84.8, exceeding PVTv2’s [56] by 1.6%. Compared with other DETR-based models, LDA-DETR also demonstrates clear advantages. In the scale-specific evaluation, it achieves the highest small-object accuracy (35.5%), exceeding Deformable DETR [11] by 10.1%. It also improves medium-object detection by 4.3%, underscoring its robustness in handling objects of varying scales.

thumbnail
Table 10. Comparative experiments on the URPC2020 dataset.

https://doi.org/10.1371/journal.pone.0340977.t010

Comparison on the VisDrone-DET dataset: The performance of the proposed model is evaluated against existing methods, as shown in Table 11. For ease of identification, the best results are highlighted in bold black and the second-best in bold green. LDA-DETR achieves the best accuracy on small- and medium-scale targets, with APs of 21.4% and APm of 42.1%, surpassing all competing approaches. Furthermore, LDA-DETR requires substantially lower computational cost, with 49.7 GFLOPs compared to 120.3 GFLOPs for Cascade R-CNN [3] and 547.2 GFLOPs for Cascade ADPN [58]. Moreover, LDA-DETR outperforms QueryDet [59] by 1.6% in APs, underscoring its superior capability to detect small objects.

thumbnail
Table 11. Comparative experiments on the VisDrone-DET dataset.

https://doi.org/10.1371/journal.pone.0340977.t011

Notably, LDA-DETR maintains high accuracy in object localization and classification while requiring fewer parameters. This balance enables the model to achieve an optimal trade-off between performance and complexity, underscoring its suitability for practical applications. Furthermore, LDA-DETR demonstrates clear advantages for small and medium objects in the datasets with scale-specific evaluation. On URPC2020, it achieves the best APs (35.5%) and APm (47.2%) among all competing methods in Table 10. On the VisDrone-DET dataset, although the absolute performance metrics are generally lower due to the intrinsic difficulty of the benchmark, LDA-DETR achieves a notable improvement in small object detection. Specifically, it improves APs to 21.4%, which is 3.2% higher than the baseline RT-DETR, and attains the best APm (42.1%). These results indicate that the proposed modules are particularly effective in enhancing small- and medium-object representations, achieving a balanced trade-off between accuracy and efficiency.

Qualitative analysis

Classification performance and training behavior.

Fig 9 shows the normalized confusion matrices of LDA-DETR on the URPC2020, NWPU VHR-10, RSOD, and VisDrone-DET datasets. The diagonal elements indicate the normalized classification accuracy of each category, with darker colors representing higher accuracy. However, some categories still exhibit confusion due to visual similarity or background interference. Such misclassifications are common in complex underwater, aerial, and UAV scenes.

Fig 10 compares the training accuracy of LDA-DETR, RT-DETR-r18, and three YOLO variants (YOLOv10-m, YOLOv11-m, and YOLOv13-l) over 300 epochs. In Fig 10(a) and 10(b), corresponding to mAP@0.5 and mAP@0.5:0.95, LDA-DETR exhibits a smooth and stable convergence trajectory, ultimately attaining the best accuracy. RT-DETR-r18 also converges steadily but remains inferior to LDA-DETR. In contrast, the YOLO-series models rise rapidly during the early epochs but plateau prematurely, showing limited improvement in the later stages and ultimately lower performance under both metrics. These results demonstrate that LDA-DETR achieves superior convergence stability and detection accuracy compared to Transformer- and CNN-based counterparts.

thumbnail
Fig 10. Training performance comparison of LDA-DETR, RT-DETR-r18, and three YOLO variants (YOLOv10-m, YOLOv11-m, and YOLOv13-l) on the NWPU VHR-10 dataset: (a) mAP@0.5; (b) mAP@0.5:0.95.

https://doi.org/10.1371/journal.pone.0340977.g010

Regarding training efficiency, LDA-DETR requires approximately 15 seconds per epoch, compared with 9 seconds for RT-DETR-r18. Although the per-epoch time is moderately higher, improved convergence stability and superior accuracy compensate for this cost. These results indicate that LDA-DETR balances computational overhead and performance well, delivering stable optimization without imposing a substantial training burden.

Comparison of heat map visualizations.

As shown in Fig 11, the attention heatmaps of RT-DETR-r18 and the proposed LDA-DETR are visualized and compared on the NWPU VHR-10 dataset. The first column presents the original image, the second column displays the heatmap from RT-DETR-r18, and the third column shows the heatmap from LDA-DETR. Warmer colors (e.g., red and yellow) denote higher attention weights, corresponding to regions where the model exhibits greater confidence in object detection. Compared with RT-DETR-r18, LDA-DETR produces more concentrated and accurately localized activation regions, particularly in complex background scenarios such as ground track fields and bridges, where the baseline model tends to generate more diffuse responses. For small or densely distributed targets, such as airplanes and storage tanks, LDA-DETR effectively emphasizes object regions while suppressing irrelevant background activations, enhancing detection robustness. In several instances (e.g., storage tanks within complex background environments), where RT-DETR-r18 fails to detect targets, LDA-DETR still generates strong activations at the correct locations, demonstrating its superior robustness and localization capability.

thumbnail
Fig 11. Attention heatmap comparisons between RT-DETR-r18 (second column) and LDA-DETR (third column) on the NWPU VHR-10 dataset.

The first column shows the original images. Warmer colors (e.g., red and yellow) denote higher attention weights. LDA-DETR exhibits more concentrated and accurately localized activations compared with the more diffuse responses of RT-DETR-r18, particularly for small or densely distributed targets.

https://doi.org/10.1371/journal.pone.0340977.g011

Progressive visualization of module contributions.

To better demonstrate the contribution of each proposed component to small object localization and representation, Fig 12 presents attention heatmaps showing the progressive integration of the LFEB, DMSFM, and AEFN modules. The experiments were conducted on the NWPU VHR-10 dataset. The baseline RT-DETR (Fig 12(b)) exhibits scattered or weak activations, resulting in incomplete object coverage. After incorporating LFEB (Fig 12(c)), the network generates more compact and sharper responses in local regions, benefiting from the residual structures and partial convolutions. These designs enhance gradient flow and improve feature representation. With the integration of DMSFM (Fig 12(d)), attention becomes more coherent and contextually aware, effectively capturing both object boundaries and surrounding background cues through the fusion of multi-scale features. Finally, the AEFN module (Fig 12(e)) introduces spatial-channel attention refinements, allowing the model to suppress interference better and emphasize subtle object regions.

thumbnail
Fig 12. Visualization of attention heatmaps showing the incremental contributions of LFEB, DMSFM, and AEFN to LDA-DETR.

(a) Original image. (b) Baseline RT-DETR (ResNet-18). (c) +LFEB. (d) +DMSFM. (e) +AEFN (full LDA-DETR). These progressive visualizations illustrate how each module enhances small-object localization and background suppression on the NWPU VHR-10 dataset.

https://doi.org/10.1371/journal.pone.0340977.g012

Visualization of improvement effects.

To verify the improvements over the baseline network, we present side-by-side visual comparisons on the NWPU VHR-10 and URPC2020 datasets (Fig 13). White ovals indicate the missed regions in the baseline results. On NWPU VHR-10, LDA-DETR successfully detects bridge and tennis court instances missed by RT-DETR-r18 in (a)→(b). On URPC2020, LDA-DETR accurately identifies holothurian and starfish overlooked by the baseline in (c)→(d). These qualitative improvements are consistent with the quantitative findings: on NWPU VHR-10, LDA-DETR achieves mAP@0.5 scores of 90.4% for tennis court and 85.5% for bridge, compared to 87.3% and 84.9% for RT-DETR-r18. On URPC2020, it increases the overall mAP@0.5 from 83.4% to 84.5% and boosts AP for small objects to 35.5%, indicating stronger recall for dense small targets.

thumbnail
Fig 13. Qualitative detection results on NWPU VHR-10 and URPC2020.

Left: baseline; right: LDA-DETR. White ovals mark missed detections. (a–b) NWPU VHR-10: LDA-DETR detects bridge and tennis court missed by RT-DETR-r18. (c–d) URPC2020: LDA-DETR detects holothurian and starfish missed by the baseline.

https://doi.org/10.1371/journal.pone.0340977.g013

Overall, LDA-DETR demonstrates enhanced sensitivity to small objects and greater robustness to background clutter, effectively reducing missed detections and expanding class coverage.

Failure case analysis and method limitations.

As illustrated in the representative failure cases of Fig 14, the limitations of the proposed method can be summarized into three main aspects. Firstly, the model exhibits difficulty in detecting extremely small, distant, or partially occluded objects. In particular, targets smaller than 10 px often fall below the effective receptive field, making them especially prone to missed detections. Although the DMSFM module introduces multi-scale receptive fields to enrich representations at different levels, it cannot fully offset the resolution loss in deeper layers or the limited semantic abstraction in shallow layers. Consequently, fine-grained cues under low contrast or occlusion are frequently attenuated, leading to missed detections such as tiny aircraft in RSOD or pedestrians in VisDrone. Secondly, false detections commonly arise in scenes with repetitive textures or complex artificial structures. In AEFN, shallow features are enhanced by CHM and fused with intermediate FPN features to achieve detail–semantic integration. However, this process remains predominantly channel-driven and lacks explicit spatial–semantic suppression. As a result, high-frequency backgrounds–such as circular landscaping in RSOD, industrial facilities in NWPU, or underwater seaweed and rocks in URPC–are easily misactivated as targets, thereby increasing false positives. In dense urban scenarios, this limitation also causes confusion between small objects and surrounding structures, for example, misclassifying traffic booths as bicycles or kiosks as trucks in VisDrone. Furthermore, the LFEB backbone employs a Conv Stem followed by SRFM stages, where partial convolution and residual connections are combined to reduce redundant computation and model parameters while maintaining gradient flow. However, this simplification occasionally suppresses critical fine-grained signals, resulting in misclassification between visually similar categories–for example, confusing sea snails with echinus or seaweed with holothurians in the URPC dataset. These limitations underscore the need for future refinement, including resolution-adaptive fusion, spatial–semantic background suppression, and detail-preserving feature enhancement.

thumbnail
Fig 14. Representative failure cases of the proposed method across four datasets: (a) RSOD, (b) VisDrone-DET, (c) NWPU VHR-10, and (d) URPC2020.

Purple boxes and arrows highlight missed detections of extremely small or occluded targets (e.g., tiny aircraft, pedestrians) and false activations of background structures or visually confusing categories (e.g., circular landscaping, kiosks, docks, seaweed, and rocks).

https://doi.org/10.1371/journal.pone.0340977.g014

Conclusions

This paper focuses on efficient small object detection and proposes a lightweight dynamic attention-enhanced DETR (LDA-DETR) to address the challenges of detecting small and densely distributed objects under complex scenarios. Firstly, a Lightweight Feature Extraction Backbone (LFEB) is designed, which enhances gradient flow and reduces the model’s parameter count through residual structures and partial convolution operations. Secondly, a Dynamic Multi-Scale Fusion Module (DMSFM) is constructed. This module uses reparameterization techniques to fuse features from multiple branches into a single main branch, significantly enhancing the model’s capability and performance without adding to inference time. Finally, an Attention-Enhanced Fusion Network is proposed, integrating feature enhancement and fusion optimization techniques to improve small object detection performance further.

Ablation experiments confirm the effectiveness of the LFEB, DMSFM, and AEFN modules, and the overall performance of the final LDA-DETR network is validated on four benchmark datasets. Compared with the original RT-DETR model, LDA-DETR achieves a mAP@0.5 of 84.5% on the URPC2020 dataset, representing a 1.1% improvement while reducing the parameter count by 2.96M and GFLOPs by 7.3. On the NWPU VHR-10 dataset, the model attains a mAP@0.5 of 91.4%, a gain of 1.4%. On the RSOD dataset, it achieves 95.3% mAP@0.5, surpassing the baseline by 0.9%. On the VisDrone-DET dataset, LDA-DETR reaches 50.6% mAP@0.5, an improvement of 1.2% over RT-DETR, and demonstrates clear advantages on small and medium objects. Overall, the proposed LDA-DETR consistently outperforms the baseline and most contemporary detectors in terms of detection accuracy, model efficiency, and computational cost. These results indicate that LDA-DETR is particularly well suited for small object detection in practical application domains, including aerial and UAV-based scenarios, underwater environments, and remote sensing or real-world industrial scenes, where high accuracy and real-time performance are both required.

In future research, several promising directions are worth exploring: (1) Incorporating multimodal information to improve robustness in adverse conditions such as low-light or occluded environments; (2) Embedding task-specific priors, including geometric constraints or scene graphs, into the transformer decoder to enhance contextual reasoning and reduce false positives; and (3) Extending the framework to video-based detection and tracking by leveraging temporal context modeling for improved performance in dynamic scenes.

References

  1. 1. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
  2. 2. He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 2961–9.
  3. 3. Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 6154–62.
  4. 4. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. Ssd: Single shot multibox detector. In: Computer vision–ECCV 2016 : 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I; 2016. p. 21–37.
  5. 5. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 7464–75. https://doi.org/10.1109/cvpr52729.2023.00721
  6. 6. Tan M, Pang R, Le QV. Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 10781–90.
  7. 7. Ross TY, Dollár G. Focal loss for dense object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2980–8.
  8. 8. Wang W, Dai J, Chen Z, Huang Z, Li Z, Zhu X, et al. InternImage: Exploring large-scale vision foundation models with deformable convolutions. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 14408–19. https://doi.org/10.1109/cvpr52729.2023.01385
  9. 9. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: European conference on computer vision. Springer; 2020. p. 213–29.
  10. 10. Shehzadi T, Hashmi KA, Liwicki M, Stricker D, Afzal MZ. Object detection with transformers: A review. sensors (Basel). 2025;25(19):6025. pmid:41094848
  11. 11. Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint. 2020.
  12. 12. Gao P, Zheng M, Wang X, Dai J, Li H. Fast convergence of detr with spatially modulated co-attention. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 3621–30.
  13. 13. Chen Q, Chen X, Wang J, Zhang S, Yao K, Feng H, et al. Group detr: Fast detr training with group-wise one-to-many assignment. In: Proceedings of the IEEE/CVF international conference on computer vision; 2023. p. 6633–42.
  14. 14. Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L. Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 13619–27.
  15. 15. Jia D, Yuan Y, He H, Wu X, Yu H, Lin W, et al. DETRs with hybrid matching. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 19702–12. https://doi.org/10.1109/cvpr52729.2023.01887
  16. 16. Zhao Y, Lv W, Xu S, Wei J, Wang G, Dang Q. Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2024. p. 16965–74.
  17. 17. Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint; 2022.
  18. 18. Dong H, Yuan M, Wang S, Zhang L, Bao W, Liu Y, et al. PHAM-YOLO: A parallel hybrid attention mechanism network for defect detection of meter in substation. Sensors (Basel). 2023;23(13):6052. pmid:37447900
  19. 19. Xu L, Dong S, Wei H, Ren Q, Huang J, Liu J. Defect signal intelligent recognition of weld radiographs based on YOLO V5-IMPROVEMENT. J Manuf Process. 2023;99:373–81.
  20. 20. Sun W, Dai L, Zhang X, Chang P, He X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl Intell. 2021;52(8):8448–63.
  21. 21. Zhao C, Shu X, Yan X, Zuo X, Zhu F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement. 2023;214:112776.
  22. 22. Woo S, Park J, Lee JY, Kweon IS. Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.
  23. 23. Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 13713–22.
  24. 24. Shi L, Kuang L, Xu X, Pan B, Shi Z. CANet: Centerness-aware network for object detection in remote sensing images. IEEE Trans Geosci Remote Sensing. 2022;60:1–13.
  25. 25. Wu Y, Li J. YOLOv4 with deformable-embedding-transformer feature extractor for exact object detection in aerial imagery. Sensors (Basel). 2023;23(5):2522. pmid:36904727
  26. 26. Nguyen H, Ngo TQ, Uyen HTT, Duong MK. Enhanced object recognition from remote sensing images based on hybrid convolution and transformer structure. Earth Sci Inform. 2025;18(2).
  27. 27. Berahmand K, Daneshfar F, Salehi ES, Li Y, Xu Y. Autoencoders and their applications in machine learning: a survey. Artif Intell Rev. 2024;57(2).
  28. 28. Sajjadi Mohammadabadi SM, Entezami M, Karimi Moghaddam A, Orangian M, Nejadshamsi S. Generative artificial intelligence for distributed learning to enhance smart grid communication. Int J Intell Netw. 2024;5:267–74.
  29. 29. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. MobileNetV2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition; 2018. p. 4510–20. https://doi.org/10.1109/cvpr.2018.00474
  30. 30. Zhang X, Zhou X, Lin M, Sun J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 6848–56.
  31. 31. Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C. Ghostnet: More features from cheap operations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 1580–9.
  32. 32. Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, et al. Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 367–76.
  33. 33. Zhang YM, Lee CC, Hsieh JW, Fan KC. CSL-YOLO: A new lightweight object detection system for edge computing. arXiv preprint arXiv:210704829. 2021.
  34. 34. Yue X, Meng L. YOLO-SM: A lightweight single-class multi-deformation object detection network. IEEE Trans Emerg Top Comput Intell. 2024;8(3):2467–80.
  35. 35. Yi H, Liu B, Zhao B, Liu E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2024;17:1734–47.
  36. 36. Yang L, Gu Y, Feng H. Multi-scale feature fusion and feature calibration with edge information enhancement for remote sensing object detection. Sci Rep. 2025;15(1):15371. pmid:40316719
  37. 37. Nie J, Pang Y, Zhao S, Han J, Li X. Efficient selective context network for accurate object detection. IEEE Trans Circuits Syst Video Technol. 2021;31(9):3456–68.
  38. 38. Jiang L, Yuan B, Du J, Chen B, Xie H, Tian J, et al. MFFSODNet: Multiscale feature fusion small object detection network for UAV aerial images. IEEE Trans Instrum Meas. 2024;73:1–14.
  39. 39. Ding X, Zhang X, Ma N, Han J, Ding G, Sun J. Repvgg: Making VGG-style convnets great again. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 13733–42.
  40. 40. Ding X, Guo Y, Ding G, Han J. ACNet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks. In: 2019 IEEE/CVF international conference on computer vision (ICCV); 2019. p. 1911–20. https://doi.org/10.1109/iccv.2019.00200
  41. 41. Ding X, Zhang X, Han J, Ding G. Diverse branch block: Building a convolution as an inception-like unit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 10886–95.
  42. 42. Mehta S, Rastegari M. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint. 2021.
  43. 43. Dong R, Xu D, Zhao J, Jiao L, An J. Sig-NMS-based faster R-CNN combining transfer learning for small target detection in VHR optical remote sensing imagery. IEEE Trans Geosci Remote Sensing. 2019;57(11):8534–45.
  44. 44. Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, et al. Sparse R-CNN: An end-to-end framework for object detection. IEEE Trans Pattern Anal Mach Intell. 2023;45(12):15650–64. pmid:37402189
  45. 45. Tian Z, Shen C, Chen H, He T. FCOS: A simple and strong anchor-free object detector. IEEE Trans Pattern Anal Mach Intell. 2022;44(4):1922–33. pmid:33074804
  46. 46. Ge Z. Yolox: Exceeding yolo series in 2021. arXiv preprint. 2021.
  47. 47. Zhang J, Lei J, Xie W, Fang Z, Li Y, Du Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans Geosci Remote Sensing. 2023;61:1–15.
  48. 48. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 10012–22.
  49. 49. Chen Y, Liu B, Yuan L. PR-Deformable DETR: DETR for remote sensing object detection. IEEE Geosci Remote Sensing Lett. 2024.
  50. 50. Liu S, Li F, Zhang H, Yang X, Qi X, Su H, et al. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:220112329. 2022.
  51. 51. Wang S, Chen Y, Liu B. DLA-deformable DETR: Dynamic layer assignment deformable DETR for remote sensing images. In: 2025 5th international conference on consumer electronics and computer engineering (ICCECE). IEEE; 2025. p. 737–40.
  52. 52. Xi LH, Hou JW, Ma GL, Hei YQ, Li WT. A multiscale information fusion network based on PixelShuffle integrated with YOLO for aerial remote sensing object detection. IEEE Geosci Remote Sensing Lett. 2024;21:1–5.
  53. 53. Zhang H, Chang H, Ma B, Wang N, Chen X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In: Computer vision–ECCV 2020 : 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV; 2020. p. 260–75.
  54. 54. Li Y, Li X, Dai Y, Hou Q, Liu L, Liu Y, et al. LSKNet: A foundation lightweight backbone for remote sensing. Int J Comput Vis. 2024;133(3):1410–31.
  55. 55. Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: 2021 IEEE/CVF international conference on computer vision workshops (ICCVW), 2021. 2778–88. https://doi.org/10.1109/iccvw54120.2021.00312
  56. 56. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, et al. PVT v2: Improved baselines with pyramid vision transformer. Comp Visual Med. 2022;8(3):415–24.
  57. 57. Chen Q, Wang Y, Yang T, Zhang X, Cheng J, Sun J. You only look one-level feature. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 13039–48.
  58. 58. Zhang R, Shao Z, Huang X, Wang J, Wang Y, Li D. Adaptive dense pyramid network for object detection in UAV imagery. Neurocomputing. 2022;489:377–89.
  59. 59. Yang C, Huang Z, Wang N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. 13668–77.
  60. 60. Du B, Huang Y, Chen J, Huang D. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023. 13435–44. https://doi.org/10.1109/cvpr52729.2023.01291
  61. 61. Qi S, Song X, Shang T, Hu X, Han K. MSFE-YOLO: An improved YOLOv8 network for object detection on drone view. IEEE Geosci Remote Sensing Lett. 2024;21:1–5.
  62. 62. Hou W, Wu H, Wu D, Shen Y, Liu Z, Zhang L, et al. Small object detection method for UAV remote sensing images based on αS-YOLO. IEEE J Sel Top Appl Earth Observ Remote Sensing. 2025;18:8984–94.