Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MSRRT-DETR: A high-precision apple detection method with strong cross-domain generalization capability in complex orchard scenes

  • Xinyu Zhang,

    Roles Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliations College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, China, Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China

  • Sawut Mamat ,

    Roles Conceptualization, Methodology, Writing – review & editing

    korxat@xju.edu.cn (SM); 15313256806@163.com (XHL)

    Affiliations College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, China, Xinjiang Key Laboratory ossssf Oasis Ecology, Xinjiang University, Urumqi, China

  • Xiaohuang Liu ,

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Writing – review & editing

    korxat@xju.edu.cn (SM); 15313256806@163.com (XHL)

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Jiufen Liu,

    Roles Formal analysis, Writing – review & editing

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Run Liu,

    Roles Data curation, Formal analysis

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Guangjie Wu,

    Roles Data curation, Methodology, Software, Supervision, Visualization, Writing – review & editing

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Ping Zhu,

    Roles Data curation, Investigation

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Hongyu Li,

    Roles Data curation, Formal analysis, Validation

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Min Ma,

    Roles Data curation, Formal analysis

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

  • Xiaotong Liu

    Roles Data curation, Validation

    Affiliations Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing, China, Xinjiang Altay Natural Resources Element Field Observation Research Station, Altay, China

Abstract

Accurate fruit detection is a key component of precision agriculture applications such as crop yield estimation, orchard management, and intelligent harvesting. In scenarios where immature fruits exhibit visual similarity to the background or where significant varietal differences exist, traditional models often lack sufficient generalization ability, resulting in reduced detection accuracy and unstable predictions. To address this problem, this paper proposes a fruit detection model, MSRRT-DETR, which achieves a balance of high accuracy, real-time performance, and strong generalization capability. To improve detection accuracy and robustness in complex orchard environments, MSRRT-DETR introduces three major enhancements to the RT-DETR framework: a Multi-Scale Convolutional Attention Module (MSBlock) to enhance feature representation at different scales; a Spatial and Channel Synergistic Attention Module (SCSA) to improve object focus and discriminative capability; and a Re-parameterized Feature Pyramid Network (RepGFPN) to achieve efficient multi-scale feature fusion. Experimental results show that MSRRT-DETR achieves a mAP50 of 87.3% on the self-constructed TSApple dataset, outperforming mainstream lightweight models YOLOv8, YOLO11, and YOLO12 by 2.0–7.9 percentage points, exceeding two-stage detectors including Faster R-CNN, Mask R-CNN, and Cascade R-CNN by 5.1–8.6 percentage points, and surpassing the RT-DETR series by 1.1–2.6 percentage points. With an inference speed of 30.2 FPS, comparable to the YOLO series, MSRRT-DETR achieves an excellent balance between accuracy and real-time performance. In addition, MSRRT-DETR demonstrates outstanding cross-domain generalization capability on four public datasets including MinneApple, validating its stable applicability across diverse scenarios and fruit varieties. MSRRT-DETR combines high recognition accuracy, fast inference, and strong cross-domain generalization, fully meeting the requirements of fruit detection in complex agricultural scenarios. The model provides robust technical support for intelligent monitoring and automated orchard management in precision agriculture, and holds significant practical value and broad potential for application.

Introduction

In recent years, deep learning has achieved remarkable progress in agricultural intelligence, particularly demonstrating broad prospects in automated fruit harvesting and yield estimation scenarios. As the fundamental component of these tasks, fruit target detection’s accuracy and robustness directly impact the practicality of agricultural robotic systems [1]. However, in practical orchard environments, fruits often exist at different maturity stages and are susceptible to interference from occlusion, scale variation, and complex lighting conditions. These factors commonly lead to challenges such as small targets and weak boundaries in images, posing significant difficulties for existing detection models [24]. Consequently, how to improve target detection models’ precise recognition capability in complex orchard environments while maintaining real-time performance and computational efficiency has become a critical issue in this field.

With the increasing demand for agricultural intelligence, deep learning-based object detection technology has been widely applied in smart agriculture scenarios such as precise fruit recognition and automated harvesting [57]. Current mainstream object detection methods mainly include two-stage detectors and one-stage detectors, both of which have been extensively explored and applied in agricultural settings. Among two-stage detectors, Faster R-CNN has attracted attention in agricultural vision due to its high accuracy. Gao et al. [8] used this model to achieve multi-state detection of non-occluded, leaf-occluded and fruit-occluded apples in orchards. However, this method requires generating numerous candidate regions, resulting in high computational complexity and insufficient real-time performance, making it difficult to directly apply in orchard robotic systems.

To address the computational complexity issues of two-stage methods and improve real-time performance, researchers have increasingly focused on one-stage detectors represented by the YOLO series [912]. Tian et al. [13] proposed an improved YOLOv3 model for apples at different growth stages, which optimized feature layer propagation through DenseNet architecture and effectively improved detection performance under varying lighting conditions and fruit scale changes. Sun et al. [14] developed ESC-YOLO that uses spatial-channel feature reconstruction convolution (SCConv) to improve detection accuracy under occlusion while maintaining real-time detection speed. However, one-stage models typically employ shared-channel feature pyramids, making it difficult to fully capture fine-grained features of multi-scale fruits, leading to suboptimal performance in dense small-target or heavily occluded scenarios [15]. In recent years, Transformer architectures have provided new research directions for object detection due to their excellent global modeling capabilities [16]. DETR-based models have significantly improved object detection performance in complex environments through attention mechanisms [17]. Hu & Li [18] proposed the FC-DETR model that achieves high recognition accuracy in heavily occluded and noise-interfered environments through cross-scale adaptive feature fusion and efficient loss function design. However, standard Transformers exhibit high computational complexity when processing high-resolution images, making them difficult to run efficiently on resource-constrained edge devices, thus limiting their practical deployment in real-time orchard detection tasks [19].

To address the aforementioned challenges, this paper proposes MSRRT-DETR, an enhanced fruit detection model built upon the RT-DETR framework with three key improvements. First, a Multi-Scale Convolutional Attention Module (MSBlock) is designed to improve feature representation across different scales and enhance the receptive field adaptability for fruit targets. Second, a Spatial and Channel Collaborative Attention Module (SCSA) is introduced, which integrates spatial multi-semantic attention with progressive channel self-attention to improve the model’s sensitivity to occluded regions, small objects, and complex backgrounds, thereby enhancing both focus and discriminative precision. Third, to overcome performance bottlenecks in feature fusion, a Re-parameterized Feature Pyramid Network (RepGFPN) is incorporated, leveraging non-shared channels and multi-scale interaction to achieve more efficient feature aggregation and representation. The proposed model adopts an end-to-end architecture and employs the Complete IoU loss function to further optimize localization accuracy.

Experimental results demonstrate that MSRRT-DETR outperforms all compared methods across key metrics including mAP50, mAP50:90, and precision, with notable improvements in detecting small and occluded targets. Compared with YOLO series models, MSRRT-DETR achieves a 2.0%–7.9% gain in mAP50 while maintaining comparable inference speed (30.2 FPS), enabling real-time detection. Relative to two-stage detectors, it achieves significantly higher accuracy and inference speed with nearly half the number of parameters. Compared with its RT-DETR counterparts, MSRRT-DETR offers a moderate improvement in accuracy while reducing parameter count by approximately one-third and doubling the FPS. Moreover, MSRRT-DETR demonstrates superior cross-domain generalization on four unseen public fruit datasets including MinneApple, consistently ranking first or second across all major metrics, validating its robustness and adaptability in diverse orchard environments.

The main contributions of this paper are as follows:

  1. A high-quality apple detection dataset (TSAppleData) is constructed, covering various fruit growth stages, occlusion types, scale variations, and complex lighting conditions, providing a solid benchmark for fruit detection in challenging orchard scenarios.
  2. An effective transformer-based detection network, MSRRT-DETR, is proposed, which integrates multi-scale feature construction, spatial-channel collaborative attention, and efficient feature fusion to significantly improve detection precision and robustness, particularly for small or occluded targets.
  3. Extensive experiments on both the proposed dataset and multiple public datasets validate that MSRRT-DETR achieves a well-balanced performance in accuracy, speed, and generalization, consistently outperforming YOLO series, two-stage detectors, and baseline RT-DETR models.

Construction of TSAppleData dataset and data augmentation

Data collection

From July to September 2024, we conducted multiple field surveys in apple orchards in Shuimogou District, Urumqi, Xinjiang, China, collecting an apple image dataset named TSAppleData Dataset that covers different temporal phases and spatial distributions. The temporal dimension included various growth stages and harvesting times within a day, while the spatial dimension covered different occlusion conditions and shooting distances. Using a HUAWEI nova 12 smartphone, we captured 1,266 apple images (3000 × 4000 resolution). After manual screening and downsampling, we obtained 1,078 high-quality images (640 × 640 resolution). Fig 1 displays a selection of images from the TSAppleData Datasets.

thumbnail
Fig 1. Sample images from TSAppleData Dataset, showing various growth stages, lighting conditions, occlusion types, and shooting distances.

https://doi.org/10.1371/journal.pone.0342854.g001

Data collection spanned three maturity stages (immature, semi-mature, mature) to ensure comprehensive growth stage coverage. We scheduled collection between 11:00–14:00 and 17:00–19:00 to account for lighting variations, maintaining balanced frontlight/backlight representation.

For spatial distribution, we categorized occlusion into four types: leaf occlusion, branch occlusion, fruit-to-fruit occlusion, and non-occlusion [8]. Images were captured at close (0.5-1m), medium (1-2m), and long (2-5m) distances to provide multi-scale spatial information.

Data annotation and dataset partitioning

This study employed LabelImg software for precise annotation of apple images. During annotation, the bounding box size reflected the relative distance between fruit targets and the camera. Considering the limited coverage range of robotic arms, distant small targets on different branches within the same image as close-range targets were treated as background and not annotated, thereby reducing false detections and performance waste [20]. After annotation, the dataset was randomly divided into training, validation, and test sets at a ratio of 7:2:1 to ensure scientific rigor and effectiveness in model training, validation, and testing.

Data augmentation

To enhance dataset diversity, improve model robustness, and effectively prevent overfitting, this study applied online augmentation techniques to process the apple image dataset. Specifically, the augmentation operations included comprehensive geometric transformations such as random rotation, translation, scaling, horizontal flipping, vertical flipping, and cropping to simulate apples’ appearance from various perspectives and positions. Additionally, adjustments to hue, saturation, and brightness in HSV color space were implemented to replicate color variations under different lighting conditions. To further strengthen the model’s robustness against practical shooting factors like occlusion, overlap, and object interactions in complex scenarios, advanced augmentation methods including Random Erasing, Mosaic, and Mixup were introduced. These techniques improved dataset diversity and enhanced the model’s adaptability to environmental disturbances. Detailed illustrations of the data augmentation methods are shown in Fig 2.

thumbnail
Fig 2. Schematic diagram of partial data augmentation methods for Online Augmentation.

https://doi.org/10.1371/journal.pone.0342854.g002

Method overview

Overall architecture of MSRRT-DETR

Existing detection models still face significant challenges in accurately identifying and localizing occluded targets in complex real-world environments with diverse objects and frequent occlusions. To address this, we propose an optimized object detection model based on the RT-DETR architecture [21], as illustrated in Fig 3. The model employs a modified lightweight ResNet as backbone network for efficient feature extraction. A Spatial and Channel Reconstruction Convolution module (MSBlock) is introduced to enhance multi-scale spatial representation. The Spatial and Channel Collaborative Attention mechanism (SCSA) enables joint modeling of spatial-channel information.

thumbnail
Fig 3. Schematic diagram of the MSRRT-DETR overall architecture, where ResBlock denotes the residual module.

https://doi.org/10.1371/journal.pone.0342854.g003

For the neck network, we implement RepGFPN structure with non-shared channel configuration for dense target representation. Complete IoU (CIoU) loss function is adopted for precise bounding box optimization.

Improved backbone feature extraction network

Lightweight ResNet network.

The backbone network of the RT-DETR model employs a lightweight ResNet architecture, achieving efficient feature extraction through streamlined module design and channel optimization [22], with its structure shown in Fig 4. The network primarily consists of three components: an initial downsampling stage, a deep residual extraction stage, and a multi-scale output module. The initial stage sequentially applies three ConvNormLayer operations and one max pooling operation to accomplish feature space compression and low-level feature capture. Subsequently, four sets of residual blocks (Blocks) extract mid-to-high level semantic information with 64, 128, 256, and 512 channels respectively, outputting feature maps at three scales (P3, P4, P5) for subsequent detection heads to perform multi-scale target modeling. Each Block consists of multiple BasicBlocks, exhibiting excellent feature reusability and computational efficiency to meet the representation requirements for complex targets and multi-scale objects in object detection tasks. While maintaining the model’s overall inference speed, this backbone network provides high-quality semantic features for subsequent feature fusion and decoder modules.

Multi-scale Feature Construction Module.

The MSBlock primarily consists of three components: channel expansion and grouping, inverted bottleneck branches, and feature fusion modules. This structure draws inspiration from Res2Net’s hierarchical feature fusion concept while incorporating large-kernel inverted bottleneck architectures to achieve efficient and comprehensive multi-scale feature representation [23,24]. The schematic diagram of the MSBlock structure is shown in Fig 5.

First, the input feature undergoes channel expansion via convolution to obtain feature maps with channels. The expanded features are then evenly divided into n groups along the channel dimension, denoted as . To achieve an optimal balance between representational capacity and computational efficiency, this study sets n = 3.

Building upon the feature grouping, MSBlock employs a progressive fusion strategy to process each group’s features. The first group serves directly as a cross-stage connection, preserving the original feature information. The remaining groups (e.g., ) undergo fusion with the output from the previous group through inverted bottleneck branches (Inverted Bottleneck, ), achieving hierarchical extraction of multi-scale features. The specific operations are as follows:

(1)

denotes an inverted bottleneck structure equipped with large convolutional kernels, where k represents the kernel size. This design facilitates effective multi-scale information interaction by integrating the output from the previous group with the input of the current group.

The outputs from all branches, denoted as , are concatenated along the channel dimension and subsequently processed by a convolution to enable global information exchange and channel-wise feature recalibration. This operation ensures that the fused features are well adapted to the subsequent network structure.

The proposed MSBlock not only enhances the backbone network’s capability to capture and exploit multi-scale features, but also maintains a favorable balance between computational cost and inference speed. The enriched grouping and fusion mechanisms significantly improve the network’s robustness and detection accuracy when dealing with objects of varying scales and diverse scene complexities.

Spatial and channel synergistic attention mechanism.

The Spatial and Channel Synergistic Attention (SCSA) mechanism focuses on the joint optimization of spatial and channel information. It integrates two key components—Shared Multi-Semantic Spatial Attention (SMSA) and Progressive Channel Self-Attention (PCSA)—to fully exploit the representational potential of features along both spatial and channel dimensions through a cascaded design [25,26].

In the SMSA module, the input feature is first subjected to global average pooling along the spatial dimensions, producing two separate feature sequences corresponding to the height and width directions, respectively:

(2)

Subsequently, and are partitioned along the channel dimension into K equally sized and mutually independent sub-features, denoted as and , respectively. Each sub-feature is then processed using depthwise separable one-dimensional convolutions with distinct kernel sizes, enabling the efficient capture of multi-semantic structural information distributed across different spatial regions.

(3)

The outputs from the above branches are concatenated along the channel dimension and normalized using Group Normalization, which effectively mitigates the semantic interference commonly introduced by Batch Normalization. A Sigmoid activation is then applied to generate the final spatial attention map.

(4)

The spatial attention map performs element-wise weighting on the original input features, enhancing multi-scale spatial information and ultimately yielding a spatially restructured feature representation.

(5)

Building upon this, the PCSA module further performs channel dependency modeling on the spatially enhanced features . First, undergoes spatial dimension reduction via pooling to reduce computational cost, followed by depthwise separable convolution for linear projection, producing the query ), key ), and value ()vectors along the channel dimension:

(6)

A single-head self-attention mechanism is then applied along the channel dimension to compute global inter-channel correlations:

(7)

Subsequently, FFF is passed through global pooling and a Sigmoid activation function to generate channel attention weights, which are used to reweight the spatially enhanced features across the channel dimension, thereby achieving synergistic fusion of spatial and channel information:

(8)

The design of SCSA fully considers the interaction and complementarity between spatial and channel information. The SMSA module leverages a multi-branch, multi-scale grouped convolution mechanism to effectively enhance the model’s ability to represent different levels of spatial semantics, enabling precise emphasis on salient regions while suppressing redundant noise. On top of spatial enhancement, the PCSA module employs a self-attention mechanism to model global channel dependencies, allowing for fine-grained modeling of high-level channel semantics. The organic integration of these two modules significantly alleviates the semantic fragmentation commonly caused by the decoupling of spatial and channel attention in conventional mechanisms, thereby enhancing the diversity and robustness of feature representations.The overall architecture of SCSA is illustrated in Fig 6.

thumbnail
Fig 6. Schematic diagram of the SCSA attention mechanism structure.

https://doi.org/10.1371/journal.pone.0342854.g006

Neck with reparameterized generalized feature pyramid network

To enhance the model’s multi-scale feature fusion capability, this work adopts an efficient Reparameterized Generalized Feature Pyramid Network (RepGFPN) as the neck structure [27], whose overall architecture is illustrated in Fig 7. Building upon the classical Feature Pyramid Network (FPN), RepGFPN incorporates several key techniques, including a non-shared channel configuration, a re-parameterization mechanism, as well as CSP structures and ELAN-style connections, effectively mitigating the limitations of the traditional FPN architecture [28].

thumbnail
Fig 7. Schematic illustration of the Efficient RepGFPN architecture.

https://doi.org/10.1371/journal.pone.0342854.g007

Unlike the standard FPN, which enforces channel sharing across features at different scales, RepGFPN employs a non-shared channel configuration. This allows feature maps at each specific scale to retain their native spatial and channel characteristics. This design eliminates the need for the frequent rescaling and dimensional alignment operations required in traditional methods, thereby improving the efficiency of multi-scale feature fusion while simultaneously enhancing feature representational capacity.

In addition to optimizing the fusion process, RepGFPN introduces an enhanced cross-scale feature interaction mechanism. At its core, it integrates the CSPStage and Rep modules (as shown in Fig 8) within the neck structure. Specifically, five CSPStage modules are employed to incorporate multi-scale feature inputs from both adjacent and same-level layers, enabling more comprehensive feature fusion. This design significantly promotes the flow and interaction of multi-scale information across feature layers, thereby enhancing feature reuse and representation power.

thumbnail
Fig 8. Structural illustration of the CSPStage and Rep modules.

https://doi.org/10.1371/journal.pone.0342854.g008

Moreover, this mechanism achieves these improvements without imposing substantial additional computational overhead, leading to higher detection accuracy and robustness. As a result, RepGFPN provides a strong technical foundation for object detection tasks under complex real-world scenarios.

Complete IoU loss function

To enhance the performance of bounding box regression in object detection, this work adopts the Complete Intersection over Union (CIoU) loss function. Unlike conventional IoU-based losses that consider only the overlap area between the predicted and ground truth boxes, CIoU additionally takes into account the distance between box centers and the aspect ratio, thereby providing a more comprehensive and effective optimization of the regression targets [29].

Specifically, let the predicted bounding box be denoted as B, and the ground truth box as . The CIoU loss is defined as follows:

(9)

where IoU represents the intersection-over-union between the predicted and ground truth boxes. denotes the Euclidean distance between the centers of the predicted and ground truth boxes. c is the diagonal length of the smallest enclosing box covering both bounding boxes. The term v measures the consistency of aspect ratios and is defined as:

(10)

is a weighting factor for the regularization term, computed as:

(11)

where and correspond to the widths and heights of the predicted and ground truth boxes, respectively.

Unlike traditional IoU-based loss functions, the CIoU loss simultaneously optimizes localization accuracy and shape constraints, which facilitates faster convergence and improves the final detection performance.

Experimental results and analysis

Ablation study on module design

To evaluate the impact of each improved module on overall performance, we conducted multiple ablation experiments by progressively introducing the multi-scale feature extraction module (MSBlock), spatial-channel collaborative attention mechanism (SCSA), and efficient feature fusion module (RepGFPN). The changes in detection accuracy, inference efficiency, and computational cost under different configurations were analyzed. The experimental results are shown in Table 1.

As shown in Table 1, after introducing MSBlock (Method 2) to the baseline model (Method 1), the inference speed increased from 26.7 FPS to 37.1 FPS, representing a 39% improvement, while GFLOPs only increased by 2.4%. Meanwhile, Precision improved to 87.1% and Recall increased to 77.3%. These results indicate that MSBlock accelerates inference with minimal additional computational cost and positively impacts classification performance.

By adding the SCSA module to Method 2 (Method 3), the model’s mAP50 improved from 84.4% to 86.1%, mAP50:90 increased from 67.4% to 70.3%, and Recall further improved to 79.4%, indicating significant enhancement in feature representation capability and target localization accuracy. Although SCSA introduced additional computational overhead, reducing inference speed to 22.3 FPS, it substantially improved detection performance in complex scenarios [30]. The enhanced feature representation and localization accuracy provide superior comprehensive performance, which is particularly valuable for applications requiring high detection precision.

Further integrating RepGFPN into Method 3 (Method 4) achieved optimal overall performance. mAP50 increased to 87.3%, mAP50:90 improved to 73.0%, Precision reached 88.2%, and Recall rose to 80.3%. Inference speed also recovered from 22.3 FPS to 30.2 FPS. This demonstrates that RepGFPN successfully mitigated the inference speed reduction caused by the attention mechanism while strengthening multi-scale feature fusion and enriching local detail representation. Although parameter count increased, the performance gains far outweighed the additional computational cost in real-time detection and complex orchard environments.

The ablation study results show that MSBlock, SCSA and RepGFPN effectively complement different performance dimensions. MSBlock significantly accelerates inference while improving precision, SCSA enhances feature representation and target localization, and RepGFPN optimizes the balance between inference efficiency and detection accuracy while strengthening feature fusion. Through their synergistic effects, the improved RT-DETR model achieved simultaneous improvements in detection accuracy, recall rate and inference speed, demonstrating particularly strong comprehensive performance in complex scenarios involving weak features and small object recognition. Statistical tests confirm that our method significantly outperforms all other configurations in terms of comprehensive metrics including mAP50 and mAP50:90 with a p-value less than 0.01. Furthermore, it achieves statistically significant improvements in Precision compared to the baseline and Method 3 and in Recall compared to the baseline and Method 2.

Attention mechanism comparison experiments

To systematically evaluate the impact of different attention mechanism modules on the improved model’s performance, this study adopted a unified baseline network architecture and implemented configurations including no attention mechanism (No Attention) or replacement with current mainstream attention mechanism modules, namely SENetV1, EMA, Biformer, and the proposed SCSA module. The experimental results are shown in Table 2.

thumbnail
Table 2. Performance comparison of different attention mechanisms on MSRRT-DETR evaluated on the TSAppleData dataset.

https://doi.org/10.1371/journal.pone.0342854.t002

The experiments on attention mechanism integration indicate that embedding attention modules can improve the model’s detection performance to varying degrees while keeping the computational cost largely unchanged; however, their impact on the trade-off between inference efficiency and accuracy varies significantly. EMA demonstrates strong performance in improving recognition accuracy and recall rate, but also leads to a significant decrease in model inference speed. In contrast, the SCSA module achieves the best mAP50:90 of 73.0% while maintaining high recognition performance and an inference speed second only to the no-attention baseline. The t-test results further confirm its superiority in high-precision detection scenarios, showing significantly better performance than other state-of-the-art attention methods. For metrics that did not reach statistical significance, the performance of the competing attention modules is essentially on par with SCSA, indicating that the advantages of the SCSA module are mainly manifested in high-precision detection scenarios rather than showing significant improvements across all evaluation metrics.

As shown in Fig 9, The heatmap comparison results further demonstrate that attention mechanisms play a significant role in guiding feature extraction towards key regions, with distinct differences observed in feature response locations across various mechanisms. The no-attention mechanism (No Attention) model exhibits dispersed attention and frequently misses critical target areas when confronted with complex backgrounds and occlusion scenarios. While SENetV1 and EMA mechanisms show more concentrated focus on targets, they also display noticeable erroneous activation in background interference regions – for instance, the EMA mechanism incorrectly activates attention on branches in the upper-left corner of densely distributed images. The Biformer mechanism demonstrates relatively scattered attention distribution with insufficient boundary awareness.

thumbnail
Fig 9. Feature response heatmaps of different attention mechanisms in apple detection across multiple scenarios.

https://doi.org/10.1371/journal.pone.0342854.g009

In contrast, our proposed SCSA mechanism more effectively concentrates attention on target regions, accurately capturing target contours under complex occlusion and background interference while effectively suppressing attention responses to non-target areas. Moreover, compared to other attention mechanisms, SCSA demonstrates superior performance in small target recognition. Further analysis reveals that SCSA possesses stronger boundary perception and regional discrimination capabilities at the feature level, particularly exhibiting more stable performance when processing small-sized, highly overlapping, or occluded targets. Its achievement of 73.0% on the mAP50:90 metric significantly surpasses other attention mechanisms, quantitatively validating its advantages in target localization accuracy.

Detection performance analysis across different complex scenarios

To comprehensively evaluate the detection performance of the MSRRT-DETR model in complex natural scenarios, this study conducted systematic analysis from both temporal and spatial dimensions. Temporally, three growth stages of apples were selected for investigation: immature, semi-mature, and mature. Spatially, three viewing fields were examined: close-range, medium-range, and long-range. Since all images were captured in natural environments and randomly contained various scenarios including front lighting, back lighting, and fruit occlusion, these characteristics existed as natural attributes in each image and were therefore not listed as separate scenarios. As shown in Fig 10, the improved model demonstrated significantly superior detection performance over the original model across different temporal stages and spatial distribution conditions.

thumbnail
Fig 10. Comparison of detection results before and after model improvement for apples at different growth stages and spatial distributions, along with typical false detection examples.

The first row shows original images, the second and third rows display inference results from the improved and original models respectively, while the fourth row presents enlarged views of error regions from the original model (marked by gray boxes). Undetected targets are indicated by orange boxes, and false detections by orange circles.

https://doi.org/10.1371/journal.pone.0342854.g010

In the temporal dimension, fruits at immature and semi-mature stages mostly appear green or light red, exhibiting high color similarity with background leaves and constituting typical low-contrast detection scenarios. In such images, the original model failed to effectively distinguish fruit from background information and showed deviations in boundary recognition at fruit overlapping regions, resulting in relatively obvious missed and false detection issues as shown in Figure m. In contrast, the detection results of the improved model under the same scenarios (as shown in Figure g) could more accurately distinguish detection targets from image backgrounds, reflecting its stronger perception and representation capabilities in low-contrast and weak-edge regions.

In the spatial dimension, as the shooting field of view moved from near to far, the pixel size of fruit targets continuously decreased while detail information gradually reduced, particularly under medium-to-long range viewing conditions, which placed higher demands on the model’s scale perception and multi-scale feature fusion capabilities. In such scenarios, the original model showed significantly increased missed detection rates and markedly degraded detection performance. As shown in Figure r, numerous problems including repeated detections, missed detections and false detections appeared in areas with dense targets or severe occlusion, revealing its insufficient robustness under complex conditions. Compared with the original model, the improved model demonstrated significant performance advantages in complex scenarios, especially for small-size target detection. Under the long-range viewing fields shown in Figures l and r, MSRRT-DETR exhibited stronger small-target detection capability compared to the original model, further demonstrating its superior scale robustness.

Further analysis of typical detection errors revealed in Figures s-t shows that the original model had multiple performance deficiencies in complex scenarios. For instance, in areas with complicated lighting conditions such as backlighting or bright backgrounds (shown in Figures o and p), the original model tended to falsely detect bright areas as apples (detailed in Figures u and v), or miss real targets in shadowed areas (shown in Figure t). When targets were severely occluded, it struggled to accurately infer the contours of occluded targets, leading to missed or false detections of fruits as shown in Figure w. In comparison, the improved model demonstrated detection performance significantly superior to the original model in backlit and bright background areas (shown in Figures i and j), while also achieving more accurate recognition and localization of fruit targets under shadowed and severely occluded conditions (shown in Figure k). This performance improvement mainly benefits from multi-level optimization of the model’s feature extraction and fusion strategies. Specifically, the improved model introduced a multi-scale perception module at shallow layers, incorporated spatial-channel attention mechanisms at middle layers, and reconstructed multi-scale feature transmission paths at detection heads, thereby effectively enhancing the model’s perception capability for feature regions and target discrimination ability in complex scenarios. Consequently, under adverse factors like strong light interference and shadow occlusion, the improved model could more stably focus on effective target areas, overall exhibiting higher detection robustness and better environmental adaptability.

Performance comparative analysis with mainstream detectors

To thoroughly investigate the comprehensive performance of the improved MSRRT-DETR model in object detection tasks, this paper conducted a systematic comparative analysis with current mainstream object detection models, with the results presented in Table 3.

thumbnail
Table 3. Performance comparison of MSRRT-DETR versus mainstream object detection models on the TSAppleData dataset.

https://doi.org/10.1371/journal.pone.0342854.t003

As shown in Table 3, compared to lightweight YOLO series models, the proposed model demonstrates significant advantages across all accuracy metrics. Particularly for mAP50:90, it achieves improvements of 12.1%, 5.7%, and 9.0% over YOLOv8, YOLO11, and YOLO12 respectively. With an inference speed of 30.2 FPS, comparable to the YOLO series, the model maintains excellent real-time performance while significantly enhancing high-precision target recognition capabilities.

Compared to two-stage detectors (Faster R-CNN, Mask R-CNN, Cascade R-CNN), MSRRT-DETR achieves comprehensive superiority in both accuracy and efficiency. It shows improvements of 7.3%, 12.5%, and 13.5% in mAP50:90 respectively, with Precision increasing by over 12%, demonstrating stronger performance in localization accuracy and false detection suppression. In terms of inference speed, MSRRT-DETR has significantly fewer parameters than most two-stage models, achieving 30.2 FPS – far exceeding Faster R-CNN (6.6 FPS) and Cascade R-CNN (2.1 FPS), indicating excellent real-time capability and deployment potential. In terms of computational cost (GFLOPs), its overhead is only 12.1%–65.5% of that of the two-stage models. Its coordinated design of accuracy, speed, and model size makes it better suited for practical detection requirements in complex orchard scenarios.

Compared to the RT-DETR series, MSRRT-DETR achieves significant improvements in both detection accuracy and inference efficiency. Specifically, versus RT-DETR-50, our method improves mAP50:90 by 2.6 percentage points while more than doubling the inference speed (from 15 FPS to 30.2 FPS), with a computational cost only 49.2% of that of RT-DETR-50. Compared to RT-DETR-L, it achieves a 3.7 percentage point improvement in mAP50:90 with approximately 33.6% faster inference speed and a computational cost only 60.6% of that of RT-DETR-L.

In summary, while maintaining the global modeling capabilities of Transformer, the improved RT-DETR model achieves an optimal balance between detection accuracy and inference efficiency through the introduction of: (1) multi-scale feature extraction module (MSBlock), (2) spatial-channel collaborative attention (SCSA), and (3) efficient feature fusion network (RepGFPN). In addition, statistical significance tests further validate the robustness of the proposed method. For most evaluation metrics, MSRRT-DETR achieves statistically significant improvements over competing models (p < 0.05). For the few cases where no significant difference is observed, the performance gaps are marginal, indicating that MSRRT-DETR maintains comparable accuracy while offering clear advantages in high-precision detection and efficiency.

To facilitate intuitive comparison of each model’s comprehensive performance in accuracy and efficiency, this study introduces the Composite Score as an evaluation metric, calculated as follows:

(12)

where represents the mAP50 of the i-th model, and represents the FPS of the i-th model.

As shown in Fig 11, compared to two-stage detectors and YOLO series models, MSRRT-DETR achieves the highest Composite Score (indicated by the darkest color) and is positioned along the diagonal. This demonstrates that its high Composite Score is not due to exceptional performance in one single metric while performing mediocre or poorly in others. These results indicate that MSRRT-DETR successfully achieves an excellent balance across three key dimensions – accuracy, inference speed, and parameter count – demonstrating its significant advantages in comprehensive model performance.

thumbnail
Fig 11. Performance comparison of different object detection models in terms of FPS, mAP50, parameter count and Composite Score.

The horizontal axis represents the model’s FPS value, while the vertical axis represents the model’s mAP50 value. The circle size indicates the model’s parameter count (model complexity), with larger circles representing higher parameter counts. The color depth of the circles represents the model’s comprehensive score, where darker colors indicate better overall performance in both accuracy and speed.

https://doi.org/10.1371/journal.pone.0342854.g011

Cross-domain generalization experiment

To comprehensively evaluate the cross-domain generalization capability of the proposed model, this study selected four apple detection datasets that were not involved in training for validation, covering different fruit maturity stages (AppleBBCH76, AppleBBCH81) [31], complex background and small target scenarios (Minneapple) [32], as well as datasets with significant lighting variations and occlusion interference (AppleDatas). The dataset descriptions and basic information are shown in Table 4 and Fig 12. The comparison subjects include YOLO series models (YOLO11, YOLOv8, YOLO12), two-stage detectors (Faster R-CNN, Mask R-CNN, Cascade R-CNN), RT-DETR series models (RT-DETR-L, RT-DETR-18, RT-DETR-50), and the proposed improved model.

thumbnail
Table 4. Detailed statistics and characteristic descriptions of apple datasets for generalization experiment.

https://doi.org/10.1371/journal.pone.0342854.t004

thumbnail
Fig 12. Sample images from each dataset used in the generalization experiments.

https://doi.org/10.1371/journal.pone.0342854.g012

As shown in Table 5, while the YOLO series demonstrates relatively good performance on certain metrics in standard scenarios, its accuracy declines significantly in complex environments. The two-stage detectors exhibit considerable fluctuations across different datasets for certain metrics, indicating insufficient stability. In contrast, our model achieves more balanced performance across various scenarios with stronger generalization capability.

thumbnail
Table 5. Cross-domain generalization performance comparison of different detection models across datasets.

https://doi.org/10.1371/journal.pone.0342854.t005

Notably, our model delivers excellent performance across all datasets, particularly achieving the best overall F1-Score of 0.792 on AppleBBCH81. Even in the Minneapple dataset where overall accuracy generally decreases, our model still maintains a leading score of 0.439, demonstrating robust small-target detection capabilities.

As shown in Fig 13, the models demonstrate good overall performance under standard detection conditions such as AppleBBCH76 and AppleBBCH81, but exhibit a general decline in detection accuracy in datasets with dense small targets and complex lighting conditions like Minneapple and AppleDatas, indicating that the robustness of current mainstream detection models in complex scenarios still needs improvement.

thumbnail
Fig 13. Performance comparison of various detection models on multi-source apple detection datasets (F1-score bar chart and mAP50 heatmap).

https://doi.org/10.1371/journal.pone.0342854.g013

Notably, the RE-DETR series models perform significantly better than both the YOLO series and two-stage detectors across all datasets. In particular, both RT-DETR-50 and our model achieve mAP50 scores exceeding 0.82 in the AppleBBCH81 dataset, ranking first among all models. In the Minneapple dataset, although the overall accuracy level decreases significantly, the Transformer series still maintains a relative advantage, demonstrating its modeling capability and cross-domain generalization performance in handling complex environments with small targets and heavy occlusion.

Conclusion

This paper presents MSRRT-DETR, an end-to-end fruit detection model designed to meet the demands of accuracy, real-time performance, and cross-domain generalization in complex orchard environments. The proposed model integrates a hierarchical multi-scale feature extraction module (MSBlock), a spatial–channel collaborative attention mechanism (SCSA), and an efficient re-parameterized feature fusion neck (RepGFPN), significantly enhancing detection robustness for small, occluded, and low-contrast targets.

Comprehensive experiments demonstrate that MSRRT-DETR achieves superior detection accuracy, fast inference speed (30.2 FPS), and strong cross-domain generalization without the need for retraining. Compared with state-of-the-art YOLO series models, two-stage detectors, and original RT-DETR variants, MSRRT-DETR exhibits notable advantages in both detection precision and computational efficiency.

Despite its overall superior performance, MSRRT-DETR still exhibits limitations in generalization when there is a significant discrepancy between the training data and the target apple varieties. Future research will focus on enhancing model robustness under extreme conditions, including the integration of multimodal data sources (e.g., depth maps or thermal imagery), and the adoption of model sparsification or knowledge distillation techniques to further improve cross-domain generalization and lightweight deployment. These efforts aim to support the continued advancement of intelligent fruit detection systems in complex and dynamic agricultural environments.

References

  1. 1. Zhu F, Zhang W, Wang S, Jiang B, Feng X, Zhao Q. Apple-harvesting robot based on the YOLOv5-RACF Model. Biomimetics (Basel). 2024;9(8):495. pmid:39194474
  2. 2. YOLO-ALW: An Enhanced High-Precision Model for Chili Maturity Detection. https://www.mdpi.com/1424-8220/25/5/1405. Accessed 2025 July 10.
  3. 3. Cao D, Luo W, Tang R, Liu Y, Zhao J, Li X, et al. Research on apple detection and tracking count in complex scenes based on the improved YOLOv7-Tiny-PDE. Agriculture. 2025;15(5):483.
  4. 4. Zhang S, Wan H, Fan Z, Zeng X, Zhang K. Bunet: An effective and efficient segmentation method based on bilateral encoder-decoder structure for rapid detection of apple tree branches. Appl Intell. 2023;53(20):23336–48.
  5. 5. Xiao F, Wang H, Xu Y, Zhang R. Fruit detection and recognition based on deep learning for automatic harvesting: An overview and review. Agronomy. 2023;13(6):1625.
  6. 6. Han W, Li T, Guo Z, Wu T, Huang W, Feng Q, et al. LGVM-YOLOv8n: A lightweight apple instance segmentation model for standard orchard environments. Agriculture. 2025;15(12):1238.
  7. 7. Wu Z, Sun X, Jiang H, Gao F, Li R, Fu L, et al. Twice matched fruit counting system: An automatic fruit counting pipeline in modern apple orchard using mutual and secondary matches. Biosystems Engineering. 2023;234:140–55.
  8. 8. Gao F, Fu L, Zhang X, Majeed Y, Li R, Karkee M, et al. Multi-class fruit-on-plant detection for apple in SNAP system using Faster R-CNN. Computers and Electronics in Agriculture. 2020;176:105634.
  9. 9. Jiang P, Chen Y, Liu B, He D, Liang C. Real-time detection of apple leaf diseases using deep learning approach based on improved convolutional neural networks. IEEE Access. 2019;7:59069–80.
  10. 10. Wu H, Mo X, Wen S, Wu K, Ye Y, Wang Y, et al. DNE-YOLO: A method for apple fruit detection in Diverse Natural Environments. Journal of King Saud University - Computer and Information Sciences. 2024;36(9):102220.
  11. 11. Wang X, Liu J, Chen Q. An advanced deep learning method for pepper diseases and pests detection. Plant Methods. 2025;21(1):70. pmid:40420214
  12. 12. Liu J, Wang X. Early recognition of tomato gray leaf spot disease based on MobileNetv2-YOLOv3 model. Plant Methods. 2020;16:83. pmid:32523613
  13. 13. Tian Y, Yang G, Wang Z, Wang H, Li E, Liang Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Computers and Electronics in Agriculture. 2019;157:417–26.
  14. 14. Sun J, Peng Y, Chen C, Zhang B, Wu Z, Jia Y, et al. ESC-YOLO: optimizing apple fruit recognition with efficient spatial and channel features in YOLOX. J Real-Time Image Proc. 2024;21(5).
  15. 15. Lin Y, Huang Z, Liang Y, Liu Y, Jiang W. AG-YOLO: A rapid citrus fruit detection algorithm with global context fusion. Agriculture. 2024;14(1):114.
  16. 16. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-End Object Detection with Transformers. 2020.
  17. 17. Wu Y, Yuan S, Tang L. Plant recognition of maize seedling stage in UAV remote sensing images based on H-RT-DETR. Plant Methods. 2025;21(1):60. pmid:40369645
  18. 18. Hu L, Li X. Apple recognition in complex environments based on FC-DETR. Heliyon. 2024;10(18):e37605. pmid:39974622
  19. 19. Meng F, Li J, Zhang Y, Qi S, Tang Y. Transforming unmanned pineapple picking with spatio-temporal convolutional neural networks. Computers and Electronics in Agriculture. 2023;214:108298.
  20. 20. Lin Y, Xia Y, Xia P, Liu Z, Wang H, Qin C, et al. YOLO11-ARAF: An accurate and lightweight method for apple detection in real-world complex orchard environments. Agriculture. 2025;15(10):1104.
  21. 21. ‘DETRs Beat YOLOs on Real-time Object Detection | IEEE Conference Publication | IEEE Xplore’. Accessed: Jul. 10, 2025. https://ieeexplore.ieee.org/document/10657220
  22. 22. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2015.
  23. 23. Chen Y, Yuan X, Wang J, Wu R, Li X, Hou Q, et al. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans Pattern Anal Mach Intell. 2025;47(6):4240–52. pmid:40031746
  24. 24. Ming M, Elsherbiny O, Gao J. Trinocular vision-driven robotic fertilization: Enhanced YOLOv8n for precision mulberry growth synchronization. Sensors (Basel). 2025;25(9):2691. pmid:40363130
  25. 25. Si Y, Xu H, Zhu X, Zhang W, Dong Y, Chen Y, et al. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing. 2025;634:129866.
  26. 26. Wang D, Song H, Wang B. YO-AFD: an improved YOLOv8-based deep learning approach for rapid and accurate apple flower detection. Front Plant Sci. 2025;16:1541266. pmid:40144752
  27. 27. Xu X, Jiang Y, Chen W, Huang Y, Zhang Y, Sun X. DAMO-YOLO: A Report on Real-Time Object Detection Design. 2023.
  28. 28. Zheng X, Shao Z, Chen Y, Zeng H, Chen J. MSPB-YOLO: High-precision detection algorithm of multi-site pepper blight disease based on improved YOLOv8. Agronomy. 2025;15(4):839.
  29. 29. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Journals & Magazine. https://ieeexplore-ieee-org-443.webvpn.xju.edu.cn:8040/document/9523600. Accessed 2025 July 10.
  30. 30. Duan Y, Han W, Guo P, Wei X. YOLOv8-GDCI: Research on the phytophthora blight detection method of different parts of chili based on improved YOLOv8 model. Agronomy. 2024;14(11):2734.
  31. 31. Kodors S, Zarembo I, Lācis G, Litavniece L, Apeināns I, Sondors M, et al. Autonomous yield estimation system for small commercial orchards using UAV and AI. Drones. 2024;8(12):734.
  32. 32. Hani N, Roy P, Isler V. MinneApple: A benchmark dataset for apple detection and segmentation. IEEE Robot Autom Lett. 2020;5(2):852–8.