Figures
Abstract
Sugarcane stem node detection is critical for monitoring sugarcane growth, enabling precision cutting, reducing spuriousness, and improving breeding for resistance to downfall. However, in complex field environments, sugarcane stem nodes often suffer from reduced detection accuracy due to background interference and shadowing effects. For this reason, this paper proposes an improved sugarcane stem node detection model based on YOLO11. This study incorporates the ASF-YOLO (Attentional Scale Sequence Fusion based You Only Look Once) mechanism to enhance the feature fusion layer of YOLO11. Additionally, a high-resolution detection layer, P2, is integrated into the fusion module to improve the model’s ability to detect small objects—particularly sugarcane stem nodes—and to better handle multi-scale feature representations. Secondly, to better align with the P2 small-object detection layer, this paper adopts a shared convolutional detection head named LSDECD (Lightweight Shared Detail-Enhanced Convolutional Detection Head), which can better deal with small target detection while reducing the number of model parameters through parameter sharing and detail-enhanced convolution. Using soft-NMS (non-maximum suppression) to replace the original NMS and combining with Shape-IoU, a bounding box regression method that focuses on the shape and scale of the bounding box itself, makes the bounding box regression more accurate, and solves the problem of the impact of detection caused by occlusion and illumination. Finally, to address the increased complexity introduced by the addition of the P2 detection layer and the replacement of the detection head, channel pruning is applied to the model, effectively reducing its overall complexity and parameter count. The experimental results show that the model before pruning has 96.1% and 53.2% mean average precision mAP50 and mAP50:95, respectively, which are 11.9% and 11.1% higher than the original YOLO11n, and the model after pruning also has 10.8% and 9.3% higher than the original YOLO11n, respectively, and the number of parameters is reduced to 279,778, and model size is reduced to 1.3MB. The computational cost decreased from 11.6 GFlops to 6.6 GFlops.
Citation: Wen C, Liu L, Li S, Cheng Y, Liang Q, Li K, et al. (2025) Sugarcane stem node detection with algorithm based on improved YOLO11 channel pruning with small target enhancement. PLoS One 20(9): e0332870. https://doi.org/10.1371/journal.pone.0332870
Editor: Narendra Khatri, Manipal Institute of Technology, INDIA
Received: March 19, 2025; Accepted: September 7, 2025; Published: September 18, 2025
Copyright: © 2025 Wen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data and code underlying the findings of this study are fully available without restriction from the ScienceDB. The sugarcane stem node dataset is available at DOI: https://doi.org/10.57760/sciencedb.27078 The source code used in this study is available at DOI: https://doi.org/10.57760/sciencedb.27287.
Funding: the Guangxi Science and Technology Major Project (Guike AA22117006),the Open Project of Key Laboratory of Artificial Intelligence and Information Processing in Guangxi Colleges and Universities (2024GXZDSY017), the Key Project of Teaching Reform of the Guangxi Zhuang Autonomous Region in 2022 (2022JGZ121), the Project of Reform of Degree and Postgraduate Education in Guangxi in 2023 (JGY2023115), and 2020 Guangxi Degree and Postgraduate Education Reform Project (JGY2020066). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
As the global demand for agricultural intelligence and automation increases, the combination of artificial intelligence and agriculture is gradually becoming one of the most important means to improve crop yield and quality. As an economically important crop, sugarcane is not only the main raw material for sugar production but also an important resource in the light industry, chemical industry, and energy sector. Currently, sugarcane is grown in more than 90 countries worldwide, covering an area of about 22 million hectares. Major sugarcane-producing countries include Brazil, India, China, Thailand, the United States, and Australia [1]. In China, Guangxi occupies 60% of the national planting area, and most of the sugar production [2], and the sugarcane industry has become an important economic industry in Guangxi. At present, in the sugarcane harvesting link, the whole stalk-type sugarcane combine harvester is limited by the hilly landscape environment in Guangxi; its operating conditions are high, and the operating effect is not fully realized. Even under the premise of the current rise of intelligent agriculture, the current field environment of sugarcane still requires a lot of manual work, especially in the identification and positioning of stalks, which still requires manual intervention, which not only increases labor intensity but also makes it challenging to improve operational efficiency. In the research and development of sugarcane intelligent harvesting technology, the identification of sugarcane stems and nodes under the complex light environment in the field is a problem that needs to be focused on.
With the rapid adoption of deep learning in agriculture, both one-stage detectors like YOLO [3] and two-stage methods like Faster R-CNN [4] have been applied to sugarcane stem node detection with promising results. Early efforts began in 2008, when Moshashai et al. [5] utilized diameter differences between stem nodes and internodes to initiate image-based detection. In 2010, Lu Shangping et al. [6] applied machine vision techniques using HSV color space thresholding and support vector machines for classification. Huang Yi-Qi et al. [7–8] later introduced grayscale and edge-based preprocessing methods, including local mean algorithms, to locate stem nodes. Recent years have seen the emergence of deep learning-based methods. In 2019, Li Shangping [9] proposed a YOLOv3-based real-time detection network, achieving 96.89% accuracy. Zhou et al. [10] (2020) enhanced RGB image analysis using median filtering and Sobel edge detection. Xie et al. [11] (2023) introduced a lightweight YOLOv5s variant by removing CBS and C3 modules, achieving high accuracy with only 2.6MB model size. In 2024, our group [12] proposed a field-adapted YOLOv7-based model by integrating SimAM attention and replacing some traditional convolutions with deformable layers, achieving 94.53% mAP and 92.41 F1-score.
In recent years, extensive efforts have been made to improve multi-scale feature learning and small object detection. Mask-Refined R-CNN (Regions with Convolutional Neural Networks) proposed by Zhang et al. [13] refines region-level features through a mask-guided mechanism and effectively preserves boundary details during multi-scale fusion, offering insights for enhancing small object detection accuracy. Lin et al. [14] introduced a feature disentanglement strategy in a one-stage detector, which models classification and localization tasks separately to mitigate task interference and improve both accuracy and robustness. The design principles of these works—multi-scale representation refinement and task-specific feature separation—align well with the motivation behind our lightweight ASF module [15], highlighting the potential of structurally optimized and robust detection frameworks in real-world agricultural applications.
In recent years, a growing number of advanced agricultural vision models have been proposed, covering various tasks such as fruit detection, weed recognition, and leaf disease classification. In fruit detection, Ma et al. [16] proposed an improved YOLOv7-tiny model for small apple detection under complex environments, demonstrating robustness to occlusion and illumination variations. In leaf disease identification, Sanida et al. [17] developed a two-stage transfer learning approach to accurately classify various tomato leaf diseases, achieving strong generalization performance across different datasets. To enhance detection performance in orchard settings, Zhang et al. [18] introduced a multi-scale feature adaptive fusion model that effectively improves the detection of small and partially occluded citrus fruits in real-time applications.
Although deep learning methods have achieved some success in recognizing sugarcane and stem nodes, most of the existing studies have focused on validation in indoor environments, and their detection capabilities in real complex field environments still need to be further explored. Fewer methods have been proposed for complex environments in the field, such as dust, cane leaf occlusion, etc., and there is a lack of a method that combines the two to achieve such high robustness and generalization as simultaneous detection of both indoor and outdoor.
While our method builds upon the YOLO family, it departs from incremental modifications in several key aspects tailored to the challenges of sugarcane stem node detection in complex field environments. First, we design a long-strip-aware shared convolutional detection head (LSDECD) to enhance the model’s ability to capture the spatial continuity and elongated structure of sugarcane stem nodes. Second, we introduce the Attentional Scale Sequence Fusion (ASF) module to improve multi-scale feature aggregation and robustness against occlusion and lighting variations. Third, we refine the Shape-IoU loss with a Soft-NMS-inspired formulation, allowing the model to be more sensitive to boundary shapes and morphological consistency—an important characteristic for structured agricultural objects like stem nodes. Furthermore, we apply LAMP (Layer-Adaptive Magnitude-based Pruning) to optimize the model structure, achieving significant compression without sacrificing accuracy. This pruning strategy is especially effective in agricultural scenarios with limited computation and power constraints. Combined with other lightweight design considerations, our model achieves deployment-friendly performance on edge devices, making it suitable for real-time applications in intelligent agricultural machinery.
Methodology
YOLO11 model
YOLO11 is a new generation of target detection algorithm proposed by the Ultralytics team in 2024 based on YOLOv8, and its core improvements include module reconstruction and structure optimization. First, the algorithm upgrades the C2f module of YOLOv8 to the C3k2 module: by dynamically controlling the c3k parameter, when c3k is activated, it replaces the original Bottleneck module with the CSP (Cross Stage Partial) Bottleneck with a smaller core to achieve a balance between lightweight and feature extraction efficiency. Meanwhile, the C2PSA (Cross-Stage Progressive Spatial Attention) mechanism is introduced to integrate the PSA (Pyramid Squeeze Attention) block in the C2f module, which enhances the ability to capture key features through attention weighting. Second, at the network structure level, YOLO11 adds a new C2PSA module after the SPPF (Spatial Pyramid Pooling Fast) module to optimize the gradient propagation and training stability by combining the multi-head attention and residual structure; the neck layer replaces the C2f uniformly with the C3k2 module to accelerate the process of multi-resolution feature aggregation and enhance the efficiency of cross-scale information fusion. In the detection head part, the algorithm adds two new deep convolutions (Depth-wise Conv), which utilize the independent computation of spatial dimensions to dramatically reduce the number of parameters and computational complexity while retaining the independence of cross-channel features. These improvements enable YOLO11 to balance detection accuracy and computational efficiency in complex scenarios, which is especially suitable for the needs of field environments with strong background interference and changing light in sugarcane stem node detection. The network structure of YOLO11 is shown in Fig 1.
Improved attention scale sequence with new small target detection layer
Although the YOLO11 algorithm performs well in common task scenarios (e.g., pedestrian and vehicle detection), there are still many problems when it is directly applied to sugarcane stem node recognition in complex environments: for example, in real sugarcane environments, sugarcane grows densely and busily, the stem nodes are small in size and numerous in number, and to a certain extent, the presence of other sugarcane leaves and roots that are obscured or overlapped can lead to missed and false detection situations. Since sugarcane stem nodes usually belong to small targets in the pictures with small pixel ratios, the neck network of YOLO11 extracts feature information in 80 × 80, 40 × 40 and 20 × 20 pixel dimensions through the downsampling process and realizes the fusion of multi-scale information through the upsampling strategy. However, the method ignores the tiny-sized sugarcane stem node features, resulting in difficulties in recognizing these small targets at long distances.
In order to address the above problems, this paper introduces an additional up-sampling layer at the neck layer to create a 160 × 160 pixel P2 detection layer. The detailed procedure of this improvement is shown in Fig 2. In order to maximize the detection of sugarcane stem nodes, the feature information is usually reduced after several layers of downsampling during feature extraction by the convolutional neural network. For example, when the step size is 16, the target area of 32 × 32 pixels is only 2 × 2 pixels in the feature map, and the effective area for detecting small objects cannot be recognized. Moreover, as the number of network layers deepens, small targets’ feature information and position information are gradually lost, which is not conducive to target localization. In convolutional neural networks, high-level feature maps have large sensory fields and rich semantic features but lose a large number of spatial features. On the contrary, the shallow feature map has a smaller receptive field [19], but it has high spatial resolution and accurate target location, which is suitable for small object detection, so a high-resolution feature layer (P2) is added to the original model to better utilize the shallow details and location information.
To further enhance the multi-scale feature fusion capability, this paper introduces the ASF structure into the Neck layer of YOLO11 to cope with the task of sugarcane stem node detection in the complex environment in the field. ASF is a target detection and segmentation framework that combines spatial attention with multi-scale features, and its core idea is to dynamically fusing feature maps from different layers of the backbone network.
In this paper, we add the P2 feature layer on top of the ASF architecture to comprehensively capture the multi-scale information of the target and form a new feature fusion architecture. The architecture mainly consists of the SSFF (Scale Sequence Feature Fusion Layer) module and the TFE (Triple Feature Encoder) module, which complement each other in global semantics and local detail modeling. The SSFF module aims to efficiently fuse the feature maps of P2, P3, P4, and P5, which capture different spatial scales covering various sizes of sugarcane in the dataset due to the shooting angle of the stem node targets; in SSFF, the feature maps of P2, P3, P4, and P5 are normalized to the same size and then stacked together as inputs to a three-dimensional (3D) convolution to combine multi-scale features.
The TFE module aims to enhance the detection of small targets in dense sugarcane stem nodes by splicing three different sizes of features, large, medium, and small, in the spatial dimension in order to capture detailed information about small targets. The detailed feature information of the TFE module is subsequently integrated into each feature branch through the PANet (Path Aggregation Network) structure. The TFE module aims to model fine-grained details across small, medium, and large receptive fields. It slices features at three spatial scales and fuses them via concatenation followed by convolutional layers. The final enhanced feature is:
Wherein, FTFEX denotes the feature map output by the feature extraction module. Fl, Fm, and Fs correspond to the feature maps of large, medium, small size targets, respectively.
In order to solve the multi-scale problem of sugarcane stem nodes due to the shooting angle, existing literature uses the feature pyramid [20] structure for feature fusion, which usually merges the pyramid features using summation or concatenation. However, various feature pyramid networks must effectively utilize the correlation between all pyramid feature maps. SSFF better combines the high-dimensional information of the deep feature maps with the detailed information of the shallow feature maps. The image dimensions change during the downsampling process, but the scale-invariant features remain unchanged. The scale space is constructed along the image axes to represent not only a single scale but also a range of scales that an object may have. Although details may be lost in blurred images, structural features of the image can be preserved. Eqs (2–2) and (2–3) can obtain the scaled image input to SSFF:
Where f(w, h) denotes a 2D input image with width w and height h, (w, h) denotes the smoothing process produced by f(w, h) by performing a series of convolution operations on the 2D Gaussian filter
(w, h). σ denotes the scale parameter that determines the standard deviation of the 2D Gaussian filter used in the convolution process. The SSFF module performs multi-scale feature aggregation by normalizing feature maps from different levels (P2, P3, P4, P5) to the same resolution and stacking them along the channel dimension. These are then processed using 3D convolution to capture inter-scale dependencies. The fused feature
can be expressed as:
where represents the feature map from level
, and Resize denotes upsampling/downsampling to a unified resolution.
The information from the TFE module is subsequently integrated into each feature branch through the PANet structure and combined with the multi-scale information from the SSFF module into the P3 branch. Three large, medium, and small feature maps with the same dimensions are convolved once and then spliced in the channel dimension. Building upon YOLO11, we reconstruct the Neck by integrating the concept of ASF, introduce a shallow P2 path, and optimize the feature fusion process. This approach aims to enhance small object detection performance with a lighter architecture and reduced complexity. The specific improvements are illustrated in Fig 2.
Co-optimizing lightweight shared convolutional detection heads for small object detection
Aiming at the shortcomings of the traditional YOLO single-scale prediction structure in multi-scale feature utilization, YOLOv11 significantly enhances the multi-scale feature extraction capability through the introduction of improvements such as the C3k2 module and the C2PSA module, thus better adapting to the task of target detection in complex scenes. In this paper, a Lightweight Shared DEConv (the detail-enhanced convolutio) Detection Head [21] (LSDH) is designed to co-optimize with the added P2 feature layer. This detection head uses Group Normalization (GN) instead of the traditional Batch Normalization (BN) [22], which is especially suitable for small-batch training scenarios. Group normalization effectively improves the stability of model training and further optimizes the localization and classification performance by grouping feature channels to compute the normalization.
Since the introduction of the P2 layer makes the number of model parameters increase, in order to reduce the parameter size and computation of the detection head, LSDECD adopts a parameter-sharing strategy to integrate the original four feature extraction operations into a single process and adjusts the size of the feature map through the scale module. At the same time, Detail-Enhanced Convolution is introduced instead of the conventional convolution to improve the expressiveness and generalization of the feature extraction. DEConv includes five parallel convolutions: four differential convolutions and one regular convolution are included in the five parallel convolutions, which are used for feature extraction in parallel. Central Differential Convolution (CDC), Angular Differential Convolution (ADC), Horizontal Differential Convolution (HDC), and Vertical Differential Convolution (VDC) are used to integrate traditional local descriptors into the convolutional layer, thereby enhancing representation and generalization capabilities. In difference convolution, the differences of pixels in an image are first computed and then convolved with a convolution kernel to generate an output feature map; by designing a difference computation strategy for the pixels, a priori information can be displayed and encoded into a convolutional neural network (CNN). These convolutions are used for feature extraction and learning to enhance representation and generalization capabilities.
The structure of the LSDECD detector head is shown in Fig 3, and the overall improved structure is shown in Fig 4, where the feature fusion layer is modified based on the original YOLO11, and the TFE, SSFF, and Add modules are introduced, replacing the original detector head with the LSDECD detector head, all of which are marked with a red box in the figure:
Loss function improvement
For the field environment, the detection effect is poor due to factors such as light and sugarcane leaf occlusion, and the sugarcane stem nodes also belong to the target of intensive detection. The Shape-IoU [23] loss function not only takes into account the geometrical relationship between the predicted frame and the actual frame but also integrates the shape and scale factors of the bounding box. Their IOU values may still differ for equal-sized bounding box regression samples despite having the same bias or shape bias. In addition, the bias or shape deviation in the direction of the short side of the bounding box has a more significant effect on the IOU values. Among the same-shaped bounding box regression samples, the IOU values of the smaller-size samples are more likely to be affected by the shape of the true box (GT). The formula for Shape-IoU can be derived from Fig 5, and its calculation is defined as follows:
Following Bodla et al. [24], we adopt the linear variant of Soft-NMS, which decays the confidence scores of overlapping boxes rather than eliminating them. Unlike traditional NMS that uses hard thresholding, Soft-NMS ensures that adjacent but distinct stem nodes are retained. First, the standard Non-Maximum Suppression (NMS) was replaced with Soft-NMS, which adjusts the suppression score using a continuous decay function rather than hard thresholding. This allows nearby overlapping boxes with slightly lower confidence to be retained, mitigating missed detections of dense and adjacent stem nodes.
The score update follows the linear decay version of Soft-NMS as in Eq (2–12):
Where is the score of box
,
is the highest scoring box, and Nt is the threshold. We set Nt = 0.5.
Second, we replace the IoU-based regression loss with a Shape-IoU loss, which introduces a shape-aware penalty to account for the width–height discrepancy, better capturing the anisotropic structure of sugarcane stem nodes. Shape-IoU is defined as:
Where the ShapePenalty is:
We set α = 1.5, balancing sensitivity to aspect ratio differences. This design was inspired by the shape-sensitive penalty mechanisms in attention modeling for forgery detection, as used in [25], where architectural attention is modulated by structural divergence. We adopt this spirit for spatial regularization in object shape regression. Finally, Shape-IoU is used to replace the original IoU in Soft-NMS, which can better distinguish the bounding box quality of slender targets. The following simultaneous formulas are then obtained:
Model lightweight
The addition of the high-resolution P2 detection layer and the replacement of the shared convolutional detection head increase the model parameters and complexity, making it difficult to be applied to resource-constrained application scenarios such as field environments.To facilitate deployment on resource-constrained devices, this paper employs LAMP (Layer-Adaptive Sparsity for Magnitude-Based Pruning) [26] to reduce the model size through structured pruning.
Traditional Magnitude-Based Pruning (MBP) usually uses a uniform pruning rate (i.e., all layers are pruned at the same rate) without considering the impact of different layers on the model performance, which may lead to poor pruning results. In this paper, we adopt a channel pruning approach using the LAMP algorithm to compute the channel importance score. Applying the concept of unstructured pruning to channel pruning enhances flexibility and adaptability, which enables accurate pruning of the model according to specific tasks and requirements. This flexibility enables optimization of the model structure according to different application scenarios and platform specifications.
First, LAMP is an unstructured pruning approach that computes a layer-adaptive sparsity score, which is then used to determine the appropriate sparsity levels across different layers and guide the magnitude-based pruning process. This approach balances sparsity and accuracy without the need to tune hyperparameters or perform complex computations. The fine-grained computation and pruning effects of unstructured pruning then motivate this paper to incorporate the LAMP score into the calculation of the importance score for structured pruning. Advanced channel pruning methods typically begin by sparsely training the bootstrap model, producing many scaling factors close to zero. Subsequently, connections corresponding to these near-zero scaling factors are removed, a process known as pruning, and the pruned model is then retrained to improve accuracy, a process known as fine-tuning. In this paper, the pruning scheme utilizes this score to simplify important score computation and channel pruning, thus avoiding extensive sparse training. This approach synchronizes the pruning process and retains the weighted connections with higher relative importance, thus maximizing the recovery of accuracy and minimizing the loss of accuracy during fine-tuning, with the score computed as shown in Equation (216):
Where |W[i]| denotes the weight tensor W mapped by index i and Score(i;W) denotes the LAMP score for the i-th index of |W[i]|. Consider a feedforward neural network with a fully connected layer whose weight tensor is a two-dimensional matrix, while for a two-dimensional convolutional layer, the corresponding tensor is a four-dimensional matrix. To define the score uniformly, each weight tensor is expanded into a one-dimensional vector, and for each expanded vector, the weights are assumed to be in ascending order of the index mapping, i.e., |W[i]| ≥ |W[j]| when i > j. The LAMP measures the relative importance of all the retained connections in the same layer and prunes those with lesser weights by a simple check, as expressed in Eq (217):
In our implementation, pruning is performed in a layer-wise manner, without any additional sparse training. The pruning process is executed iteratively over 200 steps, where a regularization term is gradually increased using a linear decay schedule. During each iteration, the algorithm evaluates the importance of each convolutional channel by computing the LAMP score: the ℓ1-norm magnitudes of each channel’s weights are first sorted layer-wise, and the cumulative distribution is used to normalize the importance scores. Channels with the lowest LAMP scores are progressively pruned in each layer, ensuring a fine-grained local selection. The max_sparsity is set to 1.0 to allow full exploration of pruning potential in each layer, while the target speed_up ratio is fixed at 3.0 to constrain the total pruning intensity. The pruning regularization strength reg is set to 0.0005 and adjusted with a small increment at each step to gradually enforce sparsity.
The effect of pruning is shown in Fig 6, where the horizontal axis represents the network layer number, the vertical axis represents the number of channels, yellow is the number of parameters in each layer of the original model, and red is the number of parameters in each layer of the pruned model, and the number of channels in the feature fusion layer is drastically reduced, which suggests that there may be a high level of redundancy in the feature fusion layer, which contributes less to the task. The number of parameters can hardly be optimized in the detection head part, which shows that the most important structure affecting the size of the model is the detection head of the model, and the detection head is the key to detection and recognition. The channels of the backbone network and feature fusion layer are reduced by 84%, while the channels of the detection head are reduced by only 7%, so the detection head needs to keep more channels to maintain localization accuracy. After pruning, the number of model parameters is reduced by 90%, and FLOPs are reduced by 44%, but the mAP50 only decreases by 1.1%, indicating that pruning significantly improves inference speed while maintaining accuracy.
Experimentation and analysis
Dataset preparation
The dataset used in this study consists of 2,346 images. Among them, 1,618 images were collected from sugarcane fields in real-world outdoor conditions at the experimental site in Balian Village, Fusui County, Chongzuo City, Guangxi Zhuang Autonomous Region. The remaining 728 images were taken under controlled indoor conditions. These indoor samples were obtained from sugarcane collected at the Agricultural Science and Technology Innovation Base in Fusui County. After manual harvesting and leaf removal, the samples were photographed in a laboratory at Guangxi University for Nationalities. The outdoor images were captured by an Apple iPhone 13pro with a resolution of 3024 × 4032, while the indoor images were captured by a depth camera IntelD435i with a height of 1m and a resolution of 1920 × 1080 and were saved in jpg format. The images were randomly divided into 80% as the training set and 20% as the validation set. The dataset was then manually annotated using LabelImg by drawing bounding boxes around the sugarcane stem nodes in each image, with the annotations saved in YOLO text format.
Experimental environment and evaluation indicators
In this study, model training was performed on a computer running the Ubuntu 20.04 operating system. The computer was configured with 16 virtual CPUs at 2.10 GHz. The GPU was an RTX 4090 (24 GB), and the experimental environment was based on Python 3.8 and CUDA 11.8. the relevant hyperparameters of the experimental model were set as follows: the model accepts an image with a resolution of 640 × 640 pixels as standard input, the initial learning rate is 0.01, and the optimizer uses SGD. Considering the variations in model size and training speed, the number of training batches is set to 16 per training batch, and the number of training iterations is 200.
On the test dataset, this paper uses several evaluation metrics to assess the performance of the training model. All samples are categorized into four types, i.e., True Positive Example (TP), False Positive Example (FP), True Negative Example (TN), and False Negative Example (FN). Precision (P) and recall (R) are defined based on the number of these four sample types. The metrics for evaluating the model in this paper mainly include average precision (mAP50, mAP50:95), model size, precision (P), recall (R), number of parameters, and FLOPs. These metrics can synthesize the model’s performance in the target detection task, including detection accuracy, efficiency, and complexity of the model. Specifically, the average precision is usually computed at different IoU thresholds, and the commonly used standard is mAP50 vs. map50:95. The inference speed of the model, on the other hand, is an important factor in assessing its usability in real-world applications, whereas the number of parameters and the computational FLOPS, in turn, directly affect the deployment of the model and the computational resource requirements. Among them, the formulas for precision, recall, and average precision are (3−1), (3−2), and (3−3), respectively:
Comparison, ablation experiments and analysis of their results
To validate the independent and combined effectiveness of soft-NMS and Shape-IoU, ablation experiments were conducted sequentially on these improvements using our enhanced algorithm. Table 1 compares baseline (NMS + CIoU), Soft-NMS, Shape-IoU, and their combination.
As shown in Table 1, the baseline model achieves 93.2% mAP50 and 89.7% recall. With the addition of Soft-NMS, the mAP:50 increases to 95.2% and the recall improves slightly to 90.6%, demonstrating that Soft-NMS effectively preserves overlapping detections, thus enhancing overall detection performance. Introducing Shape-IoU as the localization loss further improves both metrics, achieving 94.7% mAP:50 and 91.1% recall. Notably, when both Soft-NMS and Shape-IoU are applied simultaneously, the model achieves the best performance with a mAP:50 of 96.1% and recall of 92.6%. These results indicate that the combination of Soft-NMS and Shape-IoU provides complementary benefits, leading to more accurate and robust object detection, particularly in complex agricultural scenes involving dense and elongated targets.
In this paper, comparisons are made with other algorithms, including RT-DETR as well as the YOLO family of algorithms: RT-DETR-l, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv12n, Hyper-YOLO, Mamba-YOLO, as well as YOLO11’s large models: YOLO11m, YOLO11s, YOLO11l, and all of them are trained under the improved loss function again validation comparison using the validation dataset. The comparison results are shown in Table 2 and Fig 7:
The experimental results show that although RT-DETR-l reaches 94.9%, 94.3%, 97.3%, and 51.8% in the four important indexes of accuracy, recall, and mAP50. mAP50:95, respectively, of which the first three are in the leading position, its number of participants reaching ten million, and its vast GFLOPs are challenging to cope with the field scenarios. In contrast, the improved algorithm in this paper reaches 93.9%, 92.6%, 96.1%, and 53.2% in the four important indexes, respectively, and with only about one-tenth of the computational complexity and model size, it is also only slightly behind in terms of accuracy, and it even exceeds RT-DETR-l in mAP50:95, and compared with other algorithms of the YOLO series, it is in the leading position.
Moreover, our approach demonstrates superior performance over emerging models including YOLOv12n, Hyper-YOLO, and Mamba-YOLO in terms of both accuracy and model efficiency. Although Hyper-YOLO and Mamba-YOLO achieve competitive mAP50 scores (89.9% and 89.0% respectively), they still fall short of our method and have relatively larger parameter sizes (3.6M and 5.7M) and GFLOPs (9.5 and 12.3).
Meanwhile, in terms of lightweight, the pruned algorithm is one-tenth of the original YOLO11n in terms of the number of parameters, which is as low as 279,778, and the GFLOPs are only slightly larger than the YOLO11n model after adding a high-resolution detection layer, but the model size is only 1.3MB, which ensures the lightweight and at the same time, it still achieves the following results in terms of the four important indicators 92.3%, 90.7%, 95.0%, and 51.4%, which is also ahead of the other algorithms in the table. Before the improved loss function, the mAP50, as well as mAP50:95 using YOLO11’s CIoU loss function, both declined by 2.9% and 3.8%, respectively. The comparison effect between the improved algorithm and YOLO11n prediction is shown in Fig 8, which can be seen very intuitively through the predicted target box; the improved model predicts and recognizes a higher number of sugarcane stem nodes and higher accuracy than YOLO11n no matter in a small number of target scenes or in a complex scene, and still better than YOLO11n, even though the accuracy after pruning has decreased.
Inspired by the rigorous evaluation protocol adopted in Khan et al. [33], we structured our experiments to include extensive comparisons with state-of-the-art models, cross-validation, and statistical significance testing. To evaluate the robustness and generalization ability of the proposed model, we adopted a 5-fold cross-validation strategy. The dataset was randomly partitioned into five equal subsets. In each round, one subset was used as the validation set while the remaining four subsets served as the training set, the process was repeated five times. The experimental results are shown in Table 3. Subsequently, paired t-tests were performed on the five cross-validation sets. The test results demonstrated that mAP50 achieved a statistically significant improvement (t = 23.68, p < 0.0001), suggesting that the performance enhancement was not attributable to random chance. This method effectively reduces evaluation bias and ensures the reliability of experimental results.
To further validate the contribution of each improvement, ablation experiments were designed in this paper. The results are shown in Table 4 and Fig 9:
As can be seen from Table 4 and Fig 9, after the improved feature fusion layer, mAP50 is improved by 9.9%, and mAP50:95 is improved by 9.2%. The addition of the high-resolution small-target detection layer P2 to the feature fusion layer dramatically improves the model’s attention to detecting sugarcane stem nodes, but the amount of computation of the model is increased, which also makes the FPS decrease from 80.9 to 63.6. The effect of heat comparison graph is shown in Fig 10, comparing before and after the improved feature fusion layer; the redder color in the heat map indicates the more significant value, and the more heat box plots indicate that the model pays more attention to the sugarcane stem node target in the figure. It can be seen that the original version of the model in the target detection process of the stem node prediction is not accurate enough; there are leakage and false detection phenomena; after the improvement of the special fusion layer, the model in the localization of the stem node is more accurate and comprehensive, the average number of thermal box plots of a map are nearly 40% more than the original version of the thermal box plots.
To further validate the contribution of each component in the improved architecture, detailed ablation studies are presented in Table 5, which systematically quantifies the individual effects of the enhanced feature fusion layer, the P2 high-resolution detection module, and LSDECD detection head on model performance.
As shown in Table 5. The baseline model, without any enhancements, achieved an mAP50 of 84.2% and an mAP50:95 of 42.1%. Introducing the ASF module alone led to a significant increase in mAP50 (+5.5%) but brought only marginal gains in mAP50:95 (+0.3%), suggesting that ASF primarily enhances global semantic representation but is less sensitive to finer object boundaries. The addition of the high-resolution P2 detection head resulted in substantial improvements in both mAP50 (+10.0%) and mAP50:95 (+5.4%), confirming its effectiveness in detecting small and occluded targets. Similarly, integrating the LSDECD module provided consistent gains in detection performance, particularly in mAP50:95 (+3.2%), demonstrating its contribution to scale-aware feature extraction. When all three modules were combined, the model achieved the highest performance with an mAP50 of 96.1% and an mAP50:95 of 53.2%, indicating strong complementarity among the modules and validating the overall design. Based on the aforementioned improvements, the computational cost of the model reached 11.6 GFlops. Therefore, we pruned the model, and the pruning comparison results are shown in Table 6.
As shown in Table 6 and Fig 11, after pruning, the model’s parameters decrease by nearly 90% (from 2.58M to 279K), while computation drops by 43% (5GFLOPs reduction). Despite slight mAP declines (1.1% in mAP50, 1.8% in mAP50:95), performance remains comparable to the YOLO11-SUG. The model size shrinks by 77% (5.2MB → 1.3MB), and detection speed improves by 15.3 FPS (63.6 → 78.9), enhancing real-time capability. These optimizations meet the requirements for field deployment in sugarcane plantations, balancing efficiency, accuracy, and hardware constraints.
Scale-specific performance analysis
Our improved feature fusion layer method draws inspiration from the saliency-driven multi-scale perception framework proposed by Khan et al. [34], which emphasizes the role of attention-enhanced instinctive vision in improving small-object detection through saliency-aware feature refinement. Their work highlights that shallow, high-resolution feature layers are crucial for preserving spatial detail and local structure, especially for small and task-relevant targets. Building on this insight, we incorporate and adapt the Attentional Scale Sequence Fusion (ASF) module from ASF-YOLO into the neck of YOLO11, and further enhance it by introducing an additional P2 layer to better retain fine-grained spatial information. This attentional multi-scale fusion strategy enables the model to dynamically emphasize critical features across different scales, proving particularly effective in complex field environments where sugarcane stem nodes are frequently occluded or entangled with background vegetation.
To further validate the effectiveness of our proposed multi-scale fusion strategy, we based on the bounding box area size, we selected a subset of images from our datasets and classified the sugarcane stem node targets into small, medium, and large validation sets, containing 112, 162, and 177 samples respectively.
Experiments were conducted on these three validation sets in the same experimental environment using the baseline model YOLO11, the classic multi-scale detection method Efficientdet [35], and our improved YOLO11-SUG.
As shown in Table 7. Compared with YOLO11, our method achieves substantial improvements, particularly in small object detection, where the mAP50 increases from 82.4% to 94.2%. This demonstrates the effectiveness of introducing the ASF module and the additional P2 layer in preserving fine-grained spatial features. Furthermore, our model consistently outperforms EfficientDet, with slightly higher mAP50 on small objects (94.2% vs. 93.7%), and more evident gains on medium (96.8% vs. 94.3%) and large objects (97.7% vs. 96.4%). These results validate the robustness and adaptability of our architecture in complex field environments, especially for detecting dense and small-scale sugarcane stem nodes.
Discussions
This paper proposes several enhancement strategies for YOLO11 to improve sugarcane stem node detection, particularly for small object detection in complex field environments. The addition of the high-resolution P2 detection layer improves recall and precision on small targets like stem nodes, especially in scenarios with occlusion and varying illumination. While the P2 layer increases computational complexity, channel pruning effectively mitigates this issue, reducing parameters to 279,778 and model size to 1.3MB.
The ASF feature fusion module enhances multi-scale representation, and the LSDECD detection head balances parameter sharing with detail enhancement, providing stronger representation for small targets without significantly increasing model size. In bounding box regression, Shape-IoU and Soft-NMS improve localization accuracy under occlusion and illumination challenges, though they introduce some computational overhead.
Overall, the improvements in this paper show high utility in the sugarcane stem node detection task, specially making significant progress in small target detection and complex environment adaptation. However, the generalizability of these improvements still needs to be verified in a broader range of agricultural target detection tasks. To evaluate model robustness, data augmentation was performed on 659 images from the original dataset to simulate three extreme weather conditions: dust, rain, and strong light exposure. This expanded the augmented dataset to 1977 images, which were then divided into training and validation sets at an 8:2 ratio. Under the same experimental environment as the comparative experiments, the proposed model, YOLO11n, YOLOv12n, and Mamba-YOLO were sequentially trained and validated on this dataset. The experimental results are shown in Table 8, and the prediction comparison is illustrated in Fig 12.
While real-world multi-condition data collection is ideal but logistically constrained, our approach was inspired by Khan et al. [36], who emphasized enhancing feature robustness under visually challenging environments through attention-guided feature selection. Following this principle, we designed a comparative experiment to test the generalizability of our model against YOLO11n, YOLOv12n, and Mamba-YOLO under simulated complex conditions.
The experimental results demonstrate that, after being subjected to simulated extreme weather conditions, all four models exhibited a decline in both mAP50 and mAP50:95 metrics. However, our model experienced the smallest reduction in mAP50, and its decrease in mAP50:95 only slightly surpassed that of YOLOv12. Furthermore, our model maintained higher precision values compared to the other three models, which strongly suggests its robust performance.
Nevertheless, while the proposed improvements significantly enhance detection accuracy and robustness in field conditions, model deployment efficiency on edge devices remains a critical concern in real-world agricultural scenarios. Due to hardware constraints, real-world inference performance testing on platforms such as Jetson Nano or Raspberry Pi 4 was not conducted. However, to assess the model’s deployment feasibility, we referred to several representative benchmarking studies of YOLO-based models on edge computing devices [37–41].
Alqahtani et al. [37] analyzed the performance trade-offs between accuracy and latency on devices such as Jetson Nano and Raspberry Pi, showing that while SSD-MobileNetV1 provides low inference time, YOLO variants deliver better accuracy at acceptable latency. Feng et al. [38] demonstrated that YOLOv3-tiny and YOLOv4-tiny can achieve over 20 FPS on Raspberry Pi 4 with Intel NCS2 accelerators.
The large-scale benchmarking work YOLOBench [39] confirmed that lightweight YOLO variants with fewer than 1M parameters and sizes under 5MB typically achieve 20–30 FPS on Jetson Nano. More recent studies like FRYOLO [40] optimized YOLOv8 using pruning and Ghost modules to achieve 33 FPS on Jetson Orin Nano with minimal accuracy loss. Edge-YOLO [41] adopted a pruning edge-cloud collaboration strategy, achieving 26.6 FPS on embedded systems while maintaining high robustness.
Our pruned model, with only 279,778 parameters and 1.3MB in size, is significantly smaller than mainstream lightweight detectors such as YOLOv5n or YOLOv8n. Given the similar or higher resource demand of the models evaluated in, it is reasonable to infer that our model could also achieve real-time performance on edge hardware, especially when combined with inference optimizations such as quantization, Tensor-RT, or ONNX acceleration.
This indicates that the proposed model, while tailored for robust sugarcane stem node detection in challenging environments, also holds promise for practical deployment in resource-constrained agricultural edge systems, where low latency and compact model design are critical.
Conclusions
In this paper, a series of improvements are proposed for the sugarcane stem node detection task, including the introduction of an ASF feature fusion layer, the addition of a high-resolution small-target detection layer P2, the adoption of a shared convolutional detection head LSDECD, the combination of Soft-NMS and Shape-IoU, and the optimization of the model complexity by channel pruning. Experimental results show that these improvements significantly enhance the detection performance and practical application value of the model.
Before pruning, the mAP50 and mAP50:95 of the improved model reaches 96.1% and 53.2%, respectively, which are improved by 11.8% and 9.1% compared with the original YOLO11n, which thoroughly verifies the superiority of the improved method in small target detection and multi-scale feature processing. The pruned model effectively reduces the number of parameters and computational complexity while mAP50 and mAP50:95 still improve by 10.8% and 9.3%, respectively, compared with the original YOLO11n, which indicates that the pruning strategy is able to significantly improve the inference efficiency while maintaining the model performance, and it provides good support for practical deployment.
In summary, the method proposed in this paper not only achieves significant improvement in detection accuracy but also effectively balances the relationship between model complexity and performance through channel pruning. These improvements explore feasible technical paths for agricultural intelligent detection tasks such as sugarcane stem node detection and have specific practical application value. Future research can further optimize the model structure, improve its robustness and adaptability in complex field environments, and provide specific technical references for the development of agricultural automation.
References
- 1. Xie ZH. Utilization of modern agricultural technology to open a new mode of raw material production. China Sugar Crops. 2015;37:61–3, 65.
- 2. Liao DS, Qin SJ. Analysis of production cost accounting and control issues in Guangxi sugarcane industry. Acad Forum. 2013;36:83–7.
- 3. Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, p. 779–88.
- 4. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neur Inform Process Syst. 2015;28:91–9.
- 5. Moshashai K, Almasi M, Minaei S, Borghei AM. Identification of Sugarcane Nodes Using Image Processing and Machine Vision Technology. Int J Agric Res. 2008;3(5):357–64.
- 6. Lu SP, Wen YX, Ge W, Peng H. Feature extraction and recognition of sugarcane stem nodes based on machine vision. Trans Chin Soc Agric Mach. 2010;41:190–4.
- 7. Huang YQ, Qiao X, Tang SX, Luo ZY, Zhang PD. Localization and experiment of sugarcane stem node feature distribution based on Matlab. Trans Chin Soc Agric Mach. 2013;44(10):93–7.
- 8. Huang YQ, Huang TS, Huang MZ, Yin K, Wang XB. Sugarcane Stem Node Recognition Based on Local Mean. J Chin Agric Mechan. 2017;38:76–80.
- 9. Li SP, Li XH, Zhang K, Li KH, Yuan HL. Improved YOLOv3 Network for Enhancing Real-Time Dynamic Recognition Efficiency of Sugarcane Stem Nodes. Trans Chin Soc Agric Eng. 2019;35:185–91.
- 10. Zhou D, Fan Y, Deng G, He F, Wang M. A new design of sugarcane seed cutting systems based on machine vision. Comput Electron Agric. 2020;175:105611.
- 11. Xie Z, Li Y, Xiao Y, Diao Y, Liao H, Zhang Y, et al. Sugarcane stem node identification algorithm based on improved YOLOv5. PLoS One. 2023;18(12):e0295565. pmid:38079443
- 12. Wen C, Guo H, Li J, Hou B, Huang Y, Li K, et al. Application of improved YOLOv7-based sugarcane stem node recognition algorithm in complex environments. Front Plant Sci. 2023;14:1230517. pmid:37680364
- 13. Zhang Y, Chu J, Leng L, Miao J. Mask-Refined R-CNN: A Network for Refining Object Details in Instance Segmentation. Sensors (Basel). 2020;20(4):1010. pmid:32069927
- 14. Lin W, Chu J, Leng L, Miao J, Wang L. Feature disentanglement in one-stage object detection. Pattern Recog. 2024;145:109878.
- 15. Kang M, Ting C-M, Ting FF, Phan RC-W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Imag Vis Comput. 2024;147:105057.
- 16. Ma L, Zhao L, Wang Z, Zhang J, Chen G. Detection and Counting of Small Target Apples under Complicated Environments by Using Improved YOLOv7-tiny. Agronomy. 2023;13(5):1419.
- 17. Sanida T, Sideris A, Sanida MV, Dasygenis M. Tomato leaf disease identification via two–stage transfer learning approach. Smart Agric Technol. 2023;5:100275.
- 18. Zhang Y, Li L, Chun C, Wen Y, Xu G. Multi-scale feature adaptive fusion model for real-time detection in complex citrus orchard environments. Comput Electron Agric. 2024;219:108836.
- 19. Luo W, Li Y, Urtasun R, Zemel R. Understanding the effective receptive field in deep convolutional neural networks. Adv Neur Inform Process Syst. 2016;29.
- 20. Lin T-Y, Dollar P, Girshick R, He K, Hariharan B, Belongie S. Feature Pyramid Networks for Object Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
- 21. Chen Z, He Z, Lu Z-M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans Image Process. 2024;33:1002–15. pmid:38252568
- 22. Tian Z, Shen C, Chen H, He T. FCOS: A Simple and Strong Anchor-Free Object Detector. IEEE Trans Pattern Anal Mach Intell. 2022;44(4):1922–33. pmid:33074804
- 23. Zhang H, Zhang S. Shape-IoU: More Accurate Metric Considering Bounding Box Shape and Scale. 2023.
- 24. Bodla N, Singh B, Chellappa R, Davis LS. Soft-NMS — Improving Object Detection with One Line of Code. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017, p. 5562–70.
- 25. Talha Usman M, Khan H, Kumar Singh S, Young Lee M, Koo J. Efficient Deepfake Detection via Layer-Frozen Assisted Dual Attention Network for Consumer Imaging Devices. IEEE Trans Consumer Electron. 2025;71(1):281–91.
- 26. Lee J, Park S, Mo S, Ahn S, Shin J. Layer-Adaptive Sparsity for the Magnitude-Based Pruning. 2020.
- 27. Zhao Y, Lv W, Xu S, Wei J, Wang G, Dang Q, et al. DETRs Beat YOLOs on Real-time Object Detection. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024, p. 16965–74.
- 28. Wang CY, Yeh IH, Mark Liao HY. Yolov9: Learning what you want to learn using programmable gradient information. European conference on computer vision. Cham: Springer Nature Switzerland; 2024, p. 1–21.
- 29. Wang A, Chen H, Liu L, Chen K, Lin Z, Han J. Yolov10: Real-time end-to-end object detection. Adv Neur Inform Process Syst. 2024;37:107984–8011.
- 30. Tian Y, Ye Q, Doermann D. Yolov12: Attention-centric real-time object detectors. arXiv e-prints. 2025.
- 31. Feng Y, Huang J, Du S, Ying S, Yong J-H, Li Y, et al. Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation. IEEE Trans Pattern Anal Mach Intell. 2025;47(4):2388–401. pmid:40030788
- 32. Wang Z, Li C, Xu H, Zhu X. Mamba YOLO: SSMs-based YOLO for object detection. arXiv e-prints. 2024.
- 33. Khan H, Ullah I, Shabaz M, Omer MF, Usman MT, Guellil MS, et al. Visionary vigilance: Optimized YOLOV8 for fallen person detection with large-scale benchmark dataset. Image Vis Comput. 2024;149:105195.
- 34. Khan H, Usman MT, Rida I, Koo J. Attention enhanced machine instinctive vision with human-inspired saliency detection. Image Vis Comput. 2024;152:105308.
- 35. Tan M, Pang R, Le QV. Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, p. 10781–90.
- 36. Khan H, Ullah Khan S, Ullah W, Wook Baik S. Optimal features driven hybrid attention network for effective video summarization. Eng Appl Artif Intellig. 2025;158:111211.
- 37. Alqahtani DK, Cheema MA, Toosi AN. Benchmarking deep learning models for object detection on edge computing devices. International Conference on Service-Oriented Computing. Singapore: Springer Nature Singapore; 2024, p. 142–50.
- 38. Feng H, Mu G, Zhong S, Zhang P, Yuan T. Benchmark Analysis of YOLO Performance on Edge Intelligence Devices. Cryptography. 2022;6(2):16.
- 39. Lazarevich I, Grimaldi M, Kumar R, Mitra S, Khan S, Sah S. YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2023. p. 1161–70.
- 40. Chen R, Wang P, Lin B, Wang L, Zeng X, Hu X, et al. An optimized lightweight real-time detection network model for IoT embedded devices. Sci Rep. 2025;15(1):3839. pmid:39885208
- 41. Liang S, Wu H, Zhen L, Hua Q, Garg S, Kaddoum G, et al. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans Intell Trans Syst. 2022;23(12):25345–60.