Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Oak-YOLO: A high-performance detection model for automated Oak seed defect identification

Abstract

Oak seeds are highly susceptible to pest infestations due to their elevated starch content, which significantly impairs germination and subsequent growth. To address this challenge, we developed a high-resolution imaging system and proposed an improved YOLO-based model named Oak-YOLO for efficient and accurate defect detection in oak seeds. The proposed model enhances the YOLOv8 architecture by incorporating EfficientViT as the backbone to improve global feature extraction, and integrates a Ghost-DynamicConv detection head to enhance the representation of small and irregular defects such as insect holes and cracks. Additionally, the WIoUv3 loss function is introduced to optimize bounding box regression for complex target shapes and overlapping instances.Extensive experiments were conducted on both single-object and multi-object datasets. Oak-YOLO achieved a mAP50 of 94.5%, an F1-score of 95.3%, and a precision of 94.% on the oak-intensive dataset, with an inference speed of 132.2 FPS. Cross-device validation using mobile-captured images further demonstrated the model’s robustness, achieving mAP50 scores of 94.7% and 93.8% on different smartphone test sets. Comparative evaluations show that Oak-YOLO outperforms existing YOLO models, including YOLOv9 to YOLOv12, by delivering a favorable trade-off between detection accuracy and computational efficiency. These results highlight the potential of Oak-YOLO as a practical solution for real-time seed quality inspection in forestry applications.

Introduction

In modern forestry production, oak seeds, due to their high starch content, are particularly vulnerable to insect infestations [1]. These infestations not only affect the quality of the seeds but also significantly reduce their germination rate and growth potential. To address these issues, chemical pesticides have traditionally been widely used. However, the effectiveness of this approach has been limited, with insect damage remaining prevalent [2], and the overuse of chemical pesticides potentially causing adverse environmental effects.

Traditional methods for screening oak seeds primarily include the weighing method and the water immersion method. The weighing method estimates seed damage by comparing the seed’s weight, while the water immersion method relies on the buoyancy difference of seeds in water to distinguish healthy seeds. However, these methods have significant limitations, as they cannot accurately detect subtle internal insect damage, leading to suboptimal screening results and affecting overall seed quality and planting success [3].

Given the limitations of traditional methods in detecting defects in oak seeds, this study introduces machine vision technology for efficient and accurate identification of oak seed defects. Traditional machine vision methods rely on manually designed features such as color, shape, texture, and spectral information. These methods have been widely applied in plant seed detection and classification. For example, Wang et al. [4] proposed a maize seed recognition method based on genetic algorithms (GA) and multi-class support vector machines (SVM), While Nguyen-Quoc et al. [5] utilized image preprocessing,HOG descriptor, and various imputation techniques combined with SVM classifiers to classify different rice seeds.

In contrast to traditional methods, deep learning techniques can automatically extract rich, multi-level features from images, enabling higher detection accuracy and efficiency in complex environments. YOLO frameworks [68], along with region-based models like Faster R-CNN [9] and single-shot detectors like SSD [10], enable fast and accurate defect identification in real-time. Recent advancements, such as Swin Transformer-based models [11] and transfer learning approaches [12], have further enhanced feature extraction and adaptability. Standard datasets like Pascal VOC [13] and COCO [14] provide benchmarks for evaluating these models. These innovations streamline seed quality assessment, ensuring higher efficiency and accuracy for agricultural applications.For instance, Mukasa et al. [15] used DD-SIMCA, SVM, and deep learning classifiers to distinguish between triploid and diploid watermelon seeds; Kurtulmus [16] proposed a sunflower seed classification method based on deep convolutional neural networks (CNNs), enhancing classification performance through various CNN architectures; Wang et al. [17] combined hyperspectral imaging with deep learning, proposing a novel CNN-LSTM model for maize seed variety identification that showed excellent performance; Shi et al. [18] employed iPhone images and deep learning methods, using data augmentation and transfer learning strategies to improve barley seed variety identification accuracy. Barrio-Conde et al. [19] successfully classified high oleic sunflower seed varieties using deep learning algorithms, while Bi et al. [20] proposed a maize seed recognition method based on an improved Swin Transformer, which incorporated feature attention mechanisms and multi-scale feature fusion to enhance recognition accuracy; Thakur et al. [21] introduced deep transfer learning photon sensors, improving seed quality evaluation accuracy and efficiency through laser backscattering and deep learning.deep learning technology has significantly advanced the development of crop seed detection. However, existing computer vision research is primarily focused on the detection of seeds from other crops, while the detection of tree seeds has received relatively less attention.

In this study, we constructed a dedicated dataset for oak seed defect detection. The dataset was meticulously curated to cover various defect types, and incorporated diverse lighting conditions, backgrounds, and spatial arrangements. To further enhance the dataset’s robustness, we employed automatic data augmentation techniques, including rotation, brightness adjustment, and background variation, ensuring a wide range of sample diversity.Building on this comprehensive dataset, we then proposed targeted improvements specifically aimed at enhancing small target detection in oak seed defect analysis. To improve the YOLO model’s performance in identifying small defects, we introduced a multi-scale feature fusion mechanism designed to preserve high-resolution feature information. This approach ensures that small defects, such as cracks and insect holes, are effectively retained within high-level feature maps.

Additionally, we incorporated an improved attention mechanism to enhance boundary separation between adjacent seeds, reducing interference caused by overlapping targets. To address the challenge of detecting irregularly shaped defects, we proposed a novel loss function that enables the model to accurately fit the boundaries of cracks and insect holes using rotated rectangular boxes.

Experimental results demonstrate that the proposed detection framework significantly outperforms traditional methods in detecting oak seed defects. The model effectively addresses the limitations of the YOLO architecture, particularly in small target detection, overlapping object handling, irregular shape adaptation, and data labeling challenges. This comprehensive improvement greatly enhances the accuracy and efficiency of oak seed quality inspection, making the framework highly suitable for practical applications in forestry production.

Materials and methods

Selection of seed

The oak seeds used in this study were purchased from oak farmers in Xinxu Town, Suqian, Jiangsu, China. The seeds were cleaned to remove plant debris, soil, dust, and stones. Subsequently, the seeds were stored at temperatures ranging from -20°C to -30°C for two weeks to eliminate residual pests.

Photographic equipment

As shown in Fig 1.The image acquisition system utilized a Canon camera (ME2P-G-P, Hangzhou, China) with a resolution of 45084096. An adjustable lighting setup was employed to facilitate image capture, with the light source set to a color temperature of 5000K to replicate natural lighting conditions. A circular LED light was positioned at a 45° angle, with the illumination diffused through white fabric to ensure soft, uniform lighting and to minimize specular reflections that could impede the detection of subtle defects. Based on measurements, the optimal focusing distance between the camera and the seeds was maintained between 12 and 15 cm. The platform was designed to enable flexible adjustments of shooting angles and heights, allowing for comprehensive capture of seed surface features from various perspectives. To maintain consistent image exposure, the ISO setting of all imaging devices was fixed at 800.

Segmentation of seed images

In this study, we employed the automated segmentation algorithm EfficientSAM to annotate the dataset. As shown at Fig 2. The original images were first denoised using median filtering to remove background noise and interference on the seed surface. Subsequently, the images were converted to the HSV color space to enhance the contrast between the seeds and the background. In the HSV space, histogram equalization was applied to the brightness channel to enhance seed surface details:

(1)

L denotes the maximum grayscale level, is the total number of pixels, and h(i) represents the frequency of the grayscale level i. This equalization further improves the detail of the seed surface.we applied an adaptive threshold segmentation method to separate the seed region from the background, generating an initial binary mask:

(2)

In this equation, T(x,y) denotes the threshold value, N(x,y) is the neighborhood around the pixel, and C is a constant for adjustment. This formula calculates the average value within the local neighborhood and adjusts it to achieve binarization. Subsequently, morphological operations, including dilation and erosion, were employed to refine the mask edges. Dilation helps connect adjacent seed regions, while erosion removes noise, making seed contours clearer. After these processing steps, the resulting binary mask accurately represents the seed area.

To annotate irregular defect areas such as cracks and insect holes, we introduced the convex hull algorithm. Traditional rectangular annotations often fail to capture complex defect boundaries, especially for irregular geometries like cracks and insect holes. The convex hull method effectively addresses this limitation by precisely enclosing defect areas.

Given a set of points representing the boundary of the defect region, the convex hull CH(P) is the smallest convex polygon that encloses the set P. Assuming Pi, Pj, and Pk are three points in the set, the cross product of vectors and is calculated as follows:

(3)

After annotation, to match the YOLO object detection model’s input format, all annotated bounding box information was converted to the standard YOLO label format. Each label contains five values: the object class, the center coordinates of the bounding box (x,y), the width (w), and the height (h). These values were normalized, and the segmented data were further augmented to increase model robustness. The augmented data covers diverse lighting conditions, angle variations, and noise simulation to match real-world production environments.

To evaluate the effectiveness of the EfficientSAM segmentation algorithm, we used a subset of manually annotated seed images as a control group for accuracy comparison. Compared with traditional rectangular annotation methods, EfficientSAM significantly improved segmentation accuracy, especially for complex defect shapes. The convex hull annotation method provided more precise coverage of defect areas. The comparison results are presented in Table 1.

Data augmentation

The color images were first converted to grayscale, reducing both storage space and computational complexity by approximately one-third. Subsequently, each pixel value was multiplied by a randomly generated brightness factor ranging from 0.5 to 1.5, enhancing the model’s robustness to variations in illumination. Next, the images were randomly rotated within an angle range of –90° to 90°, and Gaussian noise was added with a 50% probability, using a noise matrix generated from a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. Finally, multiple augmented images were randomly combined into a single large image using 22, 33, or 44 grid patterns, and the corresponding labels were updated based on the offset caused by the image stitching. Fig 4 illustrates the augmentation process across different samples.

thumbnail
Fig 3. Comparison of concat-image and densely detected data.

(a) Features of concat-image data. (b) Features of densely detected data.

https://doi.org/10.1371/journal.pone.0327371.g003

Subsequently, Gaussian noise was added to the images with a 50% probability. The noise was generated from a Gaussian distribution with a mean of 0 and a standard deviation of 0.01, and was randomly superimposed on the pixel values of the images. To further enhance the dataset, multiple images were concatenated into a single large image, with random grid sizes (2x2, 3x3, or 4x4), and corresponding label files were generated. After each concatenation, the coordinates of the labels were adjusted according to the offset of the concatenated images.

Fig 3(a) illustrates the distribution of x and y coordinates for all annotations in the -seed dataset. The figure shows that annotation centers are evenly distributed across the image, with no significant clustering or outliers. This indicates that the seed positions exhibit good diversity, which helps the model learn a wider range of features. Additionally, the distributions of annotation width (w) and height (h) are shown, with w and h primarily concentrated between 0.01 and 0.05. This suggests a certain degree of clustering in annotation sizes, likely due to denser seed placement in specific image regions, leading to more similar annotation sizes in those areas. In contrast, the distribution of x and y coordinates in Fig 3(b) appears more concentrated, likely reflecting the characteristics of densely packed detection samples. Similarly, w and h are concentrated between 0.015 and 0.04, further highlighting the annotation patterns in densely populated regions.

After applying the aforementioned augmentation techniques, the number of single-object seed images in the dataset was increased from 2,000 to 2,537, while the multi-object dataset grew from 1,572 to 1,949 images. The dataset was split into training, validation, and test sets using an 8:1:1 ratio with stratified random sampling to ensure that the class distribution remained consistent across all subsets.Specifically, the training set consists of 4,538 images, while the testing and validation sets each contain 567 images. The test set was exclusively reserved for final performance evaluation.The single-object dataset is named oak-simple, and the multi-object dataset is named oak-intensive, as summarized in Table 2.

Oak-YOLO

YOLOv8 [22] exhibits high real-time detection capability and accuracy optimization, making it suitable for object detection tasks. However, YOLOv8 still has limitations in detecting irregular-shaped defects in oak seeds, as the rectangular bounding box cannot accurately describe the edges, leading to incomplete detection or boundary deviation. Additionally, its real-time performance and efficiency are limited when deployed on resource-constrained embedded devices.

As shown in Fig 4, we improved YOLOv8 in three aspects: 1) Introducing the Ghost-Dynamic prediction head, which combines shallow and deep features to effectively enhance the detection accuracy of small defects such as cracks and wormholes; 2) Upgrading the YOLOv8 backbone by replacing traditional CNN modules with EfficientViT to improve the capture of global features and long-range dependencies; 3) Employing the WIoUv3 loss function to optimize IoU calculation for small and overlapping defects, ensuring that the predicted bounding box shape better matches the defect characteristics.

Ghost-dynamic.

In the YOLOv8 model, instead of using the shallow feature map F2 with limited semantic information, the deeper feature maps F3, F4, and F5, extracted from the backbone network are passed into the neck for feature fusion. These feature maps, after undergoing multiple convolutional down-sampling layers, gradually expand their receptive fields. The deeper feature maps contain more rich semantic information, sufficient to handle typical object detection tasks. However, in the case of oak seed defect detection, particularly for small and complex defects such as insect holes and cracks, using deeper features may lead to information dilution, thereby reducing both detection precision and localization accuracy. This is because these defects often appear as small objects, and the positional information in deeper features is relatively sparse and blurred due to the convolutional layers, making precise defect localization and classification more challenging [23]. Moreover, due to insufficient information extracted by the prediction head from the feature maps, the detection accuracy of defects is compromised. Similar, overlapping, and partially occluded defects further exacerbate the difficulty in object detection.

Shallow feature maps, in contrast to deep feature maps, have smaller receptive fields, higher spatial resolution, and more precise positional information, making them particularly effective for small object detection. This makes them highly suitable for the accurate detection of seed defects, such as cracks and insect holes. Therefore, to improve the detection performance for oak seed defects, this study integrates the Ghost module [24] with Dynamic Convolution [25] to construct a high-resolution Ghost-Dynamic prediction head, as illustrated in Fig 5. This approach effectively leverages the rich positional information and high resolution inherent in shallow features, enhancing the detection performance for small defect targets. Ghost-DynamicConv utilizes a two-step convolution operation by dividing the convolution process into primary and auxiliary convolutions. The primary convolution generates initial features Y, and the auxiliary convolution further optimizes the feature extraction. The final feature F can be expressed as:

(4)

W1 and W2 represent the weights of the primary and auxiliary convolutions, respectively, and fdenotes the nonlinear activation function. The core of the dynamic weight allocation mechanism lies in selecting appropriate expert convolution kernels based on the features of the input image. This is achieved through a multi-layer perceptron (MLP), which generates dynamic weights. Specifically, the MLP takes the input image features as input and outputs a dynamic weight vector that indicates the activation level of different expert convolution kernels. The process of generating dynamic weights is as follows:

(5)

As shown in Fig 6, To further enhance the richness of feature fusion, the PAN (Path Aggregation Network) structure is employed to fuse the F2 feature map from the backbone network with feature maps from other scales (e.g., F3, F4, etc.). This fusion process facilitates the creation of a smaller P2 prediction head. The introduction of the P2 prediction head provides additional positional and feature information for defect detection, while effectively reducing the loss of spatial features that can occur during down-sampling due to scale variations. When combined with the other three prediction heads, the P2 head helps mitigate the decline in detection accuracy that is often caused by significant changes in object scales. This method enables the prediction heads to extract richer global features through a self-attention mechanism, which is particularly beneficial for the precise localization of defects in cases with significant overlap or occlusion.

EfficientViT.

EfficientViT [26] is an efficient visual Transformer model designed to combine the strengths of Convolutional Neural Networks (CNN) and self-attention mechanisms, providing more precise and computationally efficient feature extraction for visual tasks. Fig 7 shows the structure of EfficientViT.In the context of oak seed defect detection, where defects such as cracks and insect holes are structurally complex, small in area, and irregular in shape, traditional CNN models face limitations in capturing long-range dependencies and global features. The core optimizations of EfficientViT are reflected in several key aspects. The “Sandwich Layout" effectively reduces memory consumption. Unlike conventional self-attention mechanisms, EfficientViT uses a single layer of self-attention for spatial feature mixing, supplemented by additional feed-forward network (FFN) layers before and after the self-attention layer to enhance communication between channels. The mathematical expression of this optimization is as follows:

thumbnail
Fig 4. The overall structure of the Oak-YOLO model.

https://doi.org/10.1371/journal.pone.0327371.g004

(6)

Xi represents the input features of the i-th layer, denotes the self-attention layer, and denotes the feed-forward network layer. By reducing the usage of self-attention layers, EfficientViT significantly lowers memory consumption while enhancing channel communication by increasing the number of feed-forward network layers. In the computation of Multi-Head Self-Attention (MHSA), EfficientViT introduces the Cascaded Group Attention (CGA) mechanism to further enhance computational efficiency. Unlike traditional MHSA models, CGA partitions the input features into multiple subsets, with each attention head processing only a portion of the features, thus reducing redundancy in feature computation. The core formula of this mechanism is given by:

(7)(8)

represents the self-attention result of the j-th head, , , and are the projection matrices, and is the final projection layer. Through this grouped computation, CGA effectively reduces the computational redundancy of MHSA, while the concatenation operation enhances the feature representational power.

WIoUv3.

The size of the object is crucial in object detection tasks, especially in small object detection, where the standard IoU may not give sufficient attention due to the small area occupied by small objects. Existing bounding box loss functions, such as Smooth L1 [27], GIoU [28] and CIoU [29], often fail to effectively address the complex scenarios in defect detection, such as occlusion, small sizes, and class imbalance. To address this, WIoU introduces a weighted IoU calculation that assigns greater loss contributions to small objects. WIoU [30] is a weighted variant of the standard IoU, which adjusts the loss for each object or region by introducing a weight factor w based on the standard IoU calculation. WIoUv3 improves upon this by introducing a dynamic non-monotonic focusing mechanism, allowing the model to allocate more computational resources to challenging samples with occluded or small objects, thus improving the accuracy and stability of the training process. Additionally, WIoUv3 provides a more refined method for bounding box regression by considering the overlap degree and dynamically adjusting the loss contribution of each bounding box. The weight factor based on the target center distance is defined as:

(9)

where ε is a small constant to prevent division by zero, and β is a hyperparameter that adjusts the impact of the distance. The formula for WIoU, incorporating the weight factor w, is expressed as:

(10)

We conducted experiments on the dataset, comparing WIoUv3 with existing bounding box loss functions (such as Smooth L1, CIoU, and GIoU). As shown in Table 3. The experimental show that WIoUv3 significantly improves model performance, especially in handling complex scenarios like occlusion, small size, and class imbalance.

thumbnail
Table 3. Performance comparison of different loss functions.

https://doi.org/10.1371/journal.pone.0327371.t003

Results and discussion

Indicators for model evaluation

These metrics are used to evaluate the model’s effectiveness in detecting seed cracks and insect hole defects, with the following formulas applied:

(11)(12)(13)(14)(15)

TP represents true positives (correct defect predictions), TN denotes true negatives (correct non-defect predictions), FP refers to false positives (incorrect defect predictions), FN indicates false negatives (missed defects), and N is the total number of categories. Rr and Pr represent recall and precision for the r-th class respectively.

Experimental configuration

As shown in Table 4, the hardware and software environment of the experimental testing platform are listed below. We use the Adam optimizer to fine-tune the model parameters, Compared to traditional optimizers such as SGD, Adam provides adaptive learning rates and faster convergence, which is beneficial for models like Oak-YOLO that integrate complex modules such as EfficientViT and Ghost-DynamicConv. The training process consists of a maximum of 300 epochs, and the batch size is set to 16, with an initial learning rate set to 0.001.The hyperparameters used in training, including the learning rate, batch size, and optimizer settings, were determined through manual tuning based on empirical performance on the validation set.

Results of ablation experiments

To evaluate the effectiveness of each proposed module, a series of ablation experiments were conducted on the oak seed dataset using consistent training, validation, and testing protocols. The detailed quantitative results are summarized in Table 5. Integrating the Ghost-Dynamic module into the YOLOv8 detection head resulted in consistent improvements in mAP50, F1 score, and inference speed. Specifically, the enhanced YOLOv8-Ghost-Dynamic variant achieved an mAP50 of 95.97% and a precision of 98.61%, while also demonstrating a speed increase of 12.4 FPS (reducing inference time by 1.1 ms per image) compared to the baseline YOLOv8 model. Notably, the full Oak-YOLO configuration, which incorporates both the Ghost-Dynamic and EfficientViT modules, delivered the most substantial gains: it achieved an mAP50 of 96.92%, a precision of 98.12%, and an inference speed of 132.2 FPS (corresponding to 7.6 ms per image). These results underscore the effectiveness of the multi-module design in improving both accuracy and real-time performance.

As illustrated in Fig 8, precision and recall curves gradually converge after approximately 100 epochs, exhibiting minimal fluctuations in the later stages.Notably, Oak-YOLO consistently outperforms other models throughout training in both mAP0.5 and mAP0.5:0.95 metrics, achieving higher accuracy and faster convergence. This suggests that the integration of Ghost-Dynamic and EfficientViT modules enhances both the localization precision and the generalization capability of the model across different IoU thresholds.From the comparison of the confusion matrices, as shown in Fig 9. It is evident that YOLOv8 performs poorly in identifying cracks, with frequent missed detections. Following the enhancement, the issues of missed detections and misclassifications are significantly alleviated, achieving detection rates above 95% for both wormholes and cracks.

thumbnail
Fig 8. Comparison of ablation experiments on different datasets.

https://doi.org/10.1371/journal.pone.0327371.g008

thumbnail
Fig 9. Comparison of Confusion Matrices between YOLOv8 and OSK-YOLO8.

https://doi.org/10.1371/journal.pone.0327371.g009

Comparative performance analysis against alternative models

Based on the experimental results, our study primarily focuses on evaluating the differences between OSK-YOLO8 and YOLOv8 in terms of detection performance and inference speed. Two representative categories of object detection models were selected for comparative analysis: Transformer-based detectors, including rt-DETR-18 [31], rt-DETR-50 [31], and Deformable DETR [32]; and lightweight CNN-based detectors, comprising the YOLO series (YOLOv5s [33], YOLOv7 [34], YOLOv9 [35], YOLOv10 [36], YOLOv11 [37], and YOLOv12 [38]) as well as the proposed OSK-YOLO8.

As shown in Table 6, although Transformer-based models achieved superior detection accuracy, they require over 40 million parameters and entail high computational complexity. In contrast, the YOLO series strikes a better balance between accuracy and model efficiency. Notably, YOLOv5s and YOLOv7 have relatively small parameter sizes of only 16.4M and 18.8M, respectively. However, due to their limited number of convolutional layers and channels, these models struggle to extract effective features in complex backgrounds. It is worth highlighting that YOLOv9 to YOLOv12 exhibit a consistent improvement in detection performance while maintaining the efficiency of the YOLO architecture. Specifically, YOLOv9 achieves an mAP50 of 94.5%, which increases to 97.3% in YOLOv12. Nonetheless, this performance gain comes at the cost of increased model size and complexity—parameter counts grow from 21.2M to 29.8M, GFLOPs rise from 12.3 to 29.2, and inference speed declines from 195 FPS to 120 FPS, indicating a trade-off between performance and computational cost.

thumbnail
Table 6. Comparative Evaluation of Detection Performance and Computational Efficiency Across Transformer-Based and YOLO Series Models.

https://doi.org/10.1371/journal.pone.0327371.t006

Among all evaluated models, OSK-YOLO8 offers the most favorable trade-off between performance and efficiency. It achieves an mAP50 of 96.2%, ranking among the top within the YOLO family. With only 12.0M parameters and 6.4 GFLOPs, it reaches an inference speed of 264 FPS—substantially outperforming all baseline models, including YOLOv12. Compared with YOLOv12, OSK-YOLO8 reduces GFLOPs by approximately 78%, parameter count by nearly 60%, and improves inference speed by over 120 FPS. Fig 10 provides a visual comparison of each model’s accuracy, speed, and computational complexity, clearly demonstrating the comprehensive advantages of OSK-YOLO8.

thumbnail
Fig 10. Comparative Evaluation of Detection Performance and Computational Efficiency Across Transformer-Based and YOLO Series Models.

https://doi.org/10.1371/journal.pone.0327371.g010

Verify the results of the experiment

To evaluate the effectiveness of OSK-YOLO8 and YOLOV8n in detecting defects during actual production, this study applied the pre-trained OSK-YOLO8 and YOLOV8n models to validation experiments involving seed images. Fig 11 visually represents the detection results of OSK-YOLO8.

thumbnail
Fig 11. Experimental results verification, illustrating the comparison between predicted and actual outcomes.

https://doi.org/10.1371/journal.pone.0327371.g011

The visualization results demonstrate that in the Oak-Intensive task, characterized by densely packed seeds, the YOLOV8n model struggles to accurately identify defects, especially in scale scenarios lacking fine detail, particularly when cracks dominate the scene. In contrast, OSK-YOLO8, with its dynamic detection head, significantly enhances the expression of fine-grained details.

Similarly, in scenarios dominated by wormholes, YOLOV8n often misjudges fine details, whereas OSK-YOLO8 almost accurately identifies all wormhole cases. In single-defect scenarios, YOLOV8n tends to misinterpret the seed base as a crack defect, while OSK-YOLO8 exhibits outstanding performance in crack detection. Both models demonstrate similar performance in wormhole identification, though YOLOV8n occasionally misclassifies surface textures as wormholes, an issue not observed with OSK-YOLO8.

Robustness evaluation

To validate the model’s robustness across devices and scenarios, we conducted cross-domain testing using images captured by mobile devices. The external validation sets included: 1) OnePlus ACE2 Pro (OnePlus Technology Co., Ltd., Shenzhen, China) and 2) iPhone 13 Plus (Apple Inc., USA), with resolutions of 3,2642,448 and 3,0244,032 respectively.

As shown in Table 7, Oak-YOLO achieved mAP50 scores of 94.7% and 93.8% on mobile-captured test sets (Test3 and Test4), reflecting only a 2.3–3.3% drop compared to the laboratory-controlled environments (Test1 and Test2). Despite the more challenging conditions associated with mobile devices—such as inconsistent lighting and complex backgrounds—the model maintained a high precision of over 92% and F1 scores above 89%, indicating stable and reliable detection performance. Furthermore, the mAP50–95 values on mobile-captured sets remained above 68%, further demonstrating the model’s generalization capability across diverse imaging sources.

thumbnail
Table 7. Cross-device performance comparison.

https://doi.org/10.1371/journal.pone.0327371.t007

To evaluate the defect localization capability of the model in cross-domain scenarios, this study uses the Grad-CAM++ heatmap visualization method [39] to systematically compare the attention region distributions of Oak-YOLO and YOLOv8 under complex backgrounds. The heatmaps use a blue-to-red color spectrum (low-to-high values) to visually present the response intensity of the model to the input images. As shown in Fig 12, Oak-YOLO exhibits more concentrated high-response regions in images captured by mobile devices, with activation area coverage exceeding 85%, accurately focusing on seed defect locations. In contrast, YOLOv8 shows more dispersed attention distribution, with 12–15% false activations near seed edges and greater susceptibility to background noise.

thumbnail
Fig 12. Attention heatmap comparison between Oak-YOLO and YOLOv8.

https://doi.org/10.1371/journal.pone.0327371.g012

Conclusion

In this study, we proposed Oak-YOLO, an improved YOLOv8-based detection framework designed specifically for identifying defects in oak seeds. The model integrates the EfficientViT backbone for enhanced global feature extraction, and introduces a Ghost-Dynamic prediction head to better detect small and irregular targets. Furthermore, the adoption of the WIoUv3 loss function improves bounding box regression for overlapping and deformable defects such as cracks and insect holes.

Experimental evaluations demonstrated that Oak-YOLO achieved a mAP50 of 94.5%, an F1-score of 95.3%, and a detection speed of 132.2 FPS on the oak-intensive dataset, significantly outperforming the baseline YOLOv8 model. In cross-device validation using smartphone-captured images, the model maintained high accuracy and robustness, confirming its generalization ability across diverse acquisition environments. Comparative analysis also showed that Oak-YOLO offers superior performance-efficiency trade-offs compared to state-of-the-art YOLO variants and Transformer-based models.

These findings highlight the practical applicability of Oak-YOLO for real-time and high-precision seed defect detection in forestry. Future work will focus on further reducing computational costs and extending the model to support defect screening across a broader range of tree species.

References

  1. 1. Parker CL, Hibbard MJ. Factors controlling germination and early survival in oaks. Forest Ecology and Management. 2002;159(2):133–44.
  2. 2. Li X, Wu Y, Dong X, Zhao S, Mu X, Chen Z, et al. Checklist of major oak pests in China. Acta Sericologica Sinica. 2010;36(2):330–6.
  3. 3. Nickerson DM. Addendum to counting by weighing and the seed testing problem. Annals of Applied Biology. 2003;143(3):371–4.
  4. 4. Wang H, Hou H, Liu S. Maize seed recognition based on genetic algorithm and multi-class SVM. Computer Engineering and Application. 2008;44(18):221–3.
  5. 5. Nguyen-Quoc H, Truong Hoang V. Rice seed image classiï¬#129;cation based on HOG descriptor with missing values imputation. TELKOMNIKA. 2020;18(4):1897.
  6. 6. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:779–88.
  7. 7. Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 7263–71.
  8. 8. Redmon J, Farhadi A. YOLOv3: an incremental improvement. arXiv preprint. 2018.
  9. 9. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, 2015: 91–9.
  10. 10. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY. SSD: single shot multibox detector. In: European conference on computer vision, 2016: 21–37.
  11. 11. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021:10012–10022. https://doi.org/10.1109/ICCV48922.2021.00988.
  12. 12. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, et al. A comprehensive survey on transfer learning. Proc IEEE. 2021;109(1):43–76.
  13. 13. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (VOC) challenge. Int J Comput Vision. 2010;88(2):303–38.
  14. 14. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, et al. Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv 2015. https://arxiv.org/abs/1504.00325
  15. 15. Mukasa P, Wakholi C, Akbar Faqeerzada M, Amanah HZ, Kim H, Joshi R, et al. Nondestructive discrimination of seedless from seeded watermelon seeds by using multivariate and deep learning image analysis. Computers and Electronics in Agriculture. 2022;194:106799.
  16. 16. Kurtulmuş F. Identification of sunflower seeds with deep convolutional neural networks. Food Measure. 2020;15(2):1024–33.
  17. 17. Wang Y, Song S. Variety identification of sweet maize seeds based on hyperspectral imaging combined with deep learning. Infrared Physics & Technology. 2023;130:104611.
  18. 18. Shi Y, Patel Y, Rostami B, Chen H, Wu L, Yu Z, et al. Barley variety identification by iphone images and deep learning. Journal of the American Society of Brewing Chemists. 2021;80(3):215–24.
  19. 19. Barrio-Conde M, Zanella MA, Aguiar-Perez JM, Ruiz-Gonzalez R, Gomez-Gil J. A deep learning image system for classifying high oleic sunflower seed varieties. Sensors (Basel). 2023;23(5):2471. pmid:36904675
  20. 20. Bi C, Hu N, Zou Y, Zhang S, Xu S, Yu H. Development of deep learning methodology for maize seed variety recognition based on improved swin transformer. Agronomy. 2022;12(8):1843.
  21. 21. Singh Thakur P, Tiwari B, Kumar A, Gedam B, Bhatia V, Krejcar O, et al. Deep transfer learning based photonics sensor for assessment of seed-quality. Computers and Electronics in Agriculture. 2022;196:106891.
  22. 22. Ultralytics Team. YOLOv8: the latest version of YOLO for object detection. 2023. [cited 2024 Dec 10]. Available from: https://github.com/ultralytics/yolov8
  23. 23. Xu X, Chen S, Lv X, Wang J, Hu X. Guided multi-scale refinement network for camouflaged object detection. Multimed Tools Appl. 2023;82(4):5785–801. pmid:35968408
  24. 24. Han K, Wang Y, Cheng Q, Zhang Y, Xu C, Li J, et al. GhostNet: more features from cheap operations. arXiv preprint. 2020. https://arxiv.org/abs/2003.08056
  25. 25. Chen Y, Li H, Yu X, Li Z. Dynamic convolution: attention meets convolution. arXiv preprint arXiv. 2020. https://arxiv.org/abs/2003.03126
  26. 26. Liu X, Peng H, Zheng N, Yang Y, Hu H, Yuan Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023. p. 14420–14430. https://doi.org/10.1109/CVPR52729.2023.01386.
  27. 27. Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. p. 1440–8. https://doi.org/10.1109/iccv.2015.169
  28. 28. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. p. 658–66. https://doi.org/10.1109/cvpr.2019.00075
  29. 29. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI. 2020;34(07):12993–3000.
  30. 30. Tong Z, Chen Y, Xu Z, Yu R. Wise-IoU: bounding box regression loss with dynamic focusing mechanism. arXiv preprint arXiv. 2023. https://arxiv.org/abs/2301.10051
  31. 31. Yin J, Wu Y, Zhang Y. RT-DETR: Real-Time Detection Transformer. arXiv preprint. 2023.
  32. 32. Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In: Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  33. 33. JocherG, “Ultralytics YOLOv5,” 2020. [Online]. Available from: https://github.com/ultralytics/yolov5
  34. 34. WangC-Y, BochkovskiyA, LiaoH-YM. “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv. 2022. https://doi.org/10.48550/arXiv.2207.02696
  35. 35. Wang C-Y, Yeh I-H, Liao H-YM. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv preprint. 2024.
  36. 36. Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, et al. YOLOv10: Real-Time End-to-End Object Detection. arXiv preprint. 2024.
  37. 37. Ultralytics. “YOLOv11,” 2024. [Online]. Available from: https://github.com/ultralytics/ultralytics
  38. 38. Tian Y, Ye Q, Doermann D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv preprint. 2025.
  39. 39. Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018. p. 839–47 https://doi.org/10.1109/wacv.2018.00097