Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

SAMF-YOLO: A self-supervised, high-precision approach for defect detection in complex industrial environments

  • Jun Huang ,

    Roles Conceptualization, Methodology, Project administration, Resources, Validation, Writing – original draft, Writing – review & editing

    huangjun1207@whit.edu.cn

    Affiliations Faculty of Intelligent Manufacturing, Wuhu Institute of Technology, Anhui, China, Faculty of Information Technology, City University Malaysia, Kuala Lumpur, Malaysia

  • Shamsul Arrieya Ariffin,

    Roles Supervision

    Affiliations Faculty of Information Technology, City University Malaysia, Kuala Lumpur, Malaysia, Faculty of Computing and Meta Technology, Sultan Idris Education University, Perak, Malaysia

  • Qiang Zhu,

    Roles Data curation

    Affiliation Faculty of Intelligent Manufacturing, Wuhu Institute of Technology, Anhui, China

  • Wanting Xu,

    Roles Formal analysis

    Affiliations Faculty of Information Technology, City University Malaysia, Kuala Lumpur, Malaysia, Faculty of Automotive and Aviation, Wuhu Institute of Technology, Anhui, China

  • Qun Yang

    Roles Conceptualization, Software

    Affiliation Faculty of Intelligent Manufacturing, Wuhu Institute of Technology, Anhui, China

Abstract

As object detection models grow in complexity, balancing computational efficiency and feature expressiveness becomes a critical challenge. To address this, we propose SAMF-YOLO, a novel model integrating three key components: SONet, BFAM, and FASFF-Head. The UniRepLKNet backbone, enhanced by the Star Operation, expands the feature space with high efficiency. FASFF-Head performs adaptive multi-scale feature fusion with minimal overhead, and the Bi-temporal Feature Aggregation Module (BFAM) strengthens the detection of small defects. Additionally, the Focaler-IoU loss improves bounding box regression for challenging object scales, and a self-supervised contrastive learning strategy enhances feature representation and model robustness without relying on labeled data. Experimental results demonstrate that SAMF-YOLO surpasses YOLOv11s with a 6.38% improvement in mAP@0.5 and a notable reduction in computational cost, confirming its superiority in accuracy, efficiency, and robustness. The code is released at https://github.com/Missing24ff/SAMF-YOLO.git.

Introduction

Object detection has witnessed tremendous progress in recent years, largely propelled by deep learning techniques, especially convolutional neural networks (CNNs). Although models such as YOLO, SSD, and Faster R-CNN [13] have set benchmarks for performance and speed, they often face challenges in balancing computational efficiency with the capacity to learn rich, high-dimensional representations. Traditional convolutional backbones, despite their power, depend on complex hierarchical structures and explicit non-linear activation functions. This not only introduces substantial computational overhead but also requires extensive manual tuning for optimal performance [4,5].

To overcome these limitations, we propose UniRepLKNet, a novel backbone architecture integrated into the YOLOv11 framework. Leveraging the Star Operation, UniRepLKNet generates high-dimensional features within a low-dimensional computational space, significantly enhancing feature extraction capabilities without incurring additional computational costs. By stacking multiple Star Operation layers [6], the feature space grows exponentially, thereby increasing the network’s representational capacity. This design enables UniRepLKNet to maintain high efficiency while enriching feature dimensionality, offering a compelling alternative to traditional convolution-heavy backbones [7].

In addition, we introduce the FASFF-Head, an adaptive multi-scale fusion mechanism that improves detection robustness across varying object sizes [8]. Complementing this is the Bi-temporal Feature Aggregation Module (BFAM), which aggregates low-level textures and multi-level semantic cues to enhance detection precision, particularly for small or complex defects [9]. To further improve bounding box regression, we present the Focaler-IoU loss, designed to mitigate sample imbalance by emphasizing hard examples during training [10].

Extensive experiments demonstrate that SAMF-YOLO, which integrates these innovations, achieves notable improvements over existing methods in both detection accuracy and computational efficiency, making it a powerful and practical solution for real-time object detection in industrial applications [11,12].

Related work

Object detection has been a key area of research within computer vision, and numerous methods have emerged to balance speed, accuracy, and efficiency. Early object detection models, such as R-CNN, achieved remarkable success by using region proposals and CNNs[13]. However, these methods were computationally expensive and slow, which led to the development of more efficient models like Fast R-CNN [14] and Faster R-CNN [15]. These models introduced the concept of region proposal networks (RPNs), which significantly improved detection speed and accuracy.

The YOLO series revolutionized real-time object detection by framing the task as a regression problem, allowing predictions to be made in a single network pass. YOLOv3 [16] and YOLOv4 [17] continued to improve upon the framework, incorporating additional techniques like multi-scale predictions and improved anchor box generation. YOLOv5 [18] further optimized the architecture, improving inference speed and detection performance.

In recent years, lightweight architectures have gained attention due to the increasing demand for real-time inference on resource-constrained devices. Models like MobileNetv2 [19] and EfficientNet [20] introduced depthwise separable convolutions and compound scaling techniques to improve computational efficiency without sacrificing accuracy. MobileNeXt [21] and MobileNetv3 [22] further optimized these techniques for edge devices, achieving significant improvements in both accuracy and efficiency.

Feature fusion techniques have also been explored extensively, with methods like Feature Pyramid Networks (FPN) [23] and BiFPN [24] providing robust multi-scale feature fusion. These networks combine features from different levels of the network to enhance object detection, especially for small or distant objects. The use of attention mechanisms has become increasingly popular, with Squeeze-and-Excitation Networks (SENet) [25] and Non-local Networks [26] improving feature selection by emphasizing important regions of the input data.

Moreover, the introduction of advanced loss functions like IoU-based losses [27] and Focal Loss [28] has significantly improved the performance of object detection networks, particularly for tasks involving small objects or imbalanced datasets. Despite these advancements, existing models still struggle with achieving both high computational efficiency and feature expressiveness. Our proposed UniRepLknet, combined with the FASFF-Head, BFAM, and Focaler-IoU loss, offers a novel solution by focusing on efficiently learning implicit high-dimensional features and enhancing feature fusion while reducing computational complexity [29,30].

Materials and methods

To address the limitations of existing models in the detection of complex industrial part defects, this study proposes an enhanced detection framework based on YOLOv11, named SAMF-YOLO. As shown in Fig 1, the model achieves significant performance improvements through three core innovative modules: First, a novel UniRepLknet backbone network is designed, which generates high-dimensional feature representations in low-dimensional computational space using the Star Operation. This effectively compensates for the feature loss caused by traditional pooling operations. The structure adopts a four-stage hierarchical architecture, with a dynamic feature dimension adjustment through a channel expansion factor, enabling fine-grained multi-scale feature extraction while maintaining computational efficiency. Second, to address the challenges of defect detection in complex scenarios, an effective module for filtering conflicting information and enhancing scale invariance (FASFF-HEAD) is proposed. This module can be trained via backpropagation and has minimal computational overhead, effectively solving the feature loss issue caused by cross-scale interactions. Lastly, to address the insufficiency of traditional initial convolutional feature capture, a Bi-temporal Feature Aggregation Module (BFAM) is introduced to replace the standard C3K2 blocks in the backbone network. BFAM gradually merges low-level texture information with multi-scale features, allowing the network to capture subtle changes and broader spatial relationships. After channel recalibration, these features are fused, improving the quality of the initial features while maintaining computational efficiency. Experimental results demonstrate that the Focaler-IoU loss function significantly enhances the model’s detection performance for complex-shaped targets, improving mAP@0.5 by 6.38% compared to the baseline model. The technical highlights of each innovative module are as Fig 1 follows:

UniRepLknet

With the increasing complexity of object detection models, balancing computational efficiency and expressive power remains a major challenge. Traditional convolutional backbones rely on explicit non-linear activation functions and complex hierarchical structures to enhance feature extraction. However, these approaches introduce significant computational overhead and require meticulous manual optimization. To address this issue, we propose UniRepLknet, an innovative backbone that integrates the Star Operation into the YOLOv11 framework, offering a novel perspective on implicit high-dimensional representation learning.

In a single layer of a neural network, the star operation is typically expressed as:

(1)

which represents the fusion of two linearly transformed features through element-wise multiplication. To simplify the notation, we combine the weight matrices and bias terms into a single entity, defined as:

(2)

Where d represents the number of input channels. This formulation can be naturally extended to accommodate multiple output channels, as follows:

(3)

Thus, the star operation can be rewritten as:

(4)

Here, we use i,j to index the channels and define as the coefficient for each term:

(5)

Except for , each term maintains a nonlinear relationship with x, indicating that they correspond to independent, implicit high-dimensional feature dimensions. Consequently, while computations are performed within a d-dimensional space using the computationally efficient star operation, the resulting feature representation resides in an implicit feature space of dimension .

Given that d>2, this characteristic effectively expands the feature dimensions without introducing additional computational overhead within a single layer. By stacking multiple layers of star operations, the implicit dimensions can grow exponentially in a recursive manner, approaching infinity. Assuming the initial network layer has a width of d, applying a single star operation yields the following expression:

(6)

This operation results in an implicit feature space of dimension . Let Ml denote the output of the l-th star operation, then we obtain:

(7)

After l layers of computation, the feature space expands to a dimension of . This means that by stacking only a few layers of star operations, the implicit feature dimension can be exponentially expanded, significantly enhancing the representation capacity.

To transform UniRepLknet into a self-supervised learning model, we adopt a contrastive learning approach aimed at learning effective feature representations through unsupervised tasks. Specifically, during training, positive and negative sample pairs are generated through data augmentation techniques, including random cropping (crop ratio between 0.6 and 1.0 of the original image), horizontal flipping (with a probability of 0.5), random rotation (within the range of to ), color jittering (brightness, contrast, saturation, and hue adjusted with a strength factor of 0.4), grayscale transformation (with a probability of 0.2), Gaussian blur (applied with a kernel size of 3 and sigma in the range [0.1, 2.0]), and random erasing (with an area ratio ranging from 2% to 10% of the image), as depicted in Fig 2. Positive sample pairs are formed by applying two distinct augmentations to the same original image, thereby producing two different but semantically similar views. Negative samples, conversely, are derived from different images, ensuring that the learned representation distinguishes clearly between varied object instances.

Assuming the input data consists of augmented pairs of samples , their respective high-dimensional feature vectors are extracted via the UniRepLknet backbone utilizing the Star Operation. To measure similarity between the feature vectors, we employ the cosine similarity metric as defined:

(8)

For the optimization objective, we employ the widely-adopted InfoNCE loss as our contrastive loss function. This loss function aims to maximize similarity for positive pairs while minimizing similarity for negative pairs:

(9)

Here, represents the temperature parameter, which scales the distribution of similarity scores and is empirically set to 0.07 in our experiments. During training, we adopt the SGD optimizer with an initial learning rate of 0.01, momentum of 0.937, and weight decay of , accompanied by a cosine annealing schedule to gradually adjust the learning rate over epochs.

Combining the Star Operation in UniRepLknet with contrastive learning, the final self-supervised loss function integrates the contrastive loss and an L2 regularization term to prevent feature space collapse and enhance the model’s generalization capability:

(10)

where is empirically set to to balance the impact of the regularization term.

Given the unique advantage of UniRepLknet–which performs computations in a low-dimensional space while generating high-dimensional features–we have identified its practical value in efficient network architectures. Therefore, we integrate UniRepLknet as a proof-of-concept model into the YOLOv11 backbone, where it demonstrates outstanding performance, highlighting the effectiveness of the Star Operation, as Fig 3 shows.

UniRepLknet adopts a four-stage hierarchical architecture, where downsampling is performed through convolutional layers, and feature extraction is carried out using modified blocks. To enhance efficiency, Batch Normalization (BatchNorm) replaces Layer Normalization, positioned after depthwise convolution, enabling fusion during inference. Inspired by MobileNeXt, we introduce an additional depthwise convolution at the end of each module. The channel expansion factor is consistently set to 4, while the network width doubles at each stage. The architecture follows the design principles of MobileNetv2, replacing the GELU activation function in the demonstration blocks with ReLU6, further improving computational efficiency. The overall UniRepLknet framework is illustrated in the accompanying diagram. By simply adjusting the number of blocks and input embedding channels, we construct different scales of UniRepLknet, optimizing the YOLOv11 backbone to achieve a balance between efficiency and performance.

FASFF-Head

Unlike traditional feature fusion methods, we introduce FASFF-HEAD, which integrates Four-Head Adaptive Spatial Feature Fusion (FASFF) directly into the YOLOv11 detection head. This approach enables more efficient multi-scale feature fusion while addressing feature loss caused by cross-scale interactions. The key innovation of FASFF-HEAD lies in its adaptive spatial feature fusion mechanism, which effectively filters conflicting information and enhances scale invariance. The FASFF method is model-agnostic, trainable via backpropagation, and introduces minimal computational overhead, making it a practical enhancement for existing object detection frameworks, as Fig 4 shows. The Key Innovations of FASFF-HEAD:

thumbnail
Fig 4. Illustration of the adaptively spatial feature fusion mechanism.

https://doi.org/10.1371/journal.pone.0327001.g004

  1. Adaptive Spatial Feature Fusion: We propose a novel pyramid feature fusion strategy that spatially filters conflicting information and suppresses inconsistencies between multi-scale features.
  2. Enhanced Scale Invariance: The FASFF strategy significantly improves feature scale invariance, leading to more accurate object detection.
  3. Low Inference Overhead: While improving detection performance, FASFF introduces negligible additional computational cost during inference.

These innovations make FASFF-HEAD a significant advancement in the field of single-shot object detection, particularly in improving the ability to handle objects of varying scales. Figure A illustrates the working principle of the Adaptive Spatial Feature Fusion (FASFF) mechanism, specifically designed for single-shot object detection.

In this structure, features at different levels (represented by layers of different colors) first undergo downsampling or upsampling according to their respective strides to ensure uniform spatial dimensions across all feature maps. Level 1, Level 2, Level 3, and Level 4 correspond to different levels in the feature pyramid, each with distinct spatial resolutions. ASFF-1, ASFF-2, ASFF-3, and ASFF-4 represent different levels where the FASFF mechanism is applied to achieve feature fusion.

In the zoomed-in section of FASFF-4, we observe that feature maps from other levels (, , ) are resized to match the spatial dimensions of the fourth level (). These resized features are then adaptively weighted using learned importance maps before being fused into a final refined feature representation (), which is used for prediction. Let denote the feature vector at the position (i,j) on the feature maps resized from level n to level l. We propose to fuse the features at the corresponding level l as follows:

(11)

where denotes the (i,j)-th vector of the output feature maps among channels. , , , and refer to the spatial importance weights for the feature maps at three different levels to level l, which are adaptively learned by the network. Note that , , , and can be simple scalar variables, shared across all the channels. Inspired by [31], we enforce the constraint and , and define:

(12)

Here, , , , and are defined by using the softmax function with , , , and as control parameters, respectively. We use convolution layers to compute the weight scalar maps , , , and from , , , and , respectively. These weight maps can thus be learned through standard back-propagation.

With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of YOLOv11. By incorporating four-head feature fusion into the detection pipeline, FASFF-HEAD achieves weighted fusion across multiple feature levels. Compared to traditional three-head fusion methods, this approach further enhances detection accuracy, especially in complex environments and for objects of varying scales.

C3K2-BFAM

In the original YOLOv11 framework, the C3K2 layer primarily focuses on capturing mid-level features but lacks a comprehensive integration of low-level details and multilevel information. While the C3K2 layer captures important features, it struggles to effectively combine fine-grained details (such as texture and color) with higher-level context, which is crucial for detecting small, intricate industrial defects.

To address this limitation, we introduce the Bi-temporal Feature Aggregation Module (BFAM) within the C3K2 layer, as shown in the Fig 5. BFAM progressively merges low-level texture information with multilevel features, allowing the network to capture both subtle changes and broader spatial relationships. This fusion improves the model’s ability to detect small, complex defects, which is essential for industrial part defect detection.

To effectively extract fine-grained features while preserving spatial relationships, we apply channel splicing to the input bitemporal features . Four parallel dilated convolutions () with varying dilation rates (1, 2, 3, 4) and group size c (representing the number of channels) are employed. This strategy enables the model to capture change regions at different scales through diverse receptive fields, all while maintaining the spatial coherence of the features via grouped convolutions. The outputs from these convolutions are then concatenated along the channel dimension, followed by a convolution block to reduce the channel dimensions. Finally, feature refinement is achieved using the SimAM attention mechanism [32]. The corresponding equations are as follows:

(13)(14)(15)

Let d represent the dilation rate, g the group size, and Concat the channel concatenation operation. Considering the shared characteristics within the bitemporal features, this approach enhances the precision of low-level texture information. We compute the importance of pixel-level features from different temporal inputs using the SimAM attention mechanism. Subsequently, the extracted common feature is multiplied by the respective temporal features to measure their similarity. The equations are as follows:

(16)(17)

Next, we combine the low-level detail features with the advanced change feature . The final aggregated feature is generated using the SimAM attention mechanism. As the CNN model deepens, low-level information gradually diminishes. To retain sufficient texture details in the deeper layers, we introduce residual connections. The equations are as follows:

(18)

We incorporated multilevel features through the use of a BFAM to improve the spatial texture information in the change region by integrating various receptive fields. Additionally, these features were fused with high-level semantic information to provide a more precise and holistic understanding of the change.

Focaler-IoU

In various object detection tasks, the problem of imbalanced samples is prevalent. Samples can be classified into hard samples and easy samples according to the difficulty of object detection. From the perspective of target scale analysis, general detection targets can be regarded as easy samples, while extremely small targets are considered hard samples due to the great difficulty in their precise positioning.

For detection tasks dominated by easy samples, paying attention to easy samples during the bounding box regression process can contribute to the improvement of detection performance. Conversely, for detection tasks with a high proportion of hard samples, it is necessary to focus on the bounding box regression of hard samples.

To focus on different regression samples for various detection tasks, we reconstruct the Intersection over Union loss (IoU loss) using the linear interval mapping method [33]. This is conducive to enhancing the boundary regression effect. The formula is as follows:

(19)

Where is the reconstructed Focaler-IoU, is the original IoU value, and . By adjusting the values of d and u, we can make focus on different regression samples. Its loss is defined below:

(20)

For general object detection tasks, placing appropriate emphasis on easy samples proves beneficial in boosting overall detection performance; conversely, in scenarios where extremely small targets constitute a significant proportion, directing attention to hard samples markedly enhances the precision of boundary localization. Subsequent experiments and analyses further substantiate that Focaler-IoU not only preserves detection accuracy but also effectively mitigates sample imbalance in bounding box regression, thereby paving the way for adaptive regression strategies tailored to varying target scales and detection contexts.

Experimental results and analyses experimental environments and dataset

As shown in Table 1, the deep learning environment and framework used in the experiments have been consistently applied to all tested networks with identical configurations. Detailed hyper-parameters listed in Table 2 were uniformly used across all models evaluated, ensuring fairness and consistency throughout the experimental comparisons.

thumbnail
Table 1. Experimental environment configuration.

https://doi.org/10.1371/journal.pone.0327001.t001

To comprehensively evaluate the robustness and generalization capability of the SAMF-YOLO model, two distinct datasets were employed. The first dataset, self-collected, consists of solar heating tape defect images, comprising 907 images for training and 227 images for validation. This dataset covers diverse defect types, including wrinkles, dents, scratches, punctures, damage, and dirt, providing varied scenarios to rigorously test the model’s ability to generalize across different conditions and anomalies.Additionally, the publicly available NEU-DET [34] dataset was utilized to further validate the effectiveness and performance of our proposed approach. NEU-DET focuses specifically on steel surface defects, containing 1,800 grayscale images classified into six typical defect categories: crazing, patches, inclusion, pitted surface, rolled-in scale, and scratches. The incorporation of this standard dataset further demonstrates the model’s adaptability and effectiveness in broader industrial defect detection tasks.

To ensure efficient training and evaluation, the experiments were conducted on high-performance hardware, configured with an NVIDIA RTX 4080 Super GPU and CUDA 12.1 to enhance computational efficiency. The software environment is based on Python 3.11, with model training and evaluation carried out using the PyTorch deep learning framework. Additionally, to improve the consistency of training and evaluation, the datasets were split into 70% for training, 20% for validation, and 10% for testing. The dataset was meticulously annotated to capture complex scenarios such as occlusion and texture variations. All images were resized to 640640 before being fed into the model to maintain consistency and reduce computational load.

With these two datasets and robust hardware support, the SAMF-YOLO model is capable of efficient defect detection in a wide range of real-world environments, ensuring its broad applicability and high efficiency.

Evaluation metrics

This study evaluates object detection performance through Precision, Recall, mAP, GFLOPs, and model parameters. Classification outcomes are defined by True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN), providing a comprehensive view of accuracy and efficiency. Precision reflects the ability to correctly identify positive samples. High precision means fewer false positives, while low precision indicates a higher rate of incorrect positive predictions. It is expressed as follows:

(21)

Recall quantifies the proportion of actual positive samples correctly identified by the classifier. A high recall indicates that most positive instances are detected, while a low recall suggests many positive instances are missed. It is expressed as follows:

(22)

Average Precision (AP) measures object detection performance by computing the area under the Precision-Recall curve across different thresholds. A higher AP value signifies better detection accuracy. It is calculated as:

(23)

Mean Average Precision (mAP) averages the AP values across all classes in multi-class tasks. mAP@0.5, a common benchmark, is computed at an IoU threshold of 0.5 to assess detection performance. It is expressed as:

(24)

Where N represents the total number of classes, and denotes the average precision for the i-th class.

To provide a more comprehensive evaluation, mAP@0.5:0.95 averages AP across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. This metric reflects both localization accuracy and robustness. It is defined as:

(25)

Where T is the number of IoU thresholds (typically 10), and denotes the average precision for the i-th class at the t-th IoU threshold.

GFLOPs (Giga Floating Point Operations) quantify the computational complexity of a deep learning model. Specifically, they measure the number of floating-point operations required by the model, expressed in billions. A higher GFLOPs value indicates a more complex model that demands greater computational resources, resulting in longer training and inference times. This metric is essential for evaluating the efficiency and scalability of a model, especially when deploying it on limited hardware.

The number of parameters refers to the learnable weights and biases within a model, which directly impacts its capacity for learning complex features. While increasing the number of parameters can enhance the model’s ability to capture intricate patterns in the data, it also introduces higher resource requirements. More parameters lead to increased memory usage for storage, higher computational costs during training and inference, and potentially longer training times. Striking a balance between model complexity and resource efficiency is key for optimizing performance and ensuring practical deployment.

Ablation experiments

To evaluate the performance gain obtained by enlarging the implicit feature space with the Star Operation, we varied only its feature-dimension expansion coefficient , keeping every other architectural component and training hyper-parameter unchanged. As reported in Table 3, setting lifts mAP@0.5 from 67.32% to 69.10%–a 1.78 pp increase–at the cost of just +0.6 M trainable parameters. Raising the coefficient to yields a better trade-off: mAP@0.5 reaches 70.24% (a 2.92 pp gain over the baseline) while the model remains compact at 2.6 M parameters. Further widening to provides only an additional 0.56 pp but nearly doubles the parameter budget, indicating sharply diminishing returns. Consequently, is adopted as the default configuration in all subsequent experiments.

thumbnail
Table 3. The Star Operation feature–dimension expansion coefficient .

https://doi.org/10.1371/journal.pone.0327001.t003

To evaluate the effectiveness of each proposed module in SAMF-YOLO, we conducted a comprehensive ablation study as shown in Table 4, focusing on the contributions of the Stars Operation Net (So-Net), Bi-temporal Feature Aggregation Module (BFAM), Adaptive Spatial Feature Fusion Head (FASFF), and the Focaler-IoU loss function.Using So-Net as the backbone alone yields a strong baseline, reaching 70.24% mAP@0.5 and 34.3% mAP@0.5:0.95, with a high inference speed of 81.2 FPS. This demonstrates its ability to extract rich multi-scale features while maintaining efficiency.Introducing the BFAM module, which captures complementary temporal-spatial patterns, leads to notable improvements: Precision increases to 72.9%, mAP@0.5 rises to 72.4%, and mAP@0.5:0.95 improves to 35.6%, with only a slight FPS drop to 79.4.Adding the FASFF module to the detection head further enhances the results, especially for small-object detection. The model achieves 75.1% mAP@0.5 and 36.8% mAP@0.5:0.95, with only a minor speed trade-off (78.6 FPS), demonstrating FASFF’s strength in adaptive spatial feature fusion.Finally, replacing the standard CIoU loss with Focaler-IoU yields the best overall performance: Precision reaches 76.5%, mAP@0.5 peaks at 75.7%, and mAP@0.5:0.95 rises to 37.4%, while maintaining high inference speed at 77.9 FPS.Overall, each component of SAMF-YOLO contributes significantly to performance, validating the design’s effectiveness in building an accurate and efficient industrial defect detection framework.

The results in Table 5 demonstrate that our proposed Focaler-IoU loss function consistently outperforms existing IoU-based variants across all key evaluation metrics. It achieves the highest mAP@0.5 of 75.7% and the best mAP@0.5:0.95 of 37.4%, indicating superior localization performance across a broad IoU threshold range. In addition, it delivers the highest precision (76.5%) and recall (74.8%), showcasing a balanced improvement in both confidence and completeness of detections.Compared to traditional losses such as IoU (72.0%, 34.3%), GIoU (73.2%, 34.9%), and CIoU (73.9%, 35.8%), Focaler-IoU provides a clear edge, particularly in challenging scenarios where precise bounding box regression is critical,as Fig 6 shows.

thumbnail
Fig 6. Performances of different loss functions.

https://doi.org/10.1371/journal.pone.0327001.g006

thumbnail
Table 5. Performances of different loss functions.

https://doi.org/10.1371/journal.pone.0327001.t005

This performance gain is particularly valuable in industrial defect detection tasks, where small and irregularly shaped objects require more refined localization. By emphasizing informative samples and adapting better to varying object geometries, Focaler-IoU enhances both detection robustness and reliability, making it a key contributor to the overall effectiveness of the SAMF-YOLO framework.

Comparative experiment

In this study, we conducted a comprehensive evaluation of the proposed SAMF-YOLO algorithm against mainstream object detection models—including YOLOv5s [35], YOLOv6n [36], YOLOv7-tiny [37], YOLOv8n, RT-DETR-R18 [38], YOLOv9s [39], YOLOv10s [40], and YOLOv11s.

As shown in Table 6, on the solar heating tape surface defect dataset, SAMF-YOLO achieves the best trade-off between detection accuracy and computational efficiency. Compared with YOLOv5s, our model improves Precision from 60.42% to 76.5% and mAP@0.5 from 63.11% to 75.7%, while reducing GFLOPs from 15.8 to 12.4, demonstrating the efficiency of the UniRepLKNet backbone.Against YOLOv6n, SAMF-YOLO increases mAP@0.5 by 14.56% with only a marginal increase in computation (12.4 vs. 11.8 GFLOPs), validating the lightweight design of the C3K2-BFAM module. Compared with YOLOv7-tiny, our model surpasses it by 12.21% in mAP@0.5 and reduces computational cost by 0.6 GFLOPs, confirming the advantages of Focaler-IoU and the overall efficient architecture.When compared to YOLOv8n, SAMF-YOLO achieves a 15.18% higher Precision and 11.38% higher mAP@0.5, while using only 12.4 GFLOPs versus 8.1, highlighting the robustness of the FASFF-Head in handling complex spatial features. Furthermore, compared to RT-DETR-R18, SAMF-YOLO improves mAP@0.5 by 10.42% and reduces GFLOPs by nearly 79%, making it substantially more suitable for real-time applications.Even against the latest models such as YOLOv9s, YOLOv10s, and YOLOv11s, SAMF-YOLO maintains superiority in both accuracy and efficiency—offering higher mAP@0.5 while reducing GFLOPs by 53.6%, 49.2%, and 41.8%, respectively. Notably, it sustains a real-time inference speed of 77.9 FPS, striking an ideal balance between performance and speed.

thumbnail
Table 6. Performance comparison on the solar heating tape surface defect dataset.

https://doi.org/10.1371/journal.pone.0327001.t006

On the NEU-DET dataset (Table 7), SAMF-YOLO again demonstrates excellent generalization ability, achieving the highest scores across all key metrics: Precision of 79.46%, mAP@0.5 of 76.32%, and mAP@0.5:0.95 of 39.17%. Despite its lightweight architecture (12.4 GFLOPs), the model runs at a high-speed 85.7 FPS, confirming its robustness and versatility for diverse industrial defect detection tasks.

thumbnail
Table 7. Performance comparison on the NEU-DET dataset.

https://doi.org/10.1371/journal.pone.0327001.t007

Detection results and analysis

As shown in the Fig 7, the object detection results of different models for images of solar heating tapes with various defects are presented. The first column shows the original images, the second column presents the detection results of the baseline model YOLOv11s, and the third column shows the results of the improved SAMF-YOLO model. The first row displays regular scenes with no significant occlusion or clutter. Both the baseline model YOLOv11s and SAMF-YOLO accurately detect the defective targets in these images. The following rows illustrate scenes with multiple defects. Although YOLOv11s occasionally misses some defect targets, its detection accuracy is relatively lower compared to SAMF-YOLO. In contrast, SAMF-YOLO demonstrates higher detection accuracy, successfully and precisely identifying all defect targets in the images. SAMF-YOLO excels at detecting and annotating small, subtle targets, highlighting its advantage in handling complex scenes. Overall, the results indicate that the improved SAMF-YOLO model shows higher detection accuracy and robustness in various complex scenarios, particularly excelling at detecting small and intricate defects, thus demonstrating outstanding performance in challenging environments.

thumbnail
Fig 7. Comparison of YOLOv8n, YOLOv11n and SAMF-YOLO.

https://doi.org/10.1371/journal.pone.0327001.g007

Score-CAM [41] is a class activation mapping method based on feature map scoring, aimed at improving the transparency and interpretability of convolutional neural network models. Unlike traditional gradient-based methods, Score-CAM generates class activation maps by directly computing the activation strength of feature maps, without the need for gradient calculations or backpropagation information. This approach helps avoid issues like gradient vanishing while providing more intuitive and precise visual explanations. Unlike previous class activation mapping methods [42,43], Score-CAM does not rely on gradients or weighting mechanisms; instead, it highlights key regions that influence decision-making by scoring the feature maps directly. This method not only enhances computational efficiency but also reduces noise interference, resulting in clearer and more understandable activation maps. Through this approach, Score-CAM effectively showcases regions in the input image closely related to the model’s prediction, such as subtle defect areas.

Furthermore, Score-CAM exhibits high computational efficiency, as its simplified calculation process accelerates the generation of class activation maps. For real-time detection tasks, it demonstrates a significant advantage, making it particularly suitable for applications requiring high model interpretability and decision transparency. Due to these features, Score-CAM has become a powerful tool, enhancing the interpretability of the SAMF-YOLO model, aiding in better understanding its reasoning process, and providing strong support for the model’s robustness and generalization capabilities.

Fig 8 shows the heatmap performance of the baseline model YOLOv11s and the improved SAMF-YOLO model in different scenarios. Each row represents a different scene: the first row shows a regular scene, the second row displays a scene with occlusion, and the third row shows a scene with small occluded targets. The left column displays the original image, the middle column shows the heatmap generated by YOLOv11s, and the right column shows the heatmap generated by SAMF-YOLO. In the regular scene, both YOLOv11s and SAMF-YOLO accurately identify the targets, but the SAMF-YOLO heatmap is more concentrated, reflecting greater attention to the target. For occluded scenes, the YOLOv11s heatmap shows some signal confusion and scattered focus, failing to effectively concentrate on the target. In contrast, SAMF-YOLO can accurately localize the target area, maintaining detection accuracy even in complex backgrounds. Finally, in the scene with small occluded targets, YOLOv11s misses the target that has fallen to the right, exhibiting weaker heatmap performance. In comparison, SAMF-YOLO maintains high brightness and focus, better emphasizing the target area in these challenging scenarios. Overall, SAMF-YOLO demonstrates significant advantages in handling subtle defects and small target detection, showcasing its stability and effectiveness across different levels of complexity.

Conclusion

This study presents SAMF-YOLO, a lightweight yet accurate detector designed for industrial defect inspection. Compared with YOLOv11s, SAMF-YOLO achieves higher Precision (76.5%), Recall (74.8%), and mAP@0.5 (75.7%), while reducing GFLOPs by 41.8% and maintaining real-time speed at 77.9 FPS.These improvements stem from the effective integration of UniRepLKNet, C3K2-BFAM, FASFF-Head, and Focaler-IoU, each contributing to robust, efficient detection. Overall, SAMF-YOLO strikes a strong balance between performance and efficiency, making it well-suited for real-world industrial applications.

References

  1. 1. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, 2014.
  2. 2. Girshick R. Fast R-CNN. In: ICCV, 2015.
  3. 3. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
  4. 4. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: CVPR, 2016.
  5. 5. Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv Preprint. 2018. https://doi.org/10.48550/arXiv.1804.02767
  6. 6. Bochkovskiy A, Wang CY, Liao HYM. YOLOv4: Optimal speed and accuracy of object detection. arXiv Preprint. 2020. https://doi.org/10.48550/arXiv.2004.10934
  7. 7. Jocher A, et al. YOLOv5. GitHub repository. 2020. https://github.com/ultralytics/yolov5
  8. 8. Wu Y, He K. Group normalization. In: ECCV, 2016.
  9. 9. Sun C, Neven D. SE-ResNet: Squeeze and excitation networks. arXiv Preprint. 2018. https://doi.org/10.48550/arXiv.1803.08103
  10. 10. Zhang X, Liu D. Efficient loss function optimization for object detection. In: ICCV, 2020.
  11. 11. Lin TY, Goyal P, Girshick R, He K, Dollár PF. Focal loss for dense object detection. In: CVPR, 2017.
  12. 12. Li Y, Wu X. Efficient object detection using multi-level feature aggregation. In: CVPR, 2019.
  13. 13. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, 2014.
  14. 14. Girshick R. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2015.
  15. 15. Ren S, He K, Girshick R. Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, 2015.
  16. 16. Redmon J, Farhadi A. Yolov3: An incremental improvement. arXiv Preprint. 2018.
  17. 17. Bochkovskiy A, Wang CY, Liao HYM. Yolov4: Optimal speed and accuracy of object detection. arXiv Preprint. 2020.
  18. 18. Jocher G, Chaurasia A, Stoken A. Ultralytics/yolov5: v6.2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. In: Zenodo, 2022.
  19. 19. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L. MobileNetV2: Inverted residuals and linear bottlenecks. In: CVPR, 2018.
  20. 20. Tan M, Le QV. EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML, 2019.
  21. 21. Kong L, Li L, Han L. MobileNeXt: Towards a new family of mobile network. arXiv Preprint. 2020. https://arxiv.org/abs/2005.10285
  22. 22. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, et al. MobileNetV3: A network architecture for mobile vision applications. In: CVPR, 2019.
  23. 23. Lin TY, Dollar P, Girshick R, He K, Hariharan B. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017.
  24. 24. Tan M, Pang R, Le QV. EfficientDet: Scalable and efficient object detection. In: CVPR, 2020.
  25. 25. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018.
  26. 26. Wang X, Zhang H, Zhang J. Non-local neural networks. In: CVPR, 2018.
  27. 27. Zhang Z, Wei Y, Yu Y. IoU-Net: An IoU-based loss function for object detection. In: NeurIPS, 2018.
  28. 28. Lin TY, Goyal P, Girshick R, He K, Dollár PF. Focal loss for dense object detection. In: ICCV, 2017.
  29. 29. Liu D, Zhang H. Scalable object detection with contextual attention. In: ECCV, 2019.
  30. 30. Zhang Z, Wei X. High-precision object detection with adversarial training. In: NeurIPS, 2020.
  31. 31. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Adv Neural Inf Process Syst. 2014.
  32. 32. Yang RYZ, Li L, Xie X. SimAM: A simple, parameter free attention module for convolutional neural networks. In: Proceedings of the 38th international conference on machine learning, 2021. p. 11863–74.
  33. 33. Zheng Z, Wang P, Liu W. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI conference on artificial intelligence, 2020. p. 12993–3000.
  34. 34. He Y, Song K, Meng Q, Yan Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. In: Proceedings of the IEEE transactions on instrumentation and measurement, 2019. p. 1493–504.
  35. 35. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), 2016. p. 779–88. https://doi.org/10.1109/cvpr.2016.91
  36. 36. Li C, Li L, Jiang H. YOLOv6: A single-stage object detection framework for industrial applications. arXiv Preprint. 2022. https://doi.org/10.48550/arXiv.2209.02976
  37. 37. Wang CY, Bochkovskiy A, Liao HYM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. p. 7464–75.
  38. 38. Zhao Y, et al. Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024. p. 16965–74.
  39. 39. Wang CY, Yeh IH, Liao HYM. Yolov9: Learning what you want to learn using programmable gradient information. arXiv Preprint. 2024. https://doi.org/arXiv:2402.13616
  40. 40. Wang A, et al. Yolov10: Real-time end-to-end object detection. arXiv Preprint. 2024.
  41. 41. Wang H, Wang Z, Du M. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020. p. 24–5.
  42. 42. Selvaraju RR. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, 2017. p. 618–26.
  43. 43. Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN. Grad-cam: Generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter conference on applications of computer vision (WACV), 2018. p. 839–47.