ACCYolo: Transmission equipment inspection image detection method based on multi-scale and occluded targets

Xi Chen; Fulong Yao; Rongbin Cui; Shulei Zhang; Haixing Li; Chunhe Song; Shimao Yu

doi:10.1371/journal.pone.0335186

Abstract

With the rising global demand for electricity, transmission infrastructure is becoming increasingly important as a key support for ensuring stable and reliable power supply.In recent years, UAVs have been widely used in the inspection and maintenance of transmission equipment due to their advantages of high efficiency, flexibility and intelligence, which have greatly improved the operation and maintenance efficiency and safety level.However, the transmission equipment itself is exposed to harsh natural environments during prolonged use, such as high temperatures, humidity changes, wind and sand erosion, as well as electromagnetic interference, coupled with complex topographical features, such as mountainous, hilly, and forested areas, which result in the transmission equipment inspection process being challenged by occlusion and large differences in dimensions.To cope with these problems, this paper proposes ACCYolo. a model based on the YOLOv10n architecture with the goal of improving image detection of transmission equipment under multi-scale and occluded targets in UAV-based scenes.On the one hand, the ACCYolo model, to solve the occlusion problem, incorporates the Acmix model, which incorporates the self-attention mechanism to achieve dynamic feature extraction, effectively improving the detection performance of the model in overlapping scenes.On the other hand, in order to cope with the size difference problem in multi-scale detection, the GELAN structure combines a lightweight design with the Programmable Gradient Information (PGI) mechanism to improve the accuracy of multi-scale target detection, while the ASFF module is designed to improve the accuracy of multi-scale target detection through adaptive spatial feature fusion.The experimental results show that. The proposed method shows significant advantages in transmission equipment monitoring tasks, Overall mAP@50 raise to 0.950, and provides an effective program to ensure the reliability of power supply.

Citation: Chen X, Yao F, Cui R, Zhang S, Li H, Song C, et al. (2025) ACCYolo: Transmission equipment inspection image detection method based on multi-scale and occluded targets. PLoS One 20(10): e0335186. https://doi.org/10.1371/journal.pone.0335186

Editor: Zeashan Hameed Khan, King Fahd University of Petroleum & Minerals, SAUDI ARABIA

Received: June 24, 2025; Accepted: October 7, 2025; Published: October 28, 2025

Copyright: © 2025 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The InsPLAD dataset used in this paper is publicly available, and researchers can access it via the following link: [https://github.com/andreluizbvs/InsPLAD].

Funding: This research was funded by the National Key R&D Program Project of China (No. 2024YFB4709100), the National Natural Science Foundation of China (No. 62273337), the Natural Science Foundation of Liaoning Province, China (No. 2025-BS-0297), the Department of Education of Liaoning Province, China (No. LJ212410142147), and the Fundamental Research Funds for the Central Universities, China (No. N2523029).

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, with the continuous promotion of the global “carbon neutral” [1] goal and the accelerated construction of new power systems, the power industry in all countries of the world is accelerating towards cleaner, more efficient and intelligent. As the key link between energy supply and demand, the safety, stability and economy of the transmission network have become core issues of widespread concern in the international community. In the face of large-scale access to new energy, interconnection of power grids across regions and increasingly complex operation and maintenance scenarios, the demand for accurate sensing and intelligent inspection of transmission equipment continues to grow globally. However, how to give full play to the advantages of UAV remote sensing images in the complex and changing field environment to realize efficient and reliable intelligent inspection is still an important technical bottleneck in the development of the current international smart grid.

The northeast region of Brazil has a very diverse climatic environment, consisting mainly of the vast Caatinga semi-arid zone in the interior and the humid coastal zone. The inland area is hot all year round, precipitation is scarce and spatial and temporal distribution is extremely uneven, frequently experiencing long-term drought and high temperature stress, is one of the world’s typical arid and vulnerable zone; the coast is affected by the Atlantic monsoon, precipitation is relatively abundant, but there are short-term heavy rainfall and occasional tropical storms and other extreme weather events. The high complexity and extremity of the regional climate pose a great challenge to the stable operation of transmission equipment and the adaptability of the intelligent inspection system.Therefore, in order to promote the research and application of intelligent inspection technology in extremely complex environments, it is urgently necessary to rely on highly diverse and authentic data sets to carry out method validation.

At present, the mainstream public power line datasets [2] include Power line dataset (2017), Tomaszewski et al. (2018), Tower dataset (2019), CPLID (2020), TTPLA (2020), STN PLAD (2021) and PLT-AI (2022). These datasets focus on conductors (such as Power line dataset, which only contains pixel-level segmentation of conductors), insulators (such as Tomaszewski et al. and CPLID, which are mainly used for target detection and are mostly single categories), transmission towers (Tower dataset, which only contains transmission tower targets), instance segmentation or target detection of multi-category components (such as TTPLA and STN PLAD), and PLT-AI, which contains a small number of defective components such as the Bird’s Nest. Although the above dataset utilizes UAV remote sensing imagery to provide strong data support for power asset identification and detection research, most of them have limitations such as limited asset categories, small number of annotations and samples, lack of defect types, and single task types, which makes it difficult to fully reflect the complexity and diversity of the real field power transmission environment.

To make up for the shortcomings of existing datasets, this paper uses the InsPLAD dataset collected by drones in real power transmission line scenarios in northeastern Brazil. InsPLAD covers 17 types of power assets, 28,933 component instances, 10,607 high-resolution images, and includes multiple types of real defective assets, which not only fully reflects the actual challenges of variable lighting, complex background, occlusion and natural equipment degradation in natural environment, but also supports various computer vision tasks such as target detection, image classification and anomaly detection. InsPLAD is significantly better than the above mainstream datasets in terms of asset category, sample size, defect type and task diversity, and is a representative and valuable test benchmark for the research and development of intelligent transmission equipment inspection methods and engineering applications.

Power transmission equipment is usually exposed to complex field environments for long periods of time [3], and is susceptible to a variety of factors such as wind and sand, drastic changes in temperature and humidity, mechanical fatigue and environmental corrosion, leading to degradation of its performance and even functional failure, which induces power line failures or wide-scale grid shutdown accidents. In order to reduce the risk of equipment failure, the traditional power system relies on manual inspection for periodic equipment monitoring. However, this method has significant problems such as low efficiency, high labor cost, and high security risk, which makes it difficult to meet the current operation and maintenance needs of “large-scale, high-density, and high-reliability” transmission networks.

At the policy level [4], developed and developing countries show significant stratification in the path of promoting smart grid policies. Developed countries such as the United States, Japan, and South Korea generally incorporate smart grids into their national energy strategies or legislative systems, and issue special bills and roadmaps to clarify phased construction goals and innovation orientations. These countries attach great importance to the construction of standard systems, information security and interoperability, promote the large-scale application of smart meters, advanced metering infrastructure (AMI), demand response and distributed renewable energy, and promote the deep integration of ICT and power systems through pilot demonstrations and market-oriented mechanisms. In contrast, developing countries, represented by mainland China, pay more attention to leading the upgrading of power grid infrastructure through national development planning, take the smart grid as an important hand in realizing energy transformation and grid modernization, emphasize the breakthroughs in core technology, the construction of independent intellectual property rights system and the synergy of the whole industrial chain, and promote the intelligent coverage of urban and rural power grids in a phased and coordinated manner. Overall, developed countries focus on innovation-driven and green and low-carbon integration, while developing countries emphasize planning-driven and localized innovation. The two complement each other and jointly promote the system upgrade and energy transformation of the global smart grid.

With the development of artificial intelligence, especially deep learning technology [5], image-based automated transmission equipment inspection has become an important means to realize intelligent perception. By means of high-resolution image acquisition [6] (e.g., drones, tower camera systems), combined with target detection algorithms, remote sensing, defect identification and structural analysis of transmission equipment status can be realized. However, there are still two prominent difficulties in practical applications: first, the types of power transmission equipment are diverse [7], and the structural sizes vary significantly [8]. Traditional detection models are difficult to balance the recognition accuracy of small and large targets; Secondly, interference factors such as occlusion and background complexity often exist in field environments, resulting in weak expression of equipment target features and high false detection and leakage rate [9], which seriously restricts the model’s generalization ability and engineering practicability.

Aiming at the above problems, this paper proposes transmission equipment inspection image detection based on multi-scale and occlusion targets based on the improved YOLO algorithm, aiming to build a deep detection framework that takes into account accuracy, speed and robustness, thereby effectively reducing operation and maintenance costs, improving detection efficiency and engineering feasibility. Specifically: (1) The Acmix module is introduced to fuse the self-attention mechanism with convolution to achieve dynamic regulation of feature extraction, significantly improving the model’s perception ability in scenarios where devices overlap or are blocked; (2) Design a lightweight GELAN structure in the backbone network and enhance the model computing efficiency through PGI mechanism to improve the scale detection efficiency; (3) Construct ASFF module to realize multi-scale spatial feature fusion and enhance the robust detection ability of the model for different sizes of equipment targets.

Through comparative experiments and ablation analysis on complex power transmission scenario datasets, the proposed method significantly improves the detection accuracy under multi-scale and occluded target conditions while maintaining the lightweight model, showing good engineering applicability and deployment value. This study not only provides key algorithm support for the new intelligent inspection system, but also provides a technical path reference for the intelligent transformation of my country’s power grid operation and maintenance model. The main contributions of this paper are as follows:

1. First, for the occlusion problem in the transmission equipment detection task, this paper improves the Neck. The improved Neck module has the ability of global sensing, and also captures local features through convolution, thus improving the performance of the model while maintaining a low computational cost.

2. In addition, in order to cope with the challenges of target detection of different sizes, this paper improves the Backbone and Head modules. The improved Backbone module, designs a lightweight GELAN structure in the backbone network, and enhances the model computing efficiency through the PGI mechanism to improve the efficiency of scale detection. The improved Head module introduces an adaptive spatial feature fusion approach, which enhances the scale invariance.

This paper is organized as follows. First, we introduce relevant prior work. Next, we present the proposed algorithm in detail. Then, we present experimental results and analyze them. Finally, we draw conclusions.

Related work

In the power system, especially in the inspection of high-voltage transmission lines, drone inspections are widely used because of their high safety and efficiency. However, due to the characteristics of the transmission equipment itself and its complex operating environment, the inspection process faces challenges such as occlusion and large size differences. To address these issues, transmission equipment inspection image detection based on multi-scale and occluded targets is proposed. This will effectively reduce the burden of manual inspections, improve the accuracy of detection, and help staff complete inspection tasks more easily and efficiently.To facilitate our research, we have introduced Table 1 for analyzing relevant work. Each subsection below will analyze the table.

Download:

Table 1. Detection methods analysis.

https://doi.org/10.1371/journal.pone.0335186.t001

Target detection model based on traditional methods

In the early stages of computer vision, target detection mainly relies on the combination of manual feature extraction and traditional classifiers, which are widely used in a variety of scenes. Early studies mostly used traditional methods based on the combination of Haar features and cascade classifiers. Specifically, Haar features scan the image block by block through a sliding window and combine with a pre-trained Haar cascade classifier to accomplish classification and detection. Zhao et al. [10] proposed to retain insulators based on orientation angle detection and a priori knowledge of the insulator’s binary shape and localize them with a minimum outer rectangular frame. By traversing all possible orientation angles, multiple insulators with different orientation angles can be localized. Jaya Bharata Reddy et al. [11] proposed the use of Discrete Orthogonal S-transform to extract image features and combined with Support Vector Machine (SVM) and Adaptive Neuro Fuzzy Inference System (ANFIS) to estimate the insulator condition and combined with Support Vector Machine classification to identify the state of the insulator. Zuo et al. [12] proposed a classifier that can recognize and locate insulators obtained by feature extraction and training with Haar features, integral map, cascade classifier, and directional gradient histogram, then segment the insulators by a series of digital image processing methods, and finally the insulator pixels obtained by segmentation are statistically analyzed to determine whether the insulators are missing or not. In order to improve the robustness of the model in such environments, Schwegmann et al. [13] proposed a novel vessel detection method, which first performs an initial screening by a constant false alarm rate (FAR) pre-screening step, and then uses a cascading classifier based on Haar-like features for vessel identification. However, these methods provide a solid foundation for target detection, they require manual design of feature extractors, have limited feature expressiveness, limited algorithmic recognition accuracy, and are difficult to cope with complex scenarios.

Deep learning based target detection method

With the development of deep learning, specific network structures for target detection have emerged, which are divided into two main categories: single-stage detectors and two-stage detectors.

The single-stage detector directly predicts the targets in the image and accomplishes the detection of target location and category. Liu et al. [14] proposed Single Shot MultiBox Detector (SSD), a single-step anchor-box based target detection algorithm, which is able to predict a set of default bounding boxes from feature maps at different scales, handle targets of different sizes, and predict both the category confidence and adjustment within each bounding box. Joseph Redmon [15] proposed You Only Look Once (YOLO), which treats target detection as a regression problem and directly predicts the bounding boxes and their category probabilities in an image, providing a basis for the optimization of the subsequent YOLO series of algorithms.

The two-stage detector divides target detection into two steps: candidate region generation and classification. Ross Girshick et al. [16] proposed the RCNN, which combines region proposals with deep convolutional neural network (CNN) features. The method generates candidate regions through unsupervised learning, extracts features using pre-trained CNNs, and ultimately classifies the target with an SVM classifier. Subsequently, Ross Girshick et al. [17] proposed Fast R-CNN based on RCNN, where candidate regions are mapped to a fixed-size feature map through ROI pooling layer, and the image needs only one forward propagation to extract features, and then a fully connected layer is used to complete the classification and regression tasks, which significantly improves the speed and efficiency of the RCNN. Ren et al. [18] further proposed the Regional Proposal Network (RPN) to share the convolutional features of the image with Fast R-CNN, which reduces the computational cost.

These network structures have achieved significant results in target detection. By training deep learning models, they are able to detect classroom behaviors and improve the accuracy and efficiency of detection. However, these general methods have some limitations in different scenes and object detection, and still need a lot of adjustment and optimization for specific scenes.

Improved deep learning based target detection

The general target detection method has certain limitations, but through the improved method based on the basic algorithm, we can see its feasibility. Haiwei et al. [19] proposed an improved YOLOv8 classroom behavior detection model, which improves the effectiveness and accuracy of YOLOv8 in classroom behavior detection by combining Res2Net and YOLOv8 network modules, proposing the C2f_Res2block module, and introducing MHSA and EMA mechanisms. In addition, to further enhance the feature extraction effect of the model, Wenqi et al. [20] proposed the CSB-YOLO model, which enhances multi-scale feature fusion through bidirectional feature pyramid network (BiFPN), designs an efficient reparameterized detection head (ERD Head) to improve the inference speed, and introduces self-calibrating convolution (SCConv) to compensate for the loss of accuracy in the lightweight design.

Li et al. [21] utilized a feature pyramid network to fuse the features of the upsampling layer and the invariant scale convolutional layer, and retained the multi-scale feature layer extracted from the traditional SSD structure, which improved the accuracy of detection. Ding et al. [22] replaced IOU with EIOU to calculate the box loss and introduced Assumption-free K-MC2 (AFK-MC2) algorithm based on yolov5, and cluster non-maximum suppression (Cluster-NMS), to improve the leakage of detection due to occlusion. Henghuai et al. [23] proposed an improved YOLO-v4 behavior detection algorithm that combined a cross-stage local network and an embedded connection component to improve the recognition ability of student and teacher behaviors. In order to solve the problem of difficult recognition of student behavior in the case of occlusion, the Repulsion loss function is added to reduce the misdetection and omission through RepGT and RepBox loss. Hsu et al. [24] proposed the RSA-YOLO (Ratio and Scale Aware YOLO) method, which aims to solve the problem of poor detection performance in pedestrian detection due to the small target scale and the large difference in the aspect ratio of the input image. The method dynamically adjusts the input layer hyperparameters of YOLOv3 by introducing a scale-aware mechanism, and adopts intelligent image segmentation technology to improve the detection accuracy. Meanwhile, a scale-aware mechanism of multi-resolution fusion is proposed to effectively improve the detection of pedestrians with small targets.

Tang et al. [25] proposed a YOLOv7-based target detection method, called TOD-YOLOv7, to address the challenges of low-resolution images, dense occlusions, and different poses in the TinyPerson dataset. By adding a small target detection layer to the YOLOv7 network, introducing a recursive gated convolution module and a coordinate attention mechanism, the model’s detection capability is enhanced and inference time is reduced. In addition, data enhancement techniques are combined to improve the algorithm’s representation learning capability. Wu et al. [26] proposed a Faster R-CNN-based Different Scale Face Detector (DSFD), which aims to solve the challenge of small-scale face detection. The method first obtains face ROIs through a multi-task regional proposal network (RPN) combined with augmented face detection, and assigns the proposals to three corresponding Fast R-CNN networks based on different proposal scales. Wang et al. [27] proposed a UAV target detection model based on YOLOv8 optimization, called UAV-YOLOv8, aiming to solve the problem of low accuracy and resource constraints in detecting small targets in UAV images. The model introduces Wise-IoU (WIoU) v3 as a regression loss function to improve the localization ability and optimizes the backbone network through BiFormer attention mechanism. A Focal FasterNet module is also designed for multi-scale feature fusion to improve the detection performance of small targets.

Tianyong et al. [28] proposed an improved YOLOv8 network called YOLO-SE, which aims to address the challenges of multi-scale target detection and small target detection in remote sensing images. By introducing lightweight SEConv convolution and SEF modules, the network parameters are reduced and the detection speed is accelerated. Also, an efficient multi-scale attention (EMA) mechanism is integrated to enhance the feature extraction capability. The network also contains specialized tiny target detection heads and uses Transformer [29] prediction heads instead of the original detection heads. In addition, the Wise-IoU loss function is introduced to cope with the gradient problem for low-quality instances. Wang et al. [30] proposed an adaptive enhancement fusion framework based on YOLO (YOLO_AEF). This framework employs a multiple exposure enhancement module (MEEM) to improve image quality under complex lighting conditions and utilizes an adaptive feature fusion module (AFFM) to fuse raw and enhanced image features, thereby enhancing robustness and contextual expression. Concurrently, it incorporates a fusion detection module (FDM) to achieve robust occlusion detection.

Although generalized target detection methods have shown better results, there are still some challenges for the detection of transmission line components. For example, the detection and target overlapping of existing methods for small-sized targets such as Spacer can affect the effectiveness of the deep learning model, yet it is difficult to obtain a large and sufficient amount of data for training due to the difficulty of transmission line data acquisition. Therefore, the research in this paper focuses on solving the problems of overlapping targets and large size differences in transmission line component detection. By improving the training method of the deep learning model, the detection accuracy and robustness of transmission line components can be improved.

Method

The main architecture of the model is shown in Fig 1. The backbone network of the architecture is yolov10. On this basis, the backbone module replaces the original C2f with C2f-ELAN4. For the Neck architecture, the ACmix attention mechanism is added to the output positions of the large, medium, and small target detection layers. The original v10Detect detection header is replaced with the ASFFDetect detection header in the Head. In the rest of this section, the ACmix attention mechanism, the ASFF detection header, and the GELAN module are described in detail for each of them.

Download:

Fig 1. ACCYolo network structure diagram.

https://doi.org/10.1371/journal.pone.0335186.g001

Improved neck module

In target detection tasks, the occurrence of occlusion phenomenon leads to the loss of local information about the target, which poses a challenge to the recognition performance of the model. The traditional Neck module, despite its self-attention mechanism, fails to effectively capture the contextual information of occluded objects due to its high computational complexity, leading to its unsatisfactory performance in dealing with occlusion situations. In order to counteract this problem, we have made improvements to the Neck module. Specifically, the ACmix module [31] was added after the output channels of the C2f and C2fCIB. With ACmix’s decomposition and reconstruction strategy, we are able to extract the contextual information of the occluded target more efficiently, Enhanced detection accuracy.

This improvement leads to a significant optimization of the model’s performance in occlusion situations. ACmix combines the advantages of convolution and self-attention mechanisms (self-attention). Through analysis, both techniques operate similarly in the first stage, relying mainly on 1×1 convolution for feature projection, with computational complexity quadratically related to the number of channels. Based on this finding, ACmix shares convolutional and self-attentive projection operations, reducing computational overhead. While keeping the computational cost low, ACmix achieves significant performance gains on tasks such as image classification and target detection. Fig 2(a) and 2(b) represent the standard convolution and self-attention mechanisms, respectively. The workflow of the ACmix model is demonstrated in Fig 2(c).

Download:

Fig 2. Architecture of the ACmix module.

https://doi.org/10.1371/journal.pone.0335186.g002

ACmix forms a hybrid module by combining convolution and self-attention mechanisms. In the first stage, the input feature maps are projected through three convolutions to generate multiple intermediate feature maps. In the second stage, these intermediate feature maps follow two different aggregation methods: convolutional paths and self-attention paths.

For the self-attentive path, the generated intermediate feature maps are divided into groups and each group contains three feature maps which are used as query, key and value. Through the self-attention mechanism, the similarity between the query and the key is computed and the values are weighted and summed using the attention weights. This process can be represented by the following equation:

(1)

where q_ij and k_ab are the feature representations of the query and key, respectively, denote the attention weights, and are the values.

For convolutional paths, a lightweight fully connected layer is used to generate k² feature maps, which are shifted and summed by the following formula:

(2)

Where denotes the feature map obtained after applying the convolution kernel, and is a shift operation.

Eventually, the outputs from the convolutional and self-attentive paths are weighted and summed to form the final output. The formula is given below:

(3)

where α and β are learnable scalars that control the output strength of the self-attentive path and the convolutional path, respectively.

All figures and tables should be cited in the main text as Fig 1, Table 2, etc.

Download:

Table 2. Training parameter settings.

https://doi.org/10.1371/journal.pone.0335186.t002

Improved backbone and head modules

The improved Backbone module adds C2f-ELAN4 (GELAN) [32] to the output channel of CBS, GELAN incorporates CSPNet and ELAN mechanisms and utilizes RepConv to obtain more effective features while specializing in single-branching structure in inference, thus increasing the detection accuracy at multiple scales. The GELAN module improves feature extraction by combining multiple convolutional operations. As in Fig 3. first, the input feature map undergoes an initial convolution operation to generate an output feature map with channels, expressed as:

(4)

where , H and W are the height and width of the feature map, respectively, and C is the number of channels. Next, the generated feature map Y is equally divided into two parts Y₁ and Y₂ with channels each, which can be expressed as:

(5)

where , the segmented feature map Y₂ is subsequently processed through the RepNCSP module. The core of RepNCSP is a convolution operation combining and . Eq:

(6)

where, this design allows the model to simultaneously process features from different receptive fields, enhancing the ability to characterize complex scenes.

Download:

Fig 3. Schematic diagram of the ASFF module.

https://doi.org/10.1371/journal.pone.0335186.g003

Next, the two RepNCSP-processed features Z₂, Z₃ and Y₁ are re-fused by a splicing operation to form a complete feature map Z, denoted as:

(7)

where . Finally, the fused feature map Z is again subjected to a convolution operation to generate the final output , Eq:

(8)

The design improves the expressive power of the model in feature extraction by separating the channels, feature processing with different convolutional kernels and the final fusion convolution operation, which is particularly suitable for tasks such as target detection where multi-scale features need to be processed.

The improved Head module adds ASFF [33] to the output channels of ACmix, which introduces an adaptive spatial feature fusion method that effectively filters out conflicting information and thus enhances scale invariance. The proposed ASFF (Adaptively Spatial Feature Fusion) module aims to improve multi-scale target detection by fusing feature maps from different scales. Fig 4 illustrates the feature maps from different levels of the feature pyramid (Level 1, Level 2, Level 3) with different resolutions and strides. The ASFF module adjusts the weighting of the features at each scale using the adaptive weights , which are combined with the feature maps of each level (X1 3, X2 3, X3 3) by element-by-element multiplication, ensuring that the contributions of each level can be effectively integrated. to ensure that the feature contributions of each layer can be effectively integrated. Especially for multiscale detection, low-level high-resolution features are crucial, and the ASFF module generates unified features for subsequent target prediction by adaptively fusing different scale features, and, finally, fusing the scale features after an additive operation.

Download:

Fig 4. Schematic diagram of the GELAN architecture.

https://doi.org/10.1371/journal.pone.0335186.g004

Experiments

The training parameters of the experiment based on the YOLOv10 model for the transmission equipment detection task are shown in Table 2 to explore the performance of the model in complex scenarios. First, the configuration files of the model and data were loaded and key training parameters were set, including the number of training rounds, batch size, learning rate and momentum. In the data preprocessing stage, data enhancement techniques such as Mosaic, Random Flip and Color Transform (including random adjustments of hue, saturation and luminance) are used with the aim of improving the generalization ability of the model and reducing possible overfitting phenomena. In addition, pre-training weights are used for model initialization to accelerate model convergence. During the training process, SGD is used as an optimizer with a warm-up phase and a dynamic learning rate adjustment strategy to improve the convergence speed of the model. After each round of training, metrics including precision, recall, mAP@50 and mAP@50-95 evaluations are computed on the validation set to monitor the performance of the model in the transmission equipment detection task in real time. In order to avoid overfitting, an early stopping mechanism is set up, and the training is automatically terminated early when no significant performance improvement occurs within 100 epochs. This experiment generates graphs of loss and performance metrics during training and validation to provide reliable data support for model optimization and performance improvement.

Experimental setup and data description

The code for the paper was implemented on a server equipped with 2 RTX4090 GPUs. As shown in Table 3, compared to earlier datasets such as Power line dataset (2017) and Tomaszewski et al. (2018) that contain only a single asset class and lack defective samples, the InsPLAD dataset has a significant lead in terms of the number of asset classes (17), the number of annotations (28,933) and the number of images (10,607), and covers 5 classes of defective assets. significantly ahead of the others and covers 5 classes of defective assets, supporting a variety of visual tasks such as Object Detection (OD), Image Classification (IC) and Anomaly Detection (AD). Therefore, the dataset in this paper adopts InsPLAD, a dataset and benchmark for power line asset inspection, with specific parameters shown in Table 4.

Download:

Table 3. Public power line datasets.

https://doi.org/10.1371/journal.pone.0335186.t003

Download:

Table 4. Basic parameters of the InsPLAD dataset.

https://doi.org/10.1371/journal.pone.0335186.t004

Ablation experiment

In Table 5, we show the results of the ablation experiments of the model by systematically investigating the independent contribution of each module to the model performance in order to assess its importance in complex detection tasks. In this set of experiments, experiments 0, 1, 2, 3, 4, 5, 6 and 7 were tested on raw data. Labeled A(Acmix), B(ASFF), C(GELAN), where a0, a1, a2 are for multi-scale target detection and a3, a4, a5 are for occluded target detection.

Download:

Table 5. Ablation study results of YOLOv10 with different modules.

https://doi.org/10.1371/journal.pone.0335186.t005

First, in the base model without any improvement module (experiment a), the overall mAP@50 performance is 0.838. Due to the lack of specific module support, the base model has obvious deficiencies in coping with target occlusion and scale differences. For example, the low detection effect of Stockbridge Damper and Yoke Suspension indicates that the base model is limited in dealing with these complex detection tasks as shown by a(0) in Fig 2. The prediction of Stockbridge Damper reaches 0.7 and the prediction of Yoke Suspension reaches 0.45.

The experiment with the addition of the Acmix module (experiment d) shows that the detection effect is as shown in Fig 5, b(5), where the Polymer Insulator Tower Shackle detection efficiency is improved from 0.3 to 0.91 compared to the original a(5). b(4) in Fig 5 also shows that compared to the original a(4), the Polymer Insulator Tower Shackle detection efficiency is improved from 0.86 to 0.89 compared with the original a(4) in Fig 5, which effectively solves the target occlusion problem.

Download:

Fig 5. Detection results of YOLOv10 with progressively enhanced modules.

https://doi.org/10.1371/journal.pone.0335186.g005

The experiment with the addition of GELAN and ASFF modules (experiment e), the detection effect is shown as b(1) in Fig 5, compared with the original a(1) Yoke Suspension, the detection effect is improved to 0.84 improved to 0.91, and Stockbridge Damper is improved from 0.41 to 0.89, respectively. improves the multi-scale complex background detection efficiency and effectively enhance multi-scale target detection.

Finally, the complete model containing all enhancement modules (experiment h) achieves the best performance on all categories, overall mAP@50 raised to 0.950. The synergy of the modules with the introduction of data enhancement strategies not only improves the detection performance of the model in multi-scale, complex background and occluded target environments, but also significantly enhances the model’s generalization ability.

Comparison experiment

In order to verify the soundness of the proposed method, we compare it with several classical target detection methods, yolo11 [34], rtdetr-l [35], yolov8 [36], rtdetr-x [37], yolov9 [38], rtdetr-resnet50 [39], yolov10 [40], rtdetr-resnet101 [41]. As shown in Table 6. Our improved model ACC_yolov10_Enhance (ours) addresses the two key challenges of the transmission equipment inspection task (overlapping targets and large size difference) and significantly outperforms the other models in mAP@50. With the tabular data, we can clearly see how ACC_yolov10_Enhance (ours) compares with other models on different categories, further validating its advantages in complex detection tasks.

Download:

Table 6. Performance comparison of object detection models for transmission equipment inspection (mAP@50).

https://doi.org/10.1371/journal.pone.0335186.t006

In the “target occlusion” problem, ACC_yolov10_Enhance (ours) utilizes the Acmix module to combine the self-attention mechanism with convolutional feature extraction to improve the model’s discriminative ability in complex backgrounds. In the category of “lightning rod shackle”, the mAP@50 of ACC_yolov10_Enhance (ours) reaches 0.924, which is significantly higher than that of yolo11 (0.794) and yolov8 (0.82), and the mAP@50 of ACC_yolov10_Enhance (ours) is 0.924, which is significantly higher than that of yolo11 (0.794) and yolov8 (0.82). In the category of “Polymer Insulator Tower Shackle”, d(3) 0.75 in Fig 6 is improved to c(3) 0.8. The results show that the improved model has stronger detection accuracy in the scene with more serious target occlusion.

Download:

Fig 6. Comparative visualization of detection performance across different models (Ours vs. YOLOv8/RT-DETR).

https://doi.org/10.1371/journal.pone.0335186.g006

To address the problem of “large size difference”, ACC_yolov10_Enhance (ours) enhances the adaptability to multi-scale targets through the GELAN and ASFF modules. For example, in the category of “polymer insulator lower shackle”, which has a large size difference, although other models such as yolo11 and yolov9 also have high performance in this category (0.907 and 0.928, respectively), ACC_yolov10_Enhance (ours) is not able to adapt to multiscale targets through the GELAN and ASFF modules. yolov10_Enhance (ours) is more robust in the overall test, and in particular more balanced in terms of overall performance in the different categories. In addition, in the “yoke” category, ACC_yolov10_Enhance (ours) reaches 0.973, while yolov10 only reaches 0.953. In the “Yoke Suspension, Polymer Insulator Tower Shackle” category, ACC_yolov10_Enhance (ours) is more robust in the overall test, especially in the balanced performance of the different categories. Polymer Insulator Tower Shackle,” in the category of “Yoke Suspension,” d(5) 0.47,0.81 in Fig 6 improves to c(5) 0.95,0.88. This shows the excellent performance of the improved model on large size targets.

Combining the results of “all classes”, the overall mAP@50 of ACC_yolov10_Enhance (ours) reaches 0.95, which is significantly higher than that of yolov9, which is the best performer among other models (0.901). In the overall comparison of different categories, ACC_yolov10_Enhance (ours) shows high stability and accuracy under all kinds of challenges, providing higher reliability and accuracy for the automated inspection of transmission equipment.

Figs 7, 8, and 9 demonstrate the trends of key loss terms and evaluation metrics for the ACC_yolov10_Enhance model over 150 training cycles. Specifically, the bounding box loss (box loss), classification loss (cls loss), and distribution focus loss (DFL loss) on the training and validation sets decrease rapidly at the initial stage, and then level off, indicating that the model converges gradually. In addition, the performance metrics such as precision, recall, and average precision (mAP@50 and mAP@50-95) increase rapidly in the early stage and stabilize at a high level in the later stage, reflecting the gradual improvement of the model’s performance on the detection task and its good generalization ability. Overall, the figure shows that the model achieves good convergence during training and stable performance on the validation set.

Download:

Fig 7. Convergence trends of training loss components (Box/Cls/DFL).

https://doi.org/10.1371/journal.pone.0335186.g007

Download:

Fig 8. Evolution of model performance metrics during training (Precision/Recall/mAP).

https://doi.org/10.1371/journal.pone.0335186.g008

Download:

Fig 9. Validation loss analysis across training epochs (Box/Cls/DFL).

https://doi.org/10.1371/journal.pone.0335186.g009

Conclusion

In this study, an ACCYolo model is designed. An innovative solution is proposed to address the key challenges in target detection in complex backgrounds, such as target occlusion, size difference and background complexity. By introducing the Acmix model and combining the self-attention mechanism with the convolutional mechanism, the detection ability of the model in target occlusion and overlapping scenes is significantly improved. Meanwhile, the combination of the GELAN structure and the programmable gradient information (PGI) mechanism makes the model more efficient in dealing with multi-scale targets and improves the detection speed while ensuring the accuracy. In addition, the ASFF module further improves the accuracy of multi-scale target detection through adaptive spatial feature fusion, and shows strong robustness especially in complex backgrounds.

In order to verify the effectiveness of the proposed method, we conducted ablation experiments and comparison experiments. The ablation experiments show that the introduction of the Acmix module effectively solves the target occlusion problem and significantly improves the detection accuracy. For example, the detection accuracy of the Polymer Insulator Tower Shackle category is improved from 0.3 to 0.91, and the addition of the GELAN and ASFF modules also significantly improves the detection accuracy, especially in the multi-scale target detection task, the detection accuracy of the Stockbridge Damper and Yoke Suspension categories improved from 0.41 and 0.84 to 0.89 and 0.91, respectively.

When compared with classical target detection methods (e.g., YOLOv10, YOLOv8, RTDETR, etc.), the present method demonstrates significant advantages. In the category of “Lightning rod shackle”, the ACC_yolov10_Enhance (the present method) mAP@50 reaches 0.924, which is much higher than 0.794 for YOLOv11 and 0.82 for YOLOv8. Meanwhile, in the category of “Polymer Insulator Tower Shackle”, the detection accuracy is improved from 0.75 to 0.8, which shows the powerful detection ability of this method in complex background.

The combined detection results of all categories, ACC_yolov10_Enhance has an overall mAP@50 of 0.95, are significantly higher than the best performance of other comparison models (0.901 for YOLOv9). This indicates that the method proposed in this study not only has obvious advantages in detection accuracy, but more importantly, it effectively reduces the cost of transmission line inspection and greatly improves work efficiency by replacing the traditional manual inspection with automation technology. Before the improvement, the model’s frame rate was 7.58 FPS (frames per second). After the improvement, the frame rate increased to 15.13 FPS, demonstrating a significant enhancement in real-time processing capabilities. Compared with the traditional manual inspection, the automated inspection system is able to monitor the equipment 24 hours a day, discover potential problems in real time, avoid the expansion of faults, and reduce the investment of manpower and resources, thus improving the inspection efficiency while reducing the economic losses.

The method in this study is not only in the power system, but also has strong versatility and can be widely extended to other industrial inspection fields, such as health monitoring of transportation facilities, bridge structures and other critical infrastructures. These fields are generally characterized by technical difficulties such as target occlusion and large differences in the scale of the detection object, etc. In the future, this technology will provide solid technical support for the stable operation of the smart grid and promote the development of the power system in the direction of intelligence, providing an effective guarantee for the reliability of the global power supply.

References

1. Jiang T, Yu Y, Jahanger A, Balsalobre-Lorente D. Structural emissions reduction of China’s power and heating industry under the goal of “double carbon”: a perspective from input-output analysis. Sustainable Production and Consumption. 2022;31:346–56.
- View Article
- Google Scholar
2. Silva ALV, Simões F, Kowerko D, Schlosser T, Battisti F, Teichrieb V. Attention modules improve image-level anomaly detection for industrial inspection: a DifferNet case study. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, HI, USA. 2024. p. 8246–55.
3. Mohd Zainuddin N, Abd. Rahman MS, Ab. Kadir MZA, Nik Ali NH, Ali Z, Osman M, et al. Review of thermal stress and condition monitoring technologies for overhead transmission lines: issues and challenges. IEEE Access. 2020;8:120053–81.
- View Article
- Google Scholar
4. Dileep G. A survey on smart grid technologies and applications. Renewable Energy. 2020;146:2589–625.
- View Article
- Google Scholar
5. Gupta R, Srivastava D, Sahu M, Tiwari S, Ambasta RK, Kumar P. Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers. 2021;25(3):1315–60. pmid:33844136
- View Article
- PubMed/NCBI
- Google Scholar
6. Ballesteros R, Ortega JF, Hernández D, Moreno MA. Applications of georeferenced high-resolution images obtained with unmanned aerial vehicles. Part I: description of image acquisition and processing. Precision Agric. 2014;15(6):579–92.
- View Article
- Google Scholar
7. Arcia-Garibaldi G, Cruz-Romero P, Gómez-Expósito A. Future power transmission: visions, technologies and challenges. Renewable and Sustainable Energy Reviews. 2018;94:285–301.
- View Article
- Google Scholar
8. Kim P, Han WS, Kim H, Kim JH, Kang YJ, Kim S. Simplified design of power transmission tower: strategic variable analysis study. Structures. 2025;71:108084.
- View Article
- Google Scholar
9. Liu Z, Wu G, He W, Fan F, Ye X. Key target and defect detection of high-voltage power transmission lines with deep learning. International Journal of Electrical Power & Energy Systems. 2022;142:108277.
- View Article
- Google Scholar
10. Zhao Z, Liu N, Wang L. Localization of multiple insulators by orientation angle detection and binary shape prior knowledge. IEEE Trans Dielect Electr Insul. 2015;22(6):3421–8.
- View Article
- Google Scholar
11. Reddy MJB, Chandra BK, Mohanta DK. Condition monitoring of 11 kV distribution system insulators incorporating complex imagery using combined DOST-SVM approach. IEEE Trans Dielect Electr Insul. 2013;20(2):664–74.
- View Article
- Google Scholar
12. Zuo D, Hu H, Qian R, Liu Z. An insulator defect detection algorithm based on computer vision. In: 2017 IEEE International Conference on Information and Automation (ICIA), 2017. p. 361–5. https://doi.org/10.1109/icinfa.2017.8078934
13. Schwegmann CP, Kleynhans W, Salmon BP. Synthetic aperture radar ship detection using haar-like features. IEEE Geosci Remote Sensing Lett. 2017;14(2):154–8.
- View Article
- Google Scholar
14. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In: Computer Vision–ECCV 2016 : 14th European Conference. Amsterdam, The Netherlands: Springer International Publishing; 2016. p. 21–37.
- View Article
- Google Scholar
15. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 779–88. https://doi.org/10.1109/cvpr.2016.91
16. Girshick R, Donahue J, Darrell T, Malik J. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014. p. 580–7.
17. Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, 2015. p. 1440–8.
18. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015;28:91–9.
- View Article
- Google Scholar
19. Chen H, Zhou G, Jiang H. Student behavior detection in the classroom based on improved YOLOv8. Sensors (Basel). 2023;23(20):8385. pmid:37896479
- View Article
- PubMed/NCBI
- Google Scholar
20. Zhu W, Yang Z. Csb-yolo: a rapid and efficient real-time algorithm for classroom student behavior detection. J Real-Time Image Proc. 2024;21(4):1–17.
- View Article
- Google Scholar
21. Li H, Lin K, Bai J, Li A, Yu J. Small object detection algorithm based on feature pyramid-enhanced fusion SSD. Complexity. 2019;2019(1):3416307.
- View Article
- Google Scholar
22. Ding J, Cao H, Ding X, An C. High accuracy real-time insulator string defect detection method based on improved YOLOv5. Front Energy Res. 2022;10.
- View Article
- Google Scholar
23. Chen H, Guan J. Teacher–student behavior recognition in classroom teaching based on improved YOLO-v4 and Internet of Things t. Electronics. 2022;11(23):3998.
- View Article
- Google Scholar
24. Hsu W-Y, Lin W-Y. Ratio-and-scale-aware YOLO for pedestrian detection. IEEE Trans Image Process. 2021;30:934–47. pmid:33242306
- View Article
- PubMed/NCBI
- Google Scholar
25. Tang F, Yang F, Tian X. Long-distance person detection based on YOLOv7. Electronics. 2023;12(6):1502.
- View Article
- Google Scholar
26. Wu W, Yin Y, Wang X, Xu D. Face detection with different scales based on faster R-CNN. IEEE Trans Cybern. 2019;49(11):4017–28. pmid:30113907
- View Article
- PubMed/NCBI
- Google Scholar
27. Wang G, Chen Y, An P, Hong H, Hu J, Huang T. UAV-YOLOv8: a small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors (Basel). 2023;23(16):7190. pmid:37631727
- View Article
- PubMed/NCBI
- Google Scholar
28. Wu T, Dong Y. YOLO-SE: improved YOLOv8 for remote sensing object detection and recognition. Applied Sciences. 2023;13(24):12977.
- View Article
- Google Scholar
29. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. p. 38–45.
30. Wang P, Mohamed R, Mustapha N, Manshor N. YOLO-AEF: traffic sign detection on challenging traffic scenes via adaptive enhancement and fusion. Neurocomputing. 2025;655:131430.
- View Article
- Google Scholar
31. Pan X, Ge C, Lu R. On the integration of self-attention and convolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. p. 815–25.
32. Wang CY, Yeh IH, Mark LHY. Yolov9: learning what you want to learn using programmable gradient information. In: European conference on computer vision. Cham: Springer; 2024. p. 1–21.
33. Liu S, Huang D, Wang Y. Learning spatial fusion for single-shot object detection. arXiv preprint 2019.
- View Article
- Google Scholar
34. Jegham N, Koh CY, Abdelatti M, Al Marzouqi H. Evaluating the evolution of YOLO (You only look once) models: a comprehensive benchmark study of YOLO11 and its predecessors. arXiv preprint 2024.
- View Article
- Google Scholar
35. Tian F, Song C, Liu X. Small target detection in coal mine underground based on improved RTDETR algorithm. Sci Rep. 2025;15(1):12006. pmid:40199916
- View Article
- PubMed/NCBI
- Google Scholar
36. Li Y, Li Q, Pan J, Zhou Y, Zhu H, Wei H, et al. SOD-YOLO: small-object-detection algorithm based on improved YOLOv8 for UAV images. Remote Sensing. 2024;16(16):3057.
- View Article
- Google Scholar
37. Li X, Cai M, Tan X, Yin C, Chen W, Liu Z, et al. An efficient transformer network for detecting multi-scale chicken in complex free-range farming environments via improved RT-DETR. Computers and Electronics in Agriculture. 2024;224:109160.
- View Article
- Google Scholar
38. Chien CT, Ju RY, Chou KY, Chiang JS, Tsai MH, Yang CJ, et al. YOLOv9 for fracture detection in pediatric wrist trauma X-ray images. 2024.
39. Theckedath D, Sedamkar RR. Detecting affect states using VGG16, ResNet50 and SE-ResNet50 networks. SN Comput Sci. 2020;1(2):79.
- View Article
- Google Scholar
40. Wang A, Chen H, Liu L. Yolov10: real-time end-to-end object detection. arXiv preprint. 2024.
- View Article
- Google Scholar
41. Zhang Q. A novel ResNet101 model based on dense dilated convolution for image classification. SN Appl Sci. 2021;4(1):79.
- View Article
- Google Scholar

[ref1] 1. Jiang T, Yu Y, Jahanger A, Balsalobre-Lorente D. Structural emissions reduction of China’s power and heating industry under the goal of “double carbon”: a perspective from input-output analysis. Sustainable Production and Consumption. 2022;31:346–56.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Silva ALV, Simões F, Kowerko D, Schlosser T, Battisti F, Teichrieb V. Attention modules improve image-level anomaly detection for industrial inspection: a DifferNet case study. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, HI, USA. 2024. p. 8246–55.

[ref3] 3. Mohd Zainuddin N, Abd. Rahman MS, Ab. Kadir MZA, Nik Ali NH, Ali Z, Osman M, et al. Review of thermal stress and condition monitoring technologies for overhead transmission lines: issues and challenges. IEEE Access. 2020;8:120053–81.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Dileep G. A survey on smart grid technologies and applications. Renewable Energy. 2020;146:2589–625.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Gupta R, Srivastava D, Sahu M, Tiwari S, Ambasta RK, Kumar P. Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers. 2021;25(3):1315–60. pmid:33844136
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref6] 6. Ballesteros R, Ortega JF, Hernández D, Moreno MA. Applications of georeferenced high-resolution images obtained with unmanned aerial vehicles. Part I: description of image acquisition and processing. Precision Agric. 2014;15(6):579–92.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref7] 7. Arcia-Garibaldi G, Cruz-Romero P, Gómez-Expósito A. Future power transmission: visions, technologies and challenges. Renewable and Sustainable Energy Reviews. 2018;94:285–301.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref8] 8. Kim P, Han WS, Kim H, Kim JH, Kang YJ, Kim S. Simplified design of power transmission tower: strategic variable analysis study. Structures. 2025;71:108084.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Liu Z, Wu G, He W, Fan F, Ye X. Key target and defect detection of high-voltage power transmission lines with deep learning. International Journal of Electrical Power & Energy Systems. 2022;142:108277.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref10] 10. Zhao Z, Liu N, Wang L. Localization of multiple insulators by orientation angle detection and binary shape prior knowledge. IEEE Trans Dielect Electr Insul. 2015;22(6):3421–8.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref11] 11. Reddy MJB, Chandra BK, Mohanta DK. Condition monitoring of 11 kV distribution system insulators incorporating complex imagery using combined DOST-SVM approach. IEEE Trans Dielect Electr Insul. 2013;20(2):664–74.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref12] 12. Zuo D, Hu H, Qian R, Liu Z. An insulator defect detection algorithm based on computer vision. In: 2017 IEEE International Conference on Information and Automation (ICIA), 2017. p. 361–5. https://doi.org/10.1109/icinfa.2017.8078934

[ref13] 13. Schwegmann CP, Kleynhans W, Salmon BP. Synthetic aperture radar ship detection using haar-like features. IEEE Geosci Remote Sensing Lett. 2017;14(2):154–8.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref14] 14. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In: Computer Vision–ECCV 2016 : 14th European Conference. Amsterdam, The Netherlands: Springer International Publishing; 2016. p. 21–37.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref15] 15. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 779–88. https://doi.org/10.1109/cvpr.2016.91

[ref16] 16. Girshick R, Donahue J, Darrell T, Malik J. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014. p. 580–7.

[ref17] 17. Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, 2015. p. 1440–8.

[ref18] 18. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015;28:91–9.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref19] 19. Chen H, Zhou G, Jiang H. Student behavior detection in the classroom based on improved YOLOv8. Sensors (Basel). 2023;23(20):8385. pmid:37896479
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref20] 20. Zhu W, Yang Z. Csb-yolo: a rapid and efficient real-time algorithm for classroom student behavior detection. J Real-Time Image Proc. 2024;21(4):1–17.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref21] 21. Li H, Lin K, Bai J, Li A, Yu J. Small object detection algorithm based on feature pyramid-enhanced fusion SSD. Complexity. 2019;2019(1):3416307.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref22] 22. Ding J, Cao H, Ding X, An C. High accuracy real-time insulator string defect detection method based on improved YOLOv5. Front Energy Res. 2022;10.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref23] 23. Chen H, Guan J. Teacher–student behavior recognition in classroom teaching based on improved YOLO-v4 and Internet of Things t. Electronics. 2022;11(23):3998.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref24] 24. Hsu W-Y, Lin W-Y. Ratio-and-scale-aware YOLO for pedestrian detection. IEEE Trans Image Process. 2021;30:934–47. pmid:33242306
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref25] 25. Tang F, Yang F, Tian X. Long-distance person detection based on YOLOv7. Electronics. 2023;12(6):1502.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref26] 26. Wu W, Yin Y, Wang X, Xu D. Face detection with different scales based on faster R-CNN. IEEE Trans Cybern. 2019;49(11):4017–28. pmid:30113907
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref27] 27. Wang G, Chen Y, An P, Hong H, Hu J, Huang T. UAV-YOLOv8: a small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors (Basel). 2023;23(16):7190. pmid:37631727
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref28] 28. Wu T, Dong Y. YOLO-SE: improved YOLOv8 for remote sensing object detection and recognition. Applied Sciences. 2023;13(24):12977.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref29] 29. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. p. 38–45.

[ref30] 30. Wang P, Mohamed R, Mustapha N, Manshor N. YOLO-AEF: traffic sign detection on challenging traffic scenes via adaptive enhancement and fusion. Neurocomputing. 2025;655:131430.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref31] 31. Pan X, Ge C, Lu R. On the integration of self-attention and convolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. p. 815–25.

[ref32] 32. Wang CY, Yeh IH, Mark LHY. Yolov9: learning what you want to learn using programmable gradient information. In: European conference on computer vision. Cham: Springer; 2024. p. 1–21.

[ref33] 33. Liu S, Huang D, Wang Y. Learning spatial fusion for single-shot object detection. arXiv preprint 2019.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref34] 34. Jegham N, Koh CY, Abdelatti M, Al Marzouqi H. Evaluating the evolution of YOLO (You only look once) models: a comprehensive benchmark study of YOLO11 and its predecessors. arXiv preprint 2024.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref35] 35. Tian F, Song C, Liu X. Small target detection in coal mine underground based on improved RTDETR algorithm. Sci Rep. 2025;15(1):12006. pmid:40199916
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref36] 36. Li Y, Li Q, Pan J, Zhou Y, Zhu H, Wei H, et al. SOD-YOLO: small-object-detection algorithm based on improved YOLOv8 for UAV images. Remote Sensing. 2024;16(16):3057.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref37] 37. Li X, Cai M, Tan X, Yin C, Chen W, Liu Z, et al. An efficient transformer network for detecting multi-scale chicken in complex free-range farming environments via improved RT-DETR. Computers and Electronics in Agriculture. 2024;224:109160.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref38] 38. Chien CT, Ju RY, Chou KY, Chiang JS, Tsai MH, Yang CJ, et al. YOLOv9 for fracture detection in pediatric wrist trauma X-ray images. 2024.

[ref39] 39. Theckedath D, Sedamkar RR. Detecting affect states using VGG16, ResNet50 and SE-ResNet50 networks. SN Comput Sci. 2020;1(2):79.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref40] 40. Wang A, Chen H, Liu L. Yolov10: real-time end-to-end object detection. arXiv preprint. 2024.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref41] 41. Zhang Q. A novel ResNet101 model based on dense dilated convolution for image classification. SN Appl Sci. 2021;4(1):79.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

Figures

Abstract

Introduction

Related work

Target detection model based on traditional methods

Deep learning based target detection method

Improved deep learning based target detection

Method

Improved neck module

Improved backbone and head modules

Experiments

Experimental setup and data description

Ablation experiment

Comparison experiment

Conclusion

References