Figures
Abstract
Personal protective equipment (PPE) is critical for ensuring the safety of construction workers. However, site surveillance images from construction sites often feature multi-size and multi-scale targets, leading to low detection accuracy for PPE in existing models. To address this issue, this paper proposes an improved model based on YOLOv8n.By enriching feature diversity and enhancing the model’s adaptability to geometric transformations, the detection accuracy is improved.A Multi-Scale Group Convolution Module (MSGP) was designed to extract multi-level features using different convolution kernels. A Multi-Scale Feature Diffusion Pyramid Network (MFDPN) was developed, which aggregates multi-scale features through the Multiscale Feature Focus (MFF) module and diffuses them across scales, providing each scale with detailed contextual information. A customized Task Alignment Module was introduced to integrate interactive features, optimizing both classification and localization tasks. The DCNV2(Deformable Convolutional Networks v2) module was incorporated to handle geometric scale transformations by generating spatial offsets and feature masks from interactive features, thereby improving localization accuracy and dynamically selecting weights to enhance classification precision.The improved model incorporates rich multi-level and multi-scale features, allowing it to better adapt to tasks involving geometric transformations and aligning with the image data distribution in construction scenarios. Additionally, structured pruning techniques were applied to the model at varying levels, further reducing computational and parameter loads. Experimental results indicate that at a pruning level of 1.5, mAP@0.5 and mAP@0.5:0.95 improved by 3.9% and 4.6%, respectively, while computational load decreased by 21% and parameter count dropped by 53%. The proposed MFD-YOLO(1.5) model achieves significant progress in detecting personal protective equipment on construction sites, with a substantial reduction in parameter count, making it suitable for deployment on resource-constrained edge devices.
Citation: Tong B, Li G, Bu X, Wang Y, Yu X (2025) A deep learning-based algorithm for the detection of personal protective equipment. PLoS One 20(5): e0322115. https://doi.org/10.1371/journal.pone.0322115
Editor: Yile Chen, Macau University of Science and Technology, MACAO
Received: November 21, 2024; Accepted: March 17, 2025; Published: May 29, 2025
Copyright: © 2025 Tong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The public dataset used in this study can be obtained from https://doi.org/10.1016/j.autcon.2022.104499. All other data and code referenced in this paper are available at https://github.com/afbb12123/PPE-Detection.
Funding: Fundamental Research Funds for the Central Universities (Grant No 3142024037) Fundamental Research Funds for the Central Universities (Grant No 3142024036). Funded by Science and Technology Project of Hebei Education Department (Grant No. ZD2022163).
Competing interests: The authors have declared that no competing interests exist.
Introduction
The construction industry has consistently been one of the most hazardous sectors due to its labor-intensive nature and dangerous working environments [1]. According to relevant statistics, over one-fifth of fatal workplace accidents in the European Union occurred in the construction industry in 2021 [2]. In China, there were 773 and 689 safety incidents in housing and municipal engineering in 2019 and 2020, resulting in 904 and 794 fatalities, respectively [3,4]. In the United States, the number of fatalities in the construction industry reached 1,008 in 2020 [5]. Approximately 30% of production accidents in the UK occur in the construction sector [6]. With the ongoing expansion of the construction industry in developing countries, the accident rate may continue to rise [7]. These data clearly indicate that it is imperative to focus on construction safety and take effective measures to ensure the safety of workers.
There are many potential hazards on construction sites, such as mechanical collisions, falls, electrocution, etc.[8]The majority of construction accidents are related to human factors, with over 80% of accidents stemming from unsafe worker behaviour [9–11]. The lack of necessary protective equipment is one of the main causes of serious casualties [12]. The correct use of personal protective equipment(PPE) can significantly safeguard the lives of workers. however, for a large construction project, it can be difficult to use manpower to carry out inspections [13,14]. With advances in computer vision (CV), AI has surpassed humans in some aspects when it comes to detecting unsafe behaviour [15]. At the same time, closed-circuit television (CCTV) at construction sites can capture a large amount of video and image data, providing objective conditions for computer vision-based personal equipment protection detection [8].
However, there are currently some problems in using CV for personal equipment protection for detection: (1) Detection accuracy is limited. Cameras in architectural scenes usually have a wide-angle field of view, resulting in captured images with large-scale variations of the target object, containing different levels of detail and feature information. Therefore, the algorithm model needs to have the capability to effectively handle features of varying sizes. In traditional deep learning, low-level features typically capture basic information and details, such as edges and textures, while high-level features are more abstract and complex representations derived from low-level features. As the network depth increases, some low-level information may be lost, leading to the loss of crucial details during the fusion of high- and low-level features, which consequently reduces detection accuracy. (2) The model is too large for deployment. Edge devices typically have limited resources, and constraints on memory and computational power necessitate that models be smaller in size and more efficient.
To enhance the accuracy of detection algorithms, Wang Zheng et al. [16]introduced cross-level path aggregation in the feature fusion structure, which reduced the loss of target feature information and effectively supported the detection of personal protective equipment in coal mines. However, there are significant differences between construction sites and coal mine scenarios. Coal mine environments tend to have darker backgrounds and limited scene sizes. In contrast, construction sites have complex backgrounds and larger areas, requiring wide-angle cameras to ensure a broad view of the target locations, necessitating the handling of images at various scales. Huang Guoyan et al. [17]employed a progressive adaptive feature fusion technique to effectively address the issue of information conflict in helmet detection, achieving improved accuracy while maintaining a fast inference speed.
In practical engineering applications, the processing of image data often requires real-time or near-real-time responses. Considering cost constraints, there is a need for a model with lower computational and memory requirements.
Lin Jiehua et al. [18]developed a lightweight decoupled detection head that reduces the model’s parameter count while enhancing its classification and localization capabilities. In a one-stage detector, the detection head typically accounts for about 50% of the total parameter count. Optimizing the detection head can significantly reduce the parameter size, making the model more suitable for deployment on edge embedded devices. Zhao Jingyi et al. [19]utilized channel pruning to significantly reduce the model’s parameter count and size, supporting efficient deployment. This paper presents the MFD-YOLO architecture to address the challenges of detecting personal protective equipment in construction scenarios. Its main contributions are as follows:
- (1) A Bottleneck structure is designed that leverages the concepts of group convolution and pointwise convolution, sending feature maps into convolutions of varying sizes for feature extraction. The fused feature maps possess richer features, making them more suitable for the variable target information present in construction scene images.
- (2) In the Neck section, a customized Multi-Scale Feature Aggregation Module and feature diffusion mechanism are implemented, ensuring that each scale of features possesses detailed contextual information. The feature aggregation module accepts input from three scales, utilizing depth convolutions of varying sizes to capture rich cross-scale information. It includes a non-parametric module and employs the diffusion mechanism to spread detailed contextual information to each detection scale.
- (3) In the detection head section, Batch Normalization (BN) layers are replaced with better-performing Group Normalization (GN) layers, significantly reducing the parameter count through shared convolutions. A Scale layer is introduced to ensure consistency in target scale across the detection head. A customized task alignment structure enhances the interaction between classification and localization features. Additionally, the introduction of DCNV2 deformable convolutions adjusts spatial offsets and weights, improving localization performance. Dynamic feature selection based on interactive features further enhances classification performance.
- (4) An adaptive network pruning method is employed for global pruning, assessing parameter importance and removing unimportant network parameters to reduce model size, decrease computational complexity, and enhance inference speed. After pruning, the model is fine-tuned to recover any potential performance loss.
The remainder of this paper is organized as follows: Sect 2 presents related research on PPE detection development. Sect 3 provides a detailed description of the proposed MFD-YOLO architecture. Sect 4 discusses the experiments related to model validation. Finally, Sect 5 summarizes the work presented in this paper.
Related work
The use of personal protective equipment (PPE) is crucial for protecting workers, especially those working in hazardous areas, from injuries caused by sudden incidents [13]. Traditionally, safety inspectors verify workers’ PPE compliance using checklists to reduce workplace accidents [14]. However, this manual verification method is not only time-consuming and labor-intensive but also prone to oversight, and it cannot achieve real-time monitoring, leading to many potential safety hazards going undetected. With advancements in artificial intelligence technology and closed-circuit television on construction sites, more effective supervision methods have emerged.
Detection of PPE wearing based on traditional sensors and machine learning
Sensor-based methods typically require workers to wear specific sensor devices and communication-enabled equipment, such as Global Positioning Systems (GPS), Ultra-Wideband (UWB), and Bluetooth [20–22]. The data collected by these sensors are processed and analyzed through Internet of Things (IoT) platforms [23]. For example, Riaz et al. monitored work environments in confined spaces using integrated Building Information Modeling (BiM) and wireless sensors to prevent workers from encountering time-sensitive situations [24]. Gómez-de-Gabriel et al. combined Bluetooth Low Energy (BLE) and Bayesian distance estimators to prompt and intervene in cases of improperly installed PPE or insufficient distance from tools [20]. Zhang et al. utilized Radio Frequency Identification (RFID) in conjunction with IoT to issue warnings when workers enter hazardous areas on construction sites, transmitting alert information to remote servers via LoRa technology [23]. However, sensor-based methods have limitations, such as reliance on the establishment of tags, which incurs additional costs for installation and maintenance [25], and the extra equipment may interfere with workers’ normal tasks[26].
Detection methods based on machine learning typically focus on identifying single devices within images. Rubaiyat et al. extracted differential features from images using Histogram of Oriented Gradients (HOG) and combined color information with Circular Hough Transform (CHT) to detect helmets [27]. Liu et al. employed a Support Vector Machine (SVM) model to classify images based on helmet usage [28]. Wu et al. constructed a hybrid operator based on color Local Binary Patterns (LBP), Hu Moment Invariants (HMI), and color histograms (CH) to extract features of helmets in various colors and classified four types of helmet usage using Hierarchical Support Vector Machines (H-SVM) [29]. These methods often focus on specific regions of the image, extracting information from critical areas and using the differences between these areas and their surroundings for classification. While effective for certain tasks, they generally suffer from low detection accuracy, slow inference speed, and poor model generalization, making them inadequate for the real-time and complex requirements of construction sites [16].
Detection of PPE wearing based on object detection
Deep learning-based object detection algorithms are typically divided into two categories [30]: two-stage object detection algorithms and one-stage object detection algorithms. Two-stage algorithms, such as Region-based Convolutional Neural Networks (R-CNN) [31] and Faster R-CNN [32],offer high accuracy but have complex network structures, demanding significant computational resources and resulting in slower inference speeds, making them unsuitable for practical engineering applications [16]. In contrast, one-stage object detection algorithms, like YOLO (You Only Look Once) [33]and Single Shot MultiBox Detector (SSD) [34], are more suitable for real-world applications due to their higher speed and lower computational demands.
In practical applications, one-stage object detection algorithms perform particularly well. For instance, Nath et al. [35] significantly reduced model complexity while maintaining detection accuracy by replacing the backbone network of YOLOv4 with MobileNet v3. Wang Zhen [16] improved the feature fusion path, enhancing the accuracy of PPE detection in coal mine scenarios. Zhao et al. [19] applied network pruning techniques to reduce the parameter count of building recognition algorithm models, making them more suitable for deployment on edge devices and improving inference speed.
Methods
MFD-YOLO overview
As a one-stage detector, YOLOv8 has demonstrated outstanding performance in the field of object detection, particularly showing great potential for personal protective equipment detection. Therefore, this paper proposes a PPE detection method based on YOLOv8n, referred to as MFD-YOLO, aiming to leverage the advantages of one-stage detectors while addressing the shortcomings of traditional methods. The model structure is illustrated in Fig 1.
In the MFD-YOLO architecture, we propose the MGPC (Multi-scale Group Pointwise Convolution) module, which utilizes convolutions of different sizes to extract features at various levels, enriching the feature flow. In the Neck section, we designed the MFF module to aggregate features from three scales, and through the Multiscale Feature Diffusion Pyramid Network (MFDPN), these aggregated features are diffused across the various detection scales, ensuring that each scale possesses detailed contextual information. In the detection head section, we designed the Dynamic Task-alignment Detection Heads (DTADH) module, implementing a customized task alignment mechanism. This mechanism aims to enhance the interaction between classification and localization tasks by generating spatial offsets and masks based on interactive features. To address the geometric deformation of detection targets, we introduced DCNv2 in the localization task. For the classification task, dynamic feature selection facilitates matching across multi-scale features. Finally, we performed global pruning on the model using adaptive pruning techniques, resulting in a model with a smaller parameter count and lower computational requirements.
MGPC module
In construction scenarios, the diverse positioning of cameras often results in the acquisition of image data at varying scales. In the traditional YOLOv8 backbone network, the C2f module is utilized for feature extraction, where input features are first processed through a 1x1 convolution to reduce the channel count and lower computational load. The features are then sent into a Bottleneck structure, where a 3x3 convolution is employed for feature extraction, and residual connections are used to prevent gradient vanishing. However, this approach relies solely on the receptive field of the 3x3 convolution, resulting in insufficiently rich information extraction.
To address this, we designed a new Bottleneck structure called Multi-scale Group Pointwise Convolution (MGPC), as illustrated in Fig 2. MGPC employs grouped convolution to divide the input feature channels into four groups, utilizing convolution kernels of sizes 1x1, 3x3, 5x5, and 7x7 to extract features and capture information at different levels, ultimately fusing the features through concatenation. Previous studies have shown that grouped convolution not only increases the diagonal correlation between filters but also reduces training parameters, making it less prone to overfitting [36]. The structure of MGPC is shown in Fig 2, with the parameters and computational load of traditional convolutions represented by the (Eq 1), assuming the input feature shape is [C1,H,W] and the output feature shape is .
For grouped convolution, assuming it is divided into groups, the associated count is denoted as and the computational load in FLOPs is represented as
,Both parameters are reduced to varying extents depending on the size of the groups, which is advantageous for deployment on embedded devices. Additionally, the 1x1 convolution is a common method used to exchange channel information between grouped feature maps. By employing pointwise convolution, we enriched the inter-channel information. enhancing the expressiveness of the final feature map [37]. The MGPC structure combines features from different levels and channels,making it more suitable for the multi-scale transformation requirements in construction scenarios.
MFDPN module
The image data obtained from CCTV presents challenges related to target scale variations and diverse contextual environments. PkiNet has been shown to effectively address scale transformations by extracting features within different receptive fields and collecting local contextual information [38]. Based on this concept, we designed the MFF module, which accepts input from three scales, P3, P4, and P5, as illustrated in Fig 3. When processing features from the P3 layer, we use an Adown module for downsampling, which retains more detailed information compared to traditional convolutions [39]. For the P5 layer, we first upsample to restore resolution and then apply a 1x1 convolution to adjust the channel count. The features from P3, P4, and P5 are then aggregated using Concat to form a fusion of multiscale features. The larger the receptive field, the richer the information obtained; thus, the aggregated features are passed through four different sizes of depth convolutions (5x5, 7x7, 9x9, and 11x11) alongwith a parameter-free Identity mapping. The Identity module, similar to a residual connection, maintains the continuity of the information flow. Subsequently, these features with varying receptive fields are fused with the previously aggregated features, and a 1×1 convolution is applied for further channel information integration. The resulting features are then re-fused with the aggregated features from the three scales, yielding features enriched with multilevel and detailed contextual information. Ultimately, the features containing multiscaleand detailed contextual information are fused again with the initial features from the three scales, enhancing the accuracy and robustness of the detection across each scale through upsampling, downsampling, and Concat operations.
DTADH module
In the design of the detection head for YOLOv8, classification and localization tasks operate independently. Due to their different learning mechanisms, there may be discrepancies in the spatial distribution of features, leading to biases during predictions [40]. To address this issue, we restructured the detection head by drawing inspiration from TOOD, as illustrated in Fig 4. Research indicates that Group Normalization (GN) can significantly enhance the performance of the detection head in both localization and classification tasks [41]. Consequently, we employed a 3x3 convolution combined with a GN layer as the feature extractor to derive interactive features from the P3, P4, and P5 layers. The input features processed by the Neck layer are represented Xinput,while the interactive features learned from the feature extractor are denoted as XInter,where H, W and C correspond to the height, width, and number of channels, respectively. The Task Decomposition module utilizes a layer attention mechanism to compute the attention weights for each feature map ωi,Based on the varying levels of attention, features are mapped for subsequent independent processing in classification and localization tasks. The classification task dynamically adjusts the convolution weights based on the interactive features, generating classification predictions. The localization task uses the interactive features to produce spatial displacement offsets, denoted as Offset, along with masks Masks.By employing DCNv2 with the generated offsets and masks, the module effectively adjusts the spatial positions and weights of the feature maps, allowing it to better accommodate localization tasks with geometric transformations. DCNV2 introduces deformable convolutions that adaptively adjust receptive fields, allowing for better handling of objects with geometric variations and improving detection robustness. To reduce the model’s parameter count and facilitate deployment on resource-constrained edge devices, we implemented shared convolution techniques. This enables the feature extractor, task alignment processing, and subsequent classification and localization components to share parameters, as highlighted by the red dashed box in the diagram. Furthermore, we introduced Scale layers to ensure consistency in the target scales detected by each detection head, as detailed in Equations 5, 6, and 7.
In the (Eq 4), denotes the task decomposition for each layer of features, while fc1 and fc2 represent the fully connected layers.
is
. The interactive features
undergo global average pooling to capture the contextual information of the feature maps, resulting in a global feature representation for each channel. These processed features are then concatenated to form
. The modified
is fed into a fully connected layer to generate feature mappings. An activation function is applied to produce attention weights, which assess the importance of each layer’s features in the classification and localization tasks. Based on these attention weights, the interactive features are allocated to their respective tasks for further processing.
Prune
In the context of embedded devices frequently facing resource limitations on construction sites, it is crucial to reduce both the model’s parameter count and computational load for effective deployment. Network pruning, a technique that reduces the size of neural networks by removing unimportant weights, can help minimize costs. This paper employs Layer-wise Adaptive Magnitude-based Pruning (LAMP) to perform structured pruning on the improved MFD-YOLO model [42]. LAMP pruning involves designing a scoring function to quantify the loss incurred on the output layer due to pruning. We select weights that have minimal impact on model performance for removal. During the pruning process, each layer’s weight score are calculated and ranked, with a global sparsity target set to remove the weights with the lowest score until the desired global pruning ratio speedup is achieved. Through this method, we effectively eliminate unimportant channels in the MFD-YOLO network, significantly reducing the model’s parameter count and computational complexity, as detailed in the following (Eq 7).
In the above equation, represents the weight of a particular layer, i denotes the index of that layer,
signifies all weights in ascending order following the current weight
.The denominator in the formula indicates the sum of the squares of the remaining weights in the current layer, while the numerator represents the square of the current weight, assessing its overall importance to the model. LAMP prunes weights in ascending order of their score , with speedup denoting the pruning ratio. GFLOPs1 and GFLOPs2 represent the computational loads before and after pruning, respectively. this ratio is not strictly defined, and pruning typically stops when the ratio approaches a certain threshold.
In LAMP pruning, the Frobenius norm is used to approximate the loss minimization problem. In the (Eq 9), represents the output of a specific layer in the network model, where x is the input and w is the tensor of weights. M is a binary mask matrix indicating whether a layer participates in pruning. The product
yields the weight matrix after pruning. By quantifying the various impacts generated during the pruning process, adjustments are made to the weights of each layer to minimize the loss at the output layer, ensuring that the outputs remain as close as possible before and after pruning.
Experiment and results
Dataset and environment configuration
The experimental environment was configured as follows: Software—Operating System: Ubuntu 22.04; Python: 3.9; PyTorch: 2.1.2; Ultralytics: 8.2.50. Hardware—Processor: Intel(R) Xeon(R) Platinum 8352V at 2.1 GHz; RAM: 80 GB; GPU: Nvidia GeForce RTX 4090 with 24 GB. To enable a quantitative comparison of model performance before and after improvements, we standardized the training parameters, as detailed in Table 1.
Evaluation metrics
In this experiment, we evaluated the improved model based on detection accuracy, speed, and model size, using the following metrics: mean Average Precision (mAP), inference speed (FPS), number of parameters, and GFLOPs (Giga Floating Point Operations per second). The mAP is derived from the average precision (AP), which is related to precision (P) and recall (R). The (Eq 10) are as follows:
Where TP, FP, FN represent true positive samples, false positive samples, and false negative samples, respectively. The AP is the area under the curve formed by P and R, while the mAP is the average of the AP values across all classes, where N is the total number of detection classes. For a model to be deployable on edge devices, model size and inference speed are crucial evaluation metrics. We quantify model size using the number of parameters and assess inference speed using FPS. The formula is shown in (Eq 14),where t1, t2 and t3 represent the preprocessing time, inference time, and postprocessing time of the model on an image, respectively.
Ablation experiment
In this experiment, YOLOv8n was used as the baseline model. To validate the effectiveness of the proposed improvements, we conducted ablation experiments, comparing each modified module one by one and analyzing the results in detail. Our proposed model is named MFD-YOLO, and we provide the quantification results of certain modules along with a visual analysis in the results. The results of the ablation experiments are shown in Table 2, while the PR comparison chart of the model improvements is illustrated in Fig 6, where A represents the MGPC module, B represents the FDPN module, and C represents the DTADH module. The detection structures for each category are detailed in Table 3.
MGPC Module. In Table 2, the mAP@0.5, mAP@0.5:0.95, and Parameters for the baseline model YOLOv8n before improvements were 70.2%, 35.1%, and 3,006,233, respectively. The GFLOPs and inference speed before the improvements were 8.1 and 158.7, respectively. After replacing part of the C2f Bottleneck in the network structure with MGPC, the parameter count and computation decreased by 0.2M and 0.2G, respectively, while the mAP@0.5 and mAP@0.5:0.95 increased by 0.7% and 0.3%. This indicates that using multi-scale features to enhance personal protective equipment detection in construction scenarios is effective. The changes in parameter count after introducing MGPC into the backbone are shown in Table 4. In networks of the same depth, using MGPC instead of C2f resulted in a reduction in both parameter count and computation, although the inference time slightly increased.
MFDPN Module. By replacing the PAN+FPN structure in YOLOv8 with the multi-scale feature aggregation pyramid, the mAP@0.5 and mAP@0.5:0.95 increased by 2.1% and 3.5%, respectively. However, the increased complexity of the network structure also led to an increase in computation and parameter count, resulting in a decrease in model inference speed (FPS) by 27.8. The improvements to the Neck enhance the richness of feature information, as illustrated in the heatmap shown in Fig 7, demonstrating that MFDPN improves detection performance across different scales.
(a) Dataset distribution, (b) dataset size distribution.
(a) Dataset distribution, (b) dataset size distribution.
Origin is the original image,Detect is the detection result,Heatmap is the detection heatmap.
DTADH Module. The interaction features obtained through feature fusion allow the detection head’s localization task to generate relevant spatial offsets and masks based on these interaction features, while the classification task dynamically selects weights based on the same features. The introduction of DCNv2 and Scale layers better accommodates geometric transformations in images. We separated the test images by size and predicted the images corresponding to the top four size proportions. The detection results, as shown in Table 6, indicate that the DTADH module improves detection accuracy across different sizes compared to the original model.
After introducing the DTADH module, mAP@0.5 and mAP@0.5:0.95 increased by 1.8% and 1.1%, respectively. Although the computational load slightly increased due to a uniform number of channels, the parameter count decreased by 25%, and the inference speed (FPS) dropped by 20. Overall, the improved model showed enhancements of 4.2% and 5.1% in mAP0.5 and mAP0.5:0.95, respectively, while reducing the parameter count by 17%. This modified network not only lowered the parameter count but also increased the detection accuracy of personal protective equipment for workers. The GFLOPs increased by 1.6 due to the higher complexity of the network; the designed modules required more computation than the detection part of YOLOv8n, resulting in a final FPS decrease of 45.7%.
To further validate the multi-scale detection capability of MFD-YOLO, we conducted additional experiments on the CHV dataset, following the COCO evaluation criteria to classify objects into three categories based on their bounding box area: small (), medium (
), and large (
). We then computed small, medium, and large to quantitatively assess the model’s performance across different object scales. The results, presented in Table 6 , demonstrate that MFD-YOLO consistently outperforms the baseline model across all object sizes, further substantiating its effectiveness in multi-scale object detection. This additional analysis strengthens the claim regarding our model’s capability to handle objects of varying sizes.
Pruning experiment
Although the improved model enhances target detection accuracy, it also leads to an increase in computational load. Reducing this computational burden and improving inference speed are especially critical for resource-constrained embedded devices. We employed an adaptive pruning approach to remove unimportant channels, thereby reducing both the parameter count and computational complexity. The results of the pruning experiments are shown in Table 7, and the channel comparisons of each module before and after pruning are illustrated in Fig 8.
(a) Speed up 1.5, (b) speed up 1.7, (c) speed up 2.0.
After applying the LAMP method for structured pruning, the channel counts of various modules were reduced, with some modules completely pruned. At a pruning ratio of 1.5, the model’s performance slightly decreased by 0.3% and 0.5% for mAP@0.5 and mAP@0.5:0.95, respectively, while the computational load decreased by 34%, and the parameter count dropped by 45%, leading to a 28.8% increase in FPS.
At a pruning ratio of 1.7, performance metrics for mAP@0.5 and mAP@0.5:0.95 dropped by 2.8% for both, with a 42% reduction in computational load and a 50% decrease in parameters, resulting in a 32.8% increase in FPS. At a pruning ratio of 2, the performance further declined by 5.8% and 7.1% for mAP@0.5 and mAP@0.5:0.95, respectively, with a 50% drop in computational load, a 56% decrease in parameters, and a 27.1% increase in FPS. Overall, when comparing mAP@0.5 alongside computational load, parameter count, and FPS, the pruning ratio of 1.5 yielded the best results.
Comparative experiment
To better illustrate the advantages of the improved MFD-YOLOv8 detection model, we conducted comparative experiments. We selected currently advanced detection algorithms as comparison models [44]. The model size was quantified using parameter counts, while GFLOPs and FPS were used to estimate inference speed. For evaluation of detection accuracy, we chose AP and mAP@0.5(%) as the metrics. The results of the comparative experiments are shown in Table 8, with a radar chart comparing other metrics depicted in Fig 9.
The mAP values for Faster RCNN and SSD algorithms are 61.3% and 58.7%, respectively, with GFLOPs values of 257.3 and 52.7, and FPS values of 21.3 and 47.2. These algorithms exhibit low detection accuracy for personal protective equipment and have high computational complexity. YOLOv3 has an mAP of 60.8%, GFLOPs of 155.3, and a parameter count of 62.1M, resulting in a detection speed of 43.4. Its relatively high parameter count leads to significant memory usage. YOLOv4 achieves an mAP of 63.4%, surpassing earlier algorithms, but still has considerable computational and parameter demands, reaching 140 GFLOPs and 64.6M parameters. YOLOv5n is a commonly used model with a detection accuracy of 68.8%, a parameter count of 2.6M, and a computational complexity of 7.1 GFLOPs. However, its accuracy is lower compared to YOLOv8, and due to its reliance on anchor boxes and a coupled detection head, it presents fewer advantages for further improvements. We also experimented with replacing YOLOv8n’s backbone with the widely used lightweight MobileNet architecture. This resulted in an mAP of 64.8%, with a parameter count of 5.8M and a computational complexity of 7.8 GFLOPs. Although this modification reduced both the parameter count and computational cost compared to YOLOv8n, it led to a significant drop in accuracy. Additionally, we evaluated the latest YOLOv11n model, which achieved an average detection accuracy of 70.0%, with a parameter count of 2.6M and a computational complexity of 6.5 GFLOPs. Despite its lower parameter count and computational cost, its inference speed (FPS) was lower than that of YOLOv8n, making it less suitable for real-world deployment.YOLOv8n achieves an mAP of 70.2%, with 2.5M parameters, 9.7 GFLOPs, and an FPS of 148.2. Although it performs well, its computational load and parameter count remain somewhat high for deployment on edge embedded devices. The improved model, MFD-YOLO, reaches a maximum mAP of 74.4, with a computational load of 9.7 GFLOPs, 2.5M parameters, and an FPS of 102.5. Compared to YOLOv8n, while the parameter count has decreased, the computational load has increased, resulting in slower detection speeds, which is also not ideal for deployment. After applying a pruning ratio of 1.5, we obtained MFD-YOLO(1.5). Compared to YOLOv8n, the mAP increased by 3.9%, while the computational load decreased by 21% and the parameter count reduced by 53%. Although the FPS decreased by 16.9, this has a minimal impact on real-time detection tasks. The radar chart demonstrates that MFD-YOLO achieves optimal detection across various categories. Additionally, the pruned MFD-YOLO addresses the issue of high computational load, meeting the requirements for real-time detection of personal protective equipment in construction scenarios, while occupying less memory and maintaining a high running speed. Overall, the algorithm obtained after pruning is superior to other mainstream algorithms.
Algorithm deployment
To evaluate the real-world performance of our proposed model, we deployed it on a Jetson Nano B1 edge computing device. Both the YOLOv8n and MFD-YOLOv8n(1.5) models were deployed and tested under two lighting conditions: well-lit and low-light environments.
For performance evaluation, we used FPS (frames per second) as the metric for inference speed and AP (average precision) as the metric for detection accuracy. The FPS and AP values were averaged over 100 consecutive frames to obtain a comprehensive assessment. The summarized results are presented in the table below, while sample detection images from the real-world deployment are shown in the accompanying Fig 10.
(a) YOLOv8n in low light conditions, (b) YOLOv8n under sufficient lighting conditions, (c) MFD-YOLO in low light conditions, (d) MFD-YOLO under sufficient lighting conditions.
(a) shows the detection results of YOLOv8n under low-light conditions, while (b) presents the detection results of YOLOv8n under well-lit conditions. It can be observed that the detection accuracy decreases under low-light conditions. (c) and (d) display the detection results of MFD-YOLO (1.5) under low-light and well-lit conditions, respectively. Compared to YOLOv8n, the detection accuracy of MFD-YOLO (1.5) improves in both scenarios. The average FPS for YOLOv8n is 35, whereas for MFD-YOLO (1.5), it is 30. For edge detection tasks, this difference is entirely acceptable.
Discussion and conclusion
In this study, we proposed an improved model for detecting personal protective equipment (PPE) worn on construction sites, based on YOLOv8n. The goal is to address issues such as low detection accuracy and difficulties in model deployment. The main improvements include:
(1)Designed the MSGP module, which utilizes grouped convolutions of varying sizes to extract multi-level features while reducing the number of parameters and computational load, thereby enhancing feature representation.
(2)Constructed the MFDPN (Multi-scale Feature Diffusion Pyramid Network) structure, which aggregates feature information from the P3, P4, and P5 layers through the MFF module. A feature fusion pathway spreads detailed contextual information across various detection layers, improving the model’s capability to handle targets of different scales.
(3)Customized a task alignment mechanism that separates classification and localization tasks. Introduced DCNv2 to adjust spatial positions and weights, allowing the localization task to better adapt to the geometric transformations of targets at different scales. The classification task dynamically adjusts weights based on interaction features and utilizes the Scale layer to ensure consistency in detection results.
(4)Employed an adaptive structured pruning method to trim the improved model, removing redundant network parameters and reducing both the parameter count and computational load.
Ablation experiments and comparative tests demonstrated the effectiveness of the proposed improvements. The performance of each module was quantified and visualized during the experiments. Ultimately, MFD-YOLO(1.5) successfully reduced both the parameter count and computational load while effectively enhancing the accuracy of personal protective equipment detection on construction sites, thereby ensuring the safety of workers during construction activities.
Despite the improvements in detection accuracy across various scales, there are still instances of missed detections in specific scenarios, such as target occlusion and personnel overlap. Future work will focus on designing new attention mechanisms to enhance the model’s ability to recognize challenging samples. Additionally, considering the difficulties in data collection for construction scenarios, future efforts could incorporate GANs and anomaly generation modules to create synthetic detection datasets, effectively addressing the issue of insufficient data for specific tasks.
References
- 1. Guo BHW, Zou Y, Fang Y, Goh YM, Zou PXW. Computer vision technologies for safety science and management in construction: a critical review and future research directions. Safety Science. 2021;135:105130.
- 2.
ES Accidents at work statistics; 2021. Available from: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Accidents_at_work_statistics
- 3.
Notice of the Ministry of Housing and Urban-Rural Development, the People’s Republic of China. General Office of the Ministry of Housing and Urban-Rural Development on the production safety accidents of housing municipal engineering in 2020; 2022. Available from: https://www.mohurd.gov.cn/gongkai/zhengce/zhengcefilelib/202210/20221026_768565.html
- 4.
Notice of the Ministry of Housing and Urban-Rural Development, the People’s Republic of China. General Office of the Ministry of Housing and Urban-Rural Development on the production safety accidents of housing municipal engineering in 2020; 2019. Available from: https://www.mohurd.gov.cn/gongkai/zhengce/zhengcefilelib/202006/20200624_246031.html
- 5.
U.S. Bureau of Labor Statistics. Fatal occupational injuries by industry and event or exposure; 2021. Available from: https://www.bls.gov/iif/fatal-injuries-tables/fatal-occupational-injuries-table-a-1-2021.htm
- 6. Executive HS. Work-related fatal injuries in Great Britain; 2023. Available from: https://www.hse.gov.uk/statistics/pdf/fatalinjuries.pdf
- 7. Amiri M, Ardeshir A, Fazel Zarandi MH. Fuzzy probabilistic expert system for occupational hazard assessment in construction. Saf Sci. 2017;93:16–28.
- 8. Kulinan AS, Park M, Aung PPW, Cha G, Park S. Advancing construction site workforce safety monitoring through BIM and computer vision integration. Autom Constr. 2024;158:105227.
- 9. Choi B, Lee S. An empirically based agent-based model of the sociocognitive process of construction workers’ safety behavior. J Constr Eng Manage. 2018;144(2):04017102.
- 10. Zhang P, Li N, Jiang Z, Fang D, Anumba CJ. An agent-based modeling approach for understanding the effect of worker-management interactions on construction workers’ safety-related behaviors. Autom Constr. 2019;97:29–43.
- 11. Fang Q, Li H, Luo X, Ding L, Luo H, Rose TM, et al. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Autom Constr. 2018;85:1–9.
- 12. Katsakiori P, Manatakis E, Goutsos S, Athanassiou G. Factors attributed to fatal occupational accidents in a period of 5 years preceding the Athens 2004 Olympic Games. Int J Occup Saf Ergon. 2008;14(3):285–92. pmid:18954538
- 13. Xiong R, Tang P. Pose guided anchoring for detecting proper use of personal protective equipment. Autom Constr. 2021;130:103828.
- 14. Ammad S, Alaloul WS, Saad S, Qureshi AH. Personal protective equipment (PPE) usage in construction projects: a scientometric approach. J Build Eng. 2021;35:102086.
- 15. Rasouli S, Alipouri Y, Chamanzad S. Smart personal protective equipment (PPE) for construction safety: a literature review. Saf Sci. 2024;170:106368.
- 16. Wang Z, Zhu Y, Ji Z, Liu S, Zhang Y. An efficient YOLOv8-based model with cross-level path aggregation enabling personal protective equipment detection. IEEE Trans Ind Inform. 2024.
- 17. Huang Y, Fang M, Peng J. Lightweight helmet target detection algorithm combined with Effici-Bi-Level Routing Attention. PLoS One. 2024;19(5):e0303866. pmid:38809845
- 18. Lin J, Zhao Y, Wang S, Tang Y. YOLO-DA: an efficient YOLO-based detector for remote sensing object detection. IEEE Geosci Remote Sens Lett. 2023:1–5.
- 19. Zhao J, Li Y, Cao J, Gu Y, Wu Y, Chen C, et al. An improved YOLOv5s model for building detection. Electronics. 2024;13(11):2197.
- 20. Gómez-de-Gabriel JM, Fernández-Madrigal JA, Rey-Merchán MC, López-Arquillos A. A safety system based on bluetooth low energy (BLE) to prevent the misuse of personal protection equipment (PPE) in construction. Saf Sci. 2023;158:105995.
- 21. Cheng T, Venugopal M, Teizer J, Vela P. Performance evaluation of ultra wideband technology for construction resource location tracking in harsh environments. Automation in Construction. 2011;20(8):1173–84.
- 22.
Zhang S, Teizer J, Pradhanang N. Global positioning system data to model and visualize workspace density in construction safety planning. In: ISARC proceedings of the international symposium on automation and robotics in construction. Vol. 32. IAARC Publications; 2015. p. 1.
- 23. Zhang M, Ghodrati N, Poshdar M, Seet B-C, Yongchareon S. A construction accident prevention system based on the Internet of Things (IoT). Saf Sci. 2023;159:106012.
- 24. Riaz Z, Arslan M, Kiani AK, Azhar S. CoSMoS: a BIM and wireless sensor based integrated solution for worker safety in confined spaces. Autom Constr. 2014;45:96–106.
- 25. Lee H, Lee G, Lee S, Ahn CR. Assessing exposure to slip, trip, and fall hazards based on abnormal gait patterns predicted from confidence interval estimation. Autom Constr. 2022;139:104253.
- 26. Di B, Xiang L, Daoqing Y, Kaimin P. MARA-YOLO: an efficient method for multiclass personal protective equipment detection. IEEE Access. 2024.
- 27.
Rubaiyat AH, Toma TT, Kalantari-Khandani M, Rahman SA, Chen L, Ye Y, et al. Automatic detection of helmet uses for construction safety. 2016 IEEE/WIC/ACM international conference on web intelligence workshops (WIW); 2016. p. 135–42.
- 28. Liu X, Ye X. Application of skin color detection and Hu moment in helmet recognition. J East China Univ Sci Technol: Nat Sci Ed. 2014;3:365–70.
- 29. Wu H, Zhao J. An intelligent vision-based approach for helmet identification for work safety. Computers in Industry. 2018;100:267–77.
- 30. Zou Z, Chen K, Shi Z, Guo Y, Ye J. Object detection in 20 years: a survey. Proc IEEE. 2023;111(3):257–76.
- 31. Xu P, Li F, Wang H. A novel concatenate feature fusion RCNN architecture for sEMG-based hand gesture recognition. PLoS One. 2022;17(1):e0262810. pmid:35051235
- 32. Jiang X, Wu Z, Han S, Yan H, Zhou B, Li J. A multi-scale approach to detecting standing dead trees in UAV RGB images based on improved faster R-CNN. PLoS One. 2023;18(2):e0281084. pmid:36827399
- 33. Fu Z, Yuan X, Xie Z, Li R, Huang L. Research on improved gangue target detection algorithm based on Yolov8s. PLoS One. 2024;19(7):e0293777. pmid:38980881
- 34. Zhang X, Chen F, Yu T, An J, Huang Z, Liu J, et al. Real-time gastric polyp detection using convolutional neural networks. PLoS One. 2019;14(3):e0214133. pmid:30908513
- 35. Nath ND, Behzadan AH, Paal SG. Deep learning for site safety: real-time detection of personal protective equipment. Autom Constr. 2020;112:103085.
- 36.
Ioannou Y, Robertson D, Cipolla R, Criminisi A. Deep roots: improving CNN efficiency with hierarchical filter groups. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1231–40.
- 37.
Sinha D, El-Sharkawy M. Thin mobilenet: an enhanced mobilenet architecture. 2019 IEEE 10th annual ubiquitous computing, electronics & mobile communication conference (UEMCON). IEEE; 2019. p. 0280–5.
- 38.
Cai X, Lai Q, Wang Y, Wang W, Sun Z, Yao Y. Poly kernel inception network for remote sensing detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2024. p. 27706–16.
- 39. Wang C, Yeh I, Liao H. Yolov9: learning what you want to learn using programmable gradient information. arXiv Preprint. 2024.
- 40.
Feng C, Zhong Y, Gao Y, Scott MR, Huang W. Tood: task-aligned one-stage object detection. 2021 IEEE/CVF international conference on computer vision (ICCV). IEEE Computer Society; 2021. p. 3490–9.
- 41. Tian Z, Shen C, Chen H, He T. FCOS: fully convolutional one-stage object detection. arXiv Preprint. 2019.
- 42. Lee J, Park S, Mo S, Ahn S, Shin J. Layer-adaptive sparsity for the magnitude-based pruning. arXiv Preprint https://arxiv.org/abs/201007611. 2020.
- 43. Duan R, Deng H, Tian M, Deng Y, Lin J. SODA: a large-scale open site object detection dataset for deep learning in construction. Autom Constr. 2022;142:104499.
- 44. Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, et al. MMDetection: open MMLab detection toolbox and benchmark. arXiv Preprint. 2019.