Enhanced human pose estimation using YOLOv8 with Integrated SimDLKA attention mechanism and DCIOU loss function: Analysis of human body behavior and posture

Xunqian Xu; Tao Wu; Zhongbao Du; Hui Rong; Siwen Wang; Shue Li; Dakai Chen

doi:10.1371/journal.pone.0318578

Abstract

Pose estimation is a crucial task in the field of human motion analysis, and detecting poses is a topic of significant interest. Traditional detection algorithms are not only time-consuming and labor-intensive but also suffer from deficiencies in accuracy and objectivity. To address these issues, we propose an improved pose estimation algorithm based on the YOLOv8 framework. By incorporating a novel attention mechanism, SimDLKA, into the original YOLOv8 model, we enhance the model’s ability to selectively focus on input data, thereby improving its decoupling and flexibility. In the feature fusion module of YOLOv8, we replace the original Bottleneck module with the SimDLKA module and integrate it with the C2F module to form the C2F-SimDLKA structure, which more effectively fuses global semantics, especially for medium to large targets. Furthermore, we introduce a new loss function, DCIOU, based on the YOLOv8 loss function, to improve the forward propagation of model training. Results indicate that our new loss function has a 3–5 loss value reduction compared to other loss functions. Additionally, we have independently constructed a large-scale pose estimation dataset, HP, employing various data augmentation strategies, and utilized the open-source COCO and MPII datasets for model training. Experimental results demonstrate that, compared to the traditional YOLOv8, our improved YOLOv8 algorithm increases the mAP value on the pose estimation dataset by 2.7% and the average frame rate by approximately 3 frames. This method provides a valuable reference for pose detection in pose estimation.

Citation: Xu X, Wu T, Du Z, Rong H, Wang S, Li S, et al. (2025) Enhanced human pose estimation using YOLOv8 with Integrated SimDLKA attention mechanism and DCIOU loss function: Analysis of human body behavior and posture. PLoS One 20(5): e0318578. https://doi.org/10.1371/journal.pone.0318578

Editor: Xu Yanwu, South China University of Technology, CHINA

Received: August 10, 2024; Accepted: January 19, 2025; Published: May 7, 2025

Copyright: © 2025 Xu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The underlying code and data are held at GitHub. Repository: https://github.com/Mrsaibei/yolov8.

Funding: This work was supported in part by the Ministry of Science and Technology of the People's Republic of China under the National Key Research and Development Program of China grant [2016YFB0303103 to XX] and the Natural Science Foundation of Nantong grant [MS2023074 to XX].

Competing interests: The authors declare that they have no competing interests.

1. Introduction

Against the backdrop of rapid advancements in modern technology, computer vision has gradually entered the public eye. Since the 1950s, there have been basic applications of computer vision. Neurophysiologists David Hubel and Torsten Wiesel discovered the relationship between visual neurons and moving edge stimuli through their experiments on cats, thus pioneering initial visual neural research [1]. Subsequently, devices capable of converting images into formats recognizable by computers were developed, leading to the advent of digital image processing methods. The journey of computer vision further extended to understanding the three-dimensional world to enhance image recognition capabilities. The formal research into computer vision began in the 1960s, with milestones including Roberts’ understanding of three-dimensional information, MIT’s Summer Vision Project, and the invention of CCDs, among others [2]. These developments each propelled the field forward in different ways.From the 1970s to the 1980s, computer vision began to establish itself as an independent discipline, with theoretical research and practical applications mutually reinforcing each other. This period saw influential developments, from MIT courses to Fukushima’s Neocognitron, and David Marr’s visual theories, all of which shaped foundational concepts and technologies in the field [3,4]. By the 1990s, feature and object recognition became research focal points, with significant theories and tools emerging, such as SIFT features and the extensive application of GPUs [5]. These advancements greatly enhanced the practicality of computer vision technologies.

Entering the 21st century, the development of high-quality, highly generalized datasets and the rise of deep learning ushered in a new golden era for computer vision. From Viola-Jones face detection to innovative applications of GANs, and the maturation of deep learning frameworks, computer vision achieved not only technical breakthroughs but also demonstrated tremendous potential and prospects in applications [6–8]. Today, computer vision is more than just a discipline; it is a capability that permeates all aspects of our lives, from security surveillance to social media, and from autonomous driving to content creation.

Pose estimation [9] is a crucial subfield of computer vision. Pose estimation involves enabling computers to infer human postures from images by recognizing key body parts, subsequently identifying and reasoning human motion poses. The recent progress in human pose estimation technology is mainly attributed to the development of deep learning, particularly the powerful capabilities of convolutional neural networks in image feature extraction. Researchers have proposed various algorithms and models, such as OpenPose, DeepCut, RMPE (AlphaPose), and Mask RCNN [10–13], each suitable for different application scenarios.

For example, OpenPose achieves accurate multi-person pose estimation by constructing part confidence maps and using Part Affinity Fields to prune bounding boxes. However, OpenPose’s computation cannot run efficiently on GPUs, making it suboptimal in terms of performance [10]. DeepCut uses a bottom-up approach by solving an integer linear programming problem to optimize the allocation of body parts, thereby estimating multi-person poses [11]. Nevertheless, DeepCut requires detecting all body parts in the image and then assigning parts to different individuals, making the computation process complex and unsuitable for lightweight applications on devices like drones. RMPE improves pose estimation accuracy by using a symmetric spatial transformer network to extract high-quality single-person regions from inaccurate bounding boxes [12]. However, RMPE heavily relies on the performance of the human detector; if the detector’s performance does not meet expectations, the entire pose recognition process can be affected. Moreover, these algorithms face significant challenges in complex environments such as overlapping individuals, varying lighting conditions, and occlusion of key body parts.

Recent advances in transformer-based architectures, such as the Vision Transformer (ViT) and transformer-based pose estimation models, have brought promising improvements to human pose estimation tasks. These models, which are adept at capturing long-range dependencies and learning global context, are increasingly being explored for pose estimation. For example, Transformer-based networks have demonstrated superior performance in addressing occlusions and complex background noise compared to traditional convolutional models. Several recent studies have leveraged transformers for pose estimation, showing significant improvements in both accuracy and computational efficiency, particularly when dealing with multi-person poses in cluttered environments [14–16]. Additionally, recent works have combined the strengths of transformers with convolutional neural networks in hybrid architectures, further enhancing the robustness and adaptability of pose estimation systems [17,18]. These new developments suggest that transformers, with their attention mechanisms and scalability, could be the key to overcoming many of the challenges.

To address these challenges, recent approaches have turned to more efficient real-time detection models, such as YOLO. The YOLO (You Only Look Once) algorithm is used for real-time detection, which shows considerable promise in adapting to pose estimation tasks. The advantage of YOLO is its ability to process images in one go, enabling efficient detection and localization of multiple objects or human poses in real time. By leveraging YOLO’s fast end-to-end architecture, researchers have begun exploring its potential for human pose estimation, enabling the network to simultaneously detect and localize key body parts. This method has significant advantages in speed and scalability, so we utilize the YOLO algorithm for the human pose estimation task.

1.1. Introduction to YOLOv8 algorithm

YOLOv8 (You Only Look Once, version 8) [19] is a real-time object detection algorithm that enhances detection speed and accuracy over previous YOLO versions. It predicts object bounding boxes and class probabilities directly from the entire image through a single neural network, supports GPU computation for efficient model training, and can be embedded into devices. The network structure includes three main components: the Backbone (extracts image features using modules like CBS, C2F, and SPPF in Darknet-53[20]), the Neck (fuses extracted features), and the Head (detects fused features to generate final results). The network architecture of YOLOv8 is illustrated in Fig 1.

Download:

Fig 1. The network architecture of YOLOv8 comprises three parts.

The Backbone is responsible for feature extraction. The Neck, situated between the Backbone and the Head, is responsible for feature fusion. The Head is responsible for outputting the detection results.

https://doi.org/10.1371/journal.pone.0318578.g001

This paper employs the YOLOv8-pose network architecture for pose estimation [21]. Compared to YOLOv8’s object detection network architecture, YOLOv8-pose has an additional branch head in the output layer (the object detection output layer has two branches) specifically for keypoint detection. The architecture of this new branch head is identical to that of the original branches. And YOLO-Pose [22], limited to 150 GMACS, excels over comparable algorithms, offering unmatched real-time detection capabilities. It can provide better help for pose estimation.

2. Integration of attention mechanism and C2F

YOLOv8 offers extensive applicability for various scenarios including object detection, segmentation, pose estimation, and tracking. To deploy on embedded or mobile platforms, it incorporates lightweight processing such as replacing the backbone with PP-LCNet [23], adding depthwise separable convolution [24], or SCConv modules [25]. Although YOLOv8 prioritizes real-time detection over accuracy, integrating attention mechanisms enhances accuracy by focusing on key features and reducing parameters [26]. The C2F module, improving accuracy by 2%, utilizes a residual connection and FPN structure for effective feature fusion [27]. This paper introduces a C2F-SimDLKA module with an attention mechanism, preserving the original network’s collaborative operation while improving feature extraction.

2.1. Introduction to the C2F module

YOLOv8’s backbone network, continuing YOLOv7’s architecture and similar to CSPDarkNet-53, uses the C2F module for improved gradient flow and convergence speed. The Neck part, referencing YOLOv5’s PAN-FPN, optimizes feature fusion with bottom-up and top-down pathways [28]. The Head part replaces the coupled head with a decoupled head structure, incorporating Distributional Focal Loss (DFL) [29], enhancing semantic and positional information. However, the C2F module’s standard convolution layers and residual connections limit flexibility and spatial transformations, suggesting room for structural enhancements.

2.2. Integration of C2F module

To ensure that the model can capture human motion poses more accurately and in real-time during pose estimation, this paper introduces the SimDLKA module to replace the Bottleneck module while maintaining a relatively simple structure. This is followed by integrating it with the C2F module to form the C2F-SimDLKA module, which is embedded into and replaces the original C2F module in the YOLOv8 network architecture.

2.3. DLKA and LKA attention mechanisms

Large Kernel Attention (LKA) is an attention mechanism specifically designed for visual tasks by Guo, which combines the advantages of convolutional neural networks (CNNs) and self-attention mechanisms (such as those used in Transformers) while avoiding their drawbacks [30]. The core idea of LKA is to capture long-range dependencies in images using large kernels, while still maintaining sensitivity to local structural information.

Fig 2 shows the decomposition diagram of large kernel convolution. In LKA, the large kernel convolution can be divided into three parts: depthwise convolution (DW-Conv), depthwise dilated convolution (DW-DConv), and pointwise convolution (1×1 Conv). Specifically, the S×S convolution is decomposed into a depthwise dilated convolution of size with a dilation rate of d, a depthwise convolution, and a 1×1 convolution. The decomposed convolution has the advantages of low computational cost, fewer parameters, and the ability to capture long-range relationships. Thus, the LKA module can be formulated as follows:

Download:

Fig 2. The decomposition diagram of the large kernel convolution, where the blue grid represents the convolution kernel and the green grid represents the center point.

The large kernel convolution in the figure is decomposed into a depth convolution, a depth dilution convolution and a point convolution.

https://doi.org/10.1371/journal.pone.0318578.g002

(1)

(2)

represents the input features. represents the attention map, which indicates the importance of each feature.

The DLKA structure diagram is shown in Fig 3. It includes deformable depthwise convolutions and deformable dilated convolutions, which can adaptively adjust their receptive fields to better capture the spatial distribution of critical information in images. DLKA maintains computational efficiency while effectively handling a broader range of contextual information. Additionally, the deformable nature of DLKA allows the network to adapt more flexibly to various shapes and sizes of objects, thereby improving the overall performance and robustness of the model.

Download:

Fig 3. Details the DLKA structure, which includes the DLKA-Attention module and the FFN module. The DLKA-Attention module mainly consists of the DLKA module, which comprises deformable depthwise convolutions and deformable dilated convolutions.

The FFN module is composed of deformable convolutions.

https://doi.org/10.1371/journal.pone.0318578.g003

2.4. Construction of the SimDLKA module

Real-time performance is a major advantage of the YOLO algorithm, and thus, maintaining real-time capability is crucial when optimizing the model. The introduction of the DLKA module inevitably increases computational costs, potentially reducing the model’s real-time performance. Therefore, this paper proposes the SimDLKA module, which enhances detection accuracy while maintaining a simple structure.

The original DLKA module involves multiple processes. To simplify the structure and reduce computational costs, we combined the convolutional layer with the normalization layer. For a batch of data, the steps to merge the convolutional layer and normalization layer are as follows:

(3)

is the i-th input; is the i-th output; γ is the scaling factor; β is the shifting factor; μ is the mean of the input batch; is the variance of the input batch; ε is a very small value to avoid division by zero.

For C feature maps (F), the normalization process can be written as follows:

(4)

where is the output feature map at position for channel C; is the input feature map at position for channel C.

It can be observed that the normalization result is equivalent to a convolution. Therefore, normalization can be directly integrated into the convolution operation, and we can list the following equation:

(5)

is the weight parameter during convolution; is the bias during normalization; is the weight parameter during normalization; is the bias during convolution; is the convolution kernel of size 1×1; is the number of channels in the input layer; k is the size of the convolution kernel.

By decomposing the equation, we get the updated weight parameter W and bias b:

(6)

(7)

By using the above equations, we combine the normalization layer and the convolution layer, merging what originally required two separate convolution operations into one. This integration not only simplifies the processing flow of the DLKA module but also significantly reduces computational costs due to the frequent execution of normalization and convolution layers in the module. Therefore, this method of merging convolution operations effectively reduces resource consumption while enhancing processing efficiency. We replace the Bottleneck part of YOLOv8 with the SimDLKA module and integrate it with the C2F module, resulting in the C2F-SimDLKA module structure as shown in Fig 4.

Download:

Fig 4. Illustrates the structure of the C2F-SimDLKA module. After processing with CBS, the features are first split into two parts: one part is retained without any processing, and the other part is processed through several SimDLKA modules.

Each SimDLKA module splits into two channels: one channel passes the processed features to the next SimDLKA module, while the other channel retains the features for later concatenation. Finally, after passing through n SimDLKA modules, all features are fused together.

https://doi.org/10.1371/journal.pone.0318578.g004

3. Improved DCIOU loss function

The YOLOv8 algorithm demonstrates significant advantages in real-time object detection, achieved by sacrificing some detection accuracy. To further optimize the algorithm, this paper proposes a new loss function. The DCIOU loss function, the latest technique proposed in this paper, redefines the term measuring the aspect ratio difference of bounding boxes, enhancing the model’s convergence capability. This improves the extraction of key human body features and increases the model’s accuracy.

The original YOLO-pose for pose estimation used the CIOU loss function, which is derived from the DIOU loss function. Distance-IoU (DIoU) loss function is an optimization method designed for bounding box regression [31]. By incorporating the Euclidean distance between the center points of the predicted and ground truth boxes, the DIoU loss function improves the training convergence speed, outperforming the traditional IoU (Intersection over Union) and GIoU (Generalized Intersection over Union) loss functions [32]. The DIoU loss function is defined as follows:

(8)

IoU is the Intersection over Union, which is the ratio of the overlapping area between the predicted box and the ground truth box to their combined area. represents the Euclidean distance between the center points of the predicted box and the ground truth box. c is the diagonal length of the smallest enclosing box that covers both the predicted and ground truth boxes.

The DIoU loss function adds a penalty term for the center point distance to the IoU loss, making the loss function better account for the geometric center distance differences between the bounding boxes. This results in faster model convergence and improved accuracy in bounding box regression. However, the DIoU loss function does not consider the aspect ratio consistency of the bounding boxes. Therefore, Zheng et al. proposed the improved CIOU loss function, defined as follows:

(9)

(10)

(11)

, and w, h are the widths and heights of the ground truth box and the predicted box, respectively; a is a weight coefficient used to balance the importance of aspect ratio consistency; ν is a term that measures the difference in aspect ratios between the bounding boxes.

The CIOU loss function not only considers the overlapping area of the boxes and the distance between their center points but also adds a penalty term for the aspect ratio difference. This enables the loss function to perform better when handling targets of different shapes and sizes. However, the CIOU loss function has certain limitations. When the height and width of the ground truth box and the predicted box are exactly equal, i.e., and , ν becomes zero, causing the CIOU loss function to degrade into the DIoU loss function. This degradation can reduce the function’s convergence ability, especially in human pose estimation, where slower convergence can lead to failure in capturing key point features and cause the entire model to fail. Therefore, this paper proposes a new loss function, DCIOU (Double-Complete-IoU), which retains the advantages of the CIOU loss function while avoiding its degradation. It is defined as follows:

(12)

(13)

(14)

In the equation, ϵ is a very small value. As shown, the DCIOU loss function has similar penalty terms to the CIOU loss function. The DCIOU loss function also considers the overlapping area of the boxes, the distance between the center points, and the shape of the boxes. The difference lies in the modification of ν. Due to the addition of ϵ, the new νν value differs from the old ν value, and since ϵ is very small and the arctan function is not sensitive to small perturbations, the ν value in the DCIOU loss function retains the advantage of measuring the aspect ratio difference of the bounding boxes. The inclusion of ϵ ensures that and can never both be zero simultaneously, meaning the model will always consider the importance of aspect ratio consistency. This increases the model’s convergence ability and prevents the loss function from degrading in certain situations.

To verify the convergence ability of the DCIOU loss function in pose estimation, we trained the original YOLOv8 model on the COCO dataset using the CIOU loss function. Then, we replaced the CIOU loss function in the YOLOv8 model with the improved DCIOU loss function and continued training on the COCO dataset. Fig 6 shows the training loss curves of the two models under different loss functions

Download:

Fig 5. Comparison of the curves of the DCIOU loss function and the CIOU loss function.

https://doi.org/10.1371/journal.pone.0318578.g005

As can be seen from Fig 5, from the 0th epoch to the 20th epoch, the curves of the DCIOU loss function, DIOU loss function and CIOU loss function all decline very quickly. At this time, there is no significant difference in the decline speed of the three loss functions.. After the 20th epoch, there are differences among the three types of loss functions. The DCIOU loss function is the fastest, followed by the CIOU loss function. The DIOU loss function declines slightly slower than the CIOU loss function. After the 90th epoch, the three types of loss function curves slowly converged. The DCIOU loss function, DIOU loss function and CIOU loss function finally stabilized at 42.6, 48.7 and 46.6 respectively. The results show that the DCIOU loss function has a 3–5 loss value reduction compared to other loss functions in back propagation, and can obtain reasonable parameters faster, thereby achieving better training results. Therefore, in this training, the DCIOU loss function showed faster convergence speed, better stability and better performance than the other two loss functions.

4. Experimental section

4.1. Dataset introduction

In the experimental part, we selected two open source datasets widely used by researchers, namely the COCO [33] dataset and the MPII [34] dataset. In order to ensure the diversity of the images, we also independently constructed an HP dataset. The HP dataset contains more than 5,000 human pose images manually annotated by professionals, including human images of different ages, genders and poses. The characters are set in different life scenes, and 20 key points are annotated for human body parts. We also used data enhancement techniques [35] (such as rotation, scaling, flipping and color adjustment) to preprocess the dataset to improve the generalization and robustness of the model.

4.2. Experimental preparation

In this experiment, we selected NVIDIA RTX 3060TI graphics card, CUDA version 11.4 and PyTorch 1.10.0. The combination of the aforementioned software and hardware provides strong support for our research projects. In our human pose estimation tasks, we compared several widely used pose estimation models, specifically OpenPose, DeepCut, AlphaPose, HRNet [36], FasterPose [37], TransPose [38], and YOLOv8-pose. In this experiment, we uniformly selected 300 epochs, a batch size of 32, an initial learning rate (lr0) of 0.01, and a final learning rate of 0.1. We set the dropout to 0.1, the initial warm-up momentum to 0.85, and the warm-up bias to 0.12. Other network parameters were handled according to default settings to align with the model training tasks. AP values reflect the model’s accuracy; in this paper, AP@0.5 represents the AP value at an IoU threshold of 0.5. AP@M and AP@L calculate the AP values for medium and large target objects, respectively. In the experiments, the IoU thresholds were set at 0.5 and 0.75 to observe the model’s performance on different sizes of targets.

4.3. Result analysis

As shown in Table 1, this paper analyzes the Average Precision (AP) values of different human pose estimation models compared to our improved model on the COCO, MPII, and HP datasets. We follow standard evaluation metrics and use OKS-based metrics for pose estimation.

Download:

Table 1. Ap values of different methods on the three datasets. AP values include four indicators: AP@50, AP@75, AP@M, and AP@L.

https://doi.org/10.1371/journal.pone.0318578.t001

On the COCO dataset, the improved model exceeds others in AP@50 (85.18), AP@M, and AP@L (76.86), with strong performance in AP@75 (56.25), slightly trailing TransPose (56.18). On the MPII dataset, it leads all metrics: AP@50 (83.04), AP@75 (56.15), AP@M (61.51), and AP@L (71.18), maintaining superior pose estimation. On the HP dataset, the improved model also tops all metrics, excelling in large target recognition with AP@50 (79.88), AP@75 (52.85), AP@M (56.24), and AP@L (73.16). Consistent strong performance across datasets indicates the model’s superior accuracy and recall, especially for large targets. Fig 6 visualizes the table’s content, providing a more intuitive sense of the performance differences between models on different datasets. These results have certain reference and research value for the field of human pose estimation.

To compare the inference speed of different pose estimation models, this paper uses Frames Per Second (FPS) as the indicator. All models were run in the same hardware environment, and the software environment was kept consistent. We conducted inference on all three datasets to ensure comprehensive testing. Each model was tested sequentially, recording the time to process a batch of images, and multiple runs were performed to reduce random errors, with the average value taken as the final result. The FPS value for each model was then calculated, determined by the number of images in a batch divided by the average inference time. The results are shown in Table 2.

Download:

Table 2. FPS values of different methods on the three datasets.

https://doi.org/10.1371/journal.pone.0318578.t002

The FPS data shows that the YOLOv8-pose model has the highest inference speeds across all datasets—33, 36, and 36 FPS on COCO, MPII, and HP datasets, respectively, demonstrating its superiority in real-time detection. TransPose and HRNet have lower FPS values, not exceeding 12 FPS, due to their complex structures and high parameter counts. Our improved model outperforms YOLOv8-pose with even higher FPS values—37, 40, and 39 FPS on the COCO, MPII, and HP datasets, respectively, indicating better optimization for speed. This advantage is crucial for real-time applications, as depicted in Fig 7, highlighting our model’s efficiency over others in processing speed.

Download:

Fig 6. AP values of different methods on the three datasets, including four indicators: AP@50, AP@75, AP@M, and AP@L.

Blue represents AP@50, orange represents AP@75, green represents AP@M, and red represents AP@L.

https://doi.org/10.1371/journal.pone.0318578.g006

Download:

Fig 7. FPS values of different methods on the three datasets, where red represents the COCO dataset, yellow represents the MPII dataset, and green represents the HP dataset.

https://doi.org/10.1371/journal.pone.0318578.g007

4.4. Ablation study

The ablation study’s core idea is to evaluate each component’s impact on system performance. We started with a baseline YOLOv8-pose model. Experiment one used the baseline model, experiment two added the LKA module, and experiment three added the SimDLKA module, testing on COCO, MPII, and HP datasets.

Table 3 showed that experiment two slightly improved over the baseline, indicating the LKA module’s positive effect. Experiment three achieved the highest AP values in all metrics, especially AP@M and AP@L, the average AP values for the three datasets were 59.02 and 73.73 respectively. It confirms the SimDLKA module’s significant performance enhancement for medium and large targets. Consistent improvements were observed across all datasets, demonstrating the SimDLKA module’s effectiveness in real-time pose estimation for large targets.

Download:

Table 3. AP values of three methods on the three datasets, including four indicators: AP @50, AP @75, AP @M, and AP @L. Experiment one is the baseline model without adding other modules, experiment two is the baseline model with the LKA module added, and experiment three is the baseline model with the SimDLKA module added.

https://doi.org/10.1371/journal.pone.0318578.t003

We studied the impact of different loss functions (GIOU, DIOU, CIOU, and DCIOU) on pose estimation. As can be seen from Table 4, on the COCO dataset, GIOU showed lower AP values. DIOU improved all metrics, especially AP@L (76.13), indicating better large-target detection. CIOU slightly outperformed DIOU in AP@50 and AP@75. DCIOU provided optimal performance across all indicators, demonstrating superior localization accuracy and size consistency.On the MPII dataset, GIOU’s performance was relatively low. DIOU significantly improved AP@M and AP@L, with CIOU slightly better in AP@L. DCIOU again excelled in all metrics, proving its superiority.On the HP dataset, GIOU had the lowest AP values. DIOU and CIOU performed better in AP@M and AP@L, but DCIOU excelled in all AP indicators, particularly AP@L (73.16).

Download:

Table 4. AP values of three loss functions on the three datasets, including four indicators: AP@50, AP@75, AP@M, and AP@L.

https://doi.org/10.1371/journal.pone.0318578.t004

Hence, DCIOU outperforms GIOU, DIOU, and CIOU across all datasets, providing better predictive bounding boxes for medium and large targets, significantly enhancing pose estimation performance.

Conclusion

This paper introduces the SimDLKA module to replace the Bottleneck module in YOLOv8-pose and integrates it with the C2F module, enhancing feature extraction and proposing an optimized DCIOU loss function for better convergence. Experiments show improved accuracy and speed, especially in large object detection, providing valuable insights for pose estimation research and applications. Despite its strengths, the model has reduced sensitivity to small objects and extreme poses and faces challenges in varied lighting and dynamic backgrounds.

In the future, we can start from these two aspects: extracting more refined image features and adapting to better environmental changes. It is also particularly important to continuously update the model so that it can have a place in the new model comparison. Secondly, combining posture estimation with fields such as transportation can broaden the scope of application. For example, in the field of transportation, the danger factor can be determined by estimating the posture of pedestrians, and the data can be uploaded to the cloud. Finally, the vehicle receives the information and makes reasonable avoidance. Through in-depth exploration of the research direction of posture estimation, the future posture estimation technology can not only make more progress in academia, but also play a greater role in practical applications, further promoting the development of intelligent systems and human-computer interaction, and also promoting the development of human motion analysis.

Supporting information

S1 Data. Source data for figures and tables.

XXX.

https://doi.org/10.1371/journal.pone.0318578.s001

(ZIP)

S1 File. Implementation code and computational results.

https://doi.org/10.1371/journal.pone.0318578.s002

(ZIP)

References

1. Lukiw WJ. David Hunter Hubel, the “Circe effect”, and SARS-CoV-2 infection of the human visual system. Front Biosci (Landmark Ed). 2022;27(1):7. pmid:35090312
- View Article
- PubMed/NCBI
- Google Scholar
2. Smith ML, Smith LN, Hansen MF. The quiet revolution in machine vision - a state-of-the-art survey paper, including historical review, perspectives, and future directions. Computers in Industry. 2021;130:103472.
- View Article
- Google Scholar
3. Goyal P, Verma DK, Kumar S. Plant Leaf Disease Detection Using an Optimized Evolutionary Gravitational Neocognitron Neural Network. Natl Acad Sci Lett. 2024;47(4):347–54.
- View Article
- Google Scholar
4. Martinez-Conde S, Macknik SL, Heeger DJ. An Enduring Dialogue between Computational and Empirical Vision. Trends Neurosci. 2018;41(4):163–5. pmid:29602332
- View Article
- PubMed/NCBI
- Google Scholar
5. Dong L, Jiao N, Zhang T, Liu F, You H. GPU Accelerated Processing Method for Feature Point Extraction and Matching in Satellite SAR Images. Applied Sciences. 2024;14(4):1528.
- View Article
- Google Scholar
6. Ananthi G, Pujaa M, Amretha VM. Eye gaze capture for preference tracking. Multimed Tools Appl. 2023;83(16):47139–50.
- View Article
- Google Scholar
7. Farhadinia B, Ahangari MR, Heydari A, Datta A. A generalized optimization-based generative adversarial network. Expert Systems with Applications. 2024;248:123413.
- View Article
- Google Scholar
8. Upadhyay A, Meena YK, Chauhan GS. SatCoBiLSTM: Self-attention based hybrid deep learning framework for crisis event detection in social media. Expert Systems with Applications. 2024;249(1):Art no. 123604.
- View Article
- Google Scholar
9. Dang M, Liu G, Xu Q, Li K, Wang D, He L. Multi-object behavior recognition based on object detection for dense crowds. Expert Systems with Applications. 2024;248:123397.
- View Article
- Google Scholar
10. Cao Z, Simon T, Wei S-E, Sheikh Y. Realtime multi-person 2d pose estimation using part affinity fields.in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:7291–9.
- View Article
- Google Scholar
11. Pishchulin L, et al. Deepcut: Joint subset partition and labeling for multi person pose estimation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:4929–37.
- View Article
- Google Scholar
12. Fang H-S, Xie S, Tai Y-W, Lu C. Rmpe: Regional multi-person pose estimation. in Proceedings of the IEEE international conference on computer vision. 2017:2334–43.
- View Article
- Google Scholar
13. He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. in Proceedings of the IEEE International Conference on Computer Vision. 2017:2961–9.
- View Article
- Google Scholar
14. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding?. in ICML. 2021:2(3):4.
- View Article
- Google Scholar
15. Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z. 3d human pose estimation with spatial and temporal transformers. in Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:11656–65.
- View Article
- Google Scholar
16. Shi D, Wei X, Li L, Ren Y, Tan W. End-to-end multi-person pose estimation with transformers. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:11069–78.
- View Article
- Google Scholar
17. Chen D, Wu L, Chen Z, Lin X. CTHPose: An efficient and effective CNN-transformer hybrid network for human pose estimation. in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). 2023:327–39.
- View Article
- Google Scholar
18. Alomar K, Aysel HI, Cai X. RNNs, CNNs and transformers in human action recognition: A survey and a hybrid model. arXiv preprint. 2024.
- View Article
- Google Scholar
19. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:779–88.
- View Article
- Google Scholar
20. Redmon J, Farhadi A. Yolov3: An incremental improvement. arXiv preprint. 2018.
- View Article
- Google Scholar
21. Kuzdeuov A, Taratynova D, Tleuliyev A, Varol HA, OpenThermalPose: An Open-Source Annotated Thermal Human Pose Dataset and Initial YOLOv8-Pose Baselines. Authorea Preprints; 2024.
- View Article
- Google Scholar
22. Maji D, Nagori S, Mathew M, Poddar D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:2637–46.
- View Article
- Google Scholar
23. Cui C. PP-LCNet: A lightweight CPU convolutional neural network. arXiv preprint. 2021.
- View Article
- Google Scholar
24. Howard A, G. A. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint. 2017.
- View Article
- Google Scholar
25. Li J, Wen Y, He L. Scconv: spatial and channel reconstruction convolution for feature redundancy. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023:6153–62.
- View Article
- Google Scholar
26. Vaswani A. et al. Attention is all you need. Advances in neural information processing systems. 2017:30
- View Article
- Google Scholar
27. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:2117–25.
- View Article
- Google Scholar
28. Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:8759–68.
- View Article
- Google Scholar
29. Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. in Proceedings of the IEEE International Conference on Computer Vision. 2017:2980–8.
- View Article
- Google Scholar
30. Guo M-H, Lu C-Z, Liu Z-N, Cheng M-M, Hu S-M. Visual attention network. Comp Visual Med. 2023;9(4):733–52.
- View Article
- Google Scholar
31. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU loss: Faster and better learning for bounding box regression. in Proceedings of the AAAI conference on artificial intelligence. n.d.;34(07):12993–3000.
- View Article
- Google Scholar
32. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding box regression. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:658–66.
- View Article
- Google Scholar
33. Zhang J, Chen Z, Tao D. Towards high performance human keypoint detection. International Journal of Computer Vision. 2021;129(9):2639–62.
- View Article
- Google Scholar
34. Zhang F, Zhu X, Ye M. Fast human pose estimation. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:3517–26.
- View Article
- Google Scholar
35. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of big data. 2019;6(1):1–48.
- View Article
- Google Scholar
36. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2021;43(10):3349–64. pmid:32248092
- View Article
- PubMed/NCBI
- Google Scholar
37. Dai H, Shi H, Liu W, Wang L, Liu Y, Mei T. FasterPose: A faster simple baseline for human pose estimation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 2022;18(4):1–16.
- View Article
- Google Scholar
38. Yang S, Quan Z, Nie M, Yang W. Transpose: Keypoint localization via transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:11802–12.
- View Article
- Google Scholar

[ref1] 1. Lukiw WJ. David Hunter Hubel, the “Circe effect”, and SARS-CoV-2 infection of the human visual system. Front Biosci (Landmark Ed). 2022;27(1):7. pmid:35090312
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Smith ML, Smith LN, Hansen MF. The quiet revolution in machine vision - a state-of-the-art survey paper, including historical review, perspectives, and future directions. Computers in Industry. 2021;130:103472.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Goyal P, Verma DK, Kumar S. Plant Leaf Disease Detection Using an Optimized Evolutionary Gravitational Neocognitron Neural Network. Natl Acad Sci Lett. 2024;47(4):347–54.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Martinez-Conde S, Macknik SL, Heeger DJ. An Enduring Dialogue between Computational and Empirical Vision. Trends Neurosci. 2018;41(4):163–5. pmid:29602332
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Dong L, Jiao N, Zhang T, Liu F, You H. GPU Accelerated Processing Method for Feature Point Extraction and Matching in Satellite SAR Images. Applied Sciences. 2024;14(4):1528.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. Ananthi G, Pujaa M, Amretha VM. Eye gaze capture for preference tracking. Multimed Tools Appl. 2023;83(16):47139–50.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Farhadinia B, Ahangari MR, Heydari A, Datta A. A generalized optimization-based generative adversarial network. Expert Systems with Applications. 2024;248:123413.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Upadhyay A, Meena YK, Chauhan GS. SatCoBiLSTM: Self-attention based hybrid deep learning framework for crisis event detection in social media. Expert Systems with Applications. 2024;249(1):Art no. 123604.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref9] 9. Dang M, Liu G, Xu Q, Li K, Wang D, He L. Multi-object behavior recognition based on object detection for dense crowds. Expert Systems with Applications. 2024;248:123397.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref10] 10. Cao Z, Simon T, Wei S-E, Sheikh Y. Realtime multi-person 2d pose estimation using part affinity fields.in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:7291–9.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref11] 11. Pishchulin L, et al. Deepcut: Joint subset partition and labeling for multi person pose estimation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:4929–37.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref12] 12. Fang H-S, Xie S, Tai Y-W, Lu C. Rmpe: Regional multi-person pose estimation. in Proceedings of the IEEE international conference on computer vision. 2017:2334–43.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref13] 13. He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. in Proceedings of the IEEE International Conference on Computer Vision. 2017:2961–9.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref14] 14. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding?. in ICML. 2021:2(3):4.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref15] 15. Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z. 3d human pose estimation with spatial and temporal transformers. in Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:11656–65.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref16] 16. Shi D, Wei X, Li L, Ren Y, Tan W. End-to-end multi-person pose estimation with transformers. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:11069–78.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref17] 17. Chen D, Wu L, Chen Z, Lin X. CTHPose: An efficient and effective CNN-transformer hybrid network for human pose estimation. in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). 2023:327–39.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref18] 18. Alomar K, Aysel HI, Cai X. RNNs, CNNs and transformers in human action recognition: A survey and a hybrid model. arXiv preprint. 2024.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref19] 19. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:779–88.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref20] 20. Redmon J, Farhadi A. Yolov3: An incremental improvement. arXiv preprint. 2018.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref21] 21. Kuzdeuov A, Taratynova D, Tleuliyev A, Varol HA, OpenThermalPose: An Open-Source Annotated Thermal Human Pose Dataset and Initial YOLOv8-Pose Baselines. Authorea Preprints; 2024.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref22] 22. Maji D, Nagori S, Mathew M, Poddar D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:2637–46.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref23] 23. Cui C. PP-LCNet: A lightweight CPU convolutional neural network. arXiv preprint. 2021.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref24] 24. Howard A, G. A. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint. 2017.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref25] 25. Li J, Wen Y, He L. Scconv: spatial and channel reconstruction convolution for feature redundancy. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023:6153–62.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref26] 26. Vaswani A. et al. Attention is all you need. Advances in neural information processing systems. 2017:30
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref27] 27. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:2117–25.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref28] 28. Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:8759–68.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref29] 29. Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. in Proceedings of the IEEE International Conference on Computer Vision. 2017:2980–8.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref30] 30. Guo M-H, Lu C-Z, Liu Z-N, Cheng M-M, Hu S-M. Visual attention network. Comp Visual Med. 2023;9(4):733–52.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref31] 31. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU loss: Faster and better learning for bounding box regression. in Proceedings of the AAAI conference on artificial intelligence. n.d.;34(07):12993–3000.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref32] 32. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding box regression. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:658–66.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref33] 33. Zhang J, Chen Z, Tao D. Towards high performance human keypoint detection. International Journal of Computer Vision. 2021;129(9):2639–62.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref34] 34. Zhang F, Zhu X, Ye M. Fast human pose estimation. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:3517–26.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref35] 35. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of big data. 2019;6(1):1–48.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref36] 36. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2021;43(10):3349–64. pmid:32248092
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref37] 37. Dai H, Shi H, Liu W, Wang L, Liu Y, Mei T. FasterPose: A faster simple baseline for human pose estimation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 2022;18(4):1–16.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref38] 38. Yang S, Quan Z, Nie M, Yang W. Transpose: Keypoint localization via transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:11802–12.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

Figures

Abstract

1. Introduction

1.1. Introduction to YOLOv8 algorithm

2. Integration of attention mechanism and C2F

2.1. Introduction to the C2F module

2.2. Integration of C2F module

2.3. DLKA and LKA attention mechanisms

2.4. Construction of the SimDLKA module

3. Improved DCIOU loss function

4. Experimental section

4.1. Dataset introduction

4.2. Experimental preparation

4.3. Result analysis

4.4. Ablation study

Conclusion

Supporting information

S1 Data. Source data for figures and tables.

S1 File. Implementation code and computational results.

References