Figures
Abstract
Scale variation is a challenge in human pose estimation. The scale variations of human body are related to the accuracy and robustness of posture estimation. For example, the prediction accuracy of smaller joints (such as ankles and wrists) is less than that of larger joints (such as head and shoulders). To address the impact of scale variations across parts of the human body on the positioning of key points. In this paper, we propose a Detail Enhanced High-Resolution Network (DE-HRNet), which can efficiently extract local detail features and mitigate the impact of scale variations for human pose estimation. First, we propose a Detail Enhancement Module (DEM) to relearn the lost low-level detailed features and enhance the model’s ability to capture delicate local features, which is crucial for improving the accuracy of scale-varying keypoints. Second, we introduce an ultra-lightweight dynamic sampler - dySample, which is used to replace nearest up-sampling. It aims to reduce the loss of detail information from low-resolution features during up-sampling, while simultaneously preserving finer local representations for high resolution, it can be beneficial in improving the robustness of the model in dealing with scale-varying keypoints. On the COCO test-dev2017 and MPII valid datasets, our method achieved 75.6 AP and 90.7 PCKh@0.5, respectively, compared to High-Resolution Network (HRNet), it improved by 0.7 and 0.4 points. In comparison with the other works, the proposed method has performed well in the scale variation.
Citation: Liu Y, Zhou G, He W, Zhu H, Cui Y (2025) DE-HRNet: Detail enhanced high-resolution network for human pose estimation. PLoS One 20(9): e0325540. https://doi.org/10.1371/journal.pone.0325540
Editor: Yongjie Li, University of Electronic Science and Technology of China, CHINA
Received: December 9, 2024; Accepted: May 14, 2025; Published: September 2, 2025
Copyright: © 2025 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: Funding Number: GZ20220131 Funding name: Optimization modeling of knowledge representation and interpretable confidence rule base in the education big data environment Supported by: Harbin Normal University Responsible person: Zhou Guohui Phone: 13796827949 The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
2D human pose estimation is a challenging task in computer vision, providing crucial technical support for downstream vision tasks such as pose tracking, action recognition, and action prediction. However, as a pivotal component of pose estimation, keypoint localization often faces persistent challenges in practical scenarios due to factors including scale variations [1,2], illumination changes, occlusions, and complex backgrounds, which collectively hinder the achievement of robust performance.
Scale variation presents a fundamental challenge in human pose estimation due to the inherent diversity of target human joints across practical scenarios. Human joints exhibit multi-scale properties where finger joints (3–5px in 640 × 480 images) require 10 × higher localization precision than torso joints [3]. These intrinsic challenges are exacerbated in real-world scenarios where camera-subject distances (5-50m), dynamic postures (e.g., crouching vs. jumping), and inter-person occlusions create compound scale variations within single frames.
Recent studies [4] have demonstrated that top-down methods, which typically localize keypoints at a single approximate scale, offer advantages in dealing with the varying scale of human joints. For example, Sun et al. [5] proposed a High-Resolution Network (HRNet), which extracts fine local representations through multi-resolution parallel interactions and preserves high-resolution representations to maintain accurate spatial information—as shown in Fig 1. The network has achieved good performance, providing a strong baseline for addressing the scale variation of human pose. HRFormer [6] is used for dense prediction, which enhances robustness in handling scale variations. In particular, the scale-aware HRNet variant has also achieved a breakthrough, such as HigherHRNet [4] and SWAHR [7]. In addition, Cai et al. [8] proposed a Residual Steps Network (RSN), which aims to effectively aggregate features of the same spatial size (intra-level features) to obtain fine local representations, the network is advantageous for identifying joints with scale variation. However, the RSN is too large in terms of network scale, which brings high resource costs and is not conducive to improving efficiency.
The multi-resolution stage modules are marked with blue color areas. The remained three stages module consists of parallel multi-resolution subnetworks with multi-resolution information interactions.
Unlike bottom-up methods that focus on learning scale-aware representations, and different from using intra-level feature fusion to obtain fine local representations, our method utilizes detailed information from high-resolution features, which further captures the local detail features that are lost in low-resolution images, thereby improving the model’s ability to capture local detail features. Inspired by HRNet [5,9,10,11], we find that the repeated downsampling results in the loss of detail information, which means that the model cannot capture more features of small joints. Based on this, we propose a Detail Enhancement Module (DEM) that introduces low-level detail information to achieve across channels and resolutions interaction of high-level features, thereby enhancing the ability of model to capture local details. Furthermore, we introduce an ultra-lightweight dynamic sampler – (dySample) as a replacement for nearest up-sampling. This approach effectively reduces the problem of information loss during the up-sampling process and provides a richer set of local and global feature information for inter-layer fusion. In conclusion, our method employs HRNet as the backbone, with the primary objective of addressing scale variability by enhancing the precision of predictions at keypoints across various scales. Our contributions can be summarized as follows:
- We propose a detail-enhanced high-resolution network (DE-HRNet). It derived from HRNet, we introduced dySample to reduce information loss during the upsampling process, thereby improving the performance of the network in dealing with keypoints of scale variations.
- We propose the Detail Enhancement Module (DEM), which effectively combines detail information from disparate resolution branches and enhances the feature representation of these branches. It refines the local representation of features by modeling channel-wise features.
- Experimental results demonstrate that our proposed method achieves improved accuracy on the COCO and MPII datasets while maintaining comparable performance to HRNet.
2 Related work
2.1 Top-down methods
In human pose estimation, top-down methods [5,8,9,12–14] provide a fundamental basis for studying the scale variations of human keypoints when detecting objects in input images, as the keypoints detected on the human body are at an approximate scale. Consequently, we adopt a top-down approach to locate joints of different scales, which can mitigate interference from varying human scales in multi-person scenes.
2.2 Feature fusion and up sample
Most existing human pose estimation methods adopt the approach of inter-layer feature fusion [5,9,15]. Inter-layer feature fusion can integrate more spatial and semantic information. For example, Alejandro et al. [15] proposed a symmetric topological convolutional neural network where low-level features are upsampled in a single stage and then concatenated with high-level features. The representative method for inter-layer feature fusion is FPN [16], which fuses upsampled high-level features with low-level features through element-wise summation. The process of feature fusion is as follows:
Where denotes upsampling,
和
represent high-resolution and low-resolution features before fusion, respectively. The results of inter-layer fusion are relatively coarse. For example, simple upsampling operations can lead to fused features that lack fine local details. In addition, HRNet [5,9] provides high-frequency local details to low-resolution features through the downsampling of low-level features. The process is as follows:
Where denotes the resolution branches of
or
. Considering that important detail information may be lost during the upsampling process. In this work, the nearest up-sampling is substituted with an ultra-lightweight dynamic sampler - dySample [17], which retains more spatially sensitive detail information and improve our sampling efficiency.
From [17], We observe that dySample is designed from a point sampling perspective to guide high-resolution feature learning. With only minimal parameter increase, it outperforms standard upsampling methods like Nearest and Bilinear in accuracy. In addition, While alternative upsampling methods like FreqFusion [18], A2U [19], SAPA-B [20] achieve competitive performance, our evaluation based on parameter count (Params) and computational complexity (FLOPs) in [18] demonstrates that dySample provides the best cost-effectiveness among these approaches.
In a word, dySample can effectively mitigate the impact of these issues, leading to a more robust and consistent upsampling process.
2.3 Channel attention mechanism
Channel attention mechanisms [21–23,24] aim to enhance the model’s capacity to represent specific task objectives by modeling relationships between feature channels. Hu et al. [21] proposed a SENet, as shown in the bottom block of Fig 2, which explicitly models the interdependencies between channels by leveraging global information from features. However, Wang et al. [23] argue that the dimensionality reduction of channels in SENet is detrimental to learning channel relationships and is inefficient. Therefore, they proposed an ECANet, as shown in the top block of Fig 2, which circumvents dimensionality reduction and achieves the capture of local cross-channel interaction information through one-dimensional convolution. For spatial position-sensitive tasks such as human pose estimation, the spatial positions of keypoints must have strong local feature representations. Inspired by [21,22], we propose a Detail Enhancement Module (DEM), which incorporates the SE block to learn the relationships between channels of resolution features, enhancing the capability to capture detailed information and improving the representational capacity of local features in each resolution branch.
The top block (a) is ECA block, which consists of average pooling, a fast 1D convolution of size k and a sigmoid activation. The bottom block (b) is SE block, which consists of average pooling, two fully-connected (FC) layers and a sigmoid activation.
3 Method
The architecture of DE-HRNet is illustrated in Fig 3. Our method extends the HRNet [5,9] backbone network (the HRNet architecture as shown in Fig 1). In this section, we first provide a brief introduction to DE-HRNet. Following that, we explain our proposed DEM. In addition, we explained our proposed method and discussed the differences between our and other methods.
The dySample and the Detail Enhancement Module (DEM) are applied to the HRNet [5,9], further implementation details are provided in Section 3.1.
3.1 Overview of DE-HRNet
The overall network structure of DE-HRNet is shown in Fig 3, which uses HRNet as the backbone network. HRNet [5,9] consists of four stages of parallel subnetworks, the first stage only contain high resolution subnetwork, which is constructed with convolution block. The 2nd, 3rd, 4th stages consist of parallel high-to-low resolution subnetworks, which is constructed with convolution block and multi-resolution fusion.
We propose a Detail Enhancement Module (DEM) to strengthen feature representations across multi-resolution branches, which systematically improves multi-scale fusion performance through cross-branch attention mechanism. As shown in Fig 3, this module is embedded after the feature extraction stages (second, third, and fourth) but prior to multi-resolution feature fusion. For the highest-resolution branch, (a) is adopted to enhance feature representation, which employs a single input and a single output. For other branches, (b) is utilized, featuring dual inputs (from high and low-resolution branches) and producing an output for the low-resolution branch.On the other hand, in order to address the cross-resolution feature utilization gap, we integrate a content-aware adaptive sampler - dySample [17] that preserves critical structural information during upsampling through learnable kernel prediction, enabling lossless propagation of enhanced low-resolution features to guide high-resolution refinement.
3.2 Detail enhancement module (DEM)
In our study, the Detail Enhancement Module (DEM), which is primarily designed to enhance the capacity of each resolution branch to capture detailed features, thereby enhancing the network’s ability to perceive keypoints at varying scales. The structure of the DEM structure is illustrated in Fig 4. Suppose and
are respectively the higher and lower resolution input of the DEM,
is the output of the DEM. The DEM contains two branches corresponding to higher and lower resolution feature, the output of higher resolution branch is high resolution feature, the output of lower resolution branch through combine higher and lower resolution feature to capture the delicate local representation.
The module is designed based on SE block. Additionally, the global average pooling and dropout [25] technology are located before and after the SE block, respectively. Dashed lines denote selective identity mappings.
The model evolution process is divided into two steps: Frist, the high-resolution branch goes to the next stage by identity mapping. For the low-resolution branch, we use adaptive average pooling for the low-level features, so as to preserve the semantic relationships of the low-level features, and then concatenate high-level and low-level features to get a feature map . Second, Similar to SENet [21], we have calculated spatial weights for each resolution position to re-calibrate the different feature channels and refines the local feature information. The above generation process is formulated as:
Where denotes the element-wise mutiplication,
[25] is used to prevent overfitting,
is the channel weighting function:
For the output of each layer of HRNet’s resolution branch, useful local features are extracted through a consistent residual block. Each layer extracts only the local features of the current branch and lacks the detailed features of the lower layers to capture finer local representations. Therefore, to address this lack of detailed features, the detail enhancement module aims to enhance the ability to capture detailed features in the resolution branch. In particular, we use a global average pooling operation to introduce detail information in high-resolution features to boost weight learning while preserving the semantic relationships between neighboring features.
3.3 Discussions
We use HRNet as the backbone network and our DEM with SE block as the backbone component. Therefore, we discuss the differences in our methods from two aspects:
Differences with HRNet [5]. Both HRNet and our method maintains high-resolution representations through the whole process. But they differ in the following aspects: HRNet achieves multi-resolution fusion through upsampling and downsampling across parallel resolution streams. In contrast, our method employs a DEM to augment fine-grained information in each parallel resolution branch while replacing nearest-neighbor upsampling with dySample. This design ensures that enhanced details are effectively fused across resolutions, optimizing feature preservation during cross-scale integration. Our method enhances the capability of capturing detailed features across branches of various resolutions, especially in low-resolution branches, thereby improving the model’s sensitivity to certain keypoints (like ankles).
Difference with SENet [21]. Both model use SE block as component to adaptively recalibrates channel-wise feature responses. But they differ in the following aspects: SENet focuses on learning inter-channel dependencies, where the input at each stage is derived from the resolution-specific streams of the previous stage. However, our DEM employs downsampled higher-resolution features from preceding stages combined with aggregated branches at the current resolution. We posit that integrating higher-resolution features from upper stages facilitates the aggregation of feature hierarchies absent in the current resolution, thereby enhancing spatial encoding quality and strengthening the representational capacity of CNNs. Additionally, the aggregated feature information aids in recalibrating overlooked or missing detail information through feature reweighting.
4 Experiments
To verify the impact of our method on the accuracy of keypoint localization in scale variations, we compared with other related methods on two datasets, MS COCO dataset [26] and MPII Human Pose dataset [27].
4.1 COCO keypoint detection
4.1.1 Dataset and evaluation metric.
The COCO dataset [26] comprises over 200,000 images and 250,000 annotated instances of humans with 17 keypoints, which is useful for learning complex models capable of precise localization. We utilize the COCO 2017 dataset, which is split into training, validation, and test sets at an approximate ratio. The evaluation metrics adopted are the average precision (AP) and average recall (AR), both based on object keypoint similarity (OKS):
[26].
4.1.2 Training.
The network is trained on 4 NVIDIA GeForce RTX 2080 Ti GPUs with batch size 32 per GPU. We use the Adam optimizer [28] and the base learning rate is 1e-3, it drops to 1e-4 at the 170th and 1e-5 at the 200th epochs. There are 210 epochs in total.
Following [9], we use the same human detection box cropping method, each image is resized to . Data augmentation includes scale(
), rotation(
degrees), flipping and half body data augmentation.
4.1.3 Testing.
Consistent with the backbone network, we used faster-RCNN [29] as the human detector. In line with hourglass [22], we averaged the predicted heatmaps from the flipped images with those from the original images to estimate the joint positions, then taking the location offset by one quarter of the distance from the highest to the second-highest response to determine the final predicted position.
4.1.4 Results on the validation set.
Table 1. reported the results of our and other related methods. Our DE-HRNet-32 trained from the model pretrained on the ImageNet with the input size , achieves a 76.4 AP score, outperforming other methods with the same input size. Compared to SimpleBaseline [31], our network improves AP by 1.9 points, and the GFLOPs of our network and the number of parameters is much lower and less than half. Compared to HRNet-32 and HRNet-48, our network DE-HRNet-32 and DE-HRNet-48, with slightly larger model size and slightly higher complexity, achieves 0.6 and 0.3 gain, respectively.
4.1.5 Results on the test-dev set.
Table 2. reported the results of our and other state-of–the-art methods. First, our method is better compared to bottom-up approaches designed to address scale variation challenges. Second, our DE-HRNet achieves 75.6 AP on the test set, showing comparable gains (+0.7) to the validation set improvements (+0.6), which substantiates its generalization capability across different data distributions. Notably, under the input setting, the performance gap between our lightweight variant (33.6M parameters) and the full model (74.8M) becomes statistically insignificant, suggesting that model compression can be strategically applied in resolution-constrained scenarios. The sole limitation of our method compared to the original HRNet lies in its increased model size (by nearly one-fifth in parameters) and elevated computational complexity (by approximately one-tenth in FLOPs).
4.2 MPII human pose estimation
4.2.1 Dataset, training and evaluation metric.
The MPII Human Pose dataset [27] contains 40k person samples, each image labelled with 16 joints. The dataset is split into training and validation sets in a ratio of approximately 7:6. Following [5], we adopt the same training, testing and data augmentation strategies (such as the input size is cropped to ).We employ the standard Percentage of Correct Keypoints (PCK) [27] metric. PCK reports measures the proportion of detected keypoints lying within a normalized distance threshold from their ground-truth locations. To ensure evaluative fairness, all performance metrics are reported using PCKh@0.5, a standardized evaluation protocol where the matching threshold corresponds to 50% of the head segment length.
4.2.2 Results on the validation set.
Table 3 reports the results of our method and other methods on the MPII valid set. To compare with other method, our results are rounded upwards to preserve one decimal place. Our method achieves 90.7 PCKh@0.5 score, Compared to HRNet and SimpleBaseline, achieves 0.4 and 0.5 gain, respectively. In line with HRNet, we also training our method (DE-HRNet-w48), its accuracy not significantly different.
4.3 Ablation study
In this subsection, through systematic ablation studies on the MPII human pose benchmark, we quantify the contribution of individual architectural components to keypoint prediction consistency, with all evaluations strictly adhering to the input protocol.
Table 4 illustrates the effects of two critical components in our method, dySample and the Detail Enhancement Module (DEM), on multi-joint prediction. We progressively integrated the DEM and dynamic sampler (dySample) into the HRNet-W32 baseline through stepwise ablation studies. As shown in Table 4 (b), replacing only the dynamic sampler (dySample) yields 90.5 PCKh@0.5, achieving a 0.2 point improvement over the baseline HRNet-W32. Table 4 (c) demonstrates that integrating the Detail Enhancement Module (DEM) exclusively into parallel multi-resolution branches attains 90.4 PCKh@0.5, corresponding to a 0.1 point gain compared to the original HRNet-W32. Table 4 reveals two critical observations: First, the substitution with dySample effectively reduces detail loss through adaptive feature resampling, which significantly improves the model’s capability in multi-scale joint prediction. Second, the experimental results of DEM validate that our designed Detail Enhancement Module effectively processes multi-scale features while augmenting reconstructed local details, thereby enhancing the overall performance of human pose estimation.
Additionally, we validate the impact of SENet and ECANet by replacing DEM. The results are shown in Table 5. The experimental results demonstrate that incorporating either SE blocks or ECA blocks leads to degraded prediction accuracy across all body joints. We therefore propose the Detail Enhancement Module (DEM) by reconstructing these baseline attention components, which improves multi-scale keypoint detection accuracy.
4.4 Discussions
Table 2 compares DE-HRNet with recent HRNet variants across two technical directions: Transformer-based adaptations (HRFormer [6] and TokenPose [34]), Quantization-error-aware frameworks (DARK [30] and UDP [35]). In addition, there are also some advanced CNN architectures (RSN [8] and MSPN [33]). While DE-HRNet shows modest accuracy gaps compared to these specialized approaches (e.g., AP against HRFormer on COCO test-dev). Unlike transformer-based methods [6,34] that capture global dependencies at higher computational cost, our convolutional enhancement module prioritizes local feature refinement for speed (Ours 17.7G vs HRFormer 26.8G GFLOPs). Unlike quantization error correction methods [30,35] that primarily operate through pre-processing or post-processing stages, our approach enhances the network’s capability of capturing multi-scale joint features through embedded convolutional modules. While this design choice inevitably increases parameter count and model complexity, it fundamentally addresses scale variation challenges from the feature learning perspective. Meanwhile, compared to large convolutional architectures [8,33], our HRNet-based optimized method demonstrates distinct efficiency advantages: While our approach currently exhibits modest performance gaps (approximately 0.5% AP difference on COCO test-dev), it requires only 47% of the computational resources and 30% of the parameters compared to MSPN [33], making it more suitable for deployment-constrained scenarios. Therefore, our method DE-HRNet improves efficiency in addressing the challenge of scale variations.
In the ablation experiments of Table 4 regarding the decreased prediction accuracy for knee or ankle joints, we conducted further experimental analysis and obtained Table 5. We analyzed the causes of accuracy degradation through the changes in joint accuracy in configurations (a), (b), (e), and (g). Among these: (a) is the HRNet baseline, (b) is the HRNet baseline integrated with dySample, (e) is the HRNet baseline integrated with DEM and SE block, (g) is the HRNet baseline integrated with dySample, DEM, and SE block. Taking the knee joint accuracy scores as an example: (a) achieves 87.1, (b) 86.6, (e) 86.9, and (g) 86.7. The data suggest that the introduced contextual information from DEM may interfere with the detail enhancement for knee joints, leading to decreased prediction scores. Additionally, dySample [17], derived from point sampling, maintains depth values in flat regions while processing gradually changing depth values. We infer that dySample [17] is less effective than adjacent upsampling when handling features with complex local depth variations.
Compared to the above methods, our current analysis does not specifically address the robustness of our method under special scenarios such as occlusion or extreme pose variations. But the COCO keypoint dataset [26] provides naturally challenging images with diverse human poses, varying body scales, and complex occlusion patterns.
5 Conclusion
In this paper, we aim to enhance the precision of anatomical keypoint localization under scale variations. we propose a novel Detail-Enhanced High-Resolution Network (DE-HRNet), an architectural extension of HRNet tailored for human pose estimation. Our innovation stems from two advancements: First, we Detail Enhancement Module (DEM) that facilitates multi-level feature fusion for preserving localized texture patterns in high-resolution streams. Second, we introduce dynamic sampler – dySample which is replacing neighbor interpolation upsampling, which mitigates representational degradation during resolution recovery.
Our method are suitable to same position sensitive vision applications, such as sports analytics or autonomous driving. In addition, our method still has limitations: the analysis under extreme poses or heavy occlusion is insufficient, and the model contains a large number of parameters. In future work, we will explore Transformer-based approaches to address extreme poses and occlusions, apply pruning and other compression techniques for model lightweighting. Additionally, we plan to explore 3D pose estimation, which could resolve many challenges (scale variations, illumination changes, occlusions, and complex backgrounds) in human pose estimation.
References
- 1. Singh B, Davis LS. An analysis of scale invariance in object detection - SNIP. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA; 2018. p. 3578–87.
- 2. Lin W, Wu Z, Chen J, Huang J, Jin L. Scale-aware modulation meet transformer. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France; 2023. p. 5992–6003.
- 3. Xu L, Jin S, Liu W, Qian C, Ouyang W, Luo P, et al. ZoomNAS: searching for whole-body human pose estimation in the wild. IEEE Trans Pattern Anal Mach Intell. 2023;45(4):5296–313. pmid:35939471
- 4. Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA; 2020. p. 5385–94.
- 5. Sun K, Xiao B, Liu D, Wang J. Deep High-Resolution Representation Learning for Human Pose Estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE; 2019. p. 5686–96.
- 6.
Yuan Y, Fu R, Huang L. HRFormer: high-resolution transformer for dense prediction. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021. p. 7281–93.
- 7. Luo Z, Wang Z, Huang Y, Wang L, Tan T, Zhou E. Rethinking the heatmap regression for bottom-up human pose estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA; 2021. p. 13259–68.
- 8. Cai Y, Wang Z, Luo Z, Yin B, Du A, Wang H, et al. Learning delicate local representations for multi-person pose estimation. In: Vedaldi A, Bischof H, Brox T, Frahm JM, editors. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2020. p. 455–72.
- 9. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2021;43(10):3349–64. pmid:32248092
- 10. Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C. Efficient object localization using Convolutional Networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA; 2015.
- 11. Yu C, Xiao B, Gao C, Yuan L, Zhang L, Sang N, et al. Lite-HRNet: A lightweight high-resolution network. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA; 2021. p. 10435–45.
- 12. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J. Cascaded pyramid network for multi-person pose estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE; 2018. p. 7103–12.
- 13. Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA; 2014. p. 1653–60.
- 14. Ke L, Qi H, Chang M-C, Lyu S. Multi-scale supervised network for human pose estimation. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece; 2018. p. 564–8.
- 15. Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In: Leibe B, Matas J, Sebe N, Welling M, editors. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2016. p. 483–99.
- 16.
Lin TY, Dollár P, Girshick R. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 2117–25.
- 17.
Liu W, Lu H, Fu H. Learning to upsample by learning to sample. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 6027–37.
- 18. Chen L, Fu Y, Gu L, Yan C, Harada T, Huang G. Frequency-aware feature fusion for dense image prediction. IEEE Trans Pattern Anal Mach Intell. 2024;46(12):10763–80. pmid:39186415
- 19. Dai Y, Lu H, Shen C. Learning affinity-aware upsampling for deep image matting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2021. p. 6841–50.
- 20.
Lu H, Liu W, Ye Z, Fu H, Liu Y, Cao Z. Sapa: Similarity-aware point affiliation for feature upsampling. In: Proceedings of Advances in Neural Information Processing Systems. 2022. p. 1–12.
- 21. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA; 2018.
- 22. Woo S, Park J, Lee J-Y, Kweon IS. CBAM: Convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Lecture Notes in Computer Science. Cham: Springer International Publishing. 2018. p. 3–19.
- 23.
Wang Q, Wu B, Zhu P. ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020. p. 11534–42.
- 24.
Xiao C, Yang W, Ouyang W, Ma C, Yuille AL, Wang X. Multi-context attention for human pose estimation. In: CVPR. 2017. p. 5669–78.
- 25. Hinton GE. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint. 2012.
- 26. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2014. p. 740–55.
- 27. Andriluka M, Pishchulin L, Gehler P, Schiele B. 2D human pose estimation: new benchmark and state of the art analysis. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE; 2014.
- 28.
Kingma DP, Ba J. Adam: A method for stochastic optimization. CoRR. 2014;abs/1412.6980.
- 29. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
- 30.
Zhang F, Zhu X, Dai H. Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. p. 7093–102.
- 31. Xiao B, Wu H, Wei Y. Simple baselines for human pose estimation and tracking. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2018. p. 472–87.
- 32. Huang S, Gong M, Tao D. A coarse-fine network for keypoint localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy; 2017.
- 33. Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, et al. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148. 2019.
- 34.
Li Y, Zhang S, Wang Z. Tokenpose: Learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 11313–22.
- 35.
Huang J, Zhu Z, Guo F. The devil is in the details: delving into unbiased data processing for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. p. 5700–9.