Figures
Abstract
This paper proposes a unified skeleton-based framework for 3D human pose estimation and Parkinson’s disease classification, integrating a Dual-Stage Adaptive Temporal Perception (DATP) strategy and an Adaptive Graph Topology Modeling Network (AGTM-Net). DATP enhances robustness to joint occlusion and sequence degradation through occlusion-aware interpolation, trend-extrapolated frame padding, and multi-scale spatiotemporal modeling. On the MPI-INF-3DHP dataset with 16 missing joints, DATP achieves 77.72 PCK and 43.57 AUC, outperforming state-of-the-art methods. On Human3.6M, DATP also shows strong generalization with MPJPE reduced to 32.68 mm. For clinical classification, AGTM-Net dynamically models skeletal structure variations and achieves an F1-score of 0.898 and accuracy of 0.881 in distinguishing healthy individuals from Parkinson’s patients with a score of 0 based on the “3.9 Arising from Chair” task. Interpretability analyses—based on gradient and perturbation methods—highlight the spine, chest, and hips as decisive joints, aligning with clinical understanding of gait disorders and enhancing the model’s transparency and clinical reliability.
Citation: Zuo M, Li J, Chang M, Zhang Q, Fan S (2026) Robust 3D Pose estimation and Parkinson’s Disease classification via Dual-Stage Adaptive Temporal Perception and graph topology modeling network. PLoS One 21(3): e0344375. https://doi.org/10.1371/journal.pone.0344375
Editor: Paulo Jorge Simões Coelho, Polytechnic Institute of Leiria: Instituto Politecnico de Leiria, PORTUGAL
Received: June 18, 2025; Accepted: February 19, 2026; Published: March 19, 2026
Copyright: © 2026 Zuo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code and datasets used in this project are available at the following GitHub repository: https://github.com/JL-Li-st/DATP.
Funding: This research was funded by the National Natural Science Foundation of China under Grant No. 62433002 and No. 62476014, the Project of Construction and Support for High-level Innovative Teams of Beijing Municipal Institutions under Grant No. BPHR20220104, and the Beijing Scholars Program under Grant No. 099. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No authors received a salary from any of the funders.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Human 3D pose estimation and skeleton-based motion analysis are fundamental tasks with wide applications in healthcare, human-computer interaction, and robotics. In particular, accurate 3D motion modeling from 2D inputs is crucial for clinical scenarios such as Parkinson’s disease diagnosis and symptom assessment. However, applying these technologies in real-world clinical settings faces significant hurdles. Clinical videos often suffer from severe joint occlusions (e.g., body parts blocked by furniture or self-occlusion) and varying motion speeds. Standard computer vision models typically struggle to reconstruct coherent skeletons under these conditions, leading to “jittery” or incomplete data that is unreliable for medical diagnosis. Furthermore, traditional classification models often treat the human skeleton as a static graph, failing to capture the dynamic, subtle motion variations characteristic of early-stage Parkinson’s disease (PD).
To address these challenges, we propose a unified skeleton-based framework that integrates robust pose estimation with precise disease classification. First, to ensure data quality, we develop a Dual-Stage Adaptive Temporal Perception (DATP) strategy. This module acts as a robust pre-processor, employing occlusion-aware interpolation to restore missing joints and trend-extrapolated padding to fix incomplete sequences. By using a multi-scale spatial-temporal modeling approach, DATP ensures that the input skeleton data maintains temporal coherence even in the presence of severe occlusion.
Building on this high-fidelity data, we introduce the Adaptive Graph Topology Modeling Network (AGTM-Net) for symptom classification. Unlike static models, AGTM-Net captures the dynamic evolution of skeletal structures during movement via attention-guided updates. For clinical evaluation, this study specifically focuses on the MDS-UPDRS item 3.9 (“Arising from Chair”). This task was selected because it serves as a critical indicator of axial motor impairment and postural instability—symptoms that are often resistant to medication and strongly predictive of fall risk. Based on the features extracted by AGTM-Net, we construct binary classification tasks to distinguish PD patients with varying symptom severity (scores of 0, 1, and 2) from healthy controls.
Finally, to bridge the gap between “black-box” AI and clinical trust, we conduct gradient-based and perturbation-based interpretability analyses. These analyses visualize the key skeletal joints driving the model’s decisions, verifying alignment with clinical knowledge of gait disorders.
The most important contribution of this paper is:
- (1) A novel DATP framework is proposed, incorporating occlusion-aware interpolation, trend-extrapolated frame padding, unified temporal alignment, and multi-scale adaptive modeling. This innovation significantly improves robustness in scenarios with missing or occluded frames.
- (2) An AGTM-Net is designed to dynamically capture evolving joint relationships and extract discriminative features. It is applied to clinically relevant binary classification tasks based on the “3.9 Arising from Chair” score, achieving effective symptom differentiation.
- (3) Gradient-based and perturbation-based interpretability analyses are employed to identify critical skeletal joints influencing classification decisions, thereby enhancing model transparency and clinical trustworthiness.
Related work
3D human posture estimation
In recent years, with the popularity of 3D human pose estimation, it has received more and more attention from researchers, and many excellent pose estimation algorithms have emerged, which have made significant progress in solving the generalization ability, the occlusion problem, and the computational efficiency in different scenarios, etc. Wu et al. [1] proposed a lightweight human pose estimation algorithm based on adaptive feature sensing, which compressed the model volume and at the same time improved the detection efficiency and robustness of the model at the same time. Wang et al. [2] also proposed a new lightweight human posture estimation algorithm based on the non-rigid characteristics of human posture and the diversity of distribution of human landmarks, which significantly improves the accuracy compared with the benchmark algorithm. In order to improve the problem of inaccurate estimation results in human pose estimation due to the complexity of human limbs and environmental factors, Zhang et al. [3] proposed a human pose estimation method based on quadratic generative confrontation, which effectively improves the estimation accuracy of stacked hourglass network (SHN) by generative confrontation training of SHN in two stages. Similarly, to solve the problem of human pose estimation in complex backgrounds, Fu et al. [4] proposed an improved YOLOv7-POSE algorithm, and created a dataset with various shooting angles for training, and the accuracy of the self-constructed dataset was improved by 4% compared with the original YOLOv7 algorithm. Song et al. [5] proposed a hybrid attention adaptive sampling network, which incorporates a dynamic attention module and a pose quality attention module. This network comprehensively takes into account the dynamic information and the quality of pose data. Compared with traditional sampling strategies, such as sparse uniform sampling and keyframe selection based on Convolutional Neural Networks (CNNs), it demonstrates significantly stronger robustness in challenging conditions, including high levels of occlusion, motion blur, and illumination variations. In order to estimate accurate and temporally consistent 3D human motion from videos, Sun et al. [6] proposed the “Bidirectional Temporal Feature for Human Motion Recovery" (BTMR). This model employs bidirectional features, replacing the unidirectional temporal features used in previous studies. Thanks to this improvement, the BTMR model is capable of generating more accurate and temporally coherent 3D human motion. Since the prior template of the human body in the SMPL model is fixed, significant discrepancies may occur in the reconstructed human body shapes when individuals perform vigorous movements such as sports or dancing. To tackle this issue, Wu et al. [7] introduced a parallel-branch network featuring a custom-designed spatiotemporal (ST) branch and an SMPL branch. The 3D joint information from the ST branch is utilized to supervise the 3D joints of the SMPL branch, effectively rectifying the biases in the SMPL model. To address the scarcity of real-world multi-view datasets, Wang et al. [8] introduced the FreeMan dataset, which provides large-scale multi-view image and sequence data, providing a more challenging benchmark for 3D pose estimation. Finally, the PoseIRM method proposed by Cai et al. [9] overcomes the challenges in 3D human pose estimation under unknown camera settings using the Invariant Risk Minimization (IRM) paradigm, enabling the model to adapt to diverse camera configurations with a small number of training samples and improving the generalization ability of the model.
Classification of disease subtypes
Differentiation of different subtypes of Parkinson’s disease is important for treatment and pathologic studies. Currently, Mestre et al. [10] and Dulski at al. [11] researched and found that there is no consensus in the medical field on the classification of different subtypes of Parkinson’s disease, mainly due to the extremely heterogeneous nature of Parkinson’s disease, the complexity of its etiology and pathogenesis, the lack of reliable biomarkers, and the shortcomings of existing methods. Subtype classification is usually studied in terms of both hypothesis-driven and data-driven subtypes. Hypothesis-driven subtyping relies on predefined clinical criteria, with motor subtypes—specifically Tremor Dominant (TD) and Postural Instability/Gait Difficulty (PIGD)—being the most widely used in clinical practice. However, these classifications often depend on subjective clinical scales (e.g., MDS-UPDRS), which may lack granularity in capturing subtle motor variations. Consequently, data-driven subtyping has gained attention for its objectivity. Deng et al. [12] researched and found that Data-driven subtyping methods do not rely on preconceived assumptions, but rather define the phenotypic characteristics of the disease through the comprehensive analysis of multidimensional data, which has a higher degree of objectivity and data dependence. Parket al. [13] found that Parkinson’s disease is heterogeneous in terms of disability and mortality in a study using follow-up records. Gong et al. [14] analyzed kinesiology data using a machine-learning model, which was characterized by a 79.6% of F1 scores to identify PD kinematic subtypes. Krishnagopal et al. [15] developed a data-driven, network-based trajectory profile clustering (TPC) algorithm for identifying disease subtypes and early prediction of subtypes and disease progression. Fereshtehnejad et al. [16] used a clustering analysis of a comprehensive dataset of baseline (i.e., cross-sectional) data to identify different subgroups by clustering the baseline (i.e., cross sectional) comprehensive dataset, which consists of clinical features, neuroimaging, biospecimens, and genetic information, and then develops criteria for assigning patients to different Parkinson’s disease subtypes. Building on these data-driven approaches, this study focuses on utilizing high-fidelity 3D skeletal data to objectively classify motor subtypes related to axial impairment.
Interpretability analysis
The use of interpretable analysis to explain the results of decision making reveals the parts of the human body that play an important role in decision making, and can effectively improve the credibility of the model, especially in the healthcare industry, which has a high level of risk, and it is important for the model to be able to explain the output. Proper interpretable analysis can also reverse-inspire researchers and promote research progress. Suara et al. [17] study explored the fundamentals of interpretable deep learning and its importance in medical imaging, reviewed various interpretable techniques and their limitations, and focused on the application of Grad-CAM in medical image analysis. The results show that interpretable deep learning and Grad- CAM help to improve the accuracy and interpretability of deep learning models in medical image analysis, and enhance the trust of medical personnel in AI diagnosis. In the perturbation-based approach, Singh et al. [18] researched perturbations are applied to the keypoint information while observing the changes in the model output to determine the keypoints that have a significant impact on the judgment results. Shen et al. [19] In the study of predicting the risk of in-hospital death in patients with chronic heart failure complicated by pulmonary infection using interpretable machine learning, the model was interpreted using the SHAP method. Luo et al. [20] in the study of predicting the risk of in-hospital death in patients with chronic heart failure complicated by pulmonary infection in intensive care, the interpretable deep learning and Grad-CAM can help improve the accuracy and interpretation of deep learning models. The SHAP method was also used in the Interpretable AKI Prediction Study.
Materials and methods
To address the challenges of severe joint occlusion, incomplete motion sequences, and diverse temporal dynamics in 3D human pose estimation and clinical action recognition, this paper proposes a comprehensive skeleton-based framework. Specifically, we introduce a two-stage Dual-stage Adaptive Temporal Perception (DATP) framework for 3D pose estimation, which includes an occlusion-guided frame padding and trend-based extrapolation mechanism, a unified feature encoding and temporal alignment module, and a multi-scale spatial-temporal modeling strategy with adaptive scale weighting. Furthermore, we propose the Adaptive Graph Topology Modeling Network (AGTM-Net) to extract informative skeletal features. Based on the clinical action score “3.9 Arising from Chair,” we construct three binary classification tasks to distinguish Parkinson’s patients from healthy individuals [21].
Datasets
Two public datasets are employed to evaluate the proposed framework across 3D pose estimation and Parkinson’s disease classification tasks: the Human3.6M datasets and the REMAP Open datasets.
Human3.6M datasets
Human3.6M is a large-scale benchmark for 3D pose estimation, containing millions of frames from 11 subjects performing various daily activities. Each frame is annotated with 3D joint positions using a motion capture system. Following standard protocols, subjects S1, S5, S6, S7, and S8 are used for training, and S9 and S11 for testing. We adopt 2D key points normalized to camera coordinates and predict 17 target joints for evaluation.
MPI-INF-3DHP datasets
MPI-INF-3DHP is a widely used benchmark for 3D human pose estimation in unconstrained environments. It includes both indoor and outdoor scenes, covering a wide variety of motions, viewpoints, and lighting conditions. The datasets provides synchronized RGB images, 2D key-points, and accurate 3D pose annotations captured using a marker less motion capture system. It is particularly suitable for evaluating the generalization and robustness of 3D pose estimation models under occlusions and real-world conditions.
REMAP open datasets
The REMAP Open dataset, derived from the PD SENSORS study at the University of Bristol, provides real-world mobility data captured using markerless Microsoft Kinect sensors (640x480 resolution, 30 fps). This study specifically utilizes 403 Sit-to-Stand (STS) episodes, which are expertly annotated according to the MDS-UPDRS item 3.9 criteria, with scores ranging from 0 (’Normal’) to 4 (’Unable to arise without help’). Crucially, the dataset records the medication status of participants (defined as ’On’ or ’Off’ dopaminergic medication), enabling a robust analysis of symptom fluctuations. We selected the STS task as it specifically targets axial motor impairment, a symptom often resistant to medication and a strong predictor of postural instability and fall risk.
The REMAP dataset is publicly available and can be downloaded from: https://github.com/ale152/SitToStandPD.
Human 3D position estimation based on Dual-Stage adaptive temporal perception modeling
According to Article 32, Clauses 1 and 2 of the Administrative Measures for Ethical Review of Life Science and Medical Research Involving Humans (issued on February 18, 2023), ethical review may be exempted under the following conditions: (1) the research utilizes publicly available data obtained through legitimate means or data derived from the observation of public behavior without intervention; (2) the research is conducted using anonymized information or data. Fig 1 illustrates the overall architecture of the proposed Dual-Stage Adaptive Temporal Perception (DATP) model. Unlike conventional pose estimation methods, DATP addresses challenges such as joint occlusion and temporal modeling difficulties in real-world applications through systematic improvements at both the input enhancement and feature modeling levels. The framework consists of three key components: the occlusion-aware and trend-enhanced input module, the unified feature encoding and temporal alignment module, and the multi-scale perception with adaptive scale weighting module.
The DATP model consists of three core components: an OATE Input Module (Occlusion-Aware and Trend-Enhanced Input Module), a UFTA Module (Unified Feature Encoding and Temporal Alignment Module).
Occlusion sensing and trend enhancement input module
Aiming at the common problems of missing joints and insufficient time boundary information in 2D inputs, this module proposes two new mechanisms based on the traditional occlusion processing and frame supplementation strategies: the local smoothing interpolation mechanism (TLSI); and the frame filling mechanism based on trend extrapolation (TTEP).
Temporal Locally Smoothed Interpolation (TLSI)
Traditional occlusion complementation mostly uses nearest-neighbor interpolation, which ignores local inter-frame smoothness. For this reason, we propose locally weighted smoothing interpolation (TLSI), which uses the information of neighboring k-frames for weighted complementation.
For the missing key point (f,j), the set of its neighborhood frames is defined as , and the TLSI interpolation is:
where the weight w(f, f’) is the weighting factor between the current interpolated frame f and the neighboring frame f’ considering the time distance and confidence level, and pf’,j is the position vector of the j-th joint in the neighboring frame f’.
where is the time decay factor and cf’,j is the positional confidence of the j-th keypoint in the neighboring frame f’. The closer the distance and the higher the confidence of the frame, the higher the contribution weight, and the complementary position is guaranteed to be continuous in time and smooth in space.
The final interpolation result is more natural and alleviates the problem of jumps that may be caused by traditional nearest neighbor interpolation.
Temporal Trend Extrapolated Padding (TTEP)
Conventional frame supplementation by simply replicating the first and last frames can easily lead to repetitive or unnatural information. We propose the trend extrapolation frame-padding (TTEP) strategy to reasonably predict the padding based on the local motion trends of the first and last frames.
Assume that the first k frames of the sequence are X1:k, and the last k frames are , whose linear trend extrapolates forward/backward to generate complementary frames, complementary per P frames.
where is the mean value of the difference of consecutive frames based on the first k frames to infer the motion trend at the beginning of the sequence. Supplemented backward p-frames:
Where is the mean value of the difference between successive frames based on the last k frames to infer the motion trend at the end of the sequence, and XF+p is the supplementary backward extrapolated frame.
Linear extrapolation using and
makes the supplementary frames not only continue the original motion trend, but also avoid the static redundancy caused by simple replication. This method makes the supplemental frames conform to the local motion trend, and the boundary features are more natural and continuous, which improves the quality of the two-end time modeling.
Harmonization of feature coding and time alignment modules
In order to enhance the representation of the input skeleton sequence in the channel and temporal dimensions, this module adopts a two-level feature encoder structure. Each level of this structure contains normalization, feed-forward modeling and residual connectivity, aiming to gradually enhance the interaction between key nodes and construct a unified temporal structure of the frame sequence.
P-frame complementary framing is performed before and after the input to obtain the post-complementary input tensor , which is subsequently rearranged into a temporal feature stream X”.
The sequence will be sequentially passed through two residual enhancement modelers to extract temporal dynamic features across nodes. First, the joint representation at the joint channel level is modeled for each frame to obtain the first-stage feature representation:
where Z1 denotes the local spatial structure features fused after the first stage of modeling, and is the feed-forward mapping network in the channel direction.
Then, the inter-frame temporal variation patterns are modeled by the second stage to obtain features containing temporal context information:
where Z2 denotes the high-level feature representation incorporating temporal dynamic information. In order to enhance the feature dimension and representation, we map the outputs of the two stages to a unified high-dimensional space separately. E1, E2 denote the high-dimensional mapping representation of the first and second stage encoding results in the feature space, as the abstract feature flow of the two perspectives.
We introduce the cross-attention mechanism to fuse the two perspectives to enhance the overall modeling capability. After obtaining the encoding of the two perspectives, we introduce a cross-attention mechanism to fuse the two perspectives to enhance the overall modeling capability of the spatial-temporal structure, and the fusion operation can be expressed as follows:
where E1 is used as Query (Q) and E2 is used as Key, Value (K, V), a feature dimension normalization factor. This attention essentially realizes the dynamic weighted aggregation between different views of the feature, so that the fusion feature F can express both local retention and cross-time dependencies. Since the input sequence is extended with P frames before and after; a temporal cropping operation is performed on the fusion feature F to ensure that the length of the output sequence remains the same as the original sequence F’.
This operation is essentially down-sampling, removing the complementary frame regions and retaining only the original-length frame features in the middle segments, ensuring that the model output dimensions are consistent with the inputs, thus maintaining structural alignment with subsequent labeling supervision. Ultimately, the encoded unified feature output is represented as .
Up to this point, the module has completed the two-stage structural coding of features, multi-view dynamic fusion and sequence length alignment processing, providing structurally unified, dynamically consistent and high-quality feature inputs for subsequent scale modeling and attitude regression.
Multi-scale perception and scale-adaptive weighting module
In 3D human action sequences, there is significant diversity in the time scales of action changes. For example, fine-grained hand movements usually occur within a short time window, whereas gait changes require a longer temporal receptive field to be modeled. Traditional time series modeling often uses a single scale that cannot simultaneously account for local fine-grained changes and global long-term dynamics.
To this end, we propose a multi-scale perceptual modeling strategy and further introduce the Adaptive Scale Weighting (ASW) mechanism, which enables features at different time scales to automatically adjust their contribution weights according to the dynamic complexity of the input action sequences, thus realizing a more flexible and accurate time-series feature representation.
This module is divided into two main stages: multi-scale spatial-temporal feature extraction; and adaptive weight-based scale fusion.
Fig 2 illustrates the detailed process of the multi-scale transformer encoder.
This figure illustrates the architecture of the multi-scale perception and adaptive scale weighting module used in the proposed DATP model.
Multi-scale spatial-temporal feature extraction
First, the input features are subjected to a multi-scale decomposition, and at each scale d, a spatial Transformer encoder is first applied to the time series features to extract the dependencies in the local spatial structure:
where Fd is the dth scale input feature sequence, and is the output feature sequence after spatial encoding.
In order to separate the dynamic features in different time ranges, the encoded features are scaled in the time dimension as nd, and the length of each segment is sd, which is satisfied , and reshape the features into
.
This division can cut long sequences into local segments, which facilitates local dynamic modeling and improves the perception of short-term dynamic changes. The reshape operation splits the original time dimension into two dimensions: (number of subsequences, number of frames per segment), which is ready for subsequent temporal-local Attention modeling.
For each sub-segment interior, the local time Transformer module is applied for modeling:
where is the feature output after temporally localized attention coding. Self-attention is modeled for the time sequences within each local time period. Through the localized attention mechanism, the model can effectively capture short-term temporal change patterns and maintain temporal consistency.
The features of the local segments are re-reduced to a continuous time series form , where
is the feature output after the local segment reorganization is reduced to a continuous time series. The operation here ensures that the features are restored to be consistent with the original time length F, providing a uniform size for subsequent multi-scale fusion.
Adaptive Scale Weighting (ASW) Mechanism
Although multi-scale feature extraction can capture dynamic information at different levels, if all scale features are directly fused with equal weights, it may introduce redundant information or weaken the contribution of important scales. Therefore, we further design the scale adaptive weighting mechanism (ASW) to automatically adjust the importance of each scale by learning action sequence features.
- (1) Scale Weight Generation.
For each scale d, the extracted output features are first subjected to global average pooling in the time dimension to compress the scaled feature representation gd through GlobalAvgPool. The GlobalAvgPool takes the mean on the time axis and extracts the overall profile of the action. Subsequently, gd is input to a small multi-layer perceptron (MLP) for nonlinear feature transformation to capture scale importance.
Finally, the output is normalized to scale weights in the (0, 1) interval by a Sigmoid activation function; among them, controls the degree of contribution of the t-th scale feature in the final fusion, and
is sample-adaptive and can be dynamically adjusted according to different action features.
- (2) Multi-scale feature weighted fusion
The features of all scales are weighted and summed according to the corresponding weights to obtain the final multi-scale integrated feature representation:
Where D is the total number of scales, each scale output is dynamically weighted according to the learned weights
, so that important scale features are strengthened and redundant scales are suppressed.
is the final fused multi-scale feature representation. This mechanism greatly improves the model’s ability to adapt to different temporal dynamics in complex action sequences.
Finally, the fused multi-scale features are fed into the regression header, which is mapped to the 3D spatial coordinates of each joint point in each frame by a layer of convolution, and reshaped into the final output format
, which is the 3D joint position predicted for each frame.
Skeleton-based feature extraction and parkinson’s disease classification task
In order to realize the dynamic modeling and feature expression of key skeletal structure relationships in human actions, an Adaptive Graph Topology Modeling Network (AGTM-Net) is proposed in this study.
As shown in Fig 3, the AGTM-Net architecture consists of three main stages: (1) initial graph construction based on the skeletal structure; (2) dynamic adjacency update via attention-guided mechanisms to capture temporal joint interactions; and (3) feature extraction through graph convolutional layers, spatial pooling, and a multi-layer perceptron (MLP) for downstream classification. The attention-based adjacency module enhances the adaptability of skeletal graphs by learning context-dependent joint relations.
This section firstly introduces the overall architecture and key module design of AGTM-Net model, and then builds three independent binary classification tasks based on the extracted skeleton features of “3.9 Arising from Chair” action indicator in Real-world Mobility Activities in Parkinson’s Disease (REMAP) datasets. Then, three independent binary classification tasks are constructed based on the extracted skeleton features around the “3.9 Arising from Chair” action indicator in the REMAP dataset.
AGTM-net skeleton feature extraction model
In order to fully capture the dynamic association relationships between skeletal nodes in a motion sequence, AGTM-Net adopts an adaptive modeling strategy based on graph structure. In this model, the skeletal sequence is first encoded as a static topological initial graph structure, and then the adjacency matrix is dynamically updated through the introduction of an attention mechanism to realize the modeling of the changes in the intensity of the joint interactions during the movement. Ultimately, a skeleton feature representation that can effectively distinguish action patterns is generated through multi-layer feature fusion and cross-node feature interaction.
The input skeleton sequence at time step t can be represented as a node feature matrix, i.e., , where J denotes the number of joints and C denotes the input feature dimension of each node. The initial static skeleton connectivity can be represented as
by the adjacency matrix A0, which is defined based on the natural connectivity structure of the skeleton, with 1 indicating the presence of a connection and 0 indicating no connection.
In order to introduce dynamic dependency changes between node pairs, AGTM-Net is designed with an adaptive neighbor matrix updating mechanism. For each time step, the dynamic neighbor matrix Ad is calculated as follows:
where is the learnable query and key mapping matrix, and dk is the dimension of the key vectors used for scaling to stabilize the gradient. The final effective adjacency matrix is defined as a weighted combination of static and dynamic structures:
Where is a hyper-parameter that controls the fusion ratio between static and dynamic structures. After obtaining the updated adjacency matrix, the skeleton features are updated by local information aggregation based on graph convolution as:
where is the feature transformation matrix, where C is the original feature dimension and C’ is the mapped feature dimension,
is the nonlinear activation function, and
denotes the feature representation of each joint point after updating at time step t.
In order to further model the relationship between different node channels, AGTM-Net introduces a Cross Node Feature Interaction (CCI) mechanism, which performs local self-attention enhancement of node features along the channel direction:
where MLP is a two-layer perceptual machine and LayerNorm normalizes the node features. Finally, the skeleton features of all time steps are stacked to obtain the complete skeleton sequence feature representation as
where T denotes the total number of time steps in the action sequence. AGTM-Net introduces a rich representation of spatial-temporal dynamics changes on the basis of ensuring the rationality of skeleton structure modeling through the above-mentioned neighbor matrix updating, graph convolutional feature aggregation, and channel interaction mechanisms, which provides more discriminative skeleton features for downstream classification tasks.
Skeleton-based binary classification task design
In order to verify the discrimination ability of the skeleton features proposed by the AGTM-Net model in the recognition of Parkinson’s disease movement symptoms, the present study is based on the clinical movement score index “3.9 Arising from Chair” in the Real-world Mobility Activities in Parkinson’s Disease (REMAP) datasets.
Based on the “3.9 Arising from Chair” clinical movement score index in the Real-world Mobility Activities in Parkinson’s Disease (REMAP) datasets, this study constructed two dichotomous classification tasks to discriminate the movement performances of those who scored 0 from the healthy population, those who scored 2 from the healthy population, and those who scored 3 from the healthy population to realize the symptom recognition and subtype character modeling.
For each standardized action sequence, the AGTM-Net output frame-level skeleton feature sequence is denoted as , where T is the number of time steps, J is the number of skeleton joint points, and C’ is the output feature dimension of each node. In order to obtain a uniform temporal representation, all time-step and joint point features are averaged and pooled and compressed to obtain the sample-level global feature vector g:
Next, feature mapping and binary prediction are performed by a two-layer fully connected classification network. The entire binary classification process is performed with each sample pair (i.e., “score = 0 vs healthy population,” “score = 1 vs healthy population,” and “score = 2 vs healthy population”) as an independent task, and the two tasks share the same AGTM-Net feature extractor. The two tasks share the same AGTM-Net feature extractor, but the corresponding classifiers are constructed separately in the training phase for optimization. The final model can be used for Parkinson’s disease diagnosis and symptom severity modeling by accurately identifying subtle differences in motor ability.
Evaluation metrics
To comprehensively evaluate the proposed framework for both 3D pose estimation and skeleton-based classification tasks, multiple evaluation metrics are adopted according to task characteristics.
Evaluation metrics for 3D pose estimation
Mean Per Joint Position Error (MPJPE) measures the average Euclidean distance between predicted and ground-truth 3D joint locations across all joints and frames. It is defined as:
where N denotes the total number of joints, and pi are the predicted and ground-truth positions for the i-th joint, respectively.
Percentage of Correct Key-points (PCK) measures the percentage of joints whose prediction errors are within a certain threshold (e.g., 150 mm). It is formulated as:
where is the indicator function that returns 1 if the condition holds and 0 otherwise.
Area Under Curve (AUC) calculates the integral of the PCK curve across varying thresholds, providing a single scalar measure of overall pose estimation performance:
where τ represents varying distance thresholds and is the maximum threshold considered.
Evaluation metrics for skeleton-based classification
Accuracy measures the proportion of correctly classified samples among all samples. It is defined as:
To further evaluate model performance, especially in unbalanced datasets, precision, recall, and F1-score are reported:
Details of implementation
In the 3D pose estimation experiments, the proposed framework incorporates two sequential feature encoding stages, each consisting of 2 encoder layers with 9 attention heads. The input feature dimension is set to dm = 25, and the hidden dimension is set to df = 512. After encoding, a multi-scale temporal refinement module with 5 stages is employed to progressively model motion dynamics across different time scales. All experiments were implemented using the PyTorch framework and trained on two NVIDIA A100 GPUs. The Adam optimizer was used with an initial learning rate of 0.001, and a learning rate decay factor of 0.95 applied after each epoch. During training, horizontal flipping was applied as a data augmentation strategy. The input 2D poses were either obtained from classical 2D pose detectors or directly from ground-truth annotations, depending on the evaluation setting.
For Parkinson’s disease movement classification experiments, the number of computational threads was set to 4. The Adam optimizer was adopted with an initial learning rate of 0.0001. The learning rate was decayed by a factor of 0.1 at milestone epochs [30, 40]. Training was conducted for 100 epochs using a single NVIDIA GTX 4060 GPU under the CUDA 11.7 and Python 3.12.4 environment based on the PyTorch framework. During classification, skeleton sequences were normalized, and mini-batch stochastic gradient descent was employed to ensure stable optimization.
Results
Experimental results of human 3D pose estimation based on DATP modeling
Experimental results comparing DATP with state-of-the-art methods on robustness to joint occlusion.
To evaluate the clinical readiness of our model, we benchmarked it against several representative state-of-the-art (SOTA) methods commonly used in computer vision. These baselines serve as proxies for current standard AI capabilities: CNN-based methods (e.g., T3D-CNN) represent traditional deep learning models that are fast but often struggle to capture long-term movement patterns; Transformer-based methods (e.g., MHFormer, STCFormer) represent the current “gold standard” in research, offering high accuracy in ideal conditions by modeling complex temporal relationships. However, these methods are typically not optimized for clinical scenarios where body parts are frequently obscured (occluded). By comparing our DATP model against these established benchmarks, specifically under conditions of “missing joints,” we aim to demonstrate the specific advantage of our framework in handling the imperfect video data typical of real-world medical assessments.
To further investigate the robustness of the proposed DATP model under joint-level occlusion, we conducted a comprehensive comparison on the MPI-INF-3DHP datasets. In this evaluation, we progressively increased the number of randomly missing joints per frame and assessed model performance in terms of MPJPE.
As shown in Fig 4, all evaluated methods exhibit increasing MPJPE as the number of missing joints rises. However, the extent of performance degradation varies. Among them, STCFormer displays the steepest error curve, reflecting high sensitivity to partial occlusion. T3D-CNN and MHFormer show moderately rising trends but still suffer from significant accuracy loss under severe occlusion.
Including T3D-CNN, MHFormer, STCFormer, and DTF with the proposed fusion model under different levels of missing joints on the MPI-INF-3DHP dataset. As the number of missing joints increases, the proposed method maintains better robustness, exhibiting the lowest MPJPE increase overall.
In contrast, both DTF and the proposed DATP model show relatively stable performance, with DATP achieving consistently lower MPJPE at each occlusion level. This advantage stems from its enhanced temporal alignment and multi-scale feature fusion strategy, which improves the model’s ability to capture dynamic spatial-temporal relationships.
Notably, DATP maintains a smooth and moderate MPJPE growth curve, especially under heavy occlusion (e.g., 14 or 16 missing joints), where other models experience a sharp spike in error. This observation highlights DATP’s superior generalization ability and resilience under real-world partial observation scenarios. Collectively, the results confirm that DATP not only achieves high accuracy under ideal conditions but also maintains robustness in the presence of substantial joint occlusion.
Quantitative comparison of DATP and SOTA methods under severe joint occlusion
Table 1 presents a comparative analysis of several state-of-the-art methods (T3D-CNN, P-STMO, MHFormer, STCFormer, and DTF) along with the proposed DATP model on the MPI-INF-3DHP datasets under the condition of 16 missing joints per frame. Two evaluation metrics are considered: Percentage of Correct Key-points (PCK) and Area Under Curve (AUC), which collectively reflect the model’s accuracy and robustness.
From the PCK results, it is evident that DATP consistently outperforms DTF across nearly all activity categories. Notably, in high-motion activities such as Exercising, Sitting, and reaching/Crouching, DATP achieves the highest accuracy, surpassing DTF by a significant margin. This suggests that the fusion and temporal alignment strategies of DATP effectively compensate for the structural loss caused by joint-level occlusion.
Similarly, in terms of AUC, the proposed DATP model demonstrates superior generalization ability, consistently achieving higher or second-highest scores across categories. Especially for fine-grained tasks like Sitting and Miscellaneous, DATP shows stronger performance than other SOTA methods, indicating its ability to maintain stability under partial visibility.
On average, DATP surpasses all competing methods in both PCK (77.72) and AUC (43.57), proving its strong robustness and reliability under severe occlusion scenarios.
Comprehensive evaluation of DATP and SOTA Methods under Human3.6M Protocols
Table 2 and Table 3 presents a detailed comparison of the proposed DATP model with several state-of-the-art (SOTA) methods on the Human3.6M datasets under Protocols 1 and 2, with 16 missing joints per frame using 2D CPN detections. From the average MPJPE results, it can be observed that DATP consistently achieves lower errors than the baseline method DTF under both protocols. Specifically, DATP ranks second among all methods, trailing only behind MHFormer (under Protocol 1) and showing competitive performance across almost all action categories.
Under Protocol 2, which is generally considered more challenging due to the stricter evaluation strategy, DATP still maintains robust estimation capability, outperforming DTF and many other transformer-based approaches in multiple categories such as Eat, Pose, Photo, and Walk Together. This result highlights DATP’s generalization ability and robustness against severe joint occlusion scenarios.
In summary, DATP demonstrates superior performance by effectively combining multi-scale temporal-spatial encoding and adaptive temporal padding, which contributes to more stable and accurate pose estimation across varied human actions and occlusion levels.
Results of the DATP model ablation experiments
To evaluate the contribution of each component in the proposed DATP model, we conducted ablation studies under Protocol 1 and Protocol 2 with 4 random missing joints per frame.
As shown in Table 4, removing both the frame padding strategy and the occlusion confidence leads to a significant performance drop, with MPJPE increasing to 52.12 mm and 41.06 mm, respectively, on the two protocols. When removing only the occlusion confidence, the model still suffers, although slightly less, indicating its standalone contribution to mitigating joint uncertainty.
Similarly, removing only the frame padding strategy increases the MPJPE to 46.15 mm (P1) and 37.02 mm (P2), showing that temporal sequence consistency plays an important role in robust pose recovery. Moreover, excluding the Adaptive Scale Weighting (ASW) module also deteriorates performance, especially under Protocol 2, confirming that dynamic scale weighting is essential for multi-scale feature fusion.
The full model DATP achieves the lowest error (41.32 mm and 32.68 mm), outperforming all ablated variants, demonstrating that each component of the proposed architecture is beneficial and synergistically contributes to the model’s robustness under joint occlusion.
Experimental results of Parkinson’s patient type classification based on AGTM-Net
Experimental results of the classification model based on AGTM-Net.
Table 5 presents the comparative results of the AGTM-Net model against several existing models on the healthy people and patients with a score of 0 datasets. As demonstrated, AGTM-Net achieves outstanding performance across all evaluation metrics, with particularly high scores in precision (0.931) and F1-score (0.898), outperforming all baseline methods. Specifically, AGTM-Net reaches a recall of 0.867, indicating that the model can identify many true positive samples while maintaining strong overall sensitivity. This demonstrates the model’s excellent classification performance in practical gait analysis scenarios. Among the competing methods, DeGCN (accuracy 0.820, F1-score 0.859) and SkateFormer (accuracy 0.803, F1-score 0.795) also show competitive results, yet they remain slightly behind AGTM-Net in overall performance. In contrast, CTR-GCN (accuracy 0.765, F1-score 0.762) and HCN (accuracy 0.731, F1-score 0.749) perform noticeably worse in all evaluation aspects, suggesting that their capacity to handle complex gait patterns is relatively limited. These results highlight the superior effectiveness of AGTM-Net in gait classification tasks.
Interpretability analysis of the classification model based on AGTM-Net
In order to improve the transparency of the model’s decision-making, this study conducted an interpretability analysis of the model using two methods: gradient-based interpretability analysis and perturbation-based interpretability analysis. These two methods provide insight into the contribution of each skeletal key point to the final decision outcome, thus helping to understand the model’s decision-making rationale when making disease severity judgments.
Taking the classification task between score-0 and healthy individuals as an example, Fig 5 demonstrates the results of the gradient-based interpretability analysis performed on the datasets with a score of 0. As shown in the fig 5, the skeletal key-points that have a large impact on the decision-making results of the model are mainly concentrated in nodes 0, 1, 4, 7, and 8, which correspond to the parts of the human body that are the spine, the thorax, and the hips, respectively. Fig 6 demonstrates the results of the perturbation-based interpretability analysis on the datasets with a score of zero. Through this analysis, it is also found that the key nodes such as spine, chest and hips have a strong influence on the final prediction results These nodes highly overlap with the joint locations in the gradient-based analysis results, indicating that they play a decisive role in making a judgment on the severity of the disease on the datasets with score 0.
By combining gradient-based and perturbation-based interpretable analysis methods, the spine, chest, and hips were found to be the key factors influencing the progression status of Parkinson’s disease. This finding not only provides a basis for modeling decisions but also coincides with the clinical physiological understanding of gait disorders, further validating the validity and operationalization of the study model in subtype classification.
Discussion
This study proposes a unified framework combining DATP for robust pose estimation and AGTM-Net for disease classification. While the results are promising, this study represents a preliminary analysis.
First, our clinical validation focused exclusively on the MDS-UPDRS item 3.9 (’Arising from Chair’). Although this task is a critical indicator of axial impairment, a comprehensive PD diagnosis requires analyzing a broader range of motor tasks (e.g., gait, finger tapping) to fully capture disease heterogeneity.
Second, the current validation was performed on the REMAP Open dataset. To establish broad clinical utility, larger sample sizes and multi-site validation are necessary to account for geographic variability and diverse patient demographics. Future work will focus on expanding the dataset to include greater variability in PD stages (Hoehn and Yahr stages) to better define disease trajectories and phenotypes.
Finally, while DATP and AGTM-Net are evaluated as separate modules in this paper, they are conceptually unified: DATP reconstructs high-fidelity skeletons from imperfect video data, serving as the necessary pre-processing step for the AGTM-Net classifier.
Conclusion
The proposed framework in this paper demonstrates excellent performance in 3D pose estimation and Parkinson’s disease classification tasks. In the aspect of 3D pose estimation, experiments based on the DATP model show that this model performs outstandingly in handling joint occlusion problems. On the MPI-INF-3DHP datasets, as the number of missing joints increases, the DATP model maintains better robustness compared with other state-of-the-art methods such as T3D-CNN, MHFormer, STCFormer, and DTF, with the smallest increase in MPJPE. Under the condition of severe joint occlusion (16 missing joints per frame), the DATP model outperforms most of its competitors in both PCK and AUC metrics. Under the two protocols of the Human3.6M datasets, it can also achieve a lower MPJPE error, demonstrating good generalization ability and robustness. Ablation experiments further prove that components such as the frame padding strategy, occlusion confidence, and Adaptive Scale Weighting (ASW) module in the model contribute significantly to its performance improvement under joint occlusion.
In the Parkinson’s disease classification task, the AGTM-Net model performs excellently. On the datasets of healthy people and patients with a score of 0, its precision reaches 0.931, and the F1-score is 0.898, surpassing all baseline methods. Through gradient-based and perturbation-based interpretability analyses, it is found that the skeletal key points of the spine, chest, and hips have a significant impact on the model’s decision-making results. This not only provides a basis for model decision-making but also coincides with the clinical physiological understanding of gait disorders, further validating the effectiveness and operability of the research model in subtype classification. In conclusion, the proposed framework exhibits strong performance and reliability in both tasks, providing valuable references for research and applications in related fields.
Supporting information
S1 Fig. The gradient-based interpretability analysis performed on the datasets with a score of 1.
As shown in the Fig, the skeletal key-points that have a large impact on the decision-making results of the model are mainly concentrated in nodes 0, 1, 4, 7, and 8, which correspond to the parts of the human body that are the spine, the thorax, and the hips, respectively.
https://doi.org/10.1371/journal.pone.0344375.s001
(TIF)
S2 Fig. The perturbation-based interpretability analysis performed on the dataset with a score of 1.
Fig demonstrates the results of the perturbation-based interpretability analysis on the datasets with a score of 1. Through this analysis, it is also found that the key nodes such as spine, chest and hips have a strong influence on the final prediction results These nodes highly overlap with the joint locations in the gradient-based analysis results, indicating that they play a decisive role in making a judgment on the severity of the disease on the datasets with score 1.
https://doi.org/10.1371/journal.pone.0344375.s002
(TIF)
S3 Fig. The gradient-based interpretability analysis performed on the datasets with a score of 2.
Fig shows the results of the gradient-based interpretability analysis on the datasets with a score of 2. As shown in the fig, the key points of the skeleton that have a large influence on the model’s decision-making results are mainly concentrated in nodes 0, 1, 4, 7, and 8, which correspond to the parts of the human body that are the spine, the thorax, and the hips, respectively.
https://doi.org/10.1371/journal.pone.0344375.s003
(TIF)
S4 Fig. The perturbation-based interpretability analysis performed on the datasets with a score of 2.
Fig demonstrates the results of the perturbation-based interpretability analysis on the datasets with a score of 2. This analysis reveals that the key nodes, such as the spine, chest, and hips, have a greater impact on the final prediction results. These nodes highly overlapped with the joint locations in the gradient-based analysis results, indicating that they played a decisive role in the judgment of disease severity.
https://doi.org/10.1371/journal.pone.0344375.s004
(TIF)
References
- 1. Wu N, Wang P, Li X, Lü Z, Sun M. Lightweight human pose estimation based on adaptive feature sensing. Chinese Journal of Liquid Crystals and Displays. 2023;38(8):1107–17.
- 2. Wang M, Xu W, Jiang H. Improved lightweight human pose estimation algorithm. Chinese Journal of Liquid Crystals and Displays. 2023;38(7):955–63.
- 3. Zhang X, Zhang R, Liu Y. Human pose estimation based on secondary generation adversary. Laser & Optoelectronics Progress. 2020;57(20):201509.
- 4. Fu H, Gao J, Che L. Human posture estimation and movement recognition in fitness behavior. Chinese Journal of Liquid Crystals and Displays. 2024;39(2):217–27.
- 5. Song Q, Zhang H, Liu Y, Sun S, Xu D. Hybrid attention adaptive sampling network for human pose estimation in videos. Computer Animation & Virtual. 2024;35(4).
- 6. Sun L, Tang T, Qu Y, Qin W. Bidirectional temporal feature for 3D human pose and shape estimation from a video. Comput Anim Virtual Worlds. 2023;34(3):e2187.
- 7. Wu Y, Wang C. Parallel‐branch network for 3D human pose and shape estimation in video. Computer Animation & Virtual. 2022;33(3–4).
- 8.
Wang J, Yang F, Li B, Gou W, Yan D, Zeng A, et al. In: 2024. 21978–88.
- 9.
Cai Y, Zhang W, Wu C. PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization. In: 2024. 2124–33.
- 10. Mestre TA, Fereshtehnejad S-M, Berg D, Bohnen NI, Dujardin K, Erro R, et al. Parkinson’s Disease Subtypes: Critical Appraisal and Recommendations. J Parkinsons Dis. 2021;11(2):395–404. pmid:33682731
- 11. Dulski J, Uitti RJ, Ross OA, Wszolek ZK. Genetic architecture of Parkinson’s disease subtypes. Frontiers in Aging Neuroscience. 2022;14:1023574.
- 12. Deng X, et al. Disease progression of data-driven subtypes of Parkinson’s disease: 5-year longitudinal study. Journal of Parkinson’s Disease. 2024;14(5):1051–9.
- 13. Park D, Lee SY, Kim JH. Classification of long-term clinical course of Parkinson’s disease using clustering algorithms. J Big Data. 2023;10:140.
- 14. Gong J, Huan L, Zheng X. Deep learning interpretability analysis methods in image interpretation. Acta Geodaetica et Cartographica Sinica. 2022;51(6):873–84.
- 15. Krishnagopal S, Coelln RV, Shulman LM, Girvan M. Identifying and predicting Parkinson’s disease subtypes via bipartite networks. PloS One. 2020;15(6):e0233296.
- 16. Fereshtehnejad SM, Zeighami Y, Dagher A, Postuma RB. Clinical criteria for subtyping Parkinson’s disease. Brain. 2017;140(7):1959–76.
- 17.
Suara S, Jha A, Sinha P, Sekh AA. In: 2023. 124–35.
- 18. Singh A, Sengupta S, Lakshminarayanan V. Explainable Deep Learning Models in Medical Image Analysis. J Imaging. 2020;6(6):52. pmid:34460598
- 19. Shen C, Wang S, Zhou R. Journal of Southern Medical University. 2024;44(6):1141–8.
- 20. Luo Y, Wang C, Ye W. An interpretable prediction model for acute kidney injury based on XGBoost and SHAP. Journal of Electronics & Information Technology. 2022;44(1):27–38.
- 21. Ahmed Khan N, Ahmed Alarfaj A, Alabdulqader EA, Zamzami N, Umer M, Innab N, et al. TRI-POSE-Net: Adaptive 3D human pose estimation through selective kernel networks and self-supervision with trifocal tensors. PLoS One. 2024;19(12):e0310831. pmid:39636815