Figures
Abstract
Accurate 3D skeletal model is fundamental to human pose estimation and body shape reconstruction, as it encodes intricate motion dynamics and spatial configurations. However, generating high-fidelity 3D skeleton samples that adhere to human kinematic constraints remains a significant challenge. To address this problem, the Constrained Dynamic Graph Spatial Perception Adversarial Network (CDGSPAN) is proposed, which is designed to model and synthesize human motion poses with high realism. CDGSPAN leverages dynamic graph-based operations to capture the spatial angular relationships between skeletal joints, while incorporating a constraint-aware regularization mechanism to guide the learning process. This joint modeling enables the network to effectively learn motion priors from real 3D skeletal samples and generate synthetic poses that closely align with biomechanical plausibility. Extensive experiments demonstrate that CDGSPAN achieves superior performance compared to recent adversarial network frameworks in generating sparse 3D skeletal sequences that preserve natural human motion characteristics.
Citation: Li W, Yang J, Li J, Zhao Y, Fan Y, Wu Y, et al. (2026) Constrained graph dynamic spatial perception adversarial network for human motion generation. PLoS One 21(1): e0339297. https://doi.org/10.1371/journal.pone.0339297
Editor: Panos Liatsis, Khalifa University of Science and Technology, UNITED ARAB EMIRATES
Received: May 10, 2025; Accepted: December 3, 2025; Published: January 5, 2026
Copyright: © 2026 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: This work is supported by the Development of Guangdong Province Philosophy and Social Science (No. GD25CTY14), the Educational Science Planning Project of Guangdong Province (No. 2023GXJK125), the Guangdong Provincial Special Program in Key Areas for Higher Education Institutions (New Generation Electronic Information (Semiconductors)) (No. 2024ZDZX1040) and the Special Support Program for Cultivating High-Level Talents of Guangdong University of Education (2022 Outstanding Young Teacher Cultivation Object: Wanyi Li).
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
The generation of 3D human motion skeletal model [1] is a prominent research topic in the fields of Artificial Intelligence (AI) and Computer Vision (CV), with significant implications for 3D body shape estimation [2], the development of life like 3D character models [3,4], and virtual reality technologies [5]. However, generating accurate 3D skeletal (keypoint/joint) models of human motion poses is full of numerous technical challenges. These include the high dimensionality of spatial data, the complexity of human skeletal spatial and angular structures, deficiencies in input data, the need to model dynamic and pose variations, and substantial computational demands. Firstly, the high-dimensional nature of 3D human keypoint data requires models to handle complex spatial relationships—capturing not only joint-to-joint spatial dependencies but also the dynamic transitions among different body parts. Furthermore, due to the intricacy of human joints and skeletal anatomy, models must adhere to the physical constraints of joint movement and inter-joint relations to produce valid motion poses. When reconstructing (or estimating) 3D human poses from 2D images [6,7] critical spatial information is lost due to 2D projection, and issues such as noise and occlusion often lead to incomplete input data, compromising accuracy.
Dynamic pose generation further requires models to go beyond static pose prediction and capture temporal-spatial transitions across actions, which involves modeling complex dependencies among keypoints across time. In recent years, Graph Neural Networks (GNNs) [8–10] have gained popularity for processing graph-structured data (composed of nodes and edges), showing potential in modeling the topology and spatial relations of human skeletons. However, conventional graph convolutions in GNNs rely on fixed skeletal connectivity and fail to capture the intrinsic motion patterns of 3D skeleton samples. Even with extensive training, such models often struggle to generate realistic, physically plausible 3D motion sequences. Some GNN variants employ Graph Attention Networks (GATs) [11–13], which have shown promise in various graph-based tasks. Nonetheless, their performance in generating 3D skeletal models remains limited. The primary limitation lies in the coarse modeling of spatial relations—human motion requires fine-grained constraints on joint positions and angular configurations, whereas GATs focus on localized attention computations, which are insufficient to capture the complex spatial dependencies among joints in 3D space.
To address these challenges, the Constrained Dynamic Graph Spatial Perception Adversarial Network (CDGSPAN) is proposed, which offers enhanced flexibility in modeling human motion patterns and the relative spatial configuration of keypoints. CDGSPAN introduces a dynamic graph computation framework for both the generator and discriminator, enabling adversarial learning to progressively optimize the generation of 3D skeletal samples that conform to human motion rules. Unlike existing adversarial networks [14–18], which primarily focus on generating high-dimensional image data (often exceeding hundreds of thousands of pixels), our proposed CDGSPAN is tailored for sparse, structured data that embodies human motion dynamics. CDGSPAN incorporates a spatial perception mechanism capable of capturing the relative joint positions and limb angles of 3D skeletal models through dynamic graph operations. Specifically, edge weights in the graph are updated dynamically based on node features and spatial direction vectors, allowing the model to adaptively handle various poses and joint relationships. This dynamic update mechanism significantly improves the model’s ability to capture meaningful spatial information from each pose sample.
Based on this dynamic computation, the generator and discriminator are constructed to form the adversarial network. During training, we introduce a spatial constraint module that regularizes the loss function, effectively preventing the generation of anatomically implausible limb configurations and improving the realism of synthesized poses. Additionally, we propose an evaluation method to quantitatively assess the validity of generated 3D skeletal samples, independent of visual observation. This method evaluates whether the joint angles and spatial configurations adhere to realistic human kinematics and closely resemble real samples.
In summary, the major contributions of this article are as follows:
- A new dynamic graph operation is proposed, and a corresponding generator and discriminator are designed to construct the CDGSPAN model.
- Spatial constraint models are incorporated during the training process, enhancing the model’s understanding of pose geometry and enabling the generation of physically plausible 3D skeletal samples under kinematic constraints.
- A style-specific sample selection criterion for CDGSPAN that filters generated poses by their Frobenius distance to a reference skeleton under a certain tolerance, and select the representative sample of style-specific sample. This provides a consistent way to obtain style-constrained samples for generation and evaluation.
- A non-visual evaluation method is developed to assess the quality of generated 3D skeletal data, based on whether the joint angles and spatial positions conform to human motion principles and resemble real-world samples.
The following sections provide a detailed account of the methodology, experiments, and findings.
2. Related works
2.1. Spatial modeling and spatial constraint models
Many graph-based pose methods still rely on a fixed skeletal topology or coarse, local attention, which limits fine joint-to-joint modeling across different poses. Multi-scale residual GCNs [19] for motion prediction typically operate on predefined skeleton graphs, so edge patterns cannot adapt per pose. Dynamic dense GCNs [20] relax the static-graph assumption, yet they usually do not integrate spatial constraint models into training, leaving geometric validity under-regularized. Evidence that fixed or non-adaptive edges restrict representation also appears outside human pose: a 3D–2D hybrid and GAT model [13] in hyperspectral classification explicitly argues that fixed node weights/edges constrain learning and shows gains from attention-based adaptive edges. Likewise, graph-attention solutions in address matching [11] improve local matching but remain local/coarse and are unrelated to human geometry. Overall, prior work either uses fixed graphs or coarse attention, and seldom employs spatial constraint models to enforce limb relations or angle/length consistency.
2.2. Frame-wise spatial representation and sampling stability
Another line of work emphasizes generative quality or distributional fit while omitting explicit constraints on per-frame body geometry. DCNN-based pose generators [21] for animated characters optimize reconstruction/adversarial losses but omit spatial constraint models, which can yield anatomically implausible poses. Flow-based structured prediction [22] improves likelihood and mode coverage yet typically lacks spatial constraint models that suppress per-frame geometric artifact. Diffusion-style generators stress realism/diversity—e.g., a pose-guided diffusion transformer [23] and cross-diffusion motion models [24]—but evaluate mainly with image/video quality or retrieval metrics (FID, R-Precision, etc.), again without explicit spatial constraint models. Several GAN variants [15–17] from adjacent domains (document enhancement, orthogonal subspace disentanglement, differentiable GAN search) likewise provide no built-in safeguards for frame-wise anatomical validity if transferred to skeleton generation, because spatial constraint models are absent by design.
2.3. Data robustness and objective (non-visual) evaluation
Most prior evaluations lean on visual or task-level scores and rarely report non-visual criteria that test geometric validity (e.g., joint-angle legality, bone-length consistency, self-intersection checks)—i.e., metrics aligned with spatial constraint models. Image/video pose transfer typically reports LPIPS/SSIM/PSNR or perceptual features rather than limb-level spatial validity [25]. Diffusion models [23,24] report distributional and user-study metrics rather than non-visual spatial tests. Graph-based predictors [19,20] commonly use MPJPE-style geometric errors, which do not certify anatomical plausibility. Broader generative/vision literature—style transfer [26], super-resolution [18], document enhancement [17], text-to-image surveys [14], network embedding [27], radiomics-GANs [28]—centers on visual fidelity, segmentation accuracy, or topology preservation, not non-visual spatial evaluation [11,13,16]. These gaps motivate methods that pair adaptive graph modeling with explicit spatial constraint models and that adopt objective, non-visual metrics to judge whether generated skeletons are spatially valid.
2.4. Improving on existing methods
Based on these challenges, we propose the Constrained Dynamic Graph Spatial Perception Adversarial Network (CDGSPAN). This method introduces a dynamic graph computation framework for both the generator and discriminator, enabling adversarial learning to progressively optimize the generation of 3D skeletal samples that adhere to human motion principles. By incorporating spatial constraint models, such as joint-angle limits and bone-length consistency, CDGSPAN addresses the issues found in previous methods and ensures the anatomical plausibility of generated poses. This makes CDGSPAN a significant step forward in addressing the lack of fine-grained motion modeling, the absence of explicit geometric constraints, and the deficiency of objective, non-visual evaluation metrics in existing methods.
3. Construction of 3D human motion skeletal model
A 17 × 3 matrix is used to represent the 3D human skeletal model, where each of the 17 rows corresponds to a uniquely indexed keypoint in 3D space. These keypoints encode the spatial coordinates of human joints and are commonly referred to as keypoints in pose estimation literature. The skeletal data can be obtained either through direct capture from motion sensors or reconstructed from 2D images using deep learning-based pose estimation algorithms [29,30]. As illustrated in Fig 1, connecting these keypoints yields the structural representation of human limbs in the natural daily-life scenarios or videos extracted from standard benchmark datasets such as HumanEva [31] and Human3.6M [32], capturing their relative lengths, spatial positions, and angular orientations. This structure enables intuitive visualization and modeling of diverse human motion patterns, as shown in Fig 2.
The spatial positioning of each keypoint is critical, as it encodes the semantic logic of human pose. Different joint angles and spatial configurations reflect different motion states. However, if the keypoints are arranged in a disordered or anatomically implausible manner—for example, when the hands, feet, or head are misaligned or reversed—the resulting 3D pose is invalid, as illustrated in Figs 3 and 4 Therefore, accurate modeling of the relative spatial relationships among keypoints is essential for training models that can generate valid and realistic 3D human poses.
An end-to-end preprocessing workflow is adopted to convert raw acquisitions into analyzable 3D skeleton sequences, as shown in Fig 5. Starting from raw images/videos, 2D joints are first extracted per frame by a standard pose detector, and basic 2D quality control is applied (filtering low-confidence joints and optional temporal smoothing). Depending on data conditions, 3D poses are then obtained: with multi-view data, camera calibration and triangulation are performed; with monocular data, a 2D-to-3D lifting network is used. The resulting 3D coordinates are centered on a reference joint and scale-normalized to remove body-size effects and maintain consistent bone lengths. Metadata are loaded and class labels (e.g., action/style) are encoded as integer IDs. Data cleaning and validity checks are conducted to remove NaN/Inf entries and poses that clearly violate anatomical constraints (e.g., extreme limb lengths), while class balance is preserved as much as possible. Finally, the dataset is split—preferably stratified and, when necessary, grouped by subject/scene—into training/validation/test sets to ensure reproducibility and prevent information leakage. Through this workflow, scale-consistent and anatomically plausible 3D skeleton data, along with fair and comparable splits, are prepared for model training and evaluation.
4. CDGSPAN model
4.1. Model architecture and computational principles
The overall architecture of the proposed CDGSPAN model is illustrated in Fig 6. Some of the key computational definitions are presented as follows. The CDV function is used to calculate the relative position encoding of 3D skeletal model keypoints with respect to a designated root joint. It is defined as:
where , represents the relative position (direction vector) of each keypoint with respect to the root joint, and
denotes the reference (root) joint features extracted from the input X.
denotes tensor broadcasting. The reference joint vector
is replicated N times along the joint dimension, resulting in
. Next, the DynamicGraphGNN function Fout=DynamicGraphGNN (*) performs the proposed dynamic graph operation to estimate spatially adaptive relations among skeletal keypoints. The relative spatial distances between nodes are encoded as follows. First, we compute pairwise relative features between all joints, then perform a one-dimensional concatenation across the feature vectors:
This results in edge feature representations . After that, a multi-layer perceptron (MLP) is applied to compute the dynamic edge weights:
where , and
denotes the ReLU activation function. From the equation (3), The computed edge weights
are then expanded into a 4D tensor
and used to modulate spatial information passing through the dynamic graph. The original input tensor is also processed with the similar expansion in the third dimension, resulting in
.These representations are used in subsequent modules to compute the weighted neighbor features for each joint:
from Equation (4), the tensor is obtained to represent the weighted neighbor features for each joint. To integrate multi-source spatial information, the tensors
, the original input X, and the relative position tensor DV are concatenated along the last dimension. As a result, a fused feature tensor is constructed, as defined in Equation (5):
In Equation (5), Cat is denoting the tensor concatenation operation along dimension (dim); we set dim = −1 to indicate the last dimension. The tensor is subsequently passed through a two-layer fully connected network to compute the final joint-level feature representation, as formulated in Equation (6):
where, and
denote the learnable weight matrices,
are the corresponding bias terms. The ReLU activation function
is employed between the two layers. Consequently, the output feature tensor
is obtained, which encodes the spatial representations of each joint by incorporating both individual joint features and context from neighboring keypoints.
The operational pipeline of DynamicGraphGNN can be seen in Fig 7. Given an input sequence of poses, we first compute direction vectors (DV) between relevant joints to encode local geometry. These vectors parameterize a dynamic edge-weight function , producing stochastically varying weight matrices that adapt the graph topology per frame. Using these weights, the network performs weighted neighbor feature aggregation to obtain
, which is then combined with each node’s current features to form
. A node feature update module (e.g., linear/MLP with residual gating) outputs the refined representations Fout. The purple arrows indicate that this process is applied recurrently across frames, enabling adaptive and geometry-aware message passing.
The DynamicGraphGNN accepts either a pose sequence together with the root-relative direction vectors (DV)computed with respect to the first (root) joint, or a random tensor
with the same total number of elements as the pose sequence (possibly a different shape), paired with the same root-relative direction vectors. In both cases, the direction vectors are obtained by subtracting the coordinates of the root joint from those of all joints in each frame (i.e., vectors pointing from the root to each joint). The representations produced by DynamicGraphGNN are fed into task-specific linear heads. In the generator G(*), a linear projection maps the features to a new pose sample. In the discriminator D(*), a linear classifier outputs a real/fake label indicating whether the generated pose sample is consistent with the training samples.
4.2. Construction of the spatial constraint models
In adversarial learning, constraints are built to ensure that the generated 3D skeletal pose samples not only satisfy the discriminator’s classification criteria, but also conform to the physical and biomechanical principles of real human motion. In particular, for pose generation tasks, it is essential that the positions and angles of the generated joints adhere to human kinematic constraints, thereby avoiding the generation of unnatural or physically implausible poses.
By incorporating spatial position constraints, the generator is guided to maintain reasonable distances between adjacent joints, which helps prevent the generation of joint pairs that are either excessively close or unnaturally far apart. In parallel, angular constraints are applied to ensure that the angle variations between connected joints remain within physiologically feasible ranges. For example, the bending angles of the elbow and knee joints must not exceed human anatomical limits. These constraints not only improve the plausibility and realism of the generated poses, but also help stabilize the generator and mitigate issues such as mode collapse during training. By integrating spatial and angular constraints into the adversarial learning framework, the generator is enabled to produce physically valid and kinematically consistent poses while maintaining high visual fidelity. This integration enhances training efficiency and model usability. Moreover, the inclusion of such constraints improves model stability, reduces convergence time, and ensures that the generated poses are not only visually realistic but also practically viable for downstream applications.
To enforce these constraints, regularization terms are incorporated into the generator’s loss function during training. These regularization terms act as guiding signals, steering the generator to comply with predefined spatial and kinematic rules and thus improving the quality of the generated 3D skeletal samples.
In the following, we define two essential constraint formulations for pose modeling.
The 3D skeletal model used in this work (as shown in Fig 2) is subject to inter-joint distance limitations. Since the skeleton has a predefined joint connection structure, a set of connected joint pairs can be defined as:
From Equation (7), the set defines 16 connections between the 17 joints in the 3D skeletal model, where each pi, represents the distance between a pair of connected joints. Therefore, let
denote the generated and real 3D skeletal samples for a batch B of training data respectively. A spatial distance constraint model can be established as follows:
In Equation (8), let the -th skeletal edge be
. Define the limb vectors
Equation (8) is expressed as the following equality constraints (bone-length consistency):
. Using joint triplets, a set of angular constraint joint groups is defined as:
Then, the cosine similarity between the generated and ground-truth limb directions is computed using Equation (10) and Equation (11):
Subsequently, the angular error between the generated data and the ground-truth data is computed as:
In Equation (12), ε denotes the acceptable angular deviation margin, this constraint is equivalent to . If the deviation is smaller than ε, no loss is applied. These constraints define the manifold
.
4.3. Construction of the training procedure
Based on Sections 4.1 and 4.2, the training procedure can be constructed as illustrated in Fig 6. First, the loss functions for the generator and discriminator are defined as follows:
The full objective of the generator and discriminator is then defined as:
Equations (13) through (16) are established to guide the optimization of model parameters fD, fG during training. The training process is based on binary cross-entropy (BCE) loss, with the final generator objective defined in Equation (16), which incorporates regularization terms on spatial and angular constraints, weighted by the hyperparameters λspatial, λangle respectively. During training, the discriminator is optimized to maximize the probability of correctly identifying real data, i.e., assigning a confidence score close to 1 for real samples. Simultaneously, the discriminator
is trained to minimize the likelihood of mistakenly classifying generated data as real, pushing the predicted probability for fake data (i.e., samples generated by the generator) toward 0. Then, the generator
is trained to minimize the overall loss
, such that the generated 3D skeletal samples not only adhere to human joint spatial and kinematic constraints but also successfully deceive the discriminator, thereby making the output probability from
as close to 1 as possible. Ultimately, the training process aims to produce generated skeletal data that is structurally and spatially close to real human motion samples. Through the adversarial process, the model parameters fD and fG are iteratively optimized. The pseudo-code of the complete training procedure is provided in Table 1.
Equation (16) can be connected to Lagrange-multiplier dynamics. The holonomic (bone-length) residual is , and the two inequality residuals are
from Equation (12). We set
Jacobians. For a vectorized pose
,
are the Jacobians of the equality and inequality residual vectors, J is the number of joints. Gradient step (discrete constrained dynamics) is
which matches discrete dynamics under Lagrange forces by identifying
In an augmented-Lagrangian variant, updating the duals by
, yields the standard primal–dual scheme. At inference (optional projection), a one-step equality projection can be applied to sharpen feasibility
with active inequalities incorporated if needed.
In this work, a subset of video sequences from the Human3.6M dataset [32] is selected to construct the training set. For each video, 2D keypoints are extracted using a keypoint detection model [29,30] like the work of Fig 1, and the corresponding 3D skeleton sequences are reconstructed through a pretrained lifting network (e.g., VideoPose3D).
The reconstructed 3D samples are used to train the proposed model. A total of ten human motions are included, such as directions, greeting, phoning, waiting, sitting and so on. 80% of the reconstructed 3D skeletons are selected randomly to train the proposed model, while the remaining 20% is reserved for validation and testing. This split ensures that the model is exposed to diverse motion sequences during training and ensures that its generalization ability is evaluated on unseen samples. The frame statistics for each pose category are summarized in Table 2. A total of 16180 frames are reconstructed and used for training and evaluation. The runtime environment for training, validation, and testing is shown in Table 3
Our proposed generator and discriminator operate on sparse 3D skeletons (17 × 3 matrices) rather than high-dimensional RGB images. Feasible human poses lie on a kinematics-constrained low-dimensional manifold, which reduces sample complexity. Beyond data size, CDGSPAN uses explicit skeletal topology and spatial/angle constraints as structural regularizers. Early stopping and learning-rate scheduling are employed to reduce the risk of overfitting, and training stability is maintained under adversarial balance without discriminator collapse. The risk of overfitting is reduced through the use of early stopping and learning-rate scheduling, which help stabilize training for both the generator and discriminator.
A practical hyperparameter schedule is adopted. Training is started with a relatively high learning rate (0.0001) and with no structural regularization (λ_spatial = 0, λ_angle = 0). Once plausible human shapes are observed—while residual limb twisting may still be present—regularization weights are gradually introduced and tuned (e.g., λ_spatial = 0.005, λ_angle = 0.005), and a small angular tolerance ε = 0.001 is set for the angle-consistency term. The learning rate is then reduced (e.g., to 0.00001) and training is continued. Early stopping is applied when generated poses deteriorate, after which the above hyperparameters are readjusted so that generalization is improved.
The training curves in Fig 8 show that the generator loss (G Loss) and discriminator loss (D Loss) do not converge monotonically but instead oscillate within a stable range. This oscillatory behavior is typical in adversarial learning, reflecting the dynamic competition between the generator and discriminator. Over the course of training, both losses gradually reach a balance, indicating that the model has approached an equilibrium where the generator produces plausible samples and the discriminator can no longer easily distinguish real from generated data. Furthermore, the training strategy first performs unconstrained learning until the generated poses visually resemble human-like structures. At this stage, constraint terms are incorporated to guide and refine the generated data, ensuring both structural plausibility and stability. To prevent overfitting during training, early stopping and learning rate adjustment techniques are applied. These strategies help maintain generalization by halting training when the model’s performance levels off. Additionally, the learning rate is gradually lowered to facilitate fine-tuning. As a result, the generator’s training curve may exhibit sudden upward spikes, reflecting the model’s adaptation to these constraints and fine-tuning processes.
(Blue: generator; Red: discriminator. X: epochs (0–1000); Y: loss value. The discriminator drops early and remains low, while the generator shows spikes near 200 and 600 before stabilizing—indicating a transition from early instability to a balanced adversarial regime without sustained mode collapse.).
The Fig 9 displays the variation of the spectral gap and Lipschitz constant with respect to Epochs during adversarial training. Both metrics gradually stabilize during the training process, reflecting the convergence of the model’s training.
In the Fig 9 as training progresses, the values of spectral gap and Lipschitz constant show a trend of stabilization. This indicates that the min-max game between the generator and discriminator gradually reaches equilibrium. Specifically, the smaller fluctuations in spectral gap and Lipschitz constant after a certain number of epochs suggest that the model is stabilizing, and the training process is nearing convergence. The stabilization of the spectral gap reflects the steadying of the graph structure and the internal information flow of the network. This suggests that the interaction between the generator and discriminator is becoming balanced, and the quality of the model’s output is continuously improving. The stabilization of the Lipschitz constant indicates that the model’s parameter updates and training process are becoming more stable, avoiding issues like gradient explosion or excessive updates, further validating the stability and convergence of the training process.
The spectral gap refers to the difference between the eigenvalues λ2 and λ1 of the graph, which reflects the connectivity and stability of the graph. In adversarial training, a larger spectral gap means a greater difference between the generator and discriminator, leading to instability in the network’s training. Conversely, a smaller spectral gap indicates that the generator and discriminator’s training is aligning, and the model is converging more effectively. Therefore, monitoring the spectral gap helps in observing the stability of the model. The Lipschitz constant measures the degree to which the output of a function changes with respect to changes in its input. In deep learning, a smaller Lipschitz constant typically indicates that the model is less sensitive to input changes, resulting in smoother gradient updates. A larger Lipschitz constant, on the other hand, can lead to gradient explosion or unstable training. In adversarial training, the stabilization and smaller value of the Lipschitz constant mean that the training process is more stable, avoiding large parameter updates, which in turn enhances convergence efficiency.
By introducing the DynamicGraphGNN operation and the spatial constraints regularization, the training process of the min-max adversarial game can achieve effective equilibrium, avoiding instability and non-convergence issues during training. The theoretical foundation of this approach is based on the concept of Nash equilibrium, combined with the spectral properties and Lipschitz continuity found in modern Graph Neural Networks (GNNs). Ultimately, the generator and discriminator will gradually approach equilibrium during the training process, leading to the stable convergence of the model.
4.4. Evaluation model for generated data
Traditional image quality evaluation metrics such as PSNR [33] and FID [34] are reference-based and primarily measure the pixel-level or distributional similarity between the generated image and the original image. As the generated data becomes increasingly similar to the ground truth, these metrics tend to improve monotonically in a single direction. However, in the context of generative adversarial networks (GANs), the goal is to create realistic samples that match the training data in style and distribution, rather than replicating the original images pixel by pixel. Unlike image synthesis, our task generates 3D skeletal joint coordinates (17 × 3) rather than pixel images. Image-centric metrics such as PSNR (higher is better) and FID (lower is better) primarily reflect pixel-space appearance and are highly sensitive to rendering choices (e.g., line width, color, camera pose, etc.), while being largely insensitive to kinematic/physiological plausibility (bone-length consistency, joint-angle limits, left–right symmetry, etc.). Consequently, it is possible for visually incorrect or semantically meaningless images to achieve comparable PSNR or FID scores, highlighting a key limitation of traditional metrics. Therefore, this article proposes a novel evaluation metric specifically designed to assess the quality of 3D skeletal samples generated by generative adversarial networks.
After the CDGSPAN model generates 3D skeletal samples, it is necessary to evaluate whether the generated sample is similar to the real training samples and whether it exhibits valid human motion characteristics—such as continuity of joint movement, reasonable limb lengths, and physiological feasibility. Inspired by MPJPE [32], a new evaluation metric is proposed to assess the quality of generated 3D skeletal samples. Let the generated 3D pose sequence be represented as . The corresponding ground-truth pose sequence is denoted as
. The mean error
and standard deviation
between corresponding joints of generated and real poses are computed as follows:
Subsequently, for both the generated and ground-truth 3D skeletal sequences, joint triplets (groups of three joints) are sampled. These triplets are substituted into Equations (9) through (12) to compute the angular similarity metric , which measures the deviation in joint angles. Then, the binary cross-entropy (BCE) loss between the generated sequence and the real sequence is computed using the discriminator
, This loss quantifies how well the generated sample is mistaken for a real one by the new discriminator D’ The newly trained discriminator D’ adopts the same architecture as shown in Fig 6 and is trained on the same dataset as reported in Table 2,which is used to evaluate the generated samples. It achieves accuracies of 99.51% on the training set, 99.48% on the validation set, and 99.39% on the test set. Finally, a comprehensive loss score is computed as:
Equation (19) serves as an integrated metric to evaluate the generated pose sequence in terms of reconstruction error, positional deviation, angular deviation, and adversarial loss. A lower tscore indicates that the 3D pose sequence produced by the CDGSPAN model is more similar to the ground-truth sequence. Conversely, if the generated poses contain noise, or the spatial configuration of joints violates human motion constraints, the corresponding tscore will be significantly higher.
4.5. Style-specific sample selection
To determine whether a generated pose corresponds to a specific style, we use a reference skeleton that characterizes the target style. Each generated candidate
, is evaluated against this reference using the Frobenius norm:
A tolerance margin is introduced to account for natural variability. The set of candidates that meet this condition is given by
All candidates in can be regarded as valid realizations of the target style, since their deviations from the reference skeleton remain within the acceptable margin. Among these, the most representative sample i* is identified as the one with the minimum error:
If no candidate satisfies the tolerance, the threshold may be relaxed, or the top-
closest samples may be retained. This procedure ensures that generated poses are filtered and selected strictly according to their proximity to the designated style, thereby providing a consistent criterion for style-constrained pose generation.
5. Experiments and evaluation
5.1. Comparative evaluation with other models
Adversarial models for skeleton generation are proposed by many researchers in recent years. However, most of these models are trained on high-resolution datasets containing densely annotated data (such as image collections with over 100,000 samples), where a large amount of structural information is extracted from pixel-level features. As a result, such models are not designed to capture sparse spatial structures or joint-level relationships. Therefore, in this study, a sparse 17 × 3 skeletal representation is adopted, and comparative experiments are conducted using representative adversarial models, including MLPGAN [35], GRAGAN [36], SAGAN [37], DCGAN [38], CNNGAN [39] and RESNETGAN [40]. CNNGAN and RESNETGAN adopt conventional CNN and RESNET architectures for feature extraction within their adversarial framework. The evaluation metric is defined by Equation (19), and the results are shown in Table 4. CDGSPAN achieves the lowest tscore, and the generated samples are closer to the real samples than those produced by other models. The lowest mean error and standard deviation
are observed. As shown in Table 4, CDGSPAN achieves a discriminator
loss of 7.1205, which is in lower level. This indicates that the samples generated by CDGSPAN are more difficult to distinguish from real samples by the discriminator. Meanwhile, it is worth noting that the tested models such as DCGAN, CNNGAN, and RESNETGAN exhibit lower values in certain traditional loss metrics (e.g.,
) compared to the proposed CDGSPAN.
According to the data in Table 4, although CDGSPAN has a lower discriminator’s confidence (Ldisc = 7.1205) compared to other methods (e.g., DCGAN with Ldisc = 7.4296), it decisively outperforms all baselines on domain-relevant criteria. Specifically, CDGSPAN’s joint-angle penalty (Langle) drops to 21.6705 rad, a reduction of about 74% compared to other methods (e.g., MLPGAN with Langle = 83.0964 rad), indicating significantly fewer extreme or illegal angles in the generated 3D skeletons. In terms of geometric accuracy and stability, CDGSPAN’s mean error (ē) is 0.1494, which is much lower than other methods (e.g., RESNETGAN with ē = 0.4937), demonstrating better accuracy. Furthermore, its standard deviation (σe) is 0.0918, significantly lower than MLPGAN and SAGAN, indicating that CDGSPAN produces more consistent and stable poses. In the task-specific quality metric, tscore, CDGSPAN achieves a value of 520.3430, much lower than other methods, such as CNNGAN with tscore = 4835.8330. This shows that CDGSPAN delivers superior structural plausibility, geometric consistency, and perceptual/physiological quality. Despite its higher model complexity (0.272M parameters) and inference time (8.1454E-04s), CDGSPAN consistently produces higher-quality results compared to other methods. Moreover, although CDGSPAN has a slightly longer inference time, it remains competitive in terms of efficiency, achieving high-quality results faster than most models. This highlights that, despite its higher complexity, CDGSPAN maintains a strong performance-to-efficiency ratio in generating high-quality 3D skeletons.
From the comparison with ground-truth samples, it is observed that the 3D skeleton sequences generated by CDGSPAN resemble the ground-truth ones more closely than the other methods. As illustrated in Figs 10–, the generated sequences closely match the structure of the ground-truth poses, with only minimal deviations. Although slight differences in limb motion patterns can be observed, the overall results exhibit strong consistency between the generated and real data. These visual observations align well with the quantitative results reported in Table 4, further confirming the effectiveness of CDGSPAN.
(CDGSPAN’s generated 3D skeleton sequences closely match the ground truth with only minor deviations, consistent with the quantitative results in Table 4.).
Fig 18 Stacked bar charts of mean subjective ratings per profession from five rater groups—cinician (n = 8), biomechanist (n = 6), ML (Machine Learning) researcher (n = 10), animator (n = 7), and layperson (n = 9); total N = 40—across eight conditions (MLPGAN, GRAGAN, SAGAN, DCGAN, CNNGAN, RESNETGAN, CDGSPAN (Ours), and ground-truth data). For each method, five side-by-side bars correspond to the five professions; each bar reflects the average score within that profession. Within every bar, colored segments denote Realism, Naturalness, and Diversity, and the total height gives the Overall score (0–30). Ground-truth serves as the perceptual upper bound (≈29/30), while CDGSPAN closely approaches this reference and visually outperforms the baselines on all three dimensions.
(Across five rater groups, CDGSPAN nearly matches the ground truth (~28/30 vs. ~ 29/30 Overall) and consistently outperforms all baselines on Realism, Naturalness, and Diversity; the ranking (GT > CDGSPAN > others) is stable across professions.).
Across professions (group means), CDGSPAN attains Overall scores of 27.8–28.1, with mean gaps to ground truth of ~1.0–1.7 points (e.g., clinician 27.84 vs. 28.84, biomechanist 27.90 vs. 29.55, ML researcher 28.11 vs. 29.53, animator 28.14 vs. 29.17, layperson 27.79 vs. 29.21). Improved GAN baselines (DCGAN/CNNGAN/RESNETGAN) cluster around ~12–17, and traditional GANs (MLPGAN/GRAGAN/SAGAN) around ~1–7. The stack composition shows consistent gains in realism and naturalness (≈9.0–9.6 each) while maintaining strong diversity, and the ranking (ground truth> CDGSPAN> others) remains stable across all five profession-specific averages. The results are consistent with the tscore values reported in Table 4.
5.2. Ablation study
An ablation study is conducted to evaluate the impact of removing key components from the CDGSPAN model. As shown in Table 5, removing either key component leads to marked degradation. The full CDGSPAN achieves a joint-angle penalty of rad, which rises to 47.9018 rad
without spatial constraints and to 25.9700 rad
without the dynamic-graph operation. Geometric accuracy and stability follow the same trend: the full model’s
and
deteriorate to
in the no-spatial variant and to
in the no-dynamic variant, showing both modules are crucial for suppressing implausible limb postures and improving kinematic consistency.
Table 5 also shows that the composite quality metric tscore, which captures perceptual and physiological plausibility, is most sensitive to these removals: the full model attains tscore = 480.0798; this balloons to without spatial constraints and
without the dynamic operation. Notably, discriminator confidence
alone can mislead: the no-dynamic model has
(lower than the full model’s 7.4584), yet its
is far worse—hence the need for
as a task-relevant indicator.
Finally, Table 5 reports similar model sizes across settings (0.026−0.027M parameters) and inference times on the order of s (full 8.1265E-04s; no-spatial 7.6006E-04s; no-dynamic 1.2549E-04s, faster after removing the dynamic operation). In sum, Table 5 demonstrates that without increasing model scale, the full CDGSPAN simultaneously achieves the lowest
, and
, confirming both the spatial-constraints and dynamic graph operation are indispensable for structural plausibility, kinematic consistency, and perceptual quality.
The ablation experiments above demonstrate that comprehensive loss score calculated using Equation (19) serves as a valid and reliable measure for evaluating generation quality.
The 3D skeletal sequences generated from the models of Table 5 can be visually inspected and compared with the ground-truth samples as shown in Figs 19–. It can be observed that for the CDGSPAN models with the spatial constraint module or the dynamic graph operation removed, the generated poses exhibit mutual misalignment between limbs, and certain joint positions appear in anatomically implausible locations, resulting in a noticeably chaotic structure. In contrast, the full CDGSPAN model produces samples that resemble the ground-truth ones with no structural dislocation of limbs or joint positions that violate human motion kinematics in either spatial or angular dimensions.
(Removing the spatial constraint or motion regulation causes limb misalignment and anatomically implausible joints, whereas the full CDGSPAN matches the ground truth without kinematic violations; visuals align with Table 4 and support tscore Equation (19) as a reliable quality metric.).
5.3. Evaluation of style-specific pose selection
Building on the method of Section 4.5, we first sample poses from the proposed adversarial generator under random noise, then, for each generated sample, perform Euclidean nearest-neighbor as Equation (20), and match in the training or testing set over the full 17 × 3 keypoints representation to identify its closest real exemplar and inherit the corresponding style label. The Fig 23 arranges 10 distinct styles column-wise: the top row (Generated) displays the synthesized 3D skeletons, and the bottom row (Real) shows the matched real poses with their style annotations. Visually, generated poses preserve the global skeletal topology of their real counterparts—evidence that the model has learned plausible spatial/kinematic constraints—while minor local deviations (e.g., limb flexion, torso orientation) remain. The clear differentiation across columns confirms style diversity and indicates that Section 4.5’s selection and matching pipeline effectively ensures that generated samples are aligned with the correct style categories present in the training or testing samples.
(Using nearest-neighbor matching over 17 × 3 keypoints, generated poses inherit style labels from their closest real exemplars; across 10 styles, the synthesized (top) closely align with matched real poses (bottom), preserving global topology with minor local deviations and confirming style diversity/alignment).
A reference sequence of length T (here, frames) is used for temporal alignment. As shown in Fig 24, the reference is subsampled with an interval of 30 frames, meaning adjacent reference poses are taken 30 frames apart in the original stream. For each time step
, the distance displayed above the generated panel is computed using Equation (20); smaller values indicate closer matches. Following section 4.5, the real, temporally ordered sequence is employed to guide selection from a pool of model outputs: at each
, the candidate minimizing the distance in Eq. (20) is chosen, and a one-to-one constraint is enforced to prevent reuse. The selected poses are then concatenated over time, by which a temporally aligned, model-generated sequence is obtained that mirrors the progression of the real sequence while preserving the target style.
(Top row shows temporally coherent samples generated by the proposed model, while the bottom row shows real temporal samples that are used to guide the generation of the top sequence.).
5.4. Visual sensitivity analysis of constraint loss weights
For the parameter sensitivity test, λ_spatial = 0.005 and λ_angle = 0.005 are used as the baseline configuration. The two parameters are adjusted individually and in combination, and the generated samples under these settings are visually compared to assess the effect of the constraint weights. It is observed that when λ_spatial or λ_angle is set too large or too small, the generated 3D skeletons become distorted, for example, sample diversity is suppressed, and limb lengths deviate from normal human proportions, while moderate values (λ_spatial = 0.005 and λ_angle = 0.005) lead to more natural and diverse results. The sensitivity test results are shown in –. Relative to the baseline (λ_spatial = 0.005, λ_angle = 0.005), increasing or decreasing either weight—alone or together—yields distorted 3D skeletons (reduced diversity, abnormal limb lengths), whereas the baseline produces the most natural and diverse samples.
5.5. Statistical evaluation of CDGSPAN performance
We evaluated sampling efficiency and acceptance‐rate stability under a uniform per-class quota. Using the full training set (per-frame 17 × 3 keypoints), poses were normalized by centering on the pelvis/root joint and scaling by the mean joint-to-root distance. A pretrained generator produced samples from standard-normal noise, and each candidate was assigned a class via Euclidean nearest neighbor to the normalized training set. For total sample sizes Nsample∈{300,600,1000,1500,2000,2500,3000} as shown in the Figs 30 and 31, we enforced a 10% target per class (or uniform 1/C for 10 different style categories) and stopped once all quotas were met. We recorded wall-clock time to completion and defined acceptance rate as the ratio of accepted samples to all evaluated candidates. The figures report total samples vs. time with an ordinary least-squares fit (slope and R2) and total samples vs. acceptance rate with an in-figure dashed mean line; all measurements are collected on the same hardware/software configuration for consistency.
Under a uniform 10% per-class quota, time-to-quota increases approximately linearly with total samples and the acceptance rate remains stable across scales. No collection bottlenecks or declines in acceptance are observed, indicating no evident performance impact from class-wise dataset bias and any residual effect, if present, is small and does not manifest as measurable efficiency or acceptance rate degradation. Overall, the proposed method (CDGSPAN) shows good scalability and robustness within the current data distribution.
The model’s generalization can be affected if the 3D skeleton samples use inconsistent coordinate conventions. If other videos or images are reconstructed and then mapped to the same coordinate system as the training data—i.e., scaled and normalized so that keypoint coordinates are unified—then even a single-source dataset can train a well-generalizing model. When training on multiple datasets, it is essential to standardize the 3D skeleton keypoint coordinates across datasets. Therefore, a robust preprocessing pipeline must be established to perform the necessary coordinate alignment and normalization.
Across ten independent runs per method (Trials = 10), the composite score consistently ranks methods in a way that matches our qualitative, visual assessment of the generated 3D skeletons. Smaller
indicates better sample quality. In the Table 6, CDGSPAN yields the lowest
values run-by-run, while baselines such as CNNGAN, DCGAN, GraphGAN, MLPGAN, RESNETGAN, and SAGAN exhibit substantially larger scores. This behavior is stable across runs and reflects the constituent terms as well. CDGSPAN maintains lower
,
,
, which is resulting in markedly reduced
. Overall, the numerical results corroborate the visual inspection of generated poses. Methods with lower
also produce visibly cleaner, more anatomically plausible 3D skeleton samples.
The Table 7 reports repeat-aligned comparisons on the raw tscore_A (lower is better; smaller values indicate the generated 3D skeletons are closer to real samples). For each pair we summarize the paired mean difference d = tscore_A – tscore_B together with its 95% confidence intervals (paired-t and bootstrap), a one-sided Wilcoxon signed-rank p-value for
, and the Hodges–Lehmann (HL) median difference with its bootstrap CI. A clear and consistent pattern emerges: whenever B = CDGSPAN, the mean difference is large in magnitude and negative (e.g., CDGSPAN vs. CNNGAN: −7497.8400; vs. DCGAN: −3029.0040; vs. MLPGAN: −10208.0400, etc.). Because
means
’s tscore is larger, these rows show that CDGSPAN attains the smaller (better) tscore in every pairing. Moreover, for all “CDGSPAN vs. baseline” rows, both the paired-t and the bootstrap 95% CIs lie entirely below zero (e.g., CDGSPAN vs. CNNGAN: t-CI [−8358.8200, −6636.8500], boot-CI [−8236.7600, −6858.3600]), and intervals excluding 0 provide strong evidence that the observed differences are not due to sampling variation.
The nonparametric results reinforce this conclusion: the Wilcoxon one-sided p-values for the CDGSPAN comparisons are extremely small (≈ 0.001 or smaller when the alternative is stated in the direction of CDGSPAN’s advantage), and the HL median differences are likewise far below zero with bootstrap CIs that do not cross 0, confirming the effect on the median scale and reducing sensitivity to non-normality and outliers. Several baseline-to-baseline comparisons show CIs that straddle 0 with non-significant Wilcoxon p-values, indicating no reliable difference between those methods on the paired runs and further highlighting the separation achieved by CDGSPAN. In sum, across all paired comparisons, smaller tscore reliably corresponds to higher sample quality—i.e., generated 3D skeletons that are closer to real data—and CDGSPAN consistently achieves the lowest (best) tscore among all methods. The agreement of the mean differences, both types of 95% confidence intervals, the Wilcoxon tests, and the HL median estimates provides convergent statistical evidence that CDGSPAN generates the highest-quality samples under this metric.
The Table 8 provides a comprehensive analysis of biomechanical metrics, evaluating various models in comparison to the ground truth. The key metrics analyzed include bone length consistency, angular velocity, angular acceleration, and joint angles, alongside their deviations from ground truth data.
Average bone length consistency (mm): The consistency of bone lengths across different models shows substantial variation. CDGSPAN, with an average of 31.0564 mm, exhibits a deviation compared to the ground truth (0.0000 mm), indicating a certain mismatch in the generated bone lengths. In contrast, models like MLPGAN and DCGAN show much higher discrepancies (245.2761 mm and 107.8953 mm, respectively), suggesting poor biomechanical alignment in terms of bone structure. This could affect the physical plausibility of generated poses.
Angular velocity mean squared error (MSE): The angular velocity MSE quantifies the accuracy of the generated models in replicating the ground truth data’s dynamic motion. CDGSPAN achieves the best performance with an MSE of 0.0800, closely aligning with the real angular velocities. On the other hand, models such as MLPGAN (0.9421) and GraphGAN (1.3604) demonstrate significantly larger discrepancies. These higher values indicate that these models fail to match the dynamic behavior of the ground truth, which directly impacts their dynamic plausibility and smoothness, leading to less realistic motion trajectories.
Average angular acceleration (rad/s²): Angular acceleration measures the rate of change in angular velocity over time. CDGSPAN again performs best, with a value of 0.0005 rad/s², closely matching the ground truth. In contrast, GraphGAN (0.0061 rad/s²) shows a more significant difference, suggesting less accurate replication of motion dynamics. The smaller difference observed in CDGSPAN indicates better dynamic plausibility, as it generates smoother transitions in motion, avoiding abrupt changes that would be unnatural in real-world movement.
Average angular acceleration (rad/s²): This metric measures the deviation in angular acceleration. CDGSPAN maintains a low value (0.0013 rad/s²), showing minimal fluctuation compared to the ground truth data. Models such as GraphGAN (0.0069 rad/s²) show larger discrepancies, indicating more erratic motion behavior and poor dynamic continuity.
Average joint angles (rad): Joint angles reflect the relative positions of body parts, and CDGSPAN achieves an average angle of 2.1657 rad, which is quite close to the ground truth (2.2070 rad). In comparison, GraphGAN’s joint angles are further off, with an average of 2.0495 rad. This discrepancy suggests that CDGSPAN more accurately captures the spatial arrangement of body parts, contributing to better realism in the generated poses.
Average joint angles (rad): CDGSPAN shows a minimal difference of 0.0413 rad, which is consistent with the ground truth data. Other models like GraphGAN exhibit larger differences, particularly 0.1575 rad, which again shows poorer alignment with the expected human pose, diminishing the realism and smoothness of the generated motion.
In summary, CDGSPAN emerges as the top performer in terms of biomechanical accuracy, especially when considering both bone length consistency and dynamic plausibility. It exhibits the smallest deviations in angular velocity, angular acceleration, and joint angles, thereby generating smoother, more biologically plausible motion sequences. On the other hand, models like GraphGAN and MLPGAN show larger deviations across multiple metrics, indicating that while these models may generate dynamic motion, the results may lack the realism and consistency required for biomechanical simulations, especially in areas such as joint angles and angular velocity. These discrepancies highlight the challenges in generating physically accurate and dynamically plausible motion with certain models.
6. Conclusion
In this article, a Constrained Dynamic Graph Spatial Perception Adversarial Network (CDGSPAN) is proposed. The model is designed to learn real 3D skeletal motion sequences and generate new sample sequences. Experimental results indicate that the generated 3D skeletal sequences closely resemble the ground-truth samples. CDGSPAN integrates dynamic graph operation and incorporates spatial constraint regularization during training, enabling the model to capture human motion features and joint-level spatial relationships from sparse real-world 3D skeletal data. Throughout the adversarial training process, the generator and discriminator are updated based on the spatial positions and angular dependencies among joints, which allows the model to learn structured kinematic patterns. After training, the model is capable of generating 3D skeletal samples from random input tensors that conform to human motion rules. Despite its effectiveness in generating valid sparse 3D skeletons, CDGSPAN still exhibits certain limitations. While CDGSPAN can yield style-consistent outputs after generation via the style-specific sample selection procedure introduced in Section 4.5, the model itself does not yet natively control style or action type at inference time. As a result, the diversity and granularity of generated poses remain constrained—particularly for specified or composite actions (e.g., bending, hand-raising, squatting, or transitions among them). This limitation stems from the absence of explicit style/action control variables in training, which prevents the generator from learning to conditionally regulate pose variation. To enable model-intrinsic control and direct style-aware outputs, more fine-grained modeling is required—e.g., introducing a conditional adversarial framework with explicit conditioning signals (style/action codes), disentangled representations, or auxiliary control heads.
CDGSPAN’s lack of native style/action control at inference has several adverse implications for deployment: target styles or composite actions cannot be specified deterministically, so outputs remain uncertain; the diversity and granularity of generated poses are limited—coverage of rare or composite motions (e.g., squat→stand with arm raise) is weak and fine details are under-expressed; reliance on post-generation style filtering introduces latency and a non-trivial rejection rate; outputs are biased toward frequent training styles, they yield poorer consistency on long-tail styles; stylistic attributes may drift across runs, reducing reproducibility; and these issues collectively constrain usability in real-time or style-critical settings such as clinical workflows and animation pipelines. If the proposed model is to generate temporally related samples, it still needs real temporal samples for guidance, which can be inconvenient.
To address this issue, future research may incorporate a broader range of 3D skeleton samples with diverse pose types for training. In addition, conditional generative mechanisms or latent space modeling techniques can be introduced to enhance controllability over motion styles and improve the diversity of generated results. The model may also suffer from overfitting or unstable training, especially when generating high-quality and biomechanically valid pose sequences. Therefore, further improvements in model architecture and integration with other generative paradigms are encouraged to enhance the realism and variety of generated samples. Future work will increasingly focus on controlling the style, type, and diversity of generated poses, thereby improving the flexibility and applicability of the model in practical scenarios.
Supporting information
S1 Dataset. The dataset file of CDGSPAN training and visualization. This dataset is described in Table 2.
https://doi.org/10.1371/journal.pone.0339297.s001
(PKL)
S2 File. The code script for reading dataset written in Python.
The code needs to run in a certain environment package.
https://doi.org/10.1371/journal.pone.0339297.s002
(PY)
S3 Text. The name document of the environment package required to read the dataset.
This document can be quickly installed with typing “pip install - r requirements.txt” command to install the packages inside. All the files can be seen in the website: https://www.kaggle.com/datasets/luther1212/cdgspan-dataset#.
https://doi.org/10.1371/journal.pone.0339297.s003
(TXT)
References
- 1.
Ibh M, Grasshof S, Witzner D, Madeleine P, editors. TemPose: a new skeleton-based transformer model designed for fine-grained motion recognition in badminton. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- 2.
Zhu H, Zheng Z, Nevatia R, editors. Gait recognition using 3-d human body shape inference. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.
- 3. Wibowo MC, Nugroho S, Wibowo A. The use of motion capture technology in 3D animation. International Journal of Computing and Digital Systems. 2024;15(1):975–87.
- 4.
Sharma S, Verma S, Kumar M, Sharma L, editors. Use of Motion Capture in 3D Animation: Motion Capture Systems, Challenges, and Recent Trends. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). 2019 14-16 Feb. 2019.
- 5. Li X, Fan D, Feng J, Lei Y, Cheng C, Li X. Systematic review of motion capture in virtual reality: Enhancing the precision of sports training. Journal of Ambient Intelligence and Smart Environments. 2025;17(1):5–27.
- 6. Luvizon DC, Habermann M, Golyanik V, Kortylewski A, Theobalt C. Scene‐Aware 3D Multi‐Human Motion Capture from a Single Camera. Computer Graphics Forum. 2023;42(2):371–83.
- 7. Huang Y, Taheri O, Black MJ, Tzionas D. InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images. Int J Comput Vis. 2024;132(7):2551–66.
- 8. Xu G, Rao G, Zhang L, Cong Q. Entity-relation aggregation mechanism graph neural network for knowledge graph embedding. Appl Intell. 2024;55(1).
- 9. Song Q, Li C, Fu J, Zeng Q, Xie N. Self-supervised heterogeneous graph neural network based on deep and broad neighborhood encoding. Appl Intell. 2025;55(7).
- 10. Li M, Ma W, Chu Z. User preference interaction fusion and swap attention graph neural network for recommender system. Neural Netw. 2025;184:107116. pmid:39798353
- 11. Li M, Su J, Song Z, Qiu J, Lin Y. An interactive address matching method based on a graph attention mechanism. International Journal of Cognitive Computing in Engineering. 2025;6:191–200.
- 12. Zhou H, Zhao T, Fang Y, liu Q. A trajectory prediction method based on graph attention mechanism. Applied Mathematics and Nonlinear Sciences. 2023;9(1).
- 13. Zhang H, Tu K, Lv H, Wang R. Hyperspectral Image Classification Based on 3D–2D Hybrid Convolution and Graph Attention Mechanism. Neural Process Lett. 2024;56(2).
- 14. Ullah A, Numan M, Khalid MNA, Majid A. Words shaping worlds: A comprehensive exploration of text-driven image and video generation with generative adversarial networks. Neurocomputing. 2025;632:129767.
- 15. Jiang H, Luo X, Yin J, Fu H, Wang F. Orthogonal Subspace Representation for Generative Adversarial Networks. IEEE Trans Neural Netw Learn Syst. 2025;36(3):4413–27. pmid:38530724
- 16. Yan C, Chang X, Li Z, Guan W, Ge Z, Zhu L, et al. ZeroNAS: Differentiable Generative Adversarial Networks Search for Zero-Shot Learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(12):9733–40. pmid:34762584
- 17. Souibgui MA, Kessentini Y. DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement. IEEE Trans Pattern Anal Mach Intell. 2022;44(3):1180–91. pmid:32894707
- 18. Wang J, Jin C, Zhou S. Segmentation-aware image super-resolution with generative adversarial networks. Multimedia Systems. 2025;31(2).
- 19. Zand M, Etemad A, Greenspan M. Multiscale residual learning of graph convolutional sequence chunks for human motion prediction. arXiv preprint arXiv:230816801. 2023.
- 20. Wang X, Zhang W, Wang C, Gao Y, Liu M. Dynamic Dense Graph Convolutional Network for Skeleton-Based Human Motion Prediction. IEEE Trans Image Process. 2024;33:1–15. pmid:38019621
- 21. Wang B. A pose generation model for animated characters based on DCNN and PFNN. Systems and Soft Computing. 2024;6:200115.
- 22. Zand M, Etemad A, Greenspan M. Flow-Based Spatio-Temporal Structured Prediction of Motion Dynamics. IEEE Trans Pattern Anal Mach Intell. 2023;45(11):13523–35. pmid:37463083
- 23. Gan Q, Ren Y, Zhang C, Ye Z, Xie P, Yin X, et al. Humandit: Pose-guided diffusion transformer for long-form human motion video generation. arXiv preprint arXiv:250204847. 2025.
- 24.
Ren Z, Huang S, Li X, editors. Realistic human motion generation with cross-diffusion models. European Conference on Computer Vision. Springer; 2024.
- 25. Ma F, Xia G, Liu Q. 3D human model guided pose transfer via progressive flow prediction network. Journal of Visual Communication and Image Representation. 2024;105:104327.
- 26. Dong Y, Liu S, Li Y, Zheng L. Aesthetic-aware adversarial learning network for artistic style transfer. Neurocomputing. 2025;646:130431.
- 27. Zheng C, Pan L, Wu P. Attribute Augmented Network Embedding Based on Generative Adversarial Nets. IEEE Trans Neural Netw Learn Syst. 2023;34(7):3473–87. pmid:34623283
- 28. Li J, Pan S, Zhang X, Lin CT, Stayman JW, Gang GJ. Generative Adversarial Networks With Radiomics Supervision for Lung Lesion Generation. IEEE Trans Biomed Eng. 2025;72(1):286–96. pmid:39208053
- 29.
Pavllo D, Feichtenhofer C, Grangier D, Auli M, editors. 3D human pose estimation in video with temporal convolutions and semi-supervised training. 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, June 16, 2019 - June 20, 2019. Long Beach, CA, United states: IEEE Computer Society; 2019.
- 30.
Waleed A. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. 2017. GitHub repository, howpublished= \url https://githubcom/matterport/Mask_RCNN
- 31. Sigal L, Balan AO, Black MJ. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision. 2010;87(1-2):4–27.
- 32. Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans Pattern Anal Mach Intell. 2014;36(7):1325–39. pmid:26353306
- 33. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process. 2004;13(4):600–12. pmid:15376593
- 34. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems. 2017;30.
- 35.
Guan S, editor Performance Analysis of Convolutional Neural Networks and Multilayer Perceptron in Generative Adversarial Networks. 3rd IEEE International Conference on Power, Electronics and Computer Applications, ICPECA 2023, January 29, 2023 - January 31, 2023. Shenyang, China: Institute of Electrical and Electronics Engineers Inc.; 2023.
- 36. Fathallah M, Eletriby S, Alsabaan M, Ibrahem MI, Farok G. Advanced 3D Face Reconstruction from Single 2D Images Using Enhanced Adversarial Neural Networks and Graph Neural Networks. Sensors (Basel). 2024;24(19):6280. pmid:39409320
- 37. Shen L, Yan J, Sun X, Li B, Pan Z. Wavelet-Based Self-Attention GAN With Collaborative Feature Fusion for Image Inpainting. IEEE Trans Emerg Top Comput Intell. 2023;7(6):1651–64.
- 38.
Li B, Li Z, Du Q, Luo J, Wang W, Xie Y, et al. LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors.2024. p. 69840–64.
- 39. Purwono P, Ma’arif A, Rahmaniar W, Fathurrahman HIK, Frisky AZK, Haq QM ul. Understanding of Convolutional Neural Network (CNN): A Review. IJRCS. 2023;2(4):739–48.
- 40. Liang J. Image classification based on RESNET. J Phys: Conf Ser. 2020;1634(1):012110.