JMM-TGT: Self-supervised 3D action recognition through joint motion masking and topology-guided transformer

Han Wen; Guangping Zeng; Qingchuan Zhang; Zihan Li; Mengyang Zhu

doi:10.1371/journal.pone.0338008

Abstract

In the field of 3D skeleton action recognition, research on self-supervised learning methods has primarily focused on spatio-temporal feature modeling. However, these methods rely heavily on modeling single motion features, which limits their ability to capture subtle motion variations and complex spatio-temporal relationships. This is a direct result of the fact that understanding the model of the action remains incomplete. To address the above-mentioned issue, this paper proposes the Joint Motion Masking with Topology-Guided Transformer model (JMM-TGT) for action recognition. First, the Joint Motion Masking strategy is applied to enhance the ability of the model to perceive subtle joint movements. This method can generate masking probabilities by combining the differences and similarities in joint motion, thereby guiding the selection of joints to be masked at each time step. Meanwhile, in the transformer-based encoder module, the topological relationship between joints is introduced to adjust the attention mechanism, allowing the model to capture spatio-temporal dependencies and better understand the complex dynamic patterns of joint motion. To verify the performance of the JMM-TGT model, we conducted comparison experiments between it and mainstream action recognition models. Experiments demonstrate that the proposed JMM-TGT achieves performance improvements ranging from 1.5% to 7.9% under different evaluation settings on the NTU RGB+D 60, NTU RGB+D 120, and PUK-MMD datasets.

Citation: Wen H, Zeng G, Zhang Q, Li Z, Zhu M (2025) JMM-TGT: Self-supervised 3D action recognition through joint motion masking and topology-guided transformer. PLoS One 20(12): e0338008. https://doi.org/10.1371/journal.pone.0338008

Editor: Longhui Qin, Southeast University, CHINA

Received: May 15, 2025; Accepted: November 17, 2025; Published: December 30, 2025

Copyright: © 2025 Wen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All files are available from the GitHub (https://github.com/wenhan20201/JMM-TGT.git) and protocols.io DOI: https://dx.doi.org/10.17504/protocols.io.14egnrmjql5d/v1.

Funding: This research was supported by the National Natural Science Foundation of China under Grant No. 62072031.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Action recognition is the task of identifying and categorizing specific actions within videos for classification purposes [1]. It is widely applied in various fields such as virtual reality [2], autonomous driving [3], and video surveillance [4]. Advances in pose estimation algorithms [5] have simplified the acquisition of 3D skeleton data, which is also more resistant to interference. These factors have made 3D skeleton-based action recognition a popular research topic in recent years. Driven by graph convolutional networks [6] and transformer [7], numerous supervised learning methods have emerged, achieving excellent performance due to their reliance on large amounts of labeled data. However, the increasing cost of acquiring labeled data makes it difficult to rely solely on supervised learning. As a result, using unlabeled data for training has become increasingly important. Semi-supervised learning [8] with a small amount of annotated data and unsupervised learning based on unlabeled data have gradually attracted attention. However, these methods still face limitations in terms of accuracy and generalization. Therefore, self-supervised learning becomes a promising solution in the field of action recognition. This method learns through the intrinsic structure of the data itself, which not only significantly reduces the dependence on labeled data, but improves the learning ability of the model on large-scale unlabeled data.

The introduction of the Multi-Layer Perceptron (MLP) [9] has advanced the application of self-supervised learning in spatio-temporal modeling. This type of modeling captures complex dependencies across both space and time in videos, helping the model better understand the spatial and temporal structure of actions. Spatio-temporal feature modeling aims to capture the complex dependencies across both space and time in videos, enabling the model to better understand the temporal and spatial structure of actions. Currently, research focuses on efficiently extracting spatio-temporal features to improve the accuracy and robustness of action recognition models, particularly in complex environments. Given the impressive performance of Graph Convolutional Network (GCN) and transformer in action recognition tasks, researchers have explored combining these Networks with others. The integration of the GCN and Contrastive Learning, particularly Cross-Stream Contrastive Learning [10], has successfully reduced the reliance on labeled data as well as introduced new perspectives for spatio-temporal feature modeling. Furthermore, Contrastive Learning with Cross-Part Bidirectional Distillation [11] enhances the understanding of action in the model by applying bidirectional distillation between different parts of joints and the skeleton. Moreover, the combination of masking strategies has shown promising results. For example, the Spatial-Temporal Masked Autoencoder framework (SkeletonMAE) [12] demonstrates its unique advantages of combining self-supervised learning with masking strategies by randomly masking certain joints or frames and using transformers for prediction. To improve model performance in complex action recognition tasks, spatio-temporal feature modeling has increasingly focused on capturing relative motion. The approach of using similarity modeling between local and global skeleton sequences [13] has laid a solid foundation for subsequent research. This method effectively captures richer spatio-temporal relationships between actions across different skeleton parts. Finally, the Relative Temporal Velocity Contrastive Learning framework for skeleton action Represetation (RVTCLR) [14] highlights the importance of relative motion modeling in ecognizing subtle movement differences, specially for fine-grained and cross-view action recognition.

Although existing spatio-temporal feature modeling approaches have advanced 3D skeleton action recognition to some extent, they still struggle with capturing subtle motion variations and complex spatio-temporal dependencies. Current models tend to overly rely on single motion feature modeling [12], which makes it difficult to capture global dynamic changes caused by small motion differences, thereby limiting the ability of model to fully understand actions. Moreover, these models often overlook the structured information in skeleton data [12–14], preventing them from effectively learning and utilizing the spatio-temporal dependencies between joints in complex actions. The oversight affects both action recognition accuracy and the generalization ability of model.

To address these defects, this paper proposes the Joint Motion Masking with Topology-Guided Transformer model (JMM-TGT) for 3D skeleton action recognition. The core idea of this model is to enhance the fine-grained modeling of complex spatio-temporal dependencies by incorporating joint motion difference, similarity information, and inter-joint topological relationships. To be specific, the main contributions of this paper are as follows:

Proposed a joint motion masking strategy, JMM. Integrates the differences and similarities in joint motion across adjacent frames as priors, employing an adaptive skeleton masking approach to better focus on semantically rich temporal regions. The motion similarity information enhances the ability of model to perceive subtle joint movements. This method achieves the fusion of global and local motion features of joints, effectively overcoming the limitations of relying on a single motion modeling approach.
Designed a topology-guided Transformer, TGT. We innovatively introduce topological relationship between joints into the attention mechanism, which not only captures spatial dependencies between joints but also ensures the effective learning of temporal details, enhancing the focus on key joint motion variations, and realizes the understanding of complex actions in the model.
To validate the effectiveness of the model, experiments are conducted on NTU RGB+D 60, NTU RGB+D 120, and PUK-MMD datasets. The accuracy reached 84.4% and 90.1% under the X-Sub and X-View protocols on the NTU-60, respectively, and 78.2% and 78.8% under the X-Sub and X-Set protocols on the NTU-120, such as Fig 1. Additionally, transfer learning on the PKU-II achieved an accuracy of 73.0%. These results validate the effectiveness of JMM-TGT and its excellent transfer ability.

Download:

Fig 1. A comparison of the proposed method with SkeletonMAE(ICMEW 23) [12], using linear probing evaluation protocol.

https://doi.org/10.1371/journal.pone.0338008.g001

2 Related works

According to different training methods, action recognition approaches can be divided into two categories: supervised and unsupervised-based learning methods, as well as self-supervised learning-based methods. As shown in Table 1, summarizes the main self-supervised learning related works.

Download:

Table 1. The summary of related work on self-supervised learning methods.

https://doi.org/10.1371/journal.pone.0338008.t001

2.1 Supervised and unsupervised action recognition

Supervised learning methods train models using labeled data, enabling them to predict or classify new and unlabeled data. Due to the spatio-temporal nature of skeleton data, Graph Convolutional Network (GCN) has become widely used in action recognition. Yan S et al. [6] introduce the Spatial-Temporal Graph Convolutional Network (ST-GCN), which learns spatial and temporal patterns from data, overcoming the limitations of traditional methods. Cheng K et al. [15] propose Shift-GCN to address high computational costs and rigid receptive fields of traditional GCN. The Select-Assemble-Normalize Graph Convolutional Network (SAN-GCN) for improved feature modeling is introduced by Tian H et al. [16]. Jang S et al. [17] develope Multi-Scale Structural Graph Convolutional Network (MSS-GCN) to enhance high-dimensional information use and multi-scale aggregation. Zhang Y et al. [18] introduce Lightweight Graph Convolutional Network (LGCN) to reduce computational complexity for real-time applications. The Hybrid Network of Skeleton Guidance and Supervision (SGS-HN) is proposed by Ren Z et al. [19], improving multimodal feature learning and inter-modal alignment. Xu J et al. [20] create an end-to-end skeleton model with an Out-of-Distribution detection mechanism for better adaptability. To overcome the challenges posed by perspective changes and varying speeds, Aouaidjia K et al. [21] propose Spatio-Temporal Invariant Descriptors (STID). Recent studies have explored the use of Generative Adversarial Networks (GANs) [22] in action recognition, improving data diversity and model robustness.

To improve the ability of the model to capture both temporal and spatial features, the transformer has become an essential component. Yang G et al. [7] introduce the Spatial-Temporal Attention Temporal Segment Network (STA-TSN), which combines an attention mechanism to improve sensitivity to dynamic changes and accurately capture key features in time-series data. Multi-AxisFormer (MAFormer) is proposed by Huang H et al. [23] to address limitations in integrating hierarchical information and handling ultra-long time series. In addition, the cross-attention [24], dynamic attention [25], and multi-scale attention mechanisms [26] effectively utilize information from multiple perspectives, enhancing the capabilities of model. It is worth noting that Xu Z et al. [27] propose Satiotemporal Decoupling Attention Transformer (SDAT) to address the issue of spatio-temporal interaction in complex action patterns.

Supervised learning methods require large amounts of labeled data, which is expensive and labor-intensive. Additionally, these methods struggle to generalize to new scenarios and cannot effectively utilize the vast amount of available unlabeled data. To break through these limitations, semi-supervised and unsupervised learning methods have gained attention. Semi-supervised learning combines a small amount of labeled data with a large volume of unlabeled data, with models like Graph Representation Alignment (GRA) [8] and Dual-Stream Cross-Fusion with Class-Aware Memory Bank (DSCF-CAMB) [28] addressing challenges in category modeling and feature alignment. However, semi-supervised learning still rely heavily on the quality of unlabeled data and often involves high model complexity. In contrast, unsupervised learning eliminates the need for labeled data, significantly reducing labeling costs. Lin L et al. [29] propose a comparative learning framework based on Actionlet to enhance feature discriminability, while RDCL [30] captures complex joint relationships. He Z et al. [31] introduce Multi-Domain Decoupling Representation Modeling to improve cross-domain generalization, and Liu Z et al. [32] develop a multi-view action recognition method. Lin L et al. [33] propose an idempotent unsupervised representation learning method to stabilize skeleton data representations. However, interpreting and evaluating unsupervised methods remains challenging, limiting their development.

2.2 Self-supervised learning for action recognition

Current research on self-supervised learning methods can be divided into spatio-temporal feature modeling, relative motion modeling. In spatio-temporal feature modeling, Jin Z et al. [34] propose the Self-supervised Spatial-Temporal Representation Learning (SSRL), which improves model performance in complex dynamic environments by jointly modeling spatial and temporal features. Yao S et al. [35] introduce a GCN-based approach for recognizing martial arts leg poses in multimodal robots. Due to the inherent nature of self-supervised learning, masking strategies are particularly effective as a complementary method. These strategies force the model to learn useful feature representations by masking parts of the data, requiring the model to perform reconstruction. SkeletonMAE is proposed by Wu W et al. [12], using spatial-temporal masking strategies where joints and frames are randomly masked based on a predefined percentage. This framework leverages transformer to predict the masked joints or frames, effectively capturing dynamic features. Similarly, the Spatiotemporal Clues Disentanglement Network (SCD-Net) proposed by Wu C et al. [36] decouples spatio-temporal information through masking. It applies spatial bootstrap masks based on skeleton structure and temporal bootstrap masks on small portions of the time series. In addition, the puzzle problem training method [37] disrupts action sequences to teach the model to learn the correct order. Given the effectiveness of masking strategies, this paper adopts them as a core method.

All of the above methods rely on single motion feature modeling and struggle with the challenge of insufficiently capturing fine-grained motion patterns. Thus, relative motion modeling becomes crucial. Contrastive learning is a popular method for learning efficient representations by comparing similarities and differences between samples. Common methods include Cross-Stream Contrastive Learning [10], the Dual Min-Max method from game theory [38], and the Extremely Enhanced Contrastive Learning method [39]. Hu J et al. [13] improve model performance by combining global and local similarity modeling to capture fine-grained features. Futhermore, spatio-temporal information integration in relative motion modeling has been explored to improve adaptability in complex environments, with approaches like spatio-temporal comparative clustering [40] and multi-scale motion comparison learning [41]. The Cross-Part Bidirectional Distillation method is proposed by Yang H et al. [11], which focuses on learning features from different skeleton parts and emphasizes local motion. Liu R et al. [42] propose a semantic representation-guided contrastive learning method, enhancing the understanding of action details. However, contrastive learning performance depends heavily on the quality of negative sample selection. Poor negative sample selection can result in ineffective model training. Moreover, the difficulty in clearly defining positive and negative samples in some tasks impacts the training effectiveness. To address these issues, Zhu Y et al. [14] introduce the RVTCLR model, which focuses on capturing rhythmic variations and relative temporal relationships between movements. This method has shown effectiveness in capturing subtle movement differences. Based on this, this paper models motion features by calculating joint motion differences and similarities across adjacent frames.

In summary, existing self-supervised learning methods mainly focus on contrastive learning, and the masking strategies still need further refinement. Spatio-temporal feature modeling typically relies on either absolute or relative motion alone, lacking integrated modeling of both local and global features. On the other hand, in transformer-based skeletal structures, the model often overlooks subtle differences in local motion and the spatio-temporal context, leading to poor performance in understanding complex actions. Therefore, this paper proposes the JMM-TGT, which captures both local and global features by leveraging the differences and similarities in joint motion, and introduces joint topological relationships as priors to enhance the ability of model to understand complex actions.

3 Methodology

The overall framework of this paper is shown in Fig 2. The model consists of three core components: joint embeddings, joint motion masking strategy, and topology-guided transformer. The first stage involves a base embedding layer that converts raw skeleton data into a valid embedding representation. In the second stage, the model determines the masked joints by combining differences and similarities in joint motion across adjacent frames. The third stage is the pre-training phase, which utilizes transformers for encoding and decoding. In this stage, this paper introduces the topological relationships between joints as prior into the attention layer, while also incorporating positional and temporal encoding to predict the motion sequence over time.

Download:

Fig 2. Joint motion masking with topology-guided transformer model for action recognition.

https://doi.org/10.1371/journal.pone.0338008.g002

3.1 Joint motion masking strategy

In this study, the joint motion masking strategy (JMM) generates masking probabilities by integrating the differences and similarities in joint motion between adjacent frames. Based on these probabilities, the model dynamically selects which joints to mask at each time step during training. Specifically, joint motion difference focuses on absolute movement changes at each time step, helping the model capture significant motion variations. For instance, when a joint shows a large motion difference, the model can quickly identify this change. Meanwhile, joint motion similarity emphasizes how small local movements can cause larger changes in the global motion pattern, i.e., relative motion. This helps the model detect subtle changes that impact the overall action. By combining these two aspects, the model is better able to focus on fine-grained motion changes, improving its ability to recognize complex actions. This strategy enhances the ability of model to perceive both global and local motion features, overcoming the limitations of traditional methods that rely solely on single-motion features.

As shown in Fig 3, the implementation process of the JMM is outlined in detail. First, the formulas for calculating joint motion difference and similarity are presented, followed by an explanation of how to generate the final motion information by combining these differences and similarities in joint motion. Next, the application of the masking strategy in the model is fully outlined, including the calculation of the masking probability and the process of generating the masking matrix.

Download:

Fig 3. Joint motion masking strategy for action recognition.

https://doi.org/10.1371/journal.pone.0338008.g003

3.1.1 Joint motion difference.

In order to quantify the change in motion of the joint at each time step, denoted as , the absolute value of the positional difference is calculated.

(1)

Here,

x = Skeleton sequence.
T = Number of time frames.
V = Number of joints.
C = Number of channels.

where is the difference in motion at joint v in the c dimension at time step t, and the is the difference in motion at joint v in the c dimension at time step t.

Since the joint motion difference between adjacent frames are calculated, the time step is set to 1. As the first frame lacks a previous frame, zero padding is applied in the subsequent calculations. The calculation of joint motion similarity in the following is the same.

3.1.2 Motion similarity.

To further optimize the masking strategy, the similarity of each joint at adjacent time steps is also calculated, denoted as . The similarity measures the trend of motion change of joint v in the c dimension between the current time step t, and the previous time step t–1.

(2)

Here,

1e–10 = Prevent division by zero.

where and are the mean values of joint v in dimension c at time steps t and t–1, respectively, and the similarity is the ratio of the numerator to the denominator, which measures the similarity of the motion pattern between the current frame and the previous frame. If the similarity is close to 1, it means that the motion changes between the two time steps are very similar.

Moreover, The numerator denotes the result of the element-by-element multiplication of the difference vectors of the joints between the current and previous frames, indicating the magnitude of change for each joint in each dimension. The denominator is the Euclidean distance of the change between the two time steps, which measures the overall motion of each joint between the current and previous frames.

3.1.3 Final motion information.

To ensure the model can choose the retained time step precisely in the masking phase, the joint motion difference and similarity are combined to get the final motion information, denoted as .

(3)

where denotes the final motion information of joint v in the c dimension at time step t.

3.1.4 Mask generation.

The mask probabilities are generated based on the probability distribution of the final motion information . First, the normalized motion information is obtained by dividing the final motion information by the maximum value in this time series T.

(4)

where denotes the maximum value on T at the time step , and the denotes the normalized motion information of joint v in the c dimension at time step t.

Next, softmax is used to convert the normalized motion information into mask probabilities.

(5)

Here,

= Temperature coefficient. Specifically, the value of controls the smoothness of the mask probability distribution. A higher temperature makes the distribution flatter, leading to more random masking choices, while a lower temperature concentrates the distribution, making the masking choices more focused. The value of is set to 0.8 in the experiments, selected by empirical tuning, with the goal of balancing randomness and focus in the masking process.

where denotes the mask probability of joint v in the c dimension at time step t.

To prevent any single dimension from excessively influencing the mask generation, the probability vector of the joint at each time step is averaged, and the resulting value is used as the final probability for that joint, denoted as .

(6)

where denotes the probability that joint v is masked at time step t.

Next, dynamic mask generation is achieved by sampling the probability distribution with the Gumbel-Softmax technique to determine which time steps are masked during each training epoch.

(7)

Here,

= Random noise.

where denotes the mask probability matrix with noise. The ordering standard is the noise values of the joints in noise in both dimensions at each time step. The minimum sorting of the noise value at the time step is selected to obtain the sorting index of the time step index.

Furthermore, based on index and , this step select the first time steps to keep, i.e., time steps with higher probability are selected to be kept, and time steps with lower probability are masked out.

(8)

Here,

L = Total number of time steps.

where is the proportion of masks that take values between 0 and 1.

Ultimately, a binary mask matrix is generated to indicate whether each time step is preserved or not.

(9)

where denotes whether joint v is masked at time step t. If the value is 1, it is masked, otherwise, it is not.

3.2 Topology-guided transformer model

A detailed explanation of the data flow for the raw skeleton sequence in the JMM-TGT model is provided, highlighting the key design of the topology-guided transformer encoder. The overall processing flow is as follows: the pre-processed skeleton data is first converted into feature embeddings by using convolution operations, and then combined with positional and temporal encoding to help the model understand the relative positions and timing relationships of the joints. Next, a mask matrix is generated by using a joint motion-based masking strategy and applied to the skeleton data for masking. After the masking process is complete, the masked skeleton data is passed to the encoder, where the model learning spatio-temporal dependencies within the skeleton sequences through the multilayer topology-guided transformer. Finally, as shown in Fig 4, the encoder output is passed to the decoder for prediction of the temporal motion skeleton sequence.

Download:

Fig 4. Comparison of pre-training objectives with SkeletonMAE(ICMEW 23) [12].

https://doi.org/10.1371/journal.pone.0338008.g004

3.2.1 Embedding.

First, the preprocessed skeleton data is converted to feature embedding. The convolution kernel size of is used for the convolution operation.

(10)

In this way, embedding is obtained. Next, positional and temporal coding is added to help the model understand the relative positions of the skeleton joints as well as the temporal relationships between frames.

(11)

(12)

Here,

= Temporal coding.
= Positional coding.

where denotes the skeleton data obtained by adding temporal encoding to , denotes the skeleton data obtained by adding positional encoding to .

3.2.2 Masking.

Next, the JMM strategy in Sect 3.1 obtains a mask matrix mask for the skeleton data , which is then used to mask x_pos.

(13)

(14)

Here,

JMM = Our joint motion masking strategy.

where is the masked skeleton data, and Masking refers to the process of masking the time steps of certain joints in using the mask matrix mask.

3.2.3 Topology-guided transformer encoder.

The topology-guided transformer (TGT) incorporates the topological relationships between skeleton joints into the attention mechanism to optimize the calculation of attention scores. These relationships are represented using an adjacency matrix, which is introduced as prior knowledge. This allows the attention mechanism to better model interactions between connected joints, enhancing the ability of model to learn spatial dependencies within the skeleton data.

The encoder consists of eight transformer blocks, each containing a LayerNorm layer, an Attention layer, a DropPath layer, an MLP layer and a Dropout layer, as shown in Fig 5 below. Each layer serves a specific function to help the model learn complex spatio-temporal dependencies. The LayerNorm layer enhances training stability and prevents gradient issues, such as vanishing or exploding, by normalizing the inputs. The Attention layer adjusts attention scores using the adjacency matrix, modeling the interactions between joints to effectively capture long-range dependencies. The DropPath layer acts as regularization, helping the model avoid overfitting. The MLP layer strengthens the expressive power of the model through nonlinear transformations. Moreover, dropout improves the generalization ability of the model during training by randomly dropping neurons.

Download:

Fig 5. Architecture of topology-guided transformer encoder.

https://doi.org/10.1371/journal.pone.0338008.g005

First, the attention score is calculated.

(15)

Here,

A = Attention score.

To strengthen spatial structure modeling, the adjacency matrix of 25 skeleton joints is used in the calculation of attention scores. This matrix is predefined based on the topological structure of the human skeleton, encoding the relationships between joints and remaining fixed during training. The adjacency matrix A_adj is represented as:

(16)

where A_ij denotes the connection relationship between joint and joint .

The adjusted attention score is:

(17)

Here,

A_adj = Adjacency matrix. It serves as prior in the calculation of the attention layer.
=Adjusted attention score.

The output of the Attention layer is then passed as input to the DropPath layer and continues through the network. After eight transformer blocks, the final output of the encoder is denoted as x_enc.

(18)

3.2.4 Decoder.

The task of the decoder is to reconstruct the skeleton data for the masked time steps based on the output from the encoder. Decoder operates similarly to the encoder, except that the Attention layer utilizes a mask matrix to adjust the attention scores, as shown in Fig 6 below. After passing through five transformer layers, the final output of the encoder is denoted as .

Download:

Fig 6. Architecture of transformer-based decoder.

https://doi.org/10.1371/journal.pone.0338008.g006

(19)

(20)

(21)

where denotes the attention score adjusted by the mask matrix mask, which is used in the attention layer, and x_dec is the decoder intermediate input after the multilayer transformer module. Decoder_Pred refers to the prediction layer of the decoder.

3.2.5 Loss function.

Finally, the loss function computes the reconstruction error between the predicted skeleton sequence and the original skeleton sequence x. Mean squared error is used here.

(22)

Here,

mask = Binary mask matrix.
=Reconstruction skeleton data.
N =Number of samples.
L =Reconstruction error.

4 Experiment

This section presents comparative experiments between the JMM-TGT proposed in this paper and other mainstream models on the NTU RGB+D 60, NTU RGB+D 120, and PUK-MMD datasets, along with the results and analysis of the ablation experiments.

Dataset and dataset preprocessing.

The experimental datasets in this paper is based on the publicly available NTU RGB+D 60 (NTU-60) dataset [43], NTU RGB+D 120 (NTU-120) dataset [44], and PKU-MMD dataset [45]. The NTU-60 dataset contains 60 action categories captured from 40 different subjects in a multi-camera setup, with a total of 56,880 samples. The NTU-120 dataset extends the NTU-60 dataset, adding 120 action categories and a total of 113,945 samples. The PKU-MMD dataset contains nearly 20,000 action instances, over 5 million frames, and covers 51 action categories.

To better evaluate the robustness and generalization ability of the model, datasets under different protocols are used, accounting for variations in shooting angles and subject differences. Specifically, the Cross-Subject (X-Sub) and Cross-View (X-View) protocols are used for NTU-60, and the Cross-Subject (X-Set) and Cross-View (X-View) protocols are used for NTU-120. The PKU-MMD includes two subsets, the first part (PKU-I) and the second part (PKU-II). Compared to PKU-I, the PKU-II is more challenging due to its more complex viewpoints. PKU-I and PKU-II are divided into training and test sets according to a cross-subject protocol.

Before feeding the data into the model, the original action videos are preprocessed by unifying them to a fixed length of 120 seconds through random sampling, maximizing training data diversity. Due to the video being too long, each input video is further divided into four smaller segments [12] to reduce server response time. The input skeleton data consists of 25 joints, each with three dimensions.

Experimental setup.

In the topology-guided transformer module, the encoder depth is set to 8, the decoder depth to 5, and the transformer feature dimension per layer is 256. The JMM-TGT model is trained for 400 epochs with a batch size of 128, using the AdamW optimizer with a weight decay of 0.05. The learning rate follows a warm-up strategy of 20 epochs, increasing linearly from 0 to 1e-3, and then decreasing to 1e-4 by using a cosine decay. In the linear probing evaluation protocol, the pre-trained parameters are fixed, and a linear classifier is added after JMM-TGT. The model is trained for 100 epochs with a batch size of 48 and a learning rate of 0.1.

In the fine-tuning evaluation protocol, an MLP layer is added after the pre-trained JMM-TGT and trained for 100 epochs with a batch size of 32. The learning rate is linearly increased from 0 to 2e-4, then reduced to 1e-5 using a cosine decay scheme.

In the semi-supervised evaluation protocol, a classification layer is added after the pre-trained JMM-TGT, and the model is fine-tuned on a small training set. Other settings remain consistent with the fine-tuning protocol.

In the transfer learning evaluation protocol, the pre-trained JMM-TGT is connected with a linear classifier and fine-tuned on the target dataset. The training epochs, batch size, and learning rate are consistent with the settings used in the fine-tuning evaluation protocol. Finally, all experiments are conducted on three NVIDIA RTX 3090 Ti using the PyTorch framework.

Evaluation metrics.

This paper primarily uses Top-1 accuracy as the evaluation metric to measure the final classification performance of the JMM-TGT model. Top-1 accuracy measures whether the most probable category predicted by the model matches the true label. This metric is especially effective in 3D skeleton action recognition, as it reflects the ability of model to correctly classify actions. The formula for calculating Top-1 accuracy is:

(23)

Here,

N =Number of test samples.
=Predicted label of sample i.
y_i= True label of sample i.
= Indicator function.

4.1 Comparison experiment

JMM-TGT is evaluated under four evaluation protocols: linear probing evaluation protocol, fine-tuning evaluation protocol, semi-supervised evaluation protocol, and transfer learning evaluation protocol. The linear probing evaluation protocol evaluates the quality of the learned feature representations by training a simple linear classifier on top of the pre-trained model. The fine-tuning evaluation protocol, on the other hand, is designed to evaluate the performance of model on downstream tasks. A comparative analysis with supervised learning methods is also conducted. The semi-supervised evaluation protocol combines a limited amount of labeled data with a large volume of unlabeled data to evaluate the ability of the model to generalize with minimal labeled data. Finally, the transfer learning evaluation protocol aims to assess whether the representations learned by the pre-trained model on the source dataset can be successfully transferred and applied to the target dataset, testing the generalization ability of model.

Before these protocols, the pre-training loss of the JMM-TGT model is shown over 400 epochs on the NTU-60 and NTU-120 datasets. As shown in Fig 7, the loss gradually decreases and stabilizes with increasing training epochs. This indicates that the JMM-TGT model is gradually converging during training and its performance is steadily improving, thus making thorough preparations for the subsequent evaluation stages.

Download:

Fig 7. Pre-training loss of the JMM-TGT model on the NTU-60 and NTU-120 datasets.

https://doi.org/10.1371/journal.pone.0338008.g007

4.1.1 Linear probing evaluation experiment.

The effectiveness of JMM-TGT in feature learning is validated by comparing it with state-of-the-art self-supervised and unsupervised learning methods on the NTU-60 and NTU-120 datasets. These state-of-the-art methods include those based on masking strategy, such as SkeletonMAE [12], as well as those based on contrastive learning, including ActCLR [29], SSRL [34], 3s-SSRL [34], DMMG [38], SG-CLR [42], AimCLR [46], 3s-AimCLR [46], GT-Transformer [47], CPM [48], CMD [49], and HiCo-Transformer [50].

As shown in Table 2, JMM-TGT outperforms the contrastive learning-based methods and also surpasses SkeletonMAE [12], across different protocols of the two datasets. Notably, JMM-TGT uses only a single joint stream as input. Meanwhile, JMM-TGT achieves impressive results on the X-Sub and X-View protocols of NTU-60, with accuracies of 84.4% and 90.1%, respectively. It outperforms the recent method SG-CLR [42] by 0.9% and 0.2%, respectively. This demonstrates that the joint motion-based masking strategy and the topology-guided transformer help the encoder better learn to discriminate and represent skeleton features.

Download:

Table 2. Comparison of linear probing evaluation results with state-of-the-art methods on the NTU-60 and NTU-120 datasets.

https://doi.org/10.1371/journal.pone.0338008.t002

On the X-Sub and X-Set protocols of NTU-120, JMM-TGT achieves 78.2% and 78.8% accuracy, outperforming existing methods. Compared to SkeletonMAE [12], JMM-TGT improves by 5.7% and 5.3% on the X-Sub and X-Set protocols, respectively. Meanwhile, JMM-TGT surpasses the state-of-the-art SG-CLR [42] method, leading by 2.9% and 1.7% in the X-Sub and X-Set protocols, respectively. This shows that JMM-TGT is highly competitive even with large datasets and multiple action categories.

4.1.2 Fine-tuning evaluation experiment.

The fine-tuning performance of JMM-TGT is evaluated on the NTU-60 and NTU-120 datasets. Similarly, comparisons are made with the latest methods, including SkeletonMAE [12], including SSRL [34], 3s-SSRL [34], SG-CLR [42], 3s-SG-CLR [42], AimCLR [46], 3s-AimCLR [46], CPM [48], and 3s-Hi-TRS [51].

As shown in Table 3, the JMM-TGT achieves accuracies of 92.2% and 96.9% for X-Sub and X-View protocols of NTU-60, respectively. Comparison to SkeletonMAE [12], it improves by 5.6% and 4.0%, respectively. In addition, compared to the multi-stream 3s-Hi-TRS [51], which also employs transformer, it outperforms by 2.2% and 1.2%, respectively. Moreover, on the X-Sub and X-Set protocols of NTU-120, JMM-TGT achieves an accuracy of 89.7% and 90.5% after fine-tuning. To be specific, compared to the multi-stream 3s-Hi-TRS [51], it surpasses 3s-Hi-TRS [51] by 4.4% and 3.1%, respectively.

Download:

Table 3. Comparison of fine-tuning evaluation results with state-of-the-art methods on the NTU-60 and NTU-120 datasets.

https://doi.org/10.1371/journal.pone.0338008.t003

Overall, JMM-TGT shows significant performance improvements on the NTU-60 and NTU-120 datasets. The final result outperforms all previous methods, even those with multi-stream fusion, such as 3s-SSRL [34], 3s-SG-CLR [42], 3s-AimCLR [46], and 3s-Hi-TRS [51]. In addition, JMM-TGT performs notably better than GCN, ST-GCN, and STTFormer, further validating the effectiveness of the transformer in action recognition tasks.

Moreover, the JMM-TGT is compared with top-performing supervised methods like ST-GCN [6], Shift-GCN [15], SAN-GCN [16], and MSS-GCN [17]. As shown in Table 4, JMM-TGT significantly outperforms supervised methods such as ST-GCN [6], Shift-GCN [15], and SAN-GCN [16] on both the NTU-60 and NTU-120 datasets. In the X-View protocol of NTU-60, JMM-TGT achieves a top accuracy of 96.9%, tying with the MSS-GCN [17]. In the X-Sub protocol of NTU-120, JMM-TGT leads with the highest accuracy of 89.7%, its accuracy is just 0.1% lower than that of MSS-GCN [17]. These results demonstrate that JMM-TGT excels on both benchmark datasets and remains highly competitive with state-of-the-art supervised learning methods.

Download:

Table 4. Comparison with state-of-the-art supervised methods on the NTU-60 and NTU-120 datasets.

https://doi.org/10.1371/journal.pone.0338008.t004

4.1.3 Semi-supervised evaluation experiment.

In semi-supervised learning, the performance is reported on the NTU-60 dataset using 1% and 10% of the labeled training data. Comparing with state-of-the-art methods, including 3s-SSRL [34], 3s-SG-CLR [42], 3s-AimCLR [46], GT-Transformer [47], CPM [48], CMD [49], HiCo-Transformer [50], LongT GAN [52], and ML [53].

As shown in Table 5, even with only 1% labeled data, JMM-TGT achieves an accuracy of 65.3% and 67.5% on X-Sub and X-View protocols, respectively. With only 1% labeled data, JMM-TGT outperforms the state-of-the-art methods in all settings. Specifically, JMM-TGT surpasses 3s-SSRL [34] by 4.1% and 11.2% in X-Sub and X-View protocols of NTU-60, respectively.

Download:

Table 5. Comparison of semi-supervised evaluation results with state-of-the-art methods on the NTU-60 dataset.

https://doi.org/10.1371/journal.pone.0338008.t005

The performance of JMM-TGT model further improves to 87.3% and 89.2% when 10% labeled data is available. Notably, when using just 10% labeled data, JMM-TGT shows even greater improvements over other methods, outperforming 3s-SSRL [34] by 7.9% and 7.2% on X-Sub and X-View protocols, respectively. This demonstrates the capability of JMM-TGT in extracting more discriminative skeleton representations.

4.1.4 Transfer learning evaluation protocol experiment.

In the transfer learning, NTU-60, NTU-120, and PKU-I are selected as the source datasets, with PKU-II chosen as the target dataset. Comparing with state-of-the-art methods, including LongT GAN [52], ML [53], ICS [54], and CMD [49].

As shown in Table 6, the proposed JMM-TGT achieves the best performance on the PKU-II dataset, indicating that the representations learned by the method have stronger transferability. On the source datasets NTU-60, NTU-120, and PKU-I, JMM-TGT outperforms LongT GAN [52], ML [53], ICS [54], and CMD [49] to varying degrees. Specifically, when pretraining on the NTU-60 and NTU-120 datasets, it improves by 15.4% and 16% over CMD [49], respectively. When pretraining on the PKU-I dataset, it outperforms ICS [54] by 25.5%. These results confirm the excellent generalization ability of the representations learned by the JMM-TGT.

Download:

Table 6. Comparison of transfer learning protocol results with state-of-the-art methods on the PKU-II dataset.

https://doi.org/10.1371/journal.pone.0338008.t006

4.2 Ablation experiment

Ablation studies are conducted to validate the contribution of the different components of JMM-TGT to the overall performance. All experimental results focus on linear probing evaluation protocol, using the X-Sub protocol on the NTU-60 dataset.

4.2.1 Effectiveness of JMM-TGT.

To evaluate the effectiveness of the joint motion masking strategy and the topology-guided transformer, this study explores how the three components (that is to say, joint motion difference, motion similarity, and topology information) impact the accuracy of 3D skeleton action recognition. This paper use two baseline models: one that considers only joint motion difference (Baseline-1) and one that considers only joint motion similarity (Baseline-2). JMM-TGT* denotes the model with the topology-guided component removed.

As shown in Table 7, under the linear probing evaluation protocol, Baseline-1 and Baseline-2 achieve accuracies of 82.2% and 82.4%, respectively, demonstrating the effectiveness of each component in isolation. When joint motion difference and motion similarity are combined, JMM-TGT* sees an accuracy boost of 1.6% and 1.4% over Baseline-1 and Baseline-2, respectively. This suggests an interaction between the two components. When all components are used together, the accuracy reaches 84.4%, showing a significant synergistic effect of joint motion difference, motion similarity, and topology information in improving overall model performance.

Download:

Table 7. Ablation experiments of JMM-TGT components and complexity comparison with state-of-the-art methods on the NTU-60 dataset.

https://doi.org/10.1371/journal.pone.0338008.t007

In addition, the performance of the JMM-TGT model is evaluated in terms of training time, parameter count, and FLOPs. Overall, JMM-TGT is a mid-scale model with relatively low computational complexity. Each training epoch takes 14 seconds, which is longer than HiCo-Transformer[50]. This difference is primarily due to the additional joint motion difference and similarity calculation modules introduced in JMM-TGT, which significantly enhance the representational power of the model. JMM-TGT has 7.97M parameters and 3.56G FLOPS, both of which are considerably lower than those of DMMG*[38] and ST-GCN [6], thereby striking an effective balance between computational efficiency and model expressiveness. Ultimately, JMM-TGT achieved an accuracy of 84.4% on the X-Sub protocol of NTU-60, surpassing lightweight models such as HiCo-Transformer [50] and AimCLR [46]. This demonstrates that the additional computational cost results in significant performance improvements. The above results suggest that JMM-TGT effectively balances computational cost, model complexity, and performance.

4.2.2 Effectiveness of masking strategy.

To further investigate the effectiveness of the joint motion masking strategy, the joint motion masking strategy in JMM-TGT is replaced with random masking in the experiments. In addition, the impact of the masking ratio on the experiments is explored.

As shown in Table 8, compared to the random masking strategy, our joint motion masking strategy improves the absolute performance on NTU-60 dataset by 1.5%. This demonstrates that combining joint motion difference and similarity provides a rich semantic prior, effectively guiding the skeleton masking process.

Download:

Table 8. Ablation experiments with masking strategy on linear probing evaluation protocol.

https://doi.org/10.1371/journal.pone.0338008.t008

As shown in Fig 8, the effect of different masking ratios on the final experimental results is evaluated. With a 70% masking ratio, the accuracy is 75.7%. As the masking ratio increases, the accuracy improves, reaching 84.4% at 90%. However, when the masking ratio is increased to 95%, the accuracy drops to 81.6%. Results on the X-Sub protocol indicate that a 90% masking ratio provides optimal performance for the JMM-TGT.

Download:

Fig 8. Ablation experiment for mask ratio.

https://doi.org/10.1371/journal.pone.0338008.g008

Therefore, choosing the appropriate masking ratio s crucial for balancing model performance. Empirical findings suggest that both excessively high and low masking ratios can negatively affect results. A low ratio may prevent the model from capturing enough joint motion variations, while a high ratio can smooth out too much data, losing important motion details. Based on our findings, a 90% masking ratio strikes the best balance between data masking and model accuracy. For different datasets, the masking ratio can be adjusted between 80% and 95%, based on its impact on model performance.

5 Visualization

To visually demonstrate the effectiveness of the JMM-TGT, attention weight heatmaps are used to analyze the temporal relationships and focus of attention in action recognition results. As shown in Fig 9, the time attention weight distribution of the first attention head in the last layer for four action types is extracted. In the attention heatmap, the color green, from dark to light, indicates the attention weight ranging from low to high.

Download:

Fig 9. Visualization of attention weight distribution on the NTU-60 dataset.

https://doi.org/10.1371/journal.pone.0338008.g009

For complex actions like hugging, the attention weight is spread across multiple frames, indicating that the JMM-TGT captures details across a wider range to handle higher complexity. In contrast, for simpler actions like drinking, the attention is focused on a few key frames, avoiding redundant information. For fast movements like jumping, the model focuses on frames with significant changes, demonstrating its sensitivity to dynamic variations. The results show that the JMM-TGT model, by introducing a topology-guided attention mechanism and incorporating spatial dependencies between joints, can more accurately understand the spatiotemporal evolution of human actions.

6 Conclusion

In this work, we propose the JMM-TGT model for self-supervised 3D skeleton action recognition. The model introduces a joint motion masking strategy and a topology-guided transformer, which effectively fuse global and local spatio-temporal features, significantly improving model performance in complex skeleton action recognition tasks. Specifically, the JMM integrates the differences and similarities in joint motion across adjacent frames and each dimension as the final motion information. The information is then transformed into a probability distribution for the mask, which dynamically guides the selection of joints to mask at each time step. The method captures the dynamic changes of each joint and effectively guide the model in selecting critical time-step information during training. The TGT adjusts the attention mechanism based on the topological relationships between joints, which ensures the model focuses on spatio-temporal interactions when learning skeleton sequences, and further enables the understanding of complex actions. Extensive experiments are conducted on three popular benchmarks using four evaluation protocols. The results show that the JMM-TGT model achieves state-of-the-art performance on the NTU-60 and NTU-120 datasets, validating the efficiency of the skeleton representation and enhancing the potential of the transformer in action recognition. Moreover, the model demonstrates strong generalization when applied to the PKU-MMD dataset, showcasing its robustness for real-world applications. Additionally, JMM-TGT achieves low computational complexity, balancing computational efficiency with accuracy.

References

1. Gayathri T, Mamatha H. How to improve video analytics with action recognition: a survey. ACM Comput Surv. 2024;57(1):1–36.
- View Article
- Google Scholar
2. Armeni I, Sener O, Zamir AR, Jiang H, Brilakis I, Fischer M, et al. 3D Semantic parsing of large-scale indoor spaces. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 1534–43. https://doi.org/10.1109/cvpr.2016.170
3. Li Y, Yu AW, Meng T, Caine B, Ngiam J, Peng D. Deepfusion: LiDAR-camera depth fusion for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 17182–91. https://doi.org/10.1109/cvpr52688.2022.01667
4. Zhu C, Jia Q, Chen W, Guo Y, Liu Y. Deep learning for video-text retrieval: a review. Int J Multimed Info Retr. 2023;12(1).
- View Article
- Google Scholar
5. Yang S, Liu J, Lu S, Hwa EM, Hu Y, Kot AC. Self-supervised 3D action representation learning with skeleton cloud colorization. IEEE Trans Pattern Anal Mach Intell. 2024;46(1):509–24. pmid:37856263
- View Article
- PubMed/NCBI
- Google Scholar
6. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for Skeleton-based action recognition. AAAI. 2018;32(1).
- View Article
- Google Scholar
7. Yang G, Yang Y, Lu Z, Yang J, Liu D, Zhou C, et al. STA-TSN: spatial-temporal attention temporal segment network for action recognition in video. PLoS One. 2022;17(3):e0265115. pmid:35298497
- View Article
- PubMed/NCBI
- Google Scholar
8. Huang K-H, Huang Y-B, Lin Y-X, Hua K-L, Tanveer M, Lu X, et al. GRA: graph representation alignment for semi-supervised action recognition. IEEE Trans Neural Netw Learn Syst. 2024;35(9):11896–905. pmid:38215319
- View Article
- PubMed/NCBI
- Google Scholar
9. Dai C, Wei Y, Xu Z, Chen M, Liu Y, Fan J. ConMLP: MLP-based self-supervised contrastive learning for Skeleton data analysis and action recognition. Sensors (Basel). 2023;23(5):2452. pmid:36904656
- View Article
- PubMed/NCBI
- Google Scholar
10. Li D, Tang Y, Zhang Z, Zhang W. Cross-stream contrastive learning for self-supervised skeleton-based action recognition. Image and Vision Computing. 2023;135:104689.
- View Article
- Google Scholar
11. Yang H, Zhang Q, Ren Z, Yuan H, Zhang F. Contrastive learning with cross-part bidirectional distillation for self-supervised Skeleton-based action recognition. Human-Centric Computing and Information Sciences. 2024;14.
- View Article
- Google Scholar
12. Wu W, Hua Y, Zheng C, Wu S, Chen C, Lu A. Skeletonmae: spatial-temporal masked autoencoders for self-supervised Skeleton action recognition. In: 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). 2023. p. 224–9. https://doi.org/10.1109/icmew59549.2023.00045
13. Hu J, Hou Y, Guo Z, Gao J. Global and local contrastive learning for self-supervised Skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol. 2024;34(11):10578–89.
- View Article
- Google Scholar
14. Zhu Y, Han H, Yu Z, Liu G. Modeling the relative visual tempo for self-supervised skeleton-based action recognition. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. p. 13867–76. https://doi.org/10.1109/iccv51070.2023.01279
15. Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H. Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. https://doi.org/10.1109/cvpr42600.2020.00026
16. Tian H, Ma X, Li X, Li Y. Skeleton-based action recognition with select-assemble-normalize graph convolutional networks. IEEE Trans Multimedia. 2023;25:8527–38.
- View Article
- Google Scholar
17. Jang S, Lee H, Kim WJ, Lee J, Woo S, Lee S. Multi-scale structural graph convolutional network for skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol. 2024;34(8):7244–58.
- View Article
- Google Scholar
18. Zhang Y, Yang Y, Gao X. Lightweight graph convolutional network for efficient skeleton based action recognition. In: 2024 International Joint Conference on Neural Networks (IJCNN). 2024. p. 1–8. https://doi.org/10.1109/ijcnn60899.2024.10651467
19. Ren Z, Luo L, Qin Y, Gao X, Zhang Q. Skeleton-guided and supervised learning of hybrid network for multi-modal action recognition. Elsevier BV; 2024. https://doi.org/10.2139/ssrn.4970121
20. Xu J, Zhu A, Lin J, Ke Q, Chen C. Skeleton-OOD: an end-to-end skeleton-based model for robust out-of-distribution human action detection. Neurocomputing. 2025;619:129158.
- View Article
- Google Scholar
21. Aouaidjia K, Zhang C, Pitas I. Spatio-temporal invariant descriptors for Skeleton-based human action recognition. Information Sciences. 2025;700:121832.
- View Article
- Google Scholar
22. Wu S, Lu G, Han Z, Chen L. A robust two-stage framework for human skeleton action recognition with GAIN and masked autoencoder. Neurocomputing. 2025;623:129433.
- View Article
- Google Scholar
23. Huang H, Xu L, Zheng Y, Yan X. MAFormer: a cross-channel spatio-temporal feature aggregation method for human action recognition. AI Communications. 2024;37(4):735–49.
- View Article
- Google Scholar
24. Zhao Z, Liu Y, Ma L. Compositional action recognition with multi-view feature fusion. PLoS One. 2022;17(4):e0266259. pmid:35421122
- View Article
- PubMed/NCBI
- Google Scholar
25. Yang H, Wang S, Jiang L, Su Y, Zhang Y. Hierarchical adaptive multi-scale hypergraph attention convolution network for skeleton-based action recognition. Applied Soft Computing. 2025;172:112855.
- View Article
- Google Scholar
26. Zhu S, Sun L, Ma Z, Li C, He D. Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition. Neurocomputing. 2025;611:128623.
- View Article
- Google Scholar
27. Xu Z, Xu J. Spatiotemporal decoupling attention transformer for 3D skeleton-based driver action recognition. Complex Intell Syst. 2025;11(4).
- View Article
- Google Scholar
28. Huang B, Wang S, Hu C, Li X. Semi-supervised human action recognition via dual-stream cross-fusion and class-aware memory bank. Engineering Applications of Artificial Intelligence. 2024;136:108937.
- View Article
- Google Scholar
29. Lin L, Zhang J, Liu J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. 2363–72. https://doi.org/10.1109/cvpr52729.2023.00234
30. Liu X, Gao B. Reconstruction-driven contrastive learning for unsupervised skeleton-based human action recognition. J Supercomput. 2024;81(1).
- View Article
- Google Scholar
31. He Z, Lv J. Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition. Elsevier BV; 2023. https://doi.org/10.2139/ssrn.4634150
32. Liu Z, Lu B, Wu Y, Gao C. Multi-view daily action recognition based on Hooke balanced matrix and broad learning system. Image and Vision Computing. 2024;143:104919.
- View Article
- Google Scholar
33. Lin L, Wu L, Zhang J, Liu J. Idempotent unsupervised representation learning for skeleton-based action recognition. In: European Conference on Computer Vision; 2024. p. 75–92. https://doi.org/10.2139/ssrn.4634150
34. Jin Z, Wang Y, Wang Q, Shen Y, Meng H. SSRL: Self-Supervised Spatial-Temporal Representation Learning for 3D action recognition. IEEE Trans Circuits Syst Video Technol. 2024;34(1):274–85.
- View Article
- Google Scholar
35. Yao S, Ping Y, Yue X, Chen H. Graph convolutional networks for multi-modal robotic martial arts leg pose recognition. Front Neurorobot. 2025;18:1520983. pmid:39906517
- View Article
- PubMed/NCBI
- Google Scholar
36. Wu C, Wu X-J, Kittler J, Xu T, Ahmed S, Awais M, et al. SCD-Net: spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition. AAAI. 2024;38(6):5949–57.
- View Article
- Google Scholar
37. Moutik O, Sekkat H, Tchakoucht TA, El Kari B, Alaoui AEH. A puzzle questions form training for self-supervised skeleton-based action recognition. Image and Vision Computing. 2024;148:105137.
- View Article
- Google Scholar
38. Guan S, Yu X, Huang W, Fang G, Lu H. DMMG: dual min-max games for self-supervised skeleton-based action recognition. IEEE Trans Image Process. 2024;33:395–407. pmid:38060368
- View Article
- PubMed/NCBI
- Google Scholar
39. Guo T, Liu M, Liu H, Wang G, Li W. Improving self-supervised action recognition from extremely augmented skeleton sequences. Pattern Recognition. 2024;150:110333.
- View Article
- Google Scholar
40. Wang M, Li X, Chen S, Zhang X, Ma L, Zhang Y. Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition. IEEE Trans Multimedia. 2024;26:3207–20.
- View Article
- Google Scholar
41. Wu Y, Xu Z, Yuan M, Tang T, Meng R, Wang Z. Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition. Multimedia Systems. 2024;30(5).
- View Article
- Google Scholar
42. Liu R, Liu Y, Wu M, Xin W, Miao Q, Liu X, et al. SG-CLR: Semantic representation-guided contrastive learning for self-supervised skeleton-based action recognition. Pattern Recognition. 2025; p. 111377.
- View Article
- Google Scholar
43. Shahroudy A, Liu J, Ng T-T, Wang G. NTU RGB+D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 1010–9. https://doi.org/10.1109/cvpr.2016.115
44. Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans Pattern Anal Mach Intell. 2020;42(10):2684–701. pmid:31095476
- View Article
- PubMed/NCBI
- Google Scholar
45. Liu J, Song S, Liu C, Li Y, Hu Y. A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans Multimedia Comput Commun Appl. 2020;16(2):1–24.
- View Article
- Google Scholar
46. Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. AAAI. 2022;36(1):762–70.
- View Article
- Google Scholar
47. Kim B, Chang HJ, Kim J, Choi JY. Global-local motion transformer for unsupervised skeleton-based action learning. European conference on computer vision. 2022; p. 209–25. https://doi.org/10.1007/978-3-031-19772-7_13
48. Zhang H, Hou Y, Zhang W, Li W. Contrastive positive mining for unsupervised 3d action representation learning. European Conference on Computer Vision. 2022; p. 36–51. https://doi.org/10.1007/978-3-031-19772-7_3
49. Mao Y, Zhou W, Lu Z, Deng J, Li H. CMD: self-supervised 3d action representation learning with cross-modal mutual distillation. European Conference on Computer Vision. 2022; p. 734–52. https://doi.org/10.1007/978-3-031-20062-5_42
50. Dong J, Sun S, Liu Z, Chen S, Liu B, Wang X. Hierarchical contrast for unsupervised skeleton-based action representation learning. AAAI. 2023;37(1):525–33.
- View Article
- Google Scholar
51. Chen YX, Zhao L, Yuan JB, Tian Y, Xia ZY, Geng SJ, Han LG, Metaxas DN. Hierarchically self-supervised transformer for human skeleton representation learning. European Conference on Computer Vision. 2022; p. 185–202. https://doi.org/10.1007/978-3-031-19809-0_11
52. Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. AAAI. 2018;32(1).
- View Article
- Google Scholar
53. Lin L, Song S, Yang W, Liu J. MS2L. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020. p. 2490–8. https://doi.org/10.1145/3394171.3413548
54. Thoker FM, Doughty H, Snoek CGM. Skeleton-contrastive 3D action representation learning. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021. p. 1655–63. https://doi.org/10.1145/3474085.3475307

[ref1] 1. Gayathri T, Mamatha H. How to improve video analytics with action recognition: a survey. ACM Comput Surv. 2024;57(1):1–36.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Armeni I, Sener O, Zamir AR, Jiang H, Brilakis I, Fischer M, et al. 3D Semantic parsing of large-scale indoor spaces. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 1534–43. https://doi.org/10.1109/cvpr.2016.170

[ref3] 3. Li Y, Yu AW, Meng T, Caine B, Ngiam J, Peng D. Deepfusion: LiDAR-camera depth fusion for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 17182–91. https://doi.org/10.1109/cvpr52688.2022.01667

[ref4] 4. Zhu C, Jia Q, Chen W, Guo Y, Liu Y. Deep learning for video-text retrieval: a review. Int J Multimed Info Retr. 2023;12(1).
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Yang S, Liu J, Lu S, Hwa EM, Hu Y, Kot AC. Self-supervised 3D action representation learning with skeleton cloud colorization. IEEE Trans Pattern Anal Mach Intell. 2024;46(1):509–24. pmid:37856263
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref6] 6. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for Skeleton-based action recognition. AAAI. 2018;32(1).
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref7] 7. Yang G, Yang Y, Lu Z, Yang J, Liu D, Zhou C, et al. STA-TSN: spatial-temporal attention temporal segment network for action recognition in video. PLoS One. 2022;17(3):e0265115. pmid:35298497
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref8] 8. Huang K-H, Huang Y-B, Lin Y-X, Hua K-L, Tanveer M, Lu X, et al. GRA: graph representation alignment for semi-supervised action recognition. IEEE Trans Neural Netw Learn Syst. 2024;35(9):11896–905. pmid:38215319
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref9] 9. Dai C, Wei Y, Xu Z, Chen M, Liu Y, Fan J. ConMLP: MLP-based self-supervised contrastive learning for Skeleton data analysis and action recognition. Sensors (Basel). 2023;23(5):2452. pmid:36904656
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref10] 10. Li D, Tang Y, Zhang Z, Zhang W. Cross-stream contrastive learning for self-supervised skeleton-based action recognition. Image and Vision Computing. 2023;135:104689.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Yang H, Zhang Q, Ren Z, Yuan H, Zhang F. Contrastive learning with cross-part bidirectional distillation for self-supervised Skeleton-based action recognition. Human-Centric Computing and Information Sciences. 2024;14.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Wu W, Hua Y, Zheng C, Wu S, Chen C, Lu A. Skeletonmae: spatial-temporal masked autoencoders for self-supervised Skeleton action recognition. In: 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). 2023. p. 224–9. https://doi.org/10.1109/icmew59549.2023.00045

[ref13] 13. Hu J, Hou Y, Guo Z, Gao J. Global and local contrastive learning for self-supervised Skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol. 2024;34(11):10578–89.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref14] 14. Zhu Y, Han H, Yu Z, Liu G. Modeling the relative visual tempo for self-supervised skeleton-based action recognition. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. p. 13867–76. https://doi.org/10.1109/iccv51070.2023.01279

[ref15] 15. Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H. Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. https://doi.org/10.1109/cvpr42600.2020.00026

[ref16] 16. Tian H, Ma X, Li X, Li Y. Skeleton-based action recognition with select-assemble-normalize graph convolutional networks. IEEE Trans Multimedia. 2023;25:8527–38.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref17] 17. Jang S, Lee H, Kim WJ, Lee J, Woo S, Lee S. Multi-scale structural graph convolutional network for skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol. 2024;34(8):7244–58.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref18] 18. Zhang Y, Yang Y, Gao X. Lightweight graph convolutional network for efficient skeleton based action recognition. In: 2024 International Joint Conference on Neural Networks (IJCNN). 2024. p. 1–8. https://doi.org/10.1109/ijcnn60899.2024.10651467

[ref19] 19. Ren Z, Luo L, Qin Y, Gao X, Zhang Q. Skeleton-guided and supervised learning of hybrid network for multi-modal action recognition. Elsevier BV; 2024. https://doi.org/10.2139/ssrn.4970121

[ref20] 20. Xu J, Zhu A, Lin J, Ke Q, Chen C. Skeleton-OOD: an end-to-end skeleton-based model for robust out-of-distribution human action detection. Neurocomputing. 2025;619:129158.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref21] 21. Aouaidjia K, Zhang C, Pitas I. Spatio-temporal invariant descriptors for Skeleton-based human action recognition. Information Sciences. 2025;700:121832.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref22] 22. Wu S, Lu G, Han Z, Chen L. A robust two-stage framework for human skeleton action recognition with GAIN and masked autoencoder. Neurocomputing. 2025;623:129433.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref23] 23. Huang H, Xu L, Zheng Y, Yan X. MAFormer: a cross-channel spatio-temporal feature aggregation method for human action recognition. AI Communications. 2024;37(4):735–49.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref24] 24. Zhao Z, Liu Y, Ma L. Compositional action recognition with multi-view feature fusion. PLoS One. 2022;17(4):e0266259. pmid:35421122
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref25] 25. Yang H, Wang S, Jiang L, Su Y, Zhang Y. Hierarchical adaptive multi-scale hypergraph attention convolution network for skeleton-based action recognition. Applied Soft Computing. 2025;172:112855.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref26] 26. Zhu S, Sun L, Ma Z, Li C, He D. Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition. Neurocomputing. 2025;611:128623.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref27] 27. Xu Z, Xu J. Spatiotemporal decoupling attention transformer for 3D skeleton-based driver action recognition. Complex Intell Syst. 2025;11(4).
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref28] 28. Huang B, Wang S, Hu C, Li X. Semi-supervised human action recognition via dual-stream cross-fusion and class-aware memory bank. Engineering Applications of Artificial Intelligence. 2024;136:108937.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref29] 29. Lin L, Zhang J, Liu J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. 2363–72. https://doi.org/10.1109/cvpr52729.2023.00234

[ref30] 30. Liu X, Gao B. Reconstruction-driven contrastive learning for unsupervised skeleton-based human action recognition. J Supercomput. 2024;81(1).
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref31] 31. He Z, Lv J. Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition. Elsevier BV; 2023. https://doi.org/10.2139/ssrn.4634150

[ref32] 32. Liu Z, Lu B, Wu Y, Gao C. Multi-view daily action recognition based on Hooke balanced matrix and broad learning system. Image and Vision Computing. 2024;143:104919.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref33] 33. Lin L, Wu L, Zhang J, Liu J. Idempotent unsupervised representation learning for skeleton-based action recognition. In: European Conference on Computer Vision; 2024. p. 75–92. https://doi.org/10.2139/ssrn.4634150

[ref34] 34. Jin Z, Wang Y, Wang Q, Shen Y, Meng H. SSRL: Self-Supervised Spatial-Temporal Representation Learning for 3D action recognition. IEEE Trans Circuits Syst Video Technol. 2024;34(1):274–85.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref35] 35. Yao S, Ping Y, Yue X, Chen H. Graph convolutional networks for multi-modal robotic martial arts leg pose recognition. Front Neurorobot. 2025;18:1520983. pmid:39906517
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref36] 36. Wu C, Wu X-J, Kittler J, Xu T, Ahmed S, Awais M, et al. SCD-Net: spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition. AAAI. 2024;38(6):5949–57.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref37] 37. Moutik O, Sekkat H, Tchakoucht TA, El Kari B, Alaoui AEH. A puzzle questions form training for self-supervised skeleton-based action recognition. Image and Vision Computing. 2024;148:105137.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref38] 38. Guan S, Yu X, Huang W, Fang G, Lu H. DMMG: dual min-max games for self-supervised skeleton-based action recognition. IEEE Trans Image Process. 2024;33:395–407. pmid:38060368
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref39] 39. Guo T, Liu M, Liu H, Wang G, Li W. Improving self-supervised action recognition from extremely augmented skeleton sequences. Pattern Recognition. 2024;150:110333.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref40] 40. Wang M, Li X, Chen S, Zhang X, Ma L, Zhang Y. Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition. IEEE Trans Multimedia. 2024;26:3207–20.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref41] 41. Wu Y, Xu Z, Yuan M, Tang T, Meng R, Wang Z. Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition. Multimedia Systems. 2024;30(5).
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref42] 42. Liu R, Liu Y, Wu M, Xin W, Miao Q, Liu X, et al. SG-CLR: Semantic representation-guided contrastive learning for self-supervised skeleton-based action recognition. Pattern Recognition. 2025; p. 111377.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref43] 43. Shahroudy A, Liu J, Ng T-T, Wang G. NTU RGB+D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 1010–9. https://doi.org/10.1109/cvpr.2016.115

[ref44] 44. Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans Pattern Anal Mach Intell. 2020;42(10):2684–701. pmid:31095476
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref45] 45. Liu J, Song S, Liu C, Li Y, Hu Y. A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans Multimedia Comput Commun Appl. 2020;16(2):1–24.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref46] 46. Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. AAAI. 2022;36(1):762–70.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref47] 47. Kim B, Chang HJ, Kim J, Choi JY. Global-local motion transformer for unsupervised skeleton-based action learning. European conference on computer vision. 2022; p. 209–25. https://doi.org/10.1007/978-3-031-19772-7_13

[ref48] 48. Zhang H, Hou Y, Zhang W, Li W. Contrastive positive mining for unsupervised 3d action representation learning. European Conference on Computer Vision. 2022; p. 36–51. https://doi.org/10.1007/978-3-031-19772-7_3

[ref49] 49. Mao Y, Zhou W, Lu Z, Deng J, Li H. CMD: self-supervised 3d action representation learning with cross-modal mutual distillation. European Conference on Computer Vision. 2022; p. 734–52. https://doi.org/10.1007/978-3-031-20062-5_42

[ref50] 50. Dong J, Sun S, Liu Z, Chen S, Liu B, Wang X. Hierarchical contrast for unsupervised skeleton-based action representation learning. AAAI. 2023;37(1):525–33.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref51] 51. Chen YX, Zhao L, Yuan JB, Tian Y, Xia ZY, Geng SJ, Han LG, Metaxas DN. Hierarchically self-supervised transformer for human skeleton representation learning. European Conference on Computer Vision. 2022; p. 185–202. https://doi.org/10.1007/978-3-031-19809-0_11

[ref52] 52. Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. AAAI. 2018;32(1).
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref53] 53. Lin L, Song S, Yang W, Liu J. MS2L. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020. p. 2490–8. https://doi.org/10.1145/3394171.3413548

[ref54] 54. Thoker FM, Doughty H, Snoek CGM. Skeleton-contrastive 3D action representation learning. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021. p. 1655–63. https://doi.org/10.1145/3474085.3475307

Figures

Abstract

1 Introduction

2 Related works

2.1 Supervised and unsupervised action recognition

2.2 Self-supervised learning for action recognition

3 Methodology

3.1 Joint motion masking strategy

3.1.1 Joint motion difference.

3.1.2 Motion similarity.

3.1.3 Final motion information.

3.1.4 Mask generation.

3.2 Topology-guided transformer model

3.2.1 Embedding.

3.2.2 Masking.

3.2.3 Topology-guided transformer encoder.

3.2.4 Decoder.

3.2.5 Loss function.

4 Experiment

4.1 Comparison experiment

4.1.1 Linear probing evaluation experiment.

4.1.2 Fine-tuning evaluation experiment.

4.1.3 Semi-supervised evaluation experiment.

4.1.4 Transfer learning evaluation protocol experiment.

4.2 Ablation experiment

4.2.1 Effectiveness of JMM-TGT.

4.2.2 Effectiveness of masking strategy.

5 Visualization

6 Conclusion

References