Figures
Abstract
Graph Convolutional Networks (GCNs) perform well in skeleton action recognition tasks, but their pairwise node connections make it difficult to effectively model high-order dependencies between non-adjacent joints. To address this issue, hypergraph methods have emerged with the aim of capturing complex associations between multiple joints. However, existing methods either rely on static hypergraph structures or fail to fully exploit feature interactions between channels, limiting their ability to adapt to complex action patterns. Therefore, we propose the Dual-Branch Differential Channel Hypergraph Convolutional Network (DBC-HCN), which leverages hypergraphs’ ability to represent a priori non-natural dependencies in skeletal structures. It extracts spatio-temporal topological information and higher-order correlations by integrating static and dynamic hypergraphs, leveraging channel optimization and inter-hypergraph feature interactions. Our network comprises two parallel streams: a Spatio-Temporal Dynamic Hypergraph Convolutional Network (ST-HCN) and a Channel-Differential Hypergraph Convolutional Network (CD-HCN). The Spatio-Temporal Dynamic Hypergraph Convolutional stream is mainly based on the natural topology of the human skeleton, and uses dynamic hypergraphs to model the dependencies of skeletal points in spatio-temporal dimensions, so as to accurately capture the spatio-temporal characteristics of the movements. In contrast, Channel-Differential Hypergraph Convolutional stream focuses on the feature differences between different channels and extracts the characteristics of motion changes between individual skeletal points during action execution to enhance the portrayal of action details. In order to enhance the network’s representational capability, we fuse the dual streams with different action feature representations, so that the Spatio-Temporal Dynamic Hypergraph Convolutional stream and the Channel-Differential Hypergraph Convolutional stream learn from each other’s representations to better enrich the action feature representations. We experiment the model on three datasets, Kinetics-Skeleton 400, NTU RGB + D 60 and NTU RGB + D 120, and the results show that our proposed network is more competitive. The accuracy reaches 96.9% and 92.7% for the cross X-View and X-Sub benchmarks of the NTU RGB + D 60 dataset, respectively. Our code is publicly available at: https://github.com/hhh1234hhh/DBC-HCN.
Citation: Chen D, She K, Wu P, Chen M, Li C (2025) Dual-branch differential channel hypergraph convolutional network for human skeleton based action recognition. PLoS One 20(10): e0332066. https://doi.org/10.1371/journal.pone.0332066
Editor: Xiyu Liu, Shandong Normal University, CHINA
Received: May 2, 2025; Accepted: August 25, 2025; Published: October 27, 2025
Copyright: © 2025 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: This work was supported by Education Department of Guangxi Science and Technology Program (NO.GUIKEAB23075177) and Guangxi Zhuang Autonomous Region (Hechi University) (No. 2024GXZDSY015). These funds were both received by Drs. Dong Chen and Kaichen She.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
In the field of computer vision, action recognition[1], a core research area, has been extensively studied across various feature characterization modalities, including RGB image frames, human skeleton data, and depth maps. In recent years, human skeleton-based action recognition has drawn significant attention due to its superiority in model training for dealing with background noise and its robustness against challenges such as illumination changes, color distortions, and occlusions. In contrast to traditional action recognition methods that utilize RGB image frames, human skeletal data not only preserves high-order motion information but also integrates temporal and spatial information to effectively extract spatio-temporal action features. This capacity is essential for advancing human-computer interaction, action recognition research, and video content analysis.
Early in the development of deep learning, the prevalent methods, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), treated human joint coordinates as independent features. These coordinates were then organized into sequences of vectors [2,3,4] or pseudo-images [5,6,7] and input into the network for action label prediction. Since this vectorization or pseudo-image representation disregarded the correlations between joints, these methods could not fully extract the features present in the skeletal data. Therefore, Yan et al. [8]used GCNs to model the correlation between human joints and graphs and proposed a spatio-temporal graph convolutional network (ST-GCN) to describe joints with a joint segmentation strategy. This static segmentation strategy, however, proved to be insufficiently adaptive to the diversity of human actions. Therefore, in 2019, Shi et al. [9] introduced a novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition. In contrast to traditional approaches that rely on fixed graph structures, 2s-AGCN adaptively learns the graph structure based on the input data. This means that the network can automatically adjust the connection relationships between skeletal points according to different types of actions, thus better capturing the specific features of actions. However, 2s-AGCN applies a single topology to all channels, thereby forcing the GCN to aggregate only features associated with that topology, limiting the flexibility of feature extraction. To address this limitation, Chen et al. [10] proposed the Channel Topology Refinement Graph Convolutional Network (CTR-GCN) in 2021. CTR-GCN dynamically infers channel-specific correlations, accurately capturing the relationships between vertices in each channel. This generation of channel topologies, achieved through the refinement process, eliminates the need for independent modeling of each channel’s topology, significantly reducing the complexity of channel topology modeling. These advancements in GCN models optimize the topology map structure and significantly enhance the flexibility of feature extraction. Nevertheless, GCN-based methods for skeleton-based action recognition still present two significant limitations: (1) Edges in the graph structure connect only adjacent nodes. This does not reflect dependencies between multiple joints or indirectly related joints, thus omitting significant hidden higher-order information between joints. (2) Sample skeleton graph structures are fixed, despite differences in the positions and angles of human joints across different samples in the dataset. This results in incomplete joint correlation data.
To address the aforementioned challenges, this paper proposes a Dual-Branch Differential Channel Hypergraph Convolutional Network (DBC-HCN). We utilize static and dynamic hypergraphs to model the skeleton data in DBC-HCN. The topology of the dynamic hypergraphs is constructed through K-NN and K-means algorithms [11,12], as illustrated in Fig 1. Specifically, our designed network consists of two parallel streams: a Spatio-Temporal Dynamic Hypergraph Convolutional Network (ST-HCN) and a Channel-Differential Hypergraph Convolutional Network (CD-HCN). Concerning the ST-HCN stream, in the spatial topology module, we effectively extract information regarding human joints by leveraging feature interactions among multiple hypergraphs, and further refine these hypergraphs utilizing the channel refinement module, followed by the fusion of information from these two types of information derived from the topological structure. This model-building strategy increases the flexibility of graph construction models and strengthens the correlation between joints lacking natural dependencies. In the temporal convolution module, the spatial information is processed utilizing a Dilated Temporal Convolutional Network (DTCN) to further derive the spatio-temporal features of the skeletal data. Regarding the CD-HCN stream, the primary goal is to capitalize on the characteristic feature differences across various channels to capture the inter-channel relationships between joints and employ spatio-temporal learning for effective feature fusion. Besides, considering the differing representational features of the two streams, we conduct feature fusion of the two streams to enable complementary information exchange, enrichning the action features through mutual learning between the different information streams and thereby improving the accuracy of skeletal action recognition.
(a) Representation of a traditional skeleton graph with nodes on the lines indicating human joints. (b) Represents a static hypergraph with six hyperedges, and different colored lines indicate different hyperedges. (c) Represents the hypergraph topology obtained by the K-NN method. (d) Denotes the hypergraph topology with global information obtained by the K-means method.
Our primary contributions are summarized as follows:
- We propose a novel action recognition algorithm based on Spatio-Temporal Dynamic Hypergraph Convolution, which adopts a hypergraph representation strategy with spatial channel refinement, and constructs a novel model of spatio-temporal dependencies by integrating interaction features between spatial hypergraphs and merging them with the emerging Dilated Temporal Convolution operation.
- We design the hypergraph convolutional network model with a Channel Differential Mechanism, which focuses on the frame-to-frame variability in the channel, and can effectively extract more different feature information.
- We propose the DBC-HCN parallel network. This architecture significantly improves the effectiveness of feature extraction concerning node interactions across different channels, while preserving the ability to extract features based on the topological connections in the human skeleton graph. Extensive experimental results demonstrate that the DBC-HCN model achieves highly competitive, state-of-the-art performance on the NTU RGB+D 60, NTU RGB+D 120 datasets and Kinetics-Skeleton dataset.
2. Related work
In this section, we primarily review skeleton-based action recognition methods, including recurrent neural network (RNN), convolutional neural network (CNN), and graph convolutional network (GCN) methods. In addition, we provide a brief introduction to hypergraph neural networks.
2.1. RNN and CNN based action recognition methods
With the development of deep learning, RNN- and CNN-based action recognition methods have become the dominant way to solve the problem of skeleton-based human action recognition. RNN-based methods are able to capture long-term dependencies in the data and usually use coordinate vectors to model the skeleton data. Yong et al. [2] developed an end-to-end hierarchical RNN architecture, which has the advantage of processing time series data, enabling it to extract richer spatio-temporal features. Song et al [13] proposed an end-to-end spatio-temporal attention RNN that selectively focuses on the key spatial and temporal features to improve the understanding and recognition of complex actions. Zhang et al. [14] proposed a viewpoint-adaptive RNN with a Long Short-Term Memory (LSTM) network architecture that dynamically self-adjusts its orientation to skeletal data to the most optimal configuration in one end-to-end mode to enhance the performance of human action recognition.
In comparison, CNN-based methods are able to excellently deal with image and spatial aspects of data, and generally pseudo-images model the skeleton data. Liu et al. [15] proposed a multi-stream CNN fusion network based on a viewpoint-invariant approach for enhanced skeleton visualization, and by exploring the complementary properties between different types of enhanced color images, as a way to improve the performance of action recognition. Ke et al. [16] proposed a new method for 3D action recognition of skeletal sequences based on cylindrical surface coordinates. This method utilizes a multitask learning network to process feature vectors from all time frames in parallel for action recognition. Cao et al. [17] designed CNNs combining residual blocks and gated connectivity, which effectively utilized the ability of CNNs to process low-dimensional skeletal data and significantly improved the recognition accuracy of the network through the incorporation and improvement of residual blocks. Mumtaz et al. [18] studied a new dynamic frame skipping technique to generate meaningful temporal representations for spatio-temporal motion sequences, aiming to better detect anomalous actions. However, these RNN- and CNN-based methods cannot prove the graph structure of skeleton data well, resulting in certain missing skeleton data information when extracting features. Therefore, there is a need to find more effective methods for representing skeleton data.
2.2. GCN-based action recognition methods
GCN is a novel action recognition method in the field of deep learning, which utilizes graph structures to model the spatial relationships and temporal dynamics between human skeletons or joint points to efficiently capture the features of human actions. The method based on spatio-temporal graph convolutional networks (ST-GCN) was developed by Yan et al. [8]. This is the first attempt to employ GCN for modeling skeletal data. It enables the network to address spatial structure information as well as temporal dynamic information by extending graph convolution to the spatio-temporal domain, and also resolves the problem of traditional skeleton modeling that depends on handcrafted components or traversing rules, thereby enhancing the generalization and expressive power of the network. Shi et al. [9] have put forward a new type of two-stream adaptive graph convolutional network (2s-AGCN). It defines a dual-stream network with respect to spatial and temporal streams employing the adaptive graph convolution concept. This enhances the degree of freedom of graph models constructed and their applicability to a wide range of datasets. Ye et al. [19] proposed a novel Dynamic GCN (Dynamic GCN), an architecture that incorporates a Context Encoding Network (CeN) designed to learn context-rich dynamic skeleton topologies. CeN enables the modeling of dynamic graph topologies by integrating contextual information from other joints into the relationship between any two joints. Accordingly, the model is able to identify slight differences in the actions, resulting in improved performance in action recognition and enhancing the generalization ability of the model. Plizzari et al. [20] proposed a new Spatio-Temporal Transformer Network (ST-TR), which uses the Transformer’s Self-Attention Operator to capture dependencies between articulation points. The model effectively extracts the spatio-temporal features of the skeleton data in space and time through the Transformer architecture, which thus optimizes the performance related to the skeletal action recognition task. Song et al. [21] put forth a collection of efficient GCN baseline models (Efficient GCN) that incorporate increased accuracy while having a small number of trainable parameters, create stronger models through a spatio-temporal separation learning strategy, achieve faster baselines, and, further, lower the modeling complexity and over-parameterization. Nevertheless, despite the superiority of such models in performing on simple data relationships derived from pairwise interaction as well as graph neural networks, they exhibit limitations when dealing with higher-order data relationships and more complex structures.
2.3. Hypergraph neural networks
Hypergraph neural network (HGNN) [22] is a state-of-the-art data representation learning method designed to capture higher-order dependencies among data points. Unlike traditional graph neural networks, HGNN connects an arbitrary number of nodes via hyper edges, thus enabling the modeling of more complex data relationships than pairwise connections. Jiang et al. [23] proposed the Dynamic Hypergraph Neural Network (DHNN), which is capable of updating the hypergraph structure at each layer of the network to adapt to the dynamics of the action. Moreover, Gao et al. [24] developed a dynamic hypergraph representation and learning framework which effectively portrays the higher-order dependencies between nodes in the hypergraph structure through tensor representations, providing an important extension to existing hypergraph learning techniques. Mei et al. [25] proposed a Dynamic Hypergraph Hyperbolic Neural Network (DHHNN) based on variational autoencoder, which integrates hyperbolic geometry, dynamic hypergraph structure, and self-attention mechanism to dynamically adjust the contribution degrees of nodes and hyperedges, effectively enhancing the modeling capability for complex multimodal data. Overall, the hypergraph neural network through its unique structure, is able to effectively extract and represent higher-order features in skeleton data, thus demonstrating excellent performance in skeleton-based action recognition tasks.
3. Method
In this section, we begin with a concise introduction to Graph Convolutional Networks (GCNs) and Hyper Graph Convolutional Networks (HGCNs). We then demonstrate the DBC-HCN and present detailed information on the complete structure of the network.
3.1. Graph convolutional networks
In traditional GCNs, a skeleton sequence consists of single-frame skeleton graphs. The data is represented by a graph structure
, where
represents the set of joints and
represents the set of edges. These edges include the skeletal edges
, which connect corresponding nodes in each frame, and the edges
, which link skeletal points across all frames. The feature matrix of the skeleton graph is represented by the adjacency matrix
, whose element values reflect the connectivity between joints. The update rule of the graph convolutional network is defined as follows:
where is the input signal of the convolutional layer
, and
denotes the initial feature of the input layer,
is the learnable weight matrix of the convolutional layer
,
refers to the nonlinear activation function,
is the normalized graph adjacency matrix, and
is the degree matrix of
, while
is the number of nodes in the graph, and
is the node vector dimension.
3.2. Hypergraph convolutional networks
A hypergraph is defined by a triple consisting of a node set
, a hyperedge set
, and a hyperedge weight set
, whose hyperedges are capable of connecting an arbitrary number of nodes for a more thorough representation of relationships. For the hypergraph
, the association matrix
is shown in Fig 2 and its elemental relations are defined as:
The degree of a given node denotes the number of hyperedges containing that node, denoted as:
where is the weight of the hyperedge
and the degree of the hyperedge
denotes the number of joints that constitute the hyperedge
, denoted as:
The update rule for hypergraph convolutional networks is defined as [24]:
where is the weight matrix of the convolutional layer
used to extract features for the nodes in the hypergraph.
denotes the diagonal matrix of the node degree
,
indicates the diagonal matrix of the hyperedge degree
, and
is the diagonal matrix of the weights of the hyperedge.
3.3. Dual-branch differential channel hypergraph convolutional network
As presented in Fig 3(a), our proposed DBC-HCN model has two parallel processing streams, namely the ST-HCN stream and the CD-HCN stream. Spatio-Temporal Dynamic Hypergraph Convolution consists of spatial convolution module composed of a Dynamic Channel Refinement (DCR) module, a Hypergraph Feature Interaction (HFI) module, and a Feature Aggregation (FA) module, and a Dilated Temporal Convolution (DTC) module. Specifically, DCR module adjusts the weights of the feature channels through an adaptive mechanism to extract key features that are closely related to action recognition. HFI module extracts the representational differences with the help of the topologies of multiple hypergraphs, and FA is the fusion of information from both sides. DTC module has the purpose of widening the sensory field of the model such that the long-term dependencies that exist in the time-series data can be captured. Compared with the ST-HCN stream, a distinctive feature of CD-HCN stream is that the inter-channel difference operation is performed on the input feature maps, and an advanced representation of the differential features is learnt by feeding the obtained differential features into the convolutional network, which ultimately improves the discriminative power of the model.
3.3.1. Spatio-temporal dynamic hypergraph convolution.
In this section, we explore the structural details of the spatial and Dilated Temporal Convolution (DTC) in the ST-HCN, and analyze the mathematical and architectural levels of HFI module, DCR module, and FA module, which make up the spatial convolution of the model.
Hypergraph Feature Interaction (HFI). Feature interactions between hypergraphs aim to reveal the deep connections between nodes in different hypergraph structures in order to understand and utilize the information of multiple hypergraph structures. To this end, we propose a novel feature interaction mechanism as shown in Fig 3(b). First, we apply linear transformation functions and
to the input features, followed by characterization extraction via the Einstein summation convention function
, as shown in Eq:
where is the adjacency matrix of the static hypergraph,
and
denote the hypergraph adjacency matrices constructed by the K-NN and K-means algorithms [11,12,26], and
represents the composite hypergraph adjacency matrix formed by the splicing operation of the adjacency matrices
and
.
and
are the output features after the interaction of inputs and hypergraph adjacency matrices
and
features.
Next, an element-by-element subtraction of the feature vectors of each corresponding node in different hypergraphs is performed by the function to obtain a difference feature matrix, Eq:
where is the nonlinear activation function and
denotes the characteristic interaction matrix of the hypergraph.
After that, we first transform the features utilizing the linear function , then feed the transformed features into the function
to complete the feature interaction with the interaction matrix
, and finally use the interaction features to fuse the features
as a way to obtain a more comprehensive interaction output, which is given by Eq:
where is a trainable scalar parameter and
denotes the final output feature that contains the interaction information of all the above input features.
Dynamic Channel Refinement (DCR). We segment the Dynamic Channel Refinement process into three stages. First, the dynamic modeling and channel refinement are accomplished by inferring the hypergraph structure. Then, the transformation function is used for feature transformation. Finally, the channel topology is aggregated. In this process, denotes the channel correlation matrix and
denotes the channel aggregation matrix.
(i) Dynamic modeling: Depicted in Fig 3(b), we begin by mapping the input features utilizing the linear transformation functions and
, after which these dimensionality-decreasing features are fed into the dynamic modeling function
. The dynamic modeling function is denoted as:
where is a nonlinear activation function, and
centers on computing the distance between the node features
and
in the hypergraph along the channel dimensions and generating channel-specific representations of the topological relationships.
Next, we use a linear transformation on top of the modeling function to lift the features and learn the inter-channel correlation
, Eq:
where is the adjacency matrix of the static hypergraph and the
function is a multidimensional tensor algorithm.
Finally, to further refine the correlation between channels, we refine with channel correlation
, computed as:
where is denoted as the channel topology after the refinement process, and
is the Einstein summation convention.
(ii) Feature Transforms: we use feature transforms aiming to transform the input into a high-level feature representation via as shown in Fig 3(b). The graph convolution with a simple linear variation is used for this purpose with the following formula:
where is the transformed feature,
denotes the input feature matrix, and
denotes the shared weight matrix, which is responsible for linearly combining the input features during graph convolution to extract a feature representation rich in structural information.
(iii) Channel aggregation: under the premise of having channel topology and high-level features
, the final output features
can be obtained by aggregating the channel graph through the aggregation function with the formula:
where and
come from the
-th channel of
with
, the channel aggregation function
according to the given channel topology
and high-level features
obtains the output matrix
by channel aggregation of all connected channel graphs.
Feature Aggregation (FA). In this study, each of the four output features involved characterizes differentiated action characteristics among nodes in the network, and we aggregate the features as a way to comprehensively extract the relevance of these nodes and thus construct a more comprehensive global feature representation. The corresponding formula is:
where denotes the splicing operation.
Finally, on the basis of spatial convolution, we introduce a temporal convolution method for feature extraction for time series data, as shown in Fig 3(c). This study adopts the Dilated Temporal Convolution (DTC) strategy, which effectively improves the ability of feature extraction for time series data by flexibly adjusting the dilation coefficient.
3.3.2. Channel-differential hypergraph convolution.
In order to recognise and understand human movements more accurately, we propose a Channel-Differential Hypergraph Convolution method. The method is structurally similar to ST-HC, but adopts a novel Channel Differential Mechanism (CDM). The core idea of this mechanism is to extract the dynamic changes between channels, and thus to characterise the relative motion of skeletal points in a finer way. Compared with the traditional HCN and the ST-HCN proposed above, this method exhibits significant differences in feature extraction and information modelling, enabling the model to capture the dynamic changes between channels during the action more effectively.
In most hypergraph-based action recognition methods, HCN mainly relies on absolute coordinate information of skeletal points, such as X,Y,Z 3D coordinates. However, using only absolute coordinates may not be able to adequately portray the dynamic features of human motion.ST-HCN models the higher-order relationships of skeletal points by constructing spatio-temporal hypergraphs, but it still fails to explore the channel dimension in depth. As shown in Fig 4, Channel-Differential Hypergraph Convolutional Network (CD-HCN) is able to capture the subtle changes between neighbouring time frames by introducing a Channel Differential Mechanism, which enhances the representation of the dynamic features of the movement and thus improves the accuracy of recognition.
In our approach, the core of the channel differential hypergraph convolution lies in first performing the difference operation on each channel of the input data, so as to realize the refined extraction of the features of the skeletal points, and then inputting the features into the model. Specifically, each skeletal point has a corresponding channel definition in the CD-HCN, and these channels represent the 3D coordinate information of the skeletal point respectively. By performing the difference operation on these channels, we obtain multiple single channels containing the 3D motion information of the skeletal points. This processing procedure is formally described by the following formula:
where is the number of input feature channels,
and
are the feature vectors of the
-th and
-th channels, and
is the difference vector of the
-th channel.
Compared with the traditional HCN that deals with the absolute coordinates of skeletal points directly, CD-HCN extracts neighbouring time frame change information via the Channel Differential Mechanism, enabling the model to focus on relative displacements during movement. This approach can eliminate the static differences between different individuals and improve the robustness of the model to complex movement patterns.
3.4. Model architecture
The architecture of the model proposed in this paper is shown in Fig 5, which consists of a two-stream network, where the second stream specifically integrates a Channel Differential Mechanism that focuses on analyzing frame-to-frame variations between channels. Both streams consist of ten layers, which include spatial convolution and temporal convolution. Spatial convolution aims to capture spatial information of joints and bones, while temporal convolution is used to extract temporal information in different frames. In spatial convolution, we incorporate Hypergraph Feature Interaction and Dynamic Channel Refinement to analyze static and dynamic hypergraph features. The convolution kernel size for temporal convolution is fixed to 5 × 1, except for the maximum pooling layer which is 3 × 1, and employs varying dilation rates to expand the receptive field. At layers 5 and 8, the temporal dimension is halved through stride temporal convolution. Subsequently, the model predicts action labels by global average pooling and FC layers.
The temporal modeling structures of ST-GCN and CD-GCN are similar.
Finally, in order to fuse the information of the two streams for interaction, we use a late fusion strategy. Specifically, the features of ST-HCN stream are integrated with CD-HCN stream, Eq:
where ,
denotes human action data, and
and
denote spatio-temporal hypergraphic convolutional network and channel differential convolution, respectively.
4. Experiments
In this section, in order to evaluate the performance of the proposed model, we perform a number of experiments on the datasets NTU-RGB + D 60, NTU-RGB + D 120 and Kinetics-Skeleton. First, we provide a detailed description of the three datasets. Then we perform an ablation study on the NTU-RGB + D 60 dataset to examine the contribution of each module to the network. Finally, we validate the effectiveness of the proposed model by comparing it with several state-of-the-art methods.
4.1. Datasets
NTU RGB + D 60: NTU RGB + D 60 [27] dataset is a large-scale dataset widely used in the field of action recognition, which contains 56,880 skeleton action sequences performed by 40 volunteers covering 60 different action classes. Each action sample ensures that a maximum of two subjects are involved and is captured simultaneously by cameras from three different viewpoints, thus providing rich information about the 3D skeletal joint points. The authors of this dataset recommend two evaluation benchmarks: (1) Cross-subject (X-Sub) benchmark: the dataset is divided into two groups, with the training data coming from 20 subjects and the test data coming from the remaining 20 subjects, for a total of 40,320 training samples and 26,560 test samples. (2) Cross-View (X-View) benchmark: training samples come from camera views 2 and 3, totaling 37,920, while test samples come from camera view 1, totaling 18,960.
NTU RGB + D 120: NTU RGB + D 120 [28] dataset is a significant extension of the NTU RGB + D 60 dataset with the addition of 57,367 new skeleton sequences and 60 new action categories, making it the largest 3D co-annotated human action recognition dataset available. The dataset consists of more than 114,000 skeleton action sequences executed by 40 volunteers in 32 different settings, each representing a different location and context, to enhance the model’s ability to generalize across different environments. In order to evaluate the performance of the model, NTU-120 proposes two benchmark evaluation methods:(1) Interdisciplinary (X-Sub) Benchmark: same as the X-Sub benchmark of NTU-60, the X-Sub evaluation divides the dataset into two groups, one for training and the other for testing. (2) Cross-Set (X-Set): X-Set evaluation over splits the training and testing samples based on the name of the camera setup ID.
Kinetics-Skeleton: Kinetics-Skeleton [29] dataset is a large-scale human behavior recognition benchmark based on YouTube videos, containing about 300,000 video clips covering up to 400 human behaviors involving daily activities, motion scenarios, and complex human-computer interactions. The original Kinetics dataset only provides raw video clips without skeleton sequences.ST-GCN [2] obtains the positions of 18 joints on each frame of the video by applying the publicly available Open-Pose toolkit, and accordingly selects the two individuals with the highest average joint confidence for skeleton data extraction. The processed skeleton data were divided into 240,000 training clips and 20,000 validation clips.
4.2. Implementation details
All experiments were conducted utilizing the PyTorch deep learning framework, in which we implemented stochastic gradient descent (SGD) with momentum of 0.9 as the optimizer, and selected cross-entropy as the loss function for the backpropagated gradient, with weight decay configured at 0.0001. During the training phase on NTU RGB + D 60 [27] and NTU RGB + D 120 [29], the learning rate was increased at the 35th and 55th epoch, with the training process reaching completion at the 80th epoch. For Kinetics-Skeleton [22], the learning rate demonstrated decreases at the 45th and 55th epochs, and the training process reached completion at the 70th epoch. For the NTU RGB + D 60 and NTU RGB + D 120 datasets, the batch size is set to 70. For the Kinetics-Skeleton dataset, we set the batch size to 64.
4.3. Ablation study
In this section, to validate the effectiveness of our proposed module and dual-stream framework, we conduct the following experiments using the X-Sub benchmark on the NTU RGB + D 60 dataset.
In multi stream fusion, we integrate different output features, specifically J, B, and M (including joints, bones, and motions). In order to evaluate the effect of each stream on the model performance, we implemented three sets of controlled experiments, as shown in Table 1. The results showed that the B (bones) stream had the most significant impact, with a performance gain of up to 2.6%, thereby further improving the recognition accuracy.
In order to validate the effectiveness of the DCR module, HFI module, and the dual-stream methods of ST-HCN and CD-HCN, we implemented a removal-by-removal ablation study to analyze the DBC-HCN model. By monitoring the performance changes in the B-stream, we evaluate the effectiveness of these modules, as shown in Table 2. In the DBC-HCN w/o HFI model with HFI removed, we observe a 0.6% performance degradation, thus confirming the critical role of HFI between channels. In addition, the DBC-HCN w/o DCR model shows a 2.0% performance degradation, which highlights the importance of DCR in improving model performance. Meanwhile, the performance of DBC-HCN w/o CD with removal of CD-HCN and DBC-HCN w/o ST model with removal of ST-HCN decreased by 1.4% and 1.2%, respectively, indicating the vital role of the dual-stream interaction in improving model performance.
4.4. Comparative study
This section presents a comparative analysis of the DBC-HCN model’s performance against state-of-the-art skeleton-based action recognition methods. We will test it on different benchmarks on the NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton datasets.
Comparative data for the models are shown in Tables 3, 4 and 5. Across all three datasets, our method proves superior to the vast majority of methods when evaluated against almost all metrics. Specifically, regarding the NTU RGB + D 120 dataset, our model, which combines joint and bone information, reaches state-of-the-art performance levels. The DBC-HCN model surpasses the current hypergraph model DST-HCN [37] on two evaluation benchmarks, indicating improvements of 0.6% and 0.5%, respectively.
We used the X-Sub benchmark on the NTU RGB + D 60 dataset as a challenging testbed for evaluating model performance. Fig 6 illustrates the confusion matrix of the DBC-HCN model on the NTU RGB + D 60 X-Sub benchmark. Fig 6 shows the confusion matrix of the DBC-HCN model at NTU RGB + D 60. By analyzing the confusion matrix based on X-Sub, we can see that although some classes are significantly less accurate than others, our method can accurately identify most of them. Fig 7 further evaluates this trend. BCD-HCN has an accuracy rate of over 90% in 38 action categories, accounting for 63.33% of all action categories, indicating that the model can accurately identify actions in most cases. In addition, the model has an accuracy rate of over 80% for 52 action categories, accounting for 86.66% of all action categories. These statistical data highlight the outstanding accuracy and reliability of BCD-HCN in action recognition tasks.
The yellower the square on the diagonal, the more accurate the identification. (a) X-sub benchmark on the NTU RGB + D 60 dataset. (b) X-view benchmark on the NTU RGB + D 60 dataset.
Among the 60 action categories, the model demonstrates particularly strong performance in recognizing actions such as “take off jacket”, “jump up”, “hopping”, “staggering” and “walking towards”. We can attribute the high accuracy for these actions to their unique visual features and the model’s successful capture of these characteristics. However, the Staple action recorded the lowest recognition accuracy.
5. Conclusion
This paper proposes the Dual Branch Differential Channel Hypergraph Convolutional Network (DBC-HCN), which improves the performance of skeleton-based action recognition by integrating Spatio-Temporal Dynamic Hypergraph Convolution and Channel-Differential Hypergraph Convolution. Compared with traditional graph convolution methods, this model utilizes a hypergraph structure to process high-order correlations of skeleton points, effectively capturing complex spatiotemporal features in motion, and enhancing action detail representation through Dynamic Channel Refinement module, Hypergraph Feature Interaction module, and Channel Difference Mechanism. The experiment shows that DBC-HCN performs better than mainstream methods on multiple datasets, verifying its superiority in modeling complex actions.
Although the performance of the model is excellent, it still has dual limitations: On the one hand, the dual branch architecture and hypergraph computing mechanism lead to a significant increase in computational complexity, which restricts the real-time deployment capability on edge computing devices; On the other hand, the construction of dynamic hypergraphs relies on K-NN and K-means algorithms, which limit their generalization ability to unconventional action patterns. In response to the above issues, future research will advance in three directions: firstly, developing a lightweight hypergraph convolution framework, combining network pruning and quantization techniques to achieve efficient inference at the edge; Secondly, an attention driven mechanism is introduced to optimize the weight allocation of hyperedges, enhancing the robustness of modeling heterogeneous action patterns; Thirdly, the engineering validation of the final extended model in practical scenarios such as medical rehabilitation abnormal action detection and intelligent human-computer interaction.
Supporting information
S1 File. Code for key modules.
It briefly contains the code of the key modules of the neural network in this experiment.
https://doi.org/10.1371/journal.pone.0332066.s001
(PDF)
References
- 1.
Shuchang Z. A survey on human action recognition. 2022.
- 2.
Yong Du, Wang W, Wang L. Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 1110–8. doi: https://doi.org/10.1109/cvpr.2015.7298714
- 3.
Li L, Wu Z, Zhang Z, Huang Y, Wang L. Skeleton-based relational modeling for action recognition. 2018.
- 4.
Li S, Li W, Cook C, Zhu C, Gao Y. Independently recurrent neural network (IndRNN): building a longer and deeper RNN. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 5457–66. doi: https://doi.org/10.1109/cvpr.2018.00572
- 5.
Chao L, Qiaoyong Z, Di X, Shiliang P. Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2017. 597–600. doi: https://doi.org/10.1109/icmew.2017.8026285
- 6.
Kim TS, Reiter A. Interpretable 3D human action analysis with temporal convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017. 1623–31. doi: https://doi.org/10.1109/cvprw.2017.207
- 7.
Bo L, Yuchao D, Xuelian C, Huahui C, Yi L, Mingyi H. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2017. 601–4. doi: https://doi.org/10.1109/icmew.2017.8026282
- 8. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceed AAAI Conf Artif Intelli AAAI. 2018;32(1).
- 9.
Shi L, Zhang Y, Cheng J, Lu H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019. 12018–27. doi: https://doi.org/10.1109/cvpr.2019.01230
- 10.
Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 13339–48. doi: https://doi.org/10.1109/iccv48922.2021.01311
- 11. Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM. Dynamic graph CNN for learning on point clouds. ACM Trans Graph. 2019;38(5):1–12.
- 12.
Chiang W-L, Liu X, Si S, Li Y, Bengio S, Hsieh C-J. Cluster-GCN. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 2019.
- 13. Song S, Lan C, Xing J, Zeng W, Liu J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. AAAI. 2017;31(1).
- 14.
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 2136–45. doi: https://doi.org/10.1109/iccv.2017.233
- 15. Liu M, Liu H, Chen C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recog. 2017;68:346–62.
- 16.
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F. A new representation of skeleton sequences for 3D action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 4570–9. doi: https://doi.org/10.1109/cvpr.2017.486
- 17. Cao C, Lan C, Zhang Y, Zeng W, Lu H, Zhang Y. Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans Circuits Syst Video Technol. 2019;29(11):3247–57.
- 18. Mumtaz A, Sargano AB, Habib Z. AnomalyNet: a spatiotemporal motion-aware CNN approach for detecting anomalies in real-world autonomous surveillance. Vis Comput. 2024;40(11):7823–44.
- 19.
Ye F, Pu S, Zhong Q, Li C, Xie D, Tang H. Dynamic GCN: context-enriched topology learning for skeleton-based action recognition. Cornell University – arXiv; 2020.
- 20.
Plizzari C 6, Cannici M, Matteucci M. Spatial temporal transformer network for skeleton-based action recognition. In: Pattern recognition. ICPR International workshops and challenges, lecture notes in computer science, 2021. 694–701. doi: https://doi.org/10.1007/978-3-030-68796-0_50
- 21. Song Y-F, Zhang Z, Shan C, Wang L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(2):1474–88. pmid:35254974
- 22. Feng Y, You H, Zhang Z, Ji R, Gao Y. Hypergraph neural networks. AAAI. 2019;33(01):3558–65.
- 23.
Jiang J, Wei Y, Feng Y, Cao J, Gao Y. Dynamic hypergraph neural networks. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. 2635–41. doi: https://doi.org/10.24963/ijcai.2019/366
- 24. Gao Y, Zhang Z, Lin H, Zhao X, Du S, Zou C. Hypergraph learning: methods and practices. IEEE Trans Pattern Anal Mach Intell. 2022;44(5):2548–66. pmid:33211654
- 25. Mei Z, Bi X, Li D, Xia W, Yang F, Wu H. DHHNN: a dynamic hypergraph hyperbolic neural network based on variational autoencoder for multimodal data integration and node classification. Inform Fusion. 2025;119:103016.
- 26. Wang S, Zhang Y, Lin X, Hu Y, Huang Q, Yin B. Dynamic hypergraph structure learning for multivariate time series forecasting. IEEE Trans Big Data. 2024;10(4):556–67.
- 27.
Shahroudy A, Li J, Ng TT, Wang G. NTU RGB D: a large scale dataset for 3D human activity analysis. arXiv. Cornell University; 2016. https://arxiv.org/abs/1610.10096
- 28. Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans Pattern Anal Mach Intell. 2020;42(10):2684–701. pmid:31095476
- 29.
Zisserman A, Carreira J, Simonyan K, Kay W, Zhang B, Hillier C, et al. The kinetics human action video dataset. cornell university - arXiv. 2017. https://arxiv.org/abs/1705.06950
- 30.
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H. Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020. doi: https://doi.org/10.1109/cvpr42600.2020.00026
- 31.
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 140–9. doi: https://doi.org/10.1109/cvpr42600.2020.00022
- 32. Li X, Lu J, Zhou J, Liu W, Zhang K. Multiple temporal scale aggregation graph convolutional network for skeleton-based action recognition. Comp Electr Eng. 2023;110:108846.
- 33. Hao X, Li J, Guo Y, Jiang T, Yu M. Hypergraph neural network for skeleton-based action recognition. IEEE Trans Image Process. 2021;30:2263–75. pmid:33471763
- 34.
Wei J, Wang Y, Guo M, Lv P, Yang X, Xu M. Dynamic hypergraph convolutional networks for skeleton-based action recognition.
- 35.
Wang J, Falih I, Bergeret E. Skeleton-Based Action Recognition with Spatial-Structural Graph Convolution. In: 2024 International joint conference on neural networks (IJCNN), 2024. 1–9. doi: https://doi.org/10.1109/ijcnn60899.2024.10651306
- 36. Chen D, Chen M, Wu P, Wu M, Zhang T, Li C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci Rep. 2025;15(1):4982. pmid:39929951
- 37.
Wang S, Zhang Y, Qi H, Zhao M, Jiang Y. Dynamic spatial-temporal hypergraph convolutional network for skeleton-based action recognition. 2023.