Figures
Abstract
The precise recognition of human lower limb movements based on wearable sensors is very important for human-computer interaction. However, the existing methods tend to ignore the dynamic spatial information in the process of executing human lower limb movements, leading to challenges such as reduced decoding accuracy and limited robustness. In this paper, we construct skeleton graph data based on inertial measurement unit (IMU) sensors. Also, a two-branch deep learning model, termed TCNN-MGCHN, is proposed to mine meaningful spatial and temporal feature representations from IMU-based skeleton graph data. Firstly, a temporal convolutional module (consisting of a multi-scale convolutional sub-module and an attention sub-module) is developed to extract temporal feature information with highly discriminative power. Secondly, a multi-scale graph convolutional module and a spatial graph edges’ importance weight assignment method based on body partitioning strategy are proposed to obtain intrinsic spatial feature information between different skeleton nodes. Finally, the fused spatio-temporal features are passed into the classification module to obtain the predicted gait movements and sub-phases. Extensive comparison and ablation studies are conducted on our self-constructed human lower limb movement dataset. The results demonstrate that TCNN-MGCHN delivers superior classification performance compared to the mainstream methods. This study can provide a benchmark for IMU-based human lower limb movement recognition and related deep-learning modeling works.
Citation: Hu F, Zheng Q, Ye X, Qiao Z, Xiong J, Chang H (2025) Gait recognition using spatio-temporal representation fusion learning network with IMU-based skeleton graph and body partition strategy. PLoS One 20(10): e0332947. https://doi.org/10.1371/journal.pone.0332947
Editor: Andrea Tigrini, Polytechnic University of Marche: Universita Politecnica delle Marche, ITALY
Received: May 18, 2025; Accepted: September 5, 2025; Published: October 8, 2025
Copyright: © 2025 Hu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data are available publicly on github via the url: https://github.com/ZJUTofBrainIntelligence/Gait-Analysis-Dataset.
Funding: The authors declare financial support was received for the research, authorship, and/or publication of this article. This work was supported in part by Science and Technology Project of Traditional Chinese Medicine in Zhejiang Province (grant No. 2024ZL422 to HC), in part by the Medical and Health Science and Technology Project of Zhejiang Province (grant No. 2025KY969 to JX), in part by Ningbo Science Innovation Yongjiang 2035 Key Technological Breakthrough Project (grant 2024Z199 to FH), in part by Natural Science Foundation of Zhejiang Province (grant LQ23F030015 to FH), in part by Key Laboratory of Intelligent Processing Technology for Digital Music (Zhejiang Conservatory of Music), Ministry of Culture and Tourism (grant 2023DMKLC013 to FH), in part by the China Postdoctoral Science Foundation (grant 2025M772894 to FH).
Competing interests: The authors have declared that no competing interests exist.
Introduction
With the advancements in computer performance and data processing algorithms, human-computer interaction has progressed into a rapid development stage [1–5]. As an essential part of human-computer interaction technology, gait recognition method is widely used in medical rehabilitation [6–8], identity recognition [9], virtual reality [10], and other fields. At present, gait can be identified from various data forms (physiological signals [11–13], video streams [14,15], skeleton graph data [16,17], etc.). In particular, skeleton graph data has obvious advantages over other data types: even in complex environments, skeleton graph data retains sufficient kinematic and spatial information with good robustness. Therefore, research on gait recognition based on skeleton graph data has received much attention. However, skeleton graph data usually requires pre-installed vision equipment and thus cannot be used outside of the equipment’s working area. The inertial measurement unit (IMU) can overcome the limitation of the fixed area well and can provide abundant motion information. However, a single IMU sensor cannot provide abundant spatial information. Therefore, it is crucial to efficiently lay out multiple IMU sensors and establish a portable skeleton graph construction method.
In addition, IMU signals are non-stationary, weak, and low-frequency, resulting in insufficient decoding ability of traditional machine learning methods (naive Bayes [18], hidden Markov model [19], decision tree [20], support vector machine [21], etc.). With the rapid development of deep learning, many studies have applied it to the field of gait recognition. Deep learning has been proven to generally outperform traditional machine learning methods, significantly improving the accuracy of gait recognition [22]. Deep learning mainly includes two methods based on Euclidean data and non-Euclidean data. The deep learning methods based on Euclidean data include the convolutional neural network (CNN) [23], recurrent neural network (RNN) [24], and their variants. Zhao et al. [25] converted IMU data into angle embeddFed gait dynamic images, designed a shallow (3-layer) CNN model, and tested it on the two open gait movement databases. The results show that the average recognition accuracy of their model on the different day testing is only 67.9%, and the generalization ability is poor. Obviously, due to the inherent complexity of human movement, it is difficult for the shallow model to learn the nonlinear relationship from human motion data. Therefore, many studies have started to explore deep neural network methods. Arshad et al. [26] converted IMU signals into granular angular field images, designed a deep CNN model, and implemented a sense-based frailty assessment with an average recognition accuracy of 85.1%. However, IMU data is a time series signal with obvious time dependence relationship. Neither shallow nor deep CNN models can effectively capture the time-dependent relationships in the time-series signal. Therefore, some researchers [27,28] have used RNN and its variant models, which are good at extracting temporal features, for gait recognition. However, RNN and its variant models are challenging to learn local spatial features in IMU signals, and their improvement in the accuracy of gait recognition is limited. Therefore, various pattern recognition methods combining CNN and RNN have been proposed successively [29,30]. Compared with a single network model, the CNN-RNN model effectively combines the advantages of the two models and has been widely concerned. Although the CNN-RNN model has achieved some improvement in the accuracy of gait recognition, its ability to extract spatial feature information is limited. Constructing an effective deep learning model to improve the feature mining ability of IMU-based skeleton graph data is still a critical problem to be urgently solved. However, the skeleton graph is a kind of non-Euclidean data with highly correlated feature information among skeleton nodes. Classical CNN, RNN, and even CNN-RNN methods cannot extract the spatial feature information among skeleton nodes. Therefore, deep learning methods based on non-Euclidean data have received extensive attention. Graph Convolutional Network (GCN) [31,32] can perform convolution operation on the graph data, which has excellent recognition performance for classification tasks based on skeleton graph data. Yan et al. [33] first proposed the spatial-temporal convolutional network to model skeleton graph data from spatial-temporal domains. Also, some studies have showed that GCN has strong spatial information modeling capacity [33,34], but its temporal information mining ability still needs to be improved. Generally speaking, existing deep learning methods can only extract a portion of the feature information of human motion data with limited expressive power and generalization difficulties. Therefore, it is crucial to establish a deep learning-based skeleton graph data decoding method with more remarkable expression and generalization ability for gait and sub-phase recognition.
As highlighted above, the field of human lower limb movement recognition using multiple IMU sensors has several critical shortcomings. Firstly, traditional machine learning methods often struggle with decoding non-stationary, weak, and low-frequency IMU signals effectively, limiting their applicability in dynamic and complex movement scenarios. Secondly, while both shallow and deep convolutional neural networks (CNNs) have been explored, they exhibit significant limitations in capturing the essential time-dependent relationships inherent in IMU signals, resulting in suboptimal performance for sequential data analysis. Thirdly, recurrent neural networks (RNNs) and their variants, although adept at handling temporal dependencies, face substantial challenges in learning and representing local spatial features within the IMU data, which is crucial for accurate gait analysis. Finally, there is a pressing need for advanced methods capable of extracting and utilizing spatial feature information among skeleton nodes in non-Euclidean data structures, as conventional approaches often fall short in modeling the complex interrelationships present in such data. Addressing these issues is vital for advancing the accuracy of human movement recognition systems.
To address the issues above, we propose a multi-branch deep learning network called the temporal convolutional neural network and multi-scale dynamic graph convolutional hybrid network (TCNN-MGCHN) for accurate gait and sub-phases recognition based on skeleton graph data. Our model leverages a temporal convolutional module (TCM) to capture highly discriminative temporal features from IMU signals. Also, we introduced a multi-scale graph convolutional module (MGCM) combined with a body partitioning strategy to enhance the extraction of intrinsic spatial feature information between different skeleton nodes. In addition, the fused spatio-temporal features from our model improve classification performance for gait movements and sub-phases, demonstrating superior accuracy compared to mainstream methods. The main contributions of this paper are summarized as follows:
- We design a multi-task gait experiment and propose portable skeleton graph data construction method based on multiple IMU sensors.
- We propose a spatial partitioning strategy to effectively explore the inherent spatial information among multiple IMU signals, thereby enhancing the spatial feature extraction capabilities of GCN.
- TCNN-MGCHN demonstrates strong human lower limb movement recognition performance and effective spatio-temporal feature fusion extraction. Experimental results validate that TCNN-MGCHN outperforms state-of-the-art methods on the self-made datasets.
Materials and methods
Experimental design
Participants were recruited between June and December 2023 through an online forum, and the experimental protocol was designed in accordance with the regulations of the local ethics review committee. To ensure the validity and consistency of the gait recognition experiments, all participants were required to meet the following inclusion criteria: (1) No history of musculoskeletal disorders affecting the lower limbs (e.g., fractures, joint instability, or chronic pain); (2) no known neurological or motor function disorders that could affect fine motor control or movement coordination of the lower limbs. Ultimately, a total of 23 participants were recruited to participate in the multi-task gait experiment, including 13 males (height: 1721.24 cm, age: 26.32
2.19 years) and 12 females (height: 161
1.47 cm, age: 23.72
1.54 years). Before the experiment, they were also asked to eat proper food and get enough sleep. In addition, subjects will receive cash compensation after completing the experiment. Each participant read and signed a written informed consent form approved by the Ethics Committee of Zhejiang University of Technology under Application No. ZJUT-2023-47.
A multi-modal signal acquisition device was employed to record IMU and foot switch signals. The foot switch signals and IMU data were synchronously acquired using the same multi-modal signal acquisition device. As illustrated in Fig 1, the device comprises three modules: a laptop, a Raspberry Pi, and a data acquisition module. Specifically, we placed eight IMU sensors (WT901C485, Shenzhen Vit Intelligent Technology Co., LTD.) with a sampling frequency of 100 Hz on the human body: chest, thigh, calf, and ankle. The accelerometer range was 16g, and the gyroscope range was
2000°/s. Fig 2 shows the placement and triaxial orientation of the eight IMU sensors (I1-I8). It is worth noting that, to better capture the upper-body movements, two IMU sensors were installed on the chest. This dual-sensor setup enables more accurate monitoring of subtle trunk motions, including rotations and asymmetrical movements, thereby providing higher spatial resolution for subsequent motion data analysis. Moreover, in the anatomical standing posture, the local reference frames of all IMUs (x, y, and z axes) were oriented identically to that of sensor I1. The orientation definitions in the body-centered coordinate system were as follows: 1) The x-axis was defined as the horizontal axis aligned with the shoulders, pointing from the center of the torso toward the left side of the body, 2) the y-axis was the vertical axis aligned with the spine, pointing upward toward the head, 3) the z-axis was perpendicular to the chest, pointing anteriorly, i.e., forward in the direction the body is facing. Consistent with the analysis scheme of study [35], we use the triaxial angular rate signal as the analytical physical quantity. In addition, we used a wearable foot switch shoe cover made of thermoplastic polyurethane material. The shoe cover fits tightly to the sole, which can make the sensor highly sensitive in use, as shown in Fig 2. It should be noted that we place two parallel connected foot switch sensors at the front foot and heel respectively, which can expand the response area of the sensors. In total, we collected four-channel foot switch signals with a sampling frequency of 1000 Hz, including two channels for each of the left and right foot.
The experiment was conducted on an open flat area on the Zhejiang University of Technology campus. As shown in Fig 2, subjects were required to complete four gait tasks: Normal walking, downstairs, upstairs, and running. Fig 3 illustrates the sub-phase division method for gait. Here, Forefoot ‘(right)’ and ‘Heel (right)’ represent the footswitch recordings for the right forefoot and right heel, respectively. The gait cycle is subsequently divided into four stages based on the footswitch recordings: step down (SD), stance (St), push-up (PU), and swing (Sw). All subjects performed the gait at their normal pace, with each session lasting ten minutes.
Data preprocessing
The raw angular velocity signal contains a lot of artifacts and noise, so it is necessary to design an effective signal preprocessing algorithm to improve the signal quality. Specifically, a second-order Butterworth filter with a cutoff frequency of 10 Hz was applied to the angular velocity signal to eliminate high-frequency noise. Then, to preserve the high temporal resolution of the foot switch events, we resampled the angular rate signals to 1000 Hz. This is crucial for precise gait phase detection while maintaining consistent sampling rates across different modalities for accurate data fusion. Each foot switch sensor has two states: “ON” or “OFF”, where “ON” means that the foot switch sensor is in contact with the ground, and “OFF” is the opposite, as shown in Fig 3.
The foot switch is the most commonly used analytical signal for human movement sub-phase division. Additionally, in this study, the Z-axis of the IMU sensor positioned on the human lower limb is orthogonal to the sagittal plane of the body, providing sufficient kinematic information through the angle signal in this direction. Therefore, we use the Z-axis angle signal of the IMU sensor located at the thigh of the active foot (the power foot on takeoff) and the foot switch signals of both feet to realize the definition and division of the gait sub-phases, as shown in Fig 3. This study includes one gait classification task and four gait sub-phase classification tasks. Both the gait classification task and each of the gait sub-phase classification tasks comprise four distinct categories.
Next, we use a sliding window to intercept sample data segments from angular rate signals with channels. The sliding window utilized in this study has a length of 200 data points and a step size of 40 data points. Therefore, the input size is
. Since the delay caused by data interception is less than 300 ms, the data interception scheme is considered to be sufficient to ensure that the model can perform classification tasks continuously and in real time [30]. In addition, sample balancing operations are performed to ensure that the model is evaluated accurately.
Methods
In this section, the proposed TCNN-MGCHN framework and algorithm are introduced in Model architecture. Subsequently, We provide the implementation details of the TCNN-MGCHN model.
Model architecture.
As illustrated in Fig 4, an end-to-end TCNN-MGCHN model was designed for human lower limb movement (HLLM) and sub-phase recognition, utilizing the spatio-temporal feature distribution of IMU-based skeleton graph data. The TCNN-MGCHN model consists of three functional modules: A TCM, a MGCM, and a classification module (CM). The TCM captures temporal dependencies in IMU signals by applying convolutional operations along the temporal dimension, enabling the model to learn high-level features essential for accurate movement recognition. The MGCM captures spatial relationships among skeleton nodes in a non-Euclidean space using multi-scale graph convolutional operations to learn both local and global features. The body partitioning strategy enhances this process by dividing the skeleton into regions, allowing the MGCM to focus on specific areas and capture detailed spatial information, leading to a more holistic understanding of movement patterns. The Classification Module fuses temporal and spatial features from the TCM and MGCM, creating a comprehensive spatio-temporal representation that enhances accuracy and robustness in gait recognition. In the following, we detail the specific model architecture.
The kernel parameter sizes used in each module are explicitly annotated within the figure.
TCM: The TCM consists of a multi-scale convolutional sub-module and an attention sub-module. We transform angular velocity signal to learn high-level temporal domain features. To ensure that the TCM can accurately capture strongly discriminative temporal domain features, we adopt convolution layers with different convolution kernel sizes to automatically extract salient pattern at multiple time scales. Then, the attention sub-module is then used to adaptively learn and focus on important time-scale feature information. As shown in Fig 4, two TCM are used in this paper, and the output feature map of the jth TCM can be represented as . The parameters of the TCM are shown in Fig 4, and we take the convolutional module whose parameters are (3,32,
,1) as an example to illustrate: In this context, 3 denotes the input channel dimension, 32 represents the output channel dimension of the convolutional layer,
specifies the convolution kernel size, and 1 indicates the stride of the convolution. The TCM receives pre-processed IMU signals, including accelerometer and gyroscope data, segmented into fixed-length windows, forming multi-dimensional tensors
with dimensions representing time steps T, IMU features C, and sensor count S, capturing the temporal dynamics of gait.
Multi-scale convolutional sub-module: Assume that the input data is ,
, where S represents the number of skeleton nodes, C is the number of channels per skeleton node, and each channel has T data samples. In the jth TCM, we first perform three convolution operations on the input data X:
,
, and
, respectively, to obtain the low-scale, medium-scale, and high-scale temporal features
,
, and
. Then, features
,
, and
are successively processed through batch normalization (BN) and ReLU layer to obtain features
,
, and
to accelerate the model learning process and alleviate the vanishing gradient problem. Finally, we concatenate the features
,
, and
to generate features
.
Attention sub-module: In this sub-module, we fuse multi-scale temporal features and use the attention layer to pay attention to important skeleton nodes and channels. First, we perform pooling operations on features to generate features
to reduce the feature dimension and avoid overfitting. Then, the features
are flattened to one-dimensional vectors
. Next, feature
is passed through two fully connected (FC) layers followed by a softmax layer, yielding the attention weights
,
and
of features
can be obtained. Finally, we multiply the obtained attention weights with the multi-time scale temporal features to obtain the final fusion temporal feature map
:
where denotes Hadamard product. In summary, the TCM can fuse features at multiple time scales and adaptively focus on important nodes and channels of features, making the temporal features extracted by the model have more classification performance and generalization ability.
MGCM: As mentioned above, we constructed the skeleton graph data based on eight IMU sensors. Then, the MGCM is designed to learn the spatio-temporal feature representations from skeleton graph data.
Skeleton graph data construction: Fig 5 shows the constructed skeleton graph, where the IMU sensors are used as skeleton nodes and the three-axis angular rate signals are used as feature vectors for each skeleton node. The sensor numbers are labeled with digits. The blue lines between the skeleton nodes represent the natural connection of the body joints. In addition, we connect the adjacent sample points of the same skeleton node to make the data retain the time information, as shown by the red dashed line in Fig 5.
Suppose that a single sample point’s spatio-temporal skeleton graph data is represented as , where V is the set of
skeleton nodes and E is the set of edges. In this paper, each skeleton node has a m(m = 3) dimensional feature. Therefore, the skeleton node feature matrix of a single sample point can be represented as
, and the adjacency matrix can be represented as
. Then,
can be defined as follows:
A is then utilized to compute the Laplacian matrix L, as shown in Eq (3).
where ,
is the identity matrix.
is degree matrix of the vertices. Eq (4) illustrates the GCN updating process for each layer.
where denotes the input data.
represents the learnable parameter matrix of the GCN filter. G denotes the output dimension of the feature after the GCN layer.
According to Eq (4), the parameter θ is shared in the graph convolution operation without considering the effect of different body regions on the edge weight parameters. Therefore, we design three partitioning strategies to enhance high-level spatio-temporal feature extraction capability of graph convolutional networks: Uni-label partitioning strategy, dual-label partitioning strategy, and body partitioning strategy. As shown in Fig 6(a), the root node and neighbor nodes constitute a neighbor set , marked in the red dotted box. Unlike the conventional graph convolution method, we set a partitioning strategy to divide the neighbor set
into K subsets. Then, we can divide the nodes in set
to the corresponding subset label by mapping
. Therefore, the weight parameter can be expressed by the following equation:
From left to right: (a) Single frame skeleton graph (b) Uni-label partition strategy. (c) Dual-label partition strategy. (d) Body partition strategy.
Uni-label partition strategy: This strategy treats the neighbor set as a whole. Employing this strategy is equivalent to computing the inner product between the weight vector and the average feature vector of the root node along with all adjacent nodes. As shown in Fig 6(b), all nodes of the neighbor set
have the same label (red). This strategy is sub-optimal because it may ignore other body regions’ differences and interaction information. In this case, K = 1 and
.
Dual-label partitioning strategy: This strategy treats the neighbor set as two parts: the root node and its neighbor nodes. Assume that the distance from node
to the root node
is
, where
represents the root node itself, while the remaining neighboring nodes are located in the
subset. As shown in Fig 6(c), the root node is red, and its neighbor nodes are green. Therefore, the neighbor set
will have two different weight vectors which can model the local electrode interaction information. Formally, we have K = 2 and
.
Body partitioning strategy: According to the spatial distribution characteristics and functional properties of brain electrodes, was divided into three subsets: 1)
itself, 2) trunk region (TR): The nodes
connected to root node in the trunk region, 3) left lower limb region (LLLR): The nodes
connected to root node in the left lower limb region. 3) right lower limb region (RLLR): The nodes
is connected to the root node in the right lower limb region. This strategy is inspired by differences in the functions of various body regions. The trunk stabilizes the center of gravity, generates and transmits force. The left and right lower limbs support and move the body, and there are subtle differences in their function. As shown in Fig 6(d), the root node
is red, the nodes
connected to
in the TR are green, the nodes
connected to
in the LLLR are blue, and the nodes
connected to
in the RLLR are yellow. Formally, we have K = 4 and
is calculated as follows:
Fig 6 provides a visualization of the three partitioning strategies. We will evaluate the effectiveness of the proposed body partitioning strategy in Sect 4. More advanced partitioning strategies are expected to lead to better modeling capabilities and recognition performance. By combining the graph construction method and the partition strategy, each layer’s GCN update process can be expressed as follows:
where is the input data and G denotes the output dimension of the feature after the GCN layer.
As shown in Fig 4, the backbone structure of the multi-branch GCN designed in this paper consists of BN layers and nine graph convolutional blocks (B1, B2, ..., B9). The spatio-temporal graph convolutional block comprises a graph convolutional layer, followed by a batch normalization (BN) layer, a ReLU activation function, a dropout (Dp) layer, and a temporal convolutional layer, with an additional BN layer and ReLU function, as shown in Fig 7. As shown in Fig 4, the parameters of block B1 are (3,16,1, ), where 3 represents the input channel dimension, 16 denotes the output channel dimension of spatio-temporal graph convolutional block, 1 is the convolution kernel size for the graph convolutional layer, and
specifies the convolution kernel size for the temporal convolution layer. In B1, B2, B3, B5, B6, B8, and B9, the stride of the temporal convolutional layers is set to 1. In blocks B4 and B7, the stride of the temporal convolutional layer is set to 2 to reduce the number of parameters of the features. Initially, BN is applied to normalize the input data
,
, yielding feature
. Then, the feature
is then fed into multiple spatio-temporal graph convolutional blocks to extract the spatial feature representations. After nine spatio-temporal graph convolution operations and a convolution operation in the backbone structure, we can obtain the high-order feature
, W = 64. In addition, we design a branch from B3 and B6 blocks respectively to obtain intermediate features
and
, and then pass
and
through two convolutional layers to obtain features
and
, respectively. The convolutional layer parameters of the backbone and the two branches are shown in Fig 4. Finally, we can obtain the final fusion output
of MGCM by element-wise summation calculation of features
,
and
:
where is the convolution function and
denotes element-wise summation operation. Therefore, we obtain the spatial feature
from MGCM. Additionally, the adjacency matrix A of MGCM is continuously updated throughout the network training process. Algorithm 1 introduces the process of dynamical updating A.
ConvG denotes graph convolutional layer and ConvT denotes temporal convolutional layer.
Algorithm 1. The Optimizing Procedure of the TCNN-MGCHN.
Input: A labeled IMU-based skeleton graph data set , epoch parameter
, batch size parameter
, and model hyper-parameters Θ.
Output: The optimal parameter Θ and the learned adjacency matrix A.
1: for e = 1: Ep do
2: for b = 1: Ba do
3: Sample input and
from
.
4: Calculate time domain feature by feeding
into the TCM according to Eq. (1).
5: Calculate the Laplacian matrix L according to Eq. (3).
6: Calculate the spatial features by
feeding into MGCM according to Eqs. (7) (8).
7: Obtain label by passing
into classification module according to Eqs. (9) (10).
8: Use and
to calculate the loss function according to Eq. (11).
9: Update the adjacency matrix A and model parameters Θ via adam optimizer based on the loss function.
10: end for
11: end for
Classification module: Firstly, we concatenate the features and
to obtain the fused spatio-temporal feature FST . FST is further refined between channel dimensions through a convolutional layer (both with a convolution kernel size of
) to obtain the final spatio-temporal features F :
where is the convolution operation. Secondly, the feature map F is sequentially passed through a FC layer and a Softmax layer to obtain predicted labels
:
where ,
,
and
are learned parameters, and
is the predicted label of the model.
Implementation details for TCNN-MGCHN model
Consider a IMU-based skeleton graph data set is given, where
and
. Initially,
is transformed to
using the TCM, as described in Eq (1). Next,
is further transformed into
using the MGCM based on Eqs (7) and (8). Then, the fused spatial-temporal feature representations FST by feeding
and
into the classification module, and the prediction label
is calculated by passing FST into the classification module according to Eqs (9) and (10). The detailed steps of the model optimization process are provided in Algorithm 1.
Here, cross-entropy loss L was used to evaluate the discrepancy between the true label and the predicted label
. The calculation formula is provided below.
where α and β are constants, Θ denotes the hyper-parameters, and denotes the l1 norm. In Eq (11), α and β assigned values of 0.001 and 0.2, respectively, to facilitate dynamic updating of adjacency matrix A and mitigate overfitting. The Adam optimizer is employed with a learning rate of 0.001, and the batch size is set to 64. The TCNN-MGCHN model is implemented using Python 3.6 and PyTorch 1.9, and trained on an NVIDIA RTX 2080 Ti GPU.
Results
We conduct an in-depth experimental analysis to demonstrate the advantages of the proposed model. The evaluation focuses on three key aspects. First, we assess the impact of model parameters and different partitioning strategies on performance. Second, detailed ablation experiments are carried out to validate the contribution of each component, specifically the TCM and MGCM, within the TCNN-MGCHN model. Lastly, we compare the proposed model against mainstream approaches to highlight its effectiveness.
To ensure accurate evaluation, we adopt the 5-fold cross-validation method. Additionally, the model’s performance in classifying gait and sub-phases is measured using four standard metrics: precision, recall, accuracy, and Matthews correlation coefficient (MCC) [41]:
where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.
Impact analysis of model parameters and partitioning strategies
It is well known that model parameters and partitioning strategies can significantly impact on classification performance. The convolution kernel parameter determines the receptive field of the model, affecting both the scope of temporal information analyzed from the angular velocity signal and the computational complexity. Therefore, it is important to explore how to balance the information extraction capability of the TCNN-MGCHN model with computational efficiency effectively. Table 1 presents the model’s performance using various multi-scale convolution kernel parameters for the TCM, including [,
,
], [
,
,
], [
,
,
], and [
,
,
]. The results indicate that the TCNN-MGCHN model achieves optimal classification performance for both gait and gait sub-phase tasks when the multi-scale convolution kernel parameters are set to [
,
,
]. In particular, the accuracy, recall, precision, MCC for the parameters [
,
,
] on the gait classification task are 98.24%
0.24, 98.53%
0.30, 98.12%
0.26, and 98.13%
0.20, respectively. The average accuracy, recall, precision, MCC for parameter [
,
,
] are 97.09%
0.65, 97.43%
0.82, 96.96%
0.47 and 97.15%
0.58, respectively.
As mentioned above, we designed three partitioning strategies to facilitate the model to effectively mine the internal spatial dependence information between skeleton nodes in different body regions. Therefore, we explore the impact of three partitioning strategies on model performance, as shown in Table 2. Experimental results indicate that the TCNN-MGCHN model achieves the lowest average recognition accuracy, 88.68%±1.42, when using the uni-label partitioning strategy. In addition, the dual-label partitioning strategy outperformed the uni-label partitioning strategy in all evaluation metrics. The body partitioning strategy helps the most in improving the model performance. The average accuracy, recall, precision, and MCC of the gait and sub-phase classification tasks are 97.09%, 97.43%, 96.96%, and 97.15%, respectively. Moreover, to validate the statistical significance of the performance differences among the three partitioning strategies, we conducted paired t-tests on accuracy, rcall, precision, and MCC across five sub-tasks. The results revealed that the body partitioning strategy significantly outperformed both the dual-label (p<0.05) and uni-label (p<0.01) strategies across all metrics. The dual-label strategy also consistently outperformed the uni-label strategy (p<0.05). These findings confirm that explicitly modeling region-specific dependencies among skeletal joints contributes to better recognition performance.
Impact analysis of multi-branch graph convolutional network architecture
To assess the impact of the multi-branch network structure in the proposed MGCM, we compare the performance of two models, TCNN-MGCHN and TCNN-MGCHN w/o multi-branch network architecture (TCNN-MGCHN w/o MBNA). The TCNN-MGCHN w/o MBNA model only retains the backbone structure and removes two branch structures. As illustrated in Fig 8, the TCNN-MGCHN model achieves an average accuracy that is 5.44% higher compared to the TCNN-MGCHN w/o MBNA model. Moreover, a notable difference in accuracy between the two models is observed in the upstairs sub-phase classification task, where the TCNN-MGCHN model outperforms the TCNN-MGCHN w/o MBNA model by 10.6%. In terms of MCC performance indicators, the TCNN-MGCHN model is also basically better than the TCNN-MGCHN w/o MBNA model except for the normal walking sub-phase classification task. The average MCC of the TCNN-MGCHN model is 8.50% higher than that of the TCNN-MGCHN w/o MBNA model. These findings demonstrate that the multi-branch structure effectively enhances the recognition performance of the TCNN-MGCHN model in both gait classification and gait sub-phase classification tasks.
The red line represents the standard deviation.
In addition, we calculate the loss values of the TCNN-MGCHN and TCNN-MGCHN w/o MBNA models under the seven classification tasks, as shown in Fig 9. The loss value provides a more reliable measure of the error between the classifier’s output and the true label compared to the accuracy metric. As shown in Fig 9, the TCNN-MGCHN model consistently exhibits a lower loss value than the TCNN-MGCHN w/o MBNA model, except in the normal walking sub-phase classification task. Especially for the gait classification task, the loss value of the TCNN-MGCHN model is 0.8656 lower than that of the TCNN-MGCHN w/o MBNA model.
The black line represents the standard deviation.
Ablation study
Here, the importance of each module of the TCNN-MGCHN model is validated by ablation experiments: the TCM and MGCM. As shown in Fig 10, it can be found that the average area under curve (AUC) values of the TCNN-MGCHN model are all improved compared to TCNN-MGCHN w/o TCM and TCNN-MGCHN w/o MGCM.
The above results show that both TCM and MGCM contribute significantly to the classification performance improvement of the TCNN-MGCHN model, which is closely related to TCM capturing temporal information and MGCM excelling in mining spatial dependence information among various skeleton nodes. Consequently, the classification performance of the TCNN-MGCHN model is negatively affected if any of the functional modules are discarded. Furthermore, MGCM is the most helpful in improving the model performance among the two modules, with AUC increased by 2.96% (gait), 2.09% (normal walking sub-phase), 2.45% (downstairs sub-phase), 1.52% (upstairs sub-phase), and 1.75% (running sub-phase), respectively. Compared with the TCNN-MGCHN w/o TCN model, the AUC of the TCNN-MGCHN model is increased by 1% (gait), 0.67% (normal walking sub-phase), 1.73% (downstairs sub-phase), 0.91% (upstairs sub-phase), and 0.92% (running sub-phase), respectively.
Fig 11 illustrates the increased accuracy (IA) and reduced loss (RL) of the TCNN-MGCHN model compared to TCNN-MGCHN w/o TCM and TCNN-MGCHN w/o MGCM across all classification tasks. In Fig 11(a), the improvement in accuracy of the TCNN-MGCHN model relative to the TCNN-MGCHN w/o MGCM model varies depending on the classification task. The largest improvement of 7.34% is achieved on the normal walking sub-phase classification task, and the smallest improvement of 0.34% is achieved on the downstairs sub-phase classification task. In addition, the improved accuracy of the TCNN-MGCHN model relative to the TCNN-MGCHN w/o TCM model is close on different classification tasks, with an average improvement of 3.17%. As shown in Fig 11(b), the TCNN-MGCHN model has the largest reduced loss value of 0.19 relative to the TCNN-MGCHN w/o MGCM in the gait classification task. The reduced loss value of the TCNN-MGCHN model relative to the TCNN-MGCHN w/o MGCM model is not greater than 0.03 on other sub-phase classification tasks. In addition, on the normal walking and downstairs classification tasks, the TCNN-MGCHN model has the largest reduced loss values relative to the TCNN-MGCHN w/o TCM model, which are 0.55 and 0.51, respectively. Similarly, the reduced loss value of the TCNN-MGCHN model relative to the TCNN-MGCHN w/o MGCM model is no more than 0.03 on other classification tasks. These findings suggest that both TCM and MGCM independently enhance the model’s recognition capabilities, with their combination leading to further improvements in accuracy and robustness. The combination of these components proves critical for optimizing both the performance and efficiency of the TCNN-MGCHN model across a wide range of gait recognition tasks.
(a) The improved accuracy (IA). (b) The reduced loss (RL) value.
IMU contribution visualization for gait recognition
To further enhance the interpretability of the proposed model, we visualize the contribution of each IMU sensor to the classification of different gait types (i.e., normal walking, upstairs, downstairs, and running) in the form of a contribution heatmap, as shown in Fig 12. The matrix presents the average importance scores of the eight IMU sensors, which are computed using a gradient-based attribution method across all correctly classified samples. It can be observed that the sensors placed on the lower limbs—particularly those on the feet (I5/I8) and shanks (I4/I7)—exhibit higher contribution scores across all gait categories. This is especially evident in dynamic movements such as running and stair descent, where the foot-ground interaction and leg motion are more pronounced. In contrast, the sensors on the upper body (I1 on the chest and I2 on the abdomen) contribute relatively less, but still provide valuable information for posture and balance, particularly in walking and downstairs. The visualization confirms the rationality and effectiveness of the proposed architecture in leveraging spatial information from multi-IMU input for accurate gait recognition.
Comparison with some mainstream classification algorithms
Like the proposed TCNN-MGCHN model, other studies have also achieved impressive results in the field of gait recognition. Table 3 presents a detailed performance comparison between the TCNN-MGCHN algorithm and several widely adopted methods in this domain. To ensure a fair comparison, we applied each of these methods to the IMU-based skeleton graph dataset collected in this study. The methods included in the comparison are DBN [42], Phase Variable [43], Bi-LSTM [44], CNN [45], ConvLSTM [46], DAFO [28], LDA-PSO-LSTM [27], as well as the recently proposed DPF-LSTM-CNN [47], LSTM-CRF [48], LSTM-CRF [49], and Bi-LSTM [50]. By applying these algorithms to the same IMU-based skeleton graph dataset, we ensured consistency and reliability in the performance assessment. Additionally, we employed uniform preprocessing across all models, and the evaluation was based on the same performance metric: accuracy. This approach provides an unbiased and comprehensive comparison of the methods.
As shown in Table 3, the TCNN-MGCHN model achieves the highest accuracy on the custom dataset, reaching an impressive 97.54%. This result is notably 11.93%, 10.22%, 5.74%, 5.23%, 5.32%, 1.96%, and 3.37% higher than the accuracy rates achieved by the DBN, Phase Variable, Bi-LSTM, CNN, ConvLSTM, DAFO, and LDA-PSO-LSTM models, respectively. This highlights the superior performance of the TCNN-MGCHN model in the context of IMU-based gait recognition, making it a significant advancement in this area. In conclusion, while the mainstream methods compared in this study have demonstrated solid performance in gait recognition, our work introduces a valuable contribution by providing a complementary approach that offers improved results. Specifically, 1) The TCNN-MGCHN model demonstrates the highest recognition accuracy, positioning it as an ideal method for driving future research and deep-learning modeling efforts related to IMU-based gait and sub-phase recognition. 2) The body partitioning strategy proposed in this study enhances the spatial feature extraction capability of the MGCM, which not only improves recognition accuracy but also holds potential for broader applications. Furthermore, this body partitioning strategy can be extended to other IMU-based GCN modeling works, enabling a more comprehensive understanding and analysis of human motion. Thus, this work sets a new benchmark in the field and paves the way for further research and innovation in the area of IMU-based gait recognition.
Comparison of model parameters
The average inference time per sample is approximately 13.25 ms, and the total parameter size of the proposed model is about 23.72 MB. While this is not the smallest among existing methods, it represents a relatively compact architecture with competitive recognition performance. As shown in Table 4, models specifically designed for lightweight deployment, such as SkeletonGait[36] (10.87 MB), NSVGT-ICBAM-FACN[37] (13.60 MB), and GPGait[38] (9.66 MB), typically have smaller parameter sizes below 15 MB. In contrast, non-lightweight models generally require much larger storage, such as GaitRGA[39] with 76.37 MB and GaitGL[40] with 30.72 MB. The proposed model, with its 23.72 MB parameter size, is larger than the highly lightweight designs but remains considerably smaller than heavy-weight architectures like GaitRGA and GaitGL, thereby achieving a balance between model compactness and recognition capability, while achieving superior recognition performance (see Table 3 for accuracy comparisons).
Limitation
It is important to note that all participants in this study were healthy individuals. In real-world applications, gait patterns may be significantly altered in pathological populations, such as individuals with neurological diseases (e.g., Parkinson’s disease or stroke). These altered movement characteristics may challenge the model’s generalization ability, especially if it is trained solely on healthy data. To enhance robustness and applicability, future work should include subjects with gait impairments and consider incorporating domain adaptation or transfer learning techniques to improve performance in clinical scenarios. Moreover, although this study focuses solely on angular rate signals as the analytical physical quantity, we recognize that integrating linear acceleration may further improve recognition performance, particularly for non-cyclic or complex motion patterns. However, given the cyclic nature of gait and sub-phase movements, angular rate alone provided sufficient discriminatory power in our tasks. Future work will explore the fusion of angular and linear kinematic features to enhance model robustness and generalizability across a wider range of human activities.
Conclusion
This study designed a multi-task gait experiment and constructed IMU-based skeleton graph data. We also propose a multi-branch deep learning network, TCNN-MGCHN, for accurate gait and sub-phase recognition using this data. The model comprises two main components: TCM and MGCM. Initially, temporal feature representations of angular velocity signals are extracted through the TCM. Subsequently, the MGCM captures intrinsic spatial dependencies between skeleton nodes. Finally, the fused temporal and spatial features are fed into the classification module for gait and sub-phase prediction. The TCNN-MGCHN model’s classification accuracy and robustness are comprehensively evaluated using mathematical statistics and performance assessment, including overall performance, parameter analysis, and ablation studies. With an accuracy of 97.54%, the TCNN-MGCHN model outperforms seven mainstream methods. These results demonstrate the superior recognition performance and generalization capability of the proposed model. Also, the body partitioning strategy proposed in this paper can focus on crucial channels and skeleton nodes, which is conducive to enhancing the ability of MGCM to mine discriminating information. Despite the robustness and superiority of our proposed model, some problems still need to be improved. Firstly, our proposed IMU-based skeleton graph data construction method is relatively simple, which cannot learn the interaction information between skeleton nodes from multiple perspectives. In future work, we will explore more sophisticated and rational methods of constructing IMU-based skeleton graph data. Secondly, we only consider four gait and sub-phases. More types of gait and sub-phases identification methods will be studied.
Declaration
The study was conducted in accordance with the Declaration of Helsinki, and the experimental protocol was approved by the Human Ethical Review Committee of Zhejiang University of technology (Approval No. 2023-314).
References
- 1. Zhou H, Wang D, Yu Y, Zhang Z. Research progress of human–computer interaction technology based on gesture recognition. Electronics. 2023;12(13):2805.
- 2. Zhen R, Song W, He Q, Cao J, Shi L, Luo J. Human-computer interaction system: a survey of talking-head generation. Electronics. 2023;12(1):218.
- 3. Liu B, Chen W, Wang Z, Pouriyeh S, Han M. RAdam-DA-NLSTM: a nested LSTM-based time series prediction method for human–computer intelligent systems. Electronics. 2023;12(14):3084.
- 4. Yang K, Kim M, Jung Y, Lee S. Hand gesture recognition using FSK radar sensors. Sensors (Basel). 2024;24(2):349. pmid:38257441
- 5. Khan HU, Ali F, Ghadi YY, Nazir S, Ullah I, Mohamed HG. Human–computer interaction and participation in software crowdsourcing. Electronics. 2023;12(4):934.
- 6. Lee H, Rosen J. Lower limb exoskeleton - energy optimization of bipedal walking with energy recycling - modeling and simulation. IEEE Robot Autom Lett. 2023;8(3):1579–86.
- 7. Li W, Lu W, Sha X, Xing H, Lou J, Sun H, et al. Wearable gait recognition systems based on MEMS pressure and inertial sensors: a review. IEEE Sensors J. 2022;22(2):1092–104.
- 8. Calafiore D, Negrini F, Tottoli N, Ferraro F, Ozyemisci-Taskiran O, de Sire A. Efficacy of robotic exoskeleton for gait rehabilitation in patients with subacute stroke: a systematic review. Eur J Phys Rehabil Med. 2022;58(1):1–8. pmid:34247470
- 9. Hutabarat Y, Owaki D, Hayashibe M. Recent advances in quantitative gait analysis using wearable sensors: a review. IEEE Sensors J. 2021;21(23):26470–87.
- 10. Guaitolini M, Petros FE, Prado A, Sabatini AM, Agrawal SK. Evaluating the accuracy of virtual reality trackers for computing spatiotemporal gait parameters. Sensors (Basel). 2021;21(10):3325. pmid:34064807
- 11. Zhou B, Wang H, Hu F, Feng N, Xi H, Zhang Z, et al. Accurate recognition of lower limb ambulation mode based on surface electromyography and motion data using machine learning. Comput Methods Programs Biomed. 2020;193:105486. pmid:32402846
- 12. Chen W, Lyu M, Ding X, Wang J, Zhang J. Electromyography-controlled lower extremity exoskeleton to provide wearers flexibility in walking. Biomedical Signal Processing and Control. 2023;79:104096.
- 13. Gurchiek RD, Donahue N, Fiorentino NM, McGinnis RS. Wearables-only analysis of muscle and joint mechanics: an EMG-driven approach. IEEE Trans Biomed Eng. 2022;69(2):580–9. pmid:34351852
- 14. Zhang J, Li W, Ogunbona PO, Wang P, Tang C. RGB-D-based action recognition datasets: a survey. Pattern Recognition. 2016;60:86–105.
- 15. Hynes A, Czarnuch S, Kirkland MC, Ploughman M. Spatiotemporal gait measurement with a side-view depth sensor using human joint proposals. IEEE J Biomed Health Inform. 2021;25(5):1758–69. pmid:32946402
- 16. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014;27.
- 17. Rao H, Wang S, Hu X, Tan M, Guo Y, Cheng J, et al. A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):6649–66. pmid:34181534
- 18. Frank E, Trigg L, Holmes G, Witten IH. Technical note: Naive Bayes for regression. Machine Learning. 2000;41(1):5–25.
- 19. Eddy SR. What is a hidden Markov model?. Nat Biotechnol. 2004;22(10):1315–6. pmid:15470472
- 20. Utgoff PE, Berkman NC, Clouse JA. Decision tree induction based on efficient tree restructuring. Machine Learning. 1997;29(1):5–44.
- 21. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst Their Appl. 1998;13(4):18–28.
- 22. Vu HTT, Cao H-L, Dong D, Verstraten T, Geeroms J, Vanderborght B. Comparison of machine learning and deep learning-based methods for locomotion mode recognition using a single inertial measurement unit. Front Neurorobot. 2022;16:923164. pmid:36524219
- 23. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint 2014.
- 24. Elman JL. Finding structure in time. Cognitive Science. 1990;14(2):179–211.
- 25. Zhao Y, Zhou S. Wearable device-based gait recognition using angle embedded gait dynamic images and a convolutional neural network. Sensors (Basel). 2017;17(3):478. pmid:28264503
- 26. Arshad MZ, Jung D, Park M, Shin H, Kim J, Mun K-R. Gait-based frailty assessment using image representation of IMU signals and deep CNN. Annu Int Conf IEEE Eng Med Biol Soc. 2021;2021:1874–9. pmid:34891653
- 27. Cai S, Chen D, Fan B, Du M, Bao G, Li G. Gait phases recognition based on lower limb sEMG signals using LDA-PSO-LSTM algorithm. Biomedical Signal Processing and Control. 2023;80:104272.
- 28. Zhang X, Zhang H, Hu J, Zheng J, Wang X, Deng J, et al. Gait pattern identification and phase estimation in continuous multilocomotion mode based on inertial measurement units. IEEE Sensors J. 2022;22(17):16952–62.
- 29. Chen C, Du Z, He L, Shi Y, Wang J, Dong W. A novel gait pattern recognition method based on LSTM-CNN for lower limb exoskeleton. J Bionic Eng. 2021;18(5):1059–72.
- 30. Moura Coelho R, Gouveia J, Botto MA, Krebs HI, Martins J. Real-time walking gait terrain classification from foot-mounted inertial measurement unit using convolutional long short-term memory neural network. Expert Systems with Applications. 2022;203:117306.
- 31. Hu F, Zhang L, Yang X, Zhang W-A. EEG-based driver fatigue detection using spatio-temporal fusion network with brain region partitioning strategy. IEEE Trans Intell Transport Syst. 2024;25(8):9618–30.
- 32.
Niepert M, Ahmed M, Kutzkov K. Learning convolutional neural networks for graphs. In: International Conference on Machine Learning. PMLR; 2016. p. 2014–2023.
- 33. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. AAAI. 2018;32(1).
- 34. Zhang X, Lu D, Pan J, Shen J, Wu M, Hu X, et al. Fatigue detection with covariance manifolds of electroencephalography in transportation industry. IEEE Trans Ind Inf. 2021;17(5):3497–507.
- 35. Li H, Derrode S, Pieczynski W. An adaptive and on-line IMU-based locomotion activity classification method using a triplet Markov model. Neurocomputing. 2019;362:94–105.
- 36. Fan C, Hou S, Liang J, Shen C, Ma J, Jin D, et al. OpenGait: a comprehensive benchmark study for gait recognition toward better practicality. IEEE Trans Pattern Anal Mach Intell. 2025;47(10):8397–414. pmid:40460018
- 37. Li C, Wang B, Li Y, Liu B. A lightweight pathological gait recognition approach based on a new gait template in side-view and improved attention mechanism. Sensors (Basel). 2024;24(17):5574. pmid:39275485
- 38.
Fu Y, Meng S, Hou S, Hu X, Huang Y. GPGait: generalized pose-based gait recognition. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. p. 19538–47. https://doi.org/10.1109/iccv51070.2023.01795
- 39. Liu J, Ke Y, Zhou T, Qiu Y, Wang C. GaitRGA: gait recognition based on relation-aware global attention. Sensors (Basel). 2025;25(8):2337. pmid:40285032
- 40. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint 2014.
- 41. Gorodkin J. Comparing two K-category assignments by a K-category correlation coefficient. Comput Biol Chem. 2004;28(5–6):367–74. pmid:15556477
- 42. Hassan MM, Uddin MdZ, Mohamed A, Almogren A. A robust human activity recognition system using smartphone sensors and deep learning. Future Generation Computer Systems. 2018;81:307–13.
- 43. Bartlett HL, Goldfarb M. A phase variable approach for IMU-based locomotion activity recognition. IEEE Trans Biomed Eng. 2018;65(6):1330–8. pmid:28910754
- 44. Turner A, Hayes S. The classification of minor gait alterations using wearable sensors and deep learning. IEEE Trans Biomed Eng. 2019;66(11):3136–45. pmid:30794506
- 45. Rohan A, Rabah M, Hosny T, Kim S-H. Human pose estimation-based real-time gait analysis using convolutional neural network. IEEE Access. 2020;8:191542–50.
- 46. Lu Y, Wang H, Qi Y, Xi H. Evaluation of classification performance in human lower limb jump phases of signal correlation information and LSTM models. Biomedical Signal Processing and Control. 2021;64:102279.
- 47. Liu K, Liu Y, Ji S, Gao C, Zhang S, Fu J. A novel gait phase recognition method based on DPF-LSTM-CNN using wearable inertial sensors. Sensors (Basel). 2023;23(13):5905. pmid:37447755
- 48. Wei H, Tong RK, Wang MY, Chen C. Gait phase detection based on LSTM-CRF for stair ambulation. IEEE Robot Autom Lett. 2023;8(9):6029–35.
- 49. Jung D, Lee C, Jeon HS. Multi-model gait-based KAM prediction system using LSTM-RNN and wearable devices. Applied Sciences. 2024;14(22):10721.
- 50. Jeon H, Lee D. Bi-directional long short-term memory-based gait phase recognition method robust to directional variations in subject’s gait progression using wearable inertial sensor. Sensors (Basel). 2024;24(4):1276. pmid:38400434