Figures
Abstract
Hand gesture recognition plays an important role in human–computer interaction, yet accurately modeling both spatial structure and temporal motion patterns in video-based vision systems remains challenging. Many existing approaches focus on either spatial appearance or motion information, which can limit their ability to capture the full complexity of dynamic hand gestures evolving over time. In this work, we present a unified feature representation framework that combines spatial descriptors modeled on the Symmetric Positive Definite (SPD) manifold with temporal motion features extracted from gesture video sequences using grid-based optical flow histograms in Euclidean space. Spatial covariance descriptors are mapped from the SPD manifold to a Euclidean space through the Log-Euclidean metric, enabling effective feature fusion while preserving intrinsic geometric properties. The resulting representation captures complementary spatial and temporal information in a compact and interpretable form. We evaluate the proposed framework on two publicly available video-based benchmark datasets for dynamic hand gesture recognition, the Cambridge Hand Gesture dataset and the Northwestern University Hand Gesture dataset. Experimental results demonstrate that the combined representation consistently improves classification performance compared to using spatial or temporal features alone, achieving 99.31% accuracy on the Cambridge dataset and 97.23% on the Northwestern dataset. These findings indicate that integrating manifold-aware spatial features with motion-based temporal cues provides a practical and effective solution for robust dynamic hand gesture recognition.
Citation: Bai Z, Snášel V, Mirjalili S, Vo B, Pan J-S, Kong L, et al. (2026) Video-based hand gesture recognition via SPD manifold spatial representation and optical flow motion features. PLoS One 21(5): e0348122. https://doi.org/10.1371/journal.pone.0348122
Editor: Ziyu Qi, University of Marburg: Philipps-Universitat Marburg, GERMANY
Received: January 26, 2026; Accepted: April 11, 2026; Published: May 14, 2026
Copyright: © 2026 Bai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets used in this study are publicly available. The data were obtained from a publicly accessible GitHub repository maintained by Ha0Tang, available at https://github.com/Ha0Tang/HandGestureRecognition. All data underlying the results presented in this study are fully available without restriction.
Funding: This scientific result is part of the CLARA project that has received funding from the European Union’s HORIZON EUROPE research and innovation programme under Grant Agreement No 101136607. The authors gratefully acknowledge financial support ROBOPROX of No. CZ.02.01.01/00/22 008/0004590 by Ministry of Education, Youth, and Sports and REFRESHof No. CZ.10.03.01/00/22_/0000048 by European Union. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Hand gestures are a vital component of natural communication and a primary means of interaction for the hearing impaired. With advances in human-computer interaction and virtual reality technology, the field of gesture recognition has also developed rapidly. Involving movements of the fingers, hands, arms, head, face, or body, hand gestures offer an intuitive method to convey information. Despite significant progress, recognizing hand gestures remains challenging due to their complexity and variability. Robust hand gesture recognition systems are essential, given their wide range of applications. These include enhancing user experiences in virtual reality systems [1], contributing significantly to human-robot collaboration [2–4], playing a crucial role in sign language recognition [5,6], practicing music conducting [7], supporting learning and teaching assistance [8], etc.
Hand gesture recognition has attracted significant attention due to its wide range of applications and can be categorized into contact-based and vision-based systems. Contact-based approaches rely on physical devices (e.g., gloves or sensors), which may restrict movement and reduce user comfort. Vision-based systems, in contrast, allow more natural interaction and have become increasingly prominent, though achieving high accuracy remains challenging. In contemporary video-based recognition, methods generally rely on either traditional feature extraction or neural network–based feature learning. For example, support vector machine–based frameworks have been applied to general video classification tasks [9], while transformer-based architectures have shown strong potential for capturing complex visual interactions [10].
Following approaches used in general video-based recognition, video-based hand gesture recognition studies can be broadly categorized into traditional feature-based methods and neural network–based approaches. Traditional methods typically extract handcrafted spatial and temporal features from video frames; for example, Chen et al. [11] use real-time hand tracking with Fourier descriptors for spatial features and motion analysis for temporal dynamics, while Tang et al. [12] employ image entropy–based keyframe selection, followed by SURF and SIFT3D for spatial and temporal feature extraction. Neural network–based methods leverage deep architectures to jointly model spatial and temporal information, as in Feichtenhofer et al. [13], who integrate ConvNet streams spatially and temporally, and in Heidari et al. [14], who propose a 2DCNN-LSTM framework with improved keyframe extraction for dynamic gesture recognition.
However, many existing hand gesture recognition methods represent spatial features in Euclidean space, which may not explicitly preserve intrinsic geometric relationships when modeling covariance-based or region-level descriptors. In dynamic hand gesture classification, this limitation can affect the robustness of spatial representations, particularly in scenarios involving complex appearance variations across video frames. To address this issue, we adopt a Symmetric Positive Definite (SPD) manifold–based representation [15] to model spatial covariance features in a geometrically meaningful space. Using the Log-Euclidean metric, these manifold-based spatial descriptors are mapped into Euclidean space, enabling consistent integration with temporal motion features extracted via grid-based optical flow histograms. This representation facilitates coherent fusion of spatial and temporal information for dynamic hand gesture classification.
The main contributions of this paper are summarized as follows:
- Complementary Spatial and Temporal Feature Modeling: We present a unified framework that combines SPD manifold-based spatial covariance features with grid-based optical flow motion features, enabling the joint modeling of spatial structure and temporal dynamics in dynamic hand gesture recognition.
- Log-Euclidean Mapping for Feature Fusion: We employ the Log-Euclidean metric to map manifold-based spatial features into Euclidean space, allowing consistent integration with temporal motion descriptors within a unified representation.
- Evaluation on Benchmark Datasets: The proposed framework is evaluated on the Cambridge and Northwestern University hand gesture datasets, providing an empirical analysis of the effectiveness of combining spatial and temporal features for gesture classification.
The remainder of this paper is organized as follows. The Related work section reviews existing hand gesture recognition systems. The Methodology section presents our proposed framework, introducing the key components of our dynamic hand gesture recognition approach. Experiments and Analysis details the datasets, experimental setup, and performance evaluation. Finally, the Conclusion summarizes the main findings and outlines directions for future research.
Related work
Hand gesture recognition systems can be classified into contact-based and vision-based categories, depending on whether they require physical devices. In our paper, we primarily focus on vision-based systems, which can be further categorized into two primary approaches: model-based and feature representation-based.
Model-based methods in gesture recognition involve constructing explicit mathematical or computational models that simulate the physical or geometrical properties of gestures. These methods typically start by minimizing a cost function that is derived from image cues such as edges [16], segmented silhouettes [17], or patch-based errors [18]. The cost function guides the adjustment of parameters within the hand model, such as joint angles and positions. The goal of this iterative process is to align the model’s projection with the observed image, thereby reducing discrepancies indicated by the cost function. This iterative refinement improves the model’s ability to accurately represent and recognize the observed hand gesture.
Recent advancements have further refined these model-based methods. Saremi et al. [19,20] enhance particle swarm optimization (PSO) in the development of techniques for accurately modeling hand postures, addressing the limitations of conventional gradient-based methods and enabling robust exploration of the solution space, thereby ensuring more reliable hand posture estimation in diverse scenarios. Complementing this approach, Boukhayma et al. [21] introduce an end-to-end deep learning method for predicting 3D hand shape and pose from RGB images in real-world settings. Their network combines a deep convolutional encoder with a fixed model-based decoder, utilizing an articulated mesh deformation model and a re-projection module for accurate 3D hand reconstruction.
Feature representation-based methods aim to extract spatio-temporal features from video frames, capturing both the visual characteristics and the motion dynamics of gestures from images or videos. Conventional approaches, such as that of Holte et al. [22], propose a view-invariant algorithm that identifies motion primitives in 3D data using 3D optical flow and harmonic motion context, applying a probabilistic Edit Distance classifier for gesture classification. Liu et al. [23] introduce a pioneering method that employs Genetic Programming(GP) to optimize 3D sequence-processing operators. This approach involves the random assembly of low-level 3D operators into tree-based combinations, which evolve iteratively through the GP system. Shen et al. [24] propose an approach leveraging motion divergence fields, Maximum Stable Extremal Regions (MSER)-based region detection, and local motion pattern descriptors. Their method includes efficient indexing with Term Frequency-Inverse Document Frequency (TF-IDF), resulting in high performance on large-scale gesture databases.
Deep learning approaches have also been utilized within the realm of feature representation-based methods for extracting spatio-temporal features. Sarma et al. [25] proposed a two-stream network with a 3D convolutional neural network (C3D) for gesture videos and a 2D CNN for optical flow motion templates (OFMT). C3D captures spatio-temporal information, while OFMT enhances recognition by providing additional motion details and filtering out irrelevant gestures, resulting in improved accuracy through stream fusion. Hou et al. [26] introduce the spatial-temporal attention residual temporal convolutional network (STA-Res-TCN) designed for skeleton-based dynamic hand gesture recognition. This model employs an attention mechanism to selectively emphasize informative spatial-temporal features while filtering out noise, thereby improving recognition accuracy. Miah et al. [27] propose a method that utilizes a multi-branch architecture comprising two graph-based neural network channels and a general deep learning channel. The two graph-based branches capture spatial-temporal features, while the third general deep learning branch extracts additional features. These features are then concatenated and processed through a fully connected layer for final classification. Liu et al. [28] propose a novel model that uses a two-branch shallow CNN to extract spatial features, which are then passed into a long short-term memory (LSTM) layer to capture the temporal features. Sahoo et al. [29] introduced the Densely Connected Residual Channel Attention Module (DRCAM) network, which features a cascading structure of residual blocks combined with a multiscale channel attention module to effectively capture both low- and high-level information related to hand gestures, while the cascading structures are interconnected through dense connectivity for enhanced feature propagation. Additionally, various other neural network-based approaches have been proposed for spatio-temporal feature extraction and gesture recognition [30–35], contributing to advancements in this field.
To provide a more comprehensive overview of recent advances in vision-based hand gesture recognition, we summarize representative methods from 2020 to 2025 in Table 1. The table includes both classical deep learning approaches and more recent transformer- and graph-based methods, highlighting their datasets, model architectures, and key advantages and limitations.
Most existing hand gesture recognition methods represent spatial features in Euclidean space, which may not explicitly preserve intrinsic geometric relationships, especially for covariance-based or region-level descriptors. This limitation can affect robustness in dynamic scenarios with complex variations across frames. To address this gap, we map SPD-based spatial descriptors via the Log-Euclidean metric into Euclidean space and integrate them with temporal features, enabling coherent spatial-temporal fusion for dynamic hand gesture classification.
Methodology
Our methodology for hand gesture classification utilizes the Riemannian manifold properties of SPD matrices. Specifically, following the RieCovDs descriptor algorithm [42], we construct local feature vectors from each image, combining pixel coordinates, RGB values, and their spatial gradients. Using these features, we compute covariance matrices across image regions, forming SPD matrices that capture the intricate spatial characteristics inherent in hand gestures. By representing the hand gesture data on the SPD manifold, essential spatial relationships are preserved in a geometrically meaningful space. Furthermore, our approach innovatively combines these Riemannian manifold features with traditional features, enriching the feature representation with both spatial nuances and complementary aspects of the data. This fusion of geometrically informed SPD features with conventional features enables a holistic characterization of hand gestures, enhancing classification performance beyond traditional approaches.
To provide a high-level overview of our proposed approach, we present the overall architecture in Fig 1, which illustrates the main components and their interactions. In the following sections, we provide detailed descriptions of each component of the methodology.
The pipeline consists of four main components: (A) keyframe extraction from input video; (B) feature extraction, including regional spatial features via SPD covariance matrices and temporal features via optical flow histograms on a grid; (C) feature unification and vectorization, where SPD features are mapped to Euclidean space and concatenated with temporal features; and (D) classification using the combined feature representation.
SPD manifold construction
This section explores methods for generating SPD matrices in image analysis. It begins with the definition of the SPD manifold and then discusses conventional techniques for computing covariance matrices from image sets, focusing on pixel-wise correlations. It then introduces the Riemannian covariance descriptors (RieCovDs) [42] approach, which captures correlations between image regions using advanced Gaussian modeling and Riemannian geometry. This provides a more nuanced representation for improved feature extraction and classification.
SPD manifold overview.
A real-valued matrix M is classified as symmetric positive definite (SPD) if the quadratic form for every non-zero vector
. The set of all d × d SPD matrices forms a commutative Lie group with a manifold structure denoted as
, defined in Eq (1):
This manifold is non-Euclidean, indicating a curved geometric structure. The SPD manifold is particularly significant in image analysis because it preserves intrinsic geometric relationships among data points, making it more suitable for tasks like feature extraction and classification compared to traditional Euclidean space.
Conventional covariance image set description.
In the context of image analysis, the SPD manifold offers a powerful representation for capturing the statistical characteristics and structural information present within image data.
A common approach involves representing image sets using covariance matrices, which are themselves SPD matrices. Each covariance matrix captures the pairwise relationships between pixel intensities across the image set, serving as a compact summary of the spatial correlations. By positioning these covariance matrices as points on the SPD manifold, we leverage its unique geometric structure to facilitate effective feature extraction and classification. To compute the covariance matrix for a set of images , where N is the number of images, we first need to calculate the covariance between specified pixel positions across all images. Then, we use these covariance values to construct the covariance matrix.
To illustrate this calculation, let’s denote two positions within each image as p and q. The covariance of positions p and q can be expressed as:
where Ii,p and Ii,q denote the pixel intensities at positions p and q respectively in the i-th image, and
are the means of pixel intensities at positions p and q respectively across all images. The covariance matrix representation can be written as Eq 3:
where Cp,q computes by Eq 2 and d is the number of pixels in each image.
Riemannian covariance image set description.
In conventional image analysis, covariance matrices are typically computed to assess pixel-level correlations. However, an innovative approach, proposed in [42] and known as the Riemann covariance matrix descriptors (RieCovDs) method, diverges significantly from this convention. Instead of focusing solely on pixel-to-pixel correlations, RieCovDs captures correlations between distinct regions within images. This shift leads to a more discriminating representation of image sets, particularly advantageous for classification tasks. Below, we provide a detailed description of this method in Alg. 1.
Algorithm 1 RieCovDs Descriptor Algorithm
1: Input: A set of n images
2: Output: Riemannian Covariance Descriptors (Riemannian CovDs) characterizing the image collection
3: Step 1: Region Partitioning:
4: Divide each image into D regions of uniform size.
5: Step 2: Feature Extraction:
6: Extract pixel-wise features from the designated regions for each image.
7: Step 3: Gaussian Modeling:
8: Model the feature vectors of each pair of regions (i-th and j-th) across the image collection using Gaussian distributions. This results in two sets of Gaussian models denoted as and
.
9: Step 4: Covariance Computation:
10: Calculate the covariance between the Gaussian models of
and
for each pair of regions (i-th and j-th).
11: Step 5: Covariance Matrix Generation:
12: Generate the resulting covariance matrix C using the formula provided in Eq 11.
Now, let’s delve into the fundamental concepts underlying this algorithm.
Gaussian Model: The Gaussian model represents the distribution of local features within an image. Given a collection of N local features , their distribution is modeled using maximum likelihood estimation with a Gaussian distribution, as shown in Eq 4:
where denotes the mean vector and
denotes the covariance matrix, calculated as
and
, respectively.
Covariance of Gaussian Models: To calculate the covariance between sets of Gaussian models, we first transform the problem into the task of computing Riemannian local difference vectors on the SPD manifold (RieLDV-S). This is achieved by embedding the Gaussian model into the SPD manifold, as expressed in Eq 5.
where is a parameter that scales from 0 to 1. Faraki et al. [43] introduced a Riemannian local difference vector (RieLDV) formulation, which relies on geodesic distance and the gradient of geodesic distance functions, as represented in Eq 6.
where ,
means SPD matrices achieved by applying Eq 5,
and
denote two sets of embedding matrices,
denotes the expected value of
, and the column vector
signifies the RieLDV relative to
.
The computation of relative to
is elucidated in Eq 7, where
represents the geodesic distance on the curved manifold, and
stands for the gradient of the smooth distance function
at point
.
The Log-Euclidean Metric (LEM) on the SPD manifold is employed, and the gradients essential for computing RieLDV-S through Eq 7 are detailed in Eq 8 and Eq 9.
The expected value of a set of SPD matrices is determined using the Fr’echet mean with LEM divergence, as expressed in Eq 10.
Finally, the resulting covariance matrix is computed from individual covariances
, as defined in Eq 11.
where . This methodological approach provides a robust characterization of image collections, contributing to various image analysis tasks. Fig 2 illustrates the process of computing region-based covariance.
Motion detection
This section consists of two parts. First, we employ common methods to select keyframes from a video, and then we discuss the method we use to extract motion features from these keyframes.
Keyframes extraction strategy.
In this section, we will introduce three keyframe extraction methods: interval-based sampling, cluster-based sampling, and optical flow-based sampling.
Interval-based sampling is a simple but effective method for extracting keyframes in video analysis. This technique involves selecting frames at fixed intervals throughout the video to ensure an even distribution of keyframes. The keyframe extraction formula is given by Eq 12:
where KFidx denotes the index of the keyframe in the original video, “TF” represents the total number of video frames, and “DKF” denotes the desired number of keyframes. Interval-based sampling is computationally efficient because keyframes are selected at fixed intervals, but may not be effective in capturing important moments or major changes in content because it treats all frames equally. Interval-based sampling is well suited for tasks that require only basic summarization of video content and have limited computational resources.
Cluster-based sampling addresses the limitation of traditional methods, such as interval-based sampling, which may struggle to adequately summarize the variety of content in videos with a large number of frames. Cluster-based keyframe extraction consists of several steps. First, a feature vector is extracted from each frame f in the video. Subsequently, these feature vectors are clustered into K clusters using algorithms such as K-means. Each cluster Ck contains frames with similar content features and its assignment is determined by Eq 13:
where denotes the centroid of the cluster Cj, which is calculated as the average of the feature vectors within the cluster. The keyframes are selected according to Eq 14, where fk denotes the frame closest to each centroid
:
Cluster-based keyframe extraction offers content-sensitive summarization by considering dynamic changes in video content, ensuring selected keyframes represent diverse segments. However, it may be computationally complex, requiring careful tuning of clustering algorithms and parameters for optimal results. Yet, it excels in tasks demanding detailed representations of video content, such as scene segmentation and content-based video summarization, effectively capturing diverse content dynamics while minimizing redundancy.
Optical flow-based sampling is also commonly employed to capture significant motion dynamics within video content. Optical flow vectors OFi are computed between consecutive frame pairs , representing the apparent motion of objects. For each frame pair, the mean magnitude of optical flow MMOFi is calculated as the average Euclidean norm of optical flow vectors, as given by Eq 15:
where H and W represent the height and width of the frames, respectively. Frames are ranked based on MMOFi, and the top K frames with the highest magnitudes are selected as keyframes. Specifically, the frame corresponding to the first frame in each pair is chosen as the keyframe.
While optical flow-based keyframe selection effectively identifies frames with notable motion transitions, it may overlook prolonged movements occurring gradually over several frames if the magnitude of motion is small in each pair. Additionally, sensitivity to abrupt changes in optical flow between consecutive frames can lead to inconsistent representation of motion sequences. Despite these limitations, this method is well-suited for tasks emphasizing dynamic temporal transitions, such as surveillance video analysis and traffic monitoring, where capturing significant motion dynamics is crucial.
Optical flow feature.
This section details the Gunnar-Farnebäck Optical Flow method for estimating pixel displacements using polynomial expansion and image pyramids. It also covers the estimation of grid-based histograms to capture motion information by partitioning the image, binning motion directions, and constructing histograms for further analysis.
Optical Flow Estimation utilizes the Gunnar-Farnebäck method [44], which applies polynomial expansion to approximate the intensity value f(x) of a pixel at position x within its neighborhood. This approach involves fitting a quadratic model to the pixel intensities, expressed as
where the coefficients A, b, and c are determined by the weighted least squares method to fit the grayscale values in the neighborhood. These coefficients capture the spatial relationships and intensity variations within the neighborhood. Assuming a theoretical case of pure translation, where no deformation or noise is present, the intensity value of a pixel at position x in the current frame N is identical to its intensity in the previous frame N − 1, after accounting for the displacement. This relationship is expressed as:
where d = (dx, dy) represents the translational displacement vector. By applying the quadratic polynomial model in Eq 16 to two consecutive frames and incorporating the translation assumption in Eq 17, the displacement vector can be derived as Eq 18:
The coefficients AN−1 and AN are ideally expected to remain equal between two consecutive frames during ideal translation. However, in practice, the approximation
is employed, as shown in Eq 19.
Although displacement information can be solved using the pointwise method, excessive noise often hinders the results. Therefore, the authors propose a more efficient approach: integrating information within pixel neighborhoods to solve for displacement under the assumption of gradual displacement changes. This method is described as follows:
where w represents the weight function assigned to individual pixels within the neighborhood, typically a Gaussian or similar weighting function. The displacement d is therefore obtained as Eq 20, with defined in Eq 21.
The original algorithm designed for handling small movements faces challenges when confronted with large motions. To overcome this challenge, the Farneback optical flow algorithm integrates an image pyramid mechanism into its practical implementation. This addition enables the algorithm to effectively manage motion estimation across various scales with precision. This pyramid structure comprises multiple levels, each presenting a progressively lower resolution than its predecessor. Initially, tracking commences at the coarsest level, progressively advancing through the pyramid, meticulously refining point tracking across multiple resolutions. This strategic initiation at a lower resolution equips the algorithm to effectively handle larger point displacements between consecutive frames. As the tracking process unfolds across the pyramid levels, it systematically refines the estimated motion, thereby enhancing accuracy. Expanding the number of pyramid levels enables the algorithm to accommodate larger displacements between frames. However, it’s imperative to consider that this adjustment escalates the computational workload.
Grid-based Histograms Estimation involves estimating motion information through a three-step process:
Step 1: Grid Partitioning To capture spatial coherence and reduce the dimensionality of motion information, the image plane is partitioned into a grid of non-overlapping cells. Each cell serves as a spatial unit for aggregating motion information within its boundaries. The size of the grid cells can be adjusted based on the application requirements and computational constraints. Larger grid cells may provide a broader overview of motion patterns but might overlook finer details, whereas smaller grid cells offer higher spatial resolution but increase computational complexity.
Step 2: Binning In this step, the direction of motion is quantized into a predefined number of bins
. Each bin covers an equal interval of
. The quantization formula, as given by Eq 22, assigns each angle to its respective bin based on the angle’s value relative to the total range
:
This process discretizes the continuous range of motion directions into discrete segments, facilitating histogram construction to capture the distribution of motion orientations.
Step 3: Histogram Construction After binning, histograms are created for each grid cell. These histograms depict the frequency distribution of motion vectors within the cell. Each bin in the histogram corresponds to a specific range of motion values, with the height of each bin indicating the frequency of occurrence of motion vectors falling within that range.
Fig 3 depicts the process, in which feature extraction of the spatial distribution of motion information in a video sequence, utilizing a grid-based histogram, aids in the further analysis and interpretation of potential dynamic changes.
Feature fusion and classification
This section covers methods for mapping SPD matrices to Euclidean space, vectorizing features from tangent space and grid-based optical flow histograms, and combining these features for classification using an SVM.
Space mapping of SPD matrices.
The Riemannian manifold of SPD matrices has a complex, curved structure, making direct Euclidean computations inappropriate and often inaccurate. The tangent space at a point on this manifold provides a local Euclidean approximation, simplifying these computations.
For an SPD matrix S, the tangent space at S consists of all symmetric matrices that represent infinitesimal displacements from S. The Log-Euclidean Metric utilizes the matrix logarithm to map S to the tangent space at the identity matrix I, transforming the problem from the curved manifold to a flat Euclidean space.
The matrix logarithm operation is central to this mapping. For an SPD matrix S, the logarithmic map log: is defined by Eq 23:
where is the eigenvalue decomposition of S, with U being the orthogonal matrix of eigenvectors and
the diagonal matrix of eigenvalues. The matrix logarithm
is computed by taking the natural logarithm of each eigenvalue in
.
After the operation, the new matrix log(S) lies in the tangent space at the identity matrix I, which is in Euclidean space. Once mapped to the tangent space, SPD matrices can be processed using standard Euclidean methods and machine learning algorithms. This compatibility broadens the applicability and efficiency of existing techniques.
Feature vectorization.
For the tangent space feature, points in the tangent space can be represented minimally by considering the independent coefficients of symmetric matrices. According to Oncel et al. [45], a vector operator is defined at the identity matrix I, as given by Eq 24:
where yi,j represents the elements of the matrix Y.
For the optical flow feature, after computing the grid-based histograms of optical flow for each frame, we vectorize these histograms to create a feature vector suitable for further analysis or machine learning applications.
The vectorization process involves flattening the histograms of each grid cell into a single one-dimensional array and then concatenating these arrays to form a comprehensive feature vector for each frame. Subsequently, the feature vectors from all frames are concatenated to represent the entire video sequence. Optionally, these feature vectors can be standardized to ensure uniform contribution of each feature.
Feature concatenation and classification.
Once both the optical flow and tangent space features have been vectorized, they are concatenated to form a single comprehensive feature vector. This combined feature vector encapsulates both motion and structural information from the video.
The concatenated feature vectors are then used to train a Support Vector Machine (SVM) classifier. The SVM is chosen for its effectiveness in high-dimensional spaces and its ability to handle the combined feature set.
Fig 4 illustrates the process of feature matrix space mapping, feature vectorization, feature fusion, and classification. In this process, the spatial features extracted from the Riemannian manifold and the temporal features extracted from the Euclidean space are fused and used for classification.
Spatial features modeled on the SPD manifold are mapped to Euclidean space using Log-Euclidean metrics and then fused with temporal motion features extracted from optical flow histograms. The combined feature vector is used for robust hand gesture classification.
Experiments and analyze
This section presents the experimental evaluation and analysis of the proposed methods. We first introduce the datasets and experimental settings, and then provide parameter studies and performance evaluations to validate the effectiveness of the approach.
Datasets and settings
We evaluate the effectiveness of the proposed method on two publicly available benchmark datasets: the Cambridge Hand Gesture Dataset [46] and the Northwestern University Hand Gesture Dataset [24]. The motions represented in these datasets are illustrated in Fig 5 and Fig 6. Both datasets are publicly accessible online via https://github.com/Ha0Tang/HandGestureRecognition.
The Cambridge Hand Gesture Dataset [46] contains 900 image sequences at a resolution of 320×240 pixels. It includes 9 gesture classes, formed by combinations of 3 hand shapes (flat, spread, V-shape) and 3 motions (leftward, rightward, contract), as illustrated in Fig 5. Each class contains variations under 5 illumination conditions and 10 sequences per gesture, recorded from 2 subjects using a fixed camera setup. These sequences provide spatially and temporally isolated gestures suitable for evaluating dynamic hand gesture recognition methods.
The Northwestern University Hand Gesture Dataset [24] contains 1,050 short video sequences of 10 dynamic hand gestures, including directional movements, rotations, circles, and symbolic gestures such as ‘Z’ and cross, as illustrated in Fig 6. The videos were recorded at a resolution of 640×480 pixels from 15 subjects, with 7 sequences per gesture category. Each sequence includes variations across seven hand postures: Fist, Fingers extended, ‘OK’, Index, Side Hand, Side Index, and Thumb. This dataset provides a diverse set of spatial and temporal patterns suitable for evaluating dynamic hand gesture recognition methods.
For both datasets, we used a 60:40 train-test split, where 60% of the data was used for training and 40% for testing. This split ensures a balanced evaluation of the proposed method by providing sufficient samples for training while retaining enough data for testing.
Parameter analysis
The algorithm proposed in this article is developed in Python. All experiments were conducted on macOS based on 4 Intel i7 2.2 GHz CPU cores and 16GB RAM.
In this subsection, we will first determine the keyframe selection method used in this study. Next, we will describe the parameters utilized in our method and provide justification for their selection. Finally, we will discuss the parameter tuning process employed to optimize performance and avoid overfitting.
Keyframe selection method.
In our framework, two primary parameters must be determined: the keyframe extraction method and the number of frames to extract from the gesture sequence, as described in the section “Motion detection.”
Fig 7 shows the average performance of the test classification after running the experiment 20 times with different train-test split seeds. It is evident that the experimental performance of the “interval-based sampling method” for extracting keyframes outperforms the other two keyframe extraction methods on both datasets. This holds true whether the features extracted by the Grid-based Optical Flow Histogram (OFH) method are used alone for classification or in combination with the features extracted by the SPD matrices method.
The Fig 7 also reveals the effect of changing the number of keyframes on classification accuracy using different feature extraction methods across the Northwest and Cambridge datasets. In the Northwest dataset, there is a consistent and notable improvement in classification accuracy with an increasing number of frames, whether using OFH features alone or combined with SPD features. Conversely, the Cambridge dataset exhibits more nuanced patterns: while classification based on OFH features initially benefits from additional frames, this advantage levels off or slightly diminishes with the interval-based sampling method. Moreover, the fusion of OFH and SPD features achieves peak classification performance with 6 frames. Classification results fluctuated with an increasing number of frames but consistently remained lower than the accuracy achieved with the initial 6-frame configuration. This underscores that fewer frames are sufficient to distinguish actions in simpler datasets, such as Cambridge, where extracting too many features with additional frames may even lead to reduced accuracy, particularly in the fusion of OFH and SPD features. Notably, interval-based sampling consistently outperforms the other methods regarding classification accuracy, highlighting its effectiveness.
Besides accuracy, efficiency is also a crucial metric in evaluating these sampling methods. To provide a comprehensive assessment, we analyze their computational complexity. For cluster-based sampling, the primary computational task is K-Means clustering, which has a complexity of O(T × N × K × D), where T represents the number of iterations until convergence, N denotes the number of frames, K denotes the number of clusters, and D denotes the dimensionality of the feature space. Optical flow-based sampling involves calculating optical flow with a complexity of O(N × H × W), where N indicates the number of frames, and H and W denote the height and width of each frame, respectively. Interval-based sampling is the simplest, with a complexity of O(N), where N represents the number of frames. Theoretically, interval-based sampling is the most efficient. Empirically, for a hand gesture video with 61 frames, extracting 6 frames takes approximately 1.14 seconds for optical flow-based sampling, 0.12 seconds for cluster-based sampling, and 0.04 seconds for interval-based sampling. These results align with the theoretical complexities, indicating that interval-based sampling generally incurs the lowest computational cost, while optical flow-based and cluster-based methods are more resource-intensive, reflecting their respective complexities.
After analyzing and validating our experimental parameters, we selected the interval-based sampling method to extract keyframes, opting for 16 frames on the Northwest dataset and 6 frames on the Cambridge dataset.
Parameter description and justification.
To extract the local features required for RieCovDs, we generate an 11-dimensional local feature vector from the color image I, as given by Eq 25:
where x and y are the coordinates of a pixel. IR(x, y), IG(x, y), and IB(x, y) denote the color information at that pixel for the red, green, and blue channels, respectively. The gradients of the color information in the x and y directions are given by for the red channel,
for the green channel, and
for the blue channel. These gradients represent the first-order changes in the color information along the x and y axes. Additionally, the parameter
in Eq 5 is set as 0.7.
The parameters employed in the experiment are detailed in Table 2. For the OFH algorithm, we utilized the cv2.calcOpticalFlowFarneback implementation [44], which integrates a pyramid scheme for enhanced motion estimation. In the SPD algorithm, parameters such as rx, ry, Wx, Wy, sx, and sy were employed to adjust the sliding window approach for extracting manifold features.
In the OFH method, the pyramid scale parameter ( = 0.5) is crucial for balancing the capture of fine motion details with computational efficiency. A lower scale (i.e., closer to 1) would increase computational costs with minimal improvements in accuracy, while a higher scale (i.e., closer to 0) could reduce the algorithm’s ability to capture small-scale motion details. The number of pyramid layers (L = 3) effectively captures motion at different scales. Three layers strike an optimal balance between capturing fine details and managing computational demands. The window size (W = 15) was selected to reduce noise while maintaining sensitivity to motion. Smaller windows resulted in noisier estimates, while larger windows caused over-smoothing, leading to a loss of crucial details. The grid sizes (Gx = 8, Gy = 8) were critical in determining the spatial resolution of the optical flow histograms. Properly chosen grid sizes ensured sufficient spatial detail without excessive computational overhead. Smaller grid sizes provided more detailed motion information but increased computational load and sensitivity to noise, while larger grid sizes reduced computational demands but might overlook finer motion details.
In the SPD method, the resizing dimensions (rx = 240, ry = 320) were selected to balance dimensionality reduction with the preservation of essential structural details, enhancing computational efficiency. The window sizes (Wx = 60, Wy = 80) were optimized to capture key features around each manifold point without incurring unnecessary computational costs. Smaller dimensions risked omitting relevant features, while larger ones could lead to redundancy. The step sizes for manifold sampling (sx = 30, sy = 40) were selected to ensure comprehensive image coverage while maintaining efficiency. Smaller steps might lead to redundant computations, whereas larger steps could miss significant features.
Parameter tuning and overfitting prevention.
To ensure the SVM classifier’s robustness and generalizability, we employed a combination of cross-validation and hyperparameter tuning strategies. Specifically, we used 5-fold cross-validation to evaluate the model’s performance and reduce bias. The training dataset was divided into five subsets, with each subset serving as the test set once, while the remaining subsets were used for training. This approach provided a reliable estimate of the model’s overall performance.
In conjunction with cross-validation, hyperparameter tuning was performed using grid search to optimize model complexity and mitigate overfitting. We systematically explored various values for the regularization parameter C (0.1, 1, 10, 100) and experimented with different kernel types (linear, polynomial, RBF, sigmoid). Special attention was given to the gamma parameter, which significantly influences the model’s decision boundary. To balance model complexity and performance, gamma was evaluated using two specific formulations that are dependent on the feature dataset.
Given the feature dataset obtained after feature extraction, where n denotes the number of records and each record
is represented as a feature vector
, we evaluated gamma using two specific calculations, as given by Eq 26 and Eq 27:
where Var(D) represents the variance of the feature dataset: and
is the mean of all xij values across the entire dataset.
By combining grid search with 5-fold cross-validation, we determined that C = 10, gamma set to Eq 26, and the RBF kernel provided an optimal balance, ensuring the model was well-tuned to avoid both overfitting and underfitting.
Experiment evaluation
This section evaluates our classification model on the Cambridge and Northwestern datasets. We use four standard metrics: Accuracy, Precision, Recall, and F1 Score, defined in Eq 28–31. Our results demonstrate the competitive performance of the proposed approach compared to state-of-the-art methods.
Evaluates on four metrics.
Here, TP, TN, FP, and FN denote True Positives, True Negatives, False Positives, and False Negatives, respectively.
Classification performance.
The precision, recall, and F1-score results, reported in Table 3, further confirm the effectiveness of the proposed method across both datasets. For this evaluation, the training and testing split was fixed using a random seed of 43 to ensure reproducible results.
On the Northwestern dataset, most gesture classes achieve near-perfect performance. Minor deviations are observed in the “rotate down” class (recall of 0.951, F1-score of 0.975), as well as in “move right-down” and “clockwise circle,” reflecting slight variations in classification accuracy. The overall accuracy on the Northwestern dataset is 0.981, demonstrating that the model correctly classifies the vast majority of gestures.
Performance on the Cambridge dataset is even more consistent, with nearly all gesture classes attaining perfect scores. Only a slight reduction in recall is seen for “spread-rightward” (0.956). The overall accuracy on the Cambridge dataset is 0.994, indicating highly reliable classification across all gesture categories. The weighted average scores of 0.982 (Northwestern) and 0.994 (Cambridge) further underscore the robustness of the proposed method.
Ablation study.
To evaluate the contribution of each component in the proposed framework, we conduct an ablation study on SPD and OFH features, as shown in Table 4, comparing the performance of using SPD alone, OFH alone, and their combination. Each experiment was repeated 20 times, and the reported results correspond to the mean and standard deviation over these runs, providing a robust estimate of the model’s performance.
On the Cambridge dataset, the SPD feature alone achieves a high accuracy of 98.20%, indicating that spatial information is largely sufficient for this dataset. The OFH feature extracted from 6 keyframes also performs well, reaching 91.09%. By combining SPD and OFH features, the accuracy is further improved to 99.31%, demonstrating a complementary effect between spatial and temporal representations.
In contrast, on the Northwestern dataset, the SPD feature alone performs poorly, achieving only 47.95% accuracy. However, the OFH feature extracted from 16 keyframes significantly improves the performance to 96.75%, highlighting the importance of temporal motion information. The combination of SPD and OFH features further boosts the accuracy to 97.23%, achieving the best overall performance.
Error analysis and methodological justification.
To better understand the performance differences observed in the ablation experiments, we analyze the confusion matrices shown in Fig 8. Subfigures 8 (A) and (B) present the results of the combined SPD + OFH features on the Cambridge and Northwestern datasets, respectively, while subfigure 8 (C)shows the SPD-only results on the Northwestern dataset.
The combined features achieve strong performance, indicating that integrating spatial and temporal information captures complementary cues. Specifically, SPD features primarily capture spatial relationships, while OFH features encode temporal dynamics, enabling the model to distinguish directional motion patterns. In contrast, the SPD-only confusion matrix reveals systematic misclassification patterns (e.g., “move right” vs. “move left,” “rotate up” vs. “rotate down”), demonstrating that SPD features alone cannot capture temporal ordering due to their covariance-based representation, which ignores temporal information. This explains the poorer performance of SPD alone, particularly on the Northwestern dataset which relies heavily on temporal information.
Overall, this analysis provides insight into why the individual components perform differently and supports the methodological choice of integrating SPD and OFH features, highlighting their complementary roles and validating the design of our feature integration strategy.
Our evaluation on the Cambridge and Northwestern datasets achieved recognition accuracies of 99.31% ± 0.39% and 97.23% ± 0.54%, as detailed in Table 5 (Cambridge) and Table 6 (Northwestern). These results demonstrate that the proposed approach achieves strong and competitive performance compared to existing methods, and notably surpasses previously reported results on the Cambridge dataset. The observed performance gains can be attributed to the effective integration of spatial features modeled on the SPD manifold with complementary temporal motion features. To provide context, we briefly analyze the main ideas behind the compared algorithms and discuss their respective limitations.
Wong et al. [47] developed a method that constructs a motion gradient orientation image from raw video data, transforming it into a motion feature vector for classification using a sparse Bayesian classifier. Meanwhile, Niebles et al. [48] focused on spatial-temporal word representation derived from detected interest points, creating a sparse representation of video sequences; however, this approach may overlook broader spatial information. Kim et al. [46,49] expanded Canonical Correlation Analysis (CCA) to multiway data arrays for robust spatiotemporal pattern recognition, bypassing the need for explicit motion estimation. Liu et al. [23] utilized Genetic Programming to optimize 3D sequence-processing operators for gesture recognition, where accuracy is contingent upon the quality of the selected operators. Lui et al. [50,51] classified videos using distances on Grassmann manifolds, while Wong et al. [52] enhanced motion recognition through a generative model extending probabilistic latent semantic analysis (pLSA). Sanin et al. [53] utilized covariance descriptors and mapped them to Euclidean space for classification. Baraldi et al. [54] combined dense trajectories with segmentation techniques for feature extraction, while Zhao et al. [55] focused on keyframe selection and local motion feature extraction. Tang et al. [12] fused appearance features from SURF with motion features from SIFT 3D for classification. Shen et al. [24] introduced a method that uses motion divergence fields for hand motion recognition, detecting salient regions with Maximally Stable Extremal Regions and extracting local descriptors to capture motion patterns.
Despite significant advancements in hand gesture recognition, existing methods still exhibit certain limitations that can affect their performance. Many approaches focus primarily on either region-of-interest detection or feature extraction within Euclidean space or Riemannian manifolds, often emphasizing a single type of representation. Although both Euclidean and manifold-based features have demonstrated effectiveness in different contexts, their complementary integration has been relatively less explored in gesture recognition tasks.
In this work, we integrate features derived from both Euclidean space and Riemannian manifolds by leveraging the Log-Euclidean metric to map manifold-based features into Euclidean space. This unified representation enables effective fusion of spatial and temporal information, facilitating improved classification performance while maintaining computational efficiency.
Conclusion
In this paper, we present an innovative method to enhance hand gesture recognition by integrating spatial features represented on the SPD manifold with temporal features derived from optical flow in Euclidean space. Our method addresses limitations associated with Euclidean space representation, which may inadequately preserve intrinsic geometric relationships and consequently affect classification performance. By leveraging the SPD manifold for spatial feature representation and integrating it seamlessly with optical flow features, we achieved significant advancements in gesture recognition accuracy.
The evaluation of our approach on the Cambridge hand gesture dataset showed outstanding results, achieving an accuracy of 99.31% ± 0.39%, and achieving performance that is comparable to or exceeds previously reported traditional and neural network–based methods on this dataset. On the Northwestern hand gesture dataset, our approach achieved an accuracy of 97.23% ± 0.54%, remaining highly competitive with recently reported deep learning–based approaches. These results demonstrate the effectiveness of our framework in capturing both the spatial subtleties and temporal dynamics of hand gestures, while maintaining efficiency and interpretability.
Our method not only preserves the geometric properties of gesture data but also enhances the meaningfulness of feature representations, thereby improving classification performance. These findings highlight the potential of integrating manifold-based spatial features with traditional temporal features for robust hand gesture recognition across diverse applications, including human-computer interaction, virtual reality, and assistive technologies.
Moving forward, further exploration could focus on refining feature fusion techniques, scalability to larger datasets, and adapting our method to real-time applications. By continuing to refine and expand upon these findings, we seek to advance the development of gesture recognition technology and its practical applications.
References
- 1.
Xu D. A Neural Network Approach for Hand Gesture Recognition in Virtual Reality Driving Training System of SPG. In: 18th International Conference on Pattern Recognition (ICPR’06). IEEE; 2006. p. 519–22.
- 2. Liu H, Wang L. Gesture recognition for human-robot collaboration: A review. Int J Indust Ergon. 2018;68:355–67.
- 3.
Van den Bergh M, Carton D, De Nijs R, Mitsou N, Landsiedel C, Kuehnlenz K, et al. Real-time 3D hand gesture interaction with a robot for understanding directions from humans. In: 2011 RO-MAN. IEEE; 2011. p. 357–62.
- 4. Gao Q, Liu J, Ju Z. Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional neural network for human–robot interaction. Expert Syst. 2020;38(5):e12490.
- 5. Chen Y, Zuo R, Wei F, Wu Y, Liu S, Mak B. Two-stream network for sign language recognition and translation. Adv Neural Inf Process Syst. 2022;35:17043-56.
- 6. Zhou H, Zhou W, Zhou Y, Li H. Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation. IEEE Trans Multimedia. 2021;24:768–79.
- 7. Schramm R, Jung CR, Miranda ER. Dynamic Time Warping for Music Conducting Gestures Evaluation. IEEE Trans Multimedia. 2015;17(2):243–55.
- 8. Wang J, Liu T, Wang X. Human hand gesture recognition with convolutional neural networks for K-12 double-teachers instruction mode classroom. Infrared Phys Tech. 2020;111:103464.
- 9. Oyediran MO, Ajagbe SA, Ojo OS, Alshahrani R, Awodoye OO, Adigun MO. White shark optimizer via support vector machine for video-based gender classification system. Multimed Tools Appl. 2025;84(28):34645–61.
- 10. Taiwo G, Vadera S, Alameer A. Vision transformers for automated detection of pig interactions in groups. Smart Agr Technol. 2025;10:100774.
- 11. Chen F-S, Fu C-M, Huang C-L. Hand gesture recognition using a real-time tracking method and hidden Markov models. Image Vis Comput. 2003;21(8):745–58.
- 12. Tang H, Liu H, Xiao W, Sebe N. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing. 2019;331:424–33.
- 13.
Feichtenhofer C, Pinz A, Zisserman A. Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 1933–41.
- 14.
Heidari N, Norouzi J, Helfroush MS, Danyali H. Dynamic Hand Gesture Recognition with 2DCNN-LSTM and Improved Keyframe Extraction. In: 2024 14th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE; 2024. p. 429–34.
- 15.
Huang Z, Wang R, Shan S, Li X, Chen X. Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification. In: International conference on machine learning. PMLR; 2015. p. 720–9.
- 16.
Heap T, Hogg D. Towards 3D hand tracking using a deformable model. In: Proceedings of the Second International Conference on Automatic Face and Gesture Recognition. IEEE; 1996. p. 140–5.
- 17.
Ouhaddi H, Horain P. 3D hand gesture tracking by model registration. In: Workshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging. 1999. p. 70-3.
- 18.
Sudderth EB, Mandel MI, Freeman WT, Willsky AS. Visual hand tracking using nonparametric belief propagation. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop. IEEE; 2004. p. 189-9.
- 19.
Saremi S, Mirjalili S. Optimisation Algorithms for Hand Posture Estimation. Springer; 2020.
- 20. Saremi S, Mirjalili S, Lewis A, Liew AWC, Dong JS. Enhanced multi-objective particle swarm optimisation for estimating hand postures. Knowl-Based Syst. 2018;158:175–95.
- 21.
Boukhayma A, Bem RD, Torr PH. 3d hand shape and pose from images in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. p. 10843–52.
- 22. Holte MB, Moeslund TB, Fihl P. View-invariant gesture recognition using 3D optical flow and harmonic motion context. Comput Vis Image Underst. 2010;114(12):1353–61.
- 23.
Liu L, Shao L. Synthesis of spatio-temporal descriptors for dynamic hand gesture recognition using genetic programming. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE; 2013. p. 1–7.
- 24. Shen X, Hua G, Williams L, Wu Y. Dynamic hand gesture recognition: An exemplar-based approach from motion divergence fields. Image Vis Comput. 2012;30(3):227–35.
- 25.
Sarma D, Kavyasree V, Bhuyan MK. Two-stream fusion model for dynamic hand gesture recognition using 3d-cnn and 2d-cnn optical flow guided motion template. arXiv preprint arXiv:200708847. 2020.
- 26.
Hou J, Wang G, Chen X, Xue JH, Zhu R, Yang H. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European conference on computer vision (ECCV) workshops. 2018.
- 27. Miah ASM, Hasan MdAM, Shin J. Dynamic Hand Gesture Recognition Using Multi-Branch Attention Based Graph and General Deep Learning Model. IEEE Access. 2023;11:4703–16.
- 28. Liu H, Liu Z. A Multimodal Dynamic Hand Gesture Recognition Based on Radar–Vision Fusion. IEEE Trans Instrum Meas. 2023;72:1–15.
- 29. Sahoo JP, Sahoo SP, Ari S, Patra SK. Hand Gesture Recognition Using Densely Connected Deep Residual Network and Channel Attention Module for Mobile Robot Control. IEEE Trans Instrum Meas. 2023;72:1–11.
- 30. Deng Z, Leng Y, Hu J, Lin Z, Li X, Gao Q. SML: A Skeleton-based multi-feature learning method for sign language recognition. Knowl-Based Syst. 2024;301:112288.
- 31. Wang S, Zhang S, Zhang X, Geng Q. A two-branch hand gesture recognition approach combining atrous convolution and attention mechanism. Vis Comput. 2022;39(10):4487–500.
- 32. Alonazi M, Ansar H, Ai Mudawi NA, Alotaibi SS, Almujally NA, Alazeb A, et al. Smart Healthcare Hand Gesture Recognition Using CNN-Based Detector and Deep Belief Network. IEEE Access. 2023;11:84922–33.
- 33. Bhaumik G, Verma M, Govil MC, Vipparthi SK. HyFiNet: Hybrid feature attention network for hand gesture recognition. Multimed Tools Appl. 2023;82(4):4863–82.
- 34. Dahiya A, Katti R, Occhipinti LG. Real-Time Hand Gesture Classification Using Infrared Sensor Arrays-Based Wearable Bracelet and Efficient 1-D Convolutional Neural Network. IEEE Sens Lett. 2025;9(6):1–4.
- 35. Cui C, Sunar MS, Eg Su G. Deep vision-based real-time hand gesture recognition: a review. PeerJ Comput Sci. 2025;11:e2921. pmid:40989457
- 36. Gao Q, Chen Y, Ju Z, Liang Y. Dynamic Hand Gesture Recognition Based on 3D Hand Pose Estimation for Human–Robot Interaction. IEEE Sensors J. 2022;22(18):17421–30.
- 37. Vysocky A, Grushko S, Spurny T, Pastor R, Kot T. Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localization. IEEE Access. 2022;10:99734–44.
- 38. Zhou G, Cui Z, Qi J. FGDSNet: A Lightweight Hand Gesture Recognition Network for Human Robot Interaction. IEEE Robot Autom Lett. 2024;9(4):3076–83.
- 39.
Rekik K, Gajjar N, Silva G, Müller R. Predictive intention recognition using deep learning for collaborative assembly. In: 2024 10th International Conference on Control, Decision and Information Technologies (CoDIT). IEEE; 2024. p. 1153–8.
- 40. Yu M, Gan A, Xue C, Yan G. DSTEN-CSLR: dual spatial–temporal enhancement network for continuous sign language recognition. Neural Comput Appl. 2025;37(19):13981–4004.
- 41. Hubert C, Odic N, Noel M, Gharib S, Zargarbashi SHH, Séoud L. MuViH: Multi-View Hand gesture dataset and recognition pipeline for human–robot interaction in a collaborative robotic finishing platform. Robot Comput-Integr Manuf. 2025;94:102957.
- 42. Chen K-X, Ren J-Y, Wu X-J, Kittler J. Covariance descriptors on a Gaussian manifold and their application to image set classification. Pattern Recognition. 2020;107:107463.
- 43. Faraki M, Harandi MT, Porikli F. A Comprehensive Look at Coding Techniques on Riemannian Manifolds. IEEE Trans Neural Netw Learn Syst. 2018;29(11):5701–12. pmid:29994290
- 44.
Farnebäck G. Two-Frame Motion Estimation Based on Polynomial Expansion. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 2003. p. 363–70.
- 45. Tuzel O, Porikli F, Meer P. Pedestrian detection via classification on Riemannian manifolds. IEEE Trans Pattern Anal Mach Intell. 2008;30(10):1713–27. pmid:18703826
- 46.
Kim T-K, Wong S-F, Cipolla R. Tensor Canonical Correlation Analysis for Action Classification. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2007. p. 1–8.
- 47.
Wong SF, Cipolla R. Real-time Interpretation of Hand Motions using a Sparse Bayesian Classifier on Motion Gradient Orientation Images. BMVC; 2005. p. 170–9.
- 48. Niebles JC, Wang H, Fei-Fei L. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. Int J Comput Vis. 2008;79(3):299–318.
- 49. Kim T-K, Cipolla R. Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans Pattern Anal Mach Intell. 2009;31(8):1415–28. pmid:19542576
- 50.
Lui YM, Beveridge JR, Kirby M. Action classification on product manifolds. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2010. p. 833–9.
- 51.
Lui YM, Beveridge JR. Tangent bundle for human action recognition. In: 2011 International Conference on Automatic Face & Gesture Recognition (FG). IEEE; 2011. p. 97–102.
- 52.
Wong S-F, Kim T-K, Cipolla R. Learning Motion Categories using both Semantic and Structural Information. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. 2007. p. 1–6.
- 53.
Sanin A, Sanderson C, Harandi MT, Lovell BC. Spatio-temporal covariance descriptors for action and gesture recognition. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV). 2013. 103–10.
- 54.
Baraldi L, Paci F, Serra G, Benini L, Cucchiara R. Gesture Recognition in Ego-centric Videos Using Dense Trajectories and Hand Segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2014. p. 688–93.
- 55.
Zhao Z, Elgammal AM. Information Theoretic Key Frame Selection for Action Recognition. BMVC; 2008. p. 1–10.
- 56. Uke SN, Zade A. Optimal video processing and soft computing algorithms for human hand gesture recognition from real-time video. Multimed Tools Appl. 2023;83(17):50425–47.
- 57. Yu J, Qin M, Zhou S. Dynamic gesture recognition based on 2D convolutional neural network and feature fusion. Sci Rep. 2022;12(1):4345. pmid:35288612