Figures
Abstract
3D skeleton-based human activity recognition has gained significant attention due to its robustness against variations in background, lighting, and viewpoints. However, challenges remain in effectively capturing spatiotemporal dynamics and integrating complementary information from multiple data modalities, such as RGB video and skeletal data. To address these challenges, we propose a multimodal fusion framework that leverages optical flow-based key frame extraction, data augmentation techniques, and an innovative fusion of skeletal and RGB streams using self-attention and skeletal attention modules. The model employs a late fusion strategy to combine skeletal and RGB features, allowing for more effective capture of spatial and temporal dependencies. Extensive experiments on benchmark datasets, including NTU RGB+D, SYSU, and UTD-MHAD, demonstrate that our method outperforms existing models. This work not only enhances action recognition accuracy but also provides a robust foundation for future multimodal integration and real-time applications in diverse fields such as surveillance and healthcare.
Citation: Xie D, Zhang X, Gao X, Zhao H, Du D (2025) MAF-Net: A multimodal data fusion approach for human action recognition. PLoS ONE 20(4): e0319656. https://doi.org/10.1371/journal.pone.0319656
Editor: Yawen Lu, Purdue University, UNITED STATES OF AMERICA
Received: October 29, 2024; Accepted: February 5, 2025; Published: April 9, 2025
Copyright: © 2025 Xie et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets used in this study are publicly available and can be accessed through the following links: The UTD-MHAD dataset, containing multimodal data for human action recognition, is available for download at https://www.utdallas.edu/ kehtar/UTD-MHAD.html. The NTU RGB+D dataset, a benchmark dataset with various modalities for action recognition, can be accessed via https://rose1.ntu.edu.sg/dataset/action Recognition/. The SYSU dataset, focused on human-object interaction, is available at https://paperswithcode.com/dataset/sysu-mm01. These datasets are integral to the reproducibility and validation of the findings presented in this research.
Funding: AS: Swedish Research Council grant 2015-01835. https://vr.se AS, MM: Swedish Research Council for Health, Working Life and Welfare (FORTE) grant 2021-01646. https://forte.se, under the CHANSE ERA-NET Co-fund programme, which has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement 101004509. AS, MM: Swedish Research Council grant 2021-02769. https://vr.se The funders did not play any role in the study design, data collection and analysis, decision to publish, and preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
With the rapid development of Internet of Things (IoT) technology and the increasing demand for Human-Computer Interaction (HCI), action recognition technology plays a crucial role in these fields. Action recognition can be applied in various scenarios, such as surveillance, healthcare, and smart homes, and is also widely used in emerging fields like Augmented Reality (AR) and Virtual Reality (VR) [1–4]. In these applications, automated and accurate action recognition systems can significantly enhance the intelligence level of devices and improve user experience. Consequently, the efficient and accurate recognition and analysis of human actions has become a research hotspot in recent years [5–8]. Particularly, with the breakthroughs in deep learning technology, action recognition methods based on Deep Neural Networks (DNN) have made remarkable progress. Against this backdrop, researchers have begun to explore the combination of multimodal data sources, with the integration of RGB video modality and skeleton sequence modality showing great potential [9–13]. This combination not only improves the accuracy of action recognition but also achieves better robustness and generalization in complex scenarios.
In action recognition, RGB video and skeleton sequence modalities provide complementary information that enhances the accuracy of human motion analysis [14–16]. RGB video captures rich spatial information, including the trajectory of human movements and interactions with objects in the environment. For instance, in a grasping action, the RGB modality can reveal the motion trajectory of the hand, the shape of the object, and the relative positioning between the hand and the object, offering crucial contextual cues for action understanding. In contrast, the skeleton sequence modality represents human poses and limb movements through keypoints, primarily focusing on temporal motion patterns [17,18]. It excels at tracking the trajectory of body parts but lacks the spatial context needed to accurately interpret human-object interactions. Therefore, while RGB video provides detailed visual information, the skeleton sequence emphasizes the geometric structure of human motion, making these two modalities highly complementary for action recognition tasks.
Existing action recognition methods that fuse RGB video and skeleton sequence modalities face several notable challenges. Firstly, processing RGB video typically relies on computationally intensive 3D Convolutional Neural Networks (3D CNNs) or hierarchical 2D Convolutional Neural Networks (2D CNNs) [19–22]. Although these methods effectively capture spatiotemporal information within video sequences, their high computational cost makes them less suitable for resource-constrained or real-time applications. Furthermore, the efficiency of multimodal information fusion has become a bottleneck, with many approaches increasing model complexity and inference time as they aim to improve recognition accuracy, thus limiting the feasibility of real-world applications [23–26]. To address these issues, some approaches have adopted the strategy of extracting single-frame RGB images from videos, which are processed by 2D convolutional networks to reduce computational overhead. RGB frames can provide robust spatiotemporal information, particularly in scenarios where human-object interactions (such as with bottles or caps) persist throughout the video, making it effective to extract spatial information from intermediate frames [27–29]. Compared to full RGB video processing, using 2D convolution-based approaches significantly reduces network complexity while retaining most of the critical information.
Replacing RGB video with individual frames can effectively reduce computational complexity, but inevitably leads to a loss of temporal information from the video stream. This information loss hinders the ability to share consistent feature representations between the RGB and skeleton streams, ultimately impacting recognition accuracy. To address this issue, this paper proposes an innovative network architecture called MAF-Net, which aims to optimize the fusion strategy between RGB video and skeleton sequences, reducing computational complexity while enhancing the efficiency and accuracy of action recognition. Compared to traditional cascade or weighted-sum fusion methods [65, 70] (as shown in Fig 1a), this method introduces data augmentation for RGB frame images and early cross-modal feature fusion (as shown in Fig 1b). In the early fusion stage, the method generates an attention mask from the projection of the skeleton sequence onto the RGB frames, guiding the RGB network to focus on high-information regions related to limb movement, thereby improving the effectiveness of feature extraction. Additionally, a self-attention mechanism is integrated into the RGB frame network to suppress background noise and further enhance the extraction of relevant information.In the final fusion phase, a cross-modal attention integration module is utilized to seamlessly integrate skeletal data with RGB data, enabling comprehensive modal fusion and enhancing the overall action recognition performance. This two-stage fusion strategy significantly enhances the sharing of multimodal features and recognition accuracy while maintaining manageable model complexity.
To validate the effectiveness of MAF-Net, experiments were conducted on three public datasets: NTU RGB+D, SYSU, and UTD-MHAD. The experimental results demonstrate that MAF-Net significantly outperforms existing multimodal action recognition methods, particularly showcasing notable advantages in computational efficiency and model complexity. These findings underscore the potential of MAF-Net in the field of action recognition, especially in scenarios with high real-time requirements such as IoT and human-computer interaction, where MAF-Net provides a more efficient and accurate solution.
The three main contributions of the paper can be summarized as follows:
1. Proposed an innovative multimodal fusion framework, MAF-Net: Multimodal fusion framework named MAF-Net, which combines multi-frame RGB images and skeleton sequences, significantly reducing computational complexity while preserving key spatiotemporal information. By leveraging self-attention and cross-attention mechanisms, the framework enhances the correlation between the RGB video modality and the skeleton sequence modality, achieving a well-balanced trade-off between performance and computational efficiency.
2. Introduced a skeleton-guided attention mechanism: We innovatively introduce a skeleton-guided attention mechanism that enables the RGB stream to focus on interaction regions between the human body and objects, compensating for the lack of temporal information in RGB images. By generating a skeleton attention mask based on a Gaussian distribution, the mechanism guides the RGB feature extraction process, effectively capturing crucial details and interaction information in the actions.
3. Extensive experimental validation: We conduct extensive experiments on three public datasets, NTU RGB+D, SYSU, and UTD-MHAD, to validate the effectiveness and superior performance of MAF-Net.
Related work
3D skeleton-based action recognition
In recent years, 3D skeleton-based action recognition has emerged as a prominent research direction due to its robustness against external conditions such as background, lighting, and viewpoint variations. Methods in this domain are primarily categorized into three major approaches: those based on Recurrent Neural Networks [30–32], Convolutional Neural Networks (CNN) [33–35], and Graph Convolutional Networks (GCN) [36–38]. Each of these methods leverages the structural and temporal characteristics of skeletal data to enhance recognition performance, offering distinct advantages in handling the inherent challenges of motion data analysis.
RNN-based methods.
Recurrent Neural Network (RNN)-based methods have been widely used in 3D skeleton-based action recognition due to their ability to model temporal sequences. By using recursive connections, RNNs process sequential data effectively, as demonstrated by standard RNNs, Long Short-Term Memory (LSTM) [39], and Gated Recurrent Units (GRUs) [40–42], which address issues such as vanishing gradients and long-term dependency modeling. Despite their effectiveness in capturing temporal dynamics, RNN-based methods often struggle with spatial modeling, limiting their performance compared to other approaches [43–45]. To overcome these limitations, researchers have proposed modifications like two-stream RNN architectures that model both spatial configurations and temporal dynamics simultaneously [46]. Furthermore, attention mechanisms have been integrated into RNNs to enhance spatial-temporal modeling by focusing on informative joints.
In response to challenges in standard RNNs, including gradient explosion and vanishing, new architectures like the Independently Recurrent Neural Network (IndRNN) [47] were developed, which allow for more robust and deeper RNNs, improving the handling of long sequences. Additionally, spatial transformations and attention mechanisms were incorporated to focus on critical joints and mitigate noise in skeleton data [48,49]. These innovations have expanded the applicability of RNNs to complex tasks, although issues with spatial modeling remain, prompting ongoing research into enhancing RNN architectures.
CNN-based methods.
CNN-based methods for skeleton-based action recognition typically transform 3D skeleton sequences into pseudo-images, where spatial and temporal information is encoded into 2D formats. This allows convolutional neural networks (CNNs) to extract features from skeleton data similarly to how they process regular images. A key challenge with this approach is capturing both spatial relationships between joints and temporal dynamics effectively. Several researchers have addressed this issue, such as Wang et al. [50], who introduced Joint Trajectory Maps (JTM) to represent joint movements as texture images, and Li et al. [51], who used translation-scale invariant mapping to combine spatial and temporal data more effectively.
However, basic CNN models may struggle to capture complex co-occurrence relationships among joints, as they typically focus on local interactions within a limited convolutional kernel. To mitigate this, Chao et al. [52]. proposed a hierarchical method to progressively aggregate contextual information, enhancing feature learning at different levels. Other enhancements include the use of Temporal CNN (TCN) [53] for modeling spatio-temporal cues and the introduction of multi-stream CNN architectures that improve the representation of skeleton data for action recognition.
GCN-based methods.
In GCN-based methods for 3D skeleton-based action recognition, the human skeleton is modeled as a graph where joints represent the nodes, and bones or their temporal connections form the edges [54,55]. The adoption of Graph Convolutional Networks (GCNs) effectively captures spatial-temporal relationships between these joints. The Spatial-Temporal Graph Convolutional Networks (ST-GCN) [56], which significantly impacted this field by constructing a spatial-temporal graph that leverages the natural graph-like structure of the human skeleton. This model encodes both spatial and temporal dependencies between joints, leading to notable advancements in performance on action recognition tasks .
Recent improvements focus on optimizing the GCN-based framework for better spatial-temporal representation. For instance, the Action-Structural Graph Convolutional Network (AS-GCN) introduces multi-task learning to predict future poses, while 2s-AGCN adapts the graph structure during training for dynamic connections between joints [57]. These methods address challenges such as how to effectively represent the skeleton’s inherent graph structure and the need for adaptability to noisy or incomplete data [58]. Despite these advancements, key challenges remain in further optimizing GCNs to fully capture the temporal and spatial nuances of skeleton-based action recognition
Multimodal action recognition
Recent advances in multimodal fusion have significantly contributed to the field of action recognition [59,60]. By integrating multiple sensory modalities, such as video, RGB images, skeletal data, audio, and text, researchers have substantially improved both the accuracy and robustness of action recognition systems. For instance, Zhou et al. [61] proposed a dance motion detection system based on multimodal fusion, which combines multi-feature video analysis with Computer-Aided Design (CAD) techniques to achieve more precise action recognition. This system enhances the perception of complex actions through the collection and processing of multimodal data. Similarly, Tran et al. [62] developed a unified multimodal consistency framework aimed at addressing human activity recognition in videos. By maintaining consistency across different modalities, this framework significantly improves action localization and group activity recognition, demonstrating high accuracy and robustness in experimental results.
Other studies have also highlighted the broad applicability and potential of multimodal fusion techniques. Rehman et al. [63] proposed an algorithm that integrates RGB imaging, skeletal tracking, and pose estimation, demonstrating significant performance improvements on the UTD multimodal human action dataset. Liu et al. [64] addressed limitations in the multimodal fusion process by introducing an adaptive multimodal graph representation fusion method, which enhances action recognition accuracy by combining skeletal data with motion trajectories. Additionally, He et al. [65] explored the application of multimodal fusion for in-vehicle human action recognition, comparing early and late fusion strategies, and revealed the significant impact of different fusion strategies on overall recognition performance. Future research will continue to investigate optimization of multimodal fusion techniques to tackle the challenges posed by complex scenarios and action recognition tasks.
Transformer-based methods
Transformer-based methods have gained significant attention in the field of 3D skeleton-based action recognition due to their ability to capture long-range dependencies and global relationships, especially using the multi-head self-attention (MSA) mechanism [66–68]. These methods demonstrate superior performance in processing sequences by aggregating spatial-temporal data through attention-based approaches. Notable architectures, such as the Self-Attention Network (SAN) and the Spatial-Temporal Transformer Network (ST-TR) [69], have introduced novel ways to model the spatial-temporal correlations and dependencies between joints, improving the accuracy of action recognition models. However, the main challenge remains in effectively capturing high-dimensional semantic information and modeling intricate spatial relationships within skeleton data .
Hybrid models combining Transformers with GCNs or CNNs have also been explored to leverage the strengths of each architecture, allowing a more comprehensive framework for 3D skeleton-based tasks [70,71]. Such models benefit from Transformers’ global relational modeling abilities while mitigating their limitations in spatial encoding. For instance, the decoupled spatial-temporal attention network [72] and TemPose [73] focus on improving temporal and spatial decoupling, demonstrating strong potential in handling action sequences of varying lengths. These approaches collectively enhance both computational efficiency and recognition accuracy, showcasing the growing dominance of Transformers in this domain .
Methods
In this section, we describe the comprehensive approach adopted for data preparation, feature extraction, and model design in our multimodal human activity recognition system, as inlustrated in Fig 2. The proposed method focuses on leveraging both RGB and skeletal data to enhance recognition accuracy and robustness. By incorporating optical flow for key frame extraction, skeletal attention mechanisms, and feature fusion techniques, we aim to address the challenges of capturing temporal and spatial dynamics in complex action sequences. Furthermore, we detail the data augmentation strategies, feature selection processes, and the overall model architecture, including the fusion of RGB and skeletal modalities, to optimize the performance of the recognition task.
Data preprocess
This section provides a detailed explanation of the key steps involved in preprocessing the RGB data for activity recognition using a sequence modeling approach. Proper data preparation plays a vital role in the overall process, as it is essential for extracting meaningful features from the raw data. These features, when effectively extracted, significantly contribute to the success of the sequence model by improving its ability to recognize patterns and ensure optimal performance during activity recognition tasks.
Key frame extraction and uniform sampling based on optical flow.
First, the average motion intensity of each frame is calculated using the optical flow method. Specifically, for each frame and the previous frame, the optical flow vectors and
are computed for each pixel, representing the motion components in the
and
directions, respectively. Then, the average motion intensity for the entire frame is calculated using the following formula:
where is the total number of pixels in the frame, and
represents the motion intensity of each pixel. Through this process, the motion intensity of each frame can be obtained, indicating the level of activity change in the video for each frame.
Next, based on the calculated motion intensity, all frames are ranked, with frames exhibiting higher motion intensity selected as candidate key frames. These frames represent portions of the video with significant activity changes. To avoid concentrating key frames within specific time intervals, frames are uniformly selected from the sorted candidate frames based on a sampling interval. The sampling interval is computed using the following formula:
where the refers to the total number of frames in the video, and
is the desired number of key frames to be extracted. This sampling interval ensures that the key frames are reasonably distributed across the video timeline, thereby covering the major activity changes throughout the video.
Data enhancement.
To enhance the robustness and adaptability of action recognition systems, we propose an improved skeletal data augmentation method, with a focus on optimizing viewpoint expansion and noise handling. For viewpoint expansion, we build upon existing rotation matrix-based augmentation techniques by incorporating additional angular variations, such as and 180
rotations, and introducing multi-axis simultaneous rotations, thereby extending the range of perspectives covered. This improvement enables the model to better handle multi-view inputs in complex environments, increasing the accuracy of action recognition. Furthermore, to enhance model stability in real-world applications, we introduce random noise into the skeletal data by applying small perturbations to joint coordinates, simulating sensor errors or jitter during data collection. This approach improves the model’s tolerance to imprecise data and noise, ensuring robust performance under varying noise conditions. The combination of viewpoint expansion and random noise significantly increases the diversity and complexity of skeletal data, thereby improving the system’s recognition performance under multi-view and noisy conditions, and enhancing its robustness and generalization capabilities in complex real-world scenarios.
In the selected RGB frames, since the human body occupies only a small portion of the image, we propose a projection-based cropping method for image preprocessing. This method utilizes skeleton data to define the bounding box of the human body, and subsequently crops the image based on this bounding box. Assuming the bounding box has a width of w and a height of h, the image is cropped into a region with dimensions of and
, where
and
are random values between 100 and 300 pixels, effectively augmenting the dataset. Specifically, 3D skeleton coordinates are projected onto RGB frames using camera parameters to calculate the 2D pixel coordinates of the human body, which are then used to determine the cropping region. Unlike traditional random cropping centered on the image, this method focuses on the human body, aligning the skeleton data with image coordinates, thereby forming part of the action recognition pipeline. Its advantage lies in ensuring that the human body occupies the main part of the cropped image, reducing background noise and facilitating the application of skeleton attention mechanisms. However, applying projection cropping and coordinate alignment for each frame increases computational complexity and may affect the integrity of video representation.
2D pose estimation.
In order to obtain the 2D pose coordinates for each sampled frame, we employ the use of OpenPose, which is a pose estimation algorithm that utilizes deep learning models that have been trained on large-scale datasets comprising labeled human poses. The model processes each frame and identifies the keypoints corresponding to a number of body joints, including the nose, elbows, knees, and shoulders. OpenPose specifically extracts 18 keypoints in the COCO format, which include keypoints like the eyes, ears, wrists, hips, and ankles. These keypoints represent the coordinates of important joints and are used to construct the skeletal structure of the human body in 2D space, as shown in Table 1.
Data structure and features
Once all frames of 2D pose coordinates are obtained, the next step involves extracting features relevant to activity recognition, as shown in Table 2. This process includes computing the normalized Euclidean distances, improving angle-based features, and introducing velocity and acceleration features to comprehensively describe human motion behavior.
In order to mitigate the scale variations caused by factors such as individual height or camera distance, the 2D coordinates of body parts are first normalized. Specific anatomical landmarks, such as the distance between the left and right shoulders, can be chosen as the normalization factor. Let the coordinates of key points A and B be and
, and the Euclidean distance between these two points is calculated as:
where is a reference distance, such as the distance between the shoulders or the hips. This normalization factor reduces the influence of varying body sizes, making the distance features more consistent across individuals.
Relative angles.
The calculation of relative angles can be enhanced using the cosine law. Compared to the simple arctangent function, the cosine law provides a more accurate calculation of the angle between two vectors. For two key points and
with coordinates
and
, we can treat these points as a vector, and the angle between this vector and a reference vector (e.g., the vertical or horizontal axis) can be calculated using the following formula:
where and
are the coordinates of the reference vector, and
is the angle between the two vectors. By calculating the angles between key points, relative changes in human posture can be more accurately captured.
Speed characteristics.
In addition to static distance and angular features, dynamic features are also crucial. For each key point, velocity features can be computed based on positional changes between consecutive frames. Let the coordinates of key point A at frame and frame
be
and
, respectively, then the velocity can be calculated using the following formula:
where represents the time interval between consecutive frames. By incorporating velocity features, variations in the speed of human movements, such as running or jumping, can be effectively captured.
Optimal feature determination
In order to improve the efficiency and effectiveness of the activity recognition model, we implement a feature selection approach. that combines L1 regularization (Lasso) with Recursive Feature Elimination (RFE), the process is shown in the Fig 3. L1 regularization introduces sparsity constraints, automatically shrinking the weights of less important features to zero, thereby effectively reducing the dimensionality of the feature set. This regularization technique preserves the most critical features for model prediction and minimizes redundant information. Subsequently, Recursive Feature Elimination (RFE) is employed to further remove features with minimal contribution to the model’s prediction, ensuring that only the most discriminative subset of features is retained.
In our activity recognition task, each frame contains 283 features, including frame index, angles, and distances between body parts. Although these features comprehensively represent the temporal variations in human posture, not all features are equally important for recognizing specific activities. A combination of Lasso and Recursive Feature Elimination (RFE) can efficiently select the most relevant features for activity recognition, thereby optimizing model performance and enhancing interpretability. This approach first reduces the feature dimensionality using Lasso, followed by a more refined selection of the remaining features through RFE. Subsequently, based on the selected optimal feature subset, a simplified Bidirectional LSTM model is trained to improve the accuracy and efficiency of human activity recognition. While increasing the number of features may extend the feature selection process, this method ensures model consistency during validation and testing phases, maintaining generalization capability without being affected by inconsistent feature sets.
After preprocessing, the dataset is divided using stratified sampling to ensure that the proportion of samples in each class is consistent across the training, validation, and test sets, thus avoiding model bias caused by class imbalance. Specifically, 70% of the data is allocated for training, 15% for validation, and 15% for testing. This allocation allows the model to be trained on representative data while reserving sufficient validation and test data for hyperparameter tuning and performance evaluation. To further enhance the robustness of the model, K-fold cross-validation is applied on the training set, instead of relying on a single fixed validation set. By repeatedly validating the model’s performance within the training set, cross-validation aids in better hyperparameter selection and improves generalization. The test set remains independent throughout, ensuring a fair evaluation of the final model’s performance.
The RGB frame stream
In our approach, the RGB stream is composed of three primary elements: the foundational convolutional layers, a self-attention mechanism, and a skeletal attention mechanism. We utilize the VGG19 network as the core convolutional layer to derive feature maps. Both the self-attention and skeletal attention modules are structured to produce attention weights, allowing the model to concentrate more effectively on key regions of the image. Compared to other convolutional architectures, the depth and multi-layered design of VGG19 enables it to achieve strong feature extraction, delivering rich feature representations for the subsequent attention processes.
Multi-head self-attention module
Drawing inspiration from techniques employed in human re-identification tasks for extracting body part features, we introduce a multi-head self-attention mechanism to analyze the feature maps derived from raw RGB frames, as illustrated in Fig 4. The advantage of the multi-head self-attention mechanism lies in its ability to allow the model to focus on multiple regions of the image from different perspectives or scales simultaneously, thus capturing more comprehensive feature representations, particularly enabling effective attention to different parts of complex actions.
Specifically, the input feature map (where
represents the number of channels, and
and
denote the width and height of the feature map) is first transformed through linear projections to generate the Query, Key, and Value matrices. The formulation for these matrices is as follows:
where ,
, and
are the learned weights for each attention head. These matrices allow the model to compute correlations between features across different spatial locations. The attention score for each head is computed using the scaled dot-product attention mechanism:
where represents the dimensionality of each head, and the scaling factor
controls the impact of increased dimensionality. The Softmax function calculates attention weights for each spatial position, reflecting the network’s focus on different parts of the human body.
Multi-head attention is performed in parallel across multiple heads, each focusing on different feature regions. The outputs from all attention heads are concatenated and projected back into the original feature space through a linear transformation:
To further emphasize human body features and suppress background information, a 1×1 convolutional layer is introduced to generate a self-attention mask , constrained between [0, 1] using the Sigmoid activation function:
Finally, the self-attention mask is applied to the original feature map
through element-wise multiplication to produce the final feature map
:
This multi-head self-attention mechanism enables the network to extract features from different body parts and spatial regions simultaneously, significantly improving the recognition accuracy of complex human actions.
Skeletal attention module
We designed a skeleton attention module that integrates single-frame RGB images with skeleton sequences as a critical step in early feature fusion, as illustrated in Fig 5. Since static RGB frames lack temporal information, the skeleton sequence effectively guides the model to focus on the regions of interaction between the human and objects, complementing the temporal aspect. By computing the movement distance of joints, the module identifies the most salient parts of the activity and generates a skeleton attention mask, assigning higher weights to these regions, thus aiding the RGB stream in capturing complex motion features more effectively.
we fuse single-frame RGB images with skeleton sequences as part of the early feature fusion stage. Since static RGB frames lack temporal information, skeleton sequences are utilized to guide the image in focusing on human-object interaction areas, thus complementing temporal information. First, the joint with the largest movement in the skeleton is computed, as given by the following equation:
where and
denote the 3D joint coordinates in the skeleton frame and the mid-point of the RGB frame, respectively. By determining the movement of each joint, the joint with the greatest variation
is identified.
Next, a skeleton attention mask is generated. During the generation of the skeleton attention mask, a Gaussian function is applied to smooth the region around the joint with the largest movement, ensuring that the attention weights are more continuous:
where represents the pixel location in the mask, and
controls the spread of the Gaussian distribution. This approach ensures that regions near the joint with the largest movement receive higher weights, while the weights smoothly decrease spatially, preventing neglect of other important regions. The generated skeleton attention mask is then resized to match the spatial dimensions of the feature map, producing the final skeleton attention weights.
Finally, the skeleton-related features are obtained by element-wise multiplication of the skeleton attention weights and the input feature map
:
Guided by the skeleton attention map, the RGB stream is able to more effectively capture human-object interaction features, improving the model’s comprehension of intricate actions.
Post-fusion module
For action classification, using skeleton data to drive RGB attention, the characteristics of the two subnetwork streams are integrated. Since conventional decision-making fusion methods are highly dependent on specific data sets, we present a feature fusion method inspired by multi-stream fusion approaches to effectively utilize the complementary information provided by the two modes. Both LSTM-based and GCN-based fusion methods are implemented.
LSTM—based on fusion module.
Since the skeletal flow features are condensed into a single channel dimension , we utilize a post-fusion approach to merge the RGB features
. The RGB features are produced by combining the self-attention features
with the skeleton-attention features
. These two sets of features are initially processed through MaxPooling layers to obtain features of dimension
, which are subsequently concatenated to form the RGB stream feature
.
For the final integration of skeletal and RGB features, both feature sets are combined using a weighted summation rather than concatenation. Following this, a dropout layer is introduced to prevent overfitting by randomly deactivating neurons during training. Afterward, a fully connected layer is applied with the Exponential Linear Unit activation function to model the complex relationships between temporal skeletal features and spatial RGB features. Lastly, the classification task is performed by employing two fully connected layers followed by a sigmoid function for multi-class classification.
GCN—based on fusion module.
In order to more effectively investigate the connection and incorporate synergistic information between the two modalities, we developed a fusion module that considers the unique characteristics of the features. The skeletal features contain spatiotemporal information, whereas the RGB features
only contain spatial information. In the context of the fusion module, given that the skeleton attention mechanism has already facilitated the extraction of essential features from the RGB frame stream, our focus is on cross-modal spatial relationships.
In order to derive the RGB feature, two attention maps are averaged to derive the RGB feature , where
is the product of
. For the skeleton features, max pooling is applied to convert the skeleton feature
, which contains solely spatial information. Subsequently, both the RGB and skeleton features are transformed into vectors
and
through global max pooling.
Subsequently, the two feature sets are merged through element-wise concatenation. In the case of RGB features, the skeleton vector, designated as , is combined with each channel of the RGB features, whereas the RGB vector, identified as
, is merged with each channel of the skeleton features. As a result, the dimensionality of the channels in both feature sets is aligned.
Since the merged features incorporate both spatial information from skeleton flow and RGB flow, two convolutional layers
are utilized to explore cross-spatial relationships, generating a relational mask
:
where denotes the softmax function. The resulting features are then obtained through the following equation:
where represents element-wise multiplication. Subsequently, the relational feature map is passed through a global average pooling (GAP) layer and two fully connected layers, with final classification performed via a softmax layer. This process yields the final output of the network.
Train steps
Multimodal fusion module would be removed during initial training phase by omitting the feature fusion component of the skeletal flow and RGB flow, allowing the two sub-networks (skeletal flow and RGB flow) to operate independently. A FC layer and a softmax layer are added to the top of each sub-network.
For sub-network training, a simple early-stage fusion mechanism is designed during the training of the skeletal flow and RGB flow sub-networks, allowing partial interaction between the RGB and skeletal flows in the early stages of training to lay a better foundation for later fusion. The weights of each network are saved, but the weights of the FC and softmax layers are not.
The next phase involves gradually unfreezing sub-network weights and training the fusion module. After independently training the sub-networks and obtaining their initial weights, the training of the multimodal fusion module begins. To enhance the alignment between the sub-network and fusion module features, we progressively unfreeze some of the sub-network weights (e.g., the convolutional layers or feature extraction layers close to the top layer) during the training of the fusion module, allowing the sub-networks to further adjust according to the fused features. This approach allows the sub-networks to dynamically adapt to the changes in the fusion module, making the multimodal fusion process more seamless and improving the overall model’s performance and feature alignment accuracy.
Finally, the entire network, including the sub-networks and the multimodal fusion module, undergoes fine-tuning to ensure that the final fused features and sub-network features work in better coordination.
Experiments
Dataset description
The NTU RGB+D dataset is another important benchmark in the human action recognition field, containing 60 action classes with a total of 56,880 video samples. This dataset includes various modalities such as skeleton, depth, infrared (IR), and RGB video. Fifty action classes are performed by a single subject, while ten involve two-person interactions. Each action includes data from 25 joints, providing rich spatiotemporal information. The dataset provides two standard evaluation protocols: Cross-Subject, which divides 40 subjects into training and testing groups, and Cross-View, which uses data from cameras 2 and 3 for training and data from camera 1 for testing. The multimodal nature of the NTU RGB+D dataset makes it a challenging and valuable benchmark for action recognition research.
The UTD Multimodal Human Action Dataset (UTD-MHAD) is a publicly available dataset designed specifically for human action recognition research, covering multimodal data collection. It consists of 8 subjects (4 male and 4 female) performing 27 different actions, with each subject repeating each action 4 times, yielding 861 valid samples (excluding 3 damaged ones). The UTD-MHAD dataset includes four synchronized data modalities: RGB video, depth video, skeleton data, and signals from inertial sensors. RGB and depth video were captured using a Microsoft Kinect camera, while inertial sensors were worn on the subject’s right wrist or thigh, depending on the action. The Kinect camera was placed approximately 3 meters away from the subject to capture full-body movements. The dataset was recorded at a frame rate of 30 frames per second with a resolution of 640 × 480 pixels, making it particularly suitable for applications involving sensor fusion and human action recognition.
The SYSU dataset focuses on human-object interaction action recognition, comprising 12 actions performed by 40 subjects, with each action involving 20 joint points. The dataset contains 480 action sequences, all related to interactions with six objects: phone, chair, backpack, wallet, cup, broom, and mop. The SYSU dataset adopts two standard evaluation protocols: Setting-1, which randomly splits the action sequences into training and testing sets, and Setting-2, which randomly assigns subjects into training and testing groups. In both settings, 30-fold cross-validation is conducted to ensure the stability and reliability of the evaluation results. This dataset is particularly well-suited for tasks related to action recognition involving human-object interactions.
The MMAct dataset is a multi-modal activity recognition dataset containing 37 daily life activities, categorized into three groups: 16 complex activities (e.g., carrying), 12 simple activities (e.g., kicking), and 9 desk activities (e.g., using a computer). The dataset consists of 37,000 video clips from 20 subjects, with each activity performed five times by each subject. It is recorded using seven modalities, including RGB videos from four viewpoints, acceleration, gyroscope, and orientation data. The RGB videos are captured at a resolution of 1920x1080 pixels with a frame rate of 30 frames per second. The inertial sensor data is collected from a smartphone placed in the subject’s pocket and a smartwatch, which record acceleration, gyroscope, and orientation data, resulting in data from a total of four sensors. To increase the diversity of scenes and viewpoints, the dataset includes four different scenes, each with four camera perspectives, providing a rich variation for assessing the robustness of the model. Some of the data suffers from visual occlusions, offering a challenge for activity recognition with incomplete or occluded data. In the experiments, we follow the cross-subject and cross-view split protocol from the original paper. Additionally, since the dataset lacks skeleton sequences, we generate skeleton data from the RGB videos using OpenPose, providing complete input for subsequent analysis.
Implementation details
The model is implemented based on PyTorch and trained using two Nvidia GTX 4090 GPUs. To optimize training performance, we utilize the AdamW optimizer with an initial learning rate of , and reduce the learning rate by a factor of 0.5 every 15 epochs to ensure smooth convergence and increased training stability. To enhance the robustness of the model, we refined the coordinate transformation strategy and improved the VA-pre preprocessing method. By setting the origin of the coordinate system to the midpoint of each action sequence, the model’s adaptability to variations in viewpoints and camera positions is strengthened. In subsequent frames, the skeleton coordinates are dynamically adjusted based on their relative positions, thus better capturing the relative displacement and spatiotemporal features of the actions. These improvements result in a more precise and robust performance of the model when handling multimodal action recognition tasks.
Experimental results
Comparison against other methods
Results on NTU RGB+D.
We compared the accuracy on the NTU RGB+D dataset for Cross-Subject and Cross-View settings, encompassing several state-of-the-art (SOTA) methods, as shown in Table 3. The experimental results demonstrate that GCN-based models outperform RNN-based models in regional relationship reasoning, leading to superior performance in action recognition tasks. For instance, MAF-Net improves accuracy by 4.7% in Cross-Subject evaluation compared to EleAtt-GRU , with only an 8.6G increase in FLOPs.
To ensure fairness, we primarily compared methods using the same skeleton-stream backbone. For example, when comparing the Bi-LSTM-based 2 Stream RNN/CNN with MAF-Net (also employing Bi-LSTM), MAF-Net demonstrated competitive results. Additionally, compared to SGM-Net (using ST-GCN as the backbone), MAF-Net (also using ST-GCN) showed superior performance across both evaluation metrics while requiring fewer parameters and FLOPs.
These performance results indicate that although SGM-Net leverages the full RGB video modality, MAF-Net, using only single-frame inputs with two-stage feature fusion, surpasses SGM-Net in both performance and computational cost. Moreover, compared to recent fusion methods such as MMTM and MFAS , MAF-Net achieves superior results, with FLOPs being only 11.4% of MMTM and 11.1% of MFAS.
For ST-GCN-based Posemap and JOLO-GCN, the parameters and FLOPs for Posemap were not provided in the literature. Since Posemap utilizes CNNs to extract features from RGB videos, the computational burden is relatively high. JOLO-GCN, which constructs optical flow motion maps around joints and employs a lightweight GCN for feature extraction, has fewer parameters and FLOPs in comparison to MAF-Net. However, the model does not consider the computational expense associated with generating optical flow from images and the joint-guided flow maps.
Results on SYSU.
Table 4 presents a comprehensive comparison of MAF-Net against several state-of-the-art methods on the SYSU dataset, detailing the accuracy achieved by each method under two different settings, along with their respective parameter counts and FLOPs. The VA-LSTM and DPRL+GCNN models attained accuracies of 77.9% and 76.5%, while Local+LGN recorded an accuracy of 84.1%. EleAtt-GRU demonstrated strong performance, achieving an accuracy of 86.7% across both settings, with a computational cost of 1.3 million parameters and 7.7 billion FLOPs. LSGM+GTSC achieved an accuracy of 86.8%, while SGN reported accuracies of 82.6% and 84.0%, requiring 2.2 million parameters and 1.8 billion FLOPs. Other notable methods include MTDA and JOULE, which attained accuracies of 80.2% / 85.5% and 80.6% / 85.9%, respectively, but with significantly higher computational demands of 47.6M and 48.4M parameters, and 203.9G and 204.8G FLOPs. MAF-Net configurations also demonstrated competitive performance, with the model without LSTM achieving 81.9% / 83.6% accuracy, using 28.4 million parameters and 16.3 billion FLOPs. In contrast, the version without GCN showed improved accuracy of 86.3% / 88.1%, with 30.1 million parameters and 25.5 billion FLOPs. This analysis underscores the effectiveness of MAF-Net, particularly its GCN component, in achieving high accuracy while maintaining a manageable computational footprint compared to other methods.
Results on UTD-MHAD.
The Table 5 presents the performance of various methods on the Cross-Subject task of the UTD-MHAD dataset, comparing skeleton-based (S), RGB-based (R), and multimodal fusion (S+R) approaches. Skeleton-based methods, such as ST-GCN , demonstrate strong performance, with MS-G3D Net achieving an accuracy of 90.5% and SGN reaching 89.0% accuracy. The RGB-based stream, such as Inflated Resnet50, achieves 88.6% accuracy, albeit with a high number of parameters and computational complexity (170G FLOPs). Fusion methods combining skeleton and RGB streams also yield high accuracy, with MFAS and MMTM obtaining 88.0% and 90.9% accuracy, respectively. Furthermore, the table highlights the outstanding performance of the MAF-Net model (Bi-LSTM and ST-GCN), which achieves 91.5% and 92.0% accuracy while maintaining low computational overhead.
Results on MMACT.
Table 6 presents the performance of various methods on the MMACT dataset, comparing skeleton-based (S), RGB-based (R), and multimodal fusion (S+R) approaches. Skeleton-based methods, such as JOLO-GCN, achieve remarkable performance, with an accuracy of 91.4%. The RGB-based stream, represented by Inflated Resnet50, achieves an accuracy of 88.6%, though with a significant computational cost (170G FLOPs). Fusion methods that combine skeleton and RGB data also perform well, with MAF-Net w/ Bi-LSTM achieving the highest accuracy of 91.5%, while MAF-Net w/ ST-GCN follows closely with 90.3%. These fusion methods offer a strong balance between high accuracy and computational efficiency, with MAF-Net models maintaining relatively low FLOPs (17.3G for Bi-LSTM and 26.5G for ST-GCN). Overall, the table highlights the competitive performance of skeleton-based, RGB-based, and multimodal approaches, with the MAF-Net models standing out for their superior accuracy and efficient computation.
Ablation study
In the process of key frame extraction from RGB videos, we compared two methods and conducted experiments on the UTD-MHAD, NTU RGB+D, and SYSU datasets. The first method is a single-frame RGB image selection strategy, where for efficiency, we extract only one RGB frame from the entire video. Ablation experiments were performed to evaluate the effect of selecting different frames between 10% and 90% of the video duration on model accuracy. Results showed that frames between 30% and 70% performed similarly, leading us to consistently choose the frame at the 50% position, demonstrating that a single RGB frame contains sufficient spatial information to describe human-object interactions. However, the second method, which involves key frame extraction and uniform sampling based on the optical flow technique, proved to be more advantageous. By calculating the motion intensity of each frame using optical flow, this method effectively identifies the segments of the video with significant motion changes. Frames are then ranked based on motion intensity, and key frames are uniformly selected according to a sampling interval, ensuring reasonable distribution of key frames along the time axis. This approach not only captures essential dynamic information in the video but also avoids the concentration of key frames within a specific time interval. Compared to the single-frame extraction strategy, it provides a more comprehensive representation of motion changes in the video. Ultimately, through analysis, we chose the key frame extraction and uniform sampling strategy based on optical flow to achieve more accurate activity recognition.
Effectiveness of data enhancement
The Table 7 demonstrates the effectiveness of data enhancement methods across different datasets and settings. Specifically, comparing ST-GCN with and without data augmentation reveals a consistent performance improvement across all benchmarks. For instance, in the NTU RGB+D Cross-Subject setting, data augmentation boosts accuracy from 82.5% to 83.3%, and in the Cross-View setting, the improvement is even more pronounced, increasing from 89.3% to 90.6%. Similarly, for the Xception model, employing projection crop as a data enhancement technique raises the accuracy significantly from 50.9% to 66.4% in the NTU RGB+D Cross-Subject setting. The MAF-Net also benefits from data augmentation, with accuracy increasing from 89.2% to 90.6% in the NTU RGB+D Cross-Subject evaluation and from 84.7% to 86.3% in SYSU Setting-1. Across various datasets, including SYSU and UTD-MHAD, data enhancement strategies consistently contribute to higher classification accuracy, highlighting the critical role of augmentation techniques in improving model robustness and generalization.
Effectiveness of self-attention mechanisms
In the experiments conducted on the NTU RGB+D and SYSU datasets, the effectiveness of the self-attention mechanism was evaluated using Bi-LSTM and ST-GCN as backbone networks. As shown in Table 6, the inclusion of the self-attention mechanism significantly enhanced the performance of the backbone networks. For instance, on the SYSU dataset, the accuracy of MAF-Net with the self-attention module improved by 1.8% in the Cross-Subject evaluation compared to the Bi-LSTM version without the self-attention module. This indicates that the self-attention mechanism helps the RGB stream to focus more effectively on regions involving human-object interactions, thereby extracting more representative feature information.
Moreover, the skeleton attention mechanism also led to substantial performance improvements. Specifically, MAF-Net with the skeleton attention module outperformed the ST-GCN version without this module, achieving a 3.0% increase in accuracy on the NTU RGB+D dataset in the Cross-Subject evaluation. This suggests that the skeleton attention mechanism not only facilitates better fusion of RGB and skeleton modalities but also aids the model in distinguishing between foreground and background, thus improving overall action recognition performance.
Effectiveness of the post-fusion module
In our approach, we assessed the final fusion technique utilizing LSTM and ST-GCN backbone networks on the NTU RGB+D, SYSU, and UTD-MHAD datasets. We contrasted the proposed method with both decision-level fusion and summation-based fusion techniques. The experimental results demonstrate that MMFF achieves superior accuracy on all three datasets compared to other methods. For methods based on Bi-LSTM and ST-GCN, the proposed late fusion module consistently outperforms decision fusion in most cases. For instance, our proposed Bi-LSTM late fusion module improved performance by 5.0% over decision fusion in the Cross-View task of the NTU RGB+D dataset and by 3.8% in Setting-1 of the SYSU dataset. However, in the SYSU Setting-1 evaluation, the best performance was achieved by the decision fusion method, which can be seen as an anomaly, as decision fusion showed limited generalization capability on the larger NTU RGB+D dataset. Moreover, MMFF combined with the proposed late fusion module outperforms summation fusion, with a 2.4% improvement in the Cross-Subject task of the NTU RGB+D dataset and a 2.0% improvement in Setting-1 of the SYSU dataset. Similarly, the late fusion method exhibited significant performance advantages on the UTD-MHAD dataset, further validating the effectiveness of the module across different datasets.
Visualization
The visualization results of the self-attention module and the skeletal attention module are shown in Figs 6, 7 and 8, respectively. All images have been cropped according to the implementation details. The results demonstrate that the self-attention module aids the model in better focusing on regions of human-object interaction, while the skeletal attention module effectively guides the model to focus on limb movements, thereby improving the accuracy of action recognition.
Discussion
The results from our experiments demonstrate the effectiveness of our proposed multimodal fusion approach for human activity recognition, particularly when utilizing skeletal and RGB data streams. By employing advanced techniques such as optical flow-based key frame extraction, enhanced skeletal data augmentation, and feature fusion strategies, our model achieves superior performance across multiple datasets, including NTU RGB+D, SYSU, and UTD-MHAD. These results underscore several critical insights:
- Key Frame Extraction and Optical Flow: The motion-based key frame extraction technique using optical flow significantly improves the temporal representation of videos. Compared to single-frame extraction, the optical flow approach ensures that frames with high activity intensity are captured, thereby offering more comprehensive coverage of motion dynamics. This method proved especially beneficial in improving accuracy across larger datasets like NTU RGB+D.
- Data Augmentation Techniques: Our findings emphasize the importance of data augmentation, particularly for enhancing the robustness of skeletal and RGB modalities. For the skeletal stream, viewpoint rotation and noise injection improved the system’s adaptability to real-world conditions. Similarly, projection-based cropping on RGB frames, aligned with skeletal data, focused the model’s attention on human-object interactions, reducing the influence of background noise.
- Self-Attention and Skeletal Attention Modules: Both attention modules were shown to contribute significantly to model performance. The self-attention module helped the model to focus on critical regions within RGB frames, capturing intricate human-object interactions. The skeletal attention module further aided in aligning key motion areas within the skeletal and RGB streams, leading to improved action recognition accuracy. This was particularly evident in tasks involving fine-grained body movements, where skeletal attention allowed the model to differentiate between similar actions.
- Fusion Techniques: The late fusion approach demonstrated clear advantages over decision and summation fusion methods. By effectively integrating features from both the skeletal and RGB streams, this method enabled the model to better utilize complementary information from both modalities, leading to improved performance. The superior results achieved on all datasets, particularly on challenging benchmarks like NTU RGB+D, validate the fusion strategy’s efficacy in multimodal action recognition tasks.
- Dataset Insights: The consistent improvement across NTU RGB+D, SYSU, and UTD-MHAD datasets highlights the generalizability of our approach. While the NTU RGB+D dataset, with its complex spatiotemporal data, challenged the model’s ability to capture detailed motion, our fusion methods effectively addressed these challenges. In contrast, the smaller SYSU dataset benefited more from data augmentation, reducing overfitting and enhancing performance in human-object interaction tasks.
Limitations and future work
Despite the promising results, there are areas for improvement. One limitation of the current approach lies in the computational complexity introduced by the fusion and attention modules, which may hinder real-time applications. Future work could focus on optimizing the computational efficiency of the model, perhaps by introducing more lightweight attention mechanisms or employing knowledge distillation techniques to simplify the model without compromising accuracy.
Additionally, while the optical flow-based key frame extraction showed significant improvement, this method adds preprocessing time, which could be streamlined in future iterations. Exploring alternative motion detection techniques or reducing the need for preprocessing could make the model more practical for real-time applications.
Lastly, integrating more diverse data sources, such as depth information or inertial sensor data, could further improve the robustness of multimodal action recognition systems, particularly in more complex, real-world environments.
Conclusion
In this work, we presented a comprehensive approach for multimodal human activity recognition using both skeletal and RGB data streams. Our method employs advanced techniques such as optical flow-based key frame extraction, data augmentation, and multimodal fusion with attention mechanisms, resulting in superior performance across three benchmark datasets. The integration of skeletal and RGB features, supported by self-attention and skeletal attention modules, enables the model to capture intricate spatiotemporal relationships and enhance its focus on relevant human-object interactions.
The late fusion strategy used in this work proved to be highly effective, outperforming other fusion techniques and ensuring the complementary strengths of both data streams were utilized to their fullest potential. Our model not only achieves state-of-the-art performance on challenging datasets like NTU RGB+D but also demonstrates strong generalization capabilities across smaller datasets like SYSU and UTD-MHAD.
Looking forward, optimizing the model for real-time performance and extending it to incorporate additional modalities will further enhance its practical applications. Overall, this work provides a robust foundation for future research in multimodal human activity recognition, with promising implications for real-world applications in surveillance, healthcare, and human-computer interaction.
References
- 1. Zhu X, Zhu Y, Wang H, Wen H, Yan Y, Liu P. Skeleton sequence and RGB frame based multi-modality feature fusion network for action recognition. ACM. 2022;18(3):1–24.
- 2. Tu Z, Zhang J, Li H, Chen Y, Yuan J. Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition. IEEE. 2022;25:1819–31.
- 3. Wang J, Li F, Lv S, He L, Shen C. Physically realizable adversarial creating attack against vision-based BEV space 3D object detection. IEEE Trans Image Process. 2025; 34:538–51.
- 4. Zhang Y, Ding K, Hui J, Liu S, Guo W, Wang L. Skeleton-RGB integrated highly similar human action prediction in human–robot collaborative assembly. J Vis Commun Image Represent. 2024;86:102659.
- 5. Wang J, Li F, He L. A unified framework for adversarial patch attacks against visual 3D object detection in autonomous driving. IEEE Trans Circuits Syst Video Technol. 2025.
- 6. Zhang L, Liu J, Wei Y, An D, Ning X. Self-supervised learning-based multi-source spectral fusion for fruit quality evaluation: A case study in mango fruit ripeness prediction. Inf Fusion. 2025;117:102814.
- 7. Yue R, Tian Z, Du S. Action recognition based on RGB and skeleton data sets: A survey. Elsevier. 2022;512:287–306.
- 8. Liu S, Bai X, Fang M, Li L, Hung C-C. Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl Intell 2022;52(2):1544–55.
- 9. Sun Y, Xu W, Yu X, Gao J, Xia T. Integrating vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition. Int J Comput Intell Syst. 2023:16(1);116.
- 10. Amsaprabhaa M, et al. Multimodal spatiotemporal skeletal kinematic gait feature fusion for vision-based fall detection. J Biomed Inform. 2023;212:118681.
- 11. Ghimire A, Kakani V, Kim H. Ssrt: A sequential skeleton RGB transformer to recognize fine-grained human-object interactions and action recognition. IEEE Trans Pattern Anal Mach Intell. 2023;11:51930–48.
- 12. Liang PP, Zadeh A, Morency L-P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput Surv. 2024;56(1):1–42.
- 13. Lu Y, Huang Y, Sun S, Zhang T, Zhang X, Fei S, et al. FM2fNet: Multi-modal forest monitoring network on large-scale virtual dataset. In: Proceedings of the IEEE conference. 2024. p. 539–543.
- 14. Kopuklu O, Kose N, Gunduz A, Rigoll G. Resource efficient 3D convolutional neural networks. 2019.
- 15. Maturana D, Scherer S. Voxnet: A 3D convolutional neural network for real-time object recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
- 16. Bhatti UA, Tang H, Wu G, Marjan S, Hussain A. Deep learning with graph convolutional networks: An overview and latest applications in computational intelligence. Int J Intell Syst. 2023;1:8342104.
- 17. Khosla M, Jamison K, Kuceyeski A, Sabuncu MR. 3D convolutional neural networks for classification of functional connectomes. Springer. 2018.
- 18. Torfi A, Iranmanesh SM, Nasrabadi N, Dawson J. 3D convolutional neural networks for cross audio-visual matching recognition. IEEE Trans Audio Speech Lang Process. 2017;5:22081–91.
- 19. Huang J, Zhou W, Li H, Li W. Sign language recognition using 3D convolutional neural networks. IEEE. 2015.
- 20. Zhang H, Yu L, Wang G, Tian S, Yu Z, Li W, et al. Cross-modal knowledge transfer for 3D point clouds via graph offset prediction. Pattern Recognit. 2025;111351.
- 21. Mzoughi H, Njeh I, Wali A, Slima MB, BenHamida A, Mhiri C, et al. Deep multi-scale 3D convolutional neural network (CNN) for MRI gliomas brain tumor classification. J Digit Imaging 2020;33(4):903–15. pmid:32440926
- 22. Li Y, Zhang H, Shen Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens 2017;9(1):67.
- 23. Zin TT, Htet Y, Akagi Y, Tamura H, Kondo K, Araki S, et al. Real-time action recognition system for elderly people using stereo depth camera. Sensors (Basel) 2021;21(17):5895. pmid:34502783
- 24. Yu, Tsz-Ho, Kim, Tae-Kyun, Cipolla, Roberto. Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests. 2010;2(5):6.
- 25. Chen C, Liu K, Kehtarnavaz N. Real-time human action recognition based on depth motion maps. Springer. 2016;12:155–63.
- 26. Chen C, Wu Y, Dai Q, Zhou H-Y, Xu M, Yang S, et al. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE. 2024.
- 27. Bloom V, Makris D, Argyriou V. G3D: A gaming action dataset and real time action recognition evaluation framework. In: Proceedings of the IEEE conference. 2012.
- 28. Subramanian S, Rajesh S, Britto PI, Sankaran S. MDHO: mayfly deer hunting optimization algorithm for optimal obstacle avoidance based path planning using mobile robots. Cybern Syst. 2023;1–20.
- 29. Ludl D, Gulde T, Curio C. Simple yet efficient real-time pose-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2019. p. 1–10.
- 30. Nejad SMM, Abbasi-Moghadam D, Sharifi A, Farmonov N, Amankulova K, Laszlz M. Multispectral crop yield prediction using 3D-convolutional neural networks and attention convolutional LSTM approaches. IEEE J Sel Top Appl Earth Obs Remote Sens. 2023;16:254–66.
- 31. Niyas S, Pawan S, Kumar MA, Rajan J. Medical image segmentation with 3D convolutional neural networks: A survey. Med Image Anal. 2022;493:397–413.
- 32. Elbaz K, Shaban WM, Zhou A, Shen S-L. Real time image-based air quality forecasts using a 3D-CNN approach with an attention mechanism. Chemosphere. 2023;333:138867. pmid:37156287
- 33. Arunnehru J, Chamundeeswari G, Bharathi SP. Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Elsevier. 2018;133:471–477.
- 34. Tlijani H, Jouila A, Nouri K. Optimized sliding mode control based on cuckoo search algorithm: Application for 2DF robot manipulator. Cybern Syst. 2023(1):1–17.
- 35. Lu Y, Zhu Y and Lu G. 3D SceneFlowNet: Self-supervised 3D scene flow estimation based on graph CNN. IEEE. 2021;3647–3651.
- 36. Alakwaa W, Nassef M, Badr A. Lung cancer detection and classification with 3D convolutional neural network (3D-CNN). Science and Information (SAI) Organization Limited. 2017.
- 37. Yang C, Rangarajan A, Ranka S. Visual explanations from deep 3D convolutional neural networks for Alzheimer’s disease classification. AMIA Annu Symp Proc. 2018;2018:1571–80. PMID: 30815203
- 38. Yang H, Yuan C, Li B, Du Y, Xing J, Hu W, et al. Asymmetric 3D convolutional neural networks for action recognition. Elsevier. 2019;85:1–12.
- 39. Zhang F, He X, Teng Q, Wu X, Dong X. 3D-PMRNN: Reconstructing three-dimensional porous media from two-dimensional image with recurrent neural network. Elsevier. 2022;208:109652.
- 40. Kim S, Kim T-S, Lee WH. Accelerating 3D convolutional neural network with channel bottleneck module for EEG-based emotion recognition. Sensors (Basel) 2022;22(18):6813. pmid:36146160
- 41. Ren BP, Wang ZY. Strategic focus, tasks, and pathways for promoting China’s modernization through new productive forces. J Xi’an Univ Finance Econom. 2024;1:3–11.
- 42. DeMatteo C, Jakubowski J, Stazyk K, Randall S, Perrotta S, Zhang R. The headaches of developing a concussion app for youth: Balancing clinical goals and technology. Int J E-Health Med Commun. 2024;15(1):1–20.
- 43. Falahzadeh MR, Farsa EZ, Harimi A, Ahmadi A, Abraham A. 3D convolutional neural network for speech emotion recognition with its realization on Intel CPU and NVIDIA GPU. IEEE. 2022;10:112460–71.
- 44. Ilesanmi AE, Ilesanmi T, Idowu OP, Torigian DA, Udupa JK. Organ segmentation from computed tomography images using the 3D convolutional neural network: a systematic review. Springer. 2022;11(3):315–31.
- 45. Hu T, Lei Y, Su J, Yang H, Ni W, Gao C, et al. Learning spatiotemporal features of DSA using 3D CNN and BiConvGRU for ischemic Moyamoya disease detection. Int J Neurosci 2023;133(5):512–22. pmid:34042552
- 46. Wei Z, Zhu Q, Min C, Chen Y, Wang G. Bidirectional hybrid LSTM based recurrent neural network for multi-view stereo. IEEE Trans Pattern Anal Mach Intell. 2022.
- 47. Faraji M, Nadi S, Ghaffarpasand O, Homayoni S, Downey K. An integrated 3D CNN-GRU deep learning method for short-term prediction of PM2.5 concentration in urban environment. Sci Total Environ. 2022;834:155324. pmid:35452742
- 48. Chen Q, Zhang Z, Lu Y, Fu K, Zhao Q. 3-D convolutional neural networks for RGB-D salient object detection and beyond. IEEE Trans Neural Netw Learn Syst 2024;35(3):4309–23. pmid:36099219
- 49. Almayyan WI, AlGhannam BA. Detection of kidney diseases: Importance of feature selection and classifiers. Int J E-Health Med Commun. 2024;15(1):1–21.
- 50. Deng H, Zheng Y, Chen J, Yu S, Xiao K, Mao X. Learning 3D mineral prospectivity from 3D geological models using convolutional neural networks: Application to a structure-controlled hydrothermal gold deposit. J Geochem Explor. 2022;161:105074.
- 51. Zheng R, Wang Q, Lv S, Li C, Wang C, Chen W, et al. Automatic liver tumor segmentation on dynamic contrast enhanced MRI using 4D information: deep learning model based on 3D convolution and convolutional LSTM. IEEE Trans Med Imaging 2022;41(10):2965–76. pmid:35576424
- 52. Cheung L, Wang Y, Lau AS, Chan RM. Using a novel clustered 3D-CNN model for improving crop future price prediction. Elsevier. 2023;260:110133.
- 53. Huang X, Liu J, Xu S, Li C, Li Q, Tai Y. A 3D ConvLSTM-CNN network based on multi-channel color extraction for ultra-short-term solar irradiance forecasting. Elsevier. 2023;272:127140.
- 54. Smith J, Doe J. Understanding the impact of climate change on marine biodiversity. Mar Ecol Prog Ser 2023;123(4):567–89.
- 55. Shi L, Zhang Y, Cheng J, Lu H. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE. 2020;29:9532–45.
- 56. Lee J, Lee M, Lee D, Lee S. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. 2023.
- 57. Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H. Skeleton-based action recognition with shift graph convolutional network. 2020.
- 58. Li S, Yi J, Farha YA, Gall J. Pose refinement graph convolutional network for skeleton-based action recognition. IEEE Trans Pattern Anal Mach Intell. 2021;6(2):1028–35.
- 59. Wang T, Yu Z, Fang J, Xie J, Yang F, Zhang H. Multidimensional fusion of frequency and spatial domain information for enhanced camouflaged object detection. Elsevier. 2024;272:102871.
- 60. Wu Y, Zhang P, Gu M, Zheng J, Bai X. Embodied navigation with multi-modal information: A survey from tasks to methodology. Inform Fusion. 2024;112:102532.
- 61. Chen Z, Li S, Yang B, Li Q, Liu H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. IEEE. 2021;35(2):1113–22.
- 62. Peng W, Shi J, Varanka T, Zhao G. Rethinking the ST-GCNs for 3D skeleton-based human action recognition. Elsevier. 2021;454:45–53.
- 63. Li W, Liu X, Liu Z, Du F, Zou Q. Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Trans Pattern Anal Mach Intell. 2020;8:144529–42.
- 64. Yang W, Zhang J, Cai J, Xu Z. Shallow graph convolutional network for skeleton-based action recognition. Sensors (Basel) 2021;21(2):452. pmid:33440785
- 65. Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W. Disentangling and unifying graph convolutions for skeleton-based action recognition. 2020.
- 66. Cho S, Maqbool M, Liu F, Foroosh H. Self-attention network for skeleton-based human action recognition. 2020.
- 67. Bandi C, Thomas U. Skeleton-based action recognition for human-robot interaction using self-attention mechanism. IEEE Trans Robot. 2021.
- 68. Zhang J, Huang L, Bai X, Zheng J, Gu L, Hancock E. Exploring the usage of pre-trained features for stereo matching. Int J Comput Vis. 2024;1–22.
- 69. Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M. Action transformer: A self-attention model for short-time pose-based human action recognition. Elsevier. 2022;124:108487.
- 70. Xin W, Liu R, Liu Y, Chen Y, Yu W, Miao Q. Transformer for skeleton-based action recognition: A review of recent advances. J Comput Sci. 2023;537:164–86.
- 71. Rahevar M, Ganatra A, Saba T, Rehman A, Bahaj SA. Spatial–temporal dynamic graph attention network for skeleton-based action recognition. IEEE. 2023;11:21546–53.
- 72. Shi L, Zhang Y, Cheng J, Lu H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. 2020.
- 73. Ibh M, Grasshof S, Witzner D, Madeleine P. TemPose: a new skeleton-based transformer model designed for fine-grained motion recognition in badminton. 2023.