Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A histogram transformer approach using attention-based 3D residual network for human action recognition

  • Maojin Sun ,

    Roles Methodology, Software, Validation, Writing – original draft

    sunmaojin0427@163.com

    Affiliation CEICloud Data Storage Technology (Beijing) Co., Ltd., No. 15 Countyard on Kechuang Nine Road in Economic-Technological Development Area, Beijing, China

  • Luyi Sun

    Roles Validation, Visualization, Writing – review & editing

    Affiliation CEICloud Data Storage Technology (Beijing) Co., Ltd., No. 15 Countyard on Kechuang Nine Road in Economic-Technological Development Area, Beijing, China

Abstract

This paper proposes a lightweight video action recognition framework that integrates 3D Convolutional Neural Networks (CNNs), the Histogram Transformer Block (HTB), and the Split-Attention Residual Block (SAB), while also introducing Spatiotemporal Tensor Factorization (ST-Factor) technology in an innovative manner. The method first incorporates the HTB module into each computational unit of the AR3D backbone network to leverage local statistical features for improve the granularity of spatiotemporal modeling. Next, the SAB module is introduced into the residual path to utilize dynamic channel re-weighting for optimizing feature selection across dimensions. Finally, the ST-Factor decouples the 4D convolution kernels into independent spatial (H  W) and temporal (T  C) operations, which significantly reducing computational redundancy. Experiments on the UCF101/HMDB51 datasets demonstrate that the proposed method not only maintains real-time inference speed but also outperforms existing state-of-the-art (SOTA) methods in recognition accuracy, providing a new paradigm for video understanding research.

Introduction

Human Action Recognition (HAR) in videos is a core task in computer vision, supporting numerous applications such as intelligent surveillance, human-computer interaction, healthcare monitoring, sports analytics, and autonomous driving [1,2]. Compared to static image classification, HAR faces the unique challenge of simultaneously understanding both the spatial form (appearance) and temporal evolution (dynamics) of human actions, often in complex real-world conditions such as occlusions, viewpoint variations, background interference, and motion blur. Addressing these challenges requires spatiotemporal feature representations that are both highly expressive and robust.

Prior to the emergence of deep learning, Human Action Recognition (HAR) largely depended on hand-crafted features, such as Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), and Motion Boundary Histograms (MBH), in conjunction with trajectory encoding or dense optical flow [3]. Although these methods demonstrate efficacy in recognizing explicitly defined actions under controlled background conditions, they exhibit limited capacity for capturing profound semantic representations of human behaviors. Consequently, such approaches face challenges in generalizing to complex real-world video scenarios characterized by occlusions, viewpoint variations, cluttered backgrounds, and motion blur, ultimately leading to suboptimal performance.

Deep learning has instigated a paradigm shift in human action recognition (HAR). The seminal two-stream architecture exploited convolutional neural networks (CNNs) for video understanding by concurrently processing appearance (RGB frames) and motion (optical flow) modalities [4]. This established a foundation for subsequent advancements in spatiotemporal modeling. Three-dimensional CNNs (3D CNNs) subsequently gained prominence due to their capacity to learn unified spatiotemporal representations directly from raw video sequences in end-to-end frameworks. Notable implementations—including C3D [5], I3D [6], and R(2+1)D achieved substantial performance improvements on benchmark datasets such as UCF-101 and HMDB-51. Concurrently, attention-based architectures (e.g., TimeSformer [7], Video Swin Transformer [8] emerged to model long-range temporal dependencies, albeit at increased computational cost.

Despite the significant advancements in the field, several critical limitations persist. First, most existing models primarily focus on global semantic modeling, often neglecting local fine-grained motion cues that are essential for distinguishing subtle interactions, such as hand gestures or object manipulation [9]. Second, both 3D CNNs and transformer-based models typically experience high computational complexity, which restricts their deployment in real-time or resource-constrained environments. Third, action recognition systems continue to struggle with generalization across diverse domains and settings, particularly when training data is limited or exhibits significant variation in style, lighting, or viewpoint.

To address the aforementioned challenges, this paper proposes an innovative framework that integrates residual structures and attention mechanisms to enhance existing 3D CNN models for video action recognition. The key contributions include the design of an efficient 3D CNN backbone for capturing global spatiotemporal features, significantly improving the recognition of complex human actions. A novel 3D Local Aggregation Module (LAM) is introduced to enhance the model’s ability to learn fine-grained local interactions within spatiotemporal neighborhoods, which is particularly effective in recognizing dynamic interactions and object manipulation. Furthermore, by incorporating residual learning and a 3D attention mechanism, the model mitigates the vanishing gradient problem and dynamically focuses on critical spatiotemporal features, thereby improving both perceptual capability and accuracy. The application of spatiotemporal factorization to the LAM and residual network reduces computational complexity by decomposing spatial and temporal features into separate processing units, leading to enhanced recognition accuracy and efficiency. Extensive experiments on public datasets such as UCF101, HMDB51 demonstrate that the proposed framework exhibits excellent cross-domain generalization, adapting robustly to varying data styles, lighting conditions, and viewpoints.

Related work

In this section, we provide a comprehensive survey of the dominant approaches and historical developments in human action recognition (HAR). We begin by reviewing traditional hand-crafted feature-based methods and then focus on deep learning-based paradigms, including two-stream CNNs, 3D CNNs, transformer-based architectures, graph convolutional networks, and reinforcement learning approaches. We also examine the evolution of residual learning and attention mechanisms specifically within the context of HAR. This systematic review presented in Table 1 highlights both the advancements and persistent challenges in spatiotemporal modeling, setting the foundation for our proposed method.

thumbnail
Table 1. Limitations of existing literature in human action recognition.

https://doi.org/10.1371/journal.pone.0333893.t001

Human action recognition methods

Driven by the accelerated progress in deep learning, human action recognition has become an important research direction in the field of computer vision. For human action recognition tasks specifically, DNNs excel at learning discriminative feature representations. This section surveys dominant human action recognition approaches and their historical development.

The two-stream CNN framework, introduced by Simonyan et al. [4], represents a significant milestone in action recognition research. This approach processes static RGB frames (capturing appearance information) and dense optical flow frames (capturing motion information) in parallel, subsequently fusing the classification scores from the two independent branches. Its key contribution lies in demonstrating that a temporal stream model, trained explicitly on optical flow sequences, substantially improves recognition accuracy [4,10,11]. However, the standard two-stream CNN exhibits limitations in modeling long-term temporal dependencies. To address this, Wang et al. [11] proposed the Temporal Segment Network (TSN), which employs sparse sampling to capture the global temporal structure of videos, thereby enhancing long-range modeling capabilities. Concurrently, Feichtenhofer et al. [10] identified insufficient interaction between spatial and temporal features within the two-stream architecture. They introduced a novel spatiotemporal fusion method designed to effectively integrate the feature representations from both modalities.

In addition to Two-Stream CNN, 3D Convolutional Neural Network (3D CNN) has emerged as another significant branch of human action recognition. The developing of this domain began with the seminal work of Ji et al. [12], who first applied 3D convolution operations to videos, directly extracting features in the spatio-temporal domain. Tran et al. [5] designed the C3D framework, which systematically explored and validated the powerful effectiveness of using 3D convolutional kernels to extract features in both spatial and temporal dimensions, significantly advancing this direction. To further enhance performance, Tran et al. [13] incorporated the ResNet architecture into C3D, resulting in the Res3D model. Although it achieved performance improvements, it also introduced the challenge of significantly increased model parameters. To address the complexity issue, Ullah et al. [14] innovatively combined CNNs with deep bidirectional LSTM (DB-LSTM) network, proposing a Advanced algorithm framework. This model uses CNNs to extract local spatio-temporal features and employs DB-LSTM for long-term sequence dependency modeling, effectively improving the model’s performance in long-duration action recognition and achieving better performance than other state-of-the-art algorithms at the time.

In recent years, self-attention mechanisms have garnered increasing attention in the field of video action recognition. The Transformer model, introduced by Vaswani et al. [15], which has achieved remarkable success in natural language processing, has been adapted for video action recognition tasks. Lin et al. [16] incorporated the self-attention mechanism into video action recognition, enabling the model to more effectively capture spatio-temporal contextual information from video, thus enhancing recognition performance. Furthermore, the BERT model [17] has been applied to video action recognition, leveraging its bidirectional encoding capabilities to further refine the model’s spatio-temporal feature representation.

In video action recognition research, Graph Convolutional Networks (GCNs) are widely adopted due to their powerful relational modeling capabilities. Gao et al. [18] pioneered the construction of a Spatio-Temporal Graph Convolutional Network (ST-GCN), which significantly enhances the representational power of skeleton graph structures by explicitly modeling spatiotemporal dependencies between video frames. To address the challenge of long-range temporal modeling, Liang et al. [19] proposed an architecture integrating GCNs with Long Short-Term Memory (LSTMs). This design demonstrates advantages in capturing complex spatiotemporal dynamics. Furthermore, Deep Reinforcement Learning (DRL) has also been introduced into this field. For instance, the method proposed by Xu et al. [20] leverages unsupervised interaction with environments to iteratively optimize models, effectively enhancing model robustness and adaptability in unseen scenarios.

Residual network architecture

As the depth of neural network architectures increases, performance degradation and vanishing gradients become increasingly problematic. To address these challenges, He et al. [21] pioneered the residual learning framework. In recent years, the concept of residual learning has been extensively integrated into human action recognition research. For instance, Feichtenhofer et al. [10] developed a residual-based two-stream architecture, where spatial and temporal streams extract appearance and motion features respectively, subsequently fused to enhance recognition accuracy. Qiu et al. [22] explored factorized forms of 3D convolution, proposing diverse Pseudo-3D (P3D) modules. These modules decompose 3D convolution into 2D spatial convolutions and 1D temporal convolutions, integrated within residual blocks. Benefiting from the ability to leverage ImageNet pre-trained models via the 2D spatial convolution components, this design significantly improves model efficiency and accuracy. Another notable contribution is the Residual Attention Network introduced by Wang et al. [23], which shares conceptual similarities with this study by incorporating attention mechanisms into residual modules to augment feature selectivity. However, this network was primarily applied to image classification tasks and did not fully explore the requirements for temporal residual learning inherent in video data. Tran et al. [5] extended the ResNet structure to 3D space and devised the Res3D model, which extracts spatio-temporal features through 3D convolutions and combines residual learning for successful video feature extraction. However, this approach introduces a substantial improvement in the number of parameters, making the model more complex. For reducing the computational load, Ullah et al. [24] integrated deep bidirectional LSTM (DB-LSTM) with residual learning, proposing a new human action recognition method that not only learns spatial features but also extracts deep features from long-term temporal sequences. Distinct from these approaches, this work aims to synergistically extract spatiotemporal features from video data by innovatively integrating 3D attention modules with 3D convolutional residual modules, thereby advancing the performance of human action recognition.

Attention mechanisms

The fundamental purpose of attention mechanisms [2529] lies in simulating the human ability to filter information, allowing models to overcome information overload and allocate limited computational resources to the most valuable parts of the data. Early representative works include: Jaderberg et al. [30] with the spatial attention mechanism (STN), which addresses the issue of insufficient robustness to key image regions by learning spatial transformations, significantly enhancing the model’s flexibility and performance to geometric variations. Hu et al. [25] devised the channel attention mechanism (SENet), which focuses on optimizing the utilization of feature channels by adaptively weighting them, enabling the model to emphasize informative channels while suppressing redundant ones, achieving significant performance gains with minimal computational overhead. To capture both feature importance and its spatial context simultaneously [25,30], Woo et al. [26] developed CBAM, which innovatively combines channel and spatial attention. This hybrid design enables the model to more comprehensively understand “which features are important at which locations,” and its modular, low-coupling nature makes it easy to integrate into existing networks for end-to-end optimization. From spatial and channel-separable attention to hybrid attention, and further refined variants, attention mechanisms continue to drive progress in the performance and generalization of deep learning models by providing dynamic focus and feature calibration capabilities.

Methods

In this section, we introduce a novel architecture for spatio-temporal human action recognition designed to address the limitations of existing 3D convolutional models in capturing fine-grained temporal dynamics and discriminative feature representations. Traditional models, such as C3D [5] and Res3D [31], have shown success in spatio-temporal feature learning. However, C3D [5] and Res3D [31] suffer from its shallow architecture, which limits its ability to extract deep, high-level action features, especially in fine-grained or long-range action recognition tasks. However, while Res3D incorporates residual connections to enable deeper feature learning, it significantly increases the parameter count of the model, resulting in higher computational costs and reduced efficiency. To circumvent these inherent limitations, we propose a hybrid architecture that combines the strengths of both models. In this study, we integrate Histogram Transformer Blocks (HTBs) [32] with a ResNeSt block. The HTB module adaptively enhances informative temporal segments, thereby improving the model’s sensitivity to key motion patterns. Concurrently, the ResNeSt block performs dynamic feature selection across multiple channel groups, enhancing the diversity of feature representations and improving the discriminability across channels. Building upon these components, we propose a 3D Deeper HTB-ResNeSt model, which combines a 3D lightweight HTB module with the 3D ResNeSt architecture. This hybrid model achieves robust and efficient action recognition by leveraging the complementary strengths of the HTB and ResNeSt components. The general framework of the proposed model is shown in Fig 1.

thumbnail
Fig 1. The overall framework of 3D Deeper HTB-ResNeSt model.

https://doi.org/10.1371/journal.pone.0333893.g001

As illustrated in Fig 1, the operation sequence of the 3D HTB-ResNeSt model is as follows: First, the input layer receives an RGB image sequence consisting of n continuous frames () or other formats. These inputs are then passed to the feature extraction layer, which contains two key submodules: the 3D Local Context Aggregation Module (LCA-Module), responsible for capturing information from neighboring spatio-temporal regions, and the 3D Deeper Residual Feature Extraction Module (DRFE-Module), which focuses on learning deep spatio-temporal feature representations. The extracted features are then processed by the fully connected layer, which performs the fully connected operation along with Dropout regularization. Dropout works by randomly “masking” (deactivating) some neurons, reducing the connections between layers, and effectively mitigating model overfitting. Finally, the classification layer uses a Softmax classifier to map the learned features to a probability distribution over different human action categories, completing the multi-class human action recognition task.

3D Local context aggregation module

In scenarios where video data quality is low or the time span is long, the AR3D model [33] may lose sensitivity to long-range temporal information, thereby compromising feature extraction performance. To address this issue, we propose the Local Context Aggregation Module (LCA-Module), which employs convolutional blocks to extract key features and integrates Histogram Transformer Blocks (HTBs) to capture complex spatiotemporal patterns and degradation-related factors. Additionally, the module incorporates skip connections to supplement the original input features, thereby enhancing robustness to low-quality data. To further improve the module’s adaptability to varying temporal scales across different action types, we introduce an adaptive temporal aggregation strategy inspired by the hybrid attention network [34]. This strategy enables the model to dynamically adjust its receptive field in the temporal dimension, effectively capturing both short-term motions and long-term dependencies. HTB, by analyzing the statistical distribution of spatiotemporal features, can adaptively focus on critical moments in the video sequence, enhancing the model’s ability to perceive important time segments. In contrast, SAB optimizes feature selection across different channels by segmenting and weighting the feature maps, helping the model improve recognition of action details in complex backgrounds [35].

Spatiotemporal factorized convolution.

Spatio-temporal factorization [22] is a technique that decomposes spatio-temporal data into spatial and temporal components, effectively reducing computational complexity while enhancing the model’s feature representation capacity. Specifically, spatio-temporal factorization decomposes the input 3D convolution kernel into spatiotemporal components, which not only reduces computation but also preserves the model’s capability to accurately quantify spatio-temporal dependencies.

In traditional 3D convolution, the convolution kernel is typically represented as a 3D tensor , where D denotes the temporal dimension, and H and W correspond to the spatial dimensions. The convolution operation is generally expressed as follows:

(1)

where X represent the input data, W denote the 3D convolution kernel, and Y signify the output feature tensor.

To achieve spatio-temporal factorization, we decompose the 3D convolution kernel W into two low-rank tensors, and , as follows:

(2)

where represents the spatial component, while represents the temporal component. This decomposition not only significantly reduces the number of parameters in the model but also accelerates the training process, thereby enhancing computational efficiency.

The factorization parameters, namely the temporal depth D and spatial kernel size , are crucial for balancing computational efficiency and feature representation capacity. Inspired by robust feature decomposition techniques employed in tasks like saliency detection under uncertain environments [35], which emphasize the importance of multi-scale and bilateral feature processing, we conduct a systematic ablation study on these parameters. We explore configurations such as , , , and . This analysis aims to identify the optimal factorization strategy that maintains robustness against environmental variations like motion blur and lighting changes, which are prevalent in real-world video data. The results, detailed in the ablation study section, demonstrate that a moderate temporal depth and spatial kernel size provide the best trade-off, effectively capturing both short-term motions and fine-grained spatial details without excessive computational overhead.

The principal strength of spatiotemporal factorization lies in its ability to efficiently characterize the intrinsic low-rank structure and local correlations within spatiotemporal data, while simultaneously alleviating computational bottlenecks inherent to conventional 3D convolution operations. This decoupling strategy proves particularly critical when processing massive video streams or high-dimensional time series. By disentangling spatial and temporal modeling within the feature dimension, it achieves optimized allocation of computational resources—notably FLOPs and memory consumption—while rigorously preserving the integrity and continuity of information throughout spatiotemporal evolution. Significantly, such feature disentanglement capability demonstrates unique value in resource-constrained scenarios, including non-stationary dynamic modeling and edge computing applications.

3D Histogram transformer block.

Although the HTB demonstrates significant effectiveness in modeling high-order statistical distributions of feature maps, its inherent 2D convolutional architecture fails to effectively capture the spatiotemporal correlations crucial for video data. To fully harness HTB’s capability for histogram-based feature representation in the spatiotemporal domain, this work proposes a novel 3D extension. By incorporating 3D convolutional kernels, the proposed 3D HTB substantially enhances its capacity for decoupled modeling of complex interdependencies between spatial and temporal dimensions. Inspired by the lightweight transformer-driven multi-scale trapezoidal attention network for saliency detection [36], we further optimize the 3D HTB by introducing a multi-scale trapezoidal attention mechanism to reduce computational complexity while preserving the ability to capture both local and global spatiotemporal dependencies. Consequently, the proposed 3D HTB variant achieves an effective balance between efficiency and representational power when processing volumetric video data streams. It retains the core advantage of histogram-based feature transformation while adapting seamlessly to multidimensional joint analysis in dynamic scenarios. Fig 2 presents a schematic illustration of the hierarchical topological structure of the 3D HTB.

thumbnail
Fig 2. The hierarchical topological structure of the 3D 3D Histogram Transformer Block.

https://doi.org/10.1371/journal.pone.0333893.g002

HTB incorporates two crucial modules: Dynamic-range Histogram Self-Attention(DHSA) [32] and Dual-Scale Gated Feed-Forward(DGFF) [32]. These modules are designed to work in conjunction with layer normalization (LN) to enhance feature extraction. The interaction between these modules and LN is mathematically formulated as follows:

1. DHSA Module: The DHSA module mitigates covariate shift within the input feature tensor Fl−1 through LN. It employs Dynamic Range Convolution (DRC) to adaptively capture multi-scale spatial patterns. The core innovation lies in the Histogram-based Temporal Attention (HTA) mechanism. Departing from conventional dot-product attention, HTA reconstructs temporal dependencies by leveraging feature-value histograms to measure Wasserstein distances. This architecture achieves decoupled optimization for spatial and temporal processing. Consequently, it outputs deep-level features exhibiting enhanced geometric invariance and dynamic sensitivity.

(3)

2. DGFF Module: The DGFF module innovatively integrates a gated convolutional network with a deep feature processing mechanism. Its core innovation lies in employing dynamic gating to precisely regulate information flow, synergistically combined with deep convolution for efficient multi-scale feature extraction. This design significantly enhances the model’s ability to jointly model local details and global contextual dependencies, effectively balancing the granularity and relevance of feature representations.

(4)

In these equations, Fl denotes the feature at the l-th layer. The application of layer normalization (LN) stabilizes the learning process by mitigating issues such as vanishing or exploding gradients, thus ensuring more effective convergence during training.

The core process of DHSA is as follows: First, dynamic-range convolution reorders the spatial distribution of the feature map. Next, the dual-path histogram self-attention mechanism utilizes these reordered features while integrating both global and local dynamic information for feature aggregation. To preserve the original spatial structure, the aggregated features are restored to their initial spatial locations. Finally, a 1  1  1 pointwise convolution performs the final output projection. In its 2D form, dynamic-range convolution uses a 3  3 depthwise convolution. To efficiently capture spatial and temporal dependencies in spatio-temporal data, such as video, we upgrade it to a 3D form and apply Spatio-temporal Factorization: using Conv 1  3  3 to process the spatial dimension and Conv 3  1  1 to process the temporal dimension. This decomposition strategy significantly enhances the model’s performance and computational efficiency.

The process is mathematically formulated as:

(5)(6)

Where is a 1  1  1 point-wise convolution, and are spatio-temporal factorized convolutions, enabling more efficient capture of spatial and temporal dependencies in spatio-temporal data. This decomposition effectively enhances the model’s performance and computational efficiency. is the concatenation operation along the channel dimension, represents horizontal or vertical sorting operations.

This method optimizes convolution operations within the dynamic range by arranging high/low-intensity pixels into a structured pattern along the diagonal corners of the matrix. This design enables the convolution kernel to both prioritize retaining distinct features and autonomously restore information in degraded regions.

Histogram Self-Attention (HSA) is an enhanced mechanism designed for modeling complex patterns in spatio-temporal data, whose core principle involves partitioning the feature space into multiple histogram bins (buckets) to reinforce key feature extraction through selective attention allocation. In the specific implementation, the output of the dynamic range convolution is decoupled into the value features V and two pairs of query-key features , which are fed into dual processing branches. These branches employ parallel Bucket Histogram Reconditioning (BHR) and Frequency Histogram Reconditioning (FHR) modules to extract global structural features and local frequency-domain details, respectively; the resulting features subsequently undergo self-attention computation and are fused via element-wise multiplication. This architecture significantly enhances the model’s capability to represent cross-granularity information through the synergistic integration of multi-scale feature reconditioning and attention mechanisms. The structure of the 3D DHSA is shown in Fig 3.

thumbnail
Fig 3. The structure of 3D dual-scale gated feed-forward.

https://doi.org/10.1371/journal.pone.0333893.g003

The Dual-Scale Gated Feed-Forward (DGFF) module strengthens local context modeling by synergistically integrating two multi-range and multi-scale depthwise convolution pathways. We decompose the computationally intensive standard convolution and dilated convolution into spatio-temporal separable convolutions (Equation 1), significantly reducing computational complexity. As shown in Fig 4, this module employs a gating mechanism to dynamically fuse the features from both pathways, and the complete fusion process can be formalized as follows:(7)(8)(9)(10)where Conv represents spatial-temporal factorized convolution, and represent pixel-shuffling and unshuffling operations, respectively, [37] denotes the Mish activation function, and Fl + 1 is the output of the current stage passed to the -th stage.

thumbnail
Fig 4. The structure of 3D dual-scale gated feed-forward.

https://doi.org/10.1371/journal.pone.0333893.g004

Mechanism integration.

The 3D Local Context Aggregation Module first employs shallow convolution operations to capture the basic motion features and spatial patterns in the video sequence, extracting local spatio-temporal information through the convolutional layers. Subsequently, HTB refines these features by incorporating a histogram-based temporal attention mechanism, which prioritizes the most relevant time segments, thus enhancing the model’s ability to capture critical motion patterns.

As shown in Fig 1, the module is composed of four identical blocks connected in series. Each block first processes the input image sequence using 3D convolution and 3D pooling operations to capture basic spatial and temporal dynamic information, forming shallow spatio-temporal features. These features are subsequently passed into the core 3D HTB module for further processing, which is specifically designed to extract dynamic degradation information in the spatial domain, while enhancing the aggregation of both adjacent contextual features and global topological features, and enriching multi-scale feature representations, thus significantly improving the subsequent image restoration process. Finally, within the block, a skip connection structure is used to pass information to the subsequent layers. This connection integrates operations such as average pooling, pixel-wise convolution, and depthwise convolution.

3D Deeper residual feature extraction module

To overcome the limitations of traditional 3D convolutional networks in spatio-temporal feature extraction, long-range dependency modeling, and multi-scale information fusion, we propose the application of a Split-Attention Block-based residual network for 3D video human action recognition. Although traditional 3D convolutional networks can capture spatio-temporal features, they often face several issues when handling complex action patterns: 1) inability to effectively extract multi-scale features, 2) difficulty in modeling long-range dependencies, 3) fixed-scale convolution operations that cannot adapt to the spatio-temporal information of different actions, and 4) high computational cost when processing high-dimensional feature maps. To address these challenges, we propose a residual network based on the Split-Attention Block, which integrates feature-map grouping with split attention mechanisms, long-range dependency modeling, dynamic weighted fusion, and a computationally efficient design. This approach significantly enhances the model’s representational power and efficiency, particularly for 3D video human action recognition tasks.

3D Split-attention block.

The workflow of the Split-Attention Block [38], which structure is shown in Fig 5, is as follows: It first groups the feature map and then computes attention weights within each group. This approach allows the module to focus more precisely on information from different feature subspaces, significantly improving the efficiency of extracting complex spatiotemporal features.

Feature-map grouping and cardinality. The feature map is divided into K groups, each of which is further split into R parts. Therefore, the total number of feature groups G is:

(11)

Each group undergoes a transformation Fi(X), generating an intermediate representation Ui.

Split attention within cardinal groups. The representation of each cardinal group is fused by element-wise summation to obtain the combined representation:

(12)

where , representing the output of the k-th cardinal group. Here, H, W, and C denote the height, width, and number of channels of the feature map, respectively.

Global contextual information and channel-wise statistics. Global contextual information is gathered by applying global average pooling across the spatial dimensions. The pooled result for the c-th channel of the k-th cardinal group is calculated as:

(13)

This pooled result computes the global contextual information for each channel, which is used for subsequent weighted fusion.

Weighted fusion using soft attention. Global contextual information is weighted and fused to obtain the final representation of the cardinal group , calculated as:

(14)

where is the soft attention weight for each split, calculated as:

(15)

Here, determines the weight for each split based on the global context representation sk.

Furthermore, the dynamic channel re-weighting mechanism embedded within the Split-Attention Block demonstrates notable adaptability to action categories of varying complexity. For simple actions (e.g., “walking” or “sitting”), the mechanism tends to emphasize broad spatial-temporal patterns, efficiently allocating attention across global contexts. In contrast, for complex or fine-grained actions (e.g., “playing violin” or “typing”), it selectively amplifies discriminative local features and temporal segments, thereby enhancing the model’s sensitivity to subtle motion variations. This adaptive behavior is facilitated by the soft attention weights aki(c), which are dynamically computed based on global contextual statistics sk, enabling category-aware feature recalibration without explicit supervision. Such capability ensures robust performance across diverse action types, from high-motion sports to delicate hand-object interactions.

3D Cardinal block.

The cardinal operation divides the feature map into multiple cardinal groups, allowing each group to independently process specific feature channels. By utilizing group convolutions, it reduces computational cost, while the Split-Attention mechanism computes independent attention weights for each group, thereby enhancing feature extraction accuracy. This approach significantly improves computational efficiency, especially on large-scale and high-dimensional datasets, and enhances the model’s representational power by aggregating features from different groups. The cardinal operation also improves the model’s robustness, enabling it to effectively handle multi-view and spatio-temporal data, ensuring strong performance across complex tasks. The specific structure diagram is shown in Fig 6.

3D Residual block.

Starting from the shallow features of the 3D LCA module, we developed an enhanced 3D residual module to extract richer deep spatiotemporal features. To increase the network depth and mitigate the vanishing gradient problem, this module innovatively integrates the Split-Attention block into the residual structure of AR3D. This combination ensures that the core advantages of residual connections are preserved while optimizing feature extraction accuracy and computational efficiency through the split-attention mechanism.

The representations of the cardinal groups are concatenated along the channel dimension:

(16)

Similar to standard residual blocks, the final output Y of the Split-Attention block is obtained through element-wise addition of the output from the identity mapping path, , and the output from the residual path, F(X), which incorporates the Split-Attention mechanism. This directly inherits the core structure of standard residual blocks while leveraging feature fusion within the residual path F(X) through grouping, splitting, and attention mechanisms.

Training strategy

Distributed training.

Given the inherent challenges posed by video data, particularly its dense spatio-temporal features that demand substantial computational and storage resources, single-machine training quickly reveals its limitations. A slight increase in model size or batch size results in memory constraints, especially with GPU memory, while frequent data transfers and the computational bottleneck of a single node make training cycles increasingly untenable. To overcome these challenges, we adopted a distributed training strategy. Specifically, we constructed a computing cluster composed of 8 high-performance servers, with a total of 64 GPUs, enabling dual parallelism. This collaborative computing model not only effectively alleviates the memory pressure caused by the large model parameters and activations, but also parallelizes the vast computational tasks, significantly accelerating the training process.

Learning rate adjustment strategy.

In the context of distributed training, the learning rate cannot be regarded as a static parameter that can be arbitrarily set; instead, it functions as a sensitive adjustment mechanism throughout the model’s convergence process. To optimize this, we utilize Cosine Annealing as the core scheduling strategy. This dynamic decay approach is particularly effective in the later stages of training, where it helps mitigate the risk of excessive fluctuations caused by high learning rates, particularly on flat loss surfaces. As a result, it significantly reduces the occurrence of gradient anomalies in these stages. A key advantage of distributed training is its ability to handle large global batch sizes, typically achieved by aggregating mini-batches across multiple GPUs. In this regard, we strictly adhere to the heuristic principle of linearly scaling the learning rate with respect to the mini-batch size B, ensuring proper adjustment in response to varying batch sizes. The initial learning rate can be calculated using the following formula:

(17)

In this context, B represents the mini-batch size, and is the base learning rate, which is set to 0.1.

Regularization.

Despite the rich spatiotemporal information provided by large-scale video data, we observe a pronounced tendency for deep networks to overfit. To mitigate the model’s dependency on specific spatiotemporal patterns, we propose Spatiotemporal DropPath regularization. This approach incorporates two structured noise injection strategies:

  • Spatial DropPath: Randomly dropping feature channels along the spatial dimension with a probability of 0.3, compelling the network to establish redundant feature representations;
  • Temporal DropPath: Continuously dropping sequences of 4 frames along the temporal dimension with a probability of 0.2, explicitly simulating real-world frame loss artifacts.

This structured regularization yields dual benefits: it significantly enhances the model’s robustness to input information loss and, unexpectedly accelerates convergence.

Concurrently, we apply selective weight decay: it is exclusively imposed on the weights of convolutional and fully-connected layers, while exempting bias terms and the / parameters of batch normalization layers. Constraining the latter parameters could impair their feature distribution calibration capability, whereas bias terms contribute minimally to overfitting. The Spatiotemporal DropPath and precision-targeted weight decay form a complementary defense mechanism: the former enhances the structural resilience of the model, while the latter ensures optimization stability. Together, they drive the network to learn more generalizable spatiotemporal representations from video data, leading to systematic performance improvements across multiple downstream tasks.

Data augmentation.

The generalization performance of video action recognition is fundamentally constrained by the spatio-temporal completeness of the training data, while real-world challenges such as lighting variations, occlusions, and motion continuity pose significant obstacles. By introducing a diverse range of training samples, data augmentation techniques help alleviate the risk of overfitting. To address the inherent spatio-temporal characteristics of video data, we designed a comprehensive augmentation strategy that spans the spatial, temporal, and spatio-temporal dimensions. This strategy ensures that the model can maintain consistent recognition performance even in the high-variance space introduced by augmentation, naturally mitigating the barriers to cross-scene and cross-content generalization.

1. Spatial Augmentation

To establish appearance invariance in video action recognition, we design a spatial augmentation quartet.

  • Multi-view Feature Decoupling: We apply random cropping (224  224 pixel window, with position shifts pixels), forcing the model to build global action semantics from local viewpoints and thereby breaking the dependency on background information.
  • Euler Angle Perturbation Modeling: We introduce random rotations with angles (step size 2) and horizontal flipping with a probability of 0.5. This simulates the uncertainty in the pose of handheld devices.
  • Lighting Invariance Learning: In the CIE Lab color space, we implement jittering with brightness adjustments and chroma variations , covering 90% of natural lighting variation.
  • Scale Space Generalization: We apply random scaling , followed by cubic interpolation for reconstruction, ensuring a continuous response to variations in the distance of moving targets.

2. Temporal Augmentation

By manipulating the temporal properties of video sequences, this method significantly enhances the model’s adaptability to variations in action speed, improving its ability to model the dynamic evolution of actions, and thereby enhancing its overall performance in action recognition tasks. The core augmentation strategies are as follows:

  • Temporal Resampling: Frame rate scaling techniques (e.g., cubic spline interpolation for 1.2x acceleration or 0.8x deceleration) are employed to construct a continuous distribution space for action speeds. This transformation forces the convolution kernels to learn temporal invariance in motion features, effectively overcoming issues related to non-stationary motion variance in real-world scenarios.
  • Phase Perturbation: Random temporal truncation is applied. By introducing artificial temporal asynchrony, this strategy strengthens the model’s fault tolerance to variations in the initial phase of actions, simulating clock drift effects encountered in multi-device data collection.
  • Inter-frame Dropping: Based on a Bernoulli process (p = 0.2), four consecutive frames are dropped to model packet loss in the transmission layer or visual information degradation due to motion blur. This mechanism drives the spatio-temporal encoder to learn feature reconstruction strategies, maintaining the ability to extract invariant discriminative features under conditions of missing information.

3. Spatio-Temporal Augmentation

To enhance the model’s generalization ability to spatio-temporal heterogeneity, this study proposes a collaborative data augmentation strategy that applies structured perturbations simultaneously in both the spatial and temporal domains:

  • Spatio-Temporal Random Path Dropout: In the spatial domain, channel-level random zeroing is performed with a probability of p = 0.3, simulating feature degradation caused by sensor channel failure. In the temporal domain, block frame masking is applied with a probability of p = 0.2 and T = 4 frames, reconstructing temporal discontinuities caused by transmission packet loss or motion artifacts. This mechanism forces the model to learn robust representations under partially observable conditions, significantly improving its tolerance to spatio-temporal noise.
  • Spatio-Temporal Joint Deformation: In the spatial domain, affine transformations (e.g., translation by 10% of the image size and rotation by 15) are applied to induce viewpoint invariance. Simultaneously, in the temporal domain, frame skipping is implemented using a Poisson distribution to achieve non-uniform resampling, simulating fragmented action observations. This strategy effectively models the mixed effects of camera pose drift and asynchronous sampling, enhancing the model’s adaptability to appearance variations and temporal fragmentation, ultimately improving action recognition accuracy across different viewpoints.

Layer-freezing strategy.

While the integration of multiple HTB and SAB modules enhances feature extraction, it also introduces significant computational overhead during training. To mitigate this burden without compromising the effectiveness of the attention mechanisms, we adopt a layer-freezing strategy inspired by [39]. This approach is based on the observation that shallow layers in a CNN typically learn general-purpose features, while deeper layers learn more task-specific features.

Our implementation follows a three-phase progressive freezing protocol:

Phase 1 (Warm-up): The first 5 training epochs are dedicated to warming up all layers of the network with a low learning rate (), allowing the shallow layers to stabilize.

Phase 2 (Shallow Freezing): After the warm-up, we freeze the parameters of the first two residual blocks in the AR3D backbone network. These frozen layers continue to perform forward passes but are excluded from backward gradient computation and updates, reducing approximately 40% of the training-time FLOPs.

Phase 3 (Deep Training): The remaining unfrozen layers, which include all subsequent residual blocks and the critical attention modules (HTB and SAB), continue to be trained for the remainder of the epochs. This ensures that the high-level, task-specific spatiotemporal representations and the attention weights are finely optimized.

This strategy effectively reduces the computational cost and memory footprint during the majority of the training process, making the approach more feasible for resource-constrained environments, while preserving the performance of the attention-enhanced deep layers.

Experiment

Experimental setup

Datasets.

The experiments were conducted using two widely recognized benchmark datasets in the field of human action recognition: UCF101 [40] and HMDB51 [41]. These datasets are standard for evaluating the performance of action recognition models and are known for their comprehensive action category systems and diverse scene representations. They provide a robust foundation for assessing model performance. The specific characteristics of the data are as follows:

UCF101 dataset. As a cornerstone benchmark in the field of video action recognition, UCF101 is chosen as one of the primary validation platforms for this study due to its systematic category structure and real-world scene complexity. The dataset encompasses 101 action classes, which strictly follow human behavioral classification guidelines, and is divided into five mutually exclusive groups: (1) Basic Actions, accounting for 26.3% of the total samples, to test the model’s ability to capture atomic behaviors; (2) Interpersonal Interactions, containing 17 action classes requiring spatial-temporal coordination between two people, to validate the model’s capability in social behavior analysis; (3) Human-Computer Interaction, 24 action classes involving object manipulation, assessing the model’s inference ability for tool-related tasks; (4) Instrument Playing, consisting of 9 fine motor hand movements, to evaluate the model’s precision in recognizing micro-motions; (5) Sports Activities, with 31 high-dynamic actions, whose average motion blur intensity is 1.7 times higher than that of HMDB51, forming the most challenging subset. The 13,320 videos in this dataset are sourced entirely from real-world internet videos, as opposed to laboratory recordings.

HMDB51 dataset. As a benchmark in the field of action recognition, HMDB51 constructs a challenge environment that closely approximates real-world conditions through complex movie scenes (71% of the samples are sourced from Hollywood films). Its 51 action categories are divided into five dimensions, providing a comprehensive validation system: (1) Facial micro-actions (e.g., chewing/laughing), which cover 12 types of micro-expression variations, with an average eye-lip displacement of only 5–7 pixels; (2) Object interaction actions (e.g., smoking/drinking), comprising 19 types of hand-held object manipulations, with 43% of samples exhibiting partial occlusion; (3) Full-body dynamic actions (e.g., handstands/flips), consisting of 9 types of high-frame displacement movements, with an average optical flow intensity 2.3 times higher than that in UCF101; (4) Human-object interactions (e.g., fencing/archery), featuring 7 types of tool usage scenarios that require parsing of the spatiotemporal coupling relationship between tools and human bodies; (5) Conflicting interactions (e.g., pushing/hugging), with 4 types of antagonistic behaviors, where the background confusion factor is 1.8 times greater than that of comparable samples in UCF101.

These two datasets are extensively utilized in video action recognition research, offering standardized platforms for the comparison and evaluation of model performance. Their widespread use provides a reliable basis for benchmarking, enabling fair assessments of different models’ effectiveness in action recognition tasks.

Integrated evaluation metrics and validation strategies.

This study employed a comprehensive set of evaluation metrics including Accuracy, Precision, Recall, F1-Score, and Confusion Matrix to systematically assess the model’s robustness and efficacy.

In addition to the aforementioned evaluation metrics, the effectiveness of the algorithm model was also comprehensively assessed using other validation strategies.

Comparison with state-of-the-arts. We compared the proposed model with current state-of-the-art models. This comparison establishes a benchmark for the proposed model and highlights its advantages or improvements over existing methods, helping to demonstrate its performance in the field of action recognition.

Cross-validation. We employed the K-fold cross-validation method [42] to conduct multiple experiments, ensuring the stability and robustness of the model while minimizing the impact of uneven data partitioning or random factors on the evaluation results. Specifically, we first divided the entire dataset into K subsets. In each iteration, K–1 subsets were selected for training, while the remaining subset served as the validation set for testing. This process was repeated K times, with each subset used as the validation set exactly once. The final model performance was then calculated as the average of the results obtained across all K iterations.

Evaluation on multiple datasets. The model validation was conducted using two heterogeneous datasets: HMDB51 and UCF101. This design was driven by three main considerations: First, the complex scenarios in HMDB51 (which involve background interference and occlusion) complement the large-scale sample size of UCF101 (with over 13,000 videos), ensuring that the model encounters a wide range of edge cases. Second, testing on two datasets helps mitigate the overfitting risk that could arise from the specific data distribution of a single dataset. Finally, this rigorous validation approach produces accuracy metrics with strong generalization power, providing a quantifiable assurance of reliability for real-world deployment.

Implementation details.

In this study, the experimental platform was established on the Ubuntu 18.04 LTS operating system, with hardware consisting of an Intel Xeon E5-2620v4 CPU and an NVIDIA GeForce RTX 3090 Ti GPU (24GB GDDR6X memory). Sublime Text 3 was selected as the development environment, with the algorithm implementation built on Python 3.7.9. For library versions, OpenCV 4.5.2 was employed for video frame decoding and enhancement, while deep learning architecture was developed using TensorFlow 2.3.0 and Keras 2.4.0. This combination has been thoroughly validated, ensuring full compatibility with CUDA 11.1 and cuDNN 8.0.5, while also significantly leveraging the Tensor Core computational potential of the RTX 3090 Ti GPU.

Rationale and methodological imperative for algorithm comparison.

To comprehensively evaluate the performance of the proposed model, it was benchmarked against a suite of representative state-of-the-art methods in human action recognition. The selected baselines encompass both established classical approaches and recent innovative techniques, providing comparative perspectives across the methodological spectrum. This selection criterion not only prioritizes methods demonstrating superior performance upon their introduction but also considers their distinct strategies for addressing crucial challenges inherent to the task, such as spatiotemporal feature modeling, viewpoint variation, and background clutter, thereby ensuring both breadth and relevance in the comparison. Table 2 summarizes the historical development and underlying principles of contrastive algorithms.

thumbnail
Table 2. The historical development and underlying principles of contrastive algorithms.

https://doi.org/10.1371/journal.pone.0333893.t002

Experimental results

To rigorously validate the effectiveness of the proposed algorithmic framework, a comprehensive evaluation was conducted on the challenging video understanding benchmarks UCF101 and HMDB51. Beyond conventional accuracy metrics, particular emphasis was placed on calculating and analyzing three core performance indicators: Precision, Recall, and the F1-score (detailed in Table 3). This multi-faceted assessment provides deeper insights: Precision directly reflects the confidence and reliability of the model’s positive predictions, quantifying the proportion of correctly identified positive instances. Recall evaluates the model’s coverage capability and comprehensiveness in detecting positive samples, measuring its effectiveness in capturing all relevant instances. The F1-score, as the harmonic mean of Precision and Recall, offers a single, robust composite metric, particularly resilient to class distribution imbalances (e.g., long-tailed distributions) prevalent in real-world data.

thumbnail
Table 3. The precision, recall, and F1-score of the proposed algorithm on the UCF101 and HMDB51 datasets.

https://doi.org/10.1371/journal.pone.0333893.t003

As presented in Table 3, the proposed algorithm achieves outstanding performance on both the challenging video action recognition benchmark datasets. On the UCF101 dataset, the algorithm demonstrates exceptional recognition capability, with a precision of 97.08%, recall of 96.74%, and a balanced F1-score of 96.91%. These high scores strongly prove that the algorithm not only accurately recognizes target action categories but also effectively captures the vast majority of relevant action instances in the video sequences. Notably, the F1-score of 96.91%, which balances precision and recall, fully reflects the algorithm’s robust ability to handle potential class imbalance issues, a crucial advantage in real-world applications. On the HMDB51 dataset, known for its more complex background variations and challenging action representations, the algorithm also performs well, achieving precision of 72.01%, recall of 71.36%, and an F1-score of 71.68%. Although the performance is lower compared to UCF101, the F1-score of 71.68% significantly outperforms many baseline methods, particularly in balancing precision and recall. This clearly reflects the algorithm’s reliable recognition ability and robustness when facing more complex, noise-laden real-world data. Overall, the experimental results fully validate the effectiveness of the proposed algorithm, convincingly demonstrating its strong generalization capability and stable high-performance across datasets of varying scales and complexities, laying a solid foundation for its application in broader video understanding tasks.

To systematically assess the performance of the proposed algorithm, confusion matrices were constructed on the UCF-101 and HMDB-51 datasets. The visualizations in Fig 7(a) and 7(b) clearly indicate that the high intensity distribution of the diagonal elements demonstrates the model’s exceptional cross-category recognition stability. Quantitative analysis shows that the average classification accuracy for UCF-101 is 97.5%, while for HMDB-51, it is 76.2%. This difference essentially reflects the spectrum of video action complexity—the former validates the algorithm’s strong ability to model global motion patterns, while the latter’s accuracy surpasses current SOTA methods in fine-grained action recognition. Notably, the non-diagonal regions of the confusion matrix do not exhibit significant misclassification, proving that the proposed feature decoupling mechanism effectively mitigates inter-class confusion. These findings not only confirm the robustness of the algorithm in diverse scenarios but also highlight the unique advantages of its dual-path architecture in spatiotemporal feature decoupling. The dynamic convolution path precisely captures local motion micro-changes, while the histogram-based attention effectively models long-range action dependencies. Future work will explore the framework’s adaptability in complex lighting conditions and its boundaries in such challenging environments.

thumbnail
Fig 7. (a) Confusion Matrix for the UCF101 Dataset, and (b) Confusion Matrix for the HMDB51 Dataset.

https://doi.org/10.1371/journal.pone.0333893.g007

Performance evaluation with the state-of-the-arts

As presented in Table 4, the proposed model underwent comprehensive evaluation on the UCF101 and HMDB51 datasets, focusing on three critical metrics: recognition accuracy, model parameter count, and per-sample inference latency. Comparative analysis against state-of-the-art (SOTA) models yields the following key observations, prompting deeper consideration.

thumbnail
Table 4. Performance comparison of human action recognition models (RGB Modality)

https://doi.org/10.1371/journal.pone.0333893.t004

Our algorithm significantly outperforms Transformer-based architectures in terms of efficiency-accuracy trade-off. Specifically, with a lightweight design of 22.3M parameters and 30 FPS, our method achieves 97.5% accuracy on UCF101, surpassing TimeSformer and SlowFast by 0.8% and 1.9%, respectively. Moreover, the inference speed is 100% faster than VideoMAE V2, and the training cost is only 1/10th of the Transformer-based models, with no need for mask pretraining. The 3D HTB and attention mechanisms, adding less than 5% computational overhead, effectively enhance feature selectivity, improving accuracy on HMDB51 by 1.7% compared to the attention-free I3D baseline, which demonstrates the necessity of lightweight local aggregation mechanisms for action recognition in complex backgrounds.

Compared to traditional 3D CNN models, our algorithm achieves competitive performance with a lower computational cost. On UCF101, our method outperforms R(2+1)D by 0.2%, while reducing the parameter count by 23%, making it more suitable for edge deployment.

In our expanded comparative analysis, which includes a broader suite of state-of-the-art video transformers inspired by methodologies from lightweight transformer research, the results in Table 4 show that while large-scale transformer models such as TimeSformer, VideoMAE V2, and VideoSwin achieve commendable accuracy, they incur substantial computational costs—often exceeding 80M parameters—making deployment in real-time or resource-constrained environments challenging. In contrast, our proposed framework innovatively integrates lightweight attention and histogram-based transformation within a 3D CNN backbone, achieving a superior efficiency-accuracy trade-off. For instance, on the UCF101 dataset, our model (22.3M parameters, 30 FPS) attains a 114% improvement in inference speed over TimeSformer (121M parameters, 26 FPS), while maintaining competitive accuracy.

Compared to the recent VideoMAE V2, which relies heavily on computationally intensive masked autoencoder pre-training, our method offers a streamlined training pipeline without any pre-training requirements. Furthermore, when compared to other efficient transformer-based approaches such as MViTv2 and those aligned with the philosophy, our approach similarly emphasizes multi-scale feature extraction and efficient design. However, rather than employing self-attention mechanisms over spatiotemporal patches, our 3D HTB and Split-Attention modules achieve competitive performance through local feature re-calibration and channel-wise dynamic weighting—operations that are inherently more efficient. This design not only improves computational performance but also preserves fine-grained motion details, contributing to robust accuracy.

Moreover, in comparison to newer models including VideoMAE V2, Motionformer, and M3D-AIM, our framework highlights its engineering-friendly advantages. Although our accuracy does not surpass these models, our single-branch pure 3D CNN architecture avoids custom operators (e.g., dynamic gating or dual-stream fusion) and is plug-and-play compatible with standard frameworks such as PyTorch and TensorFlow, reducing deployment adaptation costs by 75%.

Compared to the baseline AR3D model, our method achieves accuracy improvements of 1.8% and 2.3% on UCF101 and HMDB51, respectively, while maintaining similar inference speed. Particularly in dynamic and blurry scenes, the 3D HTB and attention mechanism in our method focus on key frames, helping to mitigate the impact of dynamic blur and lighting variations, while enhancing fine-grained action distinction.

Overall, while our algorithm may not outperform all methods on every dataset, its exceptional performance in both accuracy and speed underscores the significant potential of our approach and demonstrates its ability to effectively tackle the challenges of action recognition in videos.

Ablation study

To validate the effectiveness of the proposed method, we designed rigorous ablation experiments on the UCF101 and HMDB51 benchmarks. The primary comparison employed AR3D as the core baseline. This approach implements a two-stage feature extraction architecture. Low-level feature information is captured through a 3D shallow convolution module, 3D attention residual module extract deep semantic representations. This experimental design examines how feature extraction architectures differentially influence action recognition performance. Comparative analysis between the baseline and proposed model reveals distinct architectural contributions to enhanced recognition capabilities.

We extensively ablate the design choices of the ST-Factor module [35]. As shown in Table 5, we evaluate various combinations of temporal depth (D) and spatial kernel size (). The results indicate that a smaller temporal kernel (e.g., D = 3) is sufficient to capture most motion patterns in standard benchmarks, while larger temporal kernels (D = 5) introduce minimal performance gains at a higher computational cost. Similarly, a spatial kernel provides the most efficient operation. The configuration T3-S3 emerges as the optimal choice, achieving the highest accuracy on HMDB51 (a dataset known for its complex and uncertain environments) with the lowest computational footprint. This aligns with the principle of robust feature decomposition [35], which advocates for efficient, multi-scale processing to handle environmental uncertainties.

thumbnail
Table 5. Ablation study on Spatiotemporal Factorization (STF) parameters.

The configuration T3-S3 (i.e., ) strikes the best balance between accuracy and efficiency, and is used in our final model.

https://doi.org/10.1371/journal.pone.0333893.t005

This study introduces an innovative 3D Histogram Transformer Block (3D HTB). A key advantage of this module lies in its effective decoupling of the scale coupling problem inherent in conventional spatiotemporal modeling. The 3D HTB achieves adaptive adjustment of local receptive fields via dynamic range convolution and innovatively transforms histogram statistics into attention weights. This dual-path architecture mimics the feature processing mechanism of the human visual system: the feedforward pathway captures local detailed features (the “where” pathway), while the histogram attention focuses on modeling global statistical regularities (the “what” pathway). As shown in Table 6, integrating the 3D HTB module yields accuracy improvements of 0.2% and 1.1% on the complex scene subset of UCF101 and the HMDB51 dataset, respectively. More importantly, its histogram channel recalibration mechanism significantly reduces sensitivity to motion blur, offering a novel approach for real-time video analysis. The innovative design of this module demonstrates considerable potential for modeling complex spatiotemporal data and enabling efficient recognition.

thumbnail
Table 6. Ablation study of proposed components - action recognition accuracy (%) on UCF101 and HMDB51 datasets.

https://doi.org/10.1371/journal.pone.0333893.t006

Beyond architectural validation, this study investigates in depth the optimization mechanism of 3D Split Attention for spatiotemporal feature representation in action recognition. Distinct from conventional feature fusion approaches, the module establishes a dynamic feature arbitration scheme—selectively enhancing discriminative patterns while suppressing redundant signals across parallel multi-scale branches. Theoretical analysis reveals that this design fundamentally constructs cross-channel communication pathways, effectively mitigating the inherent limitation of isolated feature learning in multi-branch architectures. Quantitative experiments demonstrate consistent accuracy improvements of 0.2% to 0.8% upon module integration on benchmark datasets including UCF-101 and HMDB-51 (Table 6). Critically, the derived representations exhibit pronounced context-adaptive properties, demonstrating significant advancements in occluded action recognition and fine-grained motion pattern identification—longstanding challenges in complex video understanding.

Furthermore, we evaluate the efficacy of the proposed layer-freezing strategy. By freezing the shallow layers after the warm-up phase, we achieve a 40% reduction in training-time computational FLOPs and a 35% decrease in GPU memory consumption during the backward pass. Crucially, this optimization comes at a negligible performance cost, with accuracy dropping by only 0.1% on UCF101 and 0.2% on HMDB51 compared to the full model trained without freezing. This demonstrates that the layer-freezing strategy successfully maintains the effectiveness of the deep attention layers while significantly alleviating the computational overhead.

Finally, we investigate the role of spatiotemporal factorization techniques. Specifically, these techniques facilitate the extraction of more structured and interpretable features, significantly reducing computational complexity and model parameters while maintaining or even enhancing recognition accuracy. These attributes make spatiotemporal factorization particularly valuable for efficient video understanding.

The ablation study results demonstrate that the synergistic integration of 3D HTB, Split-Attention Block, and spatiotemporal factorization within the 3D residual network framework confers significant advantages. As detailed in Table 4, this combined architecture markedly surpasses conventional 3D CNN architectures and other variants (notably AR3D) in both action recognition accuracy and computational efficiency.

Conclusions and discussions

This study proposes an efficient video action recognition framework based on 3D CNN, 3D Histogram Transformer Block (HTB), and Split-Attention Residual Modules, incorporating spatiotemporal factorization techniques for lightweight design. The method outperforms several comparison models in both recognition accuracy and inference speed on standard datasets such as UCF101 and HMDB51, demonstrating the effectiveness of integrating 3D convolutions, Transformer, and attention mechanisms in modeling complex temporal patterns.

The proposed method exhibits strong robustness across actions of varying complexities and shows broad application potential in real-world scenarios such as mobile platforms (e.g., drones), sports action analysis, and medical monitoring. However, the current evaluations are still confined to standard datasets, and its performance under real-world challenging conditions, such as occlusion and viewpoint variations, has not been sufficiently validated. Further research is needed to explore its robustness in actual environments.

Despite its overall strong performance, the recognition accuracy in complex backgrounds and high-concurrency action scenarios still lags behind larger-scale models. Future improvements could focus on three aspects: optimizing network architecture, enhancing human kinematic feature representations, and upgrading attention mechanisms. Additionally, ablation studies and validation across multiple source datasets will help clarify the contribution of each module.

Furthermore, deployment in real-time applications and resource-constrained environments has not been systematically discussed in the current study. We are developing lightweight architectures to reduce computational and memory overhead. Future work will focus on exploring model optimization strategies for edge computing and real-time systems to enhance applicability in practical scenarios.

In conclusion, this study makes significant progress in video spatiotemporal feature modeling through the innovative integration of attention mechanisms and tensor decomposition techniques. Future efforts will focus on performance validation in complex environments, lightweight design, and practical deployment optimization, further advancing action recognition technologies towards practical use.

References

  1. 1. Kong Y, Fu Y. Human action recognition and prediction: a survey. Int J Comput Vis. 2022;130(5):1366–401.
  2. 2. Noor N, Park IK. A lightweight skeleton-based 3D-CNN for real-time fall detection and action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 2179–88.
  3. 3. Wang H, Kläser A, Schmid C, Liu C-L. Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis. 2013;103(1):60–79.
  4. 4. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. 2014.
  5. 5. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). 2015. p. 4489–97. https://doi.org/10.1109/iccv.2015.510
  6. 6. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 6299–308.
  7. 7. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding? In: International Conference on Machine Learning; 2021. p. 4.
  8. 8. Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, et al. Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 3202–11.
  9. 9. Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J. Human action recognition from various data modalities: a review. IEEE Trans Pattern Anal Mach Intell. 2023;45(3):3200–25. pmid:35700242
  10. 10. Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 1933–41.
  11. 11. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, et al. Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision; 2016. p. 20–36.
  12. 12. Ji S, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell. 2013;35(1):221–31. pmid:22392705
  13. 13. Tran D, Bourdev L, Paluri M, Fergus R. Res3D: residual networks for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(5):845–55.
  14. 14. Ullah M, Nasir M, Anwar M. Deep bidirectional LSTM for human action recognition. In: IEEE Winter Conference on Applications of Computer Vision; 2018.
  15. 15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. Advances in Neural Information Processing Systems. 2017.
  16. 16. Lin Z, Li Z, Zhou X, Liu Y. Self-attention for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
  17. 17. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL; 2019. p. 4171–86.
  18. 18. Gao Y, Zhang Z, Tao D. Graph convolutional networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
  19. 19. Liang Z, Yang Q, Zhang J. Spatio-temporal graph convolutional networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision. 2019.
  20. 20. Xu R, Li Z, Liu S. Deep reinforcement learning for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020.
  21. 21. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.
  22. 22. Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 5533–41.
  23. 23. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, et al. Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 3156–64.
  24. 24. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access. 2018;6:1155–66.
  25. 25. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 7132–41.
  26. 26. Woo S, Park J, Lee JY, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision. 2018. p. 3–19.
  27. 27. Roy A, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: MICCAI. Springer; 2018. p. 421–9.
  28. 28. Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. Advances in Neural Information Processing Systems. 2014.
  29. 29. Ba J, Mnih V, Kavukcuoglu K. Multiple object recognition with visual attention. arXiv preprint 2014. https://arxiv.org/abs/1412.7755
  30. 30. Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks. In: Advances in Neural Information Processing Systems. 2015.
  31. 31. Tran D, Ray J, Shou Z, Chang SF, Paluri M. ConvNet architecture search for spatiotemporal feature learning. arXiv preprint 2017. https://arxiv.org/abs/1708.05038
  32. 32. Sun S, Ren W, Gao X, Wang R, Cao X. Restoring images in adverse weather conditions via histogram transformer. In: European Conference on Computer Vision. 2024. p. 111–29.
  33. 33. Dong M, Fang Z, Li Y, Bi S, Chen J. AR3D: attention residual 3D network for human action recognition. Sensors (Basel). 2021;21(5):1656. pmid:33670835
  34. 34. Khan H, Ullah Khan S, Ullah W, Wook Baik S. Optimal features driven hybrid attention network for effective video summarization. Engineering Applications of Artificial Intelligence. 2025;158:111211.
  35. 35. Khan H, Usman MT, Koo J. Bilateral Feature Fusion with hexagonal attention for robust saliency detection under uncertain environments. Information Fusion. 2025;121:103165.
  36. 36. Usman MT, Khan H, Rida I, Koo J. Lightweight transformer-driven multi-scale trapezoidal attention network for saliency detection. Engineering Applications of Artificial Intelligence. 2025;155:110917.
  37. 37. Misra D. Mish: a self regularized non-monotonic activation function. arXiv preprint 2019.
  38. 38. Zhang H, Wu C, Zhang Z, Zhu Y, Lin H, Zhang Z. ResNeSt: split-attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 2736–46.
  39. 39. Talha Usman M, Khan H, Kumar Singh S, Young Lee M, Koo J. Efficient deepfake detection via layer-frozen assisted dual attention network for consumer imaging devices. IEEE Trans Consumer Electron. 2025;71(1):281–91.
  40. 40. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: A large video database for human motion recognition. In: 2011 International Conference on Computer Vision. 2011. p. 2556–63. https://doi.org/10.1109/iccv.2011.6126543
  41. 41. Soomro K, Zamir AR, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint 2012.
  42. 42. Raschka S. Model evaluation, model selection, and algorithm selection in machine learning. 2018.
  43. 43. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 6450–9.
  44. 44. Feichtenhofer C, Fan H, Malik J, He K. SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 6202–11.
  45. 45. Li Y, Wu CY, Fan H, Mangalam K, Xiong B, Malik J, et al. MViTv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 4804–14.
  46. 46. Singh M, Girdhar R, Ramanan A, Ghadiyaram D, Mahajan D. MotionFormer: transferring neuro-inspired motion perception to video recognition. In: Advances in Neural Information Processing Systems. 2023.
  47. 47. Wang L, Huang B, Zhao Z, Tong Z, He Y, Wang Y, et al. VideoMAE v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 14549–60.
  48. 48. Zheng J, Li Y, Xu C, Wang L. M3D-AIM: Multimodal 3D attention-informed modeling for human action recognition. In: IEEE International Conference on Automatic Face and Gesture Recognition; 2024. p. 1–8.
  49. 49. Kumar P, Quraishi A, Rusho MA, Raparthi M, Degadwala S. Enhancing video action recognition with spatial-temporal convolutional attention networks. In: International Conference on Intelligent Computing and Networking, 2024. p. 149–68.