Figures
Abstract
Human action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interactions, thus restricting their use to specific scenarios. Additionally, the need remains to devise lightweight and computationally efficient models to make them deployable in real-world applications. To this end, this paper presents a generic lightweight and computationally efficient Transformer network-based model, referred to as InterAcT, that relies on extracted bodily keypoints using YOLO v8 to recognize human solo actions as well as interactions in aerial videos. It features a lightweight architecture with 0.0709M parameters and 0.0389G flops, distinguishing it from the AcT models. An extensive performance evaluation has been performed on two publicly available aerial datasets: Drone Action and UT-Interaction, comprising a total of 18 classes including both solo actions and interactions. The model is optimized and trained on 80% train set, 10% validation set and its performance is evaluated on 10% test set achieving highly encouraging performance on multiple benchmarks, outperforming several state-of-the-art methods. Our model, with an accuracy of 0.9923 outperforms the AcT models (micro: 0.9353, small: 0.9893, base: 0.9907, and large: 0.9558), 2P-GCN (0.9337), LSTM (0.9774), 3D-ResNet (0.9921), and 3D CNN (0.9920). It has the strength to recognize a large number of solo actions and two-person interaction classes both in aerial videos and footage from ground-level cameras (grayscale and RGB).
Citation: Shah M, Nawaz T, Nawaz R, Rashid N, Ali MO (2025) InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos. PLoS One 20(5): e0323314. https://doi.org/10.1371/journal.pone.0323314
Editor: Bushra Zafar, Government College University Faisalabad, PAKISTAN
Received: July 29, 2024; Accepted: April 4, 2025; Published: May 14, 2025
Copyright: © 2025 Shah et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: In this work, we used well-known image datasets (Drone Action and UT Interaction datasets) which are already published and made publicly available online by the original owners/authors for use by the research community. The links/references of the image datasets are duly cited as references (32, 53) for Drone Action dataset and (33, 54) for UT Interaction datasets.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Human action recognition involves automatically analyzing and comprehending varying actions in videos, which form a part of several security and surveillance applications [1,2]. Broadly, existing action recognition approaches may be categorized into non-vision-based and vision-based methods.
Non-vision-based methods rely on data from wearable sensors [3] or non-wearable sensors [4]. Wearable sensors refer to body-worn devices/accessories to capture physiological signals or motion data and include, for example, accelerometers, magnetometers, gyroscopes, smart watches, etc. Non-wearable sensors are stationary or mobile devices that generally capture environmental data without direct contact with the body and include, for example, sound sensors, pressure sensors, temperature sensors etc. Examples of different action recognition approaches employed in non-vision-based methods include Deep Neural Networks (DNN) [5], Convolutional Neural Networks (CNN) [6], Autoencoders [7], Restricted Boltzmann Machine (RBM) [8], and Recurrent Neural Network (RNN) [9] and Hybrid models [10]. While these methods are generally computationally less costly and robust to deal with varying illumination, they have limitations in terms of scalability, accuracy and generalizability.
Vision-based methods, on the other hand, use image or video data coming from cameras to perform action recognition with better performance and generalizability [11]. Traditional vision-based methods comprise of two key components [12]: action representation and action classification. Action representation is the process of converting video or image data into feature vectors [13], whereas action classification then infers the class labels using the encoded data in feature vectors [14]. Indeed, the traditional machine learning methods [15] rely on hand-crafted feature extraction methods such as Histogram of Oriented Gradients (HOG) [16], Optical Flow [17], Spatiotemporal Interest Points (STIP) [18], and Motion History Images (MHI) [19]. These features are typically combined with classifiers such as Support Vector Machines (SVMs) [20], K-Nearest Neighbors (KNN) [21], and Random Forests [22]. Unlike the traditional machine learning approaches, deep learning methods [23] have been demonstrated to provide a more robust, scalable, and effective solution by automatically learning rich representations from large datasets, allowing them to better address the variability and dynamics of aerial video data. Deep neural network architectures have unified both components in a seamless manner and has importantly enhanced the classification performance. Example of representative works include approaches relying on Convolutional Neural Networks (CNNs) [24], Graph Convolutional Networks (GCNs) [25], Recurrent Neural Networks (RNNs) and their variants such as Gated Recurrent Units (GRUs) and Long-Short Term Memory (LSTMs) [26], and Vision Transformers (ViTs) [27]. There has been a trend toward using pose information in the form of extracted keypoints in DNN frameworks [28], for human action recognition under the motivation that different action types are better distinguishable based on encoding bodily movements. Additionally, pose-based approaches are more robust to deal with background and illumination changes. Some pose-based approaches employed transformer-based networks to show promising results, however, they may not be directly deployable for action recognition in aerial settings due to view-point changes and motion dynamics (motion blur and jitters) caused by UAVs. Other methods work either with solo actions or interactions [29–31]. Furthermore, the need remains to devise lightweight deep models to aid ease and effectiveness in their deployment in real-world applications where computational cost and complexity are critical considerations.
To this end, this paper presents an effective lightweight Transformer model utilizing YOLO v8 pose estimator to extract keypoints for human action recognition in aerial videos. The proposed model (InterAcT) is generic in terms of being capable of recognizing both solo actions and interactions in a computationally efficient and effective manner. We have performed a thorough performance evaluation and comparison of the proposed method with several related state-of-the-art approaches on two well-known public datasets: Drone Action dataset [32] and UT-Interaction dataset [33], altogether comprising of a total of 18 classes including both solo actions and interactions. The proposed method outperforms the existing methods both in terms of accuracy and computational cost.
The contributions of this paper are as follows. First, a generic unified keypoints-based transformer model (InterAcT) is proposed, that is capable of recognizing both solo actions and human-human interactions in aerial videos. The second contribution is the optimization using exhaustive grid searches on various architectural parameters as part of training and validation in order to build an optimized and improved transformer-based framework, which is computationally efficient as well as accurate for real-world applications. The third contribution is the robustness and effectiveness of the proposed model, for which a detailed evaluation and comparison is conducted using multiple standard metrics to demonstrate the superior performance of the proposed method over several state-of-the-art methods on two well-known publicly available datasets. Lastly, we have made our model accessible online [34] to enable its use for recognition tasks and to support reproducibility of the reported results.
Related work
While there exist several action recognition methods [11], only a limited number of approaches focused on addressing the problem for aerial videos [35]. Moreover, there are fewer methods that are generic to recognize both solo actions and human-human interactions.
Authors in [36] proposed a parts-based model with FCN that incorporates ORB attributes and texton maps for full body features, and Radon transform and 8-chain freeman codes for keypoints features to recognize solo action as well interaction classes. However, the model has limitations in terms of effectively dealing with recognition in aerial datasets due to low resolution and fast camera motion. In [37], the authors proposed a method that utilized extracted pose information with SVM and Random Forest to recognize violent actions in aerial videos; however, they demonstrated the effectiveness of their method on solo actions only. The authors in [38] presented a model that incorporates DWT, LBP and HOG to extract distinct features and used a multiclass SVM classifier with one-vs-one architecture to recognize action classes in aerial videos; however, their method is computationally costly due to the need to extract the conventional features and may not be suitable to be deployed in real-time applications. ST-PCANet [39] is a two-stream network that used unsupervised PCANet with features encoding schemes including BoF and VLAD followed by SVM classifier to classify actions in videos. The authors in [40] proposed an integrated X3D expanding architecture that utilized 3D ConvNets as a baseline model, flexible to handle occlusion and viewpoint changes in aerial videos and recognized solo actions only. The computational performance of the model may, however, not be desirable due to model complexity. The authors in [41] proposed a disjoint multi-task learning method that utilized a 3D CNN framework to recognize solo actions in aerial camera settings. They used GAN generated videos using videos synthetically created from GTA and FIFA games for human action recognition in real-world aerial videos when a few real aerial videos are available. However, generating representative GAN videos for all classes may not be feasible and games videos may have some bias toward some specific actions. The work on 3D Convolutional Neural Networks (3D CNNs) as presented in [42] is an extension of 2D CNNs that incorporated temporal dimension to capture temporal patterns of activity in videos. However, the computational cost of 3D CNNs models may be an issue due to a need of a large amount of annotated data to train. The work in [43] presented AR3D models to combine 3D CNN, residual structure, and attention mechanism to address above limitation. The method showed encouraging performance; however, the challenge pertaining to computational cost still remains. Graph Convolutional Networks (GCNs) [44] including 2P-GCN, Graph Diffusion Convolutional Network, AI-GCN, CTR-GCN and SGN and K-GCN were proposed that also relied on pose estimation but they generally aimed at human-human interactions only without accounting for solo actions. The authors in [45] presented an H-LSTM model that captured long-term inter-related dynamics among a group of persons for human interaction recognition. In [46], the keypoints-based LSTM model was proposed for recognizing solo actions in aerial camera settings. The work presented in [30] utilized keypoints-based transformer model that recognized solo actions in videos captured with fixed camera settings. Their work was extended in [31] that recognized solo actions in aerial videos.
Based on the review of related work, it is evident that there is need to devise a robust solution for human action recognition in aerial videos that is generic to work effectively both with solo actions as well interactions while being computationally efficient. Moreover, the transformer networks are being increasingly employed to solve several vision problems, but remains comparatively less explored for human action recognition task. Within transformer-based models, the keypoints-based transformer models have been relatively less investigated for their effectiveness, accuracy and performance in recognizing a large number of diverse action categories, both in fixed camera settings as well as aerial settings.
Method
Proposed transformer-based action recognition framework
The proposed method utilizes body keypoints extracted using YOLO v8 pose estimation model [47]. The extracted keypoints data is preprocessed to transform data into the desired shape, which is then fed into the proposed Transformer framework for training and testing. The proposed model is inspired from “micro” Transformer architecture in earlier work [30] that was used for recognizing solo actions with ground-based videos. Unlike the “micro” Transformer architecture [30], the proposed framework offers lightweight optimized architectural settings to build a model that works effectively for both solo actions and interactions in a computationally efficient manner without compromising on performance accuracy. Fig 1 illustrates different stages in the framework from an implementation viewpoint.
Y represents true labels, while Y* represents predicted labels.
Pose estimation.
Numerous pose estimation models including Open Pose [48], Alpha Pose [49], Hyper Pose [50], Blaze Pose [51], YOLO v7 [52] and YOLO v8 [47] exist to extract 2D or 3D keypoints. We used the widely used YOLO v8 that has reported high detection performance and low computational cost [31,47]. It extracts 17 keypoints per person in each frame of the videos. YOLOv8 pose model uses a backbone based on Cross Stage Partial Network (CSPNet) blocks and Path Aggregation Network (PANet). It enhances multi-scale detection and gradient flow, enabling efficient keypoint prediction. It is pretrained on extensive pose-annotated datasets like COCO and OpenPose due to which it accurately identifies poses across diverse activities. It divides the image into a grid and predicting the positions of key body parts within each grid cell, allowing it to process and analyze human poses in a single forward pass through the network.
YOLO v8 pose model takes an input video of shape (F,H,W,C) where F is the total number frames, each frame having a dimensions of height (H) x width (W), and C denotes the number of input channels per frame. It returns output as (F,K), where F denotes the frame and K denotes the keypoints array. The extracted keypoints data is preprocessed into the desired shape to be fed into Transformer network for training and testing it. Fig 2 below illustrates the extracted keypoints for sample frames of “waving hands” solo-action class (top) and “handshaking” interaction-class (bottom). For keypoints extraction, we used grayscale imagery that has encouraging detection performance as well as a reduced computation cost.
Republished from [53] under a CC BY license, with permission from the copyright owner of the Drone Action dataset, original copyright [2019].
Transformer-based action recognition model.
Fig 3 illustrates the proposed model architecture. The backbone of the framework is the Transformer encoder layer, which comprises of multiple layers employing a self-attention layer and feed-forward blocks. After each block, the dropout, layer-normalization, and residual connections are applied. Each feed-forward block operates as a multi-layer perceptron using GeLu non-linearization. The memory block that learns the temporal pattern of the input sequences in the Transformer network is the self-attention block. The input sequences tokens are fed into it as linearly transformed vectors known as Query (Q), Key (K) and Values (V). Using Q and V vectors, the attention weights matrix is computed, which further multiplied with V vector gives the self-attention output as per Equation 1. At the final stage, class tokens are fed into the MLP head that output a logit vector, on which softmax is applied to predict the class label.
(a) Image illustrating modules of the transformer network; (b) image illustrating the output shape of different layers.
The preprocessed keypoints for each class comprises of sequences where N denotes the number of sequences per class. Each sequence has a shape (F,K), where F denotes the number of frames per sequence which is set to 30 in our case and K denotes the number of keypoints computed as follows:
*
*
, where
is maximum required number of persons in the videos,
denotes the number of keypoints extracted per person and
is the number of coordinates/channels per keypoint. In our case, K is set to 68. For each sequence, the Transformer network will extract temporal features using 30 frames in it. First, the keypoints are transformed linearly into an Embedding matrix which are added with Positional Embedding matrix that has learnable parameters provides the positional information of each frame thus forms an embedded matrix
. The embedded matrix
has shape of (F,
), where
denotes the embedded dimension of each vector (row) of
. The Query (Q), Key (K) and Values (V) vectors are generated using the
as given by Equations 2–4 respectively.
Where ,
and
are the weights matrices having learnable parameters and their dimensions are kept same in the network, i.e.,
, which makes the dimensions of Q, K and V the same, i.e., (F,
). In our case
is set to 56. The self-attention output weight matrix is transformed by a layer having a weight matrix
which has the shape (
,
). As
, so the shape of
becomes (
,
). This transformation makes the output shape of self-attention block equal to (F,
). This output matrix is then fed into the feed forward network that linearly transforms the output using the operations as given by Equation 5.
Where x is the output of the attention block. and
are the weights matrices having shapes (
) and (
) respectively.
and
are biases vectors, both having the same shapes (F,). In our case,
=
. The output of feed forward is then fed into the MLP head to predict the class label. The layer-wise output shapes of InterAcT model are illustrated in Fig 3b.
Experimental setup
This section presents the experimentation details including the description of the datasets and the system specifications that are used in training and testing the model.
Datasets
We used two publicly available datasets for evaluation, the Drone Action dataset [32,53] and UT-Interaction dataset [33,54]. Indeed, the choice of the two datasets is made taking into account the presence of solo actions and human-human interactions, which this study focuses on.
We have used 13 solo actions classes from the Drone Action dataset [32,53] as illustrated in Fig 4. It contains RGB videos having spatial resolution of 1920x1080, recorded at a frame rate of 25 fps in an outdoor environment with a low-altitude and slow-speed moving drone. The solo actions are performed by 10 actors on an unsettled road near wheat fields. It contains challenges of cluttered background and viewpoints changes.
Republished from [53] under a CC BY license, with permission from the copyright owner of the Drone Action dataset, original copyright [2019].
Also, we have used 5 human-human interaction classes from the UT-Interaction dataset [33,54], with corresponding illustrations available online [34]. It contains RGB videos having spatial resolution of 720x480, recorded at a frame rate of 30 fps in an outdoor environment with a low-altitude camera. The human-human interactions are performed by 6 actors in two different scenarios: a parking lot and a lawn on a windy day. It contains challenges of camera jitters, varying zoom rates, illumination changes and cluttered backgrounds. Table 1 gives the summary of both datasets. Our study aims on recognizing a wide range of actions and interactions captured by low-altitude flying drones, making these two datasets highly suitable and relevant for our research.
Both datasets contain varying numbers of videos per class, leading to an imbalance in the amount of extracted sequential keypoints data. This class imbalance can affect model performance, potentially limiting its accuracy and effectiveness. Data augmentation techniques are employed to address this issue, particularly for classes with fewer videos, allowing the extraction of sufficient sequential data for model training and improving model generalizability. To increase the number of sequential samples, it’s crucial to augment the dataset with additional videos. Numerous augmentation techniques can be utilized for this purpose. In our case, we employed two methods that are horizontal flipping and rotation – both effectively simulate realistic yet diverse drone perspectives enabling the evaluation of robustness of the proposed model to variations in aerial settings. These techniques help increase both the number of videos and the corresponding sequential data per class. Despite the formation of sequential data, class imbalance persists. To address this, data slicing is employed, where the class with the fewest sequences is used as a reference. All classes are then sliced to match this number, ensuring balanced sequential data across classes. The class-wise data statistics are given in Table 2. In this study, the preprocessed keypoints data (balanced) is split into three sets. For training and validation, we used 80% training set and 10% validation set, respectively. For performance evaluation, we used a 10% test set.
System specifications
This section presents the hardware and software resources that are utilized to perform the experimental evaluation of the proposed framework. Table 3 lists the specifications of the hardware and Table 4 shows the software requirements.
Results and discussion
This section presents the experimental results of our model in a systematic manner. Specifically, it describes model parameters tuning and training, followed by evaluation and comparison of the proposed model with existing related models.
Model parameters tuning
We performed an exhaustive grid search on architectural parameters as well as training hyperparameters to obtain a light-weight architecture with optimized parameter settings. Tables 5 and 6 shows the architectural parameters and the training hyperparameters that are experimented as part of this study. The architectural parameter tuning is performed sequentially with following initial training settings: fixed window sequential data, 500 training epochs, sequence length of 30, AdamW optimizer with learning rate of 0.0001 and weight decay of 0.00001, GeLU activation function and a batch-size of 32. The optimized values of architectural parameters/hyperparameters are given in Tables 5 and 6, respectively.
The results of architecture parameter tuning are shown in Fig 5. Fig 5a illustrates the effect of varying the number of encoder layers; it shows that the validation accuracy increases with the increase in the number of encoder layers and the performance tends to stabilize at 3 encoder layers giving the highest validation accuracy of 0.7404. Fig 5b illustrates the effect of embedded dimensions; it indicates an increasing validation accuracy with increase in embedded dimensions and the highest validation accuracy of 0.7610 is obtained at embedded dimensions of 56. Fig 5c illustrates the effect of varying the dropout percentage parameter; the validation accuracy decreases with an increase in the dropout percentage because most of the input units are set to zero during training, thus leading the model to under-fitting. The highest validation accuracy of 0.7500 is obtained at a dropout of 0.10. Fig 5d illustrates the effect of varying the MLP head dimensions; a highest validation accuracy of 0.7692 is obtained at MLP head dimensions of 96. Thus, the selected architectural parameters for our model are: three encoder layers, embedding dimensions of 56, a dropout rate of 0.10, and MLP head dimensions of 96.
Similarly, the results of hyperparameters tuning are illustrated in Fig 6. The result of optimizers with different learning rates and weight decays is illustrated in Fig 6a. Among the optimizers, AdamW with learning rate of 0.001 and weight decay of 0.000001 provides the highest validation accuracy of 0.7720. Fig 6b shows the effect of activation functions. Among the activation functions, GeLU gives the highest validation accuracy of 0.7830. Fig 6c illustrates the effect of batch sizes; it shows that increasing batch size increases validation accuracy. Increasing batch size leads to a fast convergence in training and reduces training time. However, increasing it significantly could lead to reduced performance. A highest validation accuracy of 0.7473 is obtained at a batch size of 192. Fig 6d illustrates the comparison between sequences with a fixed window or sliding window for training the model. It is clear that the sliding window reports a higher validation accuracy of 0.9967 due to the increased number of sequences. On the other hand, the fixed window results in fewer sequences for model training; hence, the model performance drops as compared to sliding window. Thus, the selected training parameters for our model are: AdamW optimizer with learning rate of 0.001 and weight decay of 0.000001, GeLU activation function, a batch size of 192 and sliding window sequences.
Model training
After tuning parameters of the model, we trained and evaluated it. The training and validation accuracy-loss curves are shown in Fig 7. From Fig 7, both training and validation accuracies are increasing, and losses are decreasing over each epoch, thus showing that the model is gradually learning. We trained the model for 500 epochs.
Performance evaluation and comparison
This section presents the performance evaluation and comparison of our model with the existing AcT models and other state-of-the-art approaches to demonstrate its effectiveness and robustness. Table 7 below shows the architectural-level comparison of our model with the existing AcT models [30]. From this table, it is clear that our model has the optimized settings resulting in a light-weight architecture.
For performance evaluation, the computed confusion matrix and the classification report including class-wise accuracy, precision, recall and F1-scores, are illustrated in Figs 8 and 9, respectively. It is evident that the performance is generally very encouraging for all classes with some exceptions where there are misclassifications due to similar spatio-temporal trends (Fig. 8). For example, Fig 10 illustrates a deeper insight into an instance where jogging_f_b_solo is misclassified as running_f_b_solo. This is because both actions exhibit similar spatio-temporal pattern, making it difficult for the model to distinguish between them.
Column 2: a running_f_b_solo case that is misclassified as jogging_f_b_solo due to similar spatio-temporal trends.
We also compared the performance of our model with several traditional deep learning models including 2P-GCN [55], LSTM [46], 3D-ResNet [56] and 3D-CNN [57] as well as four state-of-the-art AcT models [30] as shown in Table 8. Table 8 show multiple performance metrics: Model parameters refer to the total learnable weights in the model, model FLOPs (floating point operations) indicate computational complexity, evaluation time is the duration for assessing model performance on the test set, inference time is the time taken to make a single prediction, throughput measures the number of sequences processed per second, accuracy reflects the percentage of correct predictions, precision is the proportion of correctly predicted positive instances among all positive predictions, recall (sensitivity) is the proportion of correctly predicted positive instances among all actual positives, and F1 score, the harmonic mean of precision and recall, provides a single measure that balances both precision and recall for assessing model performance in case of class imbalance. From Table 8, it is clear that our model outperforms all the models on multiple performance metrics. These findings show a better suitability of our model for deployment in real-time applications as compared to the other models.
Conclusions and future work
In this paper, we presented an efficient and effective generic keypoints-based transformer model, called InterAcT, which is capable of recognizing solo actions as well as human-human interactions in aerial videos, by utilizing the bodily keypoints extracted using YOLO v8 pose estimator. Our optimized model, comprising 0.0709 million parameters and 0.0389 Gflops, has demonstrated encouraging performance on the UT-Interaction and Drone-Action datasets. With sliding sequential data, the model achieves a high accuracy of 0.9923, outperforming the AcT models (micro: 0.9353, small: 0.9893, base: 0.9907, and large: 0.9558), 2P-GCN (0.9337), LSTM (0.9774), 3D-ResNet (0.9921), and 3D CNN (0.9920). Moreover, our model also shows better performance than other models in terms of evaluation time, inference time, throughput, precision, recall, F1-score, as well as in terms of model parameters and model flops. The model’s performance combined with its lower computational complexity, highlights its efficiency and robustness in comparison to several existing models. The key strength of the model lies in its lightweight architecture, thus making it more deployable in several real-world applications such as aerial surveillance, public safety monitoring systems, private security monitoring systems, home surveillance, healthcare systems, retail and customer monitoring systems and autonomous vehicle systems.
As future work, the proposed model can be adapted to incorporate other action categories such as body gestures and multi-person human-human interactions. Moreover, it would be interesting to explore the use of multi-modal data by fusing sensors (non-visual) data with visual data, in an attempt to further enhance recognition performance in extreme weather and environmental conditions. Furthermore, it would be useful to deploy and perform evaluation of the proposed framework on resource constrained devices to further test the effectiveness.
References
- 1. Arshad MH, Bilal M, Gani A. Human activity recognition: review, taxonomy and open challenges. Sensors (Basel). 2022;22(17):6463. pmid:36080922
- 2.
Nawaz T, Ferryman J. An annotation-free method for evaluating privacy protection techniques in videos. 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 2015. p. 1–6. https://doi.org/10.1109/avss.2015.7301800
- 3. Yang J, Nguyen MN, San PP, Li X, Krishnaswamy S. Deep convolutional neural networks on multichannel time series for human activity recognition. IJCAI. 2015:3995–4001.
- 4.
Vepakomma P, De D, Das SK, Bhansali S. A-Wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. 2015 IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN). 2015:1–6. https://doi.org/10.1109/bsn.2015.7299406
- 5. Hammerla NY, Halloran S, Plötz T. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint 2016. arXiv:1604.08880
- 6. Islam MM, Nooruddin S, Karray F, Muhammad G. Human activity recognition using tools of convolutional neural networks: a state of the art review, data sets, challenges, and future prospects. Comput Biol Med. 2022;149:106060. pmid:36084382
- 7. Almaslukh B, AlMuhtadi J, Artoli A. An effective deep autoencoder approach for online smartphone-based human activity recognition. Int J Comput Sci Netw Secur. 2017;17:160–5.
- 8.
Lane ND, Georgiev P. Can deep learning revolutionize mobile sensing? Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications. 2015. p. 117–22. https://doi.org/10.1145/2699343.2699349
- 9. Inoue M, Inoue S, Nishida T. Deep recurrent neural network for mobile human activity recognition with high throughput. Artif Life Robotics. 2017;23(2):173–85.
- 10.
Yao S, Hu S, Zhao Y, Zhang A, Abdelzaher T. Deepsense: a unified deep learning framework for time-series mobile sensing data processing. Proceedings of the 26th international conference on world wide web. 2017. pp. 351–60.
- 11. Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J. Human action recognition from various data modalities: a review. IEEE Trans Pattern Anal Mach Intell. 2023;45(3):3200–25. pmid:35700242
- 12. Poppe R. A survey on vision-based human action recognition. Image Vision Computing. 2010;28(6):976–90.
- 13.
Niebles JC, Chen C-W, Fei-Fei L. Modeling temporal structure of decomposable motion segments for activity classification. Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II. 2010. pp. 392–405.
- 14. Liu J, Kuipers B, Savarese S. Recognizing human actions by attributes. CVPR. 2011. pp. 3337–44.
- 15. Kong Y, Fu Y. Human action recognition and prediction: a survey. Int J Comput Vis. 2022;130:1366–1401.
- 16.
Dalal N, Triggs B. Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005;1:886–93. https://doi.org/10.1109/cvpr.2005.177
- 17. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. [cited 11 Oct 2024. ]. Available from: https://www.researchgate.net/publication/234830763_Beyond_Pixels_Exploring_New_Representations_and_Applications_for_Motion_Analysis
- 18. Laptev I. On space-time interest points. Int J Comput Vis. 2005;64:107–123. doi: https://doi.org/10.1007/S11263-005-1838-7/METRICS
- 19. Bobick AF, Davis JW. The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Machine Intell. 2001;23(3):257–67.
- 20. Vapnik VN. Statistical learning theory. 1998;736. Available from: https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034
- 21. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967;13(1):21–7.
- 22.
Xu L, Yang W, Cao Y, Li Q. Human activity recognition based on random forests. 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD). 2017. p. 548–53. https://doi.org/10.1109/fskd.2017.8393329
- 23. Gu F, Chung M-H, Chignell M, Valaee S, Zhou B, Liu X. A survey on deep learning for human activity recognition. ACM Comput Surv. 2021;54(8):1–34.
- 24. Sánchez-Caballero A, de López-Diz S, Fuentes-Jimenez D, Losada-Gutiérrez C, Marrón-Romera M, Casillas-Pérez D, et al. 3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information. Multimed Tools Appl. 2022;81(17):24119–43.
- 25.
Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N. Semantics-guided neural networks for efficient skeleton-based human action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
- 26. Sánchez-Caballero A, Fuentes-Jiménez D, Losada-Gutiérrez C. Real-time human action recognition using raw depth video-based recurrent neural networks. Multimed Tools Appl. 2022;82(11):16213–35.
- 27. Ulhaq A, Akhtar N, Pogrebna G, Mian A. Vision transformers for action recognition: a survey. arXiv preprint 2022. arXiv:220905700.
- 28.
Boualia SN, Essoukri Ben Amara N. Pose-based Human Activity Recognition: A Review. 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC). 2019. p. 1468–75. https://doi.org/10.1109/iwcmc.2019.8766694
- 29. Plizzari C, Cannici M, Matteucci M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput Vision Image Understanding. 2021;208–209:103219.
- 30. Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M. Action Transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recognit. 2022;124:108487.
- 31. Uddin S, Nawaz T, Ferryman J, Rashid N, Asaduzzaman Md, Nawaz R. Skeletal keypoint-based transformer model for human action recognition in aerial videos. IEEE Access. 2024;12:11095–103.
- 32. Perera AG, Law YW, Chahl J. Drone-action: an outdoor recorded drone video dataset for action recognition. Drones. 2019;3(4):82.
- 33.
Ryoo MS, Aggarwal JK. UT-Interaction Dataset ICPR contest on Semantic Description of Human Activities (SDHA). 2010
- 34. InterAcT GitHub Repository. 2024. Available from: https://github.com/Mshah99github/InterAcT
- 35. Azmat U, Alotaibi SS, Abdelhaq M, Alsufyani N, Shorfuzzaman M, Jalal A, et al. Aerial insights: deep learning-based human action recognition in drone imagery. IEEE Access. 2023;11:83946–61.
- 36. Ghadi Y, Waheed M, al Shloul T, A. Alsuhibany S, Jalal A, Park J. Automated parts-based model for recognizing human–object interactions from aerial imagery with fully convolutional network. Remote Sensing. 2022;14(6):1492.
- 37. Srivastava A, Badal T, Garg A, Vidyarthi A, Singh R. Recognizing human violent action using drone surveillance within real-time proximity. J Real-Time Image Proc. 2021;18(5):1851–63.
- 38. Kushwaha A, Khare A, Srivastava P. On integration of multiple features for human activity recognition in video sequences. Multimed Tools Appl. 2021;80(21–23):32511–38.
- 39. Abdelbaky A, Aly S. Two-stream spatiotemporal feature fusion for human action recognition. Vis Comput. 2020;37(7):1821–35.
- 40.
Xian R, Wang X, Manocha D. MITFAS: Mutual information based temporal feature alignment and sampling for aerial video action recognition. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2024. pp. 6625–34.
- 41. Sultani W, Shah M. Human action recognition in drone videos using a few aerial training examples. Comput Vision Image Understanding. 2021;206:103186.
- 42. Vrskova R, Hudec R, Kamencay P, Sykora P. Human activity classification using the 3DCNN architecture. Appl Sci. 2022;12(2):931.
- 43. Dong M, Fang Z, Li Y, Bi S, Chen J. AR3D: attention residual 3D network for human action recognition. Sensors (Basel). 2021;21(5):1656. pmid:33670835
- 44. Feng L, Zhao Y, Zhao W, Tang J. A comparative review of graph convolutional networks for human skeleton-based action recognition. Artif Intell Rev. 2021;55(5):4275–305.
- 45. Shu X, Tang J, Qi G-J, Liu W, Yang J. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans Pattern Anal Mach Intell. 2021;43(3):1110–8. pmid:31545711
- 46. Saeed SM, Akbar H, Nawaz T, Elahi H, Khan US. Body-pose-guided action recognition with convolutional long short-term memory (LSTM) in aerial videos. Appl Sci. 2023;13(16):9384.
- 47.
Reis D, Kupec J, Hong J, Daoudi A. Real-Time Flying Object Detection with YOLOv8. 2023
- 48.
Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. 2019.
- 49.
Fang H-S, Li J, Tang H, Xu C, Zhu H, Xiu Y, et al. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. 2022.
- 50.
Guo Y, Liu J, Li G, Mai L, Dong H. Fast and flexible human pose estimation with HyperPose. Proceedings of the 29th ACM International Conference on Multimedia. 2021:3763–6. https://doi.org/10.1145/3474085.3478325
- 51.
Bazarevsky V, Grishchenko I, Raveendran K, Zhu T, Zhang F, Grundmann M. BlazePose: On-device Real-time Body Pose tracking. 2020.
- 52.
Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. 2022.
- 53. Drone Action: An Outdoor Recorded Drone Video Dataset for Action Recognition. 2019. Available from: https://asankagp.github.io/droneaction/
- 54. UT Interaction Dataset. 2010. Available from: https://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
- 55. Li Z, Li Y, Tang L, Zhang T, Su J. Two-person graph convolutional network for skeleton-based human interaction recognition. IEEE Trans Circuits Syst Video Technol. 2023;33(7):3333–42.
- 56.
Kataoka H, Wakamiya T, Hara K, Satoh Y. Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? 2020.
- 57. Ji S, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell. 2013;35(1):221–31. pmid:22392705