TinyAct: A framework for real-time action recognition in the cloud through distillation learning

doi:10.1371/journal.pone.0347245

Fig 1.

TinyAct AIoT architecture.

Edge devices extract 1024-dimensional latent features via a lightweight 3D autoencoder and transmit them to the cloud for action classification and visualization. Note: Video frames are illustrative samples from the Kinetics-400 dataset [31]; facial regions have been obscured.

More »

Expand

Fig 2.

The base autoencoder architecture showing the input and output layers with video frames of size 256 × 256 × N, and the latent representation layer used as input for traditional machine learning classifiers including SVM, Random Forest, and XGBoost.

More »

Expand

Fig 3.

An overview of the knowledge distillation framework employed in our training pipeline.

The student model—consisting of a VideoAutoEncoder3D coupled with MLPResNetClassifier is optimized to replicate the behavior of a pretrained teacher model (ILA-ViT-B/16) through the minimization of a composite loss function. This framework integrates three key components: (1) classification supervision via cross-entropy loss with ground-truth labels, (2) soft-target alignment by minimizing the Kullback–Leibler divergence between teacher and student logits, and (3) a reconstruction objective that serves as an auxiliary regularization signal to improve representational fidelity.

More »

Expand

Table 1.

Comparison of Model Performance without Knowledge Distillation Using Base AutoEncoders with 1024-Dimensional Latent Feature Vectors from 16-Frame Video Inputs.

More »

Expand

Table 2.

Comparison of Model Performance under Knowledge Distillation Using Pretrained and Non-Pretrained AutoEncoders with 1024-Dimensional Latent Feature Vectors from 16-Frame Video Inputs.

More »

Expand

Table 3.

Comparison with lightweight action recognition methods on Kinetics-400. SOTA metrics from original publications (server-class GPUs). TinyAct measured on edge hardware (Jetson Xavier NX). Direct latency comparison across platforms is not applicable.

More »

Expand