Fig 1.
Edge devices extract 1024-dimensional latent features via a lightweight 3D autoencoder and transmit them to the cloud for action classification and visualization. Note: Video frames are illustrative samples from the Kinetics-400 dataset [31]; facial regions have been obscured.
Fig 2.
The base autoencoder architecture showing the input and output layers with video frames of size 256 × 256 × N, and the latent representation layer used as input for traditional machine learning classifiers including SVM, Random Forest, and XGBoost.
Fig 3.
An overview of the knowledge distillation framework employed in our training pipeline.
The student model—consisting of a VideoAutoEncoder3D coupled with MLPResNetClassifier is optimized to replicate the behavior of a pretrained teacher model (ILA-ViT-B/16) through the minimization of a composite loss function. This framework integrates three key components: (1) classification supervision via cross-entropy loss with ground-truth labels, (2) soft-target alignment by minimizing the Kullback–Leibler divergence between teacher and student logits, and (3) a reconstruction objective that serves as an auxiliary regularization signal to improve representational fidelity.
Table 1.
Comparison of Model Performance without Knowledge Distillation Using Base AutoEncoders with 1024-Dimensional Latent Feature Vectors from 16-Frame Video Inputs.
Table 2.
Comparison of Model Performance under Knowledge Distillation Using Pretrained and Non-Pretrained AutoEncoders with 1024-Dimensional Latent Feature Vectors from 16-Frame Video Inputs.
Table 3.
Comparison with lightweight action recognition methods on Kinetics-400. SOTA metrics from original publications (server-class GPUs). TinyAct measured on edge hardware (Jetson Xavier NX). Direct latency comparison across platforms is not applicable.