Fig 1.
In this figure two sample images from the dataset are shown.
Fig 2.
This figure depicts architectures of three reported DNN models, side by side, (a) CNN and LSTM implemented using Time distributed layer. An image frame from the video is given as input at one time step and action is predicted in the end. (b) Conv2DLSTM: image frames at each time step is presented to a the network and the features from all the images are used to action prediction. (c) 3DCNN: all images are concatenated and presented to the network for an action prediction.
Fig 3.
This figure depicts a LSTM node of a RNN.
Ct−1 is cell state at previous frame and Ct is the cell state at processing image frame. ht−1 and ht are hidden states activation at previous and current time steps respectively. The patterned boxes depict the forget (ft), input (it) and output (ot) gates, respectively at current time step.
Fig 4.
This figure illustrates convolutional LSTM.
Xt depicts the image and the red block in the middle encloses the local representation. ht−1 is the hidden state of the processing of the previous image frame and Ct−1 is the cell state. ht and Ct are the hidden and cell states respectively computed from (Xt), ht−1 and Ct−1. A red box shows the neighboring states that contribute to computation.
Fig 5.
This figure illustrates 3DCNN.
The rectangle on top depicts slices of video frames and each slice is an image that is color coded. Contiguous frames convolve with a 3D kernel and it is depicted with regular and dashed lines. The rectangle at the bottom is a depiction of the 3D convolution of image tensor with kernel tensor (it should not be confused with the shared filters).
Fig 6.
The graph in this image depicts a distribution of classes in training data in a LOOCV sample.
Fig 7.
(a) shows the composition of a Conv2DLSTM network. The receptive field is adjusted using sub-sampling and strides (convLSTM is a Conv2DLSTM layer, fc is fully connected and SM is softmax layer). (b) shows the composition of a 3DCNN. The receptive fields of the subsequent layers are adjusted using the strides only (3DConv is a 3DCNN layer). The numbers at the bottom of each shape give the number of filters in that layer. I/x term given at the bottom corner of each layer shows the reduced size of the features after application of subsampling layer by 1/x times of the size (H and W seprately) of image I.
Fig 8.
The graph in this image depicts a distribution of classes in testing data in a LOOCV sample.
Table 1.
This table presents the accuracy of the tests averaged over nine runs with LOOCV cross validation techniques.
Fig 9.
This chart depicts statistics of the classified actions in a complete lecture video of one hour.
The results are generated with Conv2DLSTM. These results can be correlated with the teaching standards.
Fig 10.
This image depicts a confusion matrix of the stats presented in Fig 9.
Fig 11.
Poor quality video example as writing on board is not visible.
Table 2.
Comparison with IAVID-1 dataset.