Employing automatic content recognition for teaching methodology analysis in classroom videos

doi:10.1371/journal.pone.0263448

Fig 1.

In this figure two sample images from the dataset are shown.

More »

Expand

Fig 2.

This figure depicts architectures of three reported DNN models, side by side, (a) CNN and LSTM implemented using Time distributed layer. An image frame from the video is given as input at one time step and action is predicted in the end. (b) Conv2DLSTM: image frames at each time step is presented to a the network and the features from all the images are used to action prediction. (c) 3DCNN: all images are concatenated and presented to the network for an action prediction.

More »

Expand

Fig 3.

This figure depicts a LSTM node of a RNN.

C_t−1 is cell state at previous frame and C^t is the cell state at processing image frame. h_t−1 and h_t are hidden states activation at previous and current time steps respectively. The patterned boxes depict the forget (f_t), input (i_t) and output (o_t) gates, respectively at current time step.

More »

Expand

Fig 4.

This figure illustrates convolutional LSTM.

X_t depicts the image and the red block in the middle encloses the local representation. h_t−1 is the hidden state of the processing of the previous image frame and C_t−1 is the cell state. h_t and C_t are the hidden and cell states respectively computed from (X_t), h_t−1 and C_t−1. A red box shows the neighboring states that contribute to computation.

More »

Expand

Fig 5.

This figure illustrates 3DCNN.

The rectangle on top depicts slices of video frames and each slice is an image that is color coded. Contiguous frames convolve with a 3D kernel and it is depicted with regular and dashed lines. The rectangle at the bottom is a depiction of the 3D convolution of image tensor with kernel tensor (it should not be confused with the shared filters).

More »

Expand

Fig 6.

The graph in this image depicts a distribution of classes in training data in a LOOCV sample.

More »

Expand

Fig 7.

(a) shows the composition of a Conv2DLSTM network. The receptive field is adjusted using sub-sampling and strides (convLSTM is a Conv2DLSTM layer, fc is fully connected and SM is softmax layer). (b) shows the composition of a 3DCNN. The receptive fields of the subsequent layers are adjusted using the strides only (3DConv is a 3DCNN layer). The numbers at the bottom of each shape give the number of filters in that layer. I/x term given at the bottom corner of each layer shows the reduced size of the features after application of subsampling layer by 1/x times of the size (H and W seprately) of image I.

More »