Fig 1.
Top image: Classic 2D GAN uses generator , which transforms samples from Gaussian distribution
into 2D images
(see the top image).
In practice, all separate frames are represented in Gaussian space. Bottom image: MeVGAN model uses such a pre-trained model and incorporates an additional neural network which models the correct order of frames (see the red curve in the latent space). Since generator is pre-trained, we only need to train
.
Fig 2.
MeVGAN model uses a pre-trained ProGAN model, which consists of and
.
In practice, all separate frames are represented in ProGAN’s Gaussian space. MeVGAN model uses such a pre-trained model and incorporates additional neural networks: plugin and video discriminator
, which are responsible for the correct sequence of frames.
ransfers Gaussian noise z and time indexes
into ProGAN’s latent codes for separate frames. Then we use pre-trained ProGAN generator
to obtain video frames
. Before we use the video discriminator, we transfer frames by pre-trained ProGAN discriminator (without the last layer) to obtain a low-dimensional representation of movies
. Such full video representation goes to classic 2D video discriminator
.
Fig 3.
The picture shows the learning curves of the Plugin and the discriminator for several datasets.
Table 1.
Detailed description of Plugin and Video Discriminator
architecture. Plugin
takes random noise vector z and temporal vector T as an input. At each of the four layers, it first concatenates temporal vector T with previous layer output, then forwards it through the Linear layer. The first three layers additionally use ReLU activation. After the fourth layer, the output vector is normalized to lie on a hypersphere. Video Discriminator
takes low-dimensional representation of generated video
as input. At each of the four layers, it applies 2D Convolution and ReLU activation. At the end, the output is flattened and forwarded through the Linear layer and Sigmoid function, to obtain probability. Conv2D parameters describe, as follows, the number of input and output channels, kernel size and stride. Linear layer parameters refers to the number of input and output channels.
Fig 4.
An example of video frames generated be MeVGAN model presented in this work, which was trained on four categories of the UCF-101 dataset.
Each row presents twelve generated consecutive frames from the movie belonging to category described in the right part of the Figure. Compare to Fig 5, which presents an example of video frames generated by TGANv2.
Fig 5.
An example of video frames generated be TGANv2 model trained on four categories of the UCF-101 dataset.
Each row presents twelve generated consecutive frames from the movie belonging to category described in the right part of the Figure. Compare to Fig 4, which presents an example of video frames generated by MeVGAN.
Table 2.
The results of comparison between our method (MeVGAN) and Temporal GAN v2 (TGANv2) presented in three evaluation metrics: Fréchet Video Distanc (FVD), Fréchet inception distance (FID), and Inception Score (IS). All calculations was performed on 4092 videos, each containing 16 frames. Since the last two metrics are measured on the images, we randomly divided all frames into five parts and calculated each subset’s mean and standard deviation. A higher IS score is considered ’better,’ which stands in contrast to the interpretation of the other metrics (FVD and FID). It’s worth noting that the values in these table cells reflect the enhanced efficiency of our approach according to these metrics.
Fig 6.
The examples of frames from five video clips produced by MeVGAN and TGANv2 trained using our colonoscopy data.
Each row contains 12 consecutive frame form one clip of pixel shape.
Fig 7.
An example of ability of MeVGAN to generate many various stages of the colonoscopy procedure e.g. intestine with visible haustration (a), sliding colonoscope on the intestinal wall (b), rinsing procedure to remove impurity (c), intestine with visible polyp (d) and ulcerative colitis (e).