Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MeVGAN: GAN-based plugin model for video generation with applications in colonoscopy

  • Łukasz Struski ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    lukasz.struski@uj.edu.pl

    Affiliations Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland, Skopia Medical Center, Kraków, Poland

  • Tomasz Urbańczyk,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Marian Smoluchowski Institute of Physics, Jagiellonian University, Kraków, Poland, Skopia Medical Center, Kraków, Poland

  • Krzysztof Bucki,

    Roles Data curation, Formal analysis, Funding acquisition, Writing – original draft, Writing – review & editing

    Affiliation Skopia Medical Center, Kraków, Poland

  • Bartłomiej Cupiał,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland

  • Aneta Kaczyńska,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland

  • Przemysław Spurek,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland, Skopia Medical Center, Kraków, Poland

  • Jacek Tabor

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland, Skopia Medical Center, Kraków, Poland

Abstract

The generation of videos is crucial, particularly in the medical field, where a significant amount of data is presented in this format. However, due to the extensive memory requirements, creating high-resolution videos poses a substantial challenge for generative models. In this paper, we introduce the Memory Efficient Video GAN (MeVGAN)–a Generative Adversarial Network (GAN) that incorporates a plugin-type architecture. This system utilizes a pre-trained 2D-image GAN, to which we attach a straightforward neural network designed to develop specific trajectories within the noise space. These trajectories, when processed through the GAN, produce realistic videos. We deploy MeVGAN specifically for creating colonoscopy videos, a critical procedure in the medical field, notably helpful for screening and treating colorectal cancer. We show that MeVGAN can produce good quality synthetic colonoscopy videos, which can be potentially used in virtual simulators.

Introduction

Video generation is an important field in AI, with many critical applications in biological domains and medicine [1, 2]. However, video generation for medicine, where typically data has a high resolution, is very demanding for generative models due to the large need for memory.

To generate high-quality images, we usually use GAN [3] (Generative Adversarial Network), which uses a minimax game to model the data distribution. GAN learns a generator network that transforms samples from Gaussian noise into an image . The generator learns by playing against an adversarial discriminator network aiming to distinguish between samples from the true data distribution and the generator’s distribution. After training, we have GAN generator ; see the top model in Fig 1.

thumbnail
Fig 1. Top image: Classic 2D GAN uses generator , which transforms samples from Gaussian distribution into 2D images (see the top image).

In practice, all separate frames are represented in Gaussian space. Bottom image: MeVGAN model uses such a pre-trained model and incorporates an additional neural network which models the correct order of frames (see the red curve in the latent space). Since generator is pre-trained, we only need to train .

https://doi.org/10.1371/journal.pone.0312038.g001

We often need to produce many consecutive frames when we generate video. Therefore, the typical architecture that generates the video is large [4, 5]. Furthermore, the specificity of videos, a sequence of similar images, forces the use of 3D convolution [49] or two discriminators [10, 11]. The biggest drawback of the existing methods is that we directly generate full movies. Such an approach requires high computational resources. As a consequence, relatively short videos are produced. Alternatively, longer videos can be generated but with lower quality.

To solve the above problem, we present MeVGAN (https://github.com/gmum/MeVGAN), a method that uses a classic GAN model trained directly on 2D images that produces movies by creating the correct combination of relevant images. Conceptually MeVGAN can be seen as a plugin [12] to GAN. Thanks to such a solution, we can produce long and high-resolution videos. In MeVGAN we use frames from training videos to train the 2D GAN generator. Then, we train an additional neural network , which learns to connect existing frames to the correct movies; see the bottom model in Fig 1. The neural network produces a curve in the latent space of the pre-trained 2D model to merge frames into the output video. Such a solution generates high-quality videos.

We test MeVGAN on the task of generating high-quality videos on real colonoscopy images. Colonoscopy is an important medical procedure that can be used for both diagnostic and therapeutic purposes. It is difficult and time-consuming to learn how to perform colonoscopy. Therefore, there is a problem with the procedure of training. Using live animals (usually pigs) for colonoscopy training can bring a lot of benefits [2], but at the same time, it is complicated and may raise ethical concerns. To partially solve the above problems, various colonoscopy simulators have been created [13]. There are many different approaches to construct such devices. Applying MeVGAN we can generate videos, which can be potentially applied to colonoscopy simulators.

To summarize, the contributions of our work are the following:

  • We propose MeVGAN, a new memory efficient plugin-based generative model for videos.
  • MeVGAN produces high-quality videos since we use pre-trained 2D-image GAN and only add neural network to construct trajectories in the noise space to produce real-life video
  • We show that MeVGAN can be used for modeling colonoscopy videos.

Related works

In this section, we present related works devoted to the three related scientific research directions. In the first one, we describe different 2D GAN models. In the second one, we present different approaches to video generation using GAN models. The third one is devoted to existing colonoscopy simulators.

Generative model.

Generative modeling is a broad area of machine learning that deals with modeling a joint data distribution. Roughly speaking, generative models can produce examples similar to those already present in a training dataset X, but not exactly the same. Generative models are one of the fastest-growing areas of deep learning. In recent years several generative models have been constructed. Among them Variational Autoencoders (VAE) [14], Wasserstein Autoencoders (WAE) [15], Generative Adversarial Networks (GAN) [3], Auto-regressive Models [16] and Flow-based Generative Models [17, 18].

The quality of generative image modeling has increased in recent years thanks to the GAN framework, which have solved many image generation tasks, like image-to-image translation [16, 1921], image super-resolution [22, 23], and text-to-image synthesis [24, 25].

GAN is a framework for training deep generative models using a minimax game. The goal is to learn a generator distribution that matches the real data distribution Pdata(x). GAN learns a generator network that generates samples from the generator distribution by transforming a noise variable (usually Gaussian noise ) into a sample . The generator learns by playing against an adversarial discriminator network which aims to distinguish between samples from the true data distribution Pdata and the generator’s distribution . More formally, the minimax game is given by the following expression:

The main advantage of GANs over other models is producing sharp images that are indistinguishable from the real ones. GANs are impressive regarding the visual quality of images sampled from the model, but the training process is often challenging and unstable.

In recent years, many researchers focused on modifying the vanilla GAN procedure to improve stability of the training process, by change of the objective function to Wasserstein distance (WGAN) [26], restrictions on the gradient penalties [27, 28], Spectral Normalization [29], imbalanced learning rates for generator and discriminator [27, 29], Self-Attention mechanisms SAGAN [30] and progressively growing architectures such as ProGAN [31] or StyleGAN [32].

In addition to works improving the training stability, several modifications of the vanilla GAN architecture are dedicated to specific tasks, like generating textures [33], producing images with different resolutions and training on a single image [34]. Such methods enable GANs training on images with varying resolutions.

It should be emphasized that there are also GANs for the synthesis of gastrointestinal images, e.g. [35, 36].

GAN for video.

Video GANs deal with multiple images with an additional time dimension. To solve such a problem, video GANs use many different strategies [37].

Video-GAN (VGAN) [38] is one of the first applications of GAN for video generating. The generator consists of two convolutional networks. The first is the 2D convolutional model for the static background, while the second is a 3D convolutional network that models moving objects in the foreground. In (FTGAN) [39], Ohnishi et al. add progressive architecture to model the the motion of an object is more effectively. MoCoGAN [11] traverses N latent points, one per frame, using recurrent neural networks RNNs. Like MoCoGAN, Temporal Generative Adversarial Nets (TGAN) [8, 40] use N latent vectors for N frames. However, each frame is generated from a latent vector in TGAN whereas in MoCoGAN, a frame is generated from a combination of a motion vector and a fixed content vector shared across the frames. Similarly, G3AN [41] proposes a three-stream generator to disentangle motion and appearance with a self-attention module.

In [42] Natarajan et al. use skeletal pose information and person images as input and produce high-quality videos. In the generator phase, the proposed model uses a U-Net-like network to generate target frames from skeletal poses.

In [43] Natarajan et al. propose an end-to-end deep learning framework for sign language recognition, translation, and video generation. Another approach is to use 3D convolutional networks instead of 2D convolutions [49]. Dual video discriminator GAN (DVD-GAN) [10] applied BigGAN [44] architecture to video generation. Similar to MoCoGAN, there are two discriminators to deal with the temporal and spatial aspects of a video.

In [45], Skorokhodov et al. build the model on top of StyleGAN2 [46] and redesign its generator and discriminator networks for video synthesis. In [47] Yu et al. use an implicit representation of video. Some recent works also consider high-resolution video synthesis [9, 48], but only with training in the latent space of a pre-trained image generator.

Another approach uses a two-stream architecture for modeling different aspects of video: motion and content [49]. EncGAN3 [50] also decomposes the video into two streams representing content and movement but consists of three processing modules, representing Encoder, Generator, and Discriminator, each trained separately.

Colonoscopy simulators.

Colonoscopy is an important medical procedure that can be used for both diagnostic and therapeutic purposes. It plays a very important role in the diagnosis and prevention of colorectal cancer (CRC), because it enables early detection and extraction of polyps, which are often the first stage of colorectal cancer which is one of the most prevalent and significant causes of morbidity and mortality in the developed world [2]. Moreover, colonoscopy is also helpful in diagnosing many other diseases, such as ulcerative colitis, Crohn’s disease, and diverticulosis [51].

Learning how to perform colonoscopy is difficult and time-consuming. Additionally, choosing the right training procedure is problematic. To start an independent practice performing 250–300 colonoscopy procedure under supervision is required, however statistical studies suggest that even up to 700 performed procedures are required to gain proficiency [52]. Proper preparation of the doctor performing colonoscopy is very important from the point of view of the effectiveness of colorectal cancer prevention, as one of the main reasons for overlooking polyps during the examination is the inexperience of the endoscopist [53]. It is worth mentioning that works are currently being carried out to automate the process of detecting of polyps in colonoscopy videos [54] which can significantly reduce the number of missed polyps. Similar automated systems have also been developed for other types of medical data, for example there exists frameworks for cervical cancer detection and classification [55, 56].

On the other hand, it is ethically questionable to train colonoscopy practitioners on real patients, as a poorly performed colonoscopy can have severe complications, including perforation, bacteriaemia, and hemorrhage [57, 58].However, it should be emphasized that serious complications are very rare, as long as the trained person is properly supervised by an experienced doctor. To partially solve the above problems, different colonoscopy simulators have been created [13]. These simulators use a variety of techniques, ranging from simple mechanical models [59], through composite devices that use explanted animal organs [60], to computerized virtual simulators which incorporate visual interface with haptics [2, 61]. There are many different approaches for constructing such virtual simulator. This paper uses a neural network approach to generate artificial videos of colonoscopy procedures.

The use of virtual simulators – which could be developed e.g. using video sequences created by MeVGAN – to train colonoscopists has numerous advantages. A big advantage of virtual simulators is the ability to easily and effectively simulate complex medical procedures or disease cases (e.g. polyps or colorectal cancer), which are relatively rare. In our work, we have shown that the MeVGAN model can be used to generate video sequences where polyps are visible. An important advantage of generative models (including MeVGAN) used to generate textures or video sequences for use in virtual simulators is their ability to anonymize sensitive medical data. Although data from real patients are used to train the model, it is virtually impossible to link the results returned by the model to data from a specific patient. According to [62], the use of synthetic data is a method to share medical datasets with a wider audience. Therefore, using generative models can be beneficial from the point of view of protecting sensitive medical data of patients.

It should be emphasized that the construction of a realistic medical simulator requires the development of elements other than realistic graphics. In particular, the haptic part of the simulator is very important, this part is even more crucial than the graphic part from the point of view of the quality of training. Developing of haptic is challenging because to obtain realistic force feedback, the haptic refresh rate should be high enough (at least 500-1000Hz [63, 64]). Such a high refresh rate of haptic places high demands on physics simulation algorithms (in particular collision detection), which must be fast enough.

Description of MeVGAN

In this section, we describe our model. In MeVGAN (see Fig 2), we use a pre-trained GAN model dedicated to 2D images and add neural network to adopt such a model for video generation. In presented model we use ProGAN [31] as the backbone, therefore we first describe the classic ProGAN model for 2D images and then we introduce MeVGAN.

thumbnail
Fig 2. MeVGAN model uses a pre-trained ProGAN model, which consists of and .

In practice, all separate frames are represented in ProGAN’s Gaussian space. MeVGAN model uses such a pre-trained model and incorporates additional neural networks: plugin and video discriminator , which are responsible for the correct sequence of frames. ransfers Gaussian noise z and time indexes into ProGAN’s latent codes for separate frames. Then we use pre-trained ProGAN generator to obtain video frames . Before we use the video discriminator, we transfer frames by pre-trained ProGAN discriminator (without the last layer) to obtain a low-dimensional representation of movies . Such full video representation goes to classic 2D video discriminator .

https://doi.org/10.1371/journal.pone.0312038.g002

ProGAN.

ProGAN [31] is a classic GAN with a minimax game. It consists of a generator network that transfers samples from the prior Gaussian noise into 2D images, and the discriminator network that aims to distinguish between samples from the true and learned data distribution. The main advantage of the ProGAN model is its architecture. The model starts with low-resolution images and then progressively increase the resolution by adding layers to the generator and the discriminator. This incremental nature allows the training first to discover the large-scale structure of the image distribution and then shift attention to increasingly finer-scale details instead of having to learn full scales simultaneously.

Progressive training has several benefits. The train procedure on smaller images is substantially more stable because there is less class information and fewer modes. Another benefit is the reduced training time. With progressively growing GANs, most iterations are done at lower resolutions.

ProGAN can be easily trained in colonoscopy images with arbitrary resolution. In MeVGAN we used the PyTorch implementation of ProGAN, trained on colonoscopy images up to resolution of 1024px. We mostly used default training hyperparameters, which includes using WGAN-GP loss [27]. The only modification we did was setting the batch size to 8 for resolution up to 256px, and then decreasing it to 4 for the rest of the training.

MeVGAN .

In this part of the section, we present our extension of a generative model that was originally trained on 2D data (images). We assume that we have the pre-trained ProGAN model on frames from training videos so we have generator and discriminator dedicated for 2D images. In MeVGAN, these networks will be frozen. We aim to train two additional networks: plugin and video discriminator . The first neural network allows the ProGAN generator to model the subsequent frames of generated video. On the input of plugin network , we have a Gaussian noise z and timeline (consecutive indexes of frames in the video)

The Plugin transfers such representation to obtain the latent codes of frames

Plugin consists of three fully connected layers that uses the temporal information (i.e., the order of frames in the video sequence) to produce a sequence of n noise vectors with a shape.

The ProGAN generator transfers this sequence of noise vectors into video frames

Subsequently, the discriminator is used to enforce the smooth transition between consecutive frames in the output video. In the majority of existing solutions discriminators utilize 3D convolutional layers, however in MeVGAN, we operate on the low-dimensional representation of images. In practice, we use pre-trained ProGAN discriminator without the last layer to extract features. Such representations are combined into a single tensor

In consequence, we obtain MeVGAN generator , which consist of three neural networks: plugin , pre-trained ProGAN generator and pre–trained ProGAN discriminator . Using the discriminator as a feature extractor, we can use classic 2D video discriminator instead of 3D convolutions.

This approach enables the discriminator to output a single value for a sequence of images, similar to a traditional discriminator. Treating the sequence of features as an image makes it easier for the discriminator to differentiate between a real movie and a sequence of unordered frames. Our discriminator architecture offers several advantages, including improved computational efficiency. By utilizing the ProGAN architecture, we can also ensure that our discriminator is well-suited for use with the generator in our proposed extension.

Our model is trained analogically to classic GAN by minimizing minimax game

In each step the video discriminator is taught to distinguish between real and fake videos, while the plugin network is taught to fool the discriminator, by generating (together with ProGAN’s generator) possibly the most realistic videos. Detailed description of Plugin and Video Discriminator architecture is presented in Table 1. Both networks are trained using Adam optimizer with learning rate of 0.0002, , and Binary Cross Entropy loss, for 50 epochs on NVIDIA Tesla V100 SXM2 32GB GPU. The process of training these two parts of our model is presented in Fig 3.

thumbnail
Fig 3. The picture shows the learning curves of the Plugin and the discriminator for several datasets.

https://doi.org/10.1371/journal.pone.0312038.g003

thumbnail
Table 1. Detailed description of Plugin and Video Discriminator architecture. Plugin takes random noise vector z and temporal vector T as an input. At each of the four layers, it first concatenates temporal vector T with previous layer output, then forwards it through the Linear layer. The first three layers additionally use ReLU activation. After the fourth layer, the output vector is normalized to lie on a hypersphere. Video Discriminator takes low-dimensional representation of generated video as input. At each of the four layers, it applies 2D Convolution and ReLU activation. At the end, the output is flattened and forwarded through the Linear layer and Sigmoid function, to obtain probability. Conv2D parameters describe, as follows, the number of input and output channels, kernel size and stride. Linear layer parameters refers to the number of input and output channels.

https://doi.org/10.1371/journal.pone.0312038.t001

The MeVGAN model stands out as an innovative approach, replete with both limitations and distinct advantages vis-à-vis its competitor, the Temporal GAN model. A central constraint of the MeVGAN model lies in its profound reliance on a pre-trained generative model. The quality of the generated images is intricately intertwined with the performance and robustness of this foundational model. Consequently, any limitations or biases inherent in the pre-trained model may seamlessly permeate into the generated content.

Diverging from the Temporal GAN model, which commences training from scratch for the entire architecture, MeVGAN adopts a different approach. It incorporates a generator that remains fixed after pre-training. This static characteristic can obstruct the model’s adaptability and its capacity for fine-tuning, particularly in response to evolving data distributions over time.

However, this unique structural aspect of our model unveils a multitude of advantages. Foremost among these is its ease of learning. By harnessing a pre-trained generative model, MeVGAN circumvents the formidable challenge of learning a generative model from the beginning. This pragmatic choice often results in faster convergence during training. Moreover, MeVGAN’s reliance on a pre-trained generator typically entails fewer parameters compared to models that commence training anew for the entire architecture. This streamlined parameterization can significantly reduce computational demands and resource consumption, rendering MeVGAN a more efficient choice for various use cases.

MeVGAN’s core focus revolves around the task of learning noise paths to generate videos. This task, in contrast to training a model alongside a generator for individual frames, is notably more manageable. Furthermore, when the noise space exhibits ’richness’, such a learning paradigm can yield video content that is remarkably consistent and aesthetically pleasing, especially when contrasted with models that fail to capture such intricate noise patterns effectively.

Experiments

The experiments section consists of two main parts. In the first subsection, we compare our method with the baseline approach, Temporal GAN v2 [8], on classic benchmarks. The second subsection shows how our model works on colonoscopy videos.

Comparison with baseline model

This section compares our method with the baseline approach, Temporal GAN v2 [8]. For this purpose, we compared the performance of these methods using practical datasets such as UCF-101 [65]. It is a widely used benchmark dataset for action recognition in videos, consisting of 101 different action categories, each containing at least 100 videos. The average length of a video clip is 6 seconds. The videos in UCF-101 cover various actions, including sports activities, human-object interactions, and animal behavior. The dataset provides a challenging benchmark for action recognition algorithms, as the videos contain a lot of variability in viewpoint, lighting conditions, background clutter, and actor appearance. For our experiments, we selected four categories: BalanceBeam, BaseballPitch, Skiing, and TaiChi.

Setup.

For each selected dataset category, we trained the ProGAN model to a resolution of pixels. We expanded the initial part of the ProGAN model in such a way that it generated sequences of n noises for the ProGAN generator, from a single noise vector of size 2048. By using the pre-trained ProGAN, we could focus solely on training the initial component responsible for generating sequences of noise vectors that would result in smooth and continuous video sequences. Otherwise, we would have to start from scratch and train the model to generate entire videos from data. Fig 4 provides several examples of video clips generated by our model.

thumbnail
Fig 4. An example of video frames generated be MeVGAN model presented in this work, which was trained on four categories of the UCF-101 dataset.

Each row presents twelve generated consecutive frames from the movie belonging to category described in the right part of the Figure. Compare to Fig 5, which presents an example of video frames generated by TGANv2.

https://doi.org/10.1371/journal.pone.0312038.g004

In addition, we compared the model trained this way with the Temporal GAN v2 model. As in our model, we trained TGANv2 separately on selected categories of the UTF-101 set. The example of video frames generated by TGANv2 is presented in Fig 5. To evaluate the models, we used the three popular metrics used to evaluate the quality of generative models in the domain of video and image generation: Fréchet Video Distance (FVD) [66], Fréchet Inception Distance (FID) [67], and Inception Score (IS) [68].

thumbnail
Fig 5. An example of video frames generated be TGANv2 model trained on four categories of the UCF-101 dataset.

Each row presents twelve generated consecutive frames from the movie belonging to category described in the right part of the Figure. Compare to Fig 4, which presents an example of video frames generated by MeVGAN.

https://doi.org/10.1371/journal.pone.0312038.g005

The FID score is one of the most important measures of quality in image generation. Rather than directly comparing images pixel by pixel (for example, as done by the L2 norm), the FID compares the mean and standard deviation of the deepest layer in Inception v3. These layers are closer to output nodes that correspond to real-world objects such as a specific breed of dog or an airplane, and further from the shallow layers near the input image. FVD measures the similarity between two sets of video clips by comparing the distributions of their features extracted from a pre-trained neural network analogous to FID.

FVD and FID are based on the popular Fréchet Distance between PR and PG is defined by: . In general case Fréchet Distance is difficult to compute, but when PR and PG are multivariate Gaussians the expression has form:

where and are the means and and are the co–vatiance matrices of PR and PG. To compute FVD and FID on real world videos and images we pass samples into an Inflated 3D Convnet (I3D) and Inception network respectively to record the activation responses of real and generated samples. Then metrics are computed using the means and covariances obtained from the recorded responses. The IS is also calculated by leveraging the outputs of the Inception network which produces class probabilities. The IS is then calculated by considering the entropy of the class probabilities for each generated sample, representing the diversity, and the average Kullback-Leibler divergence between the marginal class distribution and the overall distribution of classes, representing the quality. To evaluate the performance of these two models, we generated a total of 4092 video clips, each containing 16 frames, and calculated the Fréchet Video Distance metric for each model. Since other metrics such as Fréchet Inception Distance and Inception Score are designed for image data, we randomly split each video clip into 5 parts and calculated the FID and IS for each split. Finally, we computed the average and standard deviation across all splits for each model. The results of these evaluations are presented in Table 2.

thumbnail
Table 2. The results of comparison between our method (MeVGAN) and Temporal GAN v2 (TGANv2) presented in three evaluation metrics: Fréchet Video Distanc (FVD), Fréchet inception distance (FID), and Inception Score (IS). All calculations was performed on 4092 videos, each containing 16 frames. Since the last two metrics are measured on the images, we randomly divided all frames into five parts and calculated each subset’s mean and standard deviation. A higher IS score is considered ’better,’ which stands in contrast to the interpretation of the other metrics (FVD and FID). It’s worth noting that the values in these table cells reflect the enhanced efficiency of our approach according to these metrics.

https://doi.org/10.1371/journal.pone.0312038.t002

Our approach utilized a two-stage model learning process that resulted in improved performance compared to previous methods such as TGANv2. In the first stage, we trained our generative model on images to capture the underlying distribution of images’ features. In the second stage, we utilized the pre-trained model to train the sequence generation part of the model that creates noise vectors to form video. This two-steps approach allowed the second stage to focus on capturing temporal dependencies in video data, which is critical for generating smooth and continuous video sequences.

The effectiveness of our two-stage approach is reflected in the superior results presented in Table 2. By using FVD, FID, and IS metrics to evaluate our models, we can see that our approach outperforms the previous state-of-the-art method, TGANv2. Our method is capable of generating video sequences that are not only visually pleasing but also more realistic and diverse.

Colonoscopy movies

Colonoscopy is a medical procedure to examine the large intestine and rectum for abnormalities such as polyps or cancer. This procedure is an important tool for detecting and preventing colon cancer, which is one of the most common types of cancer worldwide.

Simulation-based training is becoming increasingly popular in the medical field, providing a safe and controlled environment for medical professionals to practice and develop their skills. However, using real patient data can be challenging due to ethical and privacy concerns. Generative models can overcome these challenges by generating synthetic medical data that resembles real data while maintaining patient privacy. Furthermore, generative models can create scenarios with specific medical conditions or abnormalities that may be difficult to encounter in real patient data.

We utilized colonoscopy data for training our generative video model MeVGAN and achieved promising results. By training our model on a large dataset of real colonoscopy videos, we were able to generate synthetic videos that closely resemble real videos in terms of visual quality and motion patterns, see Fig 6. It is important that MeVGAN can generate many stages of the colonoscopy procedure that is presented in Fig 7.

thumbnail
Fig 6. The examples of frames from five video clips produced by MeVGAN and TGANv2 trained using our colonoscopy data.

Each row contains 12 consecutive frame form one clip of pixel shape.

https://doi.org/10.1371/journal.pone.0312038.g006

thumbnail
Fig 7. An example of ability of MeVGAN to generate many various stages of the colonoscopy procedure e.g. intestine with visible haustration (a), sliding colonoscope on the intestinal wall (b), rinsing procedure to remove impurity (c), intestine with visible polyp (d) and ulcerative colitis (e).

https://doi.org/10.1371/journal.pone.0312038.g007

Components for Fig 7 were chosen by a colonoscopy specialist. Our system produced a range of movies, which the specialist categorized into different phases of the colonoscopy process, such as intestines showing haustration (a), maneuvering the colonoscope along the intestinal wall (b), cleansing process to eliminate contaminants (c), intestines displaying polyps (d), and ulcerative colitis (e).

Our approach has several potential applications in the medical field, such as the simulation of colonoscopy procedures for training purposes. Additionally, our model could be used to generate synthetic data with specific conditions or abnormalities that may be challenging to encounter in real colonoscopy videos, thereby aiding in the development and testing of new medical devices and procedures.

Summary

This paper presented a novel approach to generating high-quality video sequences using a two-stage model learning process. In the first stage, we trained a ProGAN model on a dataset of images, and in the second stage, we utilized the pre-trained model and additional neural network, to generate sequences of noise vectors, which were used to generate realistic and smooth video sequences. Our approach outperformed the previous state-of-the-art method, TGANv2, in terms of FVD, FID, and IS metrics.

In the case of colonoscopy images, we are able to produce videos with a high level of reality. Moreover, we are able to model many various stages of the colonoscopy procedure e.g. sliding colonoscope on the intestinal wall, rinsing procedure to remove impurity, intestine with visible polyp, and ulcerative colitis.

Future works.

There are several directions for future research and improvement of our approach. One possible extension is to incorporate more sophisticated techniques for modeling temporal dependencies in video data, such as recurrent neural networks (RNNs) or attention mechanisms. Another potential direction is to explore the use of more complex datasets, such as action recognition datasets, to generate more complex and diverse video sequences. Additionally, it would be valuable to explore ways of further improving the quality and diversity of generated video sequences, such as through the use of adversarial training or fine-tuning the model on specific tasks.

References

  1. 1. Golhar M, Bobrow TL, Ngamruengphong S, Durr NJ. GAN inversion for data augmentation to improve colonoscopy lesion classification. arXiv preprint 2022. https://arxiv.org/abs/2205.02840
  2. 2. Wen T, Medveczky D, Wu J, Wu J. Colonoscopy procedure simulation: virtual reality training based on a real time computational approach. Biomed Eng Online. 2018;17(1):1–15.
  3. 3. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Advances in neural information processing systems; 2014. p. 2672–80.
  4. 4. Acharya D, Huang Z, Paudel DP, Van Gool L. Towards high resolution video generation with progressive growing of sliced wasserstein gans. arXiv preprint 2018. https://arxiv.org/abs/1810.02419
  5. 5. Clark A, Donahue J, Simonyan K. Adversarial video generation on complex datasets. arXiv preprint 2019. https://arxiv.org/abs/1907.06571
  6. 6. Kahembwe E, Ramamoorthy S. Lower dimensional kernels for video discriminators. Neural Netw. 2020;132:506–520. pmid:33039788
  7. 7. Munoz A, Zolfaghari M, Argus M, Brox T. Temporal shift GAN for large scale video generation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2021. p. 3179–88.
  8. 8. Saito M, Saito S, Koyama M, Kobayashi S. Train sparsely, generate densely: memory-efficient unsupervised training of high-resolution temporal gan. Int J Comput Vis. 2020;128(10–11):2586–606.
  9. 9. Tian Y, Ren J, Chai M, Olszewski K, Peng X, Metaxas DN, et al. A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations; 2021.
  10. 10. Yu W, Zhang M, He Z, Shen Y. Convolutional two-stream generative adversarial network-based hyperspectral feature extraction. IEEE Trans Geosci Remote Sensing. 2021;60:1–10.
  11. 11. Tulyakov S, Liu MY, Yang X, Kautz J. Mocogan: decomposing motion and content for video generation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 1526–35.
  12. 12. Wołczyk M, Proszewska M, Maziarka Ł, Zieba M, Wielopolski P, Kurczab R, et al. PluGeN: multi-label conditional generation from pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:8647–56.
  13. 13. Goodman AJ, Melson J, Aslanian HR, Bhutani MS, Krishnan K, Lichtenstein DR, et al. Endoscopic simulators. Gastrointest Endosc. 2019;90(1):1–12. pmid:31122746
  14. 14. Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint https://arxiv.org/abs/1312.6114.
  15. 15. Tolstikhin I, Bousquet O, Gelly S, Schoelkopf B. Wasserstein auto-encoders. arXiv preprint 2017. https://arxiv.org/abs/1711.01558
  16. 16. Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 1125–34.
  17. 17. Dinh L, Krueger D, Bengio Y. NICE: non-linear independent components estimation. arXiv preprint 2014. https://arxiv.org/abs/1410.8516
  18. 18. Kingma D P, Dhariwal P. Glow: Generative flow with invertible 1x1 convolutions. In: Advances in Neural Information Processing Systems; 2018. p. 10236–45.
  19. 19. Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks In: Proceedings of the IEEE International Conference on Computer Vision; 2017;2223–32.
  20. 20. Taigman Y, Polyak A, Wolf L. Unsupervised cross-domain image generation. arXiv preprint 2016. https://arxiv.org/abs/1611.02200
  21. 21. Park T, Liu MY, Wang TC, Zhu JY. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. p. 2337–46.
  22. 22. Ledig C, Theis L, Husza´r F, Caballero J, Cunningham A, Acosta A, et al. Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 4681–90.
  23. 23. Sønderby C, Caballero J, Theis L, Shi W, Husz´ar F. Amortised map inference for image super-resolution. arXiv preprint 2016. https://arxiv.org/abs/1610.04490
  24. 24. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. Generative adversarial text to image synthesis. arXiv preprint 2016. https://arxiv.org/abs/1605.05396
  25. 25. Hong S, Yang D, Choi J, Lee H. Inferring semantic layout for hierarchical text-to-image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 7986–94.
  26. 26. Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: International Conference on Machine Learning; 2017. p. 214–23.
  27. 27. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein gans. In: Advances in neural information processing systems; 2017. p. 5767–77.
  28. 28. Kodali N, Abernethy J, Hays J, Kira Z. On convergence and stability of gans. arXiv preprint 2017. https://arxiv.org/abs/1705.07215
  29. 29. Miyato T, Kataoka T, Koyama M, Yoshida Y. Spectral normalization for generative adversarial networks. arXiv preprint 2018. https://arxiv.org/abs/1802.05957
  30. 30. Zhang H, Goodfellow I, Metaxas D, Odena A. Self-attention generative adversarial networks. arXiv preprint 2018. https://arxiv.org/abs/1805.08318
  31. 31. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint 2017. https://arxiv.org/abs/1710.10196
  32. 32. Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. p. 4401–10.
  33. 33. Hedjazi MA, Genc Y. Efficient texture-aware multi-GAN for image inpainting. Knowl-Based Syst. 2021;217:106789.
  34. 34. Shaham TR, Dekel T, Michaeli T. Singan: learning a generative model from a single natural image. Proceedings of the IEEE International Conference on Computer Vision; 2019. p. 4570–80.
  35. 35. Adjei PE, Lonseko ZM, Rao N. GAN-based synthetic gastrointestinal image generation. In: 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP); 2020. p. 338–42.
  36. 36. Thambawita V, Salehi P, Sheshkal SA, Hicks SA, Hammer HL, Parasa S, et al. SinGAN-Seg: synthetic training data generation for medical image segmentation. PLoS One. 2022;17(5):e0267976. pmid:35500005
  37. 37. Aldausari N, Sowmya A, Marcus N, Mohammadi G. Video generative adversarial networks: a review. ACM Comput Surv. 2022;55(2):1–25.
  38. 38. Vondrick C, Pirsiavash H, Torralba A. Generating videos with scene dynamics. Advances in neural information processing systems. 2016; 29.
  39. 39. Ohnishi K, Yamamoto S, Ushiku Y, Harada T. Hierarchical video generation from orthogonal information: optical flow and texture. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32; 2018.
  40. 40. Saito M, Matsumoto E, Saito S. Temporal generative adversarial nets with singular value clipping. In: Proceedings of the IEEE International Conference on Computer Vision; 2017. p. 2830–9.
  41. 41. Wang Y, Bilinski P, Bremond F, Dantcheva A. G3AN: Disentangling appearance and motion for video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 5264–73.
  42. 42. Natarajan B, Elakkiya R. Dynamic GAN for high-quality sign language video generation from skeletal poses using generative adversarial networks. Soft Comput. 2022;26(23):13153–75.
  43. 43. Natarajan B, Rajalakshmi E, Elakkiya R, Kotecha K, Abraham A, Gabralla LA, et al. Development of an end-to-end deep learning framework for sign language recognition, translation, and video generation. IEEE Access. 2022;10:104358–74.
  44. 44. Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations; 2019.
  45. 45. Skorokhodov I, Tulyakov S, Elhoseiny M. Stylegan-v: a continuous video generator with the price, image quality and perks of stylegan2. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 3626–36.
  46. 46. Karras T, Aittala M, Hellsten J, Laine S, Lehtinen J, Aila T. Training generative adversarial networks with limited data. Adv Neural Inf Process Syst. 2020;33:12104–14.
  47. 47. Yu S, Tack J, Mo S, Kim H, Kim J, Ha JW, et al. Generating videos with dynamics-aware implicit generative adversarial networks. arXiv preprint. 2022. https://arxiv.org/abs/2202.10571
  48. 48. Fox G, Tewari A, Elgharib M, Theobalt C. Stylevideogan: a temporal generative model using a pretrained stylegan. arXiv preprint 2021. https://arxiv.org/abs/2107.07224
  49. 49. Sun X, Xu H, Saenko K. Twostreamvan: Improving motion modeling in video generation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2020. p. 2744–53.
  50. 50. Yang J, Bors AG. Encoder enabled gan-based video generators. In: 2022 IEEE International Conference on Image Processing (ICIP). IEEE; 2022. p. 1841–5.
  51. 51. Kim Yong Gil JBI. The role of colonoscopy in inflammatory bowel disease. Clin Endosc. 2013;46(4):317–20.
  52. 52. Patwardhan VR, Feuerstein JD, Sengupta N, Lewandowski JJ, Tsao R, Kothari D, et al. Fellowship colonoscopy training and preparedness for independent gastroenterology practice. J Clin Gastroenterol. 2016;50(1):45–51. pmid:26125461
  53. 53. Papanikolaou I, Karatzas P, Varytimiadis L, Tsigaridas A, Galanopoulos M, Viazis N, et al. Effective colonoscopy training techniques: strate- gies to improve patient outcomes. Adv Med Educ Pract. 2016;7:201–10.
  54. 54. González-Bueno Puyal J, Brandao P, Ahmad OF, Bhatia KK, Toth D, Kader R, et al. Polyp detection on video colonoscopy using a hybrid 2D/3D CNN. Med Image Anal. 2022;82:102625. pmid:36209637
  55. 55. Elakkiya R, Subramaniyaswamy V, Vijayakumar V, Mahanti A. Cervical cancer diagnostics healthcare system using hybrid object detection adversarial networks. IEEE J Biomed Health Inform. 2021;26(4):1464–71.
  56. 56. Elakkiya R, Teja KSS, Jegatha L, Bisogni C, Medaglia C. Imaging based cervical cancer diagnostics using small object detection-generative adversarial networks. Multim Tools Appl. 2022; p. 1–17.
  57. 57. Fisher DA, Maple JT, Ben-Menachem T, Cash BD, Decker GA, Early DS, et al. Complications of colonoscopy. Gastrointest Endosc. 2011;74(4):745–52. pmid:21951473
  58. 58. Latos W, Aebisher D, Latos M, Krupka-Olek M, Dynarowicz K, Chodurek E, et al. Colonoscopy: preparation and potential complications. Diagnostics. 2022;12(3):747. pmid:35328300
  59. 59. Classen HM, Ruppin. Practical endoscopy training using a new gastrointestinal phantom. Endoscopy. 1974;6(02):127–31.
  60. 60. Hochberger J, Matthes K, Maiss J, Koebnick C, Hahn EG, Cohen J. Training with the compactEASIE biologic endoscopy simulator significantly improves hemostatic technical skill of gastroenterology fellows: a randomized controlled comparison with clinical endoscopy training alone. Gastrointest Endosc. 2005;61(2):204–15. pmid:15729227
  61. 61. Koch AD, Buzink SN, Heemskerk J, Botden SMBI, Veenendaal R, Jakimow- icz JJ, et al. Expert and construct validity of the Simbionix GI Mentor II endoscopy simulator for colonoscopy. Surgic Endosc. 2008;22(1):158–62.
  62. 62. Gonzales A, Guruswamy G, Smith S. Synthetic data in health care: a narrative review. PLOS Digit Health. 2023;2(1):e0000082.
  63. 63. Otaduy M A MCL High fidelity haptic rendering. Cham: Springer; 2007.
  64. 64. Laycock SD, Day AM. A survey of haptic rendering techniques. Comput Graph Forum. 2007;26(1):50–65.
  65. 65. Soomro K, Zamir AR, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint 2012. https://arxiv.org/abs/1212.0402
  66. 66. Unterthiner T, van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S. FVD: s new metric for video generation. In: ICLR Workshop DeepGenStruct 2019. 2019.
  67. 67. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv Neural Inf Process Syst. 2017;30.
  68. 68. Salimans T,Goodfellow I,Zaremba W,Cheung V,Radford A,Chen X.Improved techniques for training gans. Adv Neural Inf Process Syst. 2016;29.