Figures
Abstract
The text-to-video generation task can provide people with rich and diverse video content, but it also has some typical issues, such as content inconsistency between video frames or text alignment failure, which degrade the smoothness of video. And in the process of improving the video smoothing problems, the background texture and artistic expression are often lost because of the excessive smoothing. Based on the above problems, this paper proposes INR Smooth, a type of video smoothing strategy based on the relationship between interframe noise, which can improve the smoothness of most T2V generation tasks. Based on INR Smooth, two video smoothing editing methods are proposed in this paper. One is for T2V training models, based on the studied interframe noise relationship, noise constraints are carried out from the beginning and end of the video simultaneously, and video smoothing loss functions are constructed. The other is for T2V training-free models, this paper introduces DDIM Inversion additionally to ensure text alignment, so as to improve the smoothness. Through experimental comparison, it is found that the proposed methods can significantly improve text alignment, temporal consistency, and has outstanding performance in the smooth transition of real scenes and the portrayal of artistic styles. The proposed training-free method and zero-shot fine-tuning training method for video smoothing do not add additional computing resources. The source codes and video demos are available at https://github.com/Cuihong-Yu/INR-Smooth.
Citation: Yu C, Han C, Zhang C (2025) INR Smooth: Interframe noise relation-based smooth video synthesis on diffusion models. PLoS One 20(4): e0321193. https://doi.org/10.1371/journal.pone.0321193
Editor: Matteo Bodini, Universita degli Studi di Milano, ITALY
Received: November 3, 2024; Accepted: March 3, 2025; Published: April 29, 2025
Copyright: © 2025 Yu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source codes and video demos are available at https://github.com/Cuihong-Yu/INR-Smooth.
Funding: Natural Science Foundation of Jilin Province, No. 20230101179JC, China. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The purpose of text-to-video (T2V) generation is to generate realistic videos based on input text. As a new research hotspot of computer vision, text-to-video generation has been widely used in the fields of film and television production, games, entertainment short videos and so on [1]. The application potential of T2V is huge. The existing methods [2–7] mainly use large-scale diffusion models to achieve T2V tasks and achieve a series of stunning visual effects. Different from the research on image generation, video generation is greatly affected by the temporal consistency between frames and the spatial consistency of content features. If the consistency issue is not addressed effectively, it will result in flickering, incoherent action, and a reduction in video quality.
Tune-A-Video [8] simulates temporal consistency by fine-tuning the spatio-temporal attention mechanism, querying the relevant position of the current frame feature information in the previous frame. While Tune-A-Video generates significant text-to-video results with temporal consistency, unpleasant results may be generated when the input video contains multiple continuous objects. ControlVideo [9] employs the video keyframe attention mechanism for video consistency processing and introduces a temporal attention module to the diffusion model. A zero convolutional layer is added after each temporal attention module to preserve output before fine tuning. ControlVideo is affected by the guidance conditions, and there is still room for further improvement of video temporal consistency. Text2Video-Zero [10] introduces motion dynamic coding into potential coding and uses a new cross-frame attention mechanism to re-encode the self-attention mechanism of each frame to enforce temporal consistency. While the generated video has a smooth performance, the background content is relatively simple with no noticeable scene changes. In contrast to strategies that enhance the attention mechanism to improve video consistency, Stable Video Diffusion [11] adjusts the high-resolution text-to-video model into a frame interpolation model, uses the first and last frames as conditional frames, and feeds them to the UNet through the connection conditional mechanism, resulting in smooth videos at high frame rates. However, this frame interpolation method is limited by the length of the video.
The above advanced text-to-video generation models still face various problems that lead to the inconsistency of generated videos, so improving the temporal consistency of generated videos is the main challenge faced by text-to-video. In order to maintain the creativity of the original model and further improve video consistency, consistent video editing methods can be further explored. Therefore, this paper studies consistent video editing based on the noise relationship between video frames. Relatively few studies have investigated the inter-frame noise of text-to-video generation. PYoCo [12] investigated the correlation of noises corresponding to each frame of a video and thus developed mixed and progressive noise priors to guide the video training process, resulting in the consistency of sequential video frame generation. Smooth Video [13] introduces a loss term to noise predictions for adjacent video frames to enhance the edited object’s smoothness. However, the over-smoothing limits the alignment between the generated and source videos.
Based on the challenges of improving the consistency of text-to-video, this paper proposes INR Smooth, a video smoothing strategy for linking noise relation between different frames in video, which improves the inconsistency of existing models, enhances scene transitions, reduces over-smoothing, and maintains alignment with the source video. It is applicable to video generation and editing models, and compatible with the current personalized or conditional T2V generation tasks. Fig 1 illustrates the remarkable results of INR Smooth in T2V video smoothing application. By extensive qualitative and quantitative comparisons with state-of-the-art baselines, we demonstrate the superiority of the proposed methods. The main contributions of this work are as follows:
Based on INR Smooth, two video smoothing editing methods are proposed, one is for T2V training models, the other is for T2V training-free models. As shown in Fig 1, after employing the proposed methods, the body structure of the bird is repaired, the front structure of the car is aligned with other frames, and the color of the rearview mirror is repaired.
- Based on the diffusion models, the relation between the noise of each frame in the T2V model is studied, and a video smoothing strategy based on interframe noise relation is proposed.
- For T2V training models, this paper constructs the video smoothing loss based on the interframe noise relation, so that the video smoothing loss and original loss affect the quality of the generated video simultaneously.
- For T2V training-free models, we propose an improved version of DDIMs (Denoising Diffusion Implicit Models) inversion, which is suited for zero-shot video editing. By analyzing the correlation between inversion noise and random noise, inversion noise is estimated from the beginning and end of the video simultaneously. Finally, the weight of inversion noise is controlled to maintain the interframe consistency of generated video.
Related work
Text-to-video generation
The large-scale text-to-image (T2I) diffusion models has achieved unprecedented success, so researchers focus on applying the diffusion models to the field of video generation, and have achieved a series of remarkable research results. While many research [1,4,6,11] bring visual impact to people, but super computing resources make it difficult for individual researchers or small research teams, so researchers try to study how to better achieve text-to-video generation on the basis of lightweight resource utilization. For example, many T2V studies based on Zero-Shot, One-Shot, Few Shot and Training-Free have achieved better generation effects. This paper will also focus on lightweight resource utilization research for smooth editing.
Many major T2V generation models use the T2I model’s merging time module to generate video. Video Diffusion Models [14] extend the basic U-Net architecture of 2D image modeling to 3D space-time by factorizing the spatiotemporal attention module, and realize use the standard equation of diffusion model, which can obtain an effective process of video data generation model. Sora [15] explores large-scale training of generative models on video data, and leverages a transformer architecture that operates on spacetime patches of video and image latent codes, Sora demonstrates the feasibility of scaled video generation models. Imagen Video [16] improves video diffusion models by cascading them and using V-parametric decision making. Make-A-Video [17] begins with learning content description and move features and then extends a diffusion models-based T2I model [18] to T2V using a spatiotemporally factorized diffusion model. Lumiere [19] introduces a Space-Time U-Net architecture based on the pre-trained T2I U-Net to generate the entire temporal duration of the video, which synthefies state-of-the-art diverse and coherent videos. Follow Your Pose [20] also applies the image generation diffusion model to the video generation process, proposing a pose-guided text-to-video generation model to create a character video generation model with editable text and pose management. In addition to the guiding conditions such as pose, edge, and semantic segmentation, T2V tasks are evolving toward multi-conditional guidance. Therefore, in the face of extensive developing T2V generating tasks, research on video smoothing methods will be conducted throughout and is very important.
Text-to-video editing
The video editing based on diffusion models is often guided by text, so as to change the content, style and background layout of the generated video. Dreamix [21] proposed a text-based general appearance and action editing method for realistic scene videos, which achieved good video editing effects, but the model implementation process required relatively high computing resources. Patrick Esser [22] et al. proposed a structure and content perception model, which can modify video through sample images or text guidance, and demonstrated for the first time the control of time consistency by the joint training of image and video data. The experimental results are impressive, but the implementation process requires large-scale text image data resources. FateZero [23] captures the intermediate attention graph in the inversion process to retain the structure and motion information, fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt, and changes the self-attention in the UNet to spatio-temporal attention, so as to realize the change of the video foreground content and the editing of the video style. Rerender A Video [24] presents a novel video-to-video translation framework, which translates the source video into the target video based on the style description of the text. It can be widely used in video style transfer. TokenFlow [25] finds that enhancing the consistency of internal diffusion features between frames can generate videos with temporal consistency during video editing. Video-P2P [26] achieve video editing by introducing cross-attention control and optimizing shared unconditional embedding of video inversion, as well as using different guides for source and editing prompts, and merging their attention. Slicedit [27] runs DDPM sampling using the extracted source video noise space, while injecting the guided conditional attention maps at specific timesteps to achieve video editing. FLATTEN [28] extends the T2I diffusion model and incorporates flow-guided attention into the DDIM sampling and inversion process, resulting in consistent video editing results. With the vigorous development of T2V tasks, T2V editing tasks do not require huge computing resources for the generation process, and can be flexible in the output video style, so T2V editing tasks will also be widely developed.
Latent diffusion models
Diffusion models have recently emerged as new leaders in deep generation models, owing to their high-quality generation effects in image generation tasks. They have excellent performance in many application fields, such as computer vision, NLP (Natural Language Processing), multimodal modeling, medical image reconstruction and so on [29]. Since the original diffusion models still has some shortcomings, such as slow sampling speed, poor maximization likelihood and weak data generalization ability, many optimization studies based on diffusion models have been conducted based on the above problems [30–36].
In this paper, video smoothing is studied on the basis of latent diffusion models [18]. Compared with other space compression methods, latent diffusion models propose a method to perform the diffusion process in latent space, which can reduce the computational complexity and achieve better image generation. Instead of training the model on pixel space, latent diffusion models process the original images through self-coding models, retaining only some features of the important foundation, thus greatly reducing the complexity of the training and sampling phases. latent diffusion models mainly consist of a self-coding model (including an encoder and a decoder) and diffusion operations of the latent space.
Specifically, given an image , H is the height, W is the width, and 3 is the number of color channels (R, G, B). First use an encoder ∈ encode the image into the latent space
, where
, then use a decoder to reconstruct the image from the latent space
, the downsampling factor is f = H ∕ h = W ∕ w. In the latent space, noise prediction network
need to be trained to implement the denoising process,
is a noisy sample of
at timestep t, t = 1 , . . . , T. In the latent diffusion models, a self-coding model is introduced, so the
obtained by the encoder can be used to make the model learn in the latent space during training, expressed as:
where noise ε is added to according to step t to obtain
. For the conditional guided image generation process, LDM introduces a domain-specific encoder
for pre-processing condition y, where learning can be expressed as:
Denoising diffusion implicit models (DDIMs)
The denoising diffusion implicit models (DDIMs) [30] are more efficient iterative implicit probability models that build on the denoising diffusion probability models (DDPMs) [37] to enhance sampling speed even further. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process, so the generation steps are continuous and non-skippable. However, DDIMs break the Markovian process through mathematical reasoning and realize multi-step sampling, which improves sampling speed. The non-Markovian process is derived as follows:
Analyze the inference diffusion process indexed by real vectors :
where is a latent variable, timestep t = 1 , . . . , T,
is a data distribution, when t>1 and diffusion hyperparameter
sufficiently close to 0,
, therefore, the following equation can be obtained:
According to Bayes’ rule, the following forward process can be derived:
where the forward process is no longer a Markovian process, because every can depend on
,
.
The generation process is as follows:
Define a trainable generation process , where each
leverages knowledge of
. Given a noise observation
,
, model
predict
from
, which is a prediction of
given
:
The generation process can be defined as follows:
According to the generation process, the implicit model equation of denoising diffusion is further obtained as follows:
By setting to 0, the sampling process is deterministic, the latent variables are consistent, and the sampling steps are reduced. The following equation is obtained:
Method
In this section, the noise relation between different frames during T2V sampling will be analyzed and studied, and then the interframe noise relation model will be constructed to improve the consistency between different frames. For both training and training-free T2V methods, specific smoothing editing methods are proposed based on interframe noise relation model. The proposed methods can meet the smoothing requirements of most T2V studies and have strong generalization ability. The framework is illustrated in Fig 2.
Based on the proposed INR Smooth strategy, two video smoothing editing methods are proposed, one for T2V training models (green) and one for T2V training-free models (blue).
Research on the interframe noise relation
The training and sampling process of the diffusion models are carried out around the noise. As the core content of the diffusion models, noise brings infinite possibilities to researchers. For the traditional T2I model, the process of add noise or denoising is only for static images, but the video generated by the T2V model is composed of a series of static images, so the process of T2V generation will increase resources according to the length and resolution of the video. We often consider the noise relation of each step in the process of adding noise and denoising for T2I, so what is the relation between the noise of different frames in T2V, and how does it affect the consistency of the video? At present, relatively few studies [12,13] have been carried out on the noise relationship between video frames, and our research work is carried out in this aspect.
For a single frame image, in the forward process, noise is progressively added to the observation , and
is obtained after t steps, for any step t,
can be expressed as a linear combination of
and random noise variable
:
In the sampling process, after progressively denoise from , the observation
is finally obtained, then
can also be expressed as a linear combination of
and estimated noise variable
:
Discuss the relation between the noise of different frames in T2V. In the forward process, ,
and
represent the original observations corresponding to any adjacent three frames, n is the frame index, after t steps, random noise variables
,
,
are added to the original observations, then
,
and
are obtained.
In the sampling process, after progressively denoise from the adjacent three frames, ,
and
are obtained;
,
,
respectively represent the estimated noise variable of the three frames during the denoising process. The diffusion hyperparameter
values for the adjacent three frames in the same step are equal, so we can express any adjacent three frames as follows:
For forward and sampling processes, the ultimate prediction objective is to reach or
and construct loss from there, so according to the prediction objective, we can construct the following equation:
Transform the two above equations as follows:
As shown in the above equations, for any adjacent three frames, when the steps are same, the increased noise variable and the estimated noise variable have a certain constraint relation, which will certainly affect the consistency of each frame.
INR Smooth: Video smoothing editing method for T2V training models
In order to improve the alignment of text and video and the consistency of different frames, the above results of the interframe noise relation can be applied to the T2V training models. Fig 3 depicts the proposed video smoothing editing framework for T2V training models.
The green arrow indicates the direction from the first frame to the last frame to build the interframe noise constraint, and the blue arrow indicates the opposite process.
As shown in Fig 3, construct interframe noise constraints from both directions simultaneously, then obtain the mean square error (MSE) of the interframe noise variables:
where n indexes the corresponding video frame, N is the total number of frames of the video, MSE(1) represents the interframe noise constraint from the last frame to the first, MSE(2) represents the interframe noise constraint from the first frame to the last. The video smoothing loss function of the training model is constructed as follows:
where and
are the weights of the constraint relation between frames in two directions respectively, and the values of
and
can be set to 0.5 respectively, which can be adjusted appropriately according to the characteristics of experimental objects during the experiment. Thus, the overall loss function of the T2V training model can be obtained as follows:
where λ is the weight of video smoothing loss, and is the original loss function of the diffusion models. The pseudocode of the video smoothing editing method for the T2V training models is shown in Algorithm 1.
INR Smooth: Video smoothing editing method for T2V training-free models
On the basis of large-scale diffusion models, more and more researchers begin to explore lightweight optimization research that can be achieved only through model fine-tuning without adding additional computing resources and training. The training-free T2V generating process uses fewer computing resources and produces results faster. However, due to the lack of training and learning of the study samples, it is often difficult to control the smooth performance of the generated video. To address such research challenges, DDIM Inversion is introduced based on the above interframe noise relation, and control the inversion noise to improve the smoothness of the generated video.
The purpose of DDIM Inversion is to reverse the sampling process, resulting in a noisy implicit representation that serves as the starting for the sampling process, and the generated image will be close to the source image. The whole process can be modeled as the process of adding random noise
to
gets
, then using
as the starting point for DDIM sampling to generate
and predict inversion noise (represented by
), and then gradually sampling to get the final generated image. From , we can generate a sample
from a sample
via:
Algorithm 1: Video smoothing editing method for T2V training models.
Input:
Output:
1: Hyperparameters: ,
2: Diffusion process
3: for t = 1 , 2 , . . . , T do
4:
5: end for
6: Denoising process
7: for t = T , ( T − 1 ) , . . . , 1 do
8:
9: end for
10: Deriving the noise relationship between video frames
11: from
12:
13:
14:
15: to
16:
17:
18: Constructing smoothing loss
19:
20:
21:
22: return
According to the DDIM Inversion process and the interframe noise relation studied in this paper, a video smoothing editing framework for training-free models is constructed, as shown in Fig 4.
Different from the smoothing editing method of the training models, inversion noise is used to replace the estimated noise for the training-free models.
As shown, random noise is added in the diffusion forward process, and inversion noise is removed by the DDIM Inversion sampling process, represents the inversion noise from the last frame to the first,
represents the inversion noise from the first frame to the last, a and b represent the step, n indexes the corresponding video frame. The noise constraint from both ends of the video is conducive to improving the consistency between video frames. According to Eqs 17 and 18 and DDIM Inversion process,
and
can be constructed as follows:
For training-free models, inversion noise is used to replace the estimated noise in the method for training models. Inversion noise has a value starting from the second-to-last frame of the video, inversion noise
has a value starting from the second frame of the video, they both depend on the added random noise and the removed inversion noise of the adjacent frame. Weighting the inversion noise in both directions to obtain the inversion noise
of the NTH frame:
where is inverse noise weight, by adjusting the inverse noise weight, the interframe consistency and text alignment of the video generated by the training-free T2V models can be improved. The pseudocode of the video smoothing editing method for the T2V training-free models is shown in Algorithm 2.
Algorithm 2. Video smoothing editing method for T2V training-free models.
Input:
Output:
1: Hyperparameters:
2: DDIM for inversion latents
3: for t = 1 , 2 , . . . , T do
4:
5: end for
6: Deriving the noise relationship between frames by inversion noise and random noise
7:
8:
9: Constructing the inversion noise
10:
11: return
Experiments
Implementation details
The proposed two smoothing editing methods can be applied to T2V baselines which require training or inversion-based baselines which are training-free, respectively. The implementation is based on each baseline’s official codes, original data frame rate, frame number, learning rate, batch size and resolution requirements. The comparison experiments with baselines are representative. In the inference phase, we use DDIM sampler with classifier-free guidance. As the internal parameters of different baselines are obviously different, in order to adapt to the optimization of different baselines, it is necessary to adjust the weights according to specific baselines. So the smoothing loss weight λ and inverse noise weight can be set according to the concrete experiment. The number of sampling steps in the experiment is 50. All experiments are conducted on an NVIDIA RTX 4090 GPU.
Dataset and metrics
This study uses official implementation datasets of each baseline for smoothing tests, which mainly include the DAVIS dataset [38]. We evaluate the quality of the generated video from two aspects: text alignment and frame consistency. According to the previous research [8,9,22,23,26], we select CLIP (T) (text alignment evaluation), CLIP (F) (frame consistency evaluation), and Optical flow [39] as the corresponding evaluation metrics. CLIP (T) refers to the average cosine similarity of text and image embedded across all frames of the generated video. CLIP (F) refers to the average cosine similarity between all consecutive frame pairs of the generated video. Optical flow computes the motion vector of each pixel between two frames using image information from succeeding frames. For the smoothing optimization effect of the baselines, the source videos provided by the official implementation process will be used as the main evaluation and comparison.
Baselines
Based on the generated data from the state-of-the-art baselines: Tune-A-Video [8], ControlVideo [9], FateZero [23] and Video-P2P [26], this paper carries out smoothing editing experiments. Smoothing editing methods include Smooth Video [13] and the proposed methods. At the same time, two advanced methods Slicedit [27] and STEM [40] are selected for extended comparison. Among them, Smooth Video [13] applies a video frame noise prediction method, which is in contrast with the proposed methods. Slicedit [27] and STEM [40] apply DDPM inversion and Spatial-Temporal Expectation-Maximization inversion, respectively, in contrast to the DDIM inversion applied in this paper.
Smoothing results based on training baselines
Comparison with Tune-A-Video. The experiments are based on the Stable Diffusion 1.4, with the guidance_scale set at 12.5. The experiments compare the baseline methods Tune-A-Video [8], Slicedit [27], and STEM [40], the smoothing editing method Smooth Video [13], and the proposed video smoothing editing method (Ours(T)) which is based on the training models. The smoothing loss weight of Ours(T) is set to 2.0.
As shown in Fig 5, each row in the figure is the middle 8 frames with obvious changes selected from the generated video. As shown, the video generated by Tune-A-Video can maintain the semantic alignment with the text, but there are still obvious flickers between frames (as shown in the yellow box of Tune-A-Video). With the introduction of Smooth Video, the smoothness of Tune-A-Video is improved, but the video results do not present a cartoon style and thus are not aligned with the text semantics. After introducing the proposed smoothing method (Ours(T)), the generated video exhibits better frame consistency, while the generated video has better quality and obvious cartoon style. The generated results of Slicedit do not present cartoon style and do not achieve text alignment. The STEM generation effect is remarkable, the text is well aligned, but there are slightly unsmooth details (as shown in the yellow box of STEM).
Fig 6 shows the experimental results of the first 8 frames selected from the generated video. As shown by the red box of Tune-A-Video, there are obvious background differences between the frames, resulting in flickering and unsmooth video. Smooth Video improves the smoothness of Tune-A-Video, but there are obvious generation anomalies in the first three frames (as shown in the yellow boxes of Smooth Video). Similarly, Slicedit does not achieve text alignment. The generation results of STEM are abnormal, as shown by the yellow boxes, additional cars that do not exist in the source video appear. Compared with the above methods, the proposed method (Ours(T)) can make the background texture of the generated video clearer and can improve the interframe consistency. Meanwhile the proposed method achieves text alignment, and preserves the temporal consistency with the source video.
Comparison with Control Video. The experimental results of the baseline method ControlVideo [9], the comparison method Smooth Video [13] and Ours(T) are compared. The smoothing loss weight of Ours(T) is set to 0.2. Fig 7 shows the first 7 frames results of the experiments. As the figure shows, while ControlVideo can generate text-aligned videos, there are still shortcomings in the shadows and leg structure of the person in the video (as shown in the yellow boxes). Smooth video, improved character quality in some frames, but due to over-smoothing, the background texture disappears, some action poses are distorted, and cannot be aligned with the source video (as shown in the red boxes). In contrast, the proposed method aligns better with the background, color, and texture of the source video, while improving the person’s shadows and poses, and maintaining temporal consistency with the source video.
Smoothing results based on training-free baselines
Comparison with FateZero. Due to FateZero [23]’s advantages in video content style and structural shape editing, the generated videos have better temporal consistency and smoothness. Under the condition that guidance_scale is set to 7.5, an style transfer experimental comparison is conducted between FateZero, Slicedit [27], STEM [40] and the proposed video smoothing editing method (Ours(TF)) which is based on the training-free models. The inverse noise weight of Ours(TF) is set to 1.0.
Fig 8 shows that the first three frames of FateZero have a strong stylized effect, whereas the last three frames have a weak stylized effect, displaying inconsistent stylized effect. The generation effect of Slicedit is close to the source video and does not achieve the Van Gogh style. The artistic effect of STEM needs to be strengthened, and the background does not realize the corresponding artistic style. In terms of style editing, the proposed method (Ours(TF)) demonstrates overall consistency, and the texture artistic effect of leaves and petals in the last three frames is consistent with that of the first three frames, and the background of each frame also shows consistency with the style (as shown in Fig 9). Through comparison, it is found that the proposed method not only maintains the alignment with the text, but also has obvious advantages in terms of overall artistic style, color, and texture presentation.
Comparison with Video-P2P. Video-P2P [26] can achieve better application in generating new videos by changing the guide text and attention weighting, and has good performance in maintaining frame consistency. In this paper, Video-P2P is selected to compare with Slicedit [27], STEM [40], and Ours(TF). The inversion noise weight of Ours(TF) is set to 0.5.
As shown in Fig 10, the results of Video-P2P show many inconsistencies, such as inconsistent background and texture (as shown in the red boxes of Video-P2P), inconsistent structure and shape of rabbit head (as shown in the yellow boxes of Video-P2P), and inconsistent color and shape of rabbit feet (as shown in the blue boxes of Video-P2P). Inconsistent backgrounds and textures similarly appear in the STEM results. In contrast, Slicedit and the proposed method Ours(TF) achieve text alignment, however, the origami effect of Slicedit needs to be strengthened.
Quantitative results
In this section, the training and training-free video smoothing methods proposed in this paper are quantitatively compared with the following T2V baselines, including Tune-A-Video [8], ControlVideo [9], Smooth Video [13], FateZero [23], Video-P2P [26], Slicedit [27], and STEM [40]. Figs 4, 5, and 7 show that the results of Slicedit are almost identical to the source video and are not aligned with the text. Therefore, Slicedit is excluded from these quantitative comparisons. Tables 1 and 2 display the results of the quantitative comparison. CLIP (T) represents the degree of alignment between the generated video and the text, so the higher the value, the better the alignment. CLIP (F) further indicates the frame consistency of the generated video, so higher value indicates the better smoothness. The purpose of optical flow metric is to compute the motion vector of each pixel between two frames using image information from succeeding frames in order to determine the video’s smoothness.
As shown in Table 1, through quantitative comparison, the proposed video smoothing editing method (Ours(T)) which is based on the training models, is significantly superior to the compared baselines in the comprehensive scores of the three metrics: text alignment CLIP (T), frame consistency CLIP (F), and Optical flow. In the experiments based on ControlVideo, the shadow of jeans is smoothed out due to the over-smoothing of Smooth Video, so the CLIP (T) and optical flow metrics of Smooth Video are improved, but the experimental results of the proposed method (Ours(T)) are more consistent with the background texture and light shadow of the source video.
Table 2 shows that the proposed video smoothing editing method (Ours(TF)), based on training-free models, significantly improves text alignment, frame consistency, and temporal consistency. It is worth noting that when the proposed method (Ours(TF)) is used to smooth the stylized video generated by FateZero, the art style texture effect improves, resulting in a higher CLIP (T); however, since the background of each frame also implements the artistic style, the CLIP(F) and optical flow metrics of the proposed method are relatively poor compared with the solid color background of FateZero and STEM.
The preceding studies demonstrate that the proposed methods have significant effects in text-to-video editing. We extensively evaluate the training and inference time of the training and training-free methods. For example, Tune-A-Video takes about 12 minutes to train on a 24-frame video with 500 epochs and requires 15.8G of memory resources. Compared with Tune-A-Video, the proposed method (Ours(T)) requires the same training time as Tune-A-Video, and occupies 18.8G of memory resources. It can be seen that the proposed method increases the memory resources by about 128M per frame while editing the training baseline, but does not add extra computation time. The improvement in text alignment, temporal consistency, and fidelity makes the additional memory resources worthwhile. Training-free method FateZero takes about 1 minute 32 seconds to infer 8 frames of video and requires 9.7 to 11.4G of memory resources. Our method (Ours(TF)) takes the same computational cost as FateZero. It can be seen that the computational resources required to edit a training-free model mainly depend on the baseline model. For various text inputs such as scene, dress, style, and material, compared with the state-of-the-art methods, the proposed method can generate consistent video content that conforms to the semantics of the text. For the additional guidance conditions, such as human pose, the proposed method can also achieve alignment with the source video and guide pose. The above stable generation performance further verifies the robustness of the proposed method.
Ablation study
We conduct ablation experiments by removing the forward noise constraint (Forward) and the backward noise constraint (Backward) from our frame work, respectively, as shown in Fig 11. Ablation experiments reveal that using the forward noise constraint can improve text alignment and local properties. For example, the material of the swan head and neck is improved, and the structure shape of the swan tail is improved. The use of the backward noise constraint also improves the text alignment, the material of the swan head and neck is improved, but the shape of the swan tail does not change significantly. By adding noise constraints in both directions at the same time (Double), the overall material and structure shape achieve more significant improvement, both text alignment and temporal consistency are enhanced.
Limitations
Qualitative and quantitative experiments show that the proposed method achieves superior text alignment and source video fidelity, and the generated video is temporal consistent. However, it is worth noting that the temporal consistency of the edited video is affected by the quality of the source video and the generation effect of the baseline method. Since the proposed method edits the generated results of the baseline method, if the generated results of the baseline method fully depart from the text semantics or the content of each frame is extremely diverse, the editing enhancement effect will be obscured.
Conclusion
This paper proposes a video smoothing strategy INR Smooth, which is based on the interframe noise relation. Based on INR Smooth, two specific video smoothing editing methods are proposed for training models and training-free models. A large number of experiments have proved that the proposed methods have achieved remarkable results in a wide range of applications. The proposed methods have obvious advantages in maintaining the consistency between video frames, text alignment and temporal consistency with the source video, and has outstanding performance in the smooth transition of real scenes, the portrayal of artistic styles, and the maintenance of guiding conditions.
References
- 1.
Qing Z, Zhang S, Wang J, Wang X, Wei Y, Zhang Y, et al. Hierarchical spatio-temporal decoupling for text-to-video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE; 2024. p. 6635–45. Available from: https://doi.org/10.1109/CVPR52733.2024.00634
- 2.
Kondratyuk D, Yu L, Gu X, Lezama J, Huang J, Schindler G, et al. VideoPoet: a large language model for zero-shot video generation. In: Forty-first international conference on machine learning, ICML 2024, Vienna, Austria, July 21–27, 2024. OpenReview.net; 2024. Available from: https://openreview.net/forum?id=LRkJwPIDuE
- 3.
Guo Y, Yang C, Rao A, Agrawala M, Lin D, Dai B. SparseCtrl: adding sparse controls to text-to-video diffusion models. In: Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors. Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLII. vol. 15100 of Lecture Notes in Computer Science. Springer; 2024. p. 330–48. Available from: https://doi.org/10.1007/978-3-031-72946-1_19
- 4.
Zeng Y, Wei G, Zheng J, Zou J, Wei Y, Zhang Y, et al. Make pixels dance: high-dynamic video generation. In: IEEE/CVF conference on computer vision and pattern recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE; 2024. p. 8850–60. Available from: https://doi.org/10.1109/CVPR52733.2024.0084
- 5.
Wang X, Zhang S, Yuan H, Qing Z, Gong B, Zhang Y, et al. A recipe for scaling up text-to-video generation with text-free videos. In: IEEE/CVF conference on computer vision and pattern recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE; 2024. p. 6572–82. Available from: https://doi.org/10.1109/CVPR52733.2024.00628
- 6.
Chen H, Zhang Y, Cun X, Xia M, Wang X, Weng C, et al. VideoCrafter2: overcoming data limitations for high-quality video diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE; 2024. p. 7310–20. Available from: https://doi.org/10.1109/CVPR52733.2024.00698
- 7. Wang Y, Chen X, Ma X, Zhou S, Huang Z, Wang Y, et al. LAVIE: high-quality video generation with cascaded latent diffusion models. arXiv preprint 2023. https://arxiv.org/abs/2309.15103
- 8.
Wu JZ, Ge Y, Wang X, Lei SW, Gu Y, Shi Y, et al. Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023. IEEE; 2023. p. 7589–7599. Available from: https://doi.org/10.1109/ICCV51070.2023.00701
- 9.
Zhao M, Wang R, Bao F, Li C, Zhu J. ControlVideo: conditional control for one-shot text-driven video editing and beyond; 2023. Available from: https://arxiv.org/abs/2305.17098
- 10. Khachatryan L, Movsisyan A, Tadevosyan V, Henschel R, Wang Z, Navasardyan S, et al. Text2Video-zero: text-to-image diffusion models are zero-shot video generators. In: IEEE/CVF international conference on computer vision, ICCV 2023, Paris, France, October 1–6, 2023. IEEE; 2023. p. 15908–18. Available from:
- 11. Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, et al. Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint 2023. https://arxiv.org/abs/2311.15127
- 12.
Ge S, Nah S, Liu G, Poon T, Tao A, Catanzaro B, et al. Preserve your own correlation: a noise prior for video diffusion models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023. IEEE; 2023. p. 22873–84. Available from: https://doi.org/10.1109/ICCV51070.2023.02096
- 13.
Peng L, Cheng H, Yang Z, Zhao R, Xia L, Song C, et al. SmoothVideo: smooth video synthesis with noise constraints on diffusion models for one-shot video tuning; 2024. Available from: https://arxiv.org/abs/2311.17536
- 14.
Ho J, Salimans T, Gritsenko AA, Chan W, Norouzi M, Fleet DJ. Video diffusion models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in neural information processing systems 35: annual conference on neural information processing systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28–December 9, 2022; 2022. Available from: http://papers.nips.cc/paper_files/paper/2022/hash/39235c56aef13fb05a6adc95eb9d8d66-Abstract-Conference.html
- 15. Brooks T, Peebles B, Holmes C, DePue W, Guo Y, Jing L. Video generation models as world simulators. OpenAI. 2024.
- 16. Ho J, Chan W, Saharia C, Whang J, Gao R, Gritsenko AA, et al. Imagen video: high definition video generation with diffusion models. arXiv preprint 2022. https://arxiv.org/abs/2210.02303
- 17.
Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S, et al. Make-a-video: text-to-video generation without text-video data. In: The eleventh international conference on learning representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net; 2023.Available from: https://openreview.net/forum?id=nJfylDvgzlq
- 18.
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE; 2022. p. 10674–85. Available from: https://doi.org/10.1109/CVPR52688.2022.01042
- 19.
Bar-Tal O, Chefer H, Tov O, Herrmann C, Paiss R, Zada S, et al. Lumiere: a space-time diffusion model for video generation. In: Igarashi T, Shamir A, Zhang HR, editors. SIGGRAPH Asia 2024 conference papers, SA 2024, Tokyo, Japan, December 3–6, 2024. ACM; 2024. p. 94:1–94:11. Available from: https://doi.org/10.1145/3680528.3687614
- 20.
Ma Y, He Y, Cun X, Wang X, Chen S, Li X, et al. Follow your pose: pose-guided text-to-video generation using pose-free videos. In: Wooldridge MJ, Dy JG, Natarajan S, editors. Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20–27, 2024, Vancouver, Canada. AAAI Press; 2024. p. 4117–25. Available from: https://doi.org/10.1609/aaai.v38i5.28206
- 21.
Molad E, Horwitz E, Valevski D, Rav-Acha A, Matias Y, Pritch Y, et al. Dreamix: video diffusion models are general video editors. arXiv preprpint 2023 https://arxiv.org/abs/2302.01329
- 22.
Esser P, Chiu J, Atighehchian P, Granskog J, Germanidis A. Structure and content-guided video synthesis with diffusion models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. IEEE; 2023. p. 7312–22. Available from: https://doi.org/10.1109/ICCV51070.2023.00675
- 23.
Qi C, Cun X, Zhang Y, Lei C, Wang X, Shan Y, et al. FateZero: fusing attentions for zero-shot text-based video editing. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023. IEEE; 2023. p. 15886–96. Available from: https://doi.org/10.1109/ICCV51070.2023.01460
- 24.
Yang S, Zhou Y, Liu Z, Loy CC. Rerender a video: zero-shot text-guided video-to-video translation. In: Kim J, Lin MC, Bickel B, editors. SIGGRAPH Asia 2023 Conference Papers, SA 2023, Sydney, NSW, Australia, December 12–15, 2023. ACM; 2023. p. 95:1–95:11. Available from: https://doi.org/10.1145/3610548.3618160
- 25.
Geyer M, Bar-Tal O, Bagon S, Dekel T. TokenFlow: consistent diffusion features for consistent video editing. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7–11, 2024. OpenReview.net; 2024. Available from: https://openreview.net/forum?id=lKK50q2MtV
- 26. Liu S, Zhang Y, Li W, Lin Z, Jia J. Video-P2P: video editing with cross-attention control. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE; 2024. p. 8599–608. Available from:
- 27.
Cohen N, Kulikov V, Kleiner M, Huberman-Spiegelglas I, Michaeli T. Slicedit: zero-shot video editing with text-to-image diffusion models using spatio-temporal slices. In: Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21–27, 2024. OpenReview.net; 2024. Available from: https://openreview.net/forum?id=kUm9iuvwIQ
- 28.
Cong Y, Xu M, Simon C, Chen S, Ren J, Xie Y, et al. FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7–11, 2024. OpenReview.net; 2024. Available from: https://openreview.net/forum?id=JgqftqZQZ7
- 29. Yang L, Zhang Z, Song Y, Hong S, Xu R, Zhao Y, et al. Diffusion models: a comprehensive survey of methods and applications. ACM Comput Surv 2023;56(4):1–39. https://doi.org/10.1145/3626235
- 30.
Song J, Meng C, Ermon S. Denoising diffusion implicit models. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net; 2021. Available from: https://openreview.net/forum?id=St1giarCHLP
- 31.
Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net; 2021. Available from: https://openreview.net/forum?id=PxTIG12RRHS
- 32.
Song Y, Ermon S. Improved techniques for training score-based generative models. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, editors. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual; 2020. Available from: https://proceedings.neurips.cc/paper/2020/hash/92c3b916311a5517d9290576e3ea37ad-Abstract.html
- 33.
Song Y, Durkan C, Murray I, Ermon S. Maximum likelihood training of score-based diffusion models. In: Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW, editors. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual; 2021. p. 1415–28. Available from: https://proceedings.neurips.cc/paper/2021/hash/0a9fdbb17feb6ccb7ec405cfb85222c4-Abstract.html
- 34.
Nichol AQ, Dhariwal P. Improved denoising diffusion probabilistic models. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. vol. 139 of Proceedings of Machine Learning Research. PMLR; 2021. p. 8162–71. Available from: http://proceedings.mlr.press/v139/nichol21a.html
- 35. Lu C, Zheng K, Bao F, Chen J, Li C, Zhu J. Maximum likelihood training for score-based diffusion ODEs by high order denoising score matching. Proc Mach Learn Res. 2022; 162:14429–60.
- 36.
Zhang Q, Chen Y. Fast sampling of diffusion models with exponential integrator. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net; 2023. Available from: https://openreview.net/forum?id=Loek7hfb46P
- 37.
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, editors. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual; 2020. Available from: https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html
- 38. Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Gool LV. The 2017 DAVIS challenge on video object segmentation. arXiv preprint 2017. https://arxiv.org/abs/1704.00675
- 39.
Teed Z, Deng J. RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi A, Bischof H, Brox T, Frahm J, editors. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II. vol. 12347 of Lecture Notes in Computer Science. Springer; 2020. p. 402–19. Available from: https://doi.org/10.1007/978-3-030-58536-5_24
- 40.
Li M, Li Y, Yang T, Liu Y, Yue D, Lin Z, et al. A video is worth 256 bases: spatial-temporal expectation-maximization inversion for zero-shot video editing. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16–22, 2024. IEEE; 2024. p. 7528–37. Available from: https://doi.org/10.1109/CVPR52733.2024.00719