Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes
Accepted in Conference on Computer Vision and Pattern Recognition (CVPR 2024)

 


 

Abstract

Diffusion models have made significant strides in image generation, mastering tasks such as unconditional image synthesis, text-image translation, and image-to-image conversions. However, their capability falls short in the realm of video prediction, mainly because they treat videos as a collection of independent images, relying on external constraints such as temporal attention mechanism to enforce temporal coherence. In our paper, we introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames. Through extensive experimentation, we establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M and UCF101.

 

Please Note: Context represents context frames given to the model, while predict depicts the predicted frames conditioned on context frames.

 

 

 

 

KTH Action Recognition Dataset

 

 

Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict

 

 

 

 

BAIR Robot Push Dataset

 

 

Context
Predict
Context
Predict
Context
Predict
Context
Predict

 

 

 

 

Human3.6M Dataset

 

 

Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict

 

 

 

 

UCF101 Dataset

 

 

Context
Predict
Context
Predict
Context
Predict
Context
Predict
Context
Predict