Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements
Accepted in Conference on Neural Information Processing Systems (NeurIPS) 2023

 


 

Abstract

In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus. Our proposed approach directly learns the weights of neural modules by optimizing over the corrupted test sequence, leveraging the spatio-temporal coherence and internal statistics of videos. Furthermore, we introduce a novel spatial pyramid loss that leverages the property of spatio-temporal patch recurrence in a video across the different scales of the video. This loss enhances robustness to unstructured noise in both the spatial and temporal domains. This further results in our framework being highly robust to degradation in input frames and yields state-of-the-art results on downstream tasks such as denoising, object removal, and frame interpolation. To validate the effectiveness of our approach, we conduct qualitative and quantitative evaluations on standard video datasets such as DAVIS, UCF-101, and VIMEO90K-T.

Video Denoising (Qualitative Results)

 

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Example 2 $\left(\frac{A|B|C}{D|E|F}\right)$

Note: It can be observed from the video examples that baseline methods (FastDVDNet and M2F2) do not differentiate between the noise signal and clean signal for additive poisson noise. Hence, these methods increase noise in the video by performing spatio-temporal smoothening.

 
 
 

Video Super Resolution (Qualitative Results)

 

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Note, the details in our VDP method's processed super resolution video. Cascaded models (Denoiser + VSR) wash away the details like (1) the colors of artifact created by sunlight falling directly on capturing lens, (2) The details in the bushes (Zoom-in for viewing it more clearly)

Example 2 $\left(\frac{A|B|C}{D|E|F}\right)$

By zooming in the video example 2, it can be observed that cascaded(Denoiser + VSR) models creates spatio-temporal artifacts (flickering + ringing) across the videos. While, our VDP model produces high resolution videos without these artifacts.

 
 
 

Video Object Removal (Qualitative Results)

 

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

 
 
 

Video Frame Interpolation ($4\times$ Qualitative Results)

 

We perform $4\times$ framerate upsampling of a video sequence that consists of three frames. Each example provided below has three frames where the middle frame has been corrupted using a Gaussian noise of intensity $\sigma = 15$.

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Please Note: The cascaded model(denoiser + VFI) smoothens out a lot of details in the video. For example, (1)Bricks texture on the wall (2) The hairy texture on the bear(Zoom-in to view the difference clearly).