Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements

Abstract

In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus. Our proposed approach directly learns the weights of neural modules by optimizing over the corrupted test sequence, leveraging the spatio-temporal coherence and internal statistics of videos. Furthermore, we introduce a novel spatial pyramid loss that leverages the property of spatio-temporal patch recurrence in a video across the different scales of the video. This loss enhances robustness to unstructured noise in both the spatial and temporal domains. This further results in our framework being highly robust to degradation in input frames and yields state-of-the-art results on downstream tasks such as denoising, object removal, and frame interpolation. To validate the effectiveness of our approach, we conduct qualitative and quantitative evaluations on standard video datasets such as DAVIS, UCF-101, and VIMEO90K-T.

Video Denoising (Qualitative Results)

A - Original Video with Poisson Noise ($\lambda$ = 25).
B - Video denoising utilizing FastDVDNet.
C - Video denoising utilizing UDVD.
D - Video denoising utilizing M2F2.
E - Video denoising utilizing Ours VDP method.
F - Original Noise-Free Video.

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Example 2 $\left(\frac{A|B|C}{D|E|F}\right)$

Note: It can be observed from the video examples that baseline methods (FastDVDNet and M2F2) do not differentiate between the noise signal and clean signal for additive poisson noise. Hence, these methods increase noise in the video by performing spatio-temporal smoothening.

Video Super Resolution (Qualitative Results)

A - Original Low Resolution Video with Gaussian Noise ($\sigma$ = 5).
B - Enhanced High Resolution Video with EDVR.
C - Enhanced High Resolution Video with BasicVSR++.
D - Cascaded Denoiser(FastDVDNet) + VSR(EDVR).
E - Cascaded Denoiser(FastDVDNet) + VSR(BasicVSR++).
F - Enhanced High Resolution Video with Ours VDP method.

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Note, the details in our VDP method's processed super resolution video. Cascaded models (Denoiser + VSR) wash away the details like (1) the colors of artifact created by sunlight falling directly on capturing lens, (2) The details in the bushes (Zoom-in for viewing it more clearly)

Example 2 $\left(\frac{A|B|C}{D|E|F}\right)$

By zooming in the video example 2, it can be observed that cascaded(Denoiser + VSR) models creates spatio-temporal artifacts (flickering + ringing) across the videos. While, our VDP model produces high resolution videos without these artifacts.

Video Object Removal (Qualitative Results)

A - Original Video.
B - Masked Video.
C - Masked Object Removal using FGVC method.
D - Masked Object Removal using InterVI method.
E - Masked Object Removal using LongVI method.
F - Masked Object Removal using Ours VDP method.

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Video Frame Interpolation ($4\times$ Qualitative Results)

We perform $4\times$ framerate upsampling of a video sequence that consists of three frames. Each example provided below has three frames where the middle frame has been corrupted using a Gaussian noise of intensity $\sigma = 15$.

A - Original Low framerate Video with Gaussian Noise ($\sigma$ = 15) injected in second frame.
B - Enhanced High framerate Video with Pixels Interpolation.
C - Enhanced High framerate Video with SoftSplat.
D - Enhanced High framerate Video with RIFE.
E - Cascaded Denoiser(FastDVDNet) + VFI(SoftSplat).
F - Enhanced High framerate Video with Ours VDP method.

Example 1 - Videos are arranged in the format $\left(\frac{A|B|C}{D|E|F}\right)$

Please Note: The cascaded model(denoiser + VFI) smoothens out a lot of details in the video. For example, (1)Bricks texture on the wall (2) The hairy texture on the bear(Zoom-in to view the difference clearly).