NIRVANA: Neural Implicit Representations of Videos with Adaptive Networks and Autoregressive Patch-wise Modeling


Paper

Abstract

Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group’s model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12x, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6x decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.

1) Overview of NIRVANA


Overview

Prior video INR works perform either pixel-wise or frame-wise prediction. We instead perform spatio-temporal patch-wise prediction and fit individual neural networks to groups of frames (clips) which are initialized using networks trained on the previous group. Such an autoregressive patch-wise approach exploits both spatial and temporal redundancies present in videos while promoting scalability and adaptability to varying video content, resolution or duration.

2) Model architecture


Model Architecture

3) Results

Comparison on UVG

Dataset Method Encoding Time
(Hours) ↓
Decoding
Speed (FPS) ↑
PSNR ↑ BPP ↓

UVG-HD
SIREN
NIRVANA (Ours)
∼30
5.44
15.62
87.91
27.20
34.71
0.28
0.32
NeRV
NIRVANA (Ours)
~80
6.71
11.01
65.42
37.36
37.70
0.92
0.86
UVG-4K NeRV
NIRVANA (Ours)
~134
20.89
8.27
50.83
35.24
35.18
0.28
0.27

Comparison with video INR approaches on UVG benchmarks. We vary patch size of NIRVANA on UVG-HD to match the BPP of SIREN and NeRV respectively. NIRVANA achieved much faster encoding and decoding speed, while maintaining better or on-par quality at comparable BPP.

Encoding Time

Num
Frames
Method Encoding Time
(Hours) ↓
PSNR ↑ BPP ↓
2000 NeRV
NIRVANA (Ours)
84.44
20.85
33.38
35.43
0.22
0.62
3000 NeRV
NIRVANA (Ours)
134.58
31.37
31.6
35.21
0.16
0.64
4000 NeRV
NIRVANA (Ours)
190.30
41.84
30.53
35.15
0.12
0.65

Video duration adaptability: For longer videos, we maintain similar reconstruction quality (∼35 PSNR) and compression rate (∼0.62 BPP). We retain a significantly faster encoding speed than NeRV which suffers from significant degradation with increased number of frames.

4) Adaptive Compression


Video content adaptability: F6 videos are sorted in increasing order of variation between subsequent frames. Our approach shows adaptive bitrate compression, with more static scenes exhibiting lower BPP, while highly dynamic ones being allocated more bits while maintaining a similar PSNR as NeRV (and 12× encoding speed).

5) GPU Scalability



GPU scalability of NIRVANA: We compare scalability of our approach with NeRV in terms of encoding time with increasing number of GPUs at two video resolutions: 1080p and 4K. We scale close to linear for 4K and have much lower overhead compared to NeRV for both resolutions.

6) Visual Results


Ground truth video frame.                Reconstruction from NIRVANA.                Reconstruction from NeRV.

7) Citation


The website template was borrowed from Ben Mildenhall.