HVTR:
Hybrid Volumetric-Textural Rendering for Human Avatars

Tao Hu1,    Tao Yu2,   Zerong Zheng2,    He Zhang3,    Yebin Liu*2,    Matthias Zwicker1
1University of Maryland, College Park,      2Tsinghua University,     3Beihang University

Key idea of HVTR: two-stage hybrid rendering.

Abstract

We propose a novel neural rendering pipeline, Hybrid Volumetric-Textural Rendering (HVTR), which synthesizes virtual human avatars from arbitrary poses efficiently and at high quality. First, we learn to encode articulated human motions on a dense UV manifold of the human body surface. To handle complicated motions (e.g., self-occlusions), we then leverage the encoded information on the UV manifold to construct a 3D volumetric representation based on a dynamic pose-conditioned neural radiance field. While this allows us to represent 3D geometry with changing topology, volumetric rendering is computationally heavy. Hence we employ only a rough volumetric representation using a pose-conditioned downsampled neural radiance field (PD-NeRF), which we can render efficiently at low resolutions. In addition, we learn 2D textural features that are fused with rendered volumetric features in image space. The key advantage of our approach is that we can then convert the fused features into a high resolution, high-quality avatar by a fast GAN-based textural renderer. We demonstrate that hybrid rendering enables HVTR to handle complicated motions, render high-quality avatars under user-controlled poses/shapes and even loose clothing, and most importantly, be fast at inference time. Our experimental results also demonstrate the state-of-the-art quantitative results. HVTR is differentiable, and can be trained end-to-end using only 2D images.

Video

Pipeline

Given a coarse SMPL mesh $I_p$ with pose p and a target viewpoint (o, d), our system renders a detailed avatar using four main components: pose encoding, 2D textural feature encoding, 3D volumetric representation, and hybrid rendering. ① Pose Encoding in UV space: We learn human motions on the UV manifold of body mesh surface by recording the 3D positions of the mesh on a UV positional map and proposing optimizable geometry and texture latents to capture local motion/appearance details. The step yields pose-dependent features in UV space, which are projected into 2D textural features $\Psi^{im}_{tex}$ in ② 2D Tex-Encoding. ③ 3D Vol-Rep: To capture the rough geometry and address self-occlusion problems, we further learn a volumetric representation by constructing a pose-conditioned downsampled neural radiance field (PD-NeRF) to encode 3D pose-dependent features. ④ Hybrid Rendering : PD-NeRF is rasterized into image space $\Psi^{im}_{vol}$ by volume rendering, where 3D volumetric features are also preserved. Both the 2D textural and 3D volumetric features are pixel-aligned in image space, fused by Attentional Feature Fusion (AFF), and then converted into a realistic image and a mask by TexRenderer.

Geometry Reconstruction

As a byproduct of our method, we can also reconstruct a rough 3D geometry by learning the pose-conditioned downsampled NeRF from 45x45 (1/16 of full images) resolution images with only 7 sampled points along each ray in training: left (predicted geometry), right (reference image), which enables efficient training and inference.

Applications

HVTR can render human avatars with both pose and shape control from arbitary viewpoints.

BibTeX

@article{hu2021hvtr,
      title={HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars}, 
      author={Tao Hu and Tao Yu and Zerong Zheng and He Zhang and Yebin Liu and Matthias Zwicker},
      year={2021},
      eprint={2112.10203},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}