From Videos to 4D Worlds and Beyond
IRB-4105 (in-person talk). Zoom: https://umd.zoom.us/j/7316339020
The world underlying images and videos is 3-dimensional and dynamic, with people interacting with each other, objects, and the underlying scene. Even in videos of a static scene, there is always the camera moving about in the 4D world. However, disentangling this 4D world from a video is a challenging inverse problem due to fundamental ambiguities of depth and scale. Yet, accurately recovering this information is essential for building systems that can reason about and interact with the underlying scene, and has immediate applications in visual effects and creation of immersive digital worlds.
In this talk, I will discuss recent updates in 4D human perception, which includes disentangling the camera and the human motion from challenging in-the-wild videos with multiple people. Our approach takes advantage of background pixels as cues for camera motion, which when combined with motion priors and inferred ground planes can resolve scene scale and depth ambiguities up to an "anthropometric" scale. I will also talk about nerf.studio, a modular open-source framework for easily creating photorealistic 3D scenes and accelerating NeRF development. I will introduce two new works that highlight how language can be incorporated for editing and interacting with the recovered 3D scenes. These works leverage large-scale vision and language models, demonstrating the potential for multi-modal exploration and manipulation of 3D scenes.