PhD Defense: Neural Rendering Techniques for Photo-realistic Image Generation and Novel View Synthesis

Moustafa Meshry
06.09.2022 10:00 to 12:00

IRB 5137

Recent advances in deep generative models has enabled computers to imagine and generate fictional samples from any given distribution of images. Techniques like Generative Adversarial Networks (GANs) and image-to-image (I2I) translation can generate images by mapping a random noise or an input image (e.g., a sketch or a semantic map) to photo-realistic images. However, there are still plenty of challenges regarding training such models and improving their output quality and diversity. Furthermore, to harness this imaginative and generative power for solving real-world applications, we need to be able to control different aspects of the rendering process; for example to specify the content and/or style of generated images, camera pose, lighting ... etc.One challenge to training image generation models is the multi-modal nature of image synthesis. An image with a specific content, such as a dog or a car, can be generated with countless choices of different styles (e.g., colors, lighting, and local texture details). To enable user control over the generated style, previous works train multi-modal I2I translation networks, but they suffer from a complicated and slow training, and their training is specific to one target image domain. We address this limitation and propose a style pre-training strategy that generalizes across many image domains, improves the training stability and speed, and improves the performance in terms of output quality and diversity.Another challenge to GANs and I2I translation is to provide 3D control over the rendering process. For example, applications such as AR/VR, virtual tours and telepresence require generating consistent images or videos of 3D environments. However, GANs and I2I translation mainly operate in 2D, which limits their use for such applications. To address this limitation, we propose to condition image synthesis on coarse geometric proxies (e.g., a point cloud, a coarse mesh, or a voxel grid), and we augment these rough proxies with machine learned components to fix and compliment their artifacts and render photo-realistic images. We apply our proposals to solve the task of novel view synthesis under different challenging settings, and show photo-realistic novel views of complex scenes with multiple objects, tourist landmarks under different appearances, and human subjects under novel head poses and facial expressions.
Examining Committee:

Chair:Co-Chair:Dean's Representative:Members:

Dr. Abhinav Shrivastava Dr. Larry S Davis Dr. Carol Espy-WilsonDr. David Jacobs Dr. Thomas Goldstein