PhD Proposal: Image-to-image translation for multi-modal image generation and novel view synthesis.
For a long time imagination was thought to be a unique quality to the human mind alone. We thought imagination is difficult or even near impossible to teach to others, let alone to machines! This however changed with the introduction of Generative Adversarial Networks (GANs). Computers can now imagine fictional and beautiful images of various categories such as landscape, faces, animals, indoor scenes and many more. Harnessing this imaginative and generative power was further possible though GAN-based image-to-image (I2I) translation. We can now train an AI to turn user sketches into realistic pictures, semantic labels into beautiful scenes, or natural images into artistic paintings.It is always said that "sky is the limit" when it comes to our imagination. Now with this ability to use AI to turn our imagination into visual art, there are countless real-world applications for us to explore. However, as in any relatively new field, GANs and I2I translation haven't fully matured yet. Training GANs and I2I translation networks is hard; it can take a lot of time, is highly unstable and the training objective is often complicated. Also, the output in many cases is not yet perfect and can often exhibit clear visual artifacts. In this dissertation, we explore improving the training of I2I translation networks, as well as expanding its application, specifically to the problem of novel view synthesis.In the first direction, we explore how to improve the training of I2I translation networks. We propose an alternative architecture and training pipeline to improve the training stability and speed, simplify the training objective, and improve the output quality and diversity. We also explore how to improve the training objective to properly impose user constraints while giving the machine the freedom to add creative details.In the second direction, we explore extending the application of I2I translation to variations of the novel view synthesis problem. We propose a novel framework that only looks at in-the-wild internet photos of a scene, and learns to generate realistic images of the scene from arbitrary viewpoints and under any appearance. We also explore novel view synthesis in a low-shot setting, where only few images of the target subject/object are available. In the low-shot setting, it is important to learn a strong prior over an object category such that, when combined with a single or few-shots of an object instance, this learned prior would be used to reason about and complete missing information. We propose to disentangle the synthesis process into spatial information, like a 2D layout or a 3D mesh, and appearance information like latent style codes or neural textures. We then train a network to learn priors of an object class over this disentangled representation.Examining Committee:
Chair: Dr. Larry Davis Dept rep: Dr. Tom Goldstein Members: Dr. Abhinav Shrivastava Dr. David Jacobs