PhD Proposal: Generative Visual Understanding: From Emergence to Application
Modern visual generative models have fueled creativity among even non-experts, shattering the barrier to entry of artistic training and creation time. These models have illustrated contextual understanding by maintaining surprising level of consistency with the input condition, context, and constraints. This naturally leads to the following questions. Do these generative models actually understand the underlying structures, texture, and semantics of the images they generate? If so, can we employ the structure of this understanding to ameliorate generative models further? This thesis seeks answers to these questions in the case of denoising diffusion probabilistic models (hereafter referred to as diffusion models).
Firstly, we establish the existence of contextual understanding in diffusion models using a concrete exemplar task of audio-conditioned lip-synchronization. Our generalizable in-the-wild results shows that we were successfully able to explicitly instill contextual understanding (using constraints like conditional inputs, and multiple losses) in diffusion models.
Secondly, by probing unconditional diffusion models, we investigate whether the diffusion training itself fosters this kind of understanding, or whether it happens only because of the conditions/constraints. To this end, we observe the variety in the information spread in the features across noise levels and neural network blocks. Our feature accumulation techniques obtain promising performance on discriminatory tasks, redefining diffusion models as unified self-supervised representation learners.
Finally, we analyze the stark resemblance in the hierarchical informational content in the states of diffusion as compared to the scale spaces in Gaussian pyramids. To leverage this insight, we propose frameworks for integrating these two well-established computer vision techniques for achieving superior performance at increased efficiency — potentially bringing back pixel space diffusion to the forefront.