PhD Proposal: Simulating and Imagining the World with Generative Foundation Models
IRB-3137
Generative foundation models for images and videos are pre-trained on internet-scale data, enabling them to learn broad visual priors for general-purpose generation. However, they are typically conditioned only on text prompts or a single reference image, which limits their applicability to real-world tasks that require richer visual guidance to produce specific, goal-directed outputs. In addition, training such models from scratch is prohibitively expensive for most domain-specific applications.
This proposal investigates how pre-trained generative foundation models can be adapted to real-world image and video tasks through data-efficient adaptation and enhanced visual conditioning. It argues that, with only limited domain-specific data, these general-purpose models can be transformed into effective tools for simulation and imagination, with applications in image and video generation, editing, and robotic simulation.