PhD Defense: Generating Visual Content: From Pixel Orders to Videos and Beyond

Talk
Hanyu Wang
Time: 
12.17.2025 13:00 to 14:30

Visual content generation is a fundamental challenge in computer vision that enables diverse applications across domains. The high-dimensional nature of visual data makes it particularly challenging to achieve both quality and precise control in generation tasks. This thesis investigates visual generation across varying levels of abstraction, ranging from fundamental pixel-level ordering to video synthesis, and extending beyond to the unification of perception and creation within large-scale multimodal systems.We begin by addressing the foundational challenge of sequentially representing visual data through Neural Space-filling Curves, a data-driven approach that learns context-aware pixel orderings optimized for downstream tasks such as LZW compression. We then explore controlled image generation through two complementary approaches: Chop & Learn, a framework for compositional generation that enables synthesis of novel object-state combinations, and a multimodal style transfer method that effectively combines guidance from both images and text. For video generation, we introduce LARP, a novel tokenization approach with a learned autoregressive prior that achieves state-of-the-art performance while maintaining computational efficiency. Finally, we present Bridge, a unified framework that equips pre-trained MLLMs with visual generative capabilities. By utilizing a Mixture-of-Transformers architecture to handle conflicting modalities and a novel semantic-to-pixel discrete representation, Bridge enables high-precision visual understanding and high-fidelity generation within a single model, effectively closing the loop between perception and creation.