PhD Proposal: Towards Unifying Multimodal: Perception, Reasoning and Generation

Talk

Jiuhai Chen

Time:

05.12.2025 11:30 to 13:30

Location:

IRB IRB-5107

URL:

https://talks.cs.umd.edu/talks/4229

Firstly, we introduce Florence-VL, a multimodal family built on Florence-2’s generative vision encoder. Unlike conventional vision encoder models CLIP, Florence-2 provides rich, multi-level visual features. We fuse these features via a novel depth-breadth fusion architecture and train in two stages—end-to-end pretraining followed by instruction tuning. Florence-VL’s enriched embeddings yield state-of-the-art results across VQA, OCR, chart understanding, and knowledge-intensive benchmarks, and all code and weights are open-sourced.
Next, we introduce BLIP-3U, a unified foundation model for both image understanding and image generation. Recently, unified multimodal models that support both image understanding and generation have gained increasing attention. However, the optimal design choice and training strategy for unified model remain an open question. In this work, we present a comprehensive study of image generation based on autoregressive and diffusion models, exploring different image representations (e.g., VAE and CLIP encoders) and modeling methods such as Mean Squared Error (MSE) and Flow Matching. We introduce a novel approach that uses a diffusion transformer to diffuse CLIP image features, achieving high training efficiency and strong performance. We also investigate joint and sequential training strategies for image understanding and image generation and find that sequential training offers practical benefits by preserving image understanding while enabling effective image generation. Based on these findings, we develop a state-of-the-art unified model BLIP-3U. Our model demonstrates superior performance on a wide range of benchmarks for both image understanding and generation. We also showcase applications, such as image editing, reconstruction, and interleaved generation that highlight the necessity of integrating image understanding and generation. All model weights, code, and evaluation pipelines are open-sourced to support future research.

PhD Proposal: Towards Unifying Multimodal: Perception, Reasoning and Generation

Talk

Talk

Talk

Event

Event

Event

Event

Event

Event

Event