PhD Proposal: Plug and Predict: Generative Recognition and Surrogate Training for VLMs

Talk

Kaiyu Yue

Time:

07.29.2025 10:00 to 11:30

Location:

IRB-5105 or https://umd.zoom.us/j/7173057078

URL:

https://talks.cs.umd.edu/talks/4276

Traditional object recognition models, such as ResNet and CLIP, rely on a predefined label gallery, limiting their ability to handle real open-world scenarios. We propose a generative framework that predicts object labels as next tokens conditioned on image embeddings. With the proposed one-shot sampling strategy, our method enables parallel decoding of labels, supporting large-scale predictions such as top-100 labels per image. The second work tackles the high cost of training giant vision-language models (VLMs), where large language models (LLMs) are used as the decoder. We first analyze the prediction trajectory of LLMs to develop a general method for constructing smaller surrogate language models for any target LLM. Vision encoders trained on these surrogates can be zero-shot grafted into full-size LLMs for downstream tasks without additional tuning. When the decoder is fine-tuned on these encoders, our approach reduces the overall training cost by up to 45%, with Llama-70B as the decoder, while improving performance over baseline methods.

Upcoming Events

Event

09.05.2025 12:00 to 13:30

IRB-4105

Computer Science APT Meeting