PhD Proposal: Plug and Predict: Generative Recognition and Surrogate Training for VLMs
IRB-5105 or https://umd.zoom.us/j/7173057078
Traditional object recognition models, such as ResNet and CLIP, rely on a predefined label gallery, limiting their ability to handle real open-world scenarios. We propose a generative framework that predicts object labels as next tokens conditioned on image embeddings. With the proposed one-shot sampling strategy, our method enables parallel decoding of labels, supporting large-scale predictions such as top-100 labels per image. The second work tackles the high cost of training giant vision-language models (VLMs), where large language models (LLMs) are used as the decoder. We first analyze the prediction trajectory of LLMs to develop a general method for constructing smaller surrogate language models for any target LLM. Vision encoders trained on these surrogates can be zero-shot grafted into full-size LLMs for downstream tasks without additional tuning. When the decoder is fine-tuned on these encoders, our approach reduces the overall training cost by up to 45%, with Llama-70B as the decoder, while improving performance over baseline methods.