Do Vision-Language Pretrained Models Learn Primitive Concepts for Recognition and Reasoning?
Vision-language models pretrained on web-scale data have revolutionized deep learning in the last few years. They have demonstrated strong transfer learning performance on a wide range of tasks, even under the zero-shot setup, where text prompts serve as a natural interface for humans to specify a task, as opposed to collecting labeled data. These models are trained on composite data, such as visual scenes of multiple objects, or a sentence that describes that spatiotemporal event. However, it is not clear whether they do this by learning to reason over lower-level, spatio-temporal primitive concepts that humans naturally use to characterize these concepts, such as colors, shapes, or verbs that describe short actions. If they do so, it has important implications for the capacity of models to support compositional generalization, and for humans to interpret the reasoning procedures models undertake.
In this talk, I will present our recent attempts to answer this question. We study several representative vision-language (VL) models trained on images (e.g. CLIP) and videos (e.g. VideoBERT), and design corresponding “probing” frameworks to understand if VL pretraining: (1) improves lexical grounding, (2) encodes verb meaning, and (3) learns visually grounded primitive concepts. I will also discuss our ongoing approach on utilizing concept binding that emerges inside a pretrained neural network for visual reasoning tasks.