Learning With and Beyond Visual Knowledge
Virtual talk: https://umd.zoom.us/j/7316339020
The computer vision community has embraced the success of learning specialist models by training with a fixed set of predetermined object categories, such as ImageNet or COCO. However, learning only from visual knowledge might hinder the flexibility and generality of visual models, which requires additional labeled data to specify any other visual concept and makes it hard for users to interact with the system. In this talk, I will present our recent work LSeg, a novel multimodal modeling method for Language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., "grass" or "building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., "cat" and "furry"). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We show that joint embeddings allow the creation of semantic segmentation systems that can segment an image with any label set. Beyond that, I will briefly introduce several works about data-efficient algorithms such as data augmentation to boost the performance of neural models. At the end of this talk, I will talk about ongoing research and potential future directions for multimodal modelings, such as common sense reasoning and open-world recognition.