Is this the beginning or is this the end for end-to-end vision models?

Talk
Ani Kembhavi
Time: 
07.17.2023 14:30 to 15:30
Location: 

IRB 4105, Zoom Link: https://umd.zoom.us/j/99048512875

Large language models like GPT-4 support a whole gamut of tasks in natural language, some out of the box and others using a few examples via in context learning. In contrast, unification has been more challenging in computer vision, partly due to the heterogeneity of tasks in the visual domain. How do we create unified systems for vision that can be as capable and creative as language counterparts ? In this talk, I will present two different paths that we are actively exploring. The first path is to build large end to end models for computer vision, and along this direction I will introduce Unified-IO, the first single neural model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing. The second path is Visual Programming, where given a natural language description of a vision task, a program generator creates a program which is then executed on the task inputs using a program interpreter. This paradigm uses language models to parse instructions and generate code, leverages specialized vision models that the community is building and ever improving, and scales easily to large sets of diverse tasks.