Towards Human-Level Recognition via Contextual, Dynamic, and Predictive Representations
Existing state-of-the-art computer vision models usually specialize in single domains or tasks, while human-level recognition can be contextual for diverse scales and tasks. This specialization isolates different vision tasks and hinders deployment of robust and effective vision systems. In this talk, I will discuss contextural image representations for different scales and tasks through the lens of pixel-level prediction. These connections, built by the study of dilated convolutions and deep layer aggregation, can interpret convolutional network behaviors and lead to model frameworks applicable to a wide range of tasks. Beyond contextual, I will argue that image representation should also be dynamic and predictive. I will illustrate the case with input-dependent dynamic networks, which lead to new insights into the relationship of zero-shot/few-shot learning and network pruning, and with semantic predictive control, which utilizes prediction for better driving policy learning. To conclude, I will discuss the on-going system and algorithm investigations which couple representation learning and real-world interaction to build intelligent agents that can continuously learn from and interact with the world.