Learning Spatio-Temporal Representations for Video Understanding
Video understanding is one of the fundamental problems in computer vision with various applications, including autonomous vehicles, robot learning, and visual perception. Compared with traditional image understanding, video understanding: (i) has higher model complexity and requires to learn from a much larger amount of data; (ii) requires more expensive annotations; (iii) and sometimes demands multimodal modeling, e.g., audiovisual modeling instead of visual only. In this talk, I will present some of our approaches addressing these challenges, such as efficient and scalable spatiotemporal learning, cross-modal self-supervised learning of video and audio representations, and multimodal learning. Finally, I will outline several potential future research directions in this area.