Learning Spatio-Temporal Representations for Video Understanding

Talk
Du Tran
Talk Series: 
Time: 
02.22.2021 13:00 to 14:00

Video understanding is one of the fundamental problems in computer vision with various applications, including autonomous vehicles, robot learning, and visual perception. Compared with traditional image understanding, video understanding: (i) has higher model complexity and requires to learn from a much larger amount of data; (ii) requires more expensive annotations; (iii) and sometimes demands multimodal modeling, e.g., audiovisual modeling instead of visual only. In this talk, I will present some of our approaches addressing these challenges, such as efficient and scalable spatiotemporal learning, cross-modal self-supervised learning of video and audio representations, and multimodal learning. Finally, I will outline several potential future research directions in this area.