Compositional and Robust Action Understanding

Talk
Huijuan Xu
Talk Series: 
Time: 
03.30.2021 13:00 to 14:00

In an era with massive video data becoming available from a wide range of applications (e.g. smart home devices, medical instruments, intelligent transportation networks, etc), designing algorithms which understand action can enable machines to interact meaningfully with human partners. Practically, continuous video streams require temporal localization of actions before a trimmed action recognition method can be applied, yet such annotation is expensive and suffers from annotation consistency issues. Also, early video understanding technologies mostly use holistic frame modeling and do not employ reasoning capabilities. In this talk, I will discuss how to detect action in continuous video streams efficiently. Specifically, I will talk about several temporal action detection models with different levels of supervision. Next, I will introduce how to understand action compositionally with localized foreground subjects or objects to reduce the effect of confounding variables, and bridge a connection with common knowledge of involved objects. Additionally, natural language provides an efficient and intuitive way to convey details of action to a human. I will conclude the talk with some perspectives on how compositional and efficient modeling opens the door for real-word action understanding with high complexity and fine granularity.