PhD Proposal: Long-term Temporal Modeling for Video Action Understanding

Talk
Xitong Yang
Time: 
04.13.2020 10:00 to 12:00
Location: 

The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Inspired by the success of deep convolutional neural networks (CNNs) on image understanding, many efforts have been made to extend the deep networks to video understanding by modeling both spatial and temporal information. Compared to still image analysis, the temporal component of videos provides an additional and important clue for action recognition, as a number of actions can only be distinguished when motion information is taken into account. However, the temporal information also brings new challenges for the networks to perform effective temporal modeling, especially for the semantic dynamics that covers a long-range time scale.In this proposal, we focus on the development of effective temporal modeling methods for improved video action understanding. More specifically, we propose two approaches to model long-term temporal information for action recognition: (1) learning hierarchical motion representation (e.g. from lower-level motion to higher-level motion) through a multi-scale self-supervised learning framework; (2) integrating temporal relational reasoning into models through a decoupled version of the non-local neural networks. We also propose a progressive learning framework for spatio-temporal action detection in videos, which can naturally handle the large spatial displacement of human boxes due to long sequences or rapid movement of actors. Finally, we will discuss some future work on modeling long-term temporal structures and reasoning about spatio-temporal relationships.Examining Committee:

Chair: Dr. Larry S. Davis Dept rep: Dr. Furong Huang Members: Dr. Abhinav Shrivastava