PhD Proposal: Scalable Action Recognition Using Knowledge Bases and Heterogeneous Data Fusion
Action recognition involves decision making based on both spatial and temporal connections in the video, over long time steps. In this thesis proposal, we develop a method to capture these through edge connections in a graph convolution network (GCN). These connections will help to account for missed detections in frames through temporal connections not only to adjacent frames, but also to next few frames. Our proposed approach uses the GCN to fuse various information in the video like detected objects, human pose, scene information etc. It is a flexible graph configuration allowing variable length node features. Also use of hourglass structure on this irregularly connected graph data helps to improve performance. Our experiments on CAD120 and Charades datasets show improvements over the state of the art results.For certain other functions like zero shot and few shot action recognition, knowledge graphs play a vital role. We learn a classifier for unseen test classes through comparison with popular yet similar training classes by providing the information about similarity features between the two classes. We use sentence2vector embeddings to learn similarities between seen and unseen classes, providing an implicit relationship map. We also propose a knowledge graph, that provides an explicit relationship map, based on these sentence2vector embeddings and their cosine distances, to address the absence of established knowledge graphs for action classes. To utilize existing state-of-the-art models and datasets, we cannot use common classes between those datasets and our test sets. We propose a new benchmark based on UCF101, HMDB51 and Kinetics datasets. Our method of using GCN on this knowledge graph, results in significant improvements for zero shot action recognition compared to the state of the art results.In summary we use graph convolution networks on various types of input graphs to improve action recognition in videos. The input graphs maybe spatio-temporal graphs constructed from contextual features in videos or it might be knowledge graphs from language based models, but this kind of fusion of different information from external knowledge bases or context, can result in better video analysis for all kinds of tasks like action recognition or zero-shot analysis of actions. In the future we plan to use this for unsupervised learning or few-shot learning on video data, as well as test the scalability of GCN on larger input graphs.
Chair: Dr. Larry Davis
Co-Chair: Dr. Abhinav Shrivastava Dept rep: Dr. Marine Carpuat Members: Dr. Soheil Feizi