PhD Defense: Deep Video Analytics of Humans: from Action Recognition to Forgery Detection

Steven Schwarcz
08.05.2021 13:00 to 15:00

IRB 4105

In this work, we explore a variety of techniques and applications for addressing visual problems involving videos of humans in the contexts of activity detection, pose detection, and forgery detection.The first works discussed here address the issue of human activity detection in untrimmed video where the actions performed are spatially and temporally sparse. The video may therefore contain long sequences of frames where no actions occur, and the actions that do occur will often only comprise a very small percentage of the pixels on the screen. We address this with a 2-stage architecture that first suggests many coarse proposals with high recall, and then classifies and refines proposals to create temporally accurate activity proposals. We present two methods that follow this high-level paradigm: TRI-3D and CHUNK-3D.This work on activity detection is then extended to include results on few-shot learning. In this domain, a system must learn to perform detection given only an extremely limited set of training examples. We propose a method we call a Self-Denoising Neural Network (SDNN) which takes inspiration from Denoising Autoencoders in order to solve this problem, both in the context of activity detection and image classification. We also propose a method that performs optical character recognition on real world images when no labels are available in the language we wish to transcribe. Specifically, we build an accurate transcription system for Hebrew street name signs when no labeled training data is available.We continue our analysis by proposing a method for automatic detection of facial forgeries in videos and images. This work approaches the problem of facial forgery detection by breaking the face into multiple regions and training separate classifiers for each part. The end result is a collection of high-quality facial forgery detectors that are both accurate and explainable. We exploit this explainability by providing extensive empirical analysis of our method’s results.Finally, we present work that focuses on multi-camera, multi-person 3D human pose estimation from video. To address this problem, we aggregate the outputs of a 2D human pose detector across cameras and actors using a novel factor graph formulation, which we optimize using the loopy belief propagation algorithm.Examining Committee:

Chair: Dr. Rama Chellappa Dean's rep: Dr. David Jacobs Members: Dr. Christopher Metzler Dr. Shuvra Bhattacharyya
Dr. Abhinav Shrivastava