PhD Proposal: Applications of Deep Learning to Sequential Visual Tasks

Talk
Steven Schwarcz
Time: 
01.28.2019 14:00 to 16:00
Location: 

AVW 4424

In this work, we explore a variety of techniques and applications for tackling visual problems with sequential components, whether the sequence represents a temporal component, as in video analysis, or the sequence is being recognized within an image. More specifically, we address three different sequential problems in computer vision: 3D human pose estimation, human action detection in untrimmed videos, and sequential optical character recognition.The first of these works focuses on multi-camera, multi-person 3D human pose estimation from video. To address this problem, we aggregate the outputs of a 2D human pose detector across cameras and actors using a novel factor graph formulation, which we optimize using the loopy belief propagation algorithm. In particular, our factor graph introduces a temporal smoothing term to create smooth transitions between poses across frames.The second work discussed here addresses the issue of human activity detection in untrimmed video where the actions performed are spatially and temporally sparse. The video may therefore contain long sequences of frames where no actions occur, and the actions that do occur will often only comprise a very small percentage of the pixels on the screen. We address this with a 2-stage architecture that first suggests many coarse proposals with high recall, and then classifies and refines proposals to create temporally accurate activity proposals.The third work we discuss performs optical character recognition on real world images when no labels are available in the language we wish to transcribe. Specifically, we build an accurate transcription system for Hebrew street name signs when no labeled training data is available. In order to do this, we divide the problem into 2 components and address each separately: content, which refers to the characters and language structure, and style, which refers to the domain of the images (for example, real or synthetic). We train with simple synthetic Hebrew street signs to address the content components, and labeled French street signs to address the style.

Examining Committee:

Chair: Dr. Rama Chellappa Dept. rep: Dr. David Jacobs Members: Dr. Abhinav Shrivastava