PhD Proposal: Efficient Image and Video Representations for Retrieval

Talk

Sravanthi Bondugula

Time:

12.18.2014 11:30 to 13:00

Location:

AVW 4424

URL:

https://talks.cs.umd.edu/talks/858

Image/Video retrieval is an interesting problem of retrieving images/videos similar to the query. Similar images/videos are obtained by finding nearest neighbors in the input representation space. Numerous input representations both in real valued and binary space have been proposed for conducting faster retrieval. In this proposal, we present techniques that obtain improved input representations for some of the less explored retrieval based problems.
High Dimensional Vectors like VLAD have shown to improve the performance of many computer vision applications. Existing binary encoding schemes like spherical hashing cannot be readily extended to compress such vectors due to its high training times and requires large training data. In the first part of the proposal, we aim to learn high dimensional binary codes using hyperspherical hashing functions for compressing such high dimensional vectors. We overcome the computational challenges of the hashing schemes for small dimensional vectors, by presenting a hierarchical model that partitions the data and learns sub-spherical hashing functions for each component. We then combine the sub-spherical hashing functions to produce full hyperspheres, preserving the hashing properties using Random Select and Adjust (RSA) technique applied in a divide and conquer fashion. We show that our proposed high dimensional binary codes outperform the binary codes obtained using traditional hyperplane methods for higher compression ratios.
Supervised Retrieval is a well known problem of retrieving same class images of the query. In the second part, we address the problem of supervised retrieval that also takes into account the retrieved images of the related classes. Here, we learn relationship aware binary codes by minimizing the similarity between inner product of the binary codes and the similarity between the classes. We calculate the similarity between classes using output embeddings-which are vector representations of classes. Our method deviates from the previous supervised binary encoding schemes as it is the first to use output embeddings for learning hashing functions. We also introduce new performance metrics that take into account the related class retrieval results and show significant gains over the state-of-the art.
Event Classification of videos is a challenging problem in computer vision, specifically in a zero-shot setting, where there are no training videos available for the event. In the third part of the proposal, we solve the zero-shot classification problem by posing it as a retrieval problem in the concept space. We learn a generic set of concept detectors and represent videos by its concept detection scores. Query events are also represented in the concept space using their textual descriptions. We then compute similarity between the query event and the video in the concept space and videos similar to the query event are classified as the videos belonging to the event. We further show that by combining the visual concept based scores with audio and text concept based scores, we significantly boost the performance.
In the future, we would like to improve the image/video representations for other applications, such as tracking.
Examining Committee:
Committee Chair: - Dr. Larry S. Davis
Dept's Representative - Dr. Dana Nau
Committee Member(s): - Dr. David Jacobs