PhD Defense: Towards Effective and Efficient Video Understanding

Talk
Xijun Wang
Time: 
06.09.2025 10:00 to 12:00
Location: 

IRB-4105 or umd.zoom.us/my/dmanocha

“If a picture is worth a thousand words, what is a video worth?” Video information, due to its inherent richness and efficiency, plays a pivotal role in conveying complex information. However, video understanding faces numerous challenges, including selecting informative frames, addressing domain shifts, semantic grounding, reasoning and attention deficits, and significant computational burdens. Recent advancements in computer vision underscore the need to address these challenges through effective and efficient approaches, which are crucial for applications ranging from autonomous systems to human-computer interactions that require high accuracy and low latency. In this dissertation, we address five critical issues to overcome these challenges: dataset development, preprocessing, visual reasoning, multimodal alignment, and computational acceleration.For dataset development, we proposed the METEOR dataset for autonomous driving applications in dense, heterogeneous, and unstructured traffic scenarios with rare and challenging conditions. Additionally, we developed DAVE, a comprehensive benchmark dataset specifically designed to enhance video understanding research, with a focus on the safety of vulnerable road users in complex and unpredictable environments. Our analysis revealed substantial shortcomings of current object detection and behavior prediction models when tested against our METEOR and DAVE. To complement the datasets, we propose AZTR, which incorporates an automatic zooming algorithm for dynamic target scaling and a temporal reasoning mechanism to capture action sequences accurately. Furthermore, we introduced MITFAS, an alignment and sampling method based on mutual information specifically designed to address challenges inherent to UAV video action recognition, including varying human resolutions, significant positional changes between frames, and occluded action features. For visual reasoning, we introduced SCP, which guides the model to learn input-invariant (prompt experts) and input-specific (data-dependent) prompt knowledge, effectively capturing discriminative patterns and significantly improving accuracy on challenging datasets. We also developed ICAR, a compatibility learning framework with a novel category-aware Flexible Bidirectional Transformer (FBT), which can effectively generate cross-domain features based on visual similarity and complementarity for reasoning tasks. For multimodal applications, we proposed ViLA to address both efficient frame sampling and effective cross-modal alignment in a unified manner. Finally, we propose Bi-VLM to explore an ultra-low precision post-training quantization method to bridge the gap between computational demands and practical limitations. Our method employs a saliency-aware hybrid quantization algorithm, combined with a non-uniform model weight partition strategy, which significantly reduces computational costs without incurring significant performance loss.