PhD Defense: Small and Large Perception Models for robotic navigation

Talk
Tianrui Guan
Time: 
06.16.2025 12:00 to 14:00
Location: 

Computer vision is fundamental to advancing robotic navigation and autonomous driving systems, enabling machines to interpret visual data critical for interaction with complex and diverse real-world environments. We address many critical challenges in these domains by developing advanced vision methods. Our approaches leverage the complementary strengths of small-scale specialized models, and large-scale vision-language models (LVLMs) with better generalized zero-shot capabilities. Specifically, small-scale models typically focus on specialized tasks, such as object detection, segmentation, and terrain classification. On the other hand, large-scale vision models, particularly vision-language models (VLMs), leverage extensive training data to capture richer contextual information and enhance generalization. Yet, these benefits often come at the expense of increased computational demands and latency, and VLMs can still fail when confronted with complex or nuanced scenarios specific to certain tasks.
To enhance perception capabilities in challenging scenarios, we introduce GA-Nav, an efficient transformer-based terrain segmentation approach explicitly designed for off-road robotic navigation. Our method simplifies the semantic segmentation task, emphasizing distinct terrain types to enhance navigation capabilities in unstructured outdoor environments. Additionally, we propose M3DETR, the first unified transformer-based architecture for 3D object detection in autonomous driving contexts, simultaneously modeling multi-representation, multi-scale, and mutual-relations with transformers. To understand hallucinations and further improve model generalization and adaptability for navigation applications, we propose HallusionBench and AutoHallusion, the first few benchmarks designed to diagnose different types of hallucinations and systematically scale hallucination examples. In HallusionBench, we systematically analyze and benchmark accuracy, sycophancy, robustness, and various failure modes of existing LVLMs. We also propose AutoHallusion, the first fully automated pipeline capable of generating challenging hallucination cases based on the pattern we discovered. Leveraging insights from these benchmarks on VLM failure modes and hallucination patterns, we introduce LOC-ZSON, advancing vision-language models for indoor zero-shot object navigation (ZSON). LOC-ZSON provides a practical, retrieval-based solution for natural language-guided navigation, compared to traditional reinforcement learning approaches. Furthermore, we present CrossLoc3D and AGL-Net, robust global localization techniques that ensure reliable long-horizon navigation by using different map modalities, thereby providing enhanced robustness in GPS-denied regions. Through these advancements, we bridge the theoretical innovations in computer vision with practical, scalable solutions for sophisticated robotic and autonomous navigation systems.