MS Defense: Post-Training of Vision-Language Agents for Decentralized Autonomous Vehicle Coordination Using Generalizable Multi-Agent Rewards

Talk
Time: 
05.08.2026 10:30 to 12:00

Decentralized coordination at unsignalized intersections remains a persistent failure mode for modern autonomous driving policies when vehicle-to-everything (V2X) communication is unavailable. Policies trained primarily with ego-centric objectives (e.g., collision avoidance, comfort, and action consistency) can be overly conservative in symmetric interactions, leading to deadlocks, or can make conflicting commitments, leading to unsafe near-collisions. This thesis addresses this gap by introducing a social post-training method for Alpamayo-R1 (AR1) that explicitly rewards behavior that is predictable to neighboring agents.
We extend AR1's Group Relative Policy Optimization (GRPO) post-training by augmenting the reward with Expectation Alignment (ELIGN), an intrinsic social term that penalizes mismatch between a learned neighbor-expectation model and the realized shared next observation. To make ELIGN applicable to AR1's continuous trajectory outputs, we define the shared observation space over low-dimensional kinematic waypoints (x, y, ψ, v) rather than high-dimensional perception features, and we learn a compact trajectory prediction model jointly during fine-tuning.
We evaluate the proposed AR1+ELIGN post-training in a multi-agent simulation benchmark of symmetric four-way arrival scenarios in AlpaSim and compare against an ego-centric AR1 baseline as well as standard multi-agent reinforcement learning baselines (PPO and MAPPO). Performance is measured by collision rate (as a hard safety constraint), deadlock rate, intersection clearance time, and jerk variance as an indicator of indecision. Finally, we study zero-shot social generalization by testing whether ELIGN-fine-tuned agents coordinate effectively with novel partner agents not encountered during training. Results show that introducing expectation-aligned intrinsic reward improves decentralized intersection throughput while preserving safety, and provides evidence for improved coordination with unseen partners.