PhD Proposal: Towards Generalist Embodied Agents via Representation Learning
In our dynamic and ever-evolving world, embodied agents for sequential decision-making (SDM) lie at the heart of intelligent behavior in machine learning systems. Just as foundation models in vision and language have revolutionized natural language processing and computer vision through large-scale pretraining, foundation models for SDM hold similar potential by capturing the structure and semantics of decision trajectories. In this research proposal, I aim to address this challenge from the perspective of representation learning—specifically, how to learn compact yet expressive state and action abstractions that are well-suited for downstream policy learning in embodied agents. To this end, I explore both state and action representations, and further introduce a surprisingly simple yet effective approach that leverages explicit visual prompting to bridge toward the capabilities of modern vision-language-action (VLA) foundation models.
First, I introduce TACO, a temporal contrastive learning objective for visual representation learning. This method enables the learned embeddings to encode control-relevant dynamics in a compact latent space, significantly improving data efficiency during policy learning. When used for pretraining, these representations allow embodied agents to generalize to novel tasks with minimal expert demonstrations.
Next, inspired by the success of large language models (LLMs), I present a simple yet effective strategy—PRISE—to learn temporally abstracted action representations directly from raw trajectories. This approach shortens the effective planning horizon and thus significantly improves the performance of multitask imitation learning algorithms, while enabling better generalization to unseen tasks with minimal expert demonstrations.
Finally, building on recent advances in vision-language-action (VLA) models, I introduce TraceVLA, an explicit visual prompting technique that encodes a robot’s execution history as a visual trace. This visual cue improves the spatio-temporal understanding of large VLA models and yields robust generalization across embodiments, outperforming existing VLA baselines on real-world robotic manipulation tasks.