PhD Proposal: Scaling Policy Gradient Methods to Open-Ended Domains

Ryan Sullivan
05.01.2024 14:00 to 16:00

IRB 2137


Curriculum learning has been a quiet yet crucial component of many of the major successes of reinforcement learning. AlphaGo learned to play the board game Go using self-play, which produces an implicit curriculum of increasingly challenging opponents. OpenAI Five was trained to play Dota by progressively adding complexity to the environment and randomizing game features to encourage robustness. GT Sophy, an agent that plays the racing game Gran Turismo at a professional level, learned from a manually curated distribution of racing scenarios. Notably, with the use of curriculum learning, many of these milestones were achieved with simple policy gradient methods. Despite its near ubiquity in successful reinforcement learning applications, curriculum learning is rarely the focus of research, and often mentioned as a minor implementation detail.This began to change with the advent of open-endedness research. Open-ended environments have large, growing task spaces that present constantly evolving challenges to agents, similar to the real world. In these settings with countless tasks that agents may choose to devote time to, it is crucial to identify tasks that will teach transferable skills, and to learn those skills as efficiently as possible. Curriculum learning is therefore a required component of open-endedness research.This work develops a stronger empirical understanding of policy gradient methods and curriculum learning in complex, multi-task environments. It proposes a new method for plotting reward surfaces, using them to identify challenges for policy-gradient methods in sparse-reward environments. We explore implementation tricks that have successfully improved the reward scale robustness of model-based RL algorithms and show that they are not effective when transferred to model-free PPO. Our findings demonstrate that direct policy optimization and clever implementation tricks are not enough for model-free policy gradient algorithms to solve challenging RL tasks. This motivates the use of curriculum learning, which circumvents these problems by training on easier subtasks.We develop a general purpose library for curriculum learning and reimplement several popular algorithms using that framework, identifying shared components between methods and evaluating their impact across algorithms and environments. This allows us to transfer improvements between methods, resulting in new algorithms, and a stronger foundational understanding of automatic curriculum learning.

Examining Committee


Dr. John Dickerson

Department Representative:

Dr. Ming Lin


Dr. Furong Huang