Design principles
We’ve designed all Procgen environments to satisfy the following criteria:
* High Diversity: Environment generation logic is given maximal freedom, subject to basic design constraints. The diversity in the resulting level distributions presents agents with meaningful generalization challenges.
* Fast Evaluation: Environment difficulty is calibrated such that baseline agents make significant progress after training for 200M timesteps. Moreover, the environments are optimized to perform thousands of steps per second on a single CPU core, enabling a fast experimental pipeline.
* Tunable Difficulty: All environments support two well-calibrated difficulty settings: easy and hard. While we report results using the hard difficulty setting, we make the easy difficulty setting available for those with limited access to compute power. Easy environments require approximately an eighth of the resources to train.
* Emphasis on Visual Recognition and Motor Control: In keeping with precedent, environments mimic the style of many Atari and Gym Retro games. Performing well primarily depends on identifying key assets in the observation space and enacting appropriate low level motor responses.
Your browser does not support the video tag.
00:00
Evaluating generalization
We came to appreciate how hard RL generalization can be while conducting theRetro Contest, as agents continually failed to generalize from the limited data in the training set. Later, our CoinRun experiments painted an even clearer picture of our agents’ struggle to generalize. We’ve now expanded on those results, conducting our most thorough study of RL generalization to date using all 16 environments in Procgen Benchmark.
We first measured how the size of the training set impacts generalization. In each environment, we generated training sets ranging in size from 100 to 100,000 levels. We trained agents for 200M timesteps on these levels usingProximal Policy Optimization, and we measured performance on unseen test levels.
Generalization performance
Score over 100k levels, log scale
Train
Test
CoinRun
StarPilot
CaveFlyer
Dodgeball
Fruitbot
Chaser
Miner
Jumper
Leaper
Maze
BigFish
Heist
Climber
Plunder
Ninja
BossFight
We found that agents strongly overfit to small training sets in almost all environments. In some cases, agents need access to as many as 10,000 levels to close the generalization gap. We also saw a peculiar trend emerge in many environments: past a certain threshold, training performance improves as the training sets grows! This runs counter to trends found in supervised learning, where training performance commonly decreases with the size of the training set. We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels. A larger training set can improve training performance if the agent learns to generalize _even across levels in the training set_. We previously noticed this effect with CoinRun, and have found it often occurs in many Procgen environments as well.
An ablation with deterministic levels
Train and test performance
Score over 200M timesteps
At test time, we remove the determinism in the sequence of levels, instead choosing level sequences at random. We find that agents become competent over the first several training levels in most games, giving an illusion of meaningful progress. However, test performance demonstrates that the agents have in fact learned almost nothing about the underlying level distribution. We believe this vast gap between training and test performance is worth highlighting. It reveals a crucial hidden flaw in training on environments that follow a fixed sequence of levels. These results show just how essential it is to use diverse environment distributions when training and evaluating RL agents.
Next steps
We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.
_If you’re interested in helping develop diverse environments,we’re hiring__!_
CaveFlyer
Dodgeball
Fruitbot
Chaser
Authors
Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman
Acknowledgments
Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.
Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.
Special thanks toKenney(opens in a new window)for the many high quality game assets used throughout these environments.
Additional thanks to Oleg Domrachev and Anton Tyshchenko (CraftPix.net(opens in a new window))for several game backgrounds, as well as toGameArtGuppy(opens in a new window), andansimuz(opens in a new window). All asset licenses can be foundhere(opens in a new window).
Related articles
View all
Scaling laws for reward model overoptimization Publication Oct 19, 2022
Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022
Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019
We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.
_If you’re interested in helping develop diverse environments,we’re hiring__!_
* Exploration & Games
* Learning Paradigms
* Simulated Environments
* Software & Engineering
Authors
Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman
Acknowledgments
Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.
Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.
Special thanks toKenney(opens in a new window)for the many high quality game assets used throughout these environments.
Additional thanks to Oleg Domrachev and Anton Tyshchenko (CraftPix.net(opens in a new window))for several game backgrounds, as well as toGameArtGuppy(opens in a new window), andansimuz(opens in a new window). All asset licenses can be foundhere(opens in a new window).
Related articles
View all
Scaling laws for reward model overoptimization Publication Oct 19, 2022
Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022
Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019
Comments
Sign in or join free to leave a comment.
No comments yet. Be the first.