A product update that may change what people can do with AI this week. Een productupdate die kan veranderen wat mensen deze week met AI kunnen doen.
Procgen Benchmark Procgen Benchmark
Procgen Benchmark consists of 16 unique environments designed to measure both sample efficiency and generalization in reinforcement learning. This benchmark is ideal for evaluating generalization since distinct training and test sets can be generated in each environment. This benchmark is also well-suited to evaluate sample efficiency, since all environments pose diverse and compelling challenges for RL agents. The environments’ intrinsic diversity demands that agents learn robust policies; overfitting to narrow regions in state space will not suffice. Put differently, the ability to generalize becomes an integral component of success when agents are faced with ever-changing levels. Procgen Benchmark consists of 16 unique environments designed to measure both sample efficiency and generalization in reinforcement learning. This benchmark is ideal for evaluating generalization since distinct training and test sets can be generated in each environment. This benchmark is also well-suited to evaluate sample efficiency, since all environments pose diverse and compelling challenges for RL agents. The environments’ intrinsic diversity demands that agents learn robust policies; overfitting to narrow regions in state space will not suffice. Put differently, the ability to generalize becomes an integral component of success when agents are faced with ever-changing levels.
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Relevant for creators comparing tools for images, audio, video, or publishing. Relevant voor creators die tools vergelijken voor beeld, audio, video of publicatie.
- Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
Design principles
We’ve designed all Procgen environments to satisfy the following criteria:
* High Diversity: Environment generation logic is given maximal freedom, subject to basic design constraints. The diversity in the resulting level distributions presents agents with meaningful generalization challenges.
* Fast Evaluation: Environment difficulty is calibrated such that baseline agents make significant progress after training for 200M timesteps. Moreover, the environments are optimized to perform thousands of steps per second on a single CPU core, enabling a fast experimental pipeline.
* Tunable Difficulty: All environments support two well-calibrated difficulty settings: easy and hard. While we report results using the hard difficulty setting, we make the easy difficulty setting available for those with limited access to compute power. Easy environments require approximately an eighth of the resources to train.
* Emphasis on Visual Recognition and Motor Control: In keeping with precedent, environments mimic the style of many Atari and Gym Retro games. Performing well primarily depends on identifying key assets in the observation space and enacting appropriate low level motor responses.
Your browser does not support the video tag.
00:00
Evaluating generalization
We came to appreciate how hard RL generalization can be while conducting theRetro Contest, as agents continually failed to generalize from the limited data in the training set. Later, our CoinRun experiments painted an even clearer picture of our agents’ struggle to generalize. We’ve now expanded on those results, conducting our most thorough study of RL generalization to date using all 16 environments in Procgen Benchmark.
We first measured how the size of the training set impacts generalization. In each environment, we generated training sets ranging in size from 100 to 100,000 levels. We trained agents for 200M timesteps on these levels usingProximal Policy Optimization, and we measured performance on unseen test levels.
Generalization performance
Score over 100k levels, log scale
Train
Test
CoinRun
StarPilot
CaveFlyer
Dodgeball
Fruitbot
Chaser
Miner
Jumper
Leaper
Maze
BigFish
Heist
Climber
Plunder
Ninja
BossFight
We found that agents strongly overfit to small training sets in almost all environments. In some cases, agents need access to as many as 10,000 levels to close the generalization gap. We also saw a peculiar trend emerge in many environments: past a certain threshold, training performance improves as the training sets grows! This runs counter to trends found in supervised learning, where training performance commonly decreases with the size of the training set. We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels. A larger training set can improve training performance if the agent learns to generalize _even across levels in the training set_. We previously noticed this effect with CoinRun, and have found it often occurs in many Procgen environments as well.
An ablation with deterministic levels
Train and test performance
Score over 200M timesteps
At test time, we remove the determinism in the sequence of levels, instead choosing level sequences at random. We find that agents become competent over the first several training levels in most games, giving an illusion of meaningful progress. However, test performance demonstrates that the agents have in fact learned almost nothing about the underlying level distribution. We believe this vast gap between training and test performance is worth highlighting. It reveals a crucial hidden flaw in training on environments that follow a fixed sequence of levels. These results show just how essential it is to use diverse environment distributions when training and evaluating RL agents.
Next steps
We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.
_If you’re interested in helping develop diverse environments,we’re hiring__!_
CaveFlyer
Dodgeball
Fruitbot
Chaser
Authors
Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman
Acknowledgments
Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.
Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.
Special thanks toKenney(opens in a new window)for the many high quality game assets used throughout these environments.
Additional thanks to Oleg Domrachev and Anton Tyshchenko (CraftPix.net(opens in a new window))for several game backgrounds, as well as toGameArtGuppy(opens in a new window), andansimuz(opens in a new window). All asset licenses can be foundhere(opens in a new window).
Related articles
View all
Scaling laws for reward model overoptimization Publication Oct 19, 2022
Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022
Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019
We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.
_If you’re interested in helping develop diverse environments,we’re hiring__!_
* Exploration & Games
* Learning Paradigms
* Simulated Environments
* Software & Engineering
Authors
Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman
Acknowledgments
Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.
Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.
Special thanks toKenney(opens in a new window)for the many high quality game assets used throughout these environments.
Additional thanks to Oleg Domrachev and Anton Tyshchenko (CraftPix.net(opens in a new window))for several game backgrounds, as well as toGameArtGuppy(opens in a new window), andansimuz(opens in a new window). All asset licenses can be foundhere(opens in a new window).
Related articles
View all
Scaling laws for reward model overoptimization Publication Oct 19, 2022
Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022
Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
More from OpenAI Meer van OpenAI
All updates Alle updatesOur principles Our principles
Title: Our principles Title: Our principles
Introducing GPT-5.5 GPT-5.5 geïntroduceerd
Title: Introducing GPT-5.5 Titel: GPT-5.5 geïntroduceerd
GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty
Title: GPT-5.5 Bio Bug Bounty Titel: GPT-5.5 Bio Bug Bounty
How to get started with Codex Zo begin je met Codex
Tips to set up Codex, create your first project, and start completing real tasks. Tips om Codex in te stellen, je eerste project te maken en echte taken af te ronden.