Procgen Benchmark | The Next Input

Design principles

We’ve designed all Procgen environments to satisfy the following criteria:

* High Diversity: Environment generation logic is given maximal freedom, subject to basic design constraints. The diversity in the resulting level distributions presents agents with meaningful generalization challenges.

* Fast Evaluation: Environment difficulty is calibrated such that baseline agents make significant progress after training for 200M timesteps. Moreover, the environments are optimized to perform thousands of steps per second on a single CPU core, enabling a fast experimental pipeline.

* Tunable Difficulty: All environments support two well-calibrated difficulty settings: easy and hard. While we report results using the hard difficulty setting, we make the easy difficulty setting available for those with limited access to compute power. Easy environments require approximately an eighth of the resources to train.

* Emphasis on Visual Recognition and Motor Control: In keeping with precedent, environments mimic the style of many Atari and Gym Retro games. Performing well primarily depends on identifying key assets in the observation space and enacting appropriate low level motor responses.

Your browser does not support the video tag.

00:00

Evaluating generalization

We came to appreciate how hard RL generalization can be while conducting theRetro Contest⁠, as agents continually failed to generalize from the limited data in the training set. Later, our CoinRun experiments painted an even clearer picture of our agents’ struggle to generalize. We’ve now expanded on those results, conducting our most thorough study of RL generalization to date using all 16 environments in Procgen Benchmark.

We first measured how the size of the training set impacts generalization. In each environment, we generated training sets ranging in size from 100 to 100,000 levels. We trained agents for 200M timesteps on these levels usingProximal Policy Optimization⁠, and we measured performance on unseen test levels.

Generalization performance

Score over 100k levels, log scale

Train

Test

CoinRun

StarPilot

CaveFlyer

Dodgeball

Fruitbot

Chaser

Miner

Jumper

Leaper

Maze

BigFish

Heist

Climber

Plunder

Ninja

BossFight

We found that agents strongly overfit to small training sets in almost all environments. In some cases, agents need access to as many as 10,000 levels to close the generalization gap. We also saw a peculiar trend emerge in many environments: past a certain threshold, training performance improves as the training sets grows! This runs counter to trends found in supervised learning, where training performance commonly decreases with the size of the training set. We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels. A larger training set can improve training performance if the agent learns to generalize _even across levels in the training set_. We previously noticed this effect with CoinRun, and have found it often occurs in many Procgen environments as well.

An ablation with deterministic levels

Train and test performance

Score over 200M timesteps

At test time, we remove the determinism in the sequence of levels, instead choosing level sequences at random. We find that agents become competent over the first several training levels in most games, giving an illusion of meaningful progress. However, test performance demonstrates that the agents have in fact learned almost nothing about the underlying level distribution. We believe this vast gap between training and test performance is worth highlighting. It reveals a crucial hidden flaw in training on environments that follow a fixed sequence of levels. These results show just how essential it is to use diverse environment distributions when training and evaluating RL agents.

Next steps

We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.

_If you’re interested in helping develop diverse environments,we’re hiring_⁠_!_

CaveFlyer

Dodgeball

Fruitbot

Chaser

Authors

Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman

Acknowledgments

Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.

Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.

Special thanks toKenney⁠(opens in a new window)for the many high quality game assets used throughout these environments.

Additional thanks to Oleg Domrachev and Anton Tyshchenko (CraftPix.net⁠(opens in a new window))for several game backgrounds, as well as toGameArtGuppy⁠(opens in a new window), andansimuz⁠(opens in a new window). All asset licenses can be foundhere⁠(opens in a new window).

View all

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019

We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.

_If you’re interested in helping develop diverse environments,we’re hiring_⁠_!_

* Exploration & Games

* Learning Paradigms

* Simulated Environments

* Software & Engineering

Authors

Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman

Acknowledgments

Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.

Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.

Special thanks toKenney⁠(opens in a new window)for the many high quality game assets used throughout these environments.

View all

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019

Procgen Benchmark Procgen Benchmark

Quick editorial signal Snelle redactionele duiding

Design principles

Evaluating generalization

An ablation with deterministic levels

Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019

Authors

Acknowledgments

Related articles

Help shape what we cover next Help bepalen wat we hierna volgen

More from OpenAI Meer van OpenAI

Our principles Our principles

Introducing GPT-5.5 GPT-5.5 geïntroduceerd

GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty

How to get started with Codex Zo begin je met Codex

Procgen Benchmark Procgen Benchmark

Quick editorial signal Snelle redactionele duiding

Design principles

Evaluating generalization

An ablation with deterministic levels

Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019

Authors

Acknowledgments

Related articles

Help shape what we cover next Help bepalen wat we hierna volgen

More from OpenAI Meer van OpenAI

Our principles Our principles

Introducing GPT-5.5 GPT-5.5 geïntroduceerd

GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty

How to get started with Codex Zo begin je met Codex

The Next Input keeps optional media off until you say yes. The Next Input houdt optionele media uit tot jij ja zegt.