Plan online, learn offline: Efficient learning and exploration via model-based control Plan online, learn offline: Efficient learning and exploration via model-based control

Read paper(opens in a new window) Read paper(opens in a new window)

Article details Artikelgegevens

AI maker AI-maker OpenAI Type Type Article Artikel Published Gepubliceerd 5 November 2018 5 november 2018 Updates Updates Videos Video's View original article Bekijk origineel artikel

Why it matters Waarom dit telt

Quick editorial signal Snelle redactionele duiding

1 min

Impact Impact

Worth checking before choosing or changing a subscription. Handig om te checken voordat je een abonnement kiest of wijzigt.

Audience Voor wie Creators Creators

Level Niveau Medium Gemiddeld

Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
Check plan details before changing subscriptions or advising a team. Controleer plandetails voordat je abonnementen wijzigt of een team adviseert.
Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.

model apps video pricing

Abstract

We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.

We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.

Authors

Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, Igor Mordatch

View all

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019

Help shape what we cover next Help bepalen wat we hierna volgen

Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.

Share article Deel artikel

Plan online, learn offline: Efficient learning and exploration via model-based control Plan online, learn offline: Efficient learning and exploration via model-based control

Quick editorial signal Snelle redactionele duiding

Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, Igor Mordatch

View all

Help shape what we cover next Help bepalen wat we hierna volgen

More from OpenAI Meer van OpenAI

Just a moment... Zo begin je met Codex

Working with Codex Werken met Codex

GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty

What is Codex? Wat is Codex?

Plan online, learn offline: Efficient learning and exploration via model-based control Plan online, learn offline: Efficient learning and exploration via model-based control

Quick editorial signal Snelle redactionele duiding

Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, Igor Mordatch

View all

Help shape what we cover next Help bepalen wat we hierna volgen

More from OpenAI Meer van OpenAI

Just a moment... Zo begin je met Codex

Working with Codex Werken met Codex

GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty

What is Codex? Wat is Codex?

The Next Input keeps optional media off until you say yes. The Next Input houdt optionele media uit tot jij ja zegt.