Evolved Policy Gradients

The above video demonstrates how our method (left) teaches a robot how to reach various targets without resetting the environment, in comparison with PPO (right). Top-left text specifies the number of seconds elapsed. Note that this video demonstrates the complete learning process in real-time.

The intuition behind EPG comes from something we are all familiar with: trying to pick up a new skill and experiencing the alternating frustration and joy involved in that process. Suppose you are just starting out learning to play the violin. Even without instruction, you will immediately have a feel for what to try, and, listening to the sounds you produce, you will have a sense of whether or not you are making progress – that’s because you effectively have access to very well shapedinternal reward functions⁠(opens in a new window), derived from prior experience on other motor tasks, and through the course of biological evolution. In contrast, most reinforcement learning (RL) agents approach each new task without using prior knowledge. Instead they rely entirely on external reward signals to guide their initial behavior. Coming from such a blank slate, it is no surprise that current RL agents takefar longer⁠(opens in a new window)than humans to learn simple skills. EPG takes a step toward agents that are not blank slates but instead know what it means to make progress on a new task, by having experienced making progress on similar tasks in the past.

The above video demonstrates how our method (left) teaches a robot how to hop in the backwards direction, in comparison with PPO (right). EPG results in exploratory behavior where the agent first tries out walking forwards before realizing that backwards gives higher rewards. Top-left text specifies the number of seconds elapsed. Note that this video demonstrates the complete learning process in real-time.

There has been a flurry of recentwork⁠(opens in a new window)on⁠(opens in a new window)metalearning⁠(opens in a new window)policies⁠(opens in a new window), and it’s worth asking why learn a loss function as opposed to directly learning a policy? Learning recurrent policies tends to overfit the task at hand, while learning policy initializations has limited expressivity when it comes to exploration. Our motivation is that we expect loss functions to be the kind of object that may generalize very well across substantially different tasks. This is certainly true of hand-engineered loss functions: a well-designed RL loss function, such as that inPPO⁠(opens in a new window), can be very generically applicable, finding use in problems ranging from playing Atari games to controlling robots.

To test the generalization ability of EPG, we conducted a simple experiment. We evolved the EPG loss to be effective at getting “ants” to walk to randomly located targets on the right half of an arena. Then, we froze the loss, and gave the ants a new target, this time on the _left_ half of the arena. Surprisingly, the ants learned to walk to the left! Here is how their learning curves looked (red lines on graphs):

This result is exciting to us because it demonstrates generalization to a task _outside the training distribution_. This kind of generalization can be quite hard to achieve. We compared EPG to an alternative metalearning algorithm, calledRL2⁠(opens in a new window), which tries to directly learn a policy that can adapt to novel tasks. In our experiment, RL2 was indeed successful at getting agents to walk to targets on the right half of the screen. However, when given a test time target on the left half of the screen, it qualitatively failed, and just kept walking to the right. In a sense, it “overfit” to the set of tasks on which it was trained (i.e. walking to the right).

The above video demonstrates how our method (left) teaches an ant robot how to walk and reach a target (green circle) from scratch, in comparison with RL2 (right). Top-left text specifies the number of seconds elapsed. Note that this video demonstrates the complete learning process at 3X real-time speed.

As do all metalearning approaches, our method still has many limitations. Right now, we can train an EPG loss to be effective for one small family of tasks at a time, e.g., getting an ant to walk left and right. However, the EPG loss for this family of tasks is unlikely to be at all effective on a wildly different kind of task, like playing Space Invaders. In contrast, standard RL losses _do_ have this level of generality—the same loss function can be used to learn a huge variety of skills. EPG gains on performance by losing on generality. There is a long road ahead toward metalearning methods that both outperform standard RL methods _and_ have the same level of generality.

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Just a moment...

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments