Meta-Reinforcement Learning of Structured Exploration Strategies

2018 • Meta Learning • Meta Reinforcement Learning • NeurIPS 2018 • Reinforcement Learning • Structured Exploration • AI • Exploration • NeurIPS • RL

08 Jun 2019

Introduction

The paper looks at the problem of learning structured exploration policies for training RL agents.
Link to the paper

Structured Exploration

Consider a stochastic, parameterized policy π_θ(a|s) where θ represents the policy-parameters.
To encourage exploration, noise can be added to the policy at each time step t. But the noise added in such a manner does not have any notion of temporal coherence.
Another issue is that if the policy is represented by a simple distribution (say parameterized unimodal Gaussian), it can not model complex time-correlated stochastic processes.
The paper proposes to condition the policy on per-episode random variables (z) which are sampled from a learned latent distribution.
Consider a distibution over the tasks p(T). At the start of any episode of the i^th task, a latent variable z_i is sampled from the distribution N(μ_i, σ_i) where μ_i and σ_i are the learned parameters of the distribution and are referred to as the variation parameters.
Once sampled, the same z_i is used to condition the policy for as long as the current episode lasts and the action is sampled from then distribution π_θ(a|s, z_i).
The intuition is that the latent variable z_i would encode the notion of a task or goal that does not change arbitrarily during the episode.

Model Agnostic Exploration with Structured Noise

The paper focuses on the setting where the structured exploration policies are to be learned while leveraging the learning from prior tasks.
A meta-learning approach, called as model agnostic exploration with structured noise (MAESN) is proposed to learn a good initialization of the policy-parameters and to learn a latent space (for sampling the z from) that can inject structured stochasticity in the policy.
General meta-RL approaches have two limitations when it comes to “learning to explore”:
- Casting meta-RL problems as RL problems lead to policies that do not exhibit sufficient variability to explore effectively.
- Many current approaches try to meta-learn the entire learning algorithm which limits the asymptotic performance of the model.
Idea behind MAESN is to meta-train policy-parameters so that they learn to use the task-specific latent variables for exploration and can quickly adapt to a new task.
An important detail is that the parameters are optimized to maximize the expected rewards after one step of gradient update to ensure that the policy uses the latent variables for exploration.
For every iteration of meta-training, an “inner” gradient update is performed on the variational parameters and the post-inner-update parameters are used to perform the meta-update.
The authors report that performing the “inner” gradient update on the policy-parameters does not help the overall learning objective and that the step size for each parameter had to be meta-learned.
The variation parameters have the usual KL divergence loss which encourages them to be close to the prior distribution (unit Gaussian in this case).
After training, the variational parameters for each task are quite close to the prior probably because the training objective optimizes for the expected reward after one step of gradient descent on the variational parameters.
Another implementation detail is that reward shaping is used to ensure that the policy gets useful signal during meta-training. To be fair to the baselines, reward shaping is used while training baselines as well. Moreover, the policies trained with reward shaping generalizes to sparse reward setup as well (during meta-test time).

Experiments

Three tasks distributions: Robotic Manipulation, Wheeled Locomotion, and Legged Locomotion. Each task distribution has 100 meta-training tasks.
In the Manipulation task distribution, the learner has to push different blocks from different positions to different goal positions. In the Locomotion task distributions, the different tasks correspond to the different goal positions.
The experiments show that the proposed approach can adapt to new tasks quickly and the learn coherent exploration strategy.

• In some cases, learning from scratch also provides a strong asymptotic performance although learning from scratch takes much longer.

Papers I Read Notes and Summaries

Meta-Reinforcement Learning of Structured Exploration Strategies

Introduction

Structured Exploration

Model Agnostic Exploration with Structured Noise

Experiments

Related Posts

Toolformer - Language Models Can Teach Themselves to Use Tools 10 Feb 2023

Synthesized Policies for Transfer and Adaptation across Tasks and Environments 29 Mar 2021

Deep Neural Networks for YouTube Recommendations 22 Mar 2021