MetaReinforcement Learning of Structured Exploration Strategies
08 Jun 2019Introduction

The paper looks at the problem of learning structured exploration policies for training RL agents.

Link to the paper
Structured Exploration

Consider a stochastic, parameterized policy π_{θ}(as) where θ represents the policyparameters.

To encourage exploration, noise can be added to the policy at each time step t. But the noise added in such a manner does not have any notion of temporal coherence.

Another issue is that if the policy is represented by a simple distribution (say parameterized unimodal Gaussian), it can not model complex timecorrelated stochastic processes.

The paper proposes to condition the policy on perepisode random variables (z) which are sampled from a learned latent distribution.

Consider a distibution over the tasks p(T). At the start of any episode of the i^{th} task, a latent variable z_{i} is sampled from the distribution N(μ_{i}, σ_{i}) where μ_{i} and σ_{i} are the learned parameters of the distribution and are referred to as the variation parameters.

Once sampled, the same z_{i} is used to condition the policy for as long as the current episode lasts and the action is sampled from then distribution π_{θ}(as, z_{i}).

The intuition is that the latent variable z_{i} would encode the notion of a task or goal that does not change arbitrarily during the episode.
Model Agnostic Exploration with Structured Noise

The paper focuses on the setting where the structured exploration policies are to be learned while leveraging the learning from prior tasks.

A metalearning approach, called as model agnostic exploration with structured noise (MAESN) is proposed to learn a good initialization of the policyparameters and to learn a latent space (for sampling the z from) that can inject structured stochasticity in the policy.

General metaRL approaches have two limitations when it comes to “learning to explore”:
 Casting metaRL problems as RL problems lead to policies that do not exhibit sufficient variability to explore effectively.
 Many current approaches try to metalearn the entire learning algorithm which limits the asymptotic performance of the model.

Idea behind MAESN is to metatrain policyparameters so that they learn to use the taskspecific latent variables for exploration and can quickly adapt to a new task.

An important detail is that the parameters are optimized to maximize the expected rewards after one step of gradient update to ensure that the policy uses the latent variables for exploration.

For every iteration of metatraining, an “inner” gradient update is performed on the variational parameters and the postinnerupdate parameters are used to perform the metaupdate.

The authors report that performing the “inner” gradient update on the policyparameters does not help the overall learning objective and that the step size for each parameter had to be metalearned.

The variation parameters have the usual KL divergence loss which encourages them to be close to the prior distribution (unit Gaussian in this case).

After training, the variational parameters for each task are quite close to the prior probably because the training objective optimizes for the expected reward after one step of gradient descent on the variational parameters.

Another implementation detail is that reward shaping is used to ensure that the policy gets useful signal during metatraining. To be fair to the baselines, reward shaping is used while training baselines as well. Moreover, the policies trained with reward shaping generalizes to sparse reward setup as well (during metatest time).
Experiments

Three tasks distributions: Robotic Manipulation, Wheeled Locomotion, and Legged Locomotion. Each task distribution has 100 metatraining tasks.

In the Manipulation task distribution, the learner has to push different blocks from different positions to different goal positions. In the Locomotion task distributions, the different tasks correspond to the different goal positions.

The experiments show that the proposed approach can adapt to new tasks quickly and the learn coherent exploration strategy.
• In some cases, learning from scratch also provides a strong asymptotic performance although learning from scratch takes much longer.