Hindsight Experience Replay

18 Dec 2018

Introduction

Hindsight Experience Replay(HER) is a sample efficient technique to learn from sparse rewards.
Link to the paper

Assume a footballer misses the goal narrowly. Even though the player does not get any “reward”(in terms of goal), the player realizes that had the goal post been shifted a bit, it would have resulted in a goal(reward).
The same intuition is applied for the RL agent - let us say that the true goal state was g while the agent ends up in the state s.
While the action sequence is not useful for reaching the goal state g, it is indeed useful for reaching state s. Hence the trajectory could be replayed with the goal as s(and not g).

Multi-goal policy trained using Universal Value Function Approximation (UVFA).
Every episode starts by sampling a start state and a goal state. Each goal has a different reward function.
Policy uses both the current state and the current goal state and leads to a state transition sequence s₁, s₂,…, s_n.
Each of these transitions s_i -> s_i+1 are stored in a buffer with both the original goal and a subset of the other goals.
For the goal selection, following strategies are tried:
- Future - goal state is the state k steps after observing the state transition.
- Final - goal state is the final state of the current episode.
- Episode - k random states are selected from the current episode.
- Randon - k states are selected randomly.
Any off-policy algorithm can be used. Specifically, DDPG is used.

Robotic arm simulated using MuJoCo for push, slide and pick and place tasks.
DDPG with and without HER evaluated on the 3 tasks.
DDPG with the HER variant significantly outperforms the baseline in all the cases.