Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
13 Jun 2019Introduction

The paper proposes a new inverse RL (IRL) algorithm, called as Trajectoryranked Reward EXtrapolation (TREX) that learns a reward function from a collection of ranked trajectories.

Standard IRL approaches aim to learn a reward function that “justifies” the demonstration policy and hence those approaches cannot outperform the demonstration policy.

In contrast, TREX aims to learn a reward function that “explains” the ranking over demonstrations and can learn a policy that outperforms the demonstration policy.
Approach

The input is a sequence of trajectories T_{1}, … T_{m} which are ranked in the order of preference. That is, given any pair of trajectories, we know which of the two trajectories is better.

The setup is to learn from observations where the learning agent does not have access to the true reward function or the action taken by the demonstration policy.

Reward Inference

A parameterized reward function r_{θ} is trained with the ranking information using a binary classification loss function which aims to predict which of the two given trajectory would be ranked higher.

Given a trajectory, the reward function predicts the reward for each state. The sum of rewards (corresponding to the two trajectories) is used used to predict the preferred trajectory.

TREX uses partial trajectories instead of full trajectories as a data augmentation strategy.


Policy Optimization
 Once a reward function has been learned, standard RL approaches can be used to train a new policy.
Results

Environments: Mujoco (Half Cheetah, Ant, Hooper), Atari

Demonstrations generated using PPO (checkpointed at different stages of training).

Ensemble of networks used to learn the reward functions.

The proposed approach outperforms the baselines Behaviour Cloning from Observations and Generative Adversarial Imitation Learning.

In terms of reward extrapolation, TREX can predict the reward for trajectories which are better than the demonstration trajectories.

Some ablation studies considered the effect of adding noise (random swapping the preference between trajectories) and found that the model is somewhat robust to noise up to an extent.