Deep Reinforcement Learning and the Deadly Triad

31 Aug 2020

Introduction

The paper investigates the practical impact of the deadly triad (function approximation, bootstrapping, and off-policy learning) in deep Q-networks (trained with experience replay).
The deadly triad is called so because when all the three components are combined, TD learning can diverge, and value estimates can become unbounded.
However, in practice, the component of the deadly triad has been combined successfully. An example is training DQN agents to play Atari.
Link to the paper

The effect of each component of the triad can be regulated with some design choices:
- Bootstrapping - by controlling the number of steps before bootstrapping.
- Function approximation - by controlling the size of the neural network.
- Off-policy learning - by controlling how data points are sampled from the replay buffer (i.e., using different prioritization approaches)
The problem is studied in two contexts: toy example and Atari 2600 games.
The paper makes several hypotheses about how different components may interact in the triad and evaluate these hypotheses by training DQN with different hyperparameters:
- Number of steps before bootstrapping - 1, 3, 10
- Four levels of prioritization (for sampling data from the replay buffer)
- Bootstrap target - Q-learning, target Q-learning, inverse double Q-learning, and double Q-learning
- Network sizes-small, medium, large and extra-large.
Each experiment was run with three different seeds.
The paper formulates a series of hypotheses and designs experiments to support/reject the hypotheses.

Rewards are clipped between -1 and 1, and the discount factor is set to 0.99. Hence, the maximum absolute action value is bound to smaller than 100. This upper bound is used soft-divergence in the value estimates.
The paper reports that while soft-divergence does occur, the values do not become unbounded, thus supporting the hypothesis.

One manifestation of bootstrapping on separate networks is target-Q learning. While using separate networks helps on Atari, it does not entirely solve the problem on the toy setup.
One manifestation of correcting for the overestimation bias is using double Q-learning.
In the standard form, double Q-learning benefits by bootstrapping on a separate network. To isolate the gains by using each component independently, an inverse double Q-learning update is used that does not use a separate target-network for bootstrapping.
Experimentally, Q-learning is the most unstable while target Q-learning and double Q-learning are the most stable. This observation supports the hypothesis.

This hypothesis is intuitive as the dependence on bootstrapping is reduced with multi-step returns.
Experimental results support this hypothesis.

This hypothesis is based on the assumption that more flexible value function approximations may behave more like the tabular case.
In practice, smaller networks show fewer instances of instability than the larger networks.
The hypothesis is not supported by the experiments.

Generally, soft-divergence correlates with poor control performance.
For example, longer multi-step returns lead to fewer instances of instabilities and better performance.
The trend is more interesting in terms of network capacity. Large networks tend to diverge more but also perform the best.
While action-value estimates can grow to large values, they can recover to plausible values as training progresses.