Revisiting Fundamentals of Experience Replay
07 Sep 2020Introduction
-
The paper presents an extensive study of the effects of experience replay in Q-learning based methods.
-
It focuses explicitly on the replay capacity and replay ratio (ratio of learning updates to experience collected).
Setup
-
Replay capacity is defined as the total number of transitions stored in the replay buffer.
-
Age of a transition (stored in the replay buffer) is defined as the number of gradient steps taken by the agent since the transition was stored.
-
More is the replay capacity, more will be the age of the oldest transition (also referred to as the age of the oldest policy).
-
More is the replay capacity, more will be the degree of “off-policyness” of the transitions in the buffer (with everything else held constant).
-
Replay ratio is the number of gradient updates per environment transition. This ratio can be used as a proxy for how often the agent uses old data (vs. collecting new data) and is related to off-policyness.
-
In DQN paper, the replay ratio is set to be 0.25.
-
For experiments, a subset (of 14 games) is selected from Atari ALE (Arcade Learning Environment) with sticky actions.
-
Each experiment is repeated with three seeds.
-
Rainbow is used as the base algorithm.
-
Total number of gradient updates and batch size (per gradient update) are fixed for all the experiments.
-
Rainbow used replay capacity of 1M and oldest policy of age 250K.
-
In experiments, replay capacity varies from 0.1M to 10M ( 5 values), and the age of the oldest policy varies from 25K to 25M (4 values).
Observations
-
With the age of the oldest policy fixed, performance improves with higher replay capacity, probably due to increased state-action coverage.
-
With fixed replay capacity, reducing the oldest policy’s age improves performance, probably due to the reduced off-policyness of the data in the replay buffer.
-
However, in some specific instances (with sparse reward, hard exploration setup), performance can drop when reducing the oldest policy’s age.
-
Increasing replay capacity, while keeping the replay ratio fixed, provides varying improvements and depends on the particular values of replacy capacity and replay ratio.
-
The paper reports the effect of these choices for DQN as well.
-
Unlike Rainbow, DQN does not improve with larger replay capacity, irrespective of whether the replay ratio or age of the oldest policy is kept fixed.
-
Given that the Rainbow agent is a DQN agent with additional components, the paper explores which of these components leads to an improvement in Rainbow’s performance as replay capacity increases.
Additive Experiments
-
Four new DQN variants are created by adding each of Rainbow’s four components to the base DQN agent.
-
DQN with n-step returns is the only variant that benefits by increased replay capacity.
-
The usefulness of n-step returns is further validated by verifying that Rainbow agent without n-step returns does not benefit by increased replay capacity. While Rainbow agent without any other component benefits by the increased capacity.
-
Prioritized Experience Replay does not significantly affect the performance with increased replay capacity.
-
The observation that n-step returns are critical for taking advantage of larger replay sizes is surprising because the uncorrected n-step returns are theoretically not suitable for off-policy learning.
-
The paper tests the limits of increasing replay capacity (with n-step returns) by performing experiments in the offline-RL setup, the agent collects a dataset of about 200M frames. These frames are used to train another agent.
-
Even in this extreme setup, n-step returns improve the learning agent’s performance.
Why do n-step returns help?
-
Hypothesis 1: n-step returns help to counter the increased off-policyness produced by a larger replay buffer.
- This hypothesis does not seem to hold as keeping the oldest policy fixed or using the same contrastive factor as an n-step update does not improve the 1-step update’s performance.
-
Hypothesis 2: Increasing the replay buffer’s capacity may reduce the variance of the n-step returns.
-
This hypothesis is evaluated by training on environments with lesser variance or by turning off the sticky actions in the atari domain.
-
While the hypothesis does explain the gains by using n-step returns to some extent, n-step gains are observed even in environments with low variance.
-