Revisiting Fundamentals of Experience Replay
07 Sep 2020Introduction
- 
    The paper presents an extensive study of the effects of experience replay in Q-learning based methods. 
- 
    It focuses explicitly on the replay capacity and replay ratio (ratio of learning updates to experience collected). 
Setup
- 
    Replay capacity is defined as the total number of transitions stored in the replay buffer. 
- 
    Age of a transition (stored in the replay buffer) is defined as the number of gradient steps taken by the agent since the transition was stored. 
- 
    More is the replay capacity, more will be the age of the oldest transition (also referred to as the age of the oldest policy). 
- 
    More is the replay capacity, more will be the degree of “off-policyness” of the transitions in the buffer (with everything else held constant). 
- 
    Replay ratio is the number of gradient updates per environment transition. This ratio can be used as a proxy for how often the agent uses old data (vs. collecting new data) and is related to off-policyness. 
- 
    In DQN paper, the replay ratio is set to be 0.25. 
- 
    For experiments, a subset (of 14 games) is selected from Atari ALE (Arcade Learning Environment) with sticky actions. 
- 
    Each experiment is repeated with three seeds. 
- 
    Rainbow is used as the base algorithm. 
- 
    Total number of gradient updates and batch size (per gradient update) are fixed for all the experiments. 
- 
    Rainbow used replay capacity of 1M and oldest policy of age 250K. 
- 
    In experiments, replay capacity varies from 0.1M to 10M ( 5 values), and the age of the oldest policy varies from 25K to 25M (4 values). 
Observations
- 
    With the age of the oldest policy fixed, performance improves with higher replay capacity, probably due to increased state-action coverage. 
- 
    With fixed replay capacity, reducing the oldest policy’s age improves performance, probably due to the reduced off-policyness of the data in the replay buffer. 
- 
    However, in some specific instances (with sparse reward, hard exploration setup), performance can drop when reducing the oldest policy’s age. 
- 
    Increasing replay capacity, while keeping the replay ratio fixed, provides varying improvements and depends on the particular values of replacy capacity and replay ratio. 
- 
    The paper reports the effect of these choices for DQN as well. 
- 
    Unlike Rainbow, DQN does not improve with larger replay capacity, irrespective of whether the replay ratio or age of the oldest policy is kept fixed. 
- 
    Given that the Rainbow agent is a DQN agent with additional components, the paper explores which of these components leads to an improvement in Rainbow’s performance as replay capacity increases. 
Additive Experiments
- 
    Four new DQN variants are created by adding each of Rainbow’s four components to the base DQN agent. 
- 
    DQN with n-step returns is the only variant that benefits by increased replay capacity. 
- 
    The usefulness of n-step returns is further validated by verifying that Rainbow agent without n-step returns does not benefit by increased replay capacity. While Rainbow agent without any other component benefits by the increased capacity. 
- 
    Prioritized Experience Replay does not significantly affect the performance with increased replay capacity. 
- 
    The observation that n-step returns are critical for taking advantage of larger replay sizes is surprising because the uncorrected n-step returns are theoretically not suitable for off-policy learning. 
- 
    The paper tests the limits of increasing replay capacity (with n-step returns) by performing experiments in the offline-RL setup, the agent collects a dataset of about 200M frames. These frames are used to train another agent. 
- 
    Even in this extreme setup, n-step returns improve the learning agent’s performance. 
Why do n-step returns help?
- 
    Hypothesis 1: n-step returns help to counter the increased off-policyness produced by a larger replay buffer. - This hypothesis does not seem to hold as keeping the oldest policy fixed or using the same contrastive factor as an n-step update does not improve the 1-step update’s performance.
 
- 
    Hypothesis 2: Increasing the replay buffer’s capacity may reduce the variance of the n-step returns. - 
        This hypothesis is evaluated by training on environments with lesser variance or by turning off the sticky actions in the atari domain. 
- 
        While the hypothesis does explain the gains by using n-step returns to some extent, n-step gains are observed even in environments with low variance. 
 
-