Papers I Read Notes and Summaries

Observational Overfitting in Reinforcement Learning

Introduction

  • The paper studies observational overfitting: The phenomenon where an agent overfits to different observation spaces even though the underlying MDP remains fixed.

  • Unlike other works, the “background information” (in the pixel space) is correlated with the progress of the agent (and is not just noise).

  • Link to the paper

Setup

  • Base MDP $M = (S, A, R, T)$ where $S$ is the state space, $A$ is the action space, $R$ is the reward function, and $T$ is the transition dynamics.

  • $M$ is parameterized using $\theta$. In practice, it means introducing an observation function $\phi_{\theta}$ ie $M_{\theta} = (M, \phi_{\theta})$.

  • A distribution over $\theta$ defines a distribution over the MDPs.

  • The learning agent has access to the pixel space observations and not the state space observations.

  • Generalization gap is defined as $J_{\theta}(\pi) - J_{\theta^{train}}(\pi)$ where $\pi$ is the learning agent, $\theta$ is the distribution over all the observation functions, $\theta^{train}$ is the distribution over the observation functions corresponding to the training environments. $J_{\theta}(\pi)$ is the average reward that the agent obtains over environments sampled from $M_{\theta}$.

  • $\phi_{\theta}$ considers two featurs - generalizable (invariant across $\theta$) and non-generalizable (depends on $\theta$) ie $\phi_{\theta}(s) = concat(f(s), g_{\theta}(s))$ where $f$ is the invariant function and $g$ is the non-generalizable function.

  • The problem is set up such that “explicit regularization” can easily solve it. The focus is on understanding the effect of “implicit regularization”.

Experiments

Overparameterized LQR

  • LQR is used as a proxy for deep RL architectures given its advantages like enabling exact gradient descent.

  • The functions are parameterized as follows:

    • $f(s) = W_c(s)$

    • $g_{\theta}(s) = W_{\theta}(s)$

  • Observation at time $t$ , $o_t$, is given as $[W_c W_{\theta}]^{-1} s_t$.

  • Action at time $t$ is given as $a_t = K o_{t}$ where $K$ is the policy matrix.

  • Dimensionality:

    • state $s$: $d_{state}$ 100
    • $f(s)$: $d_{state}$ 100
    • $g_{\theta}(s)$: $d_{noise}$ 100
    • observation $o$: $d_{state}$ + $d_{noise}$ 1100
  • In case of training on just one environment, multiple solutions exist, and overfitting happens.

  • Increasing $d_{noise}$ increases the generalization gap.

  • Overparameterizing the network decreases the generalization gap and also reduces the norm of the policy.

Projected Gym Environments

  • The base MDP is the Gym Environment.

  • $M_{\theta}$ is generated as before.

  • Increasing both width and depth for basic MLPs improves generalization.

  • Generalization also depends on the choice of activation function, residual layers, etc.

Deconvolutional Projections

  • In the Gym environment, the actual state is projected to a larger vector and reshaped into an 84x84 tensor (image).

  • The image from $f$ is concatenated with the image from $g$. This setup is referred to as the Gym-Deconv.

  • The relative order of performance between NatureCNN, IMPALA, and IMPALA-Large (on both CoinRun and Gym-Deconv) is the same as the order of the number of parameters they contain.

  • In an ablation, the policy is given access to only $g_{\theta}(s)$, which makes it impossible for the model to generalize. In this test of memorization capacity, implicit regularization seems to reduce the memorization effect.

Overparameterization in CoinRun

  • The pixel space observation in CoinRun is downsized from 64x64 to 32x32 and flattened into a vector.

  • In CoinRun, the dynamics change per level, and the noisy “irrelevant” features change location across the 1D input, making this setup more challenging than the previous ones.

  • Overparameterization improves generalization in this scenario as well.