Observational Overfitting in Reinforcement Learning
23 Jan 2020Introduction

The paper studies observational overfitting: The phenomenon where an agent overfits to different observation spaces even though the underlying MDP remains fixed.

Unlike other works, the “background information” (in the pixel space) is correlated with the progress of the agent (and is not just noise).
Setup

Base MDP $M = (S, A, R, T)$ where $S$ is the state space, $A$ is the action space, $R$ is the reward function, and $T$ is the transition dynamics.

$M$ is parameterized using $\theta$. In practice, it means introducing an observation function $\phi_{\theta}$ ie $M_{\theta} = (M, \phi_{\theta})$.

A distribution over $\theta$ defines a distribution over the MDPs.

The learning agent has access to the pixel space observations and not the state space observations.

Generalization gap is defined as $J_{\theta}(\pi)  J_{\theta^{train}}(\pi)$ where $\pi$ is the learning agent, $\theta$ is the distribution over all the observation functions, $\theta^{train}$ is the distribution over the observation functions corresponding to the training environments. $J_{\theta}(\pi)$ is the average reward that the agent obtains over environments sampled from $M_{\theta}$.

$\phi_{\theta}$ considers two featurs  generalizable (invariant across $\theta$) and nongeneralizable (depends on $\theta$) ie $\phi_{\theta}(s) = concat(f(s), g_{\theta}(s))$ where $f$ is the invariant function and $g$ is the nongeneralizable function.

The problem is set up such that “explicit regularization” can easily solve it. The focus is on understanding the effect of “implicit regularization”.
Experiments
Overparameterized LQR

LQR is used as a proxy for deep RL architectures given its advantages like enabling exact gradient descent.

The functions are parameterized as follows:

$f(s) = W_c(s)$

$g_{\theta}(s) = W_{\theta}(s)$


Observation at time $t$ , $o_t$, is given as $[W_c W_{\theta}]^{1} s_t$.

Action at time $t$ is given as $a_t = K o_{t}$ where $K$ is the policy matrix.

Dimensionality:
 state $s$: $d_{state}$ 100
 $f(s)$: $d_{state}$ 100
 $g_{\theta}(s)$: $d_{noise}$ 100
 observation $o$: $d_{state}$ + $d_{noise}$ 1100

In case of training on just one environment, multiple solutions exist, and overfitting happens.

Increasing $d_{noise}$ increases the generalization gap.

Overparameterizing the network decreases the generalization gap and also reduces the norm of the policy.
Projected Gym Environments

The base MDP is the Gym Environment.

$M_{\theta}$ is generated as before.

Increasing both width and depth for basic MLPs improves generalization.

Generalization also depends on the choice of activation function, residual layers, etc.
Deconvolutional Projections

In the Gym environment, the actual state is projected to a larger vector and reshaped into an 84x84 tensor (image).

The image from $f$ is concatenated with the image from $g$. This setup is referred to as the GymDeconv.

The relative order of performance between NatureCNN, IMPALA, and IMPALALarge (on both CoinRun and GymDeconv) is the same as the order of the number of parameters they contain.

In an ablation, the policy is given access to only $g_{\theta}(s)$, which makes it impossible for the model to generalize. In this test of memorization capacity, implicit regularization seems to reduce the memorization effect.
Overparameterization in CoinRun

The pixel space observation in CoinRun is downsized from 64x64 to 32x32 and flattened into a vector.

In CoinRun, the dynamics change per level, and the noisy “irrelevant” features change location across the 1D input, making this setup more challenging than the previous ones.

Overparameterization improves generalization in this scenario as well.