# Observational Overfitting in Reinforcement Learning

## Introduction

• The paper studies observational overfitting: The phenomenon where an agent overfits to different observation spaces even though the underlying MDP remains fixed.

• Unlike other works, the “background information” (in the pixel space) is correlated with the progress of the agent (and is not just noise).

• Link to the paper

## Setup

• Base MDP $M = (S, A, R, T)$ where $S$ is the state space, $A$ is the action space, $R$ is the reward function, and $T$ is the transition dynamics.

• $M$ is parameterized using $\theta$. In practice, it means introducing an observation function $\phi_{\theta}$ ie $M_{\theta} = (M, \phi_{\theta})$.

• A distribution over $\theta$ defines a distribution over the MDPs.

• The learning agent has access to the pixel space observations and not the state space observations.

• Generalization gap is defined as $J_{\theta}(\pi) - J_{\theta^{train}}(\pi)$ where $\pi$ is the learning agent, $\theta$ is the distribution over all the observation functions, $\theta^{train}$ is the distribution over the observation functions corresponding to the training environments. $J_{\theta}(\pi)$ is the average reward that the agent obtains over environments sampled from $M_{\theta}$.

• $\phi_{\theta}$ considers two featurs - generalizable (invariant across $\theta$) and non-generalizable (depends on $\theta$) ie $\phi_{\theta}(s) = concat(f(s), g_{\theta}(s))$ where $f$ is the invariant function and $g$ is the non-generalizable function.

• The problem is set up such that “explicit regularization” can easily solve it. The focus is on understanding the effect of “implicit regularization”.

## Experiments

### Overparameterized LQR

• LQR is used as a proxy for deep RL architectures given its advantages like enabling exact gradient descent.

• The functions are parameterized as follows:

• $f(s) = W_c(s)$

• $g_{\theta}(s) = W_{\theta}(s)$

• Observation at time $t$ , $o_t$, is given as $[W_c W_{\theta}]^{-1} s_t$.

• Action at time $t$ is given as $a_t = K o_{t}$ where $K$ is the policy matrix.

• Dimensionality:

• state $s$: $d_{state}$ 100
• $f(s)$: $d_{state}$ 100
• $g_{\theta}(s)$: $d_{noise}$ 100
• observation $o$: $d_{state}$ + $d_{noise}$ 1100
• In case of training on just one environment, multiple solutions exist, and overfitting happens.

• Increasing $d_{noise}$ increases the generalization gap.

• Overparameterizing the network decreases the generalization gap and also reduces the norm of the policy.

### Projected Gym Environments

• The base MDP is the Gym Environment.

• $M_{\theta}$ is generated as before.

• Increasing both width and depth for basic MLPs improves generalization.

• Generalization also depends on the choice of activation function, residual layers, etc.

### Deconvolutional Projections

• In the Gym environment, the actual state is projected to a larger vector and reshaped into an 84x84 tensor (image).

• The image from $f$ is concatenated with the image from $g$. This setup is referred to as the Gym-Deconv.

• The relative order of performance between NatureCNN, IMPALA, and IMPALA-Large (on both CoinRun and Gym-Deconv) is the same as the order of the number of parameters they contain.

• In an ablation, the policy is given access to only $g_{\theta}(s)$, which makes it impossible for the model to generalize. In this test of memorization capacity, implicit regularization seems to reduce the memorization effect.

### Overparameterization in CoinRun

• The pixel space observation in CoinRun is downsized from 64x64 to 32x32 and flattened into a vector.

• In CoinRun, the dynamics change per level, and the noisy “irrelevant” features change location across the 1D input, making this setup more challenging than the previous ones.

• Overparameterization improves generalization in this scenario as well.