MetaLearning Update Rules for Unsupervised Representation Learning
02 Apr 2019Introduction

Standard unsupervised learning aims to learn transferable features. The paper proposes to learn a transferable learning rule (in an unsupervised manner) that can generalize across tasks and architectures.
Approach

Consider training the model with supervised learning  φ_{t+1} = SupervisedUpdate(φ_{t}, x_{t}, y_{t}, θ).

Here t denotes the step, (x, y) denotes the data points, θ denotes the hyperparameters of the optimizer.

Extending this formulation for metalearning, one could say that t is the step of the inner loop, θ are the parameters of the meta learning model.

Further, the paper proposes to use φ_{t+1} = UnsupervisedUpdate(φ_{t}, x_{t}, θ) ie y_{t} is not used (or even assumed to be available as this is unsupervised learning).

The meta update rule is used to learn the weights of a metamodel by performing SGD on the sum of MetaObjective over the distribution of tasks (over the course of inner loop training).
Model

Base model: MLP with parameters φ_{t}

To ensure that it generalizes across architectures, the update rule is designed to be neurallocal ie updates are a function of pre and postsynaptic neurons though, in practice, this constraint is relaxed to decorrelate neurons by using cross neural information.

Each neuron i in every layer l (in the base model) has an update network (MLP) which takes as input the feedforward activations, feedback weights and error signals. ie h_{b}^{l}(i) = MLP(x_{b}^{l}(i), z_{b}^{l}(i), v^{l+1}, δ^{l}(i), θ)
 b  index of the minibatch
 x^{l}  pre nonlinearity activations
 z^{l}  post nonlinearity activations
 v^{l}  feedback weights
 δ^{l}  error signal

All the update networks share the meta parameters θ

The model is run in a standard feedforward manner and the update network (corresponding to each unit) is used to generate the error signal δ^{l}_{b}(i) = lin(h_{b}^{l}(i)).

This loss is backpropogated using the set of learned backward weights v^{l} instead of the forward weights w_{l}.

The weight update Δw_{l} is also generated using a perneuron update network.
Meta Objective

The MetaObjective is based on fitting a linear regression model to labeled examples with a small number of data points.

Given the emphasis on learning generalizable features, the weights (of linear regression) are estimated on one batch and evaluated on another batch.

The MetaObjective is to reduce the cosine distance between y_{b} and v^{T}x_{b}^{L}

y_{b}  Actual lables on the evaluation batch

x_{b}^{L}  Features of the evaluation batch (using the base model)

v  parameters of the linear regression model (learned on train batch)

Practical Considerations

Meta gradients are approximated using truncated backdrop through time.

Increasing variation in the training dataset helps the meta optimization process. Data is augmented with shifts, rotations, and noise. Predicting these coefficients is an auxiliary (regression) task for training the metaobjective.

Training the system requires a lot of resources  8 days with 512 workers.
Results

With standard unsupervised learning, the performance (on transfer task) starts declining after some time even though the performance (on the unsupervised task) is improving. This suggests that the objective function for the two tasks starts to mismatch.

UnsupervisedUpdate leads to a better generalization as compared to both VAE and supervised learning (followed by transfer).

UnsupervisedUpdate also leads to a positive transfer across domains (vision to language) when trained for a shorter duration of time (to ensure that the metaobjective does not overfit).

UnsupervisedUpdate also generalizes to larger model architectures and different activation functions.