TaskNorm--Rethinking Batch Normalization for Meta-Learning

23 Jul 2020

Introduction

Meta-learning techniques are shown to benefit from the use of deep neural networks.
BatchNorm is a commonly used component when training deep networks, especially for vision tasks.
However, BatchNorm and meta-learning make contradictory assumptions, and their combination may not work well in practice.
The paper proposes TaskNorm, a normalization method that is designed explicitly for meta-learning.
Link to the paper

Standard meta-learning setup with $k$ tasks, each task with its own context and target set.
Two sets of parameters are considered during meta-learning - (i) global parameters, and (ii) task-specific parameters.
Meta-learning setup can be viewed as an inference task, where the task-specific parameters are inferred using a context set and some additional (trainable) parameters.
Normalization layers are commonly used to accelerate the training of neural networks. The general approach is to use normalization moments (statistics) along with some learned parameters.
BatchNorm is a well-known and widely used normalization approach. It relies on the implicit assumption that the dataset comprises of iid samples from some underlying distribution.
However, in meta-learning, data points are assumed to be iid only within a specific task.
This leaves open the question of what moments to use during meta-train and meta-test time.

Compute moments at meta train time and use during meta test time.
This is equivalent to lumping the moments with the global parameters. I.e., the running moments are shared globally, while the data is iid only locally.
Using CBN with MAML leads to poor results.
Moreover, meta-learning setup can some times require the use of a very small batch size. (e.g., 1-shot learning) In those cases, the computed statistics are likely to be inaccurate.

Moments are computed separately for each instance.
This mode corresponds to treating the statistics as local at the observation level.
These methods provide only limited improvement in performance, and can sometimes have a large overhead.

The normalization statistics are local at the task level, and statistics for a given data point should only depend on the context set’s data point. It should not depend on the other elements of the target set.
Meta-Batch Normalisation (METABN) is a precursor to TaskNorm where the context set alone is used to compute the normalization statistics for both the context and the target set (during both meta-test and meta-train time).
METABN does not perform well when used with small context sets.
TaskNorm overcomes this limitation by using a set of non-transductive, secondary moments (computed from the input being normalized).
When the context is small, using additional moments will help to improve the moment estimates.
In the general case, a trainable blending factor, $\alpha$, is used to combine the two sets of moments.
While the computational cost of TaskNorm is slightly more than CBN, it converges faster than CBN in practice.
Normalization mechanism in Reptile can be interpreted as a particular case of TaskNorm.