mixup  Beyond Empirical Risk Minimization
27 Feb 2020Introduction

The paper proposes a simple and datasetagnostic data augmentation mechanism called mixup.

Consider two training examples, $(x_1, y_1)$ and $(y_1, y_2)$, where $x_1$ and $x_2$ are the datapoints and $y_1$ and $y_2$ are the labels.

New training examples of the form $(\lambda \times x_1 + (1\lambda) \times x_2, \lambda \times y_1 + (1\lambda) \times y_2)$ are constructured by considering the linear interpolation of the datapoints and the labels. Here $\lambda \in [0, 1]$.

$\lambda$ is sampled from a Beta distribution $Beta(\alpha, \alpha)$ where $\alpha \in (0, \infty)$.

Setting $\lambda$ to 0 or 1 eliminates the effect of mixup.

Mixup encourages the neural network to favor linear behavior between the training examples.
Experiments

Supervised Learning

ImageNet for ResNet50, ResNet101 and ResNext101.

CIFAR10/CIFAR100 for PreAct ResNet18, WideResNet2810 and DenseNet.

Google command dataset for LeNet and VGG.


In all these setups, adding mixup improves the performance of the model.

Mixup makes the model more robust to noisy labels. Moreover, mixup + dropout improves over mixup alone. This hints that mixup’s benefits are complementary to those of dropout.

Mixup makes the network more robust to adversarial examples in both whitebox and blackbox settings (ImageNet + Resnet101).

Mixup also stabilizes the training of GANs by acting as a regularizer for the gradient of the discriminator.
Observations

Convex combination of three or more examples (with weights sampled from a Dirichlet distribution) does not provide gains over the case of two examples.

In the authors’ implementation, mixup is applied between images of the same batch (after shuffling).

Interpolating only between inputs, with the same labels, did not lead to the same kind of gains as mixup.