mixup - Beyond Empirical Risk Minimization
27 Feb 2020Introduction
-
The paper proposes a simple and dataset-agnostic data augmentation mechanism called mixup.
-
Consider two training examples, $(x_1, y_1)$ and $(y_1, y_2)$, where $x_1$ and $x_2$ are the datapoints and $y_1$ and $y_2$ are the labels.
-
New training examples of the form $(\lambda \times x_1 + (1-\lambda) \times x_2, \lambda \times y_1 + (1-\lambda) \times y_2)$ are constructured by considering the linear interpolation of the datapoints and the labels. Here $\lambda \in [0, 1]$.
-
$\lambda$ is sampled from a Beta distribution $Beta(\alpha, \alpha)$ where $\alpha \in (0, \infty)$.
-
Setting $\lambda$ to 0 or 1 eliminates the effect of mixup.
-
Mixup encourages the neural network to favor linear behavior between the training examples.
Experiments
-
Supervised Learning
-
ImageNet for ResNet-50, ResNet-101 and ResNext-101.
-
CIFAR10/CIFAR100 for PreAct ResNet-18, WideResNet-28-10 and DenseNet.
-
Google command dataset for LeNet and VGG.
-
-
In all these setups, adding mixup improves the performance of the model.
-
Mixup makes the model more robust to noisy labels. Moreover, mixup + dropout improves over mixup alone. This hints that mixup’s benefits are complementary to those of dropout.
-
Mixup makes the network more robust to adversarial examples in both white-box and black-box settings (ImageNet + Resnet101).
-
Mixup also stabilizes the training of GANs by acting as a regularizer for the gradient of the discriminator.
Observations
-
Convex combination of three or more examples (with weights sampled from a Dirichlet distribution) does not provide gains over the case of two examples.
-
In the authors’ implementation, mixup is applied between images of the same batch (after shuffling).
-
Interpolating only between inputs, with the same labels, did not lead to the same kind of gains as mixup.