Smooth Loss Functions for Deep Top-k Classification
25 Dec 2018Introduction
-
For top-k classification tasks, cross entropy is widely used as the learning objective even though it is the optimal metric only in the limit of infinite data.
-
The paper introduces a family of smoothed loss functions that are specially designed for top-k optimization.
Idea
- Inspired by the multi-loss SVMs, a surrogate loss (lk) is introduced that creates a margin between the ground truth and the kth largest score.
-
Here s denotes the output of the classifier model to be learnt, y is the ground truth label, s[p] denotes the kth largest element of s and s\p denotes the vector s without pth element.
-
This lk loss has two limitations:
-
It is continous but not differentiable in s.
-
Its weak derivatives have at most 2-nonzero elements.
-
-
The loss can be reformulated by adding and subtracting the k-1 largest scores of s\y and sy and by introducing a temperature parameter τ.
Properties of Lkτ
-
For any τ > 0, Lkτ is infinite-differentiable and has non-sparse gradients.
-
Under mild conditions, Lkτ apporachs lk (in a pointwise sense) as τ approaches to 0++.
-
It is an upper bound on the actual loss (up to a constant factor).
-
It is a generalization of the cross-entropy loss for different values of k, and τ and higher margins.
Computational Challenges
-
nCk number of terms needs to be evaluated for computing the loss for one sample (n is number of classes).
-
Loss Lkτ can be expressed in terms of elementary symmetric polynomials σi(e) (sum of all products of i distinct elements of vector e). Thus the challenge is to compute σk efficiently.
Forward Computation
-
Compute σk(e) where e is a n-dimensional vector and k« n and e[i]!=0 for all i.
-
σi(e) can be computed using the coefficients of the polynomial (X+e1)(X+e2)…(X+en) by divide and conquer approach with polynomial multiplication.
-
With some more optimizations (eg log(n) levels of recursion and each level being parallelized on a GPU), the resulting algorithms scale well with n on a GPU.
-
Operations are performed in the log-space using the log-sum-exp trick to achieve numerical stability in single floating point precision.
Backward computation
-
The backward pass uses optimizations like computing derivative of σj with respect to ei in a recursive manner.
-
Appendix of the paper describes these techniques in detail.
Experiments
-
Experiments are performed on CIFAR-100 (with noise) and Imagenet.
-
For CIFAR-100 with noise, the labels are randomized with probability p (within the same top-level class).
-
The proposed loss function is very robust to both noise and reduction in the amount of training dataset as compared to cross-entropy loss function for both top-k and top-1 performance.