Smooth Loss Functions for Deep Top-k Classification

25 Dec 2018

Introduction

For top-k classification tasks, cross entropy is widely used as the learning objective even though it is the optimal metric only in the limit of infinite data.
The paper introduces a family of smoothed loss functions that are specially designed for top-k optimization.
Paper
Code

Inspired by the multi-loss SVMs, a surrogate loss (l_k) is introduced that creates a margin between the ground truth and the kth largest score.

Equation 1

Here s denotes the output of the classifier model to be learnt, y is the ground truth label, s[p] denotes the kth largest element of s and s\p denotes the vector s without pth element.
This l_k loss has two limitations:
- It is continous but not differentiable in s.
- Its weak derivatives have at most 2-nonzero elements.
The loss can be reformulated by adding and subtracting the k-1 largest scores of s\y and s_y and by introducing a temperature parameter τ.

Equation 2

For any τ > 0, L_kτ is infinite-differentiable and has non-sparse gradients.
Under mild conditions, L_kτ apporachs l_k (in a pointwise sense) as τ approaches to 0+⁺.
It is an upper bound on the actual loss (up to a constant factor).
It is a generalization of the cross-entropy loss for different values of k, and τ and higher margins.

nCk number of terms needs to be evaluated for computing the loss for one sample (n is number of classes).
Loss L_kτ can be expressed in terms of elementary symmetric polynomials σ_i(e) (sum of all products of i distinct elements of vector e). Thus the challenge is to compute σ_k efficiently.

Compute σ_k(e) where e is a n-dimensional vector and k« n and e[i]!=0 for all i.
σ_i(e) can be computed using the coefficients of the polynomial (X+e₁)(X+e₂)…(X+e_n) by divide and conquer approach with polynomial multiplication.
With some more optimizations (eg log(n) levels of recursion and each level being parallelized on a GPU), the resulting algorithms scale well with n on a GPU.
Operations are performed in the log-space using the log-sum-exp trick to achieve numerical stability in single floating point precision.

The backward pass uses optimizations like computing derivative of σ_j with respect to e_i in a recursive manner.
Appendix of the paper describes these techniques in detail.

Experiments are performed on CIFAR-100 (with noise) and Imagenet.
For CIFAR-100 with noise, the labels are randomized with probability p (within the same top-level class).
The proposed loss function is very robust to both noise and reduction in the amount of training dataset as compared to cross-entropy loss function for both top-k and top-1 performance.