Distilling the Knowledge in a Neural Network

31 Dec 2017

Introduction

In machine learning, it is common to train a single large model (with a large number of parameters) or ensemble of multiple smaller models using the same dataset.
While such large models help to improve the performance of the system, they also make it difficult and computationally expensive to deploy the system.
The paper proposes to transfer the knowledge from such “cumbersome” models into a single, “simpler” model which is more suitable for deployment. This transfer of knowledge is referred to as “distillation”.
Link to the paper

Train the cumbersome model using the given training data in the usual way.
Train the simpler, distilled model using the class probabilities (from the cumbersome model) as the soft target. Thus, the simpler model is trained to generalise the same way as the cumbersome model.
If the soft targets have high entropy, they provide much more information than the hard targets and the gradient (between training examples) would vary lesser.
One approach is to minimise the L2 difference between logits produced by the cumbersome model and the simpler model. This approach was pursued by Buciluǎ et al.
The paper proposes a more general solution which they name “distillation”. The temperature of the final softmax is increased till the cumbersome model produces a set of soft targets (from the final softmax layer). These soft targets are then used to train the simpler model.
It also shows that the proposed approach is, in fact, a more general case of the first approach.

In the simplest setting, the cumbersome model is first trained with a high value of temperature and then the same temperature value is used to train the simpler model. The temperature is set to 1 when making predictions using the simpler model.
It helps to add an auxiliary objective function which corresponds to the cross-entropy loss with the correct labels. The second objective function should be given a much lower weight though. Further, the magnitude of the soft targets needs to be scaled by multiplying with the square of temperature.