HyperNetworks
25 Jan 2021Introduction

The paper explores HyperNetworks. The idea is to use one network (HyperNetwork) to generate the weights for another network.
Approach
Static HyperNetworks  HyperNetworks for CNNs

Consider a $D$ layer CNN where the parameters for the $j^{th}$ layer are stored in a matrix $K^j$ of the shape $N_{in}f_{size} \times N_{out}f_{size}$.

The HyperNetwork is implemented as a twolayer linear network where the input is a layer embedding $z^j$, and the output is $K^j$.

The first layer (of the HyperNetwork) maps the input to $N_{in}$ different outputs using $N_{in}$ weight matrices.

The second layer maps the different $N_{in}$ inputs to $K_{i}$ using a shared matrix. The resulting $N_{in}$ (number of) $K_{i}$ matrices are concatenated to obtain $K^j$.

As a side note, HyperNetworks have much fewer params than the network for which it produces weights.

In a general case, the kernel dimensions (across layers) are not of the same size but integer multiples of some basic sizes. In that case, the HyperNetwork can generate kernels for the basic size, which can be concatenated to form larger kernels. This would require additional input embeddings but not require a change in the architecture of HyperNetwork.
Dynamic HyperNetworks  HyperNetworks for RNNs

HyperRNNs/HyperLSTMs denote HyperNetworks that generates weights for RNNs/LSTMs.

HyperRNNs implement a form of relaxed weight sharing  an alternative to the full weight sharing of the traditional RNNs.

At any timestamp $t$, the input to the HyperRNN is the concatenated vector $x_{t}$ (input to the RNN at time $t$) and the hidden state $h_{t1}$ of the RNN. The output is the weight for the main RNN at timestep $t$.

In practice, a weight scaling vector $d$ is used to reduce the memory footprint, which would otherwise be $dim$ times the memory of a standard RNN. $dim$ is the dimensionality of the embedding vector $z_j$.
Experiments

HyperNetworks are used to train standard CNNs for MNIST and ResNets for CIFAR 10. In these experiments, HyperNetworks slightly underperform the best performing models but uses much fewer parameters.

HyperLSTMs trained on the Penn Treebank dataset and Hutter Prize Wikipedia dataset outperform the stacked LSTMs and perform similar to layernorm LSTMs. Interestingly, using HyperLSTMs with layernorm improves performance over HyperLSTMs.

Given the similar performance of HyperLSTMs and layernorm LSTMs, the paper conducted an ablation study to understand if HyperLSTMs learned a weight adjustment policy similar to the statisticsbased approach used by layernorm LSTMs.
 However, the analysis of the histogram of the hidden states suggests that using layernorm reduces the saturation effect while in HyperLSTMs, the cell is saturated most of the time. This indicates that the two models are learning different policies.

HyperLSTMs are also evaluated for handwriting sequence generation by training in the IAM online handwriting dataset.
 While HyperLSTMs are quite effective on this task, combining them with layernorm degrades the performance.

On the WMT’14 EntoFr machine translation task, HyperLSTMs outperform LSTM based approaches.