Papers I Read Notes and Summaries

Swish - a Self-Gated Activation Function

Introduction

  • The paper presents a new activation function called Swish with formulation f(x) = x.sigmod(x) and its parameterised version called Swish-β where f(x, β) = 2x.sigmoid(β.x) and β is a training parameter.

  • The paper shows that Swish is consistently able to outperform RELU and other activations functions over a variety of datasets (CIFAR, ImageNet, WMT2014) though by small margins only in some cases.

  • Link to the paper

Properties of Swish

  • Plot Of Swish

  • Smooth, non-monotonic function.

  • Swish-β can be thought of as a smooth function that interpolates between a linear function and RELU.

  • Uses self-gating mechanism (that is, it uses its own value to gate itself). Gating generally uses multiple scalar inputs but since self-gating uses a single scalar input, it can be used to replace activation functions which are generally pointwise.

  • Being unbounded on the x>0 side, it avoids saturation when training is slow due to near 0 gradients.

  • Being bounded below induces a kind of regularization effect as large, negative inputs are forgotten.

  • Since the Swish function is smooth, the output landscape and the loss landscape are also smooth. A smooth landscape should be more traversable and less sensitive to initialization and learning rates.

Criticism

  • Swish is much more complicated than ReLU (when weighted against the small improvements that are provided) so it might not end up with as strong an adoption as ReLU.