Learning to Compute Word Embeddings On the Fly

21 Aug 2017

Introduction

Word based language models suffer from the problem of rare or Out of Vocabulary (OOV) words.
Learning representations for OOV words directly on the end task often results in poor representation.
The alternative is to replace all the rare words with a single, unique representation (loss of information) or use character level models to obtain word representations (they tend to miss on the semantic relationship).
The paper proposes to learn a network that can predict the representations of words using auxiliary data (referred to as definitions) such as dictionary definitions, Wikipedia infoboxes, the spelling of the word etc.
The auxiliary data encoders are trained jointly with the end task to ensure that word representations align with the requirements of the end task.

Given a rare word w, let d(w) = <x₁, x₂…> denote its defination where x_i are words.
d(w) is fed to a defination reader network f (LSTM) and its last state is used as the defination embedding e_d(w)
In case w has multiple definitions, the embeddings are combined using mean pooling.
The approach can be extended to in-vocabulary words as well by using the definition embedding of such words to update their original embeddings.

Auxiliary data sources
- Word definitions from WordNet
- Spelling of words
The proposed approach was tested on following tasks:
- Extractive Question Answering over SQuAD
  - Base model from Xiong et al. 2016
- Entailment Prediction over SNLI corpus
  - Base models from Bowman et al. 2015 and Chen et al. 2016
- One Billion Words Language Modelling
For all the tasks, models using both spelling and dictionary (SD) outperformed the model using just one.
While SD does not outperform the Glove model (with full vocabulary), it does bridge the performance gap significantly.

Multi-token words like “San Francisco” are not accounted for now.
The model does not handle the rare words which appear in the definition and just replaces them by the token. Making the model recursive would be a useful addition.