Learning to Compute Word Embeddings On the Fly
21 Aug 2017Introduction
-
Word based language models suffer from the problem of rare or Out of Vocabulary (OOV) words.
-
Learning representations for OOV words directly on the end task often results in poor representation.
-
The alternative is to replace all the rare words with a single, unique representation (loss of information) or use character level models to obtain word representations (they tend to miss on the semantic relationship).
-
The paper proposes to learn a network that can predict the representations of words using auxiliary data (referred to as definitions) such as dictionary definitions, Wikipedia infoboxes, the spelling of the word etc.
-
The auxiliary data encoders are trained jointly with the end task to ensure that word representations align with the requirements of the end task.
Approach
-
Given a rare word w, let d(w) = <x1, x2…> denote its defination where xi are words.
-
d(w) is fed to a defination reader network f (LSTM) and its last state is used as the defination embedding ed(w)
-
In case w has multiple definitions, the embeddings are combined using mean pooling.
-
The approach can be extended to in-vocabulary words as well by using the definition embedding of such words to update their original embeddings.
Experiments
- Auxiliary data sources
- Word definitions from WordNet
- Spelling of words
-
The proposed approach was tested on following tasks:
- Extractive Question Answering over SQuAD
- Base model from Xiong et al. 2016
- Entailment Prediction over SNLI corpus
- Base models from Bowman et al. 2015 and Chen et al. 2016
- One Billion Words Language Modelling
- Extractive Question Answering over SQuAD
-
For all the tasks, models using both spelling and dictionary (SD) outperformed the model using just one.
- While SD does not outperform the Glove model (with full vocabulary), it does bridge the performance gap significantly.
Future Work
-
Multi-token words like “San Francisco” are not accounted for now.
-
The model does not handle the rare words which appear in the definition and just replaces them by the
token. Making the model recursive would be a useful addition.