Word Representations via Gaussian Embedding
05 Nov 2017Introduction
- Existing word embedding models like Skip-Gram, GloVe etc map words to fixed sized vectors in a low dimensional vector space.
- This fixed point setting cannot capture uncertainty about representation.
- Further, these fixed point vectors are compared with measures like dot product and cosine similarity which are not suitable for capturing asymmetric properties like textual entailment and inclusion.
- The paper proposes to learn Gaussian function embeddings (with diagonal covariance) for the word vectors.
- This way, the words are mapped to soft regions in the embedding space which enables modeling uncertainty and asymmetric properties like inclusion and uncertainty.
- Link to the paper
- Implementation
Approach
- KL divergence is used as the asymmetric distance function for comparing the distributions.
- Unlike the Word2Vec model, the proposed model uses ranking-based loss.
Similarity Measures used
-
Symmetric Similarity
- For two gaussian distributions, Pi and Pj, compute the inner product E(Pi, Pj) as N(0; meani - meanj, sigmai + sigmaj).
- Compute the gradient of mean and sigma with respect to log(E).
-
The resulting loss function can be interpreted as pushing the means closer which encouraging the two gaussians to be more concentrated.
-
Asymmetric Similarity
- Use KL divergence to encode the context distribution.
- The benefit over the symmetric setting is that now entailment type relations can also be modeled.
- For example, a low KL divergence from x to y indicates that y can be encoded as x or that y “entails” x.
Learning
- One of the two notions of similarity is chosen and max-margin is used as the loss function.
- Mean is regularized by adding a simple constraint on the L2-norm.
- For covariance matrix, the eigenvalues are constrained to lie within a hypercube. This ensures that the positive-definite property of the covariance matrix is maintained while having a constraint on the size.
Observations
- Polysemous words have higher variance in their word embeddings as compared to specific words.
- KL divergence (with diagonal covariance) outperforms other models.
- Simple tree hierarchies can also be modeled by embedding into the Gaussian space. A Gaussian is created for each node with randomly initialized mean and the same set of embeddings is used for nodes and context.
- For word similarity benchmarks, embeddings with spherical covariance have a slight edge over embeddings with diagonal covariance and outperform the Skip-Gram model in all the cases.
Future Work
- Use combinations of low rank and diagonal matrices for covariances.
- Improved optimisation strategies.
- Trying other distributions like Student’s-t distribution.