Word Representations via Gaussian Embedding
05 Nov 2017Introduction
- Existing word embedding models like Skip-Gram, GloVe etc map words to fixed sized vectors in a low dimensional vector space.
- This fixed point setting cannot capture uncertainty about representation.
- Further, these fixed point vectors are compared with measures like dot product and cosine similarity which are not suitable for capturing asymmetric properties like textual entailment and inclusion.
- The paper proposes to learn Gaussian function embeddings (with diagonal covariance) for the word vectors.
- This way, the words are mapped to soft regions in the embedding space which enables modeling uncertainty and asymmetric properties like inclusion and uncertainty.
- Link to the paper
- Implementation
- KL divergence is used as the asymmetric distance function for comparing the distributions.
- Unlike the Word2Vec model, the proposed model uses ranking-based loss.
Similarity Measures used
Symmetric Similarity
- For two gaussian distributions, Pi and Pj, compute the inner product E(Pi, Pj) as N(0; meani - meanj, sigmai + sigmaj).
- Compute the gradient of mean and sigma with respect to log(E).
The resulting loss function can be interpreted as pushing the means closer which encouraging the two gaussians to be more concentrated.
Asymmetric Similarity
- Use KL divergence to encode the context distribution.
- The benefit over the symmetric setting is that now entailment type relations can also be modeled.
- For example, a low KL divergence from x to y indicates that y can be encoded as x or that y “entails” x.
- One of the two notions of similarity is chosen and max-margin is used as the loss function.
- Mean is regularized by adding a simple constraint on the L2-norm.
- For covariance matrix, the eigenvalues are constrained to lie within a hypercube. This ensures that the positive-definite property of the covariance matrix is maintained while having a constraint on the size.
- Polysemous words have higher variance in their word embeddings as compared to specific words.
- KL divergence (with diagonal covariance) outperforms other models.
- Simple tree hierarchies can also be modeled by embedding into the Gaussian space. A Gaussian is created for each node with randomly initialized mean and the same set of embeddings is used for nodes and context.
- For word similarity benchmarks, embeddings with spherical covariance have a slight edge over embeddings with diagonal covariance and outperform the Skip-Gram model in all the cases.
Future Work
- Use combinations of low rank and diagonal matrices for covariances.
- Improved optimisation strategies.
- Trying other distributions like Student’s-t distribution.