Word Representations via Gaussian Embedding

05 Nov 2017

Introduction

Existing word embedding models like Skip-Gram, GloVe etc map words to fixed sized vectors in a low dimensional vector space.
This fixed point setting cannot capture uncertainty about representation.
Further, these fixed point vectors are compared with measures like dot product and cosine similarity which are not suitable for capturing asymmetric properties like textual entailment and inclusion.
The paper proposes to learn Gaussian function embeddings (with diagonal covariance) for the word vectors.
This way, the words are mapped to soft regions in the embedding space which enables modeling uncertainty and asymmetric properties like inclusion and uncertainty.
Link to the paper
Implementation

KL divergence is used as the asymmetric distance function for comparing the distributions.
Unlike the Word2Vec model, the proposed model uses ranking-based loss.

Symmetric Similarity
For two gaussian distributions, P_i and P_j, compute the inner product E(P_i, P_j) as N(0; mean_i - mean_j, sigma_i + sigma_j).
Compute the gradient of mean and sigma with respect to log(E).
The resulting loss function can be interpreted as pushing the means closer which encouraging the two gaussians to be more concentrated.
Asymmetric Similarity
Use KL divergence to encode the context distribution.
The benefit over the symmetric setting is that now entailment type relations can also be modeled.
For example, a low KL divergence from x to y indicates that y can be encoded as x or that y “entails” x.

One of the two notions of similarity is chosen and max-margin is used as the loss function.
Mean is regularized by adding a simple constraint on the L2-norm.
For covariance matrix, the eigenvalues are constrained to lie within a hypercube. This ensures that the positive-definite property of the covariance matrix is maintained while having a constraint on the size.

Polysemous words have higher variance in their word embeddings as compared to specific words.
KL divergence (with diagonal covariance) outperforms other models.
Simple tree hierarchies can also be modeled by embedding into the Gaussian space. A Gaussian is created for each node with randomly initialized mean and the same set of embeddings is used for nodes and context.
For word similarity benchmarks, embeddings with spherical covariance have a slight edge over embeddings with diagonal covariance and outperform the Skip-Gram model in all the cases.