TuckER - Tensor Factorization for Knowledge Graph Completion

19 Feb 2019

Introduction

TuckER is a simple, yet powerful linear model that uses Tucker decomposition for the task of link prediction in knowledge graphs.
Paper
Implementation

Let E be the set of all the entities and R be the set of all the relations in a given knowledge graph (KG).
The KG can be represented as a list of triples of the form (source entity, relation, object entity) or (e_s, r, e_o).
The list of triples can be represented as a third-order tensor (of binary values) where each element corresponds to a triple and each element’s value corresponds to ether that element is present in the KG or not.
The link prediction task can be formulated as - given a set of all triples, learn a scoring function that assigns a score to each triple. The score indicates whether the triple is actually present in the KG or not.

Tucker decomposition factorizes a tensor into a set of factor matrices and a smaller core tensor.
In the specific case of three-mode tensors (alternate representation of a KG), the given original tensor X (of shape IxJxK) can be factorized into a core tensor W (of shape PxQxR) and 3 factor matrics - A (of shape IxP), B (of shape JxQ) and C (of shape KxR) such that X is approximately W x₁ A x₂ B x₃ C, where X_n denotes the tensor product along the nth mode.
Generally, P, Q, R are smaller than I, J K (respectively) and W can be seen as a compressed version of X.

Two embedding matrics are used for embedding the entities and the relations respectively.
Entity embedding matrix E is shared for both subject and the object ie E = A = B.
The scoring function is gives as W x₁ e_s x₂ w_r x₃ e₀ where e_s, w_r and e_o are the embedding vectors corresonding to e_s, e_r and e_o respectively. Note that both the core tensor and the factor matrices are to be learnt.
Model is trained with the standard negative log-likelihood loss given as (for one triple): y * log(p) + (1-y) * log(1-p)
To speed up training and increase accuracy, 1-N scoring is used. A given (e_s, r) is simultaneously scored for all the entities using the local-closed world assumption (knowledge graph is only locally complete).
Handling asymmetric relations is straightforward by learning a relation embedding alongside a relation-agnostic core tensor which enables knowledge sharing across relations.

One important consideration would be the expressive power of TuckER models, especially in relation to other models like ComplEx and SimplE.
It can be shown the TuckER is fully expressive ie give any ground truth over E and R, there exists a TuckER model which can perfectly represent the data - using 1-hot entity and relation embedding.
For full expressiveness, dimensionality of entity (relation) is n_E (n_R) where n_E (n_R) are the number of entities (relations). In comparsion, the required dimensionality for ComplEx is n_E * n_R (for both entity and relations) and for SimplE, it is min(_E * n_R, number of facts + 1) (for both entity and relations).
Many existing models like RESCAL, DistMult, ComplEx, SimplE etc can be seen as special cases of TuckER.

FB15k, FB15k-237, WN18, WN18RR
The max number of entities is around 41K and max number of relations is around 1.3K

Mean Reciprocal Rank (MRR) - the average of the inverse of mean rank assigned to the true triple overall n_e generated triples.
hits@k (k = 1, 3, 10) - percentage of times the true triple is ranked in the top k of the n_e generated triples.
Higher is better for both the metrics.

TuckER outperforms all the baseline models on all but one task.
Dropout is an important factor with higher dropout rates (0, 3, 0.4, 0.5) needed for datasets with fewer training examples per relation (hence more prone to overfitting).
TuckER improves performance more significantly when the number of relations is large.
Even with lower embedding dimensions, TuckER’s performance does not deteriorate as much as other models.