Refining Source Representations with Relation Networks for Neural Machine Translation

22 Sep 2017

Introduction

The paper introduces Relation Network (RN) that refines the encoding representation of the given source document (or sentence).
This refined source representation can then be used in Neural Machine Translation (NMT) systems to counter the problem of RNNs forgetting old information.
Link to the paper

The RNN encoder-decoder architecture is the standard choice for NMT systems. But the RNNs are prone to forgetting old information.
In NMT models, the attention is modeled in the unit of words while the use of phrases (instead of words) would be a better choice.
While NMT systems might be able to capture certain relationships between words, they are not explicitly designed to capture such information.

Learn the relationship between the source words using the context (neighboring words).
Relation Networks (RNs) build pairwise relations between source words using the representations generated by the RNNs. The RN would sit between the encoder and the attention layer of the encoder-decoder framework thereby keeping the main architecture unaffected.

Neural network which is desgined for relational reasoning.
Given a set of inputs * O = o₁, …, o_n *, RN is formed as a composition of inputs: RN(O) = f(sum(g(o_i, o_j))), f and g are functions used to learn the relations (feed forward networks)
g learns how the objects are related hence the name “relation”.
Components:
- CNN Layer
  - Extract information from the words surrounding the given word (context).
  - The final output of this layer is the sequence of vectors for different kernel width.
- Graph Propagation (GP) Layer
  - Connect all the words with each other in the form of a graph.
  - Each output vector from the CNN corresponds to a node in the graph and there is an edge between all possible pair of nodes.
  - The information flows between the nodes of the graph in a message passing sort of fashion (graph propagation) to obtain a new set of vectors for each node.
- Multi-Layer Perceptron (MLP) Layer
  - The representation from the GP Layer is fed to the MLP layer.
  - The layer uses residual connections from previous layers in form of concatenation.

As sentences become larger (more than 50 words), RNMT clearly outperforms other baselines.
Qualitative evaluation shows that RNMT+ model captures the word alignment better than the NMT+ models.
Similarly, NMT+ system tends to miss some information from the source sentence (more so for longer sentences). While both CNNs and RNNs are weak at capturing long-term dependency, using the relation layer mitigates this issue to some extent.