Papers I Read Notes and Summaries

Refining Source Representations with Relation Networks for Neural Machine Translation

Introduction

  • The paper introduces Relation Network (RN) that refines the encoding representation of the given source document (or sentence).
  • This refined source representation can then be used in Neural Machine Translation (NMT) systems to counter the problem of RNNs forgetting old information.
  • Link to the paper

Limitations of existing NMT models

  • The RNN encoder-decoder architecture is the standard choice for NMT systems. But the RNNs are prone to forgetting old information.
  • In NMT models, the attention is modeled in the unit of words while the use of phrases (instead of words) would be a better choice.
  • While NMT systems might be able to capture certain relationships between words, they are not explicitly designed to capture such information.

Contributions of the paper

  • Learn the relationship between the source words using the context (neighboring words).
  • Relation Networks (RNs) build pairwise relations between source words using the representations generated by the RNNs. The RN would sit between the encoder and the attention layer of the encoder-decoder framework thereby keeping the main architecture unaffected.

Relation Network

  • Neural network which is desgined for relational reasoning.
  • Given a set of inputs * O = o1, …, on *, RN is formed as a composition of inputs: RN(O) = f(sum(g(oi, oj))), f and g are functions used to learn the relations (feed forward networks)
  • g learns how the objects are related hence the name “relation”.
  • Components:
    • CNN Layer
      • Extract information from the words surrounding the given word (context).
      • The final output of this layer is the sequence of vectors for different kernel width.
    • Graph Propagation (GP) Layer
      • Connect all the words with each other in the form of a graph.
      • Each output vector from the CNN corresponds to a node in the graph and there is an edge between all possible pair of nodes.
      • The information flows between the nodes of the graph in a message passing sort of fashion (graph propagation) to obtain a new set of vectors for each node.
    • Multi-Layer Perceptron (MLP) Layer
      • The representation from the GP Layer is fed to the MLP layer.
      • The layer uses residual connections from previous layers in form of concatenation.

Datasets

  • IWSLT Data - 44K sentences from tourism and travel domain.
  • NIST Data - 1M Chinese-English parallel sentence pairs.

Models

  • MOSES - Open source translation system - http://www.statmt.org/moses/
  • NMT - Attention based NMT
  • NMT+ - NMT with improved decoder
  • TRANSFORMER - Google’s new NMT
  • RNMT+ - Relation Network integrated with NMT+

Evaluation Metric

  • case-insensitive 4-gram BLEU score

Observations

  • As sentences become larger (more than 50 words), RNMT clearly outperforms other baselines.
  • Qualitative evaluation shows that RNMT+ model captures the word alignment better than the NMT+ models.
  • Similarly, NMT+ system tends to miss some information from the source sentence (more so for longer sentences). While both CNNs and RNNs are weak at capturing long-term dependency, using the relation layer mitigates this issue to some extent.