Get To The Point - Summarization with Pointer-Generator Networks
05 Feb 2018Introduction
-
Sequence-to-Sequence models have made abstract summarization viable but they still suffer from issues like out of vocabulary words and repetitive sentences.
-
The paper proposes to overcome these limitations by using a hybrid Pointer-Generator network (to copy words from the source text) and a coverage vector that keeps track of content that has already been summarized so as to discourage repetition.
Model
Pointer Generator Network
-
It is a hybrid model between the Sequence-to-Sequence network and Pointer Network such that when generating a word, the model decides whether the word would be generated using the softmax vocabulary (Sequence-to-Sequence) or using the source vocabulary (Pointer Network).
-
Since the model can choose a word from the source vocabulary, the issue of out of vocabulary words is handled.
Coverage Mechanism
-
The model maintains a coverage vector which is the sum of attention distributions over all previous decoder timesteps.
-
This coverage vector is fed as an input to the attention mechanism.
-
A coverage loss is added to prevent the model from repeatedly attending to the same word.
-
The idea is to capture how much coverage different words have already received from the attention mechanism.
Observation
-
Model when evaluated on CNN/Daily Mail summarization task, outperforms the state-of-the-art by at least 2 ROUGE points though it still does not outperform the lead-3 baseline.
-
Lead-3 baseline uses first 3 sentences as the summary of the article which should be a strong baseline given that the dataset is actually about news articles.
-
The model is initially trained without coverage and then finetuned with the coverage loss.
-
During training, the model first learns how to copy words and then how to generate words (pgen starts from 0.3 and converges to 0.53).
-
During testing, the model strongly prefers copying over generating (pgen = 0.17).
-
Further, whenever the model is at beginning of sentences or at the join between switched-together fragments, it prefers to generate a word instead of copying one from the source language.
-
The overall model is very simple, neat and interpretable and also performs well in practice.