Papers I Read Notes and Summaries

PTE - Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Introduction

  • Unsupervised text embeddings can be generalized for different tasks but they have weaker predictive powers (as compared to end-to-end trained deep learning methods) for any particular task. But the deep learning techniques are expensive and need a large amount of supervised data and a large number of parameters to tune.

  • The paper introduces Predictive Text Embedding (PTE) - a semi-supervised approach which learns an effective low dimensional representation using a large amount of unsupervised data and a small amount of supervised data.

  • The work can be extended to general information networks as well as classic techniques like MDS, Iso-map, Laplacian EigenMaps etc do not scale well for large graphs.

  • Further, this model can be applied to heterogeneous networks as well unlike the previous works LINE and DeepWalk which work on homogeneous networks only.

  • Link to the paper

Approach

  • The paper proposes 3 different kinds of networks:

    • Word-Word Network which captures the word co-occurrence information (local level).
    • Word-Document Network which captures the word-document co-occurrence information (local + document level).
    • Word-Label Network which captures the word-label co-occurrence information (bipartite graph).
  • All 3 graphs are integrated into one heterogeneous text network.

  • First, the authors extend their previous work, LINE, for heterogenous bipartite text networks as explained:

    • Given a bipartite graph G = (VA \bigcup VB, E) , where VA and VB are disjoint set of vertices, the conditional probability of va (in set VA) being generated by vb (in set VB) is given as the softmax score between embeddings of va and vb and normalised by the sum of exponentials of dot products between vb and all nodes in VA.

    • The second order proximity can be determined by the conditional distributions *p(. vj)*p(. vj)*.
    • The objective to be minimised the KL divergence between the conditional distribution p(.\vj) and the emperical distribution p^(.\vj) (given as wi, j/degj).

    • The objective can be further simplified and optimised using SGD with edge sampling and negative sampling.
  • Now, the 3 individual networks can all be interpreted as bipartite networks. So node representation of all the 3 individual networks is obtained as described above.

  • For the word-label network, since the training data is sparse, one could either train the unlabelled networks first and then the labelled network or they all could be trained jointly.

  • For the case of joint training, the edges are sampled from the 3 networks alternatively.

  • For the fine-tuning case, the edges are first sampled from the unlabelled network and then from the labelled network.

  • Once the word embeddings are obtained, the text embeddings may be obtained by simply averaging the word embeddings.

Evaluation

  • Baseline Models

    • Local word co-occurence based methods - SkipGram, LINE(Gww)
    • Document word co-occurence based methods - LINE(Gwd), PV-DBOW
    • Combined method - LINE (Gww + Gwd)
    • CNN
    • PTE
  • For long documents, PTE (joint) outperforms CNN and other PTE variants and is around 10 times faster than CNN model.

  • For short documents, PTE (joint) does not always outperform CNN model probably because the word sense ambiguity is more relevant in the short documents.