Competitive Training of Mixtures of Independent Deep Generative Models
12 Mar 2020Introduction

The paper proposes a Competitive training mechanism to train a mixture of independent generative models.

The idea is that this mixture of different models would divide the data distribution amongst themselves and specialize to their respective splits.

The training procedure is related to clusteringbased methods.
Motivation

In causal modeling, a common assumption is that the data is generated by a set of independent mechanisms.

It is not known which mechanism generates which datapoint and recovering the underlying mechanisms can be modeled as learning a structural causal generative model.
Setup

The paper assumes that the support of the different generators do not overlap, i.e., the underlying data distribution is factorized into nonoverlapping regions.

This data factorization is learned using a set of discriminators.

If there are $k$ generators, $k$ binary partition functions $c_i, … c_k$ are used.

For a given datapoint $x$, if $c_i(x) = 1$ then $c_j(x) = 0$ for all other $j$ and $x$ is assigned to $i^{th}$ generator.

For a fixed partition function $c_j^t$ ($t$ denotes the partition function at time $t$), minimize the sum of fdivergence between the model and the data distribution (that is assigned to it). The loss formulation is an upper bound on the fdivergence of the mixture model.

In the next step, the data points are reassigned to the generative models, based on the likelihood of each data point for each model.

The likelihood is estimated by training a discriminator that can distinguish the generated samples from the real samples.
Independence as an inductive bias

The independence assumption may be too restrictive because the lowlevel features will be common across the distribution splits.

This “violation” can be avoided by pretraining the model using a uniform random split of the dataset. In that case, the independence assumption will hold approximately after pretraining.

Another approach could be to share some parameters across the models.

A “load balancing” approach is also used where each model always keeps training on the data points assigned to it if not enough data points are assigned to it.
Comparison to VAEs and GANs

VAEs tend to be “overly inclusive” of the training distribution, i.e., they try to cover the entire support of the distribution.

GANs are prone to mode collapse where the model focuses only on one part of the distribution.

The proposed method provides a middle ground where the different generative models can focus on different parts of the distribution.
Experiments

The experiments seem to be limited. The paper shows that their proposed setup improves over the VAE and GAN baselines.

For datasets, the paper uses twodimensional synthetic data, MNIST and CelebA