Papers I Read Notes and Summaries

Competitive Training of Mixtures of Independent Deep Generative Models


  • The paper proposes a Competitive training mechanism to train a mixture of independent generative models.

  • The idea is that this mixture of different models would divide the data distribution amongst themselves and specialize to their respective splits.

  • The training procedure is related to clustering-based methods.

  • Link to the paper


  • In causal modeling, a common assumption is that the data is generated by a set of independent mechanisms.

  • It is not known which mechanism generates which datapoint and recovering the underlying mechanisms can be modeled as learning a structural causal generative model.


  • The paper assumes that the support of the different generators do not overlap, i.e., the underlying data distribution is factorized into non-overlapping regions.

  • This data factorization is learned using a set of discriminators.

  • If there are $k$ generators, $k$ binary partition functions $c_i, … c_k$ are used.

  • For a given datapoint $x$, if $c_i(x) = 1$ then $c_j(x) = 0$ for all other $j$ and $x$ is assigned to $i^{th}$ generator.

  • For a fixed partition function $c_j^t$ ($t$ denotes the partition function at time $t$), minimize the sum of f-divergence between the model and the data distribution (that is assigned to it). The loss formulation is an upper bound on the f-divergence of the mixture model.

  • In the next step, the data points are re-assigned to the generative models, based on the likelihood of each data point for each model.

  • The likelihood is estimated by training a discriminator that can distinguish the generated samples from the real samples.

Independence as an inductive bias

  • The independence assumption may be too restrictive because the low-level features will be common across the distribution splits.

  • This “violation” can be avoided by pretraining the model using a uniform random split of the dataset. In that case, the independence assumption will hold approximately after pretraining.

  • Another approach could be to share some parameters across the models.

  • A “load balancing” approach is also used where each model always keeps training on the data points assigned to it if not enough data points are assigned to it.

Comparison to VAEs and GANs

  • VAEs tend to be “overly inclusive” of the training distribution, i.e., they try to cover the entire support of the distribution.

  • GANs are prone to mode collapse where the model focuses only on one part of the distribution.

  • The proposed method provides a middle ground where the different generative models can focus on different parts of the distribution.


  • The experiments seem to be limited. The paper shows that their proposed setup improves over the VAE and GAN baselines.

  • For datasets, the paper uses two-dimensional synthetic data, MNIST and CelebA