# Competitive Training of Mixtures of Independent Deep Generative Models

## Introduction

• The paper proposes a Competitive training mechanism to train a mixture of independent generative models.

• The idea is that this mixture of different models would divide the data distribution amongst themselves and specialize to their respective splits.

• The training procedure is related to clustering-based methods.

## Motivation

• In causal modeling, a common assumption is that the data is generated by a set of independent mechanisms.

• It is not known which mechanism generates which datapoint and recovering the underlying mechanisms can be modeled as learning a structural causal generative model.

## Setup

• The paper assumes that the support of the different generators do not overlap, i.e., the underlying data distribution is factorized into non-overlapping regions.

• This data factorization is learned using a set of discriminators.

• If there are $k$ generators, $k$ binary partition functions $c_i, … c_k$ are used.

• For a given datapoint $x$, if $c_i(x) = 1$ then $c_j(x) = 0$ for all other $j$ and $x$ is assigned to $i^{th}$ generator.

• For a fixed partition function $c_j^t$ ($t$ denotes the partition function at time $t$), minimize the sum of f-divergence between the model and the data distribution (that is assigned to it). The loss formulation is an upper bound on the f-divergence of the mixture model.

• In the next step, the data points are re-assigned to the generative models, based on the likelihood of each data point for each model.

• The likelihood is estimated by training a discriminator that can distinguish the generated samples from the real samples.

### Independence as an inductive bias

• The independence assumption may be too restrictive because the low-level features will be common across the distribution splits.

• This “violation” can be avoided by pretraining the model using a uniform random split of the dataset. In that case, the independence assumption will hold approximately after pretraining.

• Another approach could be to share some parameters across the models.

• A “load balancing” approach is also used where each model always keeps training on the data points assigned to it if not enough data points are assigned to it.

### Comparison to VAEs and GANs

• VAEs tend to be “overly inclusive” of the training distribution, i.e., they try to cover the entire support of the distribution.

• GANs are prone to mode collapse where the model focuses only on one part of the distribution.

• The proposed method provides a middle ground where the different generative models can focus on different parts of the distribution.

## Experiments

• The experiments seem to be limited. The paper shows that their proposed setup improves over the VAE and GAN baselines.

• For datasets, the paper uses two-dimensional synthetic data, MNIST and CelebA