MONet - Unsupervised Scene Decomposition and Representation

14 Sep 2020

Introduction

The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into objects and learning a representation for these objects.
Link to the paper

Two components:
- Attention Module: generates spatial masks corresponding to the objects in the scene.
- VAE: learn representation for each object.
VAE components:
- Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable z.
- Decoder: It takes as input the latent variable z and attempts to reproduce the image.
The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on.
The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over.
In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1.
The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component.

A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene.
The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once.

VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features.
MONet with seven slots is trained on Objects Room dataset with 1-3 objects.
- It learns to generate different attention mask for different objects.
- Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene.
- Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training).
On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background.
On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes.