MONet - Unsupervised Scene Decomposition and Representation
14 Sep 2020Introduction
-
The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into objects and learning a representation for these objects.
Architecture
-
Two components:
-
Attention Module: generates spatial masks corresponding to the objects in the scene.
-
VAE: learn representation for each object.
-
-
VAE components:
-
Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable z.
-
Decoder: It takes as input the latent variable z and attempts to reproduce the image.
-
-
The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on.
-
The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over.
-
In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1.
-
The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component.
Motivation
-
A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene.
-
The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once.
Results
-
VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features.
-
MONet with seven slots is trained on Objects Room dataset with 1-3 objects.
-
It learns to generate different attention mask for different objects.
-
Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene.
-
Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training).
-
-
On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background.
-
On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes.