Papers I Read Notes and Summaries

MONet - Unsupervised Scene Decomposition and Representation


  • The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into objects and learning a representation for these objects.

  • Link to the paper


  • Two components:

    • Attention Module: generates spatial masks corresponding to the objects in the scene.

    • VAE: learn representation for each object.

  • VAE components:

    • Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable z.

    • Decoder: It takes as input the latent variable z and attempts to reproduce the image.

  • The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on.

  • The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over.

  • In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1.

  • The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component.


  • A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene.

  • The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once.


  • VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features.

  • MONet with seven slots is trained on Objects Room dataset with 1-3 objects.

    • It learns to generate different attention mask for different objects.

    • Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene.

    • Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training).

  • On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background.

  • On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes.