Proposed in the early days of computer vision Grenander (1976); Horn (1977), analysis-by-synthesis is an approach to the problem of visual scene understanding. The idea is conceptually elegant and appealing: build a system that is able to synthesize complex scenes (e.g., by rendering), and then understand analysis (inference) as the inverse of this process that decomposes new scenes into their constituent components. The main challenges in this approach are the need for generative models of objects (and their composition into scenes) and the need to perform tractable inference given new inputs, including the task to decompose scenes into objects in the first place. In this work, we aim to learn such as system in an unsupervised way from observations of scenes alone.
While models such as VAEs (Kingma & Welling, 2014; Rezende et al., 2014) and GANs (Goodfellow et al., 2014) constitute significant progress in generative modelling, these models still lack the ability to capture the compositional nature of reality: they typically generate entire images or scenes at once, i.e., with a single pass through a large feedforward network. While this approach works well for objects such as centred faces—and progress has been impressive on those tasks Karras et al. (2019a;b)—generating natural scenes containing several objects in non-trivial constellations gets increasingly difficult within this framework due to the combinatorial number of compositions that need to be represented and reasoned about (Bau et al., 2019).
Figure 1: Our ECON model learns to decompose training scenes (A) into layers of inpainted objects. Representing object classes separately allows controllable sampling of individual objects (B: samples from different experts) which can be recombined in novel ways (C: compositions sampled by layering the experts in B in the same order as seen during training (top), or choosing three (middle) or four (bottom) objects at random).
Image formation entangles different components in highly non-linear ways, such as occlusion. Due to the difficulty of choosing the correct model and the complexity of inference, the task to generate complex scenes containing compositions of objects still lacks success stories. More training data certainly helps, and progress on generating visually impressive scenes has been substantial Radford et al. (2015), but we hypothesize that a satisfactory and robust solution that is not optimized to a relatively well constrained IID (independent and identically distributed) data scenario will require that our models correctly incorporate the (causal) generative nature of natural scenes.
Here, we take some first small steps towards addressing the aforementioned limitations by proposing ECON, a more physically-plausible generative scene model with explicitly compositional structure. Our approach is based on two main ideas. The first is to consider scenes as layered compositions of (partially) depth-ordered objects. The second is to represent object classes separately using an ensemble of generative models, or experts.
Our generative scene model consists of a sequen-
tial process which places independent objects in
the scene, operating from the back to the front,
so that objects occurring closer to the viewer can
occlude those further away. During inference,
this process is reversed: at each step, experts compete for explaining part of the remaining scene, and only the winning expert is further trained on the explained part (Parascandolo et al., 2018). This competition ideally drives each expert to specialise on representing and generating instances from one, or a few related, object classes or concepts, and the notion of “objects” should automatically emerge as contiguous regions that appear in a stable way across a range of training images. By decomposing scenes in the reverse order of generation, occluded objects can be inpainted within the already explained regions so that experts can learn to generate full, unoccluded objects which can be recombined in novel ways.
Learning a modular scene representation via object-specific experts has several benefits. First, each expert only needs to solve the simpler subtask of representing and generating instances from a single object class—something which current generative models have been shown to be capable of—while the composition process is treated separately. Secondly, expert models are useful in their own right as they can be dropped or added, reused and repurposed for other tasks on an individual level.
We highlight the following contributions.
• We summarise a physically-plausible model of scene generation in 2 and use it to categorise and contrast related scene models and their shortcomings in
• In 4, we present ECON, a compositional scene model, which, for a single expert, can be seen as extension of MONET (Burgess et al., 2019) into a proper generative model (
• We introduce modular object representations through separate generators and propose a competition mechanism and objective to drive experts to specialise in
• In experiments on synthetic data in 6 we show qualitatively that ECON is able to decompose simple scenes into objects, represent these separately, and recombine them in a layer-wise fashion into novel, coherent scenes with arbitrary numbers and depth-orderings of objects.
• We critically discuss our assumptions and propose extensions for future work in
Figure 2: Assumed data generating process (dead-leaves model). Independent objects with shapes
(drawn from class
with properties
) are placed on the canvas sequentially (reflecting depth ordering) and appear in the final composition x as dependent, partially occluded regions
To reflect the fact that 2D images are the result of projections of richer 3D scenes, we assume that data are generated from the well-known dead leaves model,1 i.e., in a layer-wise fashion, see Figure 2a for an illustration. Starting with an empty canvas x = 0, an image is sequentially generated in T steps. At each step we sample an object from one of K different classes and place it on the canvas as follows,
where represents the object class drawn at step
is an abstract representation of the object’s properties;
is a binary image determining shape;
is a full (unmasked) image containing the object; and
denotes element-wise multiplication. The corresponding graphical model is shown in Figure 2b.2
This sequential generation process captures the loss of depth information when projecting from 3D to 2D and is a natural way of handling occlusion phenomena. Consequently, sampling from this model is straightforward. We therefore consider it a more truthful approach to modelling visual scenes than, e.g., spatial mixture models, in line with Le Roux et al. (2011).
On the other hand, inferring the objects composing a given image x is challenging. We will distinguish between shapes and regions in the following sense. The unoccluded object shapes , top row in Figure 2a, remain hidden and only appear in x via their corresponding, partially occluded segmentation regions
, see the final composition in the bottom row of Figure 2a for an illustration. In particular, a region
is always subset of the corresponding shape pixels
In addition to the separate treatment of shapes , we also introduce a scope variable
to help write the above model in a convenient form. Following Burgess et al. (2019),
is defined recursively as
The scope contains those parts of the image, which have been completely generated after t steps and will not be occluded in the subsequent
With , the regions
can be compactly defined as
Using these, we can express the final composition as
While (3) may look like a normal spatial mixture model, it is worth noting the following important point: even though the shapes are drawn independently, the resulting segmentation regions
become (temporally) dependent due to the layer-wise generation process, i.e., the visible part of object t depends on all objects subsequently placed on the canvas. This seems very intuitive and is evident from the fact that the RHS of (2) is a function of
Clustering & spatial mixture models One line of work (Greff et al., 2016; 2017; Van Steenkiste et al., 2018) approaches the perceptual grouping task of decomposing scenes into components by viewing separate regions as clusters. A scene x is modelled with a spatial mixture model, parametrised by deep neural networks, in which learning is performed with a procedure akin to expectation maximisation (EM; Dempster et al., 1977). The recent IODINE model of Greff et al. (2019) instead uses a refinement network (Marino et al., 2018) to perform iterative amortised variational inference over independent scene components which are separately decoded and then combined via a softmax to form the scene. While IODINE is able to decompose a given scene, it cannot generate coherent samples of new scenes because dependencies between regions
due to layering are not explicitly captured in its generative model.
This shortcoming of IODINE has also been pointed out by Engelcke et al. (2019) and addressed in their GENESIS model, which explicitly models dependencies between regions via an autoregressive prior over . While this does enable sampling of coherent scenes which look similar to training data, GENESIS still assumes an additive, rather than layered, model of scene composition. As a consequence, the resulting entangled component samples contain holes and partially occluded objects and cannot be easily layered and recombined as shown in Figure 1 (e.g., to generate samples with exactly two circles and one triangle).
Sequential models Our work is closely related to sequential or recurrent approaches to image decomposition and generation (Mnih et al., 2014; Gregor et al., 2015; Eslami et al., 2016; Kosiorek et al., 2018; Yuan et al., 2019). In particular, we build on the recent MONET model for scene decomposition of Burgess et al. (2019). MONET combines a recurrent attention network with a VAE which encodes and reconstructs the input within the selected attention regions while unconstrained to inpaint occluded parts outside
We extend this approach in two main directions. Firstly, we turn MONET into a proper generative model3 which respects the layer-wise generation of scenes described in 2. Secondly, we explicitly model the discrete variable k (object class) with an ensemble of class-specific VAEs (the experts)—as opposed to within a single large encoder-decoder architecture as in IODINE, GENESIS or MONET. Such specialisation allows to control object constellations in new, but scene-consistent ways.
Competition of experts To achieve specialisation on different object classes in our model, we build on ideas from previous work using competitive training of experts (Jacobs & Jordan, 1991). More recently, these ideas have been successfully applied to tasks such as lifelong learning (Aljundi et al., 2017), learning independent causal mechanisms (Parascandolo et al., 2018), training mixtures of generative models (Locatello et al., 2018), as well as to dynamical systems via sparsely-interacting recurrent independent mechanisms (Goyal et al., 2019).
Probabilistic RBM models The work of Le Roux et al. (2011) and Heess (2012) introduced probabilistic scene models that also reason about occlusion. Le Roux et al. (2011) combine restricted Boltzmann machines (RBMs) to generate masks and shape separately for every object in the scenes into a masked RBM (M-RBM) model. Two variants are explored: one that respects a depth ordering and object occlusions, derived from similar arguments as we have put forward in the introduction;
Table 1: Comparison with related unsupervised scene decomposition and generation models.
and a second model which uses a softmax combination akin to the spatial mixture models used in IODINE and GENESIS, although the authors argue it makes little sense from a modelling perspective. Inference is implemented as blocked Gibbs sampling with contrastive divergence as a learning objective. Inference over depth ordering is done exhaustively, that is, considering every permutation— as opposed to greedily using competition as in this work. Shortcomings of the model are mainly the limited expressiveness of RBMs (complexity and extent), as well as the cost of inference. Our work can be understood as an extension of the M-RBM formulation using VAEs in combination with attention, or segmentation, models.
Vision as inverse graphics & probabilistic programs Another way to programmatically introduce information about scene composition is through analysis-by-synthesis, see Bever & Poeppel (2010) for an overview. In this approach, the synthesis (i.e., generative) model is fully specified, e.g., through a graphics renderer, and inference becomes the inverse task, which poses a challenging optimisation problem. Probabilistic programming is often advocated as a means to automatically compile this inference task; for instance, PICTURE has been proposed by Kulkarni et al. (2015), and combinations with deep learning have been explored by Wu et al. (2017). This approach is sometimes also understood as an instance of Approximate Bayesian Computation (ABC; Dempster et al., 1977) or likelihood-free inference. While conceptually appealing, these methods require a detailed specification of the scene generation process—something that we aim to learn in an unsupervised way. Furthermore, gains achieved by a more accurate scene generation process are generally paid for by complicated inference, and most methods thus rely on variations of MCMC sampling schemes (Jampani et al., 2015; Wu et al., 2017).
Supervised approaches There is a body of work on augmenting generative models with ground-truth segmentation and other supervisory information. Turkoglu et al. (2019) proposed a layer based model to add objects onto a background, Ashual & Wolf (2019) proposed a scene-generation method allowing for fine grained user control, Karras et al. (2019a;b) have achieved impressive image generation results by exclusively training on a single class of objects. The key difference of these approaches to our work is that we exclusively focus on unsupervised approaches.
We now introduce ECON (for Ensemble of Competing Object Networks), a causal generative scene model which explicitly captures the compositional nature of visual scenes. On a high level, the proposed architecture is an ensemble of generative models, or experts, designed after the layer-based scene model described in 2. During training, experts compete to sequentially explain a given scene via attention over image regions, thereby specialising on different object classes. We perform variational inference (Jordan et al., 1999), amortised within the popular VAE framework (Kingma & Welling, 2014; Rezende et al., 2014), and use competition to greedily maximise a lower bound to the conditional likelihood w.r.t. object identity.
4.1 GENERATIVE MODEL
We adopt the generative model p described in 2, parametrise it by
, and assume that it factorises over the graphical model in Figure 2b (i.e., assuming that objects at different time steps are drawn independently of each other). We model
with a categorical distribution,4 and place a unit-
Figure 3: ECON architecture: ensemble of K competing experts. Each expert consists of (i) an attention network which selects image regions ; (ii) an encoder which maps the image within the attended region to a latent code z; and (iii) a decoder which reconstructs both an object
and its unoccluded shape
. A competition mechanism determines the winning expert at each step.
variance isotropic Gaussian prior over
Next, we parametrise with respective parameters
These compute object means and mask probabilities
which determine pixel-wise distributions over
We note at this point that, while other handlings of the discrete variable k are possible, we deliberately opt for K separate decoders: (i) as an inductive bias encouraging modularity; and (ii) to be able to controllably sample individual objects and recombine them in novel ways.
Finally, we need to specify a distribution over x. Due to its layer-wise generation, this is tricky and most easily done in terms of the visible regions , and linearity of Gaussians it follows that, pixel-wise,
Similarly, one can show from (1), (2), and (4) that depends on
for t = 1, . . . , T; see Appendix A for detailed derivations.
The class-conditional joint distribution then factorises as,
Conditioning on is motivated by our inference procedure, see
5. Moreover, we express p in terms of the segmentation regions
as only these are visible in the final composition which makes is easier to specify a distribution over x. Note, however, that while we will perform inference over regions
, we will learn to generate full shapes
which are consistent with the inferred
when composed layer-wise as captured in (7), thus respecting the physical data-generating process.
4.2 APPROXIMATE POSTERIOR
Since exact inference is intractable in our model, we approximate the posterior over and
with the following variational distribution q parametrised by
As for the generative distribution, we model dependence on using K modules with separate parameters
. These inference modules consist of two parts.
Attention nets compute region probabilities
and amortise inference over regions
Encoders compute means and log-variances
which parametrise distributions over
We refer to the collection of , and
for a given k as an expert as it implements all computations (generation and inference) for a specific object class—see Figure 3 for an illustration.
Due to the assumed sequential generative process, the natural order of inference is the reverse (t = T, . . . , 1), i.e., foreground objects should be explained first and the background last. This is also captured by the dependence of via the scope
Such entanglement of scene components across composition steps makes inference over the entire scene intractable. We therefore choose the following greedy approach. At each inference step t = T, . . . , 1, we consider explanations from all possible object-classes —as provided by our ensemble of experts via attending, encoding and reconstructing different parts of the current scene—and then choose the best fitting one. This offers an intuitive foreground to background decomposition of an image as foreground objects should be easier to reconstruct.
Concretely, we first lower bound the marginal likelihood conditioned on , and then use a competition mechanism between experts to determine the best k. We now describe this inference procedure in more detail.
5.1 OBJECTIVE: CLASS-CONDITIONAL ELBO
First, we lower bound the class-conditional model evidence using the approximate posterior q as follows (see Appendix A for a detailed derivation):
Next, we use the reparametrization trick of Kingma & Welling (2014) to replace expectations w.r.t. by a Monte Carlo estimate using a single sample drawn as:
Finally, we approximate expectations w.r.t. in
and
using the Bernoulli means
. We opt for directly using a continuous approximation and against sampling discrete r’s (e.g., using continuous relaxations to the Bernoulli distribution (Maddison et al., 2017; Jang et al., 2017)) as our generative model does not require the ability to directly sample regions. (Instead, we sample z’s and decode them into unoccluded shapes which can be combined layer-wise to form scenes.)
With these approximations, we obtain the estimates
which we combine to form the learning objective
where are hyperparameters. Note that for
still approximates a valid lower bound.
Generative model extension of MONET as a special case For K = 1 (i.e., ignoring different object classes for the moment), our derived objective (13) is similar to that used by Burgess et al. (2019). However, we note the following crucial difference in (12): in our model, reconstructed attention regions
are multiplied by
term of the KL, see (7). This implies that the generated shapes
are constrained to match the attention region
only within the current scope
, so that—unlike in MONET—the decoder is not penalised for generating entire unoccluded object shapes, allowing inpainting also on the level of masks. With just a single expert, our model can thus be understood as a generative model extension of MONET.
5.2 COMPETITION MECHANISM
For K > 1, i.e., when explicitly modelling object classes with separate experts, the objective (13) cannot be optimised directly because it is conditioned on the object identities . To address this issue, we use the following competition mechanism between experts.
At each inference step t = T, . . . , 1, we apply all experts to the current input
and declare that expert the winner which yields the best competition objective (see below).6 We then use the winning expert
to reconstruct the selected scene component using
where is encoded from the region
attended by the winning expert
. We then compute the contribution to (13) from step t assuming fixed
, and use it to update the winning expert with a gradient step. Finally, we update the scope using the winning expert,
to allow for inpainting within the explained region in the following inference (decomposition) steps.7
This competition process can be seen as a greedy approximation to maximising (13) w.r.t. considering all possible object combinations would require
steps, our competition procedure is linear in the number of object classes and runs in
steps. By choosing an expert at each step t = T, . . . , 1, we approximate the expectation w.r.t.
—which entangles the different composition steps and makes inference intractable—using
and the updates in (14). Competition objective While model parameters are updated using the learning objective (13) derived from the ELBO, the choice of competition objective is ours. Since we use competition to drive specialisation of experts on different object classes and to greedily infer
, (i.e., the identity of the current foreground object), the competition objective should reflect such differences between object classes. Object classes can differ in many ways (shape, color, size, etc) and to different extents, so the choice of competition objective is data-dependent and may be informed by prior knowledge.
For instance, in the setting depicted in Figure 1 where both color and shape are class-specific, we found that using a combination of and
worked well. However, on the same data with randomised color (as used in the experiments in
6) it did not: due to the greedy optimisation procedure, the expert which is initially best at reconstructing a particular color continues to win the competition for explaining regions of that color and thus receives gradient updates to reinforce this specialisation; such undesired specialisation corresponds to a local minimum in the optimisation landscape and can be very hard for the model to escape.
We thus found that relying solely on as the competition objective (i.e., the reconstruction of the attention region) helps to direct specialisation towards objects categories. In this case, experts are chosen based on how well they can model shape, and only those experts which can easily reconstruct (the shape of) a selected region within the current scope will do well at any given step, meaning that the selected region corresponds to a foreground object.
Moreover, we found that using a stochastic, rather than deterministic form of competition, (i.e., experts win the competition with the probabilities proportional to their competition objectives at a given step) helped specialisation. In particular, such approach helps prevent the collapse of the experts in the initial stages of training.
Formally, the probability of expert k winning the competition is
with and
being the terms in (13) at step t for an expert k. The hyper-parameter
controls the relative influence of the appearance and shape reconstruction objectives to make the data-dependent assumptions about the competition mechanism as discussed above.
To explore ECON’s ability to decompose and generate new scenes, we conduct experiments on synthetic data consisting of colored 2D objects or sprites (triangles, squares and circles) in different occlusion arrangements. We refer to Appendix C for a detailed account of the used data set, model architecture, choice of hyperparameters, and experimental setting. Further experiments can be found in Appendix B.
ECON decomposes scenes and inpaints occluded objects Fig. 4 shows an example of how ECON decomposes a scene with four objects. At each inference step, the winning expert segments a region (second col.) within the unexplained part of the image (first col.), and reconstructs the selected object within the attended region (fourth col.). A distinctive feature of our model is that, despite occlusion, the full shape (rightmost col.) of every object is imputed (e.g., at step ). This ability to infer complete shapes is a consequence of the assumed layer-wise generative model which manifests itself in our objective via the unconstrained shape reconstruction term (12).
ECON generalizes to novel scenes Fig. 4 also illustrates that that the model is capable of decomposing scenes containing multiple objects of the same category, as well as multiple objects of the same color in separate steps. It does so for a scene with four objects, despite being trained on scenes containing only three objects, one from each class.
Single expert as generative extension of MONET We also investigate training a single expert which we claim to be akin to a generative extension of MONET. When trained on the data from Fig. 1 with ground truth masks provided, the expert learns to inpaint occluded shapes and objects as can be seen from the samples in Fig. 5. However, all object classes are represented in a shared latent space so that different classes cannot be sampled controllably.
Figure 4: By explaining away a scene from front to back, ECON can impute occluded components (third column) and—crucially for layered generation and recombination—their shapes
(fifth column) within the already explained regions
(first column). Each inference step
shows only the winning expert’s output.
Figure 5: Random samples from a single expert (akin to a generative extension of MONET) trained on the data from Fig. 1 with ground-truth masks provided. The model learns to separately generate unoccluded objects and background, but lacks control over which object class is sampled.
Figure 6: Samples from individual experts trained on toy data with random colors (shown in top panel). Experts (corresponding to rows in the bottom panel) specialise on triangles, circles, background, and squares, respectively, but such specialisation based-purely on shape is significantly harder when color is lost as a powerful cue. This is reflected, e.g., in the imperfect separation between squares and circles, cf. Fig. 1.
Figure 7: Illustration of layer-wise sampling from ECON after training on our toy data with random colors. Starting with a background sample, subsequent rows correspond to sampling additional objects by randomly choosing one of the specialised object experts.
Multiple experts specialise on different object classes Fig. 1B shows samples from each of the four experts trained on a dataset with uniquely colored objects (Fig. 1A). The samples from each expert contain either the same object in different spatial positions or differently coloured background, indicating that the experts specialised on the different object classes composing these scenes.
Fig. 6 shows the same plot for a model trained on scenes consisting of randomly colored objects. This setting is considerably more challenging because experts have to specialise purely based on shape while also representing color variations. Yet, experts specialise on different object classes: samples in Fig. 6 are either randomly colored background or objects from mostly one class with different colours and spatial positions, indicating that the ECON is capable of representing the scenes as compositions of distinct objects in an unsupervised way.
Controlled and layered generation of new scenes The specialisation of experts allows us to controllably generate new scenes with specific properties. To do so, we follow the sequential generation procedure described in 2 by sampling from one of the experts at each time step. The number of generation steps T, as well as the choices of experts
allow to control the total number and categories of objects in the generated scene.
Fig. 1C shows samples generated using the experts in Fig. 1B. In Fig. 7 we show another example where more and more randomly colored objects are sequentially added. Even though the generated scenes are quite simple, we believe this result is important as the ability to generate scenes in a controlled way is a distinctive feature of our model, which current generative scene models lack.
Model assumptions While ECON aims at modelling scene composition in a faithful way, we make a number of assumptions for the sake of tractable inference, which need to be revisited when moving to more general environments. We assume a known (maximum) number of object classes K which may be restrictive for realistic settings, and choosing K too small may force each expert to represent multiple object classes. Other assumptions are that the pixel values are modelled as normally distributed, even though they are discrete in the range {0, . . . , 255}, and that pixels are conditionally independent given shapes and objects.
Shared vs. object-specific representations Recent work on unsupervised representation learning (Bengio et al., 2013) has largely focused on disentangling factors of variation within a single shared representation space, e.g., by training a large encoder-decoder architecture with different forms of regularization (Higgins et al.; Kim & Mnih, 2018; Chen et al., 2018; Locatello et al., 2019). This is motivated by the observation that certain (continuous) attributes such as position, size, orientation or color are general concepts which transcend object-class boundaries. However, the range of values of these attributes, as well as other (discrete) properties such as shape, can strongly depend on object class. In this work, we investigate the other extreme of this spectrum by learning entirely object-specific representations. Exploring the more plausible middle ground combining both shared and object-specific representations is an attractive direction for further research.
Extensions and future work The goal of decomposing visual scenes into their constituents in an unsupervised manner from images alone will likely remain a long standing goal of visual representation learning. We have presented a model that recombines earlier ideas on layered scene compositions, with more recent models of larger representational power, and unsupervised attention models. The focus of this work is to establish physically plausible compositional models for an easy class of images and to propose a model that naturally captures object-specific specialization.
With ECON and other models as starting point, a number of extensions are possible. One direction of future work deals with incorporating additional information about scenes. Here, we consider static, semantically-free images. Optical flow and depth information can be cues to an attention process, facilitating segmentation and specialization. First results in the direction of video data have been shown by Xu et al. (2019). Natural images typically carry semantic meaning and objects are not ordered in arbitrary configurations. Capturing dependencies between objects (e.g., using an auto-regressive prior over depth ordering as in GENESIS), albeit challenging, could help disambiguate between scene components. Another direction of future work is to relax the unsupervised assumption, e.g., by exploring a semi-supervised approach, which might help improve stability.
On the modelling side, extensions to recurrent architectures and iterative refinement as in IODINE appear promising. Our model entirely separates experts from each other but, depending on object similarity, one can also include shared representations which will help transfer already learned knowledge to new experts in a continual learning scenario.
While the scenes studied here and in the recent works of Burgess et al. (2019); Greff et al. (2019); Engelcke et al. (2019) are still in stark contrast to the impressive results that holistic generative models are able to achieve, we believe it is the right time to revisit the unsupervised scene composition problem. Our goal is to build re-combineable systems, where different components can be used for new scene inference tasks. In the spirit of the analysis-by-synthesis approach, this requires the ability to re-create physically plausible visual scenes. Disentangling the scene formation process from the objects is one crucial component thereof, and the vast number of object types will require the ability of unsupervised learning from visual input alone.
The authors would like to thank Alex Smola, Anirudh Goyal, Muhammad Waleed Gondal, Chris Russel, Adrian Weller, Neil Lawrence, and the Empirical Inference “deep learning & causality” team at the MPI for Intelligent Systems for helpful discussions and feedback.
M.B. and B.S. acknowledge support from the German Science Foundation (DFG) through the CRC 1233 “Robust Vision” project number 276693517, the German Federal Ministry of Education and Research (BMBF) through the Tbingen AI Center (FKZ: 01IS18039A), and the DFG Cluster of Excellence “Machine Learning New Perspectives for Science” EXC 2064/1, project number 390727645.
Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3366–3375, 2017.
Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4561–4569, 2019.
David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511, 2019.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
Thomas G. Bever and David Poeppel. Analysis by synthesis: A (re-)emerging program of research for language and vision. Biolinguistics, 4(2):174–200, 2010.
Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentan- glement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620, 2018.
Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, 1977.
Martin Engelcke, Adam R Kosiorek, Oiwi Parker Jones, and Ingmar Posner. Genesis: Gener- ative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052, 2019.
SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233, 2016.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Sch¨olkopf. Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893, 2019.
Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and J¨urgen Schmidhuber. Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, pp. 4484–4492, 2016.
Klaus Greff, Sjoerd Van Steenkiste, and J¨urgen Schmidhuber. Neural expectation maximization. In Advances in Neural Information Processing Systems, pp. 6691–6701, 2017.
Klaus Greff, Rapha¨el Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pp. 2424–2433, 2019.
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning, pp. 1462–1471, 2015.
Ulf Grenander. Pattern synthesis – lectures in pattern theory. 1976.
Nicolas Manfred Otto Heess. Learning generative models of mid-level structure in natural images. PhD thesis, 2012.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework.
Berthold K. P. Horn. Understanding image intensities. Artifical Intelligence, 8:201–231, 1977.
Robert A Jacobs and Michael I Jordan. A competitive modular connectionist architecture. In Advances in neural information processing systems, pp. 767–773, 1991.
Varun Jampani, Sebastian Nowozin, Matthew Loper, and Peter V. Gehler. The informed sampler: A discriminative approach to bayesian inference in generative computer vision models. Computer Vision and Image Understanding, 136:32 – 44, 2015. ISSN 1077-3142.
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumbel-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017.
Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019a.
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958, 2019b.
Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems, pp. 8606–8616, 2018.
T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. Mansinghka. Picture: A probabilistic programming language for scene perception. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4390–4399, June 2015. doi: 10.1109/CVPR.2015.7299068.
Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593–650, 2011.
Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar R¨atsch, Sylvain Gelly, and Bernhard Sch¨olkopf. Competitive training of mixtures of independent deep generative models. arXiv preprint arXiv:1804.11130, 2018.
Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp. 4114–4124, 2019.
C Maddison, A Mnih, and Y Teh. The concrete distribution: A continuous relaxation of discrete random variables. International Conference on Learning Representations, 2017.
Joe Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International Conference on Machine Learning, pp. 3403–3412, 2018.
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212, 2014.
Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Sch¨olkopf. Learning independent causal mechanisms. In International Conference on Machine Learning, pp. 4036– 4044, 2018.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d ´Alch´e Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. 2019.
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi (eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, Cham, 2015. Springer International Publishing. ISBN 978-3-319-24574-4.
Mehmet Ozgur Turkoglu, William Thong, Luuk Spreeuwers, and Berkay Kicanaoglu. A layer-based sequential framework for scene generation with gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 8901–8908, 2019.
Sjoerd Van Steenkiste, Michael Chang, Klaus Greff, and J¨urgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353, 2018.
Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 699–707, 2017.
Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T Freeman, Joshua B Tenenbaum, and Jiajun Wu. Unsupervised discovery of parts, structure, and dynamics. arXiv preprint arXiv:1903.05136, 2019.
Jinyang Yuan, Bin Li, and Xiangyang Xue. Generative modeling of infinite occluded objects for compositional scene representation. In International Conference on Machine Learning, pp. 7222–7231, 2019.
A DERIVATIONS
A.1 DERIVATION OF ELBO
We now provide a detailed derivation of the evidence lower bound (ELBO) used in the main paper.
For ease of notation we use vector notation and omit explicitly summing over pixel- and latent
dimensions (as done in the implementation).
We start by writing as an expectation w.r.t. q using importance sampling as follows:
Applying the concave function and using Jensen’s inequality we obtain
Using the chain rule of probability and properties of , we can rearrange the integrand on the
RHS of (A.1) as
We will consider the three terms in (A.2) separately and define their expectations w.r.t. the approximate
posterior as
Next, we use our modelling assumptions stated in the paper to simplify these terms, starting with
Using the assumed factorisation of the approximate posterior, in particular
, as well as the fact that
, splitting the expectation into two parts, and
using linearity of the expectation operator, we find that can be written as follows:
Next, we consider . Using a similar argument as for
, we find that
Finally, we consider . Substituting the Gaussian likelihood for
, ignoring
constants which do not depend on any learnable parameters, and using the fact that is binary and
, we obtain
where denotes the pixel-wise L2-norm between two RGB vectors. (Recall that
are defined as quantities in , respectively, and that summation over these dimensions
yields the desired scalar objective.)
We observe that can all be written as sums over the T composition steps.
We thus define:
Finally, it then follows that
A.2 DERIVATION OF GENERATIVE REGION DISTRIBUTION
We now derive the distribution in (7). We will use the fact that , and that
can be
written as , as well as the conditional independencies implied by our model, see
Figure 2b. Considering the pixel-wise distribution and marginalising over , we obtain:
Since is binary, this fully determines its distribution.
B ADDITIONAL EXPERIMENTAL RESULTS
Figure 8 shows four additional examples of ECON decomposing scenes consisting of multiple
randomly coloured shapes. The model was trained on the data from Fig. 6, but is able to decompose
scenes with five objects (a), multiple occluding objects from the same class (b, c), and objects of
similar color to the background (d). Moreover, (b) suggests that additional timesteps () are
simply ignored if they are not needed.
Figure 8: Additional decomposition plots for o.o.d. data. The model was trained with four experts on scenes containing three objcets (one triangle, square, and circle each) arranged in random order.
C EXPERIMENTAL DETAILS
C.1 DATASETS
Synthetic dataset: uniquely colored objects The dataset consists of images of circles, squares and
triangles on a randomly and uniformly colored background, such that there is a unique correspondence
between object color and class identites (red circles, green squares, blue triangles). The background
color is randomly chosen to be an RGB value with each channel being a random integer between
0 and 127, while the RGB values of the object colors are (255,0,0), (0,255,0), (0,0,255) for circles,
squares and triangles respectively. The spatial positions of the objects are randomly chosen such that
each of the objects entirely fits into an image without crossing the image boundary.
The models shown in Fig. 1 and 5 have been trained on a version of such dataset containing images
with exactly three objects per image (one of each class) in random depth orders (Fig. 1, top row). The
training and validation splits include 50 000 and 100 such images respectively.
Synthetic dataset: randomly colored objects This dataset is the same as the one described above
with the difference that the objects (circles, squares and triangles) are randomly colored with the
corresponding RGB values being random integers between 128 and 255.
The models shown in Fig. 4, 6 and 8 have been trained on a version of such dataset containing images
with exactly three objects per image (one of each class) in random depth orders (Fig. 6, top row). The
training and validation splits include 50000 and 100 such images respectively.
C.2 ARCHITECTURE DETAILS
Each expert in our model consists of attention network computing the segmentation regions as a
function of the input image and the scope at a given time step, and a VAE reconstructing the image
appearance within the segmentation region and inpainting the unoccluded shape of object. Below we
describe the details of architectures we used for each of the expert networks.
C.2.1 EXPERT VAES
Encoder The VAE encoder consists of multiple blocks, each of which is composed of
convolutional layer, ReLU non-linearity, and max pooling. The output of the final block is
flattened and transformed into a latent space vector by means of two fully connected layers. The
output of the first fully-connected layer has 4 times the number of latent dimensions activations,
which are passed through the ReLU activation, and finally linearly mapped to the latent vector by a
second fully-connected layer.
Decoder Following Burgess et al. (2019), we use spatial a broadcast decoder. First, the latent
vector is repeated on a spatial grid of the size of an input image, resulting in a 3D tensor with spatial
dimensions being that of an input, and as many feature maps as there are dimensions in the latent
space. Second, we concatenate the two coordinate grids (for and
coordinates) to this tensor.
Next, this tensor is processed by a decoding network consisting of as many blocks as the encoder,
with each block including a convolutional layer and ReLU non-linearity. Finally, we apply a
convolutional layer with sigmoid activation to the output of the decoding network resulting in
an output of 4 channel (RGB + shape reconstruction).
C.2.2 ATTENTION NETWORK
We use the same attention network architecture as in Burgess et al. (2019) and the implementation
provided by Engelcke et al. (2019). It consists of U-Net (Ronneberger et al., 2015) with 4 down and
up blocks consisting of a convolutional layer, instance normalisation, ReLU activation and
down- or up-sampling by a factor of two. The numbers of channels of the block outputs in the down
part (the up part is symmetric) of the network are: 4 - 32 - 64 - 64 - 64.
C.3 TRAINING DETAILS
We implemented the model in PyTorch (Paszke et al., 2019). We use the batch size of 32, Adam
optimiser (Kingma & Ba, 2014), and initial learning rate of . We compute the validation
loss every 100 iterations, and if the validation loss doesn’t improve for 5 consecutive evaluations, we
decrease the learning rate by a factor of. We stop the training after 5 learning rate decrease step.
C.4 CROSS-VALIDATION
Synthetic dataset: uniquely colored objects The results in Fig. 1 were obtained by cross-
validating 512 randomly sampled architectures with the following ranges of parameters:
The best performing model in terms of the validation loss (which is shown in Fig. 1) has the latent
dimension of 2, 4 layers in encoder and decoder, 32 features per layer,
The results in Fig. 5 were obtained using the same model as above but with one expert.
Synthetic dataset: randomly colored objects The results in Figs. 4, 6, and 7 were obtained by
cross-validating 512 randomly sampled architectures with the following ranges of parameters:
The best performing model in terms of the validation loss (which is shown in Fig. 1) has the latent
dimension of 5, 3 layers in encoder and decoder, 32 features per layer,