b

DiscoverSearch
About
My stuff
Towards causal generative scene models via competition of experts
2020·arXiv
ABSTRACT
ABSTRACT

Learning how to model complex scenes in a modular way with recombinable components is a pre-requisite for higher-order reasoning and acting in the physical world. However, current generative models lack the ability to capture the inherently compositional and layered nature of visual scenes. While recent work has made progress towards unsupervised learning of object-based scene representations, most models still maintain a global representation space (i.e., objects are not explicitly separated), and cannot generate scenes with novel object arrangement and depth ordering. Here, we present an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts). During training, experts compete for explaining parts of a scene, and thus specialise on different object classes, with objects being identified as parts that re-occur across multiple scenes. Our model allows for controllable sampling of individual objects and recombination of experts in physically plausible ways. In contrast to other methods, depth layering and occlusion are handled correctly, moving this approach closer to a causal generative scene model. Experiments on simple toy data qualitatively demonstrate the conceptual advantages of the proposed approach.

Proposed in the early days of computer vision Grenander (1976); Horn (1977), analysis-by-synthesis is an approach to the problem of visual scene understanding. The idea is conceptually elegant and appealing: build a system that is able to synthesize complex scenes (e.g., by rendering), and then understand analysis (inference) as the inverse of this process that decomposes new scenes into their constituent components. The main challenges in this approach are the need for generative models of objects (and their composition into scenes) and the need to perform tractable inference given new inputs, including the task to decompose scenes into objects in the first place. In this work, we aim to learn such as system in an unsupervised way from observations of scenes alone.

While models such as VAEs (Kingma & Welling, 2014; Rezende et al., 2014) and GANs (Goodfellow et al., 2014) constitute significant progress in generative modelling, these models still lack the ability to capture the compositional nature of reality: they typically generate entire images or scenes at once, i.e., with a single pass through a large feedforward network. While this approach works well for objects such as centred faces—and progress has been impressive on those tasks Karras et al. (2019a;b)—generating natural scenes containing several objects in non-trivial constellations gets increasingly difficult within this framework due to the combinatorial number of compositions that need to be represented and reasoned about (Bau et al., 2019).

image

Figure 1: Our ECON model learns to decompose training scenes (A) into layers of inpainted objects. Representing object classes separately allows controllable sampling of individual objects (B: samples from different experts) which can be recombined in novel ways (C: compositions sampled by layering the experts in B in the same order as seen during training (top), or choosing three (middle) or four (bottom) objects at random).

Image formation entangles different components in highly non-linear ways, such as occlusion. Due to the difficulty of choosing the correct model and the complexity of inference, the task to generate complex scenes containing compositions of objects still lacks success stories. More training data certainly helps, and progress on generating visually impressive scenes has been substantial Radford et al. (2015), but we hypothesize that a satisfactory and robust solution that is not optimized to a relatively well constrained IID (independent and identically distributed) data scenario will require that our models correctly incorporate the (causal) generative nature of natural scenes.

Here, we take some first small steps towards addressing the aforementioned limitations by proposing ECON, a more physically-plausible generative scene model with explicitly compositional structure. Our approach is based on two main ideas. The first is to consider scenes as layered compositions of (partially) depth-ordered objects. The second is to represent object classes separately using an ensemble of generative models, or experts.

Our generative scene model consists of a sequen-

tial process which places independent objects in

the scene, operating from the back to the front,

so that objects occurring closer to the viewer can

occlude those further away. During inference,

this process is reversed: at each step, experts compete for explaining part of the remaining scene, and only the winning expert is further trained on the explained part (Parascandolo et al., 2018). This competition ideally drives each expert to specialise on representing and generating instances from one, or a few related, object classes or concepts, and the notion of “objects” should automatically emerge as contiguous regions that appear in a stable way across a range of training images. By decomposing scenes in the reverse order of generation, occluded objects can be inpainted within the already explained regions so that experts can learn to generate full, unoccluded objects which can be recombined in novel ways.

Learning a modular scene representation via object-specific experts has several benefits. First, each expert only needs to solve the simpler subtask of representing and generating instances from a single object class—something which current generative models have been shown to be capable of—while the composition process is treated separately. Secondly, expert models are useful in their own right as they can be dropped or added, reused and repurposed for other tasks on an individual level.

We highlight the following contributions.

We summarise a physically-plausible model of scene generation in  §2 and use it to categorise and contrast related scene models and their shortcomings in  §3.

In  §4, we present ECON, a compositional scene model, which, for a single expert, can be seen as extension of MONET (Burgess et al., 2019) into a proper generative model (§5.1).

We introduce modular object representations through separate generators and propose a competition mechanism and objective to drive experts to specialise in  §5.2.

In experiments on synthetic data in  §6 we show qualitatively that ECON is able to decompose simple scenes into objects, represent these separately, and recombine them in a layer-wise fashion into novel, coherent scenes with arbitrary numbers and depth-orderings of objects.

We critically discuss our assumptions and propose extensions for future work in  §7.

image

Figure 2: Assumed data generating process (dead-leaves model). Independent objects  xtwith shapes mt(drawn from class  ktwith properties  zt) are placed on the canvas sequentially (reflecting depth ordering) and appear in the final composition x as dependent, partially occluded regions  rt.

To reflect the fact that 2D images are the result of projections of richer 3D scenes, we assume that data are generated from the well-known dead leaves model,1 i.e., in a layer-wise fashion, see Figure 2a for an illustration. Starting with an empty canvas x = 0, an image  x ∈ [0, 1]D×3is sequentially generated in T steps. At each step we sample an object from one of K different classes and place it on the canvas as follows,

image

where  kt ∈ {1, ..., K}represents the object class drawn at step  t; zt ∈ RL is an abstract representation of the object’s properties;  mt ∈ {0, 1}Dis a binary image determining shape;  xt ∈ [0, 1]D×3is a full (unmasked) image containing the object; and  ⊙denotes element-wise multiplication. The corresponding graphical model is shown in Figure 2b.2

This sequential generation process captures the loss of depth information when projecting from 3D to 2D and is a natural way of handling occlusion phenomena. Consequently, sampling from this model is straightforward. We therefore consider it a more truthful approach to modelling visual scenes than, e.g., spatial mixture models, in line with Le Roux et al. (2011).

On the other hand, inferring the objects composing a given image x is challenging. We will distinguish between shapes and regions in the following sense. The unoccluded object shapes  mt, top row in Figure 2a, remain hidden and only appear in x via their corresponding, partially occluded segmentation regions  rt ∈ {0, 1}D, see the final composition in the bottom row of Figure 2a for an illustration. In particular, a region  rtis always subset of the corresponding shape pixels  mt.

In addition to the separate treatment of shapes  mt and regions rt, we also introduce a scope variable stto help write the above model in a convenient form. Following Burgess et al. (2019),  st ∈ {0, 1}Dis defined recursively as

image

The scope  st at time tcontains those parts of the image, which have been completely generated after t steps and will not be occluded in the subsequent  T − t steps.

With  st, the regions  rtcan be compactly defined as

image

Using these, we can express the final composition as

image

While (3) may look like a normal spatial mixture model, it is worth noting the following important point: even though the shapes  mtare drawn independently, the resulting segmentation regions  rtbecome (temporally) dependent due to the layer-wise generation process, i.e., the visible part of object t depends on all objects subsequently placed on the canvas. This seems very intuitive and is evident from the fact that the RHS of (2) is a function of  mt:T .

Clustering & spatial mixture models One line of work (Greff et al., 2016; 2017; Van Steenkiste et al., 2018) approaches the perceptual grouping task of decomposing scenes into components by viewing separate regions  rtas clusters. A scene x is modelled with a spatial mixture model, parametrised by deep neural networks, in which learning is performed with a procedure akin to expectation maximisation (EM; Dempster et al., 1977). The recent IODINE model of Greff et al. (2019) instead uses a refinement network (Marino et al., 2018) to perform iterative amortised variational inference over independent scene components which are separately decoded and then combined via a softmax to form the scene. While IODINE is able to decompose a given scene, it cannot generate coherent samples of new scenes because dependencies between regions  rtdue to layering are not explicitly captured in its generative model.

This shortcoming of IODINE has also been pointed out by Engelcke et al. (2019) and addressed in their GENESIS model, which explicitly models dependencies between regions via an autoregressive prior over  r1:T. While this does enable sampling of coherent scenes which look similar to training data, GENESIS still assumes an additive, rather than layered, model of scene composition. As a consequence, the resulting entangled component samples contain holes and partially occluded objects and cannot be easily layered and recombined as shown in Figure 1 (e.g., to generate samples with exactly two circles and one triangle).

Sequential models Our work is closely related to sequential or recurrent approaches to image decomposition and generation (Mnih et al., 2014; Gregor et al., 2015; Eslami et al., 2016; Kosiorek et al., 2018; Yuan et al., 2019). In particular, we build on the recent MONET model for scene decomposition of Burgess et al. (2019). MONET combines a recurrent attention network with a VAE which encodes and reconstructs the input within the selected attention regions  rtwhile unconstrained to inpaint occluded parts outside  rt.

We extend this approach in two main directions. Firstly, we turn MONET into a proper generative model3 which respects the layer-wise generation of scenes described in  §2. Secondly, we explicitly model the discrete variable k (object class) with an ensemble of class-specific VAEs (the experts)—as opposed to within a single large encoder-decoder architecture as in IODINE, GENESIS or MONET. Such specialisation allows to control object constellations in new, but scene-consistent ways.

Competition of experts To achieve specialisation on different object classes in our model, we build on ideas from previous work using competitive training of experts (Jacobs & Jordan, 1991). More recently, these ideas have been successfully applied to tasks such as lifelong learning (Aljundi et al., 2017), learning independent causal mechanisms (Parascandolo et al., 2018), training mixtures of generative models (Locatello et al., 2018), as well as to dynamical systems via sparsely-interacting recurrent independent mechanisms (Goyal et al., 2019).

Probabilistic RBM models The work of Le Roux et al. (2011) and Heess (2012) introduced probabilistic scene models that also reason about occlusion. Le Roux et al. (2011) combine restricted Boltzmann machines (RBMs) to generate masks and shape separately for every object in the scenes into a masked RBM (M-RBM) model. Two variants are explored: one that respects a depth ordering and object occlusions, derived from similar arguments as we have put forward in the introduction;

Table 1: Comparison with related unsupervised scene decomposition and generation models.

image

and a second model which uses a softmax combination akin to the spatial mixture models used in IODINE and GENESIS, although the authors argue it makes little sense from a modelling perspective. Inference is implemented as blocked Gibbs sampling with contrastive divergence as a learning objective. Inference over depth ordering is done exhaustively, that is, considering every permutation— as opposed to greedily using competition as in this work. Shortcomings of the model are mainly the limited expressiveness of RBMs (complexity and extent), as well as the cost of inference. Our work can be understood as an extension of the M-RBM formulation using VAEs in combination with attention, or segmentation, models.

Vision as inverse graphics & probabilistic programs Another way to programmatically introduce information about scene composition is through analysis-by-synthesis, see Bever & Poeppel (2010) for an overview. In this approach, the synthesis (i.e., generative) model is fully specified, e.g., through a graphics renderer, and inference becomes the inverse task, which poses a challenging optimisation problem. Probabilistic programming is often advocated as a means to automatically compile this inference task; for instance, PICTURE has been proposed by Kulkarni et al. (2015), and combinations with deep learning have been explored by Wu et al. (2017). This approach is sometimes also understood as an instance of Approximate Bayesian Computation (ABC; Dempster et al., 1977) or likelihood-free inference. While conceptually appealing, these methods require a detailed specification of the scene generation process—something that we aim to learn in an unsupervised way. Furthermore, gains achieved by a more accurate scene generation process are generally paid for by complicated inference, and most methods thus rely on variations of MCMC sampling schemes (Jampani et al., 2015; Wu et al., 2017).

Supervised approaches There is a body of work on augmenting generative models with ground-truth segmentation and other supervisory information. Turkoglu et al. (2019) proposed a layer based model to add objects onto a background, Ashual & Wolf (2019) proposed a scene-generation method allowing for fine grained user control, Karras et al. (2019a;b) have achieved impressive image generation results by exclusively training on a single class of objects. The key difference of these approaches to our work is that we exclusively focus on unsupervised approaches.

We now introduce ECON (for Ensemble of Competing Object Networks), a causal generative scene model which explicitly captures the compositional nature of visual scenes. On a high level, the proposed architecture is an ensemble of generative models, or experts, designed after the layer-based scene model described in  §2. During training, experts compete to sequentially explain a given scene via attention over image regions, thereby specialising on different object classes. We perform variational inference (Jordan et al., 1999), amortised within the popular VAE framework (Kingma & Welling, 2014; Rezende et al., 2014), and use competition to greedily maximise a lower bound to the conditional likelihood w.r.t. object identity.

4.1 GENERATIVE MODEL

We adopt the generative model p described in  §2, parametrise it by  θ, and assume that it factorises over the graphical model in Figure 2b (i.e., assuming that objects at different time steps are drawn independently of each other). We model  p(kt)with a categorical distribution,4 and place a unit-

image

Figure 3: ECON architecture: ensemble of K competing experts. Each expert consists of (i) an attention network which selects image regions  rt; (ii) an encoder which maps the image within the attended region to a latent code z; and (iii) a decoder which reconstructs both an object  xtand its unoccluded shape  mt. A competition mechanism determines the winning expert at each step.

variance isotropic Gaussian prior over  zt,

image

Next, we parametrise  p(mt | k, z) and p(xt | k, z) using K decoders f1, . . . , fK : RL → [0, 1]D×3 ×[0, 1]D with respective parameters  θ1, . . . , θK.5 These compute object means and mask probabilities �µθk(z), ˜mθk(z)�= fk(z; θk)which determine pixel-wise distributions over  mt and xt via

image

We note at this point that, while other handlings of the discrete variable k are possible, we deliberately opt for K separate decoders: (i) as an inductive bias encouraging modularity; and (ii) to be able to controllably sample individual objects and recombine them in novel ways.

Finally, we need to specify a distribution over x. Due to its layer-wise generation, this is tricky and most easily done in terms of the visible regions  rt. From (3), (5), and linearity of Gaussians it follows that, pixel-wise,

image

Similarly, one can show from (1), (2), and (4) that  rtdepends on  r(t+1):T only via st, and that

image

for t = 1, . . . , T; see Appendix A for detailed derivations.

The class-conditional joint distribution then factorises as,

image

Conditioning on  k1:Tis motivated by our inference procedure, see  §5. Moreover, we express p in terms of the segmentation regions  rtas only these are visible in the final composition which makes is easier to specify a distribution over x. Note, however, that while we will perform inference over regions  r1:T, we will learn to generate full shapes  m1:Twhich are consistent with the inferred  r1:Twhen composed layer-wise as captured in (7), thus respecting the physical data-generating process.

4.2 APPROXIMATE POSTERIOR

Since exact inference is intractable in our model, we approximate the posterior over  z1:Tand  r1:Twith the following variational distribution q parametrised by  φ and ψ,

image

As for the generative distribution, we model dependence on  ktusing K modules with separate parameters  {φ1, ψ1}, ..., {φK, ψK}. These inference modules consist of two parts.

Attention nets  a1, ..., aK : [0, 1]D×3 × [0, 1]D → [0, 1]Dcompute region probabilities  ˜rψk(x, s) =ak(x, s; ψk)and amortise inference over regions  rt via

image

Encoders  g1, ..., gK : [0, 1]D×3 × [0, 1]D → RL×2compute means and log-variances �µφk(x, rt), log σ2φk(x, rt)�= gk(x, rt; φk)which parametrise distributions over  zt via

image

We refer to the collection of  fk( · ; θk), ak( · ; ψk), and  gk( · ; φk)for a given k as an expert as it implements all computations (generation and inference) for a specific object class—see Figure 3 for an illustration.

Due to the assumed sequential generative process, the natural order of inference is the reverse (t = T, . . . , 1), i.e., foreground objects should be explained first and the background last. This is also captured by the dependence of  rt on r(t+1):Tvia the scope  st in qψ.

Such entanglement of scene components across composition steps makes inference over the entire scene intractable. We therefore choose the following greedy approach. At each inference step t = T, . . . , 1, we consider explanations from all possible object-classes  (kt = 1, . . . , K)—as provided by our ensemble of experts via attending, encoding and reconstructing different parts of the current scene—and then choose the best fitting one. This offers an intuitive foreground to background decomposition of an image as foreground objects should be easier to reconstruct.

Concretely, we first lower bound the marginal likelihood conditioned on  k1:T , pθ(x | k1:T ), and then use a competition mechanism between experts to determine the best k. We now describe this inference procedure in more detail.

5.1 OBJECTIVE: CLASS-CONDITIONAL ELBO

First, we lower bound the class-conditional model evidence  pθ(x | k1:T )using the approximate posterior q as follows (see Appendix A for a detailed derivation):

image

Next, we use the reparametrization trick of Kingma & Welling (2014) to replace expectations w.r.t. qφ(zt | x, rt, kt)by a Monte Carlo estimate using a single sample drawn as:

image

Finally, we approximate expectations w.r.t.  qψ(rt | x, st, kt)in  Lx,tand  Lz,tusing the Bernoulli means  ˜rψkt (x, st). We opt for directly using a continuous approximation and against sampling discrete r’s (e.g., using continuous relaxations to the Bernoulli distribution (Maddison et al., 2017; Jang et al., 2017)) as our generative model does not require the ability to directly sample regions. (Instead, we sample z’s and decode them into unoccluded shapes which can be combined layer-wise to form scenes.)

With these approximations, we obtain the estimates

image

which we combine to form the learning objective

image

where  β, γare hyperparameters. Note that for  β, γ > 1, (13)still approximates a valid lower bound.

Generative model extension of MONET as a special case For K = 1 (i.e., ignoring different object classes  ktfor the moment), our derived objective (13) is similar to that used by Burgess et al. (2019). However, we note the following crucial difference in (12): in our model, reconstructed attention regions  ˜mθk(z)are multiplied by  st in the pθterm of the KL, see (7). This implies that the generated shapes  mtare constrained to match the attention region  rtonly within the current scope st, so that—unlike in MONET—the decoder is not penalised for generating entire unoccluded object shapes, allowing inpainting also on the level of masks. With just a single expert, our model can thus be understood as a generative model extension of MONET.

5.2 COMPETITION MECHANISM

For K > 1, i.e., when explicitly modelling object classes with separate experts, the objective (13) cannot be optimised directly because it is conditioned on the object identities  k1:T. To address this issue, we use the following competition mechanism between experts.

At each inference step t = T, . . . , 1, we apply all experts  (kt = 1, . . . , K)to the current input  (x, st)and declare that expert the winner which yields the best competition objective (see below).6 We then use the winning expert ˆktto reconstruct the selected scene component using

image

where  ˜zt(ˆkt)is encoded from the region  ˜rψˆktattended by the winning expert ˆkt. We then compute the contribution to (13) from step t assuming fixed ˆkt, and use it to update the winning expert with a gradient step. Finally, we update the scope using the winning expert,

image

to allow for inpainting within the explained region in the following inference (decomposition) steps.7

This competition process can be seen as a greedy approximation to maximising (13) w.r.t.  k1:T . Whileconsidering all possible object combinations would require  O(KT )steps, our competition procedure is linear in the number of object classes and runs in  O(K · T)steps. By choosing an expert at each step t = T, . . . , 1, we approximate the expectation w.r.t.  qψ(st | x, k(t+1):T )—which entangles the different composition steps and makes inference intractable—using  sT = 1and the updates in (14). Competition objective While model parameters are updated using the learning objective (13) derived from the ELBO, the choice of competition objective is ours. Since we use competition to drive specialisation of experts on different object classes and to greedily infer  kt, (i.e., the identity of the current foreground object), the competition objective should reflect such differences between object classes. Object classes can differ in many ways (shape, color, size, etc) and to different extents, so the choice of competition objective is data-dependent and may be informed by prior knowledge.

For instance, in the setting depicted in Figure 1 where both color and shape are class-specific, we found that using a combination of ˆLx,tand ˆLrtworked well. However, on the same data with randomised color (as used in the experiments in  §6) it did not: due to the greedy optimisation procedure, the expert which is initially best at reconstructing a particular color continues to win the competition for explaining regions of that color and thus receives gradient updates to reinforce this specialisation; such undesired specialisation corresponds to a local minimum in the optimisation landscape and can be very hard for the model to escape.

We thus found that relying solely on ˆLr,tas the competition objective (i.e., the reconstruction of the attention region) helps to direct specialisation towards objects categories. In this case, experts are chosen based on how well they can model shape, and only those experts which can easily reconstruct (the shape of) a selected region within the current scope will do well at any given step, meaning that the selected region corresponds to a foreground object.

Moreover, we found that using a stochastic, rather than deterministic form of competition, (i.e., experts win the competition with the probabilities proportional to their competition objectives at a given step) helped specialisation. In particular, such approach helps prevent the collapse of the experts in the initial stages of training.

Formally, the probability of expert k winning the competition is

image

with ˆLx,t(k)and ˆLr,t(k)being the terms in (13) at step t for an expert k. The hyper-parameter  λcontrols the relative influence of the appearance and shape reconstruction objectives to make the data-dependent assumptions about the competition mechanism as discussed above.

To explore ECON’s ability to decompose and generate new scenes, we conduct experiments on synthetic data consisting of colored 2D objects or sprites (triangles, squares and circles) in different occlusion arrangements. We refer to Appendix C for a detailed account of the used data set, model architecture, choice of hyperparameters, and experimental setting. Further experiments can be found in Appendix B.

ECON decomposes scenes and inpaints occluded objects Fig. 4 shows an example of how ECON decomposes a scene with four objects. At each inference step, the winning expert segments a region (second col.) within the unexplained part of the image (first col.), and reconstructs the selected object within the attended region (fourth col.). A distinctive feature of our model is that, despite occlusion, the full shape (rightmost col.) of every object is imputed (e.g., at step  t′ = 4). This ability to infer complete shapes is a consequence of the assumed layer-wise generative model which manifests itself in our objective via the unconstrained shape reconstruction term (12).

ECON generalizes to novel scenes Fig. 4 also illustrates that that the model is capable of decomposing scenes containing multiple objects of the same category, as well as multiple objects of the same color in separate steps. It does so for a scene with four objects, despite being trained on scenes containing only three objects, one from each class.

Single expert as generative extension of MONET We also investigate training a single expert which we claim to be akin to a generative extension of MONET. When trained on the data from Fig. 1 with ground truth masks provided, the expert learns to inpaint occluded shapes and objects as can be seen from the samples in Fig. 5. However, all object classes are represented in a shared latent space so that different classes cannot be sampled controllably.

image

Figure 4: By explaining away a scene from front to back, ECON can impute occluded components xt(third column) and—crucially for layered generation and recombination—their shapes  mt(fifth column) within the already explained regions  st(first column). Each inference step  (t′ = T + 1 − t)shows only the winning expert’s output.

image

Figure 5: Random samples from a single expert (akin to a generative extension of MONET) trained on the data from Fig. 1 with ground-truth masks provided. The model learns to separately generate unoccluded objects and background, but lacks control over which object class is sampled.

image

Figure 6: Samples from individual experts trained on toy data with random colors (shown in top panel). Experts (corresponding to rows in the bottom panel) specialise on triangles, circles, background, and squares, respectively, but such specialisation based-purely on shape is significantly harder when color is lost as a powerful cue. This is reflected, e.g., in the imperfect separation between squares and circles, cf. Fig. 1.

image

Figure 7: Illustration of layer-wise sampling from ECON after training on our toy data with random colors. Starting with a background sample, subsequent rows correspond to sampling additional objects by randomly choosing one of the specialised object experts.

Multiple experts specialise on different object classes Fig. 1B shows samples from each of the four experts trained on a dataset with uniquely colored objects (Fig. 1A). The samples from each expert contain either the same object in different spatial positions or differently coloured background, indicating that the experts specialised on the different object classes composing these scenes.

Fig. 6 shows the same plot for a model trained on scenes consisting of randomly colored objects. This setting is considerably more challenging because experts have to specialise purely based on shape while also representing color variations. Yet, experts specialise on different object classes: samples in Fig. 6 are either randomly colored background or objects from mostly one class with different colours and spatial positions, indicating that the ECON is capable of representing the scenes as compositions of distinct objects in an unsupervised way.

Controlled and layered generation of new scenes The specialisation of experts allows us to controllably generate new scenes with specific properties. To do so, we follow the sequential generation procedure described in  §2 by sampling from one of the experts at each time step. The number of generation steps T, as well as the choices of experts  k1:Tallow to control the total number and categories of objects in the generated scene.

Fig. 1C shows samples generated using the experts in Fig. 1B. In Fig. 7 we show another example where more and more randomly colored objects are sequentially added. Even though the generated scenes are quite simple, we believe this result is important as the ability to generate scenes in a controlled way is a distinctive feature of our model, which current generative scene models lack.

Model assumptions While ECON aims at modelling scene composition in a faithful way, we make a number of assumptions for the sake of tractable inference, which need to be revisited when moving to more general environments. We assume a known (maximum) number of object classes K which may be restrictive for realistic settings, and choosing K too small may force each expert to represent multiple object classes. Other assumptions are that the pixel values are modelled as normally distributed, even though they are discrete in the range {0, . . . , 255}, and that pixels are conditionally independent given shapes and objects.

Shared vs. object-specific representations Recent work on unsupervised representation learning (Bengio et al., 2013) has largely focused on disentangling factors of variation within a single shared representation space, e.g., by training a large encoder-decoder architecture with different forms of regularization (Higgins et al.; Kim & Mnih, 2018; Chen et al., 2018; Locatello et al., 2019). This is motivated by the observation that certain (continuous) attributes such as position, size, orientation or color are general concepts which transcend object-class boundaries. However, the range of values of these attributes, as well as other (discrete) properties such as shape, can strongly depend on object class. In this work, we investigate the other extreme of this spectrum by learning entirely object-specific representations. Exploring the more plausible middle ground combining both shared and object-specific representations is an attractive direction for further research.

Extensions and future work The goal of decomposing visual scenes into their constituents in an unsupervised manner from images alone will likely remain a long standing goal of visual representation learning. We have presented a model that recombines earlier ideas on layered scene compositions, with more recent models of larger representational power, and unsupervised attention models. The focus of this work is to establish physically plausible compositional models for an easy class of images and to propose a model that naturally captures object-specific specialization.

With ECON and other models as starting point, a number of extensions are possible. One direction of future work deals with incorporating additional information about scenes. Here, we consider static, semantically-free images. Optical flow and depth information can be cues to an attention process, facilitating segmentation and specialization. First results in the direction of video data have been shown by Xu et al. (2019). Natural images typically carry semantic meaning and objects are not ordered in arbitrary configurations. Capturing dependencies between objects (e.g., using an auto-regressive prior over depth ordering as in GENESIS), albeit challenging, could help disambiguate between scene components. Another direction of future work is to relax the unsupervised assumption, e.g., by exploring a semi-supervised approach, which might help improve stability.

On the modelling side, extensions to recurrent architectures and iterative refinement as in IODINE appear promising. Our model entirely separates experts from each other but, depending on object similarity, one can also include shared representations which will help transfer already learned knowledge to new experts in a continual learning scenario.

While the scenes studied here and in the recent works of Burgess et al. (2019); Greff et al. (2019); Engelcke et al. (2019) are still in stark contrast to the impressive results that holistic generative models are able to achieve, we believe it is the right time to revisit the unsupervised scene composition problem. Our goal is to build re-combineable systems, where different components can be used for new scene inference tasks. In the spirit of the analysis-by-synthesis approach, this requires the ability to re-create physically plausible visual scenes. Disentangling the scene formation process from the objects is one crucial component thereof, and the vast number of object types will require the ability of unsupervised learning from visual input alone.

The authors would like to thank Alex Smola, Anirudh Goyal, Muhammad Waleed Gondal, Chris Russel, Adrian Weller, Neil Lawrence, and the Empirical Inference “deep learning & causality” team at the MPI for Intelligent Systems for helpful discussions and feedback.

M.B. and B.S. acknowledge support from the German Science Foundation (DFG) through the CRC 1233 “Robust Vision” project number 276693517, the German Federal Ministry of Education and Research (BMBF) through the Tbingen AI Center (FKZ: 01IS18039A), and the DFG Cluster of Excellence “Machine Learning New Perspectives for Science” EXC 2064/1, project number 390727645.

Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3366–3375, 2017.

Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4561–4569, 2019.

David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511, 2019.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

Thomas G. Bever and David Poeppel. Analysis by synthesis: A (re-)emerging program of research for language and vision. Biolinguistics, 4(2):174–200, 2010.

Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentan- glement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620, 2018.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, 1977.

Martin Engelcke, Adam R Kosiorek, Oiwi Parker Jones, and Ingmar Posner. Genesis: Gener- ative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052, 2019.

SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233, 2016.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Sch¨olkopf. Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893, 2019.

Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and J¨urgen Schmidhuber. Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, pp. 4484–4492, 2016.

Klaus Greff, Sjoerd Van Steenkiste, and J¨urgen Schmidhuber. Neural expectation maximization. In Advances in Neural Information Processing Systems, pp. 6691–6701, 2017.

Klaus Greff, Rapha¨el Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pp. 2424–2433, 2019.

Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning, pp. 1462–1471, 2015.

Ulf Grenander. Pattern synthesis – lectures in pattern theory. 1976.

Nicolas Manfred Otto Heess. Learning generative models of mid-level structure in natural images. PhD thesis, 2012.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework.

Berthold K. P. Horn. Understanding image intensities. Artifical Intelligence, 8:201–231, 1977.

Robert A Jacobs and Michael I Jordan. A competitive modular connectionist architecture. In Advances in neural information processing systems, pp. 767–773, 1991.

Varun Jampani, Sebastian Nowozin, Matthew Loper, and Peter V. Gehler. The informed sampler: A discriminative approach to bayesian inference in generative computer vision models. Computer Vision and Image Understanding, 136:32 – 44, 2015. ISSN 1077-3142.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumbel-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019a.

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958, 2019b.

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems, pp. 8606–8616, 2018.

T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. Mansinghka. Picture: A probabilistic programming language for scene perception. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4390–4399, June 2015. doi: 10.1109/CVPR.2015.7299068.

Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593–650, 2011.

Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar R¨atsch, Sylvain Gelly, and Bernhard Sch¨olkopf. Competitive training of mixtures of independent deep generative models. arXiv preprint arXiv:1804.11130, 2018.

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp. 4114–4124, 2019.

C Maddison, A Mnih, and Y Teh. The concrete distribution: A continuous relaxation of discrete random variables. International Conference on Learning Representations, 2017.

Joe Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International Conference on Machine Learning, pp. 3403–3412, 2018.

Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212, 2014.

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Sch¨olkopf. Learning independent causal mechanisms. In International Conference on Machine Learning, pp. 4036– 4044, 2018.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d ´Alch´e Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. 2019.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi (eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, Cham, 2015. Springer International Publishing. ISBN 978-3-319-24574-4.

Mehmet Ozgur Turkoglu, William Thong, Luuk Spreeuwers, and Berkay Kicanaoglu. A layer-based sequential framework for scene generation with gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 8901–8908, 2019.

Sjoerd Van Steenkiste, Michael Chang, Klaus Greff, and J¨urgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353, 2018.

Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 699–707, 2017.

Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T Freeman, Joshua B Tenenbaum, and Jiajun Wu. Unsupervised discovery of parts, structure, and dynamics. arXiv preprint arXiv:1903.05136, 2019.

Jinyang Yuan, Bin Li, and Xiangyang Xue. Generative modeling of infinite occluded objects for compositional scene representation. In International Conference on Machine Learning, pp. 7222–7231, 2019.

A DERIVATIONS

A.1 DERIVATION OF ELBO

We now provide a detailed derivation of the evidence lower bound (ELBO) used in the main paper.

For ease of notation we use vector notation and omit explicitly summing over pixel- and latent

dimensions (as done in the implementation).

We start by writing  pθ(x|k1:T )as an expectation w.r.t. q using importance sampling as follows:

image

Applying the concave function  log( · )and using Jensen’s inequality we obtain

image

Using the chain rule of probability and properties of  log( · ), we can rearrange the integrand on the

RHS of (A.1) as

image

We will consider the three terms in (A.2) separately and define their expectations w.r.t. the approximate

posterior as

image

Next, we use our modelling assumptions stated in the paper to simplify these terms, starting with  Lz.

Using the assumed factorisation of the approximate posterior, in particular  qψ(rt|x, r(t+1):T , kt) =

qψ(rt|x, st, kt), as well as the fact that  pθ(zt|kt) = p(z), splitting the expectation into two parts, and

using linearity of the expectation operator, we find that  Lzcan be written as follows:

image

Next, we consider  Lr. Using a similar argument as for  Lz, we find that

image

Finally, we consider  Lx. Substituting the Gaussian likelihood for  pθ(x|r1:T , z1:T , k1:T ), ignoring

constants which do not depend on any learnable parameters, and using the fact that  rtis binary and

�Tt=1 rt = 1, we obtain

image

Lx(θ, ψ, φ|k1:T ) = Eqψ(r1:T |x,k1:T )

image

where  ∥ · ∥2 denotes the pixel-wise L2-norm between two RGB vectors. (Recall that  Lx, Lr, and Lz

are defined as quantities in  RD, RD, and RL, respectively, and that summation over these dimensions

yields the desired scalar objective.)

We observe that  Lx, Lr, and Lzcan all be written as sums over the T composition steps.

We thus define:

image

Finally, it then follows that

image

A.2 DERIVATION OF GENERATIVE REGION DISTRIBUTION

We now derive the distribution in (7). We will use the fact that  rt = mt ⊙ st, and thatstcan be

written as  st = 1 − �Tt′=t+1 r′t, as well as the conditional independencies implied by our model, see

Figure 2b. Considering the pixel-wise distribution and marginalising over  mt, we obtain:

image

Since  rtis binary, this fully determines its distribution.

B ADDITIONAL EXPERIMENTAL RESULTS

Figure 8 shows four additional examples of ECON decomposing scenes consisting of multiple

randomly coloured shapes. The model was trained on the data from Fig. 6, but is able to decompose

scenes with five objects (a), multiple occluding objects from the same class (b, c), and objects of

similar color to the background (d). Moreover, (b) suggests that additional timesteps (t′ = 6) are

simply ignored if they are not needed.

image

Figure 8: Additional decomposition plots for o.o.d. data. The model was trained with four experts on scenes containing three objcets (one triangle, square, and circle each) arranged in random order.

C EXPERIMENTAL DETAILS

C.1 DATASETS

Synthetic dataset: uniquely colored objects The dataset consists of images of circles, squares and

triangles on a randomly and uniformly colored background, such that there is a unique correspondence

between object color and class identites (red circles, green squares, blue triangles). The background

color is randomly chosen to be an RGB value with each channel being a random integer between

0 and 127, while the RGB values of the object colors are (255,0,0), (0,255,0), (0,0,255) for circles,

squares and triangles respectively. The spatial positions of the objects are randomly chosen such that

each of the objects entirely fits into an image without crossing the image boundary.

The models shown in Fig. 1 and 5 have been trained on a version of such dataset containing images

with exactly three objects per image (one of each class) in random depth orders (Fig. 1, top row). The

training and validation splits include 50 000 and 100 such images respectively.

Synthetic dataset: randomly colored objects This dataset is the same as the one described above

with the difference that the objects (circles, squares and triangles) are randomly colored with the

corresponding RGB values being random integers between 128 and 255.

The models shown in Fig. 4, 6 and 8 have been trained on a version of such dataset containing images

with exactly three objects per image (one of each class) in random depth orders (Fig. 6, top row). The

training and validation splits include 50000 and 100 such images respectively.

C.2 ARCHITECTURE DETAILS

Each expert in our model consists of attention network computing the segmentation regions as a

function of the input image and the scope at a given time step, and a VAE reconstructing the image

appearance within the segmentation region and inpainting the unoccluded shape of object. Below we

describe the details of architectures we used for each of the expert networks.

C.2.1 EXPERT VAES

Encoder The VAE encoder consists of multiple blocks, each of which is composed of  3 × 3

convolutional layer, ReLU non-linearity, and  2 × 2max pooling. The output of the final block is

flattened and transformed into a latent space vector by means of two fully connected layers. The

output of the first fully-connected layer has 4 times the number of latent dimensions activations,

which are passed through the ReLU activation, and finally linearly mapped to the latent vector by a

second fully-connected layer.

Decoder Following Burgess et al. (2019), we use spatial a broadcast decoder. First, the latent

vector is repeated on a spatial grid of the size of an input image, resulting in a 3D tensor with spatial

dimensions being that of an input, and as many feature maps as there are dimensions in the latent

space. Second, we concatenate the two coordinate grids (for  x−and  y−coordinates) to this tensor.

Next, this tensor is processed by a decoding network consisting of as many blocks as the encoder,

with each block including a  3 × 3convolutional layer and ReLU non-linearity. Finally, we apply a

1 × 1convolutional layer with sigmoid activation to the output of the decoding network resulting in

an output of 4 channel (RGB + shape reconstruction).

C.2.2 ATTENTION NETWORK

We use the same attention network architecture as in Burgess et al. (2019) and the implementation

provided by Engelcke et al. (2019). It consists of U-Net (Ronneberger et al., 2015) with 4 down and

up blocks consisting of a  3 × 3convolutional layer, instance normalisation, ReLU activation and

down- or up-sampling by a factor of two. The numbers of channels of the block outputs in the down

part (the up part is symmetric) of the network are: 4 - 32 - 64 - 64 - 64.

C.3 TRAINING DETAILS

We implemented the model in PyTorch (Paszke et al., 2019). We use the batch size of 32, Adam

optimiser (Kingma & Ba, 2014), and initial learning rate of  5 · 10−4. We compute the validation

loss every 100 iterations, and if the validation loss doesn’t improve for 5 consecutive evaluations, we

decrease the learning rate by a factor of√10. We stop the training after 5 learning rate decrease step.

C.4 CROSS-VALIDATION

Synthetic dataset: uniquely colored objects The results in Fig. 1 were obtained by cross-

validating 512 randomly sampled architectures with the following ranges of parameters:

image

The best performing model in terms of the validation loss (which is shown in Fig. 1) has the latent

dimension of 2, 4 layers in encoder and decoder, 32 features per layer,  β = 9.54 and γ = 0.52.

The results in Fig. 5 were obtained using the same model as above but with one expert.

Synthetic dataset: randomly colored objects The results in Figs. 4, 6, and 7 were obtained by

cross-validating 512 randomly sampled architectures with the following ranges of parameters:

image

The best performing model in terms of the validation loss (which is shown in Fig. 1) has the latent

dimension of 5, 3 layers in encoder and decoder, 32 features per layer,  β = 1 and γ = 3.26.


Designed for Accessibility and to further Open Science