Unsupervised Learning of Object Structure and Dynamics from Videos

2019·Arxiv

Abstract

Abstract

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

1 Introduction

Videos provide rich visual information to understand the dynamics of the world. However, extracting a useful representation from videos (e.g. detection and tracking of objects) remains challenging and typically requires expensive human annotations. In this work, we focus on unsupervised learning of object structure and dynamics from videos.

One approach for unsupervised video understanding is to learn to predict future frames [21, 20, 11, 19, 29, 35, 10, 3, 18]. Based on this body of work, we identify two main challenges: First, it is hard to make pixel-level predictions because motion in videos becomes highly stochastic for horizons beyond about a second. Since semantically insignificant deviations can lead to large error in pixel space, it is often difficult to distinguish good from bad predictions based on pixel losses. Second, even if good pixel-level prediction is achieved, this is rarely the desired final task. The representations of a model trained for pixel-level reconstruction are not guaranteed to be useful for downstream tasks such as tracking, motion prediction and control.

Here, we address both of these challenges by using an explicit, interpretable keypoint-based representation of object structure as the core of our model. Keypoints are a natural representation of dynamic objects, commonly used for face and pose tracking. Training keypoint detectors, however, generally requires supervision. We learn the keypoint-based representation directly from video, without any supervision beyond the pixel data, in two steps: first encode individual frames to keypoints, then model the dynamics of those points. As a result, the representation of the dynamics model is spatially structured, though the model is trained only with a pixel reconstruction loss. We show that enforcing spatial structure significantly improves video prediction quality and performance for tasks such as action recognition and reward prediction.

By decoupling pixel generation from dynamics prediction, we avoid compounding errors in pixel space because we never condition on predicted pixels. This approach has been shown to be beneficial for supervised video prediction [30]. Furthermore, modeling dynamics in keypoint coordinate space allows us to sample and evaluate predictions efficiently. Errors in coordinate space are more meaningful than in pixel space, since distance between keypoints is more closely related to semantically relevant differences than pixel-space distance. We exploit this by using a best-of-many-samples objective [5] during training to achieve stochastic predictions that are both highly diverse and of high quality, outperforming the predictions of models lacking spatial structure.

Finally, because we build spatial structure into our model a priori, its internal representation is biased to contain object-level information that is useful for downstream applications. This bias leads to better results on tasks such as trajectory prediction, action recognition and reward prediction.

Our contributions are: (1) a novel architecture and optimization techniques for unsupervised video prediction with a structured internal representation; (2) a model that outperforms recent work [10, 33] and our unstructured baseline in pixel-level video prediction; (3) improved performance vs. unstructured models on downstream tasks requiring object-level understanding.

2 Related work

Unsupervised learning of keypoints. Previous work explores learning to find keypoints in an image by applying an autoencoding architecture with keypoint-coordinates as a representational bottleneck [15, 38]. The bottleneck forces the image to be encoded in a small number of points. We build on these methods by extending them to the video setting.

Stochastic sequence prediction. Successful video prediction requires modeling uncertainty. We adopt the VRNN [8] architecture, which adds latent random variables to the standard RNN architecture, to sample from possible futures. More sophisticated approaches to stochastic prediction of keypoints have been recently explored [36, 25], but we find the basic VRNN architecture sufficient for our applications.

Unsupervised video prediction. A large body of work explores learning to predict video frames using only a pixel-reconstruction loss [22, 24, 21, 11, 29, 9]. Most similar to our work are approaches that perform deterministic image generation from a latent sample produced by stochastic sampling from a prior conditioned on previous timesteps [10, 3, 18]. Our approach replaces the unstructured image representation with a structured set of keypoints, improving performance on video prediction and downstream tasks compared with SVG [10] (Section 5).

Recent methods also apply adversarial training to improve prediction quality and diversity of samples [27, 18]. EPVA [33] predicts dynamics in a high-level feature space and applies an adversarial loss to the predicted features. We compare against EPVA and show improvement without adversarial training, but adversarial training is compatible with our method and is a promising future direction.

Video prediction with spatially structured representations. Like our approach, several recent methods explore explicit, spatially structured representations for video prediction. Xu et al. [34] proposed to discover object parts and structure by watching how they move in videos. Vid2Vid [32] proposed a video-to-video translation network from segmentation masks, edge masks and human pose. The method is also used for predicting a few frames into the future by predicting the structure representations first. Villegas et al. [30] proposed to train a human pose predictor and then use the predicted pose to generate future frames of human motion. In [31], a method is proposed where future human pose is predicted using a stochastic network and the pose is then used to generate future frames. Recent methods on video generation have used spatially structured representations for video motion transfer between humans [1, 7]. In contrast, our model is able to find spatially structured representation without supervision while using video frames as the only learning signal.

3 Architecture

Our model is composed of two parts: a keypoint detector that encodes each frame into a low-dimensional, keypoint-based representation, and a dynamics model that predicts dynamics in the keypoint space (Figure 1).

Figure 1: Architecture of our model. Variables are black, functions blue, losses red. Some arrows are omitted for clarity, see Equations 1 to 4 for details.

3.1 Unsupervised keypoint detector

The keypoint detection architecture is inspired by [15], which we adapt for the video setting. Let be a video sequence of length T. Our goal is to learn a keypoint detector that captures the spatial structure of the objects in each frame in a set of keypoints

The detector is a convolutional neural network that produces K feature maps, one for each keypoint. Each feature map is normalized and condensed into a single (x, y)-coordinate by computing the spatial expectation of the map. The number of heatmaps K is a hyperparameter that represents the maximum expected number of keypoints necessary to model the data.

For image reconstruction, we learn a generator that reconstructs frame from its keypoint representation. The generator also receives the first frame of the sequence to capture the static appearance of the scene: . Together, the keypoint detector and generator form an autoencoder architecture with a representational bottleneck that forces the structure of each frame to be encoded in a keypoint representation [15].

The generator is also a convolutional neural network. To supply the keypoints to the network, each point is converted into a heatmap with a Gaussian-shaped blob at the keypoint location. The K heatmaps are concatenated with feature maps from the first frame . We also concatenate the keypoint-heatmaps for the first frame to the decoder input for subsequent frames , to help the decoder to "inpaint" background regions that were occluded in the first frame. The resulting tensor forms the input to the generator. We add skip connections from the first frame of the sequence to the generator output such that the actual task of the generator is to predict

We use the mean intensity of each keypoint feature map returned by the detector as a continuousvalued indicator of the presence of the modeled object. When converting keypoints back into heatmaps, each map is scaled by the corresponding . The model can use to encode the presence or absence of individual objects on a frame-by-frame basis.

3.2 Stochastic dynamics model

To model the dynamics in the video, we use a variational recurrent neural network (VRNN) [8]. The core of the dynamics model is a latent belief z over keypoint locations x. In the VRNN architecture, the prior belief is conditioned on all previous timesteps through the hidden state of an RNN, and thus represents a prediction of the current keypoint locations before observing the image:

We obtain the posterior belief by combining the previous hidden state with the unsupervised keypoint coordinates detected in the current frame:

Predictions are made by decoding the latent belief:

Finally, the RNN is updated to pass information forward in time:

Note that to compute the posterior (Eq. 2), we obtain from the keypoint detector, but for the recurrence in Eq. 4, we obtain by decoding the latent belief. We can therefore predict into the future without observing images by decoding from the prior belief. Because the model has both deterministic and stochastic pathways across time, predictions can account for long-term dependencies as well as future uncertainty [13, 8].

4 Training

4.1 Keypoint detector

The keypoint detector is trained with a simple L2 image reconstruction loss where v is the true and is the reconstructed image. Errors from the dynamics model are not backpropagated into the keypoint detector.2

Ideally, the representation should use as few keypoints as possible to encode each object. To encourage such parsimony, we add two additional losses to the keypoint detector:

Temporal separation loss. Image features whose motion is highly correlated are likely to belong to the same object and should ideally be represented jointly by a single keypoint. We therefore add a separation loss that encourages keypoint trajectories to be decorrelated in time. The loss penalizes "overlap" between trajectories within a Gaussian radius

where is the distance between the trajectories of keypoints k and , computed after subtracting the temporal mean from each trajectory. denotes the squared Euclidean norm.

Keypoint sparsity loss. For similar reasons, we add an L1 penalty on the keypoint scales to encourage keypoints to be sparsely active.

In Section 5.3, we show that both contribute to stable keypoint detection.

4.2 Dynamics model

The standard VRNN [8] is trained to encode the detected keypoints by maximizing the evidence lower bound (ELBO), which is composed of a reconstruction loss and a KL term between the Gaussian prior and posterior distribution

The KL term regularizes the latent representation. In the VRNN architecture, it is also responsible for training the RNN, since it encourages the prior to predict the posterior based on past information. To balance reliance on predictions with fidelity to observations, we add the hyperparameter [2]). We found it essential to tune for each dataset to achieve a balance between reconstruction quality (lower ) and prediction diversity.

The KL term only trains the dynamics model for single-step predictions because the model receives observations after each step [13]. To encourage learning of long-term dependencies, we add a pure reconstruction loss, without the KL term, for multiple future timesteps:

Figure 2: Main datasets used in our experiments. First row: Ground truth images. Second row: Decoded coordinates (black dots; in Figure 1) and past trajectories (gray lines). Third row: Reconstructed image. Green borders indicate observed frames, red indicate predicted frames.

The standard approach to estimate in Eq. 6 and 7 is to sample a single . To further encourage diverse predictions, we instead use the best of a number of samples [5] at each timestep during training:

where for observed steps and for predicted steps. By giving the model several chances to make a good prediction, it is encouraged to cover a range of likely data modes, rather than just the most likely. Sampling and evaluating several predictions at each timestep would be expensive in pixel space. However, since we learn the dynamics in the low-dimensional keypoint space, we can evaluate sampled predictions without reconstructing pixels. Due to the keypoint structure, the L2 distance of samples from the observed keypoints meaningfully captures sample quality. This would not be guaranteed for an unstructured latent representation. As shown in Section 5, the best-of-many objective is crucial to the performance of our model.

The combined loss of the whole model is:

where and are scale parameters for the keypoint separation and sparsity losses. See Section S1 for implementation details, including a list of hyperparameters and tuning ranges (Table S1).

5 Results

We first show that the structured representation of our model improves prediction quality on two video datasets, and then show that it is more useful than unstructured representations for downstream tasks that require object-level information.

5.1 Structured representation improves video prediction

We evaluate frame prediction on two video datasets (Figure 2). The Basketball dataset consists of a synthetic top-down view of a basketball court containing five offensive players and the ball, all drawn as colored dots. The videos are generated from real basketball player trajectories [37], testing the ability of our model to detect and stably represent multiple objects with complex dynamics. The dataset contains 107,146 training and 13,845 test sequences. The Human3.6 dataset [14] contains video sequences of human actors performing various actions, which we downsample to 6.25 Hz. We use subjects S1, S5, S6, S7, and S8 for training (600 videos), and subjects S9 and S11 for evaluation (239 videos). For both datasets, ground truth object coordinates are available for evaluation, but are not used by the model. The Basketball dataset contains the coordinates of each of the 5 players and the ball. The Human dataset contains 32 motion capture points, of which we select 12 for evaluation.

We compare the full model (Struct-VRNN) to a series of baselines and ablations: the Struct-VRNN (no BoM) model was trained without the best-of-many objective; the Struct-RNN is deterministic; the CNN-VRNN architecture uses the same stochastic dynamics model as the Struct-VRNN, but uses an unstructured deep feature vector as its internal representation instead of structured keypoints. All structured models use K = 12 for Basketball, and K = 48 for Human3.6M, and were conditioned on

Figure 3: Video generation quality on Human3.6M. Our stochastic structured model (Struct-VRNN) outperforms our deterministic baseline (Struct-RNN), our unstructured baseline (CNN-VRNN), and the SVG [10] and SAVP [18] models. Top: Example observed (green borders) and predicted (red borders) frames (best viewed as video: https://mjlm.github.io/video_structure/). Example sequences are the closest or furthest samples from ground truth according to VGG cosine similarity, as indicated. Note that for Struct-VRNN, even the samples furthest from ground truth are of high visual quality. Bottom left: Mean VGG cosine similarity of the the samples closest to ground truth (left) and furthest from ground truth (right). Higher is better. Plots show mean performance across 5 model initializations, with the 95% confidence interval shaded. Bottom right: Fréchet Video Distance [28], using all samples. Lower is better. Dots represents separate model initializations. EPVA [33] is not stochastic, so we compare performance with a single sample from our method on their test set.

8 frames and trained to predict 8 future frames. For the CNN-VRNN, which lacks keypoint structure, we use a latent representation with 3K elements, such that its capacity is at least as large as that of the Struct-VRNN representation. Finally, we compare to three published models: SVG [10], SAVP [18] and EPVA [33] (Figure 3).

The Struct-VRNN model matches or outperforms the other models in perceptual image and video quality as measured by VGG [23] feature cosine similarity and Fréchet Video Distance [28] (Figure 3). Results for the lower-level metrics SSIM and PSNR are similar (see supplemental material).

The ablations suggest that the structured representation, the stochastic belief, and the best-of-many objective all contribute to model performance. The full Struct-VRNN model generates the best reconstructions of ground truth, and also generates the most diverse samples (i.e., samples that are furthest from ground truth; Figure 3 bottom left). In contrast, the ablated models and SVG show both lower best-case accuracy and smaller differences between closest and furthest samples, indicating less diverse samples. SAVP is closer, performing well on the single-frame metric (VGG cosine sim.), but still worse on FVD than the structured model. Qualitatively, Struct-VRNN exhibits sharper images and longer object permanence than the unstructured models (Figure 3, top; note limb detail and dynamics). This is true even for the samples that are far from ground truth (Figure 3 top, row "Struct-VRNN (furthest)"), which suggests that our model produces diverse high-quality samples, rather than just a few good samples among many diverse but unrealistic ones. This conclusion is

Figure 4: Prediction error for ground-truth trajectories by linear regression from predicted keypoints. (sup.) indicates supervised baseline.

Figure 5: Ablating either the temporal separation loss or the keypoint sparsity loss reduces model performance and stability. In the FVD plots, each dot corresponds to a different model initialization. Coordinate error plots show the prediction error when regressing the ground-truth object coordinates on the discovered keypoints. Lines show the mean of five model initializations, with the 95% confidence intervals shaded.

backed up by the FVD (Figure 3 bottom right), which measures the overall quality of a distribution of videos [28].

5.2 The learned keypoints track objects

We now examine how well the learned keypoints track the location of objects. Since we do not expect the keypoints to align exactly with human-labeled objects, we fit a linear regression from the keypoints to the ground truth object positions and measure trajectory prediction error on held-out sequences (Figure 4). The trajectory error is the average distance between true and predicted coordinates at each timestep. To account for stochasticity, we sample 50 predictions and report the error of the best.3

As a baseline, we train Struct-VRNN and CNN-VRNN models with additional supervision that forces the learned keypoints to match the ground-truth keypoints. The keypoints learned by the unsupervised Struct-VRNN model are nearly as predictive as those trained with supervision, indicating that the learned keypoints represent useful spatial information. In contrast, prediction from the internal representation of the unsupervised CNN-VRNN is poor. When trained with supervision, however, the CNN-VRNN reaches similar performance as the supervised Struct-VRNN. In other words, both the Struct-VRNN and the CNN-VRNN can learn a spatial internal representation, but the Struct-VRNN learns it without supervision.

As expected, the less diverse predictions of the Struct-VRNN (no BoM) and Struct-RNN perform worse on the coordinate regression task. Finally, for comparison, we remove the dynamics model entirely and simply predict the last observed keypoint locations for all future timepoints. All models except unsupervised CNN-VRNN outperform this baseline.

5.3 Simple inductive biases improve object tracking

In Section 4.1, we described losses intended to add inductive biases such as keypoint sparsity and uncorrelated object trajectories to the keypoint detector. We find that these losses improve object tracking performance and stability. Figure 5 shows that models without show reduced video prediction and tracking performance. The increased variability between model initializations without and suggests that these losses improve the learnability of a stable keypoint

Figure 6: Unsupervised keypoints allow human-guided exploration of object dynamics. We manipulated the observed coordinates for Player 1 (black arrow) to change the original (blue) trajectory. The other players were not manipulated. The dynamics were then rolled out into the future to predict how the players will behave in the manipulated (red) scenario. Black crosses mark initial player positions. Light-colored parts of the trajectories are observed, dark-colored parts are predicted. Dots indicate final position. Lines of the same color indicate different samples conditioned on the same observed coordinates.

structure (also see Figure S6). In summary, we find that training and final performance is most stable if K is chosen to be larger than the expected number of objects, such that the model can use in combination with to activate the optimal number of keypoints.

5.4 Manipulation of keypoints allows interaction with the model

Figure 7: Keypoints learned by our method may be manipulated to change peoples’ poses. Note that both manipulations and effects are spatially local. Best viewed in video (https://mjlm.github. io/video_structure/).

On the Basketball dataset, we can explore counterfactual scenarios such as predicting how the other players react if one player moves left as opposed to right (Figure 6). We simply manipulate the sequence observed keypoint locations before they are passed to the RNN, thus conditioning the RNN states and predictions on the manipulated observations.

For the Human3.6M dataset, we can independently manipulate body parts and generate poses that are not present in the training set (Figure 7; please see https://mjlm.github.io/ video_structure/for videos). The model

learns to associate keypoints with local areas of the body, such that moving keypoints near an arm moves the arm without changing the rest of the image.

5.5 Structured representation retains more semantic information

Figure 8: Action recognition on the Human3.6M dataset. Solid line: null model (predict the most frequent action). Dashed line: prediction from ground-truth coordinates. Sup., supervised.

Figure 9: Predicting rewards on the DeepMind Control Suite continuous control domains. We chose domains with dense rewards to ensure the random policy would provide a sufficient reward signal for this analysis. To make scales comparable across domains, errors are normalized to a null model which predicts the mean training-set-reward at all timesteps. Lines show the mean across test-set examples and 5 random model initializations, with the 95% confidence interval shaded.

from six tasks in the DeepMind Control Suite (DMCS),

a set of simulated continuous control environments (Figure 9). Image observations and rewards were collected from the DMCS environments using random actions, and we modified our model to condition predictions on the agent’s actions by feeding the actions as an additional input to the RNN. Models were trained without access to the task reward function. We used the latent state of the dynamics model as an input to a separate reward prediction model for each task (see Section S2.3 for details). The dynamics learned by the Struct-VRNN give better reward prediction performance than the unstructured CNN-VRNN baseline, suggesting our architecture may be a useful addition to planning and reinforcement learning models. Concurrent work that applies a similar keypoint-structured model to control tasks confirms these results [17].

6 Discussion

A major question in machine learning is to what degree prior knowledge should be built into a model, as opposed to learning it from the data. This question is especially important for unsupervised vision models trained on raw pixels, which are typically far removed from the information that is of interest for downstream tasks. We propose a model with a spatial inductive bias, resulting in a structured, keypoint-based internal representation. We show that this structure leads to superior results on downstream tasks compared to a representation derived from a CNN without a keypoint-based representational bottleneck.

The proposed spatial prior using keypoints represents a middle ground between unstructured representations and an explicitly object-centric approach. For example, we do not explicitly model object masks, occlusions, or depth. Our architecture either leaves these phenomena unmodeled, or learns them from the data. By choosing to not build this kind of structure into the architecture, we keep our model simple and achieve stable training (see variability across initializations in Figures 3, 4, and 5) on diverse datasets, including multiple objects and complex, articulated human shapes.

We also note the importance of stochasticity for the prediction of videos and object trajectories. In natural videos, any sequence of conditioning frames is consistent with an astronomical number of plausible future frames. We found that methods that increase sample diversity (e.g. the best-of-many objective [5]) led to large gains in FVD, which measures the similarity of real and predicted videos on the level of distributions over entire videos. Conversely, due to the diversity of plausible futures, frame-wise measures of similarity to ground truth (e.g. VGG cosine similarity, PSNR, and SSIM) are near-meaningless for measuring long-term video prediction quality.

Beyond image-based measures, the most meaningful evaluation of a predictive model is to apply it to downstream tasks of interest, such as planning and reinforcement learning for control tasks. Because of its simplicity, our architecture is straightforward to combine with existing architectures for tasks that may benefit from spatial structure. Applying our model to such tasks is an important future direction of this work.

References

[1] K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or. Learning Character-Agnostic Motion for Motion Retargeting in 2D. In SIGGRAPH, 2019.

[2] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a Broken ELBO. In ICML, 2018.

[3] M. Babaeizadeh, C. Finn, R. Erhan, Dumitru an Campbell, and S. Levine. Stochastic variational video prediction. In ICLR, 2018.

[4] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In NeurIPS, 2015.

[5] A. Bhattacharyya, B. Schiele, and M. Fritz. Accurate and Diverse Sampling of Sequences based on a "Best of Many" Sample Objective. In CVPR, 2018.

[6] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating Sentences from a Continuous Space. In CONLL, 2016.

[7] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody Dance Now. In CoRR, volume abs/1808.07371, 2018.

[8] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio. A Recurrent Latent Variable Model for Sequential Data. In NeurIPS, 2015.

[9] E. Denton and V. Birodkar. Unsupervised Learning of Disentangled Representations from Video. In NeurIPS, 2017.

[10] E. Denton and R. Fergus. Stochastic Video Generation with a Learned Prior. In ICML, 2018.

[11] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.

[12] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In KDD, 2017.

[13] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning Latent Dynamics for Planning from Pixels. In ICML, 2019.

[14] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. In PAMI, 2014.

[15] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Conditional Image Generation for Learning the Structure of Visual Objects. In NeurIPS, 2018.

[16] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.

[17] T. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih. Unsupervised learning of object keypoints for perception and control. In arXiv: 1906.11883, 2019.

[18] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic Adversarial Video Prediction. In CoRR, volume abs/1804.01523, 2018.

[19] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. In ICLR, 2017.

[20] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.

[21] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In NeurIPS, 2015.

[22] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint:1412.6604, 2014.

[23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In CoRR, 2014.

[24] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised Learning of Video Representations using LSTMs. In ICML, 2015.

[25] C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy. Stochastic Prediction of MultiAgent Interactions From Partial Observations. In ICLR, 2019.

[26] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. DeepMind Control Suite. In arXiv: 1801.00690, 2018.

[27] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.

[28] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards Accurate Generative Models of Video: A New Metric & Challenges. In CoRR, 2018.

[29] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing Motion and Content for Natural Video Sequence Prediction. In ICLR, 2017.

[30] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to Generate Long-term Future via Hierarchical Prediction. In ICML, 2017.

[31] J. Walker, K. Marino, A. Gupta, and M. Hebert. The Pose Knows: Video Forecasting by Generating Pose Futures. In NeurIPS, 2018.

[32] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-Video Synthesis. In NeurIPS, 2018.

[33] N. Wichers, R. Villegas, D. Erhan, and H. Lee. Hierarchical Long-term Video Prediction without Supervision. In ICML, 2018.

[34] Z. Xu, Z. Liu, C. Sun, K. Murphy, W. T. Freeman, J. B. Tenenbaum, and J. Wu. Unsupervised Discovery of Parts, Structure, and Dynamics. In ICLR, 2019.

[35] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS, 2016.

[36] X. Yan, A. Rastogi, R. Villegas, K. Sunkavalli, E. Shechtman, S. Hadap, E. Yumer, and H. Lee. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In ECCV, 2018.

[37] E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey. Generating Multi-Agent Trajectories using Programmatic Weak Supervision. In ICLR, 2019.

[38] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised Discovery of Object Landmarks as Structural Representations. In CVPR, 2018.

S1 Model implementation details

S1.1 Architecture

S1.1.1 Keypoint detector

Our goal is to encode the input image in terms of the locations of objects in the image [15, 38]. To this end, we use a keypoint detector network (Figure 1, bottom) that consists of K keypoint detectors. Ideally, each detector learns to detect a distinct object or object part. The detector network is implemented as a series of convolutional layers with stride 2 that reduce the input image v into a stack of keypoint detection score maps with use the softplus function on the activations of the final layer to ensure the maps are positive.

The raw maps R are normalized to obtain detection weight maps D,

where is the value of the k-th channel of D at pixel (u, v). We then reduce each to a single (x, y)-coordinate by computing the weighted mean over pixel coordinates:

To model keypoint presence or absence, we compute the mean value of the raw detection score maps,

In summary, each keypoint is represented by a -triplet encoding its location in the image and its scale.

For image reconstruction, each keypoint is converted back into a pixel representation by creating a map containing a Gaussian blob with standard deviation at the location of the keypoint, scaled by

The map contains the same information as the keypoint tuple , but in a pixel representation that is suitable as input to the convolutional reconstructor network . The image is reconstructed as follows:

where applies alternating convolutional layers and twofold bilinear upsampling until the maps are expanded to the original image resolution, denotes concatenation (here, channelwise), and is a network with the same architecture as (except for the final softmax nonlinearity) that extracts image features from the first frame to capture appearance information of the scene.

The internal layers of the convolutional encoder and decoder are connected through leaky rectified linear units f(x) = max(x, 0.2x). L2 weight decay of is applied to all convolutional kernels. To increase model capacity, we add one (for Basketball and DMCS) or two (for Human3.6M) additional size-preserving (stride 1) convolutional layers at each resolution scale of the detector and reconstructor. The image resolution is

S1.1.2 Dynamics model

The dynamics model (Figure 1, top) has the following components:

The prior network consists of a dense layer with ReLU activation functions (for number of units, see Prior net size in Table S1), followed by a dense layer that projects the activations to the mean and standard deviation that parameterize the prior latent distribution

The encoder network consists of a dense layer with 128 units and ReLU activation functions, followed by a dense layer that projects the activations to the mean and standard deviation that parameterize the posterior latent distribution

The decoder network consists of a dense layer with 128 units and ReLU activation functions, followed by a dense layer that projects the activations to the the linearized keypoint vector (containing components),

The recurrent network consists of a GRU layer with 512 units:

For the action-conditional model used for reward prediction (Figure 9), the input to is , where is the vector of random actions used to generate frame t of the DeepMind Control Suite dataset.

The size of and the latent representation z were optimized as hyperparameters (see Table S1).

S1.2 Optimization

We used the ADAM optimizer [16] with and . We trained on batches of size 32 for steps. The learning rate was set to at the start of training and reduced by half every steps. We used an L2 weight decay of on the weights of the convolutional layers in the image encoder and decoder. Weights were initialized using the "He uniform" method as implemented by Keras. Models were trained on a single Nvidia P100 GPU. Training took approximately 12 hours.

During training, we linearly annealed the KL loss scale from over the first in [6].

S1.3 Scheduled sampling

When training an RNN for many timesteps, the initially large errors compound over time, leading to slow learning. Therefore, during training, we initially supplied the observed keypoint coordinates as to the RNN, instead of the RNN’s own predictions. This is similar to teacher forcing, although we note that we used the output of the unsupervised keypoint detector, rather than the ground truth.

We find that teacher forcing causes the model to make more dynamic predictions which are qualitatively realistic, but may have poor error metrics because of the mismatch between the training and test distributions. We therefore gradually switched to using samples from the model over the course of training (scheduled sampling, [4]). We linearly increased the probability of choosing samples from the model from 0 to a final value over the course of training. We chose the final probability to be 1.0 for the observed timesteps and 0.5 for the predicted timesteps.

S1.4 Hyperparameter optimization

We used a black-box optimization tool based on Gaussian process bandits [12] to tune several of the hyperparameters of our model. The target for optimization was the mean coordinate trajectory error as computed for Figure 4. See Table S1 for parameters and their tuning ranges.

Table S1: Hyperparameters

S2 Experimental details

S2.1 Comparison to SVG and SAVP

For both SVG and SAVP, models were trained on the same datasets as our models, using the code made available by the authors. We trained all models using 8 input steps and 8 predicted steps. For SAVP, the other hyperparameters were set to those recommended by the authors for the BAIR robot pushing dataset.

S2.2 Human3.6M action recognition

To understand how much semantically useful information the representations of our models contain, we predicted the actions performed in the Human3.6M dataset from the model representations (Figure 8). We first used trained models to extract keypoints (or unstructured image representations) for sequences of 8 observed steps from the Human3.6M test set. These keypoint sequences represented the dataset used for action recognition. The action recognition training set comprised 881 sequences, the test set 279 sequences. We ensured that no test sequences came from the same original videos as those used in the training set.

We then trained a separate recurrent neural network to classify each sequence into one of the 15 action categories (Walking, Sitting, Eating, Discussions, ...) in the Human3.6M dataset. No categories were excluded. The network consisted of two GRU layers (128 units), followed by a dense layer (15 units) and a softmax layer. We used 25% dropout after each GRU layer. The model was trained for 100 epochs using the ADAM optimizer with a starting learning rate of 0.01 that was successively reduced to 0.0001. We report the mean action recognition accuracy (fraction correct) on the 279 test sequences.

S2.3 DeepMind Control Suite reward prediction

To explore if the structured representation learned by our model may be useful for planning, we used our model to predict rewards in DeepMind Control Suite [26] continuous control tasks (Figure 9). We chose tasks that have dense rewards and thus provide a strong signal for evaluation (Acrobot Swingup, Cartpole Balance, Cheetah Run, Reacher Easy, Walker Stand, Walker Walk).

We generated a dataset based on DeepMind Control Suite (DMCS) continuous control tasks by performing random actions and recording 64 by 64 pixel observations, the actions, and the rewards. We then trained our model variants on this dataset. Importantly, we trained a single model on data from all domains, to test the generality of our approach. We modified our models to be action-conditional by passing the vector of actions as an additional input to the RNN at each timestep.

To predict rewards, we used the RNN hidden state of our models as a representation of the dynamics learned by the model. We first collected the hidden states of the trained models for 10,000 length-20 sequences from the test split for each of the six domains in our DMCS dataset. We then trained a separate, smaller reward prediction model to predict rewards for each of the six domains. The reward prediction models took the sequence of RNN hidden states as input and returned a sequence of scalar reward values as output. The model consisted of a fully connected layer (128 units), two GRU layers (128 units) and a dense layer (1 unit), all connected through rectified linear units. The reward prediction model was trained on 80% of the data with the ADAM optimizer with a starting learning rate of 0.001 that was successively reduced to 0.0001. We report the mean squared error of the predicted reward on the remaining 20% of the data.

Figure S1: Additional video metrics on Human3.6M: structural similarity (SSIM) and peak signal-to-noise ratio (PSNR). The models were conditioned on 8 frames and trained to predict 8 future frames. Higher is better (closer to ground truth). Top row shows the mean across all test-set examples of the closest-to-GT of 100 stochastic samples, bottom shows the furthest. Lines show the mean across 5 random model initializations, with the 95% confidence interval shaded.

Figure S2: Video generation quality on Basketball. The models were conditioned on 8 frames and trained to predict 8 future frames. Our stochastic structured model (Struct-VRNN) outperforms our deterministic baseline (Struct-VRNN), our unstructured baseline (CNN-VRNN), and the SVG model [10] in the FVD metric and qualitatively. Top: Example input (green borders) and predicted (red borders) frames. Bottom left: Fréchet Video Distance (FVD) [28]. Lower is better. Each dot represents a separate model initialization. For SVG, the FVD for several runs was greater than 1700. The example at the top comes from the best run. Bottom right: VGG feature cosine similarity, structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). Higher is better. Lines show the mean across 5 random model initializations, with the 95% confidence interval shaded. The SVG model fails to represent objects stably at later timepoints. This is captured by the FVD metric, causing a large difference to our models. However, it is not captured by the other metrics, suggesting that they are not informative on this synthetic dataset. Also see videos in supplemental material or at https://mjlm.github.io/video_structure/.

Figure S3: Effect of stochastic belief and best-of-many-samples objective on sample diversity. Each row shows one example Basketball play, with the trajectories for one player in each column. The black line indicates the true trajectory, the colored lines indicate 20 stochastic predictions, all conditioned on the same observed steps. Trajectory endpoints are marked with dots. The model trained with the best-of-many-samples objective (a) produces more diverse samples than the model without (b). As expected, the deterministic model (c) lacks diversity completely. Players were matched to detected keypoints by finding, for each player, the keypoint which was closest to that player on average.

Figure S4: The Struct-VRNN model generates plausible and diverse predictions. Each block shows the true sequence in the top row, followed by three samples conditioned on the same initial frames (green outlines). Also see videos in supplemental material or at https://mjlm.github.io/video_ structure/.

Figure S5: Action-conditional predictions for the DeepMind Control Suite domains. Even though the CNN-VRNN has enough capacity to encode the observed frames (green outlines) well, it struggles to make future predictions (red outlines), in contrast to the Struct-VRNN. Also see videos in supplemental material or at https://mjlm.github.io/video_structure/.

(a) Coordinate error over time for individual objects. In the left plot, lines indicate the mean across 10 model initializations, with the 95% confidence interval shaded. In the middle and right plot, lines show individual model initializations. For these runs,

(b) Mean coordinate error of the ball (left) and Player 3 (right) across different settings of , relative to the best settings. Each entry in the heatmaps corresponds to the mean trajectory error across time and 10 model initializations.

Figure S6: Analysis of object tracking failure modes (Basketball dataset). (a) shows the coordinate error for individual objects. We identify two different failure modes, corresponding to failures of the dynamics model and failures of the keypoint detector: Some objects, e.g. the ball (yellow; middle plot), have relatively large tracking errors across all 10 model initializations, presumably because their dynamics are hard to learn. Other objects, e.g. Player 3 (pink; right plot) are tracked well in some and poorly in other model initializations, presumably because the keypoint detector completely fails to detect these objects in some runs. The sweep over in (b) shows that primarily reduce the keypoint detector failure mode (exemplified by the Player 3 error; right), while the tracking error of the ball (left) is insensitive to these losses. In other words, improve the reliability of the keypoint detector.

designed for accessibility and to further open science