Learning Variations in Human Motion via Mix-and-Match Perturbation

2019·Arxiv

Abstract

Abstract

Human motion prediction is a stochastic process: Given an observed sequence of poses, multiple future motions are plausible. Existing approaches to modeling this stochasticity typically combine a random noise vector with information about the previous poses. This combination, however, is done in a deterministic manner, which gives the network the flexibility to learn to ignore the random noise. In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account. We exploit this idea for motion prediction by incorporating it into a recurrent encoder-decoder network with a conditional variational autoencoder block that learns to exploit the perturbations. Our experiments demonstrate that our model yields high-quality pose sequences that are much more diverse than those from state-of-the-art stochastic motion prediction techniques.

1. Introduction

Human motion prediction aims to forecast the sequence of future poses of a person given past observations of such poses. To achieve this, existing methods typically rely on recurrent neural networks (RNNs) that encode the person’s motion [23, 12, 32, 20, 4, 26, 27]. While they predict reasonable motions, RNNs are deterministic models and thus cannot account for the highly stochastic nature of human motion; given the beginning of a sequence, multiple, diverse futures are plausible. To correctly model this, it is therefore critical to develop algorithms that can learn the multiple modes of human motion, even when presented with only deterministic training samples.

Figure 1. Diversity of K RNN decoder inputs, generated with K = 50 different random vectors. We report the mean diversity over N = 50 samples and the corresponding standard deviation.

Recently, several attempts have been made at modeling the stochastic nature of human motion [34, 4, 32, 20, 22]. These methods rely on sampling a random vector that is then combined with an encoding of the observed pose sequence. In essence, this combination is similar to the conditioning of generative networks; the resulting models aim to generate an output from a random vector while taking into account additional information about the content.

Here, we argue that, while standard conditioning strategies may be effective for many tasks, as in [35, 18, 9, 8, 3, 21], they are ill-suited for motion prediction. The reason is the following: In other tasks, the conditioning variable only provides auxiliary information about the output to produce, such as the fact that a generated face should be smiling. By contrast, in motion prediction, it typically contains the core signal to produce the output, i.e., the information about the previous poses. Since the prediction model is trained using deterministic samples, it can then simply learn to ignore the random vector and still produce a meaningful output based on the conditioning variable only. In other words, the model can ignore the root of variations, and thus essentially be-

come deterministic.

This problem was discussed in [5] in the context of text generation, and we identified it in our own motion prediction experiments. As evidence, we plot in Fig. 1 the diversity of the representations used as input to the RNN decoders of [34] (LHP) and [4] (RHP), two state-of-the-art methods that are closest in spirit to our approach. Here, diversity is measured as the average pairwise distance across the K representations produced for a single series of observations. We report the mean diversity over 50 samples and the corresponding standard deviation. As can be seen from the figure, the diversity of LHP and RHP decreases as training progresses, thus supporting our observation that the models learn to ignore the perturbations.

In this paper, we introduce a simple yet effective approach to counteracting this loss of diversity and thus to generating truly diverse future pose sequences. At the heart of our approach lies the idea of Mix-and-Match perturbations: Instead of combining a noise vector with the conditioning variables in a deterministic manner, we randomly select and perturb a subset of these variables. By randomly changing this subset at every iteration, our strategy prevents training from identifying the root of variations and forces the model to take it into account in the generation process. As a consequence, and as evidenced by the black curve in Fig. 1, which shows an increasing diversity as training progresses, our approach produces not only high-quality predictions but also truly diverse ones.

In short, our contributions are (1) a novel way of imposing diversity into conditional VAEs, called Mix-and-Match perturbations; (2) a new motion prediction model capable of generating multiple likely future pose sequences from an observed motion; and (3) a new evaluation metric for quantitatively measuring the quality and the diversity of generated motions, thus facilitating the comparison of different stochastic approaches.

2. Related Work

Most motion prediction approaches are based on deterministic models [27, 26, 12, 15, 23, 13, 10, 11], casting motion prediction as a regression task where only one outcome is possible given the observations. While this may produce accurate predictions, it fails to reflect the stochastic nature of human motion, where multiple plausible outcomes can be highly likely for a single given series of observations. Modeling this diversity is the topic of this paper, and we therefore focus the discussion below on the other methods that have attempted to do so.

The general trend to incorporate variations in the predicted motions consists of combining information about the observed pose sequence with a random vector. In this context, two types of approaches have been studied: The techniques that directly incorporate the random vector into the RNN decoder and those that make use of an additional Conditional Variational Autoencoder (CVAE) [31].

In the first class of methods, [22] samples a random vector at each time step and adds it to the pose input to the RNN decoder. By relying on different random vectors at each time step, however, this strategy is prone to generating discontinuous motions. To overcome this, [20] makes use of a single random vector to generate the entire sequence. This vector is both employed to alter the initialization of the decoder and concatenated with a pose embedding at each iteration of the RNN. By relying on concatenation, these two methods contain parameters that are specific to the random vector, and thus give the model the flexibility to ignore this information. In [4], instead of using concatenation, the random vector is added to the hidden state produced by the RNN encoder. While addition prevents having parameters that are specific to the random vector, this vector is first transformed by multiplication with a parameter matrix, and thus can again be zeroed out so as to remove the source of diversity, as we observe empirically in Section 4.1. In our experiments, we will refer to this method as RHP, for random hidden state perturbation.

The second category of stochastic methods introduce an additional CVAE between the RNN encoder and decoder. This allows them to learn a more meaningful transformation of the noise, combined with the conditioning variables, before passing the resulting information to the RNN decoder. In this context, [32] proposes to directly use the pose as conditioning variable. As will be shown in our experiments, while this approach is able to maintain some degree of diversity, albeit less than ours, it yields motions of lower quality because of its use of independent random vectors at each time step. In our experiments, we will refer to this method as LPP, for learned pose perturbation. In [6], an approach similar to that of [32] is proposed, but with one CVAE per limb. As such, this method suffers from the same discontinuity problem as [32, 22]. Finally, instead of perturbing the pose, the recent work [34] uses the RNN decoder hidden state as conditioning variable in the CVAE, concatenating it with the random vector. While this approach generates high-quality motions, it suffers from the fact that the CVAE decoder gives the model the flexibility to ignore the random vector. In the remainder of the paper, we will refer to this method as LHP, for learned hidden state perturbation.

Ultimately, both classes of methods suffer from the fact that they allow the model to ignore the random vector, thus relying entirely on the conditioning information to generate future poses. Here, we introduce an effective way to maintain the root of diversity by randomizing the combination of the random vector with the conditioning variable.

Figure 2. Mix-and-Match perturbation. (Top) Illustration of the Sampling operation (left) and of the Resampling one (right). Given a sampling rate and a vector length L, the Sampling operation samples indices, say I. The complementary, unsampled, indices are denoted by . Then, given two L-dimensional vectors and the corresponding indices, the Resampling operation mixes the two vectors to form a new L-dimensional one. (Middle) Example of Mix-and-Match perturbation. (Bottom) Example of perturbation by concatenation, as in [34]. Note that, in Mix-and-Match perturbations, sampling is stochastic; the indices are sampled uniformly randomly for each mini-batch. By contrast, in [34], sampling is deterministic, and the indices in I are fixed and correspond to

3. Proposed Method

In this section, we first introduce our Mix-and-Match approach to introducing diversity in CVAE-based motion prediction. We then describe the motion prediction architecture we used in our experiments and propose a novel evaluation metric to quantitatively measure the diversity and quality of generated motions.

3.1. Mix-and-Match Perturbation

The main limitation of prior work in the area of stochastic motion modeling, such as [32, 4, 34], lies in the way they fuse the random vector with the conditioning variable, i.e., RNN hidden state or pose, which causes the model to learn to ignore the randomness and solely exploit the deterministic conditioning information to generate motion [32, 4, 34]. To overcome this, we propose to make it harder for the model to decouple the random variable from the deterministic information. Specifically, we observe that the way the random variable and the conditioning one are combined in existing methods is deterministic. We therefore propose to make this process stochastic.

Similarly to [34], we propose to make use of the hidden state as the conditioning variable and generate a perturbed hidden state by combining a part of the original hidden state with the random vector. However, as illustrated in Fig. 2, instead of assigning predefined, deterministic indices to each piece of information, such as the first half for the hidden state and the second one for the random vector, we assign the values of hidden state to random indices and the random

vector to the complementary ones.

More specifically, as depicted in Fig. 2, a mix-and-match perturbation takes two vectors of size L as input, say and z, and combines them in a stochastic manner. To this end, it relies on two operations. The first one, called Sampling, chooses indices uniformly at random among the L possible values, given a sampling rate . Let us denote by , the resulting set of indices and by the complementary set. The second operation, called Resampling, then creates a new L-dimensional vector whose values at indices in I are taken as those at corresponding indices in the first input vector and the others at the complementary indices in the second input vector. Note that, the second vector can also have dimension , and its values be divided among the remaining indices of the output vector.

3.2. M&M Perturbation for Motion Prediction

Let us now describe the way we use our mix-and-match perturbation strategy for motion prediction. To this end, we first discuss the network we rely on during inference, and then explain our training strategy.

Inference. The high-level architecture we use at inference time is depicted by Fig. 3 (Top). It consists of an RNN encoder that takes t poses as input and outputs an L-dimensional hidden vector . A random -dimensional portion of this hidden vector, , is then combined with an -dimensional random vector via our mix-and-match perturbation strategy. The resulting L-dimensional output is passed through a small neural network (i.e., ResBlock2 in Fig. 3) that reduces its size to , and then fused with the remaining -dimensional portion of the hidden state, . This, in turn, is passed through the VAE decoder to produce the final hidden state , from which the future poses are obtained via the RNN decoder.

Training. During training, we aim to learn both the RNN parameters and the CVAE ones. Because the CVAE is an autoencoder, it needs to take as input information about future poses. To this end, we complement our inference architecture with an additional RNN future encoder, yielding the training architecture depicted in Fig. 3 (Bottom). Note that, in this architecture, we incorporate an additional mix-and-match perturbation that fuses the hidden state of the RNN past encoder with that of the RNN future encoder and forms . This allows us to condition the VAE en- coder in a manner similar to the decoder. Note that, for each mini batch, we use the same set of sampled indices for all mix-and-match perturbation steps throughout the network. Furthermore, following the standard CVAE strategy, during

Figure 3. Overview of our approach. (Top): Overview of the model during inference. During inference, given past information and a random vector sampled from a Normal distribution, the model generate new motions. (Bottom): Overview of the model during training. During training, we use a future pose autoencoder with a CVAE between the encoder and the decoder. The RNN encoder-decoder network mapping the past to the future then aims to generate good conditioning variables for the CVAE.

training, the random vector z is sampled from a distribution , whose mean and covariance matrix are produced by the CVAE encoder with parameters . This is done by the technique of [17], which computes as,

where . Note that, during inference, since we do not have access to x, hence to and .

To learn the parameters of our model, we rely on the availability of a dataset containing N videos depicting a human performing an action. Each video consists of a sequence of T poses, , and each pose comprises J joints forming a skeleton, . The pose of each joint is represented as a quaternion. Given this data, we train our model by minimizing a loss function of the form

The first term in this loss compares the output of the network with the ground-truth motion using the squared loss. That is,

where is the predicted 4D quaternion for the joint at time k in sample i, and the corresponding ground- truth one. The main weakness of this loss is that it treats all joints equally. However, when working with angles, some joints have a much larger influence on the pose than others. For example, because of the kinematic chain, the pose of the shoulder affects that of the rest of the arm, whereas the pose of the wrists has only a minor effect.

To take this into account, we define our second loss term as the error in 3D space. That is,

where is the predicted 3D position of joint j at time k in sample i and the corresponding ground-truth one. These 3D positions can be computed using forward kinematics, as in [27, 26]. Note that, to compute this loss, we first perform a global alignment of the predicted pose and the ground-truth one by rotating the root joint to face [0, 0, 0].

Finally, following standard practice in training VAEs, we define our third loss term as the KL divergence

In practice, since our VAE appears within a recurrent model, we weigh by a function corresponding to the KL annealing weight of [5]. We start from , forcing the model to encode as much information in z as possible, and gradually increase it to , following a logistic curve.

Curriculum Learning of Variation. The parameter in our mix-and-match perturbation scheme determines a trade-off between stochasticity and motion quality. The larger , the larger the portion of the original hidden state that will be perturbed. Thus, the model incorporates more randomness and less information from the original hidden state. As such, given a large , it becomes harder for the model to deliver motion information from the observation to the future representation since a large portion of the hidden state is changing randomly. In particular, we observed that training becomes unstable if we use a large from the beginning, with the motion-related loss terms fluctuating while the prior loss quickly converges to zero.

Figure 4. Example of curriculum perturbation of the hidden state. At the beginning of training, the perturbation occurs in a deterministic portion of the hidden state. As training progresses, we gradually, and randomly, spread the perturbation to the rest of the hidden state. This continues until the indices to perturb are uniformly randomly sampled.

To overcome this while still enabling the use of suffi-ciently large values of to achieve high diversity, we introduce the curriculum learning strategy depicted by Fig. 4. In essence, we initially select indices in a deterministic manner and gradually increase the randomness of these indices as training progresses. More specifically, given a set of indices, we replace c indices from the sampled ones with the corresponding ones from the remaining indices. Starting from c = 0, we gradually increase c to the point where all indices are sampled uniformly randomly. More details, including the pseudocode of this approach, are provided in the supplementary material. This strategy helps the motion decoder to initially learn and incorporate information about the observations (as in [34]), yet, in the long run, still prevents it from ignoring the random vector.

3.3. Quality and Diversity Metrics

When dealing with multiple plausible motions, or in general diverse solutions to a problem, evaluation is a challenge. The standard metrics used for deterministic motion prediction models are ill-suited to this task, because they typically compare the predictions to the ground truth, thus inherently penalizing diversity. For multiple motions, two aspects are important: the diversity and the quality, or realism, of each individual motion. Prior work typically evaluates these aspects via human judgement. While human evaluation is highly valuable, and we will also report human results, it is very costly and time-consuming. Here, we therefore introduce two metrics that facilitate the quantitative evaluation of both quality and diversity.

To measure the quality of generated motions, we propose to rely on a binary classifier trained to discriminate real (ground-truth) samples from fake (generated) ones. The accuracy of this classifier on the test set is thus inversely proportional to the quality of the generated motions. In other words, high-quality motions are those that are not distinguishable from real ones. Note that we do not advocate for adversarial training of our approach. That is, we do not de-fine a loss based on this classifier when training our model.

To measure the diversity of the generated motions, a naive approach would consist of relying on the distance between the generated motion and a reference one. However, generating identical motions that are all far from the reference one would therefore yield a high value, while not reflecting diversity. To prevent this, we therefore propose to make use of the average distance between all pairs of generated motions.

4. Experiments

Let us now evaluate the effectiveness of our approach at generating multiple plausible motions. To this end, we use Human3.6M [14], the largest publicly available motion capture dataset. Below, we first exploit the metrics introduced in Section 3.3 to compare the quality and diversity of the results of our approach with those obtained by state-of-the-art methods that produce multiple motions [34, 32, 4]. We then compare our results to the state-of-the-art deterministic motion prediction techniques for long-term motion prediction using standard metrics.

Implementation details. The motion encoders and decoders in our model are single layer GRU [7] networks, comprising 1024 hidden units each. For the decoders, we use a teacher forcing technique [33] to decode motion. At each time-step, the network chooses with probability whether to use its own output at the previous time-step or the ground-truth pose as input. We initialize , and decrease it linearly at each training epoch such that, after a certain number of epochs, the model becomes completely autoregressive, i.e., uses only its own output as input to the next time-step. We train our model on a single GPU with the Adam optimizer [16] for 100K iterations. We use a learning rate of 0.001 and a mini-batch size of 64. To avoid exploding gradients, we use the gradient-clipping technique of [24] for all layers in the network. We implemented our model using the Pytorch framework of [25].

4.1. Evaluating Quality and Diversity

Quantitative evaluation of a qualitative task is very challenging. While the ideal case is reporting the (log-)likelihood on a held-out set of samples, in (nonprobabilistic) decoder-based generative models this is not possible. An alternative is using non-parametric kernel density estimates (KDE), only via samples, however, KDE is only well-suited for very low dimensional data space. Evaluating against one GT motion (i.e., one sample from multi-modal distribution) can lead to a high score for one sample while penalizing other plausible modes. This behaviour is undesirable since it cannot differentiate a multi-modal solution with a good, but uni-modal one. Note, there exist some

Figure 5. Architecture of the quality binary classifier.

metrics [34] to evaluate motions, however, they do not re-flect the quality of a prediction, but how likely ground-truth future motions are with the given model. Moreover, as discussed, the metrics in [34] only evaluate quality given one single groundtruth. While the groundtruth has high quality, there exist multiple high quality continuations of an observation, which our metric accounts for. As discussed in Section 3.3, we evaluate both the quality and diversity of the predicted motions. Note, these two metrics should be considered together, since each one taken separately does not provide a complete picture of how well a model can predict multiple plausible future motions. For example, a model can generate diverse but unnatural motions, or, conversely, realistic but identical motions.

We compare our Mix-and-Match approach with the different means of imposing variation in motion prediction discussed in Section 2, i.e., concatenating the hidden state to a learned latent variable (LHP) [34], concatenating the pose to a learned latent variable at each time-step (LPP) [32], and adding a (transformed) random noise to the hidden state (RHP) [4]. For the comparison to be fair, we use 16 frames (i.e., 640ms) as observation to generate the next 60 frames (i.e., 2.4sec) for all baselines. All models are trained with the same motion representation, backbone network, and losses, except for RHP which cannot make use of .

To evaluate quality, as discussed in Section 3.3, we use a recurrent binary classifier whose task is to determine whether a sample comes from the ground-truth data or was generated by the model. As depicted by Fig. 5, the model is based on a single layer GRU network with 1024 hidden units to process the motion, followed by a three-layer fully connected network (with 512, 128 and 1 units, respectively) with ReLU non-linearity in between and a sigmoid non-linearity for binary classification. We train such a classifier for each method, using 25K samples generated at different training steps together with 25K real samples, forming a binary dataset of 50K motions for each method. We use stochastic gradient descent for 5K iterations, with a mini-batch size of 256, a learning rate of 0.01 and a momentum of 0.9. To evaluate diversity, as discussed in Section 3.3, we compute the mean Euclidean distance from each motion to all other motions when generating K = 50 motions. Furthermore, we also performed a human evaluation to measure the quality of the motions generated by each method. To this end, we asked eight users to rate the quality of 50 motions generated by each method, for a total

0 20 40 60 80 100 0

0 20 40 60 80 100 0

Figure 6. Quality and diversity evaluation. Our approach outperforms the baselines in terms of diversity while preserving a high quality, especially late in the training progress.

of 200 motions. The ratings were defined on a scale of 1-5, 1 representing a low-quality motion and 5 a high-quality, realistic one. We then scaled the values to the range 0-50 to make them comparable with those of the binary classifier.

The results of the metrics of Section 3.3 are provided in Fig. 6 and those of the human evaluation in Fig. 7. Below, we analyze the results of the different models.

LHP. As can be seen from Fig. 6, LHP tends to ignore the random variable z, thus ignoring the root of variation. As a consequence, it achieves a low diversity, much lower than ours, but produces samples of high quality, albeit almost identical. Note that this decrease in diversity occurs after 16K iterations, indicating that the model takes time to identify the part of the hidden state that contains the randomness. Nevertheless, at iteration 16K, prediction quality is low, and thus one could not simply stop training at this stage. Note that the lack of diversity of LHP is also evidenced by Fig. 1. To further confirm it, we performed an additional experiment where, at test time, we sampled each element of the random vector independently from N(50, 50) instead of from the prior N(0, I). This led to neither loss of quality nor increase of diversity of the generated motions. As can be verified in Fig. 7, where LHP appears in a region of high quality but low diversity, the results of human evaluation match those of our classifier-based quality metric.

RHP. As for LHP, Fig. 6 evidences the limited diversity of the motions produced by RHP despite its use of random noise during inference. Note that the authors of [4] mentioned in their paper that the random noise was added to

Figure 7. Human (H) and classifier-based (C) evaluation of quality for different methods. We plot diversity vs quality. A good method should fall into the top-right part of the plot, i.e., have high quality and diversity. Only our approach, for both human and classifier-based evaluation, satisfies this criterion. Real motions (blue circle) are deterministic, i.e., one future per observation, and have 0 diversity. However, their quality is optimal, i.e., 50%. Note that the human and classifier-based results depict the same behavior.

the hidden state. Only by studying their publicly available code1 did we understand the precise way this combination was done. In fact, the addition relies on a parametric, linear transformation of the noise vector. That is, the perturbed hidden state is obtained as

Because the parameters are learnt, the model has the flexibility to ignore z, which causes the behavior observed in Figs. 6 and 1. Note that the authors of [4] acknowledged that, despite their best efforts, they noticed very little variations between predictions obtained with different z values. Since the perturbation is ignored, however, the quality of the generated motions is high. By depicting RHP in a region of high quality but low diversity, the human evaluation results in Fig. 7 again match those of our classifier-based quality metric.

LPP. As can be seen in Fig. 6, LPP produces motions with higher diversity than LHP and RHP, but of much lower quality. The main reason behind this is that the random vectors that are concatenated to the poses at each time-step are sampled independently of each other, which translates to discontinuities in the generated motions. This problem might be mitigated by sampling the noise in a time-dependent, autoregressive manner, as in [19] for video generation. Doing so, however, goes beyond the scope of our analysis. When it comes to human evaluation, Fig.7 further confirms that LPP’s results lie in a low-quality, medium-diversity region.

Ours. The goal of our mix-and-match perturbations was to make it hard for the model to decouple the random vector from the deterministic hidden state information. The success of our approach is confirmed by Fig. 6. Our model generates diverse motions, even after a long training time, and the quality of these motions is high. While this quality is slightly lower than that of LHP and RHP when looking at our classifier-based metric, it is rated higher by humans, as can be verified from Fig. 7. We believe that this discrepancy is related to the binary classifier memorizing the ground-truth motions and thus not generalizing to the large diversity of motions generated by our model. As such, human evaluation still nicely complements our less expensive automatic one. Altogether, these results confirm the ability of our approach to generate highly diverse yet realistic motions. In Fig. 8, we further evidence this qualitatively by providing samples obtained by our approach for four different input sequences, as well as samples from the baselines. Note that, for each input sequence, we produce large, yet natural variations of future poses.

Note that our approach depends on the parameter , which defines the amount of randomness used in our mix-and-match perturbations. In Fig. 9, we report the quality and diversity of our results when varying {0.1, . . . , 0.9}. Note that these plots show a trade-off between quality and diversity. This is to be expected since, by aiming to increase diversity, the resulting motion will become unrealistic. Nevertheless, our results can be seen to be highly diverse and of high quality for a wide range of values, i.e., by setting . Note that, is the only model-related hyper-parameter of the Mix-and-Match. The quality and diversity metrics are monotonic functions of , thus, one can choose a proper given a task. Note that, using , our method still achieves a SoTA diversity of 2.25 with a higher quality of 45.0%, however, for the sake of fair comparison, we use the default value of .

4.2. Comparison with the State of the Art

We now compare the results of our approach with those obtained using the state-of-the-art deterministic motion prediction methods [23, 15, 13, 27, 26, 10, 12] for long-term motion prediction, i.e., up to 1000ms. For this experiment, following previous work [10, 26, 27, 23, 12], we model velocity instead of pose, and do the same for the stochastic baselines. This is achieved by adding a residual connection to the motion decoder. We then report the standard metric, i.e., the Euclidean distance between the predicted and ground-truth Euler angles. To evaluate this metric for our method which generates multiple, diverse predictions, we make use of the best sample among the K generated ones with K = 50 for the stochastic baselines and for our approach (i.e., the S-MSE metric [34]). In other words, we aim to show that, among the K motions we generate, at least one is close to the ground truth. As can be seen in the top portion of Fig. 10, our approach yields errors com-

Figure 8. Qualitative visualization of diverse motions generated by our model and by the baselines. Each block of columns shows the results for one observation (the first three poses of each sequence). The first row corresponds to the ground-truth motion and the other rows illustrate multiple motions generated by each method (better seen when zoomed in).

Figure 9. Quality and the diversity of the motions generated with our approach as a function of . Note that with sity increases significantly, but this diversity is the result of poorquality motions.

parable to the best-performing baselines, despite their use of more complex architectures and strong losses, such as the adversarial loss used in [12]. Note that, unlike some of the baselines [12, 15, 23], our model requires knowing the action class during neither training nor inference.

In the bottom portion of Fig. 10, we further compare our best estimate with that of the other stochastic baselines, LPH, RHP, and LPP, using the best of K motions in all cases. Note that, by providing better diversity, our approach

outperforms these baselines.

5. Evaluating the Effect of K

In the main paper, we used K = 50 to compare our approach with the state-of-the-art deterministic and stochastic baselines. Here, we provide an ablation study on the effect of K. To this end, we provide results when using K = 1 to K = 500. In Fig. 11, we plot the results with K = 50 as bold black lines, and the shaded area covers the results obtained with K = 1 to K = 500. While smaller values of K yield large errors, the difference between K = 50 and K = 500 is very small (barely visible in most cases).

6. Conclusion

In this paper, we have proposed an effective way of perturbing the hidden state of an RNN such that it becomes capable of learning the multiple modes of human motions. Our evaluation of quality and diversity, based on both new quantitative metrics and human judgment, have evidenced that our approach outperforms the state-of-the-art stochastic methods. Generating diverse plausible motions given a short sequence of observations has many applications, especially when the motions are generated in an action-agnostic

Figure 10. Mean angle error (MAE) for the Human 3.6M actions commonly used to report long-term prediction results. (Top) We compare the best of K motions generated by our approach with the state-of-the-art deterministic baselines. Note that, while our approach does exploits knowledge of the action during neither training nor inference (unlike some of the baselines), it performs on par with the state-of-the-art deterministic baselines. (Bottom) We compare our approach with the state-of-the-art stochastic baselines. Note that the results for each stochastic baseline were obtained from the best of K generated motions.

Figure 11. Effect of K in the MAE of our actions of the Human3.6M dataset. Note, the bold black one is the best of K = 50 motions, and the shaded area indicates the region between best of K = 1 and K = 500.

manner. For instance, our model can be used for human action forecasting [1, 28, 30, 29, 2], where one seeks to anticipate the action as early as possible. It can also be employed for motion inpainting, where, given partial observations, one aims to generate multiple in-between solutions. In the future, we will therefore investigate the use of our approach in such applications.

References

[1] M. S. Aliakbarian, F. Saleh, B. Fernando, M. Salzmann, L. Petersson, and L. Andersson. Deep action-and contextaware sequence learning for activity recognition and anticipation. arXiv preprint arXiv:1611.05520, 2016. 9

[2] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Viena: A driving anticipation dataset. In Asian Conference on Computer Vision, pages 449–466. Springer, 2018. 9

[3] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: fine- grained image generation through asymmetric training. In

Proceedings of the IEEE International Conference on Computer Vision, pages 2745–2754, 2017. 1

[4] E. Barsoum, J. Kender, and Z. Liu. Hp-gan: Probabilistic 3d human motion prediction via gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1418–1427, 2018. 1, 2, 3, 5, 6, 7

[5] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefow- icz, and S. Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015. 2, 4

[6] J. B¨utepage, H. Kjellstr¨om, and D. Kragic. Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–9. IEEE, 2018. 2

[7] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. 5

[8] J. Engel, M. Hoffman, and A. Roberts. Latent constraints: Learning to generate conditionally from unconditional gen-

erative models. arXiv preprint arXiv:1711.05772, 2017. 1

[9] P. Esser, E. Sutter, and B. Ommer. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8857–8866, 2018. 1

[10] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pages 4346–4354, 2015. 2, 7, 9

[11] P. Ghosh, J. Song, E. Aksan, and O. Hilliges. Learning hu- man motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV), pages 458–466. IEEE, 2017. 2, 9

[12] L.-Y. Gui, Y.-X. Wang, X. Liang, and J. M. Moura. Ad- versarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 786–803, 2018. 1, 2, 7, 8, 9

[13] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. Moura. Few- shot human motion prediction via meta-learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 432–450, 2018. 2, 7

[14] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Hu- man3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325– 1339, jul 2014. 5

[15] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. Structural- rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5308–5317, 2016. 2, 7, 8, 9

[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 4

[18] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015. 1

[19] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019. 7

[20] J. N. Kundu, M. Gor, and R. V. Babu. Bihmp-gan: Bidi- rectional 3d human motion prediction gan. arXiv preprint arXiv:1812.02591, 2018. 1, 2

[21] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015. 1

[22] X. Lin and M. R. Amer. Human motion modeling using dv- gans. arXiv preprint arXiv:1804.10652, 2018. 1, 2

[23] J. Martinez, M. J. Black, and J. Romero. On human mo- tion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4674–4683. IEEE, 2017. 1, 2, 7, 8, 9

[24] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013. 5

[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. 5

[26] D. Pavllo, C. Feichtenhofer, M. Auli, and D. Grangier. Mod- eling human motion with quaternion-based neural networks. arXiv preprint arXiv:1901.07677, 2019. 1, 2, 4, 7

[27] D. Pavllo, D. Grangier, and M. Auli. Quaternet: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018. 1, 2, 4, 7

[28] C. Rodriguez, B. Fernando, and H. Li. Action anticipation by predicting future dynamic images. In European Conference on Computer Vision, pages 89–105. Springer, 2018. 9

[29] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fer- nando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. 9

[30] Y. Shi, B. Fernando, and R. Hartley. Action anticipation with rbf kernelized feature mapping rnn. In Proceedings of the European Conference on Computer Vision (ECCV), pages 301–317, 2018. 9

[31] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015. 2

[32] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 3352–3361. IEEE, 2017. 1, 2, 3, 5, 6

[33] R. J. Williams and D. Zipser. A learning algorithm for con- tinually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. 5

[34] X. Yan, A. Rastogi, R. Villegas, K. Sunkavalli, E. Shecht- man, S. Hadap, E. Yumer, and H. Lee. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In European Conference on Computer Vision, pages 276–293. Springer, 2018. 1, 2, 3, 5, 6, 7

[35] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Con- ditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016. 1

designed for accessibility and to further open science