With the rapid growth of systems equipped with powerful sensing devices, it is important to develop algorithms that are capable of controlling systems from high-dimensional raw sensory inputs (e.g., pixel input). However, scaling stochastic optimal control and reinforcement learning (RL) methods to high-dimensional unknown environments remains an open challenge. To tackle this problem, a common approach is to employ various heuristics to embed the high-dimensional observations into a lower-dimensional la- tent space (Finn et al., 2016; Kurutach et al., 2018; Kaiser et al., 2019). The class of Learning Controllable Embedding (LCE) algorithms (Watter et al., 2015; Banijamali et al., 2018; Hafner et al., 2018; Zhang et al., 2019; Levine et al., 2020) further supplies the latent space with a latent dynamics model to enable planning directly in latent space.
Our present work focuses on this class of LCE algorithms and takes a critical look at the prevailing heuristic used to learn the controllable latent space: next-observation prediction. To ensure that the learned embedding and latent dynamics are predictive of future observations, existing LCE algorithms introduce a decoder during training and explicitly perform next-observation prediction by decoding the predicted latent states back into the observation space. Despite its empirical success (Watter et al., 2015; Banijamali et al., 2018; Zhang et al., 2019; Levine et al., 2020), this approach suffers from two critical drawbacks that motivate the search for better alternatives: (i) it requires the model to handle the challenging task of high-dimensional prediction; (ii) it does so in a parameter-inefficient manner—requiring the use of a decoder that is discarded during control.
To address these concerns, we propose a novel information-theoretic LCE approach for learning a controllable latent space. Our contributions are as follows.
1. We characterize the quality of the learned embedding through the lens of predictive suboptimality and show that predictive coding (van den Oord et al., 2018) is sufficient for minimizing predictive suboptimality.
2. Based on predictive coding, we propose a simpler and parameter-efficient model that jointly learns a controllable latent space and latent dynamics specifically amenable to locally-linear controllers.
3. We conduct detailed analyses and empirically characterize how model ablation impacts the learned latent space and control performance.
4. Finally, we show that our method out-performs state-of-the-art LCE algorithms on several benchmark tasks, demonstrating predictive coding as a superior alternative to next-observation prediction when learning controllable embeddings.
We are interested in controlling non-linear dynamical systems of the form , over the horizon T. In this definition, and are the state and action of the system at time step is the Gaussian system noise, and is the smooth non-linear system dynamics. We are particularly interested in the scenario in which we only have access to the high-dimensional observation of each state ). This scenario has application in many real-world problems, such as visual-servoing (Espiau et al., 1992), in which we only observe high-dimensional images of the environment and not its underlying state. We further assume that the high-dimensional observations x have been selected such that for any arbitrary control sequence , the observation se- quence is generated by a stationary Markov process, i.e.,
A common approach to control the non-linear dynamical system described above is to solve the following stochastic optimal control (SOC) problem (Shapiro et al., 2009) that minimizes the expected cumulative cost
where is the immediate cost function and is the observation at the initial state . Throughout the paper, we assume that all immediate cost functions are bounded by and Lipschitz with constant One form of the immediate cost function that is particularly common in goal tracking problems is where is the observation at the goal state.
The application of SOC to high-dimensional environments, however, faces several challenges. Since the observations x are high-dimensional and the dynamics in the observation space is unknown, solving (SOC1) is often intractable as it requires solving two difficult problems: high-dimensional dynamics estimation and high-dimensional optimal control. To address these issues, the Learning Controllable Embedding (LCE) framework proposes to learn a lowdimensional latent (embedding) space and a latent state dynamics, and then perform optimal control in the latent space instead. This framework includes algorithms such as E2C (Watter et al., 2015), RCE (Ban- ijamali et al., 2018), SOLAR (Zhang et al., 2019), and PCC (Levine et al., 2020). By learning a stochastic encoder and latent dynamics ,
LCE algorithms defines a new SOC in the latent space,
where is sampled from the distribution , and is the latent cost function. By solving the much lower-dimensional (SOC2), the resulting optimal control is then applied as a feasible solution to (SOC1) and incurs a suboptimality that depends on the choice of the encoder E and latent dynamics
Although Levine et al. (2020) provided an initial theoretical characterization of this SOC suboptimality, the selection of E and F ultimately remains heuristically-driven in all previous works. These heuristics vary across different studies (Levine et al., 2020; Banijamali et al., 2018; Watter et al., 2015; Zhang et al., 2019; Hafner et al., 2018), but the primary approach employed by the existing LCE algorithms is explicit next-observation prediction. By introducing a decoder , the composition is cast as an action-conditional latent variable model; the advances in latent variable modeling (Kingma & Welling, 2013; Rezende et al., 2014; Burda et al., 2015; Johnson et al., 2016; Sohn et al., 2015) are then leveraged to train E, F, and D to perform explicit next-observation prediction by maximizing a lower bound on the log-likelihood , over the dataset whose trajectories are drawn from
Next-observation prediction offers a natural way to learn a non-degenerate choice of E and F, and enjoys the merit of being empirically successful. However, it requires the introduction of a decoder D as nuisance parameter that only serves the auxiliary role of training the encoder E and latent dynamics F. The focus of our paper is whether E and F can be successfully selected via a decoder-free heuristic.
Existing methods instantiate (SOC2) by learning the encoder E and latent dynamics model F in conjunction with an auxiliary decoder D to explicitly perform next-observation prediction. The auxiliary decoder ensures that the learned representation can be used for next-observation prediction, and is discarded after the encoder and latent dynamics model are learned. Not only is this a parameter-inefficient procedure for learning (E, F), this approach also learns (E, F) by explicitly performing the challenging high-dimensional next-observation prediction. In this section, we propose an information-theoretic approach that can learn (E, F) without decoding and next-observation prediction.
Figure 1. Two high-level approaches to learn an E and F to instantiate (SOC2). One way is to explicitly introduce a decoder D and do next-observation prediction (left), whereas our method uses F as a variational device to train E via predictive coding (right).
3.1. Predictive Suboptimality of a Representation
Our approach exploits the observation that the sole purpose of the decoder is to ensure that the learned representation is good for next-observation prediction. In other words, the decoder is used to characterize the suboptimality of next-observation prediction when the prediction model is forced to rely on the learned representation. We refer to this concept as predictive suboptimality of the learned representation and formally define it as follows.
Definition 1. Let denote the data distribution. Given an encoder , 3 let denote the prediction model
where are expressive non-negative functions. We define the predictive suboptimality of a representa- tion induced by E as the best-case prediction loss
Importantly, the function should measure the compatibility of the triplet —but is only allowed to do so via the representations and . Thus, the behavior of the representation bottleneck plays a critical role in modulating the expressivity of the model q. If E is invertible, then q is a powerful prediction model; if E is a constant, then q can do no better than marginal density estimation of
While it is possible to minimize the predictive suboptimality of E by introducing the latent dynamics model F and de- coder D, and then performing next-observation prediction via , our key insight is that predictive suboptimality can be bounded by the following mutual information gap (see Appendix A.2 for proof).
Lemma 1. Let be the random variables associated with the data distribution . The predictive suboptimality is upper bounded by the
mutual information gap
Since is a constant and upper bounds by the data processing inequality, this means we can minimize the predictive suboptimality of E simply by maximizing the mutual information between the future latent state and the current latent state and action pair —a form of predictive coding. We denote this mutual information as a function of E. To maximize this quantity, we can then leverage the recent advances in variational mutual information approximation (van den Oord et al., 2018; Poole et al., 2019; Belghazi et al., 2018; Nguyen et al., 2010; Hjelm et al., 2018) to train the encoder in a decoder-free fashion.
3.2. Consistency in Prediction of the Next Latent State
A notable consequence of introducing the encoder E is that it can be paired with a latent cost function to define an alternative cost function in the observation space,
where z is sampled from This is particularly useful for high-dimensional SOC problems, where it is difficult to prescribe a meaningful cost function a priori in the observation space. For example, for goal tracking problems using visuosensory inputs, prescribing the cost function to be suffers from the uninformative nature of the 2-norm in high-dimensional pixel space (Beyer et al., 1999). In the absence of a prescribed c, a natural proxy for the unknown cost function is to replace it with and consider the new SOC problem,
Assuming (SOC1-E) to be the de facto SOC problem of interest, we wish to learn an F such that the optimal control in (SOC2) approximately solves (SOC1-E). One such consideration for the latent dynamics model would be to set F as the true latent dynamics induced by (p, E), and we refer to such F as the one that is consistent with (p, E).
Our main contribution in this section is to justify—from a control perspective—why selecting a consistent F with respect to (p, E) minimizes the suboptimality incurred from using (SOC2) as an approximation to (SOC1-E). The following lemma (see Appendix A.3 for proof) provides the suboptimality performance gap between the solutions of (SOC2) and (SOC1-E).
Lemma 2. For any given encoder E and latent dynamics F, let be the solution to (SOC1-E) and be a solution to (SOC2). Then, we have the following performance bound between the costs of the control signals
L
(1) where RCE
In Eq. 1, the expectation is over the state-action stationary distribution of the policy used to generate the training samples (uniformly random policy in this work), and U is the Lebesgue measure of are the probability over the next latent state . Based on Figure 1, we therefore interpret as the measure of discrepancy between the dynamics induced by (p, E) versus the latent dynamics model induced by (E, F). which we term the consistency regularizer. We note that while our resulting bound is similar to Lemma 2 in Levine et al. (2020), there are two key differences. First, our analysis makes explicit the assumption that the cost function c is not prescribed and thus replaced in practice with the proxy cost function based on the heuristically-learned encoder. Second, by making this assumption explicit, our bound is based on samples from the environment dynamics p instead of the next-observation prediction model dynamics as required in Levine et al. (2020).
By restricting the stochastic encoder E to be a distribution with fixed entropy (e.g., by fixing the variance if E is conditional Gaussian), the minimization of the consistency regularizer corresponds to maximizing the log-likelihood of F for predicting , given , under the dynamics induced by (p, E). This correspondence holds even in the limiting case of E being deterministic (e.g., fixing the variance to an arbitrarily small value). In other words, for (SOC2) to approximate (SOC1-E) well, we select F to be a good predictor of the true latent dynamics.
3.3. Suboptimality in Locally-Linear Control
In Section 3.2, we derived the suboptimality of using (SOC2) as a surrogate control objective for (SOC1-E), and showed that the suboptimality depends on the consistency of latent dynamics model F with respect to the true latent dynamics induced by (p, E).
We now shift our attention to the optimization of (SOC2) itself. Similar to previous works (Watter et al., 2015; Bani- jamali et al., 2018; Zhang et al., 2019; Levine et al., 2020), we shall specifically consider the class of locally-linear control (LLC) algorithms, e.g., iLQR (Li & Todorov, 2004), for solving (SOC2). The main idea in LLC algorithms is to compute an optimal action sequence by linearizing the dynamics around some nominal trajectory. This procedure implicitly assumes that the latent dynamics F has low curvature, so that local linearization via first-order Taylor expansion yields to a good linear approximation over a suf-ficiently large radius. As a result, the curvature of F will play an important role in the optimizability of (SOC2) via LLC algorithms.
Levine et al. (2020) analyzed the suboptimality incurred from applying LLC algorithms to (SOC2) as a function of the curvature of F. For self-containedness, we summarize their analysis as follows. We shall assume F to be a conditional Gaussian model with a mean prediction function . The curvature of can then be measured via
where is a tunable parameter that characterizes the radius of latent state-action space in which the latent dynamics model should have low curvature. Let be a LLC solution to (SOC2). Suppose the nominal latent state-action trajectory isfies the condition: , where is the optimal trajectory of (SOC2). Us- ing Eq. 29 of Levine et al. (2020), one can show that with probability , the LLC solution of (SOC2) has the following suboptimality performance gap when compared with the optimal cost of this problem using the solution
L
where
and X is the Lebesgue measure with respect to X. We therefore additionally constrain F to have low curvature so that it is amenable to the application of LLC algorithms.
Based on the analysis in Section 3, we identify three desiderata for guiding the selection of the encoder E and latent dynamics model F. We summarize them as follows: (i) predictive coding minimizes the predictive suboptimality of the encoder E; (ii) consistency of the latent dynamics model F with respective to (p, E) enables planning directly in the latent space; and (iii) low-curvature enables planning in latent space specifically using locally-linear controllers. We refer to these heuristics collectively as Predictive Coding-Consistency-Curvature (PC3). PC3 can be thought of as an information-theoretic extension of the Prediction-Consistency-Curvature (PCC) framework described by Levine et al. (2020)—differing primarily in the replacement of explicit next-observation prediction with predictive coding in the latent space.
In this section, we highlight some of the key design choices involved when instantiating PC3 in practice. In particular, we shall show how to leverage the CPC variational mutual information bound in a parameter-efficient manner and how to enforce the consistency of F with respect to (p, E) without destabilizing training.
4.1. Enforcing Predictive Codes
To estimate the mutual information , we employ contrastive predictive coding (CPC) proposed by van den Oord et al. (2018). We perform CPC by introducing a critic to construct the lower bound
where the expectation is over K i.i.d. samples of . Notice that the current latent state-action pair is specifically designated as the source of negative samples and used for the contrastive prediction of the next latent state . We then tie the critic f to our latent dynamics model F,
This particular design of the critic has two desirable properties. First, it exploits parameter-sharing to circumvent the instantiation of an auxiliary critic f. Second, it takes advantage of the property that an optimal critic for the lower bound in Eq. (2) is the true latent dynamics (Poole et al., 2019; Ma & Collins, 2018)—which we wish F to approximate. The resulting CPC objective is thus
which we denote as
4.2. Enforcing Consistency
Since the true latent dynamics is an optimal critic for the CPC bound, it is tempting to believe that optimizing (E, F) to maximize should be sufficient to encourage the learning of a latent dynamics model F that is consistent with the true latent dynamics induced by (p, E).
In this section, we show that it is easy to construct a simple counterexample illustrating the non-uniqueness of the true latent dynamics as an optimal critic—and that F may learn to be arbitrarily inconsistent with (p, E) while still maximizing under a fixed choice of E. Our simple counterexample proceeds as follows: let E be the identity function, let X = U = R, and let be a uniform distribution over the tuples Let be a conditional Gaussian distribution with learnable variance and mean function
where is a learnable parameter. By symmetry, the bound
In the denominator, the first term arises from the positive sample (e.g., (1, 1, 1)) whereas the second term arises from the negative sample (e.g., ). One way to maximize this bound would be to set spondingly, F would approach the true latent dynamics and precisely predict how transitions to an alternative procedure for maximizing this bound is to fix to any positive constant and let . In this scenario, F becomes an arbitrarily poor predictor of the underlying latent dynamics.
This counterexample highlights a simple but important characteristic of the CPC bound. In contrast to direct maximum likelihood training of using samples of from the true latent dynamics, the contrastive predictive training of the latent dynamics model simply ensures that relatively much higher value to the positive samples than to the negative samples. The fact that the CPC bound may be maximized without learning a consistent dynamics model F may be why previous work by Nachum et al. (2018) using CPC for representation learning in model-free RL chose not to perform model-based latent space control despite also learning an F as a variational artifact from their CPC bound.
Since our goal is to use F in (SOC2) for optimal control, it is critical that we ensure the latent dynamics model F indeed approximates the true latent dynamics. We therefore additionally train F via the maximum likelihood objective
However, naively optimizing (E, F) to maximize both and is unstable; whereas is geometry-invariant, is sensitive to non-volume preserving transformations of the latent space (Rezende & Mohamed, 2015; Dinh et al., 2016) and can increase arbitrarily simply by collapsing the latent space. To resolve this issue, we add Gaussian noise with fixed variance to the next-state encoding . Doing so yields the noise-perturbed objectives . The introduction of noise has two notable effects. First, it imposes an upper bound on the achievable log-likelihood
based on the entropy of the Gaussian noise. Second, is now a lower bound to the mutual information between and the noise-perturbed
Since the noise variance is fixed, can only be maximized by expanding the latent space. By tuning the noise variance as a hyperparameter, we can balance the latent space retraction encouraged by with the latent space expansion encouraged by and thus stabilize the learning of the latent space. For notational simplicity, we shall treat all subsequent mentions of and to mean their respective noise-perturbed variants, except in the specific ablation conditions where noise is explicitly removed (e.g., the “w/o ” condition in our experiments).
4.3. Enforcing Low Curvature
We measure the curvature of F by computing the first-order Taylor expansion error incurred when evaluating at
+
Levine et al. (2020) further proposes an amortized version of this objective to accelerate training when the latent dimensionality is large. However, since is relatively small in our benchmark tasks, our initial experimentation suggests amortization to have little wall-clock time impact on these tasks. Our overall objective is thus
which maximizes the CPC bound and consistency, while minimizing curvature.
In this section, we report a thorough ablation study on various components of PC3, as well as compare the performance of our proposed model6 with two state-of-the-art LCE baselines: PCC (Levine et al., 2020) and SOLAR (Zhang et al., 2019).7 The experiments are based on four image-based control benchmark domains: Planar System, Inverted Pendulum,8 Cartpole, and 3-Link Manipulator.
Data generation procedure: In PCC and PC3, each sample is a triplet , in which we (1) sample uniformly an underlying state and generate its corresponding observation , (2) sample uniformly an action , and (3) obtain the next state from the true dynamics and generate the corresponding observation . In SOLAR, each training sample is an episode , where T is the control horizon. We sample uniformly T actions from the action space, apply the dynamics T times from the initial state, and generate T corresponding observations.
Evaluation metric: We evaluate PC3 and the baselines in terms of control performance. For PC3 and PCC, we apply iLQR algorithm in the latent space with a quadratic cost, , where and are the encoded vectors of the current and goal observation, and . For SOLAR, we use their original local-inference-and-control algorithm.9 We report the percentage of time spent in the goal region in the underlying system (Levine et al., 2020).
5.1. Ablation Study
We characterize PC3 by ablating , and the noise added to . For each setting, we report the latent map size,10 , and the control performance. These statistics are averaged over 10 different models. All settings are run on Pendulum (Balance and Swing Up).
Consistency: In Table 1, we can see that when ted, the control performance drops. As discussed in Section 3.2, the latent dynamics model F performs poorly when not explicitly optimized for consistency. This is further demonstrated in Table 2, where we take a pretrained PC3 model, freeze the encoder, and retrain F to maximize either or . Despite both retrained models achieving similar scores, it is easy to see that training via results in much worse latent dynamics in terms of prediction.
Figure 2. Inverted pendulum representations. From left to right: PC3, w/o
Table 1. Ablation analysis. Percentage of steps spent in goal state. From top to bottom: full-model PC3, excluding consistency and latent noise, excluding consistency, excluding latent noise, excluding curvature. For each setting we report the latent map scale, CPC, consistency and curvature loss, and control performance on balance and swing.
Table 2. We took a pretrained PC3 model, froze the encoder E, and then retrained only the latent dynamics model F either without (first row) or without (second row). Note that we continue to use noise in both settings.
Noise: The control performance also decreases when we do not add noise to . This is because the model will collapse the latent space to inflate as shown in Table 1, leading to a degenerate solution. Adding noise to prevents the map from collapsing; since the noise variance is fixed, is only maximized by pushing points apart. Indeed, Table 1 shows that when noise is added but removed, the latent map expands aggressively.
Curvature: Finally, as previously observed in (Levine et al., 2020), imposing low curvature is an important component if we want to use locally-linear control algorithms such as iLQR. Without the curvature loss, the Taylor approximation when running iLQR might not be accurate, leading to poor control performance. The right-most map in Figure 2 shows a depiction of this setting, in which the map has very sharp regions, requiring the transition function to have high-curvature to move in these regions.
5.2. Control Performance Comparison
Experimental pipeline: For each control domain, we run 10 different subtasks (different initial and/or goal states), and report the average performance among these subtasks. For PCC and PC3, we train 10 different models, and each of them will perform all 10 subtasks (which means a total of subtasks), and we additionally report the performance of the best model. SOLAR training procedure depends on the specific subtask (i.e., initial and goal state), and since we cannot train 100 different models due to huge computation cost, we train only 1 model for each subtask. All subtasks are shared for three methods.
Result: Table 3 shows that our proposed PC3 model signifi-cantly outperforms the baselines by comparing the means and standard error of means on the different control tasks.11 PCC and SOLAR often fail at difficult tasks such as Swing Up and 3-Link. Moreover, SOLAR training procedure depends on the specific task, which makes them unsuitable to be reused for different tasks in the same environment. Figure 3 demonstrates some (randomly selected) latent maps of Planar and Inverted Pendulum domains learned by PCC and PC3. In general, PC3 produces more interpretable latent representation for Pendulum, due to the fact that next observation prediction is too conservative and may force a model to care about things that do not matter to downstream tasks. Finally, in terms of computation, PC3 enjoys huge improvements over the baselines, with faster than PCC and faster than SOLAR.
Figure 3. Top: Planar latent representations; Bottom: Inverted Pendulum latent representations. Left three: PCC, right three: PC3.
Table 3. Percentage steps in goal state for the average model (all) and top 1 model. Since SOLAR is task-specific, it does not have top 1.
LCE Approaches. In contrast to existing LCE methods (Watter et al., 2015; Banijamali et al., 2018; Levine et al., 2020; Zhang et al., 2019; Hafner et al., 2018), our main contribution in PC3 is the development of an information-theoretic approach for minimizing the predictive suboptimality of the encoder and circumventing the need to perform explicit next-observation prediction. In particular, PC3 can be seen as a natural information-theoretic extension of PCC (Levine et al., 2020), which itself extended and improved upon E2C (Watter et al., 2015) and RCE (Banijamali et al., 2018). Compared to SOLAR (Zhang et al., 2019), PC3 (as well as PCC, RCE, and E2C) decouples the representation learning and latent dynamics estimation from control—once the encoder and latent dynamics have been learned for a particular environment, it can be used to solve many SOC problems within the same environment. In contrast, SOLAR is an online algorithm that interleaves model learning with policy optimization. Furthermore, the latent model in SOLAR is restricted to be globally linear, which can potentially impact the control performance.
Information-Theoretic Approaches. Several works have previously explored information-theoretic approaches for representation learning in the reinforcement learning context (Nachum et al., 2018; Anand et al., 2019; Lu et al., 2019). However, these works do not test the quality of their learned representations for the purposes of model-based planning in the latent space, opting instead to leverage the representations for model-free RL. This is particularly no- table in the case of Nachum et al. (2018), who explicitly learned both an encoder E and latent dynamics model F. As we showed in Section 4.2, maximizing the CPC bound alone may not be sufficient for ensuring that F is a good predictor of the latent dynamics induced by (p, E). Thus, the resulting (E, F) from predictive coding alone may be unsuitable for multi-step latent planning, as we demonstrate in our ablation analysis in Section 5.1.
In this work, we propose a novel information-theoretic Learning Controllable Embedding approach for handling high-dimensional stochastic optimal control. Our approach challenges the necessity of the next-observation prediction in existing LCE algorithms. We show theoretically that predictive coding is a valid alternative to next-observation prediction for learning a representation that minimizes predictive suboptimality. To instantiate information-theoretic LCE, we develop the Predictive Coding-Consistency-Curvature (PC3) model and show that PC3 is a simpler, yet more effective method than existing next-observation prediction-based LCE approaches. We also provide a thorough study on various components of the PC3 objective via ablation analysis to assist the adoption of predictive coding in future LCE research. A natural follow-up would be to study the effi-cacy of predictive coding when used in conjunction with other techniques in the LCE literature (e.g. latent overshooting) as well as with other controllers beyond the class of locally-linear controllers considered in our present work.
Anand, A., Racah, E., Ozair, S., Bengio, Y., Cˆot´e, M.-A., and Hjelm, R. D. Unsupervised state representation learning in atari. In Advances in Neural Information Processing Systems, pp. 8766–8779, 2019.
Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., and Gh- odsi, A. Robust locally-linear controllable embedding. In Proceedings of the Twenty First International Conference on Artificial Intelligence and Statistics, pp. 1751–1759, 2018.
Belghazi, M., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In International Conference on Machine Learning, pp. 531–540, 2018.
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When is nearest neighbor meaningful? In International conference on database theory, pp. 217–235. Springer, 1999.
Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- tion using real nvp. preprint arXiv:1605.08803, 2016.
Espiau, B., Chaumette, F., and Rives, P. A new approach to visual servoing in robotics. ieee Transactions on Robotics and Automation, 8(3):313–326, 1992.
Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 512–519. IEEE, 2016.
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
Hjelm, R., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. preprint arXiv:1808.06670, 2018.
Johnson, M., Duvenaud, D., Wiltschko, A., Adams, R., and Datta, S. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp- bell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. Model-based reinforcement learning for atari. preprint arXiv:1903.00374, 2019.
Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma, D. and Welling, M. Auto-encoding variational bayes. preprint arXiv:1312.6114, 2013.
Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Kurutach, T., Tamar, A., Yang, G., Russell, S. J., and Abbeel, P. Learning plannable representations with causal infogan. In Advances in Neural Information Processing Systems, pp. 8733–8744, 2018.
Levine, N., Chow, Y., Shu, R., Li, A., Ghavamzadeh, M., and Bui, H. Prediction, consistency, curvature: Representation learning for locally-linear control. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=BJxG_0EtDS.
Li, W. and Todorov, E. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pp. 222–229, 2004.
Lu, X., Tiomkin, S., and Abbeel, P. Predictive coding for boosting deep reinforcement learning with sparse rewards. arXiv preprint arXiv:1912.13414, 2019.
Ma, Z. and Collins, M. Noise contrastive estimation and neg- ative sampling for conditional models: Consistency and statistical efficiency. preprint arXiv:1809.01812, 2018.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. Preprint arXiv:1312.5602, 2013.
Nachum, O., Gu, S., Lee, H., and Levine, S. Near-optimal representation learning for hierarchical reinforcement learning. preprint arXiv:1810.01257, 2018.
Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
Petrik, M., Ghavamzadeh, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298–2306, 2016.
Poole, B., Ozair, S., van den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. preprint arXiv:1905.06922, 2019.
Rezende, D. and Mohamed, S. Variational inference with normalizing flows. preprint arXiv:1505.05770, 2015.
Rezende, D., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. preprint arXiv:1401.4082, 2014.
Shapiro, A., Dentcheva, D., and Ruszczy´nski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491, 2015.
van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748, 2018.
Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015.
Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M., and Levine, S. Solar: Deep structured latent representations for model-based reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
A.1. Connecting (SOC1) and (SOC1-E) with Next-observation Prediction
Recall that for an arbitrarily given encoder E the proxy cost function in the observation space is given by , where z is sampled from E(x). Equipped with this cost the only difference between (SOC1-E), i.e.,
, and the original problem (SOC1), i.e., , is on the cost function used.
To motivate the heuristic method of learning an encoder E by maximizing the likelihood of the next-observation prediction model, we want to show there exists at least one latent cost function such that the aforementioned approach makes sense. Followed from the equivalence of the energy-based graphical model (Markov random field) and Bayesian neural network (Koller & Friedman, 2009), for any arbitrary encoder E there exists a latent dynamics model and decoder such that any energy-based LCE model that has an encoder model , can be written as
Now, suppose for simplicity the observation cost is only state-dependent, and the latent cost is constructed as follows: . Then one can write , and this implies
where is the total variation distance of two distributions. Using analogous derivations of Lemma 11 in (Petrik et al., 2016), for the case of finite-horizon MDPs, one has the following chain of inequalities for any given control sequence and initial observation
The first inequality is based on the result of the above lemma, the second inequality is based on Pinsker’s inequality, and the third inequality is based on Jensen’s inequality of
Notice that for any arbitrary action sequence it can always be expressed in form of deterministic policy some non-stationary state-action mapping . Therefore, the KL term can be written as:
where the expectation is taken over the state-action stationary distribution of the finite-horizon problem that is induced by data-sampling policy U. The last inequality is due to change of measures in policy, and the last inequality is due to the facts that (i) is a deterministic policy, (ii) is a sampling policy with lebesgue measure 1/U over all control actions, (iii) the following bounds for importance sampling factor holds:
Using the above results we now have the following sub-optimality performance bound between the optimizer of (SOC1), , and the optimizer of (SOC1-E),
This shows that the performance gap between (SOC1) and (SOC1-E) is bounded by the prediction loss. Thus this result motivates the approach of learning the encoder model E of proxy cost by maximizing the likelihood of the next-observation prediction LCE model.
A.2. Proof of Lemma 1
We first provide the proof in a more general setting. Consider the data distribution p(x, y). Given any two representation functions , we wish to inquire how good these two functions are for constructing a predictor of y given x. To do so, we introduce a restricted class of prediction models of the form
Let denote the model that minimizes
Our goal is to upper bound the best possible loss based on the mutual information gap . In particular, we find that
We prove via explicit construction of a model q(y | x) whose corresponding loss is exactly the mutual information gap. Let (X, Y ) be joint random variables associated with p(x, y). Let r(a | b) be the conditional distribution of a = e(x) given b = f(y) associated with the joint random variables (A, B) = (e(X), f(Y )). Simply choose
Then, by law of the unconscious statistician, we see that
Finally, we see that
Since , the mutual information gap thus upper bounds the loss associated with the best restricted predictor
To complete the proof for Lemma 1, simply let
A.3. Proof of Lemma 2
For the first part of the proof, at any time-step , for any arbitrary control action sequence , and any arbitrary latent dynamics model F, with a given encoder E consider the following decomposition of the expected cost: Now consider the two-stage cost function: . One can express this cost function as
The last inequality is based on the chain of inequalities at any
in which the first one is based on triangle inequality and the second one is based on the non-expansive property of continuing the above expansion, one can show that
where the second inequality is based on convexity of , the third inequality is based on Pinsker’s inequality and the last inequality is based on Jensen’s inequality of
For the second part of the proof, one can show the following chain of inequalities for solution of (SOC1-E) and (SOC2):
where the first and third inequalities are based on the first part of this lemma, and the second inequality is based on the optimality condition of problem (SOC2). This completes the proof.
In the following sections we will provide the description 4 control domains and implementation details used in the experiments.
B.1. Description of the domains
All control environments are the same as reported in (Levine et al., 2020), except that we report both balance and swing up tasks for pendulum, where the author only reported swing up.
B.2. Implementation details
SOLAR training specifics: We use their default setting:
• Batch size of 2.
• ADAM (Kingma & Ba, 2014) with , and . Learning rate for learning for other parameters.
• (
• Local inference and control:
PCC training specifics: We use their reported setting:
• Batch size of
• ADAM with
• L2 regularization with a coefficient of
• for the curvature loss. This setting is shared across all domains.
• Additional VAE (Kingma & Welling, 2013) loss term with a very small coefficient of 0.01, where p(z) = N(0, 1).
• Additional deterministic reconstruction loss with coefficient 0.3: given the current observation x, we take the means of the encoder output and the dynamics model output, and decode to get the reconstruction of the next observation.
PC3 training specifics:
• Batch size of 256.
• ADAM with
• L2 regularization with a coefficient of
• Latent noise across all domains without any tuning.
• was set to be 1 across all domains, after it was tuned using grid search in range {0.5, 0.75, 1} on Planar system.
• was set to be 7 across all domains, after it was tuned using grid search in range {1, 3, 7} on Planar system.
• for the curvature loss.
• Additional loss with a very small coefficient of 0.01, which is used to center the latent space around the origin. We found this term to be important to stabilize the training process.
We next present the specific architecture choices for each domain. For fair comparison, the architectures were shared across all algorithms when possible, ReLU non-linearities were used between each two layers.
Encoder: composed of a backbone (either a MLP or a CNN, depending on the domain) and an additional fully-connected (FLC) layer that outputs either a vector (for PC3) or a Gaussian distribution (for PCC and SOLAR).
Latent dynamics (PCC and PC3): the path that leads from , composed of a MLP backbone and an additional FLC layer that outputs either a vector (for PC3) or a Gaussian distribution (for PCC and SOLAR).
Decoder (PCC and SOLAR): composed of a backbone (either a MLP or a CNN, depending on the domain) and an additional FLC layer that outputs a Bernoulli distribution.
Backward dynamics: the path that leads from . Each of the inputs goes through a FLC network respectively. The outputs are concatenated and passed through another FLC network , and finally an additional FLC network which outputs a Gaussian distribution.
Planar system
• Input: training samples of the form for PCC and PC3, and 125 rollouts for SOLAR.
• Actions space: 2-dimensional
• Latent space: 2-dimensional
• Encoder: 3 Layers: 300 units - 300 units - 4 units for PCC and SOLAR (2 for mean and 2 for variance) or 2 units for PC3
• Dynamics: 3 Layers: 20 units - 20 units - 4 units for PCC and SOLAR or 2 units for PC3
• Decoder: 3 Layers: 300 units - 300 units - 1600 units (logits)
• Backward dynamics:
• Planning horizon: T = 40
• Initial standard deviation for collecting data (SOLAR): 1.5 for both global and local traning.
Inverted Pendulum Swing up and Balance
• Input: Two images. 20000 training samples of the form for PCC and PC3, and 200 rollouts for SOLAR.
• Actions space: 1-dimensional
• Latent space: 3-dimensional
• Encoder: 3 Layers: 500 units - 500 units - 6 units for PCC and SOLAR or 3 units for PC3
• Dynamics: 3 Layers: 30 units - 30 units - 4 units for PCC and SOLAR or 2 units for PC3
• Decoder: 3 Layers: 500 units - 500 units - 4608 units (logits)
• Backward dynamics:
• Planning horizon: T = 100
• Initial standard deviation for collecting data (SOLAR): 0.5 for both global and local training.
Cartpole
• Input: Two images. 15000 training samples of the form for PCC and PC3, and 300 rollouts for SOLAR.
• Actions space: 1-dimensional
• Latent space: 8-dimensional
• Encoder: 6 Layers: Convolutional layer: ; stride (1, 1) - Convolutional layer: ; stride (2, 2) -Convolutional layer: ) - Convolutional layer: PCC and SOLAR or 8 units for PC3
• Dynamics: 3 Layers: 40 units - 40 units - 16 units for PCC and SOLAR or 8 units for PC3
• Decoder: 6 Layers: 200 units - 1000 units - 100 units - Convolutional layer: ) - Upsampling (2, 2) - Convolutional layer: ) - Upsampling (2, 2) - Convolutional layer: Upsampling (2, 2) - Convolutional layer:
• Backward dynamics:
• Planning horizon: T = 50
• Initial standard deviation for collecting data (SOLAR): 10 for global and 5 for local training.
3-link Manipulator
• Input: Two images. 30000 training samples of the form for PCC and PC3, and 150 rollouts for SOLAR.
• Encoder: 6 Layers: Convolutional layer: ; stride (1, 1) - Convolutional layer: ; stride (2, 2) -Convolutional layer: ) - Convolutional layer: PCC and SOLAR or 8 units for PC3
• Decoder: 6 Layers: 200 units - 1000 units - 100 units - Convolutional layer: ) - Upsampling (2, 2) - Convolutional layer: ) - Upsampling (2, 2) - Convolutional layer: Upsampling (2, 2) - Convolutional layer: