b

DiscoverSearch
About
My stuff
Predictive Coding for Locally-Linear Control
2020·arXiv
Abstract
Abstract

High-dimensional observations and unknown dynamics are major challenges when applying optimal control to many real-world decision making tasks. The Learning Controllable Embedding (LCE) framework addresses these challenges by embedding the observations into a lower dimensional latent space, estimating the latent dynamics, and then performing control directly in the latent space. To ensure the learned latent dynamics are predictive of next-observations, all existing LCE approaches decode back into the observation space and explicitly perform next-observation prediction—a challenging high-dimensional task that furthermore introduces a large number of nuisance parameters (i.e., the decoder) which are discarded during control. In this paper, we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. We then use predictive coding to develop a decoder-free LCE model whose latent dynamics are amenable to locally-linear control. Extensive experiments on benchmark tasks show that our model reliably learns a controllable latent space that leads to superior performance when compared with state-of-the-art LCE baselines.

With the rapid growth of systems equipped with powerful sensing devices, it is important to develop algorithms that are capable of controlling systems from high-dimensional raw sensory inputs (e.g., pixel input). However, scaling stochastic optimal control and reinforcement learning (RL) methods to high-dimensional unknown environments remains an open challenge. To tackle this problem, a common approach is to employ various heuristics to embed the high-dimensional observations into a lower-dimensional la- tent space (Finn et al., 2016; Kurutach et al., 2018; Kaiser et al., 2019). The class of Learning Controllable Embedding (LCE) algorithms (Watter et al., 2015; Banijamali et al., 2018; Hafner et al., 2018; Zhang et al., 2019; Levine et al., 2020) further supplies the latent space with a latent dynamics model to enable planning directly in latent space.

Our present work focuses on this class of LCE algorithms and takes a critical look at the prevailing heuristic used to learn the controllable latent space: next-observation prediction. To ensure that the learned embedding and latent dynamics are predictive of future observations, existing LCE algorithms introduce a decoder during training and explicitly perform next-observation prediction by decoding the predicted latent states back into the observation space. Despite its empirical success (Watter et al., 2015; Banijamali et al., 2018; Zhang et al., 2019; Levine et al., 2020), this approach suffers from two critical drawbacks that motivate the search for better alternatives: (i) it requires the model to handle the challenging task of high-dimensional prediction; (ii) it does so in a parameter-inefficient manner—requiring the use of a decoder that is discarded during control.

To address these concerns, we propose a novel information-theoretic LCE approach for learning a controllable latent space. Our contributions are as follows.

1. We characterize the quality of the learned embedding through the lens of predictive suboptimality and show that predictive coding (van den Oord et al., 2018) is sufficient for minimizing predictive suboptimality.

2. Based on predictive coding, we propose a simpler and parameter-efficient model that jointly learns a controllable latent space and latent dynamics specifically amenable to locally-linear controllers.

3. We conduct detailed analyses and empirically characterize how model ablation impacts the learned latent space and control performance.

4. Finally, we show that our method out-performs state-of-the-art LCE algorithms on several benchmark tasks, demonstrating predictive coding as a superior alternative to next-observation prediction when learning controllable embeddings.

We are interested in controlling non-linear dynamical systems of the form  st+1 = fS(st, ut) + w, over the horizon T. In this definition,  st ∈ S ⊆ Rnsand  ut ∈ U ⊆Rnuare the state and action of the system at time step t ∈ {0, . . . , T − 1}, wis the Gaussian system noise, and fSis the smooth non-linear system dynamics. We are particularly interested in the scenario in which we only have access to the high-dimensional observation  xt ∈ X ⊆ Rnxof each state  st (nx ≫ ns). This scenario has application in many real-world problems, such as visual-servoing (Espiau et al., 1992), in which we only observe high-dimensional images of the environment and not its underlying state. We further assume that the high-dimensional observations x have been selected such that for any arbitrary control sequence  U = {ut}T −1t=0, the observation se- quence  {xt}Tt=0 is generated by a stationary Markov process, i.e.,  xt+1 ∼ p(·|xt, ut), ∀t ∈ {0, . . . , T − 1}.1

A common approach to control the non-linear dynamical system described above is to solve the following stochastic optimal control (SOC) problem (Shapiro et al., 2009) that minimizes the expected cumulative cost

image

where  c : X × U → R≥0is the immediate cost function and  x0is the observation at the initial state  s0. Throughout the paper, we assume that all immediate cost functions are bounded by  cmax > 0and Lipschitz with constant  clip > 0.One form of the immediate cost function that is particularly common in goal tracking problems is  c(x, u) = ∥x−xgoal∥2,where  xgoalis the observation at the goal state.

The application of SOC to high-dimensional environments, however, faces several challenges. Since the observations x are high-dimensional and the dynamics in the observation space  p(·|xt, ut)is unknown, solving (SOC1) is often intractable as it requires solving two difficult problems: high-dimensional dynamics estimation and high-dimensional optimal control. To address these issues, the Learning Controllable Embedding (LCE) framework proposes to learn a lowdimensional latent (embedding) space  Z ⊆ Rnz (nz ≪ nx)and a latent state dynamics, and then perform optimal control in the latent space instead. This framework includes algorithms such as E2C (Watter et al., 2015), RCE (Ban- ijamali et al., 2018), SOLAR (Zhang et al., 2019), and PCC (Levine et al., 2020). By learning a stochastic encoder E : X → P(Z)and latent dynamics  F : Z × U → P(Z),

LCE algorithms defines a new SOC in the latent space,

image

where  z0is sampled from the distribution  E(x0), i.e., z0 ∼E(z0 | x0), and  ¯c : Z × U → R≥0is the latent cost function. By solving the much lower-dimensional (SOC2), the resulting optimal control  U ∗2 is then applied as a feasible solution to (SOC1) and incurs a suboptimality that depends on the choice of the encoder E and latent dynamics  F.2

Although Levine et al. (2020) provided an initial theoretical characterization of this SOC suboptimality, the selection of E and F ultimately remains heuristically-driven in all previous works. These heuristics vary across different studies (Levine et al., 2020; Banijamali et al., 2018; Watter et al., 2015; Zhang et al., 2019; Hafner et al., 2018), but the primary approach employed by the existing LCE algorithms is explicit next-observation prediction. By introducing a decoder  D : Z → P(X), the composition  D ◦ F ◦ Eis cast as an action-conditional latent variable model; the advances in latent variable modeling (Kingma & Welling, 2013; Rezende et al., 2014; Burda et al., 2015; Johnson et al., 2016; Sohn et al., 2015) are then leveraged to train E, F, and D to perform explicit next-observation prediction by maximizing a lower bound on the log-likelihood  ln�D(xt+1 |zt+1)F(zt+1 | zt, ut)E(zt | xt) dzt:t+1, over the dataset whose trajectories are drawn from  p(xt, ut, xt+1).

Next-observation prediction offers a natural way to learn a non-degenerate choice of E and F, and enjoys the merit of being empirically successful. However, it requires the introduction of a decoder D as nuisance parameter that only serves the auxiliary role of training the encoder E and latent dynamics F. The focus of our paper is whether E and F can be successfully selected via a decoder-free heuristic.

Existing methods instantiate (SOC2) by learning the encoder E and latent dynamics model F in conjunction with an auxiliary decoder D to explicitly perform next-observation prediction. The auxiliary decoder ensures that the learned representation can be used for next-observation prediction, and is discarded after the encoder and latent dynamics model are learned. Not only is this a parameter-inefficient procedure for learning (E, F), this approach also learns (E, F) by explicitly performing the challenging high-dimensional next-observation prediction. In this section, we propose an information-theoretic approach that can learn (E, F) without decoding and next-observation prediction.

image

Figure 1. Two high-level approaches to learn an E and F to instantiate (SOC2). One way is to explicitly introduce a decoder D and do next-observation prediction (left), whereas our method uses F as a variational device to train E via predictive coding (right).

3.1. Predictive Suboptimality of a Representation

Our approach exploits the observation that the sole purpose of the decoder is to ensure that the learned representation is good for next-observation prediction. In other words, the decoder is used to characterize the suboptimality of next-observation prediction when the prediction model is forced to rely on the learned representation. We refer to this concept as predictive suboptimality of the learned representation and formally define it as follows.

Definition 1. Let  p(xt+1, xt, ut)denote the data distribution. Given an encoder  E : X → Z, 3 let  q(xt+1 | xt, ut)denote the prediction model

image

where  ψ1 and ψ2are expressive non-negative functions. We define the predictive suboptimality  ℓ∗pred(E)of a representa- tion induced by E as the best-case prediction loss

image

Importantly, the function  ψ2should measure the compatibility of the triplet  (xt+1, xt, ut)—but is only allowed to do so via the representations  E(xt+1)and  E(xt). Thus, the behavior of the representation bottleneck plays a critical role in modulating the expressivity of the model q. If E is invertible, then q is a powerful prediction model; if E is a constant, then q can do no better than marginal density estimation of  p(xt+1).

While it is possible to minimize the predictive suboptimality of E by introducing the latent dynamics model F and de- coder D, and then performing next-observation prediction via  D◦F ◦E, our key insight is that predictive suboptimality can be bounded by the following mutual information gap (see Appendix A.2 for proof).

Lemma 1. Let  Xt+1, Xt, and Utbe the random variables associated with the data distribution  p(xt+1, xt, ut). The predictive suboptimality  ℓ∗pred(E)is upper bounded by the

mutual information gap

image

Since  I(Xt+1 ; Xt, Ut)is a constant and upper bounds I(E(Xt+1) ; E(Xt), Ut)by the data processing inequality, this means we can minimize the predictive suboptimality of E simply by maximizing the mutual information between the future latent state  E(Xt+1)and the current latent state and action pair  (E(Xt), Ut)—a form of predictive coding. We denote this mutual information  ℓMI(E)as a function of E. To maximize this quantity, we can then leverage the recent advances in variational mutual information approximation (van den Oord et al., 2018; Poole et al., 2019; Belghazi et al., 2018; Nguyen et al., 2010; Hjelm et al., 2018) to train the encoder in a decoder-free fashion.

3.2. Consistency in Prediction of the Next Latent State

A notable consequence of introducing the encoder E is that it can be paired with a latent cost function  ¯cto define an alternative cost function in the observation space,

image

where z is sampled from  E(x).4 This is particularly useful for high-dimensional SOC problems, where it is difficult to prescribe a meaningful cost function a priori in the observation space. For example, for goal tracking problems using visuosensory inputs, prescribing the cost function to be  c(x, u) = ∥x − xgoal∥2suffers from the uninformative nature of the 2-norm in high-dimensional pixel space (Beyer et al., 1999). In the absence of a prescribed c, a natural proxy for the unknown cost function is to replace it with  cEand consider the new SOC problem,

image

Assuming (SOC1-E) to be the de facto SOC problem of interest, we wish to learn an F such that the optimal control U ∗2in (SOC2) approximately solves (SOC1-E). One such consideration for the latent dynamics model would be to set F as the true latent dynamics induced by (p, E), and we refer to such F as the one that is consistent with (p, E).

Our main contribution in this section is to justify—from a control perspective—why selecting a consistent F with respect to (p, E) minimizes the suboptimality incurred from using (SOC2) as an approximation to (SOC1-E). The following lemma (see Appendix A.3 for proof) provides the suboptimality performance gap between the solutions of (SOC2) and (SOC1-E).

Lemma 2. For any given encoder E and latent dynamics F, let  U ∗1-Ebe the solution to (SOC1-E) and  U ∗2be a solution to (SOC2). Then, we have the following performance bound between the costs of the control signals  U ∗1-E and U ∗2 :

L(U ∗1-E, p, cE, x0) ≥ L(U ∗2 , p, cE, x0)−2λC·�2RC(E, F),

(1) where RC(E, F) = Ep(xt+1,xt,ut)[DKL(E(zt+1|xt+1)||(F◦E)(zt+1|xt, ut))] and λC = T 2cmaxU.

In Eq. 1, the expectation is over the state-action stationary distribution of the policy used to generate the training samples (uniformly random policy in this work), and U is the Lebesgue measure of  U.5 Moreover, E(zt+1|xt+1) and�F ◦ E�(zt+1|xt, ut) =�F(zt+1|zt, ut)E(zt|xt) dztare the probability over the next latent state  zt+1. Based on Figure 1, we therefore interpret  RC(E, F)as the measure of discrepancy between the dynamics  xt → xt+1 → zt+1induced by (p, E) versus the latent dynamics model  xt →zt → zt+1induced by (E, F). which we term the consistency regularizer. We note that while our resulting bound is similar to Lemma 2 in Levine et al. (2020), there are two key differences. First, our analysis makes explicit the assumption that the cost function c is not prescribed and thus replaced in practice with the proxy cost function  cEbased on the heuristically-learned encoder. Second, by making this assumption explicit, our bound is based on samples from the environment dynamics p instead of the next-observation prediction model dynamics  ˆpas required in Levine et al. (2020).

By restricting the stochastic encoder E to be a distribution with fixed entropy (e.g., by fixing the variance if E is conditional Gaussian), the minimization of the consistency regularizer corresponds to maximizing the log-likelihood of F for predicting  zt+1, given  (zt, ut), under the dynamics induced by (p, E). This correspondence holds even in the limiting case of E being deterministic (e.g., fixing the variance to an arbitrarily small value). In other words, for (SOC2) to approximate (SOC1-E) well, we select F to be a good predictor of the true latent dynamics.

3.3. Suboptimality in Locally-Linear Control

In Section 3.2, we derived the suboptimality of using (SOC2) as a surrogate control objective for (SOC1-E), and showed that the suboptimality depends on the consistency of latent dynamics model F with respect to the true latent dynamics induced by (p, E).

We now shift our attention to the optimization of (SOC2) itself. Similar to previous works (Watter et al., 2015; Bani- jamali et al., 2018; Zhang et al., 2019; Levine et al., 2020), we shall specifically consider the class of locally-linear control (LLC) algorithms, e.g., iLQR (Li & Todorov, 2004), for solving (SOC2). The main idea in LLC algorithms is to compute an optimal action sequence by linearizing the dynamics around some nominal trajectory. This procedure implicitly assumes that the latent dynamics F has low curvature, so that local linearization via first-order Taylor expansion yields to a good linear approximation over a suf-ficiently large radius. As a result, the curvature of F will play an important role in the optimizability of (SOC2) via LLC algorithms.

Levine et al. (2020) analyzed the suboptimality incurred from applying LLC algorithms to (SOC2) as a function of the curvature of F. For self-containedness, we summarize their analysis as follows. We shall assume F to be a conditional Gaussian model with a mean prediction function fZ(z, u). The curvature of  fZcan then be measured via

image

where  η = (ηz, ηu)⊤ ∼ N(0, δ2I), δ > 0is a tunable parameter that characterizes the radius of latent state-action space in which the latent dynamics model should have low curvature. Let  U ∗LLC be a LLC solution to (SOC2). Suppose the nominal latent state-action trajectory  {(zt, ut)}T −1t=0 sat-isfies the condition:  (zt, ut) ∼ N((z∗2,t, u∗2,t), δ2I), where {(z∗2,t, u∗2,t)}T −1t=0is the optimal trajectory of (SOC2). Us- ing Eq. 29 of Levine et al. (2020), one can show that with probability  1 − η, the LLC solution of (SOC2) has the following suboptimality performance gap when compared with the optimal cost of this problem using the solution  U ∗2 ,

L(U ∗2 , F, ¯c, z0) ≥ L(U ∗LLC, F, ¯c, z0) − 2λLLC ·�RLLC(F),

where

image

and X is the Lebesgue measure with respect to X. We therefore additionally constrain F to have low curvature so that it is amenable to the application of LLC algorithms.

Based on the analysis in Section 3, we identify three desiderata for guiding the selection of the encoder E and latent dynamics model F. We summarize them as follows: (i) predictive coding minimizes the predictive suboptimality of the encoder E; (ii) consistency of the latent dynamics model F with respective to (p, E) enables planning directly in the latent space; and (iii) low-curvature enables planning in latent space specifically using locally-linear controllers. We refer to these heuristics collectively as Predictive Coding-Consistency-Curvature (PC3). PC3 can be thought of as an information-theoretic extension of the Prediction-Consistency-Curvature (PCC) framework described by Levine et al. (2020)—differing primarily in the replacement of explicit next-observation prediction with predictive coding in the latent space.

In this section, we highlight some of the key design choices involved when instantiating PC3 in practice. In particular, we shall show how to leverage the CPC variational mutual information bound in a parameter-efficient manner and how to enforce the consistency of F with respect to (p, E) without destabilizing training.

4.1. Enforcing Predictive Codes

To estimate the mutual information  ℓMI(E), we employ contrastive predictive coding (CPC) proposed by van den Oord et al. (2018). We perform CPC by introducing a critic f : Z × Z × U → Rto construct the lower bound

image

where the expectation is over K i.i.d. samples of (xt+1, xt, ut). Notice that the current latent state-action pair  (E(xt), ut)is specifically designated as the source of negative samples and used for the contrastive prediction of the next latent state  E(xt+1). We then tie the critic f to our latent dynamics model F,

image

This particular design of the critic has two desirable properties. First, it exploits parameter-sharing to circumvent the instantiation of an auxiliary critic f. Second, it takes advantage of the property that an optimal critic for the lower bound in Eq. (2) is the true latent dynamics (Poole et al., 2019; Ma & Collins, 2018)—which we wish F to approximate. The resulting CPC objective is thus

image

which we denote as  ℓcpc(E, F).

4.2. Enforcing Consistency

Since the true latent dynamics is an optimal critic for the CPC bound, it is tempting to believe that optimizing (E, F) to maximize  ℓcpc(E, F)should be sufficient to encourage the learning of a latent dynamics model F that is consistent with the true latent dynamics induced by (p, E).

In this section, we show that it is easy to construct a simple counterexample illustrating the non-uniqueness of the true latent dynamics as an optimal critic—and that F may learn to be arbitrarily inconsistent with (p, E) while still maximizing  ℓcpc(E, F)under a fixed choice of E. Our simple counterexample proceeds as follows: let E be the identity function, let X = U = R, and let  p(xt+1, xt, ut)be a uniform distribution over the tuples  (1, 1, 1) and (−1, −1, −1).Let  F(zt+1 | zt, ut)be a conditional Gaussian distribution with learnable variance  σ2 > 0and mean function

image

where  η > 0is a learnable parameter. By symmetry, the bound  ℓcpc(E, F) where K = 2 becomes

image

In the denominator, the first term arises from the positive sample (e.g., (1, 1, 1)) whereas the second term arises from the negative sample (e.g.,  (1, −1, −1)). One way to maximize this bound would be to set  η = 1 and let σ → 0. Corre-spondingly, F would approach the true latent dynamics and precisely predict how  (zt, ut)transitions to  zt+1. However,an alternative procedure for maximizing this bound is to fix σto any positive constant and let  η → ∞. In this scenario, F becomes an arbitrarily poor predictor of the underlying latent dynamics.

This counterexample highlights a simple but important characteristic of the CPC bound. In contrast to direct maximum likelihood training of  F(zt+1 | zt, ut)using samples of (zt+1, zt, ut)from the true latent dynamics, the contrastive predictive training of the latent dynamics model simply ensures that  F(zt+1 | zt, ut) assigns arelatively much higher value to the positive samples than to the negative samples. The fact that the CPC bound may be maximized without learning a consistent dynamics model F may be why previous work by Nachum et al. (2018) using CPC for representation learning in model-free RL chose not to perform model-based latent space control despite also learning an F as a variational artifact from their CPC bound.

Since our goal is to use F in (SOC2) for optimal control, it is critical that we ensure the latent dynamics model F indeed approximates the true latent dynamics. We therefore additionally train F via the maximum likelihood objective

image

However, naively optimizing (E, F) to maximize both  ℓcpcand  ℓconsis unstable; whereas  ℓcpcis geometry-invariant, ℓconsis sensitive to non-volume preserving transformations of the latent space (Rezende & Mohamed, 2015; Dinh et al., 2016) and can increase arbitrarily simply by collapsing the latent space. To resolve this issue, we add Gaussian noise ϵ ∼ N(0, σ2I)with fixed variance to the next-state encoding  E(xt+1). Doing so yields the noise-perturbed objectives ℓcpc+ϵ and ℓcons+ϵ. The introduction of noise has two notable effects. First, it imposes an upper bound on the achievable log-likelihood

image

based on the entropy of the Gaussian noise. Second,  ℓcpc+ϵis now a lower bound to the mutual information between (E(Xt), Ut)and the noise-perturbed  E(Xt+1) + E,

image

Since the noise variance  σ2is fixed,  ℓcpc+ϵcan only be maximized by expanding the latent space. By tuning the noise variance  σ2 as a hyperparameter, we can balance the latent space retraction encouraged by  ℓcons+ϵwith the latent space expansion encouraged by  ℓcpc+ϵand thus stabilize the learning of the latent space. For notational simplicity, we shall treat all subsequent mentions of  ℓcpcand  ℓconsto mean their respective noise-perturbed variants, except in the specific ablation conditions where noise is explicitly removed (e.g., the “w/o  ϵ” condition in our experiments).

4.3. Enforcing Low Curvature

We measure the curvature of F by computing the first-order Taylor expansion error incurred when evaluating at  ¯z =

image

ℓcurv(F) = Eη∼N (0,δI)[∥fZ(¯z, ¯u) − (∇zfZ(¯z, ¯u)ηz+  ∇ufZ(¯z, ¯u)ηu) − fZ(z, u)∥22].

Levine et al. (2020) further proposes an amortized version of this objective to accelerate training when the latent dimensionality  nzis large. However, since  nzis relatively small in our benchmark tasks, our initial experimentation suggests amortization to have little wall-clock time impact on these tasks. Our overall objective is thus

image

which maximizes the CPC bound and consistency, while minimizing curvature.

In this section, we report a thorough ablation study on various components of PC3, as well as compare the performance of our proposed model6 with two state-of-the-art LCE baselines: PCC (Levine et al., 2020) and SOLAR (Zhang et al., 2019).7 The experiments are based on four image-based control benchmark domains: Planar System, Inverted Pendulum,8 Cartpole, and 3-Link Manipulator.

Data generation procedure: In PCC and PC3, each sample is a triplet  (xt, ut, xt+1), in which we (1) sample uniformly an underlying state  stand generate its corresponding observation  xt, (2) sample uniformly an action  ut, and (3) obtain the next state  st+1from the true dynamics and generate the corresponding observation  xt+1. In SOLAR, each training sample is an episode {x1, u1, x2, . . . , xT , uT , xT +1}, where T is the control horizon. We sample uniformly T actions from the action space, apply the dynamics T times from the initial state, and generate T corresponding observations.

Evaluation metric: We evaluate PC3 and the baselines in terms of control performance. For PC3 and PCC, we apply iLQR algorithm in the latent space with a quadratic cost, c(zt, ut) = (zt − zgoal)⊤Q(zt − zgoal) + u⊤t Rut, where  ztand  zgoalare the encoded vectors of the current and goal observation, and  Q = α·Inz, R = β ·Inu. For SOLAR, we use their original local-inference-and-control algorithm.9 We report the percentage of time spent in the goal region in the underlying system (Levine et al., 2020).

5.1. Ablation Study

We characterize PC3 by ablating  ℓcons, ℓcurv, and the noise ϵadded to  zt+1. For each setting, we report the latent map size,10 ℓcpc, ℓcons, ℓcurv, and the control performance. These statistics are averaged over 10 different models. All settings are run on Pendulum (Balance and Swing Up).

Consistency: In Table 1, we can see that when  ℓcons is omit-ted, the control performance drops. As discussed in Section 3.2, the latent dynamics model F performs poorly when not explicitly optimized for consistency. This is further demonstrated in Table 2, where we take a pretrained PC3 model, freeze the encoder, and retrain F to maximize either ℓcpcor  ℓcons. Despite both retrained models achieving similar  ℓcpcscores, it is easy to see that training via  ℓcpcresults in much worse latent dynamics in terms of prediction.

image

Figure 2. Inverted pendulum representations. From left to right: PC3, w/o  (ℓcons, ϵ), w/o (ℓcons), w/o ϵ, w/o ℓcurv

Table 1. Ablation analysis. Percentage of steps spent in goal state. From top to bottom: full-model PC3, excluding consistency and latent noise, excluding consistency, excluding latent noise, excluding curvature. For each setting we report the latent map scale, CPC, consistency and curvature loss, and control performance on balance and swing.

image

Table 2. We took a pretrained PC3 model, froze the encoder E, and then retrained only the latent dynamics model F either without  ℓcons(first row) or without  ℓcpc(second row). Note that we continue to use  ℓcurv and add ϵnoise in both settings.

image

Noise: The control performance also decreases when we do not add noise to  zt+1. This is because the model will collapse the latent space to inflate  ℓconsas shown in Table 1, leading to a degenerate solution. Adding noise to  zt+1prevents the map from collapsing; since the noise variance is fixed,  ℓcpcis only maximized by pushing points apart. Indeed, Table 1 shows that when noise is added but  ℓcons isremoved, the latent map expands aggressively.

Curvature: Finally, as previously observed in (Levine et al., 2020), imposing low curvature is an important component if we want to use locally-linear control algorithms such as iLQR. Without the curvature loss, the Taylor approximation when running iLQR might not be accurate, leading to poor control performance. The right-most map in Figure 2 shows a depiction of this setting, in which the map has very sharp regions, requiring the transition function to have high-curvature to move in these regions.

5.2. Control Performance Comparison

Experimental pipeline: For each control domain, we run 10 different subtasks (different initial and/or goal states), and report the average performance among these subtasks. For PCC and PC3, we train 10 different models, and each of them will perform all 10 subtasks (which means a total of  10 × 10 = 100subtasks), and we additionally report the performance of the best model. SOLAR training procedure depends on the specific subtask (i.e., initial and goal state), and since we cannot train 100 different models due to huge computation cost, we train only 1 model for each subtask. All subtasks are shared for three methods.

Result: Table 3 shows that our proposed PC3 model signifi-cantly outperforms the baselines by comparing the means and standard error of means on the different control tasks.11 PCC and SOLAR often fail at difficult tasks such as Swing Up and 3-Link. Moreover, SOLAR training procedure depends on the specific task, which makes them unsuitable to be reused for different tasks in the same environment. Figure 3 demonstrates some (randomly selected) latent maps of Planar and Inverted Pendulum domains learned by PCC and PC3. In general, PC3 produces more interpretable latent representation for Pendulum, due to the fact that next observation prediction is too conservative and may force a model to care about things that do not matter to downstream tasks. Finally, in terms of computation, PC3 enjoys huge improvements over the baselines, with  1.85×faster than PCC and  52.8×faster than SOLAR.

image

Figure 3. Top: Planar latent representations; Bottom: Inverted Pendulum latent representations. Left three: PCC, right three: PC3.

Table 3. Percentage steps in goal state for the average model (all) and top 1 model. Since SOLAR is task-specific, it does not have top 1.

image

LCE Approaches. In contrast to existing LCE methods (Watter et al., 2015; Banijamali et al., 2018; Levine et al., 2020; Zhang et al., 2019; Hafner et al., 2018), our main contribution in PC3 is the development of an information-theoretic approach for minimizing the predictive suboptimality of the encoder and circumventing the need to perform explicit next-observation prediction. In particular, PC3 can be seen as a natural information-theoretic extension of PCC (Levine et al., 2020), which itself extended and improved upon E2C (Watter et al., 2015) and RCE (Banijamali et al., 2018). Compared to SOLAR (Zhang et al., 2019), PC3 (as well as PCC, RCE, and E2C) decouples the representation learning and latent dynamics estimation from control—once the encoder and latent dynamics have been learned for a particular environment, it can be used to solve many SOC problems within the same environment. In contrast, SOLAR is an online algorithm that interleaves model learning with policy optimization. Furthermore, the latent model in SOLAR is restricted to be globally linear, which can potentially impact the control performance.

Information-Theoretic Approaches. Several works have previously explored information-theoretic approaches for representation learning in the reinforcement learning context (Nachum et al., 2018; Anand et al., 2019; Lu et al., 2019). However, these works do not test the quality of their learned representations for the purposes of model-based planning in the latent space, opting instead to leverage the representations for model-free RL. This is particularly no- table in the case of Nachum et al. (2018), who explicitly learned both an encoder E and latent dynamics model F. As we showed in Section 4.2, maximizing the CPC bound alone may not be sufficient for ensuring that F is a good predictor of the latent dynamics induced by (p, E). Thus, the resulting (E, F) from predictive coding alone may be unsuitable for multi-step latent planning, as we demonstrate in our ablation analysis in Section 5.1.

In this work, we propose a novel information-theoretic Learning Controllable Embedding approach for handling high-dimensional stochastic optimal control. Our approach challenges the necessity of the next-observation prediction in existing LCE algorithms. We show theoretically that predictive coding is a valid alternative to next-observation prediction for learning a representation that minimizes predictive suboptimality. To instantiate information-theoretic LCE, we develop the Predictive Coding-Consistency-Curvature (PC3) model and show that PC3 is a simpler, yet more effective method than existing next-observation prediction-based LCE approaches. We also provide a thorough study on various components of the PC3 objective via ablation analysis to assist the adoption of predictive coding in future LCE research. A natural follow-up would be to study the effi-cacy of predictive coding when used in conjunction with other techniques in the LCE literature (e.g. latent overshooting) as well as with other controllers beyond the class of locally-linear controllers considered in our present work.

Anand, A., Racah, E., Ozair, S., Bengio, Y., Cˆot´e, M.-A., and Hjelm, R. D. Unsupervised state representation learning in atari. In Advances in Neural Information Processing Systems, pp. 8766–8779, 2019.

Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., and Gh- odsi, A. Robust locally-linear controllable embedding. In Proceedings of the Twenty First International Conference on Artificial Intelligence and Statistics, pp. 1751–1759, 2018.

Belghazi, M., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In International Conference on Machine Learning, pp. 531–540, 2018.

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When is nearest neighbor meaningful? In International conference on database theory, pp. 217–235. Springer, 1999.

Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- tion using real nvp. preprint arXiv:1605.08803, 2016.

Espiau, B., Chaumette, F., and Rives, P. A new approach to visual servoing in robotics. ieee Transactions on Robotics and Automation, 8(3):313–326, 1992.

Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 512–519. IEEE, 2016.

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.

Hjelm, R., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. preprint arXiv:1808.06670, 2018.

Johnson, M., Duvenaud, D., Wiltschko, A., Adams, R., and Datta, S. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.

Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp- bell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. Model-based reinforcement learning for atari. preprint arXiv:1903.00374, 2019.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. and Welling, M. Auto-encoding variational bayes. preprint arXiv:1312.6114, 2013.

Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Kurutach, T., Tamar, A., Yang, G., Russell, S. J., and Abbeel, P. Learning plannable representations with causal infogan. In Advances in Neural Information Processing Systems, pp. 8733–8744, 2018.

Levine, N., Chow, Y., Shu, R., Li, A., Ghavamzadeh, M., and Bui, H. Prediction, consistency, curvature: Representation learning for locally-linear control. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=BJxG_0EtDS.

Li, W. and Todorov, E. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pp. 222–229, 2004.

Lu, X., Tiomkin, S., and Abbeel, P. Predictive coding for boosting deep reinforcement learning with sparse rewards. arXiv preprint arXiv:1912.13414, 2019.

Ma, Z. and Collins, M. Noise contrastive estimation and neg- ative sampling for conditional models: Consistency and statistical efficiency. preprint arXiv:1809.01812, 2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. Preprint arXiv:1312.5602, 2013.

Nachum, O., Gu, S., Lee, H., and Levine, S. Near-optimal representation learning for hierarchical reinforcement learning. preprint arXiv:1810.01257, 2018.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.

Petrik, M., Ghavamzadeh, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298–2306, 2016.

Poole, B., Ozair, S., van den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. preprint arXiv:1905.06922, 2019.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. preprint arXiv:1505.05770, 2015.

Rezende, D., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. preprint arXiv:1401.4082, 2014.

Shapiro, A., Dentcheva, D., and Ruszczy´nski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2009.

Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491, 2015.

van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748, 2018.

Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015.

Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M., and Levine, S. Solar: Deep structured latent representations for model-based reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.

image

A.1. Connecting (SOC1) and (SOC1-E) with Next-observation Prediction

Recall that for an arbitrarily given encoder E the proxy cost function in the observation space is given by  cE(x, u) :=E�¯c(z, u) | E(x)�, where z is sampled from E(x). Equipped with this cost the only difference between (SOC1-E), i.e.,

minU L(U, p, cE, x0), and the original problem (SOC1), i.e.,  minU L(U, p, c, x0), is on the cost function used.

To motivate the heuristic method of learning an encoder E by maximizing the likelihood of the next-observation prediction model, we want to show there exists at least one latent cost function  ¯csuch that the aforementioned approach makes sense. Followed from the equivalence of the energy-based graphical model (Markov random field) and Bayesian neural network (Koller & Friedman, 2009), for any arbitrary encoder E there exists a latent dynamics model ˜Fand decoder ˜Dsuch that any energy-based LCE model that has an encoder model  E, namely qE(x′|x, u), can be written as  ( ˜D ◦ ˜F ◦ E)(x′|x, u).

Now, suppose for simplicity the observation cost is only state-dependent, and the latent cost  ¯cis constructed as follows: ¯c(z, u) :=�x′�z′ c(x′)d ˜F(z′|z, u)d ˜D(x′|z′). Then one can write  cE(x, u) =�x′ dqE(x′|x, u)c(x′), and this implies ��Ex′∼p(·|x,u)[c(x′)] − cE(x, u)�� ≤ cmax · DTV(p(·|x, u)||qE(·|x, u)),

where  DTVis the total variation distance of two distributions. Using analogous derivations of Lemma 11 in (Petrik et al., 2016), for the case of finite-horizon MDPs, one has the following chain of inequalities for any given control sequence {ut}T −1t=0 and initial observation  x0:

image

The first inequality is based on the result of the above lemma, the second inequality is based on Pinsker’s inequality, and the third inequality is based on Jensen’s inequality of�(·) function.

Notice that for any arbitrary action sequence it can always be expressed in form of deterministic policy  ut = π′(xt, t) withsome non-stationary state-action mapping  π′. Therefore, the KL term can be written as:

image

where the expectation is taken over the state-action stationary distribution of the finite-horizon problem that is induced by data-sampling policy U. The last inequality is due to change of measures in policy, and the last inequality is due to the facts that (i)  πis a deterministic policy, (ii)  dU(ut)is a sampling policy with lebesgue measure 1/U over all control actions, (iii) the following bounds for importance sampling factor holds:��� dπ′(ut|xt,t)dU(ut) ��� ≤ U.

image

Using the above results we now have the following sub-optimality performance bound between the optimizer of (SOC1), U ∗1 , and the optimizer of (SOC1-E),  U ∗1-E:

image

This shows that the performance gap between (SOC1) and (SOC1-E) is bounded by the prediction loss√2T 2 · cmaxU ·�Ex,u [DKL(p(·|x, u)||qE(·|x, u))]. Thus this result motivates the approach of learning the encoder model E of proxy cost by maximizing the likelihood of the next-observation prediction LCE model.

A.2. Proof of Lemma 1

We first provide the proof in a more general setting. Consider the data distribution p(x, y). Given any two representation functions  e : X → A and f : Y → B, we wish to inquire how good these two functions are for constructing a predictor of y given x. To do so, we introduce a restricted class of prediction models of the form

image

Let  q∗(y | x)denote the model that minimizes

Our goal is to upper bound the best possible loss  ℓ∗based on the mutual information gap  I(X ; Y ) − I(e(X) ; f(Y )). In particular, we find that

image

We prove via explicit construction of a model q(y | x) whose corresponding loss  ℓis exactly the mutual information gap. Let (X, Y ) be joint random variables associated with p(x, y). Let r(a | b) be the conditional distribution of a = e(x) given b = f(y) associated with the joint random variables (A, B) = (e(X), f(Y )). Simply choose

image

Then, by law of the unconscious statistician, we see that

image

Finally, we see that

Since  ℓ∗ ≤ ℓ, the mutual information gap thus upper bounds the loss associated with the best restricted predictor  q∗.

To complete the proof for Lemma 1, simply let

image

A.3. Proof of Lemma 2

For the first part of the proof, at any time-step  t ≥ 1, for any arbitrary control action sequence  {ut}T −1t=0, and any arbitrary latent dynamics model F, with a given encoder E consider the following decomposition of the expected cost: E[c(xt, ut) | P, x0] = E[c(zt, ut) | E, P, x0] =�x0:t�tk=1 P(xk|xk−1, uk−1) ·�zt E(zt|xt)¯c(zt, ut).Now consider the two-stage cost function:  E[c(xt−1, ut−1) + c(xt, ut) | P, x0]. One can express this cost function as

image

The last inequality is based on the chain of inequalities at any  (xt−2, ut−2) ∈ X × U:

image

in which the first one is based on triangle inequality and the second one is based on the non-expansive property of  DT V . Bycontinuing the above expansion, one can show that

image

where the second inequality is based on convexity of  DT V, the third inequality is based on Pinsker’s inequality and the last inequality is based on Jensen’s inequality of�(·) function.

For the second part of the proof, one can show the following chain of inequalities for solution of (SOC1-E) and (SOC2):

image

where the first and third inequalities are based on the first part of this lemma, and the second inequality is based on the optimality condition of problem (SOC2). This completes the proof.

In the following sections we will provide the description 4 control domains and implementation details used in the experiments.

B.1. Description of the domains

All control environments are the same as reported in (Levine et al., 2020), except that we report both balance and swing up tasks for pendulum, where the author only reported swing up.

B.2. Implementation details

image

SOLAR training specifics: We use their default setting:

Batch size of 2.

ADAM (Kingma & Ba, 2014) with  β1 = 0.9, β2 = 0.999, and  ϵ = 10−8. Learning rate  αmodel = 2 · 10−5 × horizonfor learning  MNIW prior and α = 10−3 for other parameters.

(βstart, βend, βrate) = (10−4, 10.0, 5 · 10−5)

Local inference and control:

image

PCC training specifics: We use their reported setting:

Batch size of  12812.

ADAM with  α = 5 · 10−4, β1 = 0.9, β2 = 0.999, and ϵ = 10−8.

L2 regularization with a coefficient of  10−3.

 (λp, λc, λcur) = (1, 8, 8), and δ = 0.01for the curvature loss. This setting is shared across all domains.

Additional VAE (Kingma & Welling, 2013) loss term  ℓVAE = −Eq(z|x)[log p(x|z)] + DKL(q(z|x)||p(z))with a very small coefficient of 0.01, where p(z) = N(0, 1).

Additional deterministic reconstruction loss with coefficient 0.3: given the current observation x, we take the means of the encoder output and the dynamics model output, and decode to get the reconstruction of the next observation.

PC3 training specifics:

Batch size of 256.

ADAM with  α = 5 · 10−4, β1 = 0.9, β2 = 0.999, and ϵ = 10−8.

L2 regularization with a coefficient of  10−3.

Latent noise  ϵ = 0.1 and λ1 = 1across all domains without any tuning.

 λ2was set to be 1 across all domains, after it was tuned using grid search in range {0.5, 0.75, 1} on Planar system.

 λ3was set to be 7 across all domains, after it was tuned using grid search in range {1, 3, 7} on Planar system.

 δ = 0.01for the curvature loss.

Additional loss  ℓadd = || 1N�Ni=1 zi||22with a very small coefficient of 0.01, which is used to center the latent space around the origin. We found this term to be important to stabilize the training process.

image

We next present the specific architecture choices for each domain. For fair comparison, the architectures were shared across all algorithms when possible, ReLU non-linearities were used between each two layers.

Encoder: composed of a backbone (either a MLP or a CNN, depending on the domain) and an additional fully-connected (FLC) layer that outputs either a vector (for PC3) or a Gaussian distribution (for PCC and SOLAR).

Latent dynamics (PCC and PC3): the path that leads from  {z, u} to z′, composed of a MLP backbone and an additional FLC layer that outputs either a vector (for PC3) or a Gaussian distribution (for PCC and SOLAR).

Decoder (PCC and SOLAR): composed of a backbone (either a MLP or a CNN, depending on the domain) and an additional FLC layer that outputs a Bernoulli distribution.

Backward dynamics: the path that leads from  {z′, u, x} to z. Each of the inputs goes through a FLC network  {Nz, Nu, Nx},respectively. The outputs are concatenated and passed through another FLC network  Njoint, and finally an additional FLC network which outputs a Gaussian distribution.

Planar system

Input:  40 × 40 images. 5000training samples of the form  (x, u, x′)for PCC and PC3, and 125 rollouts for SOLAR.

Actions space: 2-dimensional

Latent space: 2-dimensional

Encoder: 3 Layers: 300 units - 300 units - 4 units for PCC and SOLAR (2 for mean and 2 for variance) or 2 units for PC3

Dynamics: 3 Layers: 20 units - 20 units - 4 units for PCC and SOLAR or 2 units for PC3

Decoder: 3 Layers: 300 units - 300 units - 1600 units (logits)

Backward dynamics:  Nz = 5, Nu = 5, Nx = 100 − Njoint = 100 − 4 units

Planning horizon: T = 40

Initial standard deviation for collecting data (SOLAR): 1.5 for both global and local traning.

Inverted Pendulum  −Swing up and Balance

Input: Two  48 × 48images. 20000 training samples of the form  (x, u, x′)for PCC and PC3, and 200 rollouts for SOLAR.

Actions space: 1-dimensional

Latent space: 3-dimensional

Encoder: 3 Layers: 500 units - 500 units - 6 units for PCC and SOLAR or 3 units for PC3

Dynamics: 3 Layers: 30 units - 30 units - 4 units for PCC and SOLAR or 2 units for PC3

Decoder: 3 Layers: 500 units - 500 units - 4608 units (logits)

Backward dynamics:  Nz = 10, Nu = 10, Nx = 200 − Njoint = 200 − 6 units

Planning horizon: T = 100

Initial standard deviation for collecting data (SOLAR): 0.5 for both global and local training.

Cartpole

Input: Two  80 × 80images. 15000 training samples of the form  (x, u, x′)for PCC and PC3, and 300 rollouts for SOLAR.

Actions space: 1-dimensional

Latent space: 8-dimensional

Encoder: 6 Layers: Convolutional layer:  32 × 5 × 5; stride (1, 1) - Convolutional layer:  32 × 5 × 5; stride (2, 2) -Convolutional layer:  32 × 5 × 5; stride (2, 2) - Convolutional layer:  10 × 5 × 5; stride (2, 2) - 200 units - 16 units forPCC and SOLAR or 8 units for PC3

Dynamics: 3 Layers: 40 units - 40 units - 16 units for PCC and SOLAR or 8 units for PC3

Decoder: 6 Layers: 200 units - 1000 units - 100 units - Convolutional layer:  32 × 5 × 5; stride (1, 1) - Upsampling (2, 2) - Convolutional layer:  32 × 5 × 5; stride (1, 1) - Upsampling (2, 2) - Convolutional layer:  32 × 5 × 5; stride (1, 1) -Upsampling (2, 2) - Convolutional layer:  2 × 5 × 5; stride (1, 1)

Backward dynamics:  Nz = 10, Nu = 10, Nx = 300 − Njoint = 300 − 16 units

Planning horizon: T = 50

Initial standard deviation for collecting data (SOLAR): 10 for global and 5 for local training.

3-link Manipulator  − Swing up

Input: Two  80 × 80images. 30000 training samples of the form  (x, u, x′)for PCC and PC3, and 150 rollouts for SOLAR.

image

Encoder: 6 Layers: Convolutional layer:  32 × 5 × 5; stride (1, 1) - Convolutional layer:  32 × 5 × 5; stride (2, 2) -Convolutional layer:  32 × 5 × 5; stride (2, 2) - Convolutional layer:  10 × 5 × 5; stride (2, 2) - 200 units - 16 units forPCC and SOLAR or 8 units for PC3

image

Decoder: 6 Layers: 200 units - 1000 units - 100 units - Convolutional layer:  32 × 5 × 5; stride (1, 1) - Upsampling (2, 2) - Convolutional layer:  32 × 5 × 5; stride (1, 1) - Upsampling (2, 2) - Convolutional layer:  32 × 5 × 5; stride (1, 1) -Upsampling (2, 2) - Convolutional layer:  2 × 5 × 5; stride (1, 1)

image


Designed for Accessibility and to further Open Science