Learning Stabilizable Nonlinear Dynamics with Contraction-Based Regularization

2019·arXiv

Abstract

1 Introduction

The problem of efficiently and accurately estimating an unknown dynamical system,

from a small set of sampled trajectories, where is the state and is the control input, is a central task in model-based Reinforcement Learning (RL). In this setting, a robotic agent strives to pair an estimated dynamics model with a feedback policy in order to act optimally in a dynamic and uncertain environment. The model of the dynamical system can be continuously updated as the robot experiences the consequences of its actions, and the improved model can be leveraged for different tasks, affording a natural form of transfer learning. When it works, model-based RL typically offers major improvements in sample efficiency in comparison to state-of-the-art model-free methods such as Policy Gradients (Chua et al., 2018; Nagabandi et al., 2017) that do not explicitly estimate the underlying system. Yet, all too often, when standard supervised learning with powerful function approximators such as Deep Neural Networks and Kernel Methods are applied to model complex dynamics, the resulting controllers do not perform on par with model-free RL methods in the limit of increasing sample size, due to compounding errors across long time horizons. The main goal of this paper is to develop a new control-theoretic regularizer for dynamics fitting rooted in the notion of stabilizability, which guarantees that the existence of a robust tracking controller for arbitrary open-loop trajectories generated with the learned system.

Problem Statement: The motion planning task we wish to solve is to compute a (possibly non-stationary) policy mapping state and time to control that drives any given initial state to a desired compact goal region, while satisfying state and control input constraints, and minimizing some task specific performance cost (e.g., control effort and time to completion). However, in this work, we assume that the dynamics function F(x, u) is unknown to us and we are instead provided with a dataset of tuples taken from a collection of observed trajectories (e.g., expert demonstrations) on the robot. Accordingly, the objective of this work is to learn a dynamics model ) for the robot that is subsequently amenable for use within standard planning algorithms.

Approach Overview: Our parametrization of the policy takes the form where () is a nominal open-loop state-input control trajectory tuple, and ) is a feedback tracking controller. The performance of such a policy however, is strongly reliant upon the quality of the computed state-input trajectory and the tracking controller.

Formally, a reference state-input trajectory tuple (] for system (1) is termed 0 if there exists a feedback controller such that the solution x(t) of the system:

converges exponentially to

for some constant C > 0. The system (1) is termed in an open, connected, bounded region if all state trajectories ) satisfying are exponentially stabilizable at rate

In this work, we illustrate that na¨ıve regression techniques used to estimate the dynamics model from a small set of sample trajectories can yield model estimates that are severely ill-conditioned for trajectory generation and feedback control. Instead, this work advocates for the use of a constrained

regression approach in which one attempts to solve the following problem:

where H is an appropriate normed function space and 0 is a regularization parameter. Note that we use (ˆ) to differentiate the learned dynamics from the true dynamics. We demonstrate that for systems that are indeed stabilizable, enforcing such a constraint drastically prunes the hypothesis space, and therefore plays the role of a that is potentially more powerful and ultimately, more pertinent for the downstream control task of generating and tracking new trajectories.

Statement of Contributions: Stabilizability of trajectories is not only a complex task in non-linear control, but also a difficult notion to capture (in an algebraic sense) within a unified control theory. In this work, we leverage recent advances in contraction theory for control design through the use of Control Contraction Metrics (CCMs) (Manchester and Slotine, 2017; Singh et al., 2017) that turn stabilizability constraints into convex state-dependent Linear Matrix Inequalities (LMIs). Contraction theory (Lohmiller and Slotine, 1998) is a method of analyzing nonlinear systems in a differential framework, i.e., via the associated variational system (Crouch and van der Schaft, 1987, Chp 3), and is focused on the study of convergence between pairs of state trajectories towards each other. Thus, at its core, contraction explores a stronger notion of stability – that of incremental stability between solution trajectories, instead of the stability of an equilibrium point or invariant set. Importantly, we harness recent results in (Manchester et al., 2015; Manchester and Slotine, 2017; Singh et al., 2017) that illustrate how to use contraction theory to obtain a certificate for trajectory stabilizability and an accompanying tracking controller with exponential stability properties. For self containment, we provide a brief summary of these results in Section 3, which in turn will form the foundation of this work.

Our paper makes the following primary contributions.

• We formulate the learning stabilizable dynamics problem through the lens of control contraction metrics (Section 4). The resulting optimization problem is not only infinite-dimensional, as it is formulated over function spaces, but also infinitely-constrained due to the state-dependent LMI representing the stabilizability constraint.

• Under an arguably weak assumption on the structural form of the true dynamics model and a relaxation of the functional constraints to sampling-based constraints (Section 5), we derive a Representer Theorem (Scholk¨opf and Smola, 2001) specifying the form of the optimal solutions for the dynamics functions and the certificate of stabilizability by leveraging the powerful framework of vector-valued Reproducing Kernel Hilbert Spaces (Section 6). We motivate the sampling-based relaxation of the functional constraints from a standpoint of viewing the stabilizability condition as a novel control-theoretic regularizer for dynamics learning.

• By leveraging theory from randomized matrix feature approximations, we derive a tractable algorithm leveraging alternating convex optimization problems and adaptive sampling to iteratively solve a finite-dimensional optimization problem (Section 7).

• We perform an extensive set of numerical simulations on a 6-state, 2-input planar quadrotor model and provide a comprehensive study of various aspects of the iterative algorithm. Specifically, we demonstrate that na¨ıve regression-based dynamics learning can yield estimated models that generate completely unstabilizable trajectories. In contrast, the control-theoretic regularized model generates vastly superior quality trackable trajectories, especially when learning from small supervised datasets (Sections 2.1 and 7.2).

• We validate our algorithm on a quadrotor testbed (Section 8) with partially closed control loops to emulate a planar quadrotor, where we verify that the stabilizability regularization effects in low-data regimes observed in simulations does indeed generalize to real-world noisy data. In particular, with just 150 noisy tuples of (), we are able to stably track a challenging test trajectory, which is generated with the learned model and substantially different from any of the training data. In contrast, a model learned using traditional regression techniques leads to consistently unstable behavior and eventual failure as the quadrotor repeatedly flips out of control and crashes (see Figure 1).

Figure 1: Time-lapse of a quadrotor trying to execute a figure-eight maneuver (blue curve) using a reference trajectory and an LQR feedback tracking controller generated using the learned dynamical system. Left: Model learned using traditional ridge-regression; Right: Model learned using control-theoretic regularization proposed within this work. The models were trained with the same, extremely limited (150 points) set of () supervisory tuples. The quadrotor consistently failed and crashed into the floor with the trajectory and controller generated by the model learned with ridge-regression; the red triangles mark the points along the reference and actual trajectories at moment of crash – a separation of 1.6 m. In contrast, despite imperfect tracking (not unexpected given the extremely limited amount of supervision given to the learning algorithm), which leads to a slight graze along the floor at one point during the maneuver, the quadrotor manages to maintain bounded tracking error while using the model learned with control-theoretic regularization.

A preliminary version of this paper was presented at WAFR 2018 (Singh et al., 2018). In this revised and extended version, we include the following additional contributions: (i) rigorous derivation of the stabilizability-regularized finite-dimensional optimization problem using RKHS theory and random matrix features; (ii) extensive additional numerical studies into the convergence behavior of the iterative algorithm and comparison with traditional ridge-regression techniques; and (iii) validation of the algorithm on a quadrotor testbed with partially closed control loops to emulate a planar quadrotor.

Related Work: Model-based RL has enjoyed considerable success in various application domains within robotics such as underwater vehicles (Cui et al., 2017), soft robotic manipulators (Thuruthel et al., 2019), and control of agents with non-stationary dynamics (Ohnishi et al., 2019). While the literature on model-based RL is substantial; see (Polydoros and Nalpantidis, 2017) for a recent review, we focus our attention on five broad categories relevant to the problem we address in this work. Namely, these are: (i) direct regression for learning the full dynamics, where one ignores any control-theoretic notions tied to the learning task and treats dynamics estimation as a standard regression problem; (ii) residual learning, where one only attempts to learn corrections to a nominal prediction model that may have been derived, for example, from physics-based reasoning; (iii) uncertainty-aware model-based RL, where one tries to additionally represent the uncertainty in the learned model using probabilistic representations that are subsequently leveraged within the planning phase using robust or stochastic control techniques; (iv) hybrid model-based/model-free methods; and (v) imitation learning, where one learns dynamical representations of stable closed-loop behavior for a set of outputs (e.g., the end-effector on a robotic arm), and assumes knowledge of the robot controlled dynamics to realize the learned closed-loop motion, for instance, using dynamic inversion.

The simplest approach to learning dynamics is to ignore stabilizability and treat the problem as a standard one-step time series regression task (Punjani and Abbeel, 2015; Bansal et al., 2016; Nagabandi et al., 2017; Polydoros and Nalpantidis, 2017). However, coarse dynamics models trained on limited training data typically generate trajectories that rapidly diverge from expected paths, inducing controllers that are ineffective when applied to the true system. This divergence can be reduced by expanding the training data with corrections to boost multi-step prediction accuracy (Venkatraman et al., 2015, 2016). Despite being effective, these methods are still heuristic in the sense that the existence of a stabilizing feedback controller is not explicitly guaranteed. Alternatively, one can leverage strong physics-based priors and use learning to only regress the unmodeled dynamics. For instance, (Mohajerin et al., 2019; Shi et al., 2019; Punjani and Abbeel, 2015) aim to capture the unmodeled aerodynamic disturbance terms as corrections to a prior rigid body dynamics model. (Punjani and Abbeel, 2015) accomplish this for helicopter dynamics using a deep neural network, but then do not use the learned model for control. (Shi et al., 2019) attempt to capture the unmodeled ground-effect forces on quadrotors to build better controllers for near-ground tracking and precision landing. (Mohajerin et al., 2019) leverage a residual RNN in combination with a rigid-body model to generate time-series predictions for linear and angular velocities of a quadrotor as a function of current state and candidate future motor inputs, but do not use the model for closed-loop control. Finally, (Zhou et al., 2017) adopt a different perspective to learning “corrections” in that they attempt to learn the inverse dynamics (output to reference) for a system and pre-cascade the resulting predictions to correct an existing controller’s reference signal in order to improve trajectory tracking performance. The approach relies on the existence of a stabilizing controller and the stability of the system’s zero dynamics, thereby decoupling the effects of learning from stability. In similar spirit, (Taylor et al., 2019) leverage input-output feedback linearization to derive a Control Lyapunov Function (CLF) for a nominal dynamics model, assume that this function is a CLF for the actual dynamics as well, and regress only the correction terms in the derivative of this CLF. While leveraging physics-based priors can certainly be powerful, especially when the residual errors to be learned are small enough such that the system is feedback stabilizable with a controller derived from the physics model, in this work we are interested in the far more challenging scenario when such priors are unavailable and the full dynamics model must be learned from scratch. While exemplified using quadrotor models that can certainly be accurately stabilized even in the absence of learning, the insights provided in this work shed light on fundamental topics in the context of control-theoretic learning, which hopefully may influence dynamics-learning methods in more complex settings where priors are unavailable or too simple to be useful for adequate control.

An alternative strategy to cope with error in the learned dynamics model is to use uncertainty-aware model-based RL where control policies are optimized with respect to stochastic rollouts from probabilistic dynamics models (Kocijan et al., 2004; Kamthe and Deisenroth, 2018; Deisenroth and Rasmussen, 2011; Chua et al., 2018). For instance, PILCO (Deisenroth and Rasmussen, 2011) leverages a Gaussian Process (GP) state transition model and moment matching to analytically estimate the expected cost of a rollout with respect to the induced distribution. (Kamthe and Deisenroth, 2018) extend this formulation using nonlinear model predictive control (MPC) to incorporate chance constraints. (Chua et al., 2018) leverage an ensemble of probabilistic models to capture both epistemic (i.e., model) and aleatoric (i.e., intrinsic) uncertainty, and compute their control policy in receding horizon fashion through finite sample approximation of the random cost. Probabilistic models such as GPs may also be used to capture the residual error between a nominal physics-based model and the true dynamics. In (Ostafew et al., 2016), a GP is incrementally learned over multiple trials to capture unmodeled disturbances. The 3prediction range is subsequently leveraged to formulate chance constraints as a robust nonlinear MPC problem. The goal of (Fisac et al., 2017) and (Berkenkamp et al., 2017) is motivated from a safety perspective, where one wishes to actively learn a control policy while remaining “safe” in the presence of unmodeled dynamics, represented as GPs. The authors in (Fisac et al., 2017) leverage Hamilton-Jacobi reachability analysis to give high-probability invariance guarantees for a region of the state-space within which the learning controller is free to explore. On the other hand, (Berkenkamp et al., 2017) utilize Lyapunov analysis and smoothness arguments to incrementally grow the Lyapunov function’s region of attraction while simultaneously updating the GP. For the special case where the underlying dynamics are linear-time-invariant, (Dean et al., 2019) derive high-probability convergence rates for the estimated model and leverage system-level robust control techniques (Wang et al., 2019) for guaranteeing state and control constraint satisfaction.

While utilizing probabilistic prediction models along with a control strategy that incorporates this uncertainty, such as robust or approximate stochastic MPC, can certainly help guard against imperfect dynamics models, large uncertainty in the dynamics can lead to overly conservative strategies. This is true especially when the learned model is not merely a correction or residual term, or if the probabilistic model is computationally intractable to use within planning (e.g., GPs without additional sparsifying simplifications), thereby forcing conservative approximations. Finally, with the exception of the “safe” RL methods mentioned above, the learning algorithms themselves do not incorporate knowledge of the downstream application of the function being regressed, in that learning is viewed purely from a statistical point-of-view, rather than within a control-theoretic context.

More recently, hybrid combinations of model-based and model-free techniques have gained attention within the learning community. The authors in (Bansal et al., 2017) use Bayesian optimization to find an optimal linear dynamics model whose induced MPC policy minimizes the task-specific cost. In similar spirit, (Amos et al., 2018) differentiate through the fixed-point solutions of a parametric MPC problem to find optimal MPC cost and dynamics functions in order to minimize the actual task-specific cost. (Nagabandi et al., 2017) use behavioral cloning with respect to an MPC policy generated from a learned dynamics model to initialize model-free policy fine-tuning. The works in (Levine et al., 2016; Finn et al., 2016; Chebotar et al., 2017) leverage subroutines where local time-varying dynamics are fitted around a set of policy rollouts, and then used to perform trajectory optimization via an LQR backward pass. The induced local linear-time-varying policy from this rollout is then used as a supervisory signal for global policy optimization. While these lines of work try to frame dynamics fitting within the downstream context of the task, thereby imbuing the resulting learning algorithm with a more closed-loop flavor, the learned dynamics may be substantially different from the actual dynamics of the robot since, with the exception of the local time-varying dynamics fitting, the true goal is to optimize the task-specific cost. This can yield distorted dynamic models whose induced policies are more cost-optimal than policies extracted from the true dynamics. Thus, while the work presented herein espouses a closed-loop learning ideology, it does so from the control-theoretic perspective of trajectory stabilizability, i.e., the true objective is dynamics fitting which will subsequently be used to derive optimal trajectories and tracking controllers.

Finally, we address lines of work closest in spirit to this work. Learning dynamical systems satisfying some desirable stability properties (such as asymptotic stability about an equilibrium point, e.g., for point-to-point motion) has been studied in the autonomous case, ˙x(t) = f(x(t)), in the context of imitation learning. In this line of work, one assumes perfect knowledge and invertibility of the robot’s controlled dynamics to solve for the input that realizes this desirable closed-loop motion (Lemme et al., 2014; Khansari-Zadeh and Khatib, 2017; Ravichandar et al., 2017; Khansari-Zadeh and Billard, 2011; Medina and Billard, 2017). In particular, for a vector-valued RKHS formulation in the autonomous case with constant (identity) contraction metric, see (Sindhwani et al., 2018). Crucially, in our work, we do not require knowledge or invertibility of the robot’s controlled dynamics. We seek to learn the full controlled dynamics of the robot, under the constraint that the resulting learned dynamics generate dynamically feasible and most importantly, stabilizable trajectories. Thus, this work generalizes existing literature by additionally incorporating the controllability limitations of the robot within the learning problem.

The tools we develop may also be used to extend standard adaptive robot control design, such as (Slotine and Li, 1987) – a technique which achieves stable concurrent learning and control using a combination of physical basis functions and general mathematical expansions, e.g. radial basis function approximations (Sanner and Slotine, 1992). Notably, our work allows us to handle complex underactuated systems – a consequence of the significantly more powerful function approximation framework developed herein, as well as of the use of a differential (rather than classical) Lyapunovlike setting, as we shall detail.

be the set of symmetric matrices in , respectively the set of symmetric positive semi-definite, respectively, positive definite matrices in matrix . We denote the components of a vector Euclidean norm as , and its weighted norm as Let ) denote a matrix with (entry given by the Lie derivative of the function along the vector y. Finally, let ¯) denote the maximum and minimum eigenvalues of a square matrix A.

2 Problem Formulation and Solution Methodology

In this section we formally outline the structure of the problem we wish to solve and describe a general solution methodology rooted in model-based RL. To motivate the contributions of this work, we additionally present an attempt at a solution that uses traditional model-fitting techniques, and demonstrate how it fails to capture the nuances of the problem and ultimately yields sub-par results.

Consider a robotic system with state is an open, connected, bounded subset of , and control is a closed, bounded subset of , governed by the following continuous-time dynamical system:

where F is Lipschitz continuous in the state for fixed control, so that for any measurable control function ), there exists a unique state trajectory. The motion planning task we wish to solve is to find a (possibly non-stationary) policy that (i) drives the state x to a compact region , (ii) satisfies the state and input constraints, and (iii) minimizes a quadratic cost:

where is the first time . While there exist several methods in the literature on how to solve this problem given knowledge of the dynamical system, in this work, we assume that we do not know the governing model F(x, u). The problem we wish to address is how to solve the above motion planning task, given a dataset of tuples from observed trajectories on the robot.

The solution approach presented in this work adopts the model-based RL paradigm, whereby one first estimates a model of the dynamical system ˆF(x, u) using some form of regression, and then uses the learned model to solve the motion planning task with traditional planning algorithms. In this work, our strategy to solve the planning task is to parameterize general state-feedback policies as a sum of a nominal (open-loop) input and a feedback term designed to track the nominal state trajectory (induced by

This formulation represents a compromise between the general class of state-feedback control laws (a computationally intractable space over which to optimize) and a purely open-loop formulation (i.e., no tracking). Note that we do not present a new methodology for solving the planning task. Specifically, it is assumed that there exists an algorithm for computing (i) the open-loop state and control trajectories ()) that minimize the open-loop cost:

and (ii) the feedback tracking controller ), given a dynamical model. The focus of this paper is on how to design the regression algorithm for computing the model estimate ˆF.

2.1 Motivating Example

We ground the formalism within the following running example that will feature throughout this work.

Example 1 (PVTOL). Consider the 6-state planar vertical-takeoff-vertical-landing (PVTOL) system depicted in Figure 2. The system is defined by the state (position in the 2D plane, (is the body-reference velocity, and (are the roll and angular rate respectively, and are the controlled motor thrusts. The true dynamics are given by:

where g is the acceleration due to gravity, m is the mass, l is the moment-arm of the thrusters, and J is the moment of inertia about the roll axis.

Figure 2: Definition of planar quadrotor state variables: l denotes the thrust moment arm (symmetric), and denote the right and left thrust forces respectively.

The planar quadrotor is a complex non-minimum phase dynamical system that has been heavily featured within the acrobatic robotics literature and therefore serves as a suitable case-study.

2.1.1 Solution Parametrization

The dynamics assume the general control-affine form:

where is the input matrix, depicted in column-stacked form as (). Let us define the model estimate also in control-affine form as ˙where ˆ). Consider, as a first solution attempt, the following linear parametrization for the vector-valued functions ˆ

where are constant vectors to be optimized over, and ΦΦare a priori chosen feature mappings. To replicate the sparsity structure of the PVTOL input matrix, the feature matrix Φhas all zeros in its first

The justification for a linear model and the construction of the feature mappings will be elaborated upon later. At this moment, we wish to study the quality of the learned models obtained from solving the following convex optimization problem:

where 0 are given regularization constants. Note that the above optimization corresponds to the ubiquitous ridge-regression problem and is therefore a viable solution approach.

To evaluate the feasibility of this solution approach, we extracted a collection of training tuples from simulations of the PVTOL system without any noise (for further details, please see Section 7.2). We learned three models: (i) N-R: un-regularized model10: standard ridge-regularized model with , and (iii) CCM-R: control-theoretic regularized model, corresponding to the algorithm proposed within this work and elaborated upon in the remaining paper.

We learned four versions of the model corresponding to varying training dataset sizes with . The dimensions of were both 576 (corresponding to 96 parameters per state dimension). The feature mappings themselves are described in Section 7.2 and Appendix A. The regularization constants were held fixed for all N.

2.1.2 Evaluation

The evaluation corresponded to the motion planning task of generating and tracking trajectories using the learned models. We gridded the () plane to create a set of 120 initial conditions between 4 m and 12 m away from (0, 0), and randomly sampled the other states for the rest of the initial conditions. These conditions were held fixed for all models and for all training dataset sizes to evaluate model improvement.

For each model at each value of N, the evaluation task was to (i) solve a trajectory optimization problem to compute a dynamically feasible trajectory for the learned model to go from initial state to the goal state – a stable hover at (0, 0) at near-zero velocity; and (ii) track this trajectory with a feedback controller computed using time-varying LQR (TV-LQR). Note that all simulations without any feedback controller (i.e., open-loop control rollouts) led to the PVTOL crashing. This is understandable since the dynamics fitting objective does not optimize for multi-step error. The trajectory optimization step was solved as a fixed-endpoint, fixed-final time optimal control problem using the Chebyshev pseudospectral method (Fahroo and Ross, 2002) with the objective of minimizing. The final time T for a given initial condition was held fixed between all models. Note that 120 trajectory optimization problems were solved for each model and each value of N.

Figure 4 shows a boxplot comparison of the trajectory-wise RMS full state errors (where ) is the reference trajectory obtained from the optimizer and x(t) is the actual realized trajectory) for each model and all training dataset sizes. As N increases, the spread of the RMS errors decreases for both R-R and CCM-R models as expected. However, we see that the N-R model generates