Deep Learning Tubes for Tube MPC

2020·Arxiv

Abstract

Abstract

Learning-based control aims to construct models of a system to use for planning or trajectory optimization, e.g. in model-based reinforcement learning. In order to obtain guarantees of safety in this context, uncertainty must be accurately quan-tified. This uncertainty may come from errors in learning (due to a lack of data, for example), or may be inherent to the system. Propagating uncertainty forward in learned dynamics models is a difficult problem. In this work we use deep learning to obtain expressive and flexible models of how distributions of trajectories behave, which we then use for nonlinear Model Predictive Control (MPC). We introduce a deep quantile regression framework for control that enforces probabilistic quantile bounds and quantifies epistemic uncertainty. Using our method we explore three different approaches for learning tubes that contain the possible trajectories of the system, and demonstrate how to use each of them in a Tube MPC scheme. We prove these schemes are recursively feasible and satisfy constraints with a desired margin of probability. We present experiments in simulation on a nonlinear quadrotor system, demonstrating the practical efficacy of these ideas.

I. INTRODUCTION

In controls and planning, the idea of adapting to unknown systems and environments is appealing; however, guaranteeing safety and feasibility in the midst of this adaptation is of paramount concern. The goal of robust MPC is to take into account uncertainty while planning, whether it be from modeling errors, unmodeled disturbances, or randomness within the system itself [2]. In addition to safety, other considerations such as optimality, real-time tractability, scalability to high dimensional systems, and hard state and control constraints make the problem more difficult. In spite of these difficulties, learning-based robust MPC continues to receive much attention [18, 49, 38, 19, 15, 10, 36, 35, 1, 5]. However, in an effort to satisfy the many competing design requirements in this space, certain restrictive assumptions are often made, which include predetermined error bounds, restricted classes of dynamics models, or fixed parameterizations of the uncertainty.

Consider the following nonlinear dynamics equation that describes a real system:

where is the state, are controls, and is noise or disturbance.

When attempting to find a model which captures the behavior of , there will be error that results from insufficient data, lack of knowledge of , or unknown or unobserved higherdimensional dynamics not observed in . One traditional approach has been to find robust bounds on the model error

Fig. 1. A learned tube (green) with learned mean (blue) that captures the distribution of trajectories (cyan) on a full quadrotor model tracking a target trajectory (black), propagated for 200 timesteps forward from the initial states (dots).

and plan using this robust model, i.e. . However, this approach can be too conservative since it is not time or space varying and does not capture the distribution of the disturbance [1, 42]. To partially address this one could extend W to be time and state-varying, i.e. , as is commonly done in the robust MPC and control literature. For example, [31] takes this approach for feedback linearizable systems using boundary layer control, [44] leverages contraction theory and sum-of-squares optimization to find stabilizing controllers for nonlinear systems under uncertainty, and [48] solves for forward invariant tubes using min-max differential inequalities (See [25] for a recent overview of other related approaches). In this work we aim to learn this uncertainty directly from data, which allows us to avoid structural assumptions of the system of interest or restrictive parameterizations of uncertainty. We learn a quantile representation of the bounds of the distribution of possible trajectories, in the form of a tube around some nominal trajectory (Figure 1).

More closely related to our approach is the wide range of recent work in learning-based planning and control that seeks to handle model uncertainty probabilistically, where a model is constructed from one-step prediction measurements, and it is assumed that the true underlying distribution of the function is Gaussian [11, 20, 8, 30, 3]:

where the mean function and variance function capture the uncertainty of the dynamics for one time step. Various approaches for approximating this posterior distribution have been developed [16, 14]. For example, in PILCO and related work [20], moment matching of the posterior distribution is performed to find an analytic expression for the evolution of the mean and the covariance in time. However, in order to arrive at these analytic expressions, assumptions must be made which lower the descriptive power for the model to capture the true underlying distribution, which

Fig. 2. Comparison of 3-bounds on distributions of trajectories using GP moment matching (red) and the proposed quantile regression method (green). 100 sampled trajectories are shown (cyan) along with starting and ending distributions (blue, left and right histograms). Left: GP moment matching overestimates the distribution for the dynamics , while our method models it well. Right: GP moment matching underestimates the distribution for the dynamics , while our method captures the tails of the distribution.

may be multi-modal and highly non-Gaussian. Furthermore, conservative estimates of the variance of the distribution will grow in an unbounded manner as the number of timesteps increases [26]. The result is that any chance constraints derived from these approximate models may be inaccurate. In Figure 2 we compare the classic GP-based moment matching approach for propagating uncertainty with our own deep quantile regression method on two different functions. While GP-moment matching can both underestimate and overestimate the true distribution of trajectories, our method is less prone to failures due to analytic simplifications or assumptions.

An alternative approach to Bayesian modeling for robust MPC has been to use quantile bounds to bound the tails of the distribution. This has the advantage that for planning in safetycritical contexts, we are generally not concerned with the full distribution of the trajectories, but the tails of these distributions only; specifically, we are interested in the probability of the tail of the distribution violating a safe set. A few recent works have taken this approach in the context of MPC; for example, [4] computes back-off sets with Gaussian Processes, and [5] uses an adaptive control approach to parameterize quantile bounds.

We are specifically interested in the idea of learning quantile bounds using the expressive power of deep neural networks. Quantile bounds give an explicit probability of violation at each timestep and allow for quantifying uncertainty which can be non-Gaussian, skewed, asymmetric, multimodal, and heteroskedastic [46]. Quantile regression itself is a well-studied field with the first results from [23], see also [24, 47]. Quantile regression in deep learning has been also recently considered as a general statistical modeling tool [39, 54, 41, 51, 46]. Bayesian quantile regression has also been studied [27, 52]. Recently quantile regression has gained popularity as a modeling tool within the reinforcement learning community [7].

In addition to introducing a method for deep learning quantile bounds for distributions of trajectories, we also show how this method can be tailored to a tube MPC framework. Tube MPC [28, 33] was introduced as a way to address some of the shortcomings of classic robust MPC; specifically that robust MPC relied on optimizing over an open-loop control sequence, which does not predict the closed-loop behavior well. Instead, tube MPC seeks to optimize over a local policy that generates some closed-loop behavior, which has advantages of robust constraint satisfaction, computational efficiency, and better performance.

Fig. 3. Diagram of a tube around the dynamics of z, within which x stays invariant. Note that the tube set is time-varying.

The use of tube MPC allows us to handle high dimensional systems, as well as making the learning problem more efficient, tractable, and reliable. To the best of our knowledge, our work is the first to combine deep quantile regression with tube-based MPC, or indeed any learning-based robust MPC method.

The structure of the paper is as follows: In Section II we present our approach for learning tubes, which includes deep quantile regression, enforcing a monotonicity condition with a negative divergence loss function, and quantifying epistemic uncertainty. In Section III we present three different learning tube MPC schemes that take advantage of our method. In Section IV we perform several experiments and studies to validate our method, and conclude in Section V.

II. DEEP LEARNING TUBES

A. Learning Tubes For Robust and Tube MPC

We propose learning time-varying invariant sets as a way to address the difficulties with propagating uncertainty for safety critical control, as well as to characterize the performance of a learned model or tracking controller. Consider the following quantile description of the dynamics:

where is a latent state of equal or lower dimension than is a pseudo-control input, also of equal or lower dimension than . In the simplest case, we can fix vector that we call the tube width, with each element of

This defines a ”tube” around the trajectory of z within which x will stay close to z with probability greater than (Figure 3). More formally, we can define the notion of closeness between some x and z by, for example, the distance between z and the projection of x onto , where is a projection operator. Let be a set in X associated with the tube width

where the is element-wise. Other tube parameterizations are possible, for example where

The coupled system (3) induces a sequence of sets that form a tube around . Our goal is to learn

Fig. 4. Learning tube dynamics from data. Left: The predicted tube at t+1 is too small. The gradient of the loss function will increase its size. Middle: The predicted tube at t+1 is larger than the actual trajectory in x taken, and will be shrunk. Right: The mapping is monotonic with respect to , which results in

how this tube changes over time in order to use it for planning safe trajectories.

B. Quantile Regression

Our challenge is to learn the dynamics of the tube width, . Given data collected as trajectories , we can formulate the learning problem for as follows.

Let be parameterized with a neural network, . For a given t and data point , let be the input tube width to the candidate output tube width. The candidate tube width at t+1 must be less than the estimate of the tube width at t+1, i.e: . To train the network to respect these bounds we can use the following check loss function:

where the loss is a function of each data sample . With the assumption of i.i.d. sampled data, when is minimized the quantile bound will be satisfied, (see Figure 4 and Theorem II.1). In practice we can smooth this loss function near the inflection point y=r with a slight modification, by multiplying with a Huber loss [21, 7].

Theorem II.1. Let minimize . Then with probability is an upper bound for

Proof. With a slight abuse of notation, let x denote the input variable to the loss function, and consider the expected loss . We find the minimum of this loss w.r.t. r by setting the gradient to 0:

Replacing completes the proof.

Note that quantile regression gives us tools for learning tube dynamics that are a function of the quantile probability as well. This opens the possibility to dynamically varying the margin of safety while planning, taking into account acceptable risks or value at risk [12]. For example, in planning a trajectory, one could choose a higher for the near-term and lower in the later parts of the trajectory, reducing the conservativeness of the solution.

Additionally, we note that we can train the tube bounds dynamics in a recurrent fashion to improve long sequence prediction accuracy. While we present the above and following theorems in the context of one timestep, they are easily extensible to the recurrent case.

C. Enforcing Monotonicity

In addition to the quantile loss we also introduce an approach to enforce monotonicity of the tube with respect to the tube width (Figure 4, right). This is important for ensuring recursive feasibility of the MPC problem, as well as allowing us to shrink the tube width during MPC at each timestep if we obtain measurement updates of the current state, or, in the context of state estimation, an update to the covariance of the estimate of the current state. Enforcing monotonicity in neural networks has been studied with a variety of techniques [43, 53]. Here we adopt the approach of using a loss function that penalizes the network for having negative divergence, similar to [17]:

where divis the divergence of with respect to . In practice we find that under gradient-based optimization, this loss decreases to 0 in the first epoch and does not noticeably affect the minimization of the quantile loss. Minimizing us to make claims about the monotonicity of the learned tube: Theorem II.2. Suppose minimizes and . Then for any and , if , then

Proof. Since and , then . Then and is nondecreasing with respect to . Since , then , so , which implies that

D. Epistemic Uncertainty

Finally, in order to account for uncertainty in regions where no data is available for estimating quantile bounds, we incorporate methods for estimating epistemic uncertainty. Such methods can include Bayesian neural networks, Gaussian Processes, or other heuristic methods in deep learning [13, 7, 37]. For the experiments in this work we adopt an approach that adds an additional output layer to our quantile regression network that is linear with respect to orthonormal weights [46]. We emphasize that a wide range of methods for quantifying epistemic uncertainty are available and we are not restricted to this one approach; however, for the sake of clarity, we present in detail our method of choice. Let g(z,v,t) be a neural network with either fixed weights that are either

Fig. 5. Estimating epistemic uncertainty for a 1-D function. Black dots indicate noisy data used to train the models, black line indicates the true function. Green colors indicate trained neural network models with green line indicating mean, and green dotted lines indicating learned 99% quantile bounds. Green shading indicates increased quantile bounds scaled by the learned epistemic uncertainty. Blue line and shading is GP regression with 99% bounds for comparison.

randomly chosen or pre-trained, with l dimensional output. We branch off a second output with a linear layer: where . The estimate of epistemic uncertainty is chosen as . Then, the parameters C are trained by minimizing the following loss:

where weights the orthonormal regularization. Minimizing this loss produces a network that has a value close to 0 when the input data is in-distribution, and increases with known rate as the input data moves farther from the training distribution (Figure 5, and see [46] for detailed analysis). We scale the predicted quantile bound by the epistemic uncertainty, then add a maximum bound to prevent unbounded growth as

where is a constant parameter that scales the effect of the epistemic uncertainty, and W is a vector that provides an upper bound on the total uncertainty. Finding an optimal analytically may require some assumptions such as a known Lipschitz constant of the underlying function, non-heteroskedastic noise, etc., which we leave for future investigation. We set and W by hand and find this approach to be effective in practice.

We expect that as the field matures, methods for providing guarantees on well-calibrated epistemic uncertainty in deep learning will continue to improve. In the meantime, we make the assumption that we have well-calibrated epistemic uncertainty, an assumption similar to those made with other learning-based controls methods, such as choosing noise covariances, disturbance magnitudes, or kernel types and widths. The main benefit of leveraging epistemic uncertainty modeling is that it allows us to maintain guarantees of safety and recursive feasibility when we have a limited amount of data to learn from. In the case when no reliable epistemic estimate is available, we can proceed if we simply assume there is sufficient data to learn a good model offline.

III. THREE WAYS TO LEARN TUBES FOR TUBE MPC

In this section we present three variations for applying our deep quantile regression approach to MPC problems, whose applicability may vary based on what components are available to the designer. By leveraging the previously described theorems for ensuring accurate quantiles, monotonicity, and uncertainty of the tube width dynamics, we can guarantee recursive feasibility of these MPC schemes, while ensuring that the trajectory of the system remains within a safe set with probability at each timestep. The three different approaches require different elements of the system to be known or given, and are summarized as:

1) Given a tracking control law and reference trajectory dynamics , construct an invariant tube with the reference trajectory at its center (Figure 3).

2) Given a tracking control law and reference trajectory dynamics , construct a model of the dynamics of the error , then learn an invariant tube with z+e as its center (Figure 6a).

3) From data generated from any control law, random or otherwise, learn a reduced representation of the dynamics (and optionally, a policy to track it), along with tube bounds on the tracking error (Figure 6b).

A. Learning Tube Dynamics for a Given Controller

We first consider the case where we are given a fixed ancillary controller (or potentially with a feed-forward term v), along with nominal dynamics that are used for planning and tracking in the classic tube MPC manner [32]. For now our goal is to learn

We sum the three losses discussed in the previous section:

to learn , and find and via stochastic gradient descent. Next, we perform planning on the coupled z and tube dynamics in the following nonlinear MPC problem. Let denote the planning horizon. We use the subscript notation to denote the variable for within the MPC problem at time denote the set of variables . Then, at time t, the MPC problem is:

Let denote the minimizer of the problem at time t. Note that we include in the cost, which allows us to encourage larger or smaller tube widths. The tube width is updated based on a measurement from the system, or can also be updated with information from a state estimator. In the absence of measurements we can also carry over the past optimized tube width, i.e. , as long as . The closed-loop control is set to

and the tracking target for the underlying policy is Under these assumptions we have the following theorem establishing recursive feasibility and safety:

Theorem III.1. Suppose that the MPC problem (11) is feasible at t=0. Then the problem is feasible for all and at each timestep the constraints are satisfied with probability

Proof. The proof is similar to that in [25] for general set-based robust adaptive MPC. Let and choose any such that (if measurements are unavailable, one can use ). With probability due to Theorem II.1. Let for , and let . Then is a feasible solution for the MPC problem at t=1, due to the terminal constraints (11e,11f) as well as the monotonicity of with respect to (Theorem II.2).

Since is nonlinear we find solutions to the MPC problem via iterative linear approximations, yielding an SQP MPC approach [9, 6]. Other optimization techniques are possible, including GPU-accelerated sampling-based ones [50]. We outline the entire procedure in Algorithm 1.

B. Learning Tracking Error Dynamics and Tube Dynamics

Next we show how to learn error dynamics along with a tube centered along these dynamics, where is the error between x and z, with x projected onto Z. These error dynamics function as the mean of the distribution of dynamics when the tracking policy is used . This allows the tube to take on a more accurately parameterized shape (Figure 6a). Setting up the learning problem in this way offers several distinct advantages. First, rather than relying on an accurate nominal model and learning the bounds between this model and the true dynamics, we directly characterize the difference between the two models with . This means that can be chosen more arbitrarily and does not need to be a high-fidelity dynamics model. Second, using the nominal dynamics as an input to

Fig. 6. (a) Learning error dynamics along with tube dynamics line is the nominal trajectory , blue line is data collected from the system. Cyan indicates tracking errors, whose dynamics are learned. Grey tube denotes , which captures the error between the true dynamics and . (b) Fitting learned dynamics to actual data. Blue inline indicates data collected from the system, black line is a learned dynamics trajectory fitted to the data.

and learning the error ”anchors” our prediction of the behavior of . This allows us to predict the expected distribution of with much higher accuracy for long time horizons, in contrast to the approach of learning a model f directly and propagating it forward in time, where the error between the learned model and the true dynamics tends to increase with time.

As before, we assume we have a known and nominal dynamics . Let be a set in X associated with the tube width

where the is element-wise. We have the following description of the error dynamics:

Given a dataset , we minimize the following loss over data samples in order to learn , which we parameterize with

Next, we learn by minimizing the quantile loss (10). However, while in the previous section approximate the tube width with . We obtain by propagating the learned dynamics forward in time, given . Then we can solve a similar tube-based robust MPC problem (15):

Notice that the cost and constraints are now a function of

and do not depend on only. This means that we are free to find paths for the tracking controller to track, which may violate constraints. We maintain the same guarantees of feasibility and constraint satisfaction as in Theorem III.1. Since the proof is similar we omit it for brevity. See Algorithm 2.

Theorem III.2. Suppose that the MPC problem (15) is feasible at t=0. Then the problem is feasible for all and at each timestep the constraints are satisfied with probability

C. Learning System Dynamics and Tube Dynamics

In our third approach to learning tubes, we wish to learn the dynamics directly without a prior nominal model . We restrict Z=X and V=U, and treat z as an approximation of x. Our goal is to learn to approximate f, along with that will determine a time-varying upper bound on the model error. Typically the open-loop model error will increase in time in an unbounded manner, which may make it difficult to find a feasible solution to the MPC problem. One approach is to assume the existence of a stabilizing controller and terminal set, and use a terminal condition that ensures the trajectory ends in this set [22, 26]. A second approach is to find a feedback control law to ensure bounded tube widths. We describe the latter approach in more detail, but do not restrict ourselves to it.

Using a standard L2 loss function, we first learn an approximation of with parameters

Next, we learn a policy with parameters by inverting the dynamics:

Algorithm 3: Learning Dynamics and Model Error Bounds for Tube MPC

1 Require: Safe set C, Quantile probability . MPC horizon T. 2 Initialize: Neural network for policy

dynamics . Initial state 3 Solve MPC problem (11) for initial feasible control sequence

15 Append data to dataset

By learning a policy in this manner we decouple the potentially inaccurate model from the true dynamics, in a learning inverse dynamics fashion [34]. To see this, suppose we have some and , and . If and we apply to the real system, , then the error will grow, i.e. . However, if instead we use the policy should be closer to , and the error is more likely to shrink. Other approaches are available for learning , including reinforcement learning [45], imitation learning [40], etc. Finally, we learn in the same manner as before by minimizing the quantile loss in (10). We generate data for learning the tube dynamics by fitting trajectories of the learned model to closely approximate the real data (Figure 6b). We randomly initialize along the trajectory by letting . We solve the following problem for each t:

From the fitted dynamics model data, we collect tube training data and proceed to train the tube model. We can now solve the same tube-based robust MPC problem (11), with replaced with This allows us to maintain the same guarantees of feasibility and safety with probability as before. See Algorithm 3.

IV. EXPERIMENTAL DETAILS

A. Evaluation on a 6-D problem

In this section we validate each of our three approaches to learned tubes for tube MPC on a 6-state simulated triple- integrator system. We introduce two sets of dynamics for f and to demonstrate our method. Consider the following 2D triple-integrator system with 6 states, where along with the 4 state 2D double-integrator dynamics for the reference system: . Let these systems have the following dynamics (we show the x-axis only for brevity sake):

where , and with similar dynamics for the y-axis. We construct the following cascaded PD control law:

We choose , and . We also bound . We simulate in discrete time with dt=0.1.

We collect 100 episodes with randomly generated controls, with episode lengths of 100 steps. Following each algorithm, we then set up an MPC task to navigate through a forest of obstacles (see Figure 7). We found an MPC planning horizon of 20-30 steps to be effective. We ran each MPC algorithm for 100 steps, or until the system reaches the goal. We also plot 100 rollouts of the ”true” system to evaluate the learned bounds. For each learned network, we use 3 layers with 256 units each. When calculating constraints for the tube, we treat the tube width as axes for an ellipse rather than a box. This alleviates the need for solving a mixed integer quadratic program, at the cost of a slightly larger tube. We use a quadratic running cost that penalizes deviation from the goal and excessively large velocities.

With Algorithm 1, we note that the tube widths are quite large. This is because this algorithm uses the reference trajectory itself as the center of the tube. While the tube encloses the trajectories, it does not create a tight bound. In Algorithm 2, we address this issue directly. We learn dynamics of the mean tracking error and use this as our tube center. The resulting tube dynamics bound the state distribution more closely. Note that when solving the MPC problem, the optimized reference trajectory is free to violate the constraints, as long as the system trajectories do not. This approach allows for much more aggressive behaviors. For Algorithm 3, without a good tracking controller, the tube width increases over time. However, because we replan at each timestep with a finite horizon, the planner is still able to fit through narrow passages. In the example shown we replan from the current state , with the assumption that it is measured. This allows us to create aggressive trajectories with narrow tube widths.

B. Comparison with analytic bounds

We compare our learned tubes with an analytic solution for robust bounds on the system (20). We derive these analytic bounds by assuming worst-case noise perturbations of the closed-loop system. We find the bound

Fig. 7. Comparison of 3 tube MPC approaches with learned tubes. Red circles denote obstacles, magenta cross denotes goal. Cyan lines indicate sampled trajectories from the system with randomized initial conditions. Top: Algorithm 1, learning a tube around the reference z (black) used for tracking. Green circles indicate the tube width obtained at each timestep. Mid: Algorithm 2, learning tracking error dynamics (blue line) for the center of the tube. Bot: Algorithm 3, tube MPC problem using learned policy, dynamics, and tube dynamics. Red lines indicate planned NN dynamics trajectories at each MPC timestep, along with the forward propagated tube dynamics (green), shown every 20 timesteps. Blue line indicates actual path taken (

Fig. 8. Learned 95% quantile error bounds (green) vs. 95% analytic bounds (dotted red) for the linear triple-integrator system, with 100 sampled trajectories, tracking a random reference trajectory.

). The worst-case error at each timestep is . We compare these bounds with those learned with our quantile method (Figure 8). Our method tends to underestimate the true bounds slightly, which is due to the training data rarely containing worst-case adversarial noise sequences.

C. Ablative Study

We perform an ablative study of our tube learning method. Using Algorithm 1, we learn error dynamics and tube dynamics. We collect randomized data (400 episodes of 40 timesteps) and train under varying values of . We then evaluate the accuracy of by sampling 100 new episodes of 10 timesteps, and plot the frequency that overestimates the true error, along with the magnitude of overestimation (Figure 9, left). We compare networks learned with the epistemic loss and without it, and find that our method produces well-calibrated

Fig. 9. Evaluation of learned tube dynamics on triple integrator system with varying (left) and varying number of datapoints (right). Red indicates fraction of validation samples that exceed the bound, while blue indicates average distance in excess of the bound. Models learned with the epistemic loss along with the quantile loss (circles, solid lines) perform better vs. models without epistemic uncertainty (triangles, dotted lines). Gray lines mark the best possible values.

uncertainties when using the epistemic loss, along with the quantile and monotonic losses (10). We evaluated ablation of the monotonic loss but found no noticeable differences.

We also evaluate estimation of epistemic uncertainty with varying amounts of data (from 10 to 400 episodes), with a fixed value of . We find that estimating epistemic uncertainty is particularly helpful in the low-data regimes (Figure 9, right). As expected, the network maintains good quantile estimates by increasing the value of , which results in larger tubes. This creates more conservative behavior when the model encounters new situations.

D. Evaluation on Quadrotor Dynamics

To validate our approach scales well to high-dimensional non-linear systems, we apply Algorithm 2 to a 12 state, 4 input quadrotor model, with dynamics:

where is the hat operator. The states are the position , the translational velocity , the rotation matrix from body to inertial frame , and the angular velocity in the body frame is the mass of the quadrotor, denotes gravitational force, and is the inertia matrix in body frame. The inputs to the model are the total thrust and the total moment in the body frame . Noise enters through the control channels, with . Our state is and control input is . We use a nonlinear geometric tracking controller that consists of a PD controller on position and velocity, which then cascades to an attitude controller [29]. For the nominal model we use a double integrator system on each position axis. The nominal state is with acceleration control inputs . See Figures 10 and 11.

V. CONCLUSION

We have introduced a deep quantile regression framework for learning bounds on controlled distributions of trajectories.

Fig. 10. Algorithm 2 working on quadrotor dynamics, showing 5 individual MPC solutions at different times along the path taken. Thinner lines (black and blue) indicate planned future trajectories , respectively.

Fig. 11. Tube widths for quadrotor dynamics, 10 episodes of 200 timesteps each, tracking random reference trajectories. From top to bottom, we plot . Green lines indicate the quantile bound , and cyan lines show 100 sampled error trajectories, stars indicate the start of a new episode.

For the first time we combine deep quantile regression in three robust MPC schemes with recursive feasibility and constraint satisfaction guarantees. We show that these schemes are useful for high dimensional learning-based control on quadrotor dynamics. We hope this work paves the way for more detailed investigation into a variety of topics, including deep quantile regression, learning invariant sets for control, handling epistemic uncertainty, and learning-based control for non-holonomic or non-feedback linearizable systems. Our immediate future work will involve hardware implementation and evaluation of these algorithms on a variety of systems.

ACKNOWLEDGEMENT

We thank Brett Lopez and Rohan Thakker for insightful discussions and suggestions, as well as the reviewers for helpful comments. This research was partially carried out at the Jet Propulsion Laboratory (JPL), California Institute of Technology, and was sponsored by the JPL Year Round Internship Program and the National Aeronautics and Space Administration (NASA). Copyright c2020. All rights reserved.

REFERENCES

[1] Anil Aswani, Humberto Gonzalez, S. Shankar Sastry, and Claire Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, may 2013.

[2] Alberto Bemporad and Manfred Morari. Robust model predictive control: A survey. In Robustness in identification and control, pages 207–226. Springer, 1999.

[3] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe Model-based Reinforcement Learning with Stability Guarantees. In Advances in neural information processing systems, pages 908–918, 2017.

[4] Eric Bradford, Lars Imsland, and Ehecatl Antonio del RioChanona. Nonlinear model predictive control with explicit back-offs for gaussian process state space models. In 58th Conference on decision and control (CDC). IEEE, 2019.

[5] Monimoy Bujarbaruah, Xiaojing Zhang, Marko Tanaskovic, and Francesco Borrelli. Adaptive MPC under time varying uncertainty: Robust and Stochastic. arXiv preprint arXiv:1909.13473, 2019.

[6] Rui Camacho, Ross King, and Ashwin Srinivasan. Inductive Logic Programming: 14th International Conference, ILP 2004, Porto, Portugal, September 6-8, 2004, Proceedings, volume 3194. Springer Science & Business Media, 2004.

[7] Will Dabney, Mark Rowland, Marc G Bellemare, and R´emi Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[8] Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.

[9] Moritz Diehl, Hans Joachim Ferreau, and Niels Haverbeke. Efficient numerical methods for nonlinear mpc and moving horizon estimation. In Nonlinear model predictive control, pages 391–417. Springer, 2009.

[10] David D Fan and Evangelos A Theodorou. Differential dynamic programming for time-delayed systems. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 573–579. IEEE, 2016.

[11] David D Fan, Jennifer Nguyen, Rohan Thakker, Nikhilesh Alatur, Ali-akbar Agha-mohammadi, and Evangelos A Theodorou. Bayesian learning-based adaptive control for safety critical systems. In International Conference on Robotics and Automation, 2020.

[12] Wagner Piazza Gaglianone, Luiz Renato Lima, Oliver Linton, and Daniel R Smith. Evaluating value-at-risk models via quantile regression. Journal of Business & Economic Statistics, 29(1):150–160, 2011.

[13] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.

[14] Yarin Gal, Rowan Thomas Mcallister, and Carl Edward Rasmussen. Improving PILCO with Bayesian Neural Network Dynamics Models. Data-Efficient Machine Learning Workshop, ICML, pages 1–7, 2016.

[15] Yiqi Gao, Andrew Gray, H. Eric Tseng, and Francesco Borrelli. A tube-based robust nonlinear predictive control approach to semiautonomous ground vehicles. Vehicle System Dynamics, 52(6):802–823, jun 2014. ISSN 0042-3114. doi: 10.1080/00423114.2014.902537.

[16] Agathe Girard, Carl Edward Rasmussen,

Joaquin Quinonero Candela, and Roderick Murray-Smith. Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In Advances in neural information processing systems, pages 545–552, 2003.

[17] Akhil Gupta, Naman Shukla, Lavanya Marla, and Arinbj¨orn Kolbeinsson. Monotonic trends in deep neural networks. arXiv preprint arXiv:1909.10662, 2019.

[18] Lukas Hewing and Melanie N Zeilinger. Stochastic Model Predictive Control for Linear Systems using Probabilistic Reachable Sets. may 2018. doi: 10.1109/CDC.2018.8619554.

[19] Lukas Hewing, Juraj Kabzan, and Melanie N Zeilinger. Cautious Model Predictive Control using Gaussian Process Regression. arXiv, may 2017. URL http://arxiv.org/abs/1705.10702.

[20] Lukas Hewing, Kim P Wabersich, Marcel Menner, and Melanie N Zeilinger. Learning-Based Model Predictive Control: Toward Safe Learning in Control. Annual Review of Control, Robotics, and Autonomous Systems, 3, 2019.

[21] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992.

[22] Eric C Kerrigan and Jan M Maciejowski. Robust feasibility in model predictive control: Necessary and sufficient conditions. In Proceedings of the 40th IEEE Conference on Decision and Control, volume 1, pages 728–733. IEEE, 2001.

[23] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric Society, pages 33–50, 1978.

[24] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives, 15(4):143–156, 2001.

[25] Johannes K¨ohler, Peter K¨otting, Raffaele Soloperto, Frank Allg¨ower, and Matthias A M¨uller. A robust adaptive model predictive control framework for nonlinear uncertain systems. arXiv preprint arXiv:1911.02899, 2019.

[26] Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-Based Model Predictive Control for Safe Exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066. IEEE, dec 2018. ISBN 978-1-5386-1395-5. doi: 10.1109/CDC.2018.8619572.

[27] Hideo Kozumi and Genya Kobayashi. Gibbs sampling methods for Bayesian quantile regression. Journal of statistical computation and simulation, 81(11): 1565–1578, 2011.

[28] W Langson, I Chryssochoos, S V Rakovi´c, and D Q Mayne. Robust model predictive control using tubes. Automatica, 40(1):125–133, jan 2004. ISSN 0005-1098. doi: 10.1016/J.AUTOMATICA.2003.08.009.

[29] Taeyoung Lee, Melvin Leok, and N Harris McClamroch. Geometric tracking control of a quadrotor uav on se (3). In 49th IEEE conference on decision and control (CDC), pages 5420–5425. IEEE, 2010.

[30] Anqi Liu, Guanya Shi, Soon-Jo Chung, Anima Anandku-

mar, and Yisong Yue. Robust Regression for Safe Exploration in Control. arXiv preprint arXiv:1906.05819, 2019.

[31] Brett T Lopez, Jonathan P How, and Jean-Jacques E Slotine. Dynamic tube mpc for nonlinear systems. In 2019 American Control Conference (ACC), pages 1655–1662. IEEE, 2019.

[32] D. Q. Mayne, E. C. Kerrigan, E. J. van Wyk, and P. Falugi. Tube-based robust nonlinear model predictive control. International Journal of Robust and Nonlinear Control, 21(11):1341–1353, jul 2011. ISSN 10498923. doi: 10.1002/rnc.1758.

[33] David Q Mayne. Model predictive control: Recent developments and future promise. Automatica, 50(12): 2967–2986, 2014.

[34] Duy Nguyen-Tuong, Jan Peters, Matthias Seeger, and Bernhard Sch¨olkopf. Learning inverse dynamics: a comparison. In European symposium on artificial neural networks, 2008.

[35] Chris J. Ostafew, Angela P. Schoellig, and Timothy D. Barfoot. Learning-based nonlinear model predictive control to improve vision-based mobile robot path-tracking in challenging outdoor environments. In Proceedings - IEEE International Conference on Robotics and Automation, pages 4029–4036. IEEE, may 2014. ISBN 978-1-4799-3685-4. doi: 10.1109/ICRA.2014.6907444.

[36] Chris J. Ostafew, Angela P. Schoellig, and Timothy D. Barfoot. Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547–1563, nov 2016. ISSN 0278-3649. doi: 10.1177/0278364916645661.

[37] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.

[38] Hadi Ravanbakhsh and Sriram Sankaranarayanan. Learning Control Lyapunov Functions from Counterexamples and Demonstrations. apr 2018. doi: 10.1007/s10514-018-9791-9.

[39] Filipe Rodrigues and Francisco C Pereira. Beyond expectation: Deep joint mean and quantile regression for spatio-temporal problems. arXiv preprint arXiv:1808.08798, 2018.

[40] St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.

[41] Jonathan Sadeghi, Marco De Angelis, and Edoardo Patelli. Efficient training of interval Neural Networks for imprecise training data. Neural Networks, 118:338–351, 2019.

[42] Mattia Seg´u, Antonio Loquercio, and Davide Scaramuzza. A general framework for uncertainty estimation in deep learning. arXiv preprint arXiv:1907.06890, 2019.

[43] Joseph Sill. Monotonic networks. In Advances in neural information processing systems, pages 661–667, 1998.

[44] Sumeet Singh, Anirudha Majumdar, Jean-Jacques Slotine, and Marco Pavone. Robust online motion planning

via contraction theory and convex optimization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5883–5890. IEEE, 2017.

[45] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.

[46] Natasa Tagasovska and David Lopez-Paz. Single-model uncertainties for deep learning. In Advances in Neural Information Processing Systems, pages 6414–6425, 2019.

[47] James W Taylor. A quantile regression approach to estimating the distribution of multiperiod returns. The Journal of Derivatives, 7(1):64–78, 1999.

[48] Mario E Villanueva, Rien Quirynen, Moritz Diehl, Benoˆıt Chachuat, and Boris Houska. Robust mpc via min–max differential inequalities. Automatica, 77:311–321, 2017.

[49] Kim P Wabersich and Melanie N Zeilinger. Linear model predictive safety certification for learning-based control. mar 2018. URL http://arxiv.org/abs/1803.08552.

[50] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Transactions on Robotics, 34 (6):1603–1622, 2018.

[51] Xing Yan, Weizhong Zhang, Lin Ma, Wei Liu, and Qi Wu. Parsimonious quantile regression of financial asset tail dynamics via sequential learning. In Advances in Neural Information Processing Systems, pages 1575–1585, 2018.

[52] Yunwen Yang, Huixia Judy Wang, and Xuming He. Posterior inference in Bayesian quantile regression with asymmetric Laplace likelihood. International Statistical Review, 84(3):327–344, 2016.

[53] Seungil You, David Ding, Kevin Canini, Jan Pfeifer, and Maya Gupta. Deep lattice networks and partial monotonic functions. In Advances in Neural Information Processing Systems, pages 2981–2989, 2017.

[54] Faen Zhang, Xinyu Fan, Hui Xu, Pengcheng Zhou, Yujian He, and Junlong Liu. Regression via Arbitrary Quantile Modeling. arXiv preprint arXiv:1911.05441, 2019.

designed for accessibility and to further open science