Infinite-Horizon Differentiable Model Predictive Control

2020·Arxiv

ABSTRACT

ABSTRACT

This paper proposes a differentiable linear quadratic Model Predictive Control (MPC) framework for safe imitation learning. The infinite-horizon cost is enforced using a terminal cost function obtained from the discrete-time algebraic Riccati equation (DARE), so that the learned controller can be proven to be stabilizing in closed-loop. A central contribution is the derivation of the analytical derivative of the solution of the DARE, thereby allowing the use of differentiation-based learning methods. A further contribution is the structure of the MPC optimization problem: an augmented Lagrangian method ensures that the MPC optimization is feasible throughout training whilst enforcing hard constraints on state and input, and a pre-stabilizing controller ensures that the MPC solution and derivatives are accurate at each iteration. The learning capabilities of the framework are demonstrated in a set of numerical studies.

1 INTRODUCTION

Imitation Learning (IL, Osa et al., 2018) aims at reproducing an existing control policy by means of a function approximator and can be used, for instance, to hot-start reinforcement learning. Effective learning and generalisation to unseen data are paramount to IL success, especially in safety critical applications. Model Predictive Control (MPC, Maciejowski, 2000; Camacho & Bordons, 2007; Rawlings & Mayne, 2009; Kouvaritakis & Cannon, 2015; Gallieri, 2016; Borrelli et al., 2017; Rakovi´c & Levine, 2019) is the most successful advanced control methodology for systems with hard safety constraints. At each time step, a finite horizon forecast is made from a predictive model of the system and the optimal actions are computed, generally relying on convex constrained Quadratic Programming (QP, Boyd & Vandenberghe, 2004; Bemporad et al., 2000). Stability of the MPC in closed loop with the physical system requires the solution of a simpler unconstrained infinite horizon control problem (Mayne et al., 2000) which results in a value function (terminal cost and constraint) and a candidate terminal controller to be accounted for in the MPC forecasting. For Linear Time Invariant (LTI) models and quadratic costs, this means solving (offline) a Riccati equation (Kalman, 2001) or a linear matrix inequality (Boyd et al., 1994). Under these conditions, an MPC controller will effectively control a system, up to a certain accuracy, provided that uncertainties in the model dynamics are limited (Limon et al., 2009). Inaccuracies in the MPC predictions can reduce its effectiveness (and robustness) as the forecast diverges from the physical system trajectory over long horizons. This is particularly critical in applications with both short and long-term dynamics and it is generally addressed, for instance in robust MPC (Richards, 2004; Rakovi´c et al., 2012), by using a controller to pre-stabilise the predictions.

This paper presents an infinite-horizon differentiable linear quadratic MPC that can be learned using gradient-based methods. In particular, the learning method uses an MPC controller where the terminal cost and terminal policy are the solution of an unconstrained infinite-horizon Linear Quadratic Regulator (LQR). A closed-form solution for the derivative of the Discrete-time Algebraic Riccati Equation (DARE) associated with the LQR is presented so that the stationary solution of the forward pass is fully differentiable. This method allows analytical results from control theory to be used to determine the stabilizing properties of the learned controller when implemented in closed-loop. Once the unconstrained LQR is computed, the predictive model is pre-stabilised using a linear state-feedback controller to improve the conditioning of the QP and the numerical accuracy of the MPC solution and gradients. The proposed algorithm successfully learns an MPC with both local stability and intrinsic robustness guarantees under small model uncertainties.

Contributions This paper provides a framework for correctly learning an infinite-horizon, LTI quadratic MPC using recent developments in differentiable QPs (Amos & Kolter, 2017) and principles from optimal control (Blanchini & Miani, 2007). A primary contribution is that the Discrete-time Algebraic Riccati Equation (DARE) is used to provide infinite-horizon optimality and stability, and an analytical derivative of the solution of the DARE is derived so that differentiation-based optimization can be used for training. This connects known results on MPC stability (Limon et al., 2003; 2009) and on infinite-horizon optimality (Scokaert & Rawlings, 1998) to imitation learning (Osa et al., 2018).

A further contribution is the MPC control formulation: a pre-stabilizing linear state-feedback controller is implemented from the solution of the DARE, and then the total control input is obtained as a perturbation of the feedback control law from the solution of a convex QP. The pre-stabilizing controller ensures that the QP is well conditioned and promotes a highly accurate global solution, which in turn ensures that the gradients calculated in the backwards pass are accurate. Additionally, an augmented Lagrangian penalty method is used to enforce constraints on state and control input. This approach ensures that the hard constraints are strictly enforced if the penalty term is sufficiently large, and also guarantees that the MPC problem is feasible throughout the training process. These contributions are in contrast to (Amos et al., 2018) which did not consider state constraints, and implemented a differential dynamic programming (Tassa et al., 2014) method to solve the MPC optimization for which convergence could not be guaranteed.

The framework is implemented on a set of second order mass-spring-damper systems and a vehicle platooning model, where it is demonstrated that the infinite horizon cost can be learned and the hard constraints can be guaranteed using a short finite prediction horizon.

Notation identity matrix. matrix of zeros. a vector of n zeros. a vector of n ones. All inequalities are considered element-wise in the context of vectors. largest absolute eigenvalue of given matrix A. vec is defined as vec i.e. the columns of a matrix stacked into a vector. For a matrix permutation matrix is implicitly defined by The Kronecker product, , is defined as in (Magnus & Neudecker, 1999, pp. 440).

2 DIFFERENTIABLE MPC

Linear quadratic MPC This paper considers linear time invariant systems of the form

where is the system state, is the control input, is the state transition matrix, is the input matrix, is the time, and is the timestep (assumed constant). The control problem for this system is to determine the sequence of values of that achieve a desired level of performance (e.g. stability, frequency response, etc...), and when the system is subject to hard constraints on control input, , and state, , (or a combination of both), a well studied framework for controller synthesis is MPC. The principle of MPC is that the system’s control input and state are optimized over a finite prediction horizon, then the first element of the obtained control sequence is implemented at the current time step and the process is repeated ad infinitum. For linear MPC it is common to use a quadratic stage cost and box constraints on state and control ( and where ), so that at each time index t the vector of optimized control variables is obtained from

where is the predicted control trajectory, is the predicted state trajectory, represents the stage cost on control input, represents the stage cost on state, represents the terminal cost on state, is the prediction horizon, are slack variables for the control constraint, are slack variables for the state constraint, and and represent the cost of control and state constraint violations. The variables s and r enforce the box constraints on state and control using the augmented Lagrangian method (Nocedal & Wright, 2006, §17.2), and it can be shown that for sufficiently high constraints exactly guaranteed (Kerrigan & Maciejowski, 2000) (i.e. s = r = 0). The benefit of this approach is that it ensures that the MPC optimization is feasible at each iteration of the learning process, whilst still ensuring that the constraints are ‘hard’. To close the MPC control loop, at each timestep, t, the first element of the optimized control sequence, implemented as

Pre-stabilised MPC If the control input is decomposed into is a stabilizing linear state-feedback matrix and is a perturbation to the feedback control, system (1) becomes

and problem (2) becomes

so that is implemented as . Using this decomposition, system (3) controlled with the solution of (4) is precisely equal to system (1) controlled with the solution of (2), but problem (4) is preferable from a computational standpoint if A is open-loop unstable (i.e. ) and N is ‘large’, as this can lead to poor conditioning of the matrices defined in Appendix A. This is important in the context of differentiable MPC, as if A is being learned then there may be no bounds on its eigenvalues at any given iteration.

MPC derivative. Problems (2) and (4) can be rearranged into the QP form (details in Appendix A)

When is uniquely defined by (5), it can also be considered as the solution of an implicit function defined by the Karush-Kuhn-Tucker (KKT) conditions, and in Amos & Kolter (2017) it was demonstrated that it is possible to differentiate through this function to obtain the derivatives of with respect to the parameters H, q, l, M, and u. 1 The MPC controller can then be used as a layer in a neural network, and backpropagation can be used to determine the derivatives of an imitation cost function with respect to the MPC parameters

Imitation Learning. A possible use case of the derivative of a model predictive controller is imitation learning, where a subset of {cost function, system dynamics, constraints} are learned from observations of a system being controlled by an ‘expert’. Imitation learning can be performed by minimizing the loss

where is the measured control input, is the full MPC solution, and is a hyperpa- rameter. It is assumed that both the learning algorithm and MPC controller have completely precise measurements of both the state and control input. The first term of (6) is the control imitation loss, and the second term penalises the one-step ahead prediction error In practice, the prediction error loss might not be needed for the MPC to be learned correctly, however its use can be instrumental for stability, as discussed in the next section.

3 TERMINAL COST FOR INFINITE HORIZON

Terminal cost. The infinite-horizon discrete-time Linear Quadratic Regulator (LQR, Kalman, 2001) is given with state feedback gain

where P is obtained as a solution of the DARE

The principle of the approach presented in this paper is the MPC controller (2,4) is implemented with . Proposition 1 summarises the relevant properties of the proposed MPC, building on classic MPC results from Scokaert & Rawlings (1998); Limon et al. (2003; 2009).

Proposition 1. Consider the MPC problem (4) with , where P and K solve (7-8). Define as the optimal objective in (4) with . Denote the optimal stage cost with as . Then, for the closed-loop system, it follows that:

3. There exist a scalar, , such that, for any the MPC constraints are robustly feasible, , and the system is Input-to-State Stable (ISS) given an additive model error, , such that: . In other words:

Proof. Proof of Proposition 1 is given in Appendix C.

Implications. Proposition 1 has some important implications. First, point 1 implies that there exists a state-dependant finite horizon length, , which is sufficient to make the MPC problem infinite-horizon optimal. This can be upper bounded for a closed and bounded set of feasible states, . Scokaert & Rawlings (1998) proposed an iterative search that increases the horizon until optimality is verified; a similar algorithm is discussed in Appendix E where learning is completed with a large horizon and then iteratively reduced afterwards, although it is not implemented in this paper. Point 2,3 state that MPC that can provide stability and constraints satisfaction, hence safety, if the model error is small. This also applies to small errors in the QP solution. Finally, point 4 states that the QP matrices have finite norm when the system dynamics are pre-stabilised using the LQR gain2, so the MPC problem is well conditioned and can be solved reliably to high accuracy, even over long horizons. If the open-loop system is unstable then the terms of the matrices in Appendix A for the standard form are unbounded, so the QP solution may be poorly conditioned and the result inaccurate for long horizons. This can in turn invalidate the results of Amos & Kolter (2017) which assumes that the KKT conditions are exactly satisfied in order to compute its gradients.

DARE Derivative. In order to implement in a differentiable imitation learning framework such as that presented in Section 2, the solution of the DARE is differentiated as follows.

Proposition 2. Let P be the stabilizing solution of (8), and assume that and exist, then the Jacobians of the implicit function defined by (8) are given by

where are defined by

and are defined by

Proof. The proof of Proposition 2 is given in Appendix D.

The sensitivity of the DARE solution has been investigated in the context of robustness to perturbations in the input matrices, e.g. Sun (1998); Konstantinov et al. (1993), and the analytical derivative of the continuous-time algebraic Riccati equation was derived in Brewer (1977) by differentiating the exponential of the Hamiltonian matrix, but to the best of the authors’ knowledge this is the first presentation of an analytic derivative of the DARE using the differential calculus approach of Magnus & Neudecker (1999).

Algorithm overview Algorithm 1 presents the overall procedure for learning a subset, S, of the MPC controller parameters, M = , with the key steps of the forwards and backwards pass of a gradient-based optimization method. In each forward pass the MPC terminal cost matrix, and the pre-stabilizing controller, K, are set from the solution of the DARE, then the DARE and MPC QP solutions are differentiated in the backward pass to obtain the gradients. Note that the horizon, N, is not differentiable, and that learning the entire set M simultaneously is challenging in general.

4 NUMERICAL EXPERIMENTS

In this section the performance of the algorithm was demonstrated through numerical experiments in two test cases: firstly on a set of second order mass-spring-damper models to provide a performance baseline in an easily interpretable setting, and then on vehicle platooning problem to investigate a higher-dimensional real-world application.

4.1 MASS-SPRING-DAMPER

Model & Expert Expert data was generated using a mass-spring-damper model parameterized by a mass, , damping coefficient, , stiffness, , and timestep

so that is the position and velocity of the mass, and the is a force applied to the mass.

Seven models were created with m = 1, k = 1, and dt = 0.2, and c was varied as shown in Table 1 to affect the open-loop stability of the models (stable, unstable). The expert data was then generated by simulating each of the systems the initial condition closed-loop with an infinite-horizon MPC

controller (i.e. the horizon was increased until the open-loop state predictions matched the closed-loop response), using , and . The constraint set was chosen so that the constraints on both state and control input were strongly active at the solution whilst ensuring that the expert MPC optimization was feasbile. The values were found to be sufficient to enforce the hard constraints and were used for all experiments. It is important to note that the approach of (Amos et al., 2018) cannot be used reliably for even this simple example as it does not consider state constraints, and when hard constraints are added to the method it fails in general because the optimization problem has become infeasible in the forwards pass at some time t.

Figure 1: Mass-spring-damper. Imitation loss and model loss at each iteration of the training process. Top row: imitation loss. Bottom row: model loss given by is the learned model at iteration j, and A is the correct model. The model loss was not used as part of the training process, and shown only to indicate whether the model is converging correctly.

Learning The learner and expert shared all system and controller information apart from the state transition matrix A, which was learned, and the MPC horizon length, which was implemented as each of in three separate experiments. A was initialized with the correct state transition matrix plus a uniformly distributed pseudorandom perturbation in the interval added to each element. The learner was supplied with the first 50 elements of the closed loop state trajectory and corresponding controls as a batch of inputs, and was trained to minimize the imitation loss (6) with , i.e. the state dynamics were learned using predicted control trajectories only, and the state transitions are not

made available to the learner (this is the same approach used in Amos et al., 2018). The experiments were implemented in Pytorch 1.2.0 using the built-in Adam optimizer (Kingma & Ba, 2014) for 1000 steps using default parameters. The MPC optimization problems were solved for the ‘expert’ and ‘learner’ using OSQP (Stellato et al., 2017) with settings (eps_abs=1E-10, eps_rel=1E-10, eps_rim_inf=1E-10, eps_dual_inf=1E-10).

Results Figure 1 shows the imitation and model loss at each of the 1000 optimization iterations for each of the tested horizon lengths. It can be seen that for all of the generated systems the imitation loss converges to a low value, although this is a local minimum in general. In most cases, the learned model converges to a close approximation of the real model, although as the problem is non-convex this cannot be guaranteed, and it is also shown that there are some cases in which the model does not converge correctly. This occurred exclusively for N = 2, where neither system 4 nor system 2 converge to the correct dynamics. Additionally, it can be seen that both the imitation loss and model loss converge faster as the prediction horizon is increased. This suggests that a longer learning horizon improves the learning capabilities of the methods, but there is not sufficient data to demonstrate this conclusively.

To test generalization performance, each of the systems was re-initialized with initial condition and simulated in closed loop using the learned controller for each horizon length. The results are compared in Figure 2 against the same systems controlled with an infinite horizon MPC controller. The primary observation is that as the learned MPC horizon is increased to N = 6, the closed loop trajectories converge to expert trajectories, indicating that the infinite horizon cost has been learned (when using the infinite

horizon cost with no model mismatch or disturbance, the predicted MPC trajectory is exactly the same as the closed loop trajectory), and that the state constraints are guaranteed for . Furthermore, it can be seen that the learned controllers are stabilizing, even for the shortest horizon and the most unstable open-loop systems. This is also the case for systems 2 and 4 where the incorrect dynamics were learned, although in this case the state constraints are not guaranteed for N = 2.

4.2 VEHICLE PLATOONING

Figure 3: Platoon Model. vehicles in 1 degree of freedom where y is longitudinal displacement. Model & Expert Vehicle platoon control is a problem that has been studied using control theory (e.g. Zheng et al. (2019)), but here it is demonstrated that a safe, stabilizing controller can be learned from examples of vehicles driving in formation. Figure 3 shows an illustration of a platoon of vehicles for which the objective is to stabilize the relative longitudinal positions of each vehicle to the steady-state conditions and , subject to the hard constraint that relative position of the vehicles is never lower than a safe threshold , and that the vehicles’ ability to brake and accelerate is constrained by (note that only the relative positions and velocities of the vehicles is considered, as the global position and velocity of the platoon can be controlled separately by adding an equal perturbation to each element of ). In appendix F it is shown that this can be modelled as a discrete time LTI system. Expert data was generated from the model with vehicles so that and . 20 instances were generated using random feasible initial conditions with m and y = 10 m, and then simulated for 20 s in time intervals of dt = 0.7 s with an infinite-horizon MPC controller, using

Learning The learner and expert shared all system and controller information apart from the cost matrices Q and R, which were learned, and the MPC horizon length, which was implemented as each of in four separate experiments. The matrices Q and R were initialized as completely random diagonal matrices with each element uniformly distributed in the interval [0, 3], and the diagonal structure was maintained through training. 500 training iterations were used; otherwise the learning process (loss function, learning rate, etc...) was the same as in Section 4.1.

Figure 4: Vehicle platooning. Imitation loss and cost function loss at each iteration of the training process. Left: imitation loss. Right: model loss given by , where Q and R are the correct cost matrices and are the cost matrices

Results Figure 4 shows the imitation and cost function losses at each of the 500 optimization iterations for each of the tested horizon lengths and initial conditions. As with the mass-spring-damper experiments, it is suggested that a longer prediction horizon improves training as the imitation loss generally converges to a lower value for the examples with , but only convergence to a local minimum is achieved in general. The cost error also does not converge in general (although better convergence is observed again for the longer horizon lengths), however for this learning problem there is a manifold of matrices Q and R with the same minimizing argument, so divergence of the cost error does not necessarily indicate that the learned cost

function is ‘incorrect’. Furthermore, in this case the model is known exactly, so the closed-loop infinite-horizon properties can be obtained even without the correct cost function.

Figure 5 shows the model simulated from the same initial condition in closed loop using a learned controller for each of the horizon lengths, together with the error between the MPC state predictions and ensuing closed-loop behaviour. All of the controllers are observed to successfully satisfy the hard constraints on vehicle separation, and all converge to the correct steady-state vehicle separation. The differences between the prediction capabilities of the controllers is highlighted by the state prediction errors, and it can be seen that for N = 20 the state predictions match the ensuing behaviour, indicating that the infinite horizon cost is being used and that closed-loop stability is guaranteed, even without the use of a terminal constraint set. It is also demonstrated for N < 20 that the largest errors occur

Figure 5: Vehicle platooning. Closed loop simulation and prediction error for all horizon lengths. Top row: closed loop simulation where each shaded region is the safe separation distance for each vehicle. Bottom row: prediction error given by , where is the state trajectory predicted by the MPC at time t.

from predictions made at times when the state constraints are active, suggesting that these controllers deviate from their predictions to satisfy the constraints at later intervals.

4.3 LIMITATIONS

The above approach is limited in scope to LTI systems, and a more comprehensive solution would cover linear time varying systems (for which the MPC is still obtained from the solution of a QP). In this case the infinite horizon cost cannot be obtained from the solution of the DARE, and the extension of the methods presented in this paper to time varying or non-linear models is non-trivial (see Appendix G for further discussion). Additionally, the derivative of the DARE in Proposition 2 involves multiple Kronecker products and matrix inversions (including an matrix) that do not scale well to large state and control dimensions, although the dynamics of physical systems can usually be reasonably approximated with only a few tens of variables, so this may not become an issue in practice. The algorithm also requires a stabilizing solution of the DARE to exist; theories for the existence of stabilizing solutions are non-trivial (e.g. Ran & Vreugdenhil, 1988), and it is not immediately obvious how to enforce their existence throughout the training process (stabilizibility can be encouraged using the one-step ahead term in (6)).

ACKNOWLEDGMENTS

The authors are grateful to Brandon Amos for providing support using his differentiable QP tool (https://github.com/locuslab/optnet) in the preliminary work for this project (all of the methods presented in this paper were developed independently).

REFERENCES

Brandon Amos and J. Zico Kolter. OptNet: Differentiable Optimization as a Layer in Neural Networks. arXiv:1703.00443 [cs, math, stat], March 2017. URL http://arxiv.org/abs/ 1703.00443. arXiv: 1703.00443.

Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable MPC for End-to-end Planning and Control. In Advances in Neural Information Processing Systems 31, pp. 8289–8300. Curran Associates, Inc., 2018.

A. Bemporad, M. Morari, V. Dua, and E.N. Pistikopoulos. The explicit solution of model predictive control via multiparametric quadratic programming. In Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334). IEEE, 2000. doi: 10.1109/acc.2000.876624. URL https://doi.org/10.1109/acc.2000.876624.

Franco Blanchini and Stefano Miani. Set-Theoretic Methods in Control (Systems & Control: Foundations & Applications). Birkhäuser, 2007. ISBN 0817632557.

R. V. Bobiti. Sampling driven stability domains computation and predictive control of constrained non-linear systems. PhD thesis, 2017. URL https://pure.tue.nl/ws/files/78458403/ 20171025_Bobiti.pdf.

Francesco Borrelli, Alberto Bemporad, and Manfred Morari. Predictive Control for Linear and Hybrid Systems. Cambridge University Press, 2017. ISBN 1107016886.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. ISBN 0521833787.

Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Vendataramanan Balakrishnan. Linear Matrix Inequalities in System & Control Theory (Studies in Applied Mathematics, Volume 15). Society for Industrial & Applied, 1994. ISBN 089871334X.

J. Brewer. The derivative of the riccati matrix with respect to a matrix. IEEE Transactions on Automatic Control, 22(6):980–983, December 1977. doi: 10.1109/TAC.1977.1101656.

E. F. Camacho and C. Bordons. Model Predictive control. Springer London, 2007. doi: 10.1007/ 978-0-85729-398-5.

Marco Gallieri. Lasso-MPC – Predictive Control with -Regularised Least Squares. Springer International Publishing, 2016. doi: 10.1007/978-3-319-27963-3. URL https://doi.org/ 10.1007/978-3-319-27963-3.

R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. ISBN 0521548233, 9780521548236.

Vlad Ionescu and Martin Weiss. On computing the stabilizing solution of the discrete-time riccati equation. Linear Algebra and its Applications, 174:229 – 238, 1992. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(92)90053-D. URL http://www.sciencedirect.com/ science/article/pii/002437959290053D.

Rudolf Kalman. Contribution to the theory of optimal control. Bol. Soc. Mat. Mexicana, 5, 02 2001.

Eric C. Kerrigan and Jan M. Maciejowski. Soft constraints and exact penalty functions in model predictive control. In Proc. UKACC International Conference (Control, 2000.

H. K. Khalil. Nonlinear Systems. Pearson Education, 3rd edition, 2014.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2014.

MM Konstantinov, P Petkov, and ND Christov. Perturbation analysis of the discrete riccati equation. Kybernetika, 29(1):18–29, 1993.

B. Kouvaritakis and M. Cannon. Model Predictive Control: Classical, Robust and Stochastic. Advanced Textbooks in Control and Signal Processing, Springer, London, 2015.

D. Limon, T. Alamo, and E. F. Camacho. Stable constrained MPC without terminal constraint. Proceedings of the 2003 American Control Conference, 2003., 6:4893–4898 vol.6, 2003.

D. Limon, T. Alamo, D. M. Raimondo, D. Muñoz de la Peña, J. M. Bravo, A. Ferramosca, and E. F. Camacho. Input-to-State Stability: A Unifying Framework for Robust Model Predictive Control. In Nonlinear Model Predictive Control, pp. 1–26. Springer Berlin Heidelberg, 2009. doi: 10.1007/ 978-3-642-01094-1_1. URL https://doi.org/10.1007/978-3-642-01094-1_1.

Jan Maciejowski. Predictive Control with Constraints. Prentice Hall, 2000. ISBN 0201398230.

Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley, second edition, 1999. ISBN 0471986321 9780471986324 047198633X 9780471986331.

D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. M. Scokaert. Constrained model predictive control: Stability and optimality. 2000.

J. Nocedal and S. J. Wright. Numerical optimization. Springer verlag, 2006.

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An Algorithmic Perspective on Imitation Learning. Foundations and Trends in Robotics, 7 (1-2):1–179, 2018. ISSN 1935-8253, 1935-8261. doi: 10.1561/2300000053. URL http: //arxiv.org/abs/1811.06711. arXiv: 1811.06711.

S. V. Rakovi´c, B. Kouvaritakis, R. Findeisen, and M. Cannon. Homothetic tube model predictive control. Automatica, 48:1631–1638, 08 2012. doi: 10.1016/j.automatica.2012.05.003.

Savecsa V. Rakovi´c and William S. Levine (eds.). Handbook of Model Predictive Control. Springer International Publishing, 2019. doi: 10.1007/978-3-319-77489-3. URL https://doi.org/ 10.1007/978-3-319-77489-3.

A.C.M. Ran and R. Vreugdenhil. Existence and comparison theorems for algebraic riccati equations for continuous- and discrete-time systems. Linear Algebra and its Applications, 99:63 – 83, 1988. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(88)90125-5. URL http: //www.sciencedirect.com/science/article/pii/0024379588901255.

J. B. Rawlings and D. Q. Mayne. Model Predictive Control Theory and Design. Nob Hill Pub, Llc, 2009. ISBN 0975937707.

A. G. Richards. Robust Constrained Model Predictive Control. PhD thesis, MIT, 2004.

P.O.M. Scokaert and J.B. Rawlings. Constrained linear quadratic regulation. IEEE Transactions on Automatic Control, 43(8):1163–1169, 1998. doi: 10.1109/9.704994. URL https://doi.org/ 10.1109/9.704994.

B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An operator splitting solver for quadratic programs. ArXiv e-prints, November 2017.

J. Sun. Sensitivity analysis of the discrete-time algebraic riccati equation. Linear Algebra and its Applications, 275-276:595 – 615, 1998. ISSN 0024-3795. doi: https://doi.org/ 10.1016/S0024-3795(97)10017-9. URL http://www.sciencedirect.com/science/ article/pii/S0024379597100179. Proceedings of the Sixth Conference of the International Linear Algebra Society.

Y. Tassa, N. Mansard, and E. Todorov. Control-limited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1168–1175, May 2014. doi: 10.1109/ICRA.2014.6907001.

Y. Zheng, Y. Bian, S. Li, and S. E. Li. Cooperative control of heterogeneous connected vehicles with directed acyclic interactions. IEEE Intelligent Transportation Systems Magazine, pp. 1–1, 2019. doi: 10.1109/MITS.2018.2889654.

APPENDICES

Problem (2) is equivalent to

where

are of conformal dimensions. Using the above, problem (4) is then equivalent to

where now

and

are of conformal dimensions.

OSQP solves quadratic programs of the form (5), and returns values for z, y, and s that satisfy

(Stellato et al., 2017, §2), where C is the set , and is the normal cone of C. The values of y that are returned by the solver can be used to determine whether the constraints are strongly active at the solution, where indicates that the constraints and are inactive, indicates that is strongly active, and indicates that strongly active. The solution can therefore be completely characterised by the KKT system

where , and the notation indicates a matrix consisting of the columns of given matrix indicates a vector consisting of the elements of given vector v. Equation (10) can then be differentiated using the techniques detailed in (Amos & Kolter, 2017, §3).

Proof. (Proposition 1) The first point follows from (Scokaert & Rawlings, 1998). The next two points of Proposition 1 stem from the results in (Limon et al., 2003; 2009). In particular, the closed-loop is Lipschitz since the model is linear and the controller is the solution of a strictly convex QP. Moreover, the LQR provides a contractive terminal set. The final point follows from the fact that has eigenvalues in the unit circle, . Proof of point 4 is concluded by inspection of the QP matrices (Appendix A) and by application of Theorem 5.6.12, page 298 of Horn & Johnson (2012) which states that, given a bound, , on the spectral radius, then there exists a matrix norm which is also less than

Proof. (Proposition 2) If a stabilizing solution () to (8) exists, it is unique (Ionescu & Weiss, 1992, Proposition 1), and the DARE can therefore be considered an implicit function of A, B, Q, and R. Using the assumption that exists, it can be concluded that and exist (the Kronecker product and matrix addition, subtraction, and multiplication always exist). Equation (8) can be given by

which is differentiable, and are also differentiable. Differentials are taken for (11) and each of

then these can be combined using the differential chain rule (Magnus & Neudecker, 1999, Theorem 18.2) to obtain

The Jacobians, as defined in Proposition 2, therefore exist if

A method is proposed for the reduction of the MPC prediction horizon after imitation learning. The idea is to be able to reproduce the infinite-horizon optimal MPC up to a tolerance with high probability. Do do so, we check that, for a candidate horizon , the MPC action deltas, , satisfy , for all . This means that the optimal action is equal to the LQR up to a tolerance . In order to provide a high probability guarantee of this condition, we propose the use of a probabilistic verification approach, similar to Bobiti (2017). This is described in Algorithm 2. In particular, the condition is checked on a high number, , of initial states. These states are sampled uniformly from a set of interest

X, which can be either the state constraints X or an estimate of the region of attraction, . If verified, this set is a region of attraction for the system with high probability. The relationship between the number of samples and the verification probability is discussed in (Bobiti, 2017, Chapter 5). The algorithm also checks whether the infinite horizon condition has been reached for the N used during training. Finally, a line search for a suitable X is proposed using a scaling factor . In particular, the initial set is downscaled until either an horizon is found or the set becomes empty. In latter case the search fails and the procedure returns to the training algorithm with an increased N. Noticeably, the proposed algorithm does not require to explicitly compute the terminal set in which the LQR is invariant and it could be used also for non-linear MPC if an infinite-horizon (or a stabilising) terminal controller is available.

The problem described in Section 4.2 can be decomposed into the regulation problem

If each vehicle is modelled as a mass then a continuous-time LTI state space model can be formed as

which can then be given as

If it is assumed that the control input is constant between sampling intervals t and , then this can be given in discrete time as

where and are subject to the constraints

As discussed in the main paper, our approach is currently limited to Linear Time Invariant (LTI) systems. In general, conditions for infinite-horizon optimality of systems that are not LTI are non-trivial. Some of the results on MPC stability could however be maintained, for example in the case when the LQR value function, , is a local control Lyapunov function (Khalil, 2014; Mayne et al., 2000). In this case, the stability and intrinsic robustness results are maintained (see Limon et al., 2003; 2009). For these system, it would be possible to use our method, for instance in combination with Amos et al. (2018), to provide a stable Non-linear MPC. This is however a big assumptions for systems that are very non-linear. Assessing this LQR controllability condition could be done, for instance, by training a local linear model around the target equilibrium (origin) and then checking whether the DARE is solvable. This should be performed before starting the imitation learning. We leave the study of more general systems to future work.