b

DiscoverSearch
About
My stuff
Neural Lyapunov Model Predictive Control: Learning Safe Global Controllers from Sub-optimal Examples
2020·arXiv
Abstract
Abstract

With a growing interest in data-driven control techniques, Model Predictive Control (MPC) provides an opportunity to exploit the surplus of data reliably, particularly while taking safety and stability into account. In many real-world and industrial applications, it is typical to have an existing control strategy, for instance, execution from a human operator. The objective of this work is to improve upon this unknown, safe but suboptimal policy by learning a new controller that retains safety and stability. Learning how to be safe is achieved directly from data and from a knowledge of the system constraints. The proposed algorithm alternatively learns the terminal cost and updates the MPC parameters according to a stability metric. The terminal cost is constructed as a Lyapunov function neural network with the aim of recovering or extending the stable region of the initial demonstrator using a short prediction horizon. Theorems that characterize the stability and performance of the learned MPC in the bearing of model uncertainties and sub-optimality due to function approximation are presented. The efficacy of the proposed algorithm is demonstrated on non-linear continuous control tasks with soft constraints. The proposed approach can improve upon the initial demonstrator also in practice and achieve better stability than popular reinforcement learning baselines.

Control systems comprise of safety requirements that need to be considered during the controller design process. In most applications, these are in the form of state/input constraints and convergence to an equilibrium point, a specific set or a trajectory. Typically, a control strategy that violates these specifications can lead to unsafe behavior. While learning-based methods are promising for solving challenging non-linear control problems, the lack of interpretability and provable safety guarantees impede their use in practical control settings [1]. Model-based reinforcement learning (RL) with planning uses a surrogate model to minimize the sum of future costs plus a learned value function terminal cost [2], [3]. Approximated value functions, however, do not offer safety guarantees. In contrast, control theory focuses on these guarantees but it is limited by its assumptions. Thus, there is a gap between theory and practice.

A Control Lyapunov Function (CLF) is necessary and sufficient for stabilization [4]. By exploiting the expressiveness of neural networks (NNs), Lyapunov NNs have been demonstrated as a general tool to produce non-conservative stability (and safety) certificates [5], [6] and also improve an existing controller [7], [8], [9]. In most of these settings, the controller is parameterized through a NN as well. The

M. Mittal and M. Gallieri provided equal contributions. M. Mittal is with ETH Z¨urich, Switzerland. mittalma@ethz.ch All authors are associated with NAISENSE SA, Lugano, Switzerland. This work was done by M. Mittal during his internship.

flexibility provided by this choice comes at the cost of increased sample complexity, which is often expensive in real-world safety-critical systems. In this work, we aim to overcome this limitation by leveraging an initial set of one-step transitions from an unknown expert demonstrator (which may be sub-optimal) and by using a learned Lyapunov function and surrogate model within a Model Predictive Control (MPC) formulation.

Our key contribution is an algorithmic framework, Neural Lyapunov MPC (NLMPC), that obtains a single-step horizon MPC for Lyapunov-based control of non-linear deterministic systems with constraints. We extend standard MPC stability results to a discounted setup and provide a theoretical bound for its performance using an imperfect forward model and a verified Lyapunov NN. This complements [3], which considers a perfect model. We draw links between control and RL by proving that a positive advantage function can be sufficient for stability. The proposed approach relates to Advantage Weighted Regression (AWR) [10] which favours positive advantage during the policy update. In this paper, however, the same is also done for the critic by detecting and favouring stability (advantage) over Bellman optimality.

We use alternate learning to train the Lyapunov NN and a scaling factor that make the MPC terminal cost. During training, we approximate the expensive formal verification using validation and the introduction of new points from near the region boundary across iterations. The approach is demonstrated on a constrained torque-limited inverted pendulum and non-holonomic vehicle kinematics. Experiments show that validation can be used as a faster and effective proxy for formal verification during Lyapunov NN learning while leading to stability in practice. The stable region can in fact be larger than that from an MPC demonstrator with a longer prediction horizon. NLMPC can transfer between using an inaccurate surrogate and a nominal forward model and outperform several RL baselines in terms of stability and constraints satisfaction.

a) Controlled Dynamical System: Consider a discretetime, time-invariant, deterministic system:

image

where  t ∈ Nis the timestep index,  x(t) ∈ Rnx, u(t) ∈ Rnuand  y(t) ∈ Rny are, respectively, the state, control input, and measurement at timestep t. We assume that the states and measurements are equivalent and the origin is the equilibrium

image

For further definitions, we refer the reader to [11], [12]. If condition (5), holds everywhere in  Xs, then the origin is a stable equilibrium (XT = {0}). If (most likely) this holds only outside a non-empty inner set,  XT = {x ∈ X : V (x) ≤lT } ⊂ Xs, with  XT ⊃ {0}, then the system converges to a neighborhood of the origin and remains there in the future.

image

results in a new safe set,  X(i+1)s = C(X(i)s ), the one-step controllable set of  X(i)sand the feasible region of (7), X(i+1)s ⊇ X(i)s. We soften the state constraints in (7) and use it recursively to estimate  X(j)s , j > i. We formulate an algorithm that learns the parameter  αas well as the safe set. We train a neural network via SGD to approximate V , hence the ROA estimate will not always increase through iterations. To aim for maximum ROA and minimum MPC horizon, we use cross-validation. We motivate our work by extending theoretical results on MPC stability and a sub-optimality bound for approximate f and V . Finally, we provide an error bound on the learned f for having stability.

d) Learning and Safety Verification: We wish to learn V (x) and  Xsfrom one-step on-policy rollouts, as well as a forward model ˆf(x, u). For MPC stability, we must assume that the Lyapunov network has been formally verified, namely, that (5) is valid within the learned set. Recent work on formal Lyapunov verification [6], [13], [9], [14], respectively, gridbased and symbolic methods, can be used for systems of limited dimensionality. For large-scale systems, however, this is an open area of research. Sampling-based methods [6] provide a high probability certificate which isn’t enough for MPC. In this paper, we demonstrate how validation and counterexamples can be used to help increasing the ROA in practise even without the use of verification during learning.

e) NN-based dynamics model: In some MPC applications, it might not be possible to gather sufficient data from demonstrations in order to be able to learn a model that predicts over long sequences. One-step or few-steps dynamics learning based on NNs can suffer when the model is unrolled for longer time. For instance, errors can accumulate through the horizon due to small instabilities either from the physical system or as artifacts from short sequence learning. Although some mitigations are possible for specific architectures or longer sequences [15], [16], [17], [18], we formulate our MPC to allow for a very short horizon and unstable dynamics. Since we learn a surrogate NN forward model, ˆf(x, u), fromone-step trajectories, we will assume this to have a locally bounded one-step-ahead prediction error, w(t), where:

image

for some compact set of states, ˜X ⊇ X. We also assume that both  f and ˆfare locally Lipschitz in this set, with constants Lfx, Lfu, and L ˆfx, L ˆfu respectively. A conservative value of  µcan be inferred from these constants as the input and state sets are bounded or it can be estimated from a test set.

In the context of MPC, a function V , which satisfies the Lyapunov property (5) for some local controller  K0, is instrumental to formally guarantee stability [19], [20]. We use this insight and build a general Lyapunov function terminal cost for our MPC, based on neural networks. We discuss the formulation of the Lyapunov network and the MPC in Section III-A and Section III-B respectively. In order to extend the controller’s ROA while maintaining a short prediction horizon, an alternate optimization scheme is proposed to tune the MPC and re-train the Lyapunov NN. We describe this procedure in Section III-C and provide a pseudocode in Algorithm 1.

A. Lyapunov Network and Advantage Learning

We use the Lyapunov function network introduced by [8]:

image

where  Vnet(x)is a (Lipschitz) feedforward network that produces a  nV × nxmatrix. The scalars  nVand  lℓ > 0are hyper-parameters. It is easy to verify that (9) satisfies the condition mentioned in (4). In our algorithm, we learn the parameters of the network,  Vnet(x), and a safe level,  ls. Note that equation (5) allows to learn V from demonstrations without explicitly knowing the current policy.

a) Loss function: Suppose  DKdenotes a set of state-action-transition tuples of the form  T = (x, u, x+), wherex+ is the next state obtained applying the policy u = K(x). The Lyapunov network is trained using the following loss:

image

∆V (x) = V (x+) − λV (x),Disadvantage Jvol(T ) =sign�− ∆V (x)� [ls − V (x)] .Discriminator

In (10),  IXsis the indicator function for the safe ROA Xs, which is multiplied to  Js, a function that penalises the instability. The term  Jvolis a classification loss that tries to compute the correct boundary between the stable and unstable points. It is also instrumental in increasing the ROA volume. The scalars  ϵV > 0, λ ∈ [0, 1), 0 < ρ ≪ 1, are hyper-parameters, where the latter trades off volume for stability (we take  ρ = 10−3 as in [21], [8]). To make sure that  Xs ⊆ X, wescale-down the learned  lsa-posteriori. The loss (10) extends the one proposed by [21] in the sense that we only use one-step transitions, and safe trajectories are not explicitly labeled before training. b) RL advantage and stability: The loss (10) is used as an alternative to the more common one-step value error from Reinforcement Learning (RL) [22] (maximising reward r):

image

In RL, (11) is used with accumulated targets to learn the value V⋆given transitions  T ∈ Dand rewards r(x, u). For a  γ-discounted Markov Decision Process (MDP) with r(x, u) = −ℓ(x, u), and candidate value  V(x) = −V (x), then our loss Js(T )is a one-sided version of (11), and (10) aims to learn an on-policy advantage proxy,  −∆V (x), and a region  Xswhere this is positive. Given a policy,  K0, the trueadvantage:

image

where Q is the action-value function. In a deterministic setting, given the assumptions on the reward being negative definite and zero at the desired target, the advantage function can be connected to stability by the following result which, to the best of our knowledge, is a novel connection to RL:

Theorem 1. There exists  ϵ ∈ (0, 1), ¯γ ∈ (1 − ϵ, 1)such that: 1 > γ ≥ ¯γand  u = K(x) ⇒ A⋆(x, u) ≥ 0, ∀x ∈ Xs ⊆ Xyields that  Xs is a safeROA: the state converges to  X¯γ ⊆ Xs.

A trivial choice is  ϵ = ℓℓ/LV. If  K0is stabilizing, the result holds also for  γ = 1. From Theorem 1, using  JTD(0) forRL could provide safety, if  A⋆(x, K(x)) → 0. In practice, our loss (10) can encourage roll-out stability earlier than Bellman error convergence. Replacing  ℓwith an equivalent contraction  λ on Vallows us to learn offline from stabilizing, sub-optimal examples. A combination of our approach with value/advantage learning is of interest for future work.

The remainder of the paper focuses on minimising losses, �∞t=0 γtℓ(x(t), u(t)), as this is more standard in control.

B. Neural Lyapunov MPC

The loss (10) is used to both learn V and to tune its scaling factor in the MPC loss,  α ≥ 1, to provide stability. We aim to improve the ROA of the initial controller, used to collect data, by replacing it with an MPC solving the following input-limited, state soft-constrained, optimal control problem:

image

where  ˆx(i)and  ˆu(i)are the predicted state and the input at i-steps in the future, s(i) are slack variables, u  = {u(i)}N−1i=0 ,the stage cost  ℓis given by (3),  γ ∈ (0, 1]is a discount factor, the function V is the Lyapunov NN from (9), scaled by a factor  α ≥ 1to provide stability, and x(t) is the measured system state at the current time. The penalty  ℓXis used for state constraint violation, see [23]. The optimal cost is denoted as  J⋆MPC(x(t)).

Problem (13) is solved online given the current state x(t); then, the first element of the optimal control sequence,  u⋆(0),provides the action for the physical system. Then, a new state is measured, and (13) is again solved, in a receding horizon. The implementation details are given in Appendix. a) Stability and safety: We extend standard results from [20], [24] to the discounted case and to the  λ-contractive V from (5). In order to prove them, we make use of the uniform continuity of the model, the SQP solution and the terminal cost, V , as done by [24]. Consider the candidate MPC ROA:

image

image

Theorem 2. Stability and robustness Assume that V (x) satisfies (5), with  λ ∈ [0, 1), XT = {0}. Then, given  N ≥ 1,for the MPC (13) there exist a constant  ¯α ≥ 0, a discount factor  ¯γ ∈ (0, 1], and a model error bound  ¯µsuch that, if α ≥ ¯α, µ ≤ ¯µand  γ ≥ ¯γ, then,  ∀x(0) ∈ C(Xs):

1) If  N = 1, µ = 0, then the system is asymptotically stable for any  γ > 0, ∀x(0) ∈ ΥN,γ,α ⊇ Xs.

2) If  N > 1, µ = 0, then the system reaches a set  Bγ thatis included in  Xs. This set increases with decreasing discount,  γ, ∀x(0) ∈ ΥN,γ,α. γ = 1 ⇒ Bγ = {0}.

3) If  αV (x)is the optimal value function in  Xsfor the problem,  µ = 0, and if C(Xs) ̸= Xs, then the system is asymptotically stable,  ∀x(0) ∈ ΥN,γ,α ⊃ Xs.

4) If  µ = 0, then  α ≥ ¯αimplies that  αV (x) ≥V ⋆(x), ∀x ∈ Xs, where  V ⋆is the optimal value for the infinite horizon problem with cost (3) subject to (2).

5) The MPC has a stability margin. If the MPC uses a surrogate model satisfying (8), with one-step error bound ∥w∥22 < ¯µ2 = 1−λLV L2Nˆfx ls, then the system is Input-to- State (practically) Stable (ISpS) and there exists a set BN,γ,µ : x(t) → BN,γ,µ, ∀x(0) ∈ βΥN,γ,α, β ≤ 1.

Theorem 2 states that for a given horizon N and contraction λ, one can find a minimum scaling of the Lyapunov function V and a lower bound on the discount factor  γsuch that the system under the MPC has ROA  ΥN,γ,α ⊇ Xs. If the model is not perfect, its error being less than  µ ≤ ¯µ, then the ROA size decreases but the system is still safe. In this case, a shorter horizon can be beneficial. The proof of the theorem follows standard MPC arguments and is in Appendix.

b) Performance with surrogate models: In order to further motivate for the search of a V giving the largest Xs, notice that a larger  Xscan allow for shortening the MPC horizon, possibly yielding the same ROA. Contrary to [3], we demonstrate how model mismatch and longer horizons can decrease performance with respect to an infinite-horizon oracle with same cost and perfect model. This links RL to arguments used in nominal MPC robustness [24], [25].

Let  ED[JV ⋆(K⋆)]define the expected infinite-horizon performance of the optimal policy  K⋆, evaluated by using the expected infinite-horizon performance (value function), V ⋆, for the stage cost (3) and subject to (2). Similarly, let  Ex∈D[J⋆MPC(x)]define the MPC’s expected performance with the learned V , when a surrogate model is used and Ex∈D[J⋆MPC(x; f)]when f is known.

image

for any  δ > 0:

image

where ¯Lf = min(L ˆfx, Lfx)and ¯ψis a  K∞-function representing the constraint penalty terms.

Theorem 3 is related to [26] for value-based RL. However, here we do not constrain the system and model to be stable, nor assume the MPC optimal cost to be Lipschitz. Theorem 3 shows that a discount  γor a shorter horizon N can mitigate model errors. Since  γ ≪ 1can limit stability (Theorem 2) we opt for the shortest horizon, hence  N = 1, γ = 1. Proof of Theorem 3 is in Appendix. c) MPC auto-tuning: The stability bounds discussed in Theorem 2 can be conservative and their computation is non-trivial. Theoretically, the bigger the  αthe larger is the ROA (the safe region) for the MPC, up to its maximum extent. Practically, for a very high  α, the MPC solver may not converge due to ill-conditioning. Initially, by using the tool from [27] within an SQP scheme, we tried to tune the parameters through gradient-based optimization of the loss (10). These attempts were not successful, as expected from the considerations in [28]. Therefore, for practical reasons, in this work we perform a grid search over the MPC parameter  α. Note that the discount factor  γ is mainlyintroduced for Theorem 3 and analysed in Theorem 2 to allow for future combination of stable MPC with value iteration.

C. Learning algorithm

Our alternate optimization of the Lyapunov NN, V (x), and the controller is similar to [8]. However, instead of training

image

TABLE I: Inverted Pendulum learning on nominal model. With iterations, the number of verified points increases.

image

a NN policy, we tune the scaling  α and learn V (x) used bythe MPC (13). Further, we extend their approach by using a dataset of demonstrations,  Ddemo, instead of an explicitly defined initial policy. These are one-step transition tuples, (x(0), u(0), x(1))m, m = 1, . . . , M, generated by a (possibly sub-optimal) stabilizing policy,  K0. Unlike in [21], our V is a piece-wise quadratic, and it is learned without labels. We in fact produce our own psuedo-labels using the sign of ∆V (x)in (10) in order to estimate  ls. The latter means that we don’t require episode-terminating (long) rollouts, which aren’t always available from data nor accurate when using a surrogate. Also, there is no ambiguity on how to label.

Once the initial  V , Xsare learned from the original demonstrations, we use V and a learned model, ˆf, within the MPC. We propose Algorithm 1, which runs multiple iterations where after each of them the tuned MPC and the surrogate model are used to generate new rollouts for training the next V and Xs. We search for the MPC parameter  αby minimizing the loss (10), using  (1+ϵext)Xs as a new enlargedtarget safe set instead of  Xs. Introducing possible counterexamples, the use of  ϵextcan push the safe set to extend and, together with validation, is used as a scalable proxy for verification. More specifically, we select V and  αusing the criteria that the percentage of stable points (∆V < 0) increases and that of unstable points decreases while iterating over j and i when evaluated on a validation set. The best iteration is picked1.

In Algorithm 1, MPC denotes the proposed Neural Lyapunov MPC, while one_step_sim is a one-step transition of the MPC and the surrogate. To train the parameters of V and the level  ls, Adam optimizer is used [29]. A grid search over the MPC parameter  αis performed. A thorough tuning of all MPC parameters is also possible, for instance, by using black-box optimisation methods. This is left for future work.

Through our experiments, we show the following: 1) increase in the safe set for the learned controller by using our proposed alternate learning algorithm, 2) robustness of the one-step NLMPC compared to a longer horizon MPC (used as demonstrator) when surrogate model is used for predictions, and 3) effectiveness of our proposed NLMPC against the demonstrator and various RL baselines. a) Constrained inverted pendulum: In this task, the pendulum starts near the unstable equilibrium (θ = 0°). The goal is to stay upright. We bound the input so that the system

0.0 6.6 13.2 19.8 26.4 33.0 39.6 46.2 52.8 59.4

(a) Demonstrator (b) NLMPC (nominal) (c) NLMPC (surrogate)

Fig. 1: Inverted Pendulum: Testing learned controller on nominal system. Lyapunov function with safe trajectories. NLMPC learns and transfers successfully to surrogate model.

cannot be stabilized if  |θ| > 60°. We use an MPC with horizon 4 as a demonstrator, with terminal cost,  500xT PLQRx,where  PLQRis the LQR optimal cost matrix. This is evaluated on 10K equally spaced initial states to generate the dataset Ddemo. We train a grey-box NN model, ˆf using 10K randomtransition tuples. More details are in Appendix. The learned V and  α, obtained from Algorithm 1, produce a one-step MPC that stabilizes both the surrogate and the actual system. Table I shows that the loss and percentage of verified points improve across iterations. The final ROA estimate is nearly maximal and is depicted along with the safe trajectories, produced by the MPC while using predictions from the nominal and surrogate models, in Figure 1. The performance matches that of the baseline and the transfer is successful due to the accuracy of the learned model. A full ablation is in Appendix. b) Constrained car kinematics: The goal is to steer the (x, y, θ)to (0, 0, 0) with constraints  [±10m, ±10m, ±180◦]. This is only possible through non-linear control. The vehicle cannot move sideways, hence policies such as LQR are not usable to generate demonstrations. Thus to create  Ddemo, anMPC with horizon 5 is evaluated over 10K random initial states. The surrogate, ˆfis a grey-box NN trained using 10K random transition tuples. More details are in Appendix. Figure 3 shows the learning curves, training of the Lyapunov function over iterations and line-search for the MPC auto-tuning. Table IIa summarises the metrics improvement across the iterations, indicating an increase in the ROA when a perfect model is used. With an imperfect model, the second iteration gives the best results, as shown in Table IIb.

We test the transfer capability of the approach in two ways.

TABLE II: Constrained vehicle kinematics learning.

image

image

Fig. 2: Car kinematics: Transfer from surrogate to a nominal model. Top: Lyapunov function contours at  φ = 0with trajectories for 40 steps. Bottom: Lyapunov function evaluated for specific policy on several initial states (decreasing means more stable).

image

Fig. 3: Car kinematics: Alternate learning on surrogate model. After every  NV = 800epochs of Lyapunov learning, the learned Lyapunov function is used to tune the MPC parameters. Top: The training curves for Lyapunov function. Vertical lines separate iterations. Middle: The resulting Lyapunov function V at  φ = 0with the best performance. Bottom: Line-search for the MPC parameter  αto minimize the Lyapunov loss (10) with V as terminal cost. The loss is plotted on the y-axis in a log(1 + x) scale. The point marked in red is the parameter which minimizes the loss.

First, we learn using the nominal model and test using the surrogate model for the MPC predictions. This is reported in Appendix for the sake of space. Second, the learning is performed using the surrogate model as in Algorithm 1, and the MPC is then tested on the nominal model while still using

the surrogate for prediction. This is depicted in Figure 2. Our MPC works better than the demonstrator when using the incorrect model. The learned MPC transfers successfully and completes the task safely.

TABLE III: Comparison with baselines. We compare our NLMPC (with surrogate model for predictions) with baselines. In the pendulum, our approach is second to the demonstrator for less than 1% margin. In the car task, NLMPC performs better than all baselines and improves convergence from the demonstrator, while it is nearly on par with the latter on constraints.

image

c) Comparison to baselines: Prior works such as constrained policy optimization (CPO) [30] provide safety guarantees in terms of constraint satisfaction that hold in expectation. However, due to unavailability of a working implementation, we are unable to compare our approach against it. Instead to enforce safety constraints during training of the RL algorithms, we use two different strategies: v1) early episode termination; v2) reward shaping with a constraint penalty. The v2 formulation is similar to the one used in [31], which demonstrated its practical equivalence to CPO when tuned. We compare our approach against model-free and model-based baseline algorithms. For the model-free baselines, we consider the on-policy algorithm proximal policy optimization (PPO) [32] and the off-policy algorithm soft actor-critic (SAC) [33]. For model-based baselines, we consider model-based policy optimization (MBPO) [34] and the demonstrator MPC. Further details about the reward shaping and learning curves are in Appendix.

We consider the performance of learned controllers in terms stability and safety. Stability performance is analogous to the success rate in performing the set-point tracking task. We consider a task is completed when  ||x(T)||2 < 0.2 where Tis the final time of the trajectory. For the car, we exclude the orientation from this index. The safety performance combines the former with state constraints satisfaction over the entire trajectory. As shown in Table III, for the inverted pendulum, all the policies lead to some safe trajectories. Note that the demonstrator (which has an LQR terminal cost) is an optimal controller and gives the maximum achievable performance. In terms of stability performance, our approach performs as good as the demonstrator MPC. The RL trained policies give sub-optimal behaviors, i.e. sometimes the system goes to the other equilibria. For the car, the demonstrator MPC is sub-optimal due to non-linearities. NLMPC improves upon it in performance and it is on par with it in terms of safety. NLMPC also significantly outperforms all of the considered RL baselines while using less samples for learning2. While the RL baselines have full access to the environments, it appears that our approach is better suited to non-linearities and constraints even by learning solely from offline data.

Stability and robustness of MPC and of discounted optimal control have been studied in several prior works [19], [35], [24], [20], [36], [37]. Numerical stability verification was studied in [5], [6] and, using neural network Lyapunov functions in [7], [8]. Neural Lyapunov controllers were also trained in [9]. MPC solvers based on iterative LQR (iLQR) were introduced in [38]. Sequential Quadratic Program (SQP) was studied in [39]. NNs with structural priors have been studied in [40], [41], [42]. Value functions for planning were learned in [3], [43], [44]. [8] learned a NN Lyapunov function and an NN policy with an alternating descent method, initialized using a known stabilizing policy. We remove this assumption and use MPC. Suboptimality was analysed in [45] for MPC and in [34] for policies. AWR [10] seeks positive advantage only during the policy update, not for the critic. Gaussian processes models have been studied in [46], [47].

We presented Neural Lyapunov MPC, a framework to train a stabilizing non-linear MPC based on learned neural network terminal cost and surrogate model. After extending existing theoretical results for MPC and value-based reinforcement learning, we have demonstrated that the proposed framework can incrementally increase the stability region of the MPC through offline RL and then safely transfer on simulated constrained non-linear control scenarios. Through comparison of our approach with existing RL baselines, we showed how NNs can be leveraged to achieve policies that outperform these methods on safety and stability.

Future work could address the reduction of the proposed sub-optimality bound, for instance through the integration of value learning with Lyapunov function learning as well as the optimal selection of the MPC prediction horizon. A broader class of stage costs and rewards could also be investigated.

[1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´e, “Concrete problems in ai safety,” arXiv preprint arXiv:1606.06565, 2016.

[2] T. M. Moerland, J. Broekens, and C. M. Jonker, “Model-based reinforcement learning: A survey,” arXiv preprint arXiv:2006.16712, 2020.

[3] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch, “Plan online, learn offline: Efficient learning and exploration via model-based control,” arXiv preprint arXiv:1811.01848, 2018.

[4] H. K. Khalil, Nonlinear Control. Pearson, 2014.

[5] R. V. Bobiti, “Sampling driven stability domains computation and predictive control of constrained nonlinear systems,” Ph.D. dissertation, 2017.

[6] R. Bobiti and M. Lazar, “Sampling-based verification of lyapunov’s inequality for piecewise continuous nonlinear systems,” arXiv preprint arXiv:1609.00302, 2016.

[7] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” arXiv preprint arXiv:1705.08551, 2017.

[8] M. Gallieri, S. S. M. Salehian, N. E. Toklu, A. Quaglino, J. Masci, J. Koutn´ık, and F. Gomez, “Safe interactive model-based learning,” arXiv preprint arXiv:1911.06556, 2019.

[9] Y.-C. Chang, N. Roohi, and S. Gao, “Neural lyapunov control,” arXiv preprint arXiv:2005.00611, 2020.

[10] X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,” arXiv preprint arXiv:1910.00177, 2019.

[11] F. Blanchini and S. Miani, Set-Theoretic Methods in Control (Systems & Control: Foundations & Applications). Birkh¨auser, 2007.

[12] E. Kerrigan, “Robust constraint satisfaction: Invariant sets and predictive control,” Ph.D. dissertation, 2000.

[13] A. Abate, D. Ahmed, M. Giacobbe, and A. Peruffo, “Formal synthesis of lyapunov neural networks,” IEEE Control Systems Letters, vol. 5, no. 3, p. 773–778, Jul 2021.

[14] S. Kong, S. Gao, W. Chen, and E. Clarke, “dreach:  δ-reachability analysis for hybrid systems,” in Tools and Algorithms for the Construction and Analysis of Systems, C. Baier and C. Tinelli, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2015, pp. 200–205.

[15] L. B. Armenio, E. Terzi, M. Farina, and R. Scattolini, “Echo state networks: analysis, training and predictive control,” in 2019 18th European Control Conference (ECC). IEEE, Jun. 2019.

[16] N. A. K. Doan, W. Polifke, and L. Magri, “Physics-informed echo state networks for chaotic systems forecasting,” in Lecture Notes in Computer Science. Springer International Publishing, 2019, pp. 192– 198.

[17] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott, “Using machine learning to replicate chaotic attractors and calculate lyapunov exponents from data,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 27, no. 12, p. 121102, Dec. 2017.

[18] M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez, “NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations,” arXiv:1804.07209 [cs, stat], Apr. 2018, arXiv: 1804.07209.

[19] D. Mayne, J. Rawlings, C. Rao, and P. Scokaert, “Constrained model predictive control: Stability and optimality,” Automatica, vol. 36, no. 6, pp. 789 – 814, 2000.

[20] D. Limon, T. Alamo, and E. Camacho, “Stable constrained MPC without terminal constraint,” American Control Conference, pp. 4893– 4898, 2003.

[21] S. M. Richards, F. Berkenkamp, and A. Krause, “The lyapunov neural network: Adaptive stability certification for safe learning of dynamical systems,” in Conference on Robot Learning. PMLR, 2018, pp. 466– 476.

[22] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998, vol. 28.

[23] E. C. Kerrigan and J. M. Maciejowski, “Soft constraints and exact penalty functions in model predictive control,” in Proceedings of UKACC International Conference, 2000.

[24] D. Limon, T. Alamo, D. M. Raimondo, D. M. de la Pe˜na, J. M. Bravo, A. Ferramosca, and E. F. Camacho, “Input-to-State Stability: A Unifying Framework for Robust Model Predictive Control,” in Nonlinear Model Predictive Control. Springer Berlin Heidelberg, 2009, pp. 1–26.

[25] M. Gallieri, LASSO-MPC – Predictive Control with  ℓ1-Regularised Least Squares. Springer-Verlag, 2016.

[26] K. Asadi, D. Misra, and M. Littman, “Lipschitz continuity in model-based reinforcement learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 264–273.

[27] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter, “Differentiable convex optimization layers,” arXiv preprint arXiv:1910.12430, 2019.

[28] B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable mpc for end-to-end planning and control,” arXiv preprint arXiv:1810.13400, 2018.

[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[30] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” arXiv preprint arXiv:1705.10528, 2017.

[31] A. Ray, J. Achiam, and D. Amodei, “Benchmarking Safe Exploration in Deep Reinforcement Learning,” 2019.

[32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[33] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.

[34] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” in Advances in Neural Information Processing Systems, vol. 32, 2019.

[35] J. B. Rawlings and D. Q. Mayne, Model Predictive Control Theory and Design. Nob Hill Pub, Llc, 2009.

[36] S. V. Rakovi´c, B. Kouvaritakis, R. Findeisen, and M. Cannon, “Homothetic tube model predictive control,” Automatica, vol. 48, pp. 1631–1638, 08 2012.

[37] V. Gaitsgory, L. Gr¨une, and N. Thatcher, “Stabilization with discounted optimal control,” Systems & Control Letters, vol. 82, pp. 91–98, 2015.

[38] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 4906–4913.

[39] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.

[40] A. Quaglino, M. Gallieri, J. Masci, and J. Koutn´ık, “SNODE: Spectral Discretization of Neural ODEs for System Identification,” arXiv:1906.07038 [cs], Jan. 2020, arXiv: 1906.07038.

[41] C. Yıldız, M. Heinonen, and H. L¨ahdesm¨aki, “ODE2VAE: Deepgenerative second order ODEs with Bayesian neural networks,” arXiv:1905.10994 [cs, stat], Oct. 2019, arXiv: 1905.10994.

[42] S. Pozzoli, M. Gallieri, and R. Scattolini, “Tustin neural networks: a class of recurrent nets for adaptive MPC of mechanical systems,” Nov. 2019, arXiv: 1911.01310.

[43] R. Deits, T. Koolen, and R. Tedrake, “Lvis: learning from value function intervals for contact-aware robot controllers,” in International Conference on Robotics and Automation (ICRA), 2019, pp. 7762–7768.

[44] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efficient reinforcement learning with stochastic ensemble value expansion,” arXiv preprint arXiv:1807.01675, 2018.

[45] L. Grune and A. Rantzer, “On the infinite horizon performance of receding horizon controllers,” IEEE Transactions on Automatic Control, vol. 53, no. 9, pp. 2100–2111, 2008.

[46] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learningbased model predictive control for safe exploration,” in 2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 6059–6066.

[47] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 269–296, 2020.

[48] Y. Qin, M. Cao, and B. D. Anderson, “Lyapunov criterion for stochastic systems and its applications in distributed computation,” IEEE Transactions on Automatic Control, pp. 546–560, 2019.

[49] A. Bemporad, M. Morari, V. Dua, and E. Pistikopoulos, “The explicit solution of model predictive control via multiparametric quadratic programming,” in American Control Conference. IEEE, 2000.

[50] D. Hadfield-Menell, C. Lin, R. Chitnis, S. Russell, and P. Abbeel, “Sequential quadratic programming for task plan optimization,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2016, pp. 5040–5047.

[51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” arXiv preprint arXiv:1912.01703, 2019.

[52] T. I. Fossen, Handbook of marine craft hydrodynamics and motion control. John Wiley & Sons, 2011.

[53] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256.

Neural Lyapunov Model Predictive Control Supplementary Material

We provide proofs of the Theorems 1, 2 and 3, introduced in the paper, in Appendix VII. In Appendix B, we describe the formulation of the model predictive controller as a sequential quadratic program (SQP). In Appendix C, we discuss the experimental setup that comprises of the implementation specifics, details about baseline controllers, parameters for the experiments, and more description of the control problem setting. In Appendix D, we provide additional plots and results for the inverted pendulum and car kinematics examples. Finally, in Appendix E we discuss an algorithm for probabilistic safety verification.

Here we provide the proofs of Theorems stated in the paper. We write the proof for Theorem 3 before Theorem 2 since it is simpler and helps in proving the latter.

Proof of Theorem 1

Proof. We denote the advantage of action u with respect to an initial policy  K0(x), namely  AK0(x, u), as  A⋆(x, u). We assume that there is a new policy K(x) such that:

image

for all initial states  x such that x ∈ Xs ⊆ X. Recall that [22],  A⋆(x, u) = Q⋆(x, u) − V⋆(x), where the right-hand terms are, respectively, the action-value function (where we assume  K0(x)is used for future actions) and the value of policy  K0(x), namely,  V⋆(x) = VK0(x). Then, from their definition and by the above assumption:

image

Since, by assumption,  r(x, u) = −ℓ(x, u)and  ℓ(x, u) ≥ lℓ∥x∥22, is positive definite (lℓ > 0) and zero at the origin (the target), then we have that the value function also satisfies  V⋆(x) = −V ⋆(x), where  V ⋆satisfies a condition similar to (4) provided that either  ℓ(x, K0(x)) → 0, ∀x ∈ Xs (K0is stabilizing) or that  γ < 1. In other words, under these assumptions there is a positive constant  LV ⋆such that:

image

From the above considerations, it follows that  ∀x ∈ Xs:

where, in the last step, we have used the fact that the policy K(x) and the loss  ℓare deterministic. Following [48], the stability of a stochastic system origin is determined with high probability by means of (16) as well as:

image

that is, when the discount factor is  γ = 1. The same applies to deterministic systems, and with probability 1. Recall that in our setting  γ = 1yields a finite value only when trajectories terminate at  ℓ(x, u) = 0under  K0.

For systems that are deterministic, we generalize this result to show that there exist a discount factor less than 1 for which the set  Xsis invariant, namely, the trajectories starting in  Xs stay in Xs. Recall that we have assumed that  Xssatisfying the non-negative advantage condition is a level set of  V ⋆, namely,

image

In order to prove the claim, it is sufficient show that, there exists a  ¯γ ∈ (0, 1)and a positive scalar  c1 < csuch that trajectories starting in  Xsterminate in  X¯γ = {x ∈ Rnx : V ⋆(x) ≤ c1}. In other words, we wish to show that, given  ¯γ,

image

To proceed, notice that in  Xswe have that:

image

Rearranging the terms, we wish the following quantity to be less than zero:

image

In other words, we wish that:

The final step consists of using the describing function inequalities, leading to the sufficient condition:

image

where we defined  ϵ = lℓLV, with  ϵ ∈ (0, 1)and hence the claim follows with  c1 = 0.

image

For which, since the actions are bounded and hence  βV ⋆(x) ≥ ℓ(x, K(x)), for some  β > 0 (β = 1 ⇐ K0 = K = K⋆),

then it is sufficient to have:

Hence, the set  X¯γis defined and  ¯γdetermines whether it is inside  Xswhich would make  Xsinvariant.

Possible extensions of Theorem 1: This paper considered the case of a quadratic (or positive definite) cost with a single minimum at a given target (e.g. the origin). Reinforcement learning and economic MPC often deal with different classes of rewards/costs, possibly leading to complex equilibria. We make some considerations on how future work could aim to address more general cases.

An extension of the Theorem 1 could be considered for the case when the reward is upper bounded by  rmax, e.g. if we can express the reward as: r(x, u) = rmax − ℓ(x, u),

with  minu ℓ(x, u) > 0outside some target set,  x ̸∈ T, and zero at the target set T. Then the value function satisfies

image

where  V ⋆ satisfies a condition similar to (4). In particular, the constant reward terms in the infinite sums cancel out from the stability condition:

image

This hints to the fact that the advantage function could play a crucial role in stability and convergence under RL policies.

Limitations of Theorem 1: Theorem 1 requires the value function  V⋆(x)to be known exactly, which is generally hard for continuous control problems with function approximation. Our subsequent results relax this assumption to be that the candidate function V is a valid Lyapunov function rather than the exact value function. This for instance means that scaling errors, local sub-optimality and potentially even a large bias in the Bellman error could be tolerated without affecting the task success.

Proof of Theorem 3

Proof. To prove the result, we first write:

Ex∈D[J⋆MPC(x)] − Ex∈D[J⋆V ⋆(x)] = Ex∈D[J⋆MPC(x)] − Ex∈D[J⋆MPC,f(x)] � �� �I1

where  Ex∈D[J⋆MPC,f(x)]denotes the MPC performance when a perfect model is used for predictions. For the term  I2, [3] provide a bound on the performance of the MPC policy. It is important to note that in their problem formulation, the MPC’s objective is defined as a maximization over the cumulative discounted reward, while in our

formulation (13) we consider a minimization over the cost. Consequently, compared to inequality presented by [3], there is a change in sign of the terms in the left-hand side of the inequality. This means:

image

We now focus on the term  I1in Equation 27. Let us denote  x⋆(i)and  u⋆(i)as the optimal state and action predictions respectively, obtained by using the correct model, f, and the MPC policy at time i. By the principle of optimality, the optimal sequence for the MPC using the correct model, uf, can be used to upper-bound the optimal cost for the MPC using the surrogate model:

image

Since the input sequence is now the same for both terms in right-hand side of Equation 29, the difference in the cost is driven by the different state trajectories cost (the cost on x over the horizon which includes the state-constraint violation penalty, as defined in Equation 37) as well as the terminal cost. In form of equation, this means:

image

Recall that we assume the surrogate model is Lipschitz with constant  L ˆfx. This means that  ∀˜x, x ∈ Rnxand the same input  u ∈ Rnu, we have:

image

Further, from Equation 8,  ∀(x, u) ∈ ˜X × U, we have:

Under the optimal policy for the correct model, let us denote the deviation in the state prediction when the MPC input prediction is applied with a different model, ˆf, as ˆd(j) := ˆx(j) − x⋆(j). At step j = 1:

image

At step j = 2:

image

By induction, it can be shown that:

image

Alternately, if we assume the correct system that is to be controlled is Lipschitz with constant  Lfx, then proceeding as before: At step j = 1:

image

At step j = 2:

image

By induction, again we have:

Combining equations (31) and (32) and by letting ¯Lif = min(Liˆfx, Lifx), we obtain:

image

The following identity is used;  ∀δ > 0:

image

Hence, we can write the cost over the predicted state as:

From equations (33) and (34),  ∀j ∈ {0, 1, ..., N − 1}, we obtain:

image

Recall from Equation 4 that  V (x) ≤ LV ∥x∥22. Proceeding as before, we can write the following for the terminal cost:

image

The final part of the proof concerns the constraints cost term. Let the state constraints be defined as a set of inequalities:

image

where g is a convex function. For the optimal solution,  x⋆, the violation of the constraint is represented through the slack

variable:

image

Since the constraints are convex and compact, and they contain the origin, then at the optimal solution,  x⋆, we have that there exists a  K∞-function,  ¯η(r), such that: ���ℓX(s(x⋆ + ˆd)) − ℓX (s (x⋆))��� ≤ ¯η(∥ ˆd∥).

Using the above inequality and Equation 33, it follows that,  ∀j ∈ {0, 1, ..., N}:

image

By combining equations (27), (28), (29), (30), (35), (36) and (37), we obtain the bound stated in the Theorem:

Ex∈D[J⋆MPC(x)] − Ex∈D[J⋆V ⋆(x)]

image

� �� �=: ¯ψ(µ)+ δ Ex∈D [J⋆MPC(x; f)].

image

Proof of Theorem 2

We prove the following, extended version of the theorem.

Theorem 4. Stability and robustness Assume that V (x) satisfies (5), with  λ ∈ [0, 1), XT = {0}. Then, for any horizon length  N ≥ 1there exist a constant  ¯α ≥ 0, a minimum discount factor  ¯γ ∈ (0, 1], and a model error bound  ¯µsuch that, if α ≥ ¯α, µ ≤ ¯µand  γ ≥ ¯γ, then,  ∀x(0) ∈ C(Xs)

1) If  N = 1, µ = 0, then the system is asymptotically stable for any  γ > 0, ∀x(0) ∈ ΥN,γ,α.

2) If  N > 1, µ = 0, then the system reaches a set  Bγthat is included in  Xs. This set increases monotonically with decreasing discount factors,  γ, ∀x(0) ∈ ΥN,γ,α. γ = 1 ⇒ Bγ = {0}.

3) If  N > 1, µ = 0, and once in  Xswe switch to the expert policy, then the system is asymptotically stable,  ∀x(0) ∈ ΥN,γ,α.

4) If  αV (x)is the optimal value function for the discounted problem,  µ = 0, and if  C(Xs) = Xs, then the system is asymptotically stable,  ∀x(0) ∈ ΥN,γ,α.

5) If  αV (x)is the optimal value function in  Xsfor the problem,  µ = 0, and if C(Xs) ̸= Xs, then the system is asymptotically stable,  ∀x(0) ∈ ΥN,γ,α.

6) The MPC has a stability margin. If the MPC uses a surrogate model satisfying (8), with one-step error bound  ∥w∥22 <¯µ2 = 1−λLV L2Nˆfx ls, then the system is Input-to-State (practically) Stable (ISpS) and there exists a set  BN,γ,µ : x(t) → BN,γ,µ,∀x(0) ∈ βΥN,γ,α, β ≤ 1.

7) If  µ = 0, then  α ≥ ¯αimplies that  αV (x) ≥ V ⋆(x), ∀x ∈ Xs, where  V ⋆is the optimal value function for the infinite horizon problem with cost (3) and subject to (2).

Proof. In order to prove point 1 in the theorem, we first use the standard arguments for the MPC without terminal constraint [19], [20] in the undiscounted case. We then extend the results to the discounted case.

a) Nominal stability: First, when an invariant set terminal constraint is used, which in our case corresponds to the condition  V (x(N)) ≤ ls with Xs ⊆ X, then [19] have provided conditions to prove stability by demonstrating that  J⋆MPC(x)is a Lyapunov function. These require the terminal cost to be a Lyapunov function that satisfies Equation 5. Hence, we start by looking for values of  αsuch that  αV (x)satisfies Equation 5. In other words, we wish to find an  ¯α1 ≥ 1such that, for all  α ≥ ¯α1and for some policy  K0(in our case, the demonstrator for V ), the following condition holds:

image

Let us denote  x+ = f(x, K0(x))for brevity. We have, by assumption, that:

image

This implies that:

Recall that the loss function satisfies  lℓ∥x∥22 ≤ ℓ(x, u). Since the MPC is solved using a sequence of convex quadratic programs, it is also Lipschitz [49]. Similarly, if  K0is Lipschitz or (uniformly) continuous over the closed and bounded set U, then since X is also closed and bounded, there also exists a local upper bound for the loss function on this policy, namely, ℓ(x, K0(x)) ≤ Lℓ∥x∥22.

Further, recall from Equation 4 that  lℓ∥x∥22 ≤ V (x). Using the above notions, we have:

image

To satisfy the above condition, solving for a  β ≥ 1is sufficient. From Equations (40) and (41), it implies that:

image

Now, the function  αV (x)satisfies all the sufficient conditions stated by [19] for the stability of an MPC under the terminal constraint  ˆx(N) ∈ Xswhich is equivalent to  V (ˆx(N)) ≤ ls, without discount (with  γ = 1).

Since we do not wish to have such a terminal constraint, we wish for another lower bound  ˆα2 ≥ 1such that, if  α ≥ ¯α2,then  V (ˆx(N)) ≤ lsat the optimal solution. The computation of this  ¯α2has been outlined by [24] for the undiscounted case. Since our constraints are closed, bounded and they contain the origin, our model and the MPC control law are both Lipschitz, we directly use the result from [24] to compute  ¯α2:

image

where  ˜x(i), ˜u(i)represent a sub-optimal state-action sequence for which  V (˜x(N)) ≤ ρlswith  ρ ∈ [0, 1), and d is a lower bound for the stage loss  ℓfor all x outside  Xsand all u in U.

image

to guarantee stability when  γ = 1. When the discount factor (γ < 1) is used, condition (38) is still respected by the same range of  αsince

image

However, from the discussion in [37], for infinite horizon optimal control, it appears that Equation 38 is not sufficient for J⋆MPC(x)to be a Lyapunov function, even when a terminal constraint is used.

We wish to find a lower-bound  ¯γsuch that, given  αsatisfying Equation 44, the MPC is stable for  γ ≥ ¯γ. For infinite-horizon optimal control, this was done by [37]. First, recall that:

image

In [37], it shown that  1 ≤ C < 1/(1 − γ)is sufficient for stability of an infinite-horizon discounted optimal control problem, when  αV (x)is its value function. This means that:

image

which implies that:

For MPC, we will instead present an additional condition to the above one that leads to at least convergence to the safe set. This results in a bounded and safe solution. Exact convergence to the origin will be then confirmed when V is the actual value function, as in [37], or if we switch to the demonstrating policy,  K0, once in the terminal set. Finally, we will remove the terminal constraint as done for the undiscounted case with a final bound on  αand  γ.

Recall that condition (38) applies. If the terminal constraint was met at time t, namely, if  V (x⋆(N)) ≤ ls, then at the next time step, t + 1 we have that  u(N + 1) = K0(x⋆(N))is feasible. Hence, the optimal MPC solution can be upper-bounded by the shifted solution at the previous time  t, with the K0policy appended at the end of the horizon [19]. Denote this policy as  ˜u and  ˜xas the predictions. We have that:

image

Hence,

where  LN−1(x) = �N−1i=1 γi−1ℓ(˜x(i), ˜u(i))and we have taken  αsuch that  γα ≥ ¯α1.

Now, for  γ = 1, the effect of  LN−1disappears and the MPC optimal cost is a Lyapunov function as in the standard MPC stability result from [19]. By inspection of  LN−1, since the cost is bounded over bounded sets, also a small enough  γ couldbe found such that  LN−1(x) < ℓ(x, ˜u(0)). This γ, however, depends on  x. Consider x ̸∈ Xs, for which there exist a feasible solution, namely a solution providing  ˜x(N) ∈ Xs. Then, since  ℓis strictly increasing,  ℓ(0, 0) = 0, Xscontains the origin and the constraints are bounded, we have that there exist a  υ ≥ 1such that for any feasbile x:

image

is an upper bound for  LN−1(x). For instance,

is sufficient for any closed set of initial conditions  x(0) ∈ ϵX ⊃ Xs, with ϵ > 0. In order to have stability, it suffices to have (1 − γ)¯LN−1 − ℓ(x, ˜u(0)) ≤ 0which requires:

image

In the above condition  ¯γ(x)can be less than 1 only outside a neighborhood of origin. Consider again

image

Then taking

provides that the system trajectory will enter the safe set  Xs, hence  Bγ ⊆ Xs. Finally, once  x ∈ Xs, we that the policy K0(x)is feasible and:

image

Hence, we can use this policy to upper bound the MPC cost:

If the above is true with equality, then we can proceed as in Theorem 3.1 of [37], with  γ > ¯γ1. This would require  αV (x)

to be also a value function for the discounted problem. From the above considerations, we can conclude that that: 1) If N = 1, then ¯LN−1 = 0and the system is asymptotically stable for any  γ > 0. 2) If  N > 1, γ ≥ ¯γ2, then the system reaches an bound  Bγthat is included in  Xs. 3) If  N > 1 γ ≥ ¯γ2and once in  Xswe switch to the policy  K0(x)then the system is asymptotically stable.

4) If  αV (x)is the global value function for the discounted problem and if  R(Xs) = Xs, then  γ > ¯γ1provides that the system is Asymptotically stable.

5) If  αV (x)is only the value function in  Xsfor the discounted problem, and if  R(Xs) ̸= Xs, then  γ > max(¯γ1, ¯γ2)

image

Finally, following Theorem 3 from [20], the terminal constraint can be removed for all points  x ∈ RN(ρXs), with ρ ∈ [0, 1),by setting:

image

In fact, by the same argument of [20], for any states for which it exists a feasible sequence  ˜u taking  ˆx(N) to ρXs we havethat:

image

If  αsatisfies (52), then we also have that:

from which we can directly verify that the set defined in (14) is a ROA (for either asymptotic or practical stability):

ΥN,γ,α =�x  ∈ Rnx : J⋆MPC(x) ≤ 1 − γN1 −γ d  + γNαls

image

Using the above identity, one wish to bound the increase in the MPC cost due to uncertainty. At the same time, we wish the MPC to remain feasible and perform a contraction, namely, to have a stability margin. Since we are using soft constraints, then the MPC remains always feasible, however, we need the predictions at the end of the horizon to be in an invariant set Xseven under the effect of uncertainty. In particular, we will use V (x) and its contraction factor  λto compute a smaller level set  ζXs, for some ζ ∈ (0, 1)which is invariant under the uncertainty. Once this is found then we can compute a new  αfor this set according to (52). In particular, under the policy  K0, we have that:

image

We wish this quantity to be non-positive for  x ̸∈ ζXs, which means that  V (x) ≥ ζls. For this it is sufficient to have:

image

Since we apply the function V to the prediction at time N, and at the next step the measured state (prediction at index 0) differs from the previous MPC prediction of a quantity w, for the MPC this condition translates into:

image

Therefore, given the model error w, if the MPC remains feasible and if  αand  γexceed their lower bounds given the restricted set  ζXs, we have that  ∥w∥2 < ¯µimplies that:

J⋆MP C(x) = J⋆MP C(x+ + w) − J⋆MP C(x) +  J⋆MP C(x+) − J⋆MP C(x+) ≤ JMP C(x+,˜u)  − J⋆MP C(x) +  σ(∥w∥) ≤(1  − γ)¯LN−1 − ℓ(x, ˜u(0)) +  σ(∥w∥) ≤ −lℓ∥x∥22+  σ(∥w∥) + ¯d(N),

which is the definition of Input-to-State practical Stability [4], [24], where we have defined ¯d(N) = (1 − γ)¯LN−1. The trajectory of the system is therefore bounded by the level-set of  J⋆MP C(x)outside which  σ(µ) + ¯d(N) ≤ lℓ∥x∥22. Since σ isstrictly increasing and ¯dis strictly decreasing, we can also conclude that the size of this bound increases with increasing model error  µand with the horizon length N. Note that the term ¯dvanishes if  γ = 1but the term  σwill also increase with N if ¯Lf > 1. From the restriction of the terminal set, it also follows that the ROA defined in Equation 14 will also be restricted unless we recompute a larger  αfor this new set.

c) Value function upper bound: Denote ˆV (x) = αV (x). Recall that, if  α ≥ α1, as defined in Equation 42, and if ∥w(t)∥ = 0 ∀t, then we have that,  ∀x(t) ∈ Xs:

image

which in turn implies that, by means of  γ ≤ 1and of induction:

image

which is the value of the policy  K0. This is, by definition, an upper bound of the optimal value function,  V ⋆(x). Hence αV (x) ≥ V ⋆(x), ∀x ∈ Xs.

image

In practice, the ISS bound  σ(µ)from Theorem 2 has a form similar to the one discussed for the constraint penalty in the proof of Theorem 3, see Equation 37. Its explicit computation is omitted for brevity; however, in general, we can expect the bound to become worse for systems that are open-loop unstable as the horizon length increases.

We solve the MPC problem through iterative linearisations of the forward model, which gives the state and input matrices:

image

These are evaluated around a reference trajectory:

This is initialised by running the forward model with zero inputs, and then updated at each iteration by simulating the forward model on the current optimal soulution.

The Lyapunov function V is expanded to a second order term by using Taylor expansion and is evaluated around the same trajectory. The Jacobian and Hessian matrices, respectively,  Γand H, are:

image

All the quantities in Equation 63 and (66) are computed using automatic differentiation. Using these matrices, we solve the convex optimization problem:

image

δu⋆ = arg min γNα�∥H1/2 δˆx(N)∥22+ ΓT δˆx(N)�+

image

∥δˆu(i)∥∞ ≤rtrust,  ∀i ∈ [0, N  − 1],

where the state constraints are again softened and the last inequality constraint is used to impose a trust region with a fixed radius,  rtrust. This notably improves the search for an optimal solution, as shown for the inverted pendulum case in Figure 13.

Once problem (67) is solved, we obtain the delta sequence  δu∗. The new optimal solution is then computed according to the update rule:

u⋆ ← u⋆ +lr δu⋆,

where lr < 1 is a learning rate, which is annealed after each iteration. Finally, the reference trajectories used for the next linearization are u⋆and the state series resulting from simulating the forward model on u⋆, namely

x⋆ = ˆf ◦ u⋆.

This is summarised in Algorithm 2. Interested readers can find more details on SQP in the work by [39], [50], where adaptive schemes for computing the trust radius are discussed.

image

x⋆ ← {x(t)}N+1, u⋆ ← {0}N

We demonstrate our approach on an inverted pendulum and vehicle kinematics control problem. We also provide additional results and plots in Appendix D. 3

A. Implementation Specifics

A practical consideration for implementing Algorithm 1 is tuning the MPC terminal cost scaling,  α, via grid-search. The MPC needs to run over the entire dataset of initial points,  X0 = {xm(0)}Mm=1, with different configurations. In order to speed up the search for  α, we run the MPC only on a sample of 20% of the initial dataset. Once the optimal  αis found, only then we run the MPC over the entire dataset and use this data to compute the next V (x).

Additionally to what presented in the main text, we parameterize V (x) with a trainable scaling factor,  β, as:

image

where softplus(x) = log(1 + exp(x)). The parameter  βis initialized to 1 in all experiments except for the inverted pendulum without LQR loss, i.e. for results in Figure 12. The full set of parameters for the experiments can be found in Table IV.

B. Baseline Controllers

Our Neural Lyapunov MPC has a single-step horizon and uses the learned Lyapunov function as the terminal cost. To compare its performance, we consider two other MPCs and RL baselines: Long-horizon MPC (demonstrator/demo): An MPC with a longer horizon and a quadratic terminal cost  xT Px. ThisMPC is also used to generate the initial demonstrations for alternate learning.

Short-horizon MPC: An MPC with a single-step horizon and the same terminal cost as the long-horizon MPC. All other parameters are the same as the Neural Lyapunov MPC except  α, which is tuned manually.

Model-free RL: Neural-network parameterized policy trained using on-policy algorithm, PPO [32], and off-policy algorithm, SAC [33].

Model-based RL: Neural-network parameterized policy and ensemble of dynamic models trained using the algorithm,

image

In order to train the RL baselines, we consider two variations of our proposed models. In the first version (marked v1), the environment considers the stage cost used in demonstrator MDP as the penalty and terminates the episodes whenever the state constraints are violated. In the second version (marked v2), there is no environment termination. Instead an additional

TABLE IV: Configuration for experiments in main paper. We specify the parameters used for the simulation of the system dynamics, the demonstrator, the Neural Lyapunov MPC as well as for the alternate learning algorithm.

image

penalty term is added for violating the constraint. This is further described in Table V. We show the results of training in these variants of the environments for SAC and PPO in Figure 4.

TABLE V: Reward and termination conditions for RL MDPs. We train RL baselines on two variants of the inverted pendulum and car kinematics models. Note, for inverted pendulum:  β1 = 10, β2 = 200, while car kinematics:  β1 = 1, β2 = 200.The symbol  I(.)is the indicator function.

image

C. Forward models

We use an Euler forward model for the environments. Consider dt as the sampling time, then the state transition is:

image

where  η(t)is the state, u(t) is the input and  fu(η(t), u(t))is the time-invariant, deterministic dynamical system.

image

Fig. 4: Training of RL agents. The plots show the learning curves for PPO and SAC in the two variants each of inverted pendulum and car kinematics environments. We train the agent on 4 different seeds for  4 × 106 timesteps. The solid line and filled region indicates the mean and one standard deviation reward across the seeds.

image

Fig. 5: Testing the surrogate models. We generate each trajectory by propagating an initial state  η(0)with input sequence {u(t)}T −1t=0 through the nominal system and the learned surrogate model. (a) For the inverted pendulum:  u(t) ∼ U([−0.4, 0.4])and simulation is performed for T = 25. (b) For the vehicle kinematics:  u(t) ∼ U([−0.1, 0.1] × [−0.1, 0.1])and simulation is performed for T = 5.

1) Vehicle Kinematics: a) World model: For the non-holonomic vehicle,  η = (x, y, φ) ∈ R3is the state, respectively, the coordinates in the plane and the vehicle orientation, and  u = (v, ω) ∈ R2is the control input, respectively, the linear and angular velocity in the body frame.  fu(η, u)encodes the coordinate transformation from the body to the world frame [52]:

image

b) Surrogate model: We build a gray-box model using a neural network to model  J(η), similar to the work by [40]. The input feature to the network is  sin φand  cos φ, where  φis the vehicle’s orientation. The network consists of a hidden layer with 20 hidden units and tanh activation function, and an output layer without any activation function. The weights in the network are initialized using Xavier initialization [53] and biases are initialized from a standard normal distribution. c) Training the surrogate model: We generate a dataset of 10K sequences, each of length T = 1. For each sequence, the initial state  η(0)is sampled uniformly from X, while the input u(0) is sampled uniformly from U. We use a training and validation split of 7 : 3. Training is performed for 300 epochs using the Adam optimizer [29] and the mean-squared error (MSE) loss over the predicted states. The learning rate is 0.01 and the batch size is 700.

2) Inverted Pendulum: a) World model: Inverted pendulum is one of the most standard non-linear systems for testing control methods. We consider the following model [7]:

image

where  θ ∈ Ris the angle, m and l are the mass and pole length, respectively, g is gravitational acceleration,  λFis the rotational friction coefficient and  u ∈ Ris the control input. We denote the state of the system as  η = (θ, ˙θ) ∈ X ⊂ R2 andinput as  u ∈ U ⊂ R. We use an Euler discretization, as in Equation 69, to solve the initial-value problem (IVP) associated with the following equation:

image

b) Surrogate model: We use a neural network to predict the acceleration of the pendulum, ¨θ(t). The input to the network is the state  η(t)and action u(t) at the current time-step t. The network is a three layer feedforward network with 64 hidden units and tanh activation in each hidden layer. The output layer is linear. All the layers have no biases and their weights are initialized as in [53].

c) Training the surrogate model: To train the surrogate model, we generate a dataset of 10K sequences, each of length T = 1. We use MSE loss and Adam optimizer [29] to train the model. The rest of the parameters are kept the same as those used for the vehicle kinematics model training.

Here we provide additional results and plots related to the experiments specified in the paper.

A. Vehicle Kinematics

The vehicle model results are run over different machines. The resulting losses and  αare discussed. In all experiments, the parameters of V are reinitialised after every outer epoch. The Lyapunov loss does not make use of the LQR loss  ℓ. a) Choosing number of outer iterations  Next: We run our algorithm for more iterations on a machine with different operating system. This leads to a slight difference from the results mentioned in the paper. As observed from Table VI, the third iteration leads to the best performance in terms of verified and not verified percentages. We set  Next = 3in all experiments.

TABLE VI: Car Kinematics: Choosing number of outer iterations. Lyapunov Function learning performed on a different machine (results are slightly different from the ones in the paper).

image

b) Alternate learning on nominal model: We train the Neural Lyapunov MPC while using a nominal model for internal dynamics. In Figure 6, we plot the training curves, the resulting Lyapunov function, and line-search for MPC in each outer epoch. The results are also shown in Table VII. As can be seen, tuning the MPC parameter  αhelps in minimizing the loss (10) further. Points near the origin don’t always verify. The MPC obtained after the third iteration achieves the best performance. This can further be validated from Figure 7, where we plot trajectories obtained by using the Neural Lyapunov MPC from each iteration.

c) Alternate learning on surrogate model: In order to test the transfer capability of the approach, we perform the training of Neural Lyapunov MPC using an inaccurate surrogate model for the internal dynamics. This model is also used for calculating the Lyapunov loss (10) and evaluating the performance of the Lyapunov function. We plot the training curves, the resulting Lyapunov function, and line-search for MPC in each outer epoch in Figure 3. The results of the training procedure are presented in Table VIII. The MPC obtained from the second iteration achieves the best performance. In the third iteration, the Lyapunov loss increases and number of verified and not verified points becomes worse. The poor performance also reflects in the trajectories obtained by using the Lyapunov MPC from the third iteration, as shown in Figure 8.

TABLE VII: Car Kinematics: Learning on nominal model. Results for training Neural Lyapunov MPC while using a nominal model for internal dynamics. We use the Lyapunov loss (10) for both learning the Lyapunov function and tuning the MPC. This is specified in the log(1 + x) scale.

image

TABLE VIII: Car Kinematics: Learning on surrogate model. Results for training Neural Lyapunov MPC while using the surrogate model for internal dynamics as well as in Lyapunov training. We use the Lyapunov loss (10) for both learning the Lyapunov function and tuning the MPC. This is specified in the log(1 + x) scale.

image

B. Inverted Pendulum

Differently from the car kinematics, for the inverted pendulum task, the parameters of V are not re-initialized after every outer epoch, and the Lyapunov loss makes use of the LQR loss,  ℓ, for all the experiments except for the results in Figure 12. In this section, we discuss the results obtained from the alternate learning on the nominal model. We also provide an ablation study on: 1) a solution based solely on a contraction factor, and 2) the effect of having an imperfect solver, in particular the instability resulting from wrongly tuning the trust-region radius.

TABLE IX: Inverted Pendulum: Learning on nominal model. Results for training Neural Lyapunov MPC while using a nominal model for internal dynamics. We use the Lyapunov loss (10) for both learning the Lyapunov function and tuning the MPC. This is specified in the log(1 + x) scale.

image

a) Alternate learning: Since the trained surrogate model has a high accuracy, we only consider the scenario for alternate learning with the nominal model. The main results for this scenario are in Figure 9 and Table IX. We notice a slight improvement in the MPC’s performance in the second iteration of the training procedure. In Figure 9, it can be noticed that a small  αneeds to be used, which contradicts the ideal theoretical result. In practice, a very large value of this parameter results in bad conditioning for the QPs used by the MPC and causes the solver to fail.4 Since the pendulum is open-loop unstable, an increase of the Lyapunov loss can be noticed for larger values of  α. This demonstrates that it is necessary to perform a search over the parameter and that we cannot simply set it to a very large value.

In Figure 10, we show the trajectories obtained by running the Neural Lyapunov MPC obtained from the first and second iterations. The initial states are sampled inside the safe level-set by using rejection sampling. The trajectories obtained from both the iterations are similar even though the Lyapunov function is different. The Lyapunov function from second iteration has a larger ROA.

We also compare the Neural Lyapunov MPC from the second iteration with the baseline MPCs in Figure 11. The baselines controllers behave quite similarly in this problem, although they have different prediction horizons. This is because, for both of them, the LQR terminal cost is a dominating term in the optimization’s objective function. The Neural Lyapunov MPC achieves a slightly slower decrease rate compared to the demonstrator; however, it still stabilizes the system. The transfer from nominal to surrogate model is very successful for all the controllers, though in this case, the surrogate model is particularly accurate.

It should be kept in mind that in order to produce these results, it was necessary to impose in the Lyapunov loss (10) that the decrease rate of V (x) needs to be upper bounded by the LQR stage loss, as in Equation 5. This resulted in the most effective learning of the function V (x), contrarily to the vehicle kinematics example. b) Alternate learning without LQR loss: We now consider the case when the learning is performed while using a contraction factor of  λ = 0.9and without the LQR loss term in the Lyapunov loss (i.e., v = 0). The results are depicted in Figure 12. In order to obtain these results, the Lyapunov NN scaling  β, in Equation (68), was initialized with:

β0 =softplus−1(25),

according to a rough estimate of the minimum scaling  αfrom Equation 42. This was able to produce a Lyapunov function that makes the MPC safe with  α = 12. However, the learning becomes more difficult, and it results in a smaller safe region estimate with a slower convergence rate for the system trajectories.

c) Effects of the trust region: In Figure 13, we show the result of varying the trust radius of the SQP solver on the inverted pendulum. While a larger value can result in further approximation, given the limited number of iterations, in this case a small value of the radius results in a weaker control signal and local instability near the origin.

d) Formal verification:: These procedures are expensive but necessary, as pathological cases could result in the training loss (10) for which the safe set could be converging to a local minima with a very small set. Results can also vary where different random seeds are used. For these reasons, during training we also an informal verification over a fixed validation set and use this to select the best V from all training epochs. In practical applications we reccommend to use formal methods.

image

Fig. 6: Car kinematics: Alternate learning on nominal model. After every  NV = 500epochs of Lyapunov learning, the learned Lyapunov function is used to tune the MPC parameters. Top: The training curves for Lyapunov function. Vertical lines separate iterations. Middle: The resulting Lyapunov function V at  φ = 0with the best performance. Bottom: Line-search for the MPC parameter  αto minimize the Lyapunov loss (10) with V as terminal cost. The loss is plotted on the y-axis in a log(1 + x) scale. The point marked in red is the parameter which minimizes the loss.

image

Fig. 7: Car kinematics: Testing Neural Lyapunov MPC obtained from training on nominal model. For each iteration, we show the trajectories obtained through our Neural Lyapunov MPC while using the resulting Lyapunov function and the MPC parameter selected from the line-search. Top: The Lyapunov function at  φ = 0with trajectories for 40 steps at each iteration. Middle: The evaluated Lyapunov function. Bottom: The Lyapunov function time difference.

image

Fig. 8: Car kinematics: Testing Neural Lyapunov MPC obtained from training on surrogate model. For each iteration, we show the trajectories obtained through our Neural Lyapunov MPC while using the resulting Lyapunov function and the MPC parameter selected from the line-search. Top: The Lyapunov function at  φ = 0with trajectories for 40 steps at each iteration. Middle: The evaluated Lyapunov function. Bottom: The Lyapunov function time difference.

image

Fig. 9: Inverted Pendulum: Alternate learning on surrogate model. After every  NV = 200epochs of Lyapunov learning, the learned Lyapunov function is used to tune the MPC parameters. Unlike the vehicle kinematics example, we do not reinitialize V between the iterations. Top: The training curves for Lyapunov function. Vertical lines separate iterations. Middle: The resulting Lyapunov function V with the best performance. Bottom: Line-search for the MPC parameter  αto minimize the Lyapunov loss (10) with V as terminal cost. The loss is plotted on the y-axis in a log(1 + x) scale. The point marked in red is the parameter which minimizes the loss.

image

Fig. 10: Inverted Pendulum: Testing Neural Lyapunov MPC obtained from training on nominal model over iterations. For each iteration, we show the trajectories obtained through our Neural Lyapunov MPC while using the resulting Lyapunov function and the MPC parameter selected from the line-search. The initial states are sampled inside the safe level-set using rejection sampling. Top: The Lyapunov function with trajectories for 80 steps at each iteration. Middle: The evaluated Lyapunov function. Bottom: The Lyapunov function time difference.

image

Fig. 11: Inverted Pendulum: Transfer from nominal to surrogate model. Top: The Lyapunov function with overlaid trajectories for 80 timesteps. Middle: The Lyapunov function evaluated along trajectories. Bottom: The Lyapunov decrease evaluated along trajectories.

image

Fig. 12: Inverted Pendulum: Effect of using Lyapunov MPC with contractor factor and no LQR loss. The Lyapunov function and safe-level set obtained from the first iteration of alternate learning with  λ = 0.9, v = 0in the Lyapunov loss (10). This results in a smaller safe region estimate and slower closed-loop trajectories compared to the case when  λ = 0.99, v = 1.Each trajectory is simulated for T = 80 timesteps.

image

Fig. 13: Inverted Pendulum: Effect of trust region on MPC. The solver hyperparameter,  rtrust, can also affect the stability of the MPC and was tuned manually at this stage. Given the limited amount of solver iterations, a small trust region results in weaker control signals and local instability. A larger trust radius can in this case stabilize the system, while being possibly more sub-optimal. The depicted Lyapunov function is obtained from first iteration of alternate learning and is used by the MPC as its terminal cost.


Designed for Accessibility and to further Open Science