Robust Estimation, Prediction and Control with Linear Dynamics and Generic Costs

2020·Arxiv

Abstract

Abstract

We consider the problem of robust and adaptive model predictive control (MPC) of a linear system, with unknown parameters that are learned along the way (adaptive), in a critical setting where failures must be prevented (robust). This problem has been studied from different perspectives by different communities. However, the existing theory deals only with the case of quadratic costs (the LQ problem), which limits applications to stabilisation and tracking tasks only. In order to handle more general (non-convex) costs that naturally arise in many practical problems, we carefully select and bring together several tools from different communities, namely non-asymptotic linear regression, recent results in interval prediction, and tree-based planning. Combining and adapting the theoretical guarantees at each layer is non trivial, and we provide the first end-to-end suboptimality analysis for this setting. Interestingly, our analysis naturally adapts to handle many models and combines with a data-driven robust model selection strategy, which enables to relax the modelling assumptions. Last, we strive to preserve tractability at any stage of the method, that we illustrate on two challenging simulated environments.1

1 Introduction

Despite the recent successes of Reinforcement Learning [e.g. 33, 40], it has hardly been applied in real industrial issues. This could be attributed to two undesirable properties which limit its practical applications. First, it depends on a tremendous amount of interaction data that cannot always be simulated. This issue can be alleviated by model-based methods – which we consider in this work – that often benefit from better sample efficiencies than their model-free counterparts. Second, it relies on trial-and-error and random exploration. In order to overcome these shortages, and motivated by the path planning problem for a self-driving car, in this paper we consider the problem of controlling an unknown linear system x(t) so as to maximise an arbitrary bounded reward function R, in a critical setting where mistakes are costly and must be avoided at all times. This choice of rich reward space is crucial to have sufficient flexibility to model non-convex and non-smooth functions that naturally arise in many practical problems involving combinatorial optimisation, branching decisions, etc., while quadratic costs are mostly suited for tracking a fixed reference trajectory [e.g. 23]. Since experiencing failures is out of question, the only way to prevent them from the outset is to rely on some sort of prior knowledge. In this work, we assume that the system dynamics are partially known, in the form of a linear differential equation with unknown parameters and inputs. More precisely, we consider a linear system with state , acted on by controls and disturbances , and following dynamics in the form:

where the parameter vector in the state matrix belongs to a compact set . The control matrix and disturbance matrix are known. We also assume having access to the observation of x(t) and to a noisy measurement of in the form where is a measurement noise and is known. Assumptions over the disturbance and noise will be detailed further, and we denote . We argue that this structure assumption is realistic given that most industrial applications to date have been relying on physical models to describe their processes and well-engineered controllers to operate them, rather than machine learning. Our framework relaxes this modelling effort by allowing some structured uncertainty around the nominal model. We adopt a data-driven scheme to estimate the parameters more accurately as we interact with the true system. Many model-based reinforcement learning algorithms rely on the estimated dynamics to derive the corresponding optimal controls [e.g. 24, 29], but suffer from model bias: they ignore the error between the learned and true dynamics, which can dramatically degrade control performances [39].

To address this issue, we turn to the framework of robust decision-making: instead of merely considering a point estimate of the dynamics, for any , we build an entire confidence region , illustrated in Figure 1, that contains the true dynamics parameter with high probability:

where . In Section 2, having observed a history of transitions, our first contribution extends the work of Abbasi-Yadkori et al. [2] who provide a confidence ellipsoid for the least-square estimator to our setting of feature matrices, rather than feature vectors.

The robust control objective [8, 9, 18] aims to maximise the worst-case outcome with respect to this confidence region

is a discount factor, and is the state reached at step n under controls u and disturbances within the given admissible bounds . Maximin problems such as (3) are notoriously hard if the reward R has a simple form. However, without a restriction on the shape of functions R, we cannot hope to derive an explicit solution. In our second contribution, we propose a robust MPC algorithm for solving (3) numerically. In Section 3, we leverage recent results from the uncertain system simulation literature to derive an interval predictor [x(t), x(t)] for the system (1), illustrated in Figure 2. For any , this predictor takes the information on the current state , the confidence region , planned control sequence u and admissible disturbance bounds ; and must verify the inclusion property:

Since R is generic, potentially non-smooth and non-convex, solving the optimal – not to mention the robust – control objective is intractable. In Section 4, facing a sequential decision problem with continuous states, we turn to the literature of tree-based planning algorithms. Although there exist works addressing continuous actions [10, 43], we resort to a first approximation and discretise the continuous decision space by adopting a hierarchical control architecture: at each time, the agent can select a high-level action a from a finite space A. Each action corresponds to the selection of a low-level controller , that we take affine: For

Figure 1: The model estimation procedure, running on the obstacle avoidance problem of Section 6. The confidence region shrinks with the number of samples N.

instance, a tracking a subgoal can be achieved with . This discretisation induces a suboptimality, but it can be mitigated by diversifying the controller basis. The robust objective (3) becomes stems from (1) with . However, tree-based planning algorithms are designed for a single known generative model rather than a confidence region for the system dynamics. Our third contribution adapts them to the robust objective (3) by approximating it with a tractable surrogate that exploits the interval predictions (4) to define a pessimistic reward. In our main result, we show that the best surrogate performance achieved during planning is guaranteed to be attained on the true system, and provide an upper bound for the approximation gap and suboptimality of our framework in Theorem 3. This is the first result of this kind for maximin control with generic costs to the best of our knowledge. Algorithm 1 shows the full integration of the three procedures of estimation, prediction and control.

In Section 5, our forth contribution extends the proposed framework to consider multiple modelling assumptions, while narrowing uncertainty through data-driven model rejection, and still ensuring safety via robust model-selection during planning.

Finally, in Section 6 we demonstrate the applicability of Algorithm 1 in two numerical experiments: a simple illustrative example and a more challenging simulation for safe autonomous driving.

Notation The system dynamics are described in continuous time, but sensing and control happen in discrete time with time-step dt > 0. For any variable z, we use subscript to refer to these discrete times: with and . We use bold symbols to denote temporal sequences . We denote

1.1 Related Work

The control of uncertain systems is a long-standing problem, to which a vast body of literature is dedicated. Existing work is mostly concerned with the problem of stabilisation around a fixed reference state or trajectory, including approaches such as control [7], sliding-mode control [32] or system-level synthesis [11, 12]. This paper fits in the popular MPC framework, for which adaptive data-driven schemes have been developed to deal with model uncertainty [38, 41, 5], but lack guarantees. The family of tube-MPC algorithms seeks to derive theoretical guarantees of robust constraint satisfaction: the state x is constrained in a safe region X around the origin, often chosen convex [17, 4, 6, 42, 30, 22, 31, 28]. Yet, many tasks cannot be framed as stabilisation problems (e.g. obstacle avoidance) and are better addressed with the minimax control objective, which allows more flexible goal formulations. Minimax control has mostly been studied in two particular instances.

Finite states Minimax control of finite Markov Decision Processes with uncertain parameters was studied in [21, 34, 44], who showed that the main results of Dynamic Programming can be extended to their robust counterparts only when the dynamics ambiguity set verifies a certain rectangularity property. Since we consider continuous states, these methods do not apply.

Linear dynamics and quadratic costs Several approaches have been proposed for cumulative regret minimisation in the LQ problem. In the Optimism in the Face of Uncertainty paradigm, the best possible dynamics within a high-confidence region is selected under a controllability constraint, to compute the corresponding optimal control in closed-form by solving a Riccati equation. The results of [1, 20, 16] show that this procedure achieves a regret. Posterior sampling algorithms [35, 3] select candidate dynamics randomly instead, and obtain the same result. Other works use noise injection for exploration such as [11, 12]. However, neither optimism nor random exploration fit a critical setting, where ensuring safety requires instead to consider pessimistic outcomes. The work of Dean et al. [11] is close to our setting: after an offline estimation phase, they estimate a suboptimality between a minimax controller and the optimal performance. Our work differs in that it addresses a generic shape cost. Another work of interest is [37] where worst-case generic costs are considered. However, they assume the knowledge of the dynamics, and their rollout-based solution only produces inner-approximations and does not yield any guarantee. In this paper, interval prediction is used to produce oversets, while a near-optimal control is found using a tree-based planning procedure.

2 Model Estimation

To derive a confidence region (2) for , the functional relationship must be specified.

Assumption 1 (Structure). There exists a known feature tensor such that for all

where are known. For all n, we denote . We also assume to know a bound

We slightly abuse notations and include additional known terms in the measurement signal y(t) = to obtain a linear regression system

Regularised least square To derive an estimate on , we consider the -regularised regression problem with weights

Proposition 1 (Regularised solution). The solution to (6) is

Substituting into (7) yields the regression error: To bound this error, we need the noise to concentrate.

Assumption 2 (Noise Model). We assume that

2. at each time , the disturbance is enclosed by known bounds whose amplitude verifies

Theorem 1 (Confidence ellipsoid, a matricial version of Abbasi-Yadkori et al. 2). Under Assumption 2, it holds with probability at least

We convert this confidence ellipsoid from (8) into a polytope for . For simplicity, we present here a simple but coarse strategy: bound the ellipsoid by its enclosing axis-aligned hypercube:

where A

. A tighter polytope derivation is presented in the Supplementary Material.

3 State Prediction

A simple solution to (4) is proposed in [14], where, given bounds matrix interval arithmetic to derive the predictor:

Proposition 2 (Simple predictor of Efimov et al. 14). Assuming that (2) is satisfied for the system (1), then the interval predictor following

ensures the inclusion property (4) with confidence level

However, Leurent et al. [27] showed that this predictor can have unstable dynamics, even for stable systems, which causes a fast build-up of uncertainty. They proposed an enhanced predictor which exploits the polytopic structure (9) to produce more stable predictions, at the price of a requirement: Assumption 3. There exists an orthogonal matrix is Metzler2.

In practice, this assumption is often verified: it is for instance the case whenever is diagonalisable. The similarity transformation of [15] provides a method to compute such Z when the system is observable. To simplify the notation, we will further assume that . Denote

Proposition 3 (Enhanced predictor of Leurent et al. 27). Assuming that (9) and Assumption 3 are satisfied for the system (1), then the interval predictor following

˙˙

ensures the inclusion property (4) with confidence level

Figure 3 compares the performance of the predictors (10) and (11) in a simple example. It suggests to always prefer (11) whenever Assumption 3 is verified, and only fallback to (10) as a last resort.

Figure 3: Comparison of (10) and (11) for a simple system , with and

4 Robust Control

To evaluate the robust objective , we approximate it thanks to the interval prediction [x(t), x(t)]. Definition 1 (Surrogate objective). Let following the dynamics defined in (11) and

Such a substitution makes this pessimistic reward not Markovian, since the worst case is assessed over the whole past trajectory. Theorem 2 (Lower bound). The surrogate objective (12) is a lower bound of the objective (3).

Consequently, since all our approximations are conservative, if we manage to find a control sequence such that no “bad event” (e.g. a collision) happens according to the surrogate objective guaranteed not to happen either when the controls are executed on the true system.

To maximise , we cannot use DP algorithms since the state space is continuous and the pessimistic rewards are non-Markovian. Rather, we turn to tree-based planning algorithms, which optimise a sequence of actions based on the corresponding sequence of rewards, without requiring Markovity nor state enumeration. In particular, we consider the Optimistic Planning of Deterministic Systems (OPD) algorithm [19] tailored for the case when the relationship between actions and rewards is deterministic. Indeed, the stochasticity of the disturbances and measurements is encased in : given the observations up to time N both the predictor dynamics (11) and the pessimistic rewards in (12) are deterministic. At each planning iteration progressively builds a tree by forming upper-bounds over the value of sequences of actions a, and expanding3 the leaf with highest upper-bound:

where is the set of leaves of is the length of the sequence a, and the pessimistic reward (12) obtained at time n by following the controls

Lemma 1 (Planning performance of Hren & Munos 19). The suboptimality of the OPD algorithm (13) applied to the surrogate objective (12) after K planning iterations is:

where is a problem-dependent measure of the proportion of near-optimal paths.

Hence, by using enough computational budget K for planning we can get as close as we want to the optimal surrogate value , at a polynomial rate. Unfortunately, there exists a gap between and the true robust objective , which stems from three approximations: (i) the true reachable set was approximated by an enclosing interval in (4); (ii) the time-invariance of the dynamics uncertainty was handled by the interval predictor (11) as if it were a time-varying uncertainty ; and (iii) the lower-bound used to define the surrogate objective (12) is not tight. However, this gap can be bounded under additional assumptions.

Theorem 3 (Suboptimality bound). Under two conditions:

we can bound the suboptimality of Algorithm 1 with planning budget K as:

with probability at least , where V (a) is the optimal expected return when executing an action is an optimal action, and is a constant which corresponds to an irreducible suboptimality suffered from being robust to instantaneous disturbances

It is difficult to check the validity of the stability condition 2. since it applies to matrices by the algorithm rather than to the system parameters. A stronger but easier to check condition is that the polytope (9) at some iteration becomes included in a set where this property is uniformly satisfied. For instance, if the features are sufficiently excited, the estimation converges to a neighbourhood of the true dynamics . This also allows to further bound the input-dependent estimation error term. Corollary 1 (Asymptotic near-optimality). Under an additional persistent excitation (PE) assumption

the stability condition 2. of Theorem 3 can be relaxed to apply to the true system: there exist such that

and the suboptimality bound takes the more explicit form

which ensures asymptotic near-optimality when

5 Multi-model Selection

The procedure we developed in Sections 2 to 4 relies on strong modelling assumptions, such as the linear dynamics (1) and Assumption 1. But what if they are wrong?

Model adequacy One of the major benefits of using the family of linear models, compared to richer model classes, is that they provide strict conditions allowing to quantify the adequacy of the modelling assumptions to the observations. Given observations, Section 2 provides a polytopic confidence region (9) that contains with probability at least . Since the dynamics are linear, we can propagate this confidence region to the next observation: must belong to the Minkowski sum of a polytope representing model uncertainty polytope bounding the disturbance and measurement noises. Delos & Teissandier [13] provide a way to test this membership in polynomial time using linear programming. Whenever it is not verified, we can confidently reject the -modelling assumption 1. This enables us to consider a rich set of potential featuresrather than relying on a single assumption, and only retain those that are consistent with the observations so far. Then, every remaining hypothesis must be considered during planning.

Robust selection We temporarily ignore the parametric uncertainty on to simply consider several candidate dynamics models, which all correspond to different modelling assumptions. We also restrict to deterministic dynamics, which is the case of (11).

Assumption 4 (Multi-model ambiguity). The dynamics f lie in a finite set of candidates

We adapt our planning algorithm to balance these concurrent hypotheses in a robust fashion, i.e. maximise a robust objective with discrete ambiguity:

where is the reward obtained by following the action sequence a up to step n under the dynamics . This objective could be optimised in the same way as in Section 4, but this would result in a coarse and lossy approximation. Instead, we exploit the finite uncertainty structure of Assumption 4 to asymptotically recover the true by modifying the OPD algorithm in the following way:

Definition 2 (Robust UCB). We replace the upper-bound (13) on sequence values in OPD by:

Note that it is not equivalent to solving each control problem independently and following the action with highest worst-case value, as we show in the Supplementary Material. We analyse the sample complexity of the corresponding robust planning algorithm.

Proposition 4 (Robust planning performance). The robust version of OPD (16) enjoys the same regret bound as OPD in Lemma 1, with respect to the multi-model objective (15).

This result is of independent interest: the solution of a robust objective (15) with discrete ambiguity can be recovered exactly, asymptotically when the planning budget K goes to infinity, which Robust DP algorithms do not allow. This also contrasts with the results obtained in Section 4 for the robust objective (3) with continuous ambiguity , for which OPD only recovers the surrogate approximation , as discussed in Theorem 3. Note that here the regret depends on the number K of node expansions, but each expansion now requires M times more simulations than in the single-model setting. Finally, the two approaches of Sections 4 and 5 can be merged by using the pessimistic reward (12) in (16).

6 Experiments

Videos and code are available at https://eleurent.github.io/robust-beyond-quadratic/.

Obstacle avoidance with unknown friction We first consider a simple illustrative example, shown in Figure 2: the control of a 2D system moving by means of a force in an medium with anisotropic linear friction with unknown coefficients . The reward encodes the task of navigating to reach a goal state while avoiding collisions with obstacles: collides with an obstacle, 1 otherwise. The actions A are constant controls in the up, down, left and right directions. For the reasons mentioned above, no robust baseline applies to our setting. We compare Algorithm 1 to the non-robust adaptive control approach that plans with the estimated dynamics , and thus enjoys the same prior knowledge of dynamics structure and reward. This highlights the benefits of the robust formulation solely rather than stemming from algorithm design. We show in Table 1(a) the results of 100 simulations of a single episode: the robust agent performs worse than the nominal agent on average, but manages to ensure safety while the nominal agent collides with obstacles in 4% of simulations. We also compare to a standard model-free approach, DQN, which does not benefit from the prior knowledge on the system dynamics, and is instead trained over multiple episodes. The reported performance is that of the final policy obtained after training for 3000 episodes, during which collisions occurred (%). We study the evolution of the suboptimality with respect to the number of samples N, by comparing the empirical returns from a state to the value that the agent would get by acting optimally from with knowledge of the dynamics. Although the assumptions

Figure 4: The mean (solid), 95% CI for the mean (shaded) and maximum (dashed) suboptimality with respect to N.

Figure 5: The intersection crossing task. Trajectory intervals show behavioural uncertainty for each vehicle, with a multi-model assumption over their route.

Table 1: Frequency of collision, minimum and average return achieved on a single episode, repeated with 100 random seeds. In both tasks, the robust agent performs worse than the nominal agent on average, but manages to ensure safety and attains a better worst-case performance.

of Theorem 3 are not satisfied (e.g. non-smooth reward), the mean suboptimality of the robust agent, shown in Figure 4, still decreases polynomially with N: Algorithm 1 gets more efficient as it is more confident while ensuring safety at all times. In comparison, the nominal agent enjoys a smaller suboptimality on average, but higher in the worst-case.

Motion planning for an autonomous vehicle We consider the highway-env environment [25] for simulated driving decision problems. An autonomous vehicle with state is approaching an intersection among V other vehicles with states , resulting in a joint traffic state . These vehicles follow parametrized behaviours with unknown parameters . We appreciate a first advantage of the structure imposed in Assumption 1: the uncertainty space of . In comparison, the traditional LQ setting where the whole state matrix A is estimated would have resulted in a much larger parameter space . The system dynamics f, which describes the interactions between vehicles, can only be expressed in the form of Assumption 1 given the knowledge of the desired route for each vehicle, with features expressing deviations to the centerline of the followed lane. Since these intentions are unknown to the agent, we adopt the multi-model perspective of Section 5 and consider one model per possible route for every observed vehicle before an intersection. In Table 1(b), we compare Algorithm 1 to a nominal agent planning with two different modelling assumptions: Nominal 1 has access to the true followed route for each vehicle, while Nominal 2 does not and picks the model with minimal prediction error. Again we also compare to a DQN baseline trained over 3000 episodes, causing collisions while training (). As before, the robust agent has a higher worst-case performance and avoids collisions at all times, at the price of a decreased average performance..

Conclusion

We present a framework for the robust estimation, prediction and control of a partially known linear system with generic costs. Leveraging tools from linear regression, interval prediction, and tree-based planning, we guarantee the predicted performance and provide a suboptimality bound. The method applicability is further improved by a multi-model extension and demonstrated on two simulations.

Broader Impact

The motivation behind this work is to enable the development of Reinforcement Learning solutions for industrial applications, when it has been mainly limited to simulated games so far. In particular, many industries already rely on non-adaptive control systems and could benefit from an increased efficiency, including Oil and Gas, robotics for industrial automation, Data Center cooling, etc. But more often than not, safety-critical constraints proscribe the use of exploration, and industrials are reluctant to turn to learning-based methods that lack accountability. This work addresses these concerns by focusing on risk-averse decisions and by providing worst-case guarantees. Note however that these guarantees are only as good as the validity of the underlying hypotheses, and Assumption 1 in particular should be submitted to a comprehensive validation procedure; otherwise, decisions formed on a wrong basis could easily lead to dramatic consequences in such critical settings. Beyond industrial perspectives, this work could be of general interest for risk-averse decision-making. For instance, parametrized epidemiological models have been used to represent the propagation of Covid-19 and study the impact of lockdown policies. These model parameters are estimated from observational data and corresponding confidence intervals are often available, but rarely used in the decision-making loop. In contrast, our approach would enable evaluating and optimising the worst-case outcome of such public policies.

Acknowledgments and Disclosure of Funding

This work was supported by the French Ministry of Higher Education and Research, and CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020.

References

[1] Abbasi-Yadkori, Y. and Szepesvári, C. Regret bounds for the adaptive control of linear quadratic systems. In Kakade, S. M. and von Luxburg, U. (eds.), Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pp. 1–26, Budapest, Hungary, 09–11 Jun 2011. PMLR.

[2] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 2312–2320. Curran Associates, Inc., 2011.

[3] Abeille, M. and Lazaric, A. Improved regret bounds for thompson sampling in linear quadratic control problems. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1–9, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.

[4] Adetola, V., DeHaan, D., and Guay, M. Adaptive model predictive control for constrained nonlinear systems. Systems and Control Letters, 2009. ISSN 01676911. doi: 10.1016/j.sysconle. 2008.12.002.

[5] Amos, B., Rodriguez, I. D. J., Sacks, J., Boots, B., and Zico Kolter, J. Differentiable MPC for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018.

[6] Aswani, A., Gonzalez, H., Sastry, S. S., and Tomlin, C. Provably safe and robust learning-based model predictive control. Automatica, 2013. ISSN 00051098. doi: 10.1016/j.automatica.2013. 02.003.

[7] Basar, T. and Bernhard, P. H infinity - Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach, volume 41. 1996.

[8] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton University Press, 2009.

[9] Bertsimas, D., Brown, D. B., and Caramanis, C. Theory and applications of robust optimization. SIAM review, 53(3):464–501, 2011.

[10] Busoniu, L., Pall, E., and Munos, R. Continuous-action planning for discounted infinite-horizon nonlinear optimal control with lipschitz values. Automatica, 92:100–108, 06 2018.

[11] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear quadratic regulator. ArXiv, abs/1710.01688, 2017.

[12] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret bounds for robust adaptive control of the linear quadratic regulator. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4188–4197. Curran Associates, Inc., 2018.

[13] Delos, V. and Teissandier, D. Minkowski Sum of Polytopes Defined by Their Vertices. Journal of Applied Mathematics and Physics (JAMP), 3(1):62–67, January 2015.

[14] Efimov, D., Fridman, L., Raïssi, T., Zolghadri, A., and Seydou, R. Interval estimation for LPV systems applying high order sliding mode techniques. Automatica, 48:2365–2371, 2012.

[15] Efimov, D., Raïssi, T., Chebotarev, S., and Zolghadri, A. Interval state observer for nonlinear time varying systems. Automatica, 49(1):200–205, 2013.

[16] Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Finite time analysis of optimal adaptive policies for linear-quadratic systems. CoRR, abs/1711.07230, 2017.

[17] Fukushima, H., Kim, T. H., and Sugie, T. Adaptive model predictive control for a class of constrained linear systems based on the comparison model. Automatica, 2007. ISSN 00051098. doi: 10.1016/j.automatica.2006.08.026.

[18] Gorissen, B. L., ˙Ihsan Yanıko˘glu, and den Hertog, D. A practical guide to robust optimization. Omega, 53:124 – 137, 2015.

[19] Hren, J.-F. and Munos, R. Optimistic planning of deterministic systems. In European Workshop on Reinforcement Learning, pp. 151–164, France, 2008.

[20] Ibrahimi, M., Javanmard, A., and Roy, B. Efficient reinforcement learning for high dimensional linear quadratic systems. Advances in Neural Information Processing Systems, 4, 03 2013.

[21] Iyengar, G. N. Robust Dynamic Programming. Mathematics of Operations Research, 30: 257–280, 2005.

[22] Köhler, J., Andina, E., Soloperto, R., Müller, M. A., and Allgöwer, F. Linear robust adaptive model predictive control: Computational complexity and conservatism. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 1383–1388, 2019.

[23] Kumar, E. V. and Jerome, J. Robust lqr controller design for stabilizing and trajectory tracking of inverted pendulum. Procedia Engineering, 64:169 – 178, 2013. International Conference on Design and Manufacturing.

[24] Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015.

[25] Leurent, E. An environment for autonomous driving decision-making. https://github.com/ eleurent/highway-env, 2018.

[26] Leurent, E. and Mercat, J. Social attention for autonomous decision-making in dense traffic. In Machine Learning for Autonomous Driving Workshop at NeurIPS 2019, 2019.

[27] Leurent, E., Efimov, D., Raïssi, T., and Perruquetti, W. Interval prediction for continuous-time systems with parametric uncertainties. In Proc. IEEE Conference on Decision and Control (CDC), Nice, 2019.

[28] Leurent, E., Efimov, D., and Maillard, O.-A. Robust-Adaptive Interval Predictive Control for Linear Uncertain Systems. In 2020 IEEE 59th Conference on Decision and Control (CDC), Jeju Island, Republic of Korea, 8–11 Dec 2020.

[29] Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. CoRR, abs/1504.00702, 2015.

[30] Lorenzen, M., Allgöwer, F., and Cannon, M. Adaptive model predictive control with robust constraint satisfaction. IFAC-PapersOnLine, 50(1):3313–3318, Jul 2017. ISSN 2405-8963.

[31] Lu, X. and Cannon, M. Robust adaptive tube model predictive control. In Proceedings of the American Control Conference, 2019. ISBN 9781538679265.

[32] Lu, X.-Y. and Spurgeon, S. K. Robust sliding mode control of uncertain nonlinear systems. Systems & Control Letters, 32(2):75–90, Nov 1997. ISSN 0167-6911.

[33] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.

[34] Nilim, A. and El Ghaoui, L. Robust Control of Markov Decision Processes with Uncertain Transition Matrices. Operations Research, 53:780–798, 2005.

[35] Ouyang, Y., Gagrani, M., and Jain, R. Learning-based control of unknown linear systems with thompson sampling. CoRR, abs/1709.04047, 2017.

[36] Peña, V. H., Lai, T. L., and Shao, Q.-M. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.

[37] Rosolia, U. and Borrelli, F. Sample-Based Learning Model Predictive Control for Linear Uncertain Systems. In Proceedings of the IEEE Conference on Decision and Control, 2019. ISBN 9781728113982. doi: 10.1109/CDC40024.2019.9030270.

[38] Sastry, S., Bodson, M., and Bartram, J. F. Adaptive Control: Stability, Convergence, and Robustness. The Journal of the Acoustical Society of America, 1990. ISSN 0001-4966. doi: 10.1121/1.399905.

[39] Schneider, J. Exploiting model uncertainty estimates for safe dynamic control learning. Advances in neural information processing systems, pp. 1047—-1053, 1997.

[40] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419): 1140–1144, 2018.

[41] Tanaskovic, M., Fagiano, L., Smith, R., and Morari, M. Adaptive receding horizon control for constrained MIMO systems. Automatica, 2014. ISSN 00051098. doi: 10.1016/j.automatica. 2014.10.036.

[42] Turchetta, M., Berkenkamp, F., and Krause, A. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems, 2016.

[43] Weinstein, A. and Littman, M. Bandit-based planning and learning in continuous-action markov decision processes. Proceedings of the 22nd International Conference on Automated Planning and Scheduling, pp. 306–314, 01 2012.

[44] Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov Decision Processes. Mathematics of Operations Research, pp. 1–52, 2013.

Outline In Appendix A, we provide a proof for every novel result introduced in this paper. Appendix B provides additional details on our experiments. Appendix C gives a better method of conversion from ellipsoid to polytope than that of (9). Finally, Appendix D highlights the fact that robustness cannot be recovered by aggregating independent solutions to many optimal control problem.

A Proofs

A.1 Proof of Proposition 1

Proof. We differentiate as in (6) with respect to

Hence,

A.2 Proof of Theorem 1

We start by showing a preliminary proposition:

Proposition 5 (Matrix version of Theorem 1 of Abbasi-Yadkori et al. 2). Let be a filtration. Let -valued stochastic process such that -measurable and -sub-Gaussian.

Let -valued stochastic process such that -measurable. Assume that G is a positive definite matrix. For any

Then, for any , with probability at least

Proof. Let

And for any

Then,

and using the Sub-Gaussianity of

Showing that is indeed a supermartingale and in fact . It then follows by Doob’s upcrossing lemma for supermartingale that is almost surely well-defined, and so is for any random stopping time

Next, we consider the stopped martingale is a non-negative supermartin- gale and is a random stopping time, we deduce by Doob’s decomposition that

Finally, an application of Fatou’s lemma show that

This results allows to apply a result from [36]:

Lemma 2 (Theorem 14.7 of [36]). If Z is a random vector and B is a symmetric positive definite matrix such that

then for any positive definite non-random matrix C, it holds

In particular, by Markov inequality, for all

Here, by using

Having shown this preliminary result, we move on to the proof of Theorem 1.

Proof. For all , (7) gives:

Using the Cauchy-Schwartz inequality, we get:

In particular, for , we get after simplifying with

By applying Proposition 5 with , we obtain that with probability at least

And since ∥

A.3 Proof of Theorem 2

Proof. The predictor designed in Section 3 verifies the inclusion property (4). Thus, for sequence of controls u, any dynamics , and disturbances , the corresponding state at time is bounded by , which implies that

Thus, by taking the min over , we also have for any sequence of controls u:

And by definition.

A.4 Proof of Theorem 3

We first bound the model estimation error.

Lemma 3.

Proof. We have

Moreover, belongs to a linear image of this -ball. By writing a the column of a matrix M as , and its coefficient

Then, we propagate this estimation error through the state prediction.

then for all

where

Proof. Let . (11) gives the dynamics

where recall that for any matrix

We define the Lyapunov function , which is non-negative definite provided that P > 0, and compute its derivative

with

Thus, if we had , then we would have

We now examine the condition . We resort to its Schur complement: given if and only if is the top-left block of

as it is assumed in the conditions of the lemma. Hence, under such a choice of , we recover

. (17) follows with

Developing at the first order in

Finally, we propagate the state prediction error bound to the pessimistic rewards and surrogate objective to get our final result.

Proof. For any sequence of controls u, dynamical parameters and disturbances

Moreover, by the inclusion property (4), we have that , which implies that -lipschitz,

Finally, we use the result of Lemma 1 to account for planning with a finite budget, and relate to

A.5 Proof of Corollary 1

Proof. By (7) and (14), we have

and by (8),

Stability condition 2. By Lemma 3 and the above, the sequence converges to in Frobenius norm. Thus,

which is assumed to be negative definite.

Moreover, the two functions that map a matrix to its characteristic polynomial and a polynomial to its roots, are both continuous. Thus, by continuity, the largest eigenvalue of converges to that of M, which is strictly negative. Hence, there exists some such that for all is negative definite, as required in the condition 2. of Theorem 3.

A.6 Proof of Proposition 4

We start by showing the following lemma: Lemma 5 (Robust values ordering). In addition to the robust B-value defined in (16), that we extend to inner nodes

we also define the robust value of a sequence of actions a

and the robust U-values of a sequence of action a

Then, the robust values, U-values and B-values exhibit similar properties as the optimal values, U-values and B-values, that is: for all

Proof. By definition, when starting with sequence a, the value represents the minimum admissible reward, while corresponds to the best admissible reward achievable with respect to the the possible continuations of a. Thus, for all and are non-decreasing functions of are a non-increasing functions of depend on k.

Moreover, since the reward function R is assumed be bounded in [0, 1], the sum of discounted rewards from a node of depth d is at most . As a consequence, for all of depth d, and any sequence of rewards obtained from following a path in dynamics

Hence,

And as the left-hand and right-hand sides of (22) are independent of the particular path that was followed in , it also holds for the robust path:

that is,

Finally, (23) is extended to the rest of by recursive application of (19), (20) and (18).

We now turn to the proof of the theorem.

Proof. Hren & Munos [19] first show in Theorem 2 that the simple regret of their optimistic planner is bounded by where is the depth of . This properties relies on the fact that the returned action belongs to the deepest explored branch, which we can show likewise by contradiction using Lemma 5. This yields directly that the returned action where i is some node of maximal depth expanded at round , which by selection rule verifies

Secondly, they bound the depth with respect to K. To that end, they show that the expanded nodes always belong to the sub-tree of all the nodes of depth -optimal. Indeed, if a node i of depth d is expanded at round by selection rule, thus the max-backups of (16) up to the root yield . Moreover, by Lemma 5 we have that

Then from the definition of applied to nodes in , there exists such that the number of nodes of depth is bounded by . As a consequence,

B Experimental details

In both experiments, we used and a planning budget K = 100. The disturbances were sampled uniformly in while the measurements are Gaussian with covariance

B.1 Obstacle Avoidance

States The system is described by its position and velocity

Actions It is acted upon by means horizontal and vertical forces . We discretise the action space into four constant controls, for each direction:

Reward The reward encodes the task of navigating to reach a goal state while avoiding collisions with obstacles:

Dynamics The system dynamics consist in a double integrator, with friction parameters

Note that Assumption 3 is always verified.

DQN baseline In addition to the state, knowledge of the obstacles is encoded in the observation as an angular grid of laser-like distance measurements, as well as the goal location relative to the system position. As a model for the Q-function, we used a Multi-Layer Perceptron with two hidden layers of size -greedy strategy was used for exploration.

B.2 Autonomous Driving

In the following, we describe the structure of the dynamical system f representing the couplings and interactions between several vehicles.

States In addition to the ego-vehicle, the scene contains V other vehicles. Any vehicle represented by its position , its forward velocity its heading . The resulting joint state is the traffic description:

Actions The ego-vehicle is following a fixed path, and the tasks consists in adapting its velocity by means of three actions A = {faster, constant velocity, slower}. They are achieved by a longitudinal linear controller that tracks the desired velocity , as described below in the system dynamics.

Reward The reward function R is the following:

Dynamics The kinematics of any vehicle are represented by the Kinematic Bicycle Model:

where is the vehicle position, is its forward velocity and is its heading, l is the vehicle half-length, is the acceleration command and is the slip angle at the centre of gravity, used as a steering command.

Longitudinal dynamics Longitudinal behaviour is modelled by a linear controller using three features: a desired velocity, a braking term to drive slower than the front vehicle, and a braking term to respect a safe distance to the front vehicle.

Denoting the index of the front vehicle preceding vehicle i, the acceleration command can be presented as follows:

where and T respectively denote the speed limit, jam distance and time gap given by traffic rules.

Lateral dynamics The lane with the lateral position and heading is tracked by a cascade controller of lateral position and heading , which is selected in a way the closed-loop dynamics take the form:

We assume that the drivers choose their steering command such that (24) is always achieved:

LPV formulation The system presented so far is non-linear and must be cast into the LPV form. We approximate the non-linearities induced by the trigonometric operators through equilibrium linearisation around

This yields the following longitudinal dynamics:

where are set to 0 whenever the corresponding features are not active.

Here, the dependency in is seen as an uncertain parametric dependency, i.e. , with constant bounds assumed for using an overset of the longitudinal interval predictor.

Change of coordinates In both cases, the obtained polytope centre is non-Metzler. We use the similarity transformation of coordinates of Efimov et al. [15]. Precisely, we choose such that for any is always diagonalisable with real eigenvalues, and perform an eigendecomposition to compute its change of basis matrix Z. The transformed system verifies (2) with Metlzer as required to apply the interval predictor of Proposition 3. Finally, the obtained predictor is transformed back to the original coordinates Z by using the following lemma:

Lemma 6 (Interval arithmetic of Efimov et al. 14). Let be a vector variable, some

1. If is a constant matrix, then

2. If is a matrix variable and

DQN baseline In order to avoid discontinuities in the vehicles headings, the state is encoded as , with the ego-vehicle always in the first position. As a model for the Q-function, we used the Social Attention architecture from [26], that allows to support an arbitrary number of vehicles as input and enforce an invariance to their order.

C A tighter conversion from ellipsoid to polytope

D On the ordering of min and max

In the definition of and it is essential that the minimum over the models is only taken at the end of trajectories, in the same way as for the robust objective (15) in which the worst-case dynamics is only determined after the action sequence has been fully specified. Assume that is instead naively defined as:

This would not recover the robust policy, as we show in Figure 6 with a simple counter-example.

Figure 6: From left to right: two simple models and corresponding u-values with optimal sequences in blue; the naive version of the robust values returns sub-optimal paths in red; our robust U-value properly recovers the robust policy in green.