b

DiscoverSearch
About
My stuff
Robust Estimation, Prediction and Control with Linear Dynamics and Generic Costs
2020·arXiv
Abstract
Abstract

We consider the problem of robust and adaptive model predictive control (MPC) of a linear system, with unknown parameters that are learned along the way (adaptive), in a critical setting where failures must be prevented (robust). This problem has been studied from different perspectives by different communities. However, the existing theory deals only with the case of quadratic costs (the LQ problem), which limits applications to stabilisation and tracking tasks only. In order to handle more general (non-convex) costs that naturally arise in many practical problems, we carefully select and bring together several tools from different communities, namely non-asymptotic linear regression, recent results in interval prediction, and tree-based planning. Combining and adapting the theoretical guarantees at each layer is non trivial, and we provide the first end-to-end suboptimality analysis for this setting. Interestingly, our analysis naturally adapts to handle many models and combines with a data-driven robust model selection strategy, which enables to relax the modelling assumptions. Last, we strive to preserve tractability at any stage of the method, that we illustrate on two challenging simulated environments.1

Despite the recent successes of Reinforcement Learning [e.g. 33, 40], it has hardly been applied in real industrial issues. This could be attributed to two undesirable properties which limit its practical applications. First, it depends on a tremendous amount of interaction data that cannot always be simulated. This issue can be alleviated by model-based methods – which we consider in this work – that often benefit from better sample efficiencies than their model-free counterparts. Second, it relies on trial-and-error and random exploration. In order to overcome these shortages, and motivated by the path planning problem for a self-driving car, in this paper we consider the problem of controlling an unknown linear system x(t) so as to maximise an arbitrary bounded reward function R, in a critical setting where mistakes are costly and must be avoided at all times. This choice of rich reward space is crucial to have sufficient flexibility to model non-convex and non-smooth functions that naturally arise in many practical problems involving combinatorial optimisation, branching decisions, etc., while quadratic costs are mostly suited for tracking a fixed reference trajectory [e.g. 23]. Since experiencing failures is out of question, the only way to prevent them from the outset is to rely on some sort of prior knowledge. In this work, we assume that the system dynamics are partially known, in the form of a linear differential equation with unknown parameters and inputs. More precisely, we consider a linear system with state  x ∈ Rp, acted on by controls  u ∈ Rqand disturbances  ω ∈ Rr, and following dynamics in the form:

image

where the parameter vector  θin the state matrix  A(θ) ∈ Rp×pbelongs to a compact set  Θ ⊂ Rd. The control matrix  B ∈ Rp×q and disturbance matrix  D ∈ Rp×r are known. We also assume having access to the observation of x(t) and to a noisy measurement of  ˙x(t)in the form  y(t) = ˙x(t)+Cν(t),where  ν(t) ∈ Rs is a measurement noise and  C ∈ Rp×s is known. Assumptions over the disturbance ωand noise  νwill be detailed further, and we denote  η(t) = Cν(t) + Dω(t). We argue that this structure assumption is realistic given that most industrial applications to date have been relying on physical models to describe their processes and well-engineered controllers to operate them, rather than machine learning. Our framework relaxes this modelling effort by allowing some structured uncertainty around the nominal model. We adopt a data-driven scheme to estimate the parameters more accurately as we interact with the true system. Many model-based reinforcement learning algorithms rely on the estimated dynamics to derive the corresponding optimal controls [e.g. 24, 29], but suffer from model bias: they ignore the error between the learned and true dynamics, which can dramatically degrade control performances [39].

To address this issue, we turn to the framework of robust decision-making: instead of merely considering a point estimate of the dynamics, for any  N ∈ N, we build an entire confidence region CN,δ ⊂ Θ, illustrated in Figure 1, that contains the true dynamics parameter with high probability:

image

where  δ ∈ (0, 1). In Section 2, having observed a history  DN = {(xn, yn, un)}n∈[N]of transitions, our first contribution extends the work of Abbasi-Yadkori et al. [2] who provide a confidence ellipsoid for the least-square estimator to our setting of feature matrices, rather than feature vectors.

The robust control objective  V r [8, 9, 18] aims to maximise the worst-case outcome with respect to this confidence region  CN,δ:

image

γ ∈ (0, 1)is a discount factor, and  xn(u, ω)is the state reached at step n under controls u and disturbances  ω(t)within the given admissible bounds  [ω(t), ω(t)]. Maximin problems such as (3) are notoriously hard if the reward R has a simple form. However, without a restriction on the shape of functions R, we cannot hope to derive an explicit solution. In our second contribution, we propose a robust MPC algorithm for solving (3) numerically. In Section 3, we leverage recent results from the uncertain system simulation literature to derive an interval predictor [x(t), x(t)] for the system (1), illustrated in Figure 2. For any  N ∈ N, this predictor takes the information on the current state xN, the confidence region  CN,δ, planned control sequence u and admissible disturbance bounds [ω(t), ω(t)]; and must verify the inclusion property:

image

Since R is generic, potentially non-smooth and non-convex, solving the optimal – not to mention the robust – control objective is intractable. In Section 4, facing a sequential decision problem with continuous states, we turn to the literature of tree-based planning algorithms. Although there exist works addressing continuous actions [10, 43], we resort to a first approximation and discretise the continuous decision  (Rq)Nspace by adopting a hierarchical control architecture: at each time, the agent can select a high-level action a from a finite space A. Each action  a ∈ Acorresponds to the selection of a low-level controller  πa, that we take affine:  u(t) = πa(x(t))def= −Kax(t) + ua.For

image

Figure 1: The model estimation procedure, running on the obstacle avoidance problem of Section 6. The confidence region  CN,δshrinks with the number of samples N.

image

instance, a tracking a subgoal  xgcan be achieved with  πg = K(xg − x). This discretisation induces a suboptimality, but it can be mitigated by diversifying the controller basis. The robust objective (3) becomes  supa∈AN V r(a), where xn(a, ω)stems from (1) with  un = πan(xn). However, tree-based planning algorithms are designed for a single known generative model rather than a confidence region for the system dynamics. Our third contribution adapts them to the robust objective (3) by approximating it with a tractable surrogate ˆV rthat exploits the interval predictions (4) to define a pessimistic reward. In our main result, we show that the best surrogate performance achieved during planning is guaranteed to be attained on the true system, and provide an upper bound for the approximation gap and suboptimality of our framework in Theorem 3. This is the first result of this kind for maximin control with generic costs to the best of our knowledge. Algorithm 1 shows the full integration of the three procedures of estimation, prediction and control.

In Section 5, our forth contribution extends the proposed framework to consider multiple modelling assumptions, while narrowing uncertainty through data-driven model rejection, and still ensuring safety via robust model-selection during planning.

Finally, in Section 6 we demonstrate the applicability of Algorithm 1 in two numerical experiments: a simple illustrative example and a more challenging simulation for safe autonomous driving.

Notation The system dynamics are described in continuous time, but sensing and control happen in discrete time with time-step dt > 0. For any variable z, we use subscript to refer to these discrete times:  zn = z(tn)with  tn = ndtand  n ∈ N. We use bold symbols to denote temporal sequences z = (zn)n∈N. We denote  z+ = max(z, 0), z− = z+ − z, |z| = z+ + z− and [n] = {1, . . . , n}.

1.1 Related Work

The control of uncertain systems is a long-standing problem, to which a vast body of literature is dedicated. Existing work is mostly concerned with the problem of stabilisation around a fixed reference state or trajectory, including approaches such as  H∞control [7], sliding-mode control [32] or system-level synthesis [11, 12]. This paper fits in the popular MPC framework, for which adaptive data-driven schemes have been developed to deal with model uncertainty [38, 41, 5], but lack guarantees. The family of tube-MPC algorithms seeks to derive theoretical guarantees of robust constraint satisfaction: the state x is constrained in a safe region X around the origin, often chosen convex [17, 4, 6, 42, 30, 22, 31, 28]. Yet, many tasks cannot be framed as stabilisation problems (e.g. obstacle avoidance) and are better addressed with the minimax control objective, which allows more flexible goal formulations. Minimax control has mostly been studied in two particular instances.

Finite states Minimax control of finite Markov Decision Processes with uncertain parameters was studied in [21, 34, 44], who showed that the main results of Dynamic Programming can be extended to their robust counterparts only when the dynamics ambiguity set verifies a certain rectangularity property. Since we consider continuous states, these methods do not apply.

Linear dynamics and quadratic costs Several approaches have been proposed for cumulative regret minimisation in the LQ problem. In the Optimism in the Face of Uncertainty paradigm, the best possible dynamics within a high-confidence region is selected under a controllability constraint, to compute the corresponding optimal control in closed-form by solving a Riccati equation. The results of [1, 20, 16] show that this procedure achieves a �O�N 1/2�regret. Posterior sampling algorithms [35, 3] select candidate dynamics randomly instead, and obtain the same result. Other works use noise injection for exploration such as [11, 12]. However, neither optimism nor random exploration fit a critical setting, where ensuring safety requires instead to consider pessimistic outcomes. The work of Dean et al. [11] is close to our setting: after an offline estimation phase, they estimate a suboptimality between a minimax controller and the optimal performance. Our work differs in that it addresses a generic shape cost. Another work of interest is [37] where worst-case generic costs are considered. However, they assume the knowledge of the dynamics, and their rollout-based solution only produces inner-approximations and does not yield any guarantee. In this paper, interval prediction is used to produce oversets, while a near-optimal control is found using a tree-based planning procedure.

To derive a confidence region (2) for  θ, the functional relationship  A(θ)must be specified.

Assumption 1 (Structure). There exists a known feature tensor  φ ∈ Rd×p×p such that for all  θ ∈ Θ,

image

where  A, φ1, . . . , φd ∈ Rp×pare known. For all n, we denote  Φn = [φ1xn . . . φdxn] ∈ Rp×d. We also assume to know a bound  S such that θ ∈ [−S, S]d.

We slightly abuse notations and include additional known terms in the measurement signal y(t) = ˙x(t) + Cν(t) − Ax(t) − Bu(t),to obtain a linear regression system  yn = Φnθ + ηn.

Regularised least square To derive an estimate on  θ, we consider the  L2-regularised regression problem with weights  Σp ∈ Rp×p and λ ∈ R+∗ :

image

Proposition 1 (Regularised solution). The solution to (6) is

image

Substituting  yninto (7) yields the regression error:  θN,λ − θ = G−1N,λ�Nn=1 ΦTnΣ−1p ηn − λG−1N,λθ.To bound this error, we need the noise  ηnto concentrate.

Assumption 2 (Noise Model). We assume that

image

2. at each time  t ≥ 0, the disturbance  ω(t)is enclosed by known bounds  ω(t) ≤ ω(t) ≤ ω(t),whose amplitude verifies �∞n=0 γnCω(tn) < ∞, where

image

Theorem 1 (Confidence ellipsoid, a matricial version of Abbasi-Yadkori et al. 2). Under Assumption 2, it holds with probability at least  1 − δ that

image

We convert this confidence ellipsoid  CN,δfrom (8) into a polytope for  A(θ). For simplicity, we present here a simple but coarse strategy: bound the ellipsoid by its enclosing axis-aligned hypercube:

image

where AN = A(θN,λ), hi ∈ {−1, 1}d, ∆AN,i = hi�

βN(δ)λmax(GN,λ). A tighter polytope derivation is presented in the Supplementary Material.

A simple solution to (4) is proposed in [14], where, given bounds  A ≤ A(θ) ≤ A from CN,δ they usematrix interval arithmetic to derive the predictor:

Proposition 2 (Simple predictor of Efimov et al. 14). Assuming that (2) is satisfied for the system (1), then the interval predictor following  x(tN) = x(tN) = x(tN) and:

˙x(t) = A+x+(t) − A+x−(t) − A−x+(t) + A−x−(t) + Bu(t) + D+ω(t) − D−ω(t), (10)˙x(t) = A+x+(t) − A+x−(t) − A−x+(t) + A−x−(t) + Bu(t) + D+ω(t) − D−ω(t),

ensures the inclusion property (4) with confidence level  δ.

However, Leurent et al. [27] showed that this predictor can have unstable dynamics, even for stable systems, which causes a fast build-up of uncertainty. They proposed an enhanced predictor which exploits the polytopic structure (9) to produce more stable predictions, at the price of a requirement: Assumption 3. There exists an orthogonal matrix  Z ∈ Rp×p such that ZTANZis Metzler2.

In practice, this assumption is often verified: it is for instance the case whenever  ANis diagonalisable. The similarity transformation of [15] provides a method to compute such Z when the system is observable. To simplify the notation, we will further assume that  Z = Ip. Denote  ∆A+ =�2di=1 ∆A+N,i, ∆A− = �2di=1 ∆A−N,i.

Proposition 3 (Enhanced predictor of Leurent et al. 27). Assuming that (9) and Assumption 3 are satisfied for the system (1), then the interval predictor following  x(tN) = x(tN) = x(tN) and:

˙x(t) = ANx(t) − ∆A+x−(t) − ∆A−x+(t) + Bu(t) + D+ω(t) − D−ω(t), (11)˙x(t) = ANx(t) + ∆A+x+(t) + ∆A−x−(t) + Bu(t) + D+ω(t) − D−ω(t),

ensures the inclusion property (4) with confidence level  δ.

Figure 3 compares the performance of the predictors (10) and (11) in a simple example. It suggests to always prefer (11) whenever Assumption 3 is verified, and only fallback to (10) as a last resort.

image

Figure 3: Comparison of (10) and (11) for a simple system  ˙x(t) = −θx(t) + ω(t), with  θ ∈ [1, 2]and  ω(t) ∈ [−0.05, 0.05].

To evaluate the robust objective  V r (3), we approximate it thanks to the interval prediction [x(t), x(t)]. Definition 1 (Surrogate objective). Let  xn(u), xn(u)following the dynamics defined in (11) and

image

Such a substitution makes this pessimistic reward  Rnnot Markovian, since the worst case is assessed over the whole past trajectory. Theorem 2 (Lower bound). The surrogate objective (12) is a lower bound of the objective (3).

image

Consequently, since all our approximations are conservative, if we manage to find a control sequence such that no “bad event” (e.g. a collision) happens according to the surrogate objective ˆV r, they areguaranteed not to happen either when the controls are executed on the true system.

To maximise ˆV r, we cannot use DP algorithms since the state space is continuous and the pessimistic rewards are non-Markovian. Rather, we turn to tree-based planning algorithms, which optimise a sequence of actions based on the corresponding sequence of rewards, without requiring Markovity nor state enumeration. In particular, we consider the Optimistic Planning of Deterministic Systems (OPD) algorithm [19] tailored for the case when the relationship between actions and rewards is deterministic. Indeed, the stochasticity of the disturbances and measurements is encased in ˆV r: given the observations up to time N both the predictor dynamics (11) and the pessimistic rewards in (12) are deterministic. At each planning iteration  k ∈ [K], OPDprogressively builds a tree  Tk+1by forming upper-bounds  Ba(k)over the value of sequences of actions a, and expanding3 the leaf  akwith highest upper-bound:

image

where  Lkis the set of leaves of  Tk, h(a)is the length of the sequence a, and  Rn(a)the pessimistic reward (12) obtained at time n by following the controls  un = πan(xn).

Lemma 1 (Planning performance of Hren & Munos 19). The suboptimality of the OPD algorithm (13) applied to the surrogate objective (12) after K planning iterations is:

image

where  κdef= lim suph→∞����a ∈ Ah : ˆV r(a) ≥ ˆV r(a⋆) − γh+11−γ����1/his a problem-dependent measure of the proportion of near-optimal paths.

Hence, by using enough computational budget K for planning we can get as close as we want to the optimal surrogate value ˆV r(a⋆), at a polynomial rate. Unfortunately, there exists a gap between ˆV rand the true robust objective  V r, which stems from three approximations: (i) the true reachable set was approximated by an enclosing interval in (4); (ii) the time-invariance of the dynamics uncertainty A(θ) ∈ CN,δwas handled by the interval predictor (11) as if it were a time-varying uncertainty A(θ(t)) ∈ CN,δ, ∀t; and (iii) the lower-bound � min ≤ min �used to define the surrogate objective (12) is not tight. However, this gap can be bounded under additional assumptions.

Theorem 3 (Suboptimality bound). Under two conditions:

image

we can bound the suboptimality of Algorithm 1 with planning budget K as:

image

with probability at least  1 − δ, where V (a) is the optimal expected return when executing an action  a ∈ A, a⋆is an optimal action, and  ∆ωis a constant which corresponds to an irreducible suboptimality suffered from being robust to instantaneous disturbances  ω(t).

It is difficult to check the validity of the stability condition 2. since it applies to matrices  AN producedby the algorithm rather than to the system parameters. A stronger but easier to check condition is that the polytope (9) at some iteration becomes included in a set where this property is uniformly satisfied. For instance, if the features are sufficiently excited, the estimation converges to a neighbourhood of the true dynamics  A(θ). This also allows to further bound the input-dependent estimation error term. Corollary 1 (Asymptotic near-optimality). Under an additional persistent excitation (PE) assumption

image

the stability condition 2. of Theorem 3 can be relaxed to apply to the true system: there exist  P, Q0, ρsuch that

image

and the suboptimality bound takes the more explicit form

image

which ensures asymptotic near-optimality when  N → ∞ and K → ∞.

The procedure we developed in Sections 2 to 4 relies on strong modelling assumptions, such as the linear dynamics (1) and Assumption 1. But what if they are wrong?

Model adequacy One of the major benefits of using the family of linear models, compared to richer model classes, is that they provide strict conditions allowing to quantify the adequacy of the modelling assumptions to the observations. Given  N − 1observations, Section 2 provides a polytopic confidence region (9) that contains  A(θ)with probability at least  1 − δ. Since the dynamics are linear, we can propagate this confidence region to the next observation:  yNmust belong to the Minkowski sum of a polytope representing model uncertainty  P(A0xN + BuN, ∆A1xN, . . . , ∆A2dxN) and apolytope  P(0p, η, η)bounding the disturbance and measurement noises. Delos & Teissandier [13] provide a way to test this membership in polynomial time using linear programming. Whenever it is not verified, we can confidently reject the  (A, φ)-modelling assumption 1. This enables us to consider a rich set of potential features�(A1, φ1), . . . , (AM, φM)�rather than relying on a single assumption, and only retain those that are consistent with the observations so far. Then, every remaining hypothesis must be considered during planning.

Robust selection We temporarily ignore the parametric uncertainty on  θto simply consider several candidate dynamics models, which all correspond to different modelling assumptions. We also restrict to deterministic dynamics, which is the case of (11).

Assumption 4 (Multi-model ambiguity). The dynamics f lie in a finite set of candidates  (f m)m∈[M].

We adapt our planning algorithm to balance these concurrent hypotheses in a robust fashion, i.e. maximise a robust objective with discrete ambiguity:

image

where  Rmn is the reward obtained by following the action sequence a up to step n under the dynamics f m. This objective could be optimised in the same way as in Section 4, but this would result in a coarse and lossy approximation. Instead, we exploit the finite uncertainty structure of Assumption 4 to asymptotically recover the true  V r by modifying the OPD algorithm in the following way:

Definition 2 (Robust UCB). We replace the upper-bound (13) on sequence values in OPD by:

image

Note that it is not equivalent to solving each control problem independently and following the action with highest worst-case value, as we show in the Supplementary Material. We analyse the sample complexity of the corresponding robust planning algorithm.

Proposition 4 (Robust planning performance). The robust version of OPD (16) enjoys the same regret bound as OPD in Lemma 1, with respect to the multi-model objective (15).

This result is of independent interest: the solution of a robust objective (15) with discrete ambiguity f ∈ {f m}m∈[M]can be recovered exactly, asymptotically when the planning budget K goes to infinity, which Robust DP algorithms do not allow. This also contrasts with the results obtained in Section 4 for the robust objective (3) with continuous ambiguity  A(θ) ∈ CN,δ, for which OPD only recovers the surrogate approximation ˆV r, as discussed in Theorem 3. Note that here the regret depends on the number K of node expansions, but each expansion now requires M times more simulations than in the single-model setting. Finally, the two approaches of Sections 4 and 5 can be merged by using the pessimistic reward (12) in (16).

Videos and code are available at https://eleurent.github.io/robust-beyond-quadratic/.

Obstacle avoidance with unknown friction We first consider a simple illustrative example, shown in Figure 2: the control of a 2D system moving by means of a force  (ux, uy)in an medium with anisotropic linear friction with unknown coefficients  (θx, θy). The reward encodes the task of navigating to reach a goal state  xgwhile avoiding collisions with obstacles:  R(x) = δ(x)/(1 + ∥x −xg∥2) where δ(x) is 0 whenever xcollides with an obstacle, 1 otherwise. The actions A are constant controls in the up, down, left and right directions. For the reasons mentioned above, no robust baseline applies to our setting. We compare Algorithm 1 to the non-robust adaptive control approach that plans with the estimated dynamics  θN,λ, and thus enjoys the same prior knowledge of dynamics structure and reward. This highlights the benefits of the robust formulation solely rather than stemming from algorithm design. We show in Table 1(a) the results of 100 simulations of a single episode: the robust agent performs worse than the nominal agent on average, but manages to ensure safety while the nominal agent collides with obstacles in 4% of simulations. We also compare to a standard model-free approach, DQN, which does not benefit from the prior knowledge on the system dynamics, and is instead trained over multiple episodes. The reported performance is that of the final policy obtained after training for 3000 episodes, during which  897 ± 64collisions occurred (29.9 ± 2.1%). We study the evolution of the suboptimality  V (xN) − �n>N γn−NR(xn)with respect to the number of samples N, by comparing the empirical returns from a state  xNto the value  V (xN)that the agent would get by acting optimally from  xNwith knowledge of the dynamics. Although the assumptions

image

Figure 4: The mean (solid), 95% CI for the mean (shaded) and maximum (dashed) suboptimality with respect to N.

Figure 5: The intersection crossing task. Trajectory intervals show behavioural uncertainty for each vehicle, with a multi-model assumption over their route.

Table 1: Frequency of collision, minimum and average return achieved on a single episode, repeated with 100 random seeds. In both tasks, the robust agent performs worse than the nominal agent on average, but manages to ensure safety and attains a better worst-case performance.

image

of Theorem 3 are not satisfied (e.g. non-smooth reward), the mean suboptimality of the robust agent, shown in Figure 4, still decreases polynomially with N: Algorithm 1 gets more efficient as it is more confident while ensuring safety at all times. In comparison, the nominal agent enjoys a smaller suboptimality on average, but higher in the worst-case.

Motion planning for an autonomous vehicle We consider the highway-env environment [25] for simulated driving decision problems. An autonomous vehicle with state  χ0 ∈ R4is approaching an intersection among V other vehicles with states  χi ∈ R4, resulting in a joint traffic state x = [χ0, . . . , χV ]⊤ ∈ R4V +4. These vehicles follow parametrized behaviours  ˙χi = fi(x, θi)with unknown parameters  θi ∈ R5. We appreciate a first advantage of the structure imposed in Assumption 1: the uncertainty space of  θ is R5V . In comparison, the traditional LQ setting where the whole state matrix A is estimated would have resulted in a much larger parameter space  θ ∈ R16V 2. The system dynamics f, which describes the interactions between vehicles, can only be expressed in the form of Assumption 1 given the knowledge of the desired route for each vehicle, with features  φexpressing deviations to the centerline of the followed lane. Since these intentions are unknown to the agent, we adopt the multi-model perspective of Section 5 and consider one model per possible route for every observed vehicle before an intersection. In Table 1(b), we compare Algorithm 1 to a nominal agent planning with two different modelling assumptions: Nominal 1 has access to the true followed route for each vehicle, while Nominal 2 does not and picks the model with minimal prediction error. Again we also compare to a DQN baseline trained over 3000 episodes, causing 1058 ± 113collisions while training (35 ± 4%). As before, the robust agent has a higher worst-case performance and avoids collisions at all times, at the price of a decreased average performance..

We present a framework for the robust estimation, prediction and control of a partially known linear system with generic costs. Leveraging tools from linear regression, interval prediction, and tree-based planning, we guarantee the predicted performance and provide a suboptimality bound. The method applicability is further improved by a multi-model extension and demonstrated on two simulations.

The motivation behind this work is to enable the development of Reinforcement Learning solutions for industrial applications, when it has been mainly limited to simulated games so far. In particular, many industries already rely on non-adaptive control systems and could benefit from an increased efficiency, including Oil and Gas, robotics for industrial automation, Data Center cooling, etc. But more often than not, safety-critical constraints proscribe the use of exploration, and industrials are reluctant to turn to learning-based methods that lack accountability. This work addresses these concerns by focusing on risk-averse decisions and by providing worst-case guarantees. Note however that these guarantees are only as good as the validity of the underlying hypotheses, and Assumption 1 in particular should be submitted to a comprehensive validation procedure; otherwise, decisions formed on a wrong basis could easily lead to dramatic consequences in such critical settings. Beyond industrial perspectives, this work could be of general interest for risk-averse decision-making. For instance, parametrized epidemiological models have been used to represent the propagation of Covid-19 and study the impact of lockdown policies. These model parameters are estimated from observational data and corresponding confidence intervals are often available, but rarely used in the decision-making loop. In contrast, our approach would enable evaluating and optimising the worst-case outcome of such public policies.

This work was supported by the French Ministry of Higher Education and Research, and CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020.

[1] Abbasi-Yadkori, Y. and Szepesvári, C. Regret bounds for the adaptive control of linear quadratic systems. In Kakade, S. M. and von Luxburg, U. (eds.), Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pp. 1–26, Budapest, Hungary, 09–11 Jun 2011. PMLR.

[2] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 2312–2320. Curran Associates, Inc., 2011.

[3] Abeille, M. and Lazaric, A. Improved regret bounds for thompson sampling in linear quadratic control problems. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1–9, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.

[4] Adetola, V., DeHaan, D., and Guay, M. Adaptive model predictive control for constrained nonlinear systems. Systems and Control Letters, 2009. ISSN 01676911. doi: 10.1016/j.sysconle. 2008.12.002.

[5] Amos, B., Rodriguez, I. D. J., Sacks, J., Boots, B., and Zico Kolter, J. Differentiable MPC for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018.

[6] Aswani, A., Gonzalez, H., Sastry, S. S., and Tomlin, C. Provably safe and robust learning-based model predictive control. Automatica, 2013. ISSN 00051098. doi: 10.1016/j.automatica.2013. 02.003.

[7] Basar, T. and Bernhard, P. H infinity - Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach, volume 41. 1996.

[8] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton University Press, 2009.

[9] Bertsimas, D., Brown, D. B., and Caramanis, C. Theory and applications of robust optimization. SIAM review, 53(3):464–501, 2011.

[10] Busoniu, L., Pall, E., and Munos, R. Continuous-action planning for discounted infinite-horizon nonlinear optimal control with lipschitz values. Automatica, 92:100–108, 06 2018.

[11] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear quadratic regulator. ArXiv, abs/1710.01688, 2017.

[12] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret bounds for robust adaptive control of the linear quadratic regulator. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4188–4197. Curran Associates, Inc., 2018.

[13] Delos, V. and Teissandier, D. Minkowski Sum of Polytopes Defined by Their Vertices. Journal of Applied Mathematics and Physics (JAMP), 3(1):62–67, January 2015.

[14] Efimov, D., Fridman, L., Raïssi, T., Zolghadri, A., and Seydou, R. Interval estimation for LPV systems applying high order sliding mode techniques. Automatica, 48:2365–2371, 2012.

[15] Efimov, D., Raïssi, T., Chebotarev, S., and Zolghadri, A. Interval state observer for nonlinear time varying systems. Automatica, 49(1):200–205, 2013.

[16] Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Finite time analysis of optimal adaptive policies for linear-quadratic systems. CoRR, abs/1711.07230, 2017.

[17] Fukushima, H., Kim, T. H., and Sugie, T. Adaptive model predictive control for a class of constrained linear systems based on the comparison model. Automatica, 2007. ISSN 00051098. doi: 10.1016/j.automatica.2006.08.026.

[18] Gorissen, B. L., ˙Ihsan Yanıko˘glu, and den Hertog, D. A practical guide to robust optimization. Omega, 53:124 – 137, 2015.

[19] Hren, J.-F. and Munos, R. Optimistic planning of deterministic systems. In European Workshop on Reinforcement Learning, pp. 151–164, France, 2008.

[20] Ibrahimi, M., Javanmard, A., and Roy, B. Efficient reinforcement learning for high dimensional linear quadratic systems. Advances in Neural Information Processing Systems, 4, 03 2013.

[21] Iyengar, G. N. Robust Dynamic Programming. Mathematics of Operations Research, 30: 257–280, 2005.

[22] Köhler, J., Andina, E., Soloperto, R., Müller, M. A., and Allgöwer, F. Linear robust adaptive model predictive control: Computational complexity and conservatism. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 1383–1388, 2019.

[23] Kumar, E. V. and Jerome, J. Robust lqr controller design for stabilizing and trajectory tracking of inverted pendulum. Procedia Engineering, 64:169 – 178, 2013. International Conference on Design and Manufacturing.

[24] Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015.

[25] Leurent, E. An environment for autonomous driving decision-making. https://github.com/ eleurent/highway-env, 2018.

[26] Leurent, E. and Mercat, J. Social attention for autonomous decision-making in dense traffic. In Machine Learning for Autonomous Driving Workshop at NeurIPS 2019, 2019.

[27] Leurent, E., Efimov, D., Raïssi, T., and Perruquetti, W. Interval prediction for continuous-time systems with parametric uncertainties. In Proc. IEEE Conference on Decision and Control (CDC), Nice, 2019.

[28] Leurent, E., Efimov, D., and Maillard, O.-A. Robust-Adaptive Interval Predictive Control for Linear Uncertain Systems. In 2020 IEEE 59th Conference on Decision and Control (CDC), Jeju Island, Republic of Korea, 8–11 Dec 2020.

[29] Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. CoRR, abs/1504.00702, 2015.

[30] Lorenzen, M., Allgöwer, F., and Cannon, M. Adaptive model predictive control with robust constraint satisfaction. IFAC-PapersOnLine, 50(1):3313–3318, Jul 2017. ISSN 2405-8963.

[31] Lu, X. and Cannon, M. Robust adaptive tube model predictive control. In Proceedings of the American Control Conference, 2019. ISBN 9781538679265.

[32] Lu, X.-Y. and Spurgeon, S. K. Robust sliding mode control of uncertain nonlinear systems. Systems & Control Letters, 32(2):75–90, Nov 1997. ISSN 0167-6911.

[33] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.

[34] Nilim, A. and El Ghaoui, L. Robust Control of Markov Decision Processes with Uncertain Transition Matrices. Operations Research, 53:780–798, 2005.

[35] Ouyang, Y., Gagrani, M., and Jain, R. Learning-based control of unknown linear systems with thompson sampling. CoRR, abs/1709.04047, 2017.

[36] Peña, V. H., Lai, T. L., and Shao, Q.-M. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.

[37] Rosolia, U. and Borrelli, F. Sample-Based Learning Model Predictive Control for Linear Uncertain Systems. In Proceedings of the IEEE Conference on Decision and Control, 2019. ISBN 9781728113982. doi: 10.1109/CDC40024.2019.9030270.

[38] Sastry, S., Bodson, M., and Bartram, J. F. Adaptive Control: Stability, Convergence, and Robustness. The Journal of the Acoustical Society of America, 1990. ISSN 0001-4966. doi: 10.1121/1.399905.

[39] Schneider, J. Exploiting model uncertainty estimates for safe dynamic control learning. Advances in neural information processing systems, pp. 1047—-1053, 1997.

[40] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419): 1140–1144, 2018.

[41] Tanaskovic, M., Fagiano, L., Smith, R., and Morari, M. Adaptive receding horizon control for constrained MIMO systems. Automatica, 2014. ISSN 00051098. doi: 10.1016/j.automatica. 2014.10.036.

[42] Turchetta, M., Berkenkamp, F., and Krause, A. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems, 2016.

[43] Weinstein, A. and Littman, M. Bandit-based planning and learning in continuous-action markov decision processes. Proceedings of the 22nd International Conference on Automated Planning and Scheduling, pp. 306–314, 01 2012.

[44] Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov Decision Processes. Mathematics of Operations Research, pp. 1–52, 2013.

image

Outline In Appendix A, we provide a proof for every novel result introduced in this paper. Appendix B provides additional details on our experiments. Appendix C gives a better method of conversion from ellipsoid to polytope than that of (9). Finally, Appendix D highlights the fact that robustness cannot be recovered by aggregating independent solutions to many optimal control problem.

A.1 Proof of Proposition 1

Proof. We differentiate  J(θ) = �Nn=1 ∥yn − Φnθ∥2Σ−1p + λ∥θ∥2 as in (6) with respect to  θ:

image

Hence,

image

A.2 Proof of Theorem 1

We start by showing a preliminary proposition:

Proposition 5 (Matrix version of Theorem 1 of Abbasi-Yadkori et al. 2). Let  {Fn}n=0be a filtration. Let  {ηn}∞n=1 be a Rp-valued stochastic process such that  ηn is Fn-measurable and  E [ηn | Fn−1] isΣp-sub-Gaussian.

Let  {Φn}∞n=1 be an Rp×d-valued stochastic process such that  Φn is Fn-measurable. Assume that G is a  d × dpositive definite matrix. For any  n ≥ 0, define:

image

Then, for any  δ > 0, with probability at least  1 − δ, for all n ≥ 0,

image

Proof. Let

image

And for any  z ∈ Rd,

image

Then,

image

and using the Sub-Gaussianity of  ηt

image

Showing that  (M zt )∞t=1 is indeed a supermartingale and in fact  E[M zt ] ≤ 1. It then follows by Doob’s upcrossing lemma for supermartingale that  M z∞ = limt→∞ M ztis almost surely well-defined, and so is  M zτ for any random stopping time  τ.

Next, we consider the stopped martingale  M zmin(τ,t). Since (M zt )∞t=1 is a non-negative supermartin- gale and  τis a random stopping time, we deduce by Doob’s decomposition that

image

Finally, an application of Fatou’s lemma show that  E[M zτ ] = E[lim inft→∞ M zmin(τ,t)] ≤lim inft→∞ E[M zmin(τ,t)] ≤ 1.

This results allows to apply a result from [36]:

Lemma 2 (Theorem 14.7 of [36]). If Z is a random vector and B is a symmetric positive definite matrix such that

image

then for any positive definite non-random matrix C, it holds

image

In particular, by Markov inequality, for all  δ ∈ (0, 1),

image

Here, by using  Z = �ts=1 ΦsΣ−1p ηs, B = Gt, C = G,

image

Having shown this preliminary result, we move on to the proof of Theorem 1.

Proof. For all  x ∈ Rd, (7) gives:

image

Using the Cauchy-Schwartz inequality, we get:

image

In particular, for  x = GN,λ(θN,λ − θ), we get after simplifying with  ∥θN,λ − θ∥GN,λ:

image

By applying Proposition 5 with  G = λId, we obtain that with probability at least  1 − δ,

image

And since θ∥2G−1N,λ ≤ 1/λmin(GN,λ)∥θ∥22 ≤ 1/λ∥θ∥22 and ∥θ∥22 ≤ d∥θ∥2∞ ≤ dS2,

image

A.3 Proof of Theorem 2

Proof. The predictor designed in Section 3 verifies the inclusion property (4). Thus, for sequence of controls u, any dynamics  A(θ) ∈ CN,δ, and disturbances  ω ≤ ω ≤ ω, the corresponding state at time tnis bounded by  xn ≤ xn ≤ xn, which implies that  R(xn) ≥ minx∈[xn(u),xn(u)] R(x) = Rn(u).

Thus, by taking the min over  CN,δ and [ω, ω], we also have for any sequence of controls u:

image

And  V r(u) ≤ V (u) = Eω��∞n=N+1 γnR(xn)�by definition.

A.4 Proof of Theorem 3

We first bound the model estimation error.

Lemma 3.

image

Proof. We have

image

Moreover,  A(θ)belongs to a linear image of this  L2-ball. By writing a the  jth column of a matrix M as  Mj, and its coefficient  i, j as Mi,j,

image

Then, we propagate this estimation error through the state prediction.

image

then for all  t > tN,

image

where

image

Proof. Let  e = x − x. (11) gives the dynamics

image

where recall that  |M| = M + + M − for any matrix  M ∈ Rp×p.

We define the Lyapunov function  V = eTPe, which is non-negative definite provided that P > 0, and compute its derivative

image

with  X =�e ω − ω x+ + x−�T, for any Q ∈ Rp×p, ρ, α ∈ R .

image

Thus, if we had  Υ ≤ 0, Q > 0, ρ > 0, then we would have

image

We now examine the condition  Υ ≤ 0. We resort to its Schur complement: given  α > 0, Υ ≤ 0if and only if  R ≥ S, where S = α−1 [|∆A|TP 0]T [|∆A|TP 0] and Ris the top-left block of  −Υ:

image

as it is assumed in the conditions of the lemma. Hence, under such a choice of  α and Q, we recover

Υ ≤ 0. (17) follows with  µ = λmin(Q)λmax(P ) =

image

Developing at the first order in  α gives

image

Finally, we propagate the state prediction error bound to the pessimistic rewards and surrogate objective to get our final result.

Proof. For any sequence of controls u, dynamical parameters  θ ∈ CN,δand disturbances  ω ≤ ω ≤

image

Moreover, by the inclusion property (4), we have that  xn ≤ xn ≤ xn, which implies that  R(xn) ≤maxx∈[xn(u),xn(u)] R(x). Assuming R is L-lipschitz,

image

Finally, we use the result of Lemma 1 to account for planning with a finite budget, and relate ˆV r(a⋆)to ˆV r(aK).

A.5 Proof of Corollary 1

Proof. By (7) and (14), we have

image

and by (8),

image

Stability condition 2. By Lemma 3 and the above, the sequence  (AN)Nconverges to  A(θ)in Frobenius norm. Thus,

image

which is assumed to be negative definite.

Moreover, the two functions that map a matrix to its characteristic polynomial and a polynomial to its roots, are both continuous. Thus, by continuity, the largest eigenvalue of  Mnconverges to that of M, which is strictly negative. Hence, there exists some  N0 ∈ Nsuch that for all  N > N0, MNis negative definite, as required in the condition 2. of Theorem 3.

A.6 Proof of Proposition 4

We start by showing the following lemma: Lemma 5 (Robust values ordering). In addition to the robust B-value defined in (16), that we extend to inner nodes

image

we also define the robust value of a sequence of actions a

image

and the robust U-values of a sequence of action a

image

Then, the robust values, U-values and B-values exhibit similar properties as the optimal values, U-values and B-values, that is: for all  0 < k < K and a ∈ TT ,

image

Proof. By definition, when starting with sequence a, the value  U ma (k)represents the minimum admissible reward, while  Bma (k)corresponds to the best admissible reward achievable with respect to the the possible continuations of a. Thus, for all  a ∈ A∗, U ma (k)and  U ra(k)are non-decreasing functions of  k and Bma (k) and Bra(k)are a non-increasing functions of  k, while V ma and V ra do notdepend on k.

Moreover, since the reward function R is assumed be bounded in [0, 1], the sum of discounted rewards from a node of depth d is at most  γd + γd+1 + · · · = γd1−γ . As a consequence, for all  k ≥ 0 , a ∈ Lkof depth d, and any sequence of rewards  (Rn)n∈Nobtained from following a path in  aA∞ with anydynamics  m ∈ [M]:

image

Hence,

image

And as the left-hand and right-hand sides of (22) are independent of the particular path that was followed in  aA∞, it also holds for the robust path:

image

that is,

image

Finally, (23) is extended to the rest of  Tkby recursive application of (19), (20) and (18).

We now turn to the proof of the theorem.

Proof. Hren & Munos [19] first show in Theorem 2 that the simple regret  rKof their optimistic planner is bounded by γdK1−γwhere  dKis the depth of  TK. This properties relies on the fact that the returned action belongs to the deepest explored branch, which we can show likewise by contradiction using Lemma 5. This yields directly that the returned action  a = i0where i is some node of maximal depth  dKexpanded at round  k ≤ K, which by selection rule verifies  Bra(k) = Bri (k) =

image

Secondly, they bound the depth  dK of TKwith respect to K. To that end, they show that the expanded nodes always belong to the sub-tree  T∞of all the nodes of depth  d that are γd1−γ -optimal. Indeed, if a node i of depth d is expanded at round  k, then Bri (k) ≥ Brj (k) for all j ∈ Lkby selection rule, thus the max-backups of (16) up to the root yield  Bri (k) = Br∅(k). Moreover, by Lemma 5 we have that Br∅(k) ≥ V r∅ = V r and so V ri ≥ U ri (k) = Bri (k) − γd1−γ ≥ V r − γd1−γ , thus i ∈ T∞.

Then from the definition of  κapplied to nodes in  T∞, there exists  d0 and csuch that the number  ndof nodes of depth  d ≥ d0 in T∞is bounded by  cκd. As a consequence,

image

In both experiments, we used  γ = 0.9, δ = 0.9and a planning budget K = 100. The disturbances were sampled uniformly in  [−0.1, 0.1]rwhile the measurements are Gaussian with covariance Σs = 0.12Is.

B.1 Obstacle Avoidance

States The system is described by its position  (px, py)and velocity  (vx, vy):

image

Actions It is acted upon by means horizontal and vertical forces  u = (ux, uy) ∈ [−1, 1]2. We discretise the action space into four constant controls, for each direction:

image

Reward The reward encodes the task of navigating to reach a goal state  xgwhile avoiding collisions with obstacles:

image

Dynamics The system dynamics consist in a double integrator, with friction parameters  (θx, θy):

image

Note that Assumption 3 is always verified.

DQN baseline In addition to the state, knowledge of the obstacles is encoded in the observation as an angular grid of laser-like distance measurements, as well as the goal location relative to the system position. As a model for the Q-function, we used a Multi-Layer Perceptron with two hidden layers of size  100. An ε-greedy strategy was used for exploration.

B.2 Autonomous Driving

In the following, we describe the structure of the dynamical system f representing the couplings and interactions between several vehicles.

States In addition to the ego-vehicle, the scene contains V other vehicles. Any vehicle  i ∈ [0, V ] isrepresented by its position  (xi, yi), its forward velocity  viits heading  ψi. The resulting joint state is the traffic description:  x = (xi, yi, vi, ψi)i∈[0,V ] ∈ R4V +4.

Actions The ego-vehicle is following a fixed path, and the tasks consists in adapting its velocity by means of three actions A = {faster, constant velocity, slower}. They are achieved by a longitudinal linear controller that tracks the desired velocity  v0, as described below in the system dynamics.

Reward The reward function R is the following:

image

Dynamics The kinematics of any vehicle  i ∈ [V ]are represented by the Kinematic Bicycle Model:

image

where  (xi, yi)is the vehicle position,  viis its forward velocity and  ψiis its heading, l is the vehicle half-length,  aiis the acceleration command and  βiis the slip angle at the centre of gravity, used as a steering command.

Longitudinal dynamics Longitudinal behaviour is modelled by a linear controller using three features: a desired velocity, a braking term to drive slower than the front vehicle, and a braking term to respect a safe distance to the front vehicle.

Denoting  fithe index of the front vehicle preceding vehicle i, the acceleration command can be presented as follows:

image

where  v0, d0and T respectively denote the speed limit, jam distance and time gap given by traffic rules.

Lateral dynamics The lane  Liwith the lateral position  yLiand heading  ψLiis tracked by a cascade controller of lateral position and heading  βi, which is selected in a way the closed-loop dynamics take the form:

image

We assume that the drivers choose their steering command  βisuch that (24) is always achieved: βi = sin−1( lvi ˙ψi).

LPV formulation The system presented so far is non-linear and must be cast into the LPV form. We approximate the non-linearities induced by the trigonometric operators through equilibrium linearisation around  yi = yLi and ψi = ψLi.

This yields the following longitudinal dynamics:

image

where  θi,2 and θi,3are set to 0 whenever the corresponding features are not active.

image

Here, the dependency in  viis seen as an uncertain parametric dependency, i.e.  θi,6 = vi, with constant bounds assumed for  viusing an overset of the longitudinal interval predictor.

Change of coordinates In both cases, the obtained polytope centre  ANis non-Metzler. We use the similarity transformation of coordinates of Efimov et al. [15]. Precisely, we choose  Θsuch that for any  θ ∈ Θ, A(θ)is always diagonalisable with real eigenvalues, and perform an eigendecomposition to compute its change of basis matrix Z. The transformed system  X′ = Z−1(X − Xc)verifies (2) with  ANMetlzer as required to apply the interval predictor of Proposition 3. Finally, the obtained predictor is transformed back to the original coordinates Z by using the following lemma:

Lemma 6 (Interval arithmetic of Efimov et al. 14). Let  x ∈ Rn be a vector variable,  x ≤ x ≤ x forsome  x, x ∈ Rn.

1. If  A ∈ Rm×n is a constant matrix, then A+x − A−x ≤ Ax ≤ A+x − A−x. (25)

2. If  A ∈ Rm×n is a matrix variable and  A ≤ A ≤ A for some A, A ∈ Rm×n, then

image

DQN baseline In order to avoid discontinuities in the vehicles headings, the state is encoded as x = (xi, yi, vxi , vyi , cos ψi, sin ψi)i∈[0,V ] ∈ R6V +6, with the ego-vehicle always in the first position. As a model for the Q-function, we used the Social Attention architecture from [26], that allows to support an arbitrary number of vehicles as input and enforce an invariance to their order.

Lemma 7 (Confidence polytope). We can enclose the confidence ellipsoid obtained in (8) within a polytope  Cδ:

image

Proof. The ellipsoid in (8) is described by:

image

This describes a  Rdbox containing  θ′ = Pθ, whose  kthvertex is represented by  θ′N,λ +βN(δ)D−1/2hk. We obtain the corresponding box on  θby transforming each vertex of the box with  P −1.

In the definition of  Bra(k) (18)and  U ra(k) (20)it is essential that the minimum over the models is only taken at the end of trajectories, in the same way as for the robust objective (15) in which the worst-case dynamics is only determined after the action sequence has been fully specified. Assume that  U ra(k)is instead naively defined as:

image

This would not recover the robust policy, as we show in Figure 6 with a simple counter-example.

image

Figure 6: From left to right: two simple models and corresponding u-values with optimal sequences in blue; the naive version of the robust values returns sub-optimal paths in red; our robust U-value properly recovers the robust policy in green.


Designed for Accessibility and to further Open Science