b

DiscoverSearch
About
My stuff
Safely Learning to Control the Constrained Linear Quadratic Regulator
2018·arXiv
Abstract
Abstract

We study the constrained linear quadratic regulator with unknown dynamics, addressing the tension between safety and exploration in data-driven control techniques. We present a framework which allows for system identification through persistent excitation, while maintaining safety by guaranteeing the satisfaction of state and input constraints. This framework involves a novel method for synthesizing robust constraint-satisfying feedback controllers, leveraging newly developed tools from system level synthesis. We connect statistical results with cost sub-optimality bounds to give non-asymptotic guarantees on both estimation and controller performance.

Data-driven design has considerable potential in contemporary control systems where precise modeling of the dynamics is intractable, whether due to complex large-scale interactions or nonlinearities resulting from contact forces. However, a large hurdle in the way of practical deployment is the question of maintaining safe operating conditions during the learning process, and furthermore guaranteeing safe execution using learned components.

Motivated by this issue, we study the data-driven design of a controller for the constrained Linear Quadratic Regulator (LQR) problem. In constrained LQR, we design a controller for a potentially unknown linear dynamical system that minimizes a given quadratic cost, subject to the additional requirement that both the state and input stay within a specified safe region. This is a problem that has received much attention within the model predictive control (MPC) community.

For the LQR problem with no constraints, a natural method of exploration for learning the dynamics is to excite the system by injecting white noise. When safety is not an issue, this method is effective and recently Dean et al. [1] provide an end-to-end sample complexity on this “identify-then-control” scheme. However, this method of fails to consider safety, and the injected white noise may lead to constraint violation.

In this paper, we directly address the tension between exploration for learning and safety, which are fundamentally at odds. We do this by synthesizing a controller which simultaneously excites and regulates the system; we propose to learn by additively injecting bounded noise to the control inputs computed by a safe controller. By leveraging the recently developed system level synthesis (SLS) framework for control design [2], we give a computationally tractable algorithm which returns a controller that (a) guarantees the closed loop system remains within the specified constraint set and (b) ensures that enough noise can be injected into the system to obtain a statistical guarantee on learning. To the best of our knowledge, our algorithm is the first to simultaneously achieve both objectives. Furthermore, the controller synthesis is solved by a convex optimization problem whose feasibility is a certificate of safety, and no considerations of robust invariant sets are required.

Our second contribution is to provide a sub-optimality bound on control performance for constrained LQR. Using the same SLS framework, we quantify the excess cost incurred by playing a controller designed on the uncertain dynamics obtained from learning. The sub-optimality depends on both the size of the uncertainty sets and a type of constraint robustness cost gap of the optimal constrained controller for the true system. This allows us to provide the first end-to-end sample complexity guarantee for the control of constrained systems.

Estimation and control of the unconstrained LQR problem has been studied in the non-asymptotic setting [3, 1]. However, the identification schemes rely on pure excitation and system restarts, which is not suitable for the constrained setting. On the other hand, the online learning literature simultaneously considers learning and control, where strategies are based on optimism in the face of uncertainty [4] or Thompson Sampling [5]. These approaches guarantee system estimation only up to optimal closed-loop equivalence, and do not consider safety. Building on a statistical result by Simchowitz et al. [6] which allows for non-asymptotic guarantees on parameter estimation from a single trajectory of a linear system, Dean et al. [7] provide a robust online method that guarantees parameter estimation and stability throughout.

The design of controllers that guarantee robust constraint satisfaction has long been considered in the context of model predictive control [8], including methods that model uncertainty in the dynamics directly [9], or model it as a bounded state disturbance for computational efficiency [10, 11]. Strategies for incorporating estimation of the dynamics include experiment-design inspired costs [12], decoupling learning from constraint satisfaction [13], and set-membership methods rather than parameter estimation [14, 15]. Due to the receding horizon nature of model predictive controllers, this literature relies on set invariance theory for infinite horizon guarantees [16]. Our framework considers the infinite horizon problem directly, and therefore we do not require computation of invariant sets.

Finally, the machine-learning community has begun to consider safety in reinforcement learning, where much work positions itself as being for general dynamical systems in lieu of providing statistical guarantees [17, 18, 19, 20]. Some works assume the existence of an initial safe controller for learning [21], and robust MPC methods have been proposed to modify potentially unsafe learning inputs [22]. Our framework gives an alternative procedure for designing such a controller using coarse system estimates. Most similar to this work is that of Lu et al. [23], who propose a method to allow excitation on top of a safe controller, but consider only finite-time safety and require non-convex optimization to obtain formal guarantees. We fix an underlying linear dynamical system with full state observation,

image

where we have initial condition  x0 ∈ Rn, sequence of inputs  {u0, u1, . . . } ⊆ Rd, and disturbance process  {w0, w1, . . . } ⊆ Rn. Later, we will specify assumptions on the disturbances, including independence over time, bounded second moment, and bounded  ℓ∞ norm. Thedynamics matrices (A⋆, B⋆) are unknown. We denote the errors on estimates of the system ( �A, �B) in terms of the  ℓp → ℓpmatrix operator norm as

image

In this work, we will focus on the errors for  p = 2 and p = ∞.

Furthermore, we assume some prior knowledge on the system dynamics, in the form of initial estimates ( �A0, �B0) and uncertainty measures (ε0A,p, ε0B,p). Without such knowledge, guaranteeing safety in any form would be impossible. We note that the initial estimates may be coarse grained, and the goal of the learning procedure will be to refine this uncertainty prior to optimal control design.

The constrained LQR optimal control problem seeks to minimize the expected quadratic cost, subject to constraints on the state and input. Before further detailing this objective, we develop notation and background on system level synthesis.

1.1 System Level Synthesis

Many approaches to optimal control for systems with constraints involve receding horizon control, where an open loop finite-time trajectory is computed at each timestep; indeed, parameterizing optimal control problems by a state feedback controller generally leads to nonconvex optimization. As a motivating example, consider the static feedback law  uk =Kxkapplied to the linear system (1.1). Then we can write

image

and a similar expression for  uk. Then it is clear that convex constraints on  xk and uk willbe non-convex in K. Instead, we can parametrize the problem in terms of convolution with the closed-loop system response,

image

where we defined  w−1 = x0the fixed initial condition. We note that the expression above is valid for any linear dynamic controller, i.e. any controller which is a linear function of the state and its history. Since the relation in (1.2) is linear, convex constraints on state and input translate to convex constraints on the system response elements. The system level synthesis (SLS) framework shows that for any elements  {Φx(t), Φu(t)}constrained to obey, for all  k ≥ 1,

image

there exists a feedback controller that achieves the desired system responses (1.2). The state-feedback parameterization result in Theorem 1 of Wang et al. [2] formalizes this principle, and therefore any optimal control problem over linear systems can be cast as a constrained optimization problem over system response elements. We remark that similar observations have been used for constrained state feedback control in the finite horizon setting [11]. In what follows, we use boldface letters to denote transfer functions and signals, e.g. Φx(z) = �∞k=1 Φx(k)z−k and x(z) = �∞k=0 xkz−k. Under this notation, the affine constraints can be rewritten as

image

and the corresponding control law u = Kx is given by  K = ΦuΦ−1x .

1.2 Notation

In this paper, we restrict our attention to the function space  RH∞, consisting of discretetime stable matrix-valued transfer functions. We use 1zRH∞to denote the set of transfer functions  G such that zG ∈ RH∞. We further define

image

for transfer functions that satisfy a certain decay rate in the spectral norm of their impulse response elements.

When working with transfer functions and signals, we will denote the coefficient of the term of degree  k as G[k] = G(k) and x[k] = xk. We will also denote G[k : 1] as the block row vector of system response elements of G

image

As is standard, we let  ∥x∥pdenote the  ℓp-norm of a vector x. For a matrix M, we let ∥M∥pdenote its  ℓp → ℓpoperator norm. We will consider the  H2, H∞, and L1norms, which are infinite horizon analogs of the Frobenius, spectral, and  ℓ∞ → ℓ∞operator norms of a matrix, respectively:

image

Finally, for two numbers  a, b, we let a ≲ b (resp. a ≳ b) denote that there exists an absolute constant C > 0 such that  a ≤ Cb (resp. a ≥ Cb).

1.3 Optimal Control Problem

We now describe the constrained optimal control problem that we would want to solve given perfect knowledge of the system dynamics. First, we consider the expected infinite horizon quadratic cost for the system (A⋆, B⋆) in feedback with K:

image

We assume that  {wk}is any distribution that satisfies  E[wk] = 0 and E[wkw⊤k ] = σ2I andis independent across time, i.e.,  wk ⊥ wℓ for ℓ ̸= k. Note that any distribution satisfying these constraints induces the same expected quadratic cost, so it is unnecessary to specify a specific distribution.

Next, we consider polytopic constraint sets of the form

image

We will constrain the state and input trajectories to lie within these sets for any possible disturbance sequence  {wk}satisfying  ∥wk∥∞ ≤ σw for all k ≥0. Putting the cost and the constraints together, the optimal control problem that acts as our baseline is:

image

Above, we let the set K enumerate all inputs that result from linear dynamic stabilizing feedback controllers for (A⋆, B⋆) of the form u = Kx. This is made possible by the system level synthesis framework described in the previous section.

As we show in the following section, the optimal control problem given in (1.3) is a convex, but infinite-dimensional problem. It is an idealized baseline to compare our actual solutions to; our sub-optimality guarantees will be with respect to the optimal cost achieved by this problem. It is a relevant baseline, since it optimizes for average case performance but ensures safety for the worst-case behavior, consistent with MPC literature [10, 24]. We remark that an alternative to (1.3) is to replace the worst case constraint behavior with probabilistic chance constraints [25]. We do not work with chance constraints because they are generally difficult to directly enforce on an infinite horizon; arguments around recursive feasibility using robust invariant sets are common in the MPC literature to deal with this issue.

We begin by formulating a method for robustly operating a system while maintaining safety. First we describe a system level synthesis approach to the constrained LQR problem, and then we propose a modification which makes it robust to uncertainties in system dynamics. Finally, we outline a reduction to tractable synthesis via a finite-dimensional optimization problem over a finite impulse response (FIR) approximation.

2.1 A System Level Approach

Using the SLS formulation, we define an optimization problem that solves the optimal control problem presented in the section above.

Proposition 2.1. The following convex optimization problem solves the optimal control problem (1.3).

image

where the elements of the constraint functions are defined as

image

with j indexing the rows of  Fx and Fuand entries of  Gx and Gu.

Examining the form of the optimization problem above, we see that the expected quadratic cost transforms into a system  H2norm, while the worst-case polytopic constraints on state and input become closed-form polytopic constraints on the system response. For a controller K = Φu(Φx)−1, note that the LQR cost is equivalent to

image

Lastly, we remark here that the feasibility of the convex synthesis problem in (2.1) for an initial condition  x0implies that  x0is a member of a robust control invariant set.

Proof. By the state-feedback parameterization result in Theorem 1 of [2], the SLS parametrization encompasses all internally stabilizing state-feedback controllers acting on the true system (A⋆, B⋆). Thus, it is necessary only to show that the optimization problem in (2.1) is consistent with that of (1.3) under the system level parametrization. The equivalence between the

LQR cost and the  H2system norm is standard and we refer to the Appendix of [1], which presents this reformulation in terms of system responses for Gaussian process noise. The argument applies unchanged to any distribution satisfying our assumptions.

Therefore, it remains to consider the inequality constraints. Because the constraints must be satisfied robustly, it is equivalent to consider

image

Then considering elements in the second term with each j indexing the rows of  Fx,

image

Thus the inequality constraint on the function  Gx(Φx; k) is an equivalent condition. A similar computation holds for the input constraint.

We note the appearance of the row-wise  ℓ1norm over the multiplication of the the system response elements with the constraint matrix. The expression can be understood as an analog to the  ℓ∞ → ℓ∞operator norm, mediated by the shape of the polytope. In the next section when we introduce robustness to system dynamics, the  L1system norm will come into play for similar reasons.

2.2 Robust Control

We are further motivated to reformulate the optimal control problem in terms of system responses so that we can transparently consider uncertainties in the dynamics. We are interested in controller synthesis under model errors, where only nominal estimates of the system are known. Suppose that the system responses �Φx, �Φuare designed on the estimated system. Then

image

The model mismatch impacts the closed-loop behavior of the true system in feedback with the controller �K = �Φu(�Φx)−1 in a transparent way. Supposing that  ∥ �∆∥H∞ <1, it achieves the system response and bounded cost:

image

This follows from the robust stability result in Theorem 2 of Matni et al. [26] and the submultiplicativity of the  H2 and H∞norms. Note that  ∥ �∆∥ <1 is a sufficient condition for the existence of the inverse (I + �∆)−1 for any induced norm  ∥.∥by the small gain theorem.

This expression provides an upper bound on the cost achieved for a controller designed system estimates which depends only on the system estimates and the size of the error system �∆. Motivated by this bound, consider the following robust optimization problem:

image

where 0  ≤ γ, τ ≤1 are fixed parameters and for each j and k,

image

and similarly for  Gτu(Φu; k)j.

In this problem,  γbounds the increase in the  H2cost due to the dynamics uncertainty, while  τdetermines the increase in the state and input values with respect to the constraints. In fact, both values can be related to bounding an enlarge noise process ˜w = (I + �∆)−1wdriving the system.

Theorem 2.2. Any controller designed from a feasible solution to the robust control problem (2.3) for any 0  ≤ γ, τ < 1will stabilize the true system. Furthermore, the state and input constraints will be satisfied.

Proof. First, note that by the system norm constraints,

image

Then by (2.2), the true system trajectory will be given by the stable system responses

image

Therefore, the state constraints are satisfied as long as

image

The first term reduces to  Gx(Φx; k) as in the non-robust case. Because information about �∆is not known, we resort to a sufficient condition to bound the second term, letting ˜w =

�∆(I + �∆)−1w,

image

Consequently, a sufficient condition for satisfying state constraints is to have for j indexing rows of  Fxand elements of  bx,

image

Therefore, the constraints on  Φximply that the state constraints are satisfied. Similar logic shows that the constraints on  Φuimply that the input constraints are satisfied.

We remark on differences between the presented constraint-tightening and approaches common in the MPC literature. In both cases, the uncertainty in the dynamics induces an enlarged disturbance process. Common in the MPC literature, the additive disturbance approximation assigns ˜wk = wk +∆Axk +∆Buk, or equivalently ˜w = w+∆Ax+∆Bu. Then∥ ˜w∥∞can be bounded using the state and input constraint sets. Because this bound degrades as the constraint sets increase in size, this strategy can lead to counterintuitive behavior. On the other hand, our approach writes ˜w = w + �∆(I + �∆)−1w. While bounding  ∥ ˜w∥∞in this setting does not depend on the constraint set, it is affected by the size of the initial condition. Further comparison between the two approaches is presented in Appendix B.

2.3 Finite Dimensional Reduction

To make controller synthesis tractable, we can solve a finite approximation to optimization problem (2.3) wherein we only optimize over the first L impulse response elements of  Φxand  Φu, treating them as finite impulse response (FIR) filters. We show that in this setting, the optimization variables and constraints admit finite-dimensional representations. We first reformulate the constraints. Starting with the affine constraint, we have for  k = 1, ..., L − 1

image

where we will also optimize over V , a term which captures the tail of the system responses that we ignore in the synthesis.

image

SDP over Φx, Φu, γas in Theorem 5.8 of Dumitrescu [27], described explicitly for this setting in Appendix G.3 of Dean et al. [7]. For the  L1norm, the constraint becomes an  ℓ∞ → ℓ∞operator norm bound,

image

where the tail variable V enters transparently. For  k = 1, . . . , L−1, the inequality constraints on  Gτx(Φx; k) and Gτu(Φu; k) remain. For any  k ≥ L, the expression reduces to, for each j,

image

Therefore, the synthesis problem becomes, for any fixed 0  ≤ γ, τ < 1,

image

This is a finite dimensional SDP. The controller given by  K = ΦuΦ−1xcan be written in an equivalent state-space realization (AK, BK, CK, DK) via Theorem 2 of Anderson et al. [28].

How much is control performance degraded by uncertainties about the dynamics? In this section, we derive a sub-optimality bound which answers this question for the constrained LQR problem. First, consider the addition of an outer minimization over  γ and τ:1

image

Denote the solution to the true optimal control problem as (Φ⋆x, Φ⋆u), then define  K⋆ =Φ⋆uΦ⋆x−1 and J⋆ = J(A⋆, B⋆, K⋆).Additionally, define constants related to the optimal system norm and the dynamics uncertainties:

image

Before stating the suboptimality result, we define a robust version of the optimal constrained controller. Define the doubly robust constraint sets

image

and similarly for ¯Gζu(Φu; k)j. Note that these constraint sets resemble the robust constraint sets with  τ = 2ζ∞and an enlarged noise process with  σw ← σw1−ζ∞. Then the optimal robustly constrained controller is defined as

image

with  Kc = Φu(Φcx)−1. Note that this optimization problem designs a controller which has similar closed-loop behavior to the true optimal controller (i.e. system norms of similar size), but satisfies more stringent state and input constraints. The relative robustness cost gap is defined as

image

Now we are ready to state a data independent bound on the sub-optimality of robust controllers synthesized on estimated dynamics.

Theorem 3.1. Suppose that the robust optimal constrained controller problem is feasible. As long as  ζ2 ≤ 14√2 and ζ∞ ≤ 12, we have that the cost achieved by  �K = �Φu �Φ−1xsynthesized from the minimizers of (3.1) satisfies

image

This result is stated in terms of quantities related to the unknown true system, with the goal of highlighting properties of systems that make them harder or easier to robustly control. We see that the bound grows with the  H∞norm of the optimal controller and closed-loop response. It also grows with the robustness cost gap  Mζ.

Remark that  Mζis be difficult to characterize analytically; in general, it requires checking the boundaries of the robust polytopic constraints. Once errors  εA and εBare small enough that the optimal system response (Φ⋆x, Φ⋆u) satisfies the doubly robust constraints, we will have  Mζ= 0. If the optimal controller saturates its constraints, then  Mζwill be nonzero for any nonzero estimation errors. In Section 5 (Figure 2), we numerically characterize this optimality gap for a double integrator example.

Proof. Using (2.2) along with the norm bounds (2.4) and the constraints in optimization problem (3.1),

image

Next, we will connect the optimal system response to the estimated system by constructing a feasible solution to the robust optimization problem. We will use the following lemma

Lemma 3.2. Under the conditions of Theorem 3.1, we have that the following is a feasible solution to (3.1)

image

The proof of Lemma 3.2 follows by checking that the proposed solution satisfies all the constraints and is presented in Appendix A. Applying Lemma 3.2,

image

This is true because ( �K, �γ) is the optimal solution to (3.1), so objective function with feasible ( ˜Φu ˜Φ−1x = Kc, ˜γ) is an upper bound. Then we have

image

The second inequality follows from an application of (2.2) with the roles of the nominal and true systems switched. The final follows from bounding  ∥∆∥2 by √2ζand noticing thatx1−x ≤ 2x for 0 ≤ x ≤ 12, where we set  x = 2√2ζ2. Finally, we note that

image

Here, we briefly remark that a similar sub-optimality bound can be derived for the finite problem in (2.8). In short, controllers synthesized by minimizing over  γ and τwill satisfy a sub-optimality bound of the form in Theorem 3.1 with an additional factor due to the FIR truncation. The formal statement and proof of this result are deferred to Appendix A, but we highlight here that the cost penalty incurred due to FIR approximation decays exponentially in the horizon L over which the approximation is taken.

Finally, we connect the previous results on robust control with system estimation. We adopt control actions that both keep the system safe and provide excitation,

image

where each  η = (η0, η1, . . .) is stochastic and  ℓ∞-bounded, i.e. ∥ηk∥∞ ≤ ση. Given atrajectory sequence  {(xk, uk)}Tk=0, we propose to learn the dynamics (A⋆, B⋆) via least-squares regression on a trajectory of length T:

image

In what follows, we will prove a statistical rate on the least-squares estimate ( �A, �B) in termsof the system response and the trajectory length.

The bulk of the proof for the statistical rate comes from a general theorem regarding linear-response time series data from Simchowitz et al. [6]. Recently, this proof was adopted by Dean et al. [7] to show a rate of estimation in the setting given by (4.1) when both  ηand the disturbance w are Gaussian distributed. We modify the reduction given by Dean et al. to the case when the excitation and disturbance are no longer Gaussian, but instead zero-mean and bounded. We assume that  wk and ηkare both zero-mean sequences with independent coordinates and finite fourth moments. In particular, we assume  Ewk[wk(i)2] ≤σ2w, Ewk[wk(i)4] ≲ σ4w, Eηk[ηk(i)2] ≤ σ2η, Eηk[ηk(i)4] ≲ σ4η. These assumptions are quickly verified for common distributions such as uniform on a compact interval or over a discrete set of points. The main estimation result is the following.

Theorem 4.1. Fix a failure probability  δ ∈ (0, 1).Suppose the stochastic disturbance {wk}and the input disturbance  {ηk}satisfy the assumptions above. Assume for simplicity that  ση ≤ σw, and that the stabilizing controller K achieves a SLS response  Φx ∈

zRH∞(Cx, ρ), Φu ∈ 1zRH∞(Cu, ρ). Let C2K := nC2x + dC2u. Then as long as the trajec- tory length T satisfies the condition:

image

we have the following bound on the least-squares estimation errors that holds with probability at least 1  − δ,

max{∥A∥2, ∥∆B∥2} ≲ σwCuση

The proof of this result is presented in Appendix C. We remark on the interpretation of statistical learning bounds. A priori guarantees, like the one presented here, depend on quantities related to the underlying true system. As such, they are not directly useful when the system is unknown,2 but rather they indicate qualities of systems that make them easier or harder to estimate. For example, we see that this bound on estimation errors decreases with the size of the excitation, and increases with the size of the process noise. The bound also increases with  Cu, a constant which bounds the gain from disturbance to control inputs.

Applying our robustness results to the learning controller (4.1), it is possible to guarantee safety during the estimation process. We can define an expanded noise process ˜wk = B⋆ηk +wkto synthesize  K0using the robust control synthesis problem (2.3). As long as it is feasible for initial system estimates (A0, B0), initial dynamics uncertainties (ε0A, ε0B), σw replacedwith3 ση∥B⋆∥∞+σw, and bu,jreplaced with  bu,j −ση∥Fu,j∥1, then the control law  u = K0x+ηstabilizes the true system, satisfies state and input constraints, and allows for learning at the rate given in Theorem 4.1.

Finally, we connect the sub-optimality result to the statistical learning bound for an end-to-end sample complexity bound on the constrained LQR problem. Define  MT to bethe value of  Mζwhen the definition determined by (3.2) has values set as

image

Corollary 4.2. Assume initial feasibility of the learning problem. Then for a trajectory of length

image

the cost achieved by �K = �Φu(�Φx)−1 synthesized from (2.3) on the least-squares estimates �A, �Bsatisfies with probability at least 1  − δ,

image

Proof. (Sketch) This result follows by combining the statistical guarantee in Theorem 4.1 with the sub-optimality bound in Theorem 3.1. Note that we use the na¨ıve bound  εA,∞ ≤√nεA,2and similarly  εB,∞ ≤√dεB,2; this results in an extra factor of (n + d) appearing in (4.4) and in the definition of  MT.

Our final result depends both on the true system and the initial system estimates by way of the learning controller, which affects  T0and constants  Cx, Cu, and ρ. The system constraints enter through their effect on  MT.

image

Figure 1: Safe learning trajectories synthesized with coarse initial estimates (a), then robust execution with reduced model errors (b).

We demonstrate the utility of this framework on a double integrator example. In this case, the true dynamics are given by

image

with the constraints as states bounded between  −8 and 8, and inputs bounded in between −4 and 4. We have  σw = 0.1. Our initial estimate comes from a randomly generated initial perturbation of the true system with  εA,∞ = εB,∞ = 0.1. Safe controllers are generated with finite truncation length L = 15, and for larger initial conditions, the system is warm-started with a finite-time robust controller with horizon 20 to reduce the initial condition.

Figure 1 displays safe trajectories and input sequences for several example initial conditions. In 1a, the plotted trajectories are used for learning: the controller both regulates and excites the system (ση = 0.5), and is robust to initial uncertainties. Figure 1b demonstrates an ability to operate closer to the constraints when there is less uncertainty: in this case, there is no added excitation (ση= 0) and the system estimates are better specified (ε∞ = 0.001), so larger initial conditions are feasible.

Figure 2a displays the decreasing estimation errors over time, demonstrating learning. Shaded areas represent quartiles over 400 trials. Figure 2b displays the trade-off between safety and exploration by showing the largest value of  σηfor which the robust synthesis is feasible, given a size for the state constraint set  rx. Here, we leave  x0= 0, and examine a variety of errors in the dynamics estimates. As the uncertainties in the dynamics decrease, higher levels of both safety and exploration are achievable. Finally, Figure 2c shows the value of the relative robust sub-optimality gap  Mζfor the given example. We see a sharp

image

Figure 2: Over time, estimation errors decrease (a). As safety requirements increase, the maximum feasible excitation decreases (b). The robust cost suboptimality gap  Mζ displaysan abrupt transition from feasibility to near-optimality with error size (c).

transition from infeasibility for  ε ≥ 0.05 to near-optimality for  ε ≤ 0.03. This indicates that the gap may be most significant as a feasibility condition for our sub-optimality guarantees to hold.

In this paper, we propose a method for learning unknown linear systems while ensuring that they satisfy state and input constraints. By synthesizing a controller that both excites and regulates the system, we address the trade-off between safety and exploration directly. We further derive an end-to-end finite sample bound on the performance of constrained LQR controllers synthesized from collected data.

There are several directions for possible extensions of this work. To mitigate the conservativeness of the robust controller, tighter bounds on the uncertainty in the system response �∆could be derived for structured settings, where more than just the norm of the error is known. To connect this work to experiment design literature, the objective in the synthesis problem (2.3) could be replaced with an exploration inspired cost function for the learning stage.

Alternatively, the constrained LQR problem could be cast in the setting of online learning, where one seeks to minimize cost at all times, including during learning. This would require an analysis of recursive feasibility, to understand the transition that occurs when controllers are updated based on refined system estimates. It would also likely require a finer analysis on the performance loss characterized by  Mζ. Finally, we remark that the exploration vs. safety trade-off is most compelling for nonlinear systems, and a linear-time-varying extension to this work may be applicable to those problems.

We thank Francesco Borrelli and the members of the MPC Lab at UC Berkeley for their helpful comments and feedback. SD is supported by an NSF Graduate Research Fellowship under Grant No. DGE 1752814. ST is supported by a Google PhD fellowship. BR is generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, and an Amazon AWS AI Research Award.

[1] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” arXiv preprint arXiv:1710.01688, 2017.

[2] Y.-S. Wang, N. Matni, and J. C. Doyle, “A system level approach to controller synthe- sis,” arXiv preprint arXiv:1610.04815, 2016.

[3] C.-N. Fiechter, “PAC adaptive control of linear systems,” in Conference on Learning Theory, 1997.

[4] Y. Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Conference on Learning Theory, 2011.

[5] M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control problems,” in AISTATS, 2017.

[6] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference on Learning Theory, 2018.

[7] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” in Neural Information Processing Systems, 2018.

[8] A. Bemporad and M. Morari, “Robust model predictive control: A survey,” in Robustness in identification and control. Springer, 1999, pp. 207–226.

[9] M. V. Kothare, V. Balakrishnan, and M. Morari, “Robust constrained model predictive control using linear matrix inequalities,” Automatica, vol. 32, no. 10, pp. 1361–1379, 1996.

[10] D. Q. Mayne, M. M. Seron, and S. Rakovi´c, “Robust model predictive control of con- strained linear systems with bounded disturbances,” Automatica, vol. 41, no. 2, pp. 219–224, 2005.

[11] P. J. Goulart, E. C. Kerrigan, and J. M. Maciejowski, “Optimization over state feedback policies for robust control with constraints,” Automatica, vol. 42, no. 4, pp. 523–533, 2006.

[12] C. A. Larsson, C. R. Rojas, and H. Hjalmarsson, “MPC oriented experiment design,” IFAC Proceedings Volumes, vol. 44, no. 1, pp. 9966–9971, 2011.

[13] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning- based model predictive control,” Automatica, vol. 49, no. 5, pp. 1216–1226, 2013.

[14] M. Tanaskovic, L. Fagiano, R. Smith, and M. Morari, “Adaptive receding horizon control for constrained MIMO systems,” Automatica, vol. 50, no. 12, pp. 3019–3029, 2014.

[15] M. Lorenzen, F. Allg¨ower, and M. Cannon, “Adaptive model predictive control with robust constraint satisfaction,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 3313–3318, 2017.

[16] F. Blanchini, “Set invariance in control,” Automatica, vol. 35, no. 11, pp. 1747–1767, 1999.

[17] F. Berkenkamp and A. P. Schoellig, “Safe and robust learning control with gaussian processes,” in European Control Conference (ECC), 2015.

[18] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforce- ment learning with stability guarantees,” in Neural Information Processing Systems, 2017.

[19] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.

[20] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” arXiv preprint arXiv:1805.07708, 2018.

[21] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration and reinforcement learning,” arXiv preprint arXiv:1803.08287, 2018.

[22] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control,” arXiv preprint arXiv:1803.08552, 2018.

[23] T. Lu, M. Zinkevich, C. Boutilier, B. Roy, and D. Schuurmans, “Safe exploration for identifying linear systems via robust optimization,” arXiv preprint arXiv:1711.11165, 2017.

[24] F. Oldewurtel, C. N. Jones, and M. Morari, “A tractable approximation of chance constrained stochastic MPC based on affine disturbance feedback,” in 47th IEEE Annual Conference on Decision and Control, 2008.

[25] M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear model predictive control with chance constraints–a review,” Journal of Process Control, vol. 44, pp. 53–67, 2016.

[26] N. Matni, Y.-S. Wang, and J. Anderson, “Scalable system level synthesis for virtually localizable systems,” in IEEE 56th Annual Conference on Decision and Control, 2017.

[27] B. Dumitrescu, Positive Trigonometric Polynomials and Signal Processing Applications. Springer, 2007, vol. 103.

[28] J. Anderson and N. Matni, “Structured state space realizations for sls distributed con- trollers,” in Allerton, 2017.

[29] B. Efron, “Bootstrap methods: another look at the jackknife,” in Breakthroughs in statistics. Springer, 1992, pp. 569–593.

[30] R. Ibragimov and S. Sharakhmetov, “The exact constant in the rosenthal inequality for random variables with mean zero,” Theory of Probability & Its Applications, vol. 46, no. 1, pp. 127–132, 2002.

This section contains proofs necessary for the sub-optimality results. It further develops a sub-optimality bound for the case of FIR truncation. We begin by considering the frequency response elements of composed transfer functions. First, define, for  D = �∞k=1 D(k)z−k,

image

Then we have the following lemma.

Lemma A.1. For M = �∞k=1 M(k)z−k and D = �∞k=1 D(k)z−kthe frequency response elements of ˜M = MD are given by

image

where we use the notation D(1 : k) for the vertical concatenation of D(1) through D(k). Further, we have

image

Proof. For the first statement, notice that we can write

image

Then for the second, we have

image

A.1 Proof of Lemma 3.2

Proof of Lemma 3.2. The proposed feasible solution is

image

where  ∆ = −�∆A ∆B� �ΦcxΦcu

Next, notice that

image

where the inequality follows due to the fact that√2∥∆∥H∞ ≤1. Then consider the  L1 normconstraint,

image

where the decomposition of the inverse is valid by assumption on the size of  ζ∞.Then it remains to show that the robust state and input constraints are satisfied. For compactness we will write  c0 = max(1, 1σw ∥x0∥∞). Recall that they are given by

image

Note that ˜Φx = Φcx + Φcx∆(I − ∆)−1. Define the frequency response elements of  ∆(I −∆)−1 by D(k). Then, using Lemma A.1 we have

image

Lemma A.2. Assume that  ζ∞ < 1 and let Drepresent the frequency response of  ∆(1−∆)−1.Then for any vector v,

image

Proof.

∥v⊤Toepk(D)∥1 = ∥Toepk(D)⊤v∥1 ≤ ∥Toepk(D)⊤∥1∥v∥1 = ∥Toepk(D)∥∞∥v∥1≤ ∥∆(1 − ∆)−1∥L1∥v∥1 ≤ ζ∞1 − ζ∞∥v∥1 .

Above, we make use of the fact that the  ℓ1 and ℓ∞norms are duals, and therefore  ∥A∥∞ =∥A⊤∥1. The second inequality holds because Toepk(D) is a truncation of the semi-infinite Toeplitz matrix associated with the operator  ∆(1 − ∆)−1. The final decomposition is valid because  ∥∆∥L1 ≤ ζ∞ < 1.

Now we are ready to consider the state constraint indexed by j and k,

image

Considering each term individually,

image

Then defining  E1to contain an identity in the first block and zeros elsewhere,

F ⊤x,jΦcx[k + 1 : 1]D(1 : k + 1)x0 ≤ ∥F ⊤x,jΦcx[k+ 1 : 1]Toepk+1(D)E1∥1∥x0∥∞≤ ∥F ⊤x,jΦcx[k + 1 : 1]∥1 ζ∞1 − ζ∞∥x0∥∞ ,

The first inequality is H¨older’s inequality and the second by Lemma A.2. Next, the second term:

image

where the inequality holds by Lemma A.2. Finally, the last term,

image

where we apply Lemma A.2 and plug in the definition of ˜τ:

image

The resulting sum is

image

Then considering constants around the final term,

image

Thus, we see that  G˜τx( ˜Φx; k)j ≤ ¯Gζx(Φcx; k)j ≤ bj due to the constraints on  Φcx. A similar computation with the input constraints shows the same result. Therefore, the proposed solution is feasible.

A.2 Finite Dimensional Sub-optimality

We can recover sub-optimality guarantees in the case that we optimize over only a finite set of system response variables. Define the truncated responses  ΦLx = �Lk=1 Φx(t)z−k andsimilarly for  ΦLu, and notice that

image

This reformulation allows for the optimization over the FIR filters  ΦLx and ΦLu, plus anadditional variable that represents the tail of the true response. Applying this observation to the robust synthesis problem (2.3) yields the following finite dimensional problem:

image

We now show that so long as the horizon L is suitably large, all of the properties that hold for the solution of infinite horizon problem (2.3) carry over to the solution of the finite dimensional approximation (A.2).

Proposition A.3. Any controller designed from a feasible solution to the finite robust control problem (A.2) for any 0  ≤ γ, τ < 1stabilizes the true system and ensures that state and input constraints will be satisfied.

Proof. Consider the affine constraint on  ΦLx and ΦLu. We can equivalently write, using the observation in (A.1),

image

by (2.2), the system is stabilized and the true system trajectory is given by

image

The rest of the proof follows exactly as in the proof of Theorem 2.2

To bound the sub-optimality of this robust synthesis, we need to connect the optimal controller to the robust problem. For this feasibility result, it is necessary to understand the system response induced by putting the optimal robustly constrained controller  Kc =Φcu(Φcx)−1, as defined in (3.2), in feedback with the estimated dynamics (  �A, �B). We denote this system responses as  Φcx(I − ∆)−1. As before,  ∆ = −∆AΦcx − ∆BΦcu.

We thus define constants related to the decay rate of the system responses. Let  C⋆ and ρ⋆be any constants such that  ∥Φcx(k)∥p ≤ C⋆ρk⋆ for p ∈ {2, ∞}. Note that these constants must exist since the optimal closed loop will be exponentially stable. Let  C∆ and ρ∆ similarlybe any constants that bound the norm of the frequency response elements of (I − ∆)−1.We note that these constants can be bounded by multiples of  C⋆ and ρ⋆, as worked out in Appendix G of Dean et al. [7].

Lemma A.4. Suppose that for  p ∈ {2, ∞}, ∥Φcx(k)∥p ≤ C⋆ρk⋆ and ∥(I − ∆)−1[k]∥p ≤ C∆ρk∆.Define  C = C⋆C∆log( max(ρ⋆,ρ∆)+12 max(ρ⋆,ρ∆) ) and ρ = max(ρ⋆,ρ∆)+12 . Then for p ∈ {2, ∞} and all t ≥ 0,

image

We defer the proof of this lemma to the end of the section. Next, we redefine the optimal robustly constrained responses (Φcx, Φcu) in (3.2) to satisfy the modified doubly robust constraint sets

image

Here, we have changed a factor of 2 to a factor of 3 in the last term to account for the finite approximation. We will define  Mζ as in (3.3) using the new (Φcx, Φcu). We are now ready to state the sub-optimality result.

Theorem A.5. Suppose that the truncation length is such that

image

As long as  ζ∞ < 13 and ζ2 < 13√2, we have that the cost achieved by  �K(L) = �Φu �Φ−1xsynthesized from the minimizers of minγ,τ �JL(γ, τ) satisfies

image

To prove this result, we first construct a feasible solution to the truncated synthesis problem (A.2).

Lemma A.6. Under the conditions of Theorem A.5, the following is a feasible solution to (A.2):

image

Proof of Lemma A.6. First, we consider the affine constraint,

image

Applying the observation in (A.1),

image

Next, we consider the  H∞norm constraint

image

The first inequality follows from the bound on  ∥˜V ∥2and because the proposed ˜Φx is atruncation of  Φcx(I−∆)−1and similarly for ˜Φu. The second follows by the sub-multiplicative property, because  ∥∆∥H∞ ≤ ζ2 <1, and due to the assumption on L. Then similarly considering the  L1norm constraint,

image

Then it remains only to show that the robust state and input constraints are satisfied. Notice that ˜Φxis a truncation of  Φcx(1 − ∆)−1. For k > L, the frequency response elements are zero, and therefore the constraints are trivially satisfied. For for 0  ≤ k ≤ L, we have, by Lemma A.1, using block-Toeplitz notation, that

image

where we take D(k) to represent the frequency response elements of  ∆(1 − ∆)−1. Then theproof of constraint satisfaction follows as in Section A.1, up to the computations with the definition of ˜τ

image

This affects the constant factors around the final term of the sum

image

Thus, we see that  G˜τx( ˜Φx; k)j ≤ ¯Gζx(Φcx; k)j ≤ bj due to the constraints on  Φcx. A similar computation with the input constraints shows the same result. Therefore, the proposed solution is feasible.

We are now ready to prove the main sub-optimality result.

image

Proof of Theorem A.5. Recall that we denote the minimizers of minγ,τ (A.2) as (�Φx, �Φu, �V , �γ, �τ).Let

image

Then we have that that

image

By the constraints of the optimization problem in (A.2). We now apply observation A.1 and (2.2) with �∆Lto characterize the response achieved by the FIR approximate controller �K(L) on the true system (A⋆, B⋆), giving the following:

image

The inequality follows because  ∥ �∆∥H∞ ≤ �γ < 1.

Denote by ( ˜Φx, ˜Φu, ˜V , ˜γ, ˜τ) the feasible solution constructed in Lemma A.6. Then,

image

where the first inequality follows from the optimality of (�Φx, �Φu, �V , �γ), the equality and second inequality from the fact that ( ˜Φx, ˜Φu) are truncations of the response of  Kc on( �A, �B) to the first L time steps. The second to last inequality follows from an application of (2.2) with the roles of the nominal and true systems switched. The final inequality follows by definition of the robustness cost gap and bounding  ∥∆∥2 by√2ζ2. We then simplify

image

which follows for  ζ2 < 16√2. Then finally,

image

Proof of Lemma A.4. Using Lemma A.1 with D(t) representing the frequency response elements of (I − ∆)−1, we have that

image

Then noting that for any  x ∈ (0, 1), α > 0, and k ≥ 1,

image

which is true because log� kα�≤ kα, we have for  p = 2, ∞

image

The final inequality follows from choosing  α = log�max(ρ⋆,ρ∆)+12 max(ρ⋆,ρ∆)�−1.

In this section we discuss an alternative to the proposed robust control problem in (2.3) in which the constraints are tightened using a traditional additive disturbance approximation. We will contrast this additive reduction with the system-response reduction presented in Section 2.2. Define the additive robust synthesis problem as

image

where 0  ≤ γ <1 is a fixed parameter and

image

where we define  rx and ru as the ℓ∞radius of the constraint set, i.e.  rx = maxx∈X ∥x∥∞ andsimilarly for  ru.

Theorem B.1. Any controller designed from a feasible solution to the robust control problem (B.1) for any 0  ≤ γ < 1will stabilize the true system. Furthermore, the state and input constraints will be satisfied.

Proof. First, note that by the system norm constraint,

image

We will make an additive disturbance approximation by defining ˜wk = wk + ∆Axk + ∆Buk.Though it is conservative, we can bound

image

Now note that if  xk ∈ X, we have

image

Thus as long as state and input constraints are satisfied at time k, we have that  ∥ ˜wk∥∞ ≤σw + εA,∞rx + εB,∞ru := ˜σw.

Then, notice that the constraint tightening is equivalent to replacing  σw in (2.1) withσw + εA,∞rx + εB,∞ru. Then by Proposition 2.1, the inequality constraints  Gεx and Gεu implythat the system satisfies the constraints for dynamics given by ( �A, �B) and any process noise bounded by ˜σw. Thus, we can recursively argue that state and input constraints are satisfied for all k as long as they are satisfied for the initial state  x0.

The next result establishes conditions under which this additive disturbance robust control is better than the system-response robust control.

Proposition B.2. Let �τbe an optimizer of (3.1). The additive disturbance reduction is less conservative than the system-response based reduction if and only if

max(σw, ∥x0∥∞) �τ1 − �τ > εA,∞rx + εB,∞ru .From this statement, we can see that the system-response reduction suffers for large initial conditions, while the additive reduction suffers with the size of the constraint sets. For this reason, that additive disturbance reduction would not properly characterize the tradeoff between safety and exploration as plotted in Figure 2b. However, we note that the condition in Proposition B.2 could be used as a switch in a control synthesis problem to decide which reduction to use.

image

This expression can be viewed as expanding the disturbance by the uncertainty as

image

In the robust constraints (2.3), this additional uncertainty is bounded as

image

On the other hand, the additive disturbance approximation notices that  xk+1 = �Axk +�Buk + wk + ∆Axk + ∆Buk ,and thus assigns ˜wk = wk + ∆Axk + ∆Buk, i.e.

image

Then in the robust constraints (B.1), this additional uncertainty is bounded as

image

Then the additive bound is worse than the system-response bound as stated.

Finally, we note that the sub-optimality result could be extended to controllers synthesized from minimizing (B.1) over  γ, and that the main difference would arise from the definition of the robust cost sub-optimality gap  Mζ.

First, we prove a simple small-ball result for random variables with finite fourth moments.

Proposition C.1. Let X ∈ Rbe a zero-mean random variable with finite fourth moment, which satisfies the conditions

image

Let  a ∈ Rbe a fixed scalar and  θ ∈ (0, 1). We have that

image

Proof. First, we note that we can assume  a ≥0 without loss of generality, since we can perform a change of variables  X ← −X. We have that  E[(a + X)4] = a4 + 6a2E[X2] +4aE[X3] + E[X4]. Therefore, by the Paley-Zygmund inequality and Young’s inequality, we have that

image

Now define the function  f(a) for a ≥ 0 as

image

Clearly  f(0) = µ2/(3β) and f(∞) = 1. Since by Jensen’s inequality we know that  µ2 ≤ β,this means  f(0) < f(∞). On the other hand,

image

Assume that  µ ̸= 0 (otherwise the claim is trivially true). The only critical points of the function f on non-negative reals is at  a = 0 and a =�

then a = 0 is the only critical point of  f, so f(a) ≥ f(0) = µ2/(3β) ≥ 1/(3C). On the other hand, if 4µ2 ≥ 3βholds, Some algebra yields that

image

The inequality above holds since 3β ∈ [3µ2, 4µ2]. Hence,

image

This gives us the main new technical ingredient needed to prove Theorem 4.1.

Proposition C.2. Let µ ∈ Rd and M ∈ Rd×d be a full rank matrix. Let  w ∈ Rd be arandom vector such that its fourth moment is finite and each coordinate  wiis independent and zero-mean, i.e.  E[wi] = 0. Then, for any fixed  v ∈ Rd,

image

where Σ =  Ew[ww⊤], Cis an absolute constant, and  Cw = max1≤i≤d E[w4i ]/(E[w2i ])2. Inparticular, we can take C = 192.

Proof. Fix a q ∈ Rd. By Rosenthal’s inequality, we have that  E[|⟨q, w⟩|4] ≤ C·CwE[|⟨q, w⟩|2]2for an absolute constant–we can take C = 4 (see e.g. [30]). We now apply Proposition C.1 with  a = ⟨v, µ⟩, X = ⟨M ⊤v, w⟩, C = 4Cw, and θ = 1/2. The claim now follows.

With Proposition C.2 in place, the proof of Theorem 4.1 is identical to that of Proposition C.1 in Dean et al. [7], except in the final step of establishing the block martingale condition: instead of using the small-ball calculation for Gaussian distributions we replace it with the estimate provided by Proposition C.2.


Designed for Accessibility and to further Open Science