Safely Learning to Control the Constrained Linear Quadratic Regulator

2018·Arxiv

Abstract

Abstract

We study the constrained linear quadratic regulator with unknown dynamics, addressing the tension between safety and exploration in data-driven control techniques. We present a framework which allows for system identification through persistent excitation, while maintaining safety by guaranteeing the satisfaction of state and input constraints. This framework involves a novel method for synthesizing robust constraint-satisfying feedback controllers, leveraging newly developed tools from system level synthesis. We connect statistical results with cost sub-optimality bounds to give non-asymptotic guarantees on both estimation and controller performance.

1 Introduction

Data-driven design has considerable potential in contemporary control systems where precise modeling of the dynamics is intractable, whether due to complex large-scale interactions or nonlinearities resulting from contact forces. However, a large hurdle in the way of practical deployment is the question of maintaining safe operating conditions during the learning process, and furthermore guaranteeing safe execution using learned components.

Motivated by this issue, we study the data-driven design of a controller for the constrained Linear Quadratic Regulator (LQR) problem. In constrained LQR, we design a controller for a potentially unknown linear dynamical system that minimizes a given quadratic cost, subject to the additional requirement that both the state and input stay within a specified safe region. This is a problem that has received much attention within the model predictive control (MPC) community.

For the LQR problem with no constraints, a natural method of exploration for learning the dynamics is to excite the system by injecting white noise. When safety is not an issue, this method is effective and recently Dean et al. [1] provide an end-to-end sample complexity on this “identify-then-control” scheme. However, this method of fails to consider safety, and the injected white noise may lead to constraint violation.

In this paper, we directly address the tension between exploration for learning and safety, which are fundamentally at odds. We do this by synthesizing a controller which simultaneously excites and regulates the system; we propose to learn by additively injecting bounded noise to the control inputs computed by a safe controller. By leveraging the recently developed system level synthesis (SLS) framework for control design [2], we give a computationally tractable algorithm which returns a controller that (a) guarantees the closed loop system remains within the specified constraint set and (b) ensures that enough noise can be injected into the system to obtain a statistical guarantee on learning. To the best of our knowledge, our algorithm is the first to simultaneously achieve both objectives. Furthermore, the controller synthesis is solved by a convex optimization problem whose feasibility is a certificate of safety, and no considerations of robust invariant sets are required.

Our second contribution is to provide a sub-optimality bound on control performance for constrained LQR. Using the same SLS framework, we quantify the excess cost incurred by playing a controller designed on the uncertain dynamics obtained from learning. The sub-optimality depends on both the size of the uncertainty sets and a type of constraint robustness cost gap of the optimal constrained controller for the true system. This allows us to provide the first end-to-end sample complexity guarantee for the control of constrained systems.

Estimation and control of the unconstrained LQR problem has been studied in the non-asymptotic setting [3, 1]. However, the identification schemes rely on pure excitation and system restarts, which is not suitable for the constrained setting. On the other hand, the online learning literature simultaneously considers learning and control, where strategies are based on optimism in the face of uncertainty [4] or Thompson Sampling [5]. These approaches guarantee system estimation only up to optimal closed-loop equivalence, and do not consider safety. Building on a statistical result by Simchowitz et al. [6] which allows for non-asymptotic guarantees on parameter estimation from a single trajectory of a linear system, Dean et al. [7] provide a robust online method that guarantees parameter estimation and stability throughout.

The design of controllers that guarantee robust constraint satisfaction has long been considered in the context of model predictive control [8], including methods that model uncertainty in the dynamics directly [9], or model it as a bounded state disturbance for computational efficiency [10, 11]. Strategies for incorporating estimation of the dynamics include experiment-design inspired costs [12], decoupling learning from constraint satisfaction [13], and set-membership methods rather than parameter estimation [14, 15]. Due to the receding horizon nature of model predictive controllers, this literature relies on set invariance theory for infinite horizon guarantees [16]. Our framework considers the infinite horizon problem directly, and therefore we do not require computation of invariant sets.

Finally, the machine-learning community has begun to consider safety in reinforcement learning, where much work positions itself as being for general dynamical systems in lieu of providing statistical guarantees [17, 18, 19, 20]. Some works assume the existence of an initial safe controller for learning [21], and robust MPC methods have been proposed to modify potentially unsafe learning inputs [22]. Our framework gives an alternative procedure for designing such a controller using coarse system estimates. Most similar to this work is that of Lu et al. [23], who propose a method to allow excitation on top of a safe controller, but consider only finite-time safety and require non-convex optimization to obtain formal guarantees. We fix an underlying linear dynamical system with full state observation,

where we have initial condition , sequence of inputs , and disturbance process . Later, we will specify assumptions on the disturbances, including independence over time, bounded second moment, and bounded dynamics matrices () are unknown. We denote the errors on estimates of the system ( ) in terms of the matrix operator norm as

In this work, we will focus on the errors for

Furthermore, we assume some prior knowledge on the system dynamics, in the form of initial estimates ( ) and uncertainty measures (). Without such knowledge, guaranteeing safety in any form would be impossible. We note that the initial estimates may be coarse grained, and the goal of the learning procedure will be to refine this uncertainty prior to optimal control design.

The constrained LQR optimal control problem seeks to minimize the expected quadratic cost, subject to constraints on the state and input. Before further detailing this objective, we develop notation and background on system level synthesis.

1.1 System Level Synthesis

Many approaches to optimal control for systems with constraints involve receding horizon control, where an open loop finite-time trajectory is computed at each timestep; indeed, parameterizing optimal control problems by a state feedback controller generally leads to nonconvex optimization. As a motivating example, consider the static feedback law applied to the linear system (1.1). Then we can write

and a similar expression for . Then it is clear that convex constraints on be non-convex in K. Instead, we can parametrize the problem in terms of convolution with the closed-loop system response,

where we defined the fixed initial condition. We note that the expression above is valid for any linear dynamic controller, i.e. any controller which is a linear function of the state and its history. Since the relation in (1.2) is linear, convex constraints on state and input translate to convex constraints on the system response elements. The system level synthesis (SLS) framework shows that for any elements constrained to obey, for all

there exists a feedback controller that achieves the desired system responses (1.2). The state-feedback parameterization result in Theorem 1 of Wang et al. [2] formalizes this principle, and therefore any optimal control problem over linear systems can be cast as a constrained optimization problem over system response elements. We remark that similar observations have been used for constrained state feedback control in the finite horizon setting [11]. In what follows, we use boldface letters to denote transfer functions and signals, e.g. . Under this notation, the affine constraints can be rewritten as

and the corresponding control law u = Kx is given by

1.2 Notation

In this paper, we restrict our attention to the function space , consisting of discretetime stable matrix-valued transfer functions. We use to denote the set of transfer functions . We further define

for transfer functions that satisfy a certain decay rate in the spectral norm of their impulse response elements.

When working with transfer functions and signals, we will denote the coefficient of the term of degree . We will also denote G[k : 1] as the block row vector of system response elements of G

As is standard, we let denote the -norm of a vector x. For a matrix M, we let denote its operator norm. We will consider the norms, which are infinite horizon analogs of the Frobenius, spectral, and operator norms of a matrix, respectively:

Finally, for two numbers ) denote that there exists an absolute constant C > 0 such that

1.3 Optimal Control Problem

We now describe the constrained optimal control problem that we would want to solve given perfect knowledge of the system dynamics. First, we consider the expected infinite horizon quadratic cost for the system () in feedback with K:

We assume that is any distribution that satisfies is independent across time, i.e., . Note that any distribution satisfying these constraints induces the same expected quadratic cost, so it is unnecessary to specify a specific distribution.

Next, we consider polytopic constraint sets of the form

We will constrain the state and input trajectories to lie within these sets for any possible disturbance sequence satisfying 0. Putting the cost and the constraints together, the optimal control problem that acts as our baseline is:

Above, we let the set K enumerate all inputs that result from linear dynamic stabilizing feedback controllers for () of the form u = Kx. This is made possible by the system level synthesis framework described in the previous section.

As we show in the following section, the optimal control problem given in (1.3) is a convex, but infinite-dimensional problem. It is an idealized baseline to compare our actual solutions to; our sub-optimality guarantees will be with respect to the optimal cost achieved by this problem. It is a relevant baseline, since it optimizes for average case performance but ensures safety for the worst-case behavior, consistent with MPC literature [10, 24]. We remark that an alternative to (1.3) is to replace the worst case constraint behavior with probabilistic chance constraints [25]. We do not work with chance constraints because they are generally difficult to directly enforce on an infinite horizon; arguments around recursive feasibility using robust invariant sets are common in the MPC literature to deal with this issue.

2 Constraint-Satisfying Control

We begin by formulating a method for robustly operating a system while maintaining safety. First we describe a system level synthesis approach to the constrained LQR problem, and then we propose a modification which makes it robust to uncertainties in system dynamics. Finally, we outline a reduction to tractable synthesis via a finite-dimensional optimization problem over a finite impulse response (FIR) approximation.

2.1 A System Level Approach

Using the SLS formulation, we define an optimization problem that solves the optimal control problem presented in the section above.

Proposition 2.1. The following convex optimization problem solves the optimal control problem (1.3).

where the elements of the constraint functions are defined as

with j indexing the rows of and entries of

Examining the form of the optimization problem above, we see that the expected quadratic cost transforms into a system norm, while the worst-case polytopic constraints on state and input become closed-form polytopic constraints on the system response. For a controller , note that the LQR cost is equivalent to

Lastly, we remark here that the feasibility of the convex synthesis problem in (2.1) for an initial condition implies that is a member of a robust control invariant set.

Proof. By the state-feedback parameterization result in Theorem 1 of [2], the SLS parametrization encompasses all internally stabilizing state-feedback controllers acting on the true system (). Thus, it is necessary only to show that the optimization problem in (2.1) is consistent with that of (1.3) under the system level parametrization. The equivalence between the

LQR cost and the system norm is standard and we refer to the Appendix of [1], which presents this reformulation in terms of system responses for Gaussian process noise. The argument applies unchanged to any distribution satisfying our assumptions.

Therefore, it remains to consider the inequality constraints. Because the constraints must be satisfied robustly, it is equivalent to consider

Then considering elements in the second term with each j indexing the rows of

Thus the inequality constraint on the function ) is an equivalent condition. A similar computation holds for the input constraint.

We note the appearance of the row-wise norm over the multiplication of the the system response elements with the constraint matrix. The expression can be understood as an analog to the operator norm, mediated by the shape of the polytope. In the next section when we introduce robustness to system dynamics, the system norm will come into play for similar reasons.

2.2 Robust Control

We are further motivated to reformulate the optimal control problem in terms of system responses so that we can transparently consider uncertainties in the dynamics. We are interested in controller synthesis under model errors, where only nominal estimates of the system are known. Suppose that the system responses are designed on the estimated system. Then

The model mismatch impacts the closed-loop behavior of the true system in feedback with the controller in a transparent way. Supposing that 1, it achieves the system response and bounded cost:

This follows from the robust stability result in Theorem 2 of Matni et al. [26] and the submultiplicativity of the norms. Note that 1 is a sufficient condition for the existence of the inverse (for any induced norm by the small gain theorem.

This expression provides an upper bound on the cost achieved for a controller designed system estimates which depends only on the system estimates and the size of the error system . Motivated by this bound, consider the following robust optimization problem:

where 0 1 are fixed parameters and for each j and k,

and similarly for

In this problem, bounds the increase in the cost due to the dynamics uncertainty, while determines the increase in the state and input values with respect to the constraints. In fact, both values can be related to bounding an enlarge noise process ˜driving the system.

Theorem 2.2. Any controller designed from a feasible solution to the robust control problem (2.3) for any 0 will stabilize the true system. Furthermore, the state and input constraints will be satisfied.

Proof. First, note that by the system norm constraints,

Then by (2.2), the true system trajectory will be given by the stable system responses

Therefore, the state constraints are satisfied as long as

The first term reduces to ) as in the non-robust case. Because information about is not known, we resort to a sufficient condition to bound the second term, letting ˜w =

Consequently, a sufficient condition for satisfying state constraints is to have for j indexing rows of and elements of

Therefore, the constraints on imply that the state constraints are satisfied. Similar logic shows that the constraints on imply that the input constraints are satisfied.

We remark on differences between the presented constraint-tightening and approaches common in the MPC literature. In both cases, the uncertainty in the dynamics induces an enlarged disturbance process. Common in the MPC literature, the additive disturbance approximation assigns ˜, or equivalently ˜can be bounded using the state and input constraint sets. Because this bound degrades as the constraint sets increase in size, this strategy can lead to counterintuitive behavior. On the other hand, our approach writes ˜. While bounding in this setting does not depend on the constraint set, it is affected by the size of the initial condition. Further comparison between the two approaches is presented in Appendix B.

2.3 Finite Dimensional Reduction

To make controller synthesis tractable, we can solve a finite approximation to optimization problem (2.3) wherein we only optimize over the first L impulse response elements of and , treating them as finite impulse response (FIR) filters. We show that in this setting, the optimization variables and constraints admit finite-dimensional representations. We first reformulate the constraints. Starting with the affine constraint, we have for

where we will also optimize over V , a term which captures the tail of the system responses that we ignore in the synthesis.

SDP over Φas in Theorem 5.8 of Dumitrescu [27], described explicitly for this setting in Appendix G.3 of Dean et al. [7]. For the norm, the constraint becomes an operator norm bound,

where the tail variable V enters transparently. For 1, the inequality constraints on ) remain. For any , the expression reduces to, for each j,

Therefore, the synthesis problem becomes, for any fixed 0

This is a finite dimensional SDP. The controller given by can be written in an equivalent state-space realization () via Theorem 2 of Anderson et al. [28].

3 Suboptimality Guarantees

How much is control performance degraded by uncertainties about the dynamics? In this section, we derive a sub-optimality bound which answers this question for the constrained LQR problem. First, consider the addition of an outer minimization over

Denote the solution to the true optimal control problem as (), then define Additionally, define constants related to the optimal system norm and the dynamics uncertainties:

Before stating the suboptimality result, we define a robust version of the optimal constrained controller. Define the doubly robust constraint sets

and similarly for ¯. Note that these constraint sets resemble the robust constraint sets with and an enlarged noise process with . Then the optimal robustly constrained controller is defined as

with . Note that this optimization problem designs a controller which has similar closed-loop behavior to the true optimal controller (i.e. system norms of similar size), but satisfies more stringent state and input constraints. The relative robustness cost gap is defined as

Now we are ready to state a data independent bound on the sub-optimality of robust controllers synthesized on estimated dynamics.

Theorem 3.1. Suppose that the robust optimal constrained controller problem is feasible. As long as , we have that the cost achieved by synthesized from the minimizers of (3.1) satisfies

This result is stated in terms of quantities related to the unknown true system, with the goal of highlighting properties of systems that make them harder or easier to robustly control. We see that the bound grows with the norm of the optimal controller and closed-loop response. It also grows with the robustness cost gap

Remark that is be difficult to characterize analytically; in general, it requires checking the boundaries of the robust polytopic constraints. Once errors are small enough that the optimal system response () satisfies the doubly robust constraints, we will have = 0. If the optimal controller saturates its constraints, then will be nonzero for any nonzero estimation errors. In Section 5 (Figure 2), we numerically characterize this optimality gap for a double integrator example.

Proof. Using (2.2) along with the norm bounds (2.4) and the constraints in optimization problem (3.1),

Next, we will connect the optimal system response to the estimated system by constructing a feasible solution to the robust optimization problem. We will use the following lemma

Lemma 3.2. Under the conditions of Theorem 3.1, we have that the following is a feasible solution to (3.1)

The proof of Lemma 3.2 follows by checking that the proposed solution satisfies all the constraints and is presented in Appendix A. Applying Lemma 3.2,

This is true because ( ) is the optimal solution to (3.1), so objective function with feasible ( ˜) is an upper bound. Then we have

The second inequality follows from an application of (2.2) with the roles of the nominal and true systems switched. The final follows from bounding and noticing that, where we set . Finally, we note that

Here, we briefly remark that a similar sub-optimality bound can be derived for the finite problem in (2.8). In short, controllers synthesized by minimizing over will satisfy a sub-optimality bound of the form in Theorem 3.1 with an additional factor due to the FIR truncation. The formal statement and proof of this result are deferred to Appendix A, but we highlight here that the cost penalty incurred due to FIR approximation decays exponentially in the horizon L over which the approximation is taken.

4 Learning with Control

Finally, we connect the previous results on robust control with system estimation. We adopt control actions that both keep the system safe and provide excitation,

where each ) is stochastic and -bounded, i.e. trajectory sequence , we propose to learn the dynamics () via least-squares regression on a trajectory of length T:

In what follows, we will prove a statistical rate on the least-squares estimate ( of the system response and the trajectory length.

The bulk of the proof for the statistical rate comes from a general theorem regarding linear-response time series data from Simchowitz et al. [6]. Recently, this proof was adopted by Dean et al. [7] to show a rate of estimation in the setting given by (4.1) when both and the disturbance w are Gaussian distributed. We modify the reduction given by Dean et al. to the case when the excitation and disturbance are no longer Gaussian, but instead zero-mean and bounded. We assume that are both zero-mean sequences with independent coordinates and finite fourth moments. In particular, we assume . These assumptions are quickly verified for common distributions such as uniform on a compact interval or over a discrete set of points. The main estimation result is the following.

Theorem 4.1. Fix a failure probability Suppose the stochastic disturbance and the input disturbance satisfy the assumptions above. Assume for simplicity that , and that the stabilizing controller K achieves a SLS response

. Then as long as the trajec- tory length T satisfies the condition:

we have the following bound on the least-squares estimation errors that holds with probability at least 1

max{∥∆

The proof of this result is presented in Appendix C. We remark on the interpretation of statistical learning bounds. A priori guarantees, like the one presented here, depend on quantities related to the underlying true system. As such, they are not directly useful when the system is unknown,but rather they indicate qualities of systems that make them easier or harder to estimate. For example, we see that this bound on estimation errors decreases with the size of the excitation, and increases with the size of the process noise. The bound also increases with , a constant which bounds the gain from disturbance to control inputs.

Applying our robustness results to the learning controller (4.1), it is possible to guarantee safety during the estimation process. We can define an expanded noise process ˜to synthesize using the robust control synthesis problem (2.3). As long as it is feasible for initial system estimates (), initial dynamics uncertainties (withreplaced with , then the control law stabilizes the true system, satisfies state and input constraints, and allows for learning at the rate given in Theorem 4.1.

Finally, we connect the sub-optimality result to the statistical learning bound for an end-to-end sample complexity bound on the constrained LQR problem. Define the value of when the definition determined by (3.2) has values set as

Corollary 4.2. Assume initial feasibility of the learning problem. Then for a trajectory of length

the cost achieved by synthesized from (2.3) on the least-squares estimates satisfies with probability at least 1

Proof. (Sketch) This result follows by combining the statistical guarantee in Theorem 4.1 with the sub-optimality bound in Theorem 3.1. Note that we use the na¨ıve bound and similarly ; this results in an extra factor of (n + d) appearing in (4.4) and in the definition of

Our final result depends both on the true system and the initial system estimates by way of the learning controller, which affects and constants . The system constraints enter through their effect on

Figure 1: Safe learning trajectories synthesized with coarse initial estimates (a), then robust execution with reduced model errors (b).

5 Numerical Experiments

We demonstrate the utility of this framework on a double integrator example. In this case, the true dynamics are given by

with the constraints as states bounded between 8 and 8, and inputs bounded in between 4 and 4. We have 1. Our initial estimate comes from a randomly generated initial perturbation of the true system with 1. Safe controllers are generated with finite truncation length L = 15, and for larger initial conditions, the system is warm-started with a finite-time robust controller with horizon 20 to reduce the initial condition.

Figure 1 displays safe trajectories and input sequences for several example initial conditions. In 1a, the plotted trajectories are used for learning: the controller both regulates and excites the system (5), and is robust to initial uncertainties. Figure 1b demonstrates an ability to operate closer to the constraints when there is less uncertainty: in this case, there is no added excitation (= 0) and the system estimates are better specified (001), so larger initial conditions are feasible.

Figure 2a displays the decreasing estimation errors over time, demonstrating learning. Shaded areas represent quartiles over 400 trials. Figure 2b displays the trade-off between safety and exploration by showing the largest value of for which the robust synthesis is feasible, given a size for the state constraint set . Here, we leave = 0, and examine a variety of errors in the dynamics estimates. As the uncertainties in the dynamics decrease, higher levels of both safety and exploration are achievable. Finally, Figure 2c shows the value of the relative robust sub-optimality gap for the given example. We see a sharp

Figure 2: Over time, estimation errors decrease (a). As safety requirements increase, the maximum feasible excitation decreases (b). The robust cost suboptimality gap an abrupt transition from feasibility to near-optimality with error size (c).

transition from infeasibility for 05 to near-optimality for 03. This indicates that the gap may be most significant as a feasibility condition for our sub-optimality guarantees to hold.

6 Discussion

In this paper, we propose a method for learning unknown linear systems while ensuring that they satisfy state and input constraints. By synthesizing a controller that both excites and regulates the system, we address the trade-off between safety and exploration directly. We further derive an end-to-end finite sample bound on the performance of constrained LQR controllers synthesized from collected data.

There are several directions for possible extensions of this work. To mitigate the conservativeness of the robust controller, tighter bounds on the uncertainty in the system response could be derived for structured settings, where more than just the norm of the error is known. To connect this work to experiment design literature, the objective in the synthesis problem (2.3) could be replaced with an exploration inspired cost function for the learning stage.

Alternatively, the constrained LQR problem could be cast in the setting of online learning, where one seeks to minimize cost at all times, including during learning. This would require an analysis of recursive feasibility, to understand the transition that occurs when controllers are updated based on refined system estimates. It would also likely require a finer analysis on the performance loss characterized by . Finally, we remark that the exploration vs. safety trade-off is most compelling for nonlinear systems, and a linear-time-varying extension to this work may be applicable to those problems.

ACKNOWLEDGMENT

We thank Francesco Borrelli and the members of the MPC Lab at UC Berkeley for their helpful comments and feedback. SD is supported by an NSF Graduate Research Fellowship under Grant No. DGE 1752814. ST is supported by a Google PhD fellowship. BR is generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, and an Amazon AWS AI Research Award.

References

[1] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” arXiv preprint arXiv:1710.01688, 2017.

[2] Y.-S. Wang, N. Matni, and J. C. Doyle, “A system level approach to controller synthe- sis,” arXiv preprint arXiv:1610.04815, 2016.

[3] C.-N. Fiechter, “PAC adaptive control of linear systems,” in Conference on Learning Theory, 1997.

[4] Y. Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Conference on Learning Theory, 2011.

[5] M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control problems,” in AISTATS, 2017.

[6] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference on Learning Theory, 2018.

[7] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” in Neural Information Processing Systems, 2018.

[8] A. Bemporad and M. Morari, “Robust model predictive control: A survey,” in Robustness in identification and control. Springer, 1999, pp. 207–226.

[9] M. V. Kothare, V. Balakrishnan, and M. Morari, “Robust constrained model predictive control using linear matrix inequalities,” Automatica, vol. 32, no. 10, pp. 1361–1379, 1996.

[10] D. Q. Mayne, M. M. Seron, and S. Rakovi´c, “Robust model predictive control of con- strained linear systems with bounded disturbances,” Automatica, vol. 41, no. 2, pp. 219–224, 2005.

[11] P. J. Goulart, E. C. Kerrigan, and J. M. Maciejowski, “Optimization over state feedback policies for robust control with constraints,” Automatica, vol. 42, no. 4, pp. 523–533, 2006.

[12] C. A. Larsson, C. R. Rojas, and H. Hjalmarsson, “MPC oriented experiment design,” IFAC Proceedings Volumes, vol. 44, no. 1, pp. 9966–9971, 2011.

[13] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning- based model predictive control,” Automatica, vol. 49, no. 5, pp. 1216–1226, 2013.

[14] M. Tanaskovic, L. Fagiano, R. Smith, and M. Morari, “Adaptive receding horizon control for constrained MIMO systems,” Automatica, vol. 50, no. 12, pp. 3019–3029, 2014.

[15] M. Lorenzen, F. Allg¨ower, and M. Cannon, “Adaptive model predictive control with robust constraint satisfaction,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 3313–3318, 2017.

[16] F. Blanchini, “Set invariance in control,” Automatica, vol. 35, no. 11, pp. 1747–1767, 1999.

[17] F. Berkenkamp and A. P. Schoellig, “Safe and robust learning control with gaussian processes,” in European Control Conference (ECC), 2015.

[18] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforce- ment learning with stability guarantees,” in Neural Information Processing Systems, 2017.

[19] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.

[20] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” arXiv preprint arXiv:1805.07708, 2018.

[21] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration and reinforcement learning,” arXiv preprint arXiv:1803.08287, 2018.

[22] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control,” arXiv preprint arXiv:1803.08552, 2018.

[23] T. Lu, M. Zinkevich, C. Boutilier, B. Roy, and D. Schuurmans, “Safe exploration for identifying linear systems via robust optimization,” arXiv preprint arXiv:1711.11165, 2017.

[24] F. Oldewurtel, C. N. Jones, and M. Morari, “A tractable approximation of chance constrained stochastic MPC based on affine disturbance feedback,” in 47th IEEE Annual Conference on Decision and Control, 2008.

[25] M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear model predictive control with chance constraints–a review,” Journal of Process Control, vol. 44, pp. 53–67, 2016.

[26] N. Matni, Y.-S. Wang, and J. Anderson, “Scalable system level synthesis for virtually localizable systems,” in IEEE 56th Annual Conference on Decision and Control, 2017.

[27] B. Dumitrescu, Positive Trigonometric Polynomials and Signal Processing Applications. Springer, 2007, vol. 103.

[28] J. Anderson and N. Matni, “Structured state space realizations for sls distributed con- trollers,” in Allerton, 2017.

[29] B. Efron, “Bootstrap methods: another look at the jackknife,” in Breakthroughs in statistics. Springer, 1992, pp. 569–593.

[30] R. Ibragimov and S. Sharakhmetov, “The exact constant in the rosenthal inequality for random variables with mean zero,” Theory of Probability & Its Applications, vol. 46, no. 1, pp. 127–132, 2002.

A Extended Sub-optimality Results and Proofs

This section contains proofs necessary for the sub-optimality results. It further develops a sub-optimality bound for the case of FIR truncation. We begin by considering the frequency response elements of composed transfer functions. First, define, for

Then we have the following lemma.

the frequency response elements of ˜M = MD are given by

where we use the notation D(1 : k) for the vertical concatenation of D(1) through D(k). Further, we have

Proof. For the first statement, notice that we can write

Then for the second, we have

A.1 Proof of Lemma 3.2

Proof of Lemma 3.2. The proposed feasible solution is

where

Next, notice that

where the inequality follows due to the fact that1. Then consider the constraint,

where the decomposition of the inverse is valid by assumption on the size of Then it remains to show that the robust state and input constraints are satisfied. For compactness we will write ). Recall that they are given by

Note that ˜. Define the frequency response elements of ). Then, using Lemma A.1 we have

Lemma A.2. Assume that represent the frequency response of Then for any vector v,

Proof.

Above, we make use of the fact that the norms are duals, and therefore . The second inequality holds because Toep) is a truncation of the semi-infinite Toeplitz matrix associated with the operator . The final decomposition is valid because

Now we are ready to consider the state constraint indexed by j and k,

Considering each term individually,

Then defining to contain an identity in the first block and zeros elsewhere,

+ 1 : 1]Toep

The first inequality is H¨older’s inequality and the second by Lemma A.2. Next, the second term:

where the inequality holds by Lemma A.2. Finally, the last term,

where we apply Lemma A.2 and plug in the definition of ˜

The resulting sum is

Then considering constants around the final term,

Thus, we see that due to the constraints on . A similar computation with the input constraints shows the same result. Therefore, the proposed solution is feasible.

A.2 Finite Dimensional Sub-optimality

We can recover sub-optimality guarantees in the case that we optimize over only a finite set of system response variables. Define the truncated responses similarly for , and notice that

This reformulation allows for the optimization over the FIR filters additional variable that represents the tail of the true response. Applying this observation to the robust synthesis problem (2.3) yields the following finite dimensional problem:

We now show that so long as the horizon L is suitably large, all of the properties that hold for the solution of infinite horizon problem (2.3) carry over to the solution of the finite dimensional approximation (A.2).

Proposition A.3. Any controller designed from a feasible solution to the finite robust control problem (A.2) for any 0 stabilizes the true system and ensures that state and input constraints will be satisfied.

Proof. Consider the affine constraint on . We can equivalently write, using the observation in (A.1),

by (2.2), the system is stabilized and the true system trajectory is given by

The rest of the proof follows exactly as in the proof of Theorem 2.2

To bound the sub-optimality of this robust synthesis, we need to connect the optimal controller to the robust problem. For this feasibility result, it is necessary to understand the system response induced by putting the optimal robustly constrained controller , as defined in (3.2), in feedback with the estimated dynamics ( ). We denote this system responses as . As before,

We thus define constants related to the decay rate of the system responses. Let be any constants such that . Note that these constants must exist since the optimal closed loop will be exponentially stable. Let be any constants that bound the norm of the frequency response elements of (We note that these constants can be bounded by multiples of , as worked out in Appendix G of Dean et al. [7].

Lemma A.4. Suppose that for Define

We defer the proof of this lemma to the end of the section. Next, we redefine the optimal robustly constrained responses () to satisfy the modified doubly robust constraint sets

Here, we have changed a factor of 2 to a factor of 3 in the last term to account for the finite approximation. We will define ) using the new (). We are now ready to state the sub-optimality result.

Theorem A.5. Suppose that the truncation length is such that

As long as , we have that the cost achieved by synthesized from the minimizers of min

To prove this result, we first construct a feasible solution to the truncated synthesis problem (A.2).

Lemma A.6. Under the conditions of Theorem A.5, the following is a feasible solution to (A.2):

Proof of Lemma A.6. First, we consider the affine constraint,

Applying the observation in (A.1),

Next, we consider the norm constraint

The first inequality follows from the bound on and because the proposed ˜truncation of and similarly for ˜. The second follows by the sub-multiplicative property, because 1, and due to the assumption on L. Then similarly considering the norm constraint,

Then it remains only to show that the robust state and input constraints are satisfied. Notice that ˜is a truncation of , the frequency response elements are zero, and therefore the constraints are trivially satisfied. For for 0 , we have, by Lemma A.1, using block-Toeplitz notation, that

where we take D(k) to represent the frequency response elements of proof of constraint satisfaction follows as in Section A.1, up to the computations with the definition of ˜

This affects the constant factors around the final term of the sum

Thus, we see that due to the constraints on . A similar computation with the input constraints shows the same result. Therefore, the proposed solution is feasible.

We are now ready to prove the main sub-optimality result.

Proof of Theorem A.5. Recall that we denote the minimizers of minLet

Then we have that that

By the constraints of the optimization problem in (A.2). We now apply observation A.1 and (2.2) with to characterize the response achieved by the FIR approximate controller ) on the true system (), giving the following:

The inequality follows because

Denote by ( ˜) the feasible solution constructed in Lemma A.6. Then,

where the first inequality follows from the optimality of (), the equality and second inequality from the fact that ( ˜) are truncations of the response of ( ) to the first L time steps. The second to last inequality follows from an application of (2.2) with the roles of the nominal and true systems switched. The final inequality follows by definition of the robustness cost gap and bounding . We then simplify

which follows for . Then finally,

Proof of Lemma A.4. Using Lemma A.1 with D(t) representing the frequency response elements of (, we have that

Then noting that for any

which is true because log, we have for

The final inequality follows from choosing

B Additive Disturbance Alternative

In this section we discuss an alternative to the proposed robust control problem in (2.3) in which the constraints are tightened using a traditional additive disturbance approximation. We will contrast this additive reduction with the system-response reduction presented in Section 2.2. Define the additive robust synthesis problem as

where 0 1 is a fixed parameter and

where we define radius of the constraint set, i.e. similarly for

Theorem B.1. Any controller designed from a feasible solution to the robust control problem (B.1) for any 0 will stabilize the true system. Furthermore, the state and input constraints will be satisfied.

Proof. First, note that by the system norm constraint,

We will make an additive disturbance approximation by defining ˜Though it is conservative, we can bound

Now note that if

Thus as long as state and input constraints are satisfied at time k, we have that

Then, notice that the constraint tightening is equivalent to replacing . Then by Proposition 2.1, the inequality constraints that the system satisfies the constraints for dynamics given by ( ) and any process noise bounded by ˜. Thus, we can recursively argue that state and input constraints are satisfied for all k as long as they are satisfied for the initial state

The next result establishes conditions under which this additive disturbance robust control is better than the system-response robust control.

be an optimizer of (3.1). The additive disturbance reduction is less conservative than the system-response based reduction if and only if

max(From this statement, we can see that the system-response reduction suffers for large initial conditions, while the additive reduction suffers with the size of the constraint sets. For this reason, that additive disturbance reduction would not properly characterize the tradeoff between safety and exploration as plotted in Figure 2b. However, we note that the condition in Proposition B.2 could be used as a switch in a control synthesis problem to decide which reduction to use.

This expression can be viewed as expanding the disturbance by the uncertainty as

In the robust constraints (2.3), this additional uncertainty is bounded as

On the other hand, the additive disturbance approximation notices that and thus assigns ˜

Then in the robust constraints (B.1), this additional uncertainty is bounded as

Then the additive bound is worse than the system-response bound as stated.

Finally, we note that the sub-optimality result could be extended to controllers synthesized from minimizing (B.1) over , and that the main difference would arise from the definition of the robust cost sub-optimality gap

C Learning Results

First, we prove a simple small-ball result for random variables with finite fourth moments.

be a zero-mean random variable with finite fourth moment, which satisfies the conditions

Let be a fixed scalar and . We have that

Proof. First, we note that we can assume 0 without loss of generality, since we can perform a change of variables . We have that 4]. Therefore, by the Paley-Zygmund inequality and Young’s inequality, we have that

Now define the function

Clearly ) = 1. Since by Jensen’s inequality we know that this means ). On the other hand,

Assume that = 0 (otherwise the claim is trivially true). The only critical points of the function f on non-negative reals is at

then a = 0 is the only critical point of ). On the other hand, if 4holds, Some algebra yields that

The inequality above holds since 3

This gives us the main new technical ingredient needed to prove Theorem 4.1.

be a full rank matrix. Let random vector such that its fourth moment is finite and each coordinate is independent and zero-mean, i.e. . Then, for any fixed

where Σ = is an absolute constant, and particular, we can take C = 192.

. By Rosenthal’s inequality, we have that for an absolute constant–we can take C = 4 (see e.g. [30]). We now apply Proposition C.1 with 2. The claim now follows.

With Proposition C.2 in place, the proof of Theorem 4.1 is identical to that of Proposition C.1 in Dean et al. [7], except in the final step of establishing the block martingale condition: instead of using the small-ball calculation for Gaussian distributions we replace it with the estimate provided by Proposition C.2.

designed for accessibility and to further open science