Learning convex bounds for linear quadratic control policy synthesis

2018·Arxiv

Abstract

Abstract

Learning to make decisions from observed data in dynamic environments remains a problem of fundamental importance in a number of fields, from artificial intelligence and robotics, to medicine and finance. This paper concerns the problem of learning control policies for unknown linear dynamical systems so as to maximize a quadratic reward function. We present a method to optimize the expected value of the reward over the posterior distribution of the unknown system parameters, given data. The algorithm involves sequential convex programing, and enjoys reliable local convergence and robust stability guarantees. Numerical simulations and stabilization of a real-world inverted pendulum are used to demonstrate the approach, with strong performance and robustness properties observed in both.

1 Introduction

Decision making for dynamical systems in the presence of uncertainty is a problem of great prevalence and importance, as well as considerable difficulty, especially when knowledge of the dynamics is available only via limited observations of system behavior. In machine learning, the data-driven search for a control policy to maximize the expected reward attained by a stochastic dynamic process is known as reinforcement learning (RL) [41]. Despite remarkable recent success in games [28, 39], a major obstacle to the deployment RL-based control on physical systems (e.g. robots and self-driving cars) is the issue of robustness, i.e., guaranteed safe and reliable operation. With the necessity of such guarantees widely acknowledged [2], so-called ‘safe RL’ remains an active area of research [18].

The problem of robust automatic decision making for uncertain dynamical systems has also been the subject of intense study in the area of robust control (RC) [49]. In RC, one works with a set of plausible models and seeks a control policy that is guaranteed to stabilize all models within the set. In addition, there is also a performance objective to optimize, i.e. a reward to be maximized, or equivalently, a cost to be minimized. Such cost functions are usually defined with reference to either a nominal model [17, 22] or the worst-case model [32] in the set. RC has been extremely successful in a number of engineering applications [34]; however, as has been noted, e.g., [43, 31], robustness may (understandably) come at the expense of performance, particularly for worst-case design.

The problem we address in this paper lies at the intersection of reinforcement learning and robust control, and can be summarized as follows: given observations from an unknown dynamical system, we seek a policy to optimize the expected cost (as in RL), subject to certain robust stability guarantees (as in RC). Specifically, we focus our attention on control of linear time-invariant dynamical systems, subject to Gaussian disturbances, with the goal of minimizing a quadratic function penalizing state deviations and control action. When the system is known, this is the classical linear quadratic regulator (LQR), a.k.a. , optimal control problem [8]. We are interested in the setting in which the system is unknown, and knowledge of the dynamics must be inferred from observed data.

Contributions and paper structure The principle contribution of this paper is an algorithm to optimize the expected value of the linear quadratic regulator reward/cost function, where the expectation is w.r.t. the posterior distribution of unknown system parameters, given observed data; c.f. Section 3 for a detailed problem formulation. Specifically, we construct a sequence of convex approximations (upper bounds) to the expected cost, that can be optimized via semidefinite programing [45]. The algorithm, developed in Section 4, invokes the majorize-minimization (MM) principle [26], and consequently enjoys reliable convergence to local optima. An important part of our contribution lies in guarantees on the robust stability properties of the resulting control policies, c.f. Section 4.3. We demonstrate the proposed method via two experimental case studies: i) the benchmark problem on simulated systems considered in [14, 43], and ii) stabilization of a real-world inverted pendulum. Strong performance and robustness properties are observed in both. Moving forward, from a machine learning perspective this work contributes to the growing body of research concerned with ensuring robustness in RL, c.f. Section 2. From a control perspective, this work appropriates cost functions more commonly found in RL (namely, expected reward) to a RC setting, with the objective of reducing conservatism of the resulting robust control policies.

2 Related work

Incorporating various notions of ‘robustness’ into RL has long been an area of active research [18]. In so-called ‘safe RL’, one seeks to respect certain safety constraints during exploration and/or policy optimization, for example, avoiding undesirable regions of the state-action space [19, 1]. A related problem is addressed in ‘risk-sensitive RL’, in which the search for a policy takes both the expected value and variance of the reward into account [27, 16]. Recently, there has been an increased interest in notions of robustness more commonly considered in control theory, chiefly stability [31, 3]. Of particular relevance is the work of [4], which employs Lyapunov theory [24] to verify stability of learned policies. Like the present paper, [4] adopts a Bayesian framework; however, [4] makes use of Gaussian processes [35] to model the uncertain nonlinear dynamics, which are assumed to be deterministic. A major difference between [4] and our work is the cost function; in the former the policy is selected by optimizing for worst-case performance, whereas we optimize the expected cost. Robustness of data-driven control has also been the focus of a recently developed family of methods referred to as ‘coarse-ID control’, c.f.,[42, 14, 7, 40], in which finite-data bounds on the accuracy of the least squares estimator are combined with modern robust control tools, such as system level synthesis [47]. Coarse-ID builds upon so-called ‘identification’ methods for learning models of dynamical systems, along with error bounds that are compatible with robust synthesis methods [23, 11, 10]. identification assumes an adversarial (i.e. worst-case) disturbance model, whereas Coarse-ID is applicable to probabilistic models, such as those considered in the present paper. Of particular relevance to the present paper is [14], which provides sample complexity bounds on the performance of robust control synthesis for the infinite horizon LQR problem, when the true system is not known. Such bounds necessarily consider the worst-case model, given the observed data, where as we are concerned with expected cost over the posterior distribution of models. In closing, we briefly mention the so-called ‘Riemann-Stieltjes’ class of optimal control problems, for uncertain continuous-time dynamical systems, c.f., e.g., [37, 36]. Such problems often arise in aerospace applications (e.g. satellite control) where the objective is to design an open-loop control signal (e.g. for an orbital maneuver) rather than a feedback policy.

3 Problem formulation

In this section we describe in detail the specific problem that we address in this paper. The following notation is used: denotes the set of symmetric matrices; () denotes the cone of positive semdefinite (positive definite) matrices. denotes , similarly for and . The trace of A is denoted tr A. The transpose of A is denoted is shorthand for . The convex hull of set is denoted conv. The set of Schur stable matrices is denoted S.

Dynamics, reward function and policies We are concerned with control of discrete linear time-invariant dynamical systems of the form

where , and denote the state, input, and unobserved exogenous

disturbance at time t, respectively. Let . Our objective is to design a feedback control policy that minimizes the cost function , where evolves according to (1), and and are user defined weight matrices. A number of different parametrizations of the policy have been considered in the literature, from neural networks (popular in RL, e.g., [4]) to causal (typically linear) dynamical systems (common in RC, e.g., [32]). In this paper, we will restrict our attention to static-gain policies of the form , where is constant. As noted in [14], controller synthesis and implementation, is simpler (and more computationally efficient) for such policies. When the parameters of the true system, denoted , are known this is the infinite horizon LQR problem, the optimal solution of which is well-known [5]. We assume that is unknown; rather, our knowledge of the dynamics must be inferred from observed sequences of inputs and states.

Observed data We adopt the data-driven setup used in [14], and assume that where is the observed state sequence attained by evolving the true system for T time steps, starting from an arbitrary and driven by arbitrary input . Each of these N independent experiments is referred to as a rollout. We perform parameter inference in the offline/batch setting; i.e., all data D is assumed to be available at the time of controller synthesis.

Optimization objective Given observed data and, possibly, prior knowledge of the system, we then have the posterior distribution over the model parameters denoted , in place of the true parameters . The function that we seek to minimize is the expected cost w.r.t. the posterior distribution, i.e.,

In practice, the support of almost surely contains {A, B} that are unstabilizable, which implies that (2) is infinite. Consequently, we shall consider averages over confidence regions w.r.t. . For convenience, let us denote the infinite horizon LQR cost, for given system parameters

where the second equality follows from standard Gramian calculations, and S denotes the set of Schur stable matrices. As an alternative to (2) we consider the cost function where denotes the c % confidence region of the parameter space w.r.t. the posterior . Though better suited to optimization than (2), which is almost surely infinite, this integral cannot be evaluated in closed form, due to the complexity of . Furthermore, there is still no guarantee that contain only stabilizable models. To circumvent both of these issues, we propose the following Monte Carlo approximation of

where denotes the set of stabilizable {A, B}.

Posterior distribution Given data D, the parameter posterior distribution is given by Bayes’ rule:

where denotes our prior belief on , and denotes the unnormalized posterior. To sample from , we can distinguish between two different cases. First, consider the case when is known or can be reliably estimated independently of {A, B}. This is the setting in, e.g., [14]. In this case, the likelihood can be equivalently expressed as a Gaussian distribution over {A, B}. Then, when the prior p(A, B) is uniform (i.e. non-informative) or Gaussian (self-conjugate), the posterior is also Gaussian, c.f. Appendix A.1.1. Second, consider the general case in which , along with {A, B}, is unknown. In this setting, one can select from a number of methods adapted for Bayesian inference in dynamical systems, such as Metropolis-Hastings [29], Hamiltonian Monte Carlo [12], and Gibbs sampling [13, 48]. When one places a non-informative prior on ), each iteration of a Gibbs sampler targeting requires sampling from either a Gaussian or an inverse Wishart distribution, for which reliable numerical methods exist; c.f., Appendix A.1.2. In both of these cases we can sample from and evaluate point-wise. To draw , as in (4), we can first draw a large number of samples from , discard the (100)% of samples with the lowest unnormalized posterior values, and then further discard any samples that happen to be unstabilizable. For convenience, we define , which should be interpreted as a set of M realizations of this procedure for sampling

Summary We seek the solution of the optimization problem

4 Solution via semideﬁnite programing

In this section we present the principle contribution of this paper: a method for solving via convex (semidefinite) programing (SDP). It is convenient to consider an equivalent representation

where the Comparison Lemma [30, Lecture 2] has been used to replace the equality in (3b) with the inequality in (6b). We introduce the notation , where and c are arbitrarily small and large positive constants, respectively. serves as a compact approximation of , suitable for use with SDP solvers, i.e.,

4.1 Common Lyapunov relaxation

The principle challenge in solving (6) is that the constraint (6b) is not jointly convex in K and . The usual approach to circumventing this nonconvexity is to first apply the Schur complement to (6b), and then conjugate by the matrix diag, which leads to the equivalent constraint

With the change of variables and , (7) becomes an LMI, in and . This approach is effective when M = 1 (i.e. we have a single nominal system, as in standard LQR). However, when M > 1 we cannot introduce a new for each , as we lose uniqueness of the controller , i.e., in general . One strategy (prevalent in robust control, e.g., [14, §C]) is to employ a ‘common Lyapunov function’, i.e., for all i = 1, . . . , M. This gives the following convex relaxation (upper bound) of problem (6),

where denotes the Cholesky factorization of are slack variables used to encode the cost (6a) with the change of variables, i.e.,

The approximation in (8) is highly conservative, which motivates the iterative local optimization method presented in Section 4.2. Nevertheless, (8) provides a principled way (i.e., a one-shot convex program) to initialize the iterative search method derived in Section 4.2.

4.2 Iterative improvement by sequential semidefinite programing

To develop this iterative search method first consider an equivalent representation of

This representation highlights the nonconvexity of due to the term, which was addressed (in the usual way) by a change of variables in Section 4.1. In this section, we will instead replace with a linear approximation and prove that this leads to a tight convex upper bound. Given let denote the first order (i.e. linear) Taylor series approximation of about some nominal , i.e., . We now define the function

where that achieves the minimum in (9), with for some nominal . Analogously to (4), we define

We now show that is a convex upper bound on , which is tight at . The proof is given in A.2.2 and makes use of the following technical lemma (c.f. A.2.1 for proof),

Lemma 4.1. for all , where denotes the first-order Taylor series expansion of

Theorem 4.1. Let be defined as in (11), with such that is finite. Then is a convex upper bound on , i.e., . Furthermore, the bound is ‘tight’ at

Iterative algorithm To improve upon the common Lyapunov solution given by (8), we can solve a sequence of convex optimization problems: , c.f. Algorithm 1 for details. This procedure of optimizing tight surrogate functions in lieu of the actual objective function is an example of the ‘majorize-minimization (MM) principle’, a.k.a. optimization transfer [26]. MM algorithms enjoy good numerical robustness, and (with the exception of some pathological cases) reliable convergence to local minima [44]. Indeed, it is readily verified that , where equality follows from tightness of the bound, and the second inequality is due to the fact that is an upper bound. This implies that is a converging sequence.

Remark 4.1. This sequential SDP approach can be applied in other robust control settings, e.g., mixed ], to improve on the common Lyapunov solution, c.f., Section5.1 for an illustration.

4.3 Robustness

Hitherto, we have considered the performance component of the robust control problem, namely minimization of the expected cost; we now address the robust stability requirement. It is desirable for the learned policy to stabilize every model in the confidence region ; in fact, this is necessary for the cost to be finite. Algorithm 1 ensures stability of each of the M sampled systems from , which implies that stabilizes the entire region as . However, we would like to be able to say something about robustness for finite M. To this end, we make two remarks. First, if closed-loop stability of each sampled model is verified with a common Lyapunov function, then the policy stabilizes the convex hull of the sampled systems:

Theorem 4.2. Suppose there exists

where convdenotes the convex hull of

The proof of Theorem 4.2 is given in A.2.3. The conditions of Theorem 4.2 hold for the common Lyapunov approach in (8), and can be made to hold for Algorithm 1 by introducing an additional Lyapunov stability constraint (with common Lyapunov function) for each sampled system, at the expense of some conservatism. Second, we observe empirically that Algorithm 1 returns policies that very nearly stabilize the entire region , despite a very modest number of samples M relative to the dimension of the parameter space, c.f., Section5.1, in particular Figure 2. A number of recent papers have investigated sampling (or grid) based approaches to stability verification of control policies, e.g., [46, 4, 6]. Understanding why policies from Algorithm 1 generalize effectively to the entire region is an interesting topic of future research.

5 Experimental results

5.1 Numerical simulations using synthetic systems

In this section, we study the infinite horizon LQR problem specified by

where toeplitz(r, c) denotes the Toeplitz matrix with first row r and first colum c. This is the same problem studied in [14, §6] (for ), where it is noted that such dynamics naturally arise in consensus and distributed averaging problems. To obtain problem data D, each rollout involves simulating (1), with the true parameters, for T = 6 time steps, excited by Note: to facilitate comparison with [14], we too shall assume that is known. Furthermore, for all experiments will denote the 95% confidence region, as in [14]. We compare the following methods of control synthesis: existing methods: (i) nominal: standard LQR using the nominal model from the least squares, i.e., worst-case: optimize for worst-case model (95% confidence) s.t. robust stability constraints, i.e., the method of [14, §5.2]; (iii) : enforce stability constraint from [14, §5.2], but optimize performance for the nominal model proposed method(s): (iv) CL: the common Lyapunov relaxation of 8; (v) proposed: the method proposed in this paper, i.e., Algorithm 1; additional new methods: (vi) alternate-r: initialize with the solution, and apply the iterative optimization method proposed in Section 4.2, c.f., Remark 4.1; (vii) alternate-s: optimize for the nominal model enforce stability for the sampled systems in . Before proceeding, we wish to emphasize that the different control synthesis methods have different objectives; a lower cost does not mean that the associated method is ‘better’. This is particularly true for worst-case which seeks to optimize performance for the worst possible model so as to bound the cost on the true system.

To evaluate performance, we compare the cost of applying a learned policy K to the true system , to the optimal cost achieved by the optimal controller (designed using ), i.e., . We refer to this as ‘LQR suboptimality.’ In Figure 1 we plot LQR suboptimality is shown as a function of the number of rollouts . We make the following observations. Foremost, the methods that enforce stability ‘stochastically’ (i.e. point-wise), namely proposed and alternate-s, attain significantly lower costs than the methods that enforce stability

Table 1: Median % of unstable closed-loop models, with open-loop models sampled from the 95% confidence region of the posterior, for system of varying dimension ; c.f. Section 5.1 for details. Parenthesized quantities denote the % of cases for which the policy synthesis optimization problem was infeasible (i.e. no policy was returned). 50 experiments were conducted, with and alternate-r have the same robustness guarantees as worst-case, and are omitted.

‘robustly’. Furthermore, in situations with very little data, e.g. N = 5, the robust control methods are usually unable to find a stabilizing controller, yet the proposed method finds a stabilizing controller in the majority of trials. Finally, we note that the iterative procedure in proposed (and alternate-s) significantly improves on the common-Lyapunov relaxation CL; similarly, alternate-r consistently improves upon (as expected).

Figure 1: LQR suboptimality as a function of the number of rollouts (i.e. amount of training data). suboptimality denotes cases in which the method was unable to find a stabilizing controller for the true system (including infeasibility of the optimization problem for policy synthesis), and the % denotes the frequency with which this occurred for the 50 experimental trials conducted.

It is natural to ask whether the reduction in cost exhibited by proposed (and alternate-s) come at the expense of robustness, namely, the ability to stabilize a large region of the parameter space. Empirical results suggest that this is not the case. To investigate this we sample 5000 fresh (i.e. not used for learning) models from and check closed-loop stability of each; this is repeated for 50 independent experiments with varying and N = 50. The median percentage of models that were unstable in closed-loop is recorded in Table 1. We make two observations: (i) the proposed method exhibits strong robustness. Even for (i.e. 288-dim parameter space), it stabilizes > 99% of samples from the confidence region, with only M = 100 MC samples. (ii) when the robust methods (worst-case, alternate-r) are feasible, the resulting policies were found to stabilize 100% of samples; however, for , the methods were infeasible almost half the time, whereas proposed always returned a policy. Further evidence is provided in Figure 2, which plots robustness and performance as a function of the number of MC samples, M. For and , the entire confidence region is stabilized with very high probability, suggesting that is not required for robust stability in practice.

5.2 Real-world experiments on a rotary inverted pendulum

We now apply the proposed algorithm to the classic problem of stabilizing a (rotary) inverted pendulum, on real (i.e. physical) hardware (Quanser QUBE 2), c.f. A.3 for details. To generate

Figure 2: (left) Median % of unstable closed-loop models, with open-loop models sampled from the 95% confidence region of the posterior, for and N = 15, as a function of the number of samples M used in the MC approximation (4). (right) LQR suboptimality as a function of M. 50 experiments were conducted, c.f. Section5.1 for details. Shaded regions cover the interquartile range.

Figure 3: (a) LQR cost on real-world pendulum experiment, as a function of the number of rollouts. cost denotes controllers that resulted in instability during testing. n/a denotes cases in which the synthesis problem was infeasible. (b) pendulum angle and control signal recorded after 10 rollouts.

training data, the superposition of a non-stabilizing control signal and a sinusoid of random frequency is applied to the rotary arm motor while the pendulum is inverted. The arm and pendulum angles (along with velocities) are sampled at 100Hz until the pendulum angle exceeds , which takes no more than 5 seconds. This constitutes one rollout. We applied the worst-case, , and proposed methods to optimize the LQ cost with Q = I and R = 1. To generate bounds worst-case and , we sample 95% confidence region of the posterior, using Gibbs sampling, and take and . The proposed method used 100 such samples for synthesis. We also applied the least squares policy iteration method [25], but none of the policies could stabilize the pendulum given the amount of training data. Results are presented in Figure 3, from which we make the following remarks. First, as in Section5.1, the proposed method achieves high performance (low cost), especially in the low data regime where the magnitude of system uncertainty renders the other synthesis methods infeasible. Insight into this performance is offered by Figure 3(b), which indicates that policies from the proposed method stabilize the pendulum with control signals of smaller magnitude. Finally, performance of the proposed method converges after very few rollouts. Data-inefficiency is a well-known limitation of RL; understanding and mitigating this inefficiency is the subject of considerable research [14, 43, 15, 38, 20, 21]. Investigating the role that a Bayesian approach to uncertainty quantification plays in the apparent sample-efficiency of the proposed method is an interesting topic for further inquiry.

Acknowledgments

This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT15-0012) and via the projects Learning flexible models for nonlinear dynamics (contract number: 2017-03807) and NewLEADS - New Directions in Learning Dynamical Systems (contract number: 621-2016-06079), both funded by the Swedish Research Council.

References

[1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 1–8. ACM, 2005.

[2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.

[3] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.

[4] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems (NIPS). 2017.

[5] D. P. Bertsekas. Dynamic programming and optimal control. Belmont, MA: Athena Scientific, 1995.

[6] R. Bobiti and M. Lazar. A sampling approach to finding lyapunov functions for nonlinear discrete-time systems. In Control Conference (ECC), 2016 European, pages 561–566. IEEE, 2016.

[7] R. Boczar, N. Matni, and B. Recht. Finite-data performance guarantees for the output-feedback control of an unknown system. arXiv preprint arXiv:1803.09186, 2018.

[8] J. B. Burl. Linear optimal control: . Addison-Wesley Longman Publishing Co., Inc., 1998.

[9] C. K. Carter and R. Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.

[10] J. Chen and G. Gu. Control-oriented system identification: an , volume 19. WileyInterscience, 2000.

[11] J. Chen and C. N. Nett. The Caratheodory-Fejer problem and identification: a time domain approach. In Proceedings of the 32nd IEEE Conference on Decision and Control (CDC), pages 68–73, 1993.

[12] S. H. Cheung and J. L. Beck. Bayesian model updating using hybrid Monte Carlo simulation with application to structural dynamic models with many uncertain parameters. Journal of engineering mechanics, 135(4):243–255, 2009.

[13] J. Ching, M. Muto, and J. Beck. Bayesian linear structural model updating using Gibbs sampler with modal data. In Proceedings of the 9th International Conference on Structural Safety and Reliability, pages 2609–2616. Millpress, 2005.

[14] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.

[15] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.

[16] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning. arXiv:1710.07283, 2017.

[17] J. Doyle, K. Zhou, K. Glover, and B. Bodenheimer. Mixed performance objectives. II. Optimal control. IEEE Transactions on Automatic Control, 39(8):1575–1587, 1994.

[18] J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[19] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81–108, 2005.

[20] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017.

[21] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.

[22] W. M. Haddad, D. S. Bernstein, and D. Mustafa. Mixed-norm regulation and estimation: The discrete-time case. Systems & Control Letters, 16(4):235–247, 1991.

[23] A. J. Helmicki, C. A. Jacobson, and C. N. Nett. Control oriented system identification: a worst-case/deterministic approach in IEEE Transactions on Automatic control, 36(10):1163–1176, 1991.

[24] H. K. Khalil. Noninear systems. Prentice-Hall, New Jersey, 2(5):5–1, 1996.

[25] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.

[26] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics, 9(1):1–20, 2000.

[27] O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49(2-3):267–290, 2002.

[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[29] B. Ninness and S. Henriksen. Bayesian system identification via Markov chain Monte Carlo techniques. Automatica, 46(1):40–51, 2010.

[30] Oliveira, Maurício de. MAE 280B: Linear Control Design., 2009.

[31] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot. Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547–1563, 2016.

[32] I. R. Petersen, M. R. James, and P. Dupuis. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control, 45(3):398–412, 2000.

[33] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook., 2012.

[34] I. Postlethwaite, M. C. Turner, and G. Herrmann. Robust control applications. Annual Reviews in Control, 31(1):27–39, 2007.

[35] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.

[36] I. M. Ross, R. J. Proulx, and M. Karpenko. Unscented optimal control for space flight. In 24th International Symposium on Space Flight Dynamics, 2014.

[37] I. M. Ross, R. J. Proulx, M. Karpenko, and Q. Gong. Riemann–stieltjes optimal control problems for uncertain dynamic systems. Journal of Guidance, Control, and Dynamics, 38(7):1251–1263, 2015.

[38] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.

[39] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[40] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334, 2018.

[41] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[42] S. Tu, R. Boczar, A. Packard, and B. Recht. Non-asymptotic analysis of robust control from coarse-grained identification. arXiv preprint arXiv:1707.04791, 2017.

[43] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator. arXiv preprint arXiv:1712.08642, 2017.

[44] F. Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, pages 831–840, 2005.

[45] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM review, 38(1):49–95, 1996.

[46] J. Vinogradska, B. Bischoff, D. Nguyen-Tuong, A. Romer, H. Schmidt, and J. Peters. Stability of controllers for gaussian process forward models. In International Conference on Machine Learning, pages 545–554, 2016.

[47] Y.-S. Wang, N. Matni, and J. C. Doyle. A system level approach to controller synthesis. arXiv preprint arXiv:1610.04815, 2016.

[48] A. Wills, T. B. Schön, F. Lindsten, and B. Ninness. Estimation of linear systems using a Gibbs sampler. IFAC Proceedings Volumes, 45(16):203–208, 2012.

[49] K. Zhou and J. C. Doyle. Essentials of robust control, volume 104. Prentice Hall Upper Saddle River, NJ, 1998.

A Supplementary material

A.1 Sampling from the posterior distribution

A.1.1 Case I:

where is known, we can express the likelihood in a form equivalent to an (un-normalized) Gaussian distribution over

. This implies that . Therefore, when the prior is non-informative (or Gaussian (self-conjugate), the posterior is also Gaussian.

A.1.2 Case II:

Next, consider the generic case in which all parameters are unknown. Then . One approach to sampling from the posterior involves Gibbs sampling [9], i.e., alternating between the following two sampling steps:

to form the Markov Chain . As demonstrated in A.1.1, the distribution Gaussian, so sampling is straightforward. To sample from , first note

where denotes the inverse Wishart distribution. Note, if is not valid. However, we may consider a prior (Jeffreys’ noninformative prior) which means

A.2 Proofs

A.2.1 Proof of Lemma 4.1

Lemma. denotes the first-order Taylor series expansion of

A.2.2 Proof of Theorem 4.1

Theorem. Let be defined as in (11), with is finite. Then convex upper bound on . Furthermore, the bound is ‘tight’ at

is a tight convex bound on First, we prove convexity. is defined as the supremum over an infinite family of convex functions over a compact convex set, and is therefore a itself convex function. Note: the minimum of a linear function, e.g. can be trivially expressed as the supremum of a convex function, i.e., . Next, we prove the upper bound. From Lemma 4.1, . Therefore, any satisfies (10b) also satisfies (9b). This means the feasible set of (10) is a subset of the feasible set of (9), hence . Finally, we prove tightness. As we have already proved suffices to prove that is a feasible solution to (10b). As is equivalent to (9b), which is feasible by definition of is a feasible solution of (10), that achieves tr , by definition of

A.2.3 Proof of Theorem 4.2

Theorem. Suppose there exists and all denotes the convex hull of

Proof. It is sufficient to show that defines a convex set in terms of (A, B). By the Schur complement,

A.3 Additional material for experiments on the rotary inverted pendulum

A.3.1 System description

The Quanser QUBE-Servo 2 inverted pendulum is depicted in Figure 4. The system consists of an actuated arm that rotates in the horizontal plane; actuation is provided by an electrical motor. Attached to the end of the rotary arm is an un-actuated pendulum, which is free to rotate. The voltage applied to the electric motor constitutes the control input u for the LQR problem. Rotary encoders record the angular position of the rotary arm and pendulum, denoted , respectively. These angular positions are fed through a high-pass-filter to provide angular velocity estimates, . The observed state is then given by

A.3.2 Experimental procedure

To generate one rollout of problem data, we first swing-up the pendulum to the inverted position, stabilized by an LQR designed using a physics-based model of the system. Then we apply the voltage signal and record the resulting state evolution (sampled at 100Hz), until the pendulum angle exceeds , or the rotary arm angle constitutes a state feedback policy that does not stabilize the system, but does keep the pendulum upright from slightly longer than if it were absent. This extends the typical rollout duration to around 3-5 seconds, before the pendulum angle exceeds . The angular frequency is randomly sampled each rollout, denotes the uniform distribution over the interval denotes band-limited white noise, with a sampling time of , and a gain of 0.05. represents additional exogenous disturbances that we artificially introduce to the system.

Training data then consists of denotes the recorded state sequence, and , i.e., the disturbance is not observed for learning. T is truncated to 500 (i.e. 5 seconds) in the event that the rollout lasts this long. A typical rollout is depicted in Figure 5.

Figure 4: The Quanser QUBE-Servo 2 rotary pendulum, in the inverted position. Photo: www.quanser.com/products/qube-servo-2.

A.3.3 Bounds for robust methods

The robust synthesis methods worst-case, alternate-r require bounds on the error of the least squares estimate, i.e., ], these bounds are estimated, with a specific confidence level, via a Boostrap algorithm, assuming that the covariance is known. In our setting, we estimate these bounds as described in Section 5.2, i.e., by sampling from the 95% confidence region of the posterior distribution. This ensures a fair comparison between the methods, as they are, in essence, required to stabilize the same region of the parameter space. We observed, however, that these bounds on the least squares error were too conservative; the magnitude of the uncertainty was too large, and the control synthesis optimization problems were infeasible. The experiments presented in Figure 3 were attained by scaling down these error bounds by a factor of 100. A number of scaling factors were tested, but 100 was found to achieve a reasonable trade-off between robustness and feasibility. It is worth emphasizing that the proposed method used the samples from 95% confidence region without any such scaling.

Figure 5: A typical rollout from the experimental procedure in A.3.2.

designed for accessibility and to further open science