Robust exploration in linear quadratic reinforcement learning

2019·Arxiv

Abstract

Abstract

This paper concerns the problem of learning control policies for an unknown linear dynamical system to minimize a quadratic cost function. We present a method, based on convex optimization, that accomplishes this task robustly: i.e., we minimize the worst-case cost, accounting for system uncertainty given the observed data. The method balances exploitation and exploration, exciting the system in such a way so as to reduce uncertainty in the model parameters to which the worst-case cost is most sensitive. Numerical simulations and application to a hardware-in-the-loop servo-mechanism demonstrate the approach, with appreciable performance and robustness gains over alternative methods observed in both.

1 Introduction

Learning to make decisions in an uncertain and dynamic environment is a task of fundamental importance in a number of domains. Though it has been the subject of intense research activity since the formulation of the ‘dual control problem’ in the 1960s[17], the recent success of reinforcement learning (RL), particularly in games [32, 36], has inspired a resurgence in interest in the topic. Problems of this nature require decisions to be made with respect to two objectives. First, there is a goal to be achieved, typically quantified as a reward function to be maximized. Second, due to the inherent uncertainty there is a need to gather information about the environment, often referred to as ‘learning’ via ‘exploration’. These two objectives are often competing, a fact known as the exploration/exploitation trade-off in RL, and the ‘dual effect’ (of decision) in control.

It is important to recognize that the second objective (exploration) is important only in so far as it facilitates the first (maximizing reward); there is no intrinsic value in reducing uncertainty. As a consequence, exploration should be targeted or application specific; it should not excite the system arbitrarily, but rather in such a way that the information gathered is useful for achieving the goal. Furthermore, in many real-world applications, it is essential that exploration does not compromise the safe and reliable operation of the system.

This paper is concerned with control of uncertain linear dynamical systems, with the goal of maximizing (minimizing) rewards (costs) that are a quadratic function of states and actions; cf. §2 for a detailed problem formulation. We derive methods to synthesize control policies that balance the exploration/exploitation tradeoff by performing robust, targeted exploration: robust in the sense that we optimize for worst-case performance given uncertainty in our knowledge of the system, and targeted in the sense that the policy excites the system so as to reduce uncertainty in such a way that specifically minimizes the worst-case cost. To this end, this paper makes the following specific contributions. We derive a high-probability bound on the spectral norm of the system parameter estimation error, in a form that is applicable to both robust control synthesis and design of targeted exploration; cf. §3. We also derive a convex approximation of the worst-case (w.r.t. parameter uncertainty) infinite-horizon linear quadratic regulator (LQR) problem; cf. §4.2. We then combine these two developments to present an approximate solution, via convex semidefinite programing (SDP), to the problem of minimizing the worst-case quadratic costs for an uncertain linear dynamical system; cf. §4. For brevity, we will refer to this as a ‘robust reinforcement learning’ (RRL) problem.

1.1 Related work

Inspired, perhaps in part, by the success of RL in games [32, 36], there has been a flurry of recent research activity in the analysis and design of RL methods for linear dynamical systems with quadratic rewards. Works such as [1, 23, 15] employ the so-called ‘optimism in the face of uncertainty’ (OFU) principle, which selects control actions assuming that the true system behaves as the ‘best-case’ model in the uncertain set. This leads to optimal regret but requires the solution of intractable nonconvex optimization problems. Alternatively, the works of [34, 3, 4] employ Thompson sampling, which optimizes the control action for a system drawn randomly from the posterior distribution over the set of uncertain models, given data. The work of [30] eschews uncertainty quantification, and demonstrates that ‘so-called’ certainty equivalent control attains optimal regret. There has also been considerable interest in ‘model-free’ methods[38] for direct policy optimization [16, 29], as well partially model-free methods based on spectral filtering [22, 21]. Unlike the present paper, none of the works above consider robustness which is essential for implementation on physical systems. Robustness is studied in the so-called ‘coarse-ID’ family of methods, c.f. [12, 13, 11]. In [11], sample convexity bounds are derived for LQR with unknown linear dynamics. This approach is extended to adaptive LQR in [12], however, unlike the present paper, the policies are not optimized for exploration and exploitation jointly; exploration is effectively random. Also of relevance is the field of so-called ‘safe RL’ [19] in which one seeks to respect certain safety constraints during exploration and/or policy optimization [20, 2], as well as ‘risk-sensitive RL’, in which the search for a policy also considers the variance of the reward [31, 14]. Other works seek to incorporate notions of robustness commonly encountered in control theory, e.g. stability [33, 7, 10]. In closing, we mention that related problems of simultaneous learning and control have a long history in control theory, beginning with the study of ‘dual control’ [17, 18] in the 1960s. Many of these formulations relied on a dynamic programing (DP) solution and, as such, were applicable only in special cases [6, 8]. Nevertheless, these early efforts [9] established the importance of balancing ‘probing’ (exploration) with ‘caution’ (robustness). For subsequent developments from the field of control theory, cf. e.g. [24, 25, 5].

2 Problem statement

In this section we describe in detail the problem addressed in this paper. Notation is as follows: denotes the transpose of a matrix is shorthand for the sequence denotes the maximum eigenvalue of a matrix denotes the Kronecker product. vec (A) stacks the columns of A to form a vector. ) denotes the cone(s) of symmetric positive semidefinite (definite) matrices. w.p. means ‘with probability.’ denotes the value of the Chi-squared distribution with n degrees of freedom and probability p. blkdiag is the block diagonal operator.

Dynamics and cost function We are concerned with control of linear time-invariant systems

where and denote the state (which is assumed to be directly measurable), input and process noise, respectively, at time t. The objective is to design a feedback control policy so as to minimize the cost function , where for user-specified positive semidefinite matrices Q and R. When the parameters of the true system, denoted , are known this is exactly the finite-horizon LQR problem, the optimal solution of which is well-known. We assume that are unknown.

Modeling and data As are unknown, all knowledge about the true system dynamics must be inferred from observed data, . We assume that is known, or has been estimated, and that we have access to initial data, denoted (with slight notational abuse) , obtained, e.g. during a preliminary experiment. For the model (1), parameter uncertainty can be quantified as:

Proposition 2.1. Given observed data from (1), and a uniform prior over the parameters , i.e., , the posterior distribution is given by , where , i.e., the ordinary least

squares estimator, and

Proof: cf. §A.1.1. Based on Proposition 2.1 we can define a high-probability credibility region by:

Policies Though not necessarily optimal, we will restrict our attention to static-gain policies of the form , where represent random excitations for the purpose of learning. A policy comprises and , and is denoted . Let , with , partition the time horizon T into N intervals. The ith interval is of length . We will then design N policies, , such that is deployed during the ith interval, . For convenience, we define the function given by , which maps time t to the index i = I(t) of the policy to be deployed. We also make use of the notation as shorthand for

Worst-case dynamics We are now in a position to define the optimization problem that we wish to solve in this paper. In the absence of knowledge of the true dynamics, , given initial data , we wish to find a sequence of policies that minimize the expected cost assuming that, at time t, the system evolves according to the worst-case dynamics within the high-probability credibility region

where the expectation is w.r.t. . We choose to optimize for the worst-case dynamics so as to bound, with high probability, the cost of applying the policies to the unknown true system. In principle, problems such as (3) can be solved via dynamic programing (DP) [17]. However, such DP-based solutions require gridding to obtain finite state-action spaces, and are hence computationally intractable for systems of even modest dimension [6]. In what follows, we will present an approximate solution to this problem, which we refer to as a ‘robust reinforcement learning’ (RRL) problem, that retains the continuous sate-action space formulation and is based on convex optimization. To facilitate such a solution, we require a refined means of quantifying system uncertainty, which we present next.

3 Modeling uncertainty for robust control

In this paper, we adopt a model-based approach to control, in which quantifying uncertainty in the estimates of the system dynamics is of central importance. From Proposition 2.1 the posterior distribution over parameters is Gaussian, which allows us to construct an ‘ellipsoidal’ credibility region , centered about the ordinary least squares estimates of the model parameters, as in (2).

To allow for an exact convex formulation of the control problem involving the worst-case dynamics, cf. §4.2, it is desirable to work with a credibility region that bounds uncertainty in terms of the spectral properties of the parameter error matrix are the ordinary least squares estimates, i.e. vec, cf. Proposition 2.1. To this end, we will work with

Figure 1: Cartoon depiction of the problem addressed in this paper. The goal is to design N policies, , so as to minimize the worst-case cost (blue area) over the time horizon

models of the form where specifies the following region, in parameter space, centered about

The following lemma, cf. §A.1.2 for proof, suggests a specific means of constructing D, so as to ensure that defines a high probability credibility region:

Lemma 3.1. Given data from (1), and , let

For convenience, we will make use of the following shorthand notation:

Credibility regions of the form (4), i.e. bounds on the spectral properties of the estimation error, have appeared in recent works on data-driven and adaptive control, cf. e.g., [11, Proposition 2.4] which makes use of results from high-dimensional statistics [39]. The construction in [11, Proposition 2.4] requires to be independent, and as such is not directly applicable to time series data, without subsampling to attain uncorrelated samples (though more complicated extensions to circumvent this limitation have been suggested [37]). Lemma 3.1 is directly applicable to correlated time series data, and provides a credibility region that is well suited to the RRL problem, cf. §4.3.

4 Convex approximation to robust reinforcement learning problem

Equipped with the high-probability bound on the spectral properties of the parameter estimation error presented in Lemma 3.1, we now proceed with the main contribution of this paper: a convex approximation to the ‘robust reinforcement learning’ (RRL) problem in (3).

4.1 Steady-state approximation of cost

In pursuit of a more tractable formulation, we first introduce the following approximation of (3),

Observe that (5) has introduced two approximations to (3). First, in (5) we only update the ‘worst-case’ model at the beginning of each epoch, when we deploy a new policy, rather than at each time step as in (3). This introduces some conservatism, as model uncertainty will generally decrease as more data is collected, but results in a simpler control synthesis problem. Second, we select the worst-case model from the ‘spectral’ credibility region as defined in (4), rather than the ‘ellipsoidal’ region defined in (2). Again, this introduces some conservatism as , cf. §A.1.2, but permits convex optimization of the worst-case cost, cf. §4.2. For convenience, we denote

Next, we approximate the cost between epochs with the infinite-horizon cost, scaled appropriately for the epoch duration, i.e., between the th epoch we approximate the cost as

This approximation is accurate when the epoch duration is sufficiently long relative to the time required for the state to reach the stationary distribution. Substituting (6) into (5), the cost function that we seek to minimize becomes

The expectation in (7) is w.r.t. to and , as depends on the random variables and , which evolve according to the worst-case dynamics in (5).

4.2 Optimization of worst-case cost

The previous subsection introduced an approximation of our ‘ideal’ problem (3), based on the worst-case infinite horizon cost, cf. (7). In this subsection we present a convex approach to the optimization of . The infinite horizon cost can be expressed as

Under the feedback policy K, the covariance appearing on the RHS of (8) can be expressed as

where denotes the stationary state covariance. For known A and B, W is given by the (minimum trace) solution to the Lyapunov inequality

i.e., . To optimize via convex optimization, there are two challenges to overcome: i. non-convexity of jointly searching for K and W, satisfying (10) and minimizing (8), ii. computing W for worst-case , rather than known {A, B}.

Let us begin with the first challenge: nonconvexity. To facilitate a convex formulation of the RRL problem (7) we write (10) as

and introduce the change of variables and , collated in the variable

. With this change of variables, minimizing (8) subject to (11) is a convex program.

Now, we turn to the second challenge: computation of the stationary state covariance under the worst-case dynamics. As a sufficient condition, we require (11) to hold for all particular, we define the following approximation of

Lemma 4.1. Consider the worst-case cost , cf. (6), and the approximation , cf. (12).

Proof: cf. §A.1.3. To optimize , as defined in (12), we make use of the following result from [28]:

Theorem 4.1. The data matrices (A, B, C, P, F, G, H) satisfy, for all robust fractional quadratic matrix inequality

(13) for some

To put (11) in a form to which Theorem 4.1 is applicable, we make use of of the nominal parameters defined as in (4), such that , we can express (11) as

where the ‘iff’ follows from the Schur complement. Given this equivalent representation, by Theorem 4.1, (11) holds for all

with the optimal policy given by

Note that as is purely an ‘exploitation’ problem in the above SDP; in general, in the RRL setting (i.e. (7)) where exploration is beneficial.

4.3 Approximate uncertainty propagation

Let us now return to the RRL problem (7). Given a model M, §4.2 furnished us with a convex method to minimize the worst-case cost. However, at time t = 0, we have access only to data , and therefore, only . To optimize (7) we need to approximate the models based on the future data, , that we expect to see. To this end, we denote the approximate model, at time given data , by . We now describe specific choices for , beginning with the latter.

Recall that the uncertainty matrix D at the ith epoch is denoted . The uncertainty matrix at the

i + 1th epoch is then given by

empirical covariance matrix in this expression with the worst-case state covariance as follows:

This approximation makes use of the same calculation appearing in (9). The equality makes use of the change of variables introduced in §4.2. Note that in proof of Theorem 4.2, cf. §A.1.4, it was shown that

Next, we turn our attention to approximating the effect of future data on the nominal parameter estimates . Updating these (ordinary least squares) estimates based on the expected value of future observations involves difficult integrals that must be approximated numerically [27, §5]. To preserve convexity in our formulation, we approximate future nominal parameter estimates with the current estimates, i.e., given data we set and . To summarize, our approximate model at epoch j is given by

4.4 Final convex program and receding horizon application

We are now in a position to present a convex approximation to our original problem (3). By substituting for , and for in (7), we attain the cost function: Consider the ith term in this sum, which can be optimized via the SDP (15), with . Notice two important facts: 1. for fixed multiplier , the uncertainty enters linearly in the constraint , cf. (15); 2. is linear in the decision variables , cf. end of §4.3. Therefore, the constraint remains linear in the decision variables, which means that the cost function derived by substituting be optimized as an SDP, cf. (17) below.

Hitherto, we have considered the problem of minimizing the expected cost over time horizon T given initial data . In practical applications, we employ a receding horizon strategy, i.e., at the ith epoch, given data , we find a sequence of policies that minimize the approximate h-step-ahead expected cost

and then apply during the ith epoch. At the beginning of the i + 1th epoch, we repeat the process; cf. Algorithm1. The problem can be solved as the SDP:

Selecting multipliers For optimization of given a model M, i.e., (15), the simultaneous search for the policy K and multiplier is convex, as D is fixed. However, in the RRL setting, ‘D’ is a function of the decision variables , cf. (17b), and so the multipliers must be specified in advance. We propose the following method of selecting the multipliers: given , solve via the SDP (15). Then, compute the cost by solving (17), but with the policies fixed to , and the multipliers as free decision variables. In other words, approximate the worst-case cost of deploying the epochs into the future. Then, use the multipliers found during the calculation of this cost for control policy synthesis at the ith epoch.

Computational complexity The proposed method can be implemented via semidefinite programing (SDP) for which computational complexity is well-understood. In particular, the cost of solving the SDP (15) scales as denotes the dimensionality of the decision variables, and is the dimensionality of the LMI . The cost of solving the SDP (17) is then given, approximately, by the cost of (15) multiplied by the horizon h.

5 Experimental results

Numerical simulations In this section, we consider the RRL problem with parameters

Figure 2: Results for the experiments described in §5. (a) total costs (sum of costs at each epoch) for the numerical simulations. On the left, we plot the costs when the methods are applied to the true system. In the middle, we plot the worst-case costs, when the model uncertainty is updated using the data obtained from applying the methods to the true system. On the right, we plot the worst-case costs when the model uncertainty is updated using the theoretical worst-case system behavior. (b) total costs sum of costs at each epoch) for the hardware-in-the-loop experiments. (c) on the left axis, we plot the (median) costs at each epoch for numerical simulations. On the right axis, we plot the information, a scalar measure of uncertainty defined in §5. The shaded region denotes the inter-quartile range. The scenarios are in the same order as (a). (d) (median) costs at each epoch for the hardware-in-the-loop experiments. The shaded region covers the best/worst costs at each epoch.

We partition the time horizon equally spaced intervals, each of length For robustness, we set . Each experimental trial consists of the following procedure. Initial data is obtained by driving the system forward 6 time steps, excited by . This open-loop experiment is repeated 500 times, such that . We then apply three methods: i. rrl - the method proposed in §4.4, with look-ahead horizon h = 10; ii. nom - applying the ‘nominal’ robust policy , i.e., a pure greedy exploitation policy, with no explicit exploration; iii. greedy - first obtaining a nominal robustly stabilizing policy as with nom, but then optimizing (i.e., increasing, if possible) the exploration variance until the greedy policy and the rrl policy have the same theoretical worst-case cost at the current epoch. This is a greedy exploration policy. We perform 100 of these trials and plot the results in Figure 2. In Figure 2(a) we plot the total costs (i.e. the sum of the costs for each epoch over the entire time horizon). In each setting (see caption for details), rrl attains lower total cost than the other methods. In Figure 2(c) we plot the costs at each epoch; as may be expected, nom (which does no explicit exploration) often attains lower cost than rrl at the initial epochs, rrl, which does exploration, achieves lower cost in the end. We emphasize that this balance of exploration/exploitation occurs automatically. rrl always outperforms greedy. In Figure 2(c) we also plot the information, defined as , at the ith epoch, which is the (inverse) of the 2-norm of parameter error, cf. (4). The larger the information, the more certain the system (in an absolute sense). Observe that rrl achieves larger information than nom (which does no exploration), but less information than greedy; however, rrl achieves lower cost than greedy. This suggests that rrl is reducing the uncertainty in a structured way, targeting uncertainty reduction in the parameters that ‘matter most for control’.

Hardware-in-the-loop experiment In this section, we consider the RRL problem for a hardware-in-the-loop simulation comprised of the interconnection of a physical servo mechanism (Quanser QUBE 2) and a synthetic (simulated) LTI dynamical system; cf. Appendix A.2 for full details of the experimental setup. An experimental trial consisted of the following procedure. Initial data was obtained by simulating the system for 0.5 seconds, under closed-loop feedback control (cf. Appendix A.2) with data sampled at 500Hz, to give 250 initial data points. We then applied methods rrl (with horizon h = 5) and greedy as described in §5. The total control horizon was T = 1250 (2.5 seconds at 500Hz) and was divided into N = 5 intervals, each of duration 0.5 seconds. We performed 5 of these experimental trials and plot the results in Figure2. In Figure 2(b) and (d) we plot the total cost (the sum of the costs at each epoch), and the cost at each epoch, respectively, for each method, and observe significantly better performance from rrl in both cases. Additional plots decomposing the cost into that associated with the physical and synthetic system are available in Appendix A.2.

References

[1] Y. Abbasi-Yadkori and C. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.

[2] P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 1–8. ACM, 2005.

[3] M. Abeille and A. Lazaric. Thompson Sampling for linear-quadratic control problems. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

[4] M. Abeille and A. Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9, 2018.

[5] M. Annergren, C. A. Larsson, H. Hjalmarsson, X. Bombois, and B. Wahlberg. Application-oriented input design in system identification: Optimal input design for control [applications of control]. IEEE Control Systems Magazine, 37(2):31–56, 2017.

[6] K. J. Åström and B. Wittenmark. Problems of identification and control. Journal of Mathematical analysis and applications, 34(1):90–113, 1971.

[7] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.

[8] Y. Bar-Shalom. Stochastic dynamic programming: Caution and probing. IEEE Transactions on Automatic Control, 26(5):1184–1195, 1981.

[9] Y. Bar-Shalom and E. Tse. Caution, probing, and the value of information in the control of uncertain systems. In Annals of Economic and Social Measurement, Volume 5, number 3, pages 323–337. NBER, 1976.

[10] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems (NIPS). 2017.

[11] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. to appear in Foundations of Computational Mathematics (arXiv:1710.01688), 2017.

[12] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4192–4201, 2018.

[13] S. Dean, S. Tu, N. Matni, and B. Recht. Safely learning to control the constrained linear quadratic regulator. to appear at American Control Conference (arXiv:1809.10121), 2019.

[14] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning. arXiv:1710.07283, 2017.

[15] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. Finite time adaptive stabilization of LQ systems. IEEE Transactions on Automatic Control, 2019. accepted.

[16] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039, 2018.

[17] A. Feldbaum. Dual control theory. i. Avtomatika i Telemekhanika, 21(9):1240–1249, 1960.

[18] A. Feldbaum. Dual control theory. ii. Avtomatika i Telemekhanika, 21(11):1453–1464, 1960.

[19] J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[20] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81–108, 2005.

[21] E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang. Spectral filtering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pages 4634–4643, 2018.

[22] E. Hazan, K. Singh, and C. Zhang. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pages 6702–6712, 2017.

[23] M. Ibrahimi, A. Javanmard, and B. V. Roy. Efficient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems, pages 2636–2644, 2012.

[24] C. Larsson, A. Ebadat, C. Rojas, X. Bombois, and H. Hjalmarsson. An application-oriented approach to dual control with excitation for closed-loop identification. European Journal of Control, 29:1–16, May 2016.

[25] C. Larsson, C. Rojas, X. Bombois, and H. Hjalmarsson. Experimental evaluation of model predictive control with excitation (MPC-X) on an industrial depropanizer. Journal of Process Control, 31:1–16, Jul 2015.

[26] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256, 2009.

[27] M. S. Lobo and S. Boyd. Policies for simultaneous estimation and optimization. In Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251), volume 2, pages 958–964. IEEE, 1999.

[28] Z.-Q. Luo, J. F. Sturm, and S. Zhang. Multivariate nonnegative quadratic mappings. SIAM Journal on Optimization, 14(4):1140–1162, 2004.

[29] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. arXiv preprint arXiv:1812.08305, 2018.

[30] H. Mania, S. Tu, and B. Recht. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.

[31] O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49(2-3):267–290, 2002.

[32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[33] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot. Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547–1563, 2016.

[34] Y. Ouyang, M. Gagrani, and R. Jain. Learning-based control of unknown linear systems with thompson sampling. submitted to IEEE Transactions on Automatic Control (arXiv:1709.04047), 2017.

[35] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook., 2012.

[36] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.

[37] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp analysis of linear system identification. Conf. on Learning Theory (arXiv:1802.08334), 2018.

[38] S. Tu and B. Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. to appear in Conference on Learning Theory (arXiv:1812.03565), 2019.

[39] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

A Supplementary material

A.1 Proofs

A.1.1 Proof of Proposition 2.1

where proportionality follows from the assumption of a uniform prior, likelihood can be expressed as

which has the norm of the Gaussian distribution , we know that the posterior is proportional to the likelihood; therefore the posterior is given by

A.1.2 Proof of Lemma 3.1

Proof. As . Substituting this representation of into (2) we have, w.p.

c.f., [35, Equation 521], and vec c.f., [35, Equation 520] to get

by choosing is simply the cyclic trace property; (19d) follows from the fact that the Frobenius norm upper bounds the spectral norm (2-norm) of a matrix. As

A.1.3 Proof of Lemma 4.1

As the feasible set in (24) is a subset of the feasible set in (23), the cost of (24) cannot be less than that of (23). Therefore,

A.1.4 Proof of Theorem 4.2

Proof. As Theorem 4.1 is non-conservative (i.e. if and only if), is equivalent to solving

When we solve the convex SDP (15) in Theorem 4.2, we solve with , as a free variable, instead of we ignore the structural constraints implicit in . As we remove constraints from the problem, the SDP (15) has a solution that is at least as good as the solution to (25) (which in the optimal solution).

However, as we enforce , one can recover a feasible policy , as the Schur complement implies

Therefore, as the policy from the SDP (15) in Theorem 4.2 is: i) at least as good as the optimal policy, and ii) feasible, it must be equivalent to the optimal policy.

A.2 Description of hardware in the loop experiment

For the hardware-in-the-loop experiment described in §5, we consider a system comprised of two subsystems: i. a Quanser QUBE-Servo 2 physical (i.e. real-world) servomechanism, cf. Figure 3, and ii. a synthetic (i.e. simulated) LTI system of the form (1), with parameters

For the purpose of implementation in so as to output the full state . The two subsystems are interconnected as depicted in the Simulink block diagram shown in Figure 4. The coupling between these two systems, cf. Figure 4, is sampled from the physical system at 500 Hz, i.e., a sampling time of , and the position (measured directly via an encoder) was passed through a high-pass filter to obtain velocity estimates. Band-limited white noise (of unit power) was added to all states of the system, as shown in Figure 4. The gain for each ‘noise channel’ was set to

The experiment consisted of five trials, each comprising the following procedure. Initial data was generated by simulating the system for 0.5 seconds, i.e. 250 samples at 500 Hz, under feedback control with the policy

We then applied the methods rrl and greedy, as defined in §5. The matrices specifying the cost function were given by Q = diag(1, 0.1, 0.1, 0.1, 0.1, 10, 0.1) and R = diag(0.1, 0.1). The total time horizon was T = 1250, i.e. 2.5 seconds at 500 Hz, which was divided into N = 5 equal intervals.

In Figure 5 we decompose the total cost plotted in Figure 2(d) into the costs associated with the physical system and the synthetic (simulated) system.

Figure 3: The Quanser QUBE-Servo 2. Photo: www.quanser.com/products/qube-servo-2.

Figure 4: Simulink block diagram showing the interconnection of the physical system (Quarc) and the synthetic (simulated) system.

Figure 5: The (median) cost of rrl and greedy controllers on the synthetic system and the physical system. The shaded region covers the best and worst costs at each epoch.

designed for accessibility and to further open science