Recent years have witnessed tremendous success of reinforcement learning (RL) in various sequential decision-making applications (Silver et al., 2016; OpenAI, 2018; Vinyals et al., 2019) and continuous control tasks (Lillicrap et al., 2015; Schulman et al., 2015b; Levine et al., 2016; Recht, 2019). Interestingly, most successes hinge on the algorithmic framework of policy optimization (PO), umbrellaing policy gradient (PG) methods (Sutton et al., 2000; Kakade, 2002), actor-critic methods (Konda and Tsitsiklis, 2000; Bhatnagar et al., 2009), trust-region (Schulman et al., 2015a) and proximal PO (Schulman et al., 2017) methods, etc. This inspires an increasing interest in studying the convergence theory, especially global convergence to optimal policies, of PO methods; see recent progresses in both classical RL contexts (Bhandari and Russo, 2019; Zhang et al., 2019a; Wang et al., 2019; Agarwal et al., 2019a; Shani et al., 2019), and continuous control benchmarks (Fazel et al., 2018; Bu et al., 2019a; Malik et al., 2019; Tu and Recht, 2018; Zhang et al., 2019b; Matni et al., 2019).
Indeed, PO provides a general framework for control design.1 Consider a general control design problem for the following discrete-time nonlinear dynamical system
where xk is the state, uk is the control input, and wk is the process noise. Formally, PO is a constrained optimization problem minK∈K J (K), where the decision variable K is determined by the controller parameterization, the cost function J (K) is a pre-specified control performance measure, and the feasible set K carries the information of the constraints on the controller K. These concepts are briefly reviewed as follows.
• Optimization variable K: The control input uk is typically determined by a feedback law K which is also termed as a controller. In the simplest case where a LTI state-feedback controller is used, K is parameterized as a static matrix and uk is given as uk = −Kxk. Then this matrix K becomes the decision variable of the PO problem. For the so-called linear output feedback case where the state xk is not directly measured, the controller can be either a memoryless mapping or an LTI dynamical system. Hence, K can be parameterized by either a static matrix (Rautert and Sachs, 1997) or some state/input/output matrices (AK,BK,CK,DK) (Apkarian et al., 2008). It is also possible to deploy nonlinear controllers and parameterize K as either polynomials, kernels, or deep neural networks (Topcu et al., 2008; Levine et al., 2016).
• Objective function J (K): J (K) is specified to assess the performance of a given controller K. The cost function design is more of an art than a science. Popular choices of such cost functions include H2 or H∞-norm (or some related upper bounds) for the resultant feedback systems (Zhou et al., 1996; Skogestad and Postlethwaite, 2007; Dullerud and Paganini, 2013). For standard RL models that are based on Markov decision processes (MDPs), the cost J (K) usually has an additive structure over time. For instance, in the classical linear quadratic regulator (LQR) or state-feedback LQ Gaussian (LQG) problems, the cost is J (K) := �∞t=0 ], which also has an H2-norm interpretation (Zhou et al., 1996). Nevertheless, J (K) does not necessarily have an additive structure. We will further discuss the specification of J (K) in §2.2.
• Feasible set K: Constraints on the decision variable K are posed to account for either the stability, robustness, or safety concerns on the system. A common, though sometimes implicit, example in continuous control tasks is the stability constraint, i.e. K is required to stabilize the closed-loop dynamics (Makila and Toivonen, 1987; Bu et al., 2019a). There are also other constraints related to robustness or safety concerns in control design (Skogestad and Postlethwaite, 2007; Dullerud and Paganini, 2013; Apkarian et al., 2008). The constraints will naturally confine the policy search to a feasible set K. The cost function J (K) is either ∞ or just undefined for K � K.
To ensure the feasibility of K on the fly as the PO methods proceed, projection of the iterates onto the set K seems to be the first natural approach that comes to mind. However, such a projection may not be computationally efficient or even tractable. For example, projection onto the stability constraint in LQR problems can hardly be computed, as the set K therein is well known to be nonconvex (Fazel et al., 2018; Bu et al., 2019b). Fortunately, such a projection is not needed to preserve the feasibility of the iterates in PG-based methods, as recently reported by Fazel et al. (2018); Bu et al. (2019a). In particular, Bu et al. (2019a) has identified that the cost of LQR has the coercive property, such that it diverges to infinity as the controller K approaches the boundary of the feasible set K. In other words, the cost of LQR serves as a barrier function on K. This way, the level set of the cost becomes compact, and the decrease of the cost ensures the next iterate to stay inside the level set, which further implies the stay inside K. This desired property is further illustrated in Figure 1(a), where K and K′ are two consecutive iterates, and the level set {�K |J (�K) ≤ J (K)} is always separated from the set Kc by some distance δ > 0. Hence, as long as ||K − K′|| < δ, the next iterate K′ still stays in the set K. More importantly, such a separation distance δ can be re-used for the next iterate, as the next level set is at least δ away from Kc. By induction, this allows the existence of a constant stepsize that can guarantee the controllers’ stability along the iterations. It is worth emphasizing that such a property is algorithm-agnostic, in the sense that it is dictated by the cost, and independent of the algorithms adopted, as long as they follow any descent directions of the cost.
Besides the stability constraint, another commonly used one in the control literature is the so-called H∞ constraints. This type of constraints plays a fundamental role in robust control (Zhou et al., 1996; Skogestad and Postlethwaite, 2007; Dullerud and Paganini, 2013; Apkarian et al., 2008) and risk-sensitive control (Whittle, 1990; Glover and Doyle, 1988). Based on the well-known small gain theorem (Zames, 1966; Zhou et al., 1996), such constraints can be used to guarantee robust stability/performance of the closed-loop systems when model uncertainty is at presence. Compared with LQR under the stability constraint, control synthesis under the H∞ constraint leads to a fundamentally different optimization landscape, over which the behaviors of PO methods have not been fully investigated yet. In this paper, we take an initial step towards understanding the theoretical aspects of policy-based RL methods on robust/risk-sensitive control problems.
Specifically, we establish a convergence theory for PO methods on H2 linear control problems with H∞ constraints, referred to as mixed H2/H∞ state-feedback control design in the robust control literature (Glover and Doyle, 1988; Khargonekar and Rotea, 1991; Kaminer et al., 1993; Mustafa and Glover, 1990; Mustafa and Bernstein, 1991; Mustafa, 1989; Apkarian et al., 2008). As the name suggests, the goal of mixed design is to find a robust stabilizing controller that minimizes an upper bound for the H2-norm, under the restriction that the H∞-norm on a certain input-output channel is less than a pre-specified value. The H∞ constraint is explicitly posed here to guarantee the robustness of the closed-loop system to some extent. This general framework also includes risk-sensitive linear control, modeled as linear exponential quadratic Gaussian (LEQG) (Jacobson, 1973; Whittle, 1990) problems as a special case, when a certain upper bound of H2-norm is used. Detailed formulation for such H2 linear control with H∞ constraint is provided in §2. In contrast to LQR, two challenges exist in the analysis of PO methods for mixed design problems. First, by definition of H∞-norm (Zhou et al., 1996), the constraint is defined in the frequency domain, and is hard to impose, for instance, by directly projecting the iterates in that domain, especially in the context of RL when the system model is unknown. Note that preserving the constraint bound of H∞-norm as the controller updates is critical in practice, since violation of it can cause catastrophic consequences on the system. Second, more importantly, the coercive property of LQR fails to hold for mixed design problems, as illustrated in Figure 1(b) (and formally established later). Particularly, the cost value, though undefined outside the set K, remains finite around the boundary of K. Hence, the decrease of cost from K to K′ cannot guarantee that the iterate does not travel towards, and even beyond the feasibility boundary. With no strict separation between the cost level set and Kc, there may not exist a constant stepsize that induces global convergence to the optimal policy.
These two challenges naturally raise the question: does there exist any computationally tractable PO method, which preserves the robustness constraint along the iterations, and enjoys (hopefully global) convergence guarantees? We provide a positive answer to this question in the present work. Our key contribution is three-fold: First, we study the landscape of mixed H2/H∞ design problems for both discrete- and continuous-time settings, and propose three policy-gradient based methods, inspired by those for LQR (Fazel et al., 2018; Bu et al., 2019a). Second, we prove that two of them (the Gauss-Newton method and the natural PG method) enjoy the implicit regularization property, such that the iterates are automatically biased to satisfy the required H∞ constraint. Third, we establish the global convergence of those two PO methods to the globally optimal policy with globally sublinear and locally (super-)linear rates under certain conditions, despite the nonconvexity of the problem. In particular, the two policy search directions always lead to convergence to the global optimum, without getting stuck at any spurious stationary point/local optima. Along the way, we also derive new results on linear risk-sensitive control, i.e., LEQG problems, and discuss the connection of mixed design to zero-sum LQ dynamic games, for designing model-free versions of the algorithms. We expect our work to help pave the way for rigorous understanding of PO methods for general optimal control with H∞ robustness guarantees.
Note that the concept of (implicit) regularization has been adopted in many recent works on nonconvex optimization, including training neural networks (Allen-Zhu et al., 2018; Kubo et al., 2019), phase retrieval (Chen and Candes, 2015; Ma et al., 2017), matrix completion (Chen and Wainwright, 2015; Zheng and Lafferty, 2016), and blind deconvolution (Li et al., 2019), referring to any scheme that biases the search direction of the optimization algorithms. The term implicit emphasizes that the algorithms without regularization may behave as if they are regularized. This property has been advocated as an important feature of gradient-based methods for solving aforementioned nonconvex problems. We emphasize that it is a feature of both the problem and the algorithm, i.e., it holds for certain algorithms
Figure 1: Comparison of the landscapes of LQR and mixed H2/H∞ control design that illustrates the hardness of showing convergence of the latter. The dashed lines represent the boundaries of the constraint sets K. For (a) LQR, K is the set of all linear stabilizing state-feedback controllers; for (b) mixed H2/H∞ control, K is set of all linear stabilizing state-feedback controllers satisfying an extra H∞ constraint on some input-output channel. The solid lines represent the contour lines of the cost J (K). K and K′ denote the control gain of two consecutive iterates; denotes the global optimizer.
that solve certain problems. This is precisely the case in the present work. The specific search directions of the Gauss-Newton and the natural PG methods bias the iterates towards the set of the stabilizing controllers satisfying the H∞ constraint, although no explicit regularization, e.g., projection, is adopted, which contrasts to that the stability-preserving of PO methods for LQR problems is algorithm-agnostic. To the best of our knowledge, our work appears to be the first studying the implicit regularization properties of PO methods for learning-based control in general.
Related Work.
Mixed H2/H∞ & Risk-Sensitive Control. The history of mixed H2/H∞ control design dates back to the seminal works Bernstein and Haddad (1989); Khargonekar and Rotea (1991) for continuous-time settings, built upon the Riccati equation approach and the convex optimization approach, respectively. Such a formula can be viewed as a surrogate/sub-problem of the more challenging H∞-control problem, where the goal is to find the optimal controller that minimizes the H∞-norm (Doyle et al., 1989). These formulation and approaches were then investigated for discrete-time systems in Mustafa and Bernstein (1991); Kaminer et al. (1993). A non-smooth constrained optimization perspective for solving mixed design problems was adopted in Apkarian et al. (2008), with proximity control algorithm designed to handle the constraints explicitly, and convergence guarantees to stationarypoint controllers. Numerically, there also exist other packages for multi-objective H2/H∞ control (Gumussoy et al., 2009; Arzelier et al., 2011) that are based on non-smooth non-convex optimization. However, in spite of achieving impressive numerical performance, these methods have no theoretical guarantees for either the global convergence or the H∞-norm constraint violation. It is also not clear yet how these methods can be made model-free. On the other hand, risk-sensitive control with exponential performance measure was originally proposed by Jacobson (1973) for the linear quadratic case, and then generalized in Whittle (1981); Fleming and Hern´andez-Hern´andez (1997); Borkar and Meyn (2002); Ja´skiewicz et al. (2007). Under certain conditions on the cost, noise, and risk factor, convex optimization perspectives on linear risk sensitive control have been reported in Dvijotham et al. (2014b,a). Recently, first-order optimization methods have also been applied to finite-horizon risk sensitive nonlinear control, but the control inputs (instead of the policy) are treated as decision variables (Roulet et al., 2019). Convergence to stationary points was shown therein for the iterative LEQG algorithm. Interestingly, there is a relationship between mixed design and risk-sensitive control, as established in Glover and Doyle (1988); Whittle (1990). These two classes of problems can also be unified with maximum-entropy H∞ control (Glover and Doyle, 1988; Mustafa, 1989) and zero-sum dynamic games (Jacob- son, 1973; Bas¸ar and Bernhard, 1995). Besides these direct controller/policy search methods, general mixed H2/H∞ control can also be tackled via Youla-parameterization based approaches (Boyd et al., 1988; Scherer, 1995; Chen and Wen, 1995; Hindi et al., 1998; Rotstein and Sznaier, 1998), which lead to convex programming problems that can be solved numerically. However, actual implementation of these approaches either require a finite-horizon truncation of system impulse responses, which loses optimality guarantees (Boyd et al., 1988), or require solution of (a large enough sequence of (Chen and Wen, 1995; Rotstein and Sznaier, 1998)) semi-definite programs or linear matrix inequalities with lifted dimensions (Scherer, 1995; Chen and Wen, 1995; Rotstein and Sznaier, 1998; Hindi et al., 1998), which may not be computationally efficient for large-scale dynamical systems. More importantly, it is not clear yet how to implement these approaches in the data-driven regime, without identifying the model. In contrast, the direct search methods can easily be made model-free, see e.g., our recent attempt Zhang et al. (2021) for robust control design.
Constrained MDP & Safe RL. The mixed design formulation is also pertinent to constrained dynamic control problems, usually modeled as constrained MDPs (CMDPs) (Altman, 1999). However, the constraint in CMDPs is generally imposed on either the expected long-term cost (Borkar, 2005; Achiam et al., 2017; Chow et al., 2018), which shares the additive-in-time structure as the objective, or some risk-related constraint (Di Castro et al., 2012; Chow et al., 2015, 2017). Those constraints are in contrast to the H∞ robustness constraint considered here. Under the CMDP model, various safe RL algorithms, especially PO-based ones, have been developed (Borkar, 2005; Geibel and Wysotzki, 2005; Di Castro et al., 2012; Chow et al., 2015; Achiam et al., 2017; Chow et al., 2018; Yu et al., 2019). It is worth mentioning that except Achiam et al. (2017); Chow et al. (2018), other algorithms cannot guarantee the constraint to be satisfied during the learning iterations, as opposed to our on-the-fly implicit regularization property. Recently, safety constraint has also been incorporated into the LQR model for model-based learning control (Dean et al., 2019b). Several recent model-based safe RL algorithms include Garcia and Fern´andez (2012); Aswani et al. (2013); Akametalu et al. (2014); Berkenkamp et al. (2017).
PO for LQR. PO for LQR stemmed from the adaptive policy iteration algorithm in Bradtke et al. (1994). Lately, studying the global convergence of policy-gradient based methods for LQR (Fazel et al., 2018; Tu and Recht, 2018; Malik et al., 2019; Bu et al., 2019a; Gravell et al., 2019; Mohammadi et al., 2019b,a; Jansch-Porto et al., 2020; Venkataraman and Seiler, 2019) has drawn increasing attention. Specifically, Fazel et al. (2018) first identified the landscape of PO for LQR problems that stationary point implies global optimum, which motivated the development of first-order methods for solving this nonconvex problem. A more comprehensive landscape characterization was then reinforced in Bu et al. (2019a), where the coercive property of LQR cost was explicitly mentioned. Based on these, Malik et al. (2019) advocated two-point zeroth-order methods to improve the sample complexity of model-free PG; Yang et al. (2019) proposed actor-critic algorithms with non-asymptotic convergence guarantees; Tu and Recht (2018) compared the asymptotic behavior of model-based and model-free PG methods, with focus on finite-horizon LQR problems. Recently, the continuous-time setup has been considered in Mohammadi et al. (2019b,a), and the extensions to Markov jump linear quadratic control have been presented in Jansch-Porto et al. (2020). Moreover, Gravell et al. (2019) considered LQR with multiplicative noises, in order to improve the controller’s robustness. The robustness issue for the output feedback case has been further discussed in Venkataraman and Seiler (2019).
Robust RL. Robustness with respect to the model uncertainty/misspecification has long been a significant concern in RL. Indeed, the early attempt for robust RL was based on the notion of H∞ robustness considered here (Morimoto and Doya, 2005), where the uncertainty was modeled as the control of an adversarial agent playing against the nominal controller. This game-theoretic perspective, as we will also discuss in §6.1, enabled the development of actor-critic based algorithms therein, though without theoretical analysis. Such an idea has recently been carried forward in the empirical work Pinto et al. (2017), which proposed PO methods alternating between the two agents. Another line of work follows the robust MDP framework (Nilim and El Ghaoui, 2005; Iyengar, 2005), with RL algorithms developed in Lim et al. (2013); Lim and Autef (2019); Tessler et al. (2019); Mankowitz et al. (2019). However, these algorithms apply to only tabular/small-scale MDPs (not continuous control tasks) and/or do not belong to PO methods that guarantee robustness during learning. More recently, linear control design against adversarial disturbances has also been placed in the online learning context (Cohen et al., 2018; Agarwal et al., 2019b,c) to achieve nearly-optimal regret, where either the dynamics or the cost functions are adversarially changing. Model-based methods also exist for continuous control tasks (Berkenkamp and Schoellig, 2015; Dean et al., 2019a).
Notation. For two matrices A and B of proper dimensions, we use Tr(AB) to denote the trace of AB. For any X ∈ , we use vec(X) ∈
denote the vectorization, the spectral radius, the operator norm, the Frobenius norm, the largest and smallest singular values of X, respectively. If X ∈
is square, we use X > 0 (resp. X ≥ 0) to denote that X is positive definite (resp. nonnegative definite), i.e., for any nonzero vector v ∈
0). For a symmetric matrix X, we use λmax(X) and λmin(X) to denote, respectively, its largest and smallest eigenvalues. We use ⊗ to denote the Kronecker product. We use I and 0 to denote the identity matrix and all-zero matrix of proper dimensions. We use N (µ,Σ) to denote the Gaussian distribution with mean µ and covariance matrix Σ. For any integer m > 0, we use [m] to denote the set of integers {1,··· ,m}. For any complex number c ∈ C, we use ℜc to denote the real part
state-space linear dynamical systems:
which can also be written as G(z) = C(zI −A)−1B+D and G(s) = C(sI −A)−1B+D for discreteand continuous-time systems, respectively. The H∞-norm ∥G∥∞ is then defined as
We first provide some preliminary results on H2 linear control with H∞ robustness guarantees. Throughout this section, and the following sections in the main text, we will focus on systems in discrete time. Counterparts of these results for continuous-time systems are included in Appendix §A.
2.1 Motivating Example: LEQG
We start with an example of risk-sensitive control, the infinite-horizon state-feedback linear exponential quadratic Gaussian problem2 (Jacobson, 1973), which is motivating in that: i) it is closely related to the well-known linear optimal control problems, e.g., LQR and state-feedback LQG; ii) it illustrates the idea of mixed control design, especially introducing the H∞-norm constraint, though implicit, that guarantees robustness. The latter manifests the challenge in the convergence analysis of PO methods for this problem.
Specifically, at time t ≥ 0, the agent takes an action ut ∈ , which leads the system to a new state xt+1 by a linear dynamical system
where A and B are matrices of proper dimensions, x0 ∈ 0 are independent zero-mean Gaussian random variables with positive-definite covariance matrices X0 and W, respectively. The one-stage cost of applying control u at state x is given by c(x,u) = x⊤Qx + u⊤Ru, where Q and R are positive-definite matrices. Then, the long-term cost to minimize is
where β is the parameter that describes the intensity of risk-sensitivity, and the expectation is taken over the randomness of both x0 and wt for all t ≥ 0. The intuition behind the objective (2.1) is that by Taylor series expansion around β = 0,
Hence, if β > 0, the control is risk-averse since minimization also places positive weight on the variance, in addition to the expectation, of the cost; in contrast, if β < 0, the control is referred to as risk-seeking, which encourages the variance to be large. As β → 0, the objective (2.1) reduces to the risk-neutral objective of LQR/state-feedback LQG. Usually LEQG problems consider the case of β > 0. In this sense, LEQG can be viewed as a generalization of LQR/state-feedback LQG problems.
The goal of LEQG is to find the optimal control policy µt : (in general is a mapping from the history of state-action pairs till time t and current state xt, to the action ut in
, that minimizes the cost in (2.1). By assuming that such an optimal policy exists, the limsup in (2.1) can be replaced by lim. Moreover, we can show, see a formal statement in Lemma C.2 in §C, that the optimal control has a desired property of being memoryless and stationary, i.e., linear time-invariant (LTI), and current state-feedback, i.e., µt(x0:t,u0:t−1) = µ(xt) = −Kxt, for some K ∈
. Hence, it suffices to optimize over the control gain K, without loss of optimality, i.e.,
2.1.1 Cost Closed-Form
To solve (2.2) with PO methods, it is necessary to establish the closed-form of the objective with respect to K. To this end, we introduce the following algebraic Riccati equation
for given control gain K. If β → 0, (2.3) reduces to the Lyapunov equation of policy evaluation for given K in LQR problems. For notational simplicity, we also define �PK as
Then, the objective J (K) can be expressed by the solution to (2.3), PK, as follows.
Lemma 2.1. For any stabilizing LTI state-feedback controller ut = −Kxt, such that the Riccati equation (2.3) admits a solution PK ≥ 0 that: i) is stabilizing, i.e., ρ�(A−BK)⊤(I −βPKW)−1�< 1, and ii) satisfies W −1 − βPK > 0, J (K) has the form of
Note that when β → 0, the objective (2.5) reduces to Tr(PKW), the cost function for LQG problems.
Remark 2.2 (New Results on LEQG). To the best of our knowledge, our results on that the optimal controller is LTI state-feedback in Lemma C.2, and on the form of the objective J (K) in Lemma 2.1, though expected, have not been rigorously established for LEQG problems in the literature. For completeness, we present a self-contained proof in §C. Interestingly, the former argument has been hypothesized in Section 3 of Glover and Doyle (1988); while the form of J (K) in (2.5) connects to the performance criterion for more general optimal control problems with robustness guarantees, as to be shown shortly.
2.1.2 Implicit Constraint on H∞-Norm
Seemingly, (2.2) is an unconstrained optimization over K. However, as identified by Glover and Doyle (1988), there is an implicit constraint set for this problem, which corresponds to the lower-level set of the H∞-norm of the closed-loop transfer function under the linear stabilizing controller u = −Kx. We reiterate the result as follows.
Lemma 2.3 (Glover and Doyle (1988)). Consider the LEQG problem in (2.2) that finds the optimal stationary state-feedback control gain K, and a closed-loop transfer function from the noise {wt} to the output, T (K), as
Then, the feasible set of J (K) is the intersection of the set of linear stabilizing feedback controllers and the 1/�β-lower-level set of the H∞-norm of T (K), i.e.,�K���ρ(A − BK) < 1, and ∥T (K)∥∞ <
1/�β�.
Proof. The result follows by applying the results in Section 3 in Glover and Doyle (1988), writing out the transfer function, and replacing the θ therein by the −β here.
We note that the feasible set for LEQG in Lemma 2.3 may not necessarily be bounded. This feasible set, though quite concise to characterize, is hard to enforce directly onto the control gain K, since it is a frequency-domain characterization using the H∞-norm. To develop PO algorithms for finding K, the time-domain characterization in Lemma 2.1 is more useful. Interestingly, as we will show shortly, the conditions that lead to the form of J (K) in Lemma 2.1 are indeed equivalent to the feasible set given by H∞-norm constraint in Lemma 2.3; see Remark 2.8.
In fact, this reformulation of LEQG as a constrained optimization problem, belongs to a general class problems, mixed H2/H∞ control design with state-feedback.
2.2 Bigger Picture: Mixed H2/H∞ Control Synthesis
Consider the following discrete-time linear dynamical system with a single input-output channel
where xt ∈ denote the states and controls, respectively, wt ∈
is the disturbance, zt ∈
is the controlled output, and A,B,C,D,E are matrices of proper dimensions. Consider the admissible control policy µt to be a mapping from the history of state-action pairs till time t and the current state xt to action ut. It has been shown in Kaminer et al. (1993) that, LTI state-feedback controller (without memory) suffices to achieve the optimal performance of mixed H2/H∞ design under this state-feedback information structure3. As a consequence, it suffices to consider only stationary, current state-feedback controller parametrized as ut = −Kxt.
Remark 2.4 (Justification of LTI Control for LEQG). As to be shown shortly, LEQG is a special case of mixed H2/H∞ design. Hence, the result we derived in Lemma C.2, i.e., the optimal controller of LEQG is indeed LTI, is consistent with this earlier result on mixed design from Khar- gonekar and Rotea (1991); Kaminer et al. (1993).
In accordance with this parametrization, the transfer function from the disturbance wt to the output zt can be represented as �
In common with Glover and Doyle (1988); Khargonekar and Rotea (1991); Bas¸ar and Bern- hard (1995), we make the following assumption on the matrices A,B,C,D and E.
Assumption 2.5. The matrices A,B,C,D,E in (2.7) satisfy E⊤[C E] = [0 R] for some R > 0.
Assumption 2.5 is fairly standard, which clarifies the exposition substantially by normalising the control weighting and eliminating cross-weightings between control signal and state (Bas¸ar and Bernhard, 1995). Hence, the transfer function in (2.7) has the equivalent form4 of
Hence, robustness of the designed controller can be guaranteed by the constraint on the H∞-norm, i.e., ∥T (K)∥∞ < γ for some γ > 0. The intuition behind the constraint, which follows from small gain theorem (Zames, 1966), is that the constraint on ∥T (K)∥∞ implies that the closed-loop system is robustly stable in that any stable transfer function ∆ satisfying ∥∆∥ℓ2→ℓ2 < 1/γ may be connected from zt back to wt without destablizing the system. For more background on H∞ control, see Bas¸ar and Bernhard (1995); Zhou et al. (1996). For notational convenience, we define the feasible set of mixed H2/H∞ control design as
We note that the set K may be unbounded.
where PK is the solution to the following Riccati equation
with �PK defined as
Remark 2.6 (LEQG as a Special Case of Mixed-Design). By Lemma 2.3, replacing β, W, and Q in LEQG by γ−2, DD⊤ and C⊤C, respectively, yields the formulation of mixed H2/H∞ design. In particular, the closed-form cost of LEQG that we derived for the first time, see Lemma 2.1, is identical to the cost in (2.11); and the implicit constraint of LEQG in Lemma 2.3 is exactly the H∞-norm constraint in (2.9). Thus, LEQG is a mixed-design problem with D = W 1/2 and J (K) being (2.11). Note that DD⊤ = W > 0 for LEQG.
All three objectives in (2.10)-(2.12) are upper bounds of the H2-norm (Mustafa, 1989; Mustafa and Bernstein, 1991). In particular, cost (2.10) has been adopted in Bernstein and Haddad (1989); Haddad et al. (1991), which resembles the standard H2 control/LQG control objective, but with PK satisfying a Riccati equation instead of a Lyapunov equation. Cost (2.11) is closely related to maximum entropy H∞-control, see the detailed relationship between the two in Mustafa and Glover (1990). In addition, cost (2.12) can also be connected to the cost of LQG using a different Riccati equation (Mustafa and Bernstein, 1991, Remark 2.7). As γ → ∞, the costs in all (2.10)-(2.12) reduce to the cost for LQG, i.e., H2 control design problems.
In sum, the mixed H2/H∞ control design can be formulated as
with J (K) and K defined in (2.10)-(2.12) and (2.9), respectively.
2.3 Bounded Real Lemma
Though the constraint (2.9) is concise, it is hard to enforce over K in policy optimization, since the constraint is defined in the frequency domain. Interestingly, by using a significant result in robust control theory, i.e., Bounded Real Lemma (Bas¸ar and Bernhard, 1995, Chapter 1), (Zhou et al., 1996; Rantzer, 1996), constraint (2.9) can be related to the solution of a Riccati equation and a Riccati inequality. We formally introduce the result as follows, whose proof is deferred to §B.1.
Lemma 2.7 (Discrete-Time Bounded Real Lemma). Consider a discrete-time transfer function T (K) defined in (2.8), suppose K is stabilizing, i.e., ρ(A − BK) < 1, then the following conditions are equivalent:
• ∥T (K)∥∞ < γ, which, due to ρ(A − BK) < 1, further implies K ∈ K with K defined in (2.9).
• The Riccati equation (2.13) admits a unique stabilizing solution PK ≥ 0 such that: i) I − γ−2D⊤PKD > 0; ii) (I − γ−2PKDD⊤)−⊤(A − BK) is stable;
• There exists some P > 0, such that
The three equivalent conditions in Lemma 2.7 will be frequently used in the ensuing analysis. Note that the unique stabilizing solution to (2.13) for any K ∈ K, is also minimal, if the pair (A − BK,D) is stabilizable, see (Ran and Vreugdenhil, 1988, Theorem 3.1). This holds since K ∈ K is indeed stabilizing. Thus, the optimal control that minimizes (2.10)-(2.12), which are all monotonically increasing with respect to PK, only involves the stabilizing solution PK. Hence, it suffices to consider only stabilizing solution PK of the Riccati equation (2.13) for LEQG.
Remark 2.8 (Necessity of Lemma 2.1). By Lemma 2.7 and Remark 2.6, the conditions in Lemma 2.1 are equivalent to the H∞-norm constraint in (2.9) for LEQG. This implies that these conditions are not only sufficient for the form of J (K) in Lemma 2.1 to hold, but also necessary. In other words, any feasible K ∈ K should lead to the form of J (K) in (2.5).
Next, we develop policy optimization algorithms for solving the mixed H2/H∞ control problem in (2.15).
In this section, we investigate the optimization landscape of mixed H2/H∞ control design, and develop policy optimization algorithms with convergence guarantees. In particular, we study both discrete- and continuous-time settings focusing on two representative example costs J (K) from (2.11) and (2.10), respectively.5 The first combination of settings also by chance solves the discrete-time LEQG problems introduced in §2.1. The second combination for continuous-time settings is discussed in §A.
3.1 Optimization Landscape
We start by showing that, regardless of the cost J (K), the mixed-design problem in (2.15) is a nonconvex optimization problem.
Lemma 3.1 (Nonconvexity of Discrete-Time Mixed H2/H∞ Design). The discrete-time mixed H2/H∞ design problem (2.15) is nonconvex.
The proof of Lemma 3.1 is deferred to §B.2. In particular, we show by an easily-constructed example that the convex combination of two control gains K and K′ in K may no longer lie in K. As a result, this nonconvexity poses challenges in solving (2.15) using standard policy gradient-based approaches. Note that similar nonconvexity of the constraint set also exists in LQR problems (Fazel et al., 2018; Bu et al., 2019a), and has been recognized as one of the main challenges to address. Still, the landscape of LQR has some desired property of being coercive (Bu et al., 2019a, Lemma 3.7), which played a significant role in the analysis of PO methods for LQR. However, we establish in the following lemma that such a coercivity does not hold for mixed design problems.
Lemma 3.2 (No Coercivity of Discrete-Time Mixed H2/H∞ Design). The cost functions (2.10)- (2.12) for discrete-time mixed H2/H∞ design are not coercive. Particularly, as K → ∂K, where ∂K is the boundary of the constraint set K, the cost J (K) does not necessarily approach infinity.
The proof of Lemma 3.2 is provided in §B.3. The key of the argument is that for given K ∈ K, the policy evaluation equation for mixed design problems is a Riccati equation, see (2.13) (a quadratic equation of PK in 1-dimensional case); while for LQR problems, the policy evaluation equation is a Lyapunov equation, which is essentially linear. Hence, some additional condition on K is required for the existence of the solution, which can be restricter than the conditions on K and PK that makes the cost J (K) finite. In this case, the existence condition of the solution characterizes the boundary of K, which leads to a well-defined PK, and thus a finite value of the cost J (K), even when K approaches the boundary ∂K.
The lack of coercivity turns out to be the greatest challenge when analyzing the stability/feasibility of PO methods for mixed control design, in contrast to LQR problems. Detailed discussion on this is provided in §4.1. The illustration in Figure 1 in §1 of the landscape of mixed design problems was actually based on Lemmas 3.1 and 3.2. We then show the differentiability of J (K) at each K within the feasible set K, and provide the closed-form of the policy gradient. Here we focus on the most complicated objective defined in (2.11) among the three in (2.10)-(2.12), due to its direct connection to the risk-sensitive control problem; see Remark 2.6. We note that the proof can be used directly to establish similar results for the other two objectives, namely, (2.10) and (2.12), too.
Lemma 3.3. The cost J (K) defined in (2.11) is differentiable in K for any K ∈ K, and the policy gradient has the following form:
where ∆K ∈ is a matrix given by
and �PK is defined in (2.14).
The proof of Lemma 3.3 is provided in §B.4. Note that Lemma 3.3 also implies some property on the landscape of J (K). Specifically, if ∆K > 0 is full-rank, then ∇J (K) = 0 admits a unique solution K = (R+B⊤�PKB)−1B⊤�PKA, which corresponds to the unique global optimum. Otherwise, if ∆K ≥ 0 is not full-rank, there can be multiple stationary points. Yet, the global optimum is still of the same form. We formally establish this in the following proposition, which is proved in §B.6.
Proposition 3.4. Suppose that the discrete-time mixed H2/H∞ design admits a global optimal solution K∗ ∈ K; then, one such solution has the form of K∗ = (R + B⊤�PK∗B)−1B⊤�PK∗A. Additionally, if the pair�(I −γ−2PKDD⊤)−⊤(A −BK),D�is controllable at some stationary point of J (K), such that ∇J (K) = 0, then this is the unique stationary point, and corresponds to the unique global optimizer K∗.
The form of the optimal control gain K∗ above echoes back that of the solution to finite-horizon LEQG (Jacobson, 1973), (Whittle, 1990, Chapter 7). Note that for LEQG problems, D = W 1/2 > 0 implies that the controllability condition holds automatically. Thus, this K∗ corresponds to the unique global optimizer. We also remark that the landscape result above can also be shown for the other two objectives (2.10) and (2.12). In fact, the key in proving Proposition 3.4 is to show that PK∗ is matrix-wise minimal in the positive semi-definite sense for all PK with K ∈ K. Note that since the objectives (2.10) and (2.12) are both monotonically non-decreasing in the eigenvalues of PK, one can verify that the K∗ is also the global optimizer. Note that K∗ may not be the unique global minimizer without the controllability assumption. Finally, following the proof of Lemma 3.3, one can show that the policy gradients for (2.10) and (2.12) yield an almost identical form as in Lemma 3.3, except the definition of ∆K. Although the controllability assumption has been made in the literature (Mustafa and Bernstein, 1991), and is also satisfied automatically by LEQG problems, we will show next that our PO methods can find the global optimum K∗ even without this assumption.
3.2 Policy Optimization Algorithms
Consider three policy-gradient based methods as follows. For simplicity, we define
We also suppress the iteration index, and use K and K′ to represent the control gain before and after one-step of the update.
Policy Gradient: K′ = K − η∇J (K) = K − 2ηEK∆K (3.3) Natural Policy Gradient: K′ = K − η∇J (K)∆−1K = K − 2ηEK (3.4) Gauss-Newton: K′ = K − η(R + B⊤�PKB)−1∇J (K)∆−1K
where η > 0 is the stepsize. The updates are motivated by and resemble the policy optimization updates for LQR (Fazel et al., 2018; Bu et al., 2019a), but with PK therein replaced by �PK. The natural PG update is related to gradient over a Riemannian manifold; while the Gauss-Newton update is one type of quasi-Newton update, see Bu et al. (2019a) for further justifications on the updates. In particular, with η = 1/2, the Gauss-Newton update (3.5) can be viewed as the policy iteration update for infinite-horizon mixed H2/H∞ design. Modelfree versions of the PG update (3.3) can be directly obtained, since the gradient ∇J (K) can be estimated by sampled data, using for instance zeroth-order methods, as in Fazel et al. (2018); Malik et al. (2019). A direct model-free implementation of the natural PG update (3.4) using zeroth-order optimization methods requires estimating the matrix ∆K. It is not clear yet how to estimate it from the sampled trajectories. Instead, we propose one solution by the connections between mixed design and zero-sum linear quadratic games; see §6 for more details. Finally, as in LQR problems, the Gauss-Newton update (3.5) cannot yet be estimated using zeroth-order methods directly.
In this section, we investigate the convergence of the PO methods proposed in §3.
4.1 Implicit Regularization
The first key challenge in the convergence analysis for PO methods, is to ensure that the iterates remain feasible as the algorithms proceed, hopefully without the use of projection. This is especially significant in mixed design problems, as the feasibility here means robust stability, the violation of which can be catastrophic in practical online control design. We formally define the concept of implicit regularization to describe this feature.
Definition 4.1 (Implicit Regularization). For mixed H2/H∞ control design problem (2.15), suppose an iterative algorithm generates a sequence of control gains {Kn}. If Kn ∈ K for all n ≥ 0, this algorithm is called regularized; if it is regularized without projection onto K for any n ≥ 0, this algorithm is called implicitly regularized.
Remark 4.2. The concept of (implicit) regularization has been adopted in many recent studies on nonconvex optimization, including training neural networks (Allen-Zhu et al., 2018; Kubo et al., 2019), phase retrieval (Chen and Candes, 2015; Ma et al., 2017), matrix completion (Chen and Wainwright, 2015; Zheng and Lafferty, 2016), and blind deconvolution (Li et al., 2019), referring to any scheme that biases the search direction of gradient-based algorithms. Implicit regularization has been advocated as an important feature of (stochastic) gradient descent methods for solving these problems, which, as the name suggests, means that the algorithms without regularization may behave as if they are regularized. Note that the term regularization may refer to several different schemes in different problems, e.g., trimming/truncation the gradient, adding a regularization term in the objective, etc. Here we focus on the scheme of projection, as summarized in Ma et al. (2017). Also note that implicit regularization is a feature of both the problem and the algorithm, i.e., it holds for certain algorithms that solve certain nonconvex problems.
One possible way for the iterates to remain feasible is to keep shrinking the stepsize, whenever the next iterate goes outside K, following for example the Armijo rule (Bertsekas, 1976). However, as the cost J (K) is not necessarily smooth (see Lemma 5.1 and its discussion later), it may not converge within a finite number of iterations (Hintermuller and Von Neumann Haus, 2010, Theorem 3.2). Another option is to project the iterate onto K. Nonetheless, it is challenging to perform projection onto the H∞-norm constraint set directly in the frequency domain.
For LQR problems, due to the coercivity of the cost that as K approaches the boundary of the stability/feasibility region {K ∈ 1, the cost blows up to infinity, and due to the fact that the cost is continuous with respect to K, the lower-level set of the cost is compact (Makila and Toivonen, 1987) and is contained within the stability region. As a consequence, there is a strict separation between any lower-level set of the cost and the set {K ∈
. Hence, as discussed in the introduction, there exists a constant stepsize such that as long as the initialization control is stabilizing, the iterates along the path remain stabilizing and keep decreasing the cost. Such a property is algorithm-agnostic in that it is dictated by the property of the cost, and independent of the algorithms adopted, as long as they follow any descent directions of the cost. The stability proofs in Fazel et al. (2018); Bu et al. (2019a) for LQR are essentially built upon this idea.
In contrast, for mixed H2/H∞ design problems, lack of coercivity invalidates the argument above, as the control approaching the robustness constraint boundary ∂K may incur a finite cost, and the descent direction may still drive the iterates out of the feasibility region. In addition, there may not exist a strict separation between all the lower-level sets of the cost and the complementary set Kc. This difficulty has been illustrated in Figure 1 in the introduction, which compares the landscapes of the two problems. Interestingly, we show in the following theorem that the natural PG and Gauss-Newton methods in (3.4)-(3.5) enjoy the implicit regularization feature, with certain constant stepsize. We only highlight the idea of the proof here, and defer the details to §5.1. The proof includes two main steps. First, we directly use PK to construct a Lyapunov function for K′ to show a non-strict Riccati inequality that guarantees ∥T (K′)∥∞ ≤ γ. Second, we further perturb PK in a specific way to show the strict inequality ∥T (K′)∥∞ < γ. The perturbation argument is inspired by the proof of the Kalman-Yakubovich-Popov (KYP) Lemma (Dullerud and Paganini, 2013).
Theorem 4.3 (Implicit Regularization for Discrete-Time Mixed Design). For any control gain K ∈ K, i.e., ρ(A − BK) < 1 and ∥T (K)∥∞ < γ, with ∥K∥ < ∞, suppose that the stepsize η satisfies:
• Natural policy gradient (3.4): η ≤ 1/(2∥R + B⊤�PKB∥),
• Gauss-Newton (3.5): η ≤ 1/2.
Then the K′ obtained from (3.4)-(3.5) also lies in K. Equivalently, K′ is stabilizing, i.e., ρ(A − BK′) < 1, and satisfies that: i) there exists a solution PK′ ≥ 0 to the Riccati equation (2.13); ii) I − γ−2D⊤PK′D > 0; iii) ρ�(I − γ−2PK′DD⊤)−⊤(A − BK′)�< 1.
Proof Sketch. The general idea, contrary to the coercivity-based idea that works for any descent direction, is that we focus on the feasibility of K′ after an update along certain directions: either (3.4) or (3.5). By Bounded Real Lemma, i.e., Lemma 2.7, the feasibility condition for K′, if K′ is stabilizing, is equivalent to the existence of P > 0 such that the linear matrix inequalities (LMIs) in (2.16) hold for K′. Moreover, it is straightforward to see that such a P > 0, if exists, satisfies (A − BK′)⊤P(A − BK′) − P < 0, which can be used to show that K′ is stabilizing (Boyd et al., 1994). Thus, it now suffices to find such a P.
To show this, we first study the case with stepsizes being the upper bound in the theorem, i.e., η = 1/2 for Gauss-Newton and η = 1/(2∥R + B⊤�PKB∥) for natural PG. As the solution to the Riccati equation (2.13) under K, PK ≥ 0 satisfies I − γ−2D⊤PKD > 0, the first LMI in (2.16). Hence, it may be possible to perturb PK to obtain a P > 0, such that the equality in (2.13) becomes a strict inequality of the second LMI in (2.16), while preserving the first LMI. Moreover, if K′ is not too far away from K, such a perturbed PK should also work for K′. Such an observation motivates the use of PK as the candidate of P for the LMIs in (2.16) under K′.
Indeed, it can be shown that substituting P = PK makes the second LMI in (2.16) under K′ non-strict, namely, the left-hand side (LHS) ≤ 0; see (5.5) in the detailed proof. To make it strict, consider the perturbed P = PK + α ¯P for some α > 0, where ¯P > 0 is the solution to some Lyapunov equation
Such a Lyapunov equation (4.1) always admits a solution ¯P > 0, since K ∈ K implies that (I − γ−2DD⊤PK)−1(A − BK) is stable. The intuition of choosing (4.1) is as follows. First, the LHS of the second LMI in (2.16) under K′ can be separated as
By some algebra, the first term 1 is of order o(α). Since for small α,
this, combined with the Riccati equation (2.13) and (4.1), makes the second term 2 = −αI + o(α). Hence, there exists small enough α > 0 such that 1 + 2 < 0, ensuring that the updated K′ is feasible. Lastly, by the linearity of LMIs, any interpolation of K′ with a smaller stepsize is also feasible/robustly stable, thus completing the proof.
Note that theoretically, it is not clear yet if vanilla PG enjoys implicit regularization. In the worst-case, as discussed right after Remark 4.2, vanilla PG may take infinitely many iterations to converge. Hence, hereafter, we only focus on the global convergence of natural PG method (3.4) and Gauss-Newton method (3.5), with constant stepsizes.
4.2 Global Convergence
The term global convergence here refers to two notions: i) the convergence performance of the algorithms starting from any feasible initialization point K0 ∈ K; ii) convergence to the global optimal policy under certain conditions. We formally establish the results for the natural PG (3.4) and Gauss-Newton (3.5) updates in the following theorem.
Theorem 4.4 (Global Convergence for Discrete-Time Mixed Design). Suppose that K0 ∈ K and ∥K0∥ < ∞. Then, under the stepsize choices6 as in Theorem 4.3, both updates (3.4) and (3.5) converge to the global optimum K∗ = (R + B⊤�PK∗B)−1B⊤�PK∗A, in the sense that the average of {∥EKn∥2F} over iterations converges to zero with O(1/N) rate.
The proof of Theorem 4.4 is detailed in §5.2. We remark that, the controllability assumption made in Proposition 3.4 is not required for the global convergence here. Remarkably, there might be multiple stationary points such that ∇J (K) = 0, while the two specific policy search directions (3.4) and (3.5) provably avoid the suboptimal local minima, and always converge to the global optimum K∗. This can be viewed as another implication of implicit regularization, in that (3.4) and (3.5) always bias the iterates towards a certain global optimal solution, without getting stuck at spurious local minima. The key reason is that, without using the curvature information in ∆K, these two PO methods can converge to the specific and optimal stationary point such that EK = 0, instead of any arbitrary stationary point.
Moreover, in contrast to the results for LQR (Fazel et al., 2018), only globally sublinear O(1/N), instead of linear, convergence rate can be obtained so far. This O(1/N) rate of the (iteration average) gradient norm square matches the global convergence rate of gradient descent and second order algorithms to stationary points for general nonconvex optimization, either under the smoothness assumption of the objective (Cartis et al., 2010, 2017), or for a class of non-smooth objectives (Khamaru and Wainwright, 2018).
Remark 4.5 (Robust Initial Controller). Our global convergence requires the initial controller to satisfy the H∞-norm robustness constraint, which, as the assumption on the initial controller being stabilizing for LQR (Fazel et al., 2018; Bu et al., 2019a), is inherent to PO methods with iterative local search. Complementary to Fazel et al. (2018); Bu et al. (2019a), our iterates not only improve the performance criterion, but also preserve the robustness.
Though sublinear globally, much faster rates, i.e., (super-)linear rates, can be shown locally around the optimum as below. Proof of the following theorem is deferred to §5.3.
Theorem 4.6 (Local (Super-)Linear Convergence for Discrete-Time Mixed Design). Suppose that the conditions in Theorem 4.4 hold, and additionally DD⊤ > 0 holds. Then, under the stepsize choices as in Theorem 4.4, both updates (3.4) and (3.5) converge to the optimal control gain K∗ with locally linear rate, in the sense that the objective {J (Kn)} defined in (2.11) converges to J (K∗) with a linear rate. In addition, if η = 1/2, the Gauss-Newton update (3.5) converges to K∗ with a locally Q-quadratic rate.
Key to the locally linear rates is that the property of gradient dominance (Polyak, 1963; Nesterov and Polyak, 2006) holds locally around the optimum for mixed design problems. Such a property has been shown to hold globally for LQR problems (Fazel et al., 2018), and also hold locally for zero-sum LQ games (Zhang et al., 2019b). The Q-quadratic rate echoes back the rate of Gauss-Newton with η = 1/2 for LQR problems (Hewer, 1971; Bu et al., 2019a). This globally sublinear and locally (super-)linear convergence resembles the behavior of (Quasi)-Newton methods for nonconvex optimization (Nesterov and Polyak, 2006; Ueda and Yamashita, 2010), and policy gradient methods for zero-sum LQ games (Zhang et al., 2019b).
Remark 4.7 (Comparison to Zhang et al. (2019b)). Due to the close relationship between mixed design and zero-sum LQ games, see §6, one may compare the convergence results and find the rates here (globally sublinear and locally linear) not improved over Zhang et al. (2019b). However, one key difference is that an extra projection step is required to guarantee the stability of the system in Zhang et al. (2019b), which is essentially to regularize the iterates explicitly. More importantly, such a projection can only be calculated under more restrictive assumptions (see Assumption 2.1 therein), which, though cover a class of LQ games, are not standard in robust control. Here, similar convergence results are established, without projections or non-standard assumptions in robust control, thanks to implicit regularization. Moreover, we have established the local “superlinear” rate for the Gauss-Newton update, and whole new set of results for the “continuous-time” setup, which were not studied in Zhang et al. (2019b).
In this section, we provide detailed proofs for the main results of the paper.
5.1 Proof of Theorem 4.3
To show that K′ lies in K, we first argue that it suffices to find some P > 0 such that
where �P := P + PD(γ2I − D⊤PD)−1D⊤P. By Schur complement, showing (5.1)-(5.2) is also
equivalent to showing
Obviously, if such a P exists, we denote the LHS of (5.2) by −M < 0. Thus, (5.1) and (5.2) imply
which shows that K′ is stabilizing, i.e., ρ(A − BK′) < 1 (Boyd et al., 1994). Thus, Lemma 2.7 can be applied to K′. Then, (5.1) and (5.2) are identical to (2.16), which further shows that ∥T (K′)∥∞ < γ and thus shows K′ ∈ K. Hereafter we will focus on finding such a P > 0.
We first show that for the Gauss-Newton update (3.5) with stepsize η = 1/2, (5.1) and (5.2) hold for some P > 0. Specifically, we have
Since PK ≥ 0 satisfies the conditions i)-iii), and K and K′ are close to each other, we can choose PK as a candidate of P. Hence, by Riccati equation (2.13), the LHS of (5.2) can be
written as
where we substitute K′ from (5.4), and the last equation is due to completion of the squares. Now we need to perturb PK to obtain a P, such that (5.5) holds with a strict inequality. To this end, we define ¯P > 0 as the solution to the Lyapunov equation
and let P = PK + α ¯P > 0 for some α > 0. By Lemma 2.7, (I − γ−2DD⊤PK)−1(A − BK) is stable, and thus the solution ¯P > 0 exists. For (5.1) to hold, we need a small α > 0 to satisfy
Moreover, the LHS of (5.2) now can be written as
where we aim to show that there exists some α > 0 such that 1 + 2 < 0. Note that �P can be written as
= �PK + (I − γ−2PKDD⊤)−1(α ¯P)(I − γ−2DD⊤PK)−1 + o(α), (5.9)
where the first equation follows from definition, and the second one uses the fact that
for small perturbation Y around the matrix X. Thus, 1 can be written as
1 = −K′⊤B⊤�PA − A⊤�PBK′ + K′⊤(R + B⊤�PB)K′ + K⊤B⊤�PA + A⊤�PBK − K⊤(R + B⊤�PB)K ≤ −K′⊤B⊤�PA − A⊤�PBK′ + K′⊤(R + B⊤�PB)K′ + A⊤�PB(R + B⊤�PB)−1B⊤�PA
where the inequality follows by completing squares. By substituting in K′ from (5.4), we further have
Note that by (5.9), we have
where O(α) denotes the quantities that have the order of α. By plugging (5.12) into (5.11), we obtain that 1 = o(α). Moreover, by (5.9), 2 can be written as
where the first equation uses (5.9), the second one uses the Riccati equation (2.13), and the last one uses (5.6). Therefore, for small enough α > 0 such that 1 + 2 < 0, and also satisfies (5.7), there exists some P > 0 such that both (5.1) and (5.2) hold for K′ obtained with stepsize
η = 1/2. On the other hand, such a P also makes the LMI (5.3) hold for K, i.e.,
as now 1 in (5.8) is null, and the same α above makes 2 < 0. For η ∈ [0,1/2], let Kη = K + 2η(K′ − K) be the interpolation between K and K′. Combining (5.3) and (5.14) yields
for any η ∈ [0,1/2]. The second inequality in (5.15) uses the convexity of the quadratic form. Hence, (5.15) shows that for any stepsize η ∈ [0,1/2], Kη that lies between K and K′ satisfies the conditions i)-iii) in the theorem.
Now we prove a similar result for the natural PG update (3.4). Recall that
As before, we first choose P = PK. Then, the LHS of (5.2) under K′ can be written as:
where the equation holds by adding and subtracting [K−(R+B⊤�PKB)−1B⊤�PKA]⊤(R+B⊤�PKB)[K′− (R + B⊤�PKB)−1B⊤�PKA]. Substituting (5.16) into (5.17) yields
By requiring the stepsize η to satisfy
we can bound (5.18) as
namely, letting P = PK leads to the desired LMI that is not strict.
Now suppose that P = PK + α ¯P for some α > 0, where ¯P > 0 is the solution to (5.6). Note that α first still needs to satisfy (5.7). Also, the LHS of (5.2) can still be separated into 1 and 2 as in (5.8). From the LHS of the inequality in (5.10), we have
where the first equation follows by adding and subtracting [(R + B⊤�PB)−1B⊤�PA − K]⊤(R + B⊤�PB)[(R + B⊤�PB)−1B⊤�PA − K′], and the second one follows from the definition of K′ in (5.16). Suppose that η satisfies (5.19), then we have
where the equation follows by separating �P as �PK + α ¯P in (5.21). Notice that the first two terms on the right-hand side (RHS) of (5.22) are identical to the RHS of (5.18). Thus, the inequality (5.23) is due to (5.20). Moreover, notice that
which combined with (5.23) further yields
By assumption ∥K∥ < ∞ and PK ≥ 0 exists, we know that �PK is bounded, and so is K′ obtained from (5.16) using a finite stepsize η. Also, ¯P has bounded norm. Thus, 1 in (5.24) is o(α). Thus, there exists small enough α > 0 such that 1 + 2 < 0, since from (5.13), 2 = −αI +o(α). In words, there exists some P > 0 such that (5.3) holds for K′ obtained from (5.16) with stepsize satisfying (5.19). This completes the proof of the first argument.
Lastly, by Lemma 2.7, we equivalently have that the conditions i)-iii) in the theorem hold for K′, which completes the proof.
5.2 Proof of Theorem 4.4
We first introduce the following lemma that can be viewed as the counterpart of the Cost Difference Lemma in Fazel et al. (2018). Unlike the equality relation given in the lemma in Fazel et al. (2018), we establish both lower and upper bounds for the difference of two matrices PK′ and PK. The proof of the lemma is provided in §B.7.
Lemma 5.1 (Discrete-Time Cost Difference Lemma). Suppose that both K,K′ ∈ K. Then, we have the following upper bound:
where EK is defined in (3.2). If additionally ρ�(A − BK′)⊤(I − γ−2PKDD⊤)−1�< 1, then we also have the lower bound:
Notice that Lemma 5.1 also resembles the “Almost Smoothness Condition” (Lemma 9) in Fazel et al. (2018), which characterizes how the difference between PK and PK′ relies on the difference between K and K′. Due to the difference between �PK and PK, we have to establish lower and upper bounds of PK′ − PK separately. One can still identify that the leading terms in both bounds depend on ∥K′−K∥, with the remaining terms being in the order of o(∥K′−K∥) if K′ is close to K.
Gauss-Newton:
Recall that for the Gauss-Newton update, K′ = K − 2η(R + B⊤�PKB)−1EK. By Theorem 4.3, K′ also lies in K if η ≤ 1/2. Then, by the upper bound in (5.25), we know that if η ∈ [0,1/2],
which implies the monotonic decrease of PK (matrix-wise) along the update. Since PK is lower-bounded, such a monotonic sequence of {PKn} along the iterations must converges to some PK∞ ∈ K. Now we show that this PK∞ is indeed PK∗. By multiplying both sides of (5.27) with any matrix M > 0, and then taking the trace, we have that if η ∈ [0,1/2]
where the second inequality follows by keeping only the first term in the infinite summation of positive definite matrices, the third one uses that Tr(PA) ≥ σmin(A)Tr(P), and the last one is due to the monotonic decrease of PK, and the monotonicity of �PK with respect to PK, with K0 ∈ K being the initialization of K at iteration 0. From iterations n = 0 to N − 1, replacing M by I, summing over both sides of (5.28), and dividing by N, we further have
namely, the sequence {Kn} converges to the stationary point K such that EK = 0 with O(1/N) rate. By Proposition 3.4, this is towards the global optimal control gain K∗.
≤ 0, (5.29)
which also implies the matrix-wise monotonic decrease of PK along the update. Suppose the convergent matrix is PK∞. As before, multiplying both sides of (5.29) by M > 0, and taking the trace, yields
for any M > 0, where the inequality follows by only keeping the first term in the infinite summation. Letting M = I, summing up (5.30) from n = 0 to n = N − 1, and dividing by N, we conclude that
namely, {Kn} converges to the stationary point K such that EK = with O(1/N) rate, which is also the global optimum. In addition, since {PKn} is monotonically decreasing, it suffices to require the stepsize
which completes the proof.
5.3 Proof of Theorem 4.6
To ease the analysis, we show the convergence rate of a surrogate value Tr(PKDD⊤). This is built upon the following relationship between the objective value J (K) and Tr(PKDD⊤).
Lemma 5.2. Suppose that both K,K′ ∈ K and PK ≥ PK′. Then, it follows that
Proof. First, by Sylvester’s determinant theorem, J (K) can be re-written as
By the mean value theorem, for any matrices A and B with det(A),det(B) > 0, we have
for some 0 ≤ τ ≤ 1. This leads to
J (K) − J (K′) = −γ2 logdet(I − γ−2D⊤PKD) + γ2 logdet(I − γ−2D⊤PK′D) = Tr[XD⊤(PK − PK′)D] ≤ ∥X∥ · [Tr(D⊤PKD) − Tr(D⊤PK′D)] = ∥X∥ · [Tr(PKDD⊤) − Tr(PK′DD⊤)],
where X = (I −γ−2τD⊤PK′D −γ−2(1−τ)D⊤PKD)−1, and the inequality uses the facts PK ≥ PK′ and Tr(PA) ≤ ∥A∥ · Tr(P) for any real symmetric P ≥ 0. Note that by PK ≥ PK′,
This completes the proof.
Lemma 5.2 implies that in order to show the convergence of J (K), it suffices to study the convergence of Tr(PKDD⊤), as long as ∥(I − γ−2D⊤PKD)−1∥ is bounded along the iterations. This is indeed the case since by (5.27) and (5.29), PK is monotone along both updates (3.4) and (3.5). By induction, if K0 ∈ K, i.e., I − γ−2D⊤PK0D > 0, then I − γ−2D⊤PKnD ≥ I − γ−2D⊤PK0D > 0 holds for all iterations n ≥ 1. This further yields that for all n ≥ 1, ∥(I − γ−2D⊤PKnD)−1∥ ≤ ∥(I −γ−2D⊤PK0D)−1∥, namely, ∥(I −γ−2D⊤PKD)−1∥ is uniformly bounded.
Now we show the local linear convergence rate of Tr(PKDD⊤). By (5.26), for any K′ such that (I − γ−2PKDD⊤)−⊤(A − BK′) is stabilizing, we have
PK′ − PK ≥�
where the second inequality follows from completion of squares. By taking traces on both sides of (5.31), and letting K′ = K∗, we have
where WK,K∗ is defined as
Note that K∗ ∈ K and thus (I − γ−2PK∗DD⊤)−⊤(A − BK∗) is stabilizing. Let ϵ := 1 − ρ�(I − γ−2PK∗DD⊤)−⊤(A − BK∗)�, and note that ϵ > 0. By the continuity of PK, and that of ρ(·) (Tyrtyshnikov, 2012), there exists a ball B(K∗,r) ⊆ K, centered at K∗ with radius r > 0, such that for any K ∈ B(K∗,r),
Gauss-Newton:
By Theorem 4.4, {Kn} approaches K∗. Thus, there exists some Kn ∈ B(K∗,r). Let K = Knand thus K′ = Kn+1. Replacing M in (5.28) by DD⊤ > 0 and combining (5.32), we have
which further implies that
(5.34) shows that the sequence {Tr(PKn+pDD⊤)} decreases to Tr(PK∗DD⊤) starting from some Kn ∈ B(K∗,r). By continuity, there must exists a close enough Kn+p, such that the lower-level set {K | Tr(PKDD⊤) ≤ Tr(Kn+pDD⊤)} ⊆ B(K∗,r). Hence, starting from Kn+p, the iterates will never leave B(K∗,r). By (5.33), WK,K∗, as the unique solution to the Lyapunov equation
must have its norm bounded by some constant W r > ∥DD⊤∥ for all K ∈ B(K∗,r). Replacing the term ∥WK,K∗∥ in (5.34) by W r gives the uniform local linear contraction of {Tr(PKnDD⊤)}, which further leads to the local linear rate of {J (Kn)} by Lemma 5.2.
In addition, by the upper bound (5.25) and EK∗ = 0, we have
For η = 1/2, suppose that some K = Kn ∈ B(K∗,r). Then, K′ = Kn+1 = (R + B⊤�PKB)−1B⊤�PKA yields that
Moreover, notice that
�PK − �PK∗ = (I − γ−2PKDD⊤)−1PK − (I − γ−2PK∗DD⊤)−1PK∗ (5.37) = (I − γ−2PK∗DD⊤)−1γ−2(PK − PK∗)DD⊤(I − γ−2PKDD⊤)−1 + (I − γ−2PK∗DD⊤)−1(PK − PK∗),
which, combined with (5.36), gives
for some constant c > 0. Combining (5.35) and (5.38) yields
for some constant c′. Note that from some p ≥ 0 such that Kn+p onwards never leaves B(K∗,r), the constant c′ is uniformly bounded, which proves the Q-quadratic convergence rate of {Tr(PKnDD⊤)}, and thus the rate of {J (Kn)}, around K∗.
Natural Policy Gradient:
Replacing M in (5.30) by DD⊤ > 0 and combining (5.30) and (5.32) yield
Using similar argument as above, one can establish the local linear rate of {J (Kn)} with a different contracting factor. This concludes the proof.
We now provide additional discussions on the mixed H2/H∞ control design problem.
6.1 Connection to Zero-Sum LQ Games
It is well known that minimizing the risk-sensitive cost as (2.1), which is the logarithm of the expected values of exponential functions with quadratic forms, can be equivalent to solving a zero-sum dynamic game, for both general settings (Whittle, 1990; Fleming and McE- neaney, 1992; Fleming and Hern´andez-Hern´andez, 1997), and in particular for LQ settings (Jacobson, 1973). Due to the connection between LEQG and mixed design, as discussed in §2.2, the latter can be related to a zero-sum LQ game as well.
Specifically, consider the system that follows linear dynamics7
with the system state being xt ∈ , the control inputs of players 1 and 2 being ut ∈
vt ∈
, respectively. The matrices A,B, and D all have proper dimensions. The objective of player 1 (player 2) is to minimize (maximize) the infinite-horizon value function,
where the initial state x0 ∼ D for some distribution D, the matrices Q ∈ and Rv ∈
are all positive definite. Value of the game, i.e., the value of (6.1) when the inf and sup can interchange, is characterized by
is the solution to the generalized algebraic Riccati equation (GARE) (Bas¸ar and Bernhard, 1995)
Moreover, under the standard assumption that Rv − D⊤P∗D > 0, the solution policies, i.e., the Nash equilibrium (NE) policies that are stabilizing, of the two players have forms of LTI state-feedback, namely, u∗t = −K∗xt and v∗t = −L∗xt for some matrices K∗ ∈ L∗ ∈
. The corresponding values of (K∗,L∗) are given by
As a consequence, it suffices to search over all stabilizing control gain pairs (K,L) that solves
In fact, for any stabilizing (K,L) that makes ρ(A − BK − DL) < 1, C(K,L) = Tr(PK,LΣ0), where Σ0 = is the unique solution to the Lyapunov equation
Given K that makes8 Rv − D⊤PK,LD > 0, maximizing over L on the RHS of (6.5) yields
Notice that (6.6) is in fact a Riccati equation identical to (2.13), with Rv replaced by γ2I, Ru
replaced by R, and Q replaced by C⊤C. Hence, the problem (6.4) is equivalent to minimizing C(K,L(K)) = Tr(P∗KΣ0), subject to (6.6), which coincides with the mixed design problem (2.15), where J (K) takes the form of (2.10) with DD⊤ replaced by Σ0. Furthermore, the minimizer of the RHS on (6.6) is
which equals the global optimum for the mixed design problems.
6.2 Model-Free Algorithms
The connection above provides one angle to develop model-free RL algorithms for solving mixed design problems. Indeed, the natural PG in (3.4) cannot be sampled using trajectory data, due to the form of the matrix ∆K in (3.1). Fortunately, solution of the game (6.4) can be obtained by model-free PG-based methods, see Zhang et al. (2019b), and the more recent work Bu et al. (2019c), which, by (6.8) and Proposition 3.4, is equivalent to the global optimum of mixed design problems. Hence, model-free algorithms that solve the LQ game (6.4) can also be used to solve the mixed design problem (2.15).
Specifically, by Lemma 3.3 of Zhang et al. (2019b), under certain conditions, the stationary point (K,L) where ∇KC(K,L) = 0 and ∇LC(K,L) = 0 coincides with the NE. Therefore, it is straightforward to develop PG-based updates to find the minimizer L(K) for some K, and then perform PG-based algorithms to update K, which can both be implemented in a model-free fashion, using zeroth-order methods (Nesterov and Spokoiny, 2017; Fazel et al., 2018). Note that the PO methods in Zhang et al. (2019b) are essentially also based on this idea, but with the order of max and min interchanged, and require a projection step for updating L. More recently, Bu et al. (2019c) has developed double-loop PO methods that
remove this projection. Two examples of PG-based methods can be written as
where α,η > 0 are stepsizes, ΣK,L := correlation matrix under control pair (K,L),
) is the estimate of L(K) obtained by iterating either (6.9) or (6.11), �∇LC(K,L), �∇KC(K,L), and �ΣK,L are the estimates of ∇LC(K,L), ∇KC(K,L), and ΣK,L using sampled data, respectively.
Note that the simulator for the game (6.4) that generates the data samples can be obtained by the simulator for the mixed design problem (2.15), with the disturbance wt modeled as wt = −Lxt. This way, the updates of L in (6.9) and (6.11) can be understood as improving the disturbance to find the worst-case one, which manifests the idea of H∞ norm. Also, Bu et al. (2019c) has verified that in zero-sum LQ games, given a fixed K, such an update of L converges to the best-response disturbance L(K) given in (6.7). This justifies the feasibility of our algorithms (6.9)-(6.12).
In addition, by the form of the policy gradients for the game, see Lemma 3.2 in Zhang et al. (2019b), the exact natural PG update on the LHS of (6.12) is identical to that for mixed design problems in (3.4). In other words, the natural PG update (3.4) can be implemented in a model-free way by virtue of that outer-loop update of K in a zero-sum LQ game. As shown in Bu et al. (2019c), such an outer-loop update over K converges to the NE of the game. Details of the model-free algorithms are deferred to Algorithms 1, 2, and 3 in §D.
In this section, we present some simulation results to corroborate our theory. We mainly focus on the convergence properties for the discrete-time settings. We have also included extensive numerical comparisons with existing packages for solving H2/H∞ mixed design, which can only handle the continuous-time settings (Arzelier et al., 2011; Mahmoud and Pascal, 1996). The problem setup, PO algorithms, and their analyses, for the continuous-time settings can be found in §A. We show that our PO methods outperform these existing packages in many aspects, though the latter ones can handle more general setups.
7.1 Implicit Regularization & Global/Local Convergence
We first consider the following example, denoted by Case 1, whose parameters are:
and DD⊤ = I. Note that all matrices C⊤C, E⊤E, and DD⊤ are positive definite. This DD⊤ satisfies both the controllability assumption in Proposition 3.4, and the assumption DD⊤ >
Figure 2: Convergence of the gradient norm square ∥EK∥2F and the objective {J (K)}, behav- iors of the H∞-norm ∥T (K)∥∞, and the smallest eigenvalue of γ2I − D⊤PKD, for PG, NPG, and Gauss-Newtons with stepsizes 1 × 10−7, 1 × 10−4, 0.01, and 0.5, respectively.
Figure 3: Linear, linear and super-linear convergence rates for the NPG update with η = 10−4, the GN update with η = 10−2, and the GN update with η = 0.5, respectively.
0 in Theorem 4.6. We first randomly generate K0 with each element uniformly generated from [−0.25,0.25], such that K0 is stabilizing (i.e., ρ(A − BK0) < 1). Then, the H∞-norm ∥T (K0)∥∞ under K0 is calculated. The value of γ is then chosen as 1.00001·∥T (K0)∥∞, making sure that K0 ∈ K. We then perform all three algorithms (3.3)-(3.5) in §3.2 on the above problem setting, and illustrate the convergence of both the gradient norm square ∥EK∥2F,
Figure 4: Convergence of the NPG and GN updates to K∗ , when there exists an infinite number of stationary points (i.e. K s.t. ∇J (K) = 0). The stepsizes for the PG, NPG, and GN updates are 10−3, 10−2, and 0.5, respectively.
and the objective difference {J (Kn) − J (K∗)}. The stepsizes η for the PG, NPG, and GaussNewton updates are 1 × 10−7, 1 × 10−4, and 0.01, respectively. We have also used η = 1/2 for the Gauss-Newton update.
As shown in Figure 2, for both performance criteria, all four update rules converge successfully. At the beginning of the iterations, NPG and Gauss-Newton with η = 10−2 indeed yield sublinear convergence of the gradient norm square; as the iterations proceed, linear convergence rate appears. Moreover, for Gauss-Newton with η = 1/2, super-linear convergence rate has also been observed. These observations corroborate our theory in both Theorems 4.4 and 4.6.
Moreover, we have also illustrated the behaviors of the H∞-norm ∥T (K)∥∞ and the smallest eigenvalue of γ2I − D⊤PKD, denoted by λmin(γ2I − D⊤PKD), in Figure 2. It is seen that along the iterations, with the stepsizes that guarantee convergence, the H∞-norm is below the bound γ = 15.45 for all four update rules, which validates the implicit regularization result we have in Theorem 4.3. As another evidence for implicit regularization in accordance to Lemma 2.7, it is shown that the matrix γ2I − D⊤PKD > 0 along iterations.
Notice that the initialization K0 is very close to the boundary, as γ = 1.00001 · ∥T (K0)∥∞. It is shown that the vanilla PG update, even though with infinitesimal stepsize (10−7), still violates the H∞-norm constraint and fails to converge. We do observe, however, in several other numerical examples, that vanilla PG update converges successfully. It is thus not clear whether there exists a constant stepsize choice for the global convergence and robustness perservation of the vanilla PG iterates, which is left for future investigation.
To further verify our local convergence rates, we have also initialized our algorithms by randomly searching over 3. The convergence patterns are presented in Figure 3, which clearly demonstrates the faster local rates.
7.2 Escaping Suboptimal Stationary Points
We also investigate the setting where the controllability assumption in Proposition 3.4, and the assumption DD⊤ > 0 in Theorem 4.6 do not hold. In this case, there might exist multiple stationary points, many of which are suboptimal.
Specifically, consider the following problem parameter, which is denoted by Case 2:
Note that the system is open-loop unstable, as ρ(A) > 1. We choose γ = 10. Then, one can verify that the above mixed design problem admits an optimum K∗ = (R+B⊤�PK∗B)−1B⊤�PK∗A =
�1.6186 0
0 0
� . Moreover, there exist an infinite number of stationary points, which share the
orem 4.4, the NPG and GN updates provably converge to K∗, which automatically escape other suboptimal stationary points. We numerically evaluate the convergence to K∗ in Figure 4, for both the NPG and GN updates. For each of the 50 trails, we fix a random seed and initialize the algorithm by randomly searching a K0 ∈ that satisfies K0 ∈ K. It can be observed that two PG methods converge to K∗ in all trails. In start contrast, the vanilla PG update can easily get stuck at these suboptimal stationary points, depending on its initialization.
This can be understood as another meaning of implicit regularization: for this specific nonconvex problem, two certain search directions automatically bias the iterates to avoid bad local minima, and always towards the global optimal one.
7.3 Comparison with Existing H2/H∞ Control Solvers
To better justify the superiority of our PO methods, we numerically compare their convergence properties with other numerical H2/H∞ mixed control design packages, including the HIFOO method (Arzelier et al., 2011) and the h2hinfsyn method implemented in Matlab (Mahmoud and Pascal, 1996), which is based on linear matrix inequalities. Note that these full-fledged packages can only handle continuous-time settings. To make the comparison fair, we also implement our PO methods for the continuous-time settings as studied in §A.
We mainly compare them in terms of: 1) the H2 and H∞ norms of the controller that the algorithms converge to; 2) the computation complexity (runtime); 3) the H∞-norm constraint violation. We will validate that our PO methods indeed outperform HIFOO in these aspects. The larger-scale the dynamical system is, the more pronounced our advantages are, with provable robustness preserving guarantees.
Simulation Setup. All the experiments are executed on a MacBook Pro 2019 with a 2.8 GHz Quad-Core Intel Core i7 processor with Matlab R2020b. The device also has a 16GB 2133MHz LPDDR3 memory and an Intel Iris Plus 655 Graphics. To make the runtime comparison fair (or even in favor of HIFOO), we set the following parameters of HIFOO (version 3.501 with Hanso version 2.01): options.fast = 1 for using a fast optimization method; options.prtlevel = 0 to suppress unnecessary printing statements. Other parameters of HIFOO are set to be default. For Matlab’s h2hinfsyn function, tol has been set to 10−6. For our PO methods, we set the stepsizes of the NPG and GN updates to be 1/(2∥R∥) and 1/2, respectively, and solve the following mixed design problems Cases 3-6 until J (K) − J (K∗) < 10−6.
A Simple Example. We first consider a simple setting, denoted by Case 3. The time-invariant system dynamics are characterized by ˙x = Ax + Bu + Dw, z = Cx + Eu, where
One can verify that E⊤[C E] = [0, I], satisfying our assumption. Then, we solve
The simulations are run over 100 trails with the random seed being fixed at 1,··· ,100, respectively. The optimal (minimax) disturbance attenuation level of Case 3 is γ∗ ≈ 0.53, as computed/verified both by Matlab’s hinfsyn function and HIFOO’s tion. We summarize the following interesting findings based on the comparisons between HIFOO and our PO methods in Table 1:
1. (PO methods achieve lower H2 and H∞ norms, faster). When γ = 5, both HIFOO and PO methods preserve the ∥T (K)∥∞ < γ constraint during the optimization process, and output some convergent gain matrix K. However, HIFOO converges to a local minimum due to that it is directly optimizing the H2 norm of the closed-loop transfer function, and the landscape for such an optimization problem is unclear. In contrast, our PO methods, by definition, optimize an upper bound of the H2-norm, and as proved theoretically, converge to the global minimum of the problem. The solution yields lower values for ∥T (K)∥2 and ∥T (K)∥∞ compared to those of the HIFOO output. Indeed, this shows that minimizing the H2-norm upper bound as in Bernstein and Haddad (1989); Mustafa and Bernstein (1991) and our paper can obtain reasonably good solutions. More importantly, the average runtimes of our PG methods average over 100 trails are around 5.93× faster than HIFOO.
2. (PO methods always preserve H∞-norm constraint). When γ = 3 (which is still far from γ∗ ≈ 0.53), our methods consistently preserve the ∥T (K)∥∞ < γ constraint during the optimization process, validating our theoretical findings. However, the HIFOO iterates have reached ∥T (K)∥∞ = 3.4353 > 3 along the way. Further, PO methods also find solutions that have lower ∥T (K)∥2 and ∥T (K)∥∞ norms, with a 5.85× faster runtime.
3. (Smaller γ leads to worse performance for HIFOO). When γ = 1, our PO methods still preserve the ∥T (K)∥∞ < γ constraint during the optimization process. Our PO methods also converge to the optimum point with both small H2 and H∞ norms, while HIFOO’s performance is degraded much more. We remark that in this case, the runtime of our methods is longer than that of HIFOO, however, the time was mostly consumed in finding the initialization that is robustly stable, by randomly generating K0 in a
Table 1: Comparison average over 100 trails between HIFOO and two proposed PO methods, for solving the mixed design Case 3. All three methods initialize K0 on their own. For two PO methods, a K0 ∈ K is found by randomly search over [−1,1]3×3, which takes up > 90% of the total runtime. In contrast, HIFOO uses an in-house method to find initial points. ∥T (K)∥2 Diff. and ∥T (K)∥∞ Diff. represent the difference of H2 and H∞ norms achieved by HIFOO and our PO methods (which are identical for NPG and GN, as our methods have guarantees for finding the global optimum of our mixed design problem).
certain region. This becomes harder to find for a smaller γ. Our simple initialization method takes more than 90% of the time, which might be less efficient than the advanced initialization technique used in HIFOO.
Regarding the comparison with the h2hinfsyn function, we present the results in Table 2. It is shown that in this simple 3 × 3 problem, h2hinfsyn and our PG methods converge to nearly the same solution, while our computation time is around 4× faster. Next, we will show that h2hinfsyn scales poorly with respect to the problem dimensions, while our PO methods converge efficiently in high-dimensional problems.
More Challenging Cases. We test some more challenging cases with higher dimensions to further demonstrate the efficiency of our PO methods. In Cases 4-6, the dimensions of the control gain matrices are 15×15,60×60,90×90, respectively, corresponding to the number of decision variables being 225, 3600, 8100, respectively. Problem parameters are too long to enumerate here, and are provided at here, together with all the code and data.
The simulations are run over 10 trails with fixed random seeds. Both HIFOO, h2hinfsyn, and our PO methods converge to almost the same control gain matrices in these cases, without constraint violation. This again implies that minimizing the H2-norm upper bound (instead of H2-norm directly) can usually achieve quite competitive solutions. Notably, our PO methods are around 8×, 47×, 295× faster than HIFOO, respectively, in Cases 4-6, as reported in Table 3. Our PO methods are also much faster than h2hinfsyn, as it can hardly solve Cases 5-6 and fails to return a solution even after very long runtime.. This verifies that our PO algorithms indeed enjoy better scalability, and the higher the dimension is, the
Table 2: Comparison average over 100 trails between Matlab’s h2hinfsyn and two proposed PO methods, for solving the mixed design Case 3. For the two PO methods, a K0 ∈ K is found by randomly search over [−1,1]3×3, and the computation time for finding such an initial point is not taken into account for fair comparison. In contrast, h2hinfsyn implements a LMI-based synthesis procedure (Mahmoud and Pascal, 1996).
Table 3: Average runtime comparison over 10 trails between HIFOO, Matlab’s h2hinfsyn function, and two proposed PO methods for solving Cases 4-6. The speedup times outside and inside the parenthesis denote the ones of our PO methods compared to HIFOO and Matlab, respectively.
Table 4: Average ∥T (K)∥2 reached over 10 trails between HIFOO, Matlab’s h2hinfsyn function, and two proposed PO methods for solving Cases 4-6.
more pronounced our advantage is. These observations have justified that our PO methods are not only theoretically sound, but also numerically competitive.
In this paper, we have investigated the convergence theory of policy optimization methods for H2 linear control with H∞-norm robustness guarantees. Viewed as a constrained nonconvex optimization, this problem was addressed by PO methods with provable convergence to the global optimal policy. More importantly, we showed that the proposed PO
Table 5: Average ∥T (K)∥∞ reached over 10 trails between HIFOO, Matlab’s h2hinfsyn function, and two proposed PO methods for solving Cases 4-6.
methods enjoy the implicit regularization property, despite the lack of coercivity of the cost function. We expect the present work to serve as an initial step toward further understanding of RL algorithms on robust/risk-sensitive control tasks. We conclude this main part of the paper with several ongoing/potential research directions.
Implicit regularization of other PO methods: It is of particular interests to investigate whether other PO methods enjoy similar implicit regularization properties. One important example that has not been analyzed in this paper is the PG method. Notice that the model-free implementation of the PG method, see update (3.3), does not require the connection between mixed design and zero-sum LQ games, as the gradient can be sampled via zeroth-order methods directly. Among other examples are quasi-Newton methods with preconditioning matrices other than that in (3.5), accelerated PG using the idea from Nesterov (1983), and variance reduced PG methods (Papini et al., 2018; Xu et al., 2019).
Linear quadratic games: Thanks to the connection discussed in §6.1, our LMI-based techniques for showing implicit regularization in Theorem 4.3 may be of independent interest to improve the convergence of nested policy gradient methods in Zhang et al. (2019b), and even simultaneously-moving policy-gradient methods, for solving zero-sum LQ games using PO methods. This will place PO methods for multi-agent RL (MARL) under a more solid theoretical footing, as LQ games have served as a significant benchmark for MARL (Chasnov et al., 2019; Mazumdar et al., 2019). Rigorous analysis for this setting have been partially addressed in our ongoing work, and in a more recent work Bu et al. (2019c).
Model-based v.s. model-free methods for robust control: There is an increasing literature in model-based learning-based control with robustness concerns (Aswani et al., 2013; Berkenkamp and Schoellig, 2015; Berkenkamp et al., 2017; Dean et al., 2017, 2019b). On the other hand, our work serves as an intermediate step toward establishing the sample complexity of model-free PO methods for this setting. Hence, it is natural and interesting to compare the data efficiency (sample complexity) and computational scalability of the two lines of work. Note that such a comparison has been made in Tu and Recht (2018) for LQR problems without addressing the issue of robustness.
Beyond LTI systems and state-feedback controllers: It is possible to extend our analysis to the mixed H2/H∞ control of other types of dynamical systems such as periodic systems (Bittanti and Colaneri, 1996), Markov jump linear systems (Costa et al., 2006), and switching systems (Liberzon, 2003). These more general system models are widely adopted in control applications, and extensions to these cases will significantly expand the utility of our theory. On the other hand, it is interesting while challenging to study PO for outputfeedback mixed design, where a dynamic controller parameterized by (AK,BK,CK,DK) is synthesized (Apkarian et al., 2008). This way, the PO landscape depends on the order of the parameterization, making the analysis more involved.
PO landscape and algorithms for H∞ control synthesis: Our algorithms are based on the condition that an initial policy satisfying the specified H∞-norm constraint is available. To efficiently find such an initialization, it is natural to study the PO landscape of H∞ control synthesis (Doyle et al., 1989; Gahinet and Apkarian, 1994; Apkarian and Noll, 2006), where the goal is to find the controller that not only satisfies certain H∞-norm bound, but also minimizes it. It seems that the cost function for H∞ control is still coercive. However, the main challenge of PO for H∞ control is that the cost function is non-smooth (Apkarian and Noll, 2006), which necessitates the use of subgradient methods. The LMI arguments developed here may shed new lights on the convergence analysis of these methods.
K. Zhang and T. Bas¸ar were supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, and in part by the Office of Naval Research (ONR) MURI Grant N00014-16-1-2710. The authors would like to thank Peter Seiler, Geir Dullerud, Na Li, and Mihailo Jovanovic for the valuable comments and feedback on our manuscript, as well as the helpful suggestions from the anonymous reviewers of L4DC and SICON. The authors would also like to thank Xiangyuan (Rocker) Zhang for helping with the simulations.
Achiam, J., Held, D., Tamar, A. and Abbeel, P. (2017). Constrained policy optimization. In International Conference on Machine Learning.
Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019a). Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
Agarwal, N., Bullins, B., Hazan, E., Kakade, S. and Singh, K. (2019b). Online control with adversarial disturbances. In International Conference on Machine Learning.
Agarwal, N., Hazan, E. and Singh, K. (2019c). Logarithmic regret for online control. arXiv preprint arXiv:1909.05062.
Akametalu, A. K., Fisac, J. F., Gillula, J. H., Kaynama, S., Zeilinger, M. N. and Tomlin, C. J. (2014). Reachability-based safe learning with Gaussian processes. In 53rd IEEE Conference on Decision and Control. IEEE.
Allen-Zhu, Z., Li, Y. and Liang, Y. (2018). Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.
Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
Apkarian, P. and Noll, D. (2006). Nonsmooth H∞ synthesis. IEEE Transactions on Automatic Control, 51 71–86.
Apkarian, P., Noll, D. and Rondepierre, A. (2008). Mixed H2/H∞ control via nonsmooth optimization. SIAM Journal on Control and Optimization, 47 1516–1546.
Arzelier, D., Georgia, D., Gumussoy, S. and Henrion, D. (2011). H2 for HIFOO. In International Conference on Control and Optimization with Industrial Applications.
Aswani, A., Gonzalez, H., Sastry, S. S. and Tomlin, C. (2013). Provably safe and robust learning-based model predictive control. Automatica, 49 1216–1226.
Bas¸ar, T. and Bernhard, P. (1995). H-infinity Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach. Birkh¨auser, Boston.
Berkenkamp, F. and Schoellig, A. P. (2015). Safe and robust learning control with Gaussian processes. In IEEE European Control Conference. IEEE.
Berkenkamp, F., Turchetta, M., Schoellig, A. and Krause, A. (2017). Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems.
Bernstein, D. S. and Haddad, W. M. (1989). LQG control with an H∞ performance bound: A Riccati equation approach. IEEE Transactions on Automatic Control, 34 293–305.
Bertsekas, D. P. (1976). On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transactions on Automatic Control, 21 174–184.
Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M. and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45 2471–2482.
Bittanti, S. and Colaneri, P. (1996). Analysis of discrete-time linear periodic systems. In Control and Dynamic Systems, vol. 78. 313–339.
Borkar, V. S. (2005). An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54 207–213.
Borkar, V. S. and Meyn, S. P. (2002). Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research, 27 192–209.
Boyd, S., El Ghaoui, L., Feron, E. and Balakrishnan, V. (1994). Linear Matrix Inequalities in System and Control Theory, vol. 15. SIAM.
Boyd, S. P., Balakrishnan, V., Barratt, C. H., Khraishi, N. M., Li, X., Meyer, D. G. and Norman, S. A. (1988). A new CAD method and associated architectures for linear controllers. IEEE Transactions on Automatic Control, 33 268–283.
Bradtke, S. J., Ydstie, B. E. and Barto, A. G. (1994). Adaptive linear quadratic control using policy iteration. In IEEE American Control Conference, vol. 3.
Bu, J., Mesbahi, A., Fazel, M. and Mesbahi, M. (2019a). LQR through the lens of first order methods: Discrete-time case. arXiv preprint arXiv:1907.08921.
Bu, J., Mesbahi, A. and Mesbahi, M. (2019b). On topological and metrical properties of stabilizing feedback gains: the MIMO case. arXiv preprint arXiv:1904.02737.
Bu, J., Ratliff, L. J. and Mesbahi, M. (2019c). Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games. arXiv preprint arXiv:1911.04672.
Cartis, C., Gould, N. I. and Toint, P. L. (2010). On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 20 2833–2852.
Cartis, C., Gould, N. I. and Toint, P. L. (2017). Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. arXiv preprint arXiv:1709.07180.
Chasnov, B., Ratliff, L. J., Mazumdar, E. and Burden, S. A. (2019). Convergence analysis of gradient-based learning with non-uniform learning rates in non-cooperative multi-agent settings. arXiv preprint arXiv:1906.00731.
Chen, X. and Wen, J. T. (1995). A linear matrix inequality approach to the general mixed H2/H∞ control problem. In American Control Conference, vol. 2. IEEE.
Chen, Y. and Candes, E. (2015). Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems.
Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025.
Chow, Y., Ghavamzadeh, M., Janson, L. and Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 6070–6120.
Chow, Y., Nachum, O., Duenez-Guzman, E. and Ghavamzadeh, M. (2018). A Lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems.
Chow, Y., Tamar, A., Mannor, S. and Pavone, M. (2015). Risk-sensitive and robust decision-making: A CVaR optimization approach. In Advances in Neural Information Processing Systems.
Cohen, A., Hasidim, A., Koren, T., Lazic, N., Mansour, Y. and Talwar, K. (2018). Online linear quadratic control. In International Conference on Machine Learning.
Costa, O. L. V., Fragoso, M. D. and Marques, R. P. (2006). Discrete-Time Markov Jump Linear Systems. Springer Science & Business Media.
Dean, S., Mania, H., Matni, N., Recht, B. and Tu, S. (2017). On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics 1–47.
Dean, S., Matni, N., Recht, B. and Ye, V. (2019a). Robust guarantees for perception-based control. arXiv preprint arXiv:1907.03680.
Dean, S., Tu, S., Matni, N. and Recht, B. (2019b). Safely learning to control the constrained linear quadratic regulator. In IEEE American Control Conference. IEEE.
Di Castro, D., Tamar, A. and Mannor, S. (2012). Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404.
Doyle, J., Glover, K., Khargonekar, P. and Francis, B. (1989). State-space solutions to standard H2 and H∞ control problems. IEEE Transactions on Automatic Control, 34 831–847.
Dullerud, G. E. and Paganini, F. (2013). A Course in Robust Control Theory: A Convex Approach, vol. 36. Springer Science & Business Media.
Dvijotham, K., Fazel, M. and Todorov, E. (2014a). Convex risk averse control design. In IEEE Conference on Decision and Control. IEEE.
Dvijotham, K., Fazel, M. and Todorov, E. (2014b). Universal convexification via risk-aversion. arXiv preprint arXiv:1406.0554.
Fazel, M., Ge, R., Kakade, S. M. and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning.
Fleming, W. and Hern´andez-Hern´andez, D. (1997). Risk-sensitive control of finite state machines on an infinite horizon i. SIAM Journal on Control and Optimization, 35 1790–1810.
Fleming, W. H. and McEneaney, W. M. (1992). Risk sensitive optimal control and differential games. In Stochastic Theory and Adaptive Control. Springer, 185–197.
Gahinet, P. and Apkarian, P. (1994). A linear matrix inequality approach to H∞ control. International Journal of Robust and Nonlinear Control, 4 421–448.
Garcia, J. and Fern´andez, F. (2012). Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45 515–564.
Geibel, P. and Wysotzki, F. (2005). Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24 81–108.
Glover, K. and Doyle, J. C. (1988). State-space formulae for all stabilizing controllers that satisfy an H∞-norm bound and relations to relations to risk sensitivity. Systems & Control Letters, 11 167–172.
Gravell, B., Esfahani, P. M. and Summers, T. (2019). Learning robust controllers for linear quadratic systems with multiplicative noise via policy gradient. arXiv preprint arXiv:1907.03680.
Gumussoy, S., Henrion, D., Millstone, M. and Overton, M. L. (2009). Multiobjective robust control with HIFOO 2.0. IFAC Proceedings, 42 144–149.
Haddad, W. M., Bernstein, D. S. and Mustafa, D. (1991). Mixed-norm H2/H∞ regulation and estimation: The discrete-time case. Systems & Control Letters, 16 235–247.
Hewer, G. (1971). An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control, 16 382–384.
Hindi, H. A., Hassibi, B. and Boyd, S. P. (1998). Multiobjective H2/H∞-optimal control via finite dimensional Q-parametrization and linear matrix inequalities. In Proceedings of American Control Conference, vol. 5. IEEE.
Hintermuller, M. and Von Neumann Haus, J. (2010). Nonlinear optimization. Vorlesungsskript University at Berlin.
Ionescu, V. and Weiss, M. (1992). On computing the stabilizing solution of the discrete-time Riccati equation. Linear Algebra And Its Applications, 174 229–238.
Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30 257– 280.
Jacobson, D. (1973). Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Transactions on Automatic Control, 18 124– 131.
Jansch-Porto, J. P., Hu, B. and Dullerud, G. (2020). Convergence guarantees of policy optimization methods for markovian jump linear systems. arXiv preprint arXiv:2002.04090.
Ja´skiewicz, A. et al. (2007). Average optimality for risk-sensitive control with general state space. The annals of applied probability, 17 654–675.
Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
Kaminer, I., Khargonekar, P. P. and Rotea, M. A. (1993). Mixed H2/H∞ control for discrete-time systems via convex optimization. Automatica, 29 57–70.
Khamaru, K. and Wainwright, M. J. (2018). Convergence guarantees for a class of non-convex and non-smooth optimization problems. arXiv preprint arXiv:1804.09629.
Khargonekar, P. P. and Rotea, M. A. (1991). Mixed H2/H∞ control: A convex optimization approach. IEEE Transactions on Automatic Control, 36 824–837.
Kleinman, D. (1968). On an iterative technique for Riccati equation computations. IEEE Transactions on Automatic Control, 13 114–115.
Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems.
Krantz, S. G. and Parks, H. R. (2012). The Implicit Function Theorem: History, Theory, and Applications. Springer Science & Business Media.
Kubo, M., Banno, R., Manabe, H. and Minoji, M. (2019). Implicit regularization in overparameterized neural networks. arXiv preprint arXiv:1903.01997.
Levine, S., Finn, C., Darrell, T. and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17 1334–1373.
Li, X., Ling, S., Strohmer, T. and Wei, K. (2019). Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and Computational Harmonic Analysis, 47 893–934.
Liberzon, D. (2003). Switching in Systems and Control. Springer Science & Business Media.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Lim, S. H. and Autef, A. (2019). Kernel-based reinforcement learning in robust markov decision processes. In International Conference on Machine Learning.
Lim, S. H., Xu, H. and Mannor, S. (2013). Reinforcement learning in robust Markov decision processes. In Advances in Neural Information Processing Systems.
Ma, C., Wang, K., Chi, Y. and Chen, Y. (2017). Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. arXiv preprint arXiv:1711.10467.
Magnus, J. R. and Neudecker, H. (1985). Matrix differential calculus with applications to simple, Hadamard, and Kronecker products. Journal of Mathematical Psychology, 29 474–492.
Mahmoud, C. and Pascal, G. (1996). h∞ design with pole placement constraints: an LMI approach. IEEE Transactions on Automatic Control, 41 358–367.
Makila, P. and Toivonen, H. (1987). Computational methods for parametric LQ problems–A survey. IEEE Transactions on Automatic Control, 32 658–671.
Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P. and Wainwright, M. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In International Conference on Artificial Intelligence and Statistics.
Mankowitz, D. J., Levine, N., Jeong, R., Abdolmaleki, A., Springenberg, J. T., Mann, T., Hester, T. and Riedmiller, M. (2019). Robust reinforcement learning for continuous control with model misspecification. arXiv preprint arXiv:1906.07516.
Matni, N., Proutiere, A., Rantzer, A. and Tu, S. (2019). From self-tuning regulators to reinforcement learning and back again. arXiv preprint arXiv:1906.11392.
Mazumdar, E., Ratliff, L. J., Jordan, M. I. and Sastry, S. S. (2019). Policy-gradient algorithms have no guarantees of convergence in continuous action and state multi-agent settings. arXiv preprint arXiv:1907.03712.
Mohammadi, H., Zare, A., Soltanolkotabi, M. and Jovanovi´c, M. R. (2019a). Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. arXiv preprint arXiv:1912.11899.
Mohammadi, H., Zare, A., Soltanolkotabi, M. and Jovanovi´c, M. R. (2019b). Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator. In 2019 IEEE 58th Conference on Decision and Control (CDC).
Molinari, B. (1973). The stabilizing solution of the algebraic Riccati equation. SIAM Journal on Control, 11 262–271.
Morimoto, J. and Doya, K. (2005). Robust reinforcement learning. Neural Computation, 17 335–359.
Mustafa, D. (1989). Relations between maximum-entropy/H∞ control and combined H∞/LQG control. Systems & Control Letters, 12 193–203.
Mustafa, D. and Bernstein, D. S. (1991). LQG cost bounds in discrete-time H2/H∞ control. Transactions of the Institute of Measurement and Control, 13 269–275.
Mustafa, D. and Glover, K. (1990). Minimum entropy H∞ control. Lecture Notes in Control and Information Sciences.
Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate o(1/k2). Doklady ANSSSR, 269 543–547.
Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of Newton method and its global performance. Mathematical Programming, 108 177–205.
Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17 527–566.
Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53 780–798.
OpenAI (2018). Openai five. https://blog.openai.com/openai-five/.
Papini, M., Binaghi, D., Canonaco, G., Pirotta, M. and Restelli, M. (2018). Stochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618.
Pinto, L., Davidson, J., Sukthankar, R. and Gupta, A. (2017). Robust adversarial reinforcement learning. In International Conference on Machine Learning.
Polyak, B. T. (1963). Gradient methods for minimizing functionals. USSR Computational Mathematics and Mathematical Physics, 3 14–29.
Ran, A. and Vreugdenhil, R. (1988). Existence and comparison theorems for algebraic Riccati equations for continuous- and discrete-time systems. Linear Algebra And Its Applications, 99 63–83.
Rantzer, A. (1996). On the Kalman-Yakubovich-Popov Lemma. Systems & Control letters, 28 7–10.
Rautert, T. and Sachs, E. W. (1997). Computational design of optimal output feedback controllers. SIAM Journal on Optimization, 7 837–852.
Recht, B. (2019). A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2 253–279.
Rotstein, H. and Sznaier, M. (1998). An exact solution to general four-block discrete-time mixed H2/H∞ problems via convex optimization. IEEE Transactions on Automatic Control, 43 1475–1480.
Roulet, V., Fazel, M., Srinivasa, S. and Harchaoui, Z. (2019). On the convergence to stationary points of the iterative linear exponential quadratic Gaussian algorithm. arXiv preprint arXiv:1910.08221.
Scherer, C. W. (1995). Multiobjective h/sub 2//h/sub/spl infin//control. IEEE Transactions on Automatic Control, 40 1054–1062.
Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015a). Trust region policy optimization. In International Conference on Machine Learning.
Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P. (2015b). High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shani, L., Efroni, Y. and Mannor, S. (2019). Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. arXiv preprint arXiv:1909.02769.
Shapiro, E. (1974). On the Lyapunov matrix equation. IEEE Transactions on Automatic Control, 19 594–596.
Shebrawi, K. and Albadawi, H. (2009). Numerical radius and operator norm inequalities. Journal of Inequalities and Applications, 2009 492154.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529 484–489.
Skogestad, S. and Postlethwaite, I. (2007). Multivariable Feedback Control: Analysis and Design, vol. 2. Wiley New York.
Stoorvogel, A. A. and Weeren, A. J. (1994). The discrete-time Riccati equation related to the H∞ control problem. IEEE Transactions on Automatic Control, 39 686–691.
Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
Tessler, C., Efroni, Y. and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning.
Topcu, U., Packard, A. and Seiler, P. (2008). Local stability analysis using simulations and sum-of-squares programming. Automatica, 44 2669–2675.
Tu, S. and Recht, B. (2018). The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. arXiv preprint arXiv:1812.03565.
Tyrtyshnikov, E. E. (2012). A Brief Introduction to Numerical Analysis. Springer Science & Business Media.
Ueda, K. and Yamashita, N. (2010). Convergence properties of the regularized Newton method for the unconstrained nonconvex optimization. Applied Mathematics and Optimization, 62 27–46.
Venkataraman, H. K. and Seiler, P. J. (2019). Recovering robustness in model-free reinforcement learning. In 2019 American Control Conference (ACC).
Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D. and Silver, D. (2019). AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https:// deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/.
Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
Whittle, P. (1981). Risk-sensitive linear/quadratic/Gaussian control. Advances in Applied Probability, 13 764–777.
Whittle, P. (1990). Risk-sensitive Optimal Control. Wiley Chichester.
Xu, P., Gao, F. and Gu, Q. (2019). An improved convergence analysis of stochastic variance-reduced policy gradient. arXiv preprint arXiv:1905.12615.
Yang, Z., Chen, Y., Hong, M. and Wang, Z. (2019). On the global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. arXiv preprint arXiv:1907.06246.
Yu, M., Yang, Z., Kolar, M. and Wang, Z. (2019). Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems.
Zames, G. (1966). On the input-output stability of time-varying nonlinear feedback systems part one: Conditions derived using concepts of loop gain, conicity, and positivity. IEEE Transactions on Automatic Control, 11 228–238.
Zhang, K., Koppel, A., Zhu, H. and Bas¸ar, T. (2019a). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
Zhang, K., Yang, Z. and Bas¸ar, T. (2019b). Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems.
Zhang, K., Zhang, X., Hu, B. and Bas¸ar, T. (2021). Derivative-free policy optimization for risk-sensitive and robust control design: Implicit regularization and sample complexity. arXiv preprint arXiv:2101.01041.
Zheng, Q. and Lafferty, J. (2016). Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051.
Zhou, K., Doyle, J. C. and Glover, K. (1996). Robust and Optimal Control, vol. 40. Prentice Hall New Jersey.
In this appendix, we present the counterparts of the formulation and results in the main part of the paper for the continuous-time mixed H2/H∞ design problem.
A.1 Formulation
Consider the linear dynamics
where x ∈ is the state, u ∈
is the control, w ∈
is the disturbance, z ∈
controlled output, and A,B,C,D,E are the matrices of proper dimensions. The performance measure of the mixed H2/H∞ design problem is usually some upper bound of the H2-norm of the system (Khargonekar and Rotea, 1991). With the state-feedback information structure, it has been shown in Khargonekar and Rotea (1991) that, LTI state-feedback controller suffices to achieve the optimal performance measure. As a consequence, it suffices to consider only stationary control policies parametrized as u = −Kx. Note that Assumption 2.5 is also standard for continuous-time settings. Hence, the transfer function T (K) from the disturbance w to the output z also has the form of (2.8). The robustness of the designed controller can thus be guaranteed by the constraint on the H∞-norm; see definition in (1.2). In particular, the constraint is ∥T (K)∥∞ < γ for some γ > 0. Define K to be the feasible set as
For continuous-time setting, the cost J (K) usually only takes the form of (2.10) (Mustafa, 1989; Mustafa and Bernstein, 1991), with the PK replaced by the solution to the continuous-time Riccati equation
In sum, the continuous-time mixed design can thus be formulated as
a constrained nonconvex optimization problem.
A.1.1 Bounded Real Lemma
There also exists a continuous-time Bounded Real Lemma (Zhou et al., 1996; Rantzer, 1996) that relates the H∞-norm bound to the solution of a Riccati equation and an LMI.
Lemma A.1 (Continuous-Time Bounded Real Lemma). Consider the continuous-time transfer function T (K) defined in (2.8), which is recalled here as
Suppose K is stabilizing, i.e., (A − BK) is Hurwitz, then, the following conditions are equivalent:
• ∥T (K)∥∞ < γ, which, due to that (A − BK) is Hurwitz, further implies that K ∈ K;
• The Riccati equation (A.1) admits a unique stabilizing solution PK ≥ 0 such that the matrix A − BK + γ−2DD⊤PK is Hurwitz;
• There exists some P > 0, such that
The three equivalent conditions in Lemma A.1 will be frequently used in the analysis. Similarly, the unique stabilizing solution to (A.1) for any K ∈ K, is also minimal, if the pair (A−BK,D) is stabilizable, see (Zhou et al., 1996, Corollary 13.13, page 339), which is indeed the case since any K ∈ K is stabilizing. Hence, it suffices to consider only stabilizing solution PK of the Riccati equation (A.1) in this case.
A.2 Landscape and Algorithms
Next we study the optimization landscape and policy-based algorithms for the continuous-time mixed H2/H∞ design.
A.2.1 Optimization Landscape
As in the discrete-time setting, this problem is nonconvex, and enjoys no coercivity. Proofs of the following lemmas are provided in §B.2 and §B.3, respectively.
Lemma A.2 (Nonconvexity of Continuous-Time Mixed H2/H∞ Design). The continuous-time mixed H2/H∞ design problem (A.2) is nonconvex.
Lemma A.3 (No Coercivity of Continuous-Time Mixed H2/H∞ Design). The cost function (A.2) for continuous-time mixed H2/H∞ design is not coercive.
The following lemma, whose proof is deferred to §B.5, establishes the differentiability and the policy gradient form of J (K) given by (A.2).
Lemma A.4. The objective defined in (A.2) is differentiable in K for any K ∈ K, and the policy gradient of J (K) with respect to K has the following form
where ΛK ∈ is the solution to the Lyapunov equation
Lemma A.4 further leads to the following proposition, which provides the formula for the optimal controller for mixed H2/H∞ control design.
Proposition A.5. Suppose that the continuous-time mixed H2/H∞ design admits a global optimal control gain solution K∗ ∈ K, and (A,C) is detectable, then one such solution has the form of K∗ = R−1B⊤PK∗. Additionally, if the pair�A−BK +γ−2DD⊤PK,D�is controllable at some stationary point of J (K) such that ∇J (K) = 0, then this is the unique stationary point, and corresponds to the unique global optimizer K∗.
Proposition A.5, proved in §B.6, not only gives the form of one global optimal solution, but also shows that under certain controllability conditions, the stationary point of J (K) is unique, and corresponds to that global optimum. Note that the controllability condition is satisfied if DD⊤ > 0. Such a desired property motivates the development of first-order methods to solve this nonconvex optimization problem.
A.2.2 Policy Optimization Algorithms
Consider three policy-gradient based algorithms as follows.
Policy Gradient: K′ = K − η∇J (K) = K − 2η[RK − B⊤PK]ΛK (A.5) Natural Policy Gradient: K′ = K − η∇J (K)Λ−1K = K − 2η(RK − B⊤PK) (A.6) Gauss-Newton: K′ = K − ηR−1∇J (K)Λ−1K = K − 2η(K − R−1B⊤PK) (A.7)
with η > 0 being the stepsize. The updates are designed to follow the updates for discrete-time settings in (A.5)-(A.7).
A.3 Theoretical Results
We now establish the convergence results on the algorithms proposed in §A.2.2.
A.3.1 Implicit Regularization
As in the discrete-time setting, we first show that both the natural PG update (A.6) and the Gauss-Newton update (A.7) enjoy the implicit regularization property.
Theorem A.6 (Implicit Regularization for Continuous-Time Mixed Design). For any control gain K ∈ K, i.e., (A − BK) being Hurwitz and ∥T (K)∥∞ < γ, with ∥K∥ < ∞, suppose that the stepsize η satisfies:
• Natural policy gradient (A.6): η ≤ 1/(2∥R∥),
• Gauss-Newton (A.7): η ≤ 1/2.
Then the K′ obtained from (A.6)-(A.7) also lies in K. Equivalently, K′ is stabilizing, i.e., (A−BK′) is Hurwitz, and also enables the Riccati equation (A.1) to admit a stabilizing solution PK′ ≥ 0 such that A − BK′ + γ−2DD⊤PK′ is Hurwitz.
The proof of Theorem A.6 can be found in §B.8, the key of which is the use of the Bounded Real Lemma, i.e., Lemma A.1, so that it suffices to find some P > 0 that makes (A.3) hold. Such a P can then be constructed by perturbing PK, the solution to the Riccati equation under K. Note that it is not clear yet if the vanilla PG update (A.5) also enjoys the implicit regularization property.
A.3.2 Global Convergence
Now we are ready to present the global convergence of the updates (A.6) and (A.7).
Theorem A.7 (Global Convergence for Continuous-Time Mixed Design). Suppose that K0 ∈ K, ∥K0∥ < ∞, and (A,C) is detectable. Then, under the stepsize choices as in Theorem A.6, both updates (A.6) and (A.7) converge to the global optimum K∗ = R−1B⊤PK∗, in the sense that the average of {∥RKn − B⊤PKn∥2F} along iterations converges to zero with O(1/N) rate.
The proof of Theorem A.7 is deferred to §B.9. Note that the controllability assumption made in Proposition A.5 is not required here. In other words, even if there might be multiple stationary points, the updates (A.6) and (A.7) can still avoid the spurious ones and converge to the globally optimal one K∗. This can be viewed as another implication of implicit regularization of the natural PG and Gauss-Newton methods: always avoiding bad local minima and converging to the global optimal one.
Although there is no existing result on the global convergence of policy gradient for continuous-time LQR, we believe that using similar techniques as in Fazel et al. (2018), global linear rate can be achieved. In comparison, only sublinear rate can be shown for global convergence for the continuous-time mixed design. But still, (super)-linear rates can be established locally around the optimum, whose proof is provided in §B.10.
Theorem A.8 (Local (Super-)Linear Convergence for Continuous-Time Mixed Design). Suppose that the conditions in Theorem A.7 hold, and additionally DD⊤ > 0 holds. Then, under the stepsize choices as in Theorem A.6, both updates (A.6) and (A.7) converge to the optimal control gain K∗ with locally linear rates, in the sense that the objective {J (Kn)} defined in (2.10) converges to J (K∗) with linear rate. In addition, if η = 1/2, the Gauss-Newton update (A.7) converges to K∗ with locally Q-quadratic rate.
As for the discrete-time results, the locally linear rates are caused by the fact that the gradient dominance property holds only locally for mixed design problems. Moreover, notice that the Gauss-Newton update here with stepsize η = 1/2 resembles the policy iteration algorithm for continuous-time LQR (Kleinman, 1968), where PK is the solution to a Lyapunov equation, instead of the Riccati equation (A.1) in our problem. Hence, the local Q-quadratic rate here is largely expected as in Kleinman (1968).
Remark A.9 (Model-Free Algorithms). For continuous-time settings, the relationship between H2/H∞ mixed design, risk-sensitive control (continuous-time LEQG), maximum entropy H∞ control, and zero-sum LQ differential games have also been established in the literature (Mustafa, 1989; Jacobson, 1973). Hence, model-free algorithms can also be developed from the perspective of LQ games, as we discussed in §6 for the discrete-time setting.
In this section, we provide supplementary proofs for several results stated before.
B.1 Proof of Lemmas 2.7 and A.1
As a surrogate, we also define a discrete-time transfer function �T (K) as
First note that since ρ(A − BK) < 1, �T (K) is a proper and real rational stable transfer matrix. Thus, the discrete-time Bounded Real Lemma (Zhou et al., 1996, Theorem 21.12, page 539) can be applied. Note that the statement (c) in (Zhou et al., 1996, Theorem 21.12) is equivalent to the second condition in the lemma. In particular, the Riccati equation (2.13) here is identical to the equation in (Zhou et al., 1996, Theorem 21.12 (c)), and the inequality condition in (Zhou et al., 1996, Theorem 21.12 (c)) translates to I − γ−2D⊤PKD > 0. Moreover, the stability argument in (Zhou et al., 1996, Theorem 21.12 (c)) translates to that the matrix (I −γ−2DD⊤PK)−1(A−BK) is stable. In addition, by (Ionescu and Weiss, 1992, Proposition 1), if such a stabilizing solution PK exists, it must be unique. Hence, by the equivalence between (a) and (c) in (Zhou et al., 1996, Theorem 21.12), we conclude that the second condition is equivalent to ∥�T (K)∥∞ < 1, which is further equivalent to ∥T (K)∥∞ < γ.
Moreover, by the equivalence between statements (c) and (d) in (Zhou et al., 1996, Theorem 21.12), the following linear matrix inequality (LMI) condition also holds equivalently: there exists some P > 0, such that I − γ−2D⊤PD > 0 and
which reduces to the third condition in the lemma. This proves Lemma 2.7.
By definition of T (K) in (2.8), since K is stabilizing, we know that T (K) is proper and real rational stable. Thus, the continuous-time Bounded Real Lemma (Zhou et al., 1996, Corollary 13.24, page 352) can be applied. The statement (iv) in (Zhou et al., 1996, Corollary 13.24) is equivalent to the second condition in the lemma. In particular, the matrices R and H in (Zhou et al., 1996, Corollary 13.24) have the specific forms of R := γ2 · I and
so that the Riccati equation induced by H is identical to (A.1). We also note that by (Moli- nari, 1973, Theorem 1), if such a stabilizing solution exists, it must be unique. Thus, the equivalence between the first two conditions follows by the equivalence of statements (i)
and (iv) in (Zhou et al., 1996, Corollary 13.24).
Define a surrogate transfer function �T (K) as in (B.1). Then, ∥T (K)∥∞ < γ is equivalent to
∥�T (K)∥∞ < 1. By the continuous-time KYP Lemma (Dullerud and Paganini, 2013, Lemma
7.3, page 212), this is equivalent to the statement that there exists some P > 0 such that
Since −I < 0, by Schur complement, it is also equivalent to (A.3), which completes the proof of Lemma A.1.
B.2 Proof of Lemmas 3.1 and A.2
Recall that for both discrete-time and continuous-time settings, we consider the transfer function T (K) defined in (2.8) of identical form. Consider the example with matrices A, B, Q, R all being 3 × 3 identity matrices, W = D⊤D = 0.01 · I. For the discrete-time setting, we choose
and K3 = (K1 + K2)/2, then all control gains K1, K2, K3 stabilize the system (A,B), since ρ(A − BK1),ρ(A − BK2) = 0 < 1, ρ(A − BK3) = 0.8660 < 1. Nonetheless, we have ∥T (K1)∥∞ = 0.4350, ∥T (K2)∥∞ = 0.7011, while ∥T (K3)∥∞ = 1.6575. Hence, the γ-lower-level set of H∞-norm of T (K) is nonconvex for any γ ∈ (0.7011,1.6575).
Similarly, for the continuous-time setting, we choose
and K3 = (K1 + K2)/2, then K1, K2, K3 all stabilize the system (A,B), since
However, ∥T (K1)∥∞ = 0.3860, ∥T (K2)∥∞ = 0.5306, while ∥T (K3)∥∞ = 1.1729. Therefore, K is not convex when γ ∈ (0.5306,1.1729), which completes the proof.
B.3 Proof of Lemmas 3.2 and A.3
Note that ∥K∥ may be unbounded for K ∈ K. We show via counterexamples that for K with ∥K∥ < ∞, the cost does not necessarily goes to infinity as K approaches the boundary of K. Suppose DD⊤ > 0 is full-rank. For cost J (K) of form (2.10), it remains finite as long as PK is finite. For continuous-time settings, by the Bounded Real Lemma, i.e., Lemma A.1, for any K ∈ K, A−BK +γ−2DD⊤PK is always Hurwitz. Then, if there is a sequence {Kn} approaching ∂K such that λmax(PKn) → ∞ as n → ∞, then there must exist some N > 0 such that for n ≥ N, the real part of A − BK + γ−2DD⊤PK is greater than 0, causing a contradiction. Thus, J (K) in (2.10) is always finite. For discrete-time settings, by Lemma 2.7, I − γ−2D⊤PKD > 0 always holds for K ∈ K. Thus, λmax(PK) also has to be finite.
For cost J (K) of the forms (2.11) and (2.12), with DD⊤ > 0, it is finite if both PK is finite and I − γ−2D⊤PKD > 0 is non-singular. The first condition is not violated as already shown above. We now show via a 1-dimensional example that the second condition is not violated either as K → ∂K. In fact, the Riccati equation (2.13) that defines PK becomes a quadratic equation for the 1-dimensional case:
Thus, it is possible that the condition for the existence of solutions to the quadratic equations is restricter than the conditions on PK in the Bounded Real Lemma. Specifically, the solutions have the following form
Denote the discriminant of (B.2) by = 0 admit at least one solution. Moreover,
which, as 0, can be greater than 0 with small enough D and large enough γ. Additionally, if the choices of A,B,C,D,R,γ ensure that (A − BK)(1 − γ−2PKD2)−1 < 1, then such a K ∈ K. This way, as K approaches the boundary of {K |
, it is also approaching ∂K, while the value of PK approaches [γ2 − (A − BK)2γ2 + (C2 + RK2)D2] · (2D2)−1, a finite value. The above argument can be verified numerically by choosing A = 2.75, B = 2, C2 = 1, R = 1, D2 = 0.01, γ = 0.2101. In this case, 1−γ−2D2PK → 0.2354 > 0 and (A−BK)(1−γ−2PKD2)−1 → 0.9998 < 1 if K → 1.2573, which is the value that makes
0. However, the corresponding PK → [γ2−(A−BK)2γ2+(C2+RK2)D2]·(2D2)−1 = 3.3752 > 0, a finite value that also satisfies 1 − γ−2D2PK > 0. Hence, both the costs in (2.11) and (2.12) approach a finite value, which completes the proof.
B.4 Proof of Lemmas 3.3
Note that J (K) defined in (2.11) is differentiable with respect to PK, provided that det(I − γ−2PKDD⊤) > 0. This holds for any K ∈ K since by Lemma 2.7
where we have used Sylvester’s determinant theorem that det(I +AB) = det(I +BA). Thus, it suffices to show that PK is differentiable with respect to K. Recall that
where the second equation uses matrix inversion lemma, and define the operator Ψ :
Note that Ψ is continuous with respect to both PK and K, provided that γ2I − D⊤PKD > 0. Also note that the Riccati equation (2.13) can be written as
Notice the fact that for any matrices A, B, and X with proper dimensions
Thus, by vectorizing both sides of (B.5), we have
By defining �Ψ : as a new mapping such that �Ψ�vec(PK),vec(K)�:= vec�Ψ(PK,K)�, the fixed-point equation (B.7) can be re-written as
Since vec is a linear mapping, it now suffices to show that vec(PK) is differentiable with respect to vec(K). To this end, we apply the implicit function theorem (Krantz and Parks, 2012) on the fixed-point equation (B.8). To ensure the applicability, we first note that the set K defined in (2.9) is an open set. In fact, by Lemma 2.7, for any K ∈ K, there exists some P > 0 such that the two LMIs in (2.16) hold. Since the inequality is strict, there must exists a small enough ball around K such for any K′ in the ball, the LMIs still hold. Hence, the set K is open by definition.
Moreover, by the chain rule of matrix differentials (Magnus and Neudecker, 1985, Theorem 9), we know that
To show this, we compare the element at the [(j −1)m+i]-th row and the [(l −1)m+k]-th column of both sides of (B.10) with i,j,k,l ∈ [m], where both sides are matrices of dimensions m2 × m2. On the LHS, notice that
which follows from (F−1)′ = −F−1F′F−1 for some matrix function F. Also notice that
where the first equation follows from the definition of Kronecker product, and the second one is due to that the matrix
is symmetric. Thus, (B.11) and (B.12) are identical for any (i,j,k,l), which proves (B.10). By substituting (B.10) into (B.9), we have
= [I + (γ−2PKDD⊤) · (I − γ−2PKDD⊤)−1] ⊗ (I − γ−2PKDD⊤)−1
where the second equation uses the fact that (A⊗B)(C ⊗D) = (AC)⊗(BD), the third one uses (A⊗B)+(C ⊗B) = (A+C)⊗B, and the last one uses matrix inversion lemma. Hence, by (B.9), we can write the partial derivative of �Ψ�vec(PK),vec(K)�as
Therefore, the partial derivative
Now we establish the form of the policy gradient. By Lemma 2.7, we know that for any K ∈ K, (A−BK)⊤(I −γ−2PKDD⊤)−1 is stable and I −γ−2D⊤PKD > 0. Therefore, the expression ∆K in (3.1) exists, and so does the expression for ∇J (K). We then verify the expressions by showing the form of the directional derivative ∇KijJ (K), i.e., the derivative with respect to each element Kij in the matrix K. By definition of J (K) in (2.11), we have
where the first equality follows from the chain rule and the fact that ∇X logdetX = X−⊤, and the second one follows from the fact that Tr(A⊤B⊤) = Tr(BA)⊤ = Tr(BA) = Tr(AB). Furthermore, since DD⊤ is independent of K, and Tr(ABC) = Tr(BCA), we obtain from (B.14) and (B.13) that
Now we establish the recursion of ∇KijJ (K) using Riccati equation (2.13). Specifically, letting M := D(I − γ−2D⊤PKD)−1D⊤, we have from (2.13), (B.4), and (B.15) that
where on the right-hand side of (B.16), the first term is due to the fact that ∇K Tr(K⊤RKM) = 2RKM for any positive definite (and thus symmetric) matrix M, the second term is the gradient with �PK fixed, and the third term is the gradient with A − BK fixed.In addition, by taking the derivative on both sides of (B.4), we have
where the second equation has used the fact that
the third one has used (B.13), and the last one has used the matrix inversion lemma. Also, notice that (I − γ−2DD⊤PK)−1 = (I − γ−2PKDD⊤)−⊤. Thus, (B.17) can be written as
Substituting (B.18) into (B.16) yields the following recursion
∇KijJ (K) = (2RKM)ij −�2B⊤�PK(A − BK)M�ij
By performing recursion on Tr�∇KijPK · M1�, and combining all the i,j terms into a matrix, we obtain the form of the gradient given in Lemma 3.3.
B.5 Proof of Lemma A.4
Similarly as the proof in §B.4, J (K) defined in (A.2) is always differentiable with respect to PK, which makes it sufficient to show the differentiability of PK with respect to K. We proceed by using the implicit function theorem, which requires the set K considered to be open. By Lemma A.1, requiring K to be in K is equivalent to requiring the existence of P > 0 such that (A.3) holds. Since the LHS of (A.3) is continuous in K, and the LMI is strict, for any K ∈ K, there must exist a small enough ball around K such that for any K′ inside the ball, the LMI (A.3) still holds. Hence, K is an open set.
By (B.6), we have from (A.1) that
where we define the LHS of (B.19) to be ψ(vec(PK),K). Since vec(·) is a linear mapping, it suffices to show that vec(PK) is differentiable with respect to K. By Theorem 9 in Magnus and Neudecker (1985), we have
where the first equation is due to the chain rule of matrix differentials, the second equation is by definition, and the last one uses the fact that (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD). Hence, by definition of ψ(vec(PK),K) in (B.19), we have
where the second equation follows from the facts that (A ⊗ B) + (C ⊗ B) = (A + C) ⊗ B and (PKDD⊤)⊤ = DD⊤PK. On the other hand, by Lemma A.1, the matrix A − BK + γ−2DD⊤PK is Hurwitz. As a result, ∂ψ(vec(PK),K)/∂vec⊤(PK) from (B.20) is invertible. By the implicit function theorem, we conclude that vec(PK) is continuously differentiable with respect to K, and so is PK and thus J (K), at some open neighborhood of vec(PK) and K (including vec(PK) and K). This completes the proof.
Now we establish the form of the policy gradient. Define
Then we have
where �ψPK denotes the differential of �ψ with respect to PK. Moreover, we have
Since dPK = P′KdK and
we combine (B.21) and (B.22) to obtain
P′KdK(A − BK) + (A − BK)⊤P′KdK + γ−2P′KdKDD⊤PK + γ−2PKDD⊤P′KdK (B.23) = PK(BdK) + (BdK)⊤PK − (dK)⊤RK − K⊤RdK = (PKB − K⊤R)dK + (dK)⊤(B⊤PK − RK).
On the other hand, we have
where ∇J (K) ∈ , while by definition of J (K), we also have
Since ΛK is also the solution to a Lyapunov equation (A.4), we multiply (B.26) by ΛK and multiply (A.4) by P′KdK, and then take trace on both sides of both equations, to obtain the following identity
By further equating (B.24), (B.25), and (B.27), we have
This gives the expression of the policy gradient ∇J (K) = 2[RK − B⊤PK]ΛK, completing the proof of Lemma A.4.
B.6 Proof of Propositions 3.4 and A.5
Discrete-Time:
The proof is based on a game-theoretic perspective on the problem. First, for any K ∈ K, by applying Theorem 3.7 in Bas¸ar and Bernhard (1995), with A, B, D, Q, R, γ therein being replaced by A − BK, 0, D, Q + K⊤RK, R, γ here, we obtain that the Riccati equation in (2.10) corresponds to the generalized algebraic Riccati equation (3.52b) in Bas¸ar and Bernhard (1995), for this auxiliary game. By Lemma 2.7, the solution PK ≥ 0 sat-isfies (3.53) in Bas¸ar and Bernhard (1995). Recall that PK is the unique stabilizing solution to (2.10), and is thus also minimal if (A − BK,D) is stabilizable (Ran and Vreug- denhil, 1988, Theorem 3.1), which is indeed the case since K ∈ K is stabilizing. Hence, by (Bas¸ar and Bernhard, 1995, Theorem 3.7) (ii)(iv), the controller and the disturbance that attain the upper-value of the game have, respectively, the forms of ut = 0 and wt = (γ2I −D⊤PKD)−1D⊤PK(A −BK)xt for all t. Note that (A,Q1/2) being detectable in (Bas¸ar and Bernhard, 1995, Theorem 3.7) is not used when applying (ii)(iv). This shows that in the original game with A, B, D, Q, R, γ (as defined in (Bas¸ar and Bernhard, 1995, Chapter 3.7)), and with a fixed K ∈ K, the maximizing disturbance has the form as wt above, and the value under the pair (K,−(γ2I − D⊤PKD)−1D⊤PK(A − BK)) is indeed x⊤0 PKx0. By again applying (Bas¸ar and Bernhard, 1995, Theorem 3.7) to the original game, we know that the value is x⊤0 PK∗x0, and is achieved by the optimal controller u∗t = −K∗xt and the maximizing disturbance w∗t = [(γ2I − D⊤PK∗D)−1D⊤PK∗(A − BK∗)]xt, with K∗ being defined in the propo- sition. By definition of the value of the game, we know that x⊤0 PKx0 ≥ x⊤0 PK∗x0 for any K ∈ K. As the above arguments hold for any x0, we know that PK ≥ PK∗. Finally, notice that for K,K∗ ∈ K, if PK ≥ PK∗, then 0 < I − γ−2D⊤PKD ≤ I − γ−2D⊤PK∗D (cf. Lemma 2.7). By det(I − γ−2PKDD⊤) = det(I − γ−2D⊤PKD), we know that J (K) ≥ J (K∗) for any K ∈ K. This completes the proof for the first half of the proposition.
For the second half of the proposition, note that ∆K ≥ 0 since I − γ−2D⊤PKD > 0 for any K ∈ K by Lemma 2.7. Also, since (I − γ−2D⊤PKD)−1 ≥ I, we know that
By (Zhou et al., 1996, Lemma 21.2), the RHS of (B.28) is always positive definite, since�(I − γ−2PKDD⊤)−⊤(A−BK),D�is controllable, i.e.,�(A−BK)⊤(I−γ−2PKDD⊤)−1,D⊤�is observable. Thus, ∆K > 0 is full-rank. By the necessary optimality condition ∇J (K) = 0, it follows that K∗ = (R + B⊤�PK∗B)−1B⊤�PK∗A is the unique stationary point, which is thus the unique global optimizer. This completes the proof of Proposition 3.4.
Continuous-Time:
The proof is analogous to the above one, except that for proving the first half, one applies (Bas¸ar and Bernhard, 1995, Theorem 4.8) for the continuous-time setting. Note that (A,C) being detectable is now needed to apply (Bas¸ar and Bernhard, 1995, Theorem 4.8). Note that the objective (A.2) is monotone in the eigenvalues of PK, in that if PK ≥ PK∗ for any K ∈ K, then J (K) ≥ J (K∗). For proving the second half, by (Zhou et al., 1996, Lemma 3.18 (iii)), the solution to the Lyapunov equation (A.4) ΛK > 0, since (A − BK + γ−2DD⊤PK,D) is controllable, or equivalently, the pair ((A − BK + γ−2DD⊤PK)⊤,D⊤) is observable. Thus, the necessary optimality condition ∇J (K) = 0 yields that K∗ = R−1B⊤PK∗ is the unique stationary point, which is thus the unique global optimizer. This completes the proof of Proposition A.5.
B.7 Proof of Lemma 5.1
We start with the following helper lemma.
Lemma B.1. Suppose that K,K′ ∈ K. Then we have that I − γ−2PK′DD⊤ is invertible, and
Proof. First, since K,K′ ∈ K, by Lemma 2.7, I − γ−2D⊤PK′D > 0 is invertible. Thus, det(I − γ−2PK′DD⊤) = det(I−γ−2D⊤PK′D) � 0, namely, I−γ−2PK′DD⊤ is invertible. Then the desired fact is equivalent to
which can be further simplified as
By �PK = (I − γ−2PKDD⊤)−1PK and (B.13), we have
Thus, it follows that
Therefore, (B.30) is equivalent to
Given the fact γ2I > D⊤PKD and another fact that
we know that the above inequality holds and hence our lemma is true. To show that (B.31) holds, it suffices to apply the matrix inversion lemma, i.e.,
where the first equation uses the matrix inversion lemma. This completes the proof.
By definition of �PK in (2.14) and the Riccati equation (2.13), we have
By (B.29) in Lemma B.1, we further have
By induction, we can apply the above inequality iteratively to show that
On the other hand, we have
which can be substituted into (B.33) to obtain the upper bound in (5.25). For the lower bound (5.26), note that the conditions in Lemma B.1 also hold here when the roles of K and K′ are interchanged. Thus, we have
which gives a lower bound on the RHS of (B.32) directly as
Continuing unrolling the RHS of (B.35) and substituting into (B.34), we obtain the desired lower bound in (5.26), which completes the proof.
B.8 Proof of Theorem A.6
We first argue that it suffices to find some P > 0 for K′, such that
Denote the LHS of (B.36) by −M < 0, then such a P > 0 yields
which implies that (A − BK′) is Hurwitz, i.e., K′ is stabilizing. Thus, Lemma A.1 can be applied to K′, which shows that (B.36) is equivalent to ∥T (K′)∥∞ < γ. This means that K′ ∈ K. Hence, we will focus on finding such a P > 0 hereafter.
We first show that the Gauss-Newton update (A.7) with stepsize η = 1/2 enables (B.36) to hold. Specifically, we have
By Lemma A.1, the closed-loop system A−BK +γ−2DD⊤PK is stable since K ∈ K. Hence, the following Lyapunov equation admits a solution ¯P > 0:
Hence, we choose P = PK + α ¯P > 0 as the candidate for some α > 0. The LHS of (B.36) now can be written as
We now need to show that there exists some α > 0 such that 1 + 2 < 0. By substituting in K′ from (B.37), we can write 1 as
Moreover, 2 can be written as
where the second equation has used the Riccati equation (A.1), and the third one has used the definition of ¯P in (B.38). Thus, there exists a small enough α > 0, such that 1 + 2 < 0, namely, there exists some P > 0 such that (B.36) holds for K′ obtained from (A.7) with stepsize η = 1/2. On the other hand, since such a P makes 2 < 0, it also makes the LMI (B.36) hold for K, i.e.,
As before, we choose P = PK+α ¯P for some α > 0. Then, the term 2 in (B.39) is still −αI+o(α). The term 1 in (B.39) can be written as
where we have used the definition of P, and the inequality is due to the fact that
In addition, if the stepsize η ≤ 1/(2∥R∥), then 1 can be further bounded from (B.43) that
1 ≤ −2η(RK − B⊤PK)⊤(RK − B⊤PK) + 2η(RK − B⊤PK)⊤(RK − B⊤PK) + 2α2η(B⊤ ¯P)⊤(B⊤ ¯P) = 2α2η(B⊤ ¯P)⊤(B⊤ ¯P) = o(α).
As a result, there exists small enough α > 0 (and thus P) such that 1 + 2 < 0. Hence, (B.36) holds for K′ obtained from (B.42) with η ≤ 1/(2∥R∥). By Lemma A.1, this proves the first argument that ∥T (K′)∥∞ < γ. Such a K′ also ensures the existence of the stabilizing solution PK′ ≥ 0 to the Riccati equation (A.1). This completes the proof.
B.9 Proof of Theorem A.7
We first introduce the continuous-time cost difference lemma that establishes the relationship between PK′ − PK and K′ − K.
Lemma B.2 (Continuous-Time Cost Difference Lemma). Suppose that both K,K′ ∈ K. Then, we have the following upper bound:
If additionally the matrix A − BK′ + γ−2DD⊤PK is Hurwitz, then we have the lower bound:
Proof. First, by Lemma A.1, the matrices A − BK′ + γ−2DD⊤PK′ and A − BK + γ−2DD⊤PK are both Hurwitz. As a result, the integral on the RHS of (B.44) is well defined. By subtracting two Riccati equations (A.1) corresponding to K′ and K, we have
+ (K′ − K)⊤(RK − B⊤PK) + (K′ − K)⊤R(K′ − K) − γ−2∆⊤P DD⊤∆P (B.46) = (A − BK′ + γ−2DD⊤PK)⊤∆P + ∆P(A − BK′ + γ−2DD⊤PK) + (RK − B⊤PK)⊤(K′ − K)
where we let ∆P = PK′ − PK, and the relationship follows from the facts below:
γ−2PK′DD⊤PK′ − γ−2PKDD⊤PK = γ−2PKDD⊤∆P + γ−2∆PDD⊤PK + γ−2∆⊤P DD⊤∆P = γ−2PK′DD⊤∆P + γ−2∆PDD⊤PK′ − γ−2∆⊤P DD⊤∆P.
From (B.46), we know that ∆P ≤ ˆ∆P, where ˆ∆P is the solution to the following Lyapunov equation
This is due to the fact that (B.46) subtracted from (B.48) yields
Note that �∆P has the form of the RHS of (B.45). Subtracting (B.50) from (B.47) yields the Lyapunov equation
Hence, γ−2∆⊤P DD⊤∆P ≥ 0 implies that the unique solution to (B.51), ∆P − �∆P ≥ 0, which completes the proof of the lower bound.
As in the discrete-time setting, Lemma B.2 also characterizes the “Almost Smoothness” of PK with respect to K (Fazel et al., 2018). Now we are ready to analyze the updates (A.7) and (A.6).
We start by the following helper lemma that lower-bounds the solution to the Lyapunov equation,
for any matrix M > 0.
Lemma B.3. Suppose that K ∈ K, and there exists a constant CK > 0 such that K⊤RK ≤ CK · I. Let MK > 0 be the unique solution to the Lyapunov equation (B.52) with M > 0. Then,
where ωK > 0 is defined as
Proof. By Shapiro (1974), the solution MK to the Lyapunov equation satisfies
On the other hand, multiplying P−1/2K on both sides of the Riccati equation (A.1) yields
which further implies that
by taking the largest and smallest eigenvalues on both sides, respectively. Note that in (B.55), the term K⊤RK is dropped, while in (B.56), K⊤RK is replaced by CK·I, and γ−2PKDD⊤PK is dropped. Let ω(X) be the numerical radius of a matrix X defined as
Then (B.55) and (B.56) together yield
By the relationship between operator norm and numerical radius (Shebrawi and Albadawi,
2009), we also have
Combining(B.54), (B.57), and (B.58) proves the desired result.
Gauss-Newton:
Recall that the Gauss-Newton update has the form K′ = K−2η(K−R−1B⊤PK). By Theorem A.6, K′ also lies in K if η ≤ 1/2. Then, by the upper bound (B.44), for any η ∈ [0,1/2],
which implies the monotonic decrease of PK (matrix-wise) along the update. Since PK is lower-bounded, such a monotonic sequence of {PKn} along the iterations must converge to some PK∞ ∈ K. Now we show this PK∞ is indeed PK∗. Multiplying by any M > 0 on both sides of (B.59) and taking the trace further implies that
≤ −2ησmin(M)4ωK Tr�(RK − B⊤PK)⊤R−1(RK − B⊤PK)�
≤ −ησmin(M)2ωKσmax(R) Tr�(RK − B⊤PK)⊤(RK − B⊤PK)�, (B.60)
where the second inequality follows by applying Lemma B.3, and ωK is as defined in (B.53). Since for any finite N > 0, there exists some constant CNK > 0 such that K⊤n RKn ≤ CNK ·I for all Kn with n ≤ N −1. By definition, ωK can be uniformly upper bounded along the iteration as
which is due to the facts that the first and second terms in the max operator are increasing and decreasing with respect to PK, respectively, and PK0 ≥ PKn ≥ PK∞ holds for all n ≥ 0 from (B.59).
From iterations n = 0 to N −1, replacing ωK by ωK and M by identify matrix I, summing over both sides of (B.60) and dividing by N, we have
This shows that the sequence {∥RKn −B⊤PKn∥2F} converges to zero, namely, the sequence {Kn} converges to the stationary point K such that RK −B⊤PK = 0, with sublinear O(1/N) rate. By Proposition A.5, this is in fact towards the global optimum K∗.
Natural Policy Gradient:
Recall that the natural PG update follows K′ = K −2ηEK. By Theorem A.6, K′ also lies in K if η ≤ 1/(2∥R∥). Then, by the upper bound in (B.44), we also have the monotonic decrease of PK (matrix-wise) along the iterations as
As in (B.60), taking the trace of both sides yields
for any M > 0, where ωK is also defined as in (B.61). Summing over both sides of (B.62) from n = 0 to n = N − 1 gives
where we have replaced M in (B.62) by I. This completes the proof.
B.10 Proof of Theorem A.8
By the lower bound (B.45) from Lemma B.2 and completion of squares, we have that
Multiplying DD⊤ > 0 and taking traces on both sides of (B.63) with K′ = K∗ gives
where by a slight abuse of notation, we define MK,K∗ as
Note that A−BK∗+γ−2DD⊤PK∗ is Hurwitz. Let ϵ := −maxi∈[m][ℜλi(A−BK∗+γ−2DD⊤PK∗)] > 0 be the largest real part of the eigenvalues of A−BK∗+γ−2DD⊤PK∗. By the continuity of PK, there exists a ball B(K∗,r) ⊆ K centered at K∗ with radius r > 0, such that for any K ∈ B(K∗,r),
Gauss-Newton:
Under DD⊤ > 0, which implies the observability condition in Theorem A.7, we know that {Kn} approaches K∗ = R−1B⊤PK∗. From (B.60), (B.61), and (B.64), with M replaced by DD⊤, we obtain that if some Kn is close enough to K∗ such that Kn ∈ B(K∗,r), then letting K = Kn and K′ = Kn+1, we have
which further implies that
By (B.66), the sequence {Tr(PKn+pDD⊤)}p≥0 decreases to Tr(PK∗DD⊤) starting from some Kn ∈ B(K∗,r). By continuity, there must exist a Kn+p close enough to K∗, such that the lower-level set {K | Tr(PKDD⊤) ≤ Tr(Kn+pDD⊤)} ⊆ B(K∗,r). Hence, starting from Kn+p, the iterates will never leave B(K∗,r). Thus, by (B.65), MK,K∗, as the unique solution to the Lyapunov equation
For η = 1/2, suppose that some K = Kn ∈ B(K∗,r), then K′ = Kn+1 = R−1B⊤PK, which yields that
for some constant c > 0. Combining (B.67) and (B.68) gives
for some constant c′. Note that from some p ≥ 0 such that Kn+p onwards never leaves B(K∗,r), the constant c′ is uniformly bounded, which completes the Q-quadratic convergence rate of {Tr(PKnDD⊤)} around K∗.
Natural Policy Gradient:
Combining (B.62) and (B.64) yields that
Using similar arguments as above, one can establish the local linear rate of {Tr(PKnDD⊤)} with a different contraction factor. This completes the proof.
In this section, we prove several auxiliary results used before.
Lemma C.1 (Integral of Gaussian Random Variables). Suppose that z ∼ N (¯z,Z). Then, for any positive semidefinite matrix P and scalar β satisfying I − βPZ > 0, it follows that
where �P = P + βP(Z−1 − βP)−1P.
Proof. By definition, we have
By completing the squares in the exponent, and using the fact that
for any Gaussian random variable following N (µ,Σ), we obtain the desired result.
Then, we have J ∗ = −β−1 logdet(I − βPK∗W). Moreover, among all controls that generate a well-defined objective, the optimal control is LTI state-feedback given by µt(x0:t,u0:t−1) = −K∗xt for all t ≥ 0.
Proof. Recall that the original LEQG problem is defined as
where x0:t and u0:t−1 denote the history of states from time 0 to t and actions from time 0 to t − 1, respectively. Let ht := (x0:t,u0:t−1). For notational convenience, we define
where Pt ≥ 0 is any nonnegative definite matrix, ut is the control at time t that is adapted to the σ-algebra generated by ht, and f : can be any function of x. Also, we note that the expectation in Fut(f ) is taken over the randomness of w given x and ut. Obviously,
Fut(f ) is a monotone operator, i.e., if some g satisfies g(x) ≥ f (x) for any x, then we have [Fut(g)](x) ≥ [Fut(f )](x) for any x. Consider a T -stage control sequence (u0,u1,··· ,uT −1), and notice that
Keeping imposing operators FuT −1,···Fu0 yields
[Fu0 ···FuT −1(V0)](x) = 2β log
we have from Lemma C.1 that
from which we obtain that
Note that (C.5) is a quadratic function of uT−t−1, which has the minimizer
Substituting (C.6) back to (C.5), we require Vt+1(x) to be updated as Vt+1(x) = [Fu∗T −t−1(Vt)](x)− λt, i.e., we require
In particular, (C.4), (C.6), and (C.7) constitute the recursion updates of the modified Riccati equation, which also gives the following relation
i.e.,
In addition, FuT −t−1(Vt) also has the property that
for any constant c, since c can be taken out of the expectation in FuT −t−1(V ). Combining (C.9), (C.10), and the monotonicity of Fut yields
Taking expectation over x and dividing both sides of (C.11) by T, we obtain from (C.3) that
By letting V0(x) = 0 and taking limit T → ∞ on both sides of (C.12), the LHS converges to the objective of LEQG defined in (C.2) (which is assumed to exist for the studied control sequence (u0,u1,··· ,)); while the RHS converges to the value of λ∗ := −β−1 logdet(I −βPK∗W), where PK∗ is the unique stabilizing solution to the modified Riccati equation given in (C.1), due to Lemma C.4 in §C. Thus, the sequence {λt} converges to λ∗ as t → ∞, and so does the sequence {�T −1t=0 λt/T} as T → ∞; while vanishes to zero since PT converges to PK∗ as T → ∞. Hence, we obtain from (C.12) that
where the equality can be obtained when u∗t = −(R + B⊤�PK∗B)−1B⊤�PK∗Ax = −K∗x for all t ≥ 0. In other words, among all controls that achieve the lim of (2.1), the optimal objective is J ∗ = λ∗, and can be achieved by the stationary linear state-feedback control K∗ obtained from (C.1), which completes the proof.
Lemma C.3 (Restatement of Lemma 2.1). For any LTI state-feedback controllers ut = −Kxt, such that the Riccati equation (2.3) admits a solution PK ≥ 0 that: i) is stabilizing, i.e., ρ�(A − BK)⊤(I − βPKW)−1�< 1, and ii) satisfies W −1 − βPK > 0, J (K) has the form of
Proof. Note that by (Ionescu and Weiss, 1992, Proposition 1), the stabilizing solution of the Riccati equation (2.3), if it exists, is unique. The proof then proceeds in the similar vein as that for Lemma C.2. In particular, we keep the updates (C.4) and (C.7), and replace (C.6) by u∗T −t−1 = −Kx. These updates constitute the recursion of the Riccati equation (2.3). Then, following the derivation from (C.8)-(C.11), but replacing the inequality by equality, we obtain that
Taking T → ∞ on both sides of (C.13), the limit on the LHS reduces to the definition of J (K), and the limit on the RHS converges to −β−1 logdet(I − βPKW). This follows by: i) Lemma C.5 in §C, showing that the recursion of (2.3) starting from P0 = 0 converges to PK, which further leads to the convergence of
and ii) limT →∞ = 0 due to the boundedness of PK. As a result, we obtain the desired form in (2.5), which completes the proof of Lemma 2.1.
Lemma C.4 (Recursion of Discrete-Time Modified Riccati Equation (C.1)). Suppose the modi-fied Riccati equation (C.1) admits a stabilizing fixed-point solution PK∗ such that W −1 −βPK∗ > 0. Let P0 ≥ 0 be some nonnegative definite matrix satisfying P0 ≤ Q and W −1 − βP0 > 0. Then, starting from P0, the iterate sequence {Pt} from (C.14) below converges to PK∗, which is unique.
Proof. First note that if PK > 0, then �PK in (2.4) can also be written as
which further implies the monotonicity of �PK, i.e., if PK′ ≥ PK for some K,K′, then �PK′ ≥ �PK. If Pt > 0, then the update in (C.14) can be written as
where the second equation is basic substitution, the third one uses matrix inversion lemma, and the last one uses (C.15). Obviously, F (Pt) is matrix-wise monotonically increasing with respect to Pt. Note that as a fixed point of F (P), PK∗ = F (PK∗) ≥ Q, while P0 ≤ Q. Hence, by monotonicity, if P0 > 0, then PK∗ ≥ F (P0) = P1. By induction, PK∗ ≥ Pt, for all t ≥ 0. If P0 ≥ 0, then
and we can thus apply the argument above starting from t ≥ 1; else if P0 > 0, we can apply it directly from t ≥ 0. In addition, since from (C.14), P1 ≥ Q ≥ P0, we have P2 = F (P1) ≥ P1 = F (P0). By induction, Pt+1 ≥ Pt for all t ≥ 0. Note that W −1 − βPt is well defined for all t, due to that W −1 − βPK∗ > 0 by assumption. Since the sequence {Pt} is monotone and upper-bounded, we conclude that the recursion of (C.14) must converge to some P∞, which constitutes a fixed-point solution to the modified Riccati equation (C.1).
In addition, by Lemma 3.1 in Stoorvogel and Weeren (1994), we know that the stabilizing fixed-point solution, i.e., the PK∗ that makes ρ�(I − βPK∗W)−⊤(A − BK∗)�< 1, is unique. Also,
by Lemma 3.8 therein, any other fixed-point solution P that makes A − B(R + B⊤�PB)−1B⊤�PA stable, which by Lemma 3.4 therein is a necessary condition for P to be a stabilizing solution, satisfies that P ≥ PK∗. Notice that any fixed-point solution is greater than Q, and thus greater than P0. Hence, starting from P0 converges to the minimal fixed point, i.e., the unique stabilizing one, which completes the proof.
Lemma C.5 (Recursion of Riccati Equation (2.3)). Suppose that control gain K yields a stabilizing solution PK, i.e., ρ�(A − BK)⊤(I − βPKW)−1�< 1, to the Riccati equation (2.3), such that W −1 − βPK > 0. Then, such a solution is unique, and minimal among all Hermitian solutions. In addition, if PK ≥ 0, the following recursion starting from any P0 such that 0 ≤ P0 ≤ Q and
W −1 − βP0 > 0 converges to PK,
Proof. First, by (Ran and Vreugdenhil, 1988, Theorem 3.1), if one chooses C = 0, R < 0 therein, by reversing the sign in the theorem, and noticing that the pair (A − BK,W 1/2) is stabilizable since W > 0, we know that there exists a maximal solution −P− ≥ −P for any Hermitian solution −P to the Riccati equation, which is also stabilizing. Hence, there exists a minimal solution P− that is also stabilizing. This conclusion can be directly applied to (2.3), with R = −1/βW −1 < 0 and since 1/βW −1 − PK > 0. Also, by (Ionescu and Weiss, 1992, Proposition 1), the stabilizing solution of the Riccati equation (2.3), if it exists, is unique. Thus, PK, as the unique stabilizing solution, is also minimal, i.e., PK ≤ P for any solution to (2.3).
On the other hand, if Pt > 0, we rewrite the update (C.16) as
which is matrix-wise monotone with respect to Pt. Since PK ≥ Q ≥ P0, by induction, FK(PK) = PK ≥ FK(Pt) = Pt+1 for all t ≥ 0. If P0 ≥ 0, then
and the conclusion above also holds for t ≥ 1; else it holds for t ≥ 0. In addition, since P1 ≥ Q ≥ P0, by induction, Pt+1 ≥ Pt for all t ≥ 0. Note that the matrix W −1 − βPt is always invertible along the recursion, by the assumption that W −1−βPK > 0. Therefore, by both the monotonicity and boundedness of the sequence {Pt}, the recursion (C.16) converges to some fixed-point solution to (2.3). Since PK is the minimal one, the sequence {Pt} thus converges to PK, which concludes the proof.
In this section, we provide the pseudocode of the model-free algorithms mentioned in §6.2. Particularly, Algorithm 1 estimates the policy gradient ∇KC(K,L) and the correlation matrix ΣK,L for any stabilizing (K,L); Algorithm 2 finds an estimate of the maximizer L(K) in (6.7) for a given K; Algorithm 3 describes the updates of K for finding an estimate of K∗.