Hamilton-Jacobi-Bellman Equations for Q-Learning in Continuous Time

2019·Arxiv

Abstract

Abstract

In this paper, we introduce Hamilton–Jacobi–Bellman (HJB) equations for Q-functions in continuous-time optimal control problems with Lipschitz continuous controls. The standard Q-function used in reinforcement learning is shown to be the unique viscosity solution of the HJB equation. A necessary and sufficient condition for optimality is provided using the viscosity solution framework. By using the HJB equation, we develop a Q-learning method for continuous-time dynamical systems. A DQN-like algorithm is also proposed for high-dimensional state and control spaces. The performance of the proposed Q-learning algorithm is demonstrated using 1-, 10- and 20-dimensional dynamical systems. Keywords: Hamilton-Jacobi-Bellman equation, Optimal control, Q-learning, Reinforcement learning, Deep Q-Networks.

1. Introduction

Q-learning is one of the most popular reinforcement learning methods that seek efficient control policies without the knowledge of an explicit system model Watkins and Dayan (1992). The key idea of Q-learning is to combine dynamic programming and stochastic approximation in a way to estimate the optimal state-action value function, also called the Q-function, by using trajectory samples. For discrete-time Markov decision processes, Q-learning has been extensively studied (see Bertsekas (2019); Matni et al. (2019) and the references therein), while the literature on continuous-time Q-learning is sparse. In discrete time, the Bellman equation for Q-functions can be defined by using dynamic programming in a straightforward manner. However, the corresponding Bellman equation for continuous-time Q-functions has not yet been fully characterized despite some prior attempts using HJB equations Doya (2000); Munos (2000). A variant of Q-function is used in Lee et al. (2012); Mehta and Meyn (2009), which has a different meaning from the Q-function in reinforcement learning. In other literature, a Q-function similar to that of reinforcement learning was introduced, but with function-valued control input Palanisamy et al. (2015) or heavily utilizing the linear-time-invariant (LTI) system structure Vamvoudakis (2017). A similar model-free approach for LTI systems has been also studied in Jiang and Jiang (2012); Vrabie et al. (2009), although an associated Q-function is not specifically defined. A continuous-time Q-function was also considered to prove the convergence of stochastic approximation Devraj and Meyn (2017).

In this paper, we consider continuous-time deterministic optimal control problems with Lipschitz continuous controls. We show that the associated Q-function corresponds to the unique viscosity solution of a Hamilton–Jacobi–Bellman (HJB) equation in a particular form. In the viscosity solution framework, even when it is not differentiable, the Q-function can be used to verify the optimality of a given control and to design an optimal control strategy. We use the proposed HJB equation to derive an integral equation that the optimal Q-function and optimal control trajectories should satisfy. Based on this equation, we propose a Q-learning algorithm for continuous-time dynamical systems. For high-dimensional state and control spaces, we also propose a DQN-like algorithm by using deep neural networks (DNNs) as a function approximator Mnih et al. (2015). This opens a new avenue of research that connects viscosity solution theory for HJB equations and Q-learning domain. The performance of the proposed Q-learning algorithm is tested through a set of numerical experiments with 1-, 10- and 20-dimensional systems.

2. Continuous-Time Q-Functions and HJB Equations

Consider a controlled dynamical system of the form

where is the system state and is the control input. Let measurable} be the set of admissible controls. The standard finite-horizon optimal control problem can be formulated as

with are running and terminal cost functions of interest, respectively, and is a subset of U. The Q-function is defined by

which represents the minimal cost incurred from time t to T when starting from x(t) = x with u(t) = u. In particular, when , the Q-function reduces to the standard optimal value function , defined by

Proposition 1 Suppose that . Then, the Q-function (3) corresponds to i.e.,

Proof Fix be an arbitrary positive constant. Then, there exists satisfies (1) with x(t) = x in the Carath´eodory sense: . We now construct a new control . Such a modification of controls at a single point does not affect the trajectory or the total cost. Therefore, we have

Since was arbitrary, we conclude that

Thus, if is chosen to be the entire set of measurable function U, the Q-function has no additional interesting property. Motivated by this observation, we restrict the control to be a Lipschitz continuous function. Since any Lipschitz continuous function is differentiable almost everywhere, we define the set of admissible controls a fixed constant. Then, for any , there exists a unique measurable function with such that the following ODE holds a.e.: . Thus, from now on, we will focus on the optimal control problem (2) when the control u is Lipschitz continuous such that a.e. By using the dynamic programming principle, we can deduce that

To derive the Hamilton-Jacobi equation that the Q-function satisfies, suppose for a moment that . We will discard this regularity assumption on Q by introducing the viscosity solution framework in Section 2.1. Then, the Taylor expansion of Q in (5) yields h tend to zero, we arrive at the following HJB equation for the Q-function:

Note that minimizes the Hamiltonian, and thus the HJB equation can be expressed as

In what follows, we uncover several mathematical properties of the HJB equation (6) and the Qfunction.

2.1. Viscosity Solution: Existence and Uniqueness

In general, the Q-function is not a -function. As a weak solution of the HJB equation, we use the framework of viscosity solutions Crandall et al. (1984); Crandall and Lions (1983). We begin by defining the viscosity solution of (6) in the following standard manner Bardi and Capuzzo-Dolcetta (1997); Evans (2010):

Definition 2 A continuous function viscosity solution of (6) if

1.

2. For any has a local maximum at

3. For any has a local minimum at

From now on, we assume the following regularity conditions on

• (A1) The functions f, r and q are bounded:

• (A2) The functions f, r and q are Lipschitz continuous: is a Lipschitz constant of argument.

Then, the HJB equation (6) has a unique viscosity solution, which corresponds to the Q-function.

Theorem 3 The Q-function (3) is the unique viscosity solution of the HJB equation (6). Moreover, it is a bounded and Lipschitz continuous function.

Proof To accommodate the Lipschitz continuity constraint on controls, we consider an augmented system are interpreted as a new state and a new input, respectively. Let z(t) := (x(t), u(t)) be the augmented state, and F(z, a) := (f(x, u), a) be the augmented vector field. Then, the Q-function can be expressed as . The HJB equation (6) can be rewritten as , where the Hamiltonian H = H(p, z) is defined by By the assumptions (A1) and (A2), we have

||

These imply that the Hamiltonian satisfies the Lipschitz continuity conditions, and thus the standard proof for the existence and the uniqueness of viscosity solution can be directly used (e.g., Evans (2010)). Furthermore, by Bardi and Capuzzo-Dolcetta (1997), the Q-function corresponds to the unique viscosity solution. The boundedness and the Lipschitz continuity of the Q-function can be proved as in Evans (2010).

2.2. Asymptotic Consistency

We now discuss the convergence of the Q-function to the optimal value function as the Lipschitz constant . This convergence property demonstrates that the proposed HJB framework is asymptotically consistent with our observation in Proposition 1. We parametrize the Q-function by so that the scaling becomes similar to that of the other classical singular limit problem. More precisely, let

where straightforward to observe that . We also notice that is a bounded function under Assumption (A1). Therefore, by the monotone convergence theorem, there exists a limit function

The limit function corresponds to the optimal value function (4) without the Lipschitz continuity constraint on controls.

Theorem 4 For any

Proof Since the argument in the proof of (Bardi and Capuzzo-Dolcetta, 1997, Theorem 4.1 in Ch. 7) can be used, we omit the proof.

2.3. Optimal Controls

To characterize a necessary and sufficient condition for optimality of a control , we consider the function By (5), we deduce that the control u is optimal if and only if is a constant function for each (x, u, t). On the other hand, the dynamic programming principle implies that the function is non-decreasing for any control . Thus, the control u is optimal if and only if the function is non-increasing. If the Q-function is differentiable, this implies Q-function satisfies HJB equation (6), we have

Therefore, when Q is differentiable, is optimal if and only if u(t) = u. To obtain the rigorous principle of optimality when Q is not differentiable, we use generalized derivatives of the Q-function, namely, sub- and superdifferntials of Q. The following optimality theorem is a direct application of the classical results in (Bardi and Capuzzo-Dolcetta, 1997, Theorem 3.39, Ch. 3).

Theorem 5 Suppose that f and q are continuously differentiable. Then, the trajectory-control pair is optimal if and only if

Proof Note that the controlled system is equivalent to the extended dynamics defined in the proof of Theorem 3. Then, the assertions can be obtained by directly applying (Bardi and Capuzzo-Dolcetta, 1997, Theorem 3.39 in Chapter 3) to this augmented system.

At a point (x, u, t) where Q is differentiable, the sub- and superdifferentials of Q are identical to the classical derivative of Q. Thus, at such a point, we can construct the optimal control by using . At a point where Q is not differentiable, one can choose any control given by (8).

3. Q-Learning Using the HJB Equation

In the infinite-horizon case, we consider the following discounted cost function (with

and the Q-function is defined by

Again, using the dynamic programming principle, we can derive the following HJB equation:

As in Theorem 3, we can show that the Q-function is the unique viscosity solution of the HJB equation (9). A necessary and sufficient condition for optimality can be characterized in a way similar to Theorem 5.

We now discuss how the HJB equation (9) can be used to design a Q-learning algorithm for estimating Q(x, u) using sample trajectories. To provide the essential idea, we assume for a moment that Q is differentiable. We then have

Suppose now that an optimal control is employed, i.e., . Then, the time derivative of Q-function along the trajectory is further simplified as

By integrating (10) along the optimal trajectory-control pair, we obtain

However, since the optimal Q-function and optimal trajectories are unknown a priori, we iteratively update the Q-function and the control by using sample data. The iteration is based on the equation (11), which is the characterizing equation of the optimal Qfunction. Specifically, for a given control , we obtain the system trajectory starting from randomly chosen (x, u) for time interval [0, h] with small h > 0 and collect the sample data . We then update a new estimate for Q-function by using (11) and sample data as holds asymptotically.

To handle high-dimensional state and control spaces, we propose a DQN-like algorithm by using DNNs as a function approximator. Let denote the approximate Q-function parameterized by . We update the network parameter by minimizing the mean squared error (MSE) loss between and the target . To enhance the stability of learning procedure, we use K sample points of for defining MSE loss and introduce the target network parameter in estimating target value Mnih et al. (2015). The target network is slowly updated as a weighted sum of and itself as in Lillicrap et al. (2016). The algorithm minimizes the error between the left and right-hand sides of (11) for each iteration, making the asymptotically satisfies (11) as much as possible. The overall procedures are summarized in Algorithm 1. Note that this algorithm does not need the knowledge of an explicit system model as in discrete-time Q-learning or DQN.

4. Numerical Experiments

We consider the following linear system with an exponentially discounted quadratic cost:

where . We restrict the control u as a Lipschitz continuous function with Lipschitz constant 1. The parameters for simulation are chosen as DNNs for approximating Q-functions, we use fully connected networks consisting of an input layer with m + n nodes, two hidden layers with 128 nodes and an output layer with a single node. For the two hidden layers, we use ReLU activation function. For training the networks, we use Adam optimizer with a learning rate Kingma and Ba (2015).

Figure 1: Numerical experiment result for n = m = 1: (a) learning curve; (b) controlled trajectory of x(t), u(t) and a(t).

4.1. One-dimensional problem

As a toy example for sanity check, we first consider a one dimensional model, where n = m = 1, A = 0 and B = 1. In order to measure the performance of control, we fix the initial state and control (x, u) = (1, 1) and we integrate the running cost over [0, 10]. Figure 1(a) shows the log of costs at each iteration of Algorithm 1. The solid line represents the learning curve averaged over five different trials and the shaded region represents the minimum and maximum of the cost over different trials. The running cost rapidly decreases as the network parameters is learned. We also note that the variation over five different trials vanishes as the parameters are learned. Figure 1(b) shows the trajectory of x(t), u(t), and a(t) generated by the learned Q-function iteration. The optimal control for this one dimensional problem is to drive both x(t) and u(t) as soon as possible to 0. Starting from (x, u) = (1, 1), this can be done by first driving the control u(t) to negative value so that x(t) moves towards the origin, and then reducing the absolute value of u(t) so that both x(t) and u(t) approaches 0 asymptotically. Such a behavior is observed in Figure 1(b). This confirms that the learned policy is near optimal.

4.2. 10- and 20-dimensional problems

For 10- and 20-dimensional problems, where n = m = 10 or n = m = 20, respectively, we set the coefficient matrices such that each element of A is randomly sampled from U[0, 1] and then multiplied by 0.1, and each element of B is randomly sampled from the uniform distribution U[0, 1] and then multiplied by 5:

We integrate the running cost from randomly chosen initial position and control . The learning curves are shown in Figure 2. As in the one-dimensional problem, the solid curves denote the average of the running costs over five trials and the shaded regions represent the minimum and maximum of the running costs. The results show that the cost decreases superexponentially in both 10- and 20-dimensional problems. Note that the y-axis of Figure 2(a) and Figure 2(b) are plotted with log-scale, and thus the decay of cost is rapid. Again, the results confirm that our continuous-time Q-learning algorithm presents the desired performance.

Figure 2: Numerical experiment result: (a) learning curve for n = m = 10; (b) learning curve for n = m = 20.

5. Conclusion

We introduced a Q-function for continuous-time deterministic optimal control problems with Lipschitz continuous control. By using the dynamic programming principle, we derived the corresponding HJB equation and showed that its unique viscosity solution corresponds to the Q-function. An optimality condition was also characterized in terms of the Q-function without the knowledge of system models. Using the HJB equation and the optimality condition, we construct a Q-learning algorithm and its DQN-like approximate version. The simulation results show that the proposed Q-learning algorithm is fast and stable, and that the learned controller presents a good performance, even in the 20-dimensional case.

Acknowledgments

This work was supported in part by the Creative-Pioneering Researchers Program through SNU, and the Basic Research Lab Program through the National Research Foundation of Korea funded by the MSIT(2020R1C1C1009766).

References

M. Bardi and I. Capuzzo-Dolcetta. Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Birkh¨auser, 1997.

D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, 2019.

M. Crandall and P.-L. Lions. Viscosity solutions of hamilton–jacobi equations. Transactions of the American Mathematical Society, 277:1–42, 1983.

M. Crandall, L. C. Evans, and P.-L. Lions. Some properties of viscosity solutions of Hamilton– Jacobi equations. Transactions of the American Mathematical Society, 282:487–502, 1984.

A. M. Devraj and S. Meyn. Zap q-learning. In Advances in Neural Information Processing Systems, pages 2235–2244, 2017.

K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219– 245, 2000.

L. C. Evans. Partial Differential Equations. American Mathematical Society, 2010.

Y. Jiang and Z.-P. Jiang. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 48(10):2699–2704, 2012.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conferences on Learning Representation (ICLR), 2015.

J. Y. Lee, J. B. Park, and Y. H. Choi. Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems. Automatica, 48:2850–2859, 2012.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Con- tinuous control with deep reinforcement learning. In International Conferences on Learning Representation (ICLR), 2016.

N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to reinforcement learning and back again. In IEEE Conference on Decision and Control (CDC), 2019.

P. Mehta and S. Meyn. Q-learning and Pontryagin’s minimum principle. In Proceedings of the 48th IEEE Conference on Decision and Control, pages 3598–3605, 2009.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried- miller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.

R. Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40:265–299, 2000.

M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb. Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems. IEEE Transactions on Cybernetics, 45:165–176, 2015.

K. G. Vamvoudakis. Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach. Systems & Control Letters, 100:14–20, 2017.

D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis. Adaptive optimal control for continuous- time linear systems based on policy iteration. Automatica, 45(2):477–484, 2009.

C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.