Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

2020·Arxiv

Abstract

Abstract

Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning that incorporate AMSGrad updates (a standard alternative of Adam in theoretical analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover, our analysis focuses on Markovian sampling for both algorithms. We show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of O(1/T ) (where T denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at the rate of ). Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of O(1/T ), and with a diminishing stepsize converges exactly to the global optimum at the rate of study develops new techniques for analyzing the Adam-type RL algorithms under Markovian sampling.

1 Introduction

Reinforcement learning (RL) aims to study how an agent learns a policy through interacting with its environment to maximize the accumulative reward for a task. RL has so far accomplished tremendous success in various applications such as playing video games Mnih et al. (2013), bipedal walking Castillo et al. (2019) and online advertising Pednault et al. (2002), to name a few. In general, there are two widely used classes of RL algorithms: policy-based methods and value function based methods.

For the first class, policy gradient (PG) Sutton et al. (2000) is a basic algorithm which has motivated many advanced policy-based algorithms including actor-critic Konda and Tsitsiklis (2000), DPG Silver et al. (2014), TRPO Schulman et al. (2015), PPO Schulman et al. (2017), etc. The idea of the vanilla PG Sutton et al. (2000) is to parameterize the policy and optimize a target accumulated reward function by (stochastic) gradient descent. The asymptotic convergence and finite-time (i.e., non-asymptotic) analysis have been characterized for PG in various scenarios, which will be further discussed in Section 1.2.

For the second class of value function based algorithms, temporal difference (TD) learning Sutton (1988) is a fundamental algorithm which has motivated more advanced algorithms such as Q-learning Watkins and Dayan (1992), SARSA Rummery and Niranjan (1994), etc. The vanilla TD Sutton (1988) typically parameterizes the value function of an unknown policy and iteratively finds the true value function or its estimator by following the (projected) Bellman operation, which can also be interpreted to be analogous to a stochastic gradient descent (SGD) update. The asymptotic convergence and finite-time analysis have been established for TD in various scenarios, which will be discussed in Section 1.2.

Despite extensive exploration, all the existing theoretical studies of PG and TD have focused on algorithms that adopt the SGD-type updates without adaption on the stepsize. In practice, however, the adaptive momentum estimation (Adam) method Kingma and Ba (2015) has been more commonly used in RL Bello et al. (2017); Stooke and Abbeel (2018). It is so far unclear whether RL algorithms that incorporate the Adam-type updates have provable converge guarantee. The goal of this paper is to provide the first non-asymptotic analysis to characterize the convergence rate of the Adam-type PG and TD algorithms. In addition, we develop new technical tools to analyze the Adam-type algorithms under Markovian sampling. Such analysis is not available in the existing studies of the Adam-type algorithms in optimization which usually assume independent and identically distributed (i.i.d.) sampling.

1.1 Our Contribution

The main contribution of the paper lies in establishing the first non-asymptotic analysis for the Adam-type PG and TD algorithms under the Markovian sampling model.

From the technical standpoint, although the existing analysis of the Adam-type algorithms in conventional optimization may provide useful tools, analysis of such algorithms in RL problems has new challenges. (1) Samples in TD learning are drawn from an unknown transition kernel in a non-i.i.d. manner, and hence a bias of the gradient estimation arises. Such a bias is also dependent on the AMSGrad updates, which further complicates the analysis. We develop techniques to bound the bias by the stepsize even under adaptive momentum updates. (2) The sampling process in PG (with a nonlinear approximation architecture) is not only non-i.i.d., but also follows time-varying parametric policies. Thus, it is even more challenging to bound such a bias with the AMSGrad updates. In response to this, we develop new techniques to characterize the bias error which can also be controlled by the stepsize.

As for the results, we provide the first convergence guarantee for RL algorithms (including PG and TD) incorporating the update rule of AMSGrad (referred to as PG-AMSGrad and TD-AMSGrad, respectively), and our techniques also lead to an improved result for the vanilla PG. (1) First, we show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsizeconverges to a neighborhood of a stationary point at a rate of O(1/T ) (where T denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at a rate of O(log). (2) Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at a rate of O(1/T ), and with a diminishing stepsize converges exactly to the global optimum at a rate of O(log ). (3) In addition, by adapting our technical tools to analyze the vanilla PG with the SGD update under Markovian sampling, we obtain an orderwisely better computational complexity than the existing works.

1.2 Related Work

Due to the rapidly growing theoretical studies on RL, we only review the most relevant studies below.

Convergence analysis of PG: Asymptotic convergence of PG based on stochastic approximation (SA) was established in Williams (1992); Baxter and Bartlett (2001); Sutton et al. (2000); Kakade (2002); Pirotta et al. (2015); Tadi´c et al. (2017). In specific RL problems such as LQR, PG has been proved to converge to the optimal policy Fazel et al. (2018); Malik et al. (2019); Tu and Recht (2019). Under specific policy function approximation classes, Bhandari and Russo (2019); Agarwal et al. (2019) also showed that PG can find the optimal policy. Under the general nonconvex approximation, Shen et al. (2019); Papini et al. (2018, 2019); Xu et al. (2019a, 2020) characterized the convergence rate for PG and variance reduced PG to a stationary point in the finite-horizon scenario, and Zhang et al. (2019); Karimi et al. (2019) provided the convergence rate for PG in the infinite-horizon scenario. Wang et al. (2020) studied PG with neural network function approximation in an overparameterized regime. Convergence analysis has been also established for the variants of PG, such as actor-critic algorithms (Bhatnagar et al., 2008, 2009; Bhatnagar, 2010; Luo et al., 2019; Yang et al., 2019; Fu et al., 2020), TRPO (Shani et al., 2020), TRPO/PPO (Liu et al., 2019), etc. This paper studies the infinite-horizon scenario, but focuses on the Adam-type PG, which has not been studied in the literature. Our analysis is also applicable to other variants of PG.

Convergence analysis of TD: Originally proposed in Sutton (1988), TD learning with function approximation aroused great interest in analyzing its convergence. While a general TD may not converge as pointed out in Baird (1995); Gy¨orfi and Walk (1996), the authors in Tsitsiklis and Van Roy (1997) provided conditions to ensure asymptotic convergence of TD with linear function approximation under i.i.d. sampling. Other results on asymptotic convergence using the tools from linear SA were provided in Kushner and Yin (2003); Benveniste et al. (2012). Non-asymptotic convergence was established for TD under i.i.d. sampling in, e.g., Dalal et al. (2018); Bhandari et al. (2018); Lakshminarayanan and Szepesvari (2018), and under Markovian sampling in, e.g., Bhandari et al. (2018); Srikant and Ying (2019); Hu and Syed (2019). The convergence rate of TD with nonlinear function approximation has recently been studied in Cai et al. (2019) for overparameterized neural networks using i.i.d. samples. In contrast to the aforementioned work on TD with the SGD updates, this paper studies Adam-type TD under Markovian sampling.

Adaptive reinforcement learning algorithms: Adaptivity has been applied to RL algorithms to improve the performance. (Shani et al., 2020) used an adaptive proximity term to study the convergence of TRPO. An adaptive batch size was adopted to improve the policy performance (Papini et al., 2017) and reduce the variance (Ji et al., 2019) of PG. The abovementioned papers did not study how adaptive learning rates can affect the performance of PG or TD. More recently, a concurrent work (Sun et al., 2020) which is posted a few days after our paper provided an analysis of TD(0) and TD() when incorporating adaptive gradient descent (AdaGrad) updates. However, this paper focuses on a more popular class of adaptive algorithms: Adam-type methods, and provide the first convergence guarantee when they are applied to PG and TD.

Convergence analysis of Adam-type algorithms in conventional optimization: Adam was proposed in Kingma and Ba (2015) for speeding up the training of deep neural networks, but the vanilla Adam was shown not to converge in Reddi et al. (2018). Instead AMSGrad was proposed as a slightly modified version to justify the theoretic performance of Adam and its regret bounds were characterized in Reddi et al. (2018); Tran and Phong (2019) for online convex optimization. Recently, Adam/AMSGrad was proved to converge to a stationary point for certain general nonconvex optimization problems Zou et al. (2019a); Zhou et al. (2018); Chen et al. (2019a). Our study provides the first convergence guarantee for the Adam-type algorithms in the RL settings, where the Markovian sampling poses the key difference and challenge in our analysis from those in conventional optimization.

2 Preliminary

In this section, we provide the necessary background for the problems that we study in this paper.

2.1 Markov Decision Process

We consider the standard RL settings, where an agent interacts with a (possibly stochastic) environment (e.g. process or system dynamics). This interaction is usually modeled as a discrete-time discounted Markov Decision Processes (MDPs), described by a tuple (), where S is the state space, A is the action space, [0, 1] is the probability kernel for the state transitions, e.g., ) denotes the probability distribution of the next state given the current state s and action a. In addition, [0] is the reward function mapping station-action pairs to a bounded subset of (0, 1) is the discount factor, and denotes the initial state distribution. The agent’s decision is captured by the policy := ) which characterizes the density function over the action space A at the state . We denote := as the stationary distribution of the transition kernel P for a given . In addition, we define the -discounted stationary visitation distribution of the policy as ) = ). Further, we denote ) = ) as the (discounted) state-action visitation distribution.

In this paper, we assume that the state space S is countably infinite and the action space A is finite with possibly large cardinality |A|. Hence, we estimate the policy and the value function corresponding to a certain unknown policy by parameterized function classes as we introduce in Section Sections 3.1 and 4.1, respectively.

2.2 Update Rule of AMSGrad

Although Adam Kingma and Ba (2015) has gained great success in practice since being proposed, it is shown not to converge even in the simple convex setting Reddi et al. (2018). Instead, a slightly modified version called AMSGrad Reddi et al. (2018) is widely used to understand the success of adaptive momentum optimization algorithms. Given a gradient at time t, the generic form of AMSGrad is given by

where is the stepsize, and are two hyper-parameters. In addition, given in (1) and (2) are viewed as the estimation of the first moment and second moment, respectively, which play important roles in adapting the learning rate as in (4). Compared to Adam, the main difference of AMSGrad lies in (3), which guarantees the sequence ˆto be non-decreasing whereas Adam does not require this. Such difference is considered to be a central reason causing the non-convergent behavior of Adam Reddi et al. (2018); Chen et al. (2019a).

2.3 Notations

We use := to denote the -norm of a vector x, and use = maxto denote the infinity norm. When x, y are both vectors, are all calculated in the element-wise manner, which are used in the update of AMSGrad. We denote [n] = {1, 2, . . ., n}, and as the integer such that .

3 Convergence of PG-AMSGrad under Markovian Sampling

In this section, we study the convergence of an Adam-type policy gradient algorithm (PG-AMSGrad) with nonlinear function approximation and under non-i.i.d. sampling.

3.1 Policy Gradient and PG-AMSGrad

Consider a reinforcement learning problem which aims to find a policy that maximizes the expected accumulative reward. We assume that the policy is parameterized by and form a policy class Π := , which in general is a nonlinear function class. The policy gradient method is usually used to solve the following infinite-horizon optimization problem:

The gradient of ) with respect to is captured by the policy gradient theorem for infinite-horizon MDP with the discounted reward Sutton et al. (2000), and is given by

where the expectation is taken over the discounted state-action visitation distribution := ), and ) denotes the Q-function for an initial state-action pair (s, a) defined as

In addition, we refer to log ) as the score function corresponding to the policy .

Since the transition probability is unknown, the policy gradient in (6) needs to be estimated via sampling. The Q-function ) and the score function are typically estimated by independent samples. First, at each time t, we draw a sample trajectory to provide an estimated Q-function ˆ) based on the algorithm EstQ Zhang et al. (2019) (see Algorithm 3 in Appendix A for details). Such an estimator has been shown to be unbiased Zhang et al. (2019). That is, if we use to denote the randomness including the samples and horizon in EstQ, then we have

We note that the gradient estimator obtained in (8) is biased, because the score function is estimated by a sequence of Markovian samples. We will show that such a biased gradient estimator is in fact computationally more efficient than the unbiased estimator used in the existing literature Zhang et al. (2019) (see Section 3.4). Our main technical novelty here lies in developing techniques to analyze the biased estimator under the AMSGrad update for PG.

3.2 Technical Assumptions

In the following, we specify some technical assumptions in our convergence analysis. We consider a general class of parameterized policy functions that satisfy the following assumption.

Assumption 1. Assume that the parameterized policy is differentiable with respect to , and the score function log ) corresponding to ) exists. In addition, we assume both the policy function and the score function are Lipschitz continuous with the parameters and L, respectively, i.e., for all ,

This assumption is standard in the existing literature that studies PG with nonconvex function approximation Zhang et al. (2019); Xu et al. (2019a); Papini et al. (2018).

In Algorithm 1, we sample a data trajectory using the transition kernel ˆP and the policy . Such a sequence of samples are non-i.i.d. and follow a Markovian distribution. We assume that the MDP and the policies we consider satisfy the following standard mixing property.

where denotes the total-variation norm (or the total-variation distance between two probability measures and ).

This assumption holds for irreducible and aperiodic Markov chains Mitrophanov (2005), and is widely adopted in the theoretical analysis of RL algorithms under Markovian sampling settings Bhandari et al. (2018); Chen et al. (2019b); Zou et al. (2019b); Karimi et al. (2019).

3.3 Convergence of PG-AMSGrad

In this section, we provide the convergence analysis of PG-AMSGrad as given in Algorithm 1. We first consider the case with a constant stepsize, and then provide the result with a diminishing stepsize.

Although AMSGrad has been studied in conventional optimization, our analysis of PG-AMSGrad mainly deals with the following new challenges arising in RL. First, samples here are generated via an MDP and distributed in a non-i.i.d. fashion. Thus the gradient estimator is biased and we need to control the bias with a certain upper bound scaled by the stepsize. Second, the sampling distribution also changes over time, which causes additional complication. Thus, our technical development mainly handles the above two challenges under the adaptive momentum update rule of AMSGrad. We next provide the convergence results that we obtain and relegate all the proofs to the appendices.

We first provide the Lipschitz properties for the true policy gradient and its estimator, which are useful for establishing the convergence. Recall that in Algorithm 1, the gradient estimator = ˆlog()) at time t is obtained by using the Q-function estimator generated by the EstQ algorithm (see Appendix A). Note that ˆ) is an unbiased estimator of ) for all (s, a) Zhang et al. (2019), and the samples for estimation are independent of those for other steps in PG-AMSGrad except the initial sample. Taking expectation over the randomness in EstQ at time t (denoted as ), we obtain an estimator ˜) defined as

We next obtain the Lipschitz properties of ˜) and ) in the following lemma.

Lemma 1. (Lipschitz property of policy gradient) Under Assumptions 1 and 2, the policy gradient ) defined in (6) is Lipschitz continuous with the parameter , i.e., ,

where the constant coefficient = + 1 + log. Further, the policy gradient estimator ˜) defined in (9) is also Lipschitz continuous with the parameter , i.e., ,

= 1 + log

The following theorem characterizes the convergence of PG-AMSGrad with a constant stepsize. Recall that the stepsize refers to the parameter in the AMSGrad update (4), not the overall learning rate of the algorithm.

Theorem 1. (Convergence of PG-AMSGrad with constant stepsize) Fix in Algorithm 1. Initialize Algorithm 1 such that for all ] with some 0. Suppose Assumptions 1 and 2 hold. Let for t = 1, . . . , T . Then after running T steps of PG-AMSGrad as given in Algorithm 1, we have:

Theorem 1 indicates that under the constant stepsize, PG-AMSGrad converges to a neighborhood of a stationary point at a rate of . The size of the neighborhood can be controlled by the stepsize . One can observe that controls a tradeoff between the convergence rate and the convergence accuracy. Decreasing improves the convergence accuracy, but slows down the convergence, since the coefficient contains in the denominator. To balance such a tradeoff, we set the stepsize . In this case, the mixing time becomes = O(log T ) and thus PG-AMSGrad converges to a stationary point with a rate of

. In the following, we adopt a diminishing stepsize to eliminate the convergence error and obtain the exact

convergence.

Theorem 2. (Convergence of PG-AMSGrad with diminishing stepsize) Suppose the same conditions of Theorem 1 hold, and let for t = 1, . . . , T . Then running T steps of PG-AMSGrad as given in Algorithm 1, we have:

Under the diminishing stepsize, PG-AMSGrad can converge exactly to a stationary point. Since = O(log T ), the convergence rate is given by .

Theorems 1 and 2 indicate that under both a constant stepsize and diminishing stepsize, PG-AMSGrad finds a stationary point efficiently with guaranteed convergence. However, there is a tradeoff between the convergence rate and the convergence accuracy. With a constant stepsize, PG-AMSGrad can converge faster but only to a neighborhood of a stationary point whose size is controlled by the stepsize, whereas a diminishing stepsize yields a better convergence accuracy, but attains a lower convergence rate.

3.4 Implication on Vanilla PG under Markovian Data

Although our focus in this paper is on the Adam-type PG, slight modification of our analysis yields the convergence rate of the vanilla PG under infinite horizon Markovian sampling. In the following, we first present such results and then compare them with two recent studies Zhang et al. (2019); Karimi et al. (2019) on the same model to illustrate the novelty of our analysis and benefit of our sampling strategy.

We consider the vanilla PG algorithm that uses the same gradient estimator and sampling strategy as those of Algorithm 1, but adopts the SGD update (i.e., ) rather than the AMSGrad update. We call such an algorithm as PG-SGD. The following proposition characterizes the convergence rate for PG-SGD.

Proposition 1. Suppose Assumptions 1 and 2 hold. After running T steps of PG-SGD with a constant stepsize for t = 1, . . . , T , we have:

with defined in Lemma 1, = minand . Furthermore, after running T steps of PG-SGD with a diminishing stepsize = for t = 1, . . . , T , we have:

=min=

We next compare Proposition 1 with two recent studies on the same non-i.i.d. model. First, Karimi et al. (2019) studied infinite-horizon PG with a biased gradient estimator. In their analysis, they did not bound the gradient bias using the stepsize, and hence their convergence has a non-zero error even with a diminishing stepsize. In contrast, we obtain a fine-grained bound on the bias and characterize its dependence on the stepsize. We show that PG exactly converges to a stationary point under the diminishing stepsize.

Another closely related study on the infinite-horizon PG under non-i.i.d. sampling was by Zhang et al. (2019), but their algorithm adopts an unbiased gradient estimator at the cost of using more samples. As a comparison, Proposition 1 indicates that PG-SGD with a biased gradient estimator attains a better convergence rate and accuracy. More specifically, under the constant stepsize, (Zhang et al., 2019, Corollary 4.4) showed that their PG algorithm converges with an error bound of , whereas PG-SGD with

a biased gradient estimator achieves a much smaller error bound by taking = 1 . Similarly, under the diminishing stepsize, (Zhang et al., 2019, Theorem 4.3) showed that their PG algorithm converges at a rate of

, whereas our PG-SGD converges at a rate of , which is much faster since is usually close to 1, and log T is considered to be less influential in practice.

4 Convergence of TD-AMSGrad under Markovian Sampling

In this section, we adopt AMSGrad to TD learning and analyze its convergence under Markovian sampling. The proof techniques of bounding the bias and the nature of the convergence are very different from those of PG-AMSGrad.

4.1 TD Learning and TD-AMSGrad

Policy evaluation is a fundamental task in RL, and often plays a critical role in other algorithms such as PG that we study in Section 3. The goal of policy evaluation is to obtain an accurate estimation of the accumulated reward function known as the value function for a given policy defined as

The Bellman operator corresponding to a value function V is defined by ,

The key to approximate the value function is the observation that it satisfies the projected bellman equation (Π)(s) = V (s) when projected onto some convex space. Under the function approximation, the value function V (s) is parameterized by and denoted by ). As many recent finite-time analysis of TD Bhandari et al. (2018); Xu et al. (2019b); Srikant and Ying (2019), we consider the linear approximation class of the value function ) defined as

where , and is a vector function with the dimension d, and the elements of represent the nonlinear kernel (feature) functions. Then the vanilla TD algorithm follows a stochastic iterative method given by

As seen in Algorithm 2, the state-action pairs are sampled as a trajectory under the transition probability P with unknown policy . Therefore, the samples along the trajectory are dependent, and hence we need to analyze the convergence of TD-AMSGrad under Markovian sampling.

4.2 Technical Assumptions

In this section, we introduce some standard technical assumptions for our analysis. We first give the following standard assumption on the kernel function Tsitsiklis and Van Roy (1997);

Assumption 3. For any state , the kernel function is uniformly bounded, i.e., 1. In addition, we define a feature matrix Φ as

The boundedness assumption is mild since we can always normalize the kernel functions. We next define the following norm of the value function as:

where D = diag() is a diagonal matrix whose elements are determined by the stationary distribution , and Σ is the steady-state feature covariance matrix given by Σ = .The next assumption of the bounded domain is standard in theoretical analysis of the Adam-type algorithms Reddi et al. (2018); Tran and Phong (2019).

Assumption 4. The domain of approximation parameters is a ball originating at = 0 with a bounded diameter containing . That is, there exists , such that , and , for any .

The boundedness parameter can be chosen as discussed in Bhandari et al. (2018).

4.3 Convergence of TD-AMSGrad

In the following, we provide the convergence results for TD-AMSGrad with linear function approximation under Markovian sampling.

First consider the full pseudo-gradient ¯) in (16). We define as the fixed point of ¯), i.e., ¯) = 0. Then is the unique fixed point under Assumption 3 following from the contraction property of the projected Bellman operator Tsitsiklis and Van Roy (1997). The following lemma is useful for our convergence analysis.

Lemma 2. (Bhandari et al., 2018, Lemma 3) Let 0 be the minimum eigenvalue of the matrix Σ. Under Assumption 3 and for any , we have

Lemma 2 indicates that the update of TD learning exhibits a property similar to strongly convex optimization.

Since our analysis is under Markovian sampling, non-zero bias arises in our convergence analysis when estimating the gradient, i.e., = 0. The following lemma provides an upper bound on such a bias error for TD-AMSGrad.

Lemma 3. Consider a sequence of non-increasing stepsizes for Algorithm 2 and fix = min: . Initialize Algorithm 2 such that for all ] with some 0. Under Assumptions 2-4, we have for :

+ (1 +

The following theorem provides the convergence of TD-AMSGrad under a constant stepsize coupled with diminishing hyper-parameters in the AMSGrad update.

Theorem 3. (Convergence of TD-AMSGrad with constant stepsize) Let and with (0, 1) in Algorithm 2. Initialize Algorithm 2 such that for all ] with some 0. Let = 1, . . . , T , and suppose Assumptions 2-4 hold. Then the output of Algorithm 2 satisfies:

with c = (1

In Theorem 3, are constants and time-independent. Therefore, under the choice of the stepsize and hyper-parameters in the theorem, Algorithm 2 converges to a neighborhood of the global optimum at a rate of . The size of the neigborhood is controlled by the stepsize . Similarly to Theorem 1, we can balance the tradeoff between the convergence rate and the convergence accuracy, i.e., the size of neighborhood, by setting the stepsize , which yields a convergence to the global optimal solution at

the rate of . Next, we provide the convergence result with a diminishing stepsize in the following theorem.

Theorem 4. (Convergence of TD-AMSGrad with diminishing stepsize) Suppose the same conditions of Theorem 3 hold, and let for t = 1, . . . , T . Then the output of Algorithm 2 satisfies:

with c = (1 + (1 +

Comparing with Theorem 3 and observing = O(log T ), we conclude that TD-ASMGrad with the diminishing stepsize converges exactly to the global optimum at the rate of , rather than to a neighborhood.

5 Conclusion

This paper provides the first convergence analysis of the Adam-type RL algorithms under Markovian sampling. For PG-AMSGrad with nonlinear function approximation for the policy, we show that the algorithm converges to a stationary point with a diminishing stepsize. With a constant size, we show that PG-AMSGrad converges only to a neighborhood of a stationary point but at a faster rate. Furthermore, we also provide the finite-time convergence results for TD-AMSGrad to the global optimum under proper choices of the stepsize.

Several future directions along this topic are interesting. For example, the optimality of the convergence result of PG-AMSGrad is of importance to study. More general value function approximation, and convergence results for constant hyper-parameters in TD-AMSGrad are also of interest.

Acknowledgements

The work was supported in part by the U.S. National Science Foundation under Grants CCF-1761506, ECCS-1818904, CCF-1909291 and CCF-1900145.

References

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.

Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.

Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350.

Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. (2017). Neural optimizer search with reinforcement learning. In International Conference on Machine Learning (ICML), pages 459–468.

Benveniste, A., M´etivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.

Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.

Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT).

Bhatnagar, S. (2010). An actor-critic algorithm with function approximation for discounted cost constrained markov decision processes. Systems & Control Letters, 59(12):760–766.

Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. (2008). Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 105–112.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11):2471–2482.

Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems (NeurIPS), pages 11312–11322.

Castillo, G. A., Weng, B., Hereid, A., Wang, Z., and Zhang, W. (2019). Reinforcement learning meets hybrid zero dynamics: A case study for rabbit. In 2019 International Conference on Robotics and Automation (ICRA), pages 284–290.

Chen, X., Liu, S., Sun, R., and Hong, M. (2019a). On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR).

Chen, Z., Zhang, S., Doan, T. T., Maguluri, S. T., and Clarke, J.-P. (2019b). Finite-time analysis of Q-learning with linear function approximation. arXiv preprint arXiv:1905.11425.

Dalal, G., Sz¨or´enyi, B., Thoppe, G., and Mannor, S. (2018). Finite sample analyses for td (0) with function approximation. In Thirty-Second AAAI Conference on Artificial Intelligence.

Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning (ICML), pages 1467–1476.

Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2020). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. In International Conference on Learning Representations (ICLR).

Gy¨orfi, L. and Walk, H. (1996). On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31–61.

Hu, B. and Syed, U. (2019). Characterizing the exact behaviors of temporal difference learning algorithms using Markov jump linear system theory. In Advances in Neural Information Processing Systems (NeurIPS), pages 8477–8488.

Ji, K., Wang, Z., Zhou, Y., and Liang, Y. (2019). Faster stochastic algorithms via history-gradient aided batch size adaptation. arXiv preprint arXiv:1910.09670.

Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems (NeurIPS), pages 1531–1538.

Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. (2019). Non-asymptotic analysis of biased stochas- tic approximation scheme. In Conference on Learning Theory (COLT).

Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).

Konda, V. (2002). Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology.

Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 1008–1014.

Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.

Lakshminarayanan, C. and Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1347–1355.

Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems (NeurIPS), pages 10564– 10575.

Luo, Y., Yang, Z., Wang, Z., and Kolar, M. (2019). Natural actor-critic converges globally for hierarchical linear quadratic regulator. arXiv preprint arXiv:1912.06875.

Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P., and Wainwright, M. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2916–2925.

Mitrophanov, A. Y. (2005). Sensitivity and convergence of uniformly ergodic Markov chains. Journal of Applied Probability, 42(4):1003–1014.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. (2018). Stochastic variance-reduced policy gradient. In International Conference on Machine Learning (ICML), pages 4026–4035.

Papini, M., Pirotta, M., and Restelli, M. (2017). Adaptive batch size for safe policy gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 3591–3600.

Papini, M., Pirotta, M., and Restelli, M. (2019). Smoothing policies and safe policy gradients. arXiv preprint arXiv:1905.03231.

Pednault, E., Abe, N., and Zadrozny, B. (2002). Sequential cost-sensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 259–268.

Pirotta, M., Restelli, M., and Bascetta, L. (2015). Policy gradient in lipschitz Markov decision processes. Machine Learning, 100(2-3):255–283.

Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of Adam and beyond. In International Conference on Learning Representations (ICLR).

Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems, volume 37.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML), pages 1889–1897.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Thirty-Fourth AAAI Conference on Artificial Intelligence.

Stooke, A. and Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811.

Sun, T., Shen, H., Chen, T., and Li, D. (2020). daptive temporal difference learning with linear function approximation. arXiv preprint arXiv:2002.08537.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9– 44.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 1057–1063.

Tadi´c, V. B., Doucet, A., et al. (2017). Asymptotic bias of stochastic gradient search. The Annals of Applied Probability, 27(6):3255–3304.

Tran, P. T. and Phong, L. T. (2019). On the convergence proof of AMSGrad and a new version. IEEE Access, 7:61706–61716.

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-diffference learning with function approxi- mation. IEEE Transactions on Automatic Control, 42(5):674 690.

Tu, S. and Recht, B. (2019). The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory (COLT), pages 3036–3083.

Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2020). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations (ICLR).

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learn- ing. Machine Learning, 8(3-4):229–256.

Xu, P., Gao, F., and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. In International Conference on Uncertainty in Artificial Intelligence (UAI).

Xu, P., Gao, F., and Gu, Q. (2020). Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations (ICLR).

Xu, T., Zou, S., and Liang, Y. (2019b). Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. In Advances in Neural Information Processing Systems (NeurIPS), pages 10633–10643.

Yang, Z., Chen, Y., Hong, M., and Wang, Z. (2019). Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems (NeurIPS), pages 8351–8363.

Zhang, K., Koppel, A., Zhu, H., and Ba¸sar, T. (2019). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.

Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. (2018). On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671.

Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. (2019a). A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11127–11135.

Zou, S., Xu, T., and Liang, Y. (2019b). Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 8665–8675.

Supplementary Materials A EstQ algorithm

We introduce the unbiased Q-function estimating algorithm EstQ below.

In Algorithm 3, we emphasize that the samples are used only to estimate the Q-function, and are independent from those used to evaluate the score function in PG-AMSGrad except the initial pair. Zhang et al. (2019) showed that this algorithm provided an unbiased estimator of the Q-function ).

B Proof of Lemma 1

For notational simplicity, we denote ) = ) and ˜) = ˜) in the sequel. We first introduce two technical lemmas which are useful for the proof of Lemma 1.

Proof. The proof can be proceeded by the definition of .

where (i) follows from Assumption 1 and from the fact that the reward function is uniformly bounded. The proof of the second claim is similar to the first one.

Proof of Lemma 1: To derive the Lipschitz continuity of ), we first use the policy gradient theorem Sutton et al. (2000) which gives

where is denoted as the discounted visitation distribution. We further proceed the proof of Lipschitz continuity of ) as

where (i) follows from Assumption 1 and boundedness of ), and (ii) follows since

the following, we bound the last two terms in the right hand side. First, we show that ) is also

Lipschitz continuous with respect to . To see that, we have

where is the distribution of the state-action trajectory corresponding to the transition probability P and the policy . Although is different from the state-action visitation distribution , both of them share the same property. That is, for a different is determined by the same transition kernel and only differs in the policy. Therefore, Lemma 5 applies to two distributions and , which yields

Then we bound the last term of the right hand side of (18) via using the similar techniques above.

where (i) follows from Assumption 1 and boundedness of ), and (ii) follows from Lemma 5. By substituting the bounds from (19) and (20) to (18), we obtain the Lipschitz condition of ), i.e.,

where (i) follows from Assumption 1 and the boundedness of ), and (ii) follows from (19).

C Proof of Theorem 1

In this appendix we will first introduce some useful lemmas in Appendix C.1, and then provide technical proofs in Appendix C.2.

C.1 Supporting Lemmas

We first provide two lemmas to deal with the bias of the gradient estimation.

Proof. Since V is a diagonal matrix with ], we have

where (i) follows because ].

Lemma 7. Fix time t and any . Initialize Algorithm 1 such that for all ] with some 0. Suppose Assumption 1 and 2 hold. Under Assumption 1 and Assumption 2, we have

where (i) follows from the boundedness of . In the following, we bound. To this end, we need to build a new Markov chain in which the policy is fixed after time . To be specific, in Algorithm 1, we sample actions via a changing policy ), which yields a Markov chain:

Now, from time , we repeatedly use the policy , which yields an auxiliary Markov chain:

where we denote ˜= (˜) correspondingly. Clearly we have for + 1, . . . , t,

It remains to deal with . We proceed as follows

where (i) follows from Assumption 1 and (ii) follows from Lemma 9. It remains to bound the second term of the right hand side. To this end, we first write ) as

Similarly, for ) we have

Then we obtain the following bound

Then we have a dynamical form as

Applying (31) recursively yields

Substituting (32) into (28) and according to the fact 1, we have

The next a few lemmas are closely related to the update rule of AMSGrad.

Lemma 8. (Zhou et al., 2018, Lemma A.1) Let for t = 1, 2, . . . be sequences generated by Algorithm 1, and denote = . Given as in Lemma 4, .

Lemma 9. Let for t = 1, 2, . . . be sequences generated by Algorithm 1 and dentoe + (1 + . Then

Proof. The proof can proceed easily given Lemma 14 and ˆ.

Lemma 10. (Zhou et al., 2018, Lemma A.2) Let be the stepsize in Algorithm 1 and be the constant hyper-parameters with . Then we have

Proof. We refer the reader to the proof of Lemma A.2 in Zhou et al. (2018) for more details by reducing their proof to the case where p = = 1.

Since the update rule of AMSGrad is complicated, it is often useful to introduce an auxiliary sequences . If we define , then for 1, let

The following lemma captures the property of and its connection with .

C.2 Proof of Theorem 1

In the remaining, we can complete this proof by taking the following three steps. Step1: Establishing convergent sequence First, we observe that ) is Lipschitz continuous according to Lemma 1. Then we have

Next we bound the tail terms. The term can be bounded as

where (i) follows from Lemma 6. The term is the key to deal with under non-i.i.d. sampling, where a non-zero bias arises in the gradient estimation. We bound this term as

where (i) follows from Lemma 6 and (ii) follows because ˆis positive diagonal matrix and each entry is bounded as in Lemma 8. Next we bound the term as

where (i) follows from Cauchy-Schwarz inequality, (ii) follows from Lemma 11, (iii) follows because + and (iv) follows by the update rule of Algorithm 1. Last, we bound the term as

where (i) follows from Lemma 11, and (ii) follows due to the fact (+ 2. Substituting the upper bounds of the terms and in (43) yields

Next we rearrange the above inequality and take expectation over all the randomness to obtain

where in the last equation we denote = () for brevity, and this notation is used in the sequel as well. We emphasize that the filtration contains all the samples up to time t except the samples for estimating ). Thus we have ] = ) where the expectation is taken over the randomness in

It turns out that the terms are easier to bound and the term is the key to bound the bias. For the clarity of presentation, we first bound the terms . To bound the term , we have

where (i) follows from Cauchy-Schwarz inequality, (ii) follows because ˆ, (iii) follows from Lemma 1 and Lemma 8, (iv) follows by the triangle inequality, and (v) follows due to Lemma 9. Similarly, we can bound the term as follows:

Next, we bound the term and obtain

Last, it remains to bound the term ]. Observe that ] = ]]. We first deal with ] as

For the case with a constant stepsize , we choose = min. To take the summation over the time steps, notice that the bound in (45) holds only when , and hence we separate the summation of the bias term into two parts as follows:

=

Then applying (45) to (46) yields

where (i) follows from Lemma 10, (ii) follows from the definition of and (iii) follows since

Finally, we complete our proof by letting both sides of (47) be divided by T , which yields

where

D Proof of Theorem 2

The proof of Theorem 2 starts from steps similar to those of Theorem 1. The difference starts from (46). Here we consider is not constant. Then we divide both sides of (46) by and obtain

where

We choose = min. Again we choose if and if . If , then choice of yields

where the last inequality follows because

By taking the summation over time steps, we obtain

where (i) follows since is decreasing and from the definition of , and (ii) follows from Lemma 10. For clarity we bound separately. The key observation is that is uniformly bounded as

Thus

Finally, we complete our proof by substituting (51) in (49) and letting both sides of (49) be divided by

T , which yields

where

E Proof of Proposition 1

In the following, we show how to adapt the proof techniques in analyzing PG-AMSGrad to PG-SGD. We first reduce Lemma 7 to the vanilla PG case.

Then we use the steps similar to those in building an auxiliary Markov chain (24) which is generated by the policy from time . Similarly to (27), we have

where (i) follows from (26), (28) and (32). Observe that in PG-SGD, for any t we have,

Thus, we complete the proof by substituting the above observation to (54) and then (53).

Proof of Proposition 1: Following from the Lipschitz condition of ) in Lemma 1, we obtain

Then we rearrange the above inequality and take expectation over all the randomness to have

where the filtration contains all the samples up to time t except the samples for estimating ) in the EstQ algorithm. Thus we have ] = ) where the expectation is taken over the randomness in EstQ algorithm. Observe that if , we have

Next we bound and obtain

Similarly, we can bound term as

Last, we bound term and obtain

Thus for any , we have

rearrange (55) and take the summation over the time steps to obtain

where (i) follows because (˜, and (ii) follows from (56). Then we divide both sides of the above inequality by , and have

Now we consider a diminishing stepsize = for t = 1, . . . , T . Choose = min=

where (i) follows from (48). Then we divide both sides of the above inequality by T , and have

F Proof of Lemma 3

We bound the expectation of bias via constructing a new Markov chain and applying useful techniques from information theory. Before deriving the bound, we first introduce some technical lemmas.

Proof. The proof follows easily from the boundedness of and .

where (i) follows since the reward function is uniformly bounded and (ii) follows from Assumption 3 and Assumption 4.

Lemma 14. (Zhou et al., 2018, Lemma A.1) Let for t = 1, 2, . . . be sequences generated by Algorithm 2 and denote + (1 + . Given as in Lemma 13, we have .

Lemma 15. (Bhandari et al., 2018, Lemma 11) Let ) = ((). Fix (). Then is uniformly bounded by

for fixed t and 0. Suppose Assumption 2 holds. Let are independent copies drawn from the marginal distribution of X and Y , that is = = ) = ). Then, for any bounded v, we have

Remark 1. The notation indicates that the random variable X and Y are independent conditioned on Z, which is a standard notation in information theory.

Proof of Lemma 3: Now we bound the bias of the gradient estimation. We first develop the connection between ) and ) using Lemma 15. For notational brevity, we define a random tuple = () and clearly ). We denote

Thus we can relate ) and ) by using the Lipschitz property in Lemma 15 as follows.

Since is non-random, we have )] = 0. Now we are ready to bound )] with Lemma 16 via constructing a random process satisfying (58). To do so, consider random variables and drawn independently from the marginal distribution of and , thus = = ) = = = ). Further, we can obtain )] = ] = 0 since and are independent. Combining Lemma 15 and Lemma 16 yields

Finally, we are ready to bound the bias. First, we take expectation on both sides of (59) and obtain

Let = min. When , we choose and have

When , we choose and have

where (i) follows from (60), and (ii) follows due to the definition of the mixing time.

G Proof of Theorem 3

Differently from the regret bound for AMSGrad in conventional optimization Reddi et al. (2018), our analysis here focuses on the convergence rate. In fact, a slight modification of our proof also provides the convergence rate of AMSGrad for conventional strongly convex optimization, which can be of independent interest. Moreover, we provide results under the constant stepsize and under Marovian sampling, neither of which has been studied in Reddi et al. (2018).

To proceed the proof, we first observe that

Clearly Π) = due to Assumption 4. We start from the update of when 2.

where (i) follows from Cauchy-Schwarz inequality, and (ii) holds because ˆ.

Next, we take the expectation over all samples used up to time step t on both sides, which still preserves

the inequality. Since we consider the Markovian sampling case where the estimation of gradient is biased.

ˆV (θ+ αˆV + βˆV (θ+ αˆV m2α(θg

= ¯+ 2(¯

where (i) follows from Lemma 2 and because 10, (ii) follows because 1 and 0, and (iii) follows fromby Lemma 14 and Assumption 4. By rearranging the terms in the above inequality and taking the summation over time steps, we have

where (i) follows from . By further implementing the first term in the right-hand side of the last inequality, we can then bound the sum as

Next, we further bound the above term as

where (i) follows because

Then we are ready to obtain the upper bound by applying Lemma 3. Choosing if and if , we obtain

Finally, applying Jensen’s inequality yields

where

H Proof of Theorem 4

We first provide the following useful lemma.

Proof. The proof follows from taking the standard sum of geometric sequences.

Proof of Theorem 4: The proof starts with steps similar to those of Theorem 3. The difference begins from (63), where we now consider a diminishing stepsize . We then have

where (i) follows from Assumption 4, and Lemmas 14 and 17. Then we are ready to obtain the upper bound by applying Lemma 3. Choosing if and

if , we obtain

Finally, applying Jensen’s inequality completes our proof as

where