On Online Learning in Kernelized Markov Decision Processes

2019·Arxiv

Abstract

Abstract

We develop algorithms with low regret for learning episodic Markov decision processes based on kernel approximation techniques. The algorithms are based on both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling (PSRL) philosophies, and work in the general setting of continuous state and action spaces when the true unknown transition dynamics are assumed to have smoothness induced by an appropriate Reproducing Kernel Hilbert Space (RKHS).

I. INTRODUCTION

The goal of reinforcement learning (RL) is to learn optimal behavior by repeated interaction with an unknown environment, usually modeled as a Markov Decision Process (MDP). Performance is typically measured by the amount of interaction, in terms of episodes or rounds, needed to arrive at an optimal (or near-optimal) policy; this is also known as the sample complexity of RL [1]. The sample complexity objective encourages efficient exploration across states and actions, but, at the same time, is indifferent to the reward earned during the learning phase.

A related, but different, goal in RL is the online one, i.e., to learn to gather high cumulative reward, or to equivalently keep the learner’s regret (the gap between its and the optimal policy’s net reward) as low as possible. This is preferable in settings where experimentation comes at a premium and the reward earned in each round is of direct value, e.g., recommender systems (in which rewards correspond to clickthrough events and ultimately translate to revenue), dynamic pricing – in general, control of unknown dynamical systems with instantaneous costs.

A primary challenge in RL is to learn efficiently across complex (very large or infinite) state and action spaces. In the most general tabula rasa MDP setting, the learner must explore each state-action transition before developing a reasonably clear understanding of the environment, which is prohibitive for large problems. Real-world domains, though, possess more structure: transition and reward behavior often varies smoothly over states and actions, making it possible to generalize via inductive inference – observing a state transition or reward is informative of other, similar transitions or rewards. Scaling RL to large, complex, real-world domains requires exploiting regularity structure in the environment, which has typically been carried out via the use of parametric MDP models in model-based approaches, e.g., [2].

This paper takes a step in developing theory and algorithms for online RL in environments with smooth transition and reward structure. We specifically consider the episodic online learning problem in the nonparametric, kernelizable MDP setting, i.e., of minimizing regret (relative to an optimal finite-horizon policy) in MDPs with continuous state and action spaces, whose transition and reward functions exhibit smoothness over states and actions compatible with the structure of a reproducing kernel. We develop variants of the well-known UCRL and posterior sampling algorithms for MDPs with continuous state and action spaces, and show that they enjoy sublinear, finite-time regret bounds when the mean transition and reward functions are assumed to belong to the associated Reproducing Kernel Hilbert Space (RKHS) of functions.

Our results bound the regret of the algorithms in terms of a novel generalization of the information gain of the state transition and reward function kernels, from the memoryless kernel bandit setting [3] to the state-based kernel MDP setting, and help shed light on how the choice of kernel model influences regret performance. We also leverage two different kernel approximation techniques, namely the Quadrature Fourier Features (QFF) approximation [25] and the Nystr¨om approximation [31], to prove the results in the paper. To the best of our knowledge, these are the first concrete regret bounds for RL under kernel approximation, explicitly showing the dependence of regret on kernel structure.

Our results represent a generalization of several streams of work. We generalize online learning in the kernelized bandit setting [4], [6] to kernelized MDPs, and tabula rasa online learning approaches for MDPs such as Upper Confidence Bound for Reinforcement Learning (UCRL) [7] and Posterior Sampling for Reinforcement Learning (PSRL) [8], [9] to MDPs with kernel structure. Our results can also generalize regret minimization for an episodic variant of the well-known parametric Linear Quadratic Regulator (LQR) problem [10]– [13] to its nonlinear, nonparametric, infinite-dimensional, kernelizable counterpart.

II. RELATED WORK

Regret minimization has been studied with parametric MDPs in [7], [8], [14], [15] etc. For online regret minimization in complex MDPs, apart from the work of [2], [16] and [17] consider continuous state spaces with Lipschitz transition dynamics but unstructured, finite action spaces. As regards using the Gaussian process (GP) framework to model nonlinear, smooth MDP structure to achieve generalization, [20] develop a policy search method for MDPs with GPbased dynamics, but without provable exploration guarantees, whereas approaches in [21], [22] has been shown to be sample efficient. [23] develops a PAC learning algorithm by considering separate GP priors over the mean reward function, transition kernel and optimal Q-function, with a PAC guarantee. The most closely related work of ours is [24] which consider GP based variants of UCRL and PSRL algorithms and show no-regret guarantees for kernelizable MDPs. Our results, in a way, show that we can achieve the same guarantees as in [24] even with kernel approximations.

III. PROBLEM STATEMENT

We consider the problem of learning to optimize reward in an unknown finite-horizon MDP, , over repeated episodes of interaction. Here, represents the state space, the action space, H the episode length, the reward distribution over R, and the transition distribution over S. At each period h = 1, 2, . . ., H within an episode, an agent observes a state , takes an action , observes a reward , and causes the MDP to transition to a next state . We assume that the agent, while not possessing knowledge of the reward and transition distribution of the unknown MDP , knows S, A and H.

A policy is defined to be a mapping from a state and a period to an action . For any MDP and policy , the finite horizon, undiscounted, value function for every state and every period is defined as

where the subscript indicates the application of the learning policy , i.e., , and the subscript M explicitly references the MDP environment M, i.e., , for all j = h, . . . , H.

We useto denote the mean of the reward distribution that corresponds to playing action a at state s in the MDP M. We can view a sample r from the reward distribution as , where denotes a sample of zero-mean, real-valued additive noise. Similarly, the transition distribution can also be decomposed as a mean valuein plus a zero-mean additive noise in so that lies in1 . A policy is said to be optimal for the MDP M if

At the beginning of each episode l, an RL algorithm chooses a policy depending upon the observed state-action-reward sequences upto episode , denoted by the history , and executes it for the entire duration of the episode. In other words, at each period h of the l-th episode, the learning algorithm chooses action , receives reward and observes the next state . The goal of an episodic online RL algorithm is to maximize its cumulative reward across episodes, or, equivalently, minimize its cumulative regret: the loss incurred in terms of the value function due to not knowing the optimal policy of the unknown MDP beforehand and instead using the policy for each episode l, l = 1, 2, . . .. The cumulative (expected) regret of an RL algorithm upto time horizon is defined as

where the initial states are assumed to be fixed.

For the rest of the paper, unless otherwise specified, we define and for all and .

IV. ASSUMPTIONS

A. Smoothness of Value Function

, we define the one step future value function as the expected value of the optimal policy , with the next state distributed according to , i.e. . We assume the following regularity condition on the future value function of any MDP (also made by [2]). For any two single-step transition distributions over S, and ,

where denotes the mean of the distribution . In other words, the one-step future value functions for each period h are Lipschitz continuous with respect to the -norm of the mean2, with global Lipschitz constant . We also assume that there is a known constant L such that .

B. Smoothness of Mean Reward and Transition Functions

Attaining sub-linear regret is impossible in general for arbitrary reward and transition distributions, and thus some regularity assumptions are needed. In this paper, we assume smoothness for the mean reward functionis induced by the structure of a kernel on Z. Specifically, we make the standard assumption of a p.s.d. kernel such that for all , andbeing an element of the reproducing kernel Hilbert space (RKHS) of smooth real valued functions on Z. An RKHS of real-valued functions , denoted by , is completely specified by its kernel function and viceversa, with an inner product obeying the reproducing property for all . The induced RKHS norm is a measure of smoothness of f with respect to the kernel function k. We assume that the RKHS norm ofis bounded, i.e., for some . Boundedness of along the diagonal holds for any stationary kernel, i.e., where , e.g., the Squared Exponential kernel and the Matrn kernel :

where l > 0 and are hyperparameters of the kernels,

C. Sub-Gaussian Noise Variables

We assume that the random variables is conditionally zero-mean and -sub-Gaussian, i.e., there exists a known such that for any ,

where is the sigma algebra generated by the random variables , and . Similarly, the random variables is assumed to be conditionally component-wise independent, zero-mean and -subGaussian, in the sense that there exists a known such that for any and ,

where is the sigma algebra generated by the random variables , and and denotes the identity matrix of rank m.

V. ALGORITHM

A. Kernel Approximation For kernelized MDPs [24] develop variants of the UCRL2 algorithm which, at every episode l, constructs confidence sets for the mean reward and the mean transition functions. The construction of each confidence set require one inversion of the kernel (gram) matrix, which takes time. This makes the algorithm quite prohibitive for large number of episodes. To reduce this computational cost without compromising on the accuracy of the confidence sets, we incorporate two efficient kernel approximation schemes, namely the Quadrature Fourier Features (QFF) approximation [25] and the Nystr¨om approximation [26].

1) Quadrature Fourier Features (QFF) approximation:

If k is a bounded, continuous, positive definite, stationary kernel defined over and satisfies k(x, x) = 1 for all , then by Bochner’s theorem [27], k is the Fourier transform of a probability measure p, i.e., k(x, y) = . For the Squared Exponential kernel defined over , this measure has density (abusing notation for measure and density). [25] show that for any stationary kernel k on whose inverse Fourier transform decomposes product wise, i.e., , we can use Gauss-Hermite quadrature [28] to approximate it. If , the SE kernel is approximated as follows. Choose and , and construct the 2d-dimensional feature map

Here the set

, where isthe set of (real) roots of the -th Hermite polynomial , and for all . 2) Nystr¨om approximation: Unlike the QFF approximation where the basis functions (cosine and sine) do not depend on the data, the basis functions used by the Nystr¨om method are data dependent. For a set of points , the Nystr¨om method [26] approximates a kernel as follows: First, randomly sample d points to construct a dictionary D = , according to the following distribution. For each , include in D independently with some suitably chosen probability . (’s trade off between the quality and the size of the embedding.) Then, compute the (approximate) d-dimensional feature embedding

, where , and denotes the pseudo inverse of any matrix A.

Now we will present our algorithm Kernel-UCRL using the Nystr¨om approximation. The description and performance of Kernel-UCRL using the quadrature Fourier features approximation is deferred to the Appendix.

B. Kernel-UCRL Algorithm under Nystr¨om Approximation

Kernel-UCRL (Algorithm 1) is an optimistic algorithm based on the Upper Confidence Bound principle, which adapts the confidence sets of UCRL2 [7] to exploit the kernel structure. At the start of episode l, first we find feature embeddings and to efficiently approximate the kernels and , respectively. First, we construct a dictionary by including every state-action pair z from the set in with probability (to be defined later). Then, we define

we define , where the dictionary is constructed by including every (z, i) from the set with probability (to be defined later).

Then we construct confidence sets and for the mean reward and the mean transition functions, respectively, as follows. First, we compute , where with being the reward of the state-action pair and , , and . Fix any . Now, we define to be the set of all functions such that

where ˜) = ˜, ˜) = ˜˜) + ˜) and =

Similarly, we compute , where with being the next state of the state-action pair and [H], and . Now, we define to be the set of all functions such that

, (6) where , . . . , z, m, z, iz, i, . . . , z, m, z, i) = kz, i), (z, iz, iz, i) + mH z, iV z, i) and the confidence

Next, we build the set of all plausible MDPs M such that: (i) the mean reward functionthe mean transition functionand (iii) the global Lipschitz constant (of future value functions) . Finally, we select an optimistic policy for the family of MDPs in the sense that , where is the initial state and is the most optimistic realization from , and execute for the entire episode. The pseudo-code of kernel-UCRL is given in Algorithm 1.

VI. ANALYSIS OF KERNEL-UCRL UNDER NYSTR ¨OM APPROXIMATION

A. Regret Bound of Kernel-UCRL

Let , where denotes the vector of kernel evaluations between z and elements of , and denotes the kernel matrix computed at .

and . Then, with probability at least , uniformly over all and , the following holds:

, where denotes the vector of reward noise variables. Now see that . Then by [30, Theorem 1], for any , with probability at least , uniformly over all and ,

Further, from [31, Theorem 1], with probability at least , uniformly over all and , we haveand . Then, from [32, Equation 25], we have

(8) Now the result follows by combining 7 and 8, and applying an union bound.

where denotes the vector of kernel evaluations between (z, i) and elements of , and denotes the kernel matrix computed at .

and . Then, with probability at least , uniformly over all and , the following holds:

where L is a known upper bound over denotes the diameter of

By construction of the set of MDPs , it follows that when the events and are true for all episodes , the unknown MDP lies in . Thus , since is the most optimistic MDP of . This implies

Now by the reproducing property of RKHS and CauchySchwartz inequality for all . Thus 9, [24, Lemma 7] and [24, Lemma 9] together imply that for any , with probability at least ,

(10) Now, by triangle inequality . Therefore, when the event is true,

since the mean reward functionlies in the confidence set . Similarly when the event is true,

where we have used the fact that both and are non-decreasing with the number of episodes l and that by construction of (and since ). Now the result follows from [24, Lemma 11] by noting that and .

B. Interpretation of the Bound

Theorem 1 implies that the cumulative regret of KernelUCRL after episodes is

with high prob-

ability. ( hides logarithmic factors.) Now we illustrate the growth of and as functions of T with the following concrete examples.

Let , i.e., is an additive kernel of and . Then, from [33, Theorem 3], 2 ln T . Now if both are compact and convex sets, and both are Squared Exponential (SE) kernels, then from [3, Theorem 4], and . Hence, in this case .

Further, let , i.e., is a product kernel of and R. Then, from [33, Theorem 2], m ln(mT ), since all kernel matrices over any subset of {1, . . . , m} have rank at most m. Now if is a Squared Exponential kernel on Z, then . Hence, in this case .

In essence, and grow sublinearly with T for some popular kernels, e.g. Squared Exponential, polynomial and Mat´ern. Now, since the cumulative regret of KernelUCRL scales linearly with and , it, in turn, grows sublinearly with T for these kernels.

VII. ANALYSIS OF PSRL UNDER NYSTR ¨OM APPROXIMATION

Optimizing for an optimistic policy is not computationally tractable in general, even though planning for the optimal policy is possible for a given MDP. A popular approach to overcome this difficulty is to sample a random MDP at every episode and solve for its optimal policy, called posterior sampling [34].

PSRL (Algorithm 2), in its most general form, starts with a prior distribution over MDPs. At the beginning of episode l, using the history of observations , it updates the posterior and samples an MDP from . (Sampling can be done using MCMC methods even if doesn’t admit any closed form.) It then selects an optimal policy of the sampled MDP , in the sense that for all and for all h = 1, 2, . . ., H, and executes for the entire episode.

[34] show that if we have a frequentist regret bound for UCRL in hand, then we can obtain a similar bound (upto a constant factor) on the Bayes regret, defined as the expected regret under the prior distribution , of PSRL. We use this idea to obtain a sublinear bound on the Bayes regret of PSRL under kernel approximation.

Theorem 2 (Bayes regret of PSRL under Nystr¨om approximation): Let the assumptions in Section IV hold and be a (known) prior distribution over MDPs . Then, the Bayes regret of PSRL satisfies

VIII. CONCLUSIONS

Any MDP M whose mean reward function satisfies for some such that , and mean transition func-

tion satisfies , for some such that , lies in the set . However, there might be other MDPs in which do not posses this linear structure. Therefore, optimal planning may be computationally intractable even for a single MDP. So it is common in the literature to assume access to an approximate MDP planner which returns an -optimal policy for M. Given such a planner , if it is possible to obtain (through extended value iteration [7] or otherwise) an efficient planner which returns an -optimal policy for the most optimistic MDP from a family M, then we modify PSRL and Kernel-UCRL to choose and respectively at every episode l. It follows that this adds only an factor in the respective regret bounds. The design of such approximate planners for continuous state and action spaces remains a subject of active research, whereas our focus in this work is on the statistical efficiency of the online learning problem.

APPENDIX

In this section we will assume that , i.e., m = 1 and for some n = O(1). Also we will assume that and are Squared Exponential (SE) kernels defined over with length scale parameters and , respectively.

A. Kernel-UCRL under QFF Approximation

First, we choose an such that and for some ap- propriate constants and . Then we set and construct dimensional feature map using 4. Similarly, we choose an such that and for some appropriate constants and and construct the feature map of dimension . Now we construct confidence sets and as follows. First, we compute , where , and . Then we fix any and define to be the set of all functions such that

Similarly, we compute , where , and . Now, we define to be the set of all functions such that

Next, following the same approach as before, we build the set of all plausible MDPs, choose an optimistic policy for and execute for the entire episode.

B. Regret Bound under QFF Approximation

(0, 1). Then, with probability at least , uniformly over all and ,

, where and for every episode l. Now define . Then from [32, Equation 7], we have. Let . Then from [32, Lemma 15],

From [25, Theorem 1]

, since n = O(1). If , then . Further if , then and thus, in turn, O(1/T ) for each . Now the result follows by combining 7 and 13 using the triangle inequality.

(0, 1). Then, with probability at least , uniformly over all and ,

Theorem 3 (Regret bound for Kernel-UCRL): Let the assumptions in Section IV hold. Further, let m = 1, n = O(1) and are SE kernels on with length scale parameters , respectively. Then, for any (0, 1), Kernel-UCRL with QFF approximation enjoys, with probability at least , the regret bound

:=) + :=) +

When the events and are true for all episodes , then using a similar approach as in the proof of Theorem 1 we can show that for any , with probability at least ,

(14) Also for every episode l the following holds:

By our choice of and , Lemmas 3 and 4 together imply that the events and are true for all episodes . Further, since and

and are non-decreasing functions in l. Now, combining 14 and 15, and applying a union bound we have, with probability at least , that

From [24, Lemma 11], and , since O(d ln t) for any linear kernel k defined over . Now the result follows by noting that and for n = O(1).

ACKNOWLEDGMENT

Sayak Ray Chowdhury is supported by a Google PhD Fellowship. Aditya Gopalan is grateful for support from the DST INSPIRE faculty grant IFA13- ENG-69.

REFERENCES

[1] A. L. Strehl, L. Li, and M. L. Littman, “Reinforcement learning in finite MDPs: PAC analysis,” J. Mach. Learn. Res., vol. 10, pp. 2413– 2444, Dec. 2009.

[2] I. Osband and B. Van Roy, “Model-based reinforcement learning and the eluder dimension,” in Advances in Neural Information Processing Systems, 2014, pp. 1466–1474.

[3] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” arXiv preprint arXiv:0912.3995, 2009.

[4] S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” in Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 844–853.

[5] A. Durand, O.-A. Maillard, and J. Pineau, “Streaming kernel regression with provably adaptive mean, variance, and regularization,” arXiv preprint arXiv:1708.00768, 2017.

[6] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, “Finite-time analysis of kernelised contextual bandits,” arXiv preprint arXiv:1309.6869, 2013.

[7] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010.

[8] I. Osband, D. Russo, and B. Van Roy, “(More) efficient reinforcement learning via posterior sampling,” in Advances in Neural Information Processing Systems, 2013, pp. 3003–3011.

[9] Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, “Learning unknown markov decision processes: A thompson sampling approach,” in Advances in Neural Information Processing Systems, 2017, pp. 1333– 1342.

[10] Y. Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.

[11] ——, “Bayesian optimal control of smoothly parameterized systems.” in UAI. Citeseer, 2015, pp. 1–11.

[12] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.

[13] M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control problems,” arXiv preprint arXiv:1703.08972, 2017.

[14] A. Gopalan and S. Mannor, “Thompson sampling for learning parameterized markov decision processes,” in Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, 2015, pp. 861–898.

[15] S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: worst-case regret bounds,” in Advances in Neural Information Processing Systems, 2017, pp. 1184–1194.

[16] R. Ortner and D. Ryabko, “Online regret bounds for undiscounted continuous reinforcement learning,” in Advances in Neural Information Processing Systems, 2012, pp. 1763–1771.

[17] K. Lakshmanan, R. Ortner, and D. Ryabko, “Improved regret bounds for undiscounted continuous reinforcement learning,” in International Conference on Machine Learning, 2015, pp. 524–532.

[18] M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite markov decision processes with gaussian processes,” in Advances in Neural Information Processing Systems, 2016, pp. 4312–4320.

[19] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in Neural Information Processing Systems, 2017, pp. 908– 919.

[20] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.

[21] T. Jung and P. Stone, “Gaussian processes for sample efficient reinforcement learning with rmax-like exploration,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2010, pp. 601–616.

[22] R. Grande, T. Walsh, and J. How, “Sample efficient reinforcement learning with gaussian processes,” in International Conference on Machine Learning, 2014, pp. 1332–1340.

[23] R. C. Grande, “Computationally efficient gaussian process changepoint detection and regression,” Ph.D. dissertation, Massachusetts Institute of Technology, 2014.

[24] S. R. Chowdhury and A. Gopalan, “Online learning in kernelized markov decision processes,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 3197–3205.

[25] M. Mutny and A. Krause, “Efficient high dimensional bayesian optimization with additivity and quadrature fourier features,” in Advances in Neural Information Processing Systems, 2018, pp. 9005–9016.

[26] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, “Nystr¨om method vs random fourier features: A theoretical and empirical comparison,” in Advances in neural information processing systems, 2012, pp. 476–484.

[27] S. Bochner, Lectures on Fourier integrals. Princeton University Press, 1959.

[28] F. B. Hildebrand, Introduction to numerical analysis. Courier Corporation, 1987.

[29] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.

[30] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.

[31] D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco, “Gaussian process optimization with adaptive sketching: Scalable and no regret,” In Conference on Learning Theory, 2019.

[32] S. R. Chowdhury and A. Gopalan, “Bayesian optimization under heavy-tailed payoffs,” arXiv preprint arXiv:1909.07040, 2019.

[33] A. Krause and C. S. Ong, “Contextual gaussian process bandit optimization,” in Advances in Neural Information Processing Systems, 2011, pp. 2447–2455.

[34] I. Osband and B. Van Roy, “Why is posterior sampling better than optimism for reinforcement learning?” arXiv preprint arXiv:1607.00215, 2016.

designed for accessibility and to further open science