b

DiscoverSearch
About
My stuff
On Online Learning in Kernelized Markov Decision Processes
2019·arXiv
Abstract
Abstract

We develop algorithms with low regret for learning episodic Markov decision processes based on kernel approximation techniques. The algorithms are based on both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling (PSRL) philosophies, and work in the general setting of continuous state and action spaces when the true unknown transition dynamics are assumed to have smoothness induced by an appropriate Reproducing Kernel Hilbert Space (RKHS).

The goal of reinforcement learning (RL) is to learn optimal behavior by repeated interaction with an unknown environment, usually modeled as a Markov Decision Process (MDP). Performance is typically measured by the amount of interaction, in terms of episodes or rounds, needed to arrive at an optimal (or near-optimal) policy; this is also known as the sample complexity of RL [1]. The sample complexity objective encourages efficient exploration across states and actions, but, at the same time, is indifferent to the reward earned during the learning phase.

A related, but different, goal in RL is the online one, i.e., to learn to gather high cumulative reward, or to equivalently keep the learner’s regret (the gap between its and the optimal policy’s net reward) as low as possible. This is preferable in settings where experimentation comes at a premium and the reward earned in each round is of direct value, e.g., recommender systems (in which rewards correspond to clickthrough events and ultimately translate to revenue), dynamic pricing – in general, control of unknown dynamical systems with instantaneous costs.

A primary challenge in RL is to learn efficiently across complex (very large or infinite) state and action spaces. In the most general tabula rasa MDP setting, the learner must explore each state-action transition before developing a reasonably clear understanding of the environment, which is prohibitive for large problems. Real-world domains, though, possess more structure: transition and reward behavior often varies smoothly over states and actions, making it possible to generalize via inductive inference – observing a state transition or reward is informative of other, similar transitions or rewards. Scaling RL to large, complex, real-world domains requires exploiting regularity structure in the environment, which has typically been carried out via the use of parametric MDP models in model-based approaches, e.g., [2].

This paper takes a step in developing theory and algorithms for online RL in environments with smooth transition and reward structure. We specifically consider the episodic online learning problem in the nonparametric, kernelizable MDP setting, i.e., of minimizing regret (relative to an optimal finite-horizon policy) in MDPs with continuous state and action spaces, whose transition and reward functions exhibit smoothness over states and actions compatible with the structure of a reproducing kernel. We develop variants of the well-known UCRL and posterior sampling algorithms for MDPs with continuous state and action spaces, and show that they enjoy sublinear, finite-time regret bounds when the mean transition and reward functions are assumed to belong to the associated Reproducing Kernel Hilbert Space (RKHS) of functions.

Our results bound the regret of the algorithms in terms of a novel generalization of the information gain of the state transition and reward function kernels, from the memoryless kernel bandit setting [3] to the state-based kernel MDP setting, and help shed light on how the choice of kernel model influences regret performance. We also leverage two different kernel approximation techniques, namely the Quadrature Fourier Features (QFF) approximation [25] and the Nystr¨om approximation [31], to prove the results in the paper. To the best of our knowledge, these are the first concrete regret bounds for RL under kernel approximation, explicitly showing the dependence of regret on kernel structure.

Our results represent a generalization of several streams of work. We generalize online learning in the kernelized bandit setting [4], [6] to kernelized MDPs, and tabula rasa online learning approaches for MDPs such as Upper Confidence Bound for Reinforcement Learning (UCRL) [7] and Posterior Sampling for Reinforcement Learning (PSRL) [8], [9] to MDPs with kernel structure. Our results can also generalize regret minimization for an episodic variant of the well-known parametric Linear Quadratic Regulator (LQR) problem [10]– [13] to its nonlinear, nonparametric, infinite-dimensional, kernelizable counterpart.

Regret minimization has been studied with parametric MDPs in [7], [8], [14], [15] etc. For online regret minimization in complex MDPs, apart from the work of [2], [16] and [17] consider continuous state spaces with Lipschitz transition dynamics but unstructured, finite action spaces. As regards using the Gaussian process (GP) framework to model nonlinear, smooth MDP structure to achieve generalization, [20] develop a policy search method for MDPs with GPbased dynamics, but without provable exploration guarantees, whereas approaches in [21], [22] has been shown to be sample efficient. [23] develops a PAC learning algorithm by considering separate GP priors over the mean reward function, transition kernel and optimal Q-function, with a PAC guarantee. The most closely related work of ours is [24] which consider GP based variants of UCRL and PSRL algorithms and show no-regret guarantees for kernelizable MDPs. Our results, in a way, show that we can achieve the same guarantees as in [24] even with kernel approximations.

We consider the problem of learning to optimize reward in an unknown finite-horizon MDP,  M⋆ = {S, A, R⋆, P⋆, H}, over repeated episodes of interaction. Here,  S ⊂ Rmrepresents the state space,  A ⊂ Rnthe action space, H the episode length,  R⋆(s, a)the reward distribution over R, and P⋆(s, a)the transition distribution over S. At each period h = 1, 2, . . ., H within an episode, an agent observes a state  sh ∈ S, takes an action  ah ∈ A, observes a reward rh ∼ R⋆(sh, ah), and causes the MDP to transition to a next state  sh+1 ∼ P⋆(sh, ah). We assume that the agent, while not possessing knowledge of the reward and transition distribution  R⋆, P⋆of the unknown MDP  M⋆, knows S, A and H.

A policy  π : S × {1, 2, . . ., H} → Ais defined to be a mapping from a state  s ∈ Sand a period  1 ≤ h ≤ Hto an action  a ∈ A. For any MDP  M = {S, A, RM, PM, H}and policy  π, the finite horizon, undiscounted, value function for every state  s ∈ Sand every period  1 ≤ h ≤ His defined as

image

where the subscript  πindicates the application of the learning policy  π, i.e.,  aj = π(sj, j), and the subscript M explicitly references the MDP environment M, i.e.,  sj+1 ∼PM(sj, aj), for all j = h, . . . , H.

We useRM(s, a) = E�r�� r ∼ RM(s, a)�to denote the mean of the reward distribution  RM(s, a)that corresponds to playing action a at state s in the MDP M. We can view a sample r from the reward distribution  RM(s, a)as r =RM(s, a) + εR, where  εRdenotes a sample of zero-mean, real-valued additive noise. Similarly, the transition distribution  PM(s, a)can also be decomposed as a mean valueP M(s, a)in  Rmplus a zero-mean additive noise  εPin  Rmso that  s′ =P M(s, a) + εPlies in1 S ⊂ Rm. A policy  πMis said to be optimal for the MDP M if

image

At the beginning of each episode l, an RL algorithm chooses a policy  πldepending upon the observed state-action-reward sequences upto episode  l − 1, denoted by the history  Hl−1 := {sj,k, aj,k, rj,k, sj,k+1}1≤j≤l−1,1≤k≤H, and executes it for the entire duration of the episode. In other words, at each period h of the l-th episode, the learning algorithm chooses action  al,h = πl(sl,h, h), receives reward rl,h = R⋆(sl,h, al,h) + εR,l,hand observes the next state sl,h+1 = P ⋆(sl,h, al,h) + εP,l,h. The goal of an episodic online RL algorithm is to maximize its cumulative reward across episodes, or, equivalently, minimize its cumulative regret: the loss incurred in terms of the value function due to not knowing the optimal policy  π⋆ := πM⋆of the unknown MDP  M⋆beforehand and instead using the policy  πlfor each episode l, l = 1, 2, . . .. The cumulative (expected) regret of an RL algorithm  π = {π1, π2, . . .}upto time horizon T = τHis defined as

image

where the initial states  sl,1, l ≥ 1are assumed to be fixed.

For the rest of the paper, unless otherwise specified, we define  Z := S × A, z := (s, a), z′ := (s′, a′)and  zl,h :=(sl,h, al,h)for all  l ≥ 1and  1 ≤ h ≤ H.

A. Smoothness of Value Function

image

1 ≤ h ≤ H, we define the one step future value function as the expected value of the optimal policy  πM, with the next state distributed according to  ϕ, i.e.  U Mh (ϕ) :=Es′∼ϕ�V MπM,h+1(s′)�. We assume the following regularity condition on the future value function of any MDP (also made by [2]). For any two single-step transition distributions ϕ1, ϕ2over S, and  1 ≤ h ≤ H,

image

where ϕ := Es′∼ϕ[s′] ∈ Sdenotes the mean of the distribution  ϕ. In other words, the one-step future value functions for each period h are Lipschitz continuous with respect to the  ∥·∥2-norm of the mean2, with global Lipschitz constant  LM. We also assume that there is a known constant L such that  LM⋆ ≤ L.

B. Smoothness of Mean Reward and Transition Functions

Attaining sub-linear regret is impossible in general for arbitrary reward and transition distributions, and thus some regularity assumptions are needed. In this paper, we assume smoothness for the mean reward functionR⋆ : Z → Ris induced by the structure of a kernel on Z. Specifically, we make the standard assumption of a p.s.d. kernel  kR :Z × Z → Rsuch that  kR(z, z) ≤ 1for all  z ∈ Z, andR⋆being an element of the reproducing kernel Hilbert space (RKHS)  HkR(Z)of smooth real valued functions on Z. An RKHS of real-valued functions  X → R, denoted by  Hk(X), is completely specified by its kernel function  k(·, ·)and viceversa, with an inner product  ⟨·, ·⟩kobeying the reproducing property  f(x) = ⟨f, k(x, ·)⟩kfor all  f ∈ Hk(X). The induced RKHS norm  ∥f∥k = �⟨f, f⟩kis a measure of smoothness of f with respect to the kernel function k. We assume that the RKHS norm ofR⋆is bounded, i.e., ��R⋆��kR ≤ BRfor some  BR < ∞. Boundedness of  kRalong the diagonal holds for any stationary kernel, i.e., where kR(z, z′) = kR(z −z′), e.g., the Squared Exponential kernel kSEand the Mat´ern kernel  kMat´ern:

image

where l > 0 and  ν > 0are hyperparameters of the kernels,

image

C. Sub-Gaussian Noise Variables

We assume that the random variables  {εR,l,h}l≥1,1≤h≤His conditionally zero-mean and  σR-sub-Gaussian, i.e., there exists a known  σR > 0such that for any  λ ∈ R,

image

where FR,l,h−1is the sigma algebra generated by the random variables {sj,k, aj,k, εR,j,k}1≤j≤l−1,1≤k≤H, {sl,k, al,k, εR,l,k}1≤k≤h−1, sl,hand  al,h. Similarly, the random variables  {εP,l,h}l≥1,1≤h≤His assumed to be conditionally component-wise independent, zero-mean and  σP-subGaussian, in the sense that there exists a known  σP > 0such that for any  λ ∈ Rand  1 ≤ i ≤ m,

image

where FP,l,h−1is the sigma algebra generated by the random variables {sj,k, aj,k, εP,j,k}1≤j≤l−1,1≤k≤H, {sl,k, al,k, εP,l,k}1≤k≤h−1, sl,hand  al,hand  Imdenotes the identity matrix of rank m.

A. Kernel Approximation For kernelized MDPs [24] develop variants of the UCRL2 algorithm which, at every episode l, constructs confidence sets for the mean reward and the mean transition functions. The construction of each confidence set require one inversion of the kernel (gram) matrix, which takes  O(l3)time. This makes the algorithm quite prohibitive for large number of episodes. To reduce this computational cost without compromising on the accuracy of the confidence sets, we incorporate two efficient kernel approximation schemes, namely the Quadrature Fourier Features (QFF) approximation [25] and the Nystr¨om approximation [26].

1) Quadrature Fourier Features (QFF) approximation:

If k is a bounded, continuous, positive definite, stationary kernel defined over  X ⊂ Rqand satisfies k(x, x) = 1 for all  x ∈ X, then by Bochner’s theorem [27], k is the Fourier transform of a probability measure p, i.e., k(x, y) = �Rq p(ω) cos(ωT (x − y))dω. For the Squared Exponential kernel defined over  X ⊂ Rq, this measure has density p(ω) =� l√2π�qe−l2∥ω∥222(abusing notation for measure and density). [25] show that for any stationary kernel k on Rqwhose inverse Fourier transform decomposes product wise, i.e.,  p(ω) = �qj=1 pj(ωj), we can use Gauss-Hermite quadrature [28] to approximate it. If  X = [0, 1]q, the SE kernel is approximated as follows. Choose ¯d ∈ Nand d = ( ¯d)q, and construct the 2d-dimensional feature map

image

Here the set  {ω1, . . . , ωd} =

A ¯d × · · · × A ¯d, where  A ¯disthe set of ¯d(real) roots of the ¯d-th Hermite polynomial  H ¯d, and  ν(z) = �qj=1 2¯d−1 ¯d!¯d2H ¯d−1(zj)2for all  z ∈ Rq. 2) Nystr¨om approximation: Unlike the QFF approximation where the basis functions (cosine and sine) do not depend on the data, the basis functions used by the Nystr¨om method are data dependent. For a set of points {x1, . . . , xt} ⊂ X, the Nystr¨om method [26] approximates a kernel  k : X × X → Ras follows: First, randomly sample d points to construct a dictionary D = {xi1, . . . , xid}; ij ∈ [t], according to the following distribution. For each  i ∈ [t], include  xiin D independently with some suitably chosen probability  pi. (pi’s trade off between the quality and the size of the embedding.) Then, compute the (approximate) d-dimensional feature embedding

˜ϕ(x) = �K1/2D �†kD(x), where  KD = [k(u, v)]u,v∈D, kD(x) = [k(xi1, x), . . . , k(xid, x)]Tand  A†denotes the pseudo inverse of any matrix A.

Now we will present our algorithm Kernel-UCRL using the Nystr¨om approximation. The description and performance of Kernel-UCRL using the quadrature Fourier features approximation is deferred to the Appendix.

B. Kernel-UCRL Algorithm under Nystr¨om Approximation

Kernel-UCRL (Algorithm 1) is an optimistic algorithm based on the Upper Confidence Bound principle, which adapts the confidence sets of UCRL2 [7] to exploit the kernel structure. At the start of episode l, first we find feature embeddings ˜ϕR,l : Z → RdR,Land ˜ϕP,l :¯Z → RdP,Lto efficiently approximate the kernels  kRand  kP, respectively. First, we construct a dictionary DR,lby including every state-action pair z from the set  Zl−1 := {zj,k}1≤j≤l−1,1≤k≤H = {z1,1, . . . , zl−1,H}in  DR,lwith probability  bR,l(z)(to be defined later). Then, we define  ˜ϕR,l(z) =�K1/2DR,l

we define  ˜ϕP,l(z, i) = �K1/2DP,l�†kDP,l(z, i), where the dictionary  DP,lis constructed by including every (z, i) from the set ¯Zl−1 := �(zj,k, i)�1≤j≤l−1,1≤k≤H,1≤i≤m =�(z1,1, 1), . . . , (zl−1,H, m)�with probability  bP,l(z, i)(to be defined later).

Then we construct confidence sets  CR,land  CP,lfor the mean reward and the mean transition functions, respectively, as follows. First, we compute ˜θR,l−1 = ˜V −1R,l−1 ˜ΦTR,l−1Rl−1, where  Rl−1 := [r1,1, . . . , rl−1,H]Twith  rj,kbeing the reward of the state-action pair  zj,k, j ∈ [l − 1]and  k ∈ [H], ˜ΦR,l−1 = [ ˜ϕR,l(z1,1), . . . , ˜ϕR,l(zl−1,H)]T, and  ˜VR,l−1 =˜ΦTR,l−1 ˜ΦR,l−1 + HIdR,l. Fix any  0 < δ, εR, εP < 1. Now, we define  CR,lto be the set of all functions  f : Z → Rsuch that

image

where ˜µR,l−1(z) = ˜ϕR,l(z)T ˜θR,l−1, ˜σ2R,l−1(z) =  kR(z, z)−˜ϕR,l(z)T˜ϕR,l(z) +  H ˜ϕR,l(z)T ˜V −1R,l−1˜ϕR,l(z) and  βR,l=

image

Similarly, we compute ˜θP,l−1 = ˜V −1P,l−1 ˜ΦTP,l−1Sl−1, where Sl−1 := [sT1,2, . . . , sTl−1,H+1]Twith  sj,k+1being the next state of the state-action pair  zj,k, j ∈ [l − 1]and  k ∈[H], ˜ΦP,l−1 = [ ˜ϕP,l(z1,1, 1), . . . , ˜ϕP,l(zl−1,H, m)]Tand ˜VP,l−1 = ˜ΦTP,l−1 ˜ΦP,l−1 + mHIdP,l. Now, we define  CP,lto be the set of all functions  f : Z → Rmsuch that

∥f(z) − ˜µP,l−1(z)∥2 ≤ βP,l ∥˜σP,l−1(z)∥2 , ∀z ∈ Z, (6) where ˜µP,l−1(z) = [˜µP,l−1(z, 1), . . . ,  ˜µP,l−1(z, m)]T, ˜µP,l−1(z, i) = ˜ϕP,l(z, i)T ˜θP,l−1, ˜σP,l−1(z) =[˜σP,l−1(z, 1), . . . ,  ˜σP,l−1(z, m)]T, ˜σ2P,l−1(z, i) = kP ((z, i), (z, i)) − ˜ϕP,l(z, i)T ˜ϕP,l(z, i) + mH  ˜ϕP,l(z, i)T ˜V −1P,l−1 ˜ϕP,l(z, i) and the confidence

image

Next, we build the set  Mlof all plausible MDPs M such that: (i) the mean reward functionRM ∈ CR,l, (ii)the mean transition functionP M ∈ CP,land (iii) the global Lipschitz constant (of future value functions)  LM ≤ L. Finally, we select an optimistic policy  πlfor the family of MDPs  Mlin the sense that  V Mlπl,1(sl,1) = maxπ maxM∈Ml V Mπ,1(sl,1), where  sl,1is the initial state and  Mlis the most optimistic realization from  Ml, and execute  πlfor the entire episode. The pseudo-code of kernel-UCRL is given in Algorithm 1.

A. Regret Bound of Kernel-UCRL

Let σ2R,l(z) := kR(z, z) − kR,l(z)T (KR,l +HIlH)−1kR,l(z), where kR,l(z) :=[kR(z1,1, z), . . . , kR(zl,H, z)]Tdenotes the vector of kernel evaluations between z and elements of  Zl, and  KR,ldenotes the kernel matrix computed at  Zl.

image

ηR = 6λR ln(12T/δ)ε2Rand  bR,l(z) = min{ηR˜σ2R,l−1(z), 1}. Then, with probability at least  1 − δ/3, uniformly over all z ∈ Zand  l ∈ [τ], the following holds:

image

��� ˜V −1R,l−1 ˜ΦTR,l−1εR,l−1��� ˜VR,l−1˜σR,l−1(z), where εR,l−1 := [εR,1,1, . . . , εR,l−1,H]Tdenotes the vector of reward noise variables. Now see that ��� ˜V −1R,l−1 ˜ΦTR,l−1εR,l−1��� ˜VR,l−1 = ���˜ΦTR,l−1εR,l−1��� ˜V −1R,l−1. Then by [30, Theorem 1], for any  δ ∈ (0, 1), with probability at least  1 − δ/6, uniformly over all  z ∈ Zand l ≥ 1,

image

Further, from [31, Theorem 1], with probability at least 1 − δ/6, uniformly over all  z ∈ Zand  l ∈ [τ], we have1λR σ2R,l−1(z) ≤ ˜σ2R,l−1(z) ≤ λRσ2R,l−1(z)and  dR,l ≤6ηRλR�1 + 1H�γ(l−1)H(R). Then, from [32, Equation 25], we have

image

(8) Now the result follows by combining 7 and 8, and applying an union bound.

image

mHImlH)−1kP,l(z, i)where kP,l(z, i) :=[kP ((z1,1, 1), (z, i)), . . . , kP ((zl,H, m), (z, i))]Tdenotes the vector of kernel evaluations between (z, i) and elements of ¯Zl, and  KP,ldenotes the kernel matrix computed at ¯Zl.

image

ηP = 6λP ln(12T/δ)ε2Pand  bP,l(z, i) = min{ηP ˜σ2P,l−1(z, i), 1}. Then, with probability at least  1 − δ/3, uniformly over all z ∈ Zand  l ∈ [τ], the following holds:

image

where L is a known upper bound over  LM⋆, D =maxs,s′∈S ∥s − s′∥2denotes the diameter of  S, CP,T =

image

By construction of the set of MDPs  Ml, it follows that when the events  ER,land  EP,lare true for all episodes  l ∈ [τ], the unknown MDP  M⋆lies in  Ml. Thus  V Mlπl,1(sl,1) ≥V M⋆π⋆,1(sl,1), since  Mlis the most optimistic MDP of  Ml. This implies

image

Now by the reproducing property of RKHS and CauchySchwartz inequality ��R⋆(z)�� = ��⟨R⋆, kR(z, ·)⟩kR�� ≤��R⋆��kR kR(z, z) ≤ BRfor all  z ∈ Z. Thus 9, [24, Lemma 7] and [24, Lemma 9] together imply that for any  0 < δ < 1, with probability at least  1 − δ/3,

image

(10) Now, by triangle inequality ��RMl(zl,h) −R⋆(zl,h)�� ≤��RMl(zl,h) − ˜µR,l−1(zl,h)�� + ��R⋆(zl,h) − ˜µR,l−1(zl,h)��. Therefore, when the event  ER,lis true,

image

since the mean reward functionRMllies in the confidence set  CR,l. Similarly when the event  EP,lis true,

image

where we have used the fact that both  βR,land  βP,lare non-decreasing with the number of episodes l and that LMl ≤ Lby construction of  Ml(and since  Ml ∈ Ml). Now the result follows from [24, Lemma 11] by noting that �τl=1�Hh=1 σR,l−1(zl,h) ≤ �2eHT γT(R)and �τl=1�Hh=1 ∥σP,l−1(zl,h)∥2 ≤�2emHT γmT(P).

B. Interpretation of the Bound

Theorem 1 implies that the cumulative regret of KernelUCRL after  τepisodes is ˜O���HγT (R) + γT (R)�√T +

L��mHγmT(P)+γmT (P)�√T +H√T�with high prob-

ability. ( ˜Ohides logarithmic factors.) Now we illustrate the growth of  γT (R)and  γmT (P)as functions of T with the following concrete examples.

Let  kR(z, z′) := k1(s, s′)+k2(a, a′), i.e.,  kRis an additive kernel of  k1 : S × S → Rand  k2 : A × A → R. Then, from [33, Theorem 3],  γT (R) ≤ γT (k1) + γT (k2) +2 ln T . Now if both  S ⊂ Rm, A ⊂ Rnare compact and convex sets, and both  k1, k2are Squared Exponential (SE) kernels, then from [3, Theorem 4],  γT (k1) = O�(ln T )m�and  γT (k2) = O�(ln T )n�. Hence, in this case  γT (R) =˜O�(ln T )max{m,n}�.

Further, let  kP ((z, i), (z′, j)) := k3(z, z′)k4(i, j), i.e.,  kPis a product kernel of  k3 : Z ×Z → Rand  k4 : [m]×[m] →R. Then, from [33, Theorem 2],  γmT (P) ≤ mγmT (k3) +m ln(mT ), since all kernel matrices over any subset of {1, . . . , m} have rank at most m. Now if  k3is a Squared Exponential kernel on Z, then  γmT (k3) = ˜O�(ln(mT ))m+n�. Hence, in this case  γmT (P) = O�m�ln(mT )�m+n�.

In essence,  γT (R)and  γmT (P)grow sublinearly with T for some popular kernels, e.g. Squared Exponential, polynomial and Mat´ern. Now, since the cumulative regret of KernelUCRL scales linearly with  γT (R)and  γmT (P), it, in turn, grows sublinearly with T for these kernels.

Optimizing for an optimistic policy is not computationally tractable in general, even though planning for the optimal policy is possible for a given MDP. A popular approach to overcome this difficulty is to sample a random MDP at every episode and solve for its optimal policy, called posterior sampling [34].

PSRL (Algorithm 2), in its most general form, starts with a prior distribution  Φover MDPs. At the beginning of episode l, using the history of observations  Hl−1, it updates the posterior  Φland samples an MDP  Mlfrom Φl. (Sampling can be done using MCMC methods even if  Φldoesn’t admit any closed form.) It then selects an optimal policy  πlof the sampled MDP  Ml, in the sense that  V Mlπl,h(s) = maxπ V Mlπ,h(s)for all  s ∈ Sand for all h = 1, 2, . . ., H, and executes  πlfor the entire episode.

image

[34] show that if we have a frequentist regret bound for UCRL in hand, then we can obtain a similar bound (upto a constant factor) on the Bayes regret, defined as the expected regret under the prior distribution  Φ, of PSRL. We use this idea to obtain a sublinear bound on the Bayes regret of PSRL under kernel approximation.

Theorem 2 (Bayes regret of PSRL under Nystr¨om approximation): Let the assumptions in Section IV hold and  Φbe a (known) prior distribution over MDPs  M⋆. Then, the Bayes regret of PSRL satisfies

image

Any MDP M whose mean reward function satisfies RM(z) = θTR ˜ϕR,l(z)for some  θR ∈ RdR,lsuch that ���θR − ˜θR,l−1��� ˜VR,l−1≤√HβR,l, and mean transition func-

tion satisfies P M(z, i) = θTP ˜ϕP,l(z, i), i = 1, . . . , m, for some  θP ∈ RdP,lsuch that ���θP − ˜θP,l−1��� ˜VP,l−1 ≤√mHβP,l, lies in the set  Ml. However, there might be other MDPs in  Mlwhich do not posses this linear structure. Therefore, optimal planning may be computationally intractable even for a single MDP. So it is common in the literature to assume access to an approximate MDP planner  Γ(M, ε)which returns an  ε-optimal policy for M. Given such a planner  Γ, if it is possible to obtain (through extended value iteration [7] or otherwise) an efficient planner ˜Γ(M, ε)which returns an  ε-optimal policy for the most optimistic MDP from a family M, then we modify PSRL and Kernel-UCRL to choose  πl = Γ(Ml,�H/l)and  πl =˜Γ(Ml,�H/l)respectively at every episode l. It follows that this adds only an  O(√T)factor in the respective regret bounds. The design of such approximate planners for continuous state and action spaces remains a subject of active research, whereas our focus in this work is on the statistical efficiency of the online learning problem.

In this section we will assume that  S ⊂ R, i.e., m = 1 and  A ⊂ Rnfor some n = O(1). Also we will assume that kRand  kPare Squared Exponential (SE) kernels defined over  [0, 1]n+1with length scale parameters  lRand  lP, respectively.

A. Kernel-UCRL under QFF Approximation

First, we choose an  mR ∈ Nsuch that  1/l2R ≤ mR ≤C1/l2Rand  log4/e(T 6) ≤ mR ≤ C2 log4/e(T 6)for some ap- propriate constants  C1and  C2. Then we set  dR = 2(mR)n+1and construct  dRdimensional feature map  ˜ϕR(z)using 4. Similarly, we choose an  mP ∈ Nsuch that  1/l2P ≤mP ≤ C3/l2Pand  log4/e(T 6) ≤ mP ≤ C4 log4/e(T 6)for some appropriate constants  C3and  C4and construct the feature map  ˜ϕP (z)of dimension  dP = 2(mP )n+1. Now we construct confidence sets  CR,land  CP,las follows. First, we compute ˜θR,l−1 = ˜V −1R,l−1 ˜ΦTR,l−1Rl−1, where  Rl−1 =[r1,1, . . . , rl−1,H]T, ˜ΦR,l−1 = [ ˜ϕR(z1,1), . . . , ˜ϕR(zl−1,H)]Tand ˜VR,l−1 = ˜ΦTR,l−1 ˜ΦR,l−1 + HIdR. Then we fix any 0 < δ < 1and define  CR,lto be the set of all functions f : Z → Rsuch that

image

Similarly, we compute ˜θP,l−1 = ˜V −1P,l−1 ˜ΦTP,l−1Sl−1, where Sl−1 = [s1,2, . . . , sl−1,H+1]T, ˜ΦP,l−1 =[ ˜ϕP (z1,1), . . . , ˜ϕP (zl−1,H)]Tand ˜VP,l−1 = ˜ΦTP,l−1 ˜ΦP,l−1 +HIdP. Now, we define  CP,lto be the set of all functions f : Z → Rmsuch that

image

Next, following the same approach as before, we build the set  Mlof all plausible MDPs, choose an optimistic policy πlfor  Mland execute  πlfor the entire episode.

B. Regret Bound under QFF Approximation

image

(0, 1). Then, with probability at least  1 − δ/3, uniformly over all  z ∈ Zand  1 ≤ l ≤ τ,

image

˜σ2R,l−1(z) = H ˜ϕR,l(z)T ˜V −1R,l−1 ˜ϕR,l(z), where  ˜ϕR,l = ˜ϕRand dR,l = dRfor every episode l. Now define αR,l(z) = kR,l(z)T (KR,l+HIlH)−1R⋆,l−1. Then from [32, Equation 7], we have��R⋆(z) − αR,l−1(z)�� ≤ BRσR,l−1(z). Let  εdR := supz,z′∈Z��kR(z, z′) − ˜ϕR(z)T ˜ϕR(z′)�� < 1. Then from [32, Lemma 15],

image

From [25, Theorem 1]  εdR ≤ (n+1)2n 1√2mmRR

O�� e4mRl2R�mR�, since n = O(1). If  mR ≥ 1/l2R, then εdR = O ((e/4)mR). Further if  mR ≥ log4/e(T 6), then εdR = O(1/T 6)and thus, in turn,  Hε1/2dR l2 = O(1/H2τ) =O(1/T ) for each  l ≤ τ. Now the result follows by combining 7 and 13 using the triangle inequality.

image

(0, 1). Then, with probability at least  1 − δ/3, uniformly over all  z ∈ Zand  1 ≤ l ≤ τ,

image

Theorem 3 (Regret bound for Kernel-UCRL): Let the assumptions in Section IV hold. Further, let m = 1, n = O(1) and  kR, kPare SE kernels on  [0, 1]n+1with length scale parameters  lR, lP, respectively. Then, for any  δ ∈(0, 1), Kernel-UCRL with QFF approximation enjoys, with probability at least  1 − δ, the regret bound

image

ER,l:=� ��R⋆(z) − ˜µR,l−1(z)�� ≤ βR,l˜σR,l−1(z) +  O(BR/T ), ∀z�,EP,l:=� ��P ⋆(z) − ˜µP,l−1(z)�� ≤ βP,l˜σP,l−1(z) +  O(BP /T ), ∀z�.

When the events  ER,land  EP,lare true for all episodes  l ∈[τ], then using a similar approach as in the proof of Theorem 1 we can show that for any  0 < δ < 1, with probability at least  1 − δ/3,

image

(14) Also for every episode l the following holds:

image

By our choice of  mRand  mP, Lemmas 3 and 4 together imply that the events  ER,land  EP,lare true for all episodes l ∈ [τ]. Further, since 12 ln det( ˜VR,l−1)det(HIdR ) = O(dR ln(lH))and

2 ln det( ˜VP,l−1)det(HIdP ) = O(dP ln(lH)), βR,land  βP,lare non-decreasing functions in l. Now, combining 14 and 15, and applying a union bound we have, with probability at least 1 − δ, that

image

From [24, Lemma 11],  ˜σR,l−1(zl,h) = O�√HT dR ln T�and  ˜σP,l−1(zl,h) = O�√HT dP ln T�, since  γt(k) =O(d ln t) for any linear kernel k defined over  Rd. Now the result follows by noting that  dR = (mR)n+1 = O�(ln T )n+1�and  dP = (mP )n+1 = O�(ln T )n+1�for n = O(1).

Sayak Ray Chowdhury is supported by a Google PhD Fellowship. Aditya Gopalan is grateful for support from the DST INSPIRE faculty grant IFA13- ENG-69.

[1] A. L. Strehl, L. Li, and M. L. Littman, “Reinforcement learning in finite MDPs: PAC analysis,” J. Mach. Learn. Res., vol. 10, pp. 2413– 2444, Dec. 2009.

[2] I. Osband and B. Van Roy, “Model-based reinforcement learning and the eluder dimension,” in Advances in Neural Information Processing Systems, 2014, pp. 1466–1474.

[3] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” arXiv preprint arXiv:0912.3995, 2009.

[4] S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” in Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 844–853.

[5] A. Durand, O.-A. Maillard, and J. Pineau, “Streaming kernel regression with provably adaptive mean, variance, and regularization,” arXiv preprint arXiv:1708.00768, 2017.

[6] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, “Finite-time analysis of kernelised contextual bandits,” arXiv preprint arXiv:1309.6869, 2013.

[7] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010.

[8] I. Osband, D. Russo, and B. Van Roy, “(More) efficient reinforcement learning via posterior sampling,” in Advances in Neural Information Processing Systems, 2013, pp. 3003–3011.

[9] Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, “Learning unknown markov decision processes: A thompson sampling approach,” in Advances in Neural Information Processing Systems, 2017, pp. 1333– 1342.

[10] Y. Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.

[11] ——, “Bayesian optimal control of smoothly parameterized systems.” in UAI. Citeseer, 2015, pp. 1–11.

[12] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.

[13] M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control problems,” arXiv preprint arXiv:1703.08972, 2017.

[14] A. Gopalan and S. Mannor, “Thompson sampling for learning parameterized markov decision processes,” in Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, 2015, pp. 861–898.

[15] S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: worst-case regret bounds,” in Advances in Neural Information Processing Systems, 2017, pp. 1184–1194.

[16] R. Ortner and D. Ryabko, “Online regret bounds for undiscounted continuous reinforcement learning,” in Advances in Neural Information Processing Systems, 2012, pp. 1763–1771.

[17] K. Lakshmanan, R. Ortner, and D. Ryabko, “Improved regret bounds for undiscounted continuous reinforcement learning,” in International Conference on Machine Learning, 2015, pp. 524–532.

[18] M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite markov decision processes with gaussian processes,” in Advances in Neural Information Processing Systems, 2016, pp. 4312–4320.

[19] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in Neural Information Processing Systems, 2017, pp. 908– 919.

[20] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.

[21] T. Jung and P. Stone, “Gaussian processes for sample efficient reinforcement learning with rmax-like exploration,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2010, pp. 601–616.

[22] R. Grande, T. Walsh, and J. How, “Sample efficient reinforcement learning with gaussian processes,” in International Conference on Machine Learning, 2014, pp. 1332–1340.

[23] R. C. Grande, “Computationally efficient gaussian process changepoint detection and regression,” Ph.D. dissertation, Massachusetts Institute of Technology, 2014.

[24] S. R. Chowdhury and A. Gopalan, “Online learning in kernelized markov decision processes,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 3197–3205.

[25] M. Mutny and A. Krause, “Efficient high dimensional bayesian optimization with additivity and quadrature fourier features,” in Advances in Neural Information Processing Systems, 2018, pp. 9005–9016.

[26] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, “Nystr¨om method vs random fourier features: A theoretical and empirical comparison,” in Advances in neural information processing systems, 2012, pp. 476–484.

[27] S. Bochner, Lectures on Fourier integrals. Princeton University Press, 1959.

[28] F. B. Hildebrand, Introduction to numerical analysis. Courier Corporation, 1987.

[29] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.

[30] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.

[31] D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco, “Gaussian process optimization with adaptive sketching: Scalable and no regret,” In Conference on Learning Theory, 2019.

[32] S. R. Chowdhury and A. Gopalan, “Bayesian optimization under heavy-tailed payoffs,” arXiv preprint arXiv:1909.07040, 2019.

[33] A. Krause and C. S. Ong, “Contextual gaussian process bandit optimization,” in Advances in Neural Information Processing Systems, 2011, pp. 2447–2455.

[34] I. Osband and B. Van Roy, “Why is posterior sampling better than optimism for reinforcement learning?” arXiv preprint arXiv:1607.00215, 2016.


Designed for Accessibility and to further Open Science