The goal of reinforcement learning (RL) is to learn optimal behavior by repeated interaction with an unknown environment, usually modeled as a Markov Decision Process (MDP). Performance is typically measured by the amount of interaction, in terms of episodes or rounds, needed to arrive at an optimal (or near-optimal) policy; this is also known as the sample complexity of RL [1]. The sample complexity objective encourages efficient exploration across states and actions, but, at the same time, is indifferent to the reward earned during the learning phase.
A related, but different, goal in RL is the online one, i.e., to learn to gather high cumulative reward, or to equivalently keep the learner’s regret (the gap between its and the optimal policy’s net reward) as low as possible. This is preferable in settings where experimentation comes at a premium and the reward earned in each round is of direct value, e.g., recommender systems (in which rewards correspond to clickthrough events and ultimately translate to revenue), dynamic pricing – in general, control of unknown dynamical systems with instantaneous costs.
A primary challenge in RL is to learn efficiently across complex (very large or infinite) state and action spaces. In the most general tabula rasa MDP setting, the learner must explore each state-action transition before developing a reasonably clear understanding of the environment, which is prohibitive for large problems. Real-world domains, though, possess more structure: transition and reward behavior often varies smoothly over states and actions, making it possible to generalize via inductive inference – observing a state transition or reward is informative of other, similar transitions or rewards. Scaling RL to large, complex, real-world domains requires exploiting regularity structure in the environment, which has typically been carried out via the use of parametric MDP models in model-based approaches, e.g., [2].
This paper takes a step in developing theory and algorithms for online RL in environments with smooth transition and reward structure. We specifically consider the episodic online learning problem in the nonparametric, kernelizable MDP setting, i.e., of minimizing regret (relative to an optimal finite-horizon policy) in MDPs with continuous state and action spaces, whose transition and reward functions exhibit smoothness over states and actions compatible with the structure of a reproducing kernel. We develop variants of the well-known UCRL and posterior sampling algorithms for MDPs with continuous state and action spaces, and show that they enjoy sublinear, finite-time regret bounds when the mean transition and reward functions are assumed to belong to the associated Reproducing Kernel Hilbert Space (RKHS) of functions.
Our results bound the regret of the algorithms in terms of a novel generalization of the information gain of the state transition and reward function kernels, from the memoryless kernel bandit setting [3] to the state-based kernel MDP setting, and help shed light on how the choice of kernel model influences regret performance. We also leverage two different kernel approximation techniques, namely the Quadrature Fourier Features (QFF) approximation [25] and the Nystr¨om approximation [31], to prove the results in the paper. To the best of our knowledge, these are the first concrete regret bounds for RL under kernel approximation, explicitly showing the dependence of regret on kernel structure.
Our results represent a generalization of several streams of work. We generalize online learning in the kernelized bandit setting [4], [6] to kernelized MDPs, and tabula rasa online learning approaches for MDPs such as Upper Confidence Bound for Reinforcement Learning (UCRL) [7] and Posterior Sampling for Reinforcement Learning (PSRL) [8], [9] to MDPs with kernel structure. Our results can also generalize regret minimization for an episodic variant of the well-known parametric Linear Quadratic Regulator (LQR) problem [10]– [13] to its nonlinear, nonparametric, infinite-dimensional, kernelizable counterpart.
Regret minimization has been studied with parametric MDPs in [7], [8], [14], [15] etc. For online regret minimization in complex MDPs, apart from the work of [2], [16] and [17] consider continuous state spaces with Lipschitz transition dynamics but unstructured, finite action spaces. As regards using the Gaussian process (GP) framework to model nonlinear, smooth MDP structure to achieve generalization, [20] develop a policy search method for MDPs with GPbased dynamics, but without provable exploration guarantees, whereas approaches in [21], [22] has been shown to be sample efficient. [23] develops a PAC learning algorithm by considering separate GP priors over the mean reward function, transition kernel and optimal Q-function, with a PAC guarantee. The most closely related work of ours is [24] which consider GP based variants of UCRL and PSRL algorithms and show no-regret guarantees for kernelizable MDPs. Our results, in a way, show that we can achieve the same guarantees as in [24] even with kernel approximations.
We consider the problem of learning to optimize reward in an unknown finite-horizon MDP, , over repeated episodes of interaction. Here,
represents the state space,
the action space, H the episode length,
the reward distribution over R, and
the transition distribution over S. At each period h = 1, 2, . . ., H within an episode, an agent observes a state
, takes an action
, observes a reward
, and causes the MDP to transition to a next state
. We assume that the agent, while not possessing knowledge of the reward and transition distribution
of the unknown MDP
, knows S, A and H.
A policy is defined to be a mapping from a state
and a period
to an action
. For any MDP
and policy
, the finite horizon, undiscounted, value function for every state
and every period
is defined as
where the subscript indicates the application of the learning policy
, i.e.,
, and the subscript M explicitly references the MDP environment M, i.e.,
, for all j = h, . . . , H.
We useto denote the mean of the reward distribution
that corresponds to playing action a at state s in the MDP M. We can view a sample r from the reward distribution
as
, where
denotes a sample of zero-mean, real-valued additive noise. Similarly, the transition distribution
can also be decomposed as a mean value
in
plus a zero-mean additive noise
in
so that
lies in1
. A policy
is said to be optimal for the MDP M if
At the beginning of each episode l, an RL algorithm chooses a policy depending upon the observed state-action-reward sequences upto episode
, denoted by the history
, and executes it for the entire duration of the episode. In other words, at each period h of the l-th episode, the learning algorithm chooses action
, receives reward
and observes the next state
. The goal of an episodic online RL algorithm is to maximize its cumulative reward across episodes, or, equivalently, minimize its cumulative regret: the loss incurred in terms of the value function due to not knowing the optimal policy
of the unknown MDP
beforehand and instead using the policy
for each episode l, l = 1, 2, . . .. The cumulative (expected) regret of an RL algorithm
upto time horizon
is defined as
where the initial states are assumed to be fixed.
For the rest of the paper, unless otherwise specified, we define and
for all
and
.
A. Smoothness of Value Function
, we define the one step future value function as the expected value of the optimal policy
, with the next state distributed according to
, i.e.
. We assume the following regularity condition on the future value function of any MDP (also made by [2]). For any two single-step transition distributions
over S, and
,
where denotes the mean of the distribution
. In other words, the one-step future value functions for each period h are Lipschitz continuous with respect to the
-norm of the mean2, with global Lipschitz constant
. We also assume that there is a known constant L such that
.
B. Smoothness of Mean Reward and Transition Functions
Attaining sub-linear regret is impossible in general for arbitrary reward and transition distributions, and thus some regularity assumptions are needed. In this paper, we assume smoothness for the mean reward functionis induced by the structure of a kernel on Z. Specifically, we make the standard assumption of a p.s.d. kernel
such that
for all
, and
being an element of the reproducing kernel Hilbert space (RKHS)
of smooth real valued functions on Z. An RKHS of real-valued functions
, denoted by
, is completely specified by its kernel function
and viceversa, with an inner product
obeying the reproducing property
for all
. The induced RKHS norm
is a measure of smoothness of f with respect to the kernel function k. We assume that the RKHS norm of
is bounded, i.e.,
for some
. Boundedness of
along the diagonal holds for any stationary kernel, i.e., where
, e.g., the Squared Exponential kernel
and the Mat
rn kernel
:
where l > 0 and are hyperparameters of the kernels,
C. Sub-Gaussian Noise Variables
We assume that the random variables is conditionally zero-mean and
-sub-Gaussian, i.e., there exists a known
such that for any
,
where is the sigma algebra generated by the random variables
,
and
. Similarly, the random variables
is assumed to be conditionally component-wise independent, zero-mean and
-subGaussian, in the sense that there exists a known
such that for any
and
,
where is the sigma algebra generated by the random variables
,
and
and
denotes the identity matrix of rank m.
A. Kernel Approximation For kernelized MDPs [24] develop variants of the UCRL2 algorithm which, at every episode l, constructs confidence sets for the mean reward and the mean transition functions. The construction of each confidence set require one inversion of the kernel (gram) matrix, which takes time. This makes the algorithm quite prohibitive for large number of episodes. To reduce this computational cost without compromising on the accuracy of the confidence sets, we incorporate two efficient kernel approximation schemes, namely the Quadrature Fourier Features (QFF) approximation [25] and the Nystr¨om approximation [26].
1) Quadrature Fourier Features (QFF) approximation:
If k is a bounded, continuous, positive definite, stationary kernel defined over and satisfies k(x, x) = 1 for all
, then by Bochner’s theorem [27], k is the Fourier transform of a probability measure p, i.e., k(x, y) =
. For the Squared Exponential kernel defined over
, this measure has density
(abusing notation for measure and density). [25] show that for any stationary kernel k on
whose inverse Fourier transform decomposes product wise, i.e.,
, we can use Gauss-Hermite quadrature [28] to approximate it. If
, the SE kernel is approximated as follows. Choose
and
, and construct the 2d-dimensional feature map
Here the set
, where
isthe set of
(real) roots of the
-th Hermite polynomial
, and
for all
. 2) Nystr¨om approximation: Unlike the QFF approximation where the basis functions (cosine and sine) do not depend on the data, the basis functions used by the Nystr¨om method are data dependent. For a set of points
, the Nystr¨om method [26] approximates a kernel
as follows: First, randomly sample d points to construct a dictionary D =
, according to the following distribution. For each
, include
in D independently with some suitably chosen probability
. (
’s trade off between the quality and the size of the embedding.) Then, compute the (approximate) d-dimensional feature embedding
, where
,
and
denotes the pseudo inverse of any matrix A.
Now we will present our algorithm Kernel-UCRL using the Nystr¨om approximation. The description and performance of Kernel-UCRL using the quadrature Fourier features approximation is deferred to the Appendix.
B. Kernel-UCRL Algorithm under Nystr¨om Approximation
Kernel-UCRL (Algorithm 1) is an optimistic algorithm based on the Upper Confidence Bound principle, which adapts the confidence sets of UCRL2 [7] to exploit the kernel structure. At the start of episode l, first we find feature embeddings and
to efficiently approximate the kernels
and
, respectively. First, we construct a dictionary
by including every state-action pair z from the set
in
with probability
(to be defined later). Then, we define
we define , where the dictionary
is constructed by including every (z, i) from the set
with probability
(to be defined later).
Then we construct confidence sets and
for the mean reward and the mean transition functions, respectively, as follows. First, we compute
, where
with
being the reward of the state-action pair
and
,
, and
. Fix any
. Now, we define
to be the set of all functions
such that
where ˜) = ˜
, ˜
) =
˜
˜
) +
˜
) and
=
Similarly, we compute , where
with
being the next state of the state-action pair
and
[H],
and
. Now, we define
to be the set of all functions
such that
, (6) where
, . . . ,
z, m
,
z, i
z, i
, . . . ,
z, m
,
z, i) = k
z, i), (z, i
z, i
z, i) + mH
z, i
V
z, i) and the confidence
Next, we build the set of all plausible MDPs M such that: (i) the mean reward function
the mean transition function
and (iii) the global Lipschitz constant (of future value functions)
. Finally, we select an optimistic policy
for the family of MDPs
in the sense that
, where
is the initial state and
is the most optimistic realization from
, and execute
for the entire episode. The pseudo-code of kernel-UCRL is given in Algorithm 1.
A. Regret Bound of Kernel-UCRL
Let , where
denotes the vector of kernel evaluations between z and elements of
, and
denotes the kernel matrix computed at
.
and
. Then, with probability at least
, uniformly over all
and
, the following holds:
, where
denotes the vector of reward noise variables. Now see that
. Then by [30, Theorem 1], for any
, with probability at least
, uniformly over all
and
,
Further, from [31, Theorem 1], with probability at least , uniformly over all
and
, we have
and
. Then, from [32, Equation 25], we have
(8) Now the result follows by combining 7 and 8, and applying an union bound.
where
denotes the vector of kernel evaluations between (z, i) and elements of
, and
denotes the kernel matrix computed at
.
and
. Then, with probability at least
, uniformly over all
and
, the following holds:
where L is a known upper bound over denotes the diameter of
By construction of the set of MDPs , it follows that when the events
and
are true for all episodes
, the unknown MDP
lies in
. Thus
, since
is the most optimistic MDP of
. This implies
Now by the reproducing property of RKHS and CauchySchwartz inequality for all
. Thus 9, [24, Lemma 7] and [24, Lemma 9] together imply that for any
, with probability at least
,
(10) Now, by triangle inequality . Therefore, when the event
is true,
since the mean reward functionlies in the confidence set
. Similarly when the event
is true,
where we have used the fact that both and
are non-decreasing with the number of episodes l and that
by construction of
(and since
). Now the result follows from [24, Lemma 11] by noting that
and
.
B. Interpretation of the Bound
Theorem 1 implies that the cumulative regret of KernelUCRL after episodes is
with high prob-
ability. ( hides logarithmic factors.) Now we illustrate the growth of
and
as functions of T with the following concrete examples.
Let , i.e.,
is an additive kernel of
and
. Then, from [33, Theorem 3],
2 ln T . Now if both
are compact and convex sets, and both
are Squared Exponential (SE) kernels, then from [3, Theorem 4],
and
. Hence, in this case
.
Further, let , i.e.,
is a product kernel of
and
R. Then, from [33, Theorem 2],
m ln(mT ), since all kernel matrices over any subset of {1, . . . , m} have rank at most m. Now if
is a Squared Exponential kernel on Z, then
. Hence, in this case
.
In essence, and
grow sublinearly with T for some popular kernels, e.g. Squared Exponential, polynomial and Mat´ern. Now, since the cumulative regret of KernelUCRL scales linearly with
and
, it, in turn, grows sublinearly with T for these kernels.
Optimizing for an optimistic policy is not computationally tractable in general, even though planning for the optimal policy is possible for a given MDP. A popular approach to overcome this difficulty is to sample a random MDP at every episode and solve for its optimal policy, called posterior sampling [34].
PSRL (Algorithm 2), in its most general form, starts with a prior distribution over MDPs. At the beginning of episode l, using the history of observations
, it updates the posterior
and samples an MDP
from
. (Sampling can be done using MCMC methods even if
doesn’t admit any closed form.) It then selects an optimal policy
of the sampled MDP
, in the sense that
for all
and for all h = 1, 2, . . ., H, and executes
for the entire episode.
[34] show that if we have a frequentist regret bound for UCRL in hand, then we can obtain a similar bound (upto a constant factor) on the Bayes regret, defined as the expected regret under the prior distribution , of PSRL. We use this idea to obtain a sublinear bound on the Bayes regret of PSRL under kernel approximation.
Theorem 2 (Bayes regret of PSRL under Nystr¨om approximation): Let the assumptions in Section IV hold and be a (known) prior distribution over MDPs
. Then, the Bayes regret of PSRL satisfies
Any MDP M whose mean reward function satisfies for some
such that
, and mean transition func-
tion satisfies , for some
such that
, lies in the set
. However, there might be other MDPs in
which do not posses this linear structure. Therefore, optimal planning may be computationally intractable even for a single MDP. So it is common in the literature to assume access to an approximate MDP planner
which returns an
-optimal policy for M. Given such a planner
, if it is possible to obtain (through extended value iteration [7] or otherwise) an efficient planner
which returns an
-optimal policy for the most optimistic MDP from a family M, then we modify PSRL and Kernel-UCRL to choose
and
respectively at every episode l. It follows that this adds only an
factor in the respective regret bounds. The design of such approximate planners for continuous state and action spaces remains a subject of active research, whereas our focus in this work is on the statistical efficiency of the online learning problem.
In this section we will assume that , i.e., m = 1 and
for some n = O(1). Also we will assume that
and
are Squared Exponential (SE) kernels defined over
with length scale parameters
and
, respectively.
A. Kernel-UCRL under QFF Approximation
First, we choose an such that
and
for some ap- propriate constants
and
. Then we set
and construct
dimensional feature map
using 4. Similarly, we choose an
such that
and
for some appropriate constants
and
and construct the feature map
of dimension
. Now we construct confidence sets
and
as follows. First, we compute
, where
,
and
. Then we fix any
and define
to be the set of all functions
such that
Similarly, we compute , where
,
and
. Now, we define
to be the set of all functions
such that
Next, following the same approach as before, we build the set of all plausible MDPs, choose an optimistic policy
for
and execute
for the entire episode.
B. Regret Bound under QFF Approximation
(0, 1). Then, with probability at least , uniformly over all
and
,
, where
and
for every episode l. Now define
. Then from [32, Equation 7], we have
. Let
. Then from [32, Lemma 15],
From [25, Theorem 1]
, since n = O(1). If
, then
. Further if
, then
and thus, in turn,
O(1/T ) for each
. Now the result follows by combining 7 and 13 using the triangle inequality.
(0, 1). Then, with probability at least , uniformly over all
and
,
Theorem 3 (Regret bound for Kernel-UCRL): Let the assumptions in Section IV hold. Further, let m = 1, n = O(1) and are SE kernels on
with length scale parameters
, respectively. Then, for any
(0, 1), Kernel-UCRL with QFF approximation enjoys, with probability at least
, the regret bound
:=
) +
:=
) +
When the events and
are true for all episodes
, then using a similar approach as in the proof of Theorem 1 we can show that for any
, with probability at least
,
(14) Also for every episode l the following holds:
By our choice of and
, Lemmas 3 and 4 together imply that the events
and
are true for all episodes
. Further, since
and
and
are non-decreasing functions in l. Now, combining 14 and 15, and applying a union bound we have, with probability at least
, that
From [24, Lemma 11], and
, since
O(d ln t) for any linear kernel k defined over
. Now the result follows by noting that
and
for n = O(1).
Sayak Ray Chowdhury is supported by a Google PhD Fellowship. Aditya Gopalan is grateful for support from the DST INSPIRE faculty grant IFA13- ENG-69.
[1] A. L. Strehl, L. Li, and M. L. Littman, “Reinforcement learning in finite MDPs: PAC analysis,” J. Mach. Learn. Res., vol. 10, pp. 2413– 2444, Dec. 2009.
[2] I. Osband and B. Van Roy, “Model-based reinforcement learning and the eluder dimension,” in Advances in Neural Information Processing Systems, 2014, pp. 1466–1474.
[3] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” arXiv preprint arXiv:0912.3995, 2009.
[4] S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” in Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 844–853.
[5] A. Durand, O.-A. Maillard, and J. Pineau, “Streaming kernel regression with provably adaptive mean, variance, and regularization,” arXiv preprint arXiv:1708.00768, 2017.
[6] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, “Finite-time analysis of kernelised contextual bandits,” arXiv preprint arXiv:1309.6869, 2013.
[7] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010.
[8] I. Osband, D. Russo, and B. Van Roy, “(More) efficient reinforcement learning via posterior sampling,” in Advances in Neural Information Processing Systems, 2013, pp. 3003–3011.
[9] Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, “Learning unknown markov decision processes: A thompson sampling approach,” in Advances in Neural Information Processing Systems, 2017, pp. 1333– 1342.
[10] Y. Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
[11] ——, “Bayesian optimal control of smoothly parameterized systems.” in UAI. Citeseer, 2015, pp. 1–11.
[12] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.
[13] M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control problems,” arXiv preprint arXiv:1703.08972, 2017.
[14] A. Gopalan and S. Mannor, “Thompson sampling for learning parameterized markov decision processes,” in Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, 2015, pp. 861–898.
[15] S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: worst-case regret bounds,” in Advances in Neural Information Processing Systems, 2017, pp. 1184–1194.
[16] R. Ortner and D. Ryabko, “Online regret bounds for undiscounted continuous reinforcement learning,” in Advances in Neural Information Processing Systems, 2012, pp. 1763–1771.
[17] K. Lakshmanan, R. Ortner, and D. Ryabko, “Improved regret bounds for undiscounted continuous reinforcement learning,” in International Conference on Machine Learning, 2015, pp. 524–532.
[18] M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite markov decision processes with gaussian processes,” in Advances in Neural Information Processing Systems, 2016, pp. 4312–4320.
[19] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in Neural Information Processing Systems, 2017, pp. 908– 919.
[20] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.
[21] T. Jung and P. Stone, “Gaussian processes for sample efficient reinforcement learning with rmax-like exploration,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2010, pp. 601–616.
[22] R. Grande, T. Walsh, and J. How, “Sample efficient reinforcement learning with gaussian processes,” in International Conference on Machine Learning, 2014, pp. 1332–1340.
[23] R. C. Grande, “Computationally efficient gaussian process changepoint detection and regression,” Ph.D. dissertation, Massachusetts Institute of Technology, 2014.
[24] S. R. Chowdhury and A. Gopalan, “Online learning in kernelized markov decision processes,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 3197–3205.
[25] M. Mutny and A. Krause, “Efficient high dimensional bayesian optimization with additivity and quadrature fourier features,” in Advances in Neural Information Processing Systems, 2018, pp. 9005–9016.
[26] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, “Nystr¨om method vs random fourier features: A theoretical and empirical comparison,” in Advances in neural information processing systems, 2012, pp. 476–484.
[27] S. Bochner, Lectures on Fourier integrals. Princeton University Press, 1959.
[28] F. B. Hildebrand, Introduction to numerical analysis. Courier Corporation, 1987.
[29] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
[30] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.
[31] D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco, “Gaussian process optimization with adaptive sketching: Scalable and no regret,” In Conference on Learning Theory, 2019.
[32] S. R. Chowdhury and A. Gopalan, “Bayesian optimization under heavy-tailed payoffs,” arXiv preprint arXiv:1909.07040, 2019.
[33] A. Krause and C. S. Ong, “Contextual gaussian process bandit optimization,” in Advances in Neural Information Processing Systems, 2011, pp. 2447–2455.
[34] I. Osband and B. Van Roy, “Why is posterior sampling better than optimism for reinforcement learning?” arXiv preprint arXiv:1607.00215, 2016.