In the Reinforcement Learning (RL) paradigm, an agent interacts with an unknown environment by taking actions and collecting rewards. Throughout this process, the agent strives to maximize its total expected reward, or value Q. The optimal value can be recovered with Bellman’s equation [1], whereby observed rewards are used to update estimates of Q through the unbiased, recursive relation
This applies as the agent undergoes a random transition from
Bellman’s equation motivates many methods for finding the optimal value. Among the most data-efficient are the class of Gaussian Process methods, which replace sample-intensive estimation schemes with a Bayesian non-parametric estimator, based on Gaussian Process regression [2, 3, 4, 5, 6, 7, 8] . These methods have shown to yield state-of-the-art empirical performance in their respective domains, such as model-based and model-free learning.
In this paper, we consider the class of Gaussian Process methods for model-free temporal difference learning [3, 7]. Specifically, we target the Sparse Pseudo-input Gaussian Process SARSA (SPGP- SARSA) method [7] to improve its online viability with a procedure to perform recursive updates. By extending SPGP-SARSA this way, its functionality then covers the full scope of benefits captured by its predecessor, GP-SARSA [3]; namely, previous results can be reused to compute matrix inverses.
TD algorithms recover the latent value function with data gathered in the standard RL fashion: at each step, the robot selects an action based on its current state
, after which it transitions to the next state
and collects a reward
. The repeated interaction is described as a Markov Decision Process,
, associated with the transition distribution
stationary policy
, and discount factor
. As the name suggests, TD algorithms update a running estimate of the value function to minimize its error difference from the Bellman estimate:
being the observed reward. Once the estimate converges, an agent can select actions from the greedy policy
, such that
The Gaussian Process Temporal Difference (GPTD) framework improves upon the data efficiency of frequentist TD estimation by departing from the contractive nature of Bellman’s equation, in favor of a convergence driven by non-parametric Bayesian regression. The data model is based on the random return , expressed as a sum of its mean, Q(x), and zero-mean
Figure 1: Visualizing GP- and SPGP-SARSA posteriors: The exact GP-SARSA posterior (left) is supported with all the training data (dots). The SPGP-SARSA posterior (center) uses randomlyinitialized pseudo inputs (diamonds). After adjusting the pseudo inputs to maximize likelihood of training data, the posterior (right) is nearly identical to the exact model.
residual, . Model inputs are state-action vectors
, and value differences are used to describe the observation process:
Moving forward, we assume that noise levels, , are i.i.d random variables with constant parameters,
. Under this assumption, transitions exhibit no serial correlation, and the SPGP-SARSA model is valid.
Given a time-indexed sequence of transitions , the GP-SARSA model stacks variables into vectors to obtain the complete data model:
and . Notice the commonality Equation 2 has with a standard GP likelihood model,
. Both models assume the outputs,
, are noisy observations of a latent function,
. What distingushes TD estimation is the presence of value correlations, imposed from Bellman’s equation and encoded as temporal difference coefficients in H. Used for exact GP regression, Equation 2 leads to the GP-SARSA algorithm: a non-parametric Bayesian method for recovering latent values [3].
As a Bayesian method, GP-SARSA computes a predictive posterior over the latent values by conditioning on observed rewards. The corresponding mean and variance are used for policy evaluation:
Here, is the covariance matrix with elements
, and
. Subscripts denote dimensionality, e.g.
The GP-SARSA method requires an expensive matrix inversion, costing
. To improve computational efficiency, SPGP-SARSA algorithm applies the Sparse Pseudo-input Approximation [9]. Sparsity is induced in the standard data model (Equation 2) by expanding the probability space with
additional pseudo values, u. The corresponding pseudo inputs,
, act as parameters on the support of the predictive posterior. These extra latent variables obey the same data model as q, but are predetermined, and thus, exhibit no noise. By conditioning q upon u and Z, the predictive probability space collapses such that all dense matrix inversions are of rank M. This algorithm is called Sparse Pseudo-input Gaussian Process SARSA (SPGP-SARSA) [7].
The SPGP-SARSA predictive posterior is Gaussian, , with parameter functions
The terms in Equation 4 that do not depend on the input effectively parameterize the posterior. We denote these parameters as
Here we adopt a new notation that allows us to index updates associated with each input variable, x and z. We drop the v and u designations on matrices in favor of t and k, which respectively denote the update index of x and z. A detailed breakdown of the notation is given below
• t: the unique index for the inputs x. • k: the unique index for the pseudo inputs z • . Here, the array covers the range
, and
is the common argument to all elements.
• covariance matrix of pseudo input evaluations
•
covariance matrix of input evaluations
This new notation suggests that SPGP-SARSA can be used on two timescales. There is the scale, t, associated with state transitions, and the scale k, associated with adding new pseudo inputs. Although we do not elaborate on when and how to apply multi-timescale updates, we believe this constitutes the subject of interesting future work.
There are four distinct modalities in which SPGP-SARSA can be updated:
Here we consider the third and fourth cases, when both the transition training set and the pseudo set can grow. We decompose the predictive moments into partitioned matrices and apply the partitioned matrix inversion lemma to derive a recursive algorithm for their updates.
4.1 Partitioned Matrix Inversion Lemma
Let symmetric positive definite matrix whose partition is
4.2 Partitioning the Fundamental Matrices
For deterministic transitions, the noise matrix is
The Bellman matrix partition is
Here,
Here, we define
Here we define
This partition is special, because it can be updated in three ways. We may include a new input pseudo input
, or both. Here, x includes the rows with index t, and the pseudo inputs, z, include the columns with index k. We define
, and
. When
is new, we must compute
and
. Similarly, when
is new, we must compute
and
. When both variables are new, the only element we may reuse is
This concludes our analysis of the fundamental matrices. Next we turn our attention to the compound matrices, which are products of those described above.
4.3 Partitioning Compound Matrices
Here we define . We also use
to denote the arguments distributed according to a binomial:
Three matrices must be inverted. They are:
• A
• •
5.1 Inverting Kkk
By the matrix inversion lemma, we have
Here, we have defined:
5.2 Computing Btt
To start, we compute the composite matrix
where we have defined the following elements:
Now we add all the terms and take the diagonal:
5.3 Computing Ckk
Here we define:
Adding the terms produces the partitioned
By the matrix inversion lemma, we have:
5.4 Recursive Parameters
We may now derive recursions for the parameters
On-line updates are possible by unrolling the recursion, starting from basic posterior parameters.
In this paper we derived formulas for updating the SPGP-SARSA algorithm recursively. This allows previous computations to be reused and promotes greater memory efficiency than computing matrix inverses from scratch at each iteration. Promising future work will explore the best practices for adding pseudo inputs and performing experiments to quantify the benefits of the recursive approach.
[1] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition, 1957.
[2] M. Deisenroth and C. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In In Proceedings of the International Conference on Machine Learning (ICML), 2011.
[3] Y. Engel, S. Mannor, and R. Meir. Bayes meets bellman: The gaussian process approach to temporal difference learning. In Proceedings of the 20th International Conference on Machine Learning (ICML), 2003.
[4] M. Ghavamzadeh, Y. Engel, and M. Valko. Bayesian policy gradient and actor-critic algorithms. Journal of Machine Learning Research, 2016.
[5] J. Martin and B. Englot. Extending model-based policy gradients for robots in heteroscedastic environments. In 1st Annual Conference on Robot Learning, 2017.
[6] Y. Engel, P. Szabo, and D. Volkinshtein. Learning to control an octopus arm with gaussian process temporal difference methods. In Advances in Neural Information Processing Systems 18. MIT Press, 2006.
[7] J. Martin, J. Wang, and B. Englot. Sparse gaussian process temporal difference learning for marine robot navigation. In Conference on Robot Learning (CoRL), 2018.
[8] C. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems 16, 2004.
[9] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2006.