Properties of the Least Squares Temporal Difference learning algorithm

2013·arXiv

Abstract

1 Prior Work on LSTD

An exhaustive introduction to least-squares methods for Reinforcement Learning is provided in chapter 6 of Bertsekas’ monograph [6]. The LSTD algorithm was introduced in the paper by Bradtke and Barto [8]. Boyan later extended to the case with eligibility traces [7], wherean additional parameter controls how far back the updated are influenced by previous states. The connection between LSTD and LSPE, as well as a clean-cut proof that the on-line version of LSTD converges, was given by Nedić and Bertsekas [21]. The seminal paper [28] by Tsitsiklis and Van Roy provided an explicit connection between the fix-point of the iterative TD algorithm and the LSTD solution, while also formally proving that the TD algorithm for policy evaluation converges. The paper [3], described the Bellman Residual Minimization procedure as an alternative to TD. Antos’ paper [1] provided an extensive comparison on the similarities and differences between LSTD and Bellman Residual Minimization (BRM). Parr’s paper [16] introduced the LSPI algorithm as a principled way to combine LSTD with control. The paper by Munos [20] introduced bounds for policy iteration with linear function approximation, albeit under strong assumptions. Scherrer provided [24] the geometric interpretation of LSTD as an oblique projection, in the context analysing the differences between LSTD and BRM. The paper [14] represents an early approach to automatically constructing features for RL algorithms, including LSTD. Schoknecht gave [25] an interpretation of LSTD and other algorithms in terms of a projection with respect to a certain inner product. Choi and Van Roy [9] discuss the similarities between LSTD and a version of the Kalman filter. There exist various approaches in literature to how LSTD can be regularized, none of which can be conclusively claimed to outperform the others. These include the L1 approaches of [18] and [15] and the nested approach of [13]. These approaches differ not just in the what regularization term in used, but they solve different optimization problems (we will discuss this in section 5).

2 Deﬁnition of LSTD

2.1 Notation

The LSTD algorithm finds the value function of a finite-state Markov Reward Process (MRP). The MRP is fixed, i.e. we only consider the on-policy setting. We only have access to linear features of states and to the obtained rewards. More formally, denote as P the transition matrix of the MRP. For each state s we have a feature row-vector . The feature design matrix gives the features of all states of the MRP, row-wise, where we assume that has independent columns. We use the vector R, the i-th element of which contains mean reward obtained while leaving the state i. We use to denote a left eigenvector of P corresponding to eigenvalue one. Note that if the chain has a stationary distribution, it will correspond to such an eigenvector, but we do not require it. We will assume that the chain only has one recurrent class since the case where we have many classes complicates the notation without contributing to the main argument (in practice, we can typically assume there is one class if we do enough exploration). We also introduce the matrix . We now define expectations of functions of the Markov process in terms of weighted averages. For example the expectation of , is defined by , and similarly for other functions. By this we mean that if P is ergodic, it is legitimate to consider the above quantity an expectation corresponding to long-time average by the standard ergodic theorem for Markov chains. But in our application it is convenient to be more general and allow for periodicity, i.e. the diagonal of may not be a stationary distribution, but the expression still matches the long-time average. We use subscripts do denote two-step sampling, for example denotes the fact that we first sample a state, then the successor state and obtain the feature of that successor state. When we write an expectation w.r.t. such a variable, for example , the distribution we mean for r is . Part of our present derivations depends on treating some qualities as random variables; we use small letters to denote them, for instance s denotes state and denotes feature. Once we have obtained samples from our process, we store them in matrices and , whose i-th rows correspond to, respectively, the state feature vector and reward obtained at time i. Observe the difference between and – in the first one, each state is represented once, in the second one the number of rows corresponds to the trajectory taken in the MRP and repetitions are possible. The value function is discounted with the factor . Moreover, we introduce the square matrix D which has ones on the main diagonal and on the diagonal above it. It is the sample based equivalent to the operator .

2.2 The linear dynamical system approach

The derivation given in this section is based on [23]. We begin by constructing a MRP which lives in the space of features instead of our original state space. We limit ourselves to the class of linear dynamical systems. We need to define the matrix and the vector , so that a transition from to (row vectors) is modelled by , and the reward we expect at is modelled by . Now we look for the values for and that model our system dynamics. We have that should be approximately equal to and to r. We weigh states by , giving the following optimization problems.

These optimization problems correspond to ordinary least squares (generalized to matrices in case of ) and the solutions are obtained by weighted projection: and , where the projection matrix is defined as and the matrix cancels with the one in the projection, since it is full column rank. Now consider a feature vector . In the new approximate MRP, we can compute the value function exactly (i.e. all the approximation has already taken place when we constructed the matrix and vector ). The true value function associated with it is the expected discounted future reward, and is expressed as follows.

a stronger condition, namely that the series converges, which is the same as saying that is a contraction in some norm.

This follows from the following reasoning. Consider first the case when we have . We know that is a contraction in the norm weighted by i.e. (see for example [5], proposition 6.3.1). Therefore the spectral radius of is bounded by one. Define the matrix , so that we have and . Using the assumption that has independent columns, it is easy to see that if v is an eigenvector of then is an eigenvector of with the same eigenvalue. Hence all eigenvalues of are also eigenvalues of and . Now if we have some zero entries on the diagonal of , our results follows by a continuity argument. Thus we have the result for the general case.

Note that in the above proof we used the fact that has on the diagonal is a left eigenvector of P corresponding to eigenvalue one. If used for the projection were an arbitrary distribution, then the matrix would in general have spectrum beyond the unit circle. For example, consider the following.

which is more than one.

2.3 Interpretation in terms of expectations

We now give the interpretation of the matrices and in terms of expectations, which are useful when constructing sample-based versions of the algorithm.

We also can construct sample-based variants of the matrices and (call them and respectively) and still obtain essentially the same algorithm. Let us adopt the following definitions.

In the above, we denote by N the matrix which has ones above the main diagonal and zeros elsewhere.

We note that the required inverse exists by the assumption that is invertible (we also implicitly assume that we have enough samples). Furthermore, we note that the transition model and the reward model are independent, i.e. our solution is also applicable to the setting where we have a single transition matrix P, but instead of just one reward we have many tasks, each of which with a different reward [17]. We note that in this setting, while we still have to learn for each task separately, it is worthwhile to learn using training data from all tasks.

3 Other Ways to obtain LSTD

We begin with the Bellman equation, which defines the true value function: . It can be rewritten in matrix form as . We exploit a linear architecture: i.e. we seek to approximate the true value function with the function , which is linear in w. We will briefly discuss two possibilities for how to choose an appropriate within the linear class of functions. The obvious thing would be to define , where where we assume that the inverse exists. This formula guarantees that the distance from to V is minimal in the elliptic norm weighted by . The problem with this approach is that it is not known how to efficiently compute a useful estimate of the projected value function from samples.Therefore we need a different approximation. We call it . It comes through the equation , where we look for the fixpoint of the operator instead of the Bellman operator T , where . Our choice of will be motivated further later (in particular, see equation 7).

Now the question, of course, is about the relation between our approximation and the projection of the true value function , as we have defined it in the previous section. We now state without proof the relation between the two estimates developed in [30] (see their references for prior work).

This can be used to obtain the following bound, which does not require us to estimate the matrices or P (see [30] for proof and for sharper bounds): .

We see from this that one example where the approximation of equation 10 is appropriate is when we have substantial discounting – in that case, if the linear framework is good at all, i.e. if the projection is close to the true value function, then so will be our approximation. We emphasise here that our derivation is for the case where there is one recursive class in the MRP. If there are other classes, this bound tells us nothing about them (i.e. using this bound only, we have to accept the value function at the states belonging to them may be arbitrarily off the mark).

In the subsequent sections, we will describe various seemingly different approaches to computing from samples, which however all lead to the same formula for the solution we have already seen in section 2.2. In order to derive our algorithm, we make two assumptions. First, we assume the following. We call this the feature independence assumption.

This implies that the features are linearly independent (i.e. is of full column rank) but the statement is stronger in that it concerns both the features and the transition dynamics of the MRP, and means that the parts of the features corresponding to states visited with nonzero probability are independent. We note that this implies that the matrix is also full rank – we discuss why this implication holds in appendix A. We will also use this to claim the invertability of without further comment (i.e. we assume we have enough samples). Also, we assume that the mean of the reward process exists.

The second assumption is rarely a problem because in typical applications the reward is bounded by some constant.

To summarize the description, we restate the fundamental conditions for LSTD to yield good value estimates: (1) the linear architecture itself needs to match the problem and the set of features needs to be set right, that is V must be close to , (2) the approximation needs to be good, for example through discounting and finally (3) the sample based approximation to w must also be good (in the following sections we define a consistent estimator for w, i.e. a way to compute , so that the value function computed from a sample trajectory approaches for the recursive states in the class corresponding to that trajectory as the length of the trajectory goes to infinity).

3.1 Derivation by the Galerkin method

That LSTD corresponds to a special case of the Galerkin argument has been implicitly realized for some time, and formally stated in [4], on which this section is based. The general idea of the Galerkin method is to approximate the fixed point of . We have . We introduce the approximation by considering points from within the column space of , so that our approximate fixpoint satisfies Range, yielding , which is equivalent to the following, after substituting for and for x and using the semi-norm weighted by .

Now, for our semi-norm with the corresponding projection operator , this has an analytic solution: . Now, in our case, where we note that the inverse is well-defined by assumption 4 and the evaluation of the operator T at becomes with w assuming the role of . Now we solve the following.

This can be transformed in the following way.

In the above, we can cancel out the terms , because by assumption 4, has to be of full column rank. We then multiply both sides by , to obtain, which leads to the following.

This is the same as the expression we will obtain in the instrumental variable section. We also see that equation 8 is the same as the formula obtained from the linear dynamical system approach in equation 2 when we plug in the computed values of and . Thus we have obtained the same estimator.

3.2 Derivation by instrumental variables

Again, we begin with the Bellman equation, which defines the true value function: . We will first obtain a statistical model that expresses the properties of the approximation . By solving the Bellman equation directly in the linear approximation regime, we obtain the following equation.

We note that in the above, we use the convention that w is a column vector while the features are row vectors. This convention minimizes the number of transposes we have to write. Note that we had to introduce the TD error vector and the corresponding random variable (i.e. the error is a deterministic function of the current state, which is random), since the sum of the reward vector R and the expected feature vector may not be in the feature space (i.e. the column space of ). It can be verified using equation 9 that, the error terms satisfy , i.e. it is orthogonal to the feature space (indeed it can be seen after a brief manipulation that the condition is equivalent to the formula 9 – we will do this in section 3.4), and that consequently we have the following.

This is not a derivation from first principles, since we had to use an external argument to verify that (which is equivalent to assuming that the TD error vanishes in expectation). But given the model of equation 10 it is nonetheless instructive to look at the mechanics of how the derivation works because this is the first one to have been proposed for LSTD.

We now accept equation 10 as a given and give a statistical derivation as provided in the original LSTD paper [8], based on methods described in [29]. Now, because we do not observe the expectations and E[r |s] in equation 10, but merely samples of and we model the residue wrt. the expected value as noise, yielding the probabilistic model , where we use assumption given by equation 5, and . Note that by definition . Observe that this implies the following by the law of iterated expectation (LIE).

Analogously, we have the following.

Thus we can rewrite equation 10 to obtain the following.

Now, we cannot use traditional least-squares to solve this, since the expression may be, in general, correlatedwith , so will be the projection error term e and the two correlations will not cancel in general. Therefore the noise term may be correlated with with . Also, is not necessarily zero. But ordinary least squares (OLS) requires that noise be uncorrelated with input variables and that it have mean zero to yield consistent estimates. However, there is still a way to obtain a good estimate. More formally, we first need to establish the following properties. First, we have , where the first equality follows from LIE and the second from fact 12. By the same reasoning, we have from fact 13. With these two properties, we can now multiply both sides of equation 14 by , which we for this purpose call an instrumental variable, and then take expectation, so as to make the noise terms vanish. We also have by fact 11. This results in the following.

Now because we know by assumption 4 (see section A of the appendix for a detailed proof) that is invertible, the estimator w is given by the following.

This finishes the formal derivation. We will now give two different intuitive interpretations to the instrumental variable method. First, consider the sample equivalent of equation 14, which we now rewrite in matrix notation , where by we denote the vector containing the noise terms for each individual sample and by the sample values of the random variable . Now, as described above, we cannot solve it by OLS because of the correlation between the noise and . So we ‘fix’ by projecting it onto the feature space (i.e. the column space of ), since we know that noise is uncorrelated with features. We introduce the projection operator , where we note that the inverse exists by assumption given we have enough samples. Now our equation becomes the following.

In the above, we can cancel the terms because has, by assumption , independent columns if we have enough samples. This leads to the same estimator that we derived above. This interpretation is known in econometric literature as two-stage least squares (2SLS), because we solve two linear systems: first we project on the subspace of features and then we solve the resulting modified equation. In this context we stress that we would get the same solution if we only applied the projection on the right-hand side, e.g. – this can be seen by noticing that the choice of in this equation is unaffected by any component of orthogonal to the feature space. We also see the direct correspondence between this and the projection step in the derivation through Galerkin method – the equation 7 is essentially the limiting version of the sample-based equation 17.

3.3 The geometry of instrumental variables

There is one more way to interpret the instrumental variable approach. Observe that the equation , can be rewritten as . Thus we have that applying the projection amounts to solving under the constraint that the projection of the residual on the feature space is zero. Therefore LSTD yields the same solution as applying the oblique projection of the rewards on the difference between values of successive states (i.e. ), along the subspace orthogonal to the column space of (which is the left null-space of ). See also figure 1.

Recall the formula for the coefficients of the oblique projection on the columns space of X orthogonal to the column space of Y , which is . The corresponding generalized pseudoinverse of X is . It is easy to verify that putting and into recovers the LSTD solution. Notice that in this case, the projected vector, corresponds to obtaining the ‘smoothed rewards’ corresponding to the approximate value function (i.e. , or what the rewards would have been if there had been no approximation of the value function). Now there is also a different way of defining the projection, namely we can project not the reward vector but the true value function [24]. In this case, setting and again produces the LSTD solution w (note that now, we are projecting the true value function, not the rewards). Notice that in this case the projected vector corresponds to the approximate value function.

Notice that formally speaking, in both the interpretation as a projection of the reward vector and the value function, we also need another condition to call LSTD an oblique projection – in order for the formula to mean a projection on Range(X) orthogonal to Range(Y ), we need the condition that the orthogonal complement of Range(X) and Range(Y ) should be complementary subspaces. We will now claim that this is the case in either of the above ways of thinking about LSTD as a projection. To do this, we will prove the following statement. We denote by k the number of columns in (they are known to be linearly independent by assumption 4).

Figure 1: LSTD can be interpreted as an oblique projection (left) and as a fixpoint algorithm (right).

Proof. First, we note that the dimension of Rangeis k since is full column rank and the dimension of Rangeis exactly since A is invertible. The argument in the left-to-right direction is as follows: if , then there would be a vector, , which is both in Rangeand Range. Therefore these two subspaces cannot sum to the n-dimensional space if they share a common vector. This contradiction finishes the argument. The argument in the right-to-left direction is thus: there is no non-zero vector in both Rangeand Range, then because of their dimensions they have to sum to the whole space .

We now see that the condition is fulfilled in the case of LSTD because by assumption 4 the matrix , and hence also has to be invertible. In this expression, we can substitute A = I and or alternatively and to obtain either of the interpretations of LSTD as projection outlined above. We note that in either case, is full column rank by assumption 4 together with the fact in appendix A and A is invertible since P is a Markov matrix.

3.4 Connection with the iterative TD algorithm

We have seen in section 3.2 that the equality is crucial for the development of the algorithm and indeed equivalent to the obtained estimator for w (equation 9). We will now show another way of obtaining this equality – actually, it may be taken do be the definition of the algorithm, and used as a justification for the formula 9 that stands on its own. We now give the interpretation of this equation is in terms of the iterative TD algorithm [27]. We note that the equality corresponds to saying that the LSTD solution corresponds to the fixpoint of iterative TD, i.e. the point where the expected update is zero.

Consider now the definition of the iterative TD algorithm [27]. We assume for the moment that we have an oracle for the value function and are interested in iteratively solving the optimization problem using the approcimation architecture . The iterative update is given by . We now have the following formula for the iteration.

Now we have that the update at time t, is . Setting the expectation of this update to zero gives the desired formula. We also note that the relation between the TD iteration and the LSTD algorithm resembles the chicken-and-egg problem – one can either, as we did above, consider the iteration a priori knowledge and use that to justify the LSTD fixpoint, or one can start with the fixpoint and treat the iteration as a way of reaching it, motivated by stochastic optimization. LSTD can also be extended to compute the fixpoints of TD() or, more generally other similar algorithms with different traces. For details, see [10] in slightly different notation.

3.5 LSTD as minimization of a quadratic form

This section is based on [26]. It interprets LSTD as the minimization of a quadratic form in the error between the true value function and the approximated value function . We begin by reformulating the formula for the estimator obtained above.

This equality holds because and because the matrices and are invertible by assumption given by equation 4. Now, we introduce the matrix K, as below.

We note that , where the last equality follows by substituting the definition of and cancelling the inverted term. Therefore we have . But this is the solution to the well-known optimization problem: . Thus we gain an insight about approximation of equation 10 – instead of minimizing the norm , which would yield us , we minimize the different norm , thus gaining the ability of efficiently estimating the solution from samples. Note that we can also repeat the above reasoning, without the multiplication by , to obtain the matrix which also defines a valid minimization – this is the way the equivalence was originally introduced in [25].

3.6 LSTD is a subspace algorithm

In section 3.3, we have shown that the algorithm can be thought of as an oblique projection along the subspace orthogonal to the feature space. Here, we make explicit the property that LSTD only depends on the features through the subspace they span i.e. any full-rank transformation (i.e. basis change) C of features does not influence the value function. To see this, consider the sample estimate we derived in earlier sections, where we use the transformed features instead of .

As a corollary, we state that LSTD is independent of any scaling of features.

4 LSTD vs Bellman Residual Minimization

4.1 A Decompositions of the LSTD loss

We now present an interpretation of the minimization defined in equation 6, after [1]. We recall that the minimization in equation 6 can be rewritten in the following way . Therefore , or . Therefore LSTD can be seen to be equivalent to the following optimization problem.

We note that this expression has no recursion and that the minimization is guaranteed to reach the optimum value of zero. We can now rewrite the norm as follows , where the equality follows from the Pythagorean theorem and the fact that and are orthogonal vectors, with respect to the -weighted inner product, which corresponds to . We thus obtain the following formula for the LSTD solution.

We see that the LSTD algorithm minimizes a quantity which is the Bellman residual minus the reprojection error on the feature space. We discuss in section 4.2 the difference between simply minimizing the Bellman residual only and the LSTD algorithm.

Another way to interpret the LSTD loss is to see it as a nested optimization problem [11], which leads to the following two equivalent formulations. First, define the projection in the following way.

Then we plug this for the definition of in equations 18 and 19 respectively, giving the following equivalent equations.

4.2 Comparison with BRM loss

Instead of constructing the oblique projection as described in the previous sections, we can use a simpler algorithm, known as the Bellman Residual Minimization, which corresponds directly to projecting the rewards on the differences between successive states (see figure 2) – i.e. it is similar to LSTD except the projection is orthogonal, not oblique. BRM can be interpreted as the un-nested version of the optimization from the previous section.

The reason LSTD was originally introduced as an improvement over BRM [8] is that for BRM, we do not have a justification in terms of a statistical model similar to the one we had in section 3.2 – the noise terms are correlated, so we cannot use a similar reasoning to claim consistency of BRM. But of course the fact that one line of deriving an algorithm doesn’t work for BRM does not mean that the algorithm is wrong – there may be other justifications available. Interestingly, it can be shown that under our assumption 4 the two approaches are similar (the argument comes from chapter 4 of [2]). Indeed, we have from the previous section (compare equation 18) that LSTD is similar except for the presence of the projection . It is sometimes useful to have formulas that make the difference between the two algorithms explicit in different formulations of each algorithm. The algebraic relationships between the two algorithms are summarized in the table below.

There has been renewed interest in the analysis of the difference between the two algorithms. One argument [24] is that in an off-line setting (i.e. in the situation when the weighing coefficients are different from the stationary distribution of the MRP, a scenario we do not consider in this paper) a performance bound can be shown about BRM that is impossible to derive about LSTD [24]; on the other hand LSTD remains widely used in practice.

There is yet one more feature that means that LSTD is preferable to BRM is some practical cases – while with LSTD, as we have shown above, we only need one sequence of samples of features of states and a sequence of samples of reward to obtain an estimate of the value function; but with BRM we need to have two samples of the features of states.

4.3 Sample estimate of the BRM value function

We will now show a way to obtain a sample-based estimate of the BRM solution, based on section 3.1 of [19]. We want to minimize the expectation . We have the sampled features and the sampled rewards . We also have a second set of sampled features . The sampled features are produced using the following process: given the trajectory , the features in are while the features in correspond to ‘alternative’ states sampled from . In other words, the features in describe where the MRP might also have gone to given a particular previous state. Of course, such sampling is only possible if we have a model of the transition dynamics of the MRP. Now, we can write a sample-based approximation to the expectation given above as , where the notation means selecting row i of the matrix (i.e. the i-th feature in the trajectory). We can now introduce the notation and , where the colon notation denotes ranges of rows. With this notation, we have that . It can now be seen after a few rearrangements that

leaves the us with the system , where we denoted by the sample-based BRM solution.

5 Regularization

To overcome the problem of over-fitting, the standard procedure is to add a regularization term to the proposed algorithm. There are many ways of doing that. 9

Figure 2: BRM as projection of rewards (left) and minimizing the Bellman residual (right). Cmp. fig. 1

One way, proposed by [15] is to consider the optimization problem of the fixpoint equation 6. We can extend it as follows: . Here, is an external parameter of the algorithm, is the weighted norn and is the usual norm. This way of regularizing produces the well-known analytic solution . In the paper [15], a version is also given where the second norm is . In this case, because equation 6 is a fix-point equation, it is not possible to simply plug the problem into the standard LASSO algorithm, and a new algorithm is necessary (see [15] for details).

Before we continue, denote the standard -regularized solution of a system of equations Ax = b as . Denote the version with regularization as (this has no explicit analytic form as has to be computed using an algorithm, typically LASSO).

A second way of regularization, introduced in [11] is to add regularization to equation 21, giving the following optimization problem.

In the above, the latter norm may be either of or . A quick calculation shows that this is the same as regularizing the system of equations 8. This idea therefore corresponds to the solutions for each of the discussed norms.

Another way is adding regularization directly to the equation where we have already solved for w, that is, , where and . If we regularize with , this corresponds to the solutions . This (together with other versions, that do not map to LSTD), has been done in [2], where the author also derives finite-sample error bounds.

It is also possible to combine some of the above ways together, after the manner of [13], and to use other sparsifiers in place of . In [12], for instance, the Dantzig selector is employed, which leads to a considerable simplification of the optimization problem (the optimization reduces to a linear program).

A yet different approach [22] to regularization is to keep the algorithm itself unchanged and instead do feature selection beforehand. Even if the feature selection algorithm is very simple (greedy based on correlation with residual), simulations [22] suggest that doing feature selection leads to performance essentially the same as the approaches described above. Because greedy feature selection is so simple, this suggests that regularization of LSTD is not yet really a fully solved problem.

A property of all the above regularizers is that we lose the invariance of the algorithm w.r.t. the choice of basis for the feature space, which can be seen as a natural characteristic of LSTD. It is not clear whether the property would be worth preserving in a regularized version – sparsity by its very nature is not invariant to transformations of features, even linear ones and there is a general tendency that a more specialized algorithm will have less generic properties.

6 The Episodic version of LSTD

In the other sections of this paper, we have considered the case where the MRP never terminates and convergence is defined by taking the limit with respect to the length of a trajectory. We are now interested in extending our observations to the case where there is a termination state. The limit will now be the with respect to the number of episodes being accumulated. First, let us note that the formula is still valid in this case. We simply have to give new meaning to the expectation terms.

We will now start by giving a design-based variant for the algorithm. All transitions in a terminating MRP can be described using a rectangular matrix , where the last column is meant to denote termination. We assume in the following that the starting state of the MRP is the first state. We also assume that the matrix is such that the MRP will always eventually terminate. We first need to construct a state distribution . To do this, we append the row [1, 0 . . . 0] to the matrix , producing the square matrix , which assumes that the MRP restarts after reaching the termination state. Now, the diagonal entries of the matrix are the entries of the left eigenvector of which corresponds to eigenvalue one. Now we also construct another square matrix, P, which we obtain by appending the row [0 . . . 0, 1] to the matrix . This matrix assumes that the agent stays in the termination state forever. The intuition behind this is the following: the matrix P describes the true dynamics of the MRP, but in order to have a meaningful state distribution we need to take into account the fact that we have multiple episodes – hence the definition of the matrix , which models restart. Having defined the above matrices, we may use the standard formula in the following way.

Here, we assume that the last feature vector (i.e. the one corresponding to the state modelling termination) is zero. By definition, the final element of R is also zero. It can be seen that the sample-based variant is the same as in the case of one long trajectory, except for the additional summation over the episodes. We note we use here the fact that the termination state has the feature of zero (so that we can still use the matrix D – there is no subtraction in the last row, but it doesn’t matter since the last state is the terminal state). The formula looks as follows, where the sum goes over episodes.

7 Summary of Contributions

We have provided a detailed survey of the different ways in which LSTD can be obtained. Our derivation of LSTD using instrumental variables, is, to our knowledge, the first one which is correct. We also made explicit and formal an argument concerning the invertability of the matrix that appears in the LSTD solution (see appendix A). Moreover, we have derived geometric interpretations of the LSTD fixpoint (independently of the work of Scherrer [24], which we only became aware of afterwards). We also provided an exhaustive comparison with the BRM algorithm as well as surveyed the methods that can be used to regularize the LSTD solution. Finally, we formally described the episodic version of LSTD, which was already implicitly known before, but not formalized.

A Proof of a fact about equation 4 for LSTD.

Proof. We rewrite the statement in matrix form: implies . We will now develop the second expression. By the well-known eigenvalue argument, is invertible. Assume for the moment (we will deal with the case when this is not true later). Consider some non-zero vector x. We use the assumption to state that have that if and only if the vector , which in the column space of satisfies the condition that is orthogonal to the column space of . This implies that . This holds if and only if . Now because the matrix defining this quadratic form is symmetric, and thus diagonalizable and with real eigenvalues, we have that this can only be zero if some of the eigenvalues are nonpositive. We will show that this cannot be the case. Rewrite the matrix

as . Now because by definition where , we have that (where by we denote the vector of all ones); moreover, has positive entries. So it is a Markov matrix. Thus also is a Markov matrix. Thus, has eigenvalues in the positive real half-plane. We also know that the eigenvalues of are non-negative since it is a symmetric graph Laplacian. But we cannot have zero eigenvalues, because it would imply that also has zero eigenvalues, which we have shown is impossible. This finishes the proof for .

Now consider the case when we do not have this, i.e. some of the diagonal entries of are zero. Intuitively, the fact we prove is now obvious since transient states do not influence the values of the expectations. More formally, we can, without loss of generality assume that the states for which the probability given by the stationary distribution is zero have highest indexes (i.e. they occur at the back of matrices and ). We introduce the following notations for block minors of matrices and the vector y corresponding to the non-transient and transient states.

Note that in the above, is has to be the zero matrix – it corresponds to transitions from non-transient states to transient states. Therefore we have that implies and thus, by the reasoning for the case without transient states, has to be the zero vector. Therefore we have the fact that implies that we have the following.

We see that this implies that . But we know from our assumption that this is only possible for x = 0.

Bibliography

[1] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning.

[2] Bernardo Ávila Pires. Statistical analysis of l1-penalized linear estimation with applications. Master’s thesis, Univeristy of Alberta., 2011.

[3] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning, pages 30–37, 1995.

[4] D. Bertsekas. Temporal difference methods for general projected equations. Automatic Control, IEEE Transactions on, (99):1–1, 2011.

[5] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, chapter Approximate Dynamic Programming. 2011.

[6] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific Belmont, 2012.

[7] J.A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):233–246, 2002.

[8] Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57, 1996. 10.1007/BF00114723.

[9] David Choi and Benjamin Van Roy. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, 16(2):207–239, 2006.

[10] Kamil Ciosek. Generalizing LSTD(). Internal UCL note., 2012.

[11] Matthieu Geist and Bruno Scherrer. -penalized projected Bellman residual. In European Wrokshop on Reinforcement Learning (EWRL 11), Athens, Grèce, 2011.

[12] Matthieu Geist, Bruno Scherrer, Alessandro Lazaric, and Mohammad Ghavamzadeh. A Dantzig Selector Approach to Temporal Difference Learning. In International Conference on Machine Learning (ICML), 2012.

[13] Matthew Hoffman, Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Regularized least squares temporal difference learning with nested ; penalization. In Scott Sanner and Marcus Hutter, editors, Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 102–114. Springer Berlin / Heidelberg, 2012.

[14] Philipp W Keller, Shie Mannor, and Doina Precup. Automatic basis function construction for approximate dynamic program- ming and reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 449–456. ACM, 2006.

[15] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 521–528, New York, NY, USA, 2009. ACM.

[16] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107– 1149, 2003.

[17] Guy Lever. Private communication.

[18] Manuel Loth, Manuel Davy, and Philippe Preux. Sparse temporal difference learning using lasso. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pages 352–359. IEEE, 2007.

[19] O.A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh. Finite-sample analysis of bellman residual minimization. In Proceedings of 2nd Asian Conference on Machine Learning (ACML2010), Tokyo, Japan, November 2010.

[20] R. Munos. Error bounds for approximate policy iteration. volume 20, page 560, 2003.

[21] A Nedić and Dimitri P Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2):79–110, 2003.

[22] C. Painter-Wakefield and R. Parr. Greedy algorithms for sparse reinforcement learning. Arxiv preprint arXiv:1206.6485, 2012.

[23] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L. Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 752–759, New York, NY, USA, 2008. ACM.

[24] Bruno Scherrer. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In 27th International Conference on Machine Learning - ICML 2010, Haïfa, Israël, 2010.

[25] R. Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Proceedings of the 15th Neural Information Processing Systems conference, pages 1555–1562, 2002.

[26] Yi Sun, Faustino Gomez, Mark Ring, and Jürgen Schmidhuber. Incremental basis construction from temporal difference error. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 481–488, New York, NY, USA, June 2011. ACM.

[27] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.

[28] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.

[29] J.M. Wooldridge. Econometric Analysis of Cross Section and Panel Data. The MIT Press, 2008.

[30] Huizhen Yu and D.P. Bertsekas. New error bounds for approximations from projected linear equations. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pages 1116 –1123, Sept. 2008.

Designed for Accessibility and to further Open Science