My stuff
Properties of the Least Squares Temporal Difference learning algorithm

This paper focuses on policy evaluation using the well-known Least Squares Temporal Differences (LSTD) algorithm. We give several alternative ways of looking at the algorithm: the operator-theory approach via the Galerkin method, the statistical approach via instrumental variables as well as the limit of the TD iteration. Further, we give a geometric view of the algorithm as an oblique projection. Moreover, we compare the optimization problem solved by LSTD as compared to Bellman Residual Minimization (BRM). We also treat the modification of LSTD for the case of episodic Markov Reward Processes.

The main practical problem that the LSTD algorithm solves is such: we are given a feed of data from a stochastic system, consisting of a state description in terms of features and of rewards. The task is to construct an abstraction that maps from states to values of states, where the value is defined as the discounted sum of future rewards. We will show that for LSTD, this abstraction is a linear model. For example, the system may describe a chess game, the features of state may describe what pieces the players have while the reward signal corresponds to wither winning or losing the game. The value signal will then correspond to the value of having each particular piece. Note that this is not a general constant but may depend on the way the individual players play the game, for example the values may be different for humans than for computer players. It is well-known that the value function of a given policy can be expressed as  V = (I − γP)−1R. The LSTD algorithm can be thought as a way of computing the value of this function approximately. The motivation for why the approximation is often necessary is threefold. First, we may not have access to the states directly, just to functions  φof state. Second, the number of states n is often computationally intractable. Third, even if n is tractable, there is the problem of statistical tractability – the number of samples needed to accurately estimate transition matrices  n × nis often completely prohibitive.

Associated with our problem setting is the question whether the value function is interesting in its own right, or whether we only need it to adjust the future behaviour of some aspect of the environment we can control (i.e. in our chess example. make a move). We believe that there is large scope of systems (for instance expert systems) where the focus will be on gaining insight into the behaviour of the stochastic system, but the decisions about whether or how to act will still be made manually by human controllers, on the basis of the value-function information. These are the cases where algorithms like LSTD are the most directly applicable. On the other side of the spectrum, there will also of course be situations where the value function estimate is used as a tool to automatically generate the best action on the part of the agent – such systems may also use value-function estimation algorithms of the kind of LSTD to operate within the policy iteration framework.

An exhaustive introduction to least-squares methods for Reinforcement Learning is provided in chapter 6 of Bertsekas’ monograph [6]. The LSTD algorithm was introduced in the paper by Bradtke and Barto [8]. Boyan later extended to the case with eligibility traces [7], wherean additional parameter  λcontrols how far back the updated are influenced by previous states. The connection between LSTD and LSPE, as well as a clean-cut proof that the on-line version of LSTD converges, was given by Nedić and Bertsekas [21]. The seminal paper [28] by Tsitsiklis and Van Roy provided an explicit connection between the fix-point of the iterative TD algorithm and the LSTD solution, while also formally proving that the TD algorithm for policy evaluation converges. The paper [3], described the Bellman Residual Minimization procedure as an alternative to TD. Antos’ paper [1] provided an extensive comparison on the similarities and differences between LSTD and Bellman Residual Minimization (BRM). Parr’s paper [16] introduced the LSPI algorithm as a principled way to combine LSTD with control. The paper by Munos [20] introduced bounds for policy iteration with linear function approximation, albeit under strong assumptions. Scherrer provided [24] the geometric interpretation of LSTD as an oblique projection, in the context analysing the differences between LSTD and BRM. The paper [14] represents an early approach to automatically constructing features for RL algorithms, including LSTD. Schoknecht gave [25] an interpretation of LSTD and other algorithms in terms of a projection with respect to a certain inner product. Choi and Van Roy [9] discuss the similarities between LSTD and a version of the Kalman filter. There exist various approaches in literature to how LSTD can be regularized, none of which can be conclusively claimed to outperform the others. These include the L1 approaches of [18] and [15] and the nested approach of [13]. These approaches differ not just in the what regularization term in used, but they solve different optimization problems (we will discuss this in section 5).

2.1 Notation

The LSTD algorithm finds the value function of a finite-state Markov Reward Process (MRP). The MRP is fixed, i.e. we only consider the on-policy setting. We only have access to linear features of states and to the obtained rewards. More formally, denote as P the transition matrix of the MRP. For each state s we have a feature row-vector  φ. The feature design matrix  Φgives the features of all states of the MRP, row-wise, where we assume that  Φhas independent columns. We use the vector R, the i-th element of which contains mean reward obtained while leaving the state i. We use  ξto denote a left eigenvector of P corresponding to eigenvalue one. Note that if the chain has a stationary distribution, it will correspond to such an eigenvector, but we do not require it. We will assume that the chain only has one recurrent class since the case where we have many classes complicates the notation without contributing to the main argument (in practice, we can typically assume there is one class if we do enough exploration). We also introduce the matrix  Ξ = diag(ξ). We now define expectations of functions of the Markov process in terms of weighted averages. For example the expectation of  φ⊤φ, is defined by  E�φ⊤φ�= Φ⊤ΞΦ, and similarly for other functions. By this we mean that if P is ergodic, it is legitimate to consider the above quantity an expectation corresponding to long-time average by the standard ergodic theorem for Markov chains. But in our application it is convenient to be more general and allow for periodicity, i.e. the diagonal of Ξmay not be a stationary distribution, but the expression still matches the long-time average. We use subscripts do denote two-step sampling, for example  φ′sdenotes the fact that we first sample a state, then the successor state and obtain the feature of that successor state. When we write an expectation w.r.t. such a variable, for example  E�r2s�, the distribution we mean for r is �Ss=1 p(r|s)ξs. Part of our present derivations depends on treating some qualities as random variables; we use small letters to denote them, for instance s denotes state and  φdenotes feature. Once we have obtained samples from our process, we store them in matrices ˆΦand  ˆr, whose i-th rows correspond to, respectively, the state feature vector and reward obtained at time i. Observe the difference between  Φand ˆΦ– in the first one, each state is represented once, in the second one the number of rows corresponds to the trajectory taken in the MRP and repetitions are possible. The value function is discounted with the factor  0 < γ < 1. Moreover, we introduce the square matrix D which has ones on the main diagonal and  −γon the diagonal above it. It is the sample based equivalent to the operator  I − γP.


2.2 The linear dynamical system approach

The derivation given in this section is based on [23]. We begin by constructing a MRP which lives in the space of features instead of our original state space. We limit ourselves to the class of linear dynamical systems. We need to define the matrix ˜Pand the vector ˜R, so that a transition from  φto  φ′(row vectors) is modelled by  φ ˜P = φ′, and the reward we expect at  φis modelled by  φ ˜R = r. Now we look for the values for ˜Pand ˜Rthat model our system dynamics. We have that  Φ ˜Pshould be approximately equal to  PΦand  Φ ˜Rto r. We weigh states by  Ξ, giving the following optimization problems.


These optimization problems correspond to ordinary least squares (generalized to matrices in case of ˜P) and the solutions are obtained by weighted projection:  Φ ˜P = ΠPΦand  Φ ˜R = ΠR, where the projection matrix is defined as  Π = Φ(Φ⊤ΞΦ)−1Φ⊤Ξand the matrix  Φcancels with the one in the projection, since it is full column rank. Now consider a feature vector  φ. In the new approximate MRP, we can compute the value function exactly (i.e. all the approximation has already taken place when we constructed the matrix ˜Pand vector ˜R). The true value function associated with it is the expected discounted future reward, and is expressed as follows.


a stronger condition, namely that the series �∞i=0(γ ˜P)iconverges, which is the same as saying that  γ ˜Pis a contraction in some norm.

This follows from the following reasoning. Consider first the case when we have  Ξ > 0. We know that  ΠPis a contraction in the norm weighted by  Ξi.e.  ∥ΠP∥Ξ = ∥Ξ12 ΠPΞ− 12 ∥2 ≤ 1(see for example [5], proposition 6.3.1). Therefore the spectral radius of  ΠPis bounded by one. Define the matrix Π- = (Φ⊤ΞΦ)−1Φ⊤Ξ, so that we have  ΠP = Φ(Π-P)and ˜P = (Π-P)Φ. Using the assumption that  Φhas independent columns, it is easy to see that if v is an eigenvector of  Π-PΦthen  Φvis an eigenvector of Φ(Π-P)with the same eigenvalue. Hence all eigenvalues of ˜Pare also eigenvalues of  ΠPand  ρ( ˜P) ≤ 1. Now if we have some zero entries on the diagonal of  Ξ, our results follows by a continuity argument. Thus we have the result for the general case.

Note that in the above proof we used the fact that  Ξhas on the diagonal is a left eigenvector of P corresponding to eigenvalue one. If  Ξused for the projection were an arbitrary distribution, then the matrix ˜Pwould in general have spectrum beyond the unit circle. For example, consider the following.


which is more than one.


2.3 Interpretation in terms of expectations

We now give the interpretation of the matrices ˜Pand ˜Rin terms of expectations, which are useful when constructing sample-based versions of the algorithm.


We also can construct sample-based variants of the matrices ˜Pand ˜R(call them ˆPand ˆRrespectively) and still obtain essentially the same algorithm. Let us adopt the following definitions.


In the above, we denote by N the matrix which has ones above the main diagonal and zeros elsewhere.


We note that the required inverse exists by the assumption that  Φ⊤ΞΦis invertible (we also implicitly assume that we have enough samples). Furthermore, we note that the transition model and the reward model are independent, i.e. our solution is also applicable to the setting where we have a single transition matrix P, but instead of just one reward we have many tasks, each of which with a different reward [17]. We note that in this setting, while we still have to learn ˜Rfor each task separately, it is worthwhile to learn ˜Pusing training data from all tasks.

We begin with the Bellman equation, which defines the true value function:  V (s) = E[r + γV (s′) |s] =E[r |s] + γ E[V (s′) |s]. It can be rewritten in matrix form as  V = (I − γP)−1R. We exploit a linear architecture: i.e. we seek to approximate the true value function  V (·)with the function ˜V (s) = w⊤φ, which is linear in w. We will briefly discuss two possibilities for how to choose an appropriate ˜V (·)within the linear class of functions. The obvious thing would be to define ¯V = ΠV = Π(I − γP)−1R, where  Π = Φ(Φ⊤ΞΦ)−1Φ⊤Ξwhere we assume that the inverse exists. This formula guarantees that the distance from ¯Vto V is minimal in the elliptic norm weighted by  Ξ. The problem with this approach is that it is not known how to efficiently compute a useful estimate of the projected value function from samples.1Therefore we need a different approximation. We call it ˜V (·). It comes through the equation ˜V D = ΠT ˜V, where we look for the fixpoint of the operator  ΠTinstead of the Bellman operator T , where T x = R + γPΦx. Our choice of ˜Vwill be motivated further later (in particular, see equation 7).

Now the question, of course, is about the relation between our approximation ˜Vand the projection of the true value function ¯V, as we have defined it in the previous section. We now state without proof the relation between the two estimates developed in [30] (see their references for prior work).


This can be used to obtain the following bound, which does not require us to estimate the matrices  Πor P (see [30] for proof and for sharper bounds):  ∥V − ˜V ∥Ξ ≤ (1 − γ2)−1/2∥V − ¯V ∥Ξ.

We see from this that one example where the approximation of equation 10 is appropriate is when we have substantial discounting – in that case, if the linear framework is good at all, i.e. if the projection ¯V = ΠVis close to the true value function, then so will be our approximation. We emphasise here that our derivation is for the case where there is one recursive class in the MRP. If there are other classes, this bound tells us nothing about them (i.e. using this bound only, we have to accept the value function at the states belonging to them may be arbitrarily off the mark).

In the subsequent sections, we will describe various seemingly different approaches to computing ˜V (·)from samples, which however all lead to the same formula for the solution we have already seen in section 2.2. In order to derive our algorithm, we make two assumptions. First, we assume the following. We call this the feature independence assumption.


This implies that the features are linearly independent (i.e.  Φis of full column rank) but the statement is stronger in that it concerns both the features and the transition dynamics of the MRP, and means that the parts of the features corresponding to states visited with nonzero probability are independent. We note that this implies that the matrix  E�φ⊤(φ − γφ′)�is also full rank – we discuss why this implication holds in appendix A. We will also use this to claim the invertability of ˆΦ⊤DˆΦwithout further comment (i.e. we assume we have enough samples). Also, we assume that the mean of the reward process exists.


The second assumption is rarely a problem because in typical applications the reward is bounded by some constant.

To summarize the description, we restate the fundamental conditions for LSTD to yield good value estimates: (1) the linear architecture itself needs to match the problem and the set of features needs to be set right, that is V must be close to ¯V, (2) the approximation ˜Vneeds to be good, for example through discounting and finally (3) the sample based approximation  ˆwto w must also be good (in the following sections we define a consistent estimator for w, i.e. a way to compute  ˆw, so that the value function computed from a sample trajectory approaches ˜Vfor the recursive states in the class corresponding to that trajectory as the length of the trajectory goes to infinity).

3.1 Derivation by the Galerkin method

That LSTD corresponds to a special case of the Galerkin argument has been implicitly realized for some time, and formally stated in [4], on which this section is based. The general idea of the Galerkin method is to approximate the fixed point of  T , T x⋆ = x⋆. We have  x⋆ = argminx ∥T x⋆−x∥. We introduce the approximation by considering points from within the column space of  Φ, so that our approximate fixpoint satisfies  ˜x⋆ ∈Range(Φ), yielding  ˜x⋆ = argminx∈Range(Φ) ∥T ˜x⋆ − x∥, which is equivalent to the following, after substituting  Φy⋆for  ˜x⋆and  Φyfor x and using the semi-norm weighted by  Ξ.


Now, for our semi-norm with the corresponding projection operator  Π, this has an analytic solution: Φy⋆ = Π(T (Φy⋆)). Now, in our case,  Π = Φ(Φ⊤ΞΦ)−1Φ⊤Ξwhere we note that the inverse is well-defined by assumption 4 and the evaluation of the operator T at  Φwbecomes  R + γPΦwwith w assuming the role of  y⋆. Now we solve the following.


This can be transformed in the following way.


In the above, we can cancel out the terms  Φ, because by assumption 4,  Φhas to be of full column rank. We then multiply both sides by  (Φ⊤ΞΦ), to obtain�(Φ⊤ΞΦ) − γΦ⊤ΞPΦ�w = Φ⊤ΞR, which leads to the following.


This is the same as the expression we will obtain in the instrumental variable section. We also see that equation 8 is the same as the formula obtained from the linear dynamical system approach in equation 2 when we plug in the computed values of ˜Pand ˜R. Thus we have obtained the same estimator.

3.2 Derivation by instrumental variables

Again, we begin with the Bellman equation, which defines the true value function:  V (s) = E[r + γV (s′) |s] =E[r |s]+γ E[V (s′) |s]. We will first obtain a statistical model that expresses the properties of the approximation ˜V (·). By solving the Bellman equation directly in the linear approximation regime, we obtain the following equation.


We note that in the above, we use the convention that w is a column vector while the features are row vectors. This convention minimizes the number of transposes we have to write. Note that we had to introduce the TD error vector  e = [es1, · · · , esn]⊤ = T Φw − Φwand the corresponding random variable es(i.e. the error is a deterministic function of the current state, which is random), since the sum of the reward vector R and the expected feature vector  E[φ′ |s]wmay not be in the feature space (i.e. the column space of  Φ). It can be verified using equation 9 that, the error terms satisfy  eΞΦ = 0, i.e. it is orthogonal to the feature space (indeed it can be seen after a brief manipulation that the condition eΞΦ = 0is equivalent to the formula 9 – we will do this in section 3.4), and that consequently we have the following.


This is not a derivation from first principles, since we had to use an external argument to verify that eΞΦ = 0(which is equivalent to assuming that the TD error vanishes in expectation). But given the model of equation 10 it is nonetheless instructive to look at the mechanics of how the derivation works because this is the first one to have been proposed for LSTD.

We now accept equation 10 as a given and give a statistical derivation as provided in the original LSTD paper [8], based on methods described in [29]. Now, because we do not observe the expectations E[γV (s′) |s]and E[r |s] in equation 10, but merely samples of  φand  φ′we model the residue wrt. the expected value as noise, yielding the probabilistic model  rs = E[r |s]+ηs, where we use assumption given by equation 5, and  φ′s = E[φ′ |s] + εs. Note that by definition  E[ηs |s] = 0. Observe that this implies the following by the law of iterated expectation (LIE).


Analogously, we have the following.


Thus we can rewrite equation 10 to obtain the following.


Now, we cannot use traditional least-squares to solve this, since the expression  ζs = γεsw + ηsmay be, in general, correlated2with  φ − γφ′s, so will be the projection error term e and the two correlations will not cancel in general. Therefore the noise term  ζs − esmay be correlated with with  φ − γφ′s. Also, E[es |s]is not necessarily zero. But ordinary least squares (OLS) requires that noise be uncorrelated with input variables and that it have mean zero to yield consistent estimates. However, there is still a way to obtain a good estimate. More formally, we first need to establish the following properties. First, we have E�φ⊤ηs�= E�E�φ⊤ηs��φ��= E�φ⊤ E[ηs |φ]�= 0, where the first equality follows from LIE and the second from fact 12. By the same reasoning, we have  E�φ⊤εs�= 0from fact 13. With these two properties, we can now multiply both sides of equation 14 by  φ⊤, which we for this purpose call an instrumental variable, and then take expectation, so as to make the noise terms vanish. We also have  E�φ⊤es�= 0by fact 11. This results in the following.


Now because we know by assumption 4 (see section A of the appendix for a detailed proof) that E�φ⊤(φ − γφ′s)�is invertible, the estimator w is given by the following.


This finishes the formal derivation. We will now give two different intuitive interpretations to the instrumental variable method. First, consider the sample equivalent of equation 14, which we now rewrite in matrix notation  ˆr = DˆΦ ˆw + ˆζ − ˆe, where by ˆζwe denote the vector containing the noise terms for each individual sample and by  ˆethe sample values of the random variable  es. Now, as described above, we cannot solve it by OLS because of the correlation between the noise and  DˆΦ. So we ‘fix’  DˆΦby projecting it onto the feature space (i.e. the column space of ˆΦ), since we know that noise is uncorrelated with features. We introduce the projection operator ˆΠ = ˆΦ(ˆΦ⊤ ˆΦ)−1 ˆΦ⊤, where we note that the inverse exists by assumption given we have enough samples. Now our equation becomes the following.


In the above, we can cancel the terms because ˆΦhas, by assumption , independent columns if we have enough samples. This leads to the same estimator that we derived above. This interpretation is known in econometric literature as two-stage least squares (2SLS), because we solve two linear systems: first we project  DˆΦon the subspace of features and then we solve the resulting modified equation. In this context we stress that we would get the same solution if we only applied the projection on the right-hand side, e.g.  ˆr = ˆΠDˆΦ ˆw– this can be seen by noticing that the choice of  ˆwin this equation is unaffected by any component of  ˆrorthogonal to the feature space. We also see the direct correspondence between this and the projection step in the derivation through Galerkin method – the equation 7 is essentially the limiting version of the sample-based equation 17.

3.3 The geometry of instrumental variables

There is one more way to interpret the instrumental variable approach. Observe that the equation ˆΠˆr = ˆΠDˆΦ ˆw, can be rewritten as  ˆΠ(DˆΦ ˆw − ˆr) = 0. Thus we have that applying the projection amounts to solving  ˆr = DˆΦ ˆwunder the constraint that the projection of the residual on the feature space is zero. Therefore LSTD yields the same solution as applying the oblique projection of the rewards on the difference between values of successive states (i.e.  DˆΦ), along the subspace orthogonal to the column space of ˆΦ(which is the left null-space of ˆΦ). See also figure 1.

Recall the formula for the coefficients of the oblique projection on the columns space of X orthogonal to the column space of Y , which is  X(Y ⊤X)−1Y ⊤. The corresponding generalized pseudoinverse of X is (Y ⊤X)−1Y ⊤. It is easy to verify that putting  X = (I −γP)Φand  Y = ΞΦinto  (Y ⊤X)−1Y ⊤recovers the LSTD solution. Notice that in this case, the projected vector,  X(Y ⊤X)−1Y ⊤corresponds to obtaining the ‘smoothed rewards’ corresponding to the approximate value function (i.e.  (I − γP) ˜V D, or what the rewards would have been if there had been no approximation of the value function). Now there is also a different way of defining the projection, namely we can project not the reward vector but the true value function [24]. In this case, setting  X = Φand  Y = (I − γP)⊤ΞΦagain produces the LSTD solution w (note that now, we are projecting the true value function, not the rewards). Notice that in this case the projected vector corresponds to the approximate value function.

Notice that formally speaking, in both the interpretation as a projection of the reward vector and the value function, we also need another condition to call LSTD an oblique projection – in order for the formula  X(Y ⊤X)−1Y ⊤to mean a projection on Range(X) orthogonal to Range(Y ), we need the condition that the orthogonal complement of Range(X) and Range(Y ) should be complementary subspaces. We will now claim that this is the case in either of the above ways of thinking about LSTD as a projection. To do this, we will prove the following statement. We denote by k the number of columns in  Φ(they are known to be linearly independent by assumption 4).



Figure 1: LSTD can be interpreted as an oblique projection (left) and as a fixpoint algorithm (right).

Proof. First, we note that the dimension of Range(BΦ)is k since  BΦis full column rank and the dimension of Range(AΦ)⊥is exactly  n − ksince A is invertible. The argument in the left-to-right direction is as follows: if  ∃z.Φ⊤A⊤BΦz = 0, then there would be a vector,  BΦz, which is both in Range(BΦ)and Range(AΦ)⊥. Therefore these two subspaces cannot sum to the n-dimensional space if they share a common vector. This contradiction finishes the argument. The argument in the right-to-left direction is thus: there is no non-zero vector in both Range(BΦ)and Range(AΦ)⊥, then because of their dimensions they have to sum to the whole space  Rn.

We now see that the condition  ¬∃z.Φ⊤A⊤BΦz = 0is fulfilled in the case of LSTD because by assumption 4 the matrix  Φ⊤A⊤BΦ, and hence also  Φ⊤B⊤AΦhas to be invertible. In this expression, we can substitute A = I and  B = (I − γP)⊤Ξor alternatively  A = I − γPand  B = Ξto obtain either of the interpretations of LSTD as projection outlined above. We note that in either case,  BΦis full column rank by assumption 4 together with the fact in appendix A and A is invertible since P is a Markov matrix.

3.4 Connection with the iterative TD algorithm

We have seen in section 3.2 that the equality  E�φ⊤es�= −Φ⊤ΞR + Φ⊤Ξ(I − γP)Φw = 0is crucial for the development of the algorithm and indeed equivalent to the obtained estimator for w (equation 9). We will now show another way of obtaining this equality – actually, it may be taken do be the definition of the algorithm, and used as a justification for the formula 9 that stands on its own. We now give the interpretation of this equation is in terms of the iterative TD algorithm [27]. We note that the equality 0 = E�φ⊤es�corresponds to saying that the LSTD solution corresponds to the fixpoint of iterative TD, i.e. the point where the expected update is zero.

Consider now the definition of the iterative TD algorithm [27]. We assume for the moment that we have an oracle  Vofor the value function and are interested in iteratively solving the optimization problem minw(Vo(s) − ˜V (s))2using the approcimation architecture ˜V (s) = φsw. The iterative update is given by  ∇w(Vo(s) − ˜V (s))2 = 2∇w ˜V (s) (Vo(s) − ˜V (s)). We now have the following formula for the iteration.


Now we have that the update  ∆wat time t, is  φ(st)⊤est. Setting the expectation of this update to zero gives the desired formula. We also note that the relation between the TD iteration and the LSTD algorithm resembles the chicken-and-egg problem – one can either, as we did above, consider the iteration a priori knowledge and use that to justify the LSTD fixpoint, or one can start with the fixpoint and treat the iteration as a way of reaching it, motivated by stochastic optimization. LSTD can also be extended to compute the fixpoints of TD(λ) or, more generally other similar algorithms with different traces. For details, see [10] in slightly different notation.

3.5 LSTD as minimization of a quadratic form

This section is based on [26]. It interprets LSTD as the minimization of a quadratic form in the error between the true value function  V (·)and the approximated value function  Φw. We begin by reformulating the formula for the estimator obtained above.


This equality holds because  R = (I − γP)Vand because the matrices  Φ⊤(I − γP)⊤ΞΦand  Φ⊤ΞΦare invertible by assumption given by equation 4. Now, we introduce the matrix K, as below.


We note that  ΞΦ(Φ⊤ΞΦ)−1Φ⊤Ξ = ΞΠ = Π⊤Ξ = Π⊤ΞΠ, where the last equality follows by substituting the definition of  Πand cancelling the inverted term. Therefore we have  w =�Φ⊤KΦ�−1 Φ⊤KV. But this is the solution to the well-known optimization problem:  w = argminw′ ∥V − Φw′∥K = argminw′(V −Φw′)⊤K(V − Φw′). Thus we gain an insight about approximation ˜V (·)of equation 10 – instead of minimizing the norm  ∥ · ∥Ξ, which would yield us ¯V, we minimize the different norm  ∥ · ∥K, thus gaining the ability of efficiently estimating the solution from samples. Note that we can also repeat the above reasoning, without the multiplication by  (Φ⊤ΞΦ)−1, to obtain the matrix  K′ = (I−γP)⊤ΞΦΦ⊤Ξ(I−γP)which also defines a valid minimization – this is the way the equivalence was originally introduced in [25].

3.6 LSTD is a subspace algorithm

In section 3.3, we have shown that the algorithm can be thought of as an oblique projection along the subspace orthogonal to the feature space. Here, we make explicit the property that LSTD only depends on the features through the subspace they span i.e. any full-rank transformation (i.e. basis change) C of features does not influence the value function. To see this, consider the sample estimate we derived in earlier sections, where we use the transformed features ˆΦCinstead of ˆΦ.


As a corollary, we state that LSTD is independent of any scaling of features.

4.1 A Decompositions of the LSTD loss

We now present an interpretation of the minimization defined in equation 6, after [1]. We recall that the minimization in equation 6 can be rewritten in the following way  Φy⋆ = argminy ∥T Φy⋆ − Φy∥Ξ =Π(T (Φy⋆)). Therefore  Φy⋆ − Π(T (Φy⋆)) = 0, or  ∥Φy⋆ − Π(T (Φy⋆))∥Ξ = 0. Therefore LSTD can be seen to be equivalent to the following optimization problem.


We note that this expression has no recursion and that the minimization is guaranteed to reach the optimum value of zero. We can now rewrite the norm as follows  ∥Φy − Π(T (Φy))∥Ξ = ∥Φy − T (Φy)∥Ξ −∥Π(T (Φy)) − T (Φy)∥Ξ, where the equality follows from the Pythagorean theorem and the fact that Φy − Π(T (Φy))and  Π(T (Φy)) − T (Φy)are orthogonal vectors, with respect to the  Ξ-weighted inner product, which corresponds to  Π. We thus obtain the following formula for the LSTD solution.


We see that the LSTD algorithm minimizes a quantity which is the Bellman residual minus the reprojection error on the feature space. We discuss in section 4.2 the difference between simply minimizing the Bellman residual only and the LSTD algorithm.

Another way to interpret the LSTD loss is to see it as a nested optimization problem [11], which leads to the following two equivalent formulations. First, define the projection in the following way.


Then we plug this for the definition of  Π(T (Φy))in equations 18 and 19 respectively, giving the following equivalent equations.


4.2 Comparison with BRM loss

Instead of constructing the oblique projection as described in the previous sections, we can use a simpler algorithm, known as the Bellman Residual Minimization, which corresponds directly to projecting the rewards on the differences between successive states (see figure 2) – i.e. it is similar to LSTD except the projection is orthogonal, not oblique. BRM can be interpreted as the un-nested version of the optimization from the previous section.


The reason LSTD was originally introduced as an improvement over BRM [8] is that for BRM, we do not have a justification in terms of a statistical model similar to the one we had in section 3.2 – the noise terms are correlated, so we cannot use a similar reasoning to claim consistency of BRM. But of course the fact that one line of deriving an algorithm doesn’t work for BRM does not mean that the algorithm is wrong – there may be other justifications available. Interestingly, it can be shown that under our assumption 4 the two approaches are similar (the argument comes from chapter 4 of [2]). Indeed, we have from the previous section (compare equation 18) that LSTD is similar except for the presence of the projection  Π. It is sometimes useful to have formulas that make the difference between the two algorithms explicit in different formulations of each algorithm. The algebraic relationships between the two algorithms are summarized in the table below.


There has been renewed interest in the analysis of the difference between the two algorithms. One argument [24] is that in an off-line setting (i.e. in the situation when the weighing coefficients are different from the stationary distribution of the MRP, a scenario we do not consider in this paper) a performance bound can be shown about BRM that is impossible to derive about LSTD [24]; on the other hand LSTD remains widely used in practice.

There is yet one more feature that means that LSTD is preferable to BRM is some practical cases – while with LSTD, as we have shown above, we only need one sequence of samples of features of states and a sequence of samples of reward to obtain an estimate of the value function; but with BRM we need to have two samples of the features of states.

4.3 Sample estimate of the BRM value function

We will now show a way to obtain a sample-based estimate  ˆwBof the BRM solution, based on section 3.1 of [19]. We want to minimize the expectation  E�(φwB − φ′wB − r)2�. We have the sampled features ˆΦ1and the sampled rewards  ˆr. We also have a second set of sampled features ˆΦ2. The sampled features are produced using the following process: given the trajectory  s1, s2, . . ., the features in ˆΦ1are  φ(s1), φ(s2), . . .while the features in ˆΦ2correspond to ‘alternative’ states  s′2, s′3, . . .sampled from P(·|s1), P(·|s2), . . .. In other words, the features in ˆΦ2describe where the MRP might also have gone to given a particular previous state. Of course, such sampling is only possible if we have a model of the transition dynamics of the MRP. Now, we can write a sample-based approximation to the expectation given above as ˆE = 1N−1�N−1i=1 (ˆΦ1(i)wB − γ ˆΦ1(i + 1)wB − ˆr(i))(ˆΦ1(i)wB − γ ˆΦ2(i)wB − ˆr(i)), where the notation ˆΦ1(i)means selecting row i of the matrix ˆΦ1(i)(i.e. the i-th feature in the trajectory). We can now introduce the notation ˆΨ1 = ˆΦ1(1 : N − 1) − γ ˆΦ1(2 : N)and ˆΨ2 = ˆΦ1(1 :N − 1) − γ ˆΦ2(1 : N), where the colon notation denotes ranges of rows. With this notation, we have that  w⊤B(ˆΨ1)⊤ ˆΨ2wB = w⊤B(ˆΨ2)⊤ ˆΨ1wB = �N−1i=1 (ˆΦ1(i)wB − γ ˆΦ1(i + 1)wB)(ˆΦ1(i)wB − γ ˆΦ2(i)wB). It can now be seen after a few rearrangements that ˆE = 1N−1�w⊤B(ˆΨ1)⊤ ˆΨ2wB − ˆr⊤(ˆΨ1 + ˆΨ2)wB + ˆr⊤ˆr�=


leaves the us with the system  ((ˆΨ1)⊤ ˆΨ2 + (ˆΨ2)⊤ ˆΨ1) ˆwB = (ˆΨ1 + ˆΨ2)⊤ˆr, where we denoted by  ˆwBthe sample-based BRM solution.

To overcome the problem of over-fitting, the standard procedure is to add a regularization term to the proposed algorithm. There are many ways of doing that. 9


Figure 2: BRM as projection of rewards (left) and minimizing the Bellman residual (right). Cmp. fig. 1

One way, proposed by [15] is to consider the optimization problem of the fixpoint equation 6. We can extend it as follows:  Φw = argminw′ (∥R + γPΦw − Φw′∥Ξ + β∥w′∥). Here,  β ≥ 0is an external parameter of the algorithm,  ∥ · ∥Ξis the weighted norn and  ∥ · ∥is the usual  L2norm. This way of regularizing produces the well-known analytic solution  wR =�Φ⊤Ξ(I − γP)Φ + βI�−1 Φ⊤ΞR. In the paper [15], a version is also given where the second norm is  L1. In this case, because equation 6 is a fix-point equation, it is not possible to simply plug the problem into the standard LASSO algorithm, and a new algorithm is necessary (see [15] for details).

Before we continue, denote the standard  L2-regularized solution of a system of equations Ax = b as  solveL2(A, b, β) = argminx ∥Ax − b∥Ξ + β∥x∥2 = (A⊤ΞA + βI)−1A⊤Ξb. Denote the version with  L1regularization as  solveL1(A, b, β) = argminx ∥Ax − b∥Ξ + β∥x∥1(this has no explicit analytic form as has to be computed using an algorithm, typically LASSO).

A second way of regularization, introduced in [11] is to add regularization to equation 21, giving the following optimization problem.


In the above, the latter norm may be either of  L2or  L1. A quick calculation shows that this is the same as regularizing the system of equations 8. This idea therefore corresponds to the solutions  solve(Φ(I −γ(Φ⊤ΞΦ)−1Φ⊤ΞPΦ)), Φ(Φ⊤ΞΦ)−1Φ⊤ΞR, β)for each of the discussed norms.

Another way is adding regularization directly to the equation where we have already solved for w, that is,  w = A−1b, where  A =�Φ⊤Ξ(I − γP)Φ�and  b = Φ⊤ΞR. If we regularize with  L2, this corresponds to the solutions  solveL2(A, b, β). This (together with other versions, that do not map to LSTD), has been done in [2], where the author also derives finite-sample error bounds.

It is also possible to combine some of the above ways together, after the manner of [13], and to use other sparsifiers in place of  L1. In [12], for instance, the Dantzig selector is employed, which leads to a considerable simplification of the optimization problem (the optimization reduces to a linear program).

A yet different approach [22] to regularization is to keep the algorithm itself unchanged and instead do feature selection beforehand. Even if the feature selection algorithm is very simple (greedy based on correlation with residual), simulations [22] suggest that doing feature selection leads to performance essentially the same as the approaches described above. Because greedy feature selection is so simple, this suggests that regularization of LSTD is not yet really a fully solved problem.

A property of all the above regularizers is that we lose the invariance of the algorithm w.r.t. the choice of basis for the feature space, which can be seen as a natural characteristic of LSTD3. It is not clear whether the property would be worth preserving in a regularized version – sparsity by its very nature is not invariant to transformations of features, even linear ones and there is a general tendency that a more specialized algorithm will have less generic properties.

In the other sections of this paper, we have considered the case where the MRP never terminates and convergence is defined by taking the limit with respect to the length of a trajectory. We are now interested in extending our observations to the case where there is a termination state. The limit will now be the with respect to the number of episodes being accumulated. First, let us note that the formula w = E�φ⊤(φ − γφ′)�−1 E�φ⊤rs�is still valid in this case. We simply have to give new meaning to the expectation terms.

We will now start by giving a design-based variant for the algorithm. All transitions in a terminating MRP can be described using a rectangular matrix  Pt, where the last column is meant to denote termination. We assume in the following that the starting state of the MRP is the first state. We also assume that the matrix  Ptis such that the MRP will always eventually terminate. We first need to construct a state distribution  Ξ. To do this, we append the row [1, 0 . . . 0] to the matrix  Pt, producing the square matrix  Pa, which assumes that the MRP restarts after reaching the termination state. Now, the diagonal entries of the matrix  Ξare the entries of the left eigenvector of  Pawhich corresponds to eigenvalue one. Now we also construct another square matrix, P, which we obtain by appending the row [0 . . . 0, 1] to the matrix  Pt. This matrix assumes that the agent stays in the termination state forever. The intuition behind this is the following: the matrix P describes the true dynamics of the MRP, but in order to have a meaningful state distribution we need to take into account the fact that we have multiple episodes – hence the definition of the matrix  Pa, which models restart. Having defined the above matrices, we may use the standard formula in the following way.


Here, we assume that the last feature vector (i.e. the one corresponding to the state modelling termination) is zero. By definition, the final element of R is also zero. It can be seen that the sample-based variant is the same as in the case of one long trajectory, except for the additional summation over the episodes. We note we use here the fact that the termination state has the feature of zero (so that we can still use the matrix D – there is no subtraction in the last row, but it doesn’t matter since the last state is the terminal state). The formula looks as follows, where the sum goes over episodes.


We have provided a detailed survey of the different ways in which LSTD can be obtained. Our derivation of LSTD using instrumental variables, is, to our knowledge, the first one which is correct. We also made explicit and formal an argument concerning the invertability of the matrix that appears in the LSTD solution (see appendix A). Moreover, we have derived geometric interpretations of the LSTD fixpoint (independently of the work of Scherrer [24], which we only became aware of afterwards). We also provided an exhaustive comparison with the BRM algorithm as well as surveyed the methods that can be used to regularize the LSTD solution. Finally, we formally described the episodic version of LSTD, which was already implicitly known before, but not formalized.


Proof. We rewrite the statement in matrix form:  det(Φ⊤ΞΦ) ̸= 0implies  det(Φ⊤Ξ(I − γP)Φ) ̸= 0. We will now develop the second expression. By the well-known eigenvalue argument,  I − γPis invertible. Assume for the moment  Ξ > 0(we will deal with the case when this is not true later). Consider some non-zero vector x. We use the assumption to state that  Φx ̸= 0have that  Φ⊤Ξ(I − γP)Φx = 0if and only if the vector  y = Φx, which in the column space of  Φsatisfies the condition that  Ξ(I − γP)yis orthogonal to the column space of  Φ. This implies that  y⊤Ξ(I − γP)y = 0. This holds if and only if  y⊤ � 12(Ξ(I − γP)) + 12(Ξ(I − γP))⊤�y = 0. Now because the matrix defining this quadratic form is symmetric, and thus diagonalizable and with real eigenvalues, we have that this can only be zero if some of the eigenvalues are nonpositive. We will show that this cannot be the case. Rewrite the matrix

2(Ξ(I − γP)) + 12(Ξ(I − γP))⊤as  Ξ(I − γ 12(P + Ξ−1P ⊤Ξ)). Now because by definition  Ξ = diag(ξ)where  ξ⊤P = ξ⊤, we have that  Ξ−1P ⊤ΞV 1□ = V 1□(where by  1□we denote the vector of all ones); moreover,  Ξ−1P ⊤Ξhas positive entries. So it is a Markov matrix. Thus 12(P +Ξ−1P ⊤Ξ)also is a Markov matrix. Thus,  (I − γ 12(P + Ξ−1P ⊤Ξ))has eigenvalues in the positive real half-plane. We also know that the eigenvalues of  Ξ(I − γ 12(P + Ξ−1P ⊤Ξ))are non-negative since it is a symmetric graph Laplacian. But we cannot have zero eigenvalues, because it would imply that  (I − γ 12(P + Ξ−1P ⊤Ξ))also has zero eigenvalues, which we have shown is impossible. This finishes the proof for  Ξ > 0.

Now consider the case when we do not have this, i.e. some of the diagonal entries of  Ξare zero. Intuitively, the fact we prove is now obvious since transient states do not influence the values of the expectations. More formally, we can, without loss of generality assume that the states for which the probability given by the stationary distribution is zero have highest indexes (i.e. they occur at the back of matrices  Ξ, Pand  Φ). We introduce the following notations for block minors of matrices  Ξ, Pand the vector y corresponding to the non-transient and transient states.


Note that in the above,  Pntis has to be the zero matrix – it corresponds to transitions from non-transient states to transient states. Therefore we have that  Ξ(I − γP)y = 0implies  Ξf(I − γPf)yf = 0and thus, by the reasoning for the case without transient states,  yfhas to be the zero vector. Therefore we have the fact that  Ξ(I − γP)Φx = 0implies that we have the following.


We see that this implies that  Φ⊤ΞΦx = 0. But we know from our assumption  det(Φ⊤ΞΦ) ̸= 0that this is only possible for x = 0.

[1] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning.

[2] Bernardo Ávila Pires. Statistical analysis of l1-penalized linear estimation with applications. Master’s thesis, Univeristy of Alberta., 2011.

[3] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning, pages 30–37, 1995.

[4] D. Bertsekas. Temporal difference methods for general projected equations. Automatic Control, IEEE Transactions on, (99):1–1, 2011.

[5] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, chapter Approximate Dynamic Programming. 2011.

[6] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific Belmont, 2012.

[7] J.A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):233–246, 2002.

[8] Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57, 1996. 10.1007/BF00114723.

[9] David Choi and Benjamin Van Roy. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, 16(2):207–239, 2006.

[10] Kamil Ciosek. Generalizing LSTD(λ) to LSTD(λt). Internal UCL note., 2012.

[11] Matthieu Geist and Bruno Scherrer. l1-penalized projected Bellman residual. In European Wrokshop on Reinforcement Learning (EWRL 11), Athens, Grèce, 2011.

[12] Matthieu Geist, Bruno Scherrer, Alessandro Lazaric, and Mohammad Ghavamzadeh. A Dantzig Selector Approach to Temporal Difference Learning. In International Conference on Machine Learning (ICML), 2012.

[13] Matthew Hoffman, Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Regularized least squares temporal difference learning with nested  l2 and l1; penalization. In Scott Sanner and Marcus Hutter, editors, Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 102–114. Springer Berlin / Heidelberg, 2012.

[14] Philipp W Keller, Shie Mannor, and Doina Precup. Automatic basis function construction for approximate dynamic program- ming and reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 449–456. ACM, 2006.

[15] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 521–528, New York, NY, USA, 2009. ACM.

[16] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107– 1149, 2003.

[17] Guy Lever. Private communication.

[18] Manuel Loth, Manuel Davy, and Philippe Preux. Sparse temporal difference learning using lasso. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pages 352–359. IEEE, 2007.

[19] O.A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh. Finite-sample analysis of bellman residual minimization. In Proceedings of 2nd Asian Conference on Machine Learning (ACML2010), Tokyo, Japan, November 2010.

[20] R. Munos. Error bounds for approximate policy iteration. volume 20, page 560, 2003.

[21] A Nedić and Dimitri P Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2):79–110, 2003.

[22] C. Painter-Wakefield and R. Parr. Greedy algorithms for sparse reinforcement learning. Arxiv preprint arXiv:1206.6485, 2012.

[23] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L. Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 752–759, New York, NY, USA, 2008. ACM.

[24] Bruno Scherrer. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In 27th International Conference on Machine Learning - ICML 2010, Haïfa, Israël, 2010.

[25] R. Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Proceedings of the 15th Neural Information Processing Systems conference, pages 1555–1562, 2002.

[26] Yi Sun, Faustino Gomez, Mark Ring, and Jürgen Schmidhuber. Incremental basis construction from temporal difference error. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 481–488, New York, NY, USA, June 2011. ACM.

[27] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.

[28] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.

[29] J.M. Wooldridge. Econometric Analysis of Cross Section and Panel Data. The MIT Press, 2008.

[30] Huizhen Yu and D.P. Bertsekas. New error bounds for approximations from projected linear equations. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pages 1116 –1123, Sept. 2008.

Designed for Accessibility and to further Open Science