Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decisions.
This principle of optimality has endured at the foundation of reinforcement learning research, and is central to what remains the classical definition of an optimal policy [2]. Classical reinforcement learning algorithms like Q-learning [3] embody this principle by striving to act optimally in every state that occurs, regardless of when the state occurs.
The start-state objective function, Rt
, prioritizes making decisions optimally in the initial state, not necessarily in the states resulting from the first decisions.1 These two goals (optimizing decisions in the initial state and optimizing decisions in subsequent states) can be conflicting when using function approximation, particularly when
is small and the initial state distribution has limited support. So, maximizing
does not preserve the principle of optimality.
The term in (1) discounts the importance of optimal behavior in states that occur at later times. Algorithms purported to update
following estimates of
typically drop this
term, since including it or setting
1 results in poor performance. As a result, these algorithms do not capture the essence of
, do not maximize
, and are not stochastic gradient algorithms [5].
We propose a different objective function for finite-horizon episodic Markov decision processes that better captures the principal of optimality, and provide an expression for its gradient. This new objective, which we call the classical objective function, has the form f, where d
is a distribution over S and v
Rt
St=s,
. This form harks back to the classical definition of an optimal policy, particularly if d
has full support on S and does not depend on
, in which case f preserves the partial ordering on policies used in the classical definition of an optimal policy.
In model-free reinforcement learning, the agent is not free to sample states from an arbitrary distribution, which makes estimating f or its gradient challenging with such a d. So, we trade-off similarity to the classical definition of an optimal policy with the practicality of estimating the objective function and its gradient, and define d
to be the on-policy distribution for episodic tasks [2, page 199], but with some
probability shifted to the terminal absorbing state: d1
h Pr(St
, where h is the horizon. This captures the spirit of classical algorithms like Q-learning using function approximation: updates to function approximators occur when states are encountered, and are not discounted.
In the supplementary material we show that
where w(i, t) = 1 if i t, w(i, t
if i = t and
1, and w(i, t) = t + 1 if i = t and
1.
The techniques that make estimation of effective, and which have been developed over 27 years [6], do not necessarily carry over to estimating
f. For example, it is not clear how baselines and control variates (and thus actor-critics) should be leveraged. Developing practical algorithms for (approximately) maximizing f is an open problem—we have only had success with simple REINFORCE-like algorithms.
Notice that f is not an ideal objective since, like , it does not preserve the partial ordering on policies used in the classical definition of an optimal policy, and examples exist wherein it prescribes unreasonable behavior. Still, f presents a new direction for policy gradient research, opening new questions like: 1) are policy gradient algorithms for
that drop the
term better viewed as algorithms for optimizing f? 2) How should baselines and control variates be leveraged when optimizing f? 3) Can practical (linear-time and generalized [7]) natural gradient algorithms be derived?2 4) Do alternate forms for
f facilitate gradient estimation, e.g., writing the t-summation over
ln
St, At,
and the inner i-summation over q
Si, Ai) so that the i-summation can be expressed as a new value function that measures the expected sum of state-values rather than the expected sum of rewards—a value function that might be approximated using a new TD-like algorithm, and which might allow for actor-critics for the classical objective? 5) What are the relationships between f,
, and the average reward objective? For example, notice that when
0, f is equivalent to
with
1.
[1] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018.
[3] C. Watkins. Learning From Delayed Rewards. PhD thesis, University of Cambridge, England, 1989.
[4] P. S. Thomas and B. Okal. A notation for Markov decision processes. arXiv preprint arXiv:1512.09075v2, 2016.
[5] C. Nota and P. S. Thomas. Is the policy gradient a gradient? Unpublished, 2019.
[6] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
[7] P. S. Thomas. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In ICML, 2014.
We will derive expressions for the two terms in (1) independently and then sum them to obtain an expression for . We begin with term (a) in (1), and start by using a property derived by Sutton et al. (2000):
which implies that for all t ,
since Pr(AtSt
x,
Pr(At
St
x, St=s,
and by the law of total probability. Continuing, starting with the fact that Pr(St
Pr(At
St
x,
Pr(St
x, At
, we have that:
by substitution of the variable i = t + k. Since Si is the terminal absorbing state for i h, we have that q
Si, ai) = 0 for i
h, and thus the sum over i can stop at h
1 rather than h
t. Continuing, starting with this change and then reordering the summations over t and i, we have:
since is equal to w(i, i), which is 1
1 and i + 1 otherwise. Replacing the symbol i with the symbol t we have:
Notice that (13) closely resembles the policy gradient for the start-state setting with the term removed. Also notice that the left side of (13) captures how changes to the policy parameters change the value of states, but does not capture how changes to the policy parameters change the state distribution. This suggests that removing the
term from the policy gradient theorem for the start-state objective function results in an update that does not properly account for how changes to the policy change the state distribution.
We now simplify term (b) in (1). Let Tt be a trajectory (a random variable) that includes the states and actions (not the rewards) up until (and including) time t. That is Tt = (S0, A0, S1, A1, . . . , St, At). Let be the set of all possible values for Tt. Using this notation, we simplify term (b) in (1):
Continuing, with s0, a0, s1, a1, . . . , st
, at
, using p to denote the state-transition function, and writing p
, a
, s0) to denote Pr(S0=s0), we have:
i
p(si
, ai
, si
si, ai,
i
p(si
, ai
, si)
i
p(si
, ai
, si)
tTt
, St
st
st
t
i=0
ln
Si, Ai,
ln
Si, Ai,
ln
Si, Ai,
Summing (13) and (23) and using the fact that w(i, t) = 1 if i t, we obtain an expression for the sum of terms (a) and (b) in (1):
ht
q
St, At)w(t, t
ln
St, At,
1h
ht
q
St, At) t
i
w(i, t
ln
Si, Ai,
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063, 2000.