b

DiscoverSearch
About
My stuff
Coagent Networks Revisited
2020·arXiv
Abstract
Abstract

Coagent networks formalize the concept of arbitrary networks of stochastic agents that collaborate to take actions in a reinforcement learning environment. Prominent examples of coagent networks in action include approaches to hierarchical reinforcement learning (HRL), such as those using options, which attempt to address the exploration exploitation trade-off by introducing abstract actions at different levels by sequencing multiple stochastic networks within the HRL agents. We first provide a unifying perspective on the many diverse examples that fall under coagent networks. We do so by formalizing the rules of execution in a coagent network, enabled by the novel and intuitive idea of execution paths in a coagent network. Motivated by parameter sharing in the hierarchical option-critic architecture, we revisit the coagent network theory and achieve a much shorter proof of the policy gradient theorem using our idea of execution paths, without any assumption on how parameters are shared among coagents. We then generalize our setting and proof to include the scenario where coagents act asynchronously. This new perspective and theorem also lead to more mathematically accurate and performant algorithms than those in the existing literature. Lastly, by running nonstationary RL experiments, we survey the performance and properties of different generalizations of option-critic models.

Background and related works. Use of stochastic neural networks is commonplace in reinforcement learning (RL) to execute stochastic actions in the environment. However, hierarchical approaches to RL often conceptualize a notion of an abstract action that is not applied in the environment directly, but rather happens within the mind of an agent in order to decompose the problem. Coagent networks (Thomas and Barto 2011) formalize this learning problem where stochastic actions are taken within an agent’s mind in the general case. Specifically, hierarchical RL (HRL) models such as Option-Critic (OC) (Bacon, Harb, and Precup 2017) and Hierarchical OC (HOC) (Riemer, Liu, and Tesauro 2018) attempt to leverage abstract actions to learn complicated tasks. This requires networking multiple stochastic policies that learn and act cooperatively to maximize the return from the environment. How to learn coagent networks was first studied in (Thomas 2011), wherein it was shown that the more biologically plausible approach to learning, i.e. REINFORCE (Williams 1992), is an unbiased estimator of the policy expected return when applied individually on each coagent. In other words, each coagent trained separately with REINFORCE on their own return is equivalent to REINFORCE on the entire network policy. (Kostas, Nota, and Thomas 2020) further generalized this to coagents acting (a)synchronously. However, all previous works assumed a disjoint set of parameters describing each coagent. This non-sharing assumption was lifted in the specific context of HOC (Riemer et al. 2020), with the the more advanced Actor-Critic algorithm implementation.

So far, we have discussed prior and related works on the theory of coagent networks. However, since the advent of coagent networks and its options variants, there have been continued efforts on experimenting with these models. The coagent policy gradient theorem (PGT) has been investigated by (Gupta et al. 2021) to train stochastic feedforward neural nets, where it was shown that while such networks struggle to achieve optimal results in supervised settings, they are more performant in non-iid settings compared to backprop. On the other hand, (Chung 2021) proposes an alternative to the REINFORCE called the MAP propagation algorithm on the same type of feedforward networks. This reduces the variance of updates, albeit decreasing computational efficiency and increasing bias in their experiments in OpenAI’s Gym. The main advantage of coagent networks framework is the flexibility to design the network and the freedom of choosing the rules by which the network coagents operate. This is a powerful tool that, in theory, should allow the user to exploit the relational biases in the task by reflecting them in the network. One such famous design is that of options, which tries to exploit a decomposition of the optimal policy to multiple skills or phases, and it has by far garnered the most attention by the community. They have been evaluated under famous RL tasks, but also multi-tasking, nonstationary, and non-iid datasets. The popular OC has succeeded when applied to Q-learning on Atari (Bacon, Harb, and Precup 2017), or to continuous action spaces (Klissarov et al. 2017) and asynchronous parallelization (Harb et al. 2018). (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020) further improved OC’s results under multi-tasking on Atari games by introducing the HOC architecture, and showed superior performance in nonstationary tasks. In addition, properties and “qualities” of options has been an active area of research. One prominent issue is that of option length, where some become dominant while others become useless. (Harb et al. 2018) formulates an answer to what a good option is through the notion of deliberation cost, therefore adding a cost to options that deliberate for too long. In addition (Khetarpal et al. 2020; Chunduru and Precup 2022) develop models of interest and attention to keep the focus of options on certain (less overlapping) areas of state space, thereby encouraging skill decomposition by options.

Contributions and novelties. Now, we discuss a summary of main contributions and novelties of this work, on theory, algorithms, and experiments. We first revisit the notion of a Coagent Network with the goal of addressing theoretical gaps in its online learning, and clarifying its theoretical foundation and scope of application. In the main theoretical Section 3, we begin by presenting a

• novel definition of a coagent network that formalizes the rule  ∆under which the network coagents cooperate.

This rule shows that actions within coagents establish a path of execution, which itself is a new intuitive representation of the network’s overall action in the environment. This then facilitates the writing of the policy gradient as the sum of those of the coagents in Theorem 3.1. Using the concept of path of execution, we achieve

• a much simpler proof than previous works for the general policy gradient theorem (PGT) of coagent networks, in addition to having no assumption on the sharing of parameters.

Last but not least, we show how to extend our framework to the synchronous settings. This includes a generalization of the formalization of the rule  ∆, incorporating the possibility of simultaneous executions of coagents, and thus generalizing the concept of a simple path of execution to that of a directed feedforward graph of execution. Then, the asynchronous case is addressed using the so-called Markov trick, to transform the asynchronous non-Markov MDP to the synchronous Markov MDP by augmenting the state space. This trick was also used in (Kostas, Nota, and Thomas 2020). We use this Markov trick to derive our general PGT for asynchronous networks. As such, we

• generalize the PGT to the (a)synchronous coagent networks framework (no assumption on parameter sharing).

While our framework is general and allows, for example, for cycles and loops in the execution path, it is important to outline precisely where and when this theorem applies. This is exhibited in Section 4, where in addition to exactly clarifying the theoretical constraints, we cite examples, previously known and new ones, such as Feedforward Options Network, and also non-examples for this purpose.

Although our work emphasizes more on theory, we show examples of insights and improvements brought by our theory to the online policy learning algorithm and its experimental implementation. We follow the larger focus of prior works on experimenting with options among all coagent networks. We identify multiple areas of improvements in the previous algorithms derived from the PGT for options networks (Section 5). First and foremost, we realize that in all previous works, the update takes place on all options at each time step, instead of only when each is called back, which is what the PGT states. Correcting this immediately enables

• a faster runtime and learning time on the order of the number of options.

Then, we identify an incorrect update of the termination function that has been used in the literature and argue how it artificially worsens the issue of early option termination.

• We address the early option termination issue by our correction of the termination update and proper tuning of the temperature of termination functions,

thereby avoiding an additional hyperparameter (such as the deliberation cost) in our analysis. Furthermore,

• we argue theoretically how parents of options in the hierarchy are useful as target networks in the Q-learning for the AC algorithm,

(see more in Remark 6 and Appendix F), and we accordingly make stabilizing changes in the algorithm. We experiment extensively to validate our hypotheses on a nonstationary task using the OC/HOC and the new Feedforward Options Network. Overall, these allow us to show that our implementation outperforms the existing literature (Appendices H and I). Finally, we extensively investigate and show the improvements, brought by the changes we made, on the long term stability of training (Appendix J). Lastly, we attach our code as supplementary material for reproducing the experiments with our algorithmic improvements.

We will study policies acting in a Markov Decision Process (MDP): A tuple  (S, {As}s∈S, R, P, γ)where S is the set of states,  Asis the set of actions  a ∈ Asavailable at state s, R : (s, a, s′) → r ∈ Ris the reward function for a transition (s, a, s′), P(s′|s, a)is the transition function, and  γ ∈ [0, 1]is the discount factor.

In the problems we consider, each state contains not only information about the environment but also information about the state of the agent  π, describing something internal to the agent that is relevant to its execution. Thus, we will use the subscript  πin  Sπto denote this augmented set of states, and use  Senvto exclusively refer to states of the environment. In both cases, the exact definition will be dependent on the context. For this part, we will simply use S. Further, a subset  Sinit ⊂ Sis defined as the subset of possible initial states.

At each time-step t, the Markov agent/policy  πacts on state  st ∈ Sby action  at ∈ Astwith probability π(at|st). The state then changes to  st+1with probability P(st+1|st, at)and the reward  rt := R(st, at, st+1)is received by the agent. We define

• the expected return of  πas  Jπ := Eπ,Sinit[�∞t=0 γtrt],

• the value function of  πas  Vπ(s0) := Eπ[�∞t=0 γtrt|s0]

which is the expected return given initial state  s0 ∈ Sinit, • and the action-value function of  πas  Qπ(s, a) :=

image

given initial state-action pair  (s0, a0). On-policy algorithms are one of the main class of algorithms used in online RL to learn the policy with the highest expected return. We can train the parameters  θdescribing  πby ascending the policy gradient  ∇θJπwritten as follows:

image

Here,  d(s0)is the initial distribution for state  s0 ∈ Sinitand d(s|s0)is the discounted state probability of reaching state s from  s0, meaning

image

where  Pπ,S(si+1|si)is the transition probability of  sito si+1given policy  πand transition function P of S. While the equations and theorems in this paper are written for a finite set of states and actions, it is straightforward to generalize them to the infinite case.

In this section, we offer a new perspective on coagent networks by formalizing the rule of execution, benefits of which are shown in (1) deriving our main result, the policy gradient theorem, with an intuitive short proof, (2) generalizing it the (a)synchronous setting, and (3) (re)designing previously known and new coagent networks in the next section. We study an agent  Πcomposed of a network of Markov policies πo, also called coagents, where at each time-step, rules set by the user determine the sequence of policies  (πo1, . . . , πok)that act. Definition 1. Let  Πbe a Markov policy acting within an MDP M with an augmented state space  SΠand action space Axfor  x ∈ SΠ. Let  CΠ = {πo}o∈OΠ, where  πoare Markov policies, indexed with a set of nodes  OΠ, with state and action space  So, Axofor  xo ∈ So. Πis a Markov single reward coagent network with coagent set  CΠif • Every state  x ∈ SΠuniquely determines a coagent  πo1 ∈

CΠwith state  xo1 ∈ So1, which will be the first coagent

in a sequence that will define the action of  Π.

• This sequence is generated using a rule  ∆designed and fixed by the user and not influenced by the parameters, and determines the next acting coagent, along with its input state, using the state  xoand action  uoof the previous coagent:

image

∆can be stochastic if o is the last coagent that performs a primitive action within a stochastic environment. Primitive actions are those that change the state of the environment. In that case,  xo′may not be deterministically

determined. Moreover,  ∆can also be stochastic in the selection of the next coagent  o′. For simplicity, we shall assume that  ∆is only stochastic in the former, and there is no randomness in the selection of  o′.

• Eventually, after finitely many applications of  ∆, a coagent applies a primitive action a that leads to the next state  x′ ∈ SΠ. Note this does not mean that the last coagent’s action is exactly a, as it could contain additional information necessary for the next execution of  Π.

Execution paths.  Π’s actions can be viewed as primitive actions, leading to the policy gradient (1). Alternatively:

Definition 2. An action of  Πis an execution path P = ((xoi, uoi))ki=1where k coagents  πoiactions lead to a prim- itive action a, with the last action  uokcontaining that information.

As a result of this, we can compute  QΠ(x, a)as EP :uokleading to  a[QΠ(x, P)].

Computing reward for coagents. Assume  πoacts on x(0)o by uoand only acts again at  x(l)o, meaning after Πhas gone through l many execution paths. The transition tuple for  πois defined as  (x(0)o , uo, x(l)o ). Simi- lar to how the cumulative reward for  Πis computed using the rewards  R(xt, at, x′t)from its transitions, R can also be used to uniquely define the reward functions for all coagents transitions. Let  Π’s states and primitive actions in the time interval associated to  πo’s transition (x(0)o , uo, x(l)o )be  (x(i), a(i))l−1i=0. The reward  Rofor the tran- sition  (x(0)o , uo, x(l)o )is

image

where  ri = R(x(i), a(i), x(i+1)). Notice that if  πois never reactivated (i.e.  l = ∞), then the reward corresponding to (x(0)o , uo)is an infinite sum.

Remark 1. The reward source of all coagents is the same R, hence the name single reward in our definition. The motivation and necessity of this assumption is explained later in Remark 4.

Convention. Throughout this text, we will use the term Coagent Network (CN) to refer to a Markov single reward coagent network.

Coagent networks have a natural graph representation.

Definition 3. The graph of a CN  Πhas vertex set  OΠand directed edges which shows which coagent could follow the next in some execution path.

Remark 2. A loop in a graph simply means that the coagent can act multiple times in a row. Furthermore, one can start with the most refined decomposition of  Πinto stochastic policies and contract any edge to derive different graph representations. The policy gradient will obviously not change as  Πhas not changed; indeed, as shown in our main theorem, the policy gradient for the new composite coagent decomposes to the sum of the two original ones. For an example of a coagent network graph, see Fig. 1.

Policy Gradient Theorem Let  ∇θJΠbe the policy gradient for  Πwith parameters shared among coagents, written as:

image

where actions are viewed as execution paths P = ((xoi, uoi))ki=1with  uokimplying a primitive action. We may drop  θfrom  ∇θas it is implied from the context.

Theorem 3.1. The policy gradient is the sum of the policy gradient of the coagents:

image

Here,  d(xo, x|x0)is the discounted probability of reaching state x from  x0, i.e.  d(x|x0), multiplied by the probability of reaching  xofrom x within a single execution path.

Proof. Note  Π(P|x) = �o∈P πo(uo|xo)where  o ∈ Pmeans every coagent  oi, 1 ≤ i ≤ k. Rewriting (5):

image

Distributing the derivative over the product leads to �

image

where  P<o, P>oare the parts of P before and after o. The first product over  P<o, summed over all possible P<opaths leading to  xo, gets absorbed by  d(x|x0)and gives us  d(x, xo|x0)by definition. For the product over P>o, summed over all possible  P>opaths with state-action  (xo, uo)for o, gets absorbed by  QΠ(x, P)and gives Qπo(xo, uo). This is proved by iteratively using a Bellmantype equation for  πoin terms of the next coagent in the execution path:

image

where  o′is the next coagent according to  ∆. Applied iteratively, this gives  Qπo(xo, uo) =

image

where the summation is over all paths P that include (xo, uo), and  lP>ois the index of the last coagent in  P>o, which action immediately leads to a primitive action, thus QπlP>o (xlP>o , ulP>o ) = QΠ(x, P).

Remark 3. Our much shorter general proof and definition compared to those of previous works (which also had more assumptions on the model) show the benefit of formalizing the rule  ∆. In comparison, this avoids many of the calculations around the discounted probability distributions identities (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020), as one can use the graphical representation of the network to reason through the algebra by simply following the relevant paths of execution as demonstrated in the proof above.

Remark 4. Note that our proof also uses the single reward source assumption, that R determines all rewards  Ro. Otherwise, if  Rowere to include any ‘pseudo’ or ‘intrinsic’ reward, as in the goal-based HAC model (Levy, Platt, and Saenko 2017), relating  QΠto  Qπoas in (9,10) would not have been possible (see Appendix B for further details).

Remark 5. The application of this theorem to any specific CN consists of (1) identifying the coagents and  SΠ, So, and (2) computing the discounted state action probability d(xo, x|x0)(see e.g. Appendix A for HOC).

(A)Synchronous Coagent Networks Our definition of coagent networks needs to be slightly generalized to include the (a)synchronous setting (Kostas, Nota, and Thomas 2020). In brief, the rule  ∆needs to accommodate simultaneous executions for synchronous CNs. Then, the asynchronous setting is recovered by augmenting the state space.

Synchronous Coagent Network. The rule  ∆of a CN dictates the path of execution. In particular, our definition of a path of execution is a sequence of coagents states and actions. In the synchronous framework, the policy  Πis instead composed of a fixed feedforward network with N layers of coagents where all coagents in the same layer execute simultaneously. The  i−th layer’s states  Xi = (xi,o)o(with o going over all coagents in layer i) is given by its previous layer outputs into o, called  U prei,oand the environment state s, i.e.  xi,o = (U prei,o, s). Hence, a path of execution would be a sequence of sets of coagents where coagents within the same set execute simultaneously, and their collective actions and states determine those of the next set. Therefore, the only change to our definition is in the rule  ∆, to become a function of the form  ∆ : {(xi,o, ui,o)}o → {(xi+1,o′)}o′.To adapt our theorem, we take the following approach. First, formally consider all coagents in the same layer as being one coagent. This means we have a coagent network which is simply one line of coagents  Π1, . . . , ΠNacting one after the other. Here, we are in our original framework and our theorem for this sequence of coagents gives

image

where capital letters are used for the states and actions of Πi, which is composed of those of its constituting coagents {πi,o}o, i.e.  Xi = (xi,o)o, Ui = (ui,o)o. Now our problem is to rewrite the policy gradient of coagent networks such as Πi, as the sum of its simultaneously acting coagents. This requires an analysis similar to Eq. (8), by distributing ddθover the coagents acting simultaneously

image

Similar to our main theorem, we follow this by taking appropriate marginalization over undifferentiated coagents, state-action occupancies, and Q-values (carried out in more details in Appendix D). They yield the desired decomposition to gradients of the coagent policies:

image

Asynchronous Coagent Network. Defined in (Kostas, Nota, and Thomas 2020), an asynchronous coagent network is one where the coagents may act asynchronously. For example, one node at environment time step t may have waited to act based on the actions taken by others in the step  t− 10. One can see the asynchronous setting as a non-Markov version of the synchronous coagent network. More generally, using the Markov trick, i.e. changing the non-Markov MDP to a Markov MDP by augmenting the state space by adding histories to the states, we have: Theorem 3.2. The policy gradient of  Πfor a single reward coagent network that is formed by non-Markov coagents can be written as the sum of the policy gradient of its coagents.

To prove the above for the particular example of asynchronous networks, we first apply the Markov trick, which in this case, is exactly the same state augmentation trick that (Kostas, Nota, and Thomas 2020) employed to translate this problem into a synchronous MDP problem. Then one can write the shared parameter version of the synchronous setting as previously proved. Next, one must show that the synchronous coagents of this new MDP have the same policy gradients as the original ones. Again, this equivalence is achieved by following the same reasoning in (Kostas, Nota, and Thomas 2020, Sec. 5.1, p. 7) which, although done for the non-shared parameter version, is irrelevant of the parameters, and only revolves around the definition of the state augmentation and the two MDPs equivalence (Appendix E).

We discuss herein how our definition easily incorporates previous models and inspires new ones.

1) Hierarchical Option Critic (HOC): For illustrating the expressivity of our definition, let us model an HOC (Riemer et al. 2020) as a CN. An HOC has N levels of hierarchy with  π1, sitting at the first level of hierarchy, as the most abstract option and selecting the first option  o1with probability  π1(o1|s)given a state s.

We view HOC as a tree of options, denoted by ⟨m1, . . . , mN⟩where parents at layer i have  mi+1many children. An option is equipped with a policy  πwhich selects a child and a termination function  β. Each option is uniquely determined by a tuple  o1:j = (o1, . . . , oj)where oi ∈ {1, . . . , mi}, and  0 ≤ j ≤ N − 1with the most abstract option (the root) labelled by  o1:0 := o0. The option  o1:j, sitting at the  j + 1−st level of hierarchy, has a policy  πo1:j(·|s), also denoted by  πj+1(·|s, o1:j)which selects a child  o1:j+1, and termination function  βo1:j(·), also denoted byβj(·, o1:j), which decides to terminate with probability  βj((s, o1:j), 1)and to not terminate with probability βj((s, o1:j), 0) := 1 − βj((s, o1:j), 1). We will abuse notation and denote  βj((s, o1:j), 1)by  βj(s, o1:j). We may also use ‘node’ while referring to an option in this tree.

Denote the HOC policy by  Π. The states  SΠare tuples of the form  (s, o1:i, d)or  (s, o1:i, u). We may simplify the notation by dropping “1 :” from the subscript index and use so,i,dor  so,i,u. so,i,d (so,i,u) means the state of the environment is s, and the algorithm is about to execute  πo1:i (βo1:i), as it is in the d = downward (u = upward) execution mode.

We describe the rule  ∆for HOC. Starting from the root option policy  π1at  so,0,d, each node takes and action, moving from  so,j,d → so,j+1,d. Eventually, a node  o1:N−1selects a primitive action  a = oNchanging the state of the environment to  s′. Then the upward mode is initiated, i.e. s′o,N−1,u. In this mode, termination functions get selected along the previous path and they decide whether to terminate or not. If a node terminates, the parent’s termination function is called. Therefore the state moves from  s′o,j+1,u → s′o,j,uwith probability  βj+1(so,j+1). At some level l + 1, node o1:ldoes not terminate, with probability  1 − βl(so,l)and the mode changes back to d, i.e.  s′o,l,u → s′o,l,d. Then the down- ward execution applies as before until a new primitive action is selected by some  o′1:N−1. Given this rule  ∆, we derive the associated graph representation as shown in Fig. 1.

image

Figure 1: A graph representing the HOC  ⟨1, 2, 2⟩. The policy and termination function of an option each form a coagent. Nodes are labelled by their unique address and type (policy or termination). The termination functions send an edge to their corresponding policies. The policies of the leaf nodes are also followed by their corresponding termination after applying a primitive action.

2) Feedforward Option Network (FON): Our CN framework allows us to consider other examples such as the simple feedforward network where options at level i are connected to all the next ones at level i + 1. This CN is used in our experiments as a toy model to better understand the option properties and how that is affected if parents share children. The rule  ∆is similar to that of HOC. We denote by  ⟨m1, m2, . . . , mN⟩an N layer FON where the i-th layer has  mimany options. In all cases, there is a single root, thus m1 = 1. In our experiments, we tested configurations such as  mi = 1, ∀i, and  ⟨1, 2, . . . , 2⟩.

3 and 4) Options of Interest (Khetarpal et al. 2020) and Stochastic Neural Nets (Gupta et al. 2021; Chung 2021): There are more examples of coagent networks which we discuss further in Appendix C.

Non-examples of CNs: Any model that violates the stochasticity of the coagents or the single reward source assumption could be considered as not directly fitting into our definition and would require examination. Two important models of this nature are the Attention Option-Critic (Chunduru and Precup 2022), violating the stochasticity of the coagents, and Hierarchical Actor-Critic (HAC) (Levy, Platt, and Saenko 2017), violating the single reward source assumption. However, both can in theory be incorporated in our framework if one adds stochasticity, or uses the Markov trick (see Appendix B).

Algorithm 1 shows how to leverage Theorem 3.1 and apply the Actor-Critic (AC) algorithm to learn the options in an HOC/FON network. Below are our improvements and comparisons derived from the theory.

1) Update only on arrival and its runtime advantage. Previous works (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020; Khetarpal et al. 2020; Bacon, Harb, and Precup 2017; Chunduru and Precup 2022) update every parent of the current active option at each time-step. However, Theorem 3.1 only lists the gradients dπodθat states  xo, where πois in the execution path. Thus the more accurate way to update options is to do so only when they are called back. Otherwise, depending on the levels of hierarchy N and average duration of options, the runtime of each time step is increased by a factor of O(N). As demonstrated in the experiments (Appendix H), updating each time step does not lead to less episodes for learning either. Thus, the learning time is also increased by a factor of O(N).

2) Incorrect  βoupdate and early termination. We address a recurring mistake in the literature’s AC algorithm implementation for updating  β, which worsens one of the issues of OC and HOCs called early option termination. As Theorem 3.1 states, for the update of an option o, the term dβodθ (s, a)should be scaled by the  Qβo(s, a). If a = termination, then  Qβo(s,termination) = Vβparent(o)(s), i.e. the value of the higher option termination function at s. Otherwise,  Qβo(s,not termination) = Vπo(s). For the sake of illustration, let us consider the case of the Option-Critic model with root being the parent of o. Then  Vβparent(o)(s) =Vπroot(s), as the root never terminates. Throughout the literature, instead of multiplying the gradient by either  Vπroot(s)or  Vπo(s), the difference of these two terms denoted by

image

Aβo(s) = Vπo(s) − Vπroot(s)is considered. Hence, independent of the action, the update is always

image

where  θβoare  βo’s parameters. In theory, it is clear that Aβo(s)is always negative as  Vπo(s) ≤ Vπroot(s), therefore changing the parameters in the direction of the termination gradient, thus always encouraging early termination. This artificially created problem then becomes the reason behind adding another hyperparameter  η > 0to mitigate this issue (Bacon, Harb, and Precup 2017; Harb et al. 2018). Correcting this mistake, while eliminating the need for an additional hyperparameter  η, does not resolve the early termination issue entirely. Indeed, at the beginning of training, since the lower options get their values updated more frequently, as long as rewards are nonnegative and values are initialized similarly, one can expect the value functions of the lower options to be generally higher than those of their parents. This would push the options to terminate less. However, the rate at which the parents are updated is important and this rate may need to be quite small. As supported by experiments (Appendix I), this can be facilitated by lowering the temperature of termination functions.

3) Using the parent as target network. Compared to previous works, another important modification to the algorithm is the use of the parent of the child for the critics target computation, instead of the child itself. In low temperatures, one needs to fully leverage the parent option as a value target network in the critic update of the child. For an option policy  πo, the usual Q-learning target for  Qπo(s, a)has the following form:

δo ← Qπo(s, a)−�r+γt((1−βo(s(t))) maxa Qπo(s(t), a)+

image

where t is the number of time-steps until o is called back, and r the discounted cumulative rewards accumulated during that time. Action a can be a primitive action or a child of o. Notice that the multiplier of  γtis simply  Vβo(s(t))which is decomposes in two terms by considering the two terminating and not terminating cases. In theory,  maxa Qπo(s(t), a) =Vπo(s(t)) = Qπparent(o)(s(t), o)and so in practice, the parent can serve the role of a value target network. Therefore, one can make the following change in the update

image

Going further, we implement a similar change to “(higher parent terms)” where any  maxa Qπo′ (s(t), a)is replaced by  Qπparent(o′)(s(t), o′)for any  o′in the path to root for o (line 15 of Algorithm 2). As supported by our experiments, we hypothesize that this ensures that  Qπofor the lowest options o do not get too much larger than those of their parents, as the target includes the action value of the parent instead of the option itself. Hence termination probability drops in a less dramatic way, allowing the higher options to learn enough to be able to play their role as a value target and stabilize the learning. Eventually, this allows us to obtain models with longer lasting options (Appendices I and J). A similar improvement for  Qβois discussed in Appendix J. Remark 6. Viewing the parent as a target network is also motivated by studying more deeply the crucial difference between a simple FON model  ⟨1, 1⟩vs  ⟨1⟩. As theoretically justified in Appendix F and experimentally confirmed,  ⟨1, 1⟩turns out to be an AC model which is soft-updated using a value target network (the parent) with rollout at a learnable rate (determined by the termination function of the child). This also explains why in other hierarchical networks, such as HAC, it is unnecessary to use a target network. On the other hand,  ⟨1⟩is exactly a vanilla AC model.

We conduct a number of experiments on a nonstationary stochastic sparse variant of the Four Rooms (FR) task (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020). We invite the reader to Appendix G and the ones after, which, as referred to, confirm many of the discussed insights and include

• A comparison of the performance of OC/HOC/FON to each other and previous works implementations (Fig. 2).

• Experiments on option length and how to main performance while tuning the temperature hyperparameter of βoto increase usage of all options.

• Long runs to check the stability of our algorithm, confirming the hypothesis of wider networks enjoying more stability in the long runs.

image

Figure 2: Our results for the OC model  ⟨1, 2⟩and HOC models  ⟨1, 2, 2⟩ , ⟨1, 2, 2, 2⟩for 5 random seeds. To be compared with the results from (Riemer, Liu, and Tesauro 2018, Fig. 3) for the OC model  ⟨1, 4⟩and HOC models  ⟨1, 2, 2⟩ , ⟨1, 2, 2, 2⟩, where our models outperform their counterparts. We also note how the simplest  ⟨1, 2⟩OC model outperforms all other models.

In this work, we introduce a new intuitive definition of coagent networks, allowing us to design new ones, and illustrate the idea of execution paths to prove policy gradient theorems even when coagents share parameters or act asynchronously. We survey the performance and option properties of well-known coagent networks while making the algorithms more performant, and provide theoretical analysis supporting the improvements. This work attempts to put coagent networks and their algorithms on solid mathematical footing, and future works should focus more on the network design of these models and the suitable tasks that they can be evaluated against. For example, one of the design goals of the options framework is for different options to represent different skills. If so, the training and tasks should accommodate such outcome; having options which focus, or have their attention on different parts of a screen or different parts or phases of the task (Khetarpal et al. 2020; Chunduru and Precup 2022), may not necessarily accommodate this. In addition, further algorithmic improvements should be done to mitigate the issues inherent in training such networks, chief among them being the variance (Gupta et al. 2021), in order to make them competitive against state-of-the-art online RL paradigms.

We would like to thank Vaneet Aggarwal for his suggestions and comments. The first named author would like to acknowledge the support of the Perimeter Institute for Theoretical Physics and Microsoft during his time of working on this research. Research at Perimeter Institute is supported by the Government of Canada through Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Research, Innovation and Science. The experiments were conducted using Microsoft computational resources.

Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence.

Chunduru, R.; and Precup, D. 2022. Attention option-critic. arXiv preprint arXiv:2201.02628.

Chung, S. 2021. MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents. Advances in Neural Information Processing Systems, 34: 14734–14744.

Fortunato, M.; Azar, M. G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. 2017. Noisy networks for exploration. arXiv preprint arXiv:1706.10295.

Gupta, D.; Mihucz, G.; Schlegel, M.; Kostas, J.; Thomas, P. S.; and White, M. 2021. Structural credit assignment in neural networks using reinforcement learning. Advances in Neural Information Processing Systems, 34: 30257–30270.

Harb, J.; Bacon, P.-L.; Klissarov, M.; and Precup, D. 2018. When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.

Khetarpal, K.; Klissarov, M.; Chevalier-Boisvert, M.; Bacon, P.-L.; and Precup, D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4444–4451.

Klissarov, M.; Bacon, P.-L.; Harb, J.; and Precup, D. 2017. Learnings options end-to-end for continuous action tasks. arXiv preprint arXiv:1712.00004.

Kostas, J.; Nota, C.; and Thomas, P. 2020. Asynchronous coagent networks. In International Conference on Machine Learning, 5426–5435. PMLR.

Levy, A.; Platt, R.; and Saenko, K. 2017. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948.

Riemer, M.; Cases, I.; Rosenbaum, C.; Liu, M.; and Tesauro, G. 2020. On the role of weight sharing during deep option learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5519–5526.

Riemer, M.; Liu, M.; and Tesauro, G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 10424–10434.

Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211.

Thomas, P. S. 2011. Policy gradient coagent networks. Advances in Neural Information Processing Systems, 24.

Thomas, P. S.; and Barto, A. G. 2011. Conjugate Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 137–144.

Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 229–256.

In this appendix, we review the graph representation of HOC, and by presenting some useful Bellman equations for the PGT, show how our Theorem 3.1 for HOC is the same PGT as in (Riemer et al. 2020, Theorem 1). Note the list of complicated equations in the next parts are simply derived as part of confirming that our result matches with the literature, and should not be seen as proving Theorem 3.1, which we accomplished already in the short proof in the main text. Alternative graph representation of HOC. It is possible to define the CN matching HOC differently by taking the policy and termination function corresponding to an option  o1:jas a single coagent  κo1:j(Fig. 3). This coagent will act as πj+1(·|so,j)when given a state with execution mode d, and act as the termination function  βj(so,j)given a state with execution mode u. In this case, the graph of the CN is a tree where each edge is doubled, with one edge going up and another going down, and each vertex has a loop. Notice this graph is a contraction of Fig. 1 along the edges connecting the two copies of the binary tree.

image

Figure 3: A graph representing the coagent network of HOC with N = 3 and  mi = 2. The policy and termination function of each option form a single coagent. Nodes are labelled by their unique address.

Bellman equations in HOC. To express the Bellman equations that will be necessary to understand the equivalence of our PGT for HOC and the one in the literature, we first need to clarify our notation for the edge cases:

• The state  so,N,dis always followed by some state  s′o,N,ufor some  s′ ∈ Senv. Notice it is always the case that a primitive action such as  oN, if it were to be informally looked at as an option, immediately terminates after taking place, in other words  βN() ≡ 1. So  s′o,N,uis imme- diately followed by  s′o,N−1,u. Therefore, it is meaningful to take  s′o,N,u = s′o,N−1,u.  so,0,dor  so,0,umeans the highest level of hierarchy  o0at state s, where the first option has not been chosen yet. The highest level policy never terminates. Therefore, as β0() ≡ 0, so,0,uis immediately followed by  so,0,dand as such  so,0,u = so,0,d. This state is followed by  so,1,dfor some option  o1assigned by  π1. • In our summations �yxor products  �yx, if the higher in-dex is smaller than the lower index (x > y), we consider the result to be 1. This convention is helpful as it makes the equations easier to express without specifically accounting for the edge cases. To describe  Q−values of HOC, recall that in standard RL, we tend to differentiate the action value and the state value by using a different notation (Q and V ). Our terminology allows us to unambiguously use the same notation Q() for the value of our augmented states at any level of hierarchy or any state of execution d or u. First, let  r(so,N)be the reward (or expected reward, if the environment reward process is stochastic) given by the environment to the agent which has taken primitive action oNat  s ∈ Senv. Let  γ ∈ [0, 1]be the discount factor. All Q−values will be computed in succession. Starting with the easiest case at the lowest level,  Q(so,N,d) =

image

where P is the transition probability of the environment, that given action  oNat s, the state changes to  s′. Note that Q(so,N,d) = Q(so,N−1,d, oN), i.e. the Q-value of the state-action pair  (so,N−1,d, oN)for the policy  πNof the option o1:N−1.

The equation above is obvious, as the next state is s′o,N,u = s′o,N−1,u, which is in the upward execution phase. Next, computing the value at this state,  Q(so,N−1,u) =

image

Note that the product  (1 − βi(so,i))� �N−1j=i+1 βj(so,j)�is the probability of upward trajectory up to level i + 1. At that level, the decision is to not terminate, which has probability  1 − βi(so,i). Notice  Q(so,N−1,u)can also be interpreted as the state-value for termination  βN−1of option  o1:N−1at state  so,N−1,u. Similar interpretations exist for the Q-values computed later.

Finally, computing  Q(so,i,d)completes the cycle needed to compute  Q(so,N,d), as a ‘Bellman’ equation is a recursive equation. To do so, observe the following:

image

Applied iteratively, one obtains an expression of  Q(so,i,d)in terms of  Q(so,N,d), which was already computed.

In similar fashion, one can compute the Q value at all other states. To do so, we shall derive the rewards for any node using Eq. (4):

image

Then, using the transition probability  Pπ(s′o,N,u|so,l,d) =

image

it follows

image

Notice in both equations above,  s′o,N,ucan be replaced by  s′o,N−1,u. Further, for all states with upward execution mode, one can write  Q(so,l,u) =

image

When the algorithm is at  so,l,u, it will begin executing upward and reach some state  so,i,u, for some  0 ≤ i ≤ l(with probability �lj=i+1 βj(so,j)), where it decides to exe- cute downward with probability  (1 − βi(so,i)), i.e. the state changes to  so,i,d. The above equation multiplies these probabilities by the value  Q(so,i,d)of the state  so,i,dit starts executing downward at.

Remark 7. In the literature (Sutton, Precup, and Singh 1999; Bacon, Harb, and Precup 2017; Riemer, Liu, and Tesauro 2018),  QΩ(s, o1:i)generally corresponds to our  Q(so,i,d)while  QU(s, o1:i)generally corresponds to  Q(so,i,u); in the edge case of i = N, this correspondence could become a bit imprecise, due to the fact that the literature notation accounts for those cases separately by introducing new notations. Finally,  VΩ(s)corresponds to  Q(so,0,d).

Remark 8. The equations (18) and (15) give the following in the case of l = 2, N = 2:

image

which can be recognized as the equations for  QΩin terms of  QU, Uas written in (Bacon, Harb, and Precup 2017, Eqs. 1-2).

Policy gradient theorem for HOC. Below, we write the PGT theorem for HOC to motivate the next notations and probability computations. In a series of steps, we will identify this PGT with (Riemer et al. 2020, Theorem 1). For an initial state  (s0)o0,N−1,dat the lowest level of the tree at node  o1:N−10, we have

image

There are a few notations that need to be defined:

 µ(s′, so,N−1,d|(s0)o0,N−1,d)is the discounted probability of starting at  (s0)o0,N−1,d, reaching the state so,N−1,d, which after  πNexecution is followed by s′o,N−1,u. It does not directly correspond to  d(xo, x|x0)in Theorem 3.1, but it will be shown to be related.

 Pβ,π(s′o′,j−1,d|s′o,j−1,u)is the transition probability from  s′o,j−1,uto  s′o′,j−1,d, which is dependent on the ter- mination functions  βand policies  π. It is the probability of the algorithm reaching  s′o′,j−1,dat level j, after finish- ing its upward phase of execution (which started at the same level at  s′o,j−1,u) and going down again to level j (see Eq. (24)).

 Qβl(so,l, 0), Qβl(so,l, 1)are the action value functions of

βlfor option  o1:lat  so,l. Action 1 denotes termination and

0 denotes the opposite (see Eq. (33)).

The notations that match with the reference have also the same meaning, for example  Pβ,π(·|·), µ(·|·)or the advantage A(). As will be shown in equations (37-39), we can interpret the above sum as the sum of the policy gradients of  πjand  βjof the options, as their similarity to the usual policy gradient �x d(x|x0) �adπdθ (a|x)Q(x, a)in (1) can be observed. Below we will make steps towards proving this interpretation.

Let us rewrite (19) so that the  Q−value on the right-hand side is evaluated at states with the same level as the left-hand side. To do so, apply (15) repeatedly on  Q(so,i,d)to get to Q(so,l,d), which leads to the following when  l ≤ N:

image

where  1o′1:i=o1:iis one if  o′1:i = o1:iand zero otherwise. The term multiplied by  Q(so′,l,d)has an independent meaning, which will allow to make this equation more compact. Let  Pβ,π(so′,l,d|so,l,u)be the transition probability from so,l,uto  so′,l,d, which is dependent on the termination functions  βand policies  π. Hence, this probability solely depends on the model and not the environment. It is the probability of the algorithm executing downward at  o′1:lat level l + 1, after finishing its upward phase of execution (which started at level l+1) and coming down again to level l+1. It is easy to see that this probability is the same multiplicative term in the equation above:

image

One can generalize this to compute other useful transition probabilities  ∀l ≤ m ≤ N:

image

Using the above, equation (23) becomes:

image

Next, let us define  µ(s′o′,l−1,d|so,l−1,d), which is the sum of discounted probabilities for getting from  so,l−1,dto s′o′,l−1,d. We will compute two transition probabilities. Us- ing (17) to write the transition probability for  so,l,d →s′o,N−1,u:

image

Also, for  so,l−1,d → s′o,l−1,u:

image

Next, define the discounted one-step and k-steps recursively

as follows  ∀l ≤ N:

image

The definition for  µ()can be written for different transitions. The one needed for HOCPGT is:

image

holding  ∀l ≤ N. Let us now define the advantage as follows:

image

The advantage answers this question: If one is able to choose, then how much advantageous it is to start executing downward from  so,l, than to change the higher level options and try a different set of options  so′,l? The difference above in the values determines the advantage of this choice. Indeed, just like the first term  Q(so,l,d)is the value of not terminating, the second term is the value of terminating. Thus, alternatively, one can write:

image

using the action value functions of  βl. It is clear that

image

where 1 denotes the termination action, and 0 the opposite action. These actions have probabilities:

image

We are now ready to state the rewritten HOCPGT as written by (Riemer et al. 2020). Theorem A.2. (Riemer et al. 2020, Theorem 1)

image

where  µ(s′, so,N−1,d|(s0)o0,N−1,d)is the discounted probability of starting at  (s0)o0,N−1,d, reaching the state so,N−1,d, which after  πNexecution is followed by  s′o,N−1,u.

Remark 9. The last term above might seem different from the one in the reference (Riemer et al. 2020, Theorem 1), but they are the same. Our convention, motivated from our point of view, is to sum over each node  o′1:jindividually, while in the reference, the last node  o′1:N−1at the bottom of the tree is taken and the sum is over its parents.

We now show how to identify our main theorem application to HOC with the above. Let us denote  x0 =(s0)o0,N−1,dand  x = so,N−1,d. Recall the HOC policy is denoted by  Π. Then the first term in the bracket above can be written as:

image

where  dΠ(x|x0)is the discounted transition probability for a policy  Π, for reaching x from  x0. Notice there is no  s′, as dπNdθ (oN|so,N−1)Q(so,N,d)does not depend on  s′, and so the summation over  s′averages out this outcome state. Also note that  Qo1:N−1(x, oN)is the Q-value of  πN(·|o1:N−1)policy of the node  o1:N−1, hence equal to  Q(so,N,d). Therefore, the first term is simply the policy gradient of the lowest level node.

For the second term, for a given node  o1:lfor  1 ≤l ≤ N − 1, the term  γ� �N−1k=l+1 βk(s′o,k)�can be ab-

sorbed into the discounted transition probability  µ()to create  dΠ(x′o,l, x|x0)where  x′o,l = s′o,l,u. This is the discounted transition probability from  x0 → xin any number of execution paths, and from  x → x′o,lwithin the same execution path. Remark 10. Notice that an execution path in this setting, is an execution from  πNfollowed by a path upward, then a path downward until before the next execution of  πNtakes place. Thus, as the path involves an environmental step due to the execution of  πNat the beginning of the path, the discounted transition probability takes a discount factor  γ.

The above remark applies when going from  dΠ(x|x0)to dΠ(x′o,l, x|x0), as is the case in  γ� �N−1k=l+1 βk(s′o,k)�. Furthermore, using (35),

image

From the above, using (33) and (34):

image

Similar to (37), one can see that (38) is the usual policy gradient theorem for the termination policies  βlat  o1:l. Finally, for the last term, for each  πjat  o′1:j−1:

image

Where dΠ(x′o′,j−1, x|x0)is the discounted transition probability from  x0 → xin any number of execution paths, and from  x → x′o′,j−1 = s′o′,j−1within the same execution path. The factor γ� �N−1k=j βk(s′o,k)�Pβ,π(s′o′,j−1,d|s′o,j−1,u)is what describes the discounted probability of this last transition within the same execution path.

Remark 11. Notice our proof does not involve any further state-augmentation. This is in contrast to the proof in (Riemer et al. 2020, App. C), where something called the termination vector T is defined and the graph structure is completely forgotten to be able to cast HOC into a synchronous coagent network (only two coagents). This shows that our framework can provide more intuitive ways for computing the policy gradient of coagent networks.

Attention Option-Critic (AOC). The AOC model (Chunduru and Precup 2022) is an OC augmented with attention mechanisms  hωfor each option  ω. The state  sω, input for  πωand its termination function  βω, is obtained as follows:

• Take the environment state s along with the address of the option and the mode of execution (like  so,2,d),

• change the environment state s to  sω = hω(s) ⊙ swhere hω(s)has the same shape as s with values in [0, 1], and ⊙is the Hadamard multiplication.

The motivation behind this is to focus each option on parts of the states that are relevant. This model leverages attention to show that it is possible to achieve useful options with reasonable lengths.

The AOC learning algorithm can not be recast as a policy gradient of a CN. The main obstacle is the attention mechanism, which is deterministic, and thus is not a coagent. It is learned using gradients of the option action value.

However, following Remark 12, if the attention mechanisms were stochastic, then we could apply the PGT as in Theorem 3.1. The coagents in this case are the policies, terminations and the attention mechanisms. The rule  ∆is similar to that of HOC, where we let the attention corresponding to an option choose an action when execution mode is d.

Hierarchical Actor-Critic (HAC). From the point of view of Definition 1, there are two ways that HAC (Levy, Platt, and Saenko 2017) fails to be a CN.

First, we recall how HAC functions. HAC is a hierarchical goal-based model of policies  π1, . . . , πNwhere each policy selects a goal  gito be achieved by  πi+1. Each policy has a fixed time horizon of K steps in which it has the chance to achieve the goal. It terminates once it reaches the goal or after K many executions, and the parent selects a new goal. As a result, the policy  πiacts again after at most  KN−imany time-steps. The reward given to a policy is based on how close the current state is to the goal, which is measured by a metric defined by the user. The edge case of  g0(the goal for  π1) is either set by the user or simply not chosen, in which case the reward for  π1is just the environmental reward. In practice, the reward for  πiis a linear combination of the goal reward and the environmental reward. Clearly, for the more abstract policies the environmental reward should have more weight. In any case, it is clear that the policies do not have a single reward source (Remark 4).

Another CN assumption that HAC violates is the Markov property. This difference can be resolved, at least in theory. If one considers, say, the mid-level policy of a  3−level HAC, its action depends on the subgoal  g1it has received from the highest policy for a fixed  K2executions paths (environmental time steps) or until it reaches the subgoal. During each K executions of the  K2executions, the mid-level policy is only called once by the lowest policy trying to reach the subgoal g2assigned by the mid-policy during that K time-steps. The mid-policy has to remember the subgoal  g1it was assigned to in the first execution path for the next  K2execution paths. Note that such non-Markovianness is also present in the case of asynchronous firing in (Kostas, Nota, and Thomas 2020) as remarked in the main text. This is not an insurmountable theoretical barrier in viewing these models as CNs, at least if one is willing to augment the state space by including all histories. In the case of goal-conditioned models, one can provide the information of all selected subgoals into the state. Thus, more generally, by the same Markov trick, the Markov condition in Definition 1 can be lifted.

The freedom in the graph structure and parameter sharing pattern, could allow us to exploit relational biases in a given environment/task to build more powerful RL agents. We mention a general but customizable coagent network, followed by two other examples in the literature.

1) Coagent networks with cycles or loops. Let us consider a cycle for the CN graph and design a corresponding rule  ∆. The nodes are labeled by  onfor  n ∈ {1, . . . , N}, where each node, depending on the input, makes a primitive action or sends some information via its action  unto the next node (with  oNsending to  o1). The description of the rule  ∆is as follows: one assigns each state s of the environment two integers  ns ∈ {1, . . . , N}and  ls ∈ N, which means the node  onsis the one that performs a primitive action at environment state s after s has passed through  lsmany cycles. More generally, one can imagine N decision functions fn,s : Sπon → {1, 0}for any  s ∈ Senvand  n ∈ {1, . . . , N}, where 1 means primitive and 0 non-primitive. Depending on the input of the node, after finitely many coagent executions, we have  fn,s(xon) = 1. This should be guaranteed by the programmer to happen to avoid infinite loops. This forces the node  onto apply a primitive action at  xon, which may contain s as the information regarding the environment.

2) Options of Interest. One of the structures that one can add to the HOC is an interest function  Iωfor each option ω(Khetarpal et al. 2020). One interpretation of this function is that it introduces a deterministic part in the process of the selection of  ωby its parent  Ω, which is unique to  ωand disjoint from the process of selection of the other children. This modularity has an effect similar to attention in Attention Option-Critic, discussed in Appendix B. We can still apply Theorem 3.1 by combining the parent with the interest functions of its children, i.e. the new coagent being πnewΩ (ω|s) = Iω(s)πoldΩ (ω|s). Then Theorem 3.1 applied to  {πnewo , βo}options owill give the same PGT in (Khetarpal et al. 2020).

3) Stochastic Neural Nets. Stochastic Neural Nets, in their full generality, have stochastic activation functions and/or weights. In all cases, each neuron forms a coagent with its parameters being the parameters of the activation function along with the weights connected to the neuron that compute its input. A more refined decomposition of the agent to coagents is possible if the weights are stochastic, in which case, every weight is also a coagent by itself. Notice that although our discussion has been mainly around RL applications, we can apply our results to the supervised learning setting by simply using a discount factor  γ = 0, although we naturally expect supervised learning algorithms to be superior in their own field of application. Remark 12. As a general idea, one could inject stochasticity into deterministic (parts of) RL models and apply the coagent network policy gradient theorem, and potentially obtain better results, as demonstrated in the noisy DQN model (Fortunato et al. 2017).

We finish the proof by explaining the marginalizations mentioned after Eq. (12). We first note that the marginalization of  QΠi(Xi, Ui)over  (ui,o′)o′̸=ogiven by the weighting �o′̸=o πi,o′(ui,o′|xi,o′), gives the Q- value  Q(Xi, ui,o)as the latter is naturally defined as �Ui�o′̸=o πi,o′(ui,o′|xi,o′)QΠi(Xi, Ui), where the sum is over all  Uisuch that o’s action is  ui,o. However, according to PGT, we should compute  Qπi,o(xi,o, ui,o), which can be done as follows

image

The probability  P(Xi|xi,o)is the probability that within the same time step of  xi,o, the state of  Πiis  Xi. This probability is dependent on the environment MDP and  Π. One can recover this probability from the discounted state action occupancy as follows:

image

The last two equalities require further explanation. We recall the definition of the discounted state action occupancy

image

Here  P (k)is the state action occupancy corresponding to the k-th step, its value depending on  k, Π, and the MDP M. We need to show  P (k)((xi,o′)o′̸=o|xi,o) = P((xi,o′)o′̸=o|xi,o), i.e. that it is independent of k. Notice that the policy  Π’s internal actions and outputs does not depend on k, but on s; and while the distribution of s itself indeed depends on (Π, M, k), we note that the state s is already given in xi,o = (U prei,o, s). Thus, the term  P (k)((xi,o′)o′̸=o|xi,o)is in- dependent of k.

Finally, after marginalizing  Q(Xi, ui,o)over  (xi,o′)o′̸=o, we reintroduce the one over  x0, x:

image

For thoroughness, we explicitly state the changes that one has to make to the argument in (Kostas, Nota, and Thomas 2020) in order to show that the PGT of the synchronous CN obtained after state augmentation simplifies to the the asynchronous coagents policy gradients. As mentioned in the main text, the argument around this has no relevance to parameter sharing. Therefore, as illustrated below, we note that we simply have replaced the nonshared parameters  θiby  θin the argument in (Kostas, Nota, and Thomas 2020, p. 7), and summed over i for the policy gradients  ∇J, ∇ `J:

“[...] Having shown that the expected return in the asynchronous setting is equal to the expected return in the synchronous setting, we turn to deriving the asynchronous local policy gradient,  ∆i. It follows from J(θ) = `J(θ)that  ∇J(θ) = ∇ `J(θ). Since  `πis a synchronous, acylic network, and `M is an MDP, we can apply the CPGT to find an expression for  ∇ `J(θ). For [shared parameter] synchronous network, this gives

image

Consider  ∂ ln�`πi(( `St, `U pret ), `Ut, θ)�/∂θ, which we

abbreviate as  ∂`πi/∂θ. When `U pret .ei = 0, we know that the action is `U it = `St.ui = `U it−1regardless of  θ. Therefore, in these local states,  ∂`πi/∂θis zero. When `U pret .ei = 1, we see from the definition of  `πthat ∂`πi/∂θ=∂πi/∂θ. Therefore, we see that in all cases, ∂`πi/∂θ=( `U pret .ei)∂π/∂θ. Substituting this into the above expression yields:

image

In the proof that  J(θ) = `J(θ)given in Section C of the supplementary material, we show that the distribution over all analogous random variables is equivalent in both settings (for example, for all  s ∈ S, Pr(St = s) = Pr( `St.s = s)). Substituting each of the random variables of M into the above expression yields precisely the asynchronous local policy gradient,  ∆i.”

We want to understand the algorithm for  ⟨1, 1⟩by building it piece by piece starting from an AC algorithm and changing it where appropriate. For the sake of illustration, we will assume that we have a fixed value of termination 1nfor some n > 0. This means the option terminates every n steps. This is also somewhat of a reasonable assumption for ⟨1, 1⟩, as observed in some of the FR experiments, where unless termination temperature is low, the termination function (more or less) stays constant throughout the training of ⟨1, 1⟩(Figs. 6 and 7).

First, note that the root has one action (selecting the child), and therefore, its action value is the value function of the children. Therefore, let us take a vanilla AC algorithm and add a value target network computing  Vtargetfor the algorithm, where:

(a) It is only updated every  n (= ( 1n)−1) steps using its own rollout update rule

image

(b) The target is used to soft-update the Q-value of the AC, where instead of choosing  Vtarget(s′)to build

the target of Q(s, a), we choose 1nVtarget(s′) + (1 −1n) maxa Q(s′, a), meaning

image

instead of

image

2. Unless mentioned otherwise, we take 50 runs with five to ten different random seeds for 50000 episodes, and show the 500 moving average of the quantity of interest, which is either the number of steps to the goal or the option lengths.

No target network used. As mentioned previously and theoretically justified in Appendix F, each parent is effectively acting as a value target network with rollout.

Hyperparameters. The hyperparameters in all experiments are the same, unless mentioned explicitly otherwise. We choose  γ = 0.99, termination and actor temperature 1 and 0.01, and termination, critic and actor learning rates αβ = 0.001, αQ = 0.01and  απ = 0.00001, respectively. More details can be found in the code.

Analysis and comparison of results. Our OC/HOC models generally outperform their previous versions in the literature as shown in Fig. 2. The experiments also demonstrate that increasing the number of lowest options is not always beneficial (Fig. 4). This finding is confirmed by (Riemer, Liu, and Tesauro 2018), but as further illustrated in Fig. 4 for HOC and FON models, more options simply means slower learning for this task, and not a reduction in the best possible performance. However, taking into account the efficiency of our algorithm as discussed in Section 5, the training time for HOC with 4 levels to reach score below 150 in 100000 episodes is less than that of (Riemer, Liu, and Tesauro 2018) (reached in 50000 episodes).

image

A main issue in option models has been the length of the options, with the model converging to one option dominating, or most options being of very small length (high termination probability). In the incorrect update model of  β(as discussed in Section 5), a proposal was to introduce a deliberation cost ηto the advantage function in the update rule.

Tuning the termination temperature. Here, we propose a simple method which does not require any additional hyperparameter or architectural modification. We examine the effect of lowering the temperature of the sigmoid function used in computing  β. The idea is that as action values are initialized with 0s, the termination probabilities initially drop. Then, due to the numerics of a sigmoid function with low temperature, it should become numerically harder for said probabilities to rise, thus contributing to lengthier options.

image

Figure 5: We compare the results of some simple FON models. The best model, in terms of architectural simplicity, convergence time and eventual performance is the simplest model  ⟨1, 1⟩. This also outperforms the previous HOC models in Fig. 2. Note that only two options, a root and a child are present in this model, and the primitive actions are executed by the child. During learning, the presence of another option (the root) is essential. This is further discussed in F and also supported by the result of the AC model in (Riemer, Liu, and Tesauro 2018, Fig. 3).

However, this reduces the update frequency of higher options. Over a long period of time, one wonders if the higher options can learn their value function as well, which requires experimentation.

⟨1, . . . , 1⟩and  ⟨1, 2, . . . , 2⟩FON models with low temperature. The models we experiment with have termination temperature 0.05, compared to the normal 1.0 in the previous section. Option lengths dramatically increase (Figs. 6 and 7), spanning a significant part of the episode length while leaving room for other options to be used/called back as well. So no option in the  ⟨1, 2, . . . , 2⟩FON models becomes dominant, as also shown by their almost equal length, meaning both lowest options are active participants in the task. Fig. 8 shows that lowering the temperature slightly degrades the performance in 50000 episodes but longer runs may close this gap.

Stability and option lengths. The instability of RL algorithms is well-known, especially in nonstationary settings. We run the experiments for ten times longer and with up to seven levels of hierarchy, and observe in Figs. 9 and 10 that increasing the width correlates with stability while preserving the lowest level options’ lengths. Also, instability in those models correlate with spikes in said lengths. We hypothesize that as the reward function changes dramatically from episode to episode (due to the goal changing), it can be easier for the model to be stable in the long term if there is more width to the network.

To test the stability, we take runs of 500000 episodes on a single random seed, for FON models of the form  ⟨1, . . . , 1⟩

image

Figure 6: Option lengths compared for the  ⟨1, 2, . . . , 2⟩models with low and normal temperature. Options are labelled by {0, . . . , O} where 0 denotes the root, 1, 2 the second and third options at the first level, ... , and  O − 1, Othe lowest two options. No option becomes dominant as their length is almost equal and around half of the episode length.

and  ⟨1, 2, . . . , 2⟩in Fig. 9. We observe how the stability of the algorithm in the long term is generally better for the ⟨1, 2, . . . , 2⟩models. As mentioned, we can see how the crashes in performance, which are spikes in the graphs, are correlated with those present in the option length of the lowest option(s) as shown in Fig. 10. Note that none of the lowest options are dominant and they are also lengthy in their uninterrupted execution, confirming Fig. 6 for deeper models.

Estimating  Qβousing the not-terminated ancestor. Similar to how parents were used as target networks in Section 5 for the policies  πo, the same approach applies for termination functions as  Qβo(s,termination) = Vβparent(o)(s). Considering that parent(o) may have also terminated in the execution path, one could the same value for grandparents of o, and so on, up to the ancestor that has not terminated. Based on our experiments, this last option is a better estimate of the true  Qβo(s,termination), as that is the beginning of the next phase of execution and thus, is a closer estimate to the next collected rewards (line 36 of Algorithm 2).

image

Figure 7: Option lengths compared for the  ⟨1, . . . , 1⟩models with low and normal temperature. Options are enumerated by {0, . . . , O} where 0 denotes the root and O the lowest option. Normal temperature leads to almost constant (very small) length throughout the entire training, while low temperature leads to a stable increase and eventual stabilization. Even though the option length for intermediary options has almost attained the episode length, they and the root are called back (used at least twice in an episode), as their child option length is smaller than the episode length (above 150 as shown in Fig. 8).

image

Figure 8: Lower temperature seems to slow the convergence to the best possible performance in both  ⟨1, . . . , 1⟩and  ⟨1, 2, . . . , 2⟩FON models.

image

Figure 9: Performance comparison of  ⟨1, . . . , 1⟩and  ⟨1, 2, . . . , 2⟩FON models with the same number of levels of hierarchy (two to seven). General stability increases as we switch from the former to the latter model. Note that almost all deeper models achieve the best performance of 125 steps as in Fig. 5.

image

Figure 10: Lowest option(s) length comparison of  ⟨1, . . . , 1⟩and  ⟨1, 2, . . . , 2⟩FON models with the same number of levels of hierarchy (two to seven). General stability increases as we switch from the former to the latter model. The lowest options of the latter model are enumerated by 1 and 2. We observe how both of the lowest options are almost equally used, with a substantial length, confirming the previous results in Fig. 6 for even deeper models. Note the correlation between option length spikes and performance crashes in Fig. 9.


Designed for Accessibility and to further Open Science