Coagent Networks Revisited

2020·Arxiv

Abstract

Abstract

Coagent networks formalize the concept of arbitrary networks of stochastic agents that collaborate to take actions in a reinforcement learning environment. Prominent examples of coagent networks in action include approaches to hierarchical reinforcement learning (HRL), such as those using options, which attempt to address the exploration exploitation trade-off by introducing abstract actions at different levels by sequencing multiple stochastic networks within the HRL agents. We first provide a unifying perspective on the many diverse examples that fall under coagent networks. We do so by formalizing the rules of execution in a coagent network, enabled by the novel and intuitive idea of execution paths in a coagent network. Motivated by parameter sharing in the hierarchical option-critic architecture, we revisit the coagent network theory and achieve a much shorter proof of the policy gradient theorem using our idea of execution paths, without any assumption on how parameters are shared among coagents. We then generalize our setting and proof to include the scenario where coagents act asynchronously. This new perspective and theorem also lead to more mathematically accurate and performant algorithms than those in the existing literature. Lastly, by running nonstationary RL experiments, we survey the performance and properties of different generalizations of option-critic models.

1 Introduction

Background and related works. Use of stochastic neural networks is commonplace in reinforcement learning (RL) to execute stochastic actions in the environment. However, hierarchical approaches to RL often conceptualize a notion of an abstract action that is not applied in the environment directly, but rather happens within the mind of an agent in order to decompose the problem. Coagent networks (Thomas and Barto 2011) formalize this learning problem where stochastic actions are taken within an agent’s mind in the general case. Specifically, hierarchical RL (HRL) models such as Option-Critic (OC) (Bacon, Harb, and Precup 2017) and Hierarchical OC (HOC) (Riemer, Liu, and Tesauro 2018) attempt to leverage abstract actions to learn complicated tasks. This requires networking multiple stochastic policies that learn and act cooperatively to maximize the return from the environment. How to learn coagent networks was first studied in (Thomas 2011), wherein it was shown that the more biologically plausible approach to learning, i.e. REINFORCE (Williams 1992), is an unbiased estimator of the policy expected return when applied individually on each coagent. In other words, each coagent trained separately with REINFORCE on their own return is equivalent to REINFORCE on the entire network policy. (Kostas, Nota, and Thomas 2020) further generalized this to coagents acting (a)synchronously. However, all previous works assumed a disjoint set of parameters describing each coagent. This non-sharing assumption was lifted in the specific context of HOC (Riemer et al. 2020), with the the more advanced Actor-Critic algorithm implementation.

So far, we have discussed prior and related works on the theory of coagent networks. However, since the advent of coagent networks and its options variants, there have been continued efforts on experimenting with these models. The coagent policy gradient theorem (PGT) has been investigated by (Gupta et al. 2021) to train stochastic feedforward neural nets, where it was shown that while such networks struggle to achieve optimal results in supervised settings, they are more performant in non-iid settings compared to backprop. On the other hand, (Chung 2021) proposes an alternative to the REINFORCE called the MAP propagation algorithm on the same type of feedforward networks. This reduces the variance of updates, albeit decreasing computational efficiency and increasing bias in their experiments in OpenAI’s Gym. The main advantage of coagent networks framework is the flexibility to design the network and the freedom of choosing the rules by which the network coagents operate. This is a powerful tool that, in theory, should allow the user to exploit the relational biases in the task by reflecting them in the network. One such famous design is that of options, which tries to exploit a decomposition of the optimal policy to multiple skills or phases, and it has by far garnered the most attention by the community. They have been evaluated under famous RL tasks, but also multi-tasking, nonstationary, and non-iid datasets. The popular OC has succeeded when applied to Q-learning on Atari (Bacon, Harb, and Precup 2017), or to continuous action spaces (Klissarov et al. 2017) and asynchronous parallelization (Harb et al. 2018). (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020) further improved OC’s results under multi-tasking on Atari games by introducing the HOC architecture, and showed superior performance in nonstationary tasks. In addition, properties and “qualities” of options has been an active area of research. One prominent issue is that of option length, where some become dominant while others become useless. (Harb et al. 2018) formulates an answer to what a good option is through the notion of deliberation cost, therefore adding a cost to options that deliberate for too long. In addition (Khetarpal et al. 2020; Chunduru and Precup 2022) develop models of interest and attention to keep the focus of options on certain (less overlapping) areas of state space, thereby encouraging skill decomposition by options.

Contributions and novelties. Now, we discuss a summary of main contributions and novelties of this work, on theory, algorithms, and experiments. We first revisit the notion of a Coagent Network with the goal of addressing theoretical gaps in its online learning, and clarifying its theoretical foundation and scope of application. In the main theoretical Section 3, we begin by presenting a

• novel definition of a coagent network that formalizes the rule under which the network coagents cooperate.

This rule shows that actions within coagents establish a path of execution, which itself is a new intuitive representation of the network’s overall action in the environment. This then facilitates the writing of the policy gradient as the sum of those of the coagents in Theorem 3.1. Using the concept of path of execution, we achieve

• a much simpler proof than previous works for the general policy gradient theorem (PGT) of coagent networks, in addition to having no assumption on the sharing of parameters.

Last but not least, we show how to extend our framework to the synchronous settings. This includes a generalization of the formalization of the rule , incorporating the possibility of simultaneous executions of coagents, and thus generalizing the concept of a simple path of execution to that of a directed feedforward graph of execution. Then, the asynchronous case is addressed using the so-called Markov trick, to transform the asynchronous non-Markov MDP to the synchronous Markov MDP by augmenting the state space. This trick was also used in (Kostas, Nota, and Thomas 2020). We use this Markov trick to derive our general PGT for asynchronous networks. As such, we

• generalize the PGT to the (a)synchronous coagent networks framework (no assumption on parameter sharing).

While our framework is general and allows, for example, for cycles and loops in the execution path, it is important to outline precisely where and when this theorem applies. This is exhibited in Section 4, where in addition to exactly clarifying the theoretical constraints, we cite examples, previously known and new ones, such as Feedforward Options Network, and also non-examples for this purpose.

Although our work emphasizes more on theory, we show examples of insights and improvements brought by our theory to the online policy learning algorithm and its experimental implementation. We follow the larger focus of prior works on experimenting with options among all coagent networks. We identify multiple areas of improvements in the previous algorithms derived from the PGT for options networks (Section 5). First and foremost, we realize that in all previous works, the update takes place on all options at each time step, instead of only when each is called back, which is what the PGT states. Correcting this immediately enables

• a faster runtime and learning time on the order of the number of options.

Then, we identify an incorrect update of the termination function that has been used in the literature and argue how it artificially worsens the issue of early option termination.

• We address the early option termination issue by our correction of the termination update and proper tuning of the temperature of termination functions,

thereby avoiding an additional hyperparameter (such as the deliberation cost) in our analysis. Furthermore,

• we argue theoretically how parents of options in the hierarchy are useful as target networks in the Q-learning for the AC algorithm,

(see more in Remark 6 and Appendix F), and we accordingly make stabilizing changes in the algorithm. We experiment extensively to validate our hypotheses on a nonstationary task using the OC/HOC and the new Feedforward Options Network. Overall, these allow us to show that our implementation outperforms the existing literature (Appendices H and I). Finally, we extensively investigate and show the improvements, brought by the changes we made, on the long term stability of training (Appendix J). Lastly, we attach our code as supplementary material for reproducing the experiments with our algorithmic improvements.

2 Preliminaries

We will study policies acting in a Markov Decision Process (MDP): A tuple where S is the set of states, is the set of actions available at state s, is the reward function for a transition is the transition function, and is the discount factor.

In the problems we consider, each state contains not only information about the environment but also information about the state of the agent , describing something internal to the agent that is relevant to its execution. Thus, we will use the subscript in to denote this augmented set of states, and use to exclusively refer to states of the environment. In both cases, the exact definition will be dependent on the context. For this part, we will simply use S. Further, a subset is defined as the subset of possible initial states.

At each time-step t, the Markov agent/policy acts on state by action with probability . The state then changes to with probability and the reward is received by the agent. We define

• the expected return of as ,

• the value function of as

which is the expected return given initial state , • and the action-value function of as

given initial state-action pair . On-policy algorithms are one of the main class of algorithms used in online RL to learn the policy with the highest expected return. We can train the parameters describing by ascending the policy gradient written as follows:

Here, is the initial distribution for state and is the discounted state probability of reaching state s from , meaning

where is the transition probability of to given policy and transition function P of S. While the equations and theorems in this paper are written for a finite set of states and actions, it is straightforward to generalize them to the infinite case.

3 Coagent Network

In this section, we offer a new perspective on coagent networks by formalizing the rule of execution, benefits of which are shown in (1) deriving our main result, the policy gradient theorem, with an intuitive short proof, (2) generalizing it the (a)synchronous setting, and (3) (re)designing previously known and new coagent networks in the next section. We study an agent composed of a network of Markov policies , also called coagents, where at each time-step, rules set by the user determine the sequence of policies that act. Definition 1. Let be a Markov policy acting within an MDP M with an augmented state space and action space for . Let , where are Markov policies, indexed with a set of nodes , with state and action space for is a Markov single reward coagent network with coagent set if • Every state uniquely determines a coagent

with state , which will be the first coagent

in a sequence that will define the action of .

• This sequence is generated using a rule designed and fixed by the user and not influenced by the parameters, and determines the next acting coagent, along with its input state, using the state and action of the previous coagent:

can be stochastic if o is the last coagent that performs a primitive action within a stochastic environment. Primitive actions are those that change the state of the environment. In that case, may not be deterministically

determined. Moreover, can also be stochastic in the selection of the next coagent . For simplicity, we shall assume that is only stochastic in the former, and there is no randomness in the selection of .

• Eventually, after finitely many applications of , a coagent applies a primitive action a that leads to the next state . Note this does not mean that the last coagent’s action is exactly a, as it could contain additional information necessary for the next execution of .

Execution paths. ’s actions can be viewed as primitive actions, leading to the policy gradient (1). Alternatively:

Definition 2. An action of is an execution path P = where k coagents actions lead to a prim- itive action a, with the last action containing that information.

As a result of this, we can compute as leading to .

Computing reward for coagents. Assume acts on and only acts again at , meaning after has gone through l many execution paths. The transition tuple for is defined as . Simi- lar to how the cumulative reward for is computed using the rewards from its transitions, R can also be used to uniquely define the reward functions for all coagents transitions. Let ’s states and primitive actions in the time interval associated to ’s transition be . The reward for the tran- sition is

where . Notice that if is never reactivated (i.e. ), then the reward corresponding to is an infinite sum.

Remark 1. The reward source of all coagents is the same R, hence the name single reward in our definition. The motivation and necessity of this assumption is explained later in Remark 4.

Convention. Throughout this text, we will use the term Coagent Network (CN) to refer to a Markov single reward coagent network.

Coagent networks have a natural graph representation.

Definition 3. The graph of a CN has vertex set and directed edges which shows which coagent could follow the next in some execution path.

Remark 2. A loop in a graph simply means that the coagent can act multiple times in a row. Furthermore, one can start with the most refined decomposition of into stochastic policies and contract any edge to derive different graph representations. The policy gradient will obviously not change as has not changed; indeed, as shown in our main theorem, the policy gradient for the new composite coagent decomposes to the sum of the two original ones. For an example of a coagent network graph, see Fig. 1.

Policy Gradient Theorem Let be the policy gradient for with parameters shared among coagents, written as:

where actions are viewed as execution paths P = with implying a primitive action. We may drop from as it is implied from the context.

Theorem 3.1. The policy gradient is the sum of the policy gradient of the coagents:

Here, is the discounted probability of reaching state x from , i.e. , multiplied by the probability of reaching from x within a single execution path.

Proof. Note where means every coagent . Rewriting (5):

Distributing the derivative over the product leads to

where are the parts of P before and after o. The first product over , summed over all possible paths leading to , gets absorbed by and gives us by definition. For the product over , summed over all possible paths with state-action for o, gets absorbed by and gives . This is proved by iteratively using a Bellmantype equation for in terms of the next coagent in the execution path:

where is the next coagent according to . Applied iteratively, this gives

where the summation is over all paths P that include , and is the index of the last coagent in , which action immediately leads to a primitive action, thus .

Remark 3. Our much shorter general proof and definition compared to those of previous works (which also had more assumptions on the model) show the benefit of formalizing the rule . In comparison, this avoids many of the calculations around the discounted probability distributions identities (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020), as one can use the graphical representation of the network to reason through the algebra by simply following the relevant paths of execution as demonstrated in the proof above.

Remark 4. Note that our proof also uses the single reward source assumption, that R determines all rewards . Otherwise, if were to include any ‘pseudo’ or ‘intrinsic’ reward, as in the goal-based HAC model (Levy, Platt, and Saenko 2017), relating to as in (9,10) would not have been possible (see Appendix B for further details).

Remark 5. The application of this theorem to any specific CN consists of (1) identifying the coagents and , and (2) computing the discounted state action probability (see e.g. Appendix A for HOC).

(A)Synchronous Coagent Networks Our definition of coagent networks needs to be slightly generalized to include the (a)synchronous setting (Kostas, Nota, and Thomas 2020). In brief, the rule needs to accommodate simultaneous executions for synchronous CNs. Then, the asynchronous setting is recovered by augmenting the state space.

Synchronous Coagent Network. The rule of a CN dictates the path of execution. In particular, our definition of a path of execution is a sequence of coagents states and actions. In the synchronous framework, the policy is instead composed of a fixed feedforward network with N layers of coagents where all coagents in the same layer execute simultaneously. The th layer’s states (with o going over all coagents in layer i) is given by its previous layer outputs into o, called and the environment state s, i.e. . Hence, a path of execution would be a sequence of sets of coagents where coagents within the same set execute simultaneously, and their collective actions and states determine those of the next set. Therefore, the only change to our definition is in the rule , to become a function of the form .To adapt our theorem, we take the following approach. First, formally consider all coagents in the same layer as being one coagent. This means we have a coagent network which is simply one line of coagents acting one after the other. Here, we are in our original framework and our theorem for this sequence of coagents gives

where capital letters are used for the states and actions of , which is composed of those of its constituting coagents , i.e. . Now our problem is to rewrite the policy gradient of coagent networks such as , as the sum of its simultaneously acting coagents. This requires an analysis similar to Eq. (8), by distributing over the coagents acting simultaneously

Similar to our main theorem, we follow this by taking appropriate marginalization over undifferentiated coagents, state-action occupancies, and Q-values (carried out in more details in Appendix D). They yield the desired decomposition to gradients of the coagent policies:

Asynchronous Coagent Network. Defined in (Kostas, Nota, and Thomas 2020), an asynchronous coagent network is one where the coagents may act asynchronously. For example, one node at environment time step t may have waited to act based on the actions taken by others in the step . One can see the asynchronous setting as a non-Markov version of the synchronous coagent network. More generally, using the Markov trick, i.e. changing the non-Markov MDP to a Markov MDP by augmenting the state space by adding histories to the states, we have: Theorem 3.2. The policy gradient of for a single reward coagent network that is formed by non-Markov coagents can be written as the sum of the policy gradient of its coagents.

To prove the above for the particular example of asynchronous networks, we first apply the Markov trick, which in this case, is exactly the same state augmentation trick that (Kostas, Nota, and Thomas 2020) employed to translate this problem into a synchronous MDP problem. Then one can write the shared parameter version of the synchronous setting as previously proved. Next, one must show that the synchronous coagents of this new MDP have the same policy gradients as the original ones. Again, this equivalence is achieved by following the same reasoning in (Kostas, Nota, and Thomas 2020, Sec. 5.1, p. 7) which, although done for the non-shared parameter version, is irrelevant of the parameters, and only revolves around the definition of the state augmentation and the two MDPs equivalence (Appendix E).

4 Examples and Applications

We discuss herein how our definition easily incorporates previous models and inspires new ones.

1) Hierarchical Option Critic (HOC): For illustrating the expressivity of our definition, let us model an HOC (Riemer et al. 2020) as a CN. An HOC has N levels of hierarchy with , sitting at the first level of hierarchy, as the most abstract option and selecting the first option with probability given a state s.

We view HOC as a tree of options, denoted by where parents at layer i have many children. An option is equipped with a policy which selects a child and a termination function . Each option is uniquely determined by a tuple where , and with the most abstract option (the root) labelled by . The option , sitting at the st level of hierarchy, has a policy , also denoted by which selects a child , and termination function , also denoted by, which decides to terminate with probability and to not terminate with probability . We will abuse notation and denote by . We may also use ‘node’ while referring to an option in this tree.

Denote the HOC policy by . The states are tuples of the form or . We may simplify the notation by dropping “1 :” from the subscript index and use or ) means the state of the environment is s, and the algorithm is about to execute ), as it is in the d = downward (u = upward) execution mode.

We describe the rule for HOC. Starting from the root option policy at , each node takes and action, moving from . Eventually, a node selects a primitive action changing the state of the environment to . Then the upward mode is initiated, i.e. . In this mode, termination functions get selected along the previous path and they decide whether to terminate or not. If a node terminates, the parent’s termination function is called. Therefore the state moves from with probability . At some level l + 1, node does not terminate, with probability and the mode changes back to d, i.e. . Then the down- ward execution applies as before until a new primitive action is selected by some . Given this rule , we derive the associated graph representation as shown in Fig. 1.

Figure 1: A graph representing the HOC . The policy and termination function of an option each form a coagent. Nodes are labelled by their unique address and type (policy or termination). The termination functions send an edge to their corresponding policies. The policies of the leaf nodes are also followed by their corresponding termination after applying a primitive action.

2) Feedforward Option Network (FON): Our CN framework allows us to consider other examples such as the simple feedforward network where options at level i are connected to all the next ones at level i + 1. This CN is used in our experiments as a toy model to better understand the option properties and how that is affected if parents share children. The rule is similar to that of HOC. We denote by an N layer FON where the i-th layer has many options. In all cases, there is a single root, thus . In our experiments, we tested configurations such as , and .

3 and 4) Options of Interest (Khetarpal et al. 2020) and Stochastic Neural Nets (Gupta et al. 2021; Chung 2021): There are more examples of coagent networks which we discuss further in Appendix C.

Non-examples of CNs: Any model that violates the stochasticity of the coagents or the single reward source assumption could be considered as not directly fitting into our definition and would require examination. Two important models of this nature are the Attention Option-Critic (Chunduru and Precup 2022), violating the stochasticity of the coagents, and Hierarchical Actor-Critic (HAC) (Levy, Platt, and Saenko 2017), violating the single reward source assumption. However, both can in theory be incorporated in our framework if one adds stochasticity, or uses the Markov trick (see Appendix B).

5 Practical Insights From Our Theory

Algorithm 1 shows how to leverage Theorem 3.1 and apply the Actor-Critic (AC) algorithm to learn the options in an HOC/FON network. Below are our improvements and comparisons derived from the theory.

1) Update only on arrival and its runtime advantage. Previous works (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020; Khetarpal et al. 2020; Bacon, Harb, and Precup 2017; Chunduru and Precup 2022) update every parent of the current active option at each time-step. However, Theorem 3.1 only lists the gradients at states , where is in the execution path. Thus the more accurate way to update options is to do so only when they are called back. Otherwise, depending on the levels of hierarchy N and average duration of options, the runtime of each time step is increased by a factor of O(N). As demonstrated in the experiments (Appendix H), updating each time step does not lead to less episodes for learning either. Thus, the learning time is also increased by a factor of O(N).

2) Incorrect update and early termination. We address a recurring mistake in the literature’s AC algorithm implementation for updating , which worsens one of the issues of OC and HOCs called early option termination. As Theorem 3.1 states, for the update of an option o, the term should be scaled by the . If a = termination, then termination, i.e. the value of the higher option termination function at s. Otherwise, not termination. For the sake of illustration, let us consider the case of the Option-Critic model with root being the parent of o. Then , as the root never terminates. Throughout the literature, instead of multiplying the gradient by either or , the difference of these two terms denoted by

is considered. Hence, independent of the action, the update is always

where are ’s parameters. In theory, it is clear that is always negative as , therefore changing the parameters in the direction of the termination gradient, thus always encouraging early termination. This artificially created problem then becomes the reason behind adding another hyperparameter to mitigate this issue (Bacon, Harb, and Precup 2017; Harb et al. 2018). Correcting this mistake, while eliminating the need for an additional hyperparameter , does not resolve the early termination issue entirely. Indeed, at the beginning of training, since the lower options get their values updated more frequently, as long as rewards are nonnegative and values are initialized similarly, one can expect the value functions of the lower options to be generally higher than those of their parents. This would push the options to terminate less. However, the rate at which the parents are updated is important and this rate may need to be quite small. As supported by experiments (Appendix I), this can be facilitated by lowering the temperature of termination functions.

3) Using the parent as target network. Compared to previous works, another important modification to the algorithm is the use of the parent of the child for the critics target computation, instead of the child itself. In low temperatures, one needs to fully leverage the parent option as a value target network in the critic update of the child. For an option policy , the usual Q-learning target for has the following form:

δs, a, a)+

where t is the number of time-steps until o is called back, and r the discounted cumulative rewards accumulated during that time. Action a can be a primitive action or a child of o. Notice that the multiplier of is simply which is decomposes in two terms by considering the two terminating and not terminating cases. In theory, and so in practice, the parent can serve the role of a value target network. Therefore, one can make the following change in the update

Going further, we implement a similar change to “(higher parent terms)” where any is replaced by for any in the path to root for o (line 15 of Algorithm 2). As supported by our experiments, we hypothesize that this ensures that for the lowest options o do not get too much larger than those of their parents, as the target includes the action value of the parent instead of the option itself. Hence termination probability drops in a less dramatic way, allowing the higher options to learn enough to be able to play their role as a value target and stabilize the learning. Eventually, this allows us to obtain models with longer lasting options (Appendices I and J). A similar improvement for is discussed in Appendix J. Remark 6. Viewing the parent as a target network is also motivated by studying more deeply the crucial difference between a simple FON model vs . As theoretically justified in Appendix F and experimentally confirmed, turns out to be an AC model which is soft-updated using a value target network (the parent) with rollout at a learnable rate (determined by the termination function of the child). This also explains why in other hierarchical networks, such as HAC, it is unnecessary to use a target network. On the other hand, is exactly a vanilla AC model.

6 Experiments

We conduct a number of experiments on a nonstationary stochastic sparse variant of the Four Rooms (FR) task (Riemer, Liu, and Tesauro 2018; Riemer et al. 2020). We invite the reader to Appendix G and the ones after, which, as referred to, confirm many of the discussed insights and include

• A comparison of the performance of OC/HOC/FON to each other and previous works implementations (Fig. 2).

• Experiments on option length and how to main performance while tuning the temperature hyperparameter of to increase usage of all options.

• Long runs to check the stability of our algorithm, confirming the hypothesis of wider networks enjoying more stability in the long runs.

Figure 2: Our results for the OC model and HOC models for 5 random seeds. To be compared with the results from (Riemer, Liu, and Tesauro 2018, Fig. 3) for the OC model and HOC models , where our models outperform their counterparts. We also note how the simplest OC model outperforms all other models.

7 Conclusion and Future Works

In this work, we introduce a new intuitive definition of coagent networks, allowing us to design new ones, and illustrate the idea of execution paths to prove policy gradient theorems even when coagents share parameters or act asynchronously. We survey the performance and option properties of well-known coagent networks while making the algorithms more performant, and provide theoretical analysis supporting the improvements. This work attempts to put coagent networks and their algorithms on solid mathematical footing, and future works should focus more on the network design of these models and the suitable tasks that they can be evaluated against. For example, one of the design goals of the options framework is for different options to represent different skills. If so, the training and tasks should accommodate such outcome; having options which focus, or have their attention on different parts of a screen or different parts or phases of the task (Khetarpal et al. 2020; Chunduru and Precup 2022), may not necessarily accommodate this. In addition, further algorithmic improvements should be done to mitigate the issues inherent in training such networks, chief among them being the variance (Gupta et al. 2021), in order to make them competitive against state-of-the-art online RL paradigms.

Acknowledgements

We would like to thank Vaneet Aggarwal for his suggestions and comments. The first named author would like to acknowledge the support of the Perimeter Institute for Theoretical Physics and Microsoft during his time of working on this research. Research at Perimeter Institute is supported by the Government of Canada through Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Research, Innovation and Science. The experiments were conducted using Microsoft computational resources.

References

Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence.

Chunduru, R.; and Precup, D. 2022. Attention option-critic. arXiv preprint arXiv:2201.02628.

Chung, S. 2021. MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents. Advances in Neural Information Processing Systems, 34: 14734–14744.

Fortunato, M.; Azar, M. G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. 2017. Noisy networks for exploration. arXiv preprint arXiv:1706.10295.

Gupta, D.; Mihucz, G.; Schlegel, M.; Kostas, J.; Thomas, P. S.; and White, M. 2021. Structural credit assignment in neural networks using reinforcement learning. Advances in Neural Information Processing Systems, 34: 30257–30270.

Harb, J.; Bacon, P.-L.; Klissarov, M.; and Precup, D. 2018. When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.

Khetarpal, K.; Klissarov, M.; Chevalier-Boisvert, M.; Bacon, P.-L.; and Precup, D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4444–4451.

Klissarov, M.; Bacon, P.-L.; Harb, J.; and Precup, D. 2017. Learnings options end-to-end for continuous action tasks. arXiv preprint arXiv:1712.00004.

Kostas, J.; Nota, C.; and Thomas, P. 2020. Asynchronous coagent networks. In International Conference on Machine Learning, 5426–5435. PMLR.

Levy, A.; Platt, R.; and Saenko, K. 2017. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948.

Riemer, M.; Cases, I.; Rosenbaum, C.; Liu, M.; and Tesauro, G. 2020. On the role of weight sharing during deep option learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5519–5526.

Riemer, M.; Liu, M.; and Tesauro, G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 10424–10434.

Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211.

Thomas, P. S. 2011. Policy gradient coagent networks. Advances in Neural Information Processing Systems, 24.

Thomas, P. S.; and Barto, A. G. 2011. Conjugate Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 137–144.

Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 229–256.

A Hierarchical Option Critic

In this appendix, we review the graph representation of HOC, and by presenting some useful Bellman equations for the PGT, show how our Theorem 3.1 for HOC is the same PGT as in (Riemer et al. 2020, Theorem 1). Note the list of complicated equations in the next parts are simply derived as part of confirming that our result matches with the literature, and should not be seen as proving Theorem 3.1, which we accomplished already in the short proof in the main text. Alternative graph representation of HOC. It is possible to define the CN matching HOC differently by taking the policy and termination function corresponding to an option as a single coagent (Fig. 3). This coagent will act as when given a state with execution mode d, and act as the termination function given a state with execution mode u. In this case, the graph of the CN is a tree where each edge is doubled, with one edge going up and another going down, and each vertex has a loop. Notice this graph is a contraction of Fig. 1 along the edges connecting the two copies of the binary tree.

Figure 3: A graph representing the coagent network of HOC with N = 3 and . The policy and termination function of each option form a single coagent. Nodes are labelled by their unique address.

Bellman equations in HOC. To express the Bellman equations that will be necessary to understand the equivalence of our PGT for HOC and the one in the literature, we first need to clarify our notation for the edge cases:

• The state is always followed by some state for some . Notice it is always the case that a primitive action such as , if it were to be informally looked at as an option, immediately terminates after taking place, in other words . So is imme- diately followed by . Therefore, it is meaningful to take . • or means the highest level of hierarchy at state s, where the first option has not been chosen yet. The highest level policy never terminates. Therefore, as is immediately followed by and as such . This state is followed by for some option assigned by . • In our summations or products , if the higher in-dex is smaller than the lower index (x > y), we consider the result to be 1. This convention is helpful as it makes the equations easier to express without specifically accounting for the edge cases. To describe values of HOC, recall that in standard RL, we tend to differentiate the action value and the state value by using a different notation (Q and V ). Our terminology allows us to unambiguously use the same notation Q() for the value of our augmented states at any level of hierarchy or any state of execution d or u. First, let be the reward (or expected reward, if the environment reward process is stochastic) given by the environment to the agent which has taken primitive action at . Let be the discount factor. All values will be computed in succession. Starting with the easiest case at the lowest level,

where P is the transition probability of the environment, that given action at s, the state changes to . Note that , i.e. the Q-value of the state-action pair for the policy of the option .

The equation above is obvious, as the next state is , which is in the upward execution phase. Next, computing the value at this state,

Note that the product is the probability of upward trajectory up to level i + 1. At that level, the decision is to not terminate, which has probability . Notice can also be interpreted as the state-value for termination of option at state . Similar interpretations exist for the Q-values computed later.

Finally, computing completes the cycle needed to compute , as a ‘Bellman’ equation is a recursive equation. To do so, observe the following:

Applied iteratively, one obtains an expression of in terms of , which was already computed.

In similar fashion, one can compute the Q value at all other states. To do so, we shall derive the rewards for any node using Eq. (4):

Then, using the transition probability

it follows

Notice in both equations above, can be replaced by . Further, for all states with upward execution mode, one can write

When the algorithm is at , it will begin executing upward and reach some state , for some (with probability ), where it decides to exe- cute downward with probability ), i.e. the state changes to . The above equation multiplies these probabilities by the value of the state it starts executing downward at.

Remark 7. In the literature (Sutton, Precup, and Singh 1999; Bacon, Harb, and Precup 2017; Riemer, Liu, and Tesauro 2018), generally corresponds to our while generally corresponds to ; in the edge case of i = N, this correspondence could become a bit imprecise, due to the fact that the literature notation accounts for those cases separately by introducing new notations. Finally, corresponds to .

Remark 8. The equations (18) and (15) give the following in the case of l = 2, N = 2:

which can be recognized as the equations for in terms of as written in (Bacon, Harb, and Precup 2017, Eqs. 1-2).

Policy gradient theorem for HOC. Below, we write the PGT theorem for HOC to motivate the next notations and probability computations. In a series of steps, we will identify this PGT with (Riemer et al. 2020, Theorem 1). For an initial state at the lowest level of the tree at node , we have

There are a few notations that need to be defined:

• is the discounted probability of starting at , reaching the state , which after execution is followed by . It does not directly correspond to in Theorem 3.1, but it will be shown to be related.

• is the transition probability from to , which is dependent on the ter- mination functions and policies . It is the probability of the algorithm reaching at level j, after finish- ing its upward phase of execution (which started at the same level at ) and going down again to level j (see Eq. (24)).

• are the action value functions of

for option at . Action 1 denotes termination and

0 denotes the opposite (see Eq. (33)).

The notations that match with the reference have also the same meaning, for example or the advantage A(). As will be shown in equations (37-39), we can interpret the above sum as the sum of the policy gradients of and of the options, as their similarity to the usual policy gradient in (1) can be observed. Below we will make steps towards proving this interpretation.

Let us rewrite (19) so that the value on the right-hand side is evaluated at states with the same level as the left-hand side. To do so, apply (15) repeatedly on to get to , which leads to the following when :

where is one if and zero otherwise. The term multiplied by has an independent meaning, which will allow to make this equation more compact. Let be the transition probability from to , which is dependent on the termination functions and policies . Hence, this probability solely depends on the model and not the environment. It is the probability of the algorithm executing downward at at level l + 1, after finishing its upward phase of execution (which started at level l+1) and coming down again to level l+1. It is easy to see that this probability is the same multiplicative term in the equation above:

One can generalize this to compute other useful transition probabilities :

Using the above, equation (23) becomes:

Next, let us define , which is the sum of discounted probabilities for getting from to . We will compute two transition probabilities. Us- ing (17) to write the transition probability for :

Also, for :

Next, define the discounted one-step and k-steps recursively

as follows :

The definition for can be written for different transitions. The one needed for HOCPGT is:

holding . Let us now define the advantage as follows:

The advantage answers this question: If one is able to choose, then how much advantageous it is to start executing downward from , than to change the higher level options and try a different set of options ? The difference above in the values determines the advantage of this choice. Indeed, just like the first term is the value of not terminating, the second term is the value of terminating. Thus, alternatively, one can write:

using the action value functions of . It is clear that

where 1 denotes the termination action, and 0 the opposite action. These actions have probabilities:

We are now ready to state the rewritten HOCPGT as written by (Riemer et al. 2020). Theorem A.2. (Riemer et al. 2020, Theorem 1)

where is the discounted probability of starting at , reaching the state , which after execution is followed by .

Remark 9. The last term above might seem different from the one in the reference (Riemer et al. 2020, Theorem 1), but they are the same. Our convention, motivated from our point of view, is to sum over each node individually, while in the reference, the last node at the bottom of the tree is taken and the sum is over its parents.

We now show how to identify our main theorem application to HOC with the above. Let us denote and . Recall the HOC policy is denoted by . Then the first term in the bracket above can be written as:

where is the discounted transition probability for a policy , for reaching x from . Notice there is no , as does not depend on , and so the summation over averages out this outcome state. Also note that is the Q-value of policy of the node , hence equal to . Therefore, the first term is simply the policy gradient of the lowest level node.

For the second term, for a given node for , the term can be ab-

sorbed into the discounted transition probability to create where . This is the discounted transition probability from in any number of execution paths, and from within the same execution path. Remark 10. Notice that an execution path in this setting, is an execution from followed by a path upward, then a path downward until before the next execution of takes place. Thus, as the path involves an environmental step due to the execution of at the beginning of the path, the discounted transition probability takes a discount factor .

The above remark applies when going from to , as is the case in . Furthermore, using (35),

From the above, using (33) and (34):

Similar to (37), one can see that (38) is the usual policy gradient theorem for the termination policies at . Finally, for the last term, for each at :

Where is the discounted transition probability from in any number of execution paths, and from within the same execution path. The factor is what describes the discounted probability of this last transition within the same execution path.

Remark 11. Notice our proof does not involve any further state-augmentation. This is in contrast to the proof in (Riemer et al. 2020, App. C), where something called the termination vector T is defined and the graph structure is completely forgotten to be able to cast HOC into a synchronous coagent network (only two coagents). This shows that our framework can provide more intuitive ways for computing the policy gradient of coagent networks.

B Non-Examples of Coagent Network

Attention Option-Critic (AOC). The AOC model (Chunduru and Precup 2022) is an OC augmented with attention mechanisms for each option . The state , input for and its termination function , is obtained as follows:

• Take the environment state s along with the address of the option and the mode of execution (like ),

• change the environment state s to where has the same shape as s with values in [0, 1], and is the Hadamard multiplication.

The motivation behind this is to focus each option on parts of the states that are relevant. This model leverages attention to show that it is possible to achieve useful options with reasonable lengths.

The AOC learning algorithm can not be recast as a policy gradient of a CN. The main obstacle is the attention mechanism, which is deterministic, and thus is not a coagent. It is learned using gradients of the option action value.

However, following Remark 12, if the attention mechanisms were stochastic, then we could apply the PGT as in Theorem 3.1. The coagents in this case are the policies, terminations and the attention mechanisms. The rule is similar to that of HOC, where we let the attention corresponding to an option choose an action when execution mode is d.

Hierarchical Actor-Critic (HAC). From the point of view of Definition 1, there are two ways that HAC (Levy, Platt, and Saenko 2017) fails to be a CN.

First, we recall how HAC functions. HAC is a hierarchical goal-based model of policies where each policy selects a goal to be achieved by . Each policy has a fixed time horizon of K steps in which it has the chance to achieve the goal. It terminates once it reaches the goal or after K many executions, and the parent selects a new goal. As a result, the policy acts again after at most many time-steps. The reward given to a policy is based on how close the current state is to the goal, which is measured by a metric defined by the user. The edge case of (the goal for ) is either set by the user or simply not chosen, in which case the reward for is just the environmental reward. In practice, the reward for is a linear combination of the goal reward and the environmental reward. Clearly, for the more abstract policies the environmental reward should have more weight. In any case, it is clear that the policies do not have a single reward source (Remark 4).

Another CN assumption that HAC violates is the Markov property. This difference can be resolved, at least in theory. If one considers, say, the mid-level policy of a level HAC, its action depends on the subgoal it has received from the highest policy for a fixed executions paths (environmental time steps) or until it reaches the subgoal. During each K executions of the executions, the mid-level policy is only called once by the lowest policy trying to reach the subgoal assigned by the mid-policy during that K time-steps. The mid-policy has to remember the subgoal it was assigned to in the first execution path for the next execution paths. Note that such non-Markovianness is also present in the case of asynchronous firing in (Kostas, Nota, and Thomas 2020) as remarked in the main text. This is not an insurmountable theoretical barrier in viewing these models as CNs, at least if one is willing to augment the state space by including all histories. In the case of goal-conditioned models, one can provide the information of all selected subgoals into the state. Thus, more generally, by the same Markov trick, the Markov condition in Definition 1 can be lifted.

C More on Examples of Coagent Networks

The freedom in the graph structure and parameter sharing pattern, could allow us to exploit relational biases in a given environment/task to build more powerful RL agents. We mention a general but customizable coagent network, followed by two other examples in the literature.

1) Coagent networks with cycles or loops. Let us consider a cycle for the CN graph and design a corresponding rule . The nodes are labeled by for , where each node, depending on the input, makes a primitive action or sends some information via its action to the next node (with sending to ). The description of the rule is as follows: one assigns each state s of the environment two integers and , which means the node is the one that performs a primitive action at environment state s after s has passed through many cycles. More generally, one can imagine N decision functions for any and , where 1 means primitive and 0 non-primitive. Depending on the input of the node, after finitely many coagent executions, we have . This should be guaranteed by the programmer to happen to avoid infinite loops. This forces the node to apply a primitive action at , which may contain s as the information regarding the environment.

2) Options of Interest. One of the structures that one can add to the HOC is an interest function for each option (Khetarpal et al. 2020). One interpretation of this function is that it introduces a deterministic part in the process of the selection of by its parent , which is unique to and disjoint from the process of selection of the other children. This modularity has an effect similar to attention in Attention Option-Critic, discussed in Appendix B. We can still apply Theorem 3.1 by combining the parent with the interest functions of its children, i.e. the new coagent being . Then Theorem 3.1 applied to will give the same PGT in (Khetarpal et al. 2020).

3) Stochastic Neural Nets. Stochastic Neural Nets, in their full generality, have stochastic activation functions and/or weights. In all cases, each neuron forms a coagent with its parameters being the parameters of the activation function along with the weights connected to the neuron that compute its input. A more refined decomposition of the agent to coagents is possible if the weights are stochastic, in which case, every weight is also a coagent by itself. Notice that although our discussion has been mainly around RL applications, we can apply our results to the supervised learning setting by simply using a discount factor , although we naturally expect supervised learning algorithms to be superior in their own field of application. Remark 12. As a general idea, one could inject stochasticity into deterministic (parts of) RL models and apply the coagent network policy gradient theorem, and potentially obtain better results, as demonstrated in the noisy DQN model (Fortunato et al. 2017).

D Synchronous Coagent Network PGT

We finish the proof by explaining the marginalizations mentioned after Eq. (12). We first note that the marginalization of over given by the weighting , gives the Q- value as the latter is naturally defined as , where the sum is over all such that o’s action is . However, according to PGT, we should compute , which can be done as follows

The probability is the probability that within the same time step of , the state of is . This probability is dependent on the environment MDP and . One can recover this probability from the discounted state action occupancy as follows:

The last two equalities require further explanation. We recall the definition of the discounted state action occupancy

Here is the state action occupancy corresponding to the k-th step, its value depending on , and the MDP M. We need to show , i.e. that it is independent of k. Notice that the policy ’s internal actions and outputs does not depend on k, but on s; and while the distribution of s itself indeed depends on (), we note that the state s is already given in . Thus, the term is in- dependent of k.

Finally, after marginalizing over , we reintroduce the one over :

E Asynchronous Coagent Network PGT

For thoroughness, we explicitly state the changes that one has to make to the argument in (Kostas, Nota, and Thomas 2020) in order to show that the PGT of the synchronous CN obtained after state augmentation simplifies to the the asynchronous coagents policy gradients. As mentioned in the main text, the argument around this has no relevance to parameter sharing. Therefore, as illustrated below, we note that we simply have replaced the nonshared parameters by in the argument in (Kostas, Nota, and Thomas 2020, p. 7), and summed over i for the policy gradients :

“[...] Having shown that the expected return in the asynchronous setting is equal to the expected return in the synchronous setting, we turn to deriving the asynchronous local policy gradient, . It follows from that . Since is a synchronous, acylic network, and `M is an MDP, we can apply the CPGT to find an expression for . For [shared parameter] synchronous network, this gives

Consider , which we

abbreviate as . When , we know that the action is regardless of . Therefore, in these local states, is zero. When , we see from the definition of that . Therefore, we see that in all cases, . Substituting this into the above expression yields:

In the proof that given in Section C of the supplementary material, we show that the distribution over all analogous random variables is equivalent in both settings (for example, for all , ). Substituting each of the random variables of M into the above expression yields precisely the asynchronous local policy gradient, .”

F Inherent learning Stabilization in Hierarchical Options

We want to understand the algorithm for by building it piece by piece starting from an AC algorithm and changing it where appropriate. For the sake of illustration, we will assume that we have a fixed value of termination for some n > 0. This means the option terminates every n steps. This is also somewhat of a reasonable assumption for , as observed in some of the FR experiments, where unless termination temperature is low, the termination function (more or less) stays constant throughout the training of (Figs. 6 and 7).

First, note that the root has one action (selecting the child), and therefore, its action value is the value function of the children. Therefore, let us take a vanilla AC algorithm and add a value target network computing for the algorithm, where:

(a) It is only updated every ) steps using its own rollout update rule

(b) The target is used to soft-update the Q-value of the AC, where instead of choosing to build

the target of Q(s, a), we choose , meaning

instead of

. Unless mentioned otherwise, we take 50 runs with five to ten different random seeds for 50000 episodes, and show the 500 moving average of the quantity of interest, which is either the number of steps to the goal or the option lengths.

No target network used. As mentioned previously and theoretically justified in Appendix F, each parent is effectively acting as a value target network with rollout.

Hyperparameters. The hyperparameters in all experiments are the same, unless mentioned explicitly otherwise. We choose , termination and actor temperature 1 and 0.01, and termination, critic and actor learning rates and , respectively. More details can be found in the code.

H Experiments on Performance

Analysis and comparison of results. Our OC/HOC models generally outperform their previous versions in the literature as shown in Fig. 2. The experiments also demonstrate that increasing the number of lowest options is not always beneficial (Fig. 4). This finding is confirmed by (Riemer, Liu, and Tesauro 2018), but as further illustrated in Fig. 4 for HOC and FON models, more options simply means slower learning for this task, and not a reduction in the best possible performance. However, taking into account the efficiency of our algorithm as discussed in Section 5, the training time for HOC with 4 levels to reach score below 150 in 100000 episodes is less than that of (Riemer, Liu, and Tesauro 2018) (reached in 50000 episodes).

I Experiments on Option Properties

A main issue in option models has been the length of the options, with the model converging to one option dominating, or most options being of very small length (high termination probability). In the incorrect update model of (as discussed in Section 5), a proposal was to introduce a deliberation cost to the advantage function in the update rule.

Tuning the termination temperature. Here, we propose a simple method which does not require any additional hyperparameter or architectural modification. We examine the effect of lowering the temperature of the sigmoid function used in computing . The idea is that as action values are initialized with 0s, the termination probabilities initially drop. Then, due to the numerics of a sigmoid function with low temperature, it should become numerically harder for said probabilities to rise, thus contributing to lengthier options.

Figure 5: We compare the results of some simple FON models. The best model, in terms of architectural simplicity, convergence time and eventual performance is the simplest model . This also outperforms the previous HOC models in Fig. 2. Note that only two options, a root and a child are present in this model, and the primitive actions are executed by the child. During learning, the presence of another option (the root) is essential. This is further discussed in F and also supported by the result of the AC model in (Riemer, Liu, and Tesauro 2018, Fig. 3).

However, this reduces the update frequency of higher options. Over a long period of time, one wonders if the higher options can learn their value function as well, which requires experimentation.

and FON models with low temperature. The models we experiment with have termination temperature 0.05, compared to the normal 1.0 in the previous section. Option lengths dramatically increase (Figs. 6 and 7), spanning a significant part of the episode length while leaving room for other options to be used/called back as well. So no option in the FON models becomes dominant, as also shown by their almost equal length, meaning both lowest options are active participants in the task. Fig. 8 shows that lowering the temperature slightly degrades the performance in 50000 episodes but longer runs may close this gap.

J More on Experiments

Stability and option lengths. The instability of RL algorithms is well-known, especially in nonstationary settings. We run the experiments for ten times longer and with up to seven levels of hierarchy, and observe in Figs. 9 and 10 that increasing the width correlates with stability while preserving the lowest level options’ lengths. Also, instability in those models correlate with spikes in said lengths. We hypothesize that as the reward function changes dramatically from episode to episode (due to the goal changing), it can be easier for the model to be stable in the long term if there is more width to the network.

To test the stability, we take runs of 500000 episodes on a single random seed, for FON models of the form

Figure 6: Option lengths compared for the models with low and normal temperature. Options are labelled by {0, . . . , O} where 0 denotes the root, 1, 2 the second and third options at the first level, ... , and the lowest two options. No option becomes dominant as their length is almost equal and around half of the episode length.

and in Fig. 9. We observe how the stability of the algorithm in the long term is generally better for the models. As mentioned, we can see how the crashes in performance, which are spikes in the graphs, are correlated with those present in the option length of the lowest option(s) as shown in Fig. 10. Note that none of the lowest options are dominant and they are also lengthy in their uninterrupted execution, confirming Fig. 6 for deeper models.

Estimating using the not-terminated ancestor. Similar to how parents were used as target networks in Section 5 for the policies , the same approach applies for termination functions as termination. Considering that parent(o) may have also terminated in the execution path, one could the same value for grandparents of o, and so on, up to the ancestor that has not terminated. Based on our experiments, this last option is a better estimate of the true termination), as that is the beginning of the next phase of execution and thus, is a closer estimate to the next collected rewards (line 36 of Algorithm 2).

Figure 7: Option lengths compared for the models with low and normal temperature. Options are enumerated by {0, . . . , O} where 0 denotes the root and O the lowest option. Normal temperature leads to almost constant (very small) length throughout the entire training, while low temperature leads to a stable increase and eventual stabilization. Even though the option length for intermediary options has almost attained the episode length, they and the root are called back (used at least twice in an episode), as their child option length is smaller than the episode length (above 150 as shown in Fig. 8).

Figure 8: Lower temperature seems to slow the convergence to the best possible performance in both and FON models.

Figure 9: Performance comparison of and FON models with the same number of levels of hierarchy (two to seven). General stability increases as we switch from the former to the latter model. Note that almost all deeper models achieve the best performance of 125 steps as in Fig. 5.

Figure 10: Lowest option(s) length comparison of and FON models with the same number of levels of hierarchy (two to seven). General stability increases as we switch from the former to the latter model. The lowest options of the latter model are enumerated by 1 and 2. We observe how both of the lowest options are almost equally used, with a substantial length, confirming the previous results in Fig. 6 for even deeper models. Note the correlation between option length spikes and performance crashes in Fig. 9.