Adding Neural Network Controllers to Behavior Trees without Destroying Performance Guarantees

2018·Arxiv

Abstract

Abstract

In this paper, we show how Behavior Trees that have performance guarantees, in terms of safety and goal convergence, can be extended with components that were designed using machine learning, without destroying those performance guarantees.

Machine learning approaches such as reinforcement learning or learning from demonstration can be very appealing to AI designers that want efficient and realistic behaviors in their agents. However, those algorithms seldom provide guarantees for solving the given task in all different situations while keeping the agent safe. Instead, such guarantees are often easier to find for manually designed model-based approaches. In this paper we exploit the modularity of behavior trees to extend a given design with an efficient, but possibly unreliable, machine learning component in a way that preserves the guarantees. The approach is illustrated with an inverted pendulum example.

Index Terms— Autonomous systems, behavior trees, stability of hybrid systems, switched systems

I. INTRODUCTION

Over the last decade, Behavior Trees (BTs) have become a very important tool in the design of decision structures, and are widely appreciated for their modularity and reactivity, see [1]–[3], and the over 150 papers cited in [4]. At the same time, machine learning (ML) approaches have continued to show remarkable success, [5]–[7]. However, one problem with ML approaches is that they seldom provide guarantees for ending up at the desired state, or for avoiding some unsafe states that might harm the agent. In this paper, we will show how, and when, the BT design of Figure 1 can combine the safety guarantees of an safety controller with the performance guarantees of a model-based controller and the efficiency of a data-driven controller.

The basic idea is very straightforward and relies on the modularity, BTs where proven to be optimally modular in [8], provided by the BT structure. Note that the rest of the BT can be arbitrarily complex, but we focus on what is going on when the given subtree of interest is executed. Looking at Fig. 1, the first priority is safety, and whenever the safety constraint is at risk of being violated we invoke the Safety controller, this might e.g., correspond to moving away from the edge of a cliff. If safety is ok, the BT checks if the current running cost, e.g. execution time, is ok, i.e. if there is reason to believe that the data-driven subtree is not going to complete the task in time, or at all. If the execution time

This work was supported by SSF through the Swedish Maritime Robotics Centre (SMaRC) (IRC15-0046).

C. I. Sprague and P. ¨Ogren are with the Robotics, Perception and Learning Lab., School of Electrical Engineering and Computer Science, Royal Institute of Technology (KTH), SE-100 44 Stockholm, Sweden (email: sprague@kth.se).

Fig. 1: A Model-based controller (left) is replaced by a subtree including both the model-based controller and the data-driven controller (right).

is not ok, the previously designed model-based controller is invoked. Finally, if both of the conditions above are satisfied, we allow the data-driven subtree to be executed.

Note that switching between subtrees like this might induce undesired behaviors where one subtree counteracts another one, therefore, the rest of this paper is devoted to finding explicit formal conditions for when the approach outlined above will indeed provide the desired guarantees, building upon the theoretical tools proposed in [2], [3], and illustrating the approach with the commonly known example of an inverted pendulum swingup problem.

The main contribution of this paper is that we show how to add a data-driven controller to an existing BT design, without destroying performance guarantees.

The outline of the paper is as follows. In Section II and III we present related work and background material. Then, in Section IV we formulate the main result of the paper, showing when performance guarantees can be made. A detailed inverted pendulum example is presented in Section V and conclusions are drawn in Section VI.

II. RELATED WORK

It is well-known that learning algorithms might cause unsafe behavior, both during training and possibly even after training, as it can be hard to guarantee performance in all cases. Therefore, safety of learning approaches is a very active research area, [9]–[13].

In [11] Constrained Markov Decision Problems (CMDPs) were used, and the cumulative cost was replaced by a stepwise one, which was then transferred into the admissible control set leading back to a standard MDP formulation.

There is also a set of approaches using Lyapunov ideas, originating in control theory. In [12] a Lyapunov approach was used to guarantee stability of an RL agent. The agent was allowed to switch between a given set of controllers that were designed to be safe no matter what switching strategy was used. Then, in [13], Lyapunov concepts were used to iteratively estimate the region of attraction of the policy, i.e., the region that the state is guaranteed not to leave, when applying the controller at hand. At the same time, while being in this safe region, the estimate of the region, as well as performance, was improved. Finally, Chow et al. used the CMDPs to construct Lyapunov functions using linear programming, [10]. The approach is guaranteed to provide feasible, and under certain assumptions, optimal policies.

Our approach differs from [9]–[13] by not trying to build the performance guarantees into the learning controller, but leveraging the reactivity of the surrounding BT structure and the existing model-based controller to create a combination with the required guarantees.

BTs were invented in the gaming industry [1] and are currently spreading throughout the fields of robotics and AI [14]. Significant effort to combine BTs with learning from experience as well as demonstrations can be found in the literature [15]–[23].

In [16], a complete sub-tree is learned using a genetic algorithm applied to the Mario AI environment. Similar ideas were explored in [15]. Furthermore, grammatical evolution was used in [17], to optimize the structure of a BT playing a platform game, while constraining the design to an and-or tree structure deemed efficient for the problem at hand.

Classical reinforcement learning was applied to BTs in [18], where the idea of replacing a given action (leaf) node with an RL policy was proposed. Replacing non-leaf nodes with an RL policy deciding which child to execute was explored in [19], and [20].

In [21] the BT for performing pick and place operations were learned from human demonstration, using logic and decision trees. A related idea was used in [22], where a game designer first controlled game characters to create a database of trajectories that are then used to learn controllers. Finally, in [23], a framework for end user instruction of a robot assistant was proposed.

Our approach differs from [15]–[23] by not focusing on how to integrate learning into a BT, but instead providing safety and performance guarantees when such learning has been integrated. Thus, the proposed approach can be combined with any of the methods described in [15]–[23].

III. BACKGROUND

In this section, we provide background on the formulation of BTs as discontinuous dynamical systems, following [3].

A. Continuous-time behavior trees

We will now provide the continuous-time formulation of BTs, following [3]. Throughout this paper, let be the state space, x be a state, be the control space, and u be a control input.

Definition 1 (Behavior Tree). A function {R,S ,F}, defined as

where x is the state, i V is an index, ui : is a controller, and ri : is a metadata function, describing the progress of the controller in terms of the outputs: running (R), success (S ), and failure (F). Define the metadata regions for x as the running, success, and failure regions:

respectively, which are pairwise disjoint and cover .

The intuition of the metatdata regions (2) is as follows. If x Si, then has either succeeded at achieving its goal (e.g. opening a door) or the goal was already achieved (the door was already open). Either way, it might make sense to execute another BT in sequence to achieve another goal (perhaps a goal that is intended to be achieved after opening the door).

If x Fi, then has either failed (the door to be opened turned out to be locked), or has no chance at succeeding (the door is impossible to get to from the current position). Either way, it might make sense to execute another BT as a fallback (either to open the door in another way or to achieve a higher-level goal in a way that does not involve opening the door).

If x Ri, then it is too early to determine if will succeed or fail. In most cases, it makes sense to continue executing , but it could also be reasonable execute another BT if some other goal is more important (e.g. low battery levels indicate the need to recharge).

Above, the term “execution” means the use of a BT’s controller ui in some underlying system. Such a system can be seen as a dynamical system, thus we have the following definition of a BT execution.

Definition 2 (BT execution). A dynamical system given for a BT as

where f : is a system to be controlled and ui is given by (1).

As shown in [3], the BT execution (3) can be characterized by a discontinuous dynamical system (DDS) defined over a finite set of so-called operating regions [3, Theorem 2, p.5], with corresponding results regarding existence and uniqueness of its solutions [3, Theorem 3, p.6]. These operating regions (defined below) arise from the switching among BTs invoked by BT compositions, of which there exists two fundamental types: Sequences and Fallbacks, which we will now define.

A Sequence is a BT that composes together sub-BTs that are to be executed in sequence, where each one requires the success of the previous. It will succeed only if all sub-BTs succeed, whereas, if any one sub-BT runs or fails, it will run or fail. This behavior is formalized in terms of the sub-BT’s metadata regions in the following definition.

Definition 3 (Sequence). A function Seq that composes an arbitrarily finite sequence of M BTs into a new BT as

A Fallback, on the other hand, is a BT that composes together sub-BTs that are to be executed as a fallback to one another, where each one is executed only in case of failure of the previous. It will fail only if all sub-BTs fail, whereas, if any one sub-BT runs or succeeds, it will run or succeed. This behavior is formalized in terms of the sub-BT’s metadata regions in the following definition.

Definition 4 (Fallback). A function Fal that composes an arbitrarily finite sequence of M BTs into a new BT as

B. BTs as discontinuous dynamical systems

As mentioned above, the execution (3) can be characterized by a DDS defined over so-called operating regions. The state’s presence in these operating regions can be viewed as the sufficient conditions for a sub-BT to be executed by the root BT . We will now formalize this in the following theorem from [3].

Theorem 1 (Operating regions). Assuming is the root BT, there exists a maximum subset of the index set P V and an operating region for each index such that is a partition of and

Proof. See [3, Theorem 2, p.5]

The maximum subset P V in Theorem 1 corresponds to the set of leaf nodes because it must be a set of nodes such that none of them are a parent to one another [3] — the maximum set fulfilling this criteria is the leaf nodes. Further, P only includes such leaf nodes whose operating regions are not empty, because is a partition of . For more detail on the derivation of the operating regions , see [3].

The significance of Theorem 1 is that we can now interpret the execution (3) of the root BT as a DDS, where x implies ˙x = f(x,u0(x)) = f(x,ui(x)). We will use this formalism in the remainder of the paper.

IV. MAIN RESULT

The general problem that we will address in this paper is as follows.

Problem 1. Given a system f : to be controlled with a running cost L : such that last state is the accumulated running cost with ˙xn = L(x,u), and controllers uS,uMB,uDD : , where uS is a safety controller, uMB is a model-based controller with formal guarantees, and uDD is a data-driven controller without formal guarantees (but potentially lower running cost than uMB) — how can uS, uMB, and uML be composed together with a BT to create a controller u0 that exploits uMB, yet still formally guarantees that solutions to the execution (3) of avoids unsafe regions O and converges to a goal region G O?

A. BT structure

The BT we propose to solve Problem 1 is shown in Fig. 1, and is formalized as follows.

Assumption 1 (BT structure). is given by

where uS,rS) is a safety action, uL,rC) is a high-cost condition, uDD,rG) is a data-driven action, uMB,rG) is a model-based action, rS is a safety metadata function, rG is a goal metadata function, and uL is an arbitrary controller.

The BT structure in Assumption 1 is intended to work as follows. The data-driven action will be executed as long as it does not bring the system too close to the unsafe region or accumulate too much running cost. If the system gets too close to the unsafe region then should execute until the system is sufficiently far away. If too much running cost is accumulated then the model-based action should execute instead of the data-driven one. If the data-driven action brings the system close enough to the goal region, then the model-based action should takeover the execution, as it will be formally guaranteed to stay in the goal region with negligible cost. We formalize the above in the following assumption on the metadata regions formed by the metadata functions: r1 = rS, r3 = rC, and r4 = r5 = rG.

Assumption 2 (Metadata regions). The metadata regions are chosen such that the following holds,

That is, there exists an unsafe region O that the failure region of the safety action is equal to, the running region of the high-cost condition is empty (guaranteeing that the arbitrary uL never executes), there exists a high-cost region C Oc that the success region of the high-cost condition is equal to, there exists a goal region G Oc that the success regions of the data-driven action and model-based action are equal to, and the failure regions of the data-driven action and model-based action are empty (they can be executed from any state).

We will now identify the operating regions of the BT in Fig. 1, given by Assumptions 1 and 2, following [3].

Lemma 1. If Assumptions 1 and 2 hold, the operating regions of Theorem 1 are given for as

Proof. Following [3, Def. 5, p.4], the so-called influence regions are

Following [3, Def. 6, p.5], the so-called success and failure pathways are S = {0,5} and F = {0,1,2,4,5} respectively. Following [3, Def. 7, p.5], the operating regions are

Metadata regions are pairwise disjoint and cover , thus, by (8), we have that F1 = O implies R1 = (S1 , R3 = /0 and S3 = C imply F3 = Cc, and S4 = S5 = G and F4 = F5 = /0 imply R4 = R5 = Gc. Thus, with the full application of (8), including R3 = /0, on (11), the non-empty operating regions of the leaf nodes are given in (9).

The important thing to note in the operating regions (9) is that the high-cost region C and the goal region G are given. Thus the success region S1 of the safety action is the only region left to be specified. In the next section we will show how the specification of S1 affects the execution of .

B. Policy-level stability

As discussed in the previous section, the specification of a BT’s structure and its metadata regions determine its operating regions, which determine where a particular subBT will be executed. However, in order to understand where a sub-BT’s execution will lead to, we must look at the stability properties of its execution. To do this, we first have the following definitions, in a similar spirit to [2].

Definition 5 (Finite-time successful). A BT is FTS if there exists a region Bi Ri Si that is positively invariant under its execution (3), and a finite time such that for all of its execution’s solutions we have that xBi implies xSi Bi in finite time t .

Definition 6 (Safe). A BT is safe w.r.t. an unsafe region O if there exists a region Bi O that is positively invariant under its execution (3).

Definition 7 (Safeguarding). A BT is safeguarding w.r.t. an unsafe region O if it is FTS and safe with Si Bi.

“Finite-time success” means that, for the execution of a sub-BT, if the state starts in a certain region Bi then it will be in the success portion of this region Bi Si in finite time, without every venturing out of it. “Safe” means that, if the state starts in Bi then it will never venture into the unsafe region O. “Safeguarding” means that if the state exits the success region Si, then it will be redirected back into it in finite time without venturing into the unsafe region, because the region Bi surrounds it, Si Bi. One way to guarantee the above properties is through exponential stability, as we show in the following lemma, in a similar spirit to [2].

Lemma 2 (Exponential stability). A BT for which there exists a locally exponentially stable equilibrium xi Bi Si on a region Bi Ri Si for its execution (3) is FTS. Additionally, if there exists an unsafe region O such that Bi O then is safe w.r.t. O, and if Si Bi also then is safeguarding w.r.t. O.

Proof. If the execution (3) of is locally exponentially stable about xi Bi Si on Bi Ri Si then there exists such that, for its solutions, if xBi then xixifor all t . The stability of xi Bi Si implies that there must exist a maximal and minimal such that xixiand : xiBi Si. Thus, we must have that if xBi then xBi Si in finite time t , and that xBi Si for all t . If additionally Bi O then is safe (Def. 6), and if Si Bi also then is safeguarding (Def. 7).

The stability properties described above allow us to make formal guarantees on how a sub-BT’s execution will behave. But, what can be said about how a sub-BT’s behavior will lead to the execution of other sub-BTs?

To answer this, we must analyze how the aforementioned region Bi overlaps with other sub-BTs’ operating regions. Informally speaking, if a sub-BT is FTS and Bi does not intersect any other sub-BTs’ operating regions, then the state will stay in ’s operating region and converge to success. However, if Bi does intersect other sub-BTs’ operating regions, then there is a possibility that the state could venture into those operating regions. Thus, with a strategic choice of Bi for each sub-BT, an ordered set of possible sub-BT executions can be derived in a similar spirit to the sequential composition of [24], [25]. We formalize this in the following theorem.

Theorem 2 (Sequential composition). If a sub-BT is FTS, there exists a minimum subset Pi , where P is from Theorem 1, such that

and, for solutions to its execution (3), if xBi, then either xSi or xRj Sj) for some j Pi in finite-time t .

Proof. If Pi = /0, then we have that Bi /0, which implies that Bi . Since is FTS, it is implied that Bi Si /0. Thus, for the execution (3) of , if xBi then xBi Si in finite time t .

If Li /0, then Fjis a partition of Bibecause Li is minimum. Thus, for the execution (3) of , if xBi then either xBi Si or xRj Sj) in finite time t .

C. Convergence

We will now make an assumption on the stability properties of the BT proposed in Assumptions 1 and 2, and then use the concept of Theorem 2 to prove convergence.

The assumption is motivated by the following. If we are executing the data-driven sub-BT , we want to ensure that it does not execute forever if the state does not get into the goal region G or close to the unsafe region O. Thus, we assume that the accumulated running cost will reach a finite value in finite time. Further, we assume that the safety subBT is safeguarding on the region outside of the unsafe area, i.e. B1 = Oc, so that if the data-driven sub-BT misbehaves, the system will not go into the unsafe region O. Lastly, we assume that the model-based sub-BT is FTS. We formalize this assumption as follows.

Assumption 3 (Stability). There exists a finite cost J such that the high cost region is given by C and we have that C/0, and a finite time T such that if the state is not in the goal region before that time, xG for all t , then the state will stay in the high cost region after that time, xC for all t . The safety action is safeguarding w.r.t. O such that B1 = Oc, the the model-based action is FTS.

Now given Assumption 3 and Theorem 2, we will show how a strategic choice of the safety sub-BT’s success region S1 facilitates convergence from all starting states outside of the unsafe area. In the following Theorem, we show how the overall BT design is convergent if we set S1 equal to the goal region and the region for which the model-based sub-BT is FTS.

Theorem 3 (Constant metadata). If Assumptions 1, 2, and 3 hold, and we have that S1 B5, then is FTS with and safe w.r.t. O starting from xO.

Proof. We have that if xO then either xR1 or xS1 because F1 = O and metadata regions are pairwise disjoint.

If xR1 then xB1 because Sc1 = R1 O and B1 = Oc = R1 S1. We have that B1 S1 because S1 \O = S1 since metadata regions are pairwise disjoint. We have that F4 because F4 = /0 and F5 because F5 = /0. We then have that S1 GS1. Thus we have that B1 F5F5). Thus from Theorem 2, we have that xB1 implies either xR4 S4or xR5 S5, and hence xS1 in finite time t .

If xS1 then either xor xbecause S1 and S1.

If x, then either xor x(in which case xB1 because B1 S1 since is safeguarding) in finite time t , where T is a finite time for which either xG or xC by Assumption 3.

If xthen xG in finite time t for the following reasons. Since S1 B5, we have that B5 , and since C is positively invariant under the execution (3) of by Assumption 3, we have that B = B5/0 is also positively invariant under the execution. We have that /0 and BB5GcB5/0. Thus, by Lemma 2, we have that if xB5 then xB5 G in finite time t .

Thus, for the execution (3) of , we have that if xO then xG in finite time t , where . Thus, B0 = B1 is a region for which is FTS and safe.

One limitation of Theorem 3 is that S1 B5 might be hard to achieve and verify. One way to loosen this requirement is by switching the success region of the safety subBT, thereby affecting where the sub-BTs will be executed according to (9).

If, at first, we have S1 B5, where B5 is the region with which the model-based sub-BT is FTS, then the safety sub-BT will bring the state to that region in finite time. If we then switch the success region so that S1 B5, then either data-driven sub-BT or the model-based sub-BT will be executed. If is executed, then it will successfully bring the state to the goal region G because the region B5 for which it is FTS is contained in S1. However, if is executed and it misbehaves so that the state ventures out of S1 B5, then we can switch to S1 B5 again, bring the state back again with , switch back to S1 B5, and so on. This behavior is formalized in the following Theorem, in a similar spirit to Theorem 3.

Theorem 4 (Switching metadata). If Assumptions 1, 2, and 3 hold, and we have that

then is FTS with and safe w.r.t. O starting from xO.

Proof. If xO then either x, x, or xS1, because SB5 and F = O.

If xthen S1 by (13), in which case xS1 and hence xR1. In the same way as Theorem 3, if xR1 then xS1 in finite time t .

If xS1 and S1 then S1 and either xor xbecause S1 and S1.

If xand S1 then S1 and xB5 because SB5. We have that B = B5 C is positively invariant under the execution (3) of . We have that B \ /0 because now S1 B5. Thus, by Theorem 2, we have that if xB5 with S1 then xB5 G in finite time t .

Thus, for the execution (3) of , we have that if xO then xG in finite time t , where . Thus, B0 = B1 is a region for which is FTS and safe.

V. EXAMPLE: INVERTED PENDULUM

To illustrate the results above, we will now assume that the given dynamical system to be controlled is given by the inverted pendulum of [26], with the state augmented by the cost function given in [27].

Implementation 1 (Inverted pendulum). The given system f : of Problem 1 is given by

where x is the state, u um,um] is a control input; y and v are the translational position and velocity of the cart; is the clockwise angle of the pole from the upright orientation and is the angular velocity of the pole; and J is accumulated running cost, and L is given by

where is a chosen parameter.

The parameter for the running cost function (15) will be used for both training the data-driven sub-BT and for satisfying Assumption 3, as we will describe in the next section. If 0, then (15) is a quadratic running cost on the control input; if 1 then it is a running cost on time.

A. Controllers

In this section we will define the controllers for Problem 1 under Assumption 1.

The model-based controller umb we will use is the globally asymptotically stable one of Srinivasan et al. [26, Theorem 2, p.4]. With the inverted pendulum (14) under the influence of this controller, there exists a locally exponentially stable region about the stationary upright configuration x : cosand a globally defined finite time at which the state will enter this region. Based on the above, we make the following assumption on the model-based subBT.

Implementation 2 (Model-based BT). The model-based controller u5 = uMB is given by the controller of Srinivasan et al. [26, Theorem 2, p.4]. The goal region G = S5 : cosis the region in which the stationary upright configuration is locally exponentially stable for the execution (3) of with Implementation 1.

In reality, the track length of the pendulum (14) would not be infinitely long, but bounded. Bounds on cart position and velocity for the inverted pendulum under the control of uMB are given in [26] as ym um) and vm um when starting from the stationary-downright configuration x: cos. Thus, we would need the track length to be at least 2ym in width. Given this, what width within these boundaries would we need to set as safe, to ensure that the data-driven controller does not bring the system to a state where violating the track length is inevitable, i.e., going so fast that braking in time is impossible?

To answer this, suppose we start at the origin with y(t0) = 0 and v(t0) = 0, apply the maximum control input um until some position y(t1) and velocity v(t1), and thereafter apply the minimum control input um until the cart is brought to a halt at the upper bound of the track length y(t2) = ym with v(t2) = 0. Since the cart is a double integrator, basic kinematics tells us that v(t1t02um(y(t1t0)) and v(t2t12um(y(t2t1)), which gives us y(t1) = ym/2 and v(t1uym.

We want to ensure that, if the data-driven sub-BT is being used (when x S1 ), the safety sub-BT will be activated if the state ventures outside of these bounds. However, we also want to ensure that, if the model-based sub-BT is being used (when x S1 C)), the safety sub-BT will not activate when the state ventures outside of these bounds. Thus we have the following assumption on the safety sub-BT.

Implementation 3 (Safety BT). The safety metadata region is given by S1 in (13) with

The safety controller is given by

The obstacle region is given by

Finally, we implement the data-driven sub-BT (uDD,rG) with the same goal metadata function as in Implementation 2 and a mulitlayer-perceptron (MLP) controller.

Implementation 4 (Data-driven BT). The data-driven controller uDD is a MLP with 3 hidden layers, each having 50 nodes, where the softplus activation function is used. The output layer is a tanh activation scaled to um,um]. We train uDD for 354918 iterations at a learning rate of 0.001 with behavioral cloning on 354918 state-control pairs (see Fig. 2) coming from optimal trajectories w.r.t (15) with 0, which are obtained from Pontryagin’s maximum principle in the way of [27]. On 90% of the data, uDD achieved 2.1445emean-squared error (MSE) loss in training, and 2.2745eMSE loss in on the remaining data in testing.

Fig. 2: The database of optimal swingup trajectories used to train the data-driven controller uDD, where x and y are horizontal and vertical coordinates, respectively.

Since the data-driven controller uMB is trained from trajectories that are optimal w.r.t. a quadratic cost on the control, i.e. (15) with 0, it tries to conserve effort. We want to use this controller as long as it does not take too long, so we consider the high-cost region C to be where the cost (15) with 1 reaches a certain value.

Theorem 5. Theorem 4 is satisfied with Assumptions 1 and 2, using Implementations 1, 2, 3, and 4 with 1 and C , where T is a finite time.

Proof. We need to show that Assumption 3 is fulfilled. First, the accumulated cost J is monotonically increasing when 1 in (15). Thus, for solutions of (3) of , we have xC for all t from all xif xG for t . Second, the safety controller uS is a standard PD controller, thus there exists a positively invariant set B1 = Oc upon which x1 is a locally exponentially stable equilibrium, and we have S1 B1, thus by Lemma 2, is safeguarding and there exists a finite time for which it is FTS. Third, by [26], there exists a finite time for which the execution (3) of enters the goal region G, thus is FTS. Thus, Theorem 4 is satisfied.

VI. CONCLUSIONS

Previous works have separately explored different ways of adding learning components to a BT, and approaches for building safety guarantees into learning controllers.

In this paper we have shown how the reactivity and modularity of BTs enable a design where safety and convergence guarantees are provided on top of any learning controller. This was done using a natural combination of the learning controller with a safety controller and a convergent model based controller. Such a design might however introduce deadlocks where the different controllers work against each other. The paper presents a set of conditions that guarantee that the proposed design does not suffer from such problems, and instead achieves both safety and convergence to the desired goal states. An inverted pendulum example was used to illustrate the approach.

ACKNOWLEDGEMENT

This work was supported by Stiftelsen for StrategiskForskning (SSF) through the Swedish Maritime Robotics Centre (SMaRC) (IRC15-0046).

REFERENCES

[1] A. Champandard, M. Dawe, and D. Cerpa, “Behavior trees: Three ways of cultivating strong ai,” in Game Developers Conference, Audio Lecture, 2010.

[2] M. Colledanchise and P. ¨Ogren, “How Behavior Trees Modularize Hybrid Control Systems and Generalize Sequential Behavior Compositions, the Subsumption Architecture, and Decision Trees,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 372–389, 2017.

[3] C. I. Sprague and P. ¨Ogren, “Continuous-time behavior trees as discontinuous dynamical systems,” IEEE Control Systems Letters, vol. 6, pp. 1891–1896, 2021.

[4] M. Iovino, E. Scukins, J. Styrud, P. ¨Ogren, and C. Smith, “A survey of behavior trees in robotics and ai,” accepted for publication in Autonomous Robots (and on ArXiv), 2022.

[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015, Accessed on: Apr 20, 2017. [Online]. Available: http://dx.doi.org/10.1038/nature14236

[6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[7] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep learning for video game playing,” IEEE Transactions on Games, 2019.

[8] O. Biggar, M. Zamani, and I. Shames, “On modularity in reactive control architectures, with an application to formal verification,” to appear in ACM Transactions on Cyber-Physical Systems, 2022.

[9] W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans, “Trial without error: Towards safe reinforcement learning via human intervention,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2018, pp. 2067–2069.

[10] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” arXiv preprint arXiv:1805.07708, 2018.

[11] M. El Chamie, Y. Yu, and B. Ac¸ıkmese, “Convex synthesis of randomized policies for controlled markov chains with density safety upper bound constraints,” in American Control Conference, 2016, pp. 6290–6295.

[12] T. J. Perkins and A. G. Barto, “Lyapunov design for safe reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Dec, pp. 803–832, 2002.

[13] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in Neural Information Processing Systems, 2017, pp. 908– 918.

[14] M. Colledanchise and P. ¨Ogren, Behavior Trees in Robotics and AI, an Introduction. Chapman and Hall/CRC, 2018.

[15] C. Lim, R. Baumgarten, and S. Colton, “Evolving Behaviour Trees for the Commercial Game DEFCON,” Applications of Evolutionary Computation, pp. 100–110, 2010.

[16] M. Colledanchise, R. Parasuraman, and P. ¨Ogren, “Learning of Behavior Trees for Autonomous Agents,” IEEE Transactions on Games, DOI 10.1109/TG.2018.2816806, 2018.

[17] M. Nicolau, D. Perez-Liebana, M. O’Neill, and A. Brabazon, “Evolutionary behavior tree approaches for navigating platform games,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 9, no. 3, pp. 227–238, 2016.

[18] R. d. P. Pereira and P. M. Engel, “A Framework for Constrained and Adaptive Behavior-based Agents,” arXiv Preprint arXiv:1506.02312, 2015.

[19] R. Dey and C. Child, “Ql-bt: Enhancing behaviour tree design and implementation with q-learning,” in 2013 IEEE Conference on Computational Intelligence in Games (CIG). IEEE, 2013, pp. 1–8.

[20] Y. Fu, L. Qin, and Q. Yin, “A reinforcement learning behavior tree framework for game ai,” in 2016 International Conference on Economics, Social Science, Arts, Education and Management Engineering. Atlantis Press, 2016.

[21] French, Wu, Pan, Zhou, and Jenkins, “Learning Behavior Trees From Demonstration,” in Robotics and Automation (ICRA), 2019 IEEE International Conference on. IEEE, 2019.

[22] I. Sagredo-Olivenza, P. P. G´omez-Mart´ın, M. A. G´omez-Mart´ın, and P. A. Gonz´alez-Calero, “Trained behavior trees: Programming by demonstration to support ai game designers,” IEEE Transactions on Games, 2017.

[23] C. Paxton, A. Hundt, F. Jonathan, K. Guerin, and G. D. Hager, “CoSTAR: Instructing Collaborative Robots with Behavior Trees and Vision,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 564–571.

[24] R. R. Burridge, A. A. Rizzi, and D. E. Koditschek, “Sequential Composition of Dynamically Dexterous Robot Behaviors,” The International Journal of Robotics Research, vol. 18, no. 6, pp. 534–555, 1999.

[25] D. Conner, A. Rizzi, and H. Choset, “Composition of local potential

functions for global robot control and navigation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, (IROS), vol. 4. IEEE, pp. 3546–3551.

[26] B. Srinivasan, P. Huguenin, K. Guemghar, and D. Bonvin, “A global stabilization strategy for an inverted pendulum,” IFAC Proceedings Volumes, vol. 35, no. 1, pp. 133 – 138, 2002, 15th IFAC World Congress. [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S1474667015386936

[27] C. I. Sprague, D. Izzo, and P. ¨Ogren, “Learning dynamic-objective policies from a class of optimal trajectories,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 597– 602.

designed for accessibility and to further open science