This work starts a scientific investigation on the concept of “restraining bolt”, as envisioned in Science Fiction. A restraining bolt is a “device that restricts a droid’s [agent’s] actions when connected to its systems. Droid owners install restraining bolts to limit actions to a set of desired behaviors.”1 The concept of restraining bolt introduces a new problem in AI. We have two distinct representations of the world, one by the agent and one by the authority imposing restraining specifications, i.e., the bolt. Such representations are apparently unrelated as developed by independent parties, but both model (aspects of) the same world. We want the agent to conform (as much as possible) to the restraining specifications, even if these are not expressed in terms of the agent’s world representation.
Studying this problem from a classical Knowledge Representation perspective (Reiter 2001) would require to establish some sort of “glue” between the representation by the agent and that by the restraining bolt. Instead, we bypass dealing with such a “glue” by studying this problem in the context of Reinforcement Learning (RL) (Puterman 1994; Sutton and Barto 1998), which is currently of great interest to develop components with forms of decision making,
possibly coupled with deep learning techniques (Mnih et al. 2015; Silver et al. 2017). Specifically, we consider an agent and a restraining bolt of different nature. The agent is a reinforcement learning agent whose “model” of the world is a hidden, factorized, Markov Decision Processes (MDP) over a certain set of world features. That is, the state is factorized in a set of features observable to the agent, while transition function and reward function are hidden. The restraining bolt consists in a logical specification of traces that are considered desirable. The world features that are used represent states in these traces are disjoint from those used by the agent. More concretely such specifications are expressed in full-fledged temporal logics over finite traces, LTLand its extension LDL
(De Giacomo and Vardi 2013; De Giacomo and Rubin 2018; Brafman, De Giacomo, and Patrizi 2018). Notice that the restraining bolt does not have an explicit model of the dynamics of the world, nor of the agent. Still it can assess if a given trace generated by the execution of the agent in the world is desiderabile, and give additional rewards when it does. The connection between the agent and the restraining bolt is loose: the bolt provides additional reward to the agent and only needs to know the order of magnitude of the original rewards of the agent to suitably fix a scaling factor2 for its own additional rewards. In addition, it provides to the agent additional features to allow the agent to know at what stage of the satisfaction of temporal formulas the world is so that the agent can choose its policy accordingly. Without them, the agent would not be able to act differently at different stages to get the rewards according to the temporal specifications. The main result of this paper is that, in spite of the loose connection between the two models, under general circumstances, the agent can learn to act so as to conform as much as possible to the LTL
/LDL
specifications. Observe that we deal with two separate representations (i.e., two distinct sets of features), one for the agent and one for the bolt, which are apparently unrelated, but in reality, correlated by the world itself, cf., (Brooks 1991). The crucial point is that, in order to perform RL effectively in presence of a restraining bolt such a correlation does not need to be formalized. For example, consider a service robot serving drinks and
snacks at a party. The robot knows the locations where it can grasp drink and snack items and the locations of people that can receive such items. The robot can execute actions to move in the environment, to grasp objects and to deliver them to people. With rewards associated to effective deliver, the robot can learn how to serve something to a specific person. Now, suppose we want to impose the following restraining bolt specification: serve exactly one drink and one snack to every person, and do not serve alcoholic drinks to minors. To express this specification (e.g., as an LTL/LDL
formula), a representation of the status of each person (i.e., identity, age and received items) is needed, even though these features are not available to the learning agent (but only to the restraining bolt).
Notice that the presence of LTL/LDL
specification makes the whole system formed by the agent and the restraining bolt non-Markovian. Recently, interest in nonMarkovian Reward Decision Processes NMRDPs (Bacchus, Boutilier, and Grove 1996; Whitehead and Lin 1995), and in particular on expressing rewards using linear-time temporal logic has been revived and motivated by the difficulty in rewarding complex behaviors directly on MDPs (Littman 2015; Littman et al. 2017). In this context, the use of linear time logics over finite traces such as LTL
or its extension LDL
has been recently advocated (Camacho et al. 2017; Brafman, De Giacomo, and Patrizi 2018; Icarte et al. 2018). Notably, both LTL
and LDL
formulas can be transformed into deterministic finite state automata tracking the stage of satisfaction of the formulas (De Giacomo and Vardi 2013). This, in turn, allows for transforming an NMRDP with nonMarkovian LTL
/LDL
rewards into an equivalent MDP over an extended state space, obtained as the crossproduct of the states of the NMRDP and the states of the automaton. This transformation is of particular interest in RL with temporally specified rewards expressed in LTL
/LDL
, since it allows to do RL on an equivalent MDP whose (optimal) policies are also (optimal) policies for the original problem, and viceversa (Brafman, De Giacomo, and Patrizi 2018). This provides the basis for our development here.
In this paper, we set the framework for the problem of restraining bolt in RL context and provide proofs and practical evidence, through various examples, that an RL agent can learn policies that optimize conformance to the LTL/LDL
restraining specifications, without including in the agent’s state space representation the features needed to evaluate the LTL
/LDL
formula. Our work can also be seen as a contribution to the research providing safety guarantees to AI techniques based on learning. We take up this point in a brief discussion at the end of the paper.
MDP’s. A Markov Decision Process (MDP) M = contains a set S of states, which in this paper we consider factored into world features, a set A of actions, a transition function
that returns for every state s and action a a distribution over the next state, and a reward function
that specifies the reward (a real value) received by the agent when transitioning from state s to state
by applying action a. A solution to an MDP is a function, called a policy, assigning an action to each state, possibly conditioned on past states and actions. The value of a policy
at state s, denoted
, is the expected sum of (possibly discounted by a factor
, with
) rewards when starting at state s and selecting actions based on
.
RL is the task of learning a possibly optimal policy, from an initial state , on an MDP where only S and A are known, while Tr and R are not. Typically, the MDP is assumed to start in an initial state
, so policy optimality is evaluated wrt
. Every MDP has an optimal policy
. In discounted cumulative settings, there exists an optimal policy that is Markovian
, i.e.,
depends only on the current state, and deterministic (Puterman 1994).
LTL/LDL
. The logic LTL
is the classical linear time logic LTL (Pnueli 1977) interpreted over finite traces, formed by a finite (instead of infinite, as in LTL) sequence of propositional interpretations(De Giacomo and Vardi 2013). Given a set P of boolean propositions, here called fluents (Reiter 2001), LTL
formulas
are defined as follows:
where is a propositional formula over
is the next operator and U is the until operator. We use the standard abbreviations:
; eventually as
; always as
; week next •
(note that on finite traces
); and
denoting the end of the trace. LTL
is as expressive as first-order logic (FO) over finite traces and star-free regular expressions (RE).
LDLis a proper extension of LTL
, which is as expressive as monadic second-order logic (MSO) over finite traces (De Giacomo and Vardi 2013). LDL
allows for expressing regular expressions over such sequences, hence mixing procedural and declarative specifications, as advocated in some work in Reasoning about Action and Planning (Levesque et al. 1997; Baier et al. 2008). Formally, LDL
formulas
are built as follows:
where: tt stands for logical true; is a propositional formula over
denotes path expressions, i.e., RE over propositional formulas
with the addition of the test construct
typical of Propositional Dynamic Logic (PDL). We use abbreviations
as in PDL. Intuitively,
states that, from the current step in the trace, there exists an execution satisfying the RE
such that its last step satisfies
, while
states that, from the current step, all executions satisfying the RE
are such that their last step satisfies
. Tests are used to insert into the execution path checks for satisfaction of additional LDL
formulas.
For an LTL/LDL
formula
, we can construct a deterministic finite state automaton (DFA) (Rabin and Scott 1959)
that tracks satisfaction of
accepts a finite trace
iff
satisfies
.3
NMRDP’s. A non-Markovian reward decision process (NMRDP) (Bacchus, Boutilier, and Grove 1996) is a tuple , where S, A and Tr are as in an MDP, but the reward
is a real-valued function over finite state-action sequences (referred to as traces), i.e.,
. Given a (possibly infinite) trace
, the value of
is:
where
is the discount factor and
denotes the pair
. In NMRDP’s, policies are also non-Markovian
. Since every policy induces a distribution over the set of possible infinite traces, we can define the value of a policy
, given an initial state s, as:
That is,
is the expected value of infinite traces, where the distribution over traces is defined by the initial state s, the transition function Tr, and the policy
.
Specifying a non-Markovian reward function explicitly is cumbersome and unintuitive, even if only a finite number of traces are to be rewarded. LTL/LDL
provides an intuitive and convenient language rewards (Alberto et al. 2017; Brafman, De Giacomo, and Patrizi 2018). Following, (Braf- man, De Giacomo, and Patrizi 2018) we can specify
implicitly, using a set of pairs
, with
an LTL
/LDL
formula selecting the traces to reward, and
the reward assigned to those traces, where the atomic propositions, i.e., the fluents, of
correspond to boolean features, or boolean propositions (e.g., relational value comparison) over the world features, forming the components of the state vector. Intuitively, if the current (partial) trace is
, the agent receives at
a reward r if
is satisfied by
. Formally,
if
and
, otherwise.
Before looking at the restraining bolt problem, we review RL for NMRDP’s with LTL/LDL
rewards.
• is defined as:
Observe that the state space of is the product of the state spaces of M and
, and that the reward
is Markovian. In other words, the (stateful) structure of the LTL
/LDL
formulas
used in the (non-Markovian) reward of M is compiled into the states of
.
Theorem 1 ((Brafman, De Giacomo, and Patrizi 2018)). The NMRDP is equivalent to the MDP
defined above.
Actually this theorem can be refined, into a stronger lemma that we will use in the following. A policy for an NMRDP M and a policy
for an equivalent MDP
are equivalent if they guarantee the same rewards. Assume
is constructed as above and let
be a policy for
. Consider a trace
of M and assume it leads to state
. Further, let
be the state of
on input
. We define the (non-Markovian) policy
equivalent to
as
. Similarly, given a policy
for M, by just tracking the state of the DFAs
, it is immediate to define the equivalent policy
for
. Hence we have:
Lemma 2 ((Brafman, De Giacomo, and Patrizi 2018)). Given an NMRDP M and an equivalent MDP , every policy
for
has an equivalent policy
for M and viceversa.4
We are interested in RL in the setting introduced above: learn a (possibly optimal) policy for an NMRDP M = , whose rewards
are offered on traces specified by LTL
/LDL
formulas
and where the LTL
/LDL
reward formulas
and the transi- tions function Tr is hidden to the learning agent.5 Formally, given M, with Tr and
hidden to the learn- ing agent but sampled during learning, and an initial state
, the RL problem over M consists in learning an optimal policy
. Notice that, since NMRDP rewards are based on traces, instead of state-action pairs, typical learning algorithms, such as Q-learning or SARSA (Sutton and Barto 1998), which are based on MDPs, are not applicable. However, by the above results an optimal policy for M can be obtained by learning, instead, an optimal policy for
. Being
an MDP, this can be done by typical algorithms such as Q-learning or SARSA. Of course, neither M nor
are (completely) known to the learning agent, and the transformation is never done explicitly. Rather, during the learning process, the agent assumes that the underlying model has the form of
instead of that of M.
Figure 1: Learning Agent and Restraining Bolt
Theorem 3. RL for LTL/LDL
rewards over an NMRDP
, with Tr and
hidden to the learning agent can be reduced to RL over the MDP
defined above, with
and
hidden to the learning agent.
Note that contains encoding of the stage of satisfaction of the formulas
. However since the transition function
is hidden, the agent cannot anticipate the effect of an action before the action is executed.
We now focus on the restraining bolt problem, i.e., how to do RL with restraining specifications expressed in LTL/LDL
. We are given:
• A learning agent modeled by the MDP , with transitions
and rewards
hidden, but sampled from the environment.
• A restraining bolt where: –
is the set of possible fluents’ configurations (analogously to S denoting the set of configurations of the features available to
). Fluents in F are not among the features that form the states S of the learning agent
.
The agent receives rewards based on and the pairs
. In fact, often we have to handle tasks of episodic nature. That is, the world can reach a configuration in which no action can change its configuration nor generate new rewards, e.g., a final configuration in a game. In this case we assume that the restraining bolt fluents F include a special fluent Done that denotes reaching the final configuration. This fluent can be used in LTL
/LDL
formulas to reward the agent only at the end of the episode. When the episode ends and a new episode is started, a new trace is generated on which LTL
/LDL
formulas are evaluated again.
Notice that while the agent can see the features that the reward depends on, it cannot see those that affect
. Both S and L are features’ configurations, in the sense of representing world properties. However, they capture different facets of the world. Let W be the set of real world states. A feature is a function
that maps world states to another domain
, such as reals, enumerations, booleans, etc. The feature vector of a world state
is the vector
of feature values corresponding to
. Given a world state
, the corresponding configuration
of the learning agent
consists in those components of
that produce the agent’s state, while the corresponding configuration of fluents
is formed by the components that assign truth values to the fluents. That is, a subset of the world features describes the agent states
and another subset (for simplicity, assumed disjoint from the previous one) is used to evaluate the fluents in
. Hence, a sequence
of world states defines both a sequence of learning agent states
and a sequence of fluent configurations
. While
depends on the former, each
and
depend on the latter. Consequently, by executing a policy and hence by repeatedly choosing actions in A, the agent visits a sequence of world states, collecting for each of them, the sum of the rewards
and
. The point to resolve is defining on the base of which observations the agent can choose its next actions. Obviously the agent can in principle accumulate all its observations
and on the other hand it cannot see the fluents configurations
, however we want to equip the agent to some means to establish the stage of satisfaction of the formulas
. Such a notion, as mentioned above, can be captured by considering the minimal DFA
corresponding to formula
. Notice that such a DFA is unique. Hence we equip the agent with additional observable features
corresponding to the states Let
. Such features are going to be provided by the restraining bolt.6 Note that this does not give away fluents configurations
which remain hidden to the agent, see Figure1
Hence, in general, we consider possibly non-Markovian policies of the form
and thus define the expected (discounted) cumulative reward of a possibly non-Markovian policy as the expected reward of infinite traces starting in the initial state
, induced by the policy itself (obtained as the expected sum of the collected rewards
and
).
Problem definition. (An instance of) the RL problem with LTL/LDL
restraining specifications is a pair
, where:
is a learning agent with
and
hidden, and RB =
is a restraining bolt formed by a set of LTL
/LDL
formulas
over L with associated rewards
. A solution to the problem is a (possibly non-Markovian) policy
that maximizes the expected cumulative reward.
To devise a solution technique, we assume that the agent actions in A induce a transition distribution over the features and fluents configuration, i.e.:7
Such a transition distribution, together with the initial values of the fluents and of the agent state
, allow us to describe a probabilistic transition system accounting for the dynamics of the fluents and agent states. Moreover, when
is projected on S only, i.e., the L components are marginalized, we get
of
. Obviously, both
and
are hidden to the learning agent. On the other hand, in response to an agent action
performed in the current state
(in the state
of the agent and the configuration
of the fluents), the world changes into
from which
and
are obtained. This is all we need to proceed.
Given M , RB
with M
S, A, Tr
, R
and RB
, r
, we define an NMRDP M
, A, Tr
, r
, R
s, a, s
, where:
• states are pairs formed by an agent configuration s and a fluents configuration
;
• are as before;
• ϕLast
;
• and
are hidden and sampled from the environment. Formulas
are as before, in particular they are con-
tinuously evaluated on the (partial) trace produced so far.
Though, they may use the special fluent Done to give the reward associated to the formula at the end of the episode (modulo reward shaping). Formulas , one per
, which require that both states s and action a are followed by
, are evaluated at the end of the current (partial) trace (recall that Last denotes the last element of the trace, c.f. Preliminaries). In this case, the reward
from
associated with
is given.8 Observe that policies for
have the form
which needs to be restricted to have the
form required by our problem . A policy
has the form
, . . . , q
, s
, . . . , q
, s
with q
. In other words, a policy
A has the form when the fluents L are not directly accessible but are used only the progress the DFAs
corresponding to formulas
. We can now state the following result.
Lemma 4. RL with LTL/LDL
restraining specifications
with
and
can be reduced to RL over the NMRDP
, by restricting the policy to learn to have the form
.
As a second step, we apply the construction of the previous section and obtain a new MDP learning agent. In such construction, because of their triviality, we do not need to keep track of the state of the automata associated with each , but just offer the reward
associated with
. Instead, we do need to keep track of state of each DFA
corresponding to
. Hence, from
, we obtain an MDP
where:
• is the set of states;
• is defined as follows:
Observe that, besides the rewards of the original learning agent, the environment now offers the rewards
associated with the formulas
, thus guiding the agent towards the satisfaction of the
(by progressing correctly the corresponding DFAs
).
This Lemma allows for restricting non-Markovian policies to Markovian policy
without loss of generality.
As a last step, we solve the original RL task on by performing RL on a new MDP
, where:
• The transition distribution is the marginalization of
wrt L, and is unknown;
• The reward is defined as:
• The states of the DFAs
are progressed correctly by
To see this, let , then the Bellman equation in our case is:
By using the equality between
and
we have:
On the other hand, observe that we can compute
from
and
, that is we do not need
. Hence:
So we can write:
At this point, we see that
does not depend from
, hence we can safely drop
as argument for
. Indeed, we get:
and by marginalizing the distribution
over
, we get:
This is Bellman’s equation for
, hence the thesis.
This theorem provides us with a technique to learn the optimal policy for RL with LTL/LDL
restraining specifica-tion by making minimal intervention to the learning agent: essentially we need to feed it with the rewards
at suitable times, and we need to allow the learning agent to keep track of the stage of satisfaction of the restraining bolt formulas by feeding it with new features for
.
Implementation of agents learning policies with restraining specifications is performed by assuming a learning phase in simulation and an execution phase on the real world. The learning phase is obtained by combining three software components: 1) a simulator of the dynamic system, 2) a restraining bolt (RB) process, 3) a reinforcement learning (RL) agent. All these components are modular (i.e., they can be properly connected each other or replaced by other similar components). The simulator is responsible for computing
Figure 2: Experimental scenarios: BREAKOUT, SAPIENTINO, COCKTAILPARTY
the evolution of the dynamic system under study: it receives decisions (actions to be executed) by the RL agent and communicates: i) the current state of the system to both the RL agent and the RB process, and ii) the current reward value to the RL agent. The RB process receives the current state from the simulator, evaluates the LTL/LDL
formulas denoting the restraining specifications and sends to the RL agent an encoding of the progress of the DFA representing the formulas and reward values associated to their evolution. Finally, the RL agent receives the simulator state, the RB state, and the rewards and decides the actions to be executed, while computing an optimal policy. By using such a simulator, the RL agent can learn a policy that maximizes the cumulative discounted reward taking into account both rewards from the environment and rewards from the RB. In general, when enough training is allowed, the computed policy, when executed on the real world, will satisfy the RB specifications.
As mentioned, the RL agent and the RB process have different sensors to perceive different aspects of the state of the world (or of the simulator). So we assume that they are implemented with real sensors (when attached to the real world) and corresponding virtual sensors (when attached to the simulator). We also assume that the simulator is able to model all the relevant evolutions of the world that are needed to learn the specific task with restraining specifications.
Next we show the implementation of such components in three examples (Figure 2). The first one uses a video-game simulator, while the other two consider robotic tasks and their corresponding models in a simulator. The core software for the RL agent and for the RB process are domainindependent, while the (virtual) sensors and the LTL/LDL
specifications are domain-dependent. Since all examples are of episodic nature, the learning phase is managed by an execution system that resets episodes when any of the following conditions is verified: 1) a state of the DFA where the formula is satisfied is reached, 2) a failure state (i.e., a state from which it is not possible to satisfy any formula) of the DFA is reached, 3) a maximum number of actions have been executed (to avoid infinite loops).
To speed up learning, the implementation of the bolt monitors the progress of the DFA corresponding to the restraining specifications and applies a kind of reward shaping by exploiting the DFA structure9. Through reward shaping we can anticipate part of the reward coming from temporal specifi-cations without waiting for the formulas to become true.
Each experiment (i.e., a sequence of episodes to learn a policy) terminates after a time limit that is different for each problem (see next sections) and chosen to guarantee that a policy consistent with the specifications is always found, although in general not optimal. All the problems described below have been solved with n-step Sarsa algorithm, config-ured with . The trend of the solutions is anyway not sensitive to these parameters10.
Algorithms have been implemented as single-thread nonoptimized Python procedures, in a modular and abstract way to operate on every problem. More details about the experimental configurations, source code of the implementation allowing for reproducing the results contained in this paper, and videos of the found policies are available in www.diag.uniroma1.it/restraining-bolt. Breakout. BREAKOUT has been widely used to demonstrate RL approaches. The goal of the agent is to control the paddle in order to drive a ball to hit all the bricks in the screen. In this example, we have considered two agents with different abilities: MOVE: the agent moves sideways to bounce the ball; MOVE +FIRE: the agent can both move and fire straight up to remove bricks. Agent’s state representation uses the following features: position of the paddle;
: position and direction of movement of the ball11. Reward is given to the agent when a brick is hit. With this specification a RL algorithm can find a policy to remove all the bricks and complete the game for both the agents. Restraining bolt. We want to provide the agents with the following specification: the bricks must be removed from left to right, i.e., all the bricks in column i must be removed before completing any other column j > i. This specification can be expressed with an LTL
/LDL
formula and to evaluate such a formula, the bolt needs a representation
of the status of each brick
(present or removed). The agents, after receiving in input from the bolt an encoding of the status of the LTL
/LDL
formula and associated rewards, can use the same RL algorithm to learn a new policy that will complete the task (i.e., remove all the bricks) following the restraining bolt specification (i.e., from left to right).
Notice that the same restraining bolt is applied to the two different agents and they will both learn the behavior specified by the LTL/LDL
formula, obviously with different policies. Rows 1 and 2 in Figure 3 show the results of two experiments in the Breakout scenario with the following configurations: Breakout 4x6 MOVE + FIRE (5 minutes), Breakout 4x5 MOVE (1 hour). Left plots show the average reward over the number of iterations, while right plots show the score (i.e., number of columns correctly broken) of the best policy computed so far (i.e., the results obtained in runs without exploration). The figures show how the agent is able to progressively learn how to progress over the states of the DFA corresponding to the LTL
/LDL
specification. Similar results are obtained in different configurations (e.g., different sizes of the bricks). reported in the columns is encoded with the first letter being either M for MOVE and F for FIRE actions available, and the second letter being either L for LOCAL and G for GLOBAL for the sensor modality. Sapientino. SAPIENTINO Doc is an educational game for 5-8 y.o. children where a small mobile robot has to be programmed to visit specific cells in a 5x7 grid. Some cells con-
Figure 3: Average reward and scores over number of iterations. Row 1: Breakout MOVE + FIRE 4x6 bricks (5 minutes); Row 2: Breakout MOVE only 4x5 bricks (1 hour); Row 3: Sapientino S2 OMNI (3 minutes); Row 4: Sapientino S3 DIFFERENTIAL (1 hour); Row 5: Cocktail Party (3 minutes).
tain concepts that must be matched by the children (e.g., a colored animal, a color, the first letter of the animal’s name), while other cells are empty. The robot executes sequences of actions given in input by children with a keyboard on the robot’s top side. During execution, the robot moves on the grid and executes an action (actually a bip) to announce that the current cell has been reached (this is called a visit of a cell). A pair of consecutive visits are correct when they refer to cells containing matching concepts. As in the real game, we consider a 5x7 grid with 7 triplets of colored cells, each triplet representing three matching concepts. State representation is defined by the following features: reporting the pose of the agent in the grid. In this scenario, we consider two different agents: OMNI: omni-directional movements (actions: up, down, left, right), DIFFERENTIAL: differential drive (actions: forward, backward, turn left, turn right). With this specification, the agent can just learn how to move in the grid, but it cannot match related concepts. Restraining bolt. Consider now the specifications S2: visit at least two cells of the same color for each color, in a given order among the colors (the order of the colors is predefined: first
, then
, and so on) and S3: visit all the triplets of each color, in a given order among the colors. The following additional features are needed to express and evaluate the corresponding formula:
reporting that a bip action has just been executed and
reporting the color of the current cell.
The restraining specifications for these games can be expressed with LTLformulas. A fragment of LTL
formula for the first game relative to the first color
is
For other colors , we use a similar formula, but requiring that
has already been satisfied.
Two agents and two restraining bolts can be combined to form 4 different learning situations. We show only two of them. Rows 3 and 4 of Figure 3 show the agents’ learning ability (score = 14 for S2 specs, score = 21 for the S3 specs). Similar results are obtained in different configurations. Cocktail party. For a service robot involved in a cocktail party we consider a representation of the state in terms of robot’s pose and objects’ (drinks and snacks) and people’s location. The agent can move in the environment, grasp and deliver items to people, and get a reward when a delivery task is completed. The robot has no sophisticated people perception capabilities, and no memory is available in the underlying MDP modeling the domain, so the robot cannot get information about individual people or remember who received what. The robot in this scenario will just learn how to bring one item to any person (choosing the shortest path). Restraining bolt. Consider the following specification: serve exactly one drink and one snack to every person, but do not serve alcoholic drinks to minors. As in the previous examples, the restraining bolt works on separate features, namely identity, age and received items12 and uses an LTL/LDL
formula to model this specification. operating scenario. We assume the map of the environment to be known, people sitting at tables in predefined known positions and locations of snack and drink items also known. From these information we can instantiate a simulator for the robot to navigate in this environment and reach the different locations13.
For learning this task, we considered a problem with two people and two different kinds of drinks and snacks (4 tasks to be executed in total) and we implemented an abstract simulator reproducing the scenario of RoboCup@Home competition. The results of learning the restrained task in the simulator are depicted in Row 5 of Figure 3 (score = 4 means that the 2 persons have received one drink and one snack each). As shown, after about 1 minute of simulation14, the RL agent converged to a policy satisfying the RB specifications.
Minecraft. As an example of our approach’s modularity, we used the same agent of SAPIENTINO in a MINECRAFT scenario. Here the agent has to accomplish 10 tasks (described with non-Markovian rewards via an LTL/LDL
formula). The two agents share the same state representation S but differ in the action set A, the fluent configurations L, and the component progressing the DFAs. Results (not shown here) confirm that a general-purpose agent can learn several tasks by only receiving information from its restraining bolt.
We have shown how to perform RL with LTL/LDL
restraining specifications by resorting to typical RL techniques based on MDPs. Notably, we have shown that the features needed to evaluate LTL
/LDL
formulas can be kept separated from those directly accessible to the learning agent.
Our work can be ascribed to that part of research generated by the urgency of providing safety guarantees to AI techniques based on learning (Amodei et al. 2016; Hadfield-Menell et al. 2017; Orseau and Armstrong 2016). In particular, it shares similarities with recent work on constraining the RL agent to satisfy certain safety conditions (Wen, Ehlers, and Topcu 2015; Achiam et al. 2017; Alshiekh et al. 2018). There are however important differences. First, in enforcing the restraining bolt we consider the learning agent essentially as a black box. That is, the restraining bolt does not need to know the internals S of the learning agent, and specifies the desired constraints using only its world features L. On the other hand, we do not guarantee the satisfaction of the restraining bolt constraints during training, as in (Achiam et al. 2017). In fact, differently from (Wen, Ehlers, and Topcu 2015; Alshiekh et al. 2018), we do not guarantee the hard satisfaction of constraints even after training. After all “You can’t teach pigs to fly”! and we may very well ask to do so in our restraining bolts, being these completely unrestricted in the selection of world features and in the kind of formulas they specify over such features. If we want to check formally that the optimal policy satisfies the restraining bolt specifica-tion, we first need to model how actions affect the restraining bolt’s features L, i.e., we need to link the learning agent’s features S to L, and then we can use, e.g., model checking. Notably, for doing RL we do not need to specify such a link, but we can simply allow the (possibly simulated) world to act as the link, in line with what advocated, e.g., in (Brooks 1991), and very differently from what typically considered in knowledge representation (Reiter 2001).
Apart from restraining bolts, the interest in having separate representations is manifold. The learning agent feature space can be designed separately from the features needed to express the goal, thus promoting separation of concerns which, in turn, facilitates the design, providing for modularity and reuse of representations (the same agent can learn from different bolts and the same bolt can be applied to different agents). Also, a reduced agent’s feature space allows for realizing simpler agents (think, e.g., of a mobile robot platform, where one can avoid specific sensors and perception routines), while preserving the possibility of acting according to complex declarative specifications which cannot be represented in the agent’s feature space. We plan to investigate this separation further in the future.
Acknowledgements Work partially funded by Sapienza University of Rome, under projects “Immersive Cognitive Environments” and “Data-awaRe Automatic Process Execution”, and by the European Union’s Horizon 2020 research and innovation program under grant agreement N. 825619.
[Achiam et al. 2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In ICML, 22–31.
[Alberto et al. 2017] Alberto; Chen, O.; Sanner, S.; and McIlraith, S. A. 2017. Non-markovian rewards expressed in LTL: Guiding search via reward shaping. In SOCS, 159–160.
[Alshiekh et al. 2018] Alshiekh, M.; Bloem, R.; Ehlers, R.; K¨onighofer, B.; Niekum, S.; and Topcu, U. 2018. Safe reinforcement learning via shielding. In AAAI, 2669–2678.
[Amodei et al. 2016] Amodei, D.; Olah, C.; Steinhardt, J.; Chris- tiano, P.; Schulman, J.; and Man´e, D. 2016. Concrete problems in AI safety. CoRR abs/1606.06565.
[Bacchus, Boutilier, and Grove 1996] Bacchus, F.; Boutilier, C.; and Grove, A. 1996. Rewarding behaviors. In AAAI, 1160–1167.
[Baier et al. 2008] Baier, J. A.; Fritz, C.; Bienvenu, M.; and McIl- raith, S. A. 2008. Beyond classical planning: Procedural control knowledge and preferences in state-of-the-art planners. In AAAI.
[Brafman, De Giacomo, and Patrizi 2018] Brafman, R. I.; De Gia- como, G.; and Patrizi, F. 2018. LTLnon-markovian rewards. AAAI.
[Brooks 1991] Brooks, R. A. 1991. Intelligence without represen- tation. Artif. Intell. 47(1-3):139–159.
[Camacho et al. 2017] Camacho, A.; Chen, O.; Sanner, S.; and McIlraith, S. A. 2017. Decision-making with non-markovian rewards: From LTL to automata-based reward shaping. In RLDM, 279–283.
[De Giacomo and Rubin 2018] De Giacomo, G., and Rubin, S. 2018. Automata-theoretic foundations of FOND planning for LTLf and LDL
[De Giacomo and Vardi 2013] De Giacomo, G., and Vardi, M. Y. 2013. Linear temporal logic and linear dynamic logic on finite traces. In IJCAI.
[Gretton 2007] Gretton, C. 2007. Gradient-based relational rein- forcement learning of temporally extended policies. In ICAPS, 168–175.
[Gretton 2014] Gretton, C. 2014. A more expressive behavioral logic for decision-theoretic planning. In PRICAI, 13–25.
[Hadfield-Menell et al. 2017] Hadfield-Menell, D.; Dragan, A. D.; Abbeel, P.; and Russell, S. J. 2017. The off-switch game. In IJCAI, 220–227.
[Icarte et al. 2018] Icarte, R. T.; Klassen, T. Q.; Valenzano, R. A.; and McIlraith, S. A. 2018. Teaching multiple tasks to an RL agent using LTL. In AAMAS, 452–461.
[Lacerda, Parker, and Hawes 2014] Lacerda, B.; Parker, D.; and Hawes, N. 2014. Optimal and dynamic planning for markov decision processes with co-safe LTL specifications. In IROS, 1511– 1516.
[Lacerda, Parker, and Hawes 2015] Lacerda, B.; Parker, D.; and Hawes, N. 2015. Optimal Policy Generation for Partially Satis-fiable Co-Safe LTL Specifications. In IJCAI.
[Levesque et al. 1997] Levesque, H. J.; Reiter, R.; Lesperance, Y.; Lin, F.; and Scherl, R. 1997. GOLOG: A logic programming language for dynamic domains. J. of Logic Programming 31.
[Littman et al. 2017] Littman, M. L.; Topcu, U.; Fu, J.; Jr., C. L. I.; Wen, M.; and MacGlashan, J. 2017. Environment-independent task specifications via GLTL. CoRR abs/1704.04341.
[Littman 2015] Littman, M. L. 2015. Programming agent via re- wards. In Invited talk at IJCAI.
[Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidjeland, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
[Orseau and Armstrong 2016] Orseau, L., and Armstrong, S. 2016. Safely interruptible agents. In UAI.
[Pnueli 1977] Pnueli, A. 1977. The temporal logic of programs. In FOCS.
[Puterman 1994] Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
[Rabin and Scott 1959] Rabin, M. O., and Scott, D. 1959. Finite automata and their decision problems. IBM J. Res. Dev. 3:114– 125.
[Reiter 2001] Reiter, R. 2001. Knowledge in Action. MIT Press.
[Silver et al. 2017] Silver, D.; amd Karen Simonyan, J. S.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driessche, G.; Graepel, T.; and Hassabis, D. 2017. Mastering the game of go without human knowledge. Nature 550:354–359.
[Simsek and Barto 2006] Simsek, ¨O., and Barto, A. G. 2006. An intrinsic reward mechanism for efficient exploration. In ICML, volume 148 of ACM International Conference Proceeding Series, 833–840. ACM.
[Slaney 2005] Slaney, J. K. 2005. Semipositive LTL with an unin- terpreted past operator. Logic Journal of the IGPL 13(2):211–229.
[Sutton and Barto 1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning - an introduction. MIT Press.
[Thi´ebaux et al. 2006] Thi´ebaux, S.; Gretton, C.; Slaney, J. K.; Price, D.; and Kabanza, F. 2006. Decision-theoretic planning with non-markovian rewards. J. Artif. Intell. Res. (JAIR) 25:17–74.
[Wen, Ehlers, and Topcu 2015] Wen, M.; Ehlers, R.; and Topcu, U. 2015. Correct-by-synthesis reinforcement learning with temporal logic constraints. In IROS, 4983–4990.
[Whitehead and Lin 1995] Whitehead, S. D., and Lin, L.-J. 1995. Reinforcement learning of non-markov decision processes. Arti-ficial Intelligence 73(1):271 – 306. Computational Research on Interaction and Agency, Part 2.