b

DiscoverSearch
About
My stuff
Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Büchi Automata
2020·arXiv
Abstract
Abstract

This letter proposes a novel reinforcement learning method for the synthesis of a control policy satisfying a control specification described by a linear temporal logic formula. We assume that the controlled system is modeled by a Markov decision process (MDP). We convert the specification to a limit-deterministic generalized B¨uchi automaton (LDGBA) with several accepting sets that accepts all infinite sequences satisfying the formula. The LDGBA is augmented so that it explicitly records the previous visits to accepting sets. We take a product of the augmented LDGBA and the MDP, based on which we define a reward function. The agent gets rewards whenever state transitions are in an accepting set that has not been visited for a certain number of steps. Consequently, sparsity of rewards is relaxed and optimal circulations among the accepting sets are learned. We show that the proposed method can learn an optimal policy when the discount factor is sufficiently close to one.

Index Terms—Reinforcement Learning, Linear Temporal Logic, Limit-Deterministic B¨uchi Automata.

REINFORCEMENT learning (RL) [1] is a useful ap-proach to learning an optimal policy from sample behaviors of a controlled system with inherent stochasticity, e.g., a Markov decision process (MDP) [2], when the probabilities associated with the controlled system are unknown a priori. In RL, we use a reward function that assigns a reward to each transition in the behaviors and evaluate a control policy by the return, namely an expected (discounted) sum of the rewards along the behaviors. One of the recent trends is to apply RL to synthesis problems under linear temporal logic (LTL) constraints. LTL is a formal language with rich expressivity and thus suitable for describing complex missions assigned to a controlled system [3], [4]. It is known that any LTL formula can be converted into an ω-automaton with a B¨uchi or a Rabin acceptance condition [3]. In many studies on LTL synthesis problems using RL, reward functions are formed systematically from automata corresponding to the LTL specification. This direction was first investigated by Sadigh et al. [5], where they defined a reward function based on the acceptance condition of a deterministic Rabin automaton [3] that accepts all words satisfying the LTL

This work was partially supported by JST-ERATO HASUO Project Grant Number JPMJER1603, Japan, and JST-Mirai Program Grant Number JP-MJMI18B4, Japan. The work of A. Sakakibara was supported by the Grant-in-Aid for Japan Society for the Promotion of Science Research Fellow under Grant JP19J13487.

The authors are with the Graduate School of Engineering Science, Osaka University, Toyonaka 560-8531, Japan (e-mail: r-oura, sakakibara@hopf.sys.es.osaka-u.ac.jp; ushio@sys.es.osaka-u.ac.jp).

constraint. Reward functions defined on specification automata were also proposed for a deep reinforcement learning method [6] and for an extension of LTL in collaboration with a control barrier function [7].

Recently, a limit-deterministic B¨uchi automaton (LDBA) or a generalized one (LDGBA) is paid much attention to as an  ω-automaton corresponding to the LTL specification [8]. The RL-based approaches to the synthesis of a control policy using LDBAs or LDGBAs have been proposed in [9][12]. An (LD)GBA has multiple accepting sets and accepts behaviors visiting all accepting sets infinitely often. One possible approach to generalized B¨uchi acceptance conditions is to convert a GBA into a corresponding BA, which has a single accepting set. The conversion, however, fixes the order of visits to accepting sets of the GBA [3] and causes the sparsity of the reward, which is a critical issue in RL-based controller synthesis. Another approach to RL-based synthesis for generalized B¨uchi conditions is the accepting frontier function introduced in [9], [10], based on which the reward function is defined. However, the function is memoryless, that is, it does not provide information of accepting sets that have been visited, which is important to improve learning performance.

In this letter, we propose a novel method to augment an LDGBA converted from a given LTL formula. Then, we define a reward function based on the acceptance condition of the product MDP of the augmented LDGBA and the controlled system represented as the MDP. The rest of the letter is organized as follows. Section II reviews an MDP, LTL, and automata. Section III proposes a novel RL-based method for the synthesis of a control policy. Section IV presents a numerical example to show the effectiveness of our proposed method.

Notations:  N0is the set of nonnegative integers.  R≥0is the set of nonnegative real numbers. For sets A and B, AB denotes their concatenation, i.e.,  AB = {ab; a ∈ A, b ∈ B}. Aωdenotes the infinite concatenation of the set A and  A∗denotes the finite one, i.e.,  Aω = {a0a1 . . . ; an ∈ A, n ∈ N0}and  A∗ = {a0a1 . . . an; an ∈ A, n ∈ N0}, respectively.  ε ∈A∗is the empty string.

A. Markov Decision Process

Definition 1: A (labeled) Markov decision process (MDP) is a tuple  M = (S, A, P, sinit, AP, L), where S is a finite set of states, A is a finite set of actions,  P : S ×S ×A → [0, 1]is a transition probability function,  sinit ∈ Sis the initial state, AP is a finite set of atomic propositions, and  L : S × A ×S → 2APis a labeling function that assigns a set of atomic propositions to each transition. Let  A(s) = {a ∈ A; ∃s′ ∈S s.t.  P(s′|s, a) ̸= 0}. Note that �s′∈S P(s′|s, a) = 1for any state  s ∈ Sand action  a ∈ A(s).

In the MDP M, an infinite path starting from a state  s0 ∈ Sis defined as a sequence  ρ = s0a0s1 . . . ∈ S(AS)ωsuch that  P(si+1|si, ai) > 0for any  i ∈ N0. A finite path is a finite sequence in  S(AS)∗. In addition, we sometimes represent  ρas  ρinitto emphasize that  ρstarts from  s0 = sinit. For a path  ρ = s0a0s1 . . ., we define the corresponding labeled path  L(ρ) = L(s0, a0, s1)L(s1, a1, s2) . . . ∈ (2AP )ω. InfPathM (resp.,  FinPathM)is defined as the set of infi-nite (resp., finite) paths starting from  s0 = sinitin the MDP M. For each finite path  ρ, last(ρ)denotes its last state.

Definition 2: A policy on an MDP M is defined as a mapping  π : FinPathM × A(last(ρ)) → [0, 1]. A policy πis a positional policy if for any  ρ ∈ FinPathMand any a ∈ A(last(ρ)), it holds that  π(ρ, a) = π(last(ρ), a); and there exists one action  a′ ∈ A(last(ρ))such that  π(ρ, a) = 1if  a = a′, and  π(ρ, a) = 0for any  a ∈ A(last(ρ))with a ̸= a′.

Let  InfPathMπ(resp.,  FinPathMπ) be the set of infi- nite (resp., finite) paths starting from  s0 = sinitin the MDP M under a policy  π. The behavior of the MDP M under a policy  πis defined on a probability space (InfPathMπ , FInfP athMπ , PrMπ ).

A Markov chain induced by the MDP M with a positional policy  πis a tuple  MCπ = (Sπ, Pπ, sinit, AP, L), where Sπ = S, Pπ(s′|s) = P(s′|s, a)for  s, s′ ∈ Sand  a ∈ A(s)such that  π(s, a) = 1. The state set  Sπof  MCπcan be represented as a disjoint union of a set of transient states Tπand closed irreducible sets of recurrent states  Rjπwith j ∈ {1, . . . , h}, i.e.,  Sπ = Tπ ∪ R1π ∪ . . . ∪ Rhπ[13]. In the following, we say a “recurrent class” instead of a “closed irreducible set of recurrent states” for simplicity.

In the MDP M, we define a reward function  R : S × A ×S → R≥0. The function returns the immediate reward received after the agent performs an action a at a state s and reaches a next state  s′as a result.

Definition 3: For a policy  πon an MDP M, any state  s ∈ S, and a reward function R, we define the expected discounted reward as

image

where  Eπdenotes the expected value given that the agent follows the policy  πfrom the state s and  γ ∈ [0, 1)is a discount factor. The function  V π(s)is often referred to as a state-value function under the policy  π. For any stateaction pair  (s, a) ∈ S × A, we define an action-value function Qπ(s, a)under the policy  πas follows.

image

Definition 4: For any state  s ∈ S, a policy  π∗is optimal if

image

where  Πposis the set of positional policies over the state set S.

B. Linear Temporal Logic and Automata

In our proposed method, we use linear temporal logic (LTL) formulas to describe various constraints or properties and to systematically assign corresponding rewards. LTL formulas are constructed from a set of atomic propositions, Boolean operators, and temporal operators. We use the standard notations for the Boolean operators:  ⊤(true),  ¬(negation), and  ∧(conjunction). LTL formulas over a set of atomic propositions AP are defined as

image

where  ϕ, ϕ1, and  ϕ2are LTL formulas. Additional Boolean operators are defined as  ⊥:= ¬⊤, ϕ1 ∨ ϕ2 := ¬(¬ϕ1 ∧ ¬ϕ), and  ϕ1 ⇒ ϕ2 := ¬ϕ1 ∨ ϕ2. The operators X and U are called “next” and “until”, respectively. Using the operator U, we define two temporal operators: (1) eventually, Fϕ := ⊤Uϕand (2) always, Gϕ := ¬F¬ϕ.

Definition 5: For an LTL formula  ϕand an infinite path ρ = s0a0s1 . . .of an MDP M with  s0 ∈ S, the satisfaction relation  M, ρ |= ϕis recursively defined as follows.

image

where  ρ[i :]be the i-th suffix  ρ[i :] = siaisi+1 . . ..

The next operator X requires that  ϕis satisfied by the next state suffix of  ρ. The until operator U requires that  ϕ1holds true until  ϕ2becomes true over the path  ρ. In the following, we write  ρ |= ϕfor simplicity without referring to M.

For any policy  π, the probability of all paths starting from siniton the MDP M that satisfy an LTL formula  ϕunder the policy  π, or the satisfaction probability under  πis defined as

image

We say that an LTL formula  ϕis positively satisfied by a positional policy  πif

image

Any LTL formula  ϕcan be converted into various  ω-automata, namely finite state machines that recognize all infinite words satisfying  ϕ. We review a generalized B¨uchi automaton at the beginning, and then introduce a limit-deterministic generalized B¨uchi automaton [10].

Definition 6: A transition-based generalized B¨uchi automaton (tGBA) is a tuple  B = (X, xinit, Σ, δ, F), where X is a finite set of states,  xinit ∈ Xis the initial state,  Σis an input alphabet including  ε, δ ⊂ X × Σ × Xis a set of transitions, and  F = {F1, . . . , Fn}is an acceptance condition, where for each  j ∈ {1, . . . , n}, Fj ⊂ δis a set of accepting transitions and called an accepting set. We refer to a tGBA with one accepting set as a tBA.

An infinite sequence  r = x0σ0x1 . . . ∈ X(ΣX)ωis called an infinite run if  (xi, σi, xi+1) ∈ δfor any  i ∈ N0. An infinite word  w = σ0σ1 . . . ∈ Σωis accepted by  Bϕif and only if there exists an infinite run  r = x0σ0x1 . . .starting from x0 = xinitsuch that  inf(r)∩Fj ̸= ∅for each  Fj ∈ F, where inf(r) is the set of transitions that occur infinitely often in the run r.

Definition 7: A transition-based limit-deterministic generalized B¨uchi automaton (tLDGBA) is a tGBA B = (X, xinit, Σ, δ, F)such that X is partitioned into two disjoint sets  Xinitialand  Xfinalsuch that

Fj ⊂ Xfinal × Σ × Xfinal, ∀j ∈ {1, ..., n},

|{(x, σ, x′) ∈ δ; σ∈Σ, x′ ∈ Xfinal}|=1, ∀x∈Xfinal,

|{(x, σ, x′) ∈ δ; σ∈Σ, x′ ∈ Xinitial}|=0,  ∀x∈Xfinal,

 ∀(x, ε, x′) ∈δ, x  ∈ Xinitial ∧ x′ ∈ Xfinal.

An  ε-transition enables the tLDGBA to change its state with no input and reflects a single “guess” from  Xinitialto  Xfinal. Note that by the construction in [8], the transitions in each part are deterministic except for  ε-transitions from  Xintialto Xfinal. It is known that, for any LTL formula  ϕ, there exists a tLDGBA that accepts all words satisfying  ϕ[8]. We refer to a tLDGBA with one accepting set as a tLDBA. In particular, we represent a tLDGBA recognizing an LTL formula  ϕas  Bϕ, whose input alphabet is given by  Σ = 2AP ∪ {ε}.

We introduce an automaton augmented with binary-valued vectors. The automaton can ensure that transitions in each accepting set occur infinitely often.

Let  V = {(v1, . . . , vn)T ; vi ∈ {0, 1}, i ∈ {1, . . . , n}}be a set of binary-valued vectors, and let 1 and 0 be the n-dimentional vectors with all elements 1 and 0, respectively. In order to augment a tLDGBA  Bϕ, we introduce three functions visitf : δ → V , reset : V → V, and  Max : V × V → Vas follows. For any  e ∈ δ, visitf(e) = (v1, . . . , vn)T, where  vi = 1if  e ∈ Fiand  vi = 0otherwise. For any  v ∈ V , reset(v) = 0if v = 1 and reset(v) = v otherwise. For any  v, u ∈ V , Max(v, u) = (l1, . . . , ln)T, where  li = max{vi, ui}for any  i ∈ {1, . . . , n}.

Each vector v is called a memory vector and represents which accepting sets have been visited. The function visitf returns a binary vector whose i-th element is 1 if and only if a transition in the accepting set  Fioccurs. The function reset returns the zero vector 0 if at least one transition in each accepting set has occurred after the latest reset. Otherwise, it returns the input vector without change.

Definition 8: For a tLDGBA  Bϕ = (X, xinit, Σ, δ, F), its augmented automaton is a tLDGBA ¯Bϕ = ( ¯X, ¯xinit, ¯Σ, ¯δ, ¯F), where ¯X = X × V , ¯xinit = (xinit, 0), ¯Σ = Σ, ¯δis defined as ¯δ = {((x, v), ¯σ, (x′, v′)) ∈ ¯X × ¯Σ × ¯X; (x, ¯σ, x′) ∈ δ, v′ =

image

Clearly from the construction, we have  (¯xi, ¯σi, ¯xi+1) ∈ ¯δfor any  i ∈ N. Thus,  ¯ris a run of ¯Bstarting from  ¯x0 =(xinit, 0) = ¯xinit.

We now show that  inf(¯r) ∩ ¯Fj ̸= ∅for each ¯Fj ∈ ¯F. Since  inf(r) ∩ Fj ̸= ∅for each  Fj ∈ F, we have

image

for any  j ∈ {1, . . . , n}, where  [[(x, v)]]X = xfor each (x, v) ∈ ¯X. By the construction of  ¯r, therefore, there are infinitely many indices  l ∈ Nwith  vl = 0. Let  l1, l2 ∈ Nbe arbitrary nonnegative integers such that  l1 < l2, vl1 =vl2 = 0, and  vl′ ̸= 0for any  l′ ∈ {l1 + 1, . . . , l2 − 1}. Then,

image

where  (vk)jis the j-th element of  vk. Hence, we have inf(¯r) ∩ ¯Fj ̸= ∅for each ¯Fj ∈ ¯F, which implies  w ∈L( ¯B).

⊃: Consider any  ¯w ∈ ¯σ0¯σ1 . . . ∈ L( ¯B). Then, there exists a run  ¯r = ¯x0¯σ0¯x1¯σ1 ¯x2 . . . ∈ ¯X(¯Σ ¯X)ωof ¯Bsuch that ¯x0 = ¯xinitand  inf(¯r) ∩ ¯Fj ̸= ∅for each ¯Fj ∈ ¯F, i.e.,

image

For the run ¯r, we construct a sequence r = x0σ0x1σ1x2 . . . ∈ X(ΣX)ωsuch that  xi = [[¯xi]]Xand σi = ¯σifor any  i ∈ N. It is clear that r is a run of B starting from  x0 = xinit. It holds by Eq. (1) that inf(r) ∩ Fj ̸= ∅for each  Fj ∈ F, which implies ¯w ∈ L(B).

image

For example, shown in Figs. 1 and 2 are a tLDGBA and its augmented automaton, respectively, associated with the following LTL formula.

image

The acceptance condition F of the tLDGBA is given by F = {F1, F2}, where  F1 = {(x0, {a}, x0), (x0, {a, b}, x0)}and F2 = {(x0, {b}, x0), (x0, {a, b}, x0)}. Practically, states in a strongly connected component that contains no accepting transitions can be merged as shown in Fig. 2.

image

Fig. 1. The tLDGBA recognizing the LTL formula GFa∧GFb∧G¬c, whereX = {x0, x1} = Xfinal and xinit = x0. Red arcs are accepting transitions that are numbered in accordance with the accepting sets they belong to.

image

Fig. 2. The augmented automaton for the tLDGBA in Fig. 1 recognizing the LTL formula GFa ∧ GFb ∧ G¬c, where the initial state is  (x0, (0, 0)T ).Red arcs are accepting transitions that are numbered in accordance with the accepting sets they belong to. All states corresponding to  x1are merged into (x1, (∗, ∗)T ).

We modify the standard definition of a product MDP to deal with  ε-transitions in the augmented automaton.

Definition 9: Given an augmented tLDGBA ¯Bϕand an MDP M, a tuple  M ⊗ ¯Bϕ = M ⊗ = (S⊗, A⊗, s⊗init, P ⊗, δ⊗,F⊗)is a product MDP, where  S⊗ = S × ¯Xis the finite set of states;  A⊗is the finite set of actions such that  A⊗ =A ∪ {ε¯x′; ∃¯x′ ∈ Xs.t.  (¯x, ε, ¯x′) ∈ ¯δ}, where  ε¯x′is the action for the  ε-transition to the state  ¯x′ ∈ ¯X; s⊗init = (sinit, ¯xinit)is the initial state;  P ⊗ : S⊗ ×S⊗ ×A⊗ → [0, 1]is the transition probability function defined as

image

where s⊗ = (s, (x, v)) and s⊗′ = (s′, (x′, v′)); δ⊗ ={(s⊗, a, s⊗′) ∈ S⊗ × A⊗ × S⊗;P ⊗(s⊗′|s⊗, a) > 0} is

the set of transitions; and  F⊗ = { ¯F ⊗1 , . . . , ¯F ⊗n }is the acceptance condition, where ¯F ⊗i = {((s, ¯x), a, (s′, ¯x′)) ∈δ⊗; (¯x, L(s, a, s′), ¯x′) ∈ ¯Fi}for each  i ∈ {1, . . . , n}.

Definition 10: The reward function  R : S⊗ × A⊗ × S⊗ →R≥0is defined as

image

R(s⊗, a, s⊗′) =

image

where  rpis a positive value. Remark 1: When constructing a tBA from a tGBA, the order of visits to accepting sets of the tGBA is fixed. Consequently, the rewards based on the acceptance condition of the tBA tends to be sparse. The sparsity is critical in RL-based controller synthesis problems. The augmentation of the tGBA relaxes the sparsity since the augmented tGBA has all of the order of visits to all accepting sets of the original tGBA. For the acceptance condition F of the tGBA, the size of the state space of the augmented tGBA is about 2|F|−1

the tBA constructed from the tGBA. However, the number of accepting transitions to all transitions in the augmented tGBA is much greater than the tBA. Therefore, our proposed method is expected to be better than using the tLDBA in the sense of sample efficiency. In the following, we show that a positional policy positively satisfying  ϕon the product MDP  M ⊗is synthesized by using the reward function R in Definition 10, which is based on the acceptance condition of  M ⊗.

image

with a positional policy  π, let  S⊗π = T ⊗π ∪ R⊗1π ∪ . . . ∪ R⊗hπbe the set of states in  MC⊗π, where  T ⊗πis the set of transient states and  R⊗iπis the recurrent class for each  i ∈ {1, . . . , h}, and let  R(MC⊗π )be the set of all recurrent states in  MC⊗π. Let δ⊗iπbe the set of transitions in a recurrent class  R⊗iπ, namely δ⊗iπ = {(s⊗, a, s⊗′) ∈ δ⊗; s⊗ ∈ R⊗iπ , P ⊗(s⊗′|s⊗, a) > 0}, and let  P ⊗π:  S⊗π × S⊗π → [0, 1]be the transition probability function under  π.

image

the Markov chain  MC⊗π,  MC⊗πsatisfies one of the following conditions.

image

where  pk((s, a, s′), (s, a, s′))is the probability that the transition  (s, a, s′)reoccurs after it occurs in k time steps. Eq. (3) means that each transition in  R⊗iπoccurs infinitely often. However, the memory vector v is never reset in  R⊗iπby the assumption. This directly contradicts Eq. (3).

Lemma 1 implies that, for an LTL formula  ϕ, if a policy πdoes not satisfy  ϕ, then the agent obtains no reward in recurrent classes; otherwise there exists at least one recurrent class where the agent obtains rewards infinitely often.

Theorem 1: For a product MDP  M ⊗of an MDP M and an augmented tLDGBA ¯Bϕcorresponding to a given LTL formula  ϕand the reward function given by Definition 10, if there exists a positional policy positively satisfying  ϕon  M ⊗, then there exists a discount factor  γ∗such that any algorithm that maximizes the expected discounted reward with  γ > γ∗will find a positional policy positively satisfying  ϕon  M ⊗.

Proof: Suppose that  π∗is an optimal policy but does not satisfy the LTL formula  ϕ. Then, for any recurrent class  R⊗iπ∗in the Markov chain  MC⊗π∗and any accepting set  ¯F ⊗jof the product MDP  M ⊗, δ⊗iπ∗ ∩ ¯F ⊗j = ∅holds by Lemma 1. Thus, the agent under the policy  π∗can obtain rewards only in the set of transient states. We consider the best scenario in the assumption. Let  pk(s, s′)be the probability of going to a state s′in k time steps after leaving the state s, and let  Post(T ⊗π )be the set of states in recurrent classes that can be transitioned from states in  T ⊗πby one action. For the initial state  sinitin the set of transient states, it holds that

image

where the action a is selected by  π∗. By the property of the transient states, for any state  s⊗in  T ⊗π∗, there exists a bounded positive value m such that �∞k=0 γkpk(sinit, s) ≤�∞k=0 pk(sinit, s) < m[13]. Therefore, there exists a bounded positive value  ¯msuch that  V π∗(sinit) < ¯m. Let  ¯πbe a positional policy satisfying  ϕ. We consider the following two cases.

1) Assume that the initial state  sinitis in a recurrent class R⊗i¯πfor some  i ∈ {1, . . . , h}. For any accepting set  ¯F ⊗j, δ⊗i¯π ∩ ¯F ⊗j ̸= ∅holds by the definition of  ¯π. The expected discounted reward for  sinitis given by

image

where the action a is selected by  ¯π. Since  sinitis in R⊗i¯π, there exists a positive number  ¯k = min{k ; k ≥n, pk(sinit, sinit) > 0}[13]. We consider the worst

image

whereas all states in  R(MC⊗¯π )are positive recurrent because  |S⊗| < ∞[14]. Obviously,  pk¯k(sinit, sinit) ≥(p¯k(sinit, sinit))k > 0holds for any  k ∈ (0, ∞)by the Chapman-Kolmogorov equation [13]. Furthermore, we have  limk→∞ pk¯k(sinit, sinit) > 0by the property of irreducibility and positive recurrence [15]. Hence, there exists  ¯psuch that  0 < ¯p < pk¯k(sinit, sinit)for any k ∈ (0, ∞]and we have

image

Therefore, for any rp < ∞and any ¯m ∈(V π∗(sinit), ∞), there exists  γ∗ < 1such that  γ > γ∗implies  V ¯π(sinit) > rp¯p γ¯k1−γ¯k > ¯m.

2) Assume that the initial state  sinitis in the set of transient states  T ⊗¯π.  P M ⊗¯π (sinit |= ϕ) > 0holds by the definition of  ¯π. For a recurrent class  R⊗i¯πsuch that  δ⊗i¯π ∩ ¯F ⊗j ̸= ∅for each accepting set ¯F ⊗j, there exist a number  ¯l > 0, a state  ˆsin  Post(T ⊗¯π ) ∩ R⊗i¯π, and a subset of transient states  {s1, . . . , s¯l−1} ⊂ T ⊗¯πsuch that  p(sinit, s1) > 0, p(si, si+1) > 0for  i ∈ {1, . . . , ¯l−2}, and  p(s¯l−1, ˆs) > 0by the property of transient states. Hence, it holds that p¯l(sinit, ˆs) > 0for the state  ˆs. Thus, by ignoring rewards in  T ⊗¯π, we have

image

In this section, we apply the proposed method to a path planning problem of a robot in an environment consisting

image

Fig. 3. The environment consisting of eight rooms and one corridor. Red arcs are the transitions that we want to occur infinitely often, while blue arcs are the transitions that we never want to occur.  s7is the initial state.

of eight rooms and one corridor as shown in Fig. 3. The state  s7is the initial state and the action space is speci-fied with A(s) = {Right, Left, Up, Down} for any state s ̸= s4and  A(s4) = {to s0, to s1, to s2, to s3, to s5,to s6, to s7, to s8}, where  to simeans attempting to go to the state  sifor  i ∈ {0, 1, 2, 3, 5, 6, 7, 8}. The robot moves in the intended direction with probability 0.9 and it stays in the same state with probability 0.1 if it is in the state  s4. In the states other than  s4, it moves in the intended direction with probability 0.9 and it moves in the opposite direction to that it intended to go with probability 0.1. If the robot tries to go to outside the environment, it stays in the same state. The labeling function is as follows.

image

In the example, the robot tries to take two transitions that we want to occur infinitely often, represented by arcs labeled by {a} and {b}, while avoiding unsafe transitions represented by the arcs labeled by {c}. This is formally specified by the LTL formula given by Eq. (2). The LTL formula requires the robot to keep on entering the two rooms  s0and  s8from the corridor  s4regardless of the order of entries, while avoiding entering the four rooms  s2, s3, s5, and  s6. The tLDGBA  Bϕand its augmented automaton ¯Bϕare shown in Figs. 1 and 2, respectively.

Through the above scenario, we compare our approach with 1) a case where we first convert the tLDGBA into a tLDBA, for which the augmentation makes no change, and thus a reward function in Definition 10 is based on a single accepting set; and 2) the method using a reward function based on the accepting frontier function [9], [10]. For the three methods, we use Qlearning1 with an epsilon-greedy policy. The epsilon-greedy parameter is given by 0.95nt(s⊗), where  nt(s⊗)is the number of visits to state  s⊗within t time steps [16], so that it will gradually reduce to 0 to learn an optimal policy asymptotically. We set the positive reward  rp = 2and the discount factor γ = 0.95. The learning rate  αvaries in accordance with the Robbins-Monro condition. We train the agent in 10000 iterations and 1000 episodes for 100 learning sessions.

image

Fig. 4. The mean of average reward in each episode for 100 learning sessions obtained from our proposed method (left) and the method using tLDBA (right). They are plotted per 50 episodes and the green areas represent the range of standard deviations.

image

Fig. 5. The optimal policy obtained from our proposed method (left) and the method in [9], [10] (right).

Results

1) Fig. 4 shows the average rewards obtained by our proposed method and the case using a tLDBA  B′ϕconverted from  ϕ, respectively. Both methods eventually acquire an optimal policy satisfying  ϕ. As shown in Fig. 4, however, our proposed method converges faster. This is because the order of entrances to the rooms  s0and  s8is determined according to the tLDBA. Moreover, the number of transitions with a positive reward in ¯Bϕis larger than that in  B′ϕ.

2) We use the accepting frontier function [9], [10] for the tLDGBA  Acc : δ × 2δ → 2δ. Initializing a set of transitions F with the set of the all accepting transitions in  Bϕ, the function receives the transition  (x, σ, x′)that occurs and the set F. If  (x, σ, x′)is in F, then Acc removes the accepting sets containing  (x, σ, x′)from F. For the product MDP of the MDP M and the tLDGBA  Bϕ, the reward function is based on the removed sets of  Bϕ. Then, we synthesize a positional policy on the product MDP derived from the tLDGBA  Bϕ.

Fig. 5 shows the optimal policies obtained by our proposed method and the method in [9], [10], respectively. The policy obtained by the method with the accepting frontier function fails to satisfy the LTL specification2 because it is impossible that the transitions labeled with {a} and {b} occur from  s4infinitely often by any positional policy on the product MDP with  Bϕshown in Fig. 1. More specifically, as shown in Fig. 6, the agent cannot take each accepting transition colored with red by any deterministic stationary action selection at the product state  (s4, x0). In our proposed method, the augmented tLDGBA includes the information of the (path-dependent) domain of the accepting frontier function explicitly as memory vectors, which enables us to synthesize a positional policy satisfying  ϕon the product MDP.

image

Fig. 6. A part of the product MDP of the MDP shown in Fig. 3 and the tLDGBA shown in Fig. 1, where each transitions is labeled with either an action or the transition probability.

The letter proposed a novel RL-based method for the synthesis of a control policy for an LTL specification using an augmented tLDGBA. The proposed method improved the learning performance compared to existing methods. It is future work to investigate a method that maximizes the satisfaction probability.

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. MIT press, 2018.

[2] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.

[3] C. Baier and J.-P. Katoen, Principles of Model Checking. MIT Press, 2008.

[4] C. Belta, B. Yordanov, and E. Aydin Gol, Formal Methods for DiscreteTime Dynamical Systems. Springer International Publishing, 2017.

[5] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in 53rd IEEE Conference on Decision and Control, 2014, pp. 1091–1096.

[6] Q. Gao, D. Hajinezhad, Y. Zhang, Y. Kantaros, and M. M. Zavlanos, “Reduced variance deep reinforcement learning with temporal logic specifications,” in Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, 2019, pp. 237–248.

[7] X. Li, Z. Serlin, G. Yang, and C. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,” Science Robotics, vol. 4, no. 37, pp. 1–16, 2019.

[8] S. Sickert, J. Esparza, S. Jaax, and J. Kˇret´ınsk`y, “Limit-deterministic B¨uchi automata for linear temporal logic,” in International Conference on Computer Aided Verification. Springer, 2016, pp. 312–332.

[9] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-constrained reinforcement learning,” arXiv preprint arXiv:1801.08099, 2018.

[10] M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, and I. Lee, “Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees,” arXiv preprint arXiv:1909.05304, 2019.

[11] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Omega-regular objectives in model-free reinforcement learning,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2019, pp. 395–412.

[12] A. K. Bozkurt, Y. Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” arXiv preprint arXiv:1909.07299, 2019.

[13] R. Durrett, Essentials of Stochastic Processes, 2nd ed. Springer, 2012.

[14] L. Breuer, “Introduction to stochastic processes,” [Online]. Available: https://www.kent.ac.uk/smsas/personal/lb209/files/sp07.pdf.

[15] S. M. Ross, Stochastic Processes, 2nd ed. Wiley New York, 1996.

[16] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv´ari, “Convergence results for single-step on-policy reinforcement-learning algorithms,” Machine learning, vol. 38, no. 3, pp. 287–308, 2000.


Designed for Accessibility and to further Open Science