Counterexample-Guided Strategy Improvement for POMDPs Using Recurrent Neural Networks

2019·Arxiv

Abstract

Abstract

We study strategy synthesis for partially observable Markov decision processes (POMDPs). The particular problem is to determine strategies that provably adhere to (probabilistic) temporal logic constraints. This problem is computationally intractable and theoretically hard. We propose a novel method that combines techniques from machine learning and formal verification. First, we train a recurrent neural network (RNN) to encode POMDP strategies. The RNN accounts for memory-based decisions without the need to expand the full belief space of a POMDP. Secondly, we restrict the RNN-based strategy to represent a finite-memory strategy and implement it on a specific POMDP. For the resulting finite Markov chain, efficient formal verification techniques provide provable guarantees against temporal logic specifications. If the specification is not satisfied, counterexamples supply diagnostic information. We use this information to improve the strategy by iteratively training the RNN. Numerical experiments show that the proposed method elevates the state of the art in POMDP solving by up to three orders of magnitude in terms of solving times and model sizes.

1 Introduction

Autonomous agents that make decisions under uncertainty and incomplete information can be mathematically represented as partially observable Markov decision processes (POMDPs). In this setting, while an agent makes decisions within an environment, it obtains observations and infers the likelihood of the system being in a certain state, known as the belief state. POMDPs are effective in modeling a number of real-world applications, see for instance [Kaelbling et al., 1998]. Traditional POMDP problems typically seek to compute a strategy that maximizes a cumulative reward over a finite horizon.

However, the agent’s behavior is often required to obey more complicated specifications. For example, reachability, liveness or, more generally, specifications expressed in temporal logic (e. g. LTL [Pnueli, 1977]) describe tasks that cannot be expressed using reward functions [Littman et al., 2017].

Strategy synthesis for POMDPs is a difficult problem, both from the theoretical and the practical perspective. For infinite- or indefinite-horizon problems, computing an optimal strategy is undecidable [Madani et al., 1999]. Optimal action choices depend on the whole history of observations and actions, thus requiring an infinite amount of memory. When restricting the specifications to maximize accumulated rewards over a finite horizon and limiting the available memory, computing an optimal strategy is PSPACE-complete [Papadimitriou and Tsitsiklis, 1987]. This problem is, practically, intractable even for small instances [Meuleau et al., 1999]. Moreover, even when strategies are restricted to be memoryless, finding an optimal strategy within this set is still NP-hard [Vlassis et al., 2012]. For more general specifications like LTL properties, synthesis of strategies with limited memory is even harder, namely EXPTIME-complete [Chatterjee et al., 2015]).

The intractable nature of finding exact solutions in these problems gave rise to approximate [Hauskrecht, 2000], point-based [Pineau et al., 2003], or Monte-Carlo-based [Silver and Veness, 2010] methods. However, none of these approaches provides guarantees for given temporal logic specifications. The tool PRISM-POMDP [Norman et al., 2017] does so by approximating the belief space into a fully observable belief MDP, but is restricted to small examples.

Although strategy synthesis for POMDPs is hard, an available candidate strategy resolves the nondeterminism and partial observability for a POMDP and yields a so-called induced discrete-time Markov chain (MC). For this simpler model, verification methods are capable to efficiently certify temporal logic constraints and reward specifications for billions of states [Baier and Katoen, 2008]. Tool support is available via probabilistic model checkers such as PRISM [Kwiatkowska et al., 2011] or Storm [Dehnert et al., 2017].

There remains a dichotomy between directly synthesizing an optimal strategy and the efficient verification of a candidate strategy. The key questions are (1) how to generate a “good” strategy in the first place and (2) how to improve a strategy if verification refutes the specification. Machine learning and formal verification techniques address these questions separately. In this paper, we combine methods from both fields in order to guarantee that a candidate strategy learned through machine learning provably satisfies temporal logic specifications.

At first, we learn a randomized strategy1 via recurrent neural networks (RNNs) [Hochreiter and Schmidhuber, 1997] and data stemming from knowledge of the underlying structure of POMDPs. We refer to the resulting trained RNN as the strategy network. RNNs are a good candidate for learning a strategy because they can successfully represent temporal dynamic behavior [Pascanu et al., 2013].

Secondly, we extract a concrete (memoryless randomized) candidate strategy from the RNN and use it directly on a given POMDP, resulting in the MC induced by the POMDP and the strategy. Formal verification reveals whether specifications are satisfied or not. In the latter case, we generate a so-called counterexample [Wimmer et al., 2014], which points to parts of the MC (and by extension of the POMDP), that are critical for the specification. For those critical parts, we use a linear programming (LP) approach that locally improves strategy choices (without any guarantees on the global behavior). From the improved strategy, we generate new data to retrain the RNN. We iterate that procedure until the strategy network yields satisfactory results.

While the strategies are memoryless, allowing for randomization over possible choices – relaxing determinism – is often sufficient to capture necessary variability in decisionmaking. The intuition is that deterministic choices at a certain state may need to vary depending on previous decisions, thereby trading off memory. However, randomization in combination with finite memory may supersede infi-nite memory even more for many cases [Amato et al., 2010; Junges et al., 2018]. We encode finite memory directly into a POMDP by extending its state space. We can then directly apply our method to create finite-state controllers (FSCs) [Meuleau et al., 1999].

As previously discussed, the investigated problem is undecidable for POMDPs [Madani et al., 1999] and therefore the approach is naturally incomplete. Soundness is provided, as verification yields hard guarantees on the quality of a strategy.

Related Work. We list relevant works in addition to the ones already mentioned. [Wierstra et al., 2007] is the first to employ a long short-term memory (LSTM) architecture to learn (finite-memory) strategies for POMDPs. [Mnih et al., 2015] develops a neural network (NN)-based Qlearning algorithm (called Deep Q-learning) to play video games straight from video frames, under partial observability. [Hausknecht and Stone, 2015] uses an LSTM cell to enhance the algorithm with memory. While these approaches yield good performance, they do not provide formal guarantees on strategies and cannot incorporate temporal logic spec-ifications. [Chatterjee et al., 2015] studies verification problems for POMDPs with temporal logic specifications on a

theoretical level without a connection to machine learning.

A different research area addresses the (direct) verification of NNs, as opposed to the model-based verification approach we pursue, see for instance [Katz et al., 2017].

2 Preliminaries

A probability distribution over a finite or countably infinite set X is a function with . The set of all distributions on X is Distr(X). The support of a distribution is 0}.

(PO)MDPs. A Markov decision process (MDP) M is a tuple M = (S, Act, P) with a finite (or countably infinite) set S of states, a finite set Act of actions, and a transition function . We use a reward function . A finite path of an MDP M is a sequence of states and actions; is the last state of . The set of finite paths of M is . A discrete-time Markov chain (MC) is an MDP with |Act(s)| = 1 for all .

A strategy for and MDP M is a function Distr(Act) with for all . A strategy is memoryless if implies for all .

Definition 1 (Induced Markov Chain) For an MDP M = (S, Act, P) and a strategy , the MC induced by M and is given by where:

P π, π, a, sif ππas, 0 otherwise.

Definition 2 (POMDP) A partially observable Markov decision process (POMDP) is a tuple M = (M, Z, O), with M = (S, Act, P) the underlying MDP of M, Z a finite set of observations and the observation function.

The set of all finite observation-action sequences for a POMDP M is denoted by .

Definition 3 (POMDP Strategy) An observation-based strategy for a POMDP M is a function Distr(Act) such that for all is the set of observation-based strategies for M.

A memoryless observation-based strategy is analogous to a memoryless MDP strategy, formally we simplify to , i. e., we decide based on the current observation only. Similarly, POMDP together with a strategy yields an induced MC as in Def. 1, resolving all nondeterminism and partial observability. A general POMDP strategy can be represented by infinite-state controllers. Strategies are often restricted to finite memory; this amounts to using finite- state controllers (FSCs) [Meuleau et al., 1999].

Definition 4 (FSC) A k-FSC for a POMDP is a tuple A = where N is a finite set of k memory nodes, N is the initial memory node, is the action mapping and is the memory update . Let denote the observation-based strategy represented by the FSC A.

The product of a POMDP and a k-FSC yields a (larger) “flat” POMDP where the memory update is directly encoded into the state space [Junges et al., 2018]. The action mapping is left out of the product. A memoryless strategy then determines the action mapping and can be projected to the finite-memory strategy .

Specifications. We consider linear-time temporal logic (LTL) properties [Pnueli, 1977]. For a set of atomic propositions AP, which are either satisfied or violated by a state, and , the set of LTL formulas is given by:

Intuitively, a path satisfies the proposition a if its first state does; is satisfied, if satisfies both and is true on if is not satisfied. The formula holds on if the subpath starting at the second state of satisfies . The path satisfies if all suffixes of satisfy . Finally, satisfies if there is a suffix of that satisfies and all longer suffixes satisfy abbreviates .

For POMDPs, one wants to synthesize a strategy such that the probability of satisfying an LTL-property respects a given bound, denoted for and [0, 1]. In addition, undiscounted expected reward properties require that the expected accumulated cost until reaching a state satisfying a respects .

If (either LTL or expected reward specification) is sat-isfied in a (PO)MDP M under , we write , that is, the specification is satisfied in the induced MC, see Def. 1. While determining an appropriate strategy is still efficient for MDPs, this problem is in general undecidable for POMDPs [Chatterjee et al., 2016]. In particular, for MDPs, to check the satisfaction of a general LTL specification one needs memory. Typically, tools like PRISM [Kwiatkowska et al., 2011] compute the product of the MDP and a deterministic Rabin automaton. In this product, reachability of so-called accepting end-components ensures the satisfaction of the LTL property. This reachability probability can be determined in polynomial time. PRISMPOMDP [Norman et al., 2017] handles the problem similarly for POMDPs, but note that a strategy needs memory not only for the LTL specification but also for observation dependencies.

Finally, given a (candidate) strategy , checking whether holds can be done both for MDPs and POMDPs in polynomial time. For more details we refer to [Baier and Katoen, 2008].

3 Synthesis Procedure

Formal Problem Statement. For a POMDP M and a specification , where either with an

Figure 1: Flowchart of the RNN-based refinement loop

LTL formula, or , the problem is to determine a (finite-memory) strategy such that .

If such a strategy does not exist, the problem is infeasible.

Outline. The workflow of the proposed approach is illustrated in Fig. 1: We train an RNN using observation-action sequences generated from an initial strategy as discussed in Sect. 3.1. The trained strategy network represents an observation-based strategy, taking as input an observation-action sequence and returning a distribution over actions, see Def 3. For a POMDP M, we use the output of the strategy network in order to resolve nondeterminism. The strategy network is thereby used to extract a memoryless strategy and as a result we obtain the induced MC . Model checking of this induced MC evaluates whether the specification is satisfied or not for the extracted strategy. In the former case, the synthesis procedure is finished. The extraction and evaluation is explained in Sect. 3.2.

If the specification is not satisfied, we obtain a counterexample highlighting critical states of the POMDP. We employ a linear programming (LP) approach that locally improves action choices of the current strategy at these critical states, see Sect. 3.3. Afterwards, we retrain the RNN by generating new observation-action sequences obtained from the new strategy. We iterate this procedure until the specification is satisfied or a fixed iteration threshold is reached. For cases where we need to further improve, we use domain knowledge to create a specific memory-update function of a k-FSC A, see Def. 4. Then, we compute the product . We iterate our method with as starting point and thereby determine a concrete k-FSC including the action mapping.

3.1 Learning Strategies with RNNs

Optimization methods to approximate strategies fall within the policy gradient class of algorithms, specific to reinforcement learning [Sutton et al., 2000]. In this setting, the strat- egy is parametrized and updated by performing gradient ascent on the error function (typically chosen to maximize the discounted reward). Policy gradient algorithms are generally used to map observations to actions, and are not well suited for POMDPs due to their inability to cope with arbitrary memory. To overcome this weakness, we design our method to make explicit use of memory, using RNNs, which are a family of neural networks designed to learn dependencies in sequential data. They leverage an internal state to process and store information between sequential steps, thus simulating memory.

Constructing the Strategy Network. We use the long short-term memory (LSTM) architecture [Hochreiter and Schmidhuber, 1997] in a similar fashion to policy gradient methods and model the output as a probability distribution on the action space (described formally by . Having stochastic output units, we avoid computing gradients on the internal belief states, as it is, for example, done in [Meuleau et al., 1999]. Using back propagation through time, we can update the strategy during training. Thus, for a given observation-action sequence from , the model learns a strategy . The output is a discrete probability distribution over the actions Act, represented using a final softmax layer. RNN Training. We train the RNN using a slightly modified version of sampling re-usable trajectories [Kearns et al., 2000]. In particular, for a POMDP M = (M, Z, O) and a specification , instead of randomly generating observation sequences, we first compute a strategy of the underlying MDP M that satisfies . Then we sample uniformly over all states of the MDP and generate finite paths (of a fixed maximal length) from of the induced MC , thereby creating multiple trajectory trees. For each finite path , we generate one possible observation-action sequence such that with , where denotes the i-th state of for all . We form the training set D from a (problem specific) number of m observation-action sequences with observations as input and actions as output labels. Both input and output sets were processed using one-hot-encoding. To fit the RNN model, we use the Adam optimizer [Kingma and Ba, 2014] with a cross-entropy error function.

Sampling Large Environments. In a POMDP M with a large state space (), computing the underlying MDP strategy affects the performance of the procedure. In such cases, we restrict the sampling to a smaller environment that shares the observation Z and action spaces Act with M. For example, consider a gridworld scenario with a moving obstacle that has the same underlying probabilistic movement for different problem sizes; such a framework can provide a similar dataset regardless of the size of the grid.

3.2 Strategy Extraction and Evaluation

We first describe how to extract a memoryless strategy from the strategy network for a specific POMDP, then we formalize the extension to FSCs to account for finite memory. Afterwards, we shortly explain how to evaluate the resulting strategies.

Given a POMDP M, we use the trained strategy network directly as observation-based strategy. Note that the RNN is inherently a predictor for the distribution over actions and will not always deliver the same output for one input. While we always use the first prediction we obtain, one may also sample several predictions and take the average of the output distributions.

Extension to FSCs. As mentioned before, LTL specifica-tions as well as observation-dependencies in POMDPs require memory. Consider therefore a general FSC A = as in Def. 4. We first predefine the memory update function in a problem-specific way, for instance, changes the memory node when an observation is repeated. Consider observation sequence with . Assume, the FSC is in memory node at position i of . We define , if , and there exists a j < i such that with . Similarly, we account for specific memory choices akin to the relevant LTL specification.

Once has been defined, we compute a product POMDP which creates a state space over . The training process is similar to the method outlined above but instead of generating observation-action sequences from , we generate observation-node-action sequences from . In this case, the RNN is learning the mapping of observation and memory node to the distribution over actions as an FSC strategy network:

In order to extract the memoryless FSC A from the FSC strategy network , we collect the predicted distributions across the product set of all possible observations and all possible memory nodes . From this prediction, the FSC A is constructed from the action mapping and the predefined memory update function .

Evaluation. We assume that for POMDP M = (M, Z, O) and specification , we have a finite-memory observation-based strategy as described above. We use the strategy to resolve all nondeterminism in M, resulting in the induced MC , see Def. 1. For this MC, we apply model checking, which in polynomial time reveals whether . For the fixed strategy we extracted from the strategy network, this provides hard guarantees about the quality of regarding . As mentioned before, this strategy is only a prediction obtained from the RNN – so the guarantees necessarily do not directly carry over to the strategy network.

3.3 Improving the Represented Strategy

We describe how we compute a local improvement for a strategy that does not satisfy the specification. In particular, we have POMDP M = (M, Z, O), specification , and the strategy with . We now create diagnostic information on why the specification is not satisfied.

First, without loss of generality, we assume . Let denote the probability of choosing action Act upon observation , under the strategy . Let denote the probability to satisfy within the induced MC . For some threshold , a state is critical iff . We define as a function with respect to the threshold from the original specification and the state s. We define the set of critical decision under the strategy .

Definition 5 (Critical Decision) A probability according to an observation-based strategy is a critical decision iff there exist states with , , and is critical.

Intuitively, a decision is critical if it may lead to a critical state. The set of critical decisions serves as counterexample, generated by the set of critical states and the strategy . Note that even if a specification is satisfied for , the sets of critical decisions and states may still be non-empty as they depend on the definition of the criticality-threshold .

For each observation with a critical decision, we construct an optimization problem that minimizes the number of different (critical) actions the strategy chooses per observation class. In particular, the probabilities of action choices under are redistributed such that the critical choices are minimized.

If the objective function is zero, then we have found an observation-based strategy, as there are no choices that are inconsistent with the observations anymore. Otherwise, we select a class for which at least two different actions are necessary and then we generate a new set of paths starting from the critical states. After converting these new paths into observation-action sequences, we retrain the RNN. By gathering more data from these apparently critical situations, we locally improve the quality of the strategies at those locations and gradually introduce observation-dependencies.

3.4 Correctness and Termination

Correctness of our approach is ensured by evaluating the extracted strategy on the POMDP using model checking. As the investigated problem is undecidable for POMDPs [Madani et al., 1999], our approach is naturally incomplete. In order to enforce termination after finite time, we abort the refinement loop after a specified number of iterations, or as soon as the progress from one iteration to the next (in terms of the model checking results) falls below a user-specified threshold.

4 Experimental Results

We evaluate our RNN-based synthesis procedure on benchmark examples that are subject to either LTL specifications or expected cost specifications. For the former, we compare to the tool PRISM-POMDP, and for the latter we compare to PRISM-POMDP and the point-based solver SolvePOMDP [Walraven and Spaan, 2017]. We selected the two solvers from different research communities because they provide the possibility for a straightforward adaption to our benchmark setting. In particular, the tools support undiscounted rewards and have a simple and similar input interface. Extended experiments with, for instance, Monte-Carlo-based methods [Silver and Veness, 2010] are interesting but beyond the scope of this paper.

For a fair comparison, instead of terminating our synthesis procedure once a specification is satisfied, we always iterate 10 times, where one iteration encompasses the (re-)training of the RNN, the strategy extraction, the evaluations, and the strategy improvement as detailed in Sect. 3. For instance, for a specification , we leave the open and seek to compute , that is, we compute the minimal probability of satisfying to obtain a strategy that satisfies . We cannot guarantee to reach that optimum, but we rather improve as far as possible within the predefined 10 iterations. The notions are similar for and as well as for expected cost measures ) and ).

We will now shortly describe our experimental setup and present detailed results for both types of examples.

Implementation and Setup. We employ the following Python toolchain to realize the full RNN-based synthe- sis procedure. First, we use the deep learning library Keras [Chollet, 2015] to train the strategy network. To evaluate strategies, we employ the probabilistic model checkers PRISM (LTL) and STORM (undiscounted expected rewards).We evaluated on a 2.3 GHz machine with a 12 GB memory limit and a specified maximum computation time of seconds.

4.1 Temporal Logic Examples

We examined three problem settings involving motion planning with LTL specifications. For each of the settings, we use a standard gridworld formulation of an agent with 4 action choices (cardinal directions of movement), see Fig. 2(a). Inside this environment there are a set of static and moving obstacles as well as possible target cells A and B. Each agent has a limited visibility region, indicated by the green area, and can infer its state from observations and knowledge of the environment. We define observations as Boolean functions that take as input the positions of the agent and moving obstacles. Intuitively, the functions describe the 8 possible relative positions of the obstacles with respect to the agent inside its viewing range.

1. Navigation with moving obstacles – an agent and a single stochastically moving obstacle. The agent task is to maximize the probability to navigate to a goal state A while not colliding with obstacles (both static and moving): with ,

2. Delivery without obstacles – an agent and static objects (landmarks). The task is to deliver an object from A to B in as few steps as possible: .

3. Slippery delivery with static obstacles – an agent where the probability of moving perpendicular to the desired direction is 0.1 in each orientation. The task is to maximize the probability to go back and forth from loca-

Figure 2: (a) Example environment and (b) Benchmark metrics

tions A and B without colliding with the static obstacles , with ,

Figure 3: Progression of the number of critical states and the probability of satisfying an LTL specification as a result of local improvement steps.

Evaluation. Fig. 3 compares the size of counterexample in relation to the probability of satisfying an LTL formula in each iteration of the synthesis procedure. In particular, we depict the size of the set of critical states regarding for the Navigation example with grid-size 6. Note that even if the probability to satisfy the LTL specification is nearly one (for the initial state of the POMDP), there may still be critical intermediate states. As can be seen in the figure, while the probability to satisfy the LTL formula increases, the size of the counterexample decreases. In particular, the local improvement (Eq. 1, Sect. 3.3) is demonstrated to be effective.

Table 1 contains the results for the above LTL examples. Naturally the strategies produced by the procedure will not have higher maximum probabilities (or lower minimum expected cost) than those generated by the PRISMPOMDP tool. However, they scale for significantly larger environments and settings. In the larger environments (Navigation(15) and upwards indicated by a star) we employ the sampling technique outlined at the end of Sect. 3.1 on a dataset with grid-size 10. The strategy still scales to these larger environments even when trained on data from a smaller state space.

Table 1: Synthesizing strategies for examples with LTL specs.

Also in Table 1, we compare the effect of increasing the value of k for several k-FSCs. In smaller instances with grid-sizes of 4 and 5, memory-based strategies significantly outperform memoryless ones in terms of quality (the resulting probability or expected cost) while not consuming sig-nificantly more time. The increase in performance is due to additional expressiveness of an FSC-based strategy in these environments with a higher density of obstacles.

Summarized, our method scales to significantly larger domains than PRISM-POMDP with competitive computation times. As mentioned before, there is an inherent level of randomness in extracting a strategy. While we always take the first shot result for our experiments, the quality of strategies may improved by sampling several RNN predictions.

4.2 Comparison to Existing POMDP Examples

For comparison to existing benchmarks, we extend two examples from PRISM-POMDP for an arbitrary-sized structure: Maze(c) with c + 2 rows and Grid(c) – a square grid with length c. We also compare to RockSample [Silver and Veness, 2010] (see Table 2(b) for problem metrics).

These problems are quite different to the LTL examples, in particular the significantly smaller observation spaces. As a result, a simple memoryless strategy is insufficient for a useful comparison. For each problem, the size of the k-FSC used is given by: Maze(c) has k = (c + 1); Grid(c) has k = and RockSample with b rocks has k = b.

Our method compares favorably with PRISM-POMDP and pomdpSolve for Maze and Grid (Table 2). However, the proposed method performs poorly in comparison to pomdpSolve for RockSample: An observation is received after taking an action to check a particular rock. This action is never sampled in the modified trajectory-tree based sampling method (Sect. 3.1). Note that our main aim is to enable the efficient

Table 2: Comparison for standard POMDP examples.

synthesis of strategies under linear temporal logic constraints.

5 Summary and Future Work

We introduced a new RNN-based strategy synthesis method for POMDPs and LTL specifications. While we cannot guarantee optimality, our approach shows results that are often close to the actual optimum with competitive computation times for large problem domains.

For the future, we are interested in extending our method to continuous state spaces together with abstraction techniques that would enable to employ our model-based method.

References

[Amato et al., 2010] Christopher Amato, Daniel S Bernstein, and Shlomo Zilberstein. Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. AAMAS, 21(3):293–320, 2010.

[Baier and Katoen, 2008] Christel Baier and Joost-Pieter Ka- toen. Principles of Model Checking. MIT Press, 2008.

[Chatterjee et al., 2015] Krishnendu Chatterjee, Martin Chmel´ık, Raghav Gupta, and Ayush Kanodia. Qualitative analysis of POMDPs with temporal logic specifications for robotics applications. In ICRA, pages 325–330, 2015.

[Chatterjee et al., 2016] Krishnendu Chatterjee, Martin Chmel´ık, and Mathieu Tracol. What is decidable about partially observable Markov decision processes with -regular objectives. Journal of Computer and System Sciences, 82(5):878–911, 2016.

[Chollet, 2015] Franc¸ois Chollet. Keras, 2015.

[Dehnert et al., 2017] Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In CAV (2), volume 10427 of LNCS, pages 592–600. Springer, 2017.

[Hausknecht and Stone, 2015] Matthew Hausknecht and Pe- ter Stone. Deep recurrent q-learning for partially observable MDPs. CoRR, abs/1507.06527, 7(1), 2015.

[Hauskrecht, 2000] Milos Hauskrecht. Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res., 13:33–94, 2000.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[Junges et al., 2018] Sebastian Junges, Nils Jansen, Ralf Wimmer, Tim Quatmann, Leonore Winterer, Joost-Pieter Katoen, and Bernd Becker. Finite-state controllers of POMDPs via parameter synthesis. In UAI, 2018.

[Kaelbling et al., 1998] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artif. Intell., 101(1):99–134, 1998.

[Katz et al., 2017] Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In CAV, pages 97–117. Springer, 2017.

[Kearns et al., 2000] Michael J Kearns, Yishay Mansour, and Andrew Y Ng. Approximate planning in large POMDPs via reusable trajectories. In NIPS, pages 1001– 1007, 2000.

[Kingma and Ba, 2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint 1412.6980, 2014.

[Kwiatkowska et al., 2011] Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages 585–591. Springer, 2011.

[Littman et al., 2017] Michael L. Littman, Ufuk Topcu, Jie Fu, Charles Isbell, Min Wen, and James MacGlashan. Environment-independent task specifications via GLTL. arXiv preprint 1704.04341, 2017.

[Madani et al., 1999] Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. In AAAI, pages 541–548. AAAI Press, 1999.

[Meuleau et al., 1999] Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state controllers for partially observable environments. In UAI, pages 427–436. Morgan Kaufmann, 1999.

[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[Norman et al., 2017] Gethin Norman, David Parker, and Xueyi Zou. Verification and control of partially observable probabilistic systems. Real-Time Systems, 53(3):354–402, 2017.

[Papadimitriou and Tsitsiklis, 1987] Christos H. Papadim- itriou and John N. Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987.

[Pascanu et al., 2013] Razvan Pascanu, C¸ aglar G¨ulc¸ehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. CoRR, abs/1312.6026, 2013.

[Pineau et al., 2003] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In IJCAI, pages 1025–1032. Morgan Kaufmann, 2003.

[Pnueli, 1977] Amir Pnueli. The temporal logic of programs. In FOCS, pages 46–57. IEEE Computer Society, 1977.

[Silver and Veness, 2010] David Silver and Joel Veness. Monte-carlo planning in large POMDPs. In NIPS, pages 2164–2172, 2010.

[Sutton et al., 2000] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 2000.

[Vlassis et al., 2012] Nikos Vlassis, Michael L. Littman, and David Barber. On the computational complexity of stochastic controller optimization in POMDPs. ACM Trans. on Computation Theory, 4(4):12:1–12:8, 2012.

[Walraven and Spaan, 2017] Erwin Walraven and Matthijs Spaan. Accelerated vector pruning for optimal POMDP solvers. In AAAI, pages 3672–3678. AAAI Press, 2017.

[Wierstra et al., 2007] Daan Wierstra, Alexander F¨orster, Jan Peters, and J¨urgen Schmidhuber. Solving deep memory POMDPs with recurrent policy gradients. In ICANN, pages 697–706. Springer, 2007.

[Wimmer et al., 2014] Ralf Wimmer, Nils Jansen, Erika ´Abrah´am, Joost-Pieter Katoen, and Bernd Becker. Minimal counterexamples for linear-time probabilistic verifica-tion. Theor. Comput. Sci., 549:61–100, 2014.

designed for accessibility and to further open science