35:[["$","audio",null,{"id":"tts"}],["$","$L3a",null,{"paperID":"2004.00273","publisher":"arxiv","paperJSON":{"title":"Statistically Model Checking PCTL Specifications on Markov Decision Processes via Reinforcement Learning","paperID":"2004.00273","avgLineHeight":12,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Probabilistic Computation Tree Logic (PCTL) is frequently used to formally specify control objectives such as probabilistic reachability and safety. In this work, we focus on model checking PCTL specifications statistically on Markov Decision Processes (MDPs) by sampling, e.g., checking whether there exists a feasible policy such that the probability of reaching certain goal states is greater than a threshold. We use reinforcement learning to search for such a feasible policy for PCTL specifications, and then develop a statistical model checking (SMC) method with provable guarantees on its error. Specifically, we first use upper-confidence-bound (UCB) based Q-learning to design an SMC algorithm for bounded-time PCTL specifications, and then extend this algorithm to unbounded-time specifications by identifying a proper truncation time by checking the PCTL specification and its negation at the same time. Finally, we evaluate the proposed method on case studies.","element":"span"}]]},{"heading":"I. INTRODUCTION","paragraphs":[[{"text":"Probabilistic Computation Tree Logic (PCTL) is frequently used to formally specify control objectives such as reachability and safety on probabilistic systems [1]. To check the correctness of PCTL specifications on these systems, model checking methods are required [2]. Although model checking PCTL by model-based analysis is theoretically possible [1], it is not preferable in practice when the system model is unknown or large. In these cases, model checking by sampling, i.e. statistical model checking (SMC), is needed [3], [4].","element":"span"}],[{"text":"The statistical model checking of PCTL specifications on Markov Decision Processes (MDPs) is frequently encountered in many decision problems – e.g., for a robot in a grid world under probabilistic disturbance, checking whether there exists a feasible control policy such that the probability of reaching certain goal states is greater than a probability threshold [5]–[7]. In these problems, the main challenge is to search for such a feasible policy for the PCTL specification of interest.","element":"span"}],[{"text":"To search for feasible policies for temporal logics speci-fications, such as PCTL, on MDPs, one approach is model-based reinforcement learning [8]–[11] – i.e., first inferring the transition probabilities of the MDP by sampling over each state-action pair, and then searching for the feasible policy via model-based analysis. This approach is often inefficient, since not all transition probabilities are relevant to the PCTL","element":"span"}],[{"style":{"width":"96%"},"width":945,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/0-0.png","element":"img"}],[{"text":"Nima Roohi is with the University of California San Diego, USA ","element":"span"},{"text":"nroohi@ucsd.edu","element":"span"},{"text":". Matthew West, Mahesh Viswanathan, and Geir E. Dullerud are with the University of Illinois at Urbana-Champaign, USA","element":"span"}],[{"style":{"width":"84%"},"width":831,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/0-1.png","element":"img"}],[{"text":"specification of interest. Here instead, we adopt a model-free reinforcement learning approach [12].","element":"span"}],[{"text":"Common model-free reinforcement learning techniques cannot directly handle temporal logic specifications. One solution is to find a surrogate reward function such that the policy learned for this surrogate reward function is the one needed for checking the temporal logic specification of interest. For certain temporal logics interpreted under special semantics (usually involving a metric), the surrogate reward can be found based on that semantics [13]–[15].","element":"span"}],[{"text":"For temporal logics under the standard semantics [16], the surrogate reward functions can be derived via constructing the product MDP [7], [17], [18] of the initial MDP and the automaton realizing the temporal logic specification. However, the complexity of constructing the automaton from a general linear temporal logic (LTL) specification is double exponential [16], [19]. For a fraction of LTL, namely LTL/GU, the complexity is exponential [20], [21]. In addition, the size of the product MDP is usually much larger than the initial MDP, although the produce MDP may be constructed on-the-fly to reduce the extra computation cost, as it did in [18].","element":"span"}],[{"text":"In this work, we propose a new statistical model checking method for PCTL specifications on MDPs. For a lucid discussion, we only consider non-nested PCTL specifications. PCTL formulas in general form with nested probabilistic operators can be handled in the standard manner using the approach proposed in [22], [23]. Our method uses upper-confidence-bound (UCB) based Q-learning to directly learn the feasible policy of PCTL specifications, without constructing the product MDP. The effectiveness of UCB-based Qlearning has been proven for the ","element":"span"},{"text":"K","element":"span"},{"text":"-bandit problem, and has been numerically demonstrated on many decision-learning problems on MDPs (see [24]).","element":"span"}],[{"text":"For bounded-time PCTL specifications, we treat the statistical model checking problem as a finite sequence of ","element":"span"},{"text":"K","element":"span"},{"text":"-bandit problems and use the UCB-based Q-learning to learn the desirable decision at each time step. For unbounded-time PCTL specifications, we look for a truncation time to reduce it to a bounded-time problem by checking the PCTL specification and its negation at the same time. Our statistical model checking algorithm is online; it terminates with probability 1, and only when the statistical error of the learning result is smaller than a user-specified value.","element":"span"}],[{"text":"The rest of the paper is organized as follows. The preliminaries on labeled MDPs and PCTL are given in Section ","element":"span"},{"text":"II. ","element":"span"},{"text":"In Section ","element":"span"},{"href":"#id-0","text":"III, ","element":"a"},{"text":"using the principle of optimism in the face of uncertainty, we design Q-learning algorithms to solve finite-time and infinite-time probabilistic satisfaction, and give finite sample probabilistic guarantees for the correctness of the algorithms. We implement and evaluate the proposed algorithms on several case studies in Section ","element":"span"},{"href":"#id-1","text":"IV. ","element":"a"},{"text":"Finally, we conclude this work in Section ","element":"span"},{"href":"#id-2","text":"V.","element":"a"}]]},{"heading":"II. PRELIMINARIES AND PROBLEM FORMULATION","paragraphs":[[{"text":"The set of integers and real numbers are denoted by ","element":"span"},{"text":"N ","element":"span"},{"text":"and ","element":"span"},{"text":"R","element":"span"},{"text":", respectively. For ","element":"span"},{"style":{"height":11.6},"width":111.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-0.png","element":"img","alt":" n ∈ N","inline":true},{"text":", let ","element":"span"},{"text":"[","element":"span"},{"text":"n","element":"span"},{"text":"] = ","element":"span"},{"text":"{","element":"span"},{"text":"1","element":"span"},{"text":", . . ., n","element":"span"},{"text":"}","element":"span"},{"text":". The cardinality of a set is denoted by ","element":"span"},{"style":{"height":16},"width":43.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-1.png","element":"img","alt":" |·|","inline":true},{"text":". The set of finite-length sequences taken from a finite set ","element":"span"},{"text":"S ","element":"span"},{"text":"is denoted by ","element":"span"},{"style":{"height":10.96},"width":42.88,"height":27.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-2.png","element":"img","alt":" S∗","inline":true},{"text":".","element":"span"}],[{"text":"A. Markov Decision Process","element":"span"}],[{"text":"A Markov decision process (MDP) is a finite-state probabilistic system, where the transition probabilities between the states are determined by the control action taken from a given finite set. Each state of the MDP is labeled by a set of atomic propositions indicating the properties holding on it, e.g., whether the state is a safe/goal state.","element":"span"}],[{"text":"Definition 1: A labeled Markov decision process (MDP) is a tuple ","element":"span"},{"text":"M ","element":"span"},{"text":"= (","element":"span"},{"text":"S, A, ","element":"span"},{"text":"T","element":"span"},{"text":", ","element":"span"},{"text":"AP","element":"span"},{"text":", L","element":"span"},{"text":") ","element":"span"},{"text":"where","element":"span"}],[{"text":"• ","element":"span"},{"text":"S ","element":"span"},{"text":"is a finite set of states.","element":"span"}],[{"text":"• ","element":"span"},{"text":"A ","element":"span"},{"text":"is a finite set of actions. ","element":"span"},{"text":"• ","element":"span"},{"style":{"height":16},"width":368.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-3.png","element":"img","alt":" T : S ×A×S → [0, 1]","inline":true,"padRight":true},{"text":"is a partial transition probability function. For any state ","element":"span"},{"style":{"height":11.6},"width":93.2,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-4.png","element":"img","alt":" s ∈ S","inline":true,"padRight":true},{"text":"and any action ","element":"span"},{"style":{"height":12.4},"width":100.08,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-5.png","element":"img","alt":" a ∈ A","inline":true},{"text":",","element":"span"}],[{"style":{"width":"81%"},"width":800,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-6.png","element":"img"}],[{"text":"With a slight abuse of notation, let ","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"s","element":"span"},{"text":") ","element":"span"},{"text":"be the set of allowed actions on the state ","element":"span"},{"text":"s","element":"span"},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"text":"AP ","element":"span"},{"text":"is a finite set of labels.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"height":14.16},"width":213.96,"height":35.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-7.png","element":"img","alt":" L : S → 2AP","inline":true,"padRight":true},{"text":"is a labeling function.","element":"span"}],[{"text":"Definition 2: A policy ","element":"span"},{"style":{"height":12},"width":229.68,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-8.png","element":"img","alt":" Π : S∗ → A","inline":true,"padRight":true},{"text":"decides the action to take from the sequence of states visited so far. Given a policy ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-9.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"and an initial state ","element":"span"},{"style":{"height":11.6},"width":99.92,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-10.png","element":"img","alt":" s ∈ S","inline":true},{"text":", the MDP ","element":"span"},{"text":"M ","element":"span"},{"text":"becomes purely probabilistic, denoted by ","element":"span"},{"style":{"height":15.5},"width":96.12,"height":38.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-11.png","element":"img","alt":" MΠ,s","inline":true},{"text":". The system ","element":"span"},{"style":{"height":15.5},"width":96.12,"height":38.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-12.png","element":"img","alt":" MΠ,s","inline":true,"padRight":true},{"text":"is not necessarily Markovian.","element":"span"}],[{"text":"B. Probabilistic Computation Tree Logic","element":"span"}],[{"text":"The probabilistic computation tree logic (PCTL) is defined inductively from atomic propositions, temporal operators and probability operators. It reasons about the probabilities of time-dependent properties.","element":"span"}],[{"text":"Definition 3 (Syntax): Let ","element":"span"},{"text":"AP ","element":"span"},{"text":"be a set of atomic propositions. A PCTL state formula is defined by","element":"span"}],[{"style":{"width":"78%"},"width":770,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":679.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-14.png","element":"img","alt":" a ∈ AP, ⋊⋉∈ {<, >, ≤, ≥}, T ∈ N∪{∞}","inline":true,"padRight":true},{"text":"is a (possibly infinite) time horizon, and ","element":"span"},{"style":{"height":16},"width":148.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-15.png","element":"img","alt":" p ∈ [0, 1]","inline":true,"padRight":true},{"text":"is a threshold.","element":"span"},{"text":"1 ","element":"span"},{"text":"The operators ","element":"span"},{"style":{"height":19.79},"width":84.72,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-16.png","element":"img","alt":" Pmin⋊⋉p","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.23},"width":90.44,"height":43.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-17.png","element":"img","alt":" Pmax⋊⋉p","inline":true,"padRight":true},{"text":"are called probability operators, and the “next”, “until” and “release” operators ","element":"span"},{"style":{"height":13.2},"width":200.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-18.png","element":"img","alt":" X, UT , RT","inline":true,"padRight":true},{"text":"are called temporal operators.","element":"span"}],[{"text":"More temporal operators can be derived by composition: for example, “or” is ","element":"span"},{"style":{"height":16},"width":457.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-19.png","element":"img","alt":" φ1 ∨ φ2 ::= ¬(¬φ1 ∧ ¬φ2)","inline":true},{"text":"; “true” is ","element":"span"},{"style":{"height":16},"width":312.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-20.png","element":"img","alt":"True = a ∨ (¬a)","inline":true},{"text":"; “finally” is ","element":"span"},{"style":{"height":14},"width":341.28,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-21.png","element":"img","alt":" FT φ ::= TrueUT φ","inline":true},{"text":"; and “always” is ","element":"span"},{"style":{"height":14},"width":351.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-22.png","element":"img","alt":" GT φ ::= FalseRT φ","inline":true},{"text":". For simplicity, we write ","element":"span"},{"style":{"height":13.2},"width":245.6,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-23.png","element":"img","alt":"U∞, R∞, F∞","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.1},"width":68,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-24.png","element":"img","alt":" G∞","inline":true,"padRight":true},{"text":"as ","element":"span"},{"text":"U","element":"span"},{"text":", ","element":"span"},{"text":"R","element":"span"},{"text":", ","element":"span"},{"text":"F ","element":"span"},{"text":"and ","element":"span"},{"text":"G","element":"span"},{"text":", respectively.","element":"span"}],[{"id":"id-25","text":"Definition 4 (Semantics): ","element":"span"},{"text":"For an MDP ","element":"span"},{"text":"M ","element":"span"},{"text":"= (","element":"span"},{"text":"S, A, ","element":"span"},{"text":"T","element":"span"},{"text":", ","element":"span"},{"style":{"height":16},"width":204.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-25.png","element":"img","alt":"sinit, AP, L)","inline":true},{"text":", the satisfaction relation ","element":"span"},{"text":"|","element":"span"},{"text":"= ","element":"span"},{"text":"is defined by for a state ","element":"span"},{"text":"s ","element":"span"},{"text":"or path ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-26.png","element":"img","alt":" σ","inline":true,"padRight":true},{"text":"by","element":"span"}],[{"style":{"width":"99%"},"width":975,"height":787,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-27.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":293.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-28.png","element":"img","alt":" ⋊⋉∈ {<, >, ≤, ≥}","inline":true},{"text":". And ","element":"span"},{"style":{"height":15.5},"width":185.88,"height":38.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-29.png","element":"img","alt":" σ ∼ MΠ,s","inline":true,"padRight":true},{"text":"means the path ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-30.png","element":"img","alt":"σ","inline":true,"padRight":true},{"text":"is drawn from the MDP ","element":"span"},{"text":"M ","element":"span"},{"text":"under the policy ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-31.png","element":"img","alt":" Π","inline":true},{"text":", starting from the state ","element":"span"},{"text":"s ","element":"span"},{"text":"from.","element":"span"}],[{"text":"The PCTL formulas ","element":"span"},{"style":{"height":18.43},"width":257.44,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-32.png","element":"img","alt":" s |= Pmax⋊⋉p (Xφ)","inline":true,"padRight":true},{"text":"(or ","element":"span"},{"style":{"height":19.79},"width":251.68,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-33.png","element":"img","alt":" s |= Pmin⋊⋉p (Xφ)","inline":true},{"text":") ","element":"span"},{"text":"mean that the maximal (or minimal) satisfaction probability of “next” ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-34.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":10},"width":64.72,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-35.png","element":"img","alt":" ⋊⋉ p","inline":true},{"text":". The PCTL formulas ","element":"span"},{"style":{"height":18.43},"width":343.84,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-36.png","element":"img","alt":" s |= Pmax⋊⋉p (φ1UT φ2)","inline":true,"padRight":true},{"text":"(or ","element":"span"},{"style":{"height":19.79},"width":343.84,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-37.png","element":"img","alt":" s |= Pmin⋊⋉p (φ1UT φ2)","inline":true},{"text":") mean that the maximal (or mini- ","element":"span"},{"text":"mal) satisfaction probability that ","element":"span"},{"style":{"height":14},"width":39.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-38.png","element":"img","alt":" φ1","inline":true,"padRight":true},{"text":"holds “until” ","element":"span"},{"style":{"height":14},"width":39.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-39.png","element":"img","alt":" φ2","inline":true,"padRight":true},{"text":"holds is ","element":"span"},{"style":{"height":10},"width":64.72,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-40.png","element":"img","alt":" ⋊⋉ p","inline":true},{"text":".","element":"span"}]]},{"heading":"III. NON-NESTED PCTL SPECIFICATIONS","paragraphs":[[{"id":"id-0","text":"In this section, we consider the statistical model check- ","element":"span"},{"text":"ing of non-nested PCTL specifications using an upper-confidence-bound based Q-learning. For simplicity, we focus on ","element":"span"},{"style":{"height":18.24},"width":258.4,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-41.png","element":"img","alt":" Pmax⋊⋉p (a1UT a2)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.24},"width":256.96,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-42.png","element":"img","alt":" Pmax⋊⋉p (a1RT a2)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":9.5},"width":35.2,"height":23.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-43.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.5},"width":35.2,"height":23.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-44.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"are ","element":"span"},{"text":"atomic propositions. Other cases can be handled in the same way. We discuss the case of ","element":"span"},{"text":"T ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"in Section ","element":"span"},{"href":"#id-3","text":"III-A, ","element":"a"},{"text":"the case of ","element":"span"},{"text":"T > ","element":"span"},{"text":"1 ","element":"span"},{"text":"in Section ","element":"span"},{"href":"#id-4","text":"III-B, ","element":"a"},{"text":"and the case of ","element":"span"},{"style":{"height":10.8},"width":136,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-45.png","element":"img","alt":" T = ∞","inline":true,"padRight":true},{"text":"in Section ","element":"span"},{"href":"#id-5","text":"III-C. ","element":"a"},{"text":"Similar to other works on statistical model checking [3], [4], we make the following assumption.","element":"span"}],[{"id":"id-6","text":"Assumption 1: ","element":"span"},{"text":"For ","element":"span"},{"style":{"height":18.24},"width":345.28,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-46.png","element":"img","alt":" s |= Pmax⋊⋉p (a1UT a2)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.24},"width":177.32,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-47.png","element":"img","alt":" s |= Pmax⋊⋉p","inline":true},{"style":{"height":16},"width":164.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-48.png","element":"img","alt":"(a1RT a2)","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16},"width":279.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-49.png","element":"img","alt":" T ∈ N ∪ {∞}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":304.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-50.png","element":"img","alt":" ⋊⋉∈ {<, >, ≤, ≥}","inline":true},{"text":", we assume that ","element":"span"},{"style":{"height":19.2},"width":626.32,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-51.png","element":"img","alt":" maxΠ Pσ∼MΠ,s�σ |= φ1UT φ2�̸= p","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.2},"width":591.28,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/1-52.png","element":"img","alt":"maxΠ Pσ∼MΠ,s�σ |= φ1RT φ2�̸= p","inline":true},{"text":", respectively.","element":"span"}],[{"text":"When it holds, as the number of samples increases, the samples will be increasingly concentrated on one side of the threshold ","element":"span"},{"text":"p ","element":"span"},{"text":"by the Central Limit Theorem. Therefore, a statistical analysis based on the majority of the samples has increasing accuracy. When it is violated, the samples would be evenly distributed between the two sides of the boundary ","element":"span"},{"text":"p","element":"span"},{"text":", regardless of the sample size. Thus, no matter how the sample size increases, the accuracy of any statistical test would not increase. Compared to statistical model checking algorithms based on sequential probability ratio tests (SPRT) [25], [26], no assumption on the indifference region is required here. Finally, by Assumption ","element":"span"},{"href":"#id-6","text":"1, ","element":"a"},{"text":"we have the additional semantic equivalence between the PCTL speci-fications: ","element":"span"},{"style":{"height":17.62},"width":290.96,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-0.png","element":"img","alt":" Pmax

p ψ ≡ Pmax≥p ψ","inline":true},{"text":"; thus, we ","element":"span"},{"text":"will not distinguish between them below.","element":"span"}],[{"text":"For further discussion, we first identify a few trivial cases. For ","element":"span"},{"style":{"height":18.24},"width":334.24,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-2.png","element":"img","alt":" s |= Pmax>p (a1UT a2)","inline":true},{"text":", let","element":"span"}],[{"id":"id-7","style":{"width":"80%"},"width":790,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-3.png","element":"img"}],[{"text":"Then for any policy ","element":"span"},{"style":{"height":19.2},"width":577.28,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-4.png","element":"img","alt":" Π, Pσ∼MΠ,s�σ |= φ1UT φ2�= 0","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":13.1},"width":115.36,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-5.png","element":"img","alt":"s ∈ S0","inline":true},{"text":"; and ","element":"span"},{"style":{"height":19.2},"width":499.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-6.png","element":"img","alt":" Pσ∼MΠ,s�σ |= φ1UT φ2�= 1","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":13.1},"width":115.84,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-7.png","element":"img","alt":" s ∈ S1","inline":true},{"text":". The same holds for ","element":"span"},{"style":{"height":18.24},"width":345.76,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-8.png","element":"img","alt":" s |= Pmax>p (a1RT a2)","inline":true,"padRight":true},{"text":"by defining ","element":"span"},{"style":{"height":13.1},"width":40.48,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-9.png","element":"img","alt":" S1","inline":true,"padRight":true},{"text":"to be ","element":"span"},{"text":"the union of end components of the MDP ","element":"span"},{"text":"M ","element":"span"},{"text":"labeled by ","element":"span"},{"style":{"height":9.5},"width":35.2,"height":23.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-10.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"(this only requires knowing the topology of ","element":"span"},{"text":"M","element":"span"},{"text":") [16]. In the rest of this section, we focus on handling the nontrivial case ","element":"span"},{"style":{"height":16},"width":274.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-11.png","element":"img","alt":"s ∈ S\\(S0 ∪ S1)","inline":true},{"text":".","element":"span"}],[{"id":"id-3","text":"A. Single Time Horizon","element":"span"}],[{"text":"When ","element":"span"},{"text":"T ","element":"span"},{"text":"= 1","element":"span"},{"text":", for any ","element":"span"},{"style":{"height":16},"width":314.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-12.png","element":"img","alt":" s ∈ S\\(S0 ∪ S1)","inline":true},{"text":", the PCTL specification ","element":"span"},{"style":{"height":13.1},"width":132.64,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-13.png","element":"img","alt":" a1UT a2","inline":true,"padRight":true},{"text":"(or ","element":"span"},{"style":{"height":13.1},"width":131.68,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-14.png","element":"img","alt":" a1RT a2","inline":true},{"text":") holds on a random path ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-15.png","element":"img","alt":"σ","inline":true,"padRight":true},{"text":"starting from the state ","element":"span"},{"text":"s ","element":"span"},{"text":"if and only if ","element":"span"},{"style":{"height":16},"width":164.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-16.png","element":"img","alt":" σ(1) ∈ S1","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.1},"width":40.48,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-17.png","element":"img","alt":" S0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.1},"width":40.48,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-18.png","element":"img","alt":" S1","inline":true,"padRight":true},{"text":"are from ","element":"span"},{"href":"#id-7","text":"(1)","element":"a"},{"text":". Thus, it suffices to learn from samples whether","element":"span"}],[{"style":{"width":"66%"},"width":653,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-19.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"66%"},"width":654,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-20.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":16},"width":316.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-21.png","element":"img","alt":" σ(1) ∼ T (s, a, ·)","inline":true,"padRight":true},{"text":"means ","element":"span"},{"style":{"height":16},"width":75.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-22.png","element":"img","alt":" σ(1)","inline":true,"padRight":true},{"text":"is drawn from the transition probability ","element":"span"},{"style":{"height":16},"width":147.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-23.png","element":"img","alt":" T (s, a, ·)","inline":true,"padRight":true},{"text":"for state ","element":"span"},{"text":"s ","element":"span"},{"text":"and action ","element":"span"},{"text":"a","element":"span"},{"text":". This is an ","element":"span"},{"text":"|","element":"span"},{"text":"A","element":"span"},{"text":"(","element":"span"},{"text":"s","element":"span"},{"text":")","element":"span"},{"text":"|","element":"span"},{"text":"-arm bandit problem; we solve this problem by upper-confidence-bound strategies [27], [28].","element":"span"}],[{"text":"Specifically, for the iteration ","element":"span"},{"text":"k","element":"span"},{"text":", let ","element":"span"},{"style":{"height":18.16},"width":217.6,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-24.png","element":"img","alt":" N (k)(s, a, s′)","inline":true,"padRight":true},{"text":"be the number samples for the one-step path ","element":"span"},{"style":{"height":16},"width":136.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-25.png","element":"img","alt":" (s, a, s′)","inline":true},{"text":", and with a slight abuse of notation, let","element":"span"}],[{"style":{"width":"76%"},"width":750,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-26.png","element":"img"}],[{"text":"The unknown transition probability function ","element":"span"},{"style":{"height":16},"width":168.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-27.png","element":"img","alt":" T(s, a, s′)","inline":true,"padRight":true},{"text":"is estimated by the empirical transition probability function","element":"span"}],[{"id":"id-15","style":{"width":"94%"},"width":929,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-28.png","element":"img"}],[{"text":"And the estimation of ","element":"span"},{"style":{"height":16},"width":138.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-29.png","element":"img","alt":" Q1(s, a)","inline":true,"padRight":true},{"text":"from the existing ","element":"span"},{"text":"k ","element":"span"},{"text":"samples is","element":"span"}],[{"style":{"width":"76%"},"width":752,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-30.png","element":"img"}],[{"text":"Since the value of the Q-function ","element":"span"},{"style":{"height":16},"width":304.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-31.png","element":"img","alt":" Q1(s, a) ∈ [0, 1]","inline":true,"padRight":true},{"text":"is bounded, we can construct a confidence interval for the estimate ","element":"span"},{"style":{"height":20.88},"width":73.44,"height":52.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-32.png","element":"img","alt":"ˆQ(k)1","inline":true,"padRight":true},{"text":"with statistical error at most ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-33.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"using Hoeffding’s inequality by","element":"span"}],[{"id":"id-16","style":{"width":"96%"},"width":942,"height":254,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-34.png","element":"img"}],[{"text":"where we set the value of the division to be ","element":"span"},{"style":{"height":7.2},"width":40,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-35.png","element":"img","alt":" ∞","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":18.16},"width":242.24,"height":45.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-36.png","element":"img","alt":"N (k)(s, a) = 0","inline":true},{"text":".","element":"span"}],[{"text":"Remark 1: We use Hoeffding’s bounds to yield hard guarantees on the statistical error of the model checking algorithms. Tighter bounds like Bernstein’s bounds [29] can also be used, but they only yield asymptotic guarantees on the statistical error.","element":"span"}],[{"text":"The sample efficiency for learning for the bandit problem ","element":"span"},{"href":"#id-8","text":"(2) ","element":"a"},{"text":"depends on the choice of sampling policy, decided from the existing samples. A provably best solution is to use the Q-learning from [27], [28]. Specifically, an upper confidence bound (UCB) is constructed for each state-action pair using the number of samples and the observed reward, and the best action is chosen with the highest possible reward, namely the UCB. The sampling policy is chosen by maximizing the possible reward greedily:","element":"span"}],[{"id":"id-9","style":{"width":"78%"},"width":773,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-37.png","element":"img"}],[{"text":"The action is chosen arbitrarily when there are multiple ","element":"span"},{"id":"id-8","text":"candidates. The choice of ","element":"span"},{"style":{"height":20.88},"width":65.76,"height":52.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-38.png","element":"img","alt":" π(k)1","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-9","text":"(7) ","element":"a"},{"text":"ensures that the policy giving the upper bound of the value function gets most frequently sampled in the long run.","element":"span"}],[{"text":"To initialize the iteration, the Q-function is set to","element":"span"}],[{"style":{"width":"99%"},"width":979,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-39.png","element":"img"}],[{"text":"to ensure that every state-action is sampled at least once. The","element":"span"}],[{"id":"id-10","style":{"width":"99%"},"width":978,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-40.png","element":"img"}],[{"id":"id-14","text":"where ","element":"span"},{"text":"p ","element":"span"},{"text":"is the probability threshold in the non-nested PCTL formula. Remark 2: For ","element":"span"},{"style":{"height":18.24},"width":371.2,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-41.png","element":"img","alt":" s |= Pmax

p","inline":true},{"text":". The same statements hold for general PCTL ","element":"span"},{"text":"specifications, as discussed in Sections ","element":"span"},{"href":"#id-4","text":"III-B ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-5","text":"III-C ","element":"a"},{"text":"Now, we summarize the above discussion by Algorithm ","element":"span"},{"href":"#id-11","text":"1 ","element":"a"},{"id":"id-12","text":"and Theorem ","element":"span"},{"href":"#id-12","text":"1 ","element":"a"},{"text":"below. Theorem 1: The return value of Algorithm ","element":"span"},{"href":"#id-11","text":"1 ","element":"a"},{"text":"is correct with probability at least ","element":"span"},{"style":{"height":16},"width":139.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/2-46.png","element":"img","alt":" 1 − |A|δ","inline":true},{"text":". Proof: We provide the proof of a more general state- ","element":"span"},{"id":"id-29","text":"ment in Theorem ","element":"span"},{"href":"#id-13","text":"2.","element":"a"}],[{"id":"id-11","style":{"width":"100%"},"width":981,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-0.png","element":"img"}],[{"text":"Require: MDP ","element":"span"},{"text":"M","element":"span"},{"text":", parameter ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-1.png","element":"img","alt":" δ","inline":true},{"text":".","element":"span"}],[{"style":{"width":"88%"},"width":863,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-2.png","element":"img"}],[{"text":"4: ","element":"span"},{"text":"Sample from ","element":"span"},{"text":"M","element":"span"},{"text":", and update the transition probability function by ","element":"span"},{"href":"#id-14","text":"(3)","element":"a"},{"href":"#id-15","text":"(4)","element":"a"},{"text":".","element":"span"}],[{"text":"5: ","element":"span"},{"text":"Update the bounds on the Q-function and the policies by ","element":"span"},{"href":"#id-16","text":"(6)","element":"a"},{"href":"#id-9","text":"(7)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"100%"},"width":981,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-3.png","element":"img"}],[{"text":"Remark 3: The Hoeffding bounds in ","element":"span"},{"href":"#id-16","text":"(6) ","element":"a"},{"text":"are conservative. Consequently, as shown in the simulations in Section ","element":"span"},{"href":"#id-1","text":"IV, ","element":"a"},{"text":"the actual statistical error of the our algorithms can be smaller than the given value. However, as the MDP is unknown, finding tighter bounds is challenging. One possible solution is to use asymptotic bounds, such as Bernstein’s bounds [29]. Accordingly, the algorithm will only give asymptotic probabilistic guarantees.","element":"span"}],[{"id":"id-4","text":"B. Finite Time Horizon","element":"span"}],[{"text":"When ","element":"span"},{"style":{"height":11.6},"width":106.76,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-4.png","element":"img","alt":" T ∈ N","inline":true},{"text":", for any ","element":"span"},{"style":{"height":16},"width":274.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-5.png","element":"img","alt":" s ∈ S\\(S0 ∪ S1)","inline":true},{"text":", let","element":"span"}],[{"style":{"width":"95%"},"width":933,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-6.png","element":"img"}],[{"text":"i.e., ","element":"span"},{"style":{"height":16},"width":93.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-7.png","element":"img","alt":" Vh(s)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":140.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-8.png","element":"img","alt":" Qh(s, a)","inline":true,"padRight":true},{"text":"are the maximal satisfaction probability of ","element":"span"},{"style":{"height":13.1},"width":128.32,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-9.png","element":"img","alt":" a1Uha2","inline":true,"padRight":true},{"text":"for a random path starting from ","element":"span"},{"text":"s ","element":"span"},{"text":"for any policy and any policy with first action being ","element":"span"},{"text":"a","element":"span"},{"text":", respectively. By definition, ","element":"span"},{"style":{"height":16},"width":93.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-10.png","element":"img","alt":" Vh(s)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":141.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-11.png","element":"img","alt":" Qh(s, a)","inline":true,"padRight":true},{"text":"satisfy the Bellman equation","element":"span"}],[{"id":"id-17","style":{"width":"96%"},"width":947,"height":325,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-12.png","element":"img"}],[{"text":"The second equality of the second equation is derived from","element":"span"}],[{"style":{"width":"42%"},"width":418,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-13.png","element":"img"}],[{"text":"by the semantics of PCTL.","element":"span"}],[{"text":"From ","element":"span"},{"href":"#id-17","text":"(11)","element":"a"},{"text":", we check ","element":"span"},{"style":{"height":18.43},"width":253.6,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-14.png","element":"img","alt":" Pmax>p (a1Uha2)","inline":true,"padRight":true},{"text":"by induction on the ","element":"span"},{"text":"time horizon ","element":"span"},{"text":"T ","element":"span"},{"text":". For ","element":"span"},{"style":{"height":11.6},"width":116.36,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-15.png","element":"img","alt":" h ∈ T","inline":true,"padRight":true},{"text":", the lower and upper bounds for ","element":"span"},{"style":{"height":16},"width":140.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-16.png","element":"img","alt":" Qh(s, a)","inline":true,"padRight":true},{"text":"can be derived using the bounds on the value function for the previous step — for ","element":"span"},{"text":"h ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"from ","element":"span"},{"href":"#id-16","text":"(6) ","element":"a"},{"text":"and for","element":"span"}],[{"text":"h > ","element":"span"},{"text":"0 ","element":"span"},{"text":"by the following lemma.","element":"span"}],[{"id":"id-18","style":{"width":"97%"},"width":959,"height":574,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-17.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-22","style":{"width":"99%"},"width":977,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-18.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.9},"width":36.76,"height":34.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-19.png","element":"img","alt":" δh","inline":true,"padRight":true},{"text":"is a parameter such that ","element":"span"},{"style":{"height":22.42},"width":387.32,"height":56.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-20.png","element":"img","alt":" Qh(s, a) ∈ [Q(k)h (s, a),","inline":true},{"style":{"height":22.57},"width":175.16,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-21.png","element":"img","alt":"Q(k)h (s, a)]","inline":true,"padRight":true},{"text":"with probability at least ","element":"span"},{"style":{"height":13.91},"width":116.92,"height":34.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-22.png","element":"img","alt":" 1 − δh","inline":true},{"text":". The bounds ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-18","text":"(12) ","element":"a"},{"text":"are derived from ","element":"span"},{"href":"#id-17","text":"(11) ","element":"a"},{"text":"by applying Hoeffding’s inequality, using the fact that ","element":"span"},{"style":{"height":18.83},"width":485.44,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-23.png","element":"img","alt":" E[ˆT(k)(s, a, s′)] = T(s, a, s′)","inline":true,"padRight":true},{"text":"and the Q-functions are bounded within ","element":"span"},{"text":"[0","element":"span"},{"text":", ","element":"span"},{"text":"1]","element":"span"},{"text":".","element":"span"}],[{"text":"From the boundedness of ","element":"span"},{"style":{"height":16},"width":302.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-24.png","element":"img","alt":" Qh(s, a) ∈ [0, 1]","inline":true},{"text":", we note that this confidence interval encompasses the statistical error in both the estimated transition probability function ","element":"span"},{"style":{"height":18.83},"width":212.8,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-25.png","element":"img","alt":"ˆT(k)(s, a, s′)","inline":true,"padRight":true},{"text":"and the bounds","element":"span"},{"style":{"height":22.58},"width":165.76,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-26.png","element":"img","alt":"V(k)h (s, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.45},"width":165.76,"height":53.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-27.png","element":"img","alt":" V(k)h (s, a)","inline":true,"padRight":true},{"text":"of ","element":"span"},{"text":"the value function. Accordingly, the policy ","element":"span"},{"style":{"height":21.65},"width":65.76,"height":54.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-28.png","element":"img","alt":" π(k)h","inline":true,"padRight":true},{"text":"chosen by the OFU principle at the ","element":"span"},{"text":"h ","element":"span"},{"text":"step is","element":"span"}],[{"id":"id-20","style":{"width":"78%"},"width":773,"height":65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-29.png","element":"img"}],[{"text":"with an optimal action chosen arbitrarily when there are multiple candidates, to ensure that the policy giving the upper bound of the value function is sampled the most in the long run. To initialize the iteration, the Q-function is set to","element":"span"}],[{"id":"id-19","style":{"width":"99%"},"width":979,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-30.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":16},"width":126.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-31.png","element":"img","alt":" h ∈ [T ]","inline":true},{"text":", to ensure that every state-action is sampled at least once.","element":"span"}],[{"text":"Sampling by the updated policy ","element":"span"},{"style":{"height":21.46},"width":118.72,"height":53.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-32.png","element":"img","alt":" π(k)h (s)","inline":true,"padRight":true},{"text":"can be performed ","element":"span"},{"text":"in either episodic or non-episodic ways [24]. The only requirement is that the state-action pair ","element":"span"},{"style":{"height":21.46},"width":185.92,"height":53.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-33.png","element":"img","alt":" (s, π(k)h (s))","inline":true,"padRight":true},{"text":"should ","element":"span"},{"text":"be performed frequently for each ","element":"span"},{"style":{"height":16},"width":122.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-34.png","element":"img","alt":" h ∈ [T ]","inline":true,"padRight":true},{"text":"and for each state ","element":"span"},{"text":"s ","element":"span"},{"text":"satisfying ","element":"span"},{"style":{"height":16},"width":297.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-35.png","element":"img","alt":" s ∈ S\\(S0 ∪ S1)","inline":true},{"text":". In addition, batch samples may be drawn, namely sampling over the state-action pairs multiple times before updating the policy. In this work, for simplicity, we use a non-episodic, non-batch sampling method, by drawing","element":"span"}],[{"id":"id-21","style":{"width":"67%"},"width":657,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-36.png","element":"img"}],[{"text":"for all ","element":"span"},{"style":{"height":16},"width":127.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-37.png","element":"img","alt":" h ∈ [T ]","inline":true,"padRight":true},{"text":"and state ","element":"span"},{"text":"s ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":16},"width":352.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/3-38.png","element":"img","alt":" a1 ∈ L(s), a2 /∈ L(s)","inline":true},{"text":". The Q-function and the value function are set and initialized","element":"span"}],[{"id":"id-23","style":{"width":"100%"},"width":981,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-0.png","element":"img"}],[{"text":"Require: MDP ","element":"span"},{"text":"M","element":"span"},{"text":", parameters ","element":"span"},{"style":{"height":13.9},"width":36.76,"height":34.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-1.png","element":"img","alt":" δh","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":122.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-2.png","element":"img","alt":" h ∈ [T ]","inline":true},{"text":".","element":"span"}],[{"text":"1: ","element":"span"},{"text":"Initialize the Q-function and the policy by ","element":"span"},{"href":"#id-19","text":"(15)","element":"a"},{"href":"#id-20","text":"(14)","element":"a"},{"text":". ","element":"span"},{"text":"2: ","element":"span"},{"text":"Obtain ","element":"span"},{"style":{"height":13.1},"width":40.48,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-3.png","element":"img","alt":" S0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.1},"width":40.48,"height":32.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-4.png","element":"img","alt":" S1","inline":true,"padRight":true},{"text":"by ","element":"span"},{"href":"#id-7","text":"(1)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"27%"},"width":265,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-5.png","element":"img"}],[{"text":"4: ","element":"span"},{"text":"Sample by ","element":"span"},{"href":"#id-21","text":"(16)","element":"a"},{"text":", and update the transition probability function by ","element":"span"},{"href":"#id-14","text":"(3)","element":"a"},{"href":"#id-15","text":"(4)","element":"a"},{"text":".","element":"span"}],[{"text":"5: ","element":"span"},{"text":"Update the bounds by ","element":"span"},{"href":"#id-18","text":"(12)","element":"a"},{"href":"#id-22","text":"(13) ","element":"a"},{"text":"and the policy by ","element":"span"},{"href":"#id-20","text":"(14)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"100%"},"width":981,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-6.png","element":"img"}],[{"text":"by ","element":"span"},{"href":"#id-22","text":"(13) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-19","text":"(15)","element":"a"},{"text":". The termination condition is give by","element":"span"}],[{"id":"id-24","style":{"width":"83%"},"width":816,"height":175,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-7.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"p ","element":"span"},{"text":"is the probability threshold in the non-nested PCTL formula. The above discussion is summarized by Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"id":"id-13","text":"and Theorem ","element":"span"},{"href":"#id-13","text":"2. ","element":"a"},{"text":"Theorem 2: Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"terminates with probability ","element":"span"},{"text":"1 ","element":"span"},{"text":"and its return value is correct with probability at least ","element":"span"},{"style":{"height":19.6},"width":332.44,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-8.png","element":"img","alt":"1 − N|A| �h∈[T ] δh","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":317.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-9.png","element":"img","alt":" N = |S\\(S0 ∪ S1)|","inline":true},{"text":". ","element":"span"},{"text":"Proof: By construction, as the number of iterations ","element":"span"},{"style":{"height":11.2},"width":73.12,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-10.png","element":"img","alt":" k →","inline":true},{"style":{"height":23.18},"width":327.2,"height":57.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-11.png","element":"img","alt":"∞,V(k)T −V(k)T → 0","inline":true},{"text":". Thus, by Assumption ","element":"span"},{"href":"#id-6","text":"1, ","element":"a"},{"text":"the termination condition ","element":"span"},{"href":"#id-24","text":"(17) ","element":"a"},{"text":"will be satisfied with probability ","element":"span"},{"text":"1","element":"span"},{"text":". Now, let ","element":"span"},{"text":"E ","element":"span"},{"text":"be the event that the return value of Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"is correct, and let ","element":"span"},{"style":{"height":11.5},"width":41,"height":28.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-12.png","element":"img","alt":" Fk","inline":true,"padRight":true},{"text":"be the event that Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"terminates at the iteration ","element":"span"},{"text":"k","element":"span"},{"text":", then we have ","element":"span"},{"style":{"height":17.6},"width":478.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-13.png","element":"img","alt":" P(E) = �k∈N P(E|Fk)P(Fk)","inline":true},{"text":". ","element":"span"},{"text":"For any ","element":"span"},{"text":"k","element":"span"},{"text":", the event ","element":"span"},{"text":"E ","element":"span"},{"text":"happens given that ","element":"span"},{"style":{"height":11.5},"width":40.52,"height":28.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-14.png","element":"img","alt":" Fk","inline":true,"padRight":true},{"text":"holds, if the Hoeffding confidence intervals given by ","element":"span"},{"href":"#id-18","text":"(12) ","element":"a"},{"text":"hold for any actions ","element":"span"},{"style":{"height":16},"width":255.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-15.png","element":"img","alt":" a ∈ A, h ∈ [T ]","inline":true},{"text":", and state ","element":"span"},{"text":"s ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":16},"width":280.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-16.png","element":"img","alt":" s ∈ S\\(S0 ∪ S1)","inline":true},{"text":". Thus, we have ","element":"span"},{"style":{"height":19.6},"width":516.28,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-17.png","element":"img","alt":" P(E|Fk) ≥ 1 − N|A| �h∈[T ] δh","inline":true},{"text":", where ","element":"span"},{"text":"N ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":16},"width":221.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-18.png","element":"img","alt":"|S\\(S0 ∪S1)|","inline":true},{"text":", implying that the return value of Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"is correct with probability ","element":"span"},{"style":{"height":19.6},"width":464.92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-19.png","element":"img","alt":" P(E) ≥ 1 − N|A| �h∈[T ] δh","inline":true},{"text":". ","element":"span"},{"text":"By Theorem ","element":"span"},{"href":"#id-13","text":"2, ","element":"a"},{"text":"the desired overall statistical error splits into the statistical errors for each state-action pair through the time horizon. For implementation, we can split it equally by ","element":"span"},{"style":{"height":13.9},"width":247.32,"height":34.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-20.png","element":"img","alt":"δ1 = · · · = δH","inline":true},{"text":". The specification ","element":"span"},{"style":{"height":19.6},"width":140.8,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-21.png","element":"img","alt":" Pmin⋊⋉p (φ)","inline":true,"padRight":true},{"text":"can be handled ","element":"span"},{"text":"by replacing argmax with argmin in ","element":"span"},{"href":"#id-20","text":"(14)","element":"a"},{"text":", and ","element":"span"},{"text":"max ","element":"span"},{"text":"with ","element":"span"},{"text":"min ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-22","text":"(13)","element":"a"},{"text":". The termination condition is the same as ","element":"span"},{"href":"#id-24","text":"(17)","element":"a"},{"text":". Remark 4: Due to the semantics in Definition ","element":"span"},{"href":"#id-25","text":"4, ","element":"a"},{"text":"running Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"proving ","element":"span"},{"style":{"height":18.43},"width":147.04,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-22.png","element":"img","alt":" Pmax>p (φ)","inline":true,"padRight":true},{"text":"or disproving ","element":"span"},{"style":{"height":18.43},"width":147.04,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-23.png","element":"img","alt":" Pmax

p (φ)","inline":true,"padRight":true},{"text":"or proving ","element":"span"},{"style":{"height":18.24},"width":147.04,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-25.png","element":"img","alt":" Pmax

p (φ)","inline":true,"padRight":true},{"text":"or ","element":"span"},{"text":"disproving ","element":"span"},{"style":{"height":18.24},"width":147.04,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-27.png","element":"img","alt":" Pmax

p","inline":true},{"text":", while disproving it requires evaluating all possible policies with sufficient accuracy. This is illustrated by the simulation results presented in Section ","element":"span"},{"href":"#id-1","text":"IV.","element":"a"}],[{"id":"id-5","text":"C. Infinite Time Horizon","element":"span"}],[{"text":"Infinite-step satisfaction probability can be estimated from finite-step satisfaction probabilities, using the monotone convergence of the value function in the time step ","element":"span"},{"text":"H","element":"span"},{"text":",","element":"span"}],[{"id":"id-26","style":{"width":"100%"},"width":980,"height":353,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-30.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"p ","element":"span"},{"text":"is the probability threshold in the non-nested PCTL formula.","element":"span"}],[{"text":"The general idea in using the monotonicity to check infinite horizon satisfaction probability in finite time is that if we check both ","element":"span"},{"style":{"height":18.43},"width":232.96,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-31.png","element":"img","alt":" Pmax>p (a1Ua2)","inline":true,"padRight":true},{"text":"and its negation ","element":"span"},{"style":{"height":19.79},"width":308.32,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-32.png","element":"img","alt":"Pmin>1−p(¬a1R¬a2)","inline":true,"padRight":true},{"text":"at the same time, one of them should ","element":"span"},{"text":"terminate in finite time. Here ","element":"span"},{"style":{"height":9.5},"width":61.6,"height":23.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-33.png","element":"img","alt":" ¬a1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.5},"width":61.6,"height":23.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-34.png","element":"img","alt":" ¬a2","inline":true,"padRight":true},{"text":"are treated as atomic propositions. We can use Algorithm ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"to check their satisfaction probabilities for any time horizon ","element":"span"},{"text":"T ","element":"span"},{"text":"simultaneously. The termination in finite time is guaranteed, if the time horizon for both computations increase with the iterations. The simplest choice is to increase ","element":"span"},{"text":"H ","element":"span"},{"text":"by ","element":"span"},{"text":"1 ","element":"span"},{"text":"for every ","element":"span"},{"text":"K ","element":"span"},{"text":"iterations; however, this brings the problem of tuning ","element":"span"},{"text":"K","element":"span"},{"text":". Here, we use the convergence of the best policy as the criterion for increasing ","element":"span"},{"text":"H ","element":"span"},{"text":"for each satisfaction computation. Specifically, for all the steps ","element":"span"},{"text":"h ","element":"span"},{"text":"in each iteration, in addition to finding the optimal policy ","element":"span"},{"style":{"height":21.46},"width":118.72,"height":53.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-35.png","element":"img","alt":" π(k)h (s)","inline":true,"padRight":true},{"text":"with respect to the upper ","element":"span"},{"text":"confidence bounds of the Q-functions","element":"span"},{"style":{"height":22.58},"width":164.8,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-36.png","element":"img","alt":"Q(k)h (s, a)","inline":true,"padRight":true},{"text":"by ","element":"span"},{"href":"#id-20","text":"(14)","element":"a"},{"text":", we ","element":"span"},{"text":"also consider the the optimal policy with respect to the lower confidence bounds of the Q-functions ","element":"span"},{"style":{"height":22.42},"width":164.8,"height":56.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-37.png","element":"img","alt":" Q(k)h (s, a)","inline":true},{"text":". Obviously, ","element":"span"},{"text":"when ","element":"span"},{"style":{"height":21.65},"width":168.12,"height":54.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-38.png","element":"img","alt":" π(k)h (s) ∈","inline":true,"padRight":true},{"text":"argmax","element":"span"},{"style":{"height":22.42},"width":229.6,"height":56.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-39.png","element":"img","alt":"a∈AQ(k)h (s, a)","inline":true},{"text":", we know that the ","element":"span"},{"text":"policy ","element":"span"},{"style":{"height":21.46},"width":119.2,"height":53.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-40.png","element":"img","alt":" π(k)h (s)","inline":true,"padRight":true},{"text":"is optimal for all possible Q-functions within ","element":"span"},{"style":{"height":25.49},"width":191.96,"height":63.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-41.png","element":"img","alt":"[Q(k)h ,Q(k)h ]","inline":true},{"text":". This implies that these bounds are fine enough ","element":"span"},{"text":"for estimating ","element":"span"},{"style":{"height":14},"width":58.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-42.png","element":"img","alt":" QH","inline":true},{"text":"; thus, if the algorithm does not terminate by the condition ","element":"span"},{"href":"#id-26","text":"(19)","element":"a"},{"text":", we let","element":"span"}],[{"id":"id-27","style":{"width":"100%"},"width":980,"height":870,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/4-43.png","element":"img"}],[{"id":"id-28","style":{"width":"100%"},"width":981,"height":647,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-0.png","element":"img"}],[{"text":"stops and the largest time horizon is ","element":"span"},{"text":"H","element":"span"},{"text":", then the return value is correct with probability at least ","element":"span"},{"style":{"height":16},"width":368.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-1.png","element":"img","alt":" 1 − |A| max{N1, N2}","inline":true},{"style":{"height":19.51},"width":168.76,"height":48.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-2.png","element":"img","alt":"�h∈[T ] δh","inline":true},{"text":". Thus, the theorem holds.","element":"span"}],[{"text":"Remark 5: By Theorem ","element":"span"},{"href":"#id-27","text":"3, ","element":"a"},{"text":"given the desired overall con-fidence level ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-3.png","element":"img","alt":" δ","inline":true},{"text":", we can split it geometrically by ","element":"span"},{"style":{"height":13.9},"width":97.72,"height":34.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-4.png","element":"img","alt":" δh =","inline":true},{"style":{"height":17.55},"width":226.36,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-5.png","element":"img","alt":"(1 − λ)λh−1δ","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":161.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-6.png","element":"img","alt":" λ ∈ (0, 1)","inline":true},{"text":".","element":"span"}],[{"text":"Remark 6: Similar to Section ","element":"span"},{"href":"#id-4","text":"III-B, ","element":"a"},{"text":"checking ","element":"span"},{"style":{"height":19.6},"width":140.8,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-7.png","element":"img","alt":" Pmin∼p (φ)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":292.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-8.png","element":"img","alt":" ∼∈ {<, >, ≤, ≥}","inline":true,"padRight":true},{"text":"is derived by replacing argmax with argmin in ","element":"span"},{"href":"#id-20","text":"(14)","element":"a"},{"text":", and ","element":"span"},{"text":"max ","element":"span"},{"text":"with ","element":"span"},{"text":"min ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-22","text":"(13)","element":"a"},{"text":". The termination condition is the same as ","element":"span"},{"href":"#id-26","text":"(19)","element":"a"},{"text":".","element":"span"}],[{"text":"Remark 7: Finally, we note that the exact savings of sample costs for Algorithms ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"3 ","element":"a"},{"text":"depend on the structure of the MDP. Specifically, the proposed method is more efficient than [9], [10], [30], when the satisfaction probabilities differ significantly among actions, as it can quickly detect suboptimal actions without over-sampling on them. On the other hand, if all the state-action pairs yield the same Q-value, then an equal number of samples will be spent on each of them — in this case, the sample cost of Algorithms ","element":"span"},{"href":"#id-23","text":"2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"3 ","element":"a"},{"text":"is the same as [9], [10], [30].","element":"span"}]]},{"heading":"IV. SIMULATION","paragraphs":[[{"id":"id-1","text":"To evaluate the performance of the proposed algorithms, ","element":"span"},{"text":"we ran them on two different sets of examples. In all the simulations, the transition probabilities are unknown to the algorithm (this is different from [9]).","element":"span"}],[{"text":"The first set contains 10 randomly generated MDPs with different sizes. For these MDPs, we considered the formula ","element":"span"},{"style":{"height":18.24},"width":275.68,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-9.png","element":"img","alt":"Pmax

p(α1UHα2)","inline":true,"padRight":true},{"text":"holds. However, to prove ","element":"span"},{"style":{"height":18.24},"width":275.68,"height":45.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2004.00273/images/5-27.png","element":"img","alt":" Pmax