36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"71071","publisher":"neurips","paperJSON":{"title":"Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space","paperID":"71071","avgLineHeight":10.95,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Models of many real-life applications, such as queueing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a governed by an unknown parameter ","element":"span"},{"style":{"height":12},"width":101.5,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-0.png","element":"img","alt":" θ ∈ Θ","inline":true},{"text":", and defined on a countably-infinite state-space ","element":"span"},{"style":{"height":18.8},"width":135,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-1.png","element":"img","alt":" X = Zd+","inline":true},{"text":", with finite action space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", and an unbounded cost function. ","element":"span"},{"text":"We take a Bayesian perspective with the random unknown parameter ","element":"span"},{"style":{"height":15.94},"width":204.92,"height":39.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-2.png","element":"img","alt":" θ∗ generated","inline":true,"padRight":true},{"text":"via a given fixed prior distribution on ","element":"span"},{"style":{"height":11.4},"width":27.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-3.png","element":"img","alt":" Θ","inline":true},{"text":". To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes’ rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an ","element":"span"},{"style":{"height":19.6},"width":247.92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-4.png","element":"img","alt":"˜O(dhd�|A|T)","inline":true,"padRight":true},{"text":"upper bound on the Bayesian regret of our algorithm, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queueing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Many real-life applications, such as communication networks, supply chains, and computing systems, are modeled using queueing models with countably infinite state-space. In the existing analysis of these systems, the models are assumed to be known, but despite this, developing optimal control schemes is hard, with only a few examples worked out [","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":46,"text":"46","element":"a"},{"text":"]. However, knowing the model, algorithmic procedures exist to produce approximately optimal policies [","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":"] (such as value iteration and linear programming). Given the success of data-driven optimal control design, in particular Reinforcement Learning (RL), we explore the use of such methods for the countable state-space controlled Markov processes. However, current RL methods that focus on finite-state settings do not apply to the mentioned queueing models. With the model unknown, our goal is to develop a meta-learning scheme that is RL-based but obtains good performance by utilizing algorithms developed when models are known. Specifically, we study the problem of optimal control of a family of discrete-time countable state-space MDPs governed by an unknown parameter ","element":"span"},{"style":{"height":11.4},"width":17,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-5.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"from a general space ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-6.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"with each MDP evolving on the countable state-space ","element":"span"},{"style":{"height":18.93},"width":139.76,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-7.png","element":"img","alt":" X = Zd+","inline":true,"padRight":true},{"text":"and finite action space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". ","element":"span"},{"text":"The cost function is unbounded and polynomially dependent on the state, following the examples of minimizing waiting times in queueing systems. Taking a Bayesian view, we assume the model is governed by an unknown parameter ","element":"span"},{"style":{"height":13.14},"width":121.72,"height":32.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-8.png","element":"img","alt":" θ∗ ∈ Θ","inline":true,"padRight":true},{"text":"generated from a fixed and known prior distribution. We aim to learn a policy ","element":"span"},{"style":{"height":7.2},"width":22.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-9.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"that minimizes the optimal infinite-horizon average cost over a given class of policies ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-10.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"with low Bayesian regret with respect to the (parameter-dependent) optimal policy in ","element":"span"},{"style":{"height":11},"width":36.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/0-11.png","element":"img","alt":" Π.","inline":true,"padRight":true},{"text":"To avoid many technical difficulties in countably infinite state-space settings, it is crucial to establish certain assumptions regarding the class of models from which the unknown system is drawn; some examples are: i) the number of deterministic stationary policies is not finite; and ii) in average cost optimal control problems, without stability/ergodicity assumptions, an optimal policy may not exist [","element":"span"},{"href":"#id-3","referenceIndex":33,"text":"33","element":"a"},{"text":"], and when it exists, it may not be stationary or deterministic [","element":"span"},{"href":"#id-4","referenceIndex":16,"text":"16","element":"a"},{"text":"]. With these in mind, we assume that for any state-action pair, the transition kernels in the model class are categorical and skip-free to the right, i.e., with finite support with a bound depending on the state only in an additive manner; both are common features of queueing models where an increase in state is due to arrivals. A second set of assumptions ensure stability by assuming that the Markov chains obtained by using different policies in ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-0.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"are geometrically ergodic with uniformity across ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-1.png","element":"img","alt":" Θ","inline":true},{"text":". From these assumptions, moments on hitting times are derived in terms of Lyapunov functions for polynomial ergodicity. These assumptions also yield a solution to the average cost optimality equation (ACOE) ","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"[7]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Contributions: ","element":"span"},{"text":"To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes; posterior sampling is used based on its broad applicability and computational efficiency [","element":"span"},{"href":"#id-5","referenceIndex":38,"text":"38","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":39,"text":"39","element":"a"},{"text":"]. At the beginning of each episode, a posterior distribution is formed using Bayes’ rule, and an estimate is realized from this distribution which then decides the policy used throughout the episode. To evaluate the performance of our proposed algorithm, we use the metric of Bayesian regret, which compares the expected total cost achieved by a learning policy ","element":"span"},{"style":{"height":9.4},"width":42,"height":23.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-2.png","element":"img","alt":"πL","inline":true,"padRight":true},{"text":"until time horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"with the policy achieving the optimal infinite-horizon average cost in the policy class ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-3.png","element":"img","alt":" Π","inline":true},{"text":". We consider regret guarantees in three different settings as follows: 1. In Theorem ","element":"span"},{"href":"#id-7","text":"1, ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-4.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"being the set of all policies and assuming that we have oracle access to the optimal policy for each parameter, we establish an ","element":"span"},{"style":{"height":19.8},"width":242.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-5.png","element":"img","alt":"˜O(dhd�|A|T)","inline":true,"padRight":true},{"text":"upper bound on the Bayesian regret of this algorithm compared to the optimal policy. 2. In Corollary ","element":"span"},{"href":"#id-8","text":"1, ","element":"a"},{"text":"where class ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-6.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"is a subset of all stationary policies and where we know the best policy within this subset for each parameter via an oracle, we prove an ","element":"span"},{"style":{"height":19.6},"width":247.92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-7.png","element":"img","alt":"˜O(dhd�|A|T)","inline":true,"padRight":true},{"text":"upper bound on the Bayesian regret of our proposed algorithm, relative to the best-in-class policy. 3. In Theorem ","element":"span"},{"href":"#id-9","text":"2, ","element":"a"},{"text":"we explore a scenario where we have access to an approximately optimal policy, rather than the optimal policy in set ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-8.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"(which are all assumed to be stationary policies). When the approximately optimal policies satisfy Assumptions ","element":"span"},{"href":"#id-10","text":"3-","element":"a"},{"href":"#id-11","text":"4, ","element":"a"},{"text":"we prove an ","element":"span"},{"style":{"height":19.62},"width":247.92,"height":49.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-9.png","element":"img","alt":"˜O(dhd�|A|T)","inline":true,"padRight":true},{"text":"regret bound, relative to the optimal policy in set ","element":"span"},{"style":{"height":11.2},"width":39.88,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/1-10.png","element":"img","alt":" Π.","inline":true}],[{"text":"Finally, to provide examples of our framework, we consider two different queueing models that meet our technical conditions, showing the applicability of our algorithm in developing approximately optimal control algorithms for stochastic systems with unknown dynamics. The first example is a continuous-time queueing system with two heterogeneous servers with unknown service rates and a common infinite buffer with the decision being the use of the slower server. Here, the optimal policy that minimizes the average waiting time is a threshold policy [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"] which yields a queue-length after which the slower server is always used. The second model is a two-server queueing system, each with separate infinite buffers, to one of which a dispatcher routes an incoming arrival. Here, the optimal policy to minimize the waiting time is unknown for general parameter values, so we aim to find the best policy within a commonly used set of policies that assign the arrival to the queue with minimum weighted queue length (Max-Weight policies [","element":"span"},{"href":"#id-13","referenceIndex":49,"text":"49","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":50,"text":"50","element":"a"},{"text":"]). For both models, we verify our assumptions for the class of optimal/best-in-class policies corresponding to different service rates and conclude that our proposed algorithm can be used to learn the optimal/best-in-class policy.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Related Work: ","element":"span"},{"text":"Thompson sampling [","element":"span"},{"href":"#id-15","referenceIndex":53,"text":"53","element":"a"},{"text":"], or posterior sampling, has been applied to RL in many contexts of unknown MDPs [","element":"span"},{"href":"#id-16","referenceIndex":47,"text":"47","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":37,"text":"37","element":"a"},{"text":"] and partially observed MDPs [","element":"span"},{"href":"#id-18","referenceIndex":23,"text":"23","element":"a"},{"text":"]; see tutorials [","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"18","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":42,"text":"42","element":"a"},{"text":"] for a comprehensive survey. It has been used in the parametric learning context [","element":"span"},{"href":"#id-21","referenceIndex":5,"text":"5","element":"a"},{"text":"] to minimize either Bayesian [","element":"span"},{"href":"#id-5","referenceIndex":38,"text":"38","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":39,"text":"39","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":1,"text":"1","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":51,"text":"51","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":52,"text":"52","element":"a"},{"text":"] or frequentist [","element":"span"},{"href":"#id-26","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":19,"text":"19","element":"a"},{"text":"] regret. The bulk of the literature, including [","element":"span"},{"href":"#id-26","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":19,"text":"19","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"], analyzes finite-state and finite-action models but with different parameterizations such that a general dependence of the models on the parameters is allowed. The work in [","element":"span"},{"href":"#id-25","referenceIndex":52,"text":"52","element":"a"},{"text":"] studies general state-space MDPs but with a scalar parameterization with a Lipschitz dependence of the underlying models. Our problem formulation specifically considers countable state-space models with the models related via ergodicity, which we believe is a natural choice. Our focus on parametric learning is also connected to older work in adaptive control [","element":"span"},{"href":"#id-28","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":20,"text":"20","element":"a"},{"text":"] which studies asymptotically optimal learning for general parameter settings but with either a finite or countably infinite number of policies. Learning-based asymptotically optimal control in queues has a long history [","element":"span"},{"href":"#id-30","referenceIndex":29,"text":"29","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":"] but recently there is increased work that also characterizes finite-time regret performance with respect to a well-known good policy or the optimal policy; see [","element":"span"},{"href":"#id-31","referenceIndex":54,"text":"54","element":"a"},{"text":"] for a survey. A series of work has studied ","element":"span"},{"text":"learning with Max-Weight policies to get stability and linear regret [","element":"span"},{"href":"#id-32","referenceIndex":36,"text":"36","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":25,"text":"25","element":"a"},{"text":"] or just stability [","element":"span"},{"href":"#id-34","referenceIndex":56,"text":"56","element":"a"},{"text":"]. A recent related work [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"] considers learning optimal paramterized policies in queueing networks when the MDP is known. In a finite or countable state-space setting of specific queueing models where the parameters can be estimated, several works [","element":"span"},{"href":"#id-36","referenceIndex":2,"text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":45,"text":"45","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":27,"text":"27","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":11,"text":"11","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":17,"text":"17","element":"a"},{"text":"] have used forced exploration type schemes to obtain either regret that is constant or scaling logarithmically in the time-horizon.","element":"span"}],[{"text":"Another line of work studies the problem of learning the optimal policy in an undiscounted finite-horizon MDP with a bounded reward function. Reference [","element":"span"},{"href":"#id-43","referenceIndex":57,"text":"57","element":"a"},{"text":"] uses a Thompson sampling-based learning algorithm with linear value function approximation to study an MDP with a bounded reward function in a finite-horizon setting. Reference [","element":"span"},{"href":"#id-44","referenceIndex":12,"text":"12","element":"a"},{"text":"] considers an episodic finite-horizon MDP with known bounded rewards but unknown transition kernels modeled using linearly parameterized exponential families. A maximum likelihood (ML) based algorithm coupled with exploration done by constructing high probability confidence sets around the ML estimate is used to learn the unknown parameters. In another work, [","element":"span"},{"href":"#id-45","referenceIndex":40,"text":"40","element":"a"},{"text":"] extends the problem setting of [","element":"span"},{"href":"#id-44","referenceIndex":12,"text":"12","element":"a"},{"text":"] to an episodic finite-horizon MDP with unknown rewards and transitions modeled using parametric bilinear exponential families. To learn the unknown parameters, they use a ML based algorithm with exploration done with explicit perturbation. We note that all mentioned works consider a finite-horizon problem. In contrast, our work considers an average cost problem, an infinite-horizon setting, and provides finite-time performance guarantees. In addition, these works focus on an MDP with a bounded reward function. Our focus, however, is learning in MDPs with unbounded rewards with the goal of covering practical queueing examples. We note that the parameterization of transitions used in [","element":"span"},{"href":"#id-45","referenceIndex":40,"text":"40","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":12,"text":"12","element":"a"},{"text":"] can be used within our framework. However, similar to our work, additional stability assumptions are necessary to guarantee asymptotic learning and sub-linear regret. Another issue with exponential transition families is that they do not allow for ","element":"span"},{"text":"0 ","element":"span"},{"text":"entries, which limits their applicability in queueing models.","element":"span"}],[{"text":"In another work, [","element":"span"},{"href":"#id-46","referenceIndex":43,"text":"43","element":"a"},{"text":"] studies discounted MDPs with unknown dynamics, and unbounded state-space, but with bounded rewards, and learns an online policy that satisfies a specific notion of stability. It is also assumed that a Lyapunov function ensuring stability for the optimal policy exists. We note that [","element":"span"},{"href":"#id-46","referenceIndex":43,"text":"43","element":"a"},{"text":"] ignores optimality and focuses on finding a stable policy, which contrasts with our work that evaluates performance relative to the optimal policy. Secondly, [","element":"span"},{"href":"#id-46","referenceIndex":43,"text":"43","element":"a"},{"text":"] considers a discounted reward problem, essentially a finite-time horizon problem. Average cost problems, such as ours, are infinite-time horizon problems, so connections to discounted problems can only be made in the limit of the discount parameter going to ","element":"span"},{"text":"1","element":"span"},{"text":". Moreover, [","element":"span"},{"href":"#id-46","referenceIndex":43,"text":"43","element":"a"},{"text":"] considers a bounded reward function, simplifying their analysis but not practical for many queueing examples. Further, the assumption of a stable optimal policy with a Lyapunov function (as in [","element":"span"},{"href":"#id-46","referenceIndex":43,"text":"43","element":"a"},{"text":"]) is highly restrictive for bounded reward settings with discounting. Additionally, average cost problems with bounded costs need strong state-independent recurrence conditions for the existence of (stationary) optimal solutions, which many queueing examples don’t satisfy; see [","element":"span"},{"href":"#id-47","referenceIndex":9,"text":"9","element":"a"},{"text":"]. Further complications can also arise with bounded costs: e.g., ","element":"span"},{"href":"#id-4","referenceIndex":16,"text":"[16] ","element":"a"},{"text":"shows that a stationary average cost optimal policy may not exist.","element":"span"}]]},{"heading":"2 Problem formulation","paragraphs":[[{"text":"We consider a family of discrete-time Markov Decision Processes (MDPs) governed by parameter ","element":"span"},{"style":{"height":12.2},"width":109.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-0.png","element":"img","alt":"θ ∈ Θ","inline":true,"padRight":true},{"text":"with the MDP for parameter ","element":"span"},{"style":{"height":11.4},"width":17,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-1.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"described by ","element":"span"},{"style":{"height":16},"width":211.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-2.png","element":"img","alt":" (X, A, c, Pθ)","inline":true},{"text":". For exposition purposes, we assume that all the MDPs are on (a common) countably infinite state-space ","element":"span"},{"style":{"height":18.93},"width":143.4,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-3.png","element":"img","alt":" X = Zd+","inline":true},{"text":". We denote ","element":"span"},{"text":"the finite action space by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", the transition kernel by ","element":"span"},{"style":{"height":16},"width":348.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-4.png","element":"img","alt":" Pθ : X × A → ∆(X)","inline":true},{"text":", and the cost function by ","element":"span"},{"style":{"height":15.6},"width":273.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-5.png","element":"img","alt":"c : X ×A → R+","inline":true},{"text":". As mentioned earlier, we will take a Bayesian view of the problem and assume that the model is generated using an unknown parameter ","element":"span"},{"style":{"height":13.12},"width":121.68,"height":32.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-6.png","element":"img","alt":" θ∗ ∈ Θ","inline":true},{"text":", which is generated from a given fixed prior distribution ","element":"span"},{"style":{"height":16},"width":64.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-7.png","element":"img","alt":" ν(·)","inline":true,"padRight":true},{"text":"on ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-8.png","element":"img","alt":" Θ","inline":true},{"text":". Our goal is to find a policy ","element":"span"},{"style":{"height":12.4},"width":187.44,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-9.png","element":"img","alt":" π : X → A","inline":true,"padRight":true},{"text":"that tries to achieve Bayesian optimal performance in policy class ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-10.png","element":"img","alt":" Π","inline":true},{"text":", i.e., minimizes the expected regret with ","element":"span"},{"style":{"height":12.6},"width":36,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-11.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"chosen from the prior distribution ","element":"span"},{"style":{"height":16},"width":64.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-12.png","element":"img","alt":" ν(·)","inline":true},{"text":". For each value ","element":"span"},{"style":{"height":11.6},"width":99.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-13.png","element":"img","alt":" θ ∈ Θ","inline":true},{"text":", the minimum infinite-horizon average cost is defined as","element":"span"}],[{"style":{"width":"74%"},"width":1174,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-14.png","element":"img"}],[{"text":"where we optimize over a given class of policies ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-15.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":563.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-16.png","element":"img","alt":" X(t) = (X1(t), . . . , Xd(t)) ∈ X","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":156,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-17.png","element":"img","alt":"A(t) ∈ A","inline":true,"padRight":true},{"text":"are the state and action at ","element":"span"},{"style":{"height":11.6},"width":92.12,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-18.png","element":"img","alt":" t ∈ N","inline":true},{"text":". Typically, we set this class to be all (causal) policies, but it is also possible to consider ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-19.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"to be a proper subset of all policies as we will explore in our results. For a learning policy ","element":"span"},{"style":{"height":9.2},"width":42.5,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/2-20.png","element":"img","alt":" πL","inline":true,"padRight":true},{"text":"that aims to select the optimal control without model knowledge but with","element":"span"}],[{"text":"knowledge of ","element":"span"},{"style":{"height":11.4},"width":27,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-0.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"and the prior ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-1.png","element":"img","alt":" ν","inline":true},{"text":", the Bayesian regret until time horizon ","element":"span"},{"style":{"height":13},"width":99.5,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-2.png","element":"img","alt":" T ≥ 2","inline":true,"padRight":true},{"text":"is defined as","element":"span"}],[{"id":"id-84","style":{"width":"74%"},"width":1180,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-3.png","element":"img"}],[{"text":"where the expectation is taken over ","element":"span"},{"style":{"height":12.4},"width":114.5,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-4.png","element":"img","alt":" θ∗ ∼ ν","inline":true,"padRight":true},{"text":"and the dynamics induced by ","element":"span"},{"style":{"height":9.4},"width":42,"height":23.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-5.png","element":"img","alt":" πL","inline":true},{"text":". Owing to underlying challenges in countable state-space MDPs, we require the below assumptions on the cost function.","element":"span"}],[{"id":"id-54","style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The cost function ","element":"span"},{"style":{"height":15.4},"width":268,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-6.png","element":"img","alt":" c : X ×A → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is assumed to satisfy the following two conditions:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. For every number ","element":"span"},{"style":{"height":13.2},"width":93.44,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-7.png","element":"img","alt":" z ≥ 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and action ","element":"span"},{"style":{"height":16},"width":306.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-8.png","element":"img","alt":" a ∈ A, c(x, a) ≥ z","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"outside a finite subset of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2. The cost function is upper-bounded by a multivariate polynomial ","element":"span"},{"style":{"height":18.93},"width":273.68,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-9.png","element":"img","alt":" fc : Zd+ → R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"which is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"increasing in every component on ","element":"span"},{"style":{"height":18.93},"width":126.52,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-10.png","element":"img","alt":" x ∈ Zd+ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and has maximum degree of ","element":"span"},{"style":{"height":16},"width":123.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-11.png","element":"img","alt":" r (≥ 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in any dimension. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We can assume that ","element":"span"},{"style":{"height":20.4},"width":1054.64,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-12.png","element":"img","alt":" fc(x) = K �di=1(xi)r for some K > 0, where x = (x1, . . . , xd).","inline":true}],[{"text":"Thus, the cost function increases without bound (in the state) at a polynomial rate. This assumption is common in practice—holding costs in queueing models are polynomial in the state components. To avoid technical issues the infinite state-space setting also necessitates some assumptions on the class from which the unknown model is drawn. For instance, irreducibility of Markov chains on such state-spaces does not ensure positive recurrence (and ergodicity). Moreover, for average cost optimal control problems, without stability even the existence of an optimal policy is not guaranteed, and we need more conditions. The following assumption ensures a skip-free behaviour for transitions, which holds in many queueing models, where an increase in state corresponds to (new) arrivals.","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Assumption 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"From any state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", the transition is to a finite number of states. We also assume that all transition kernels are skip-free to the right: for some ","element":"span"},{"style":{"height":13.2},"width":96.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-13.png","element":"img","alt":" h ≥ 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"which is independent of ","element":"span"},{"style":{"height":18.91},"width":1551.68,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-14.png","element":"img","alt":" θ ∈ Θ and (x, a) ∈ X × A, we have Pθ(x′; x, a) = 0 for all x′ ∈ {˜x ∈ Zd+ : ∥˜x∥1 > ∥x∥1 +h}.","inline":true}],[{"text":"Learning necessitates some commonalities within the class of models so that using a policy well-suited to one model provides information on other models too. For us, these are in the form of constraints on the transition kernels of the models and stability assumptions. As simple union bound arguments don’t work in the countably infinite state-space setting, we will use the stability assumptions instead. In our setting, we consider a class of models, each with a policy being well-suited to at least one model in the class, and use the set of policies to search within. Using a reduced set of policies is necessary as the number of deterministic stationary policies is infinite. To learn correctly while restricting attention to this subset policy class, requires some regularity assumptions when a policy well-suited to one model is tried on a different model. Our ergodicity assumptions are one convenient choice; see Appendix ","element":"span"},{"href":"#id-48","text":"A.1 ","element":"a"},{"text":"for details. These assumptions let us characterize the distributions of the first passage times of the Markov processes via stability conditions; see Lemmas ","element":"span"},{"href":"#id-49","text":"10 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-50","text":"11.","element":"a"}],[{"id":"id-10","style":{"fontWeight":"bold"},"text":"Assumption 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any MDP ","element":"span"},{"style":{"height":16},"width":211.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-15.png","element":"img","alt":" (X, A, c, Pθ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with parameter ","element":"span"},{"style":{"height":12},"width":100,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-16.png","element":"img","alt":" θ ∈ Θ","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists a unique optimal policy ","element":"span"},{"style":{"height":15.6},"width":37.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-17.png","element":"img","alt":" π∗θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that minimizes the infinite-horizon average cost within the class of policies ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-18.png","element":"img","alt":" Π","inline":true},{"style":{"fontStyle":"italic"},"text":". Further- ","element":"span"},{"style":{"fontStyle":"italic"},"text":"more, for any ","element":"span"},{"style":{"height":14.6},"width":169.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-19.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true},{"style":{"fontStyle":"italic"},"text":", the Markov process with transition kernel ","element":"span"},{"style":{"height":26.27},"width":77.24,"height":65.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-20.png","element":"img","alt":" Pπ∗θ2θ1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"obtained from the MDP ","element":"span"},{"style":{"height":16},"width":226.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-21.png","element":"img","alt":"(X, A, c, Pθ1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by following policy ","element":"span"},{"style":{"height":17.09},"width":51.88,"height":42.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-22.png","element":"img","alt":" π∗θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is irreducible, aperiodic, and geometrically ergodic with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"geometric ergodicity coefficient ","element":"span"},{"style":{"height":19.66},"width":231.32,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-23.png","element":"img","alt":" γgθ1,θ2 ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and stationary distribution ","element":"span"},{"style":{"height":11.6},"width":93.36,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-24.png","element":"img","alt":" µθ1,θ2","inline":true},{"style":{"fontStyle":"italic"},"text":". This is equivalent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"to the existence of finite set ","element":"span"},{"style":{"height":19.66},"width":97.84,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-25.png","element":"img","alt":" Cgθ1,θ2 ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and Lyapunov function ","element":"span"},{"style":{"height":19.66},"width":361.2,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-26.png","element":"img","alt":" V gθ1,θ2 : X → [1, +∞)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying","element":"span"}],[{"id":"id-98","style":{"width":"100%"},"width":1596,"height":308,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-27.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, we have the following assumptions relating all the models in ","element":"span"},{"style":{"height":11.8},"width":40,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-28.png","element":"img","alt":" Θ:","inline":true}],[{"style":{"fontStyle":"italic"},"text":"1. The geometric ergodicity coefficient is uniformly bounded below ","element":"span"},{"style":{"height":19.66},"width":513.2,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-29.png","element":"img","alt":" 1: γg∗ := supθ1,θ2∈Θ γgθ1,θ2 < 1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. We assume that ","element":"span"},{"style":{"height":20.59},"width":385.68,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-30.png","element":"img","alt":" {0d} ⊆ ∩θ1,θ2∈ΘCgθ1,θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.66},"width":357.2,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-31.png","element":"img","alt":" Cg∗ = ∪θ1,θ2∈ΘCgθ1,θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a finite set. We further ","element":"span"},{"style":{"fontStyle":"italic"},"text":"assume that ","element":"span"},{"style":{"height":19.66},"width":463.4,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/3-32.png","element":"img","alt":" bg∗ := supθ1,θ2 bgθ1,θ2 < +∞.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Remark 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The uniqueness of the optimal policy is not essential for the validity of our results, provided that all optimal policies satisfy our assumptions. When this condition is not met, we need to select an optimal policy that is geometrically ergodic for all ","element":"span"},{"style":{"height":12.2},"width":98.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-0.png","element":"img","alt":" θ ∈ Θ","inline":true},{"style":{"fontStyle":"italic"},"text":". This issue can be avoided by using a smaller subset of policies for which ergodicity can be shown, such as Max-Weight policies.","element":"span"}],[{"text":"Geometric ergodicity implies that all moments of the hitting time of state ","element":"span"},{"style":{"height":13.39},"width":36.92,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-1.png","element":"img","alt":" 0d","inline":true},{"text":", say ","element":"span"},{"style":{"height":10.4},"width":46.5,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-2.png","element":"img","alt":" τ0d","inline":true},{"text":", from any initial state ","element":"span"},{"style":{"height":17.2},"width":114.5,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-3.png","element":"img","alt":" x ̸= 0d ","inline":true,"padRight":true},{"text":"are finite as ","element":"span"},{"style":{"height":16},"width":339.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-4.png","element":"img","alt":" Ex[κτ0d ] ≤ c1V g(x)","inline":true,"padRight":true},{"text":"(for specific ","element":"span"},{"style":{"height":18.88},"width":528.88,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-5.png","element":"img","alt":" κ > 1 and c1), and so, Ex[τ k0d] ≤","inline":true},{"style":{"height":18.72},"width":451.84,"height":46.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-6.png","element":"img","alt":"c1V g(x)k!/ logk(κ) < +∞","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":11.6},"width":97,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-7.png","element":"img","alt":" k ∈ N","inline":true},{"text":"; see Appendix ","element":"span"},{"href":"#id-51","text":"A.2. ","element":"a"},{"text":"Function ","element":"span"},{"style":{"height":10.8},"width":48.08,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-8.png","element":"img","alt":" V g","inline":true,"padRight":true},{"text":"is typically exponential in some norm of the state and yields an exponential bound for moments of hitting times, and a poor regret bound. To improve the regret bound, we need a different drift equation with function ","element":"span"},{"style":{"height":11.6},"width":131.04,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-9.png","element":"img","alt":" V p with","inline":true,"padRight":true},{"text":"polynomial dependence on a norm of the state that bounds certain polynomial moments of ","element":"span"},{"style":{"height":10.4},"width":58.5,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-10.png","element":"img","alt":" τ0d.","inline":true}],[{"id":"id-11","style":{"fontWeight":"bold"},"text":"Assumption 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given ","element":"span"},{"style":{"height":14},"width":170.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-11.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true},{"style":{"fontStyle":"italic"},"text":", Markov process obtained from MDP ","element":"span"},{"style":{"height":16},"width":226.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-12.png","element":"img","alt":" (X, A, c, Pθ1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by following policy ","element":"span"},{"style":{"height":17.2},"width":49,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-13.png","element":"img","alt":" π∗θ2 ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is polynomially ergodic through the Foster-Lyapunov criteria: there exists a finite set ","element":"span"},{"style":{"height":19.6},"width":105.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-14.png","element":"img","alt":" Cpθ1,θ2,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"constants ","element":"span"},{"style":{"height":19.66},"width":575.8,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-15.png","element":"img","alt":" βpθ1,θ2, bpθ1,θ2 > 0, αpθ1,θ2 ∈ [ rr+1, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", and function ","element":"span"},{"style":{"height":19.66},"width":361.2,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-16.png","element":"img","alt":" V pθ1,θ2 : X → [1, +∞)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfying","element":"span"}],[{"id":"id-99","style":{"width":"86%"},"width":1364,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, we have the following assumptions relating all the models in ","element":"span"},{"style":{"height":11.8},"width":40,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-18.png","element":"img","alt":" Θ:","inline":true}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":19.8},"width":89,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-19.png","element":"img","alt":" V pθ1,θ2 ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a polynomial with positive coefficients, maximum degree (in any dimension) ","element":"span"},{"style":{"height":19.8},"width":171.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-20.png","element":"img","alt":" rpθ1,θ2, and","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"sum of coefficients ","element":"span"},{"style":{"height":19.68},"width":88.04,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-21.png","element":"img","alt":" spθ1,θ2","inline":true},{"style":{"fontStyle":"italic"},"text":". We assume ","element":"span"},{"style":{"height":19.68},"width":918.16,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-22.png","element":"img","alt":" rp∗ = supθ1,θ2 rpθ1,θ2 < ∞ and sp∗ = supθ1,θ2 spθ1,θ2 < ∞.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"2. We assume that ","element":"span"},{"style":{"height":20.59},"width":385.68,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-23.png","element":"img","alt":" {0d} ⊆ ∩θ1,θ2∈ΘCpθ1,θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.6},"width":353.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-24.png","element":"img","alt":" Cp∗ = ∪θ1,θ2∈ΘCpθ1,θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a finite set. We further ","element":"span"},{"style":{"fontStyle":"italic"},"text":"assume that ","element":"span"},{"style":{"height":19.66},"width":915.8,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-25.png","element":"img","alt":" βp∗ := infθ1,θ2 βpθ1,θ2 > 0 and bp∗ := supθ1,θ2 bpθ1,θ2 < ∞.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"3. Let ","element":"span"},{"style":{"height":26.29},"width":712.36,"height":65.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-26.png","element":"img","alt":" Kθ1,θ2(x) := �∞n=0 2−n−2�Pπ∗θ2θ1 �n(x, 0d)","inline":true},{"style":{"fontStyle":"italic"},"text":", which is positive for any pair ","element":"span"},{"style":{"height":14.6},"width":178,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-27.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by irreducibility. We assume that it is strictly positive in ","element":"span"},{"style":{"height":17.38},"width":706.48,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-28.png","element":"img","alt":" Θ: K∗ := infθ1,θ2 minx∈Cp∗ Kθ1,θ2(x) > 0.","inline":true}],[{"text":"Assumptions ","element":"span"},{"href":"#id-10","text":"3-","element":"a"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"hold in many models of interest; see Appendix ","element":"span"},{"text":"E. ","element":"span"},{"text":"As average cost optimality is our design criterion, we need to ensure the existence of solutions to ACOE when ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-29.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"is the set of all policies, or Poisson equation when ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-30.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"is a subset of all policies. We discuss these two cases separately. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Case 1: ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-31.png","element":"img","alt":" Π","inline":true,"padRight":true},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"is the set of all policies. ","element":"span"},{"text":"For any parameter ","element":"span"},{"style":{"height":16},"width":481.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-32.png","element":"img","alt":" θ ∈ Θ, the MDP (X, A, c, Pθ)","inline":true,"padRight":true},{"text":"is said to satisfy the ACOE if there exists a constant ","element":"span"},{"style":{"height":16},"width":77.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-33.png","element":"img","alt":" J(θ)","inline":true,"padRight":true},{"text":"and a unique function ","element":"span"},{"style":{"height":16},"width":412.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-34.png","element":"img","alt":" v(·; θ) : X → R such that","inline":true}],[{"style":{"width":"79%"},"width":1258,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-35.png","element":"img"}],[{"text":"From [","element":"span"},{"href":"#id-52","referenceIndex":10,"text":"10","element":"a"},{"text":"] if the following conditions hold, ACOE has a solution, ","element":"span"},{"style":{"height":13.4},"width":34,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-36.png","element":"img","alt":" Jθ","inline":true,"padRight":true},{"text":"is the optimal infinite-horizon average cost, and there is an optimal stationary policy with ACOE becoming ","element":"span"},{"href":"#id-53","text":"(5)","element":"a"},{"text":": (i) for every ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":99.36,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-37.png","element":"img","alt":" z ≥ 0","inline":true},{"text":", cost function ","element":"span"},{"style":{"height":16},"width":191.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-38.png","element":"img","alt":" c(x, a) ≥ z","inline":true,"padRight":true},{"text":"outside a finite subset of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"; (ii) there is a stationary policy with an irreducible and aperiodic Markov process with finite average cost; and (iii) from every ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"transition to a finite number of states is possible. From Assumptions ","element":"span"},{"href":"#id-54","text":"1-","element":"a"},{"href":"#id-10","text":"3, ","element":"a"},{"text":"the above conditions hold. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Case 2: ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-39.png","element":"img","alt":" Π","inline":true,"padRight":true},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"is a proper subset of all policies. ","element":"span"},{"text":"Here, we posit that for every ","element":"span"},{"style":{"height":11.8},"width":96,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-40.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and its best in-class policy ","element":"span"},{"style":{"height":15.8},"width":37,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-41.png","element":"img","alt":" π∗θ","inline":true},{"text":", there exists a constant ","element":"span"},{"style":{"height":16},"width":77.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-42.png","element":"img","alt":" J(θ)","inline":true},{"text":", the average cost of ","element":"span"},{"style":{"height":15.8},"width":37.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-43.png","element":"img","alt":" π∗θ","inline":true},{"text":", and a function ","element":"span"},{"style":{"height":16},"width":339.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-44.png","element":"img","alt":" v(·; θ) : X → R with","inline":true}],[{"id":"id-53","style":{"width":"81%"},"width":1292,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-45.png","element":"img"}],[{"text":"This holds by the solution of the Poisson equation with the appropriate forcing function. For a Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"on the space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"with transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"and cost function ","element":"span"},{"style":{"height":16},"width":59.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-46.png","element":"img","alt":" ¯c(·)","inline":true},{"text":", a solution to the Poisson equation [","element":"span"},{"href":"#id-55","referenceIndex":34,"text":"34","element":"a"},{"text":"] is a scalar ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"and function ","element":"span"},{"style":{"height":16},"width":963.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-47.png","element":"img","alt":" v(·) : X �→ R such that J + v = ¯c + Pv, where v(z) = 0 for","inline":true,"padRight":true},{"text":"some ","element":"span"},{"style":{"height":11.8},"width":103.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-48.png","element":"img","alt":" z ∈ X","inline":true},{"text":". In our setting using [","element":"span"},{"href":"#id-55","referenceIndex":34,"text":"34","element":"a"},{"text":", Sections 9.6-9.8], for a model governed by ","element":"span"},{"style":{"height":12.2},"width":96.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-49.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"following policy ","element":"span"},{"style":{"height":15.6},"width":37,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-50.png","element":"img","alt":" π∗θ","inline":true},{"text":", we show a solution to the Poisson equation exists and is given by ","element":"span"},{"style":{"height":18.59},"width":269.16,"height":46.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-51.png","element":"img","alt":" vπ∗θ (0d) = 0 and","inline":true}],[{"style":{"width":"92%"},"width":1462,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-52.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":22.98},"width":744.72,"height":57.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-53.png","element":"img","alt":"¯Cπ∗θ (x) = Eπ∗θx� �τ0d−1i=0 c(X(i), π∗θ(X(i)))�,","inline":true,"padRight":true},{"text":"and expectation is over trajectories of Markov chain ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"with transition kernel ","element":"span"},{"style":{"height":22.3},"width":65.76,"height":55.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-54.png","element":"img","alt":" P π∗θθ","inline":true,"padRight":true},{"text":"starting in state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":". In Appendix ","element":"span"},{"href":"#id-56","text":"A.3, ","element":"a"},{"text":"we present related definitions and show that from Assumptions ","element":"span"},{"href":"#id-10","text":"3-","element":"a"},{"href":"#id-11","text":"4, ","element":"a"},{"text":"the requirements for the existence and finiteness of the solutions to Poisson equation are satisfied. Finally, we assume ","element":"span"},{"style":{"height":16.8},"width":203,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-55.png","element":"img","alt":" supθ∈Θ J(θ)","inline":true,"padRight":true},{"text":"is finite, which typically holds as a result of the boundedness assumptions stated in Asumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"or ","element":"span"},{"href":"#id-11","text":"4, ","element":"a"},{"text":"along with Assumption ","element":"span"},{"href":"#id-54","text":"1.","element":"a"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Assumption 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We assume that ","element":"span"},{"style":{"height":16.69},"width":449.8,"height":41.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/4-56.png","element":"img","alt":" J∗ := supθ∈Θ J(θ) < +∞.","inline":true}],[{"id":"id-57","style":{"width":"100%"},"width":1592,"height":928,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-0.png","element":"img"}]]},{"heading":"3 Thompson sampling based learning algorithm","paragraphs":[[{"text":"We will use the Thompson-sampling based algorithm from [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] to learn the unknown parameter ","element":"span"},{"style":{"height":13.12},"width":121.68,"height":32.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-1.png","element":"img","alt":"θ∗ ∈ Θ","inline":true,"padRight":true},{"text":"and the corresponding policy, ","element":"span"},{"style":{"height":15.6},"width":52.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-2.png","element":"img","alt":" π∗θ∗","inline":true},{"text":", but suitably modify it for our countable state-space setting. ","element":"span"},{"text":"Consider the prior distribution ","element":"span"},{"style":{"height":9.8},"width":110,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-3.png","element":"img","alt":" ν0 = ν","inline":true,"padRight":true},{"text":"defined on ","element":"span"},{"style":{"height":11.4},"width":27,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-4.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"from which ","element":"span"},{"style":{"height":12.6},"width":36,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-5.png","element":"img","alt":" θ∗ ","inline":true,"padRight":true},{"text":"is sampled. At each time ","element":"span"},{"style":{"height":13.2},"width":101.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-6.png","element":"img","alt":" t ∈ N,","inline":true,"padRight":true},{"text":"the posterior distribution ","element":"span"},{"style":{"height":9.8},"width":28.5,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-7.png","element":"img","alt":" νt","inline":true,"padRight":true},{"text":"is updated according to Bayes’ rule as","element":"span"}],[{"style":{"width":"79%"},"width":1254,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-8.png","element":"img"}],[{"text":"and the posterior estimate ","element":"span"},{"style":{"height":14.78},"width":71.24,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-9.png","element":"img","alt":" θt+1","inline":true},{"text":", if generated, is from the posterior distribution ","element":"span"},{"style":{"height":11},"width":68,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-10.png","element":"img","alt":" νt+1","inline":true},{"text":". The Thompsonsampling with dynamically-sized episodes algorithm (TSDE) is presented in Algorithm ","element":"span"},{"href":"#id-57","text":"1. ","element":"a"},{"text":"The TSDE algorithm operates in episodes: at the beginning of each episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", parameter ","element":"span"},{"style":{"height":13.8},"width":33.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-11.png","element":"img","alt":" θk","inline":true,"padRight":true},{"text":"is sampled from the posterior distribution ","element":"span"},{"style":{"height":10.9},"width":46.76,"height":27.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-12.png","element":"img","alt":" νtk","inline":true,"padRight":true},{"text":"and during episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", actions are generated from the stationary policy according to ","element":"span"},{"style":{"height":17.18},"width":295.04,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-13.png","element":"img","alt":" θk, i.e., π∗θk. Let tk","inline":true,"padRight":true},{"text":"be the time the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th episode begins. Define ","element":"span"},{"style":{"height":17.73},"width":72.48,"height":44.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-14.png","element":"img","alt":" ˜tk+1","inline":true,"padRight":true},{"text":"as the first time after ","element":"span"},{"style":{"height":12.6},"width":30,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-15.png","element":"img","alt":"tk","inline":true,"padRight":true},{"text":"that the conditions of Line ","element":"span"},{"href":"#id-57","text":"6 ","element":"a"},{"text":"of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"is triggered and ","element":"span"},{"style":{"height":13.98},"width":72.48,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-16.png","element":"img","alt":" tk+1","inline":true,"padRight":true},{"text":"as the first time at or after ","element":"span"},{"style":{"height":17.73},"width":72.48,"height":44.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-17.png","element":"img","alt":"˜tk+1","inline":true,"padRight":true},{"text":"where state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-18.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"is visited; for the last episode started before or at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", we ensure that ","element":"span"},{"style":{"height":16.13},"width":274.12,"height":40.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-19.png","element":"img","alt":" tk and ˜tk are less","inline":true,"padRight":true},{"text":"than or equal ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"+ 1","element":"span"},{"text":". Explicitly, ","element":"span"},{"style":{"height":13.18},"width":108.08,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-20.png","element":"img","alt":" t1 = 1","inline":true,"padRight":true},{"text":"and for ","element":"span"},{"style":{"height":17.74},"width":811.84,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-21.png","element":"img","alt":" k > 1, tk = min{t ≥ ˜tk : X (t) = 0d or t > T}.","inline":true,"padRight":true},{"text":"Let ","element":"span"},{"style":{"height":14.78},"width":253.76,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-22.png","element":"img","alt":" Tk = tk+1 − tk","inline":true,"padRight":true},{"text":"be the length of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th episode and set ","element":"span"},{"style":{"height":18.82},"width":253.72,"height":47.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-23.png","element":"img","alt":"˜Tk = ˜tk+1 − tk","inline":true,"padRight":true},{"text":"with the convention ","element":"span"},{"style":{"height":17.22},"width":114.32,"height":43.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-24.png","element":"img","alt":"˜T0 = 1","inline":true},{"text":". For any state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":")","element":"span"},{"text":", we define ","element":"span"},{"style":{"height":16},"width":450.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-25.png","element":"img","alt":" N1(x, a) = 0 and for t > 1,","inline":true}],[{"style":{"width":"76%"},"width":1218,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-26.png","element":"img"}],[{"text":"Notice that for all state-action pairs ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":17.74},"width":268.04,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-27.png","element":"img","alt":"˜tk+1 ≤ t ≤ tk+1","inline":true},{"text":", we have ","element":"span"},{"style":{"height":19.36},"width":389.56,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-28.png","element":"img","alt":" Nt(x, a) = N˜tk+1(x, a)","inline":true},{"text":". We denote ","element":"span"},{"style":{"height":13.2},"width":56.84,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-29.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"as the number of episodes started by or at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", or ","element":"span"},{"style":{"height":16},"width":397.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-30.png","element":"img","alt":" KT = max{k : tk ≤ T}","inline":true},{"text":". The length of episode ","element":"span"},{"style":{"height":13.2},"width":140.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-31.png","element":"img","alt":" k < KT","inline":true,"padRight":true},{"text":"is not fixed and is determined according to two stopping criteria: (1) ","element":"span"},{"style":{"height":18.91},"width":690.96,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-32.png","element":"img","alt":"t > tk + ˜Tk−1, (2) Nt(x, a) > 2Ntk(x, a)","inline":true,"padRight":true},{"text":"for some state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":")","element":"span"},{"text":". After either criterion is met, the system will still follow policy ","element":"span"},{"style":{"height":17.18},"width":52.88,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-33.png","element":"img","alt":" π∗θk ","inline":true,"padRight":true},{"text":"until the first time at which state ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-34.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"is visited; see Line ","element":"span"},{"href":"#id-57","text":"14 ","element":"a"},{"text":"and Figure ","element":"span"},{"href":"#id-58","text":"1. ","element":"a"},{"text":"We use this settling period to ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-35.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"because the system state can be arbitrary when the first stopping criterion is met. As the countable state-space setting precludes a simple union-bound argument to overcome this uncertainty (as in the literature for finite state settings), we let the system reach the special state ","element":"span"},{"style":{"height":13.38},"width":36.92,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-36.png","element":"img","alt":" 0d","inline":true},{"text":". Another (essentially equivalent) option is to wait until the state hits the finite set ","element":"span"},{"style":{"height":14.72},"width":47.32,"height":36.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-37.png","element":"img","alt":" Cg∗","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":14.72},"width":48.36,"height":36.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-38.png","element":"img","alt":" Cp∗","inline":true,"padRight":true},{"text":"and then use a union bound argument for all states in either set. For analytical ","element":"span"},{"text":"convenience, we only use the state samples observed before arrival ","element":"span"},{"style":{"height":17.74},"width":72.44,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-39.png","element":"img","alt":"˜tk+1","inline":true,"padRight":true},{"text":"to update the posterior distribution. The posterior update is halted during the settling period to ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/5-40.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"as we have no control on the states visited during it, despite it being finite in duration (by our assumptions).","element":"span"}],[{"id":"id-58","style":{"width":"74%"},"width":1176,"height":222,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-0.png","element":"img"}],[{"text":"Figure 1: MDP evolution in episode ","element":"figcaption","subtype":"caption"},{"style":{"height":13.6},"width":139.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-1.png","element":"img","alt":" k < KT .","inline":true}]]},{"heading":"4 Regret analysis of Algorithm 1","paragraphs":[[{"text":"The performance of any learning policy ","element":"span"},{"style":{"height":9.2},"width":42,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-2.png","element":"img","alt":" πL","inline":true,"padRight":true},{"text":"is evaluated using the metric of expected regret compared to the optimal expected average cost of true parameter ","element":"span"},{"style":{"height":16.34},"width":292,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-3.png","element":"img","alt":" θ∗, namely, J(θ∗)","inline":true},{"text":". In this section, we evaluate the performance of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"and derive an upper bound for ","element":"span"},{"style":{"height":16},"width":227.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-4.png","element":"img","alt":" R(T, πT SDE)","inline":true},{"text":", its expected regret up to time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". In Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"we argued that at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"in episode ","element":"span"},{"style":{"height":14.8},"width":272.4,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-5.png","element":"img","alt":" k (tk ≤ t < tk+1","inline":true},{"text":"), there exist a constant ","element":"span"},{"style":{"height":16},"width":95.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-6.png","element":"img","alt":"J(θk)","inline":true,"padRight":true},{"text":"and a unique function ","element":"span"},{"style":{"height":19.2},"width":738.84,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-7.png","element":"img","alt":" v(·; θk) : X → R such that v�0d; θk�= 0 and","inline":true}],[{"id":"id-59","style":{"width":"94%"},"width":1494,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-8.png","element":"img"}],[{"text":"in which ","element":"span"},{"style":{"height":17.2},"width":52.92,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-9.png","element":"img","alt":" π∗θk ","inline":true,"padRight":true},{"text":"is the optimal or best-in-class policy (depending on the context) according to parameter ","element":"span"},{"style":{"height":14},"width":33.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-10.png","element":"img","alt":"θk","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":95.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-11.png","element":"img","alt":" J(θk)","inline":true,"padRight":true},{"text":"is the average cost for the Markov process obtained from MDP ","element":"span"},{"style":{"height":16.08},"width":228.12,"height":40.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-12.png","element":"img","alt":" (X, A, c, Pθk)","inline":true,"padRight":true},{"text":"by following ","element":"span"},{"style":{"height":17.18},"width":52.88,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-13.png","element":"img","alt":" π∗θk","inline":true},{"text":". We derive a bound for the expected regret ","element":"span"},{"style":{"height":16},"width":227.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-14.png","element":"img","alt":" R(T, πT SDE)","inline":true,"padRight":true},{"text":"following the proof steps of ","element":"span"},{"text":"[","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] while extending it to the countable state-space setting of our problem. Using ","element":"span"},{"href":"#id-59","text":"(8)","element":"a"},{"text":", the regret is decomposed into three terms and each term is bounded separately:","element":"span"}],[{"style":{"width":"98%"},"width":1556,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-15.png","element":"img"}],[{"style":{"width":"94%"},"width":1500,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-16.png","element":"img"}],[{"id":"id-68","style":{"width":"89%"},"width":1420,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-17.png","element":"img"}],[{"id":"id-72","style":{"width":"89%"},"width":1420,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-18.png","element":"img"}],[{"text":"Before bounding the above regret terms, we address the complexities arising from the countable state-space setting. Firstly, we need to study the maximum state (with respect to the ","element":"span"},{"style":{"height":14},"width":46,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-19.png","element":"img","alt":" ℓ∞","inline":true},{"text":"-norm) visited up to time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"in the MDP ","element":"span"},{"style":{"height":16},"width":230.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-20.png","element":"img","alt":" (X, A, c, Pθ∗)","inline":true,"padRight":true},{"text":"following Algorithm ","element":"span"},{"href":"#id-57","text":"1; ","element":"a"},{"text":"we denote this maximum state by ","element":"span"},{"style":{"height":17.9},"width":72.2,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-21.png","element":"img","alt":"M Tθ∗","inline":true},{"text":". In Appendix ","element":"span"},{"text":"C, ","element":"span"},{"text":"we derive upper bounds on the moments of hitting times of state ","element":"span"},{"style":{"height":13.39},"width":36.92,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-22.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"and utilize ","element":"span"},{"text":"this to bound the moments of random variable ","element":"span"},{"style":{"height":17.9},"width":72.16,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-23.png","element":"img","alt":" M Tθ∗","inline":true},{"text":", which then lets us study the number of episodes ","element":"span"},{"style":{"height":14.6},"width":225,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-24.png","element":"img","alt":"KT by time T","inline":true},{"text":". Another challenge in analyzing the regret is that the relative value function ","element":"span"},{"style":{"height":16},"width":152.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-25.png","element":"img","alt":" v(x; θ) is","inline":true,"padRight":true},{"text":"unlikely to be bounded in the countable state-space setting. Hence, in ","element":"span"},{"href":"#id-60","text":"(13) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-61","text":"(14)","element":"a"},{"text":", we find bounds for the relative value function in terms of hitting time ","element":"span"},{"style":{"height":10.4},"width":46.5,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-26.png","element":"img","alt":" τ0d","inline":true,"padRight":true},{"text":"from the initial state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":". Based on these results, we provide an upper bound for the regret of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"in Theorem ","element":"span"},{"href":"#id-7","text":"1.","element":"a"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Maximum state norm under polynomial and geometric ergodicity. ","element":"span"},{"text":"Here we state the results that characterize the maximum ","element":"span"},{"style":{"height":13.8},"width":40.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-27.png","element":"img","alt":" l∞","inline":true},{"text":"-norm of the state vector achieved up until and including time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", and the resulting bounds on the number of episodes executed until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Owing to space constraints the details (including formal statements) are presented in Appendix ","element":"span"},{"text":"B. ","element":"span"},{"text":"The results are listed as below: (i) In Lemma ","element":"span"},{"href":"#id-62","text":"6, ","element":"a"},{"text":"we bound the moments of the maximum length of recurrence times of state ","element":"span"},{"style":{"height":16.99},"width":143.24,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-28.png","element":"img","alt":" 0d, using","inline":true,"padRight":true},{"text":"the ergodicity assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4. ","element":"a"},{"text":"This, along with the skip-free property, allows us to prove that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-th moment of ","element":"span"},{"style":{"height":22.46},"width":392.68,"height":56.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-29.png","element":"img","alt":" max1≤i≤T τ (i)0d and M Tθ∗","inline":true,"padRight":true},{"text":"are both of order ","element":"span"},{"style":{"height":16},"width":177.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-30.png","element":"img","alt":" O(logp T).","inline":true,"padRight":true},{"text":"(ii) In Lemma ","element":"span"},{"href":"#id-63","text":"7, ","element":"a"},{"text":"we find an upper bound for the number of episodes in which the second stopping criterion is met or there exists a state-action pair for which ","element":"span"},{"style":{"height":16},"width":142.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-31.png","element":"img","alt":" Nt(x, a)","inline":true,"padRight":true},{"text":"has increased more than twice. (iii) In Lemma ","element":"span"},{"href":"#id-64","text":"8, ","element":"a"},{"text":"we bound the total number of episodes ","element":"span"},{"style":{"height":14.6},"width":227,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-32.png","element":"img","alt":" KT by time T","inline":true,"padRight":true},{"text":"by bounding the number of episodes triggered by the first stopping criterion, using the fact that in such episodes, ","element":"span"},{"style":{"height":17.22},"width":257.8,"height":43.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/6-33.png","element":"img","alt":"˜Tk = ˜Tk−1 + 1.","inline":true}],[{"text":"Moreover, to account for the settling time of each episode, we use geometric ergodicity and Lemma ","element":"span"},{"href":"#id-62","text":"6. ","element":"a"},{"text":"It follows that the expected value of the number of episodes ","element":"span"},{"style":{"height":13.2},"width":55,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-0.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"is of the order ","element":"span"},{"style":{"height":19.6},"width":236.64,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-1.png","element":"img","alt":"˜O(hd�|A|T).","inline":true}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Regret analysis. ","element":"span"},{"text":"Next, we bound regret terms ","element":"span"},{"style":{"height":13.6},"width":114.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-2.png","element":"img","alt":" R0, R1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.4},"width":43.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-3.png","element":"img","alt":" R2","inline":true,"padRight":true},{"text":"using the approach of [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] along with additional arguments to extend their result to a countably infinite state-space. We consider the relative value function ","element":"span"},{"style":{"height":16},"width":116.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-4.png","element":"img","alt":" v(x; θ)","inline":true,"padRight":true},{"text":"of policy ","element":"span"},{"style":{"height":15.8},"width":37,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-5.png","element":"img","alt":" π∗θ","inline":true,"padRight":true},{"text":"introduced for the optimal policy in ACOE or for the ","element":"span"},{"text":"best in-class policy in the Poisson equation. In either of these cases, policy ","element":"span"},{"style":{"height":15.8},"width":37,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-6.png","element":"img","alt":" π∗θ","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"href":"#id-53","text":"(5)","element":"a"},{"text":", which ","element":"span"},{"text":"is the corresponding Poisson equation with forcing function ","element":"span"},{"style":{"height":16.51},"width":192.44,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-7.png","element":"img","alt":" c(x, π∗θ(x))","inline":true,"padRight":true},{"text":"in a Markov chain with ","element":"span"},{"text":"transition matrix ","element":"span"},{"href":"#id-53","style":{"height":22.3},"width":181.04,"height":55.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-8.png","element":"img","alt":" P π∗θθ . In (6)","inline":true},{"text":", we presented the solution ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"J, v","element":"span"},{"text":") ","element":"span"},{"text":"to the Poisson equation, which yields ","element":"span"},{"text":"the following upper bound for the relative value function, as argued in Appendix ","element":"span"},{"href":"#id-56","text":"A.3:","element":"a"}],[{"id":"id-60","style":{"width":"77%"},"width":1226,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-9.png","element":"img"}],[{"text":"We can similarly lower bound the relative value function using Assumption ","element":"span"},{"href":"#id-65","text":"5 ","element":"a"},{"text":"as","element":"span"}],[{"id":"id-61","style":{"width":"72%"},"width":1146,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-10.png","element":"img"}],[{"text":"From Assumption ","element":"span"},{"href":"#id-10","text":"3, ","element":"a"},{"text":"all moments of ","element":"span"},{"style":{"height":10.03},"width":48.28,"height":25.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-11.png","element":"img","alt":" τ0d","inline":true,"padRight":true},{"text":"and thus, the derived bounds are finite. Also, in Lemma ","element":"span"},{"href":"#id-49","text":"10 ","element":"a"},{"text":"we bound the moments of ","element":"span"},{"style":{"height":10.2},"width":46.5,"height":25.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-12.png","element":"img","alt":" τ0d","inline":true,"padRight":true},{"text":"of order ","element":"span"},{"style":{"height":13.2},"width":159.08,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-13.png","element":"img","alt":" i ≤ r + 1","inline":true,"padRight":true},{"text":"using the polynomial Lyapunov function ","element":"span"},{"style":{"height":19.66},"width":92.64,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-14.png","element":"img","alt":" V pθ1,θ2","inline":true},{"text":", ","element":"span"},{"text":"which is then used to bound the expected regret. We next bound the first regret term ","element":"span"},{"style":{"height":13.6},"width":43.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-15.png","element":"img","alt":" R0","inline":true,"padRight":true},{"text":"from the first stopping criterion in terms of the number of episodes ","element":"span"},{"style":{"height":13.4},"width":54.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-16.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"and the settling time of each episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":".","element":"span"}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"Lemma 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The first regret term ","element":"span"},{"style":{"height":22.46},"width":807.72,"height":56.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-17.png","element":"img","alt":" R0 satisfies R0 ≤ J∗ E[KT (max1≤i≤T τ (i)0d + 1)].","inline":true}],[{"text":"Proof of Lemma ","element":"span"},{"href":"#id-66","text":"1 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-67","text":"B.4. ","element":"a"},{"text":"From Lemma ","element":"span"},{"href":"#id-62","text":"6, ","element":"a"},{"text":"all moments of ","element":"span"},{"style":{"height":22.46},"width":240.56,"height":56.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-18.png","element":"img","alt":" max1≤i≤T τ (i)0d","inline":true,"padRight":true},{"text":"are ","element":"span"},{"text":"bounded by a polylogarithmic function. Futhermore, as a result of Lemma ","element":"span"},{"href":"#id-64","text":"8, ","element":"a"},{"text":"expected value of the number of episodes ","element":"span"},{"style":{"height":13.4},"width":55,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-19.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"is of the order ","element":"span"},{"style":{"height":19.6},"width":227.2,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-20.png","element":"img","alt":"˜O(hd�|A|T)","inline":true},{"text":", which leads to a ","element":"span"},{"style":{"height":19.6},"width":227.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-21.png","element":"img","alt":"˜O(hd�|A|T)","inline":true,"padRight":true},{"text":"regret term ","element":"span"},{"style":{"height":13.6},"width":54,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-22.png","element":"img","alt":" R0.","inline":true,"padRight":true},{"text":"Next, an upper bound on ","element":"span"},{"style":{"height":13.2},"width":42.5,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-23.png","element":"img","alt":" R1","inline":true,"padRight":true},{"text":"defined in ","element":"span"},{"href":"#id-68","text":"(11) ","element":"a"},{"text":"is derived. In the proof of Lemma ","element":"span"},{"href":"#id-69","text":"2 ","element":"a"},{"text":"we argue that as the relative value function is equal to ","element":"span"},{"text":"0 ","element":"span"},{"text":"at all time instances ","element":"span"},{"style":{"height":13.2},"width":230.44,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-24.png","element":"img","alt":" tk for k ≤ KT","inline":true,"padRight":true},{"text":", the only term that contributes to the regret is the value function at the end of time horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". We use the lower bound derived in ","element":"span"},{"href":"#id-61","text":"(14) ","element":"a"},{"text":"to show that the second regret term ","element":"span"},{"style":{"height":18.83},"width":177.56,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-25.png","element":"img","alt":" R1 is ˜O(1)","inline":true},{"text":"; the proof is given in Appendix ","element":"span"},{"href":"#id-70","text":"B.5.","element":"a"}],[{"id":"id-69","style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The second regret term ","element":"span"},{"style":{"height":18.72},"width":1044.52,"height":46.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-26.png","element":"img","alt":" R1 satisfies R1 ≤ c2 E[(M Tθ∗)rp∗]+c3, where c2 = J∗2rp∗sp∗(βp∗)−1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":21.92},"width":658.08,"height":54.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-27.png","element":"img","alt":" c3 = J∗(βp∗)−1�sp∗ (2h)rp∗ + bp∗(K∗)−1�.","inline":true}],[{"text":"From Lemma ","element":"span"},{"href":"#id-62","text":"6, ","element":"a"},{"style":{"height":18.72},"width":189.88,"height":46.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-28.png","element":"img","alt":" E[(M Tθ∗)rp∗]","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":19.54},"width":184.32,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-29.png","element":"img","alt":" O(logrp∗ T)","inline":true},{"text":"; hence, ","element":"span"},{"style":{"height":13.4},"width":42,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-30.png","element":"img","alt":" R1","inline":true,"padRight":true},{"text":"is upper bounded by a polylogarithmic function of the order ","element":"span"},{"style":{"height":14.6},"width":34,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-31.png","element":"img","alt":" rp∗","inline":true},{"text":". Finally, in Lemma ","element":"span"},{"href":"#id-71","text":"3, ","element":"a"},{"text":"we derive an upper bound for the third regret term ","element":"span"},{"style":{"height":13.4},"width":43,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-32.png","element":"img","alt":"R2","inline":true,"padRight":true},{"text":"defined in ","element":"span"},{"href":"#id-72","text":"(12) ","element":"a"},{"text":"using the bound derived for the relative value function in ","element":"span"},{"href":"#id-60","text":"(13)","element":"a"},{"text":". To bound ","element":"span"},{"style":{"height":13.4},"width":43,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-33.png","element":"img","alt":" R2","inline":true},{"text":", we characterize it in terms of the difference between the empirical and true unknown transition kernel and following the concentration method used in [","element":"span"},{"href":"#id-73","referenceIndex":55,"text":"55","element":"a"},{"text":", ","element":"span"},{"href":"#id-74","referenceIndex":8,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":", ","element":"span"},{"href":"#id-75","referenceIndex":6,"text":"6","element":"a"},{"text":"], we argue that with high probability the total variation distance between the two distributions is small; for proof, see Appendix ","element":"span"},{"href":"#id-76","text":"B.6.","element":"a"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Lemma 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For problem-dependent constant ","element":"span"},{"style":{"height":11.8},"width":44.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-34.png","element":"img","alt":" cp3","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and polynomial ","element":"span"},{"style":{"height":18.98},"width":425.8,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-35.png","element":"img","alt":" Q(T) = cp3(Th)r+rp∗/48","inline":true},{"style":{"fontStyle":"italic"},"text":", the second regret term ","element":"span"},{"style":{"height":14.4},"width":182,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-36.png","element":"img","alt":" R2 satisfies","inline":true}],[{"style":{"width":"99%"},"width":1570,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-37.png","element":"img"}],[{"text":"The above Lemma results in a ","element":"span"},{"style":{"height":19.6},"width":460.8,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-38.png","element":"img","alt":"˜O(KrdJ∗hd+2r+rp∗�|A|T)","inline":true,"padRight":true},{"text":"regret term as a result of Lemma ","element":"span"},{"href":"#id-62","text":"6, ","element":"a"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is the skip-free parameter defined in Assumption ","element":"span"},{"href":"#id-77","text":"2, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is the dimension of the state-space, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"are the cost function parameters defined in Assumption ","element":"span"},{"href":"#id-54","text":"1, ","element":"a"},{"style":{"height":11.4},"width":37,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-39.png","element":"img","alt":" J∗","inline":true,"padRight":true},{"text":"is the supremum on the optimal cost, ","element":"span"},{"style":{"height":14.6},"width":33.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-40.png","element":"img","alt":" rp∗","inline":true,"padRight":true},{"text":"is defined in Assumption ","element":"span"},{"href":"#id-11","text":"4, ","element":"a"},{"text":"and where ","element":"span"},{"style":{"height":15.2},"width":28,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-41.png","element":"img","alt":" ˜O","inline":true,"padRight":true},{"text":"hides logarithmic factors in problem ","element":"span"},{"text":"parameters one of which is ","element":"span"},{"style":{"height":19.54},"width":267.68,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-42.png","element":"img","alt":" logd+r+rp∗+2(T)","inline":true},{"text":". For simplicity, we have not included the Lyapunov functions related parameters in the regret. Finally, from Lemmas ","element":"span"},{"href":"#id-66","text":"1, ","element":"a"},{"href":"#id-69","text":"2, ","element":"a"},{"href":"#id-71","text":"3, ","element":"a"},{"text":"along with the CauchySchwarz inequality, we conclude that the regret of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"style":{"height":16},"width":578.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-43.png","element":"img","alt":" R(T, πT SDE)(= R0 + R1 + R2) is","inline":true},{"style":{"height":19.6},"width":460.8,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-44.png","element":"img","alt":"˜O(KrdJ∗hd+2r+rp∗�|A|T)","inline":true},{"text":"; for brevity, we will state that regret is of the order ","element":"span"},{"style":{"height":19.8},"width":253.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-45.png","element":"img","alt":"˜O(dhd�|A|T).","inline":true}],[{"id":"id-7","style":{"fontWeight":"bold"},"text":"Theorem 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-54","style":{"fontStyle":"italic"},"text":"1-","element":"a"},{"href":"#id-65","style":{"fontStyle":"italic"},"text":"5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"the regret of Algorithm ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1, ","element":"a"},{"style":{"height":19.6},"width":540.52,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-46.png","element":"img","alt":" R(T, πT SDE), is ˜O(dhd�|A|T).","inline":true}],[{"text":"Theorem ","element":"span"},{"href":"#id-7","text":"1 ","element":"a"},{"text":"can be extended to the problem of finding the best policy within a sub-class of policies in set ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-47.png","element":"img","alt":" Π","inline":true},{"text":", which may or may not contain the optimal policy. In Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"we stated that Assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"hold for policies in ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-48.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"and we used this to argue that the Poisson equation has a solution given in ","element":"span"},{"href":"#id-53","text":"(6)","element":"a"},{"text":". As a result, repeating the same arguments as in Theorem ","element":"span"},{"href":"#id-7","text":"1 ","element":"a"},{"text":"with the modification that ","element":"span"},{"style":{"height":15.49},"width":135.84,"height":38.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-49.png","element":"img","alt":" π∗θ is the","inline":true,"padRight":true},{"id":"id-8","text":"best in-class policy of the MDP governed by parameter ","element":"span"},{"style":{"height":11.6},"width":17,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/7-50.png","element":"img","alt":" θ","inline":true},{"text":", yields the following corollary.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-54","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"through ","element":"span"},{"href":"#id-65","style":{"fontStyle":"italic"},"text":"5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"the regret of Algorithm ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"when using the best in-class policy is ","element":"span"},{"style":{"height":19.6},"width":257.44,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-0.png","element":"img","alt":"˜O(dhd�|A|T).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Requirement of an optimal policy oracle. ","element":"span"},{"text":"To implement our algorithm, we need to find the optimal policy for each model sampled by the algorithm—optimal policy for Theorem ","element":"span"},{"href":"#id-7","text":"1 ","element":"a"},{"text":"and optimal policy within policy class ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-1.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"for Corollary ","element":"span"},{"href":"#id-8","text":"1. ","element":"a"},{"text":"In the finite state-space setting, [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] provides a schedule of ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-2.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"values and selects ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-3.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policies to obtain ","element":"span"},{"style":{"height":18.83},"width":125.04,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-4.png","element":"img","alt":"˜O(√T)","inline":true,"padRight":true},{"text":"regret guarantees. The issue with extending the analysis of [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] to the countable state-space setting is that we need to ensure (uniform) ergodicity for the chosen ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-5.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policies. In other words, we must verify ergodicity assumptions for a potentially large set of close-to-optimal algorithms whose structure is undetermined. Another issue is that, to the best of our knowledge, there isn’t a general structural characterization of all ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-6.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal stationary policies for countable state-space MDPs or even a characterization of the policy within this set that is selected by any computational procedure in the literature; current results only discuss characterization of the stationary optimal policy. In the absence of such results, stability assumptions with the same uniformity across models as in our submission will be needed, which are likely too strong to be useful. However, if we could verify the stability requirements of Assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"for a subset of policies, the optimal oracle is not needed, and instead, by choosing approximately optimal policies within this subset, we can follow the same proof steps as [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] to guarantee regret performance similar to Corollary ","element":"span"},{"href":"#id-8","text":"1 ","element":"a"},{"text":"(without knowledge of model parameters). Thus, in Theorem ","element":"span"},{"href":"#id-9","text":"2 ","element":"a"},{"text":"we extend the previous regret guarantees to the algorithm employing ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-7.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policy; proof is given in Appendix ","element":"span"},{"href":"#id-78","text":"B.8.","element":"a"}],[{"id":"id-9","style":{"fontWeight":"bold"},"text":"Theorem 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a non-negative sequence ","element":"span"},{"style":{"height":16.51},"width":133.72,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-8.png","element":"img","alt":" {ϵk}∞k=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for every ","element":"span"},{"style":{"height":13.6},"width":150,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-9.png","element":"img","alt":" k ∈ N, ϵk","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded ","element":"span"},{"style":{"fontStyle":"italic"},"text":"above by ","element":"span"},{"style":{"height":20.98},"width":224.64,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-10.png","element":"img","alt":"1k+1 and an ϵk","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal policy satisfying Assumptions ","element":"span"},{"href":"#id-10","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"is given. The regret incurred ","element":"span"},{"style":{"fontStyle":"italic"},"text":"by Algorithm ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"while using the ","element":"span"},{"style":{"height":9.6},"width":30.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-11.png","element":"img","alt":" ϵk","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal policy during any episode ","element":"span"},{"style":{"height":19.6},"width":325.92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-12.png","element":"img","alt":" k is ˜O(dhd�|A|T).","inline":true}]]},{"heading":"5 Evaluation: Application of Algorithm 1 to queueing models","paragraphs":[[{"text":"Next, we present an evaluation of our algorithm. We study two different queueing models shown in Figure ","element":"span"},{"href":"#id-79","text":"2, ","element":"a"},{"text":"each with Poisson arrivals at rate ","element":"span"},{"style":{"height":11.4},"width":20,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-13.png","element":"img","alt":" λ","inline":true},{"text":", and two heterogeneous servers with exponentially distributed services times with unknown service rate vector ","element":"span"},{"style":{"height":16.34},"width":226.72,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-14.png","element":"img","alt":" θ∗ = (θ∗1, θ∗2)","inline":true},{"text":". Vector ","element":"span"},{"style":{"height":12.6},"width":36,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-15.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"is sampled ","element":"span"},{"text":"from the prior distribution ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-16.png","element":"img","alt":" ν","inline":true,"padRight":true},{"text":"defined on the space ","element":"span"},{"style":{"height":15},"width":167.5,"height":37.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-17.png","element":"img","alt":" Θ given as","inline":true}],[{"style":{"width":"57%"},"width":914,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-18.png","element":"img"}],[{"text":"for fixed ","element":"span"},{"style":{"height":13.2},"width":102,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-19.png","element":"img","alt":" R ≥ 1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":190.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-20.png","element":"img","alt":" δ ∈ (0, 0.5)","inline":true},{"text":". The first condition ensures the stability of the queueing models, while the second guarantees the compactness of the parameter space of the parameterized policies. In both systems, the goal of the dispatcher is to minimize the expected sojourn time of jobs, which by Little’s law [","element":"span"},{"href":"#id-80","referenceIndex":44,"text":"44","element":"a"},{"text":"] is equivalent to minimizing the average number of jobs in the system. After verifying Assumptions ","element":"span"},{"href":"#id-54","text":"1-","element":"a"},{"href":"#id-65","text":"5 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"E ","element":"span"},{"text":"for the cost function ","element":"span"},{"style":{"height":16},"width":212.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-21.png","element":"img","alt":" c(x) = ∥x∥1","inline":true},{"text":", Theorem ","element":"span"},{"href":"#id-7","text":"1 ","element":"a"},{"text":"yields a Bayesian regret of order ","element":"span"},{"style":{"height":19.8},"width":180,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-22.png","element":"img","alt":"˜O(�|A|T)","inline":true,"padRight":true},{"text":"for Algorithm ","element":"span"},{"href":"#id-57","text":"1.","element":"a"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Model 1. Two-server queueing system with a common buffer. ","element":"span"},{"text":"We consider the continuous-time queueing system of Figure ","element":"span"},{"href":"#id-79","text":"2a, ","element":"a"},{"text":"where the countable state-space is ","element":"span"},{"style":{"height":16},"width":464.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-23.png","element":"img","alt":" X = {x = (x0, x1, x2) ∈","inline":true},{"style":{"height":17.39},"width":402.88,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-24.png","element":"img","alt":"Z+ × {0, 1}2}, where x0","inline":true,"padRight":true},{"text":"is the queue length, and ","element":"span"},{"style":{"height":14.8},"width":471.68,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-25.png","element":"img","alt":" xi, i = 1, 2 equal 1 if server i","inline":true,"padRight":true},{"text":"is busy. The action space is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"h, b, ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"means no action, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"sends a job to both servers, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2 ","element":"span"},{"text":"assigns a job to server ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". In [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"], it is shown that by uniformization [","element":"span"},{"href":"#id-81","referenceIndex":32,"text":"32","element":"a"},{"text":"] and sampling the continuous-time Markov process at rate ","element":"span"},{"style":{"height":14.93},"width":194.44,"height":37.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-26.png","element":"img","alt":" λ + θ∗1 + θ∗2","inline":true},{"text":", a discrete-time Markov chain is obtained, which converts the ","element":"span"},{"text":"original continuous-time problem to an equivalent discrete-time problem where we need to minimize ","element":"span"},{"style":{"height":20.4},"width":552.2,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-27.png","element":"img","alt":"lim supT →∞ T −1 �T −1t=0 ∥X(t)∥1","inline":true},{"text":". Further, [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"] shows that the optimal policy is a threshold policy ","element":"span"},{"style":{"height":12.48},"width":94.4,"height":31.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-28.png","element":"img","alt":"πt(θ∗)","inline":true,"padRight":true},{"text":"with optimal finite threshold ","element":"span"},{"style":{"height":16.34},"width":165.16,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-29.png","element":"img","alt":" t(θ∗) ∈ N","inline":true},{"text":": always assign a job to the faster (first) server when free, and to the second server if it is free and ","element":"span"},{"style":{"height":16.34},"width":239.08,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-30.png","element":"img","alt":" ∥x∥1 > t(θ∗)","inline":true},{"text":", and take no action otherwise. In Appendix ","element":"span"},{"href":"#id-82","text":"E.1, ","element":"a"},{"text":"we argue that the discrete-time Markov process governed by ","element":"span"},{"style":{"height":11.8},"width":99,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-31.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and following threshold policy ","element":"span"},{"style":{"height":9.6},"width":33,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-32.png","element":"img","alt":" πt","inline":true,"padRight":true},{"text":"for any threshold ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"belonging to a compact set satisfies Assumptions ","element":"span"},{"href":"#id-54","text":"1-","element":"a"},{"href":"#id-65","text":"5.","element":"a"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Model 2. Two heterogeneous parallel queues. ","element":"span"},{"text":"We consider the continuous-time queueing system of Figure ","element":"span"},{"href":"#id-79","text":"2b ","element":"a"},{"text":"with countable state-space ","element":"span"},{"style":{"height":18.93},"width":465.32,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-33.png","element":"img","alt":" X = {x = (x1, x2) ∈ Z2+}","inline":true},{"text":", where ","element":"span"},{"style":{"height":9.6},"width":32,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-34.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"is the number of ","element":"span"},{"text":"jobs in the server-queue pair ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". The action space is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", where action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"sends the arrival to queue ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". We obtain the discrete-time MDP by sampling the queueing system at the arrivals, and then aim to find the average cost minimizing policy within the class ","element":"span"},{"style":{"height":17.39},"width":534.72,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/8-35.png","element":"img","alt":" Π = {πω; ω ∈ [(cRR)−1, cRR]}","inline":true},{"text":",","element":"span"}],[{"id":"id-79","style":{"width":"91%"},"width":1444,"height":276,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-0.png","element":"img"}],[{"text":"Figure 2: Two-server queueing systems with heterogeneous service rates.","element":"figcaption","subtype":"caption"}],[{"id":"id-85","style":{"width":"89%"},"width":1422,"height":522,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-1.png","element":"img"}],[{"text":"Figure 3: Regret performance for ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":262.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-2.png","element":"img","alt":" λ = 0.3, 0.5, 0.7","inline":true},{"text":". Shaded region shows the ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-3.png","element":"img","alt":" ±σ","inline":true,"padRight":true},{"text":"area of mean regret.","element":"figcaption","subtype":"caption"}],[{"style":{"height":13.4},"width":127.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-4.png","element":"img","alt":"cR ≥ 1","inline":true},{"text":". Policy ","element":"span"},{"style":{"height":14.2},"width":237,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-5.png","element":"img","alt":" πω : X → A","inline":true,"padRight":true},{"text":"routes arrivals based on the weighted queue lengths: ","element":"span"},{"style":{"height":16},"width":152.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-6.png","element":"img","alt":" πω(x) =","inline":true},{"style":{"height":16},"width":467.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-7.png","element":"img","alt":"arg min (1 + x1, ω (1 + x2))","inline":true,"padRight":true},{"text":"with ties broken for ","element":"span"},{"text":"1","element":"span"},{"text":". Even with the transition kernel fully specified (by the values of arrival and service rates), the optimal policy in ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-8.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"is not known except when ","element":"span"},{"style":{"height":13.8},"width":121.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-9.png","element":"img","alt":" θ1 = θ2","inline":true,"padRight":true},{"text":"where the optimal value is ","element":"span"},{"style":{"height":10.8},"width":99.4,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-10.png","element":"img","alt":" ω = 1","inline":true},{"text":", and so, to learn it, we will use Proximal Policy Optimization for countable state-space MDPs [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"]. Note that [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"] requires full model knowledge, which holds in our scheme as we use parameters sampled from the posterior for choosing the policy at the beginning of each episode. In Appendix ","element":"span"},{"href":"#id-83","text":"E.2, ","element":"a"},{"text":"we argue that the discrete-time Markov process governed by parameter ","element":"span"},{"style":{"height":11.8},"width":96.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-11.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and following policy ","element":"span"},{"style":{"height":17.38},"width":448.56,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-12.png","element":"img","alt":" πω for ω ∈ [(cRR)−1, cRR]","inline":true,"padRight":true},{"text":"satisfies Assumptions ","element":"span"},{"href":"#id-54","text":"1-","element":"a"},{"href":"#id-65","text":"5.","element":"a"}],[{"text":"Next, we report the numerical results of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"in the two queueing models of Figure ","element":"span"},{"href":"#id-79","text":"2 ","element":"a"},{"text":"and calculate regret using ","element":"span"},{"href":"#id-84","text":"(2)","element":"a"},{"text":". The regret is averaged over 2000 simulation runs and plotted against the number of transitions in the sampled discrete-time Markov process. Figure ","element":"span"},{"href":"#id-85","text":"3 ","element":"a"},{"text":"shows the behavior of the regret of the two queueing models for three different arrival rates and service rates distributed according to a Dirichlet prior over ","element":"span"},{"style":{"height":17.2},"width":152,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-13.png","element":"img","alt":" [0.5, 1.9]2","inline":true},{"text":". We observe that the regret is sub-linear in time and grows as the arrival rate increases. For the queueing model of Figure ","element":"span"},{"href":"#id-79","text":"2a, ","element":"a"},{"text":"the minimum average cost ","element":"span"},{"style":{"height":16},"width":77.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-14.png","element":"img","alt":"J(θ)","inline":true,"padRight":true},{"text":"and optimal policy ","element":"span"},{"style":{"height":15.8},"width":37.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-15.png","element":"img","alt":" π∗θ ","inline":true,"padRight":true},{"text":"are known explicitly [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"] for every ","element":"span"},{"style":{"height":12.2},"width":96,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-16.png","element":"img","alt":" θ ∈ Θ","inline":true},{"text":", which are used in Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"and for regret calculation. Conversely, for the second queueing model, ","element":"span"},{"style":{"height":16},"width":77.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-17.png","element":"img","alt":" J(θ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":37.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-18.png","element":"img","alt":" π∗θ","inline":true,"padRight":true},{"text":"are not known. ","element":"span"},{"text":"The PPO algorithm [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"] is used to empirically find both the optimal weight and the policy’s average cost. As expected from our theoretical guarantees, we observe that the regret is sub-linear in time. Furthermore, it grows as the arrival rate increases and the normalized load on the system converges to ","element":"span"},{"text":"1","element":"span"},{"text":", which is expected since the system gets closer to the stability boundary. As discussed in Section ","element":"span"},{"text":"4, ","element":"span"},{"text":"our bound on the expected regret is linearly dependent on ","element":"span"},{"style":{"height":11.4},"width":37,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/9-19.png","element":"img","alt":" J∗ ","inline":true,"padRight":true},{"text":"and, thus, will increase with the arrival rate. Additional details of the simulations and more plots are presented in Appendix ","element":"span"},{"text":"G.","element":"span"}]]},{"heading":"6 Conclusions and future work","paragraphs":[[{"text":"We studied the problem of learning optimal policies in countable state-space MDPs governed by unknown parameters. We proposed a learning policy based on Thompson sampling and established finite-time performance guarantees on the Bayesian regret. We highlighted the practicality of our proposed algorithm by considering two different queueing models and showing that our algorithm can be applied to develop optimal control policies. For future work we plan two directions to explore: to generalize our algorithm to consider polices that might not all be stabilizing, and also to simplify the algorithm using ideas from ","element":"span"},{"href":"#id-25","referenceIndex":52,"text":"[52]","element":"a"},{"text":".","element":"span"}]]},{"heading":"Disclosure of Funding","paragraphs":[[{"text":"SA’s research was supported by NSF via grants ECCS2038416, CCF2008130, and CNS1955777, and a grant from General Dynamics via MIDAS at the University of Michigan, Ann Arbor. VS’s research is supported by NSF via grants ECCS2038416, CCF2008130, CNS1955777, and CMMI2240981.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-23","text":"[1] ","element":"span"},{"text":"Yasin Abbasi-Yadkori and Csaba Szepesvari. Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1406.3926","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-36","text":"[2] ","element":"span"},{"text":"Saghar Adler, Mehrdad Moharrami, and Vijay Subramanian. ","element":"span"},{"text":"Learning a discrete set of optimal allocation rules in queueing systems with unknown service rates. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2202.02419","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-28","text":"[3] ","element":"span"},{"text":"R. Agrawal, D. Teneketzis, and V. Anantharam. Asymptotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 34(12):1249–1259, 1989.","element":"span"}],[{"id":"id-26","text":"[4] ","element":"span"},{"text":"Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 30, 2017.","element":"span"}],[{"id":"id-21","text":"[5] ","element":"span"},{"text":"Shipra Agrawal and Randy Jia. Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2019 ACM Conference on Economics and Computation","element":"span"},{"text":", pages 743–744, 2019.","element":"span"}],[{"id":"id-75","text":"[6] ","element":"span"},{"text":"Nima Akbarzadeh and Aditya Mahajan. On learning Whittle index policy for restless bandits with scalable regret. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2202.03463","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-1","text":"[7] ","element":"span"},{"text":"Aristotle Arapostathis, Vivek S Borkar, Emmanuel Fernández-Gaucherand, Mrinal K Ghosh, and Steven I Marcus. Discrete-time controlled Markov processes with average cost criterion: A survey. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 31(2):282–344, 1993.","element":"span"}],[{"id":"id-74","text":"[8] ","element":"span"},{"text":"Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 21, 2008.","element":"span"}],[{"id":"id-47","text":"[9] ","element":"span"},{"text":"Rolando Cavazos-Cadena. Necessary conditions for the optimality equation in average-reward Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Mathematics and Optimization","element":"span"},{"text":", 19(1):97–112, 1989.","element":"span"}],[{"id":"id-52","text":"[10] ","element":"span"},{"text":"Rolando Cavazos-Cadena. Weak conditions for the existence of optimal stationary policies in average Markov decision chains with unbounded costs. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Kybernetika","element":"span"},{"text":", 25(3):145–156, 1989.","element":"span"}],[{"id":"id-41","text":"[11] ","element":"span"},{"text":"Tuhinangshu Choudhury, Gauri Joshi, Weina Wang, and Sanjay Shakkottai. Job dispatching policies for queueing systems with unknown service rates. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-Second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","element":"span"},{"text":", MobiHoc ’21, page 181–190, New York, NY, USA, 2021. Association for Computing Machinery.","element":"span"}],[{"id":"id-44","text":"[12] ","element":"span"},{"text":"Sayak Ray Chowdhury, Aditya Gopalan, and Odalric-Ambrym Maillard. Reinforcement learning in parametric MDPs with exponential families. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 1855–1863. PMLR, 2021.","element":"span"}],[{"id":"id-37","text":"[13] ","element":"span"},{"text":"Asaf Cohen, Vijay G. Subramanian, and Yili Zhang. Learning-based optimal admission control in a single server queuing system, 2022.","element":"span"}],[{"id":"id-35","text":"[14] ","element":"span"},{"text":"Jim G Dai and Mark Gluzman. Queueing network controls via deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic Systems","element":"span"},{"text":", 12(1):30–67, 2022.","element":"span"}],[{"text":"[15] ","element":"span"},{"text":"Anthony Ephremides, Pravin Varaiya, and Jean Walrand. A simple dynamic routing problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 25(4):690–693, 1980.","element":"span"}],[{"id":"id-4","text":"[16] ","element":"span"},{"text":"Lloyd Fisher and Sheldon M Ross. An example in denumerable decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Mathematical Statistics","element":"span"},{"text":", 39(2):674–675, 1968.","element":"span"}],[{"id":"id-42","text":"[17] ","element":"span"},{"text":"Daniel Freund, Thodoris Lykouris, and Wentao Weng. Efficient decentralized multi-agent learning in asymmetric queuing systems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 4080–4084. PMLR, 2022.","element":"span"}],[{"id":"id-19","text":"[18] ","element":"span"},{"text":"Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian reinforcement learning: A survey. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Foundations and Trends® in Machine Learning","element":"span"},{"text":", 8(5-6):359–483, 2015.","element":"span"}],[{"id":"id-27","text":"[19] ","element":"span"},{"text":"Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 861–898. PMLR, 2015.","element":"span"}],[{"id":"id-29","text":"[20] ","element":"span"},{"text":"Todd L Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws incontrolled Markov chains. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM journal on control and optimization","element":"span"},{"text":", 35(3):715–743, 1997.","element":"span"}],[{"id":"id-96","text":"[21] ","element":"span"},{"text":"Bruce Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Applied probability","element":"span"},{"text":", 14(3):502–525, 1982.","element":"span"}],[{"id":"id-97","text":"[22] ","element":"span"},{"text":"Arie Hordijk and Flora Spieksma. On ergodicity and recurrence properties of a Markov chain by an application to an open Jackson network. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in applied probability","element":"span"},{"text":", 24(2):343–376, 1992.","element":"span"}],[{"id":"id-18","text":"[23] ","element":"span"},{"text":"Mehdi Jafarnia Jahromi, Rahul Jain, and Ashutosh Nayyar. Online learning for unknown partially observable MDPs. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 1712–1732. PMLR, 2022.","element":"span"}],[{"id":"id-90","text":"[24] ","element":"span"},{"text":"Soren F Jarner and Gareth O Roberts. Polynomial convergence rates of Markov chains. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Applied Probability","element":"span"},{"text":", 12(1):224–247, 2002.","element":"span"}],[{"id":"id-33","text":"[25] ","element":"span"},{"text":"Subhashini Krishnasamy, PT Akhil, Ari Arapostathis, Rajesh Sundaresan, and Sanjay Shakkottai. Augmenting max-weight with explicit learning for wireless scheduling with switching costs. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE/ACM Transactions on Networking","element":"span"},{"text":", 26(6):2501–2514, 2018.","element":"span"}],[{"id":"id-40","text":"[26] ","element":"span"},{"text":"Subhashini Krishnasamy, Ari Arapostathis, Ramesh Johari, and Sanjay Shakkottai. On learning the c","element":"span"},{"style":{"height":10.6},"width":22.5,"height":26.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/12-0.png","element":"img","alt":"µ","inline":true,"padRight":true},{"text":"rule in single and parallel server networks, 2018.","element":"span"}],[{"id":"id-39","text":"[27] ","element":"span"},{"text":"Subhashini Krishnasamy, Rajat Sen, Ramesh Johari, and Sanjay Shakkottai. Learning unknown service rates in queues: A multiarmed bandit approach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Oper. Res.","element":"span"},{"text":", 69(1):315–330, 2021.","element":"span"}],[{"id":"id-0","text":"[28] ","element":"span"},{"text":"P.R. Kumar and Pravin Varaiya. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stochastic systems: Estimation, identification, and adaptive control","element":"span"},{"text":". SIAM, 2015.","element":"span"}],[{"id":"id-30","text":"[29] ","element":"span"},{"text":"Tze-Leung Lai and Sidney Yakowitz. Machine learning and nonparametric bandit theory. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 40(7):1199–1209, 1995.","element":"span"}],[{"id":"id-147","text":"[30] ","element":"span"},{"text":"Ronald Larsen. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Control of multiple exponential servers with application to computer systems","element":"span"},{"text":". PhD thesis, University of Maryland, 1981.","element":"span"}],[{"id":"id-12","text":"[31] ","element":"span"},{"text":"Woei Lin and P. R. Kumar. Optimal control of a queueing system with two heterogeneous servers. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic control","element":"span"},{"text":", 29(8):696–703, 1984.","element":"span"}],[{"id":"id-81","text":"[32] ","element":"span"},{"text":"Steven A Lippman. Applying a new device in the optimization of exponential queuing systems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations research","element":"span"},{"text":", 23(4):687–710, 1975.","element":"span"}],[{"id":"id-3","text":"[33] Ashok P. Maitra. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dynamic programming for countable state systems","element":"span"},{"text":". PhD thesis, 1963.","element":"span"}],[{"id":"id-55","text":"[34] ","element":"span"},{"text":"Armand M Makowski and Adam Shwartz. The Poisson equation for countable Markov chains: Probabilistic methods and interpretations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Handbook of Markov Decision Processes: Methods and Applications","element":"span"},{"text":", pages 269–303, 2002.","element":"span"}],[{"id":"id-87","text":"[35] ","element":"span"},{"text":"Sean P. Meyn and Richard L. Tweedie. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov chains and stochastic stability","element":"span"},{"text":". Springer Science & Business Media, 2012.","element":"span"}],[{"id":"id-32","text":"[36] ","element":"span"},{"text":"Michael J Neely, Scott T Rager, and Thomas F La Porta. Max-Weight learning algorithms for scheduling in unknown environments. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 57(5):1179– 1191, 2012.","element":"span"}],[{"id":"id-17","text":"[37] ","element":"span"},{"text":"Pedro A Ortega and Daniel A Braun. A minimum relative entropy principle for learning and acting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Artificial Intelligence Research","element":"span"},{"text":", 38:475–511, 2010.","element":"span"}],[{"id":"id-5","text":"[38] ","element":"span"},{"text":"Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 26, 2013.","element":"span"}],[{"id":"id-6","text":"[39] ","element":"span"},{"text":"Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 2701–2710. PMLR, 2017.","element":"span"}],[{"id":"id-45","text":"[40] ","element":"span"},{"text":"Reda Ouhamma, Debabrota Basu, and Odalric Maillard. Bilinear exponential family of MDPs: Frequentist regret bound with tractable exploration & planning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 37, pages 9336–9344, 2023.","element":"span"}],[{"id":"id-22","text":"[41] ","element":"span"},{"text":"Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown Markov decision processes: A Thompson sampling approach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 30, 2017.","element":"span"}],[{"id":"id-20","text":"[42] ","element":"span"},{"text":"Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Foundations and Trends® in Machine Learning","element":"span"},{"text":", 11(1):1–96, 2018.","element":"span"}],[{"id":"id-46","text":"[43] ","element":"span"},{"text":"Devavrat Shah, Qiaomin Xie, and Zhi Xu. Stable reinforcement learning with unbounded state space. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2006.04353","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-80","text":"[44] ","element":"span"},{"text":"R. Srikant and Lei Ying. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Communication networks: An optimization, control, and stochastic networks perspective","element":"span"},{"text":". Cambridge University Press, 2013.","element":"span"}],[{"id":"id-38","text":"[45] ","element":"span"},{"text":"Thomas Stahlbuhk, Brooke Shrader, and Eytan Modiano. Learning algorithms for minimizing queue length regret. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Trans. Inform. Theory","element":"span"},{"text":", 67(3):1759–1781, 2021.","element":"span"}],[{"id":"id-2","text":"[46] ","element":"span"},{"text":"Shaler Stidham and Richard Weber. A survey of Markov decision models for control of networks of queues. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Queueing systems","element":"span"},{"text":", 13:291–314, 1993.","element":"span"}],[{"id":"id-16","text":"[47] ","element":"span"},{"text":"Malcolm Strens. A bayesian framework for reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", volume 2000, pages 943–950, 2000.","element":"span"}],[{"id":"id-106","text":"[48] ","element":"span"},{"text":"Wojciech Szpankowski and Vernon Rego. Yet another application of a binomial recurrence order statistics. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computing","element":"span"},{"text":", 43(4):401–410, 1990.","element":"span"}],[{"id":"id-13","text":"[49] ","element":"span"},{"text":"Leandros Tassiulas and Anthony Ephremides. Jointly optimal routing and scheduling in packet ratio networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Information Theory","element":"span"},{"text":", 38(1):165–168, 1992.","element":"span"}],[{"id":"id-14","text":"[50] ","element":"span"},{"text":"Leandros Tassiulas and Anthony Ephremides. Dynamic server allocation to parallel queues with randomly varying connectivity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Information Theory","element":"span"},{"text":", 39(2):466–478, 1993.","element":"span"}],[{"id":"id-24","text":"[51] ","element":"span"},{"text":"Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior sampling for large scale reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1711.07979","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-25","text":"[52] ","element":"span"},{"text":"Georgios Theocharous, Zheng Wen, Yasin Abbasi Yadkori, and Nikos Vlassis. Scalar posterior sampling with applications. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 31, 2018.","element":"span"}],[{"id":"id-15","text":"[53] ","element":"span"},{"text":"William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrika","element":"span"},{"text":", 25(3-4):285–294, 1933.","element":"span"}],[{"id":"id-31","text":"[54] ","element":"span"},{"text":"Neil Walton and Kuang Xu. Learning and information in stochastic networks and queues. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications","element":"span"},{"text":", pages 161–198. INFORMS, 2021.","element":"span"}],[{"id":"id-73","text":"[55] ","element":"span"},{"text":"Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the ","element":"span"},{"style":{"height":13.4},"width":39.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/13-0.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"deviation of the empirical distribution. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Hewlett-Packard Labs, Tech. Rep","element":"span"},{"text":", 2003.","element":"span"}],[{"id":"id-34","text":"[56] ","element":"span"},{"text":"Zixian Yang, R Srikant, and Lei Ying. Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted UCB. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 4275–4312. PMLR, 2023.","element":"span"}],[{"id":"id-43","text":"[57] ","element":"span"},{"text":"Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 1954–1964. PMLR, 2020.","element":"span"}]]},{"heading":"A Proofs related to problem formulation","paragraphs":[[{"id":"id-48","style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Ergodicity definitions","element":"span"}],[{"text":"Suppose that Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"with transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"is irreducible, aperiodic and positive recurrent with stationary distribution ","element":"span"},{"style":{"height":10.8},"width":22.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-0.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":16},"width":257.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-1.png","element":"img","alt":" f : X �→ [1, ∞)","inline":true,"padRight":true},{"text":"be a measurable function such that ","element":"span"},{"style":{"height":16.8},"width":623.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-2.png","element":"img","alt":"µ(f) := Eµ[f(Y )] < +∞ with Y ∼ µ","inline":true},{"text":". We are interested in conditions under which for a sequence of positive numbers ","element":"span"},{"style":{"height":16.8},"width":263.76,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-3.png","element":"img","alt":" ρ := (ρ(n))n≥0,","inline":true}],[{"id":"id-86","style":{"width":"75%"},"width":1196,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-4.png","element":"img"}],[{"text":"where for a signed measure ","element":"span"},{"style":{"height":14.4},"width":22,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-5.png","element":"img","alt":" ˜µ","inline":true,"padRight":true},{"text":"on ","element":"span"},{"style":{"height":18.69},"width":448.28,"height":46.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-6.png","element":"img","alt":" X, ∥˜µ∥f := sup|g|≤f |˜µ(g)|","inline":true},{"text":". The sequence ","element":"span"},{"style":{"height":10.8},"width":19.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-7.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"is interpreted as the rate function, and three different notions of ergodicity are distinguished based on the following rate functions: ","element":"span"},{"style":{"height":16},"width":359.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-8.png","element":"img","alt":" ρ(n) ≡ 1, ρ(n) = ζn","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":14},"width":102.6,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-9.png","element":"img","alt":" ζ > 1","inline":true},{"text":", and ","element":"span"},{"style":{"height":17.39},"width":218.92,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-10.png","element":"img","alt":" ρ(n) = nζ−1","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":14},"width":102.6,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-11.png","element":"img","alt":" ζ ≥ 1","inline":true},{"text":". Further, for each rate function ","element":"span"},{"style":{"height":10.8},"width":19.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-12.png","element":"img","alt":" ρ","inline":true},{"text":", we state the Foster-Lyapunov characterization of ergodicity of the Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X","element":"span"},{"text":", which provides sufficient conditions for ","element":"span"},{"href":"#id-86","text":"(15) ","element":"a"},{"text":"to hold.","element":"span"}],[{"text":"1. If ","element":"span"},{"style":{"height":16},"width":152.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-13.png","element":"img","alt":" ρ(n) ≡ 1","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":13.2},"width":101.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-14.png","element":"img","alt":" n ≥ 0","inline":true},{"text":", the Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"satisfying ","element":"span"},{"href":"#id-86","text":"(15) ","element":"a"},{"text":"is said to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"-","element":"span"},{"style":{"fontWeight":"bold"},"text":"ergodic","element":"span"},{"text":". From [","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"35","element":"a"},{"text":"], for an irreducible and aperiodic chain, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"-ergodicity is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"equivalent ","element":"span"},{"text":"to the existence of a function ","element":"span"},{"style":{"height":16},"width":266.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-15.png","element":"img","alt":" V : X �→ [0, ∞)","inline":true},{"text":", a finite set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"text":", and positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"such that","element":"span"}],[{"id":"id-88","style":{"width":"55%"},"width":874,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12},"width":295,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-17.png","element":"img","alt":" ∆V := PV − V","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":17.58},"width":576.92,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-18.png","element":"img","alt":" PV (x) := �x′∈X P(x, x′)V (x′)","inline":true},{"text":". The drift condition ","element":"span"},{"href":"#id-88","text":"(16) ","element":"a"},{"text":"implies positive recurrence of the Markov process, existence of a unique stationary distribution ","element":"span"},{"style":{"height":16},"width":384.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-19.png","element":"img","alt":" µ, and µ(f) ≤ b < +∞","inline":true,"padRight":true},{"text":"(","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"[35]","element":"a"},{"text":", Theorem 14.3.7).","element":"span"}],[{"text":"2. If ","element":"span"},{"style":{"height":16},"width":185.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-20.png","element":"img","alt":" ρ(n) = ζn","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":14},"width":109.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-21.png","element":"img","alt":" ζ > 1","inline":true},{"text":", the Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"satisfying ","element":"span"},{"href":"#id-86","text":"(15) ","element":"a"},{"text":"is said to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"- ","element":"span"},{"style":{"fontWeight":"bold"},"text":"geometrically ergodic","element":"span"},{"text":". From [","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"35","element":"a"},{"text":"], for an irreducible and aperiodic chain, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"-geometric ergodicity is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"equivalent ","element":"span"},{"text":"to the existence of a function ","element":"span"},{"style":{"height":16},"width":279.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-22.png","element":"img","alt":" V : X �→ [1, ∞)","inline":true},{"text":", a finite set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"text":", a constant ","element":"span"},{"style":{"height":16},"width":160.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-23.png","element":"img","alt":" γ ∈ (0, 1)","inline":true,"padRight":true},{"text":"and positive constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"such that","element":"span"}],[{"id":"id-89","style":{"width":"59%"},"width":938,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-24.png","element":"img"}],[{"text":"The drift condition ","element":"span"},{"href":"#id-89","text":"(17) ","element":"a"},{"text":"implies positive recurrence of the Markov process, existence of a unique stationary distribution ","element":"span"},{"style":{"height":10.8},"width":22.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-25.png","element":"img","alt":" µ","inline":true},{"text":", and ","element":"span"},{"style":{"height":21.78},"width":374.2,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-26.png","element":"img","alt":" µ(V ) ≤ b1−γ < +∞","inline":true,"padRight":true},{"text":"([","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"35","element":"a"},{"text":"], Theorem 14.3.7). ","element":"span"},{"text":"Moreover, if ","element":"span"},{"href":"#id-86","style":{"height":16},"width":255.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-27.png","element":"img","alt":" f(·) ≡ 1 in (15)","inline":true},{"text":", then the Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"is called ","element":"span"},{"style":{"fontWeight":"bold"},"text":"geometrically ergodic","element":"span"},{"text":".","element":"span"}],[{"text":"3. If ","element":"span"},{"style":{"height":17.38},"width":219.2,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-28.png","element":"img","alt":" ρ(n) = nζ−1","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":14.4},"width":98.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-29.png","element":"img","alt":" ζ ≥ 1","inline":true},{"text":", the Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"satisfying ","element":"span"},{"href":"#id-86","text":"(15) ","element":"a"},{"text":"is said to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"- ","element":"span"},{"style":{"fontWeight":"bold"},"text":"polynomially ergodic","element":"span"},{"text":". From [","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"35","element":"a"},{"text":", ","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"24","element":"a"},{"text":"], for an irreducible and aperiodic chain, the existence of a function ","element":"span"},{"style":{"height":16},"width":266.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-30.png","element":"img","alt":" V : X �→ [1, ∞)","inline":true},{"text":", a finite set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"text":", a constant ","element":"span"},{"style":{"height":16},"width":159,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-31.png","element":"img","alt":" α ∈ [0, 1)","inline":true},{"text":", and positive constants ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"such that","element":"span"}],[{"id":"id-91","style":{"width":"56%"},"width":892,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-32.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"implies ","element":"span"},{"style":{"height":15.8},"width":36,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-33.png","element":"img","alt":" Vζ","inline":true},{"text":"-polynomial ergodicity of ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"at rate ","element":"span"},{"style":{"height":17.38},"width":221.64,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-34.png","element":"img","alt":" ρ(n) = nζ−1","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":310.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-35.png","element":"img","alt":" ζ ∈ [1, 1/(1 − α)]","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":18.98},"width":269.92,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-36.png","element":"img","alt":" Vζ = V 1−ζ(1−α)","inline":true},{"text":". The drift condition ","element":"span"},{"href":"#id-91","text":"(18) ","element":"a"},{"text":"implies positive recurrence of the Markov process, existence of a unique stationary distribution ","element":"span"},{"style":{"height":19.38},"width":432.16,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-37.png","element":"img","alt":" µ, and µ(V α) ≤ bc < +∞.","inline":true}],[{"id":"id-51","style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Lemma ","element":"span"},{"href":"#id-92","style":{"fontWeight":"bold"},"text":"4","element":"a"}],[{"id":"id-92","style":{"fontWeight":"bold"},"text":"Lemma 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any state ","element":"span"},{"style":{"height":16.99},"width":116.32,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-38.png","element":"img","alt":" x ̸= 0d","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists constants ","element":"span"},{"style":{"height":11.2},"width":91,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-39.png","element":"img","alt":" κ > 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.6},"width":29,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-40.png","element":"img","alt":" c1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that the following holds for the hitting time of state ","element":"span"},{"style":{"height":16.62},"width":121.48,"height":41.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-41.png","element":"img","alt":" 0d, τ0d,","inline":true}],[{"style":{"width":"22%"},"width":356,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-42.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We define ","element":"span"},{"style":{"height":21.94},"width":678.96,"height":54.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-43.png","element":"img","alt":"˜V := �∞n=0 0dP nV g where 0dP n is the n","inline":true},{"text":"-step taboo probability ","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"defined as","element":"span"}],[{"style":{"width":"34%"},"width":540,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-44.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":14.6},"width":295,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-45.png","element":"img","alt":" A, B ⊆ X, and τA","inline":true,"padRight":true},{"text":"is the first hitting time of set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". We also let ","element":"span"},{"style":{"height":17.8},"width":409.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-46.png","element":"img","alt":" AP 0xB = IB(x). We have","inline":true}],[{"style":{"width":"60%"},"width":957,"height":261,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/14-47.png","element":"img"}],[{"text":"In Appendix ","element":"span"},{"href":"#id-93","text":"D.3, ","element":"a"},{"text":"we argue that there exists ","element":"span"},{"style":{"height":19},"width":890,"height":47.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-0.png","element":"img","alt":"˜bg > 1 such that ˜V (y) ≤ ˜bgV g(y) for all y ∈ X, which","inline":true,"padRight":true},{"text":"leads to","element":"span"}],[{"id":"id-94","style":{"width":"74%"},"width":1184,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-1.png","element":"img"}],[{"text":"Define Lyapunov function","element":"span"}],[{"style":{"width":"42%"},"width":666,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-2.png","element":"img"}],[{"text":"From the above equation and ","element":"span"},{"href":"#id-94","text":"(19)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"60%"},"width":956,"height":572,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-3.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"style":{"width":"81%"},"width":1300,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-4.png","element":"img"}],[{"text":"To find an upper bound for ","element":"span"},{"style":{"height":16},"width":138.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-5.png","element":"img","alt":" Ex[κτ0d ]","inline":true},{"text":", we apply [","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"35","element":"a"},{"text":", Theorem 15.2.5], which is a generalization of Lemma ","element":"span"},{"href":"#id-95","text":"12. ","element":"a"},{"text":"For any ","element":"span"},{"style":{"height":24.4},"width":238.5,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-6.png","element":"img","alt":" 1 ≤ κ ≤ 2˜bg2˜bg−1","inline":true},{"text":", there exists ","element":"span"},{"style":{"height":11.8},"width":240.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-7.png","element":"img","alt":" ϵ > 0 such that","inline":true}],[{"style":{"width":"41%"},"width":660,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-8.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":18.83},"width":554.36,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-9.png","element":"img","alt":"˜V g(y) ≥ 1 for all y ∈ X, we have","inline":true}],[{"style":{"width":"62%"},"width":988,"height":218,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-10.png","element":"img"}],[{"text":"and the claim holds for any ","element":"span"},{"style":{"height":28.8},"width":677.16,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-11.png","element":"img","alt":" κ ∈ [1, 2˜bg2˜bg−1] and c1 = ˜bgϵ−1 �1 + 2˜bg�.","inline":true}],[{"id":"id-56","style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Poisson equation","element":"span"}],[{"text":"For an irreducible Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"on the countably-infinite space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"with time-homogeneous transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"and cost function ","element":"span"},{"style":{"height":16},"width":54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-12.png","element":"img","alt":" ¯c(·)","inline":true},{"text":", a solution pair to the Poisson equation [","element":"span"},{"href":"#id-55","referenceIndex":34,"text":"34","element":"a"},{"text":"] is a scalar ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"and function ","element":"span"},{"style":{"height":16},"width":1190.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-13.png","element":"img","alt":" v(·) : X �→ R such that J + v = ¯c + Pv, where v(z) = 0 for some z ∈ X","inline":true},{"text":". If Markov process ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"text":"is positive recurrent and ","element":"span"},{"style":{"height":28.8},"width":424.92,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-14.png","element":"img","alt":" Ex��τy−1i=0 |¯c(Xt)|�< ∞","inline":true},{"text":", where ","element":"span"},{"style":{"height":11.8},"width":32.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-15.png","element":"img","alt":" τy","inline":true,"padRight":true},{"text":"is the first hitting time of some state ","element":"span"},{"style":{"height":14.4},"width":101.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-16.png","element":"img","alt":" y ∈ X","inline":true},{"text":", then solution pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"J, v","element":"span"},{"text":") ","element":"span"},{"text":"given as","element":"span"}],[{"style":{"width":"81%"},"width":1290,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-17.png","element":"img"}],[{"text":"is a solution to the Poisson equation ","element":"span"},{"style":{"height":16},"width":498.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/15-18.png","element":"img","alt":" J + v = ¯c + Pv with v(z) = 0","inline":true,"padRight":true},{"href":"#id-55","referenceIndex":34,"text":"[34, ","element":"a"},{"text":"Theorem 9.5].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider Markov Decision Processes ","element":"span"},{"style":{"height":16},"width":211.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-0.png","element":"img","alt":" (X, A, c, Pθ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"governed by parameter ","element":"span"},{"style":{"height":12},"width":113.5,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-1.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"following the best-in-class policy ","element":"span"},{"style":{"height":15.8},"width":37,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-2.png","element":"img","alt":" π∗θ","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the pair","element":"span"},{"style":{"height":20.08},"width":345.44,"height":50.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-3.png","element":"img","alt":"�J (θ) , vπ∗θ �given as","inline":true}],[{"style":{"width":"71%"},"width":1134,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"is a solution to the Poisson equation ","element":"span"},{"style":{"height":22.29},"width":332.04,"height":55.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-5.png","element":"img","alt":" v + J = c + P π∗θθ v","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":18.59},"width":217.88,"height":46.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-6.png","element":"img","alt":" vπ∗θ (0d) = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.59},"width":177.04,"height":46.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-7.png","element":"img","alt":" ¯Cπ∗θ (x) =","inline":true},{"style":{"height":28.8},"width":568.16,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-8.png","element":"img","alt":"Eπ∗θx� �τ0d−1i=0 c(X(i), π∗θ(X(i)))�.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"From [","element":"span"},{"href":"#id-55","referenceIndex":34,"text":"34","element":"a"},{"text":", Theorem 9.5], a solution pair to the Poisson equation exists if ","element":"span"},{"style":{"height":21.2},"width":336.5,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-9.png","element":"img","alt":" Eπ∗θx [τ0d] and ¯Cπ∗θ (x)","inline":true,"padRight":true},{"text":"are finite for all ","element":"span"},{"style":{"height":11.8},"width":106.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-10.png","element":"img","alt":" x ∈ X","inline":true},{"text":". The former follows from positive recurrence assumed in Assumption ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and for the latter, from Assumptions ","element":"span"},{"href":"#id-54","text":"1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-77","text":"2,","element":"a"}],[{"style":{"width":"78%"},"width":1250,"height":278,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-11.png","element":"img"}],[{"text":"which is finite from geometric ergodicity (Assumption ","element":"span"},{"href":"#id-10","text":"3) ","element":"a"},{"text":"and the discussion following that.","element":"span"}]]},{"heading":"B Proofs of regret analysis","paragraphs":[[{"text":"In this section, we state the proofs related to regret analysis of Section ","element":"span"},{"text":"4. ","element":"span"},{"text":"We first note a key property of Thompson sampling from [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"], which states that for any episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", measurable function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", and ","element":"span"},{"style":{"height":15.4},"width":92.5,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-12.png","element":"img","alt":"Htk−","inline":true},{"text":"measurable random variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":", we have","element":"span"}],[{"id":"id-107","style":{"width":"65%"},"width":1046,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":832.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-14.png","element":"img","alt":" Ht := σ (X (1) , . . . , X (t) , A (1) , . . . , A (t − 1))","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":11.6},"width":106.2,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-15.png","element":"img","alt":" t ∈ N","inline":true},{"text":". We start with deriving upper bounds on the hitting times of state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-16.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"using the ergodicity conditions of Assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4. ","element":"a"},{"text":"Previous works [","element":"span"},{"href":"#id-96","referenceIndex":21,"text":"21","element":"a"},{"text":", ","element":"span"},{"href":"#id-97","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"24","element":"a"},{"text":"] have already established bounds on hitting times in geometrically and polynomially ergodic chains in terms of their corresponding Lyapunov function. However, our objective is to provide a precise characterization of all constants included in these bounds in terms of the constants of the drift equations ","element":"span"},{"href":"#id-98","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-99","text":"4. ","element":"a"},{"text":"This characterization allows us to derive uniform bounds across the model class. In Appendix ","element":"span"},{"href":"#id-100","text":"C.1, ","element":"a"},{"text":"using the polynomial Lyapunov function provided in Assumption ","element":"span"},{"href":"#id-11","text":"4, ","element":"a"},{"text":"we establish upper bounds on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th moment of hitting time of state ","element":"span"},{"style":{"height":16.99},"width":192.6,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-17.png","element":"img","alt":" 0d from any","inline":true,"padRight":true},{"text":"state ","element":"span"},{"style":{"height":11.8},"width":108.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-18.png","element":"img","alt":" x ∈ X","inline":true,"padRight":true},{"text":"and for ","element":"span"},{"style":{"height":13.2},"width":233.44,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-19.png","element":"img","alt":" 1 ≤ i ≤ r + 1","inline":true},{"text":". Importantly, the derived bound is polynomial in terms of any component of the state ","element":"span"},{"style":{"height":9.8},"width":32,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-20.png","element":"img","alt":" xi","inline":true},{"text":". Additionally, in Appendix ","element":"span"},{"href":"#id-101","text":"C.2, ","element":"a"},{"text":"we characterize the tail probabilities of the return time to state ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-21.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"starting from ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-22.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"in terms of the geometric Lyapunov function of Assumption ","element":"span"},{"href":"#id-10","text":"3. ","element":"a"},{"text":"The derived tail bounds will be used in Lemma ","element":"span"},{"href":"#id-62","text":"6 ","element":"a"},{"text":"to derive upper bounds for all moments of hitting times in the model class. These bounds, along with the skip-free behavior of the model, allow us to study the maximum state (with respect to ","element":"span"},{"style":{"height":13.8},"width":46.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-23.png","element":"img","alt":" ℓ∞","inline":true},{"text":"-norm) achieved up to time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"in MDP ","element":"span"},{"style":{"height":16},"width":230.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-24.png","element":"img","alt":" (X, A, c, Pθ∗)","inline":true,"padRight":true},{"text":"following Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"as follows.","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"Lemma 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":14.2},"width":197.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-25.png","element":"img","alt":" p ∈ N, the p","inline":true},{"style":{"fontStyle":"italic"},"text":"-th moment of ","element":"span"},{"style":{"height":22.46},"width":395.04,"height":56.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-26.png","element":"img","alt":" max1≤i≤T τ (i)0d and M Tθ∗","inline":true},{"style":{"fontStyle":"italic"},"text":", that is the maximum ","element":"span"},{"style":{"height":14},"width":147.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-27.png","element":"img","alt":" ℓ∞-norm","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of the state vector achieved up until and including time ","element":"span"},{"style":{"height":16},"width":253.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-28.png","element":"img","alt":" T is O(logp T).","inline":true}],[{"text":"In the proof of Lemma ","element":"span"},{"href":"#id-62","text":"6 ","element":"a"},{"text":"given in Appendix ","element":"span"},{"href":"#id-102","text":"B.1, ","element":"a"},{"text":"we make use of geometric ergodicity of the chain and the fact that hitting times have geometric tails to find an upper bound for moments of ","element":"span"},{"style":{"height":18.2},"width":188.5,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-29.png","element":"img","alt":" M Tθ∗. Using","inline":true,"padRight":true},{"text":"this, we aim to bound the number of episodes started before or at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", denoted by ","element":"span"},{"style":{"height":13.2},"width":56.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-30.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":". We first find an upper bound for the number of episodes in which the second stopping criterion is met or there exists a state-action pair for which ","element":"span"},{"style":{"height":16},"width":142.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-31.png","element":"img","alt":" Nt(x, a)","inline":true,"padRight":true},{"text":"has increased more than twice. In the following lemma, we bound the number of such episodes, which we denote by ","element":"span"},{"style":{"height":13.18},"width":65.84,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-32.png","element":"img","alt":" KM","inline":true},{"text":", in terms of random variable ","element":"span"},{"style":{"height":18.2},"width":141.5,"height":45.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-33.png","element":"img","alt":" M Tθ∗ and","inline":true,"padRight":true},{"text":"other problem-dependent constants. Proof of Lemma ","element":"span"},{"href":"#id-63","text":"7 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-103","text":"B.2.","element":"a"}],[{"id":"id-63","style":{"fontWeight":"bold"},"text":"Lemma 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The number of episodes triggered by the second stopping criterion and started before or at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":", denoted by ","element":"span"},{"style":{"height":13.4},"width":65,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-34.png","element":"img","alt":" KM","inline":true},{"style":{"fontStyle":"italic"},"text":", satisfies ","element":"span"},{"style":{"height":17.9},"width":567.08,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/16-35.png","element":"img","alt":" KM ≤ 2|A|(M Tθ∗ + 1)d log2 T a.s.","inline":true}],[{"text":"We next bound the total number of episodes ","element":"span"},{"style":{"height":13.4},"width":54.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-0.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"by bounding the number of episodes triggered by the first stopping criterion, using the fact that in such episodes, ","element":"span"},{"style":{"height":17.4},"width":243,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-1.png","element":"img","alt":"˜Tk = ˜Tk−1 + 1","inline":true},{"text":". Moreover, to address the settling time of each episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", shown by ","element":"span"},{"style":{"height":17.4},"width":232,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-2.png","element":"img","alt":" Ek = Tk − ˜Tk","inline":true},{"text":", we use the geometric ergodicity property and Lemma ","element":"span"},{"href":"#id-62","text":"6. ","element":"a"},{"text":"Finally, the proof of Lemma ","element":"span"},{"href":"#id-64","text":"8 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-104","text":"B.3.","element":"a"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Lemma 8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The number of episodes started by ","element":"span"},{"style":{"height":29},"width":795.5,"height":72.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-3.png","element":"img","alt":" T satisfies KT ≤ 2�|A|(M Tθ∗ + 1)dT log2 T a.s.","inline":true}],[{"text":"From Lemma ","element":"span"},{"href":"#id-64","text":"8, ","element":"a"},{"text":"the upper bound given in Lemma ","element":"span"},{"href":"#id-62","text":"6 ","element":"a"},{"text":"for moments of ","element":"span"},{"style":{"height":18},"width":68,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-4.png","element":"img","alt":" M Tθ∗","inline":true},{"text":", and Cauchy–Schwarz ","element":"span"},{"text":"inequality, it follows that the expected value of the number of episodes ","element":"span"},{"style":{"height":13.4},"width":55,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-5.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"is of the order ","element":"span"},{"style":{"height":19.8},"width":190.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-6.png","element":"img","alt":"˜O(�|A|T).","inline":true,"padRight":true},{"text":"This term has a crucial role in determining the overall order of the total regret up to time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". In the rest of this section, we present a detailed proof of the lemmas and other results used to prove Theorem ","element":"span"},{"href":"#id-7","text":"1. ","element":"a"},{"style":{"fontWeight":"bold"},"text":"Remark 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The skip-free to the right property in Assumption ","element":"span"},{"href":"#id-77","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"yields a polynomially-sized subset of the underlying state-space that can be explored as a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":". This polynomially-sized subset can be viewed as the effective finite-size of the system in the worst-case, and then, directly applying finite-state problem bounds [","element":"span"},{"href":"#id-22","referenceIndex":41,"style":{"fontStyle":"italic"},"text":"41","element":"a"},{"style":{"fontStyle":"italic"},"text":"] would result in a regret of order ","element":"span"},{"style":{"height":18.82},"width":176.12,"height":47.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-7.png","element":"img","alt":"˜O(T d+0.5)","inline":true},{"style":{"fontStyle":"italic"},"text":"; since ","element":"span"},{"style":{"height":13.2},"width":96.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-8.png","element":"img","alt":" d ≥ 1","inline":true},{"style":{"fontStyle":"italic"},"text":", such a coarse bound is not helpful even for asserting asymptotic optimality! However, to achieve a regret of ","element":"span"},{"style":{"height":18.83},"width":125.04,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-9.png","element":"img","alt":"˜O(√T)","inline":true},{"style":{"fontStyle":"italic"},"text":", it is essential to carefully understand and characterize the distribution of ","element":"span"},{"style":{"height":18},"width":68,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-10.png","element":"img","alt":" M Tθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and then its ","element":"span"},{"style":{"fontStyle":"italic"},"text":"moments, as demonstrated in Lemma ","element":"span"},{"href":"#id-62","style":{"fontStyle":"italic"},"text":"6.","element":"a"}],[{"id":"id-143","style":{"fontWeight":"bold"},"text":"Remark 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The derived regret bound can be extended to a larger class of MDPs which consist of transient states in addition to the single irreducible class. Specifically, for any ","element":"span"},{"style":{"height":14.6},"width":167,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-11.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true},{"style":{"fontStyle":"italic"},"text":", the Markov process with transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"obtained from the MDP ","element":"span"},{"style":{"height":16},"width":226.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-12.png","element":"img","alt":" (X, A, c, Pθ1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by following policy ","element":"span"},{"style":{"height":17},"width":49,"height":42.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-13.png","element":"img","alt":" π∗θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has a single irreducible class ","element":"span"},{"style":{"height":41.8},"width":84,"height":104.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-14.png","element":"img","alt":" Iθ1,θ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and a set of transient states ","element":"span"},{"style":{"height":15.6},"width":92.68,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-15.png","element":"img","alt":" Tθ1,θ2","inline":true},{"style":{"fontStyle":"italic"},"text":". Furthermore, Assumptions ","element":"span"},{"href":"#id-10","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold for the single irreducible class. The reasoning behind the proof remains true in this case using the following argument: each episode ","element":"span"},{"style":{"height":13.78},"width":207.4,"height":34.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-16.png","element":"img","alt":" k starts at 0d ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"which is in the irreducible set for the chosen policy ","element":"span"},{"style":{"height":17.4},"width":50.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-17.png","element":"img","alt":" π∗θk","inline":true},{"style":{"fontStyle":"italic"},"text":", hence, throughout the episode the algorithm remains in the irreducible set that is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"positive recurrent and never visits any transient states. In other words, episodes starting and ending at ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-18.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with a fixed episode dependent policy implies that reachable set of ","element":"span"},{"style":{"height":13.39},"width":36.92,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-19.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is all that can be explored, which is positive recurrent by our assumptions. As a result, we can restrict our proof derivations to the subset that is reachable from ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-20.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in each episode and follow the same analysis. The Lyapunov function based bounds apply to the positive recurrent states, and hence, restricting attention to states reachable from ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-21.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"within each episode, we can use these bounds for our assessment of regret using norms of the state. Thereafter, the coarse bounds on the norms of the state can be applied as carried out in our proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"By problem-dependent parameters, we refer to the parameters that characterize the complexity or size of the model class ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-22.png","element":"img","alt":" Θ","inline":true},{"style":{"fontStyle":"italic"},"text":". These parameters are not just a function of the size of the state-space and diameter of the MDP (as mentioned in the literature on finite-size problems[","element":"span"},{"href":"#id-26","referenceIndex":4,"style":{"fontStyle":"italic"},"text":"4","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-27","referenceIndex":19,"style":{"fontStyle":"italic"},"text":"19","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-22","referenceIndex":41,"style":{"fontStyle":"italic"},"text":"41","element":"a"},{"style":{"fontStyle":"italic"},"text":"]), as stability needs to be accounted for in the countable state-space setting. The dependence is, thus, more complex and requires the inclusion of stability parameters, such as Lyapunov functions, petite sets, and ergodicity coefficients that are discussed in Assumptions ","element":"span"},{"href":"#id-54","style":{"fontStyle":"italic"},"text":"1-","element":"a"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"4.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Remark 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In the subsequent sections, several equalities and inequalities in the proofs are between random variables and hold almost surely (a.s.). Throughout the remainder, we will omit the explicit mention of a.s., but any such statement should be interpreted in this context.","element":"span"}],[{"id":"id-102","style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-62","style":{"fontWeight":"bold"},"text":"6","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16.6},"width":126.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-23.png","element":"img","alt":" {αi}i≥0","inline":true,"padRight":true},{"text":"be the sequence of hitting times of state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-24.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"starting from ","element":"span"},{"style":{"height":16.58},"width":369.8,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-25.png","element":"img","alt":" 0d (set α0 = 0). Define","inline":true},{"style":{"height":22.6},"width":54,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-26.png","element":"img","alt":"τ (i)0d","inline":true,"padRight":true},{"text":"as the length of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th recurrence time of state ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-27.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":11.6},"width":104.72,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-28.png","element":"img","alt":" i ∈ N","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":22.45},"width":309.88,"height":56.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-29.png","element":"img","alt":" τ (i)0d = αi − αi−1.","inline":true,"padRight":true},{"text":"For ","element":"span"},{"text":"simplicity, we take ","element":"span"},{"style":{"height":22.6},"width":175.5,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-30.png","element":"img","alt":" τ0d = τ (1)0d","inline":true,"padRight":true},{"text":". Each such recurrence time is generated using policy ","element":"span"},{"style":{"height":17.2},"width":46,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-31.png","element":"img","alt":" π∗θi","inline":true,"padRight":true},{"text":"that is ","element":"span"},{"text":"determined using the algorithm in operation in an MDP governed by parameter ","element":"span"},{"style":{"height":12.6},"width":36,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-32.png","element":"img","alt":" θ∗","inline":true},{"text":". Furthermore, ","element":"span"},{"style":{"height":22.6},"width":150,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-33.png","element":"img","alt":"{τ (i)0d }i∈N","inline":true,"padRight":true},{"text":"are independent with length at least ","element":"span"},{"text":"1","element":"span"},{"text":", but they need not be identically distributed. The ","element":"span"},{"text":"time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"can be in the middle of one of these recurrence times, hence the current recurrence interval count is ","element":"span"},{"style":{"height":22.46},"width":546.76,"height":56.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-34.png","element":"img","alt":" N(T) = inf{n : �ni=1 τ (i)0d ≥ T}","inline":true},{"text":". Note that the lower bound of ","element":"span"},{"text":"1 ","element":"span"},{"text":"on every ","element":"span"},{"style":{"height":22.6},"width":54,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-35.png","element":"img","alt":" τ (i)0d","inline":true,"padRight":true},{"text":"says that ","element":"span"},{"style":{"height":16},"width":178.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-36.png","element":"img","alt":"N(T) ≤ T","inline":true,"padRight":true},{"text":"a.s. Further, from the skip-free to the right property, the most any component of state can increase in during recurrence time ","element":"span"},{"style":{"height":22.8},"width":182.5,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-37.png","element":"img","alt":" τ (i)0d is hτ (i)0d","inline":true,"padRight":true},{"text":". Hence, the most any component of the state (and also ","element":"span"},{"text":"the ","element":"span"},{"style":{"height":16},"width":93.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-38.png","element":"img","alt":" ∥ · ∥∞","inline":true,"padRight":true},{"text":"norm of the state) can increase is given by ","element":"span"},{"style":{"height":22.45},"width":292.12,"height":56.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/17-39.png","element":"img","alt":" h maxi=1,...,T τ (i)0d","inline":true,"padRight":true},{"text":"where the random variables","element":"span"}],[{"text":"are independent with geometrically decaying tails with a worst case rate of","element":"span"}],[{"style":{"width":"41%"},"width":664,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-0.png","element":"img"}],[{"text":"see Lemma ","element":"span"},{"href":"#id-50","text":"11. ","element":"a"},{"text":"From Lemma ","element":"span"},{"href":"#id-49","text":"10, ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-111","style":{"width":"100%"},"width":1610,"height":652,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-1.png","element":"img"}],[{"text":"and we define ","element":"span"},{"style":{"height":18.99},"width":280.6,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-2.png","element":"img","alt":" ˜γg∗ := 1 − (˜bg∗)−1","inline":true},{"text":". From the definition of ","element":"span"},{"style":{"height":19.8},"width":83,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-3.png","element":"img","alt":" bgθ1,θ2 ","inline":true,"padRight":true},{"text":"in Assumption ","element":"span"},{"href":"#id-10","text":"3, ","element":"a"},{"style":{"height":19.8},"width":83,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-4.png","element":"img","alt":" bgθ1,θ2 ","inline":true,"padRight":true},{"text":"is greater than ","element":"span"},{"text":"or equal to ","element":"span"},{"style":{"height":22.21},"width":305.04,"height":55.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-5.png","element":"img","alt":" 2. Thus, ˜bgθ1,θ2 ≥ 7","inline":true,"padRight":true},{"text":"and we have","element":"span"}],[{"style":{"width":"61%"},"width":974,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-6.png","element":"img"}],[{"text":"and as a result of Lemma ","element":"span"},{"href":"#id-50","text":"11,","element":"a"}],[{"id":"id-105","style":{"width":"72%"},"width":1150,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-7.png","element":"img"}],[{"text":"We upper bound ","element":"span"},{"style":{"height":19.4},"width":134.5,"height":48.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-8.png","element":"img","alt":" E�M Tθ∗�","inline":true},{"text":"using the independence of ","element":"span"},{"style":{"height":22.6},"width":150,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-9.png","element":"img","alt":" {τ (i)0d }i∈N","inline":true,"padRight":true},{"text":"and the above equation,","element":"span"}],[{"style":{"width":"80%"},"width":1270,"height":528,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":9.8},"width":37.5,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-11.png","element":"img","alt":" n0","inline":true,"padRight":true},{"text":"is the smallest ","element":"span"},{"style":{"height":17.4},"width":465.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-12.png","element":"img","alt":" n ≥ 0 such that cg∗ (γg∗)n < 1","inline":true},{"text":". By Reimann sum approximation, we get","element":"span"}],[{"style":{"width":"54%"},"width":870,"height":448,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/18-13.png","element":"img"}],[{"text":"where the last inequality follows from ","element":"span"},{"style":{"height":21},"width":976,"height":52.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-0.png","element":"img","alt":"�Tn=1 n−1 ≤ log T + 1 and thus E�M Tθ∗�is O(h log T). We","inline":true,"padRight":true},{"text":"now extend the result to moments of order greater than one. From ","element":"span"},{"href":"#id-105","text":"(22)","element":"a"},{"text":", for ","element":"span"},{"style":{"height":13.2},"width":173.5,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-1.png","element":"img","alt":" 1 ≤ i ≤ T,","inline":true}],[{"style":{"width":"63%"},"width":1012,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-2.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":22.6},"width":1003,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-3.png","element":"img","alt":" n ≥ n0, let t = n − n0 ≥ 0 and Yi = max(τ (i)0d − n0, 0) to get","inline":true}],[{"style":{"width":"45%"},"width":716,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-4.png","element":"img"}],[{"text":"which means random variables ","element":"span"},{"style":{"height":17.8},"width":122.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-5.png","element":"img","alt":" {Yi}Ti=1 ","inline":true,"padRight":true},{"text":"are stochastically dominated by independent and identically ","element":"span"},{"text":"distributed geometric random variables with parameter ","element":"span"},{"style":{"height":16},"width":101,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-6.png","element":"img","alt":" 1 − γg∗","inline":true},{"text":". Furthermore, [","element":"span"},{"href":"#id-106","referenceIndex":48,"text":"48","element":"a"},{"text":"] argues that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-th ","element":"span"},{"text":"moment of the maximum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"independent and identically distributed geometric random variables is ","element":"span"},{"style":{"height":16},"width":168.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-7.png","element":"img","alt":"O(logp T)","inline":true},{"text":". Thus, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-th moment of ","element":"span"},{"style":{"height":16.78},"width":501.24,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-8.png","element":"img","alt":" max1≤i≤T Yi is O(logp T) and","inline":true}],[{"style":{"width":"80%"},"width":1280,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-9.png","element":"img"}],[{"text":"which gives","element":"span"}],[{"style":{"width":"70%"},"width":1124,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-10.png","element":"img"}],[{"text":"Since the right-hand side of the above equation is ","element":"span"},{"style":{"height":16.2},"width":210.5,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-11.png","element":"img","alt":" O(hp logp T)","inline":true},{"text":", the claim is proved.","element":"span"}],[{"id":"id-103","style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-63","style":{"fontWeight":"bold"},"text":"7","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":160.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-12.png","element":"img","alt":" KM(x, a)","inline":true,"padRight":true},{"text":"be the number of episodes ","element":"span"},{"style":{"height":13.6},"width":387.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-13.png","element":"img","alt":" k such that 1 ≤ k ≤ KT","inline":true,"padRight":true},{"text":"and in which the number of visits to the state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"is increased more than twice at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", or","element":"span"}],[{"style":{"width":"56%"},"width":888,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-14.png","element":"img"}],[{"text":"As for every episode in the above set the number of visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"doubles,","element":"span"}],[{"style":{"width":"37%"},"width":596,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-15.png","element":"img"}],[{"text":"and we can upper bound ","element":"span"},{"style":{"height":13.4},"width":64.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-16.png","element":"img","alt":" KM","inline":true,"padRight":true},{"text":"as follows","element":"span"}],[{"style":{"width":"77%"},"width":1222,"height":284,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-17.png","element":"img"}],[{"text":"This completes the proof.","element":"span"}],[{"id":"id-104","style":{"fontWeight":"bold"},"text":"B.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-64","style":{"fontWeight":"bold"},"text":"8","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We define macro episodes with start times ","element":"span"},{"style":{"height":15.4},"width":456,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-18.png","element":"img","alt":" tnk, k = 1, 2, . . . , KM + 1","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":14},"width":147,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-19.png","element":"img","alt":" tn1 = t1","inline":true},{"text":", ","element":"span"},{"style":{"height":17.4},"width":272,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-20.png","element":"img","alt":"tnKM +1 = T + 1","inline":true,"padRight":true},{"text":"(which is equivalent to ","element":"span"},{"style":{"height":15.2},"width":746.72,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-21.png","element":"img","alt":" nKM+1 = KT + 1), and for 1 < k < KM + 1","inline":true}],[{"style":{"width":"72%"},"width":1152,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-22.png","element":"img"}],[{"text":"which are episodes wherein the second stopping criterion is triggered. Any episode (except for the last episode) in a macro episode must be triggered by the first stopping criterion; equivalently, ","element":"span"},{"style":{"height":22.96},"width":1584.52,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-23.png","element":"img","alt":"˜Tj = ˜Tj−1 + 1 for all j = nk, nk + 1, . . . , nk+1 − 2. For 1 ≤ k ≤ KM, let T Mk = �nk+1−1j=nk Tj be","inline":true,"padRight":true},{"text":"the length of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th macro episode. We have","element":"span"}],[{"style":{"width":"96%"},"width":1534,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/19-24.png","element":"img"}],[{"text":"Consequently, ","element":"span"},{"style":{"height":29},"width":682.5,"height":72.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-0.png","element":"img","alt":" nk+1 − nk ≤�2T Mk for all 1 ≤ k ≤ KM","inline":true},{"text":". From this, we obtain","element":"span"}],[{"style":{"width":"54%"},"width":866,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-1.png","element":"img"}],[{"text":"Using the above equation and the fact that ","element":"span"},{"style":{"height":20.6},"width":368.5,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-2.png","element":"img","alt":"�KMk=1 T Mk = T we get","inline":true}],[{"style":{"width":"52%"},"width":834,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-3.png","element":"img"}],[{"text":"Finally, from Lemma ","element":"span"},{"href":"#id-63","text":"7 ","element":"a"},{"text":"we get","element":"span"}],[{"style":{"width":"51%"},"width":818,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-4.png","element":"img"}],[{"text":"This completes the proof.","element":"span"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"B.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-66","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":17.4},"width":330.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-5.png","element":"img","alt":" Ek = Tk − ˜Tk ≥ 0","inline":true,"padRight":true},{"text":"be the settling time needed to return to state ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-6.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"after a stopping criterion is realized in episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". We have","element":"span"}],[{"id":"id-108","style":{"width":"81%"},"width":1290,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-7.png","element":"img"}],[{"text":"We first simplify the first term in the above summation. From the monotone convergence theorem,","element":"span"}],[{"style":{"width":"47%"},"width":750,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-8.png","element":"img"}],[{"text":"Note that the first stopping criterion of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"ensures that ","element":"span"},{"style":{"height":17.4},"width":240,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-9.png","element":"img","alt":"˜Tk ≤ ˜Tk−1 + 1","inline":true,"padRight":true},{"text":"at all episodes ","element":"span"},{"style":{"height":13},"width":100.5,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-10.png","element":"img","alt":" k ≥ 1.","inline":true,"padRight":true},{"text":"Hence","element":"span"}],[{"style":{"width":"55%"},"width":876,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-11.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":20.4},"width":307,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-12.png","element":"img","alt":" I{tk≤T }( ˜Tk−1 + 1)","inline":true,"padRight":true},{"text":"is measurable with respect to ","element":"span"},{"style":{"height":15},"width":58.5,"height":37.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-13.png","element":"img","alt":" Htk","inline":true},{"text":", by ","element":"span"},{"href":"#id-107","text":"(20) ","element":"a"},{"text":"we get","element":"span"}],[{"style":{"width":"64%"},"width":1020,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-14.png","element":"img"}],[{"text":"Therefore,","element":"span"}],[{"style":{"width":"86%"},"width":1364,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-15.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"id":"id-109","style":{"width":"92%"},"width":1462,"height":358,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/20-16.png","element":"img"}],[{"text":"For the second term in ","element":"span"},{"href":"#id-108","text":"(23)","element":"a"},{"text":", from Assumption ","element":"span"},{"href":"#id-65","text":"5","element":"a"}],[{"id":"id-110","style":{"width":"81%"},"width":1290,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-0.png","element":"img"}],[{"text":"Substitutinh ","element":"span"},{"href":"#id-109","text":"(24) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-110","text":"(25) ","element":"a"},{"text":"in ","element":"span"},{"href":"#id-108","text":"(23)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"43%"},"width":696,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-1.png","element":"img"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"B.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-69","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We note that the state of the MDP is equal to ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-2.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"at the beginning of all episodes and the relative value function ","element":"span"},{"style":{"height":16},"width":111,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-3.png","element":"img","alt":" v(x; θ)","inline":true,"padRight":true},{"text":"is equal to 0 at ","element":"span"},{"style":{"height":16},"width":363,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-4.png","element":"img","alt":" x = 0d for all θ. Thus,","inline":true}],[{"style":{"width":"81%"},"width":1292,"height":472,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-5.png","element":"img"}],[{"text":"From the lower bound derived for the relative value function in ","element":"span"},{"href":"#id-61","text":"(14)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"50%"},"width":798,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-6.png","element":"img"}],[{"text":"where the second inequality follows from ","element":"span"},{"href":"#id-111","text":"(21) ","element":"a"},{"text":"in the proof of Lemma ","element":"span"},{"href":"#id-62","text":"6. ","element":"a"},{"text":"We also note that ","element":"span"},{"style":{"height":16},"width":138.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-7.png","element":"img","alt":" ∥X(T +","inline":true},{"style":{"height":17.9},"width":401.76,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-8.png","element":"img","alt":"1)∥∞ ≤ M Tθ∗ + h. Thus,","inline":true}],[{"style":{"width":"69%"},"width":1102,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-9.png","element":"img"}],[{"text":"From the inequality ","element":"span"},{"style":{"height":15.8},"width":525.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-10.png","element":"img","alt":" (a + b)r ≤ 2r(ar + br), we have","inline":true}],[{"style":{"width":"59%"},"width":936,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-11.png","element":"img"}],[{"id":"id-76","style":{"fontWeight":"bold"},"text":"B.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-71","style":{"fontWeight":"bold"},"text":"3","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":19.92},"width":480.04,"height":49.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-12.png","element":"img","alt":" Z (t) =�X (t) , π∗θk (X (t))�","inline":true},{"text":"be the state-action pair at ","element":"span"},{"style":{"height":14.78},"width":297.4,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-13.png","element":"img","alt":" tk ≤ t < tk+1. R2","inline":true,"padRight":true},{"text":"can be upper bounded as","element":"span"}],[{"id":"id-112","style":{"width":"95%"},"width":1510,"height":442,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/21-14.png","element":"img"}],[{"text":"We have","element":"span"}],[{"style":{"width":"100%"},"width":1592,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":18.88},"width":213.36,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-1.png","element":"img","alt":" Pˆθk(y|Z (t))","inline":true,"padRight":true},{"text":"is the empirical transition probability defined as","element":"span"}],[{"style":{"width":"39%"},"width":626,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-2.png","element":"img"}],[{"text":"and for any tuple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a, ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"y","element":"span"},{"text":")","element":"span"},{"text":", we define ","element":"span"},{"style":{"height":16},"width":493.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-3.png","element":"img","alt":" N1(x, a, y) = 0 and for t > 1,","inline":true}],[{"style":{"width":"95%"},"width":1512,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-4.png","element":"img"}],[{"text":"Thus, from ","element":"span"},{"href":"#id-112","text":"(26) ","element":"a"},{"text":"and defining random variable ","element":"span"},{"style":{"height":11.4},"width":330,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-5.png","element":"img","alt":" vM = max 1≤k≤KT","inline":true},{"style":{"height":13.82},"width":168.28,"height":34.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-6.png","element":"img","alt":"∥x∥∞≤M Tθ∗","inline":true},{"style":{"height":16},"width":166.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-7.png","element":"img","alt":"|v(x; θk)|,","inline":true}],[{"id":"id-113","style":{"width":"99%"},"width":1576,"height":168,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-8.png","element":"img"}],[{"text":"We define set ","element":"span"},{"style":{"height":13.4},"width":45,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-9.png","element":"img","alt":" Bk","inline":true,"padRight":true},{"text":"as the set of parameters ","element":"span"},{"style":{"height":11.6},"width":17,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-10.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"for which the transition kernel ","element":"span"},{"style":{"height":16},"width":121.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-11.png","element":"img","alt":" Pθ(·|z)","inline":true,"padRight":true},{"text":"is close to the empirical transition kernel ","element":"span"},{"style":{"height":18.88},"width":137.64,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-12.png","element":"img","alt":" Pˆθk(·|z)","inline":true,"padRight":true},{"text":"at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"for every state-action pair ","element":"span"},{"style":{"height":16},"width":385.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-13.png","element":"img","alt":" z = (x, a) ∈ X × A, or","inline":true}],[{"style":{"width":"84%"},"width":1342,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":139.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-15.png","element":"img","alt":" βk(z) =","inline":true},{"style":{"height":38.4},"width":40,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-16.png","element":"img","alt":"�","inline":true,"padRight":true},{"text":"14 ","element":"span"},{"style":{"height":13.62},"width":175,"height":34.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-17.png","element":"img","alt":"�di=1(xi+h)","inline":true,"padRight":true},{"text":"max(1","element":"span"},{"style":{"height":28.8},"width":305,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-18.png","element":"img","alt":",Ntk (z)) log�2|A|T˜δ","inline":true},{"style":{"height":28.8},"width":701.84,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-19.png","element":"img","alt":"�for x = (x1, . . . , xd) and some 0 < ˜δ < 1","inline":true},{"text":", which will be determined later. We simplify the ","element":"span"},{"style":{"height":13.8},"width":30,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-20.png","element":"img","alt":" ℓ1","inline":true},{"text":"-difference of the real and empirical transition kernels as follows","element":"span"}],[{"style":{"width":"88%"},"width":1404,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-21.png","element":"img"}],[{"text":"Similarly, we have","element":"span"}],[{"style":{"width":"60%"},"width":956,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-22.png","element":"img"}],[{"text":"Substituting in ","element":"span"},{"href":"#id-113","text":"(27)","element":"a"},{"text":", we get","element":"span"}],[{"id":"id-116","style":{"width":"95%"},"width":1520,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-23.png","element":"img"}],[{"text":"We first find an upper bound for ","element":"span"},{"style":{"height":11.46},"width":332.56,"height":28.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-24.png","element":"img","alt":" vM = max 1≤k≤KT","inline":true},{"style":{"height":13.8},"width":161,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-25.png","element":"img","alt":"∥x∥∞≤M Tθ∗","inline":true},{"style":{"height":16},"width":156.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-26.png","element":"img","alt":"|v(x; θk)|","inline":true,"padRight":true},{"text":"using the bounds derived in ","element":"span"},{"href":"#id-60","text":"(13) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-61","text":"(14)","element":"a"},{"text":". From ","element":"span"},{"href":"#id-60","text":"(13)","element":"a"},{"text":",","element":"span"}],[{"id":"id-115","style":{"width":"89%"},"width":1414,"height":478,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-27.png","element":"img"}],[{"text":"where the second line follows from the inequality ","element":"span"},{"style":{"height":16},"width":409.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-28.png","element":"img","alt":" (a + b)r ≤ 2r(ar + br)","inline":true},{"text":", the fifth line from Lemma ","element":"span"},{"href":"#id-49","text":"10, ","element":"a"},{"text":"and the last line from Assumption ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-111","text":"(21)","element":"a"},{"text":". We further have","element":"span"}],[{"style":{"width":"66%"},"width":1062,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/22-29.png","element":"img"}],[{"text":"where using the definition of ","element":"span"},{"href":"#id-114","style":{"height":20.6},"width":214,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-0.png","element":"img","alt":" bηjθ1,θ2 in (38),","inline":true}],[{"style":{"width":"92%"},"width":1466,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-1.png","element":"img"}],[{"text":"We also define","element":"span"}],[{"style":{"width":"72%"},"width":1144,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-2.png","element":"img"}],[{"text":"We next find a lower bound for ","element":"span"},{"style":{"height":16},"width":129.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-3.png","element":"img","alt":" v(x; θk)","inline":true,"padRight":true},{"text":"using ","element":"span"},{"href":"#id-61","text":"(14) ","element":"a"},{"text":"as follows:","element":"span"}],[{"style":{"width":"54%"},"width":872,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-4.png","element":"img"}],[{"text":"Combining ","element":"span"},{"href":"#id-115","text":"(29) ","element":"a"},{"text":"and the above equation, we get a uniform upper bound for ","element":"span"},{"style":{"height":16},"width":392.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-5.png","element":"img","alt":" |v(x; θk)| over Θ, which","inline":true,"padRight":true},{"text":"we use to upper bound ","element":"span"},{"style":{"height":11.46},"width":330.96,"height":28.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-6.png","element":"img","alt":" vM = max 1≤k≤KT","inline":true},{"style":{"height":13.8},"width":161.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-7.png","element":"img","alt":"∥x∥∞≤M Tθ∗","inline":true},{"style":{"height":16},"width":305.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-8.png","element":"img","alt":"|v(x; θk)| as below","inline":true}],[{"id":"id-118","style":{"width":"92%"},"width":1470,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-9.png","element":"img"}],[{"text":"where the constant terms are defined as","element":"span"}],[{"style":{"width":"90%"},"width":1440,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-10.png","element":"img"}],[{"text":"A deterministic upper bound on ","element":"span"},{"style":{"height":9.6},"width":50.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-11.png","element":"img","alt":" vM","inline":true,"padRight":true},{"text":"can also be found from the above equation. Noting that from Assumption ","element":"span"},{"href":"#id-77","text":"2, ","element":"a"},{"text":"until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"only states with each component less than or equal to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hT ","element":"span"},{"text":"are visited, we have","element":"span"}],[{"style":{"width":"49%"},"width":786,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":") ","element":"span"},{"text":"is a polynomial defined as above. Using the bounds derived for ","element":"span"},{"style":{"height":9.6},"width":50.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-13.png","element":"img","alt":" vM","inline":true},{"text":", we bound ","element":"span"},{"style":{"height":13.4},"width":43.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-14.png","element":"img","alt":" R2","inline":true,"padRight":true},{"text":"starting with the first term on the right-hand side of ","element":"span"},{"href":"#id-116","text":"(28)","element":"a"},{"text":". We have","element":"span"}],[{"id":"id-117","style":{"width":"97%"},"width":1542,"height":550,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-15.png","element":"img"}],[{"text":"where the last inequality follows from ","element":"span"},{"href":"#id-107","text":"(20) ","element":"a"},{"text":"and the fact that set ","element":"span"},{"style":{"height":13.6},"width":45,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-16.png","element":"img","alt":" Bk","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":15.2},"width":92.5,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-17.png","element":"img","alt":" Htk−","inline":true},{"text":"measurable. To further simplify the first term in ","element":"span"},{"href":"#id-116","text":"(28)","element":"a"},{"text":", we find an upper bound for ","element":"span"},{"style":{"height":16.32},"width":211.44,"height":40.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-18.png","element":"img","alt":" P {θ∗ /∈ Bk}","inline":true,"padRight":true},{"text":"using [","element":"span"},{"href":"#id-73","referenceIndex":55,"text":"55","element":"a"},{"text":"]. For a fixed ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"z ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"independent samples of the distribution ","element":"span"},{"style":{"height":16},"width":140.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-19.png","element":"img","alt":" Pθ∗(.|z)","inline":true},{"text":", the ","element":"span"},{"style":{"height":13.2},"width":39.5,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-20.png","element":"img","alt":" L1","inline":true},{"text":"-deviation of the true distribution ","element":"span"},{"style":{"height":16},"width":140.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-21.png","element":"img","alt":" Pθ∗(.|z)","inline":true,"padRight":true},{"text":"and empirical distribution at the end of episode ","element":"span"},{"style":{"height":18.9},"width":178.84,"height":47.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-22.png","element":"img","alt":" k, Pˆθk(.|z)","inline":true},{"text":", is bounded in [","element":"span"},{"href":"#id-74","referenceIndex":8,"text":"8","element":"a"},{"text":"] as","element":"span"}],[{"style":{"width":"98%"},"width":1556,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-23.png","element":"img"}],[{"text":"Therefore,","element":"span"}],[{"style":{"width":"81%"},"width":1296,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/23-24.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"77%"},"width":1230,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-0.png","element":"img"}],[{"text":"The probability that at episode ","element":"span"},{"style":{"height":13.4},"width":102,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-1.png","element":"img","alt":" k ≤ T","inline":true},{"text":", the true parameter ","element":"span"},{"style":{"height":12.6},"width":36,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-2.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"does not belong to the confidence set ","element":"span"},{"style":{"height":13.4},"width":45,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-3.png","element":"img","alt":"Bk","inline":true,"padRight":true},{"text":"can be bounded using the above and union bound as","element":"span"}],[{"style":{"width":"73%"},"width":1162,"height":610,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-4.png","element":"img"}],[{"text":"In the summation in the above equation, we have simplified the expression by summing over ","element":"span"},{"style":{"height":13.6},"width":139.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-5.png","element":"img","alt":" xi ≤ hT","inline":true,"padRight":true},{"text":"instead of considering the more detailed summation over ","element":"span"},{"style":{"height":18},"width":159.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-6.png","element":"img","alt":" xi ≤ M Tθ∗","inline":true},{"text":". However, this simplification ","element":"span"},{"text":"does not affect the final evaluation of regret, as this term is not dominant and only contributes to a logarithmic term in the regret bound. Substituting in ","element":"span"},{"href":"#id-117","text":"(31)","element":"a"},{"text":",","element":"span"}],[{"id":"id-121","style":{"width":"95%"},"width":1506,"height":332,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-7.png","element":"img"}],[{"text":"We now upper bound the second term in ","element":"span"},{"href":"#id-116","text":"(28)","element":"a"},{"text":". From ","element":"span"},{"href":"#id-118","text":"(30)","element":"a"},{"text":",","element":"span"}],[{"id":"id-120","style":{"width":"90%"},"width":1430,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-8.png","element":"img"}],[{"text":"To bound the regret term resulting from the summation of ","element":"span"},{"style":{"height":16},"width":165.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-9.png","element":"img","alt":" βk (Z (t))","inline":true},{"text":", we note that from the second stopping criterion, ","element":"span"},{"style":{"height":16.08},"width":841.84,"height":40.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-10.png","element":"img","alt":" Nt (Z (t)) ≤ 2Ntk (Z (t)) for all tk ≤ t < tk+1 and","inline":true}],[{"id":"id-119","style":{"width":"98%"},"width":1554,"height":492,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/24-11.png","element":"img"}],[{"text":"The first summation can be simplified as","element":"span"}],[{"style":{"width":"95%"},"width":1518,"height":812,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-0.png","element":"img"}],[{"text":"For the second term in ","element":"span"},{"href":"#id-119","text":"(34)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"77%"},"width":1234,"height":380,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.4},"width":252,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-2.png","element":"img","alt":" Ek = Tk − ˜Tk","inline":true},{"text":", and ","element":"span"},{"style":{"height":13.4},"width":55,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-3.png","element":"img","alt":" KT","inline":true,"padRight":true},{"text":"is bounded from Lemma ","element":"span"},{"href":"#id-64","text":"8. ","element":"a"},{"text":"Thus ","element":"span"},{"style":{"height":22.2},"width":423.5,"height":55.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-4.png","element":"img","alt":"�KTk=1�tk+1−1t=tk βk (Z (t))","inline":true,"padRight":true},{"text":"is bounded as","element":"span"}],[{"style":{"width":"85%"},"width":1354,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-5.png","element":"img"}],[{"text":"Substituting the above bound in ","element":"span"},{"href":"#id-120","text":"(33)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"83%"},"width":1320,"height":406,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.4},"width":200,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-7.png","element":"img","alt":" cp3 := 48cp2","inline":true},{"text":". Finally, from the above equation, ","element":"span"},{"href":"#id-121","text":"(32)","element":"a"},{"text":", and ","element":"span"},{"href":"#id-116","text":"(28)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"77%"},"width":1224,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-8.png","element":"img"}],[{"text":"By choosing ","element":"span"},{"style":{"height":23.2},"width":302,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-9.png","element":"img","alt":"˜δ = 1T Q(T ), we get","inline":true}],[{"style":{"width":"100%"},"width":1648,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/25-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":18.8},"width":325.5,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-0.png","element":"img","alt":"(T) = cp2(Th)r+rp∗.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"B.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-7","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Lemmas ","element":"span"},{"href":"#id-66","text":"1, ","element":"a"},{"href":"#id-69","text":"2, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-71","text":"3 ","element":"a"},{"text":"along with Cauchy-Schwarz inequality showed that the regret terms ","element":"span"},{"style":{"height":13.6},"width":43.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-1.png","element":"img","alt":"R0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.4},"width":43.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-2.png","element":"img","alt":" R2","inline":true,"padRight":true},{"text":"are of the order ","element":"span"},{"style":{"height":19.6},"width":460.84,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-3.png","element":"img","alt":"˜O(KrdJ∗hd+2r+rp∗�|A|T)","inline":true,"padRight":true},{"text":"and the term ","element":"span"},{"style":{"height":13.4},"width":42.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-4.png","element":"img","alt":" R1","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":18.83},"width":195.6,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-5.png","element":"img","alt":"˜O(J∗(h)rp∗)","inline":true},{"text":". Therefore, from ","element":"span"},{"style":{"height":16},"width":586.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-6.png","element":"img","alt":" R(T, πT SDE) = R0 + R1 + R2","inline":true},{"text":", the regret of Algorithm ","element":"span"},{"href":"#id-57","text":"1, ","element":"a"},{"style":{"height":16},"width":227.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-7.png","element":"img","alt":" R(T, πT SDE)","inline":true},{"text":", is ","element":"span"},{"style":{"height":19.6},"width":470.28,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-8.png","element":"img","alt":"˜O(KrdJ∗hd+2r+rp∗�|A|T).","inline":true}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"B.8 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Requirement of an optimal policy oracle.","element":"span"}],[{"text":"To implement our algorithm, we need to find the optimal policy for each model sampled by the algorithm—optimal policy for Theorem ","element":"span"},{"href":"#id-7","text":"1 ","element":"a"},{"text":"and optimal policy within policy class ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-9.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"for Corollary ","element":"span"},{"href":"#id-8","text":"1; ","element":"a"},{"text":"this has also been used in past work [","element":"span"},{"href":"#id-27","referenceIndex":19,"text":"19","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":29,"text":"29","element":"a"},{"text":"]. In the finite state-space setting, [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] provides a schedule of ","element":"span"},{"style":{"height":7.2},"width":14,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-10.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"values and selects ","element":"span"},{"style":{"height":7.2},"width":14,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-11.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policies to obtain ","element":"span"},{"style":{"height":18.83},"width":125.04,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-12.png","element":"img","alt":"˜O(√T)","inline":true,"padRight":true},{"text":"regret guarantees. The issue with extending the analysis of [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] to the countable state-space setting is that we need to ensure (uniform) ergodicity for the chosen ","element":"span"},{"style":{"height":7.2},"width":14,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-13.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policies; the ","element":"span"},{"text":"lim sup ","element":"span"},{"text":"or ","element":"span"},{"text":"lim inf ","element":"span"},{"text":"of the time-average expected reward (used to define the average cost problem) being finite doesn’t imply ergodicity. In other words, we must formulate (and verify) ergodicity assumptions for a potentially large set of close-to-optimal algorithms whose structure is undetermined. Another issue is that, to the best of our knowledge, there isn’t a general structural characterization of all ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-14.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal stationary policies for countable state-space MDPs or even a characterization of the policy within this set that is selected by any computational procedure in the literature; current results only discuss existence and characterization of the stationary optimal policy. In the absence of such results, stability assumptions with the same uniformity across models as in our submission will be needed, which are likely too strong to be useful.","element":"span"}],[{"text":"If we could verify the stability requirements of Assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"for a subset of policies, the optimal oracle is not needed, and instead, by choosing approximately optimal policies within this subset, we can follow the same proof steps as [","element":"span"},{"href":"#id-22","referenceIndex":41,"text":"41","element":"a"},{"text":"] to guarantee regret performance similar to Corollary ","element":"span"},{"href":"#id-8","text":"1 ","element":"a"},{"text":"(without knowledge of model parameters). To theoretically analyze the performance of the algorithm that follows an approximately optimal policy rather than the optimal one, we assume that for a specific sequence of ","element":"span"},{"style":{"height":16.6},"width":232.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-15.png","element":"img","alt":" {ϵk}∞k=1, an ϵk","inline":true},{"text":"-optimal policy is given, which is defined below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Policy ","element":"span"},{"style":{"height":11.6},"width":101,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-16.png","element":"img","alt":" π ∈ Π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is called an ","element":"span"},{"style":{"height":7.2},"width":14,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-17.png","element":"img","alt":" ϵ","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal policy if for every ","element":"span"},{"style":{"height":13.6},"width":103.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-18.png","element":"img","alt":" θ ∈ Θ,","inline":true}],[{"style":{"width":"92%"},"width":1466,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":15.6},"width":37.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-20.png","element":"img","alt":" π∗θ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the optimal policy in the policy class ","element":"span"},{"style":{"height":11},"width":28,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-21.png","element":"img","alt":" Π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"corresponding to parameter ","element":"span"},{"style":{"height":16},"width":293,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-22.png","element":"img","alt":" θ and v(.; θ) is the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"solution to Poisson equation ","element":"span"},{"href":"#id-53","text":"(5)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Given ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-23.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policies that satisfy Assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4, ","element":"a"},{"text":"in Theorem ","element":"span"},{"href":"#id-9","text":"2 ","element":"a"},{"text":"we extend the regret guarantees of Corollary ","element":"span"},{"href":"#id-8","text":"1 ","element":"a"},{"text":"to the algorithm employing ","element":"span"},{"style":{"height":7.2},"width":13.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-24.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policy, instead of the best-in-class policy, and show that the same regret upper bounds continue to apply.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a non-negative sequence ","element":"span"},{"style":{"height":16.51},"width":133.72,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-25.png","element":"img","alt":" {ϵk}∞k=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for every ","element":"span"},{"style":{"height":13.6},"width":150,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-26.png","element":"img","alt":" k ∈ N, ϵk","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded ","element":"span"},{"style":{"fontStyle":"italic"},"text":"above by ","element":"span"},{"style":{"height":20.6},"width":224,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-27.png","element":"img","alt":"1k+1 and an ϵk","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal policy satisfying Assumptions ","element":"span"},{"href":"#id-10","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-11","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"is given. The regret incurred ","element":"span"},{"style":{"fontStyle":"italic"},"text":"by Algorithm ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"while using the ","element":"span"},{"style":{"height":9.6},"width":30.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-28.png","element":"img","alt":" ϵk","inline":true},{"style":{"fontStyle":"italic"},"text":"-optimal policy during any episode ","element":"span"},{"style":{"height":19.6},"width":325.92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-29.png","element":"img","alt":" k is ˜O(dhd�|A|T).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For the ","element":"span"},{"style":{"height":9.6},"width":30.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-30.png","element":"img","alt":" ϵk","inline":true},{"text":"-optimal policy used in episode ","element":"span"},{"style":{"height":14.8},"width":412.08,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-31.png","element":"img","alt":" k, shown by πϵk, we have","inline":true}],[{"style":{"width":"100%"},"width":1638,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/26-32.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"style":{"width":"100%"},"width":1668,"height":564,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-0.png","element":"img"}],[{"text":"We assumed that given ","element":"span"},{"style":{"height":7.2},"width":14,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-1.png","element":"img","alt":" ϵ","inline":true},{"text":"-optimal policies satisfy Assumptions ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","text":"4. ","element":"a"},{"text":"As a result, we can utilize the proof of Theorem ","element":"span"},{"href":"#id-7","text":"1 ","element":"a"},{"text":"to deduce that the term ","element":"span"},{"style":{"height":13.6},"width":233,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-2.png","element":"img","alt":" R0 + R1 + R2","inline":true,"padRight":true},{"text":"is of the order ","element":"span"},{"style":{"height":19.8},"width":242,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-3.png","element":"img","alt":"˜O(dhd�|A|T)","inline":true},{"text":". Moreover, we can simplify the term ","element":"span"},{"style":{"height":28.8},"width":425.36,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-4.png","element":"img","alt":" E� �KTk=1 Tkϵk�as below:","inline":true}],[{"id":"id-122","style":{"width":"75%"},"width":1190,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-5.png","element":"img"}],[{"text":"From the second stopping condition of Algorithm ","element":"span"},{"href":"#id-57","text":"1, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":17.2},"width":558.5,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-6.png","element":"img","alt":"˜Tk ≤ ˜Tk−1 + 1 ≤ . . . ≤ k + 1 and","inline":true}],[{"style":{"width":"24%"},"width":396,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-7.png","element":"img"}],[{"text":"where we have used the assumption that ","element":"span"},{"style":{"height":20.6},"width":150.5,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-8.png","element":"img","alt":" ϵk ≤ 1k+1","inline":true},{"text":". For the second term of ","element":"span"},{"href":"#id-122","text":"(35)","element":"a"},{"text":", from ","element":"span"},{"href":"#id-110","text":"(25)","element":"a"}],[{"style":{"width":"98%"},"width":1558,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-9.png","element":"img"}],[{"text":"where in the last inequality we have used ","element":"span"},{"style":{"height":19.4},"width":354,"height":48.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-10.png","element":"img","alt":"�ni=11n ≤ 1 + log(n)","inline":true},{"text":". Finally, as a result of Lemma ","element":"span"},{"href":"#id-62","text":"6 ","element":"a"},{"text":"and ","element":"span"},{"text":"Lemma ","element":"span"},{"href":"#id-64","text":"8, ","element":"a"},{"text":"the result follows.","element":"span"}]]},{"heading":"C Bounds on hitting times under polynomial and geometric ergodicity","paragraphs":[[{"id":"id-100","style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Polynomial upper bounds for the moments of hitting time of state ","element":"span"},{"style":{"height":14},"width":35.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-11.png","element":"img","alt":" 0d","inline":true}],[{"text":"For any ","element":"span"},{"style":{"height":14.4},"width":177,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-12.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true},{"text":", consider the Markov process with transition kernel ","element":"span"},{"style":{"height":26.2},"width":74,"height":65.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-13.png","element":"img","alt":" Pπ∗θ2θ1","inline":true,"padRight":true},{"text":"obtained from the MDP ","element":"span"},{"style":{"height":16},"width":226.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-14.png","element":"img","alt":" (X, A, c, Pθ1)","inline":true,"padRight":true},{"text":"by following policy ","element":"span"},{"style":{"height":17},"width":49,"height":42.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-15.png","element":"img","alt":" π∗θ2","inline":true},{"text":". [","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"24","element":"a"},{"text":", Lemma 3.5] establishes that if the process is ","element":"span"},{"text":"polynomially ergodic, equivalently satisfies ","element":"span"},{"href":"#id-99","text":"(4)","element":"a"},{"text":", then for every ","element":"span"},{"style":{"height":14.4},"width":180.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-16.png","element":"img","alt":" 0 < η ≤ 1","inline":true},{"text":", there exists constants ","element":"span"},{"style":{"height":19.66},"width":278.56,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-17.png","element":"img","alt":"βηθ1,θ2, bηθ1,θ2 > 0","inline":true,"padRight":true},{"text":"such that the following holds:","element":"span"}],[{"id":"id-127","style":{"width":"91%"},"width":1452,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-18.png","element":"img"}],[{"text":"where for ","element":"span"},{"style":{"height":22.21},"width":636.16,"height":55.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-19.png","element":"img","alt":" η ∈ (0, 1), ˜βpθ1,θ2 := min(βpθ1,θ2, 1) and","inline":true}],[{"id":"id-114","style":{"width":"98%"},"width":1568,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-20.png","element":"img"}],[{"text":"and for ","element":"span"},{"style":{"height":19.66},"width":354.84,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-21.png","element":"img","alt":" η = 1, βηθ1,θ2 = βpθ1,θ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.6},"width":226,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/27-22.png","element":"img","alt":" bηθ1,θ2 = bpθ1,θ2","inline":true},{"text":". Consequently, the following result is immediate ","element":"span"},{"id":"id-126","text":"from the proof of ","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"[24, ","element":"a"},{"text":"Theorem 3.6]; for completeness, we provide the proof in Appendix ","element":"span"},{"href":"#id-123","text":"D.1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose a finite set ","element":"span"},{"style":{"height":19.8},"width":94.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-0.png","element":"img","alt":" Cpθ1,θ2","inline":true},{"style":{"fontStyle":"italic"},"text":", constants ","element":"span"},{"style":{"height":19.68},"width":678.76,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-1.png","element":"img","alt":" βpθ1,θ2, bpθ1,θ2 > 0, r/(r + 1) ≤ αpθ1,θ2 < 1","inline":true},{"style":{"fontStyle":"italic"},"text":", and a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"function ","element":"span"},{"style":{"height":19.68},"width":361.16,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-2.png","element":"img","alt":" V pθ1,θ2 : X → [1, +∞)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exist such that ","element":"span"},{"href":"#id-99","text":"(4) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Then, there exist a sequence of non-negative ","element":"span"},{"style":{"fontStyle":"italic"},"text":"functions ","element":"span"},{"style":{"height":19.9},"width":388.64,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-3.png","element":"img","alt":" V iθ1,θ2 : X → [1, +∞)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , r ","element":"span"},{"text":"+ 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that satisfy the following system of drift ","element":"span"},{"style":{"fontStyle":"italic"},"text":"equations for finite sets ","element":"span"},{"style":{"height":19.9},"width":97.88,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-4.png","element":"img","alt":" Ciθ1,θ2","inline":true},{"style":{"fontStyle":"italic"},"text":", constants ","element":"span"},{"style":{"height":19.9},"width":424.32,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-5.png","element":"img","alt":" biθ1,θ2 ≥ 0 and βiθ1,θ2 > 0:","inline":true}],[{"id":"id-124","style":{"width":"92%"},"width":1466,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-6.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is the maximum degree of the cost function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"defined in Assumption ","element":"span"},{"href":"#id-54","text":"1. ","element":"a"},{"text":"Following the proof and approach of [","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"24","element":"a"},{"text":"] and using the set of equations ","element":"span"},{"href":"#id-124","text":"(39)","element":"a"},{"text":", we can find an upper-bound for ","element":"span"},{"style":{"height":18.46},"width":123.24,"height":46.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-7.png","element":"img","alt":" Ex[τ i0d]","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , r ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"in Lemma ","element":"span"},{"href":"#id-49","text":"10. ","element":"a"},{"text":"In order to establish upper bounds for the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"moments of ","element":"span"},{"style":{"height":10.4},"width":46.5,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-8.png","element":"img","alt":" τ0d","inline":true},{"text":", it is crucial to choose the value of ","element":"span"},{"style":{"height":19.8},"width":92,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-9.png","element":"img","alt":" αpθ1,θ2","inline":true,"padRight":true},{"text":"greater than or equal to ","element":"span"},{"style":{"height":18},"width":56.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-10.png","element":"img","alt":"rr+1","inline":true},{"text":", as ","element":"span"},{"text":"demonstrated in the proof of Lemma ","element":"span"},{"href":"#id-49","text":"10 ","element":"a"},{"text":"in Appendix ","element":"span"},{"href":"#id-125","text":"D.2","element":"a"}],[{"id":"id-49","style":{"fontWeight":"bold"},"text":"Lemma 10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , r ","element":"span"},{"text":"+ 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", and for all ","element":"span"},{"style":{"height":11.8},"width":106,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-11.png","element":"img","alt":" x ∈ X","inline":true}],[{"style":{"width":"56%"},"width":892,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":29.82},"width":409.72,"height":74.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-13.png","element":"img","alt":" ϕpθ1,θ2(i) := �ij=1 1βηjθ1,θ2","inline":true},{"style":{"height":28.8},"width":481,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-14.png","element":"img","alt":"�2j−1 + (j − 1) αCpθ1,θ2 bηjθ1,θ2","inline":true},{"style":{"height":28.8},"width":546.64,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-15.png","element":"img","alt":"�, ηi = 1 − (i − 1)(1 − αpθ1,θ2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-114","style":{"height":32.3},"width":1241.56,"height":80.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-16.png","element":"img","alt":"bηiθ1,θ2 and βηiθ1,θ2 defined in (38), and αCpθ1,θ2 =�miny∈Cpθ1,θ2 Kθ1,θ2(y)�−1.","inline":true}],[{"text":"Based on Lemma ","element":"span"},{"href":"#id-49","text":"10, ","element":"a"},{"text":"we impose the conditions of Assumption ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"to obtain uniform (over model class) and polynomial (in norm of the state) upper-bounds on the moments of hitting times to ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-17.png","element":"img","alt":" 0d","inline":true},{"text":". Moreover, these conditions lead to a uniform characterization of parameters of Lemma ","element":"span"},{"href":"#id-49","text":"10 ","element":"a"},{"text":"over all models in our class.","element":"span"}],[{"id":"id-101","style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Distribution of return times to state ","element":"span"},{"style":{"height":13.8},"width":35,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-18.png","element":"img","alt":" 0d","inline":true}],[{"text":"For any ","element":"span"},{"style":{"height":14.4},"width":167.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-19.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true},{"text":", consider the Markov process with transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"style":{"height":14.2},"width":43.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-20.png","element":"img","alt":"π∗θ2","inline":true},{"style":{"height":9.4},"width":25,"height":23.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-21.png","element":"img","alt":"θ1","inline":true,"padRight":true},{"text":"obtained from the MDP ","element":"span"},{"style":{"height":16},"width":226.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-22.png","element":"img","alt":"(X, A, c, Pθ1)","inline":true,"padRight":true},{"text":"by following policy ","element":"span"},{"style":{"height":17},"width":49,"height":42.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-23.png","element":"img","alt":" π∗θ2","inline":true},{"text":". In the following lemma, we show that the tail probabilities of ","element":"span"},{"text":"the return times to the common state ","element":"span"},{"style":{"height":16.98},"width":201,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-24.png","element":"img","alt":" 0d, again τ0d","inline":true},{"text":", converge geometrically fast to ","element":"span"},{"text":"0","element":"span"},{"text":", and characterize the convergence parameters in terms of the constants given in Assumption ","element":"span"},{"href":"#id-10","text":"3. ","element":"a"},{"text":"Explicitly, we show","element":"span"}],[{"style":{"width":"35%"},"width":556,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-25.png","element":"img"}],[{"text":"for problem and policy dependent constants ","element":"span"},{"style":{"height":19.6},"width":253.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-26.png","element":"img","alt":" cgθ1,θ2 and ˜γgθ1,θ2","inline":true},{"text":". We will follow the method outlined in ","element":"span"},{"text":"[","element":"span"},{"href":"#id-97","referenceIndex":22,"text":"22","element":"a"},{"text":"] with the goal to identify problem dependent parameters that will be relevant to our results. Proof of the following lemma is given in Appendix ","element":"span"},{"href":"#id-93","text":"D.3 ","element":"a"},{"text":"and follows the methodology of ","element":"span"},{"href":"#id-97","referenceIndex":22,"text":"[22]","element":"a"},{"text":".","element":"span"}],[{"id":"id-50","style":{"fontWeight":"bold"},"text":"Lemma 11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For every ","element":"span"},{"style":{"height":14.6},"width":167,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-27.png","element":"img","alt":" θ1, θ2 ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in the Markov process obtained from the Markov decision process ","element":"span"},{"style":{"height":16},"width":226.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-28.png","element":"img","alt":"(X, A, c, Pθ1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"following policy ","element":"span"},{"style":{"height":17.2},"width":49,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-29.png","element":"img","alt":" π∗θ2","inline":true},{"style":{"fontStyle":"italic"},"text":", the return time to state ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"starting from state ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"following:","element":"span"}],[{"style":{"width":"35%"},"width":556,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-30.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where","element":"span"}],[{"style":{"width":"56%"},"width":894,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-31.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with","element":"span"}],[{"style":{"width":"71%"},"width":1138,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-32.png","element":"img"}],[{"text":"Based on Lemma ","element":"span"},{"href":"#id-50","text":"11, ","element":"a"},{"text":"it is necessary to impose the conditions in Assumption ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"to obtain uniform tail probability bounds on ","element":"span"},{"style":{"height":10.6},"width":46,"height":26.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-33.png","element":"img","alt":" τ0d","inline":true,"padRight":true},{"text":"for all model parameters and policy choices in ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-34.png","element":"img","alt":" Θ","inline":true},{"text":". Moreover, these conditions lead to a uniform characterization of ","element":"span"},{"style":{"height":19.66},"width":380.8,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-35.png","element":"img","alt":" cgθ1,θ2 and ˜γgθ1,θ2 over Θ","inline":true},{"text":". Furthermore, as a result of ","element":"span"},{"text":"Lemma ","element":"span"},{"href":"#id-49","text":"10 ","element":"a"},{"text":"and uniformity conditions of Assumption ","element":"span"},{"href":"#id-11","text":"4, ","element":"a"},{"style":{"height":23.87},"width":152.36,"height":59.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-36.png","element":"img","alt":" Eπ∗θ2u [τ0d]","inline":true,"padRight":true},{"text":"has a uniform bound over ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-37.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.59},"width":217.48,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/28-38.png","element":"img","alt":"Cgθ1,θ2 \\ {0d}","inline":true},{"text":", which can be characterized in terms of the polynomial Lyapunov function.","element":"span"}]]},{"heading":"D Proofs of hitting time bounds","paragraphs":[[{"id":"id-123","style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-126","style":{"fontWeight":"bold"},"text":"9","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In the proof, to avoid cumbersome notation we will drop the indices ","element":"span"},{"style":{"height":14.4},"width":86,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-0.png","element":"img","alt":" θ1, θ2","inline":true},{"text":". Following the proof of Theorem 3.6 in [","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"24","element":"a"},{"text":"], we choose ","element":"span"},{"style":{"height":16},"width":729.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-1.png","element":"img","alt":" ηi = 1 − (i − 1)(1 − αp) for i = 1, . . . , r + 1","inline":true,"padRight":true},{"text":"and note that as ","element":"span"},{"style":{"height":20.96},"width":589.72,"height":52.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-2.png","element":"img","alt":" αp ∈ [ rr+1, 1), we have ηi ∈ [ 1r+1, 1]","inline":true},{"text":". As a result, we can apply ","element":"span"},{"href":"#id-127","text":"(37) ","element":"a"},{"text":"to each ","element":"span"},{"style":{"height":12.8},"width":132,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-3.png","element":"img","alt":" ηi to get","inline":true}],[{"style":{"width":"74%"},"width":1176,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-4.png","element":"img"}],[{"text":"Thus, the system of drift equations ","element":"span"},{"href":"#id-124","text":"(39) ","element":"a"},{"text":"hold for","element":"span"}],[{"style":{"width":"59%"},"width":946,"height":222,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.6},"width":174.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-6.png","element":"img","alt":" βηi and bηi ","inline":true,"padRight":true},{"text":"are defined in ","element":"span"},{"href":"#id-114","text":"(38)","element":"a"},{"text":".","element":"span"}],[{"id":"id-125","style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-49","style":{"fontWeight":"bold"},"text":"10","element":"a"}],[{"text":"The proof of Lemma ","element":"span"},{"href":"#id-49","text":"10 ","element":"a"},{"text":"uses the following lemma.","element":"span"}],[{"id":"id-95","style":{"fontWeight":"bold"},"text":"Lemma 12 ","element":"span"},{"text":"(Proposition 11.3.2, [","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"35","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose for nonnegative functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"on the state space ","element":"span"},{"style":{"height":15},"width":328,"height":37.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-7.png","element":"img","alt":" X and every k ∈ Z+","inline":true},{"style":{"fontStyle":"italic"},"text":", the following holds:","element":"span"}],[{"style":{"width":"47%"},"width":756,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, for any initial condition ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and stopping time ","element":"span"},{"style":{"height":7},"width":20,"height":17.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-9.png","element":"img","alt":" τ","inline":true}],[{"style":{"width":"47%"},"width":752,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-49","style":{"fontStyle":"italic"},"text":"10. ","element":"a"},{"text":"Following [","element":"span"},{"href":"#id-90","referenceIndex":24,"text":"24","element":"a"},{"text":"], the proof uses an induction argument. We will use the notation of Lemma ","element":"span"},{"href":"#id-126","text":"9 ","element":"a"},{"text":"for simplicity. Similarly, in this proof we will also denote ","element":"span"},{"style":{"height":19.66},"width":473.24,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-11.png","element":"img","alt":" ϕpθ1,θ2(i) as ϕ(i), Kθ1,θ2(·) as","inline":true},{"style":{"height":19.89},"width":876.36,"height":49.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-12.png","element":"img","alt":"K(·), and V iθ1,θ2, biθ1,θ2, βiθ1,θ2, Ciθ1,θ2 as Vi, bi, βi, Ci.","inline":true}],[{"text":"From irreducibility, for all ","element":"span"},{"style":{"height":16},"width":233.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-13.png","element":"img","alt":" x ∈ X, K(x)","inline":true,"padRight":true},{"text":"is positive and finite. Considering the system of drift equations found in Lemma ","element":"span"},{"href":"#id-126","text":"9, ","element":"a"},{"style":{"height":13.8},"width":140.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-14.png","element":"img","alt":" Ci = Cp","inline":true,"padRight":true},{"text":"is a finite set for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , r ","element":"span"},{"text":"+ 1","element":"span"},{"text":". Thus, ","element":"span"},{"style":{"height":16.8},"width":240,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-15.png","element":"img","alt":" miny∈Ci K(y)","inline":true,"padRight":true},{"text":"is strictly positive. For all ","element":"span"},{"style":{"height":14.4},"width":600.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-16.png","element":"img","alt":" x ∈ X and i = 1, . . . , r + 1, we have","inline":true}],[{"id":"id-128","style":{"width":"67%"},"width":1076,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-17.png","element":"img"}],[{"text":"We set ","element":"span"},{"style":{"height":20.42},"width":827.36,"height":51.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-18.png","element":"img","alt":" αCp := (miny∈Ci K(y))−1 = (miny∈Cp K(y))−1","inline":true},{"text":". From Lemma ","element":"span"},{"href":"#id-126","text":"9, ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":14.4},"width":277.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-19.png","element":"img","alt":" j = 1 and x ∈ X","inline":true}],[{"style":{"width":"35%"},"width":566,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-20.png","element":"img"}],[{"text":"By applying Lemma ","element":"span"},{"href":"#id-95","text":"12, ","element":"a"},{"text":"for all ","element":"span"},{"style":{"height":14.6},"width":223.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-21.png","element":"img","alt":" x ∈ X we get","inline":true}],[{"id":"id-129","style":{"width":"81%"},"width":1296,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-22.png","element":"img"}],[{"text":"Using ","element":"span"},{"href":"#id-128","text":"(40) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-129","text":"(41)","element":"a"},{"text":", followed by noting that","element":"span"}],[{"style":{"width":"59%"},"width":950,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/29-23.png","element":"img"}],[{"text":"we get","element":"span"}],[{"style":{"width":"80%"},"width":1278,"height":708,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-0.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":16},"width":167,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-1.png","element":"img","alt":" V1(x) ≥ 1","inline":true},{"text":", this gives us a bound on ","element":"span"},{"style":{"height":16},"width":118.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-2.png","element":"img","alt":" Ex[τ0d]","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"31%"},"width":504,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-3.png","element":"img"}],[{"text":"Assume for ","element":"span"},{"style":{"height":13},"width":83,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-4.png","element":"img","alt":" i ≥ 1","inline":true},{"text":", by the induction assumption we have","element":"span"}],[{"id":"id-130","style":{"width":"79%"},"width":1268,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-5.png","element":"img"}],[{"text":"Set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"in ","element":"span"},{"href":"#id-124","text":"(39)","element":"a"},{"text":", which yields","element":"span"}],[{"style":{"width":"42%"},"width":670,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-6.png","element":"img"}],[{"text":"Define ","element":"span"},{"style":{"height":16.99},"width":263.68,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-7.png","element":"img","alt":" Zk = kiVi(Xk)","inline":true},{"text":". From the above equation, we have","element":"span"}],[{"style":{"width":"97%"},"width":1548,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-8.png","element":"img"}],[{"text":"By applying Lemma ","element":"span"},{"href":"#id-95","text":"12 ","element":"a"},{"text":"to the above equation, we get","element":"span"}],[{"id":"id-131","style":{"width":"88%"},"width":1400,"height":370,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-9.png","element":"img"}],[{"text":"where the second inequality follows from ","element":"span"},{"href":"#id-128","text":"(40) ","element":"a"},{"text":"and the induction hypothesis ","element":"span"},{"href":"#id-130","text":"(42)","element":"a"},{"text":". Thereafter, from ","element":"span"},{"href":"#id-130","text":"(42) ","element":"a"},{"text":"(by using integral lower bound after using ","element":"span"},{"style":{"height":13.8},"width":270.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-10.png","element":"img","alt":" Vi ≥ 1), we have","inline":true}],[{"style":{"width":"74%"},"width":1184,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-11.png","element":"img"}],[{"text":"Substituting in ","element":"span"},{"href":"#id-131","text":"(43)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"100%"},"width":1590,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/30-12.png","element":"img"}],[{"text":"This completes the proof.","element":"span"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"D.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-50","style":{"fontWeight":"bold"},"text":"11","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In the proof, to avoid cumbersome notation we will drop the indices ","element":"span"},{"style":{"height":14.6},"width":86,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-0.png","element":"img","alt":" θ1, θ2","inline":true},{"text":". Based on Assumption ","element":"span"},{"href":"#id-10","text":"3, ","element":"a"},{"text":"there exists a finite set ","element":"span"},{"style":{"height":11.8},"width":45,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-1.png","element":"img","alt":" Cg","inline":true},{"text":", constants ","element":"span"},{"style":{"height":16},"width":244,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-2.png","element":"img","alt":" bg, γg ∈ (0, 1)","inline":true},{"text":", and a function ","element":"span"},{"style":{"height":11.4},"width":181.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-3.png","element":"img","alt":" V g : X →","inline":true},{"style":{"height":16},"width":127,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-4.png","element":"img","alt":"[1, +∞)","inline":true,"padRight":true},{"text":"satisfying","element":"span"}],[{"id":"id-135","style":{"width":"77%"},"width":1232,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-5.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":13},"width":93.5,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-6.png","element":"img","alt":" n ≥ 1","inline":true},{"text":", define the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"-step taboo probabilities ","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"as","element":"span"}],[{"style":{"width":"34%"},"width":540,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.8},"width":194,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-8.png","element":"img","alt":" A, B ⊆ X,","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.2},"width":39,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-9.png","element":"img","alt":" τA","inline":true,"padRight":true},{"text":"is the first hitting time of set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". We also let ","element":"span"},{"style":{"height":17.8},"width":267.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-10.png","element":"img","alt":" AP 0xB = IB(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.94},"width":350.04,"height":54.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-11.png","element":"img","alt":"˜V g = �∞n=0 0dP nV g","inline":true},{"text":". Applying the last exit decomposition on ","element":"span"},{"style":{"height":17.39},"width":386.16,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-12.png","element":"img","alt":" Cg \\ {0d} for all x ∈ X","inline":true},{"text":", we obtain","element":"span"}],[{"id":"id-132","style":{"width":"97%"},"width":1540,"height":822,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-13.png","element":"img"}],[{"text":"where we break up the trajectories starting at state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"and reaching state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"y ","element":"span"},{"text":"while avoiding state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-14.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"into two: ones that never visit the set ","element":"span"},{"style":{"height":11.8},"width":45,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-15.png","element":"img","alt":" Cg","inline":true},{"text":", and the others that visit ","element":"span"},{"style":{"height":17.38},"width":165.92,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-16.png","element":"img","alt":" Cg \\ {0d}","inline":true,"padRight":true},{"text":"up until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"but not afterwards and exit ","element":"span"},{"style":{"height":17.38},"width":340.4,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-17.png","element":"img","alt":" Cg \\ {0d} at time m.","inline":true}],[{"text":"We first bound Term 1 in ","element":"span"},{"href":"#id-132","text":"(46) ","element":"a"},{"text":"by finding an upper bound for the probability term ","element":"span"},{"style":{"height":20.32},"width":315.56,"height":50.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-18.png","element":"img","alt":"�∞m=1 0dP mxz using","inline":true,"padRight":true},{"text":"the first entrance decomposition on ","element":"span"},{"style":{"height":17.39},"width":165.96,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-19.png","element":"img","alt":" Cg \\ {0d}","inline":true,"padRight":true},{"text":"while noting that ","element":"span"},{"style":{"height":17.39},"width":249.36,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-20.png","element":"img","alt":" z ∈ Cg \\ {0d}:","inline":true}],[{"id":"id-136","style":{"width":"83%"},"width":1322,"height":676,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-21.png","element":"img"}],[{"text":"where the third line follows from the fact that ","element":"span"},{"style":{"height":20.96},"width":411.72,"height":52.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-22.png","element":"img","alt":"�∞l=0�v /∈Cg CgP lxvPvu ","inline":true,"padRight":true},{"text":"is the probability of entrance ","element":"span"},{"text":"to ","element":"span"},{"style":{"height":11.8},"width":44.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-23.png","element":"img","alt":" Cg","inline":true,"padRight":true},{"text":"through ","element":"span"},{"style":{"height":16},"width":228.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-24.png","element":"img","alt":" u ∈ Cg \\ {0}","inline":true},{"text":", so it is less than ","element":"span"},{"text":"1","element":"span"},{"text":". Irreducibility and positive recurrence combined with ","element":"span"},{"style":{"height":16},"width":165.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-25.png","element":"img","alt":" |Cg| < ∞","inline":true,"padRight":true},{"text":"imply that ","element":"span"},{"style":{"height":18.03},"width":467.16,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-26.png","element":"img","alt":" maxu∈Cg\\{0d} Eu[τ0d] < ∞","inline":true},{"text":", which shows ","element":"span"},{"style":{"height":20.32},"width":218.08,"height":50.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/31-27.png","element":"img","alt":"�∞m=0 0dP mxz ","inline":true,"padRight":true},{"text":"is finite. Next,","element":"span"}],[{"text":"by induction we prove that for ","element":"span"},{"style":{"height":17.38},"width":553.36,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-0.png","element":"img","alt":" n ≥ 1 and z ∈ Cg \\ {0d} we have","inline":true}],[{"id":"id-133","style":{"width":"67%"},"width":1066,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-1.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1","element":"span"},{"text":", we have using Assumption ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"43%"},"width":696,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-2.png","element":"img"}],[{"text":"Assuming that ","element":"span"},{"href":"#id-133","text":"(48) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"we have","element":"span"}],[{"style":{"width":"94%"},"width":1502,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-3.png","element":"img"}],[{"text":"so ","element":"span"},{"href":"#id-133","text":"(48) ","element":"a"},{"text":"is shown. We collect these bounds later on for our result on Term ","element":"span"},{"text":"2","element":"span"},{"text":".","element":"span"}],[{"text":"We now simplify the summation in ","element":"span"},{"href":"#id-132","text":"(45)","element":"a"},{"text":". Similar to previous arguments, we will use induction for ","element":"span"},{"style":{"height":13.2},"width":97.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-4.png","element":"img","alt":"n ≥ 1","inline":true,"padRight":true},{"text":"and show for all ","element":"span"},{"style":{"height":11.6},"width":108,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-5.png","element":"img","alt":" x ∈ X","inline":true}],[{"id":"id-134","style":{"width":"74%"},"width":1184,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-6.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1","element":"span"},{"text":", we have","element":"span"}],[{"style":{"width":"56%"},"width":898,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-7.png","element":"img"}],[{"text":"Assuming that ","element":"span"},{"href":"#id-134","text":"(49) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"we have","element":"span"}],[{"style":{"width":"77%"},"width":1232,"height":208,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-8.png","element":"img"}],[{"text":"where the first and second inequalities follow from the definition of taboo probabilities and ","element":"span"},{"href":"#id-135","text":"(44)","element":"a"},{"text":". Thus, ","element":"span"},{"href":"#id-134","text":"(49) ","element":"a"},{"text":"is proved. Lastly, for Term 2 in ","element":"span"},{"href":"#id-132","text":"(46)","element":"a"},{"text":", we note","element":"span"}],[{"style":{"width":"89%"},"width":1412,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-9.png","element":"img"}],[{"text":"From the above equation, ","element":"span"},{"href":"#id-136","text":"(47)","element":"a"},{"text":", ","element":"span"},{"href":"#id-133","text":"(48)","element":"a"},{"text":", and ","element":"span"},{"href":"#id-134","text":"(49)","element":"a"},{"text":", we bound ","element":"span"},{"style":{"height":18.8},"width":102.5,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-10.png","element":"img","alt":"˜V g(x)","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"94%"},"width":1496,"height":420,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-11.png","element":"img"}],[{"text":"where the last line is due to ","element":"span"},{"style":{"height":16},"width":313.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-12.png","element":"img","alt":" V g(x) ≥ 1. Taking","inline":true}],[{"style":{"width":"61%"},"width":968,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/32-13.png","element":"img"}],[{"text":"we have shown that","element":"span"}],[{"id":"id-137","style":{"width":"65%"},"width":1040,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-0.png","element":"img"}],[{"text":"We now upper-bound ","element":"span"},{"style":{"height":16},"width":430.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-1.png","element":"img","alt":" P0d(τ0d > n) for all n ≥ 1","inline":true,"padRight":true},{"text":"in an inductive manner, starting with ","element":"span"},{"style":{"height":16},"width":225.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-2.png","element":"img","alt":" P0d(τ0d > 1).","inline":true,"padRight":true},{"text":"As a part of showing this, for every ","element":"span"},{"style":{"height":16.99},"width":116.32,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-3.png","element":"img","alt":" x ̸= 0d ","inline":true,"padRight":true},{"text":"we argue that for all ","element":"span"},{"style":{"height":13.2},"width":97.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-4.png","element":"img","alt":" n ≥ 1","inline":true}],[{"id":"id-138","style":{"width":"68%"},"width":1094,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-5.png","element":"img"}],[{"text":"First note that","element":"span"}],[{"style":{"width":"61%"},"width":974,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-6.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"id":"id-139","style":{"width":"88%"},"width":1400,"height":236,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-7.png","element":"img"}],[{"text":"We now apply the bound in ","element":"span"},{"href":"#id-137","text":"(50) ","element":"a"},{"text":"to get","element":"span"}],[{"id":"id-140","style":{"width":"90%"},"width":1440,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-8.png","element":"img"}],[{"text":"With the base of induction established, we assume the statement in ","element":"span"},{"href":"#id-138","text":"(51) ","element":"a"},{"text":"is true for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", and show that it continues to hold for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"as follows:","element":"span"}],[{"style":{"width":"49%"},"width":792,"height":358,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-9.png","element":"img"}],[{"text":"where the final inequality uses the same arguments as in ","element":"span"},{"href":"#id-139","text":"(53) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-140","text":"(54)","element":"a"},{"text":".","element":"span"}],[{"text":"Finally, using the tail probabilities of hitting time of state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-10.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"text":"from any state ","element":"span"},{"style":{"height":17.2},"width":115,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-11.png","element":"img","alt":" x ̸= 0d","inline":true},{"text":", we bound the tail probability of the return time to state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-12.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"(starting from ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-13.png","element":"img","alt":" 0d","inline":true},{"text":") as follows","element":"span"}],[{"style":{"width":"76%"},"width":1210,"height":258,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-14.png","element":"img"}],[{"text":"where the final inequality follows from the definition of ","element":"span"},{"style":{"height":11.4},"width":30.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-15.png","element":"img","alt":" bg","inline":true},{"text":", and we have","element":"span"}],[{"style":{"width":"35%"},"width":560,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-16.png","element":"img"}],[{"text":"and the proof is complete.","element":"span"}]]},{"heading":"E Queueing model examples","paragraphs":[[{"id":"id-82","style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model 1: Two-server queueing system with a common buffer","element":"span"}],[{"text":"We consider a continuous-time queueing system with two heterogeneous servers with unknown service rate vector ","element":"span"},{"style":{"height":16.34},"width":220.68,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-17.png","element":"img","alt":" θ∗ = (θ∗1, θ∗2)","inline":true,"padRight":true},{"text":"and a common infinite buffer, shown in Figure ","element":"span"},{"href":"#id-79","text":"2a. ","element":"a"},{"text":"Arrivals to the ","element":"span"},{"text":"system are according to a Poisson process with rate ","element":"span"},{"style":{"height":11.4},"width":20,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/33-18.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"and service times are exponentially distributed","element":"span"}],[{"text":"with parameter ","element":"span"},{"style":{"height":15.6},"width":32.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-0.png","element":"img","alt":" θ∗i ","inline":true,"padRight":true},{"text":", depending on the assigned server. The service rate vector ","element":"span"},{"style":{"height":12.6},"width":36.5,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-1.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"is sampled from the ","element":"span"},{"text":"prior distribution ","element":"span"},{"style":{"height":9.8},"width":32.5,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-2.png","element":"img","alt":" ν0","inline":true,"padRight":true},{"text":"defined on the space ","element":"span"},{"style":{"height":14.4},"width":171.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-3.png","element":"img","alt":" Θ given as","inline":true}],[{"id":"id-154","style":{"width":"79%"},"width":1260,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-4.png","element":"img"}],[{"text":"for fixed ","element":"span"},{"style":{"height":16},"width":222.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-5.png","element":"img","alt":" δ ∈ (0, 0.5)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":134,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-6.png","element":"img","alt":" R ≥ 1","inline":true},{"text":". ","element":"span"},{"text":"Note that for any ","element":"span"},{"style":{"height":16},"width":235.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-7.png","element":"img","alt":" (θ1, θ2) ∈ Θ","inline":true},{"text":", we have ","element":"span"},{"style":{"height":13.8},"width":156,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-8.png","element":"img","alt":" θ1 ≥ θ2","inline":true,"padRight":true},{"text":"and the stability requirement ","element":"span"},{"style":{"height":13.8},"width":223.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-9.png","element":"img","alt":" λ < θ1 + θ2","inline":true,"padRight":true},{"text":"holds. The countable state space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"is defined as ","element":"span"},{"style":{"height":16},"width":928.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-10.png","element":"img","alt":"X = {x = (x0, x1, x2) : x0 ∈ N ∪ {0} , x1, x2 ∈ {0, 1}}","inline":true},{"text":", in which ","element":"span"},{"style":{"height":9.8},"width":36.5,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-11.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"is the length of the queue, and ","element":"span"},{"style":{"height":14},"width":184.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-12.png","element":"img","alt":" xi, i = 1, 2","inline":true,"padRight":true},{"text":"is equal to 1 if server ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is busy serving a job. At each time instance ","element":"span"},{"style":{"height":14.78},"width":127.56,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-13.png","element":"img","alt":" r ∈ R+","inline":true},{"text":", the dispatcher can assign jobs from the (non-empty) buffer to an available server. Thus, the action space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is equal to","element":"span"}],[{"style":{"width":"17%"},"width":274,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"indicates no action, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"sends a job to both of the servers, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2 ","element":"span"},{"text":"assigns a job to server ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". The goal of the dispatcher is to minimize the expected sojourn time of customers, which by Little’s law ","element":"span"},{"href":"#id-80","referenceIndex":44,"text":"[44] ","element":"a"},{"text":"is equivalent to minimizing the average number of customers in the system, or","element":"span"}],[{"id":"id-141","style":{"width":"66%"},"width":1060,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":") ","element":"span"},{"text":"is the state of the system at time ","element":"span"},{"style":{"height":14.8},"width":120,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-16.png","element":"img","alt":" r ∈ R+","inline":true},{"text":", immediately after the arrival/departure and just before the action is taken. In [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"], it is argued that from uniformization [","element":"span"},{"href":"#id-81","referenceIndex":32,"text":"32","element":"a"},{"text":"] and sampling the continuous-time Markov process at a rate of ","element":"span"},{"style":{"height":15.2},"width":190.5,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-17.png","element":"img","alt":" λ + θ∗1 + θ∗2","inline":true},{"text":", a discrete-time Markov chain is obtained, ","element":"span"},{"text":"which converts the original continuous-time problem shown in ","element":"span"},{"href":"#id-141","text":"(56) ","element":"a"},{"text":"to an equivalent discrete-time problem as below:","element":"span"}],[{"id":"id-142","style":{"width":"83%"},"width":1316,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-18.png","element":"img"}],[{"text":"To obtain a uniform sampling rate of ","element":"span"},{"style":{"height":14.94},"width":194.68,"height":37.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-19.png","element":"img","alt":" λ + θ∗1 + θ∗2","inline":true},{"text":", the continuous-time system is sampled at arrivals, ","element":"span"},{"text":"real and dummy customer departures. In [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"], it is further shown that the optimal policy that achieves the infimum in ","element":"span"},{"href":"#id-142","text":"(57) ","element":"a"},{"text":"is a threshold policy ","element":"span"},{"style":{"height":9.6},"width":32.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-20.png","element":"img","alt":" πt","inline":true,"padRight":true},{"text":"with the optimal finite threshold ","element":"span"},{"style":{"height":16},"width":142.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-21.png","element":"img","alt":" t(θ) ∈ N","inline":true},{"text":", with the policy defined as below:","element":"span"}],[{"style":{"width":"68%"},"width":1088,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-22.png","element":"img"}],[{"text":"note that action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"is not used. Policy ","element":"span"},{"style":{"height":9.6},"width":32.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-23.png","element":"img","alt":" πt","inline":true,"padRight":true},{"text":"assigns a job to the faster (first) server whenever there is a job waiting in the queue and the first server is available. In contrast, ","element":"span"},{"style":{"height":9.6},"width":32.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-24.png","element":"img","alt":" πt","inline":true,"padRight":true},{"text":"dispatches a job to the second server only if the number of jobs in the system are greater than threshold ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and the second server is available. If neither of these conditions hold, no action or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is taken. Consequently, we can restrict the set of all policies ","element":"span"},{"href":"#id-142","style":{"height":13.8},"width":144.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-25.png","element":"img","alt":" Π in (57)","inline":true,"padRight":true},{"text":"to the set ","element":"span"},{"style":{"height":13.6},"width":39.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-26.png","element":"img","alt":" Πt","inline":true},{"text":", which is the set of all possible threshold policies corresponding to some ","element":"span"},{"style":{"height":11.8},"width":99,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-27.png","element":"img","alt":" t ∈ N.","inline":true}],[{"text":"In the rest of this subsection, our aim is to show that Assumptions 1-5 are satisfied for the discrete-time Markov process obtained by uniformization of the described queueing system and hence, conclude that Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"can be used to learn the unknown service rate vector ","element":"span"},{"style":{"height":12.6},"width":36,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-28.png","element":"img","alt":" θ∗ ","inline":true,"padRight":true},{"text":"with the expected regret of order ","element":"span"},{"style":{"height":18.8},"width":130.5,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-29.png","element":"img","alt":"˜O(√T).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"text":"Cost function is given as ","element":"span"},{"style":{"height":16},"width":266.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-30.png","element":"img","alt":" c(x, a) = ∥x∥1","inline":true},{"text":", which satisfies Assumption ","element":"span"},{"href":"#id-54","text":"1 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16},"width":634.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-31.png","element":"img","alt":"fc(x) = x0 + x1 + x2 and K = r = 1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Assumption 2. ","element":"span"},{"text":"For any state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":12.2},"width":105.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-32.png","element":"img","alt":" θ ∈ Θ","inline":true},{"text":", we have ","element":"span"},{"style":{"height":16},"width":325,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-33.png","element":"img","alt":" Pθ(A(x); x, a) = 0","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16.78},"width":632,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-34.png","element":"img","alt":"A(x) = {y ∈ X : |∥y∥1 − ∥x∥1| > 1}","inline":true},{"text":"; thus, Assumption ","element":"span"},{"href":"#id-77","text":"2 ","element":"a"},{"text":"holds with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= 1","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 3. ","element":"span"},{"text":"Consider a queueing system with parameter ","element":"span"},{"style":{"height":11.6},"width":17,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-35.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"following threshold policy ","element":"span"},{"style":{"height":13.4},"width":183.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-36.png","element":"img","alt":" πt for some","inline":true},{"style":{"height":11.6},"width":91,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-37.png","element":"img","alt":"t ∈ N","inline":true},{"text":". The uniformized discrete-time Markov chain is irreducible and aperiodic on a subset of state space given as ","element":"span"},{"href":"#id-12","referenceIndex":31,"style":{"height":16},"width":930,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-38.png","element":"img","alt":" Xt = X \\ ({(i, 0, 0) : i ≥ min(t, 2)} ∪ {(0, 1, 1)}). In [31","inline":true},{"text":"], it is proved that for every ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the chain consists of a single positive recurrent class and the corresponding average number of customers, depicted by ","element":"span"},{"style":{"height":16.8},"width":84.5,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-39.png","element":"img","alt":" Jt(θ)","inline":true},{"text":", is calculated. Moreover, it is shown that for every ","element":"span"},{"style":{"height":12},"width":96,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-40.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"the optimal threshold ","element":"span"},{"style":{"height":16},"width":65.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-41.png","element":"img","alt":" t(θ)","inline":true,"padRight":true},{"text":"can be numerically found as the smallest ","element":"span"},{"style":{"height":11.6},"width":92.24,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-42.png","element":"img","alt":" i ∈ N","inline":true,"padRight":true},{"text":"for which ","element":"span"},{"style":{"height":17.38},"width":274.84,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-43.png","element":"img","alt":" Ji(θ) < Ji+1(θ)","inline":true},{"text":". Define the set ","element":"span"},{"style":{"height":11},"width":42.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-44.png","element":"img","alt":" T ∗ ","inline":true,"padRight":true},{"text":"as the set of all optimal thresholds corresponding to at least one ","element":"span"},{"style":{"height":13.8},"width":152,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-45.png","element":"img","alt":" θ ∈ Θ, or","inline":true}],[{"style":{"width":"31%"},"width":500,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/34-46.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Remark 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There is a discrepancy between the class of MDPs defined in this section and in Section ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"as in the former the MDPs are not irreducible in the whole state space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":". Specifically, for every Markov process generated by a queueing system with parameter ","element":"span"},{"style":{"height":11.6},"width":17,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-0.png","element":"img","alt":" θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"following threshold policy ","element":"span"},{"style":{"height":9.6},"width":32.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-1.png","element":"img","alt":" πt","inline":true},{"style":{"fontStyle":"italic"},"text":", irreducibility holds on ","element":"span"},{"style":{"height":13.2},"width":126.5,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-2.png","element":"img","alt":" Xt ⊂ X","inline":true},{"style":{"fontStyle":"italic"},"text":". Nevertheless, the results of Section ","element":"span"},{"style":{"fontStyle":"italic"},"text":"4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are valid as starting from state ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0)","element":"span"},{"style":{"fontStyle":"italic"},"text":", the visited states are positive recurrent; see Remark ","element":"span"},{"href":"#id-143","style":{"fontStyle":"italic"},"text":"3.","element":"a"}],[{"text":"In the following proposition, we verify the geometric ergodicity of the discrete-time chain governed by any parameter ","element":"span"},{"style":{"height":11.8},"width":96,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-3.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and obtained by following any threshold policy ","element":"span"},{"style":{"height":13.4},"width":207,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-4.png","element":"img","alt":" πt for t ∈ T ∗","inline":true},{"text":"; proof is given in Appendix ","element":"span"},{"href":"#id-144","text":"F.1.","element":"a"}],[{"id":"id-169","style":{"fontWeight":"bold"},"text":"Proposition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The discrete-time Markov process obtained from the queueing system governed by parameter ","element":"span"},{"style":{"height":16},"width":286.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-5.png","element":"img","alt":" θ = (θ1, θ2) ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and following threshold policy ","element":"span"},{"style":{"height":9.6},"width":32.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-6.png","element":"img","alt":" πt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":11.8},"width":111,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-7.png","element":"img","alt":" t ∈ T ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is geometrically ergodic. Equivalently, the following holds","element":"span"}],[{"style":{"width":"61%"},"width":976,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for","element":"span"}],[{"id":"id-146","style":{"width":"82%"},"width":1314,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-9.png","element":"img"}],[{"style":{"width":"82%"},"width":1314,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-10.png","element":"img"}],[{"id":"id-145","style":{"width":"83%"},"width":1316,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-11.png","element":"img"}],[{"text":"Having described all the terms explicitly, we verify the rest of the conditions of Assumption ","element":"span"},{"href":"#id-10","text":"3, ","element":"a"},{"text":"which lead to uniform (over model class) upper-bounds on the moments of hitting time to ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-12.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"text":"1. From ","element":"span"},{"href":"#id-145","text":"(60)","element":"a"},{"text":", ","element":"span"},{"style":{"height":19.66},"width":469.48,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-13.png","element":"img","alt":" supθ∈Θ,t∈T ∗ γgθ,t ≤ 1/2 < 1.","inline":true}],[{"text":"2. From ","element":"span"},{"href":"#id-146","text":"(58)","element":"a"},{"text":", we can see that state ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0) ","element":"span"},{"text":"belongs to ","element":"span"},{"style":{"height":19.66},"width":66,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-14.png","element":"img","alt":" Cgθ,t","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":12.2},"width":102.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-15.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.8},"width":112,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-16.png","element":"img","alt":" t ∈ T ∗","inline":true},{"text":". In order for ","element":"span"},{"style":{"height":19.66},"width":342.28,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-17.png","element":"img","alt":"Cg∗ = ∪θ∈Θ,t∈T ∗Cgθ,t ","inline":true,"padRight":true},{"text":"to be a finite set, the supremum of the optimal threshold ","element":"span"},{"style":{"height":16},"width":298.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-18.png","element":"img","alt":" t(θ) over Θ should","inline":true,"padRight":true},{"text":"be finite. In [","element":"span"},{"href":"#id-147","referenceIndex":30,"text":"30","element":"a"},{"text":"] with service rate vector ","element":"span"},{"style":{"height":16},"width":122.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-19.png","element":"img","alt":" (θ1, θ2)","inline":true},{"text":", it is shown that the optimal threshold is bounded above by","element":"span"},{"style":{"height":17.98},"width":144.36,"height":44.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-20.png","element":"img","alt":"√2θ1/θ2","inline":true},{"text":", which further gives","element":"span"}],[{"id":"id-153","style":{"width":"60%"},"width":958,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-21.png","element":"img"}],[{"text":"Thus, ","element":"span"},{"style":{"height":18.8},"width":331.5,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-22.png","element":"img","alt":" supθ∈Θ t(θ) ≤√2R","inline":true},{"text":", which is finite. To confirm a uniform upper bound for ","element":"span"},{"style":{"height":19.4},"width":52,"height":48.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-23.png","element":"img","alt":" bgθ,t","inline":true},{"text":", we note ","element":"span"},{"text":"that from ","element":"span"},{"href":"#id-146","text":"(59)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"53%"},"width":852,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-24.png","element":"img"}],[{"text":"which is finite as ","element":"span"},{"style":{"height":16.46},"width":175,"height":41.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-25.png","element":"img","alt":" |Cg∗| < ∞.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Assumption 4. ","element":"span"},{"text":"To find an upper bound on the second moment of hitting times, we verify Assumption ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"and show that there exists a finite set ","element":"span"},{"style":{"height":19.66},"width":65.96,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-26.png","element":"img","alt":" Cpθ,t","inline":true},{"text":", constants ","element":"span"},{"style":{"height":19.66},"width":604.32,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-27.png","element":"img","alt":" βpθ,t, bpθ,t > 0, r/(r + 1) ≤ αpθ,t < 1","inline":true},{"text":", and a ","element":"span"},{"text":"function ","element":"span"},{"style":{"height":19.66},"width":341.84,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-28.png","element":"img","alt":" V pθ,t : X t → [1, +∞)","inline":true,"padRight":true},{"text":"satisfying","element":"span"}],[{"id":"id-148","style":{"width":"80%"},"width":1278,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-29.png","element":"img"}],[{"id":"id-149","style":{"fontWeight":"bold"},"text":"Proposition 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The discrete-time Markov process obtained from the queueing system governed by parameter ","element":"span"},{"style":{"height":16},"width":290.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-30.png","element":"img","alt":" θ = (θ1, θ2) ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and following threshold policy ","element":"span"},{"style":{"height":9.4},"width":32.5,"height":23.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-31.png","element":"img","alt":" πt","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":11.79},"width":115.92,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/35-32.png","element":"img","alt":" t ∈ T ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is polynomially","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"ergodic. This is true because ","element":"span"},{"href":"#id-148","text":"(62) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds for","element":"span"}],[{"id":"id-152","style":{"width":"92%"},"width":1466,"height":470,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-0.png","element":"img"}],[{"text":"Proof of Proposition ","element":"span"},{"href":"#id-149","text":"2 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-150","text":"F.2. ","element":"a"},{"text":"We define the normalized rates as ","element":"span"},{"style":{"height":22.2},"width":211,"height":55.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-1.png","element":"img","alt":"˜λ = λλ+θ1+θ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.2},"width":220.5,"height":55.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-2.png","element":"img","alt":"˜θi = θiλ+θ1+θ2","inline":true,"padRight":true},{"text":", for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":". From the choice of parameter space ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-3.png","element":"img","alt":" Θ","inline":true},{"text":", we have ","element":"span"},{"style":{"height":17.41},"width":251.76,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-4.png","element":"img","alt":" ˜λ ≤ 0.5 − 0.5δ","inline":true},{"text":", ","element":"span"},{"style":{"height":17.41},"width":730.72,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-5.png","element":"img","alt":"˜θ1 + ˜θ2 ≥ 0.5 + 0.5δ, and ˜θ1 ≥ 0.25 + 0.25δ","inline":true},{"text":". We verify the remaining conditions of Assumption ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"as follows:","element":"span"}],[{"text":"1. From ","element":"span"},{"href":"#id-151","text":"(63)","element":"a"},{"text":", the first condition holds with ","element":"span"},{"style":{"height":14.6},"width":305,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-6.png","element":"img","alt":" rp∗ = 2 and sp∗ = 2.","inline":true}],[{"text":"2. From ","element":"span"},{"href":"#id-152","text":"(64)","element":"a"},{"text":", we can see that state ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0) ","element":"span"},{"text":"belongs to ","element":"span"},{"style":{"height":19.6},"width":465.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-7.png","element":"img","alt":" Cpθ,t for all θ ∈ Θ and t ∈ T ∗","inline":true},{"text":". Furthermore,","element":"span"}],[{"id":"id-151","style":{"width":"32%"},"width":520,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-8.png","element":"img"}],[{"text":"which follows from the stability condition ","element":"span"},{"style":{"height":17.2},"width":256,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-9.png","element":"img","alt":"˜λ ≤ 0.5 − 0.5δ","inline":true},{"text":". Thus, from the definition of ","element":"span"},{"style":{"height":19.6},"width":63.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-10.png","element":"img","alt":" Cpθ,t","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-152","text":"(64)","element":"a"},{"text":", and the fact that ","element":"span"},{"style":{"height":18.4},"width":340.5,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-11.png","element":"img","alt":" supθ∈Θ t(θ) ≤√2R","inline":true,"padRight":true},{"text":"as argued in in ","element":"span"},{"href":"#id-153","text":"(61)","element":"a"},{"text":", ","element":"span"},{"style":{"height":19.6},"width":348,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-12.png","element":"img","alt":" Cp∗ = ∪θ∈Θ,t∈T ∗Cpθ,t","inline":true,"padRight":true},{"text":"is a ","element":"span"},{"text":"finite set. We also note that ","element":"span"},{"style":{"height":19.66},"width":267.96,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-13.png","element":"img","alt":" supθ∈Θ,t∈T ∗ bpθ,t","inline":true,"padRight":true},{"text":"is finite as ","element":"span"},{"style":{"height":16.6},"width":170,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-14.png","element":"img","alt":" |Cp∗| < ∞","inline":true},{"text":". It remains to show that ","element":"span"},{"style":{"height":19.8},"width":260,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-15.png","element":"img","alt":"infθ∈Θ,t∈T ∗ βpθ,t","inline":true,"padRight":true},{"text":"is positive, which is equivalent to verifying that ","element":"span"},{"style":{"height":21.2},"width":350,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-16.png","element":"img","alt":" supθ∈Θ,t∈T ∗ ˜λ < 1/2","inline":true},{"text":", which ","element":"span"},{"text":"follows from the stability condition ","element":"span"},{"style":{"height":17.2},"width":252,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-17.png","element":"img","alt":"˜λ ≤ 0.5 − 0.5δ.","inline":true}],[{"text":"3. We need to show that ","element":"span"},{"style":{"height":18.6},"width":632.5,"height":46.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-18.png","element":"img","alt":" Kθ,t(x) := �∞n=0 2−n−2 (P tθ)n (x, 0d)","inline":true,"padRight":true},{"text":"is strictly bounded away from zero.","element":"span"},{"text":"We notice that from any non-zero state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":", the queueing system hits ","element":"span"},{"style":{"height":17.6},"width":167.5,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-19.png","element":"img","alt":" 0d in ∥x∥1","inline":true,"padRight":true},{"text":"transitions only if all transitions are real departures. Hence,","element":"span"}],[{"style":{"width":"47%"},"width":748,"height":384,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-20.png","element":"img"}],[{"text":"where the third and fourth inequalities follow from the definition of ","element":"span"},{"href":"#id-154","style":{"height":14.2},"width":144.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-21.png","element":"img","alt":" Θ in (55)","inline":true},{"text":". Thus, the infimum of ","element":"span"},{"style":{"height":16.6},"width":126,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-22.png","element":"img","alt":" Kθ,t(x)","inline":true,"padRight":true},{"text":"over the finite set ","element":"span"},{"style":{"height":14.6},"width":346.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-23.png","element":"img","alt":" Cp∗ and sets Θ and T ∗","inline":true,"padRight":true},{"text":"is strictly greater than zero.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 5. ","element":"span"},{"text":"We finally verify Assumption ","element":"span"},{"href":"#id-65","text":"5, ","element":"a"},{"text":"which asserts that ","element":"span"},{"style":{"height":16.6},"width":203,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-24.png","element":"img","alt":" supθ∈Θ J(θ)","inline":true,"padRight":true},{"text":"is finite. We have","element":"span"}],[{"style":{"width":"81%"},"width":1298,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-25.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12.8},"width":98.5,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-26.png","element":"img","alt":" µθ,t(θ)","inline":true,"padRight":true},{"text":"is the stationary distribution of the discrete-time process governed by parameter ","element":"span"},{"style":{"height":11.6},"width":85.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-27.png","element":"img","alt":" θ and","inline":true,"padRight":true},{"text":"following the optimal policy according to ","element":"span"},{"style":{"height":11.6},"width":17,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-28.png","element":"img","alt":" θ","inline":true},{"text":". From ","element":"span"},{"href":"#id-148","text":"(62) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"[35, ","element":"a"},{"text":"Theorem 14.3.7],","element":"span"}],[{"style":{"width":"31%"},"width":502,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-29.png","element":"img"}],[{"text":"which is finite from the the previously verified assumption. Consequently,","element":"span"}],[{"style":{"width":"22%"},"width":358,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/36-30.png","element":"img"}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model 2: Two heterogeneous parallel queues","element":"span"}],[{"text":"We consider two parallel queues with infinite buffers, each with its own single server, and unknown service rate vector ","element":"span"},{"style":{"height":16.4},"width":215,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-0.png","element":"img","alt":" θ∗ = (θ∗1, θ∗2)","inline":true},{"text":", shown in Figure ","element":"span"},{"href":"#id-79","text":"2b. ","element":"a"},{"text":"The service rate vector ","element":"span"},{"style":{"height":12.6},"width":36.5,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-1.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"is sampled from the ","element":"span"},{"text":"prior distribution ","element":"span"},{"style":{"height":9.8},"width":32.5,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-2.png","element":"img","alt":" ν0","inline":true,"padRight":true},{"text":"defined on the space ","element":"span"},{"style":{"height":14.4},"width":168,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-3.png","element":"img","alt":" Θ given as","inline":true}],[{"id":"id-155","style":{"width":"79%"},"width":1260,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-4.png","element":"img"}],[{"text":"for fixed ","element":"span"},{"style":{"height":16},"width":188,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-5.png","element":"img","alt":" δ ∈ (0, 0.5)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":103.72,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-6.png","element":"img","alt":" R ≥ 1","inline":true},{"text":", which ensures the stability of the queueing system. Consider the discrete-time MDP ","element":"span"},{"style":{"height":16},"width":230.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-7.png","element":"img","alt":" (X, A, Pθ∗, c)","inline":true,"padRight":true},{"text":"obtained by sampling the queueing system at the Poisson arrival sequence. The countably infinite state space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"is defined as below","element":"span"}],[{"style":{"width":"38%"},"width":610,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-8.png","element":"img"}],[{"text":"where the state of the system is the number of jobs in the server-queue pair ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"just before an arrival. Furthermore, the action space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is equal to","element":"span"}],[{"style":{"width":"12%"},"width":198,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-9.png","element":"img"}],[{"text":"where action ","element":"span"},{"style":{"height":12.4},"width":102.6,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-10.png","element":"img","alt":" i ∈ A","inline":true,"padRight":true},{"text":"indicates the arrival dispatched to queue ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". The unbounded cost function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":": ","element":"span"},{"style":{"height":16},"width":297.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-11.png","element":"img","alt":"X ×A → N∪{0}","inline":true,"padRight":true},{"text":"is defined as the total number of jobs in the queueing system, i.e., ","element":"span"},{"style":{"height":16},"width":260.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-12.png","element":"img","alt":" c(x, a) = ∥x∥1.","inline":true,"padRight":true},{"text":"For every ","element":"span"},{"style":{"height":14.78},"width":128.72,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-13.png","element":"img","alt":" ω ∈ R+","inline":true},{"text":", we define policy ","element":"span"},{"style":{"height":14.2},"width":206.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-14.png","element":"img","alt":" πω : X → A","inline":true},{"text":", which routes the arrival according to the weighted queue lengths, as","element":"span"}],[{"style":{"width":"40%"},"width":646,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-15.png","element":"img"}],[{"text":"where the tie is broken in favor of the first server. We also define policy class ","element":"span"},{"style":{"height":14.8},"width":28,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-16.png","element":"img","alt":"˜Π","inline":true,"padRight":true},{"text":"as the set of policies ","element":"span"},{"style":{"height":13.6},"width":232,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-17.png","element":"img","alt":"πω such that ω","inline":true,"padRight":true},{"text":"belongs to a compact interval; in other words,","element":"span"}],[{"style":{"width":"33%"},"width":526,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-18.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is defined in ","element":"span"},{"href":"#id-155","text":"(68) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":13.4},"width":112.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-19.png","element":"img","alt":" cR ≥ 1","inline":true},{"text":". We aim to minimize the infinite-horizon average cost in the policy class ","element":"span"},{"style":{"height":17.23},"width":156.16,"height":43.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-20.png","element":"img","alt":"˜Π, that is,","inline":true}],[{"id":"id-156","style":{"width":"75%"},"width":1194,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-21.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":381.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-22.png","element":"img","alt":" X(t) = (X1(t), X2(t))","inline":true,"padRight":true},{"text":"is the occupancy vector of the queueing system just before arrival ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Even with the controlled Markov process transition kernel fully-specified (by the values of the arrival rate and the two service rates), the optimal policy","element":"span"},{"text":"1 ","element":"span"},{"text":"that satisfies ","element":"span"},{"href":"#id-156","text":"(69) ","element":"a"},{"text":"in policy class ","element":"span"},{"style":{"height":14.8},"width":28,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-23.png","element":"img","alt":"˜Π","inline":true,"padRight":true},{"text":"is not known except when ","element":"span"},{"style":{"height":13.8},"width":126.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-24.png","element":"img","alt":" θ1 = θ2","inline":true,"padRight":true},{"text":"where the optimal value is ","element":"span"},{"style":{"height":11},"width":101.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-25.png","element":"img","alt":" ω = 1","inline":true},{"text":", and so, to learn it, we will use Proximal Policy Optimization for countable state-space controlled Markov processes as developed in [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"]. Note that [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"] requires full knowledge of the controlled Markov process, which holds in our learning scheme since we use the parameters sampled from the posterior for determining the policy at the beginning of each episode. Furthermore, for each policy in the set of applicable policies ","element":"span"},{"href":"#id-35","referenceIndex":14,"style":{"height":17.23},"width":189.44,"height":43.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-26.png","element":"img","alt":"˜Π, [14] also","inline":true,"padRight":true},{"text":"requires that the resulting Markov process be geometrically ergodic, which we will establish below.","element":"span"}],[{"id":"id-157","style":{"fontWeight":"bold"},"text":"Proposition 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The discrete-time Markov process obtained from the queueing system governed by parameter ","element":"span"},{"style":{"height":16},"width":274.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-27.png","element":"img","alt":" θ = (θ1, θ2) ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and following policy ","element":"span"},{"style":{"height":17.4},"width":122.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-28.png","element":"img","alt":" πω ∈ ˜Π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is geometrically ergodic. Equivalently, the following holds","element":"span"}],[{"id":"id-173","style":{"width":"82%"},"width":1302,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/37-29.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for","element":"span"}],[{"id":"id-159","style":{"width":"100%"},"width":1608,"height":602,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and problem-dependent constants ","element":"span"},{"style":{"height":20.64},"width":495.8,"height":51.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-1.png","element":"img","alt":" xgji,θ,ω and ζi,θ,ω for i, j = 1, 2.","inline":true}],[{"text":"Proof of Proposition ","element":"span"},{"href":"#id-157","text":"3 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-158","text":"F.3. ","element":"a"},{"text":"In the rest of this subsection, our aim is to show that Assumptions 1-5 are satisfied for the discrete-time MDP and conclude that Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"can be used to learn the unknown service rate vector ","element":"span"},{"style":{"height":12.4},"width":36,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-2.png","element":"img","alt":" θ∗ ","inline":true,"padRight":true},{"text":"with expected regret of order ","element":"span"},{"style":{"height":18.8},"width":130.5,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-3.png","element":"img","alt":"˜O(√T).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"text":"Cost function is given as ","element":"span"},{"style":{"height":16},"width":266.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-4.png","element":"img","alt":" c(x, a) = ∥x∥1","inline":true},{"text":", which satisfies Assumption ","element":"span"},{"href":"#id-54","text":"1 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16},"width":634.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-5.png","element":"img","alt":"fc(x) = x0 + x1 + x2 and K = r = 1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Assumption 2. ","element":"span"},{"text":"For any state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":11.6},"width":109.2,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-6.png","element":"img","alt":" θ ∈ Θ","inline":true},{"text":", we have ","element":"span"},{"style":{"height":16},"width":327.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-7.png","element":"img","alt":" Pθ(A(x); x, a) = 0","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16.78},"width":608.24,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-8.png","element":"img","alt":"A(x) = {y ∈ X : ∥y∥1 − ∥x∥1 > 1}","inline":true},{"text":"; thus, the MDP is skip-free to the right with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= 1","element":"span"},{"text":". Moreover, from any ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", a","element":"span"},{"text":")","element":"span"},{"text":", the finite set ","element":"span"},{"style":{"height":16.78},"width":497.8,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-9.png","element":"img","alt":" {y ∈ X : ∥y∥1 ≤ ∥x∥1 + 1}","inline":true,"padRight":true},{"text":"is only accessible in one step; thus, Assumption ","element":"span"},{"href":"#id-77","text":"2 ","element":"a"},{"text":"holds.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 3. ","element":"span"},{"text":"In Proposition ","element":"span"},{"href":"#id-157","text":"3, ","element":"a"},{"text":"we verified the geometric ergodicity of the discrete-time chain governed by parameter ","element":"span"},{"style":{"height":16},"width":285.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-10.png","element":"img","alt":" θ = (θ1, θ2) ∈ Θ","inline":true,"padRight":true},{"text":"and following policy ","element":"span"},{"style":{"height":17.4},"width":128,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-11.png","element":"img","alt":" πω ∈ ˜Π","inline":true,"padRight":true},{"text":"and thus, it only remains to verify the uniform model conditions. We define the normalized rates as ","element":"span"},{"style":{"height":22},"width":227.5,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-12.png","element":"img","alt":"˜λ = λλ+θ1+θ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.2},"width":231.5,"height":55.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-13.png","element":"img","alt":"˜θi = θiλ+θ1+θ2","inline":true,"padRight":true},{"text":", for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":". From the choice of parameter space ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-14.png","element":"img","alt":" Θ","inline":true},{"text":", we have ","element":"span"},{"style":{"height":17.41},"width":265.12,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-15.png","element":"img","alt":" ˜λ ≤ 0.5 − 0.5δ","inline":true},{"text":", ","element":"span"},{"style":{"height":17.4},"width":737.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-16.png","element":"img","alt":"˜θ1 + ˜θ2 ≥ 0.5 + 0.5δ, and ˜θ1 ≥ 0.25 + 0.25δ.","inline":true}],[{"text":"1. We first argue that ","element":"span"},{"style":{"height":15.8},"width":86,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-17.png","element":"img","alt":" ζ1,θ,ω","inline":true,"padRight":true},{"text":"is bounded away from ","element":"span"},{"text":"1 ","element":"span"},{"text":"as follows","element":"span"}],[{"style":{"width":"86%"},"width":1378,"height":438,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-18.png","element":"img"}],[{"text":"where the first line follows from the definition of ","element":"span"},{"style":{"height":15.58},"width":88.28,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-19.png","element":"img","alt":" ζ1,θ,ω","inline":true,"padRight":true},{"text":"in Appendix ","element":"span"},{"href":"#id-158","text":"F.3, ","element":"a"},{"text":"the second line from ","element":"span"},{"href":"#id-159","text":"(71) ","element":"a"},{"text":"and the definition of policy class ","element":"span"},{"style":{"height":22.03},"width":169.2,"height":55.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-20.png","element":"img","alt":"˜Π. As agθ,ω ","inline":true,"padRight":true},{"text":"does not depend on ","element":"span"},{"style":{"height":21.8},"width":545,"height":54.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-21.png","element":"img","alt":" θ, supθ∈Θ,ω∈[ 1cRR ,cRR] ζ1,θ,ω < 1.","inline":true,"padRight":true},{"text":"Furthermore, by similar arguments it can be shown that ","element":"span"},{"style":{"height":15.4},"width":86,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-22.png","element":"img","alt":" ζ2,θ,ω","inline":true,"padRight":true},{"text":"is bounded away from ","element":"span"},{"text":"1","element":"span"},{"text":". We next argue that ","element":"span"},{"style":{"height":29.65},"width":432.32,"height":74.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/38-23.png","element":"img","alt":"ζ1,θ,ωω1+ω exp� agθ,ωω �+ ζ2,θ,ω1+ω","inline":true,"padRight":true},{"text":"is bounded away from 1 using an upper bound found in","element":"span"}],[{"text":"Appendix ","element":"span"},{"href":"#id-158","text":"F.3 ","element":"a"},{"text":"as below,","element":"span"}],[{"id":"id-160","style":{"width":"82%"},"width":1306,"height":700,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.8},"width":494,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-1.png","element":"img","alt":" ζ3 = (1 + δ)−1, ζ4 = 1−0.5δ1−δ","inline":true,"padRight":true},{"text":", and we have used the arguments of Appendix ","element":"span"},{"href":"#id-158","text":"F.3 ","element":"a"},{"text":"and ","element":"span"},{"text":"the definition of ","element":"span"},{"style":{"height":11.8},"width":27.5,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-2.png","element":"img","alt":" Θ","inline":true},{"text":". Using a similar argument, we can show that ","element":"span"},{"style":{"height":28.8},"width":438.52,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-3.png","element":"img","alt":"ζ1,θ,ωω1+ω + ζ2,θ,ω1+ω exp�agθ,ω�","inline":true},{"text":"is bounded away from one, and finally, we conclude that ","element":"span"},{"style":{"height":22.93},"width":488.92,"height":57.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-4.png","element":"img","alt":" supθ∈Θ,ω∈[ 1cRR ,cRR] γgθ,ω < 1.","inline":true}],[{"text":"2. From ","element":"span"},{"href":"#id-159","text":"(72)","element":"a"},{"text":", we can see that state ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0) ","element":"span"},{"text":"belongs to ","element":"span"},{"style":{"height":19.8},"width":71.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-5.png","element":"img","alt":" Cgθ,ω","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":12.2},"width":102.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-6.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21},"width":261,"height":52.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-7.png","element":"img","alt":" ω ∈ [ 1cRR, cRR]","inline":true},{"text":". In ","element":"span"},{"text":"order for ","element":"span"},{"style":{"height":14.6},"width":45,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-8.png","element":"img","alt":" Cg∗","inline":true,"padRight":true},{"text":"to be a finite set, the supremum of ","element":"span"},{"style":{"height":20.8},"width":87.5,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-9.png","element":"img","alt":" xgji,θ,ω","inline":true,"padRight":true},{"text":"over ","element":"span"},{"style":{"height":11.8},"width":27,"height":29.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-10.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.8},"width":28,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-11.png","element":"img","alt":" ˜Π","inline":true,"padRight":true},{"text":"should be finite. From the ","element":"span"},{"text":"definition of ","element":"span"},{"style":{"height":19.8},"width":92.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-12.png","element":"img","alt":" xg11,θ,ω ","inline":true,"padRight":true},{"text":"in Appendix ","element":"span"},{"href":"#id-158","text":"F.3,","element":"a"}],[{"style":{"width":"70%"},"width":1114,"height":288,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-13.png","element":"img"}],[{"text":"and we can derive a lower bound for the denominator from ","element":"span"},{"href":"#id-160","text":"(75)","element":"a"},{"text":". Similarly, we can show that ","element":"span"},{"style":{"height":22.93},"width":430.32,"height":57.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-14.png","element":"img","alt":"supθ∈Θ,ω∈[ 1cRR ,cRR] xg22,θ,ω","inline":true,"padRight":true},{"text":"is finite. We next find a uniform upper bound for ","element":"span"},{"style":{"height":19.8},"width":92,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-15.png","element":"img","alt":" xg12,θ,ω","inline":true,"padRight":true},{"text":"from Ap- ","element":"span"},{"text":"pendix ","element":"span"},{"href":"#id-158","text":"F.3, ","element":"a"},{"style":{"height":19.6},"width":92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-16.png","element":"img","alt":"xg12,θ,ω","inline":true}],[{"style":{"width":"93%"},"width":1480,"height":312,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-17.png","element":"img"}],[{"text":"which is uniformly bounded as ","element":"span"},{"style":{"height":19.6},"width":65,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-18.png","element":"img","alt":" γgθ,ω ","inline":true,"padRight":true},{"text":"is unformly bounded away from 1 and the second line follows ","element":"span"},{"text":"from ","element":"span"},{"href":"#id-159","text":"(74) ","element":"a"},{"text":"and the fact that ","element":"span"},{"style":{"height":19.66},"width":400.84,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-19.png","element":"img","alt":" γgθ,ω − ζ2,θ,ω ≥ 1 − γgθ,ω","inline":true},{"text":". Arguments verifying the finiteness of the ","element":"span"},{"text":"supremum of ","element":"span"},{"style":{"height":19.6},"width":92,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-20.png","element":"img","alt":" xg21,θ,ω","inline":true,"padRight":true},{"text":"follow similarly, and we conclude that ","element":"span"},{"style":{"height":16.46},"width":172.28,"height":41.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-21.png","element":"img","alt":" |Cg∗| < ∞","inline":true},{"text":". To confirm a uniform ","element":"span"},{"text":"upper bound for ","element":"span"},{"style":{"height":19.66},"width":62.6,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-22.png","element":"img","alt":" bgθ,ω","inline":true},{"text":", we note that from ","element":"span"},{"href":"#id-159","text":"(73)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"92%"},"width":1462,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-23.png","element":"img"}],[{"text":"which is finite as ","element":"span"},{"style":{"height":19.6},"width":106,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-24.png","element":"img","alt":" agθ,cRR ","inline":true,"padRight":true},{"text":"is independent of the choice of ","element":"span"},{"style":{"height":16.6},"width":268.5,"height":41.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-25.png","element":"img","alt":" θ and |Cg∗| < ∞.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Assumption 4. ","element":"span"},{"text":"We next verify Assumption ","element":"span"},{"href":"#id-11","text":"4 ","element":"a"},{"text":"and show that there exists a finite set ","element":"span"},{"style":{"height":19.6},"width":71.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-26.png","element":"img","alt":" Cpθ,ω","inline":true},{"text":", constants ","element":"span"},{"id":"id-162","style":{"height":19.66},"width":607.76,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-27.png","element":"img","alt":"βpθ,ω, bpθ,ω > 0, r/(r + 1) ≤ αpθ,ω < 1","inline":true},{"text":", and a function ","element":"span"},{"style":{"height":19.66},"width":336.8,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-28.png","element":"img","alt":" V pθ,ω : X → [1, +∞)","inline":true,"padRight":true},{"text":"satisfying","element":"span"}],[{"id":"id-161","style":{"width":"81%"},"width":1294,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/39-29.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The discrete-time Markov process obtained from the queueing system governed by parameter ","element":"span"},{"style":{"height":16},"width":292.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-0.png","element":"img","alt":" θ = (θ1, θ2) ∈ Θ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and following policy ","element":"span"},{"style":{"height":17.4},"width":132,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-1.png","element":"img","alt":" πω ∈ ˜Π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is polynomially ergodic. This follow because ","element":"span"},{"href":"#id-161","text":"(76) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds for","element":"span"}],[{"id":"id-164","style":{"width":"96%"},"width":1522,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-2.png","element":"img"}],[{"id":"id-165","style":{"width":"96%"},"width":1522,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-3.png","element":"img"}],[{"id":"id-166","style":{"width":"96%"},"width":1522,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-4.png","element":"img"}],[{"id":"id-168","style":{"width":"96%"},"width":1522,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-5.png","element":"img"}],[{"style":{"width":"96%"},"width":1522,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-6.png","element":"img"}],[{"text":"Proof of Proposition ","element":"span"},{"href":"#id-162","text":"4 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-163","text":"F.4. ","element":"a"},{"text":"Next, we verify the remaining conditions of Assumption ","element":"span"},{"href":"#id-11","text":"4.","element":"a"}],[{"text":"1. From ","element":"span"},{"href":"#id-164","text":"(77) ","element":"a"},{"text":"and the fact that ","element":"span"},{"style":{"height":20.98},"width":269.56,"height":52.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-7.png","element":"img","alt":" ω ∈ [ 1cRR, cRR]","inline":true},{"text":", the first condition holds with ","element":"span"},{"style":{"height":14.6},"width":118,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-8.png","element":"img","alt":" rp∗ = 2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.6},"width":80.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-9.png","element":"img","alt":" sp∗ =","inline":true},{"style":{"height":21.28},"width":609.64,"height":53.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-10.png","element":"img","alt":"supθ∈Θ,ω∈[ 1cRR ,cRR] sθ,ω = cRR + 1.","inline":true}],[{"text":"2. From ","element":"span"},{"href":"#id-165","text":"(78)","element":"a"},{"text":", state ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0) ","element":"span"},{"text":"belongs to ","element":"span"},{"style":{"height":19.6},"width":71.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-11.png","element":"img","alt":" Cpθ,ω","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":12.2},"width":106.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-12.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21},"width":264.5,"height":52.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-13.png","element":"img","alt":" ω ∈ [ 1cRR, cRR]","inline":true},{"text":". Furthermore, for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":",","element":"span"}],[{"id":"id-167","style":{"width":"76%"},"width":1218,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-14.png","element":"img"}],[{"text":"which follows from the fact that ","element":"span"},{"style":{"height":13.8},"width":158.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-15.png","element":"img","alt":" θ1 ≤ Rθ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.4},"width":306.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-16.png","element":"img","alt":"˜θ1 ≥ 0.25 + 0.25δ","inline":true},{"text":". Thus, from the definition of ","element":"span"},{"style":{"height":19.66},"width":74,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-17.png","element":"img","alt":" Cpθ,ω","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-165","text":"(78)","element":"a"},{"text":", ","element":"span"},{"style":{"height":22.93},"width":473.72,"height":57.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-18.png","element":"img","alt":" Cp∗ = ∪θ∈Θ,ω∈[ 1cRR ,cRR]Cpθ,ω","inline":true,"padRight":true},{"text":"is a finite set. We next verify that the infimum of ","element":"span"},{"href":"#id-166","style":{"height":19.66},"width":297,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-19.png","element":"img","alt":"βpθ,ω, found in (79)","inline":true},{"text":", is positive. In ","element":"span"},{"href":"#id-167","text":"(82)","element":"a"},{"text":", we showed that infimum of ","element":"span"},{"style":{"height":21.23},"width":194.08,"height":53.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-20.png","element":"img","alt":" λ+θiθi over Θ","inline":true,"padRight":true},{"text":"is lower bounded by ","element":"span"},{"style":{"height":19.8},"width":56,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-21.png","element":"img","alt":"1+δ4","inline":true,"padRight":true},{"text":". From this, the fact that ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-22.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"belongs to a compact set, and ","element":"span"},{"style":{"height":13.8},"width":264,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-23.png","element":"img","alt":" θ1 + θ2 + λ ≥ δ","inline":true},{"text":", it follows that ","element":"span"},{"style":{"height":23.2},"width":482,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-24.png","element":"img","alt":"infθ∈Θ,ω∈[ 1cRR ,cRR] βpθ,ω > 0","inline":true},{"text":". Furthermore, it is easy to see that ","element":"span"},{"style":{"height":20.2},"width":237,"height":50.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-25.png","element":"img","alt":" βpθ,ω ≤ √cRR","inline":true},{"text":". Hence, from ","element":"span"},{"href":"#id-168","text":"(80)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"87%"},"width":1386,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-26.png","element":"img"}],[{"text":"which is finite as ","element":"span"},{"style":{"height":16.8},"width":168,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-27.png","element":"img","alt":" |Cp∗| < ∞.","inline":true}],[{"text":"3. We need to show that ","element":"span"},{"style":{"height":18.4},"width":672,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-28.png","element":"img","alt":" Kθ,ω(x) := �∞n=0 2−n−2 (P πωθ )n (x, 0d)","inline":true,"padRight":true},{"text":"is strictly bounded away from","element":"span"},{"text":"zero. We show this using the fact that from any state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":", the queueing system hits ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0) ","element":"span"},{"text":"in one step with positive probability. Take ","element":"span"},{"style":{"height":16.88},"width":722,"height":42.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-29.png","element":"img","alt":" xi,θ,ω = maxx∈Cθ,ω xi for i = 1, 2. We have","inline":true}],[{"style":{"width":"76%"},"width":1212,"height":192,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-30.png","element":"img"}],[{"text":"The infimum in the right-hand side of the above equation is attained for the minimum normalized service rates possible for each server, or ","element":"span"},{"style":{"height":20.6},"width":156,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-31.png","element":"img","alt":"˜θ1 = 1+δ4","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.8},"width":156,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-32.png","element":"img","alt":"˜θ2 = 1+δ4R","inline":true,"padRight":true},{"text":". Therefore, the infimum of ","element":"span"},{"style":{"height":16.78},"width":140.2,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-33.png","element":"img","alt":"Kθ,ω(x)","inline":true,"padRight":true},{"text":"over the finite set ","element":"span"},{"style":{"height":14.8},"width":97,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-34.png","element":"img","alt":" Cp∗, Θ","inline":true},{"text":", and interval ","element":"span"},{"style":{"height":20.6},"width":175.5,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-35.png","element":"img","alt":" [ 1cRR, cRR]","inline":true,"padRight":true},{"text":"is strictly greater than zero.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 5. ","element":"span"},{"text":"We finally verify that ","element":"span"},{"style":{"height":16.8},"width":203,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-36.png","element":"img","alt":" supθ∈Θ J(θ)","inline":true,"padRight":true},{"text":"is finite. We first note that for ","element":"span"},{"style":{"height":16},"width":216.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-37.png","element":"img","alt":" x = (x1, x2),","inline":true}],[{"style":{"width":"81%"},"width":1286,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/40-38.png","element":"img"}],[{"text":"From the above equation,","element":"span"}],[{"style":{"width":"59%"},"width":948,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13},"width":124.5,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-1.png","element":"img","alt":" µθ,ω∗(θ)","inline":true,"padRight":true},{"text":"is the stationary distribution of the discrete-time process governed by parameter ","element":"span"},{"style":{"height":11.6},"width":84.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-2.png","element":"img","alt":" θ and","inline":true,"padRight":true},{"text":"following the best in-class policy according to ","element":"span"},{"style":{"height":16.88},"width":303.52,"height":42.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-3.png","element":"img","alt":" θ, shown by πω∗(θ)","inline":true},{"text":". From ","element":"span"},{"href":"#id-87","referenceIndex":35,"text":"[35]","element":"a"},{"text":", Theorem 14.3.7,","element":"span"}],[{"style":{"width":"57%"},"width":912,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-4.png","element":"img"}],[{"text":"which is finite from the the previous verified assumption. Thus,","element":"span"}],[{"style":{"width":"56%"},"width":900,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-5.png","element":"img"}]]},{"heading":"F Proofs related to the queueing model examples","paragraphs":[[{"id":"id-144","style":{"fontWeight":"bold"},"text":"F.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-169","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We define the normalized rates as","element":"span"}],[{"id":"id-172","style":{"width":"70%"},"width":1118,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-6.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":". From the choice of parameter space ","element":"span"},{"style":{"height":17.41},"width":810.72,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-7.png","element":"img","alt":" Θ, we have ˜λ ≤ 0.5 − 0.5δ, θ1 + θ2 ≥ 0.5 + 0.5δ,","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.8},"width":324,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-8.png","element":"img","alt":" θ1 ≥ 0.25 + 0.25δ","inline":true},{"text":". To prove geometric ergodicity, from the discussions of Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"it suffices to show that there exists a finite set ","element":"span"},{"style":{"height":19.66},"width":65.96,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-9.png","element":"img","alt":" Cgθ,t","inline":true},{"text":", constants ","element":"span"},{"style":{"height":19.66},"width":357.04,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-10.png","element":"img","alt":" bgθ,t > 0, γgθ,t ∈ (0, 1)","inline":true},{"text":", and a function ","element":"span"},{"style":{"height":19.68},"width":341.8,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-11.png","element":"img","alt":"V gθ,t : X t → [1, +∞)","inline":true,"padRight":true},{"text":"satisfying","element":"span"}],[{"id":"id-171","style":{"width":"81%"},"width":1286,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-12.png","element":"img"}],[{"text":"Take ","element":"span"},{"style":{"height":19.66},"width":1164.44,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-13.png","element":"img","alt":" V gθ,t(x) = exp(agθ,t∥x∥1) for some agθ,t > 0. For i ≥ 1 and x = (i, 1, 1),","inline":true}],[{"style":{"width":"69%"},"width":1106,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.4},"width":41,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-15.png","element":"img","alt":" P tθ ","inline":true,"padRight":true},{"text":"is the corresponding transition kernel. Thus,","element":"span"}],[{"style":{"width":"87%"},"width":1394,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-16.png","element":"img"}],[{"text":"Take ","element":"span"},{"style":{"height":19.66},"width":266.72,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-17.png","element":"img","alt":" ˜aθ,t = exp(agθ,t)","inline":true},{"text":". We need to find ","element":"span"},{"style":{"height":19.66},"width":571.08,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-18.png","element":"img","alt":" ˜aθ,t > 1 and 0 < γgθ,t < 1 such that","inline":true}],[{"id":"id-170","style":{"width":"69%"},"width":1102,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-19.png","element":"img"}],[{"text":"Take ","element":"span"},{"text":"˜","element":"span"},{"style":{"height":20.4},"width":416,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-20.png","element":"img","alt":"aθ,t = (1 − δ)−1 > 1 and","inline":true}],[{"style":{"width":"60%"},"width":964,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-21.png","element":"img"}],[{"text":"We need to have ","element":"span"},{"style":{"height":15.58},"width":133.28,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-22.png","element":"img","alt":" ˜γθ,t < 1","inline":true,"padRight":true},{"text":"which follows from the stability condition ","element":"span"},{"style":{"height":17.2},"width":402,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-23.png","element":"img","alt":"˜λ ≤ 0.5 − 0.5δ as below:","inline":true}],[{"style":{"width":"84%"},"width":1334,"height":352,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/41-24.png","element":"img"}],[{"text":"We now verify ","element":"span"},{"href":"#id-170","text":"(85)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"90%"},"width":1430,"height":602,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-0.png","element":"img"}],[{"text":"where the last line follows from ","element":"span"},{"style":{"height":19},"width":569,"height":47.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-1.png","element":"img","alt":"˜λ ≤ 0.5 − 0.5δ < (1 − δ)/ (2 − δ).","inline":true}],[{"text":"For ","element":"span"},{"style":{"height":16},"width":514.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-2.png","element":"img","alt":" x = (i, 0, 1) and i ≥ 1, we have","inline":true}],[{"style":{"width":"74%"},"width":1176,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-3.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"81%"},"width":1286,"height":232,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-4.png","element":"img"}],[{"text":"which results in the same conditions as previously discussed. When ","element":"span"},{"style":{"height":16},"width":518.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-5.png","element":"img","alt":" x = (i, 1, 0) and i ≥ t also same","inline":true,"padRight":true},{"text":"argument holds.","element":"span"}],[{"text":"Finally, ","element":"span"},{"href":"#id-171","text":"(84) ","element":"a"},{"text":"holds for","element":"span"}],[{"style":{"width":"52%"},"width":838,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-6.png","element":"img"}],[{"text":"where the last line holds because ","element":"span"},{"style":{"height":19.68},"width":875.64,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-7.png","element":"img","alt":" PV gθ,t(x) ≤ V gθ,t(y) for y such that ∥y∥1 = ∥x∥1 + 1.","inline":true}],[{"id":"id-150","style":{"fontWeight":"bold"},"text":"F.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-149","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In order to show polynomially ergodicity, we will verify ","element":"span"},{"href":"#id-148","text":"(62)","element":"a"},{"text":". We define ","element":"span"},{"style":{"height":20.59},"width":323.84,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-8.png","element":"img","alt":" V pθ,t(x) = ∥x∥21 and","inline":true},{"style":{"height":19.6},"width":176.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-9.png","element":"img","alt":"αpθ,t = 1/2","inline":true},{"text":", which is equal to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r/","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"+ 1) ","element":"span"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= 1","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is defined in Assumption ","element":"span"},{"href":"#id-54","text":"1. ","element":"a"},{"text":"For ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1) ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13},"width":94,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-10.png","element":"img","alt":" i ≥ 1,","inline":true}],[{"style":{"width":"74%"},"width":1176,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-11.png","element":"img"}],[{"text":"in which ","element":"span"},{"style":{"height":17.4},"width":198.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-12.png","element":"img","alt":"˜λ, ˜θ1, and ˜θ2","inline":true,"padRight":true},{"text":"are the normalized rates defined in ","element":"span"},{"href":"#id-172","text":"(83)","element":"a"},{"text":". Thus,","element":"span"}],[{"style":{"width":"53%"},"width":848,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/42-13.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":22.22},"width":253.92,"height":55.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-0.png","element":"img","alt":" βpθ,t = 1 − 2˜λ","inline":true},{"text":", the right-hand side of above equation is non-positive for ","element":"span"},{"style":{"height":24.4},"width":166,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-1.png","element":"img","alt":" i ≥ 2˜λ1−2˜λ","inline":true},{"text":". For ","element":"span"},{"style":{"height":16},"width":368.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-2.png","element":"img","alt":"x = (i, 1, 0) and i ≥ t,","inline":true}],[{"style":{"width":"74%"},"width":1176,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-3.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"style":{"width":"53%"},"width":848,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-4.png","element":"img"}],[{"text":"which is also non-positive under the same conditions as the previous case. For ","element":"span"},{"style":{"height":16},"width":367,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-5.png","element":"img","alt":" i ≥ 1 and x = (i, 1, 1),","inline":true}],[{"style":{"width":"69%"},"width":1106,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-6.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"style":{"width":"59%"},"width":948,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-7.png","element":"img"}],[{"text":"which is non-positive under the same conditions as the first case. Finally, ","element":"span"},{"href":"#id-148","text":"(62) ","element":"a"},{"text":"holds for","element":"span"}],[{"style":{"width":"82%"},"width":1310,"height":456,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-8.png","element":"img"}],[{"text":"where the last line holds because ","element":"span"},{"style":{"height":19.66},"width":875.64,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-9.png","element":"img","alt":" PV pθ,t(x) ≤ V pθ,t(y) for y such that ∥y∥1 = ∥x∥1 + 1.","inline":true}],[{"id":"id-158","style":{"fontWeight":"bold"},"text":"F.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-157","style":{"fontWeight":"bold"},"text":"3","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"To show geometric ergodicity of the chain that follows ","element":"span"},{"style":{"height":9.6},"width":41.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-10.png","element":"img","alt":" πω","inline":true},{"text":", we verify ","element":"span"},{"href":"#id-173","text":"(70)","element":"a"},{"text":". Take ","element":"span"},{"style":{"height":19.6},"width":207.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-11.png","element":"img","alt":" agθ,ω > 0 and","inline":true}],[{"id":"id-175","style":{"width":"85%"},"width":1356,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-12.png","element":"img"}],[{"text":"First, we find ","element":"span"},{"style":{"height":19.6},"width":155.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-13.png","element":"img","alt":" PV gθ,ω(x)","inline":true,"padRight":true},{"text":"for the function defined above. We have","element":"span"}],[{"id":"id-174","style":{"width":"97%"},"width":1550,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":397.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-15.png","element":"img","alt":" X(2) = (X1(2), X2(2))","inline":true,"padRight":true},{"text":"is the state of the system at the second arrival, starting from state ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":". To find the above expectations, we first find the corresponding transition probabilities. If the number of departures from server ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"during a fixed interval with length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"is less than the total number of jobs in the queue of that server, the number of departures follows a Poisson distribution with parameter ","element":"span"},{"style":{"height":14},"width":44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-16.png","element":"img","alt":"θit","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":378.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-17.png","element":"img","alt":" P ((x1, x2) → (x′1, X))","inline":true,"padRight":true},{"text":"be the probability of transitioning from a system with ","element":"span"},{"style":{"height":9.8},"width":32,"height":24.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-18.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"jobs in ","element":"span"},{"text":"server-queue pair ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"(just after the assignment of the arrival) to a queueing system with ","element":"span"},{"style":{"height":16.2},"width":35.5,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-19.png","element":"img","alt":" x′1 ","inline":true,"padRight":true},{"text":"jobs in the ","element":"span"},{"text":"first server-queue pair (just before the upcoming arrival). For ","element":"span"},{"style":{"height":15.4},"width":354,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-20.png","element":"img","alt":" 1 ≤ x′1≤ x1, we have","inline":true}],[{"id":"id-186","style":{"width":"100%"},"width":1590,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/43-21.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-187","style":{"width":"88%"},"width":1400,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-0.png","element":"img"}],[{"text":"Assume ","element":"span"},{"style":{"height":16},"width":322,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-1.png","element":"img","alt":" 1 + x1 ≤ ω(1 + x2)","inline":true},{"text":", which results in the new arrival being assigned to the first server. For the first term in ","element":"span"},{"href":"#id-174","text":"(87)","element":"a"},{"text":", we have","element":"span"}],[{"id":"id-176","style":{"width":"94%"},"width":1502,"height":712,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-2.png","element":"img"}],[{"text":"Similarly, for the second term in ","element":"span"},{"href":"#id-174","text":"(87)","element":"a"},{"text":", we have","element":"span"}],[{"id":"id-177","style":{"width":"98%"},"width":1564,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-3.png","element":"img"}],[{"text":"To satisfy ","element":"span"},{"href":"#id-173","text":"(70)","element":"a"},{"text":", for some ","element":"span"},{"style":{"height":19.66},"width":215.4,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-4.png","element":"img","alt":" 0 < γgθ,ω < 1","inline":true,"padRight":true},{"text":"and all but finitely many ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":", the following should hold,","element":"span"}],[{"style":{"width":"26%"},"width":426,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-5.png","element":"img"}],[{"text":"or from ","element":"span"},{"href":"#id-175","text":"(86) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-174","text":"(87)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"68%"},"width":1078,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-6.png","element":"img"}],[{"text":"Notice that","element":"span"}],[{"style":{"width":"47%"},"width":756,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-7.png","element":"img"}],[{"text":"From ","element":"span"},{"href":"#id-176","text":"(90) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-177","text":"(91)","element":"a"},{"text":", it suffices to have","element":"span"}],[{"id":"id-178","style":{"width":"92%"},"width":1466,"height":274,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-8.png","element":"img"}],[{"text":"Define","element":"span"}],[{"style":{"width":"73%"},"width":1164,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-9.png","element":"img"}],[{"text":"Simplifying ","element":"span"},{"href":"#id-178","text":"(92)","element":"a"},{"text":", we need the following to hold","element":"span"}],[{"id":"id-179","style":{"width":"91%"},"width":1450,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/44-10.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":15.6},"width":159.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-0.png","element":"img","alt":" ζi,θ,ω < 1","inline":true},{"text":", there exists ","element":"span"},{"style":{"height":19.68},"width":222.88,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-1.png","element":"img","alt":" γgθ,ω such that","inline":true}],[{"style":{"width":"19%"},"width":302,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-2.png","element":"img"}],[{"text":"From the assumption ","element":"span"},{"style":{"height":16},"width":329.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-3.png","element":"img","alt":" 1 + x1 ≤ ω(1 + x2)","inline":true,"padRight":true},{"text":"and the above equation, ","element":"span"},{"href":"#id-179","text":"(93) ","element":"a"},{"text":"can be further simplified as","element":"span"}],[{"id":"id-185","style":{"width":"100%"},"width":1654,"height":166,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-4.png","element":"img"}],[{"text":"For the above to hold outside a finite set, we need to have","element":"span"}],[{"id":"id-180","style":{"width":"70%"},"width":1112,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-5.png","element":"img"}],[{"text":"Define","element":"span"}],[{"style":{"width":"65%"},"width":1044,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-6.png","element":"img"}],[{"text":"Note that ","element":"span"},{"style":{"height":14},"width":289.48,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-7.png","element":"img","alt":" ζ3 < 1 and ζ4 > 1","inline":true},{"text":". Defining function ","element":"span"},{"style":{"height":16},"width":399.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-8.png","element":"img","alt":" f(y) := 1+ζ4y−exp(y)","inline":true},{"text":", we note that for ","element":"span"},{"style":{"height":14},"width":177.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-9.png","element":"img","alt":" y ≤ log ζ4,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"0","element":"span"},{"text":", where ","element":"span"},{"style":{"height":14.2},"width":90,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-10.png","element":"img","alt":" log ζ4","inline":true,"padRight":true},{"text":"is the maximizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":")","element":"span"},{"text":". Similarly, taking ","element":"span"},{"style":{"height":16},"width":466.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-11.png","element":"img","alt":" g(y) := 1 − ζ3y − exp(−y)","inline":true},{"text":", for ","element":"span"},{"style":{"height":16},"width":382.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-12.png","element":"img","alt":" y ≤ − log ζ3, g(y) > 0","inline":true},{"text":", where ","element":"span"},{"style":{"height":14},"width":129.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-13.png","element":"img","alt":" − log ζ3","inline":true,"padRight":true},{"text":"is the maximizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":")","element":"span"},{"text":". Thus, we conclude that for ","element":"span"},{"style":{"height":19.66},"width":680.28,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-14.png","element":"img","alt":"agθ,ω ≤ min (−ω log ζ3, −log ζ3, ω log ζ4),","inline":true}],[{"id":"id-181","style":{"width":"80%"},"width":1278,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-15.png","element":"img"}],[{"id":"id-182","style":{"width":"78%"},"width":1248,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-16.png","element":"img"}],[{"text":"To guarantee the existence of ","element":"span"},{"style":{"height":19.66},"width":215.4,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-17.png","element":"img","alt":" 0 < γgθ,ω < 1","inline":true,"padRight":true},{"text":"that satisfies ","element":"span"},{"href":"#id-180","text":"(95)","element":"a"},{"text":", we need to ensure the left-hand side of ","element":"span"},{"href":"#id-180","text":"(95) ","element":"a"},{"text":"is strictly less than 1. Using the bounds found in ","element":"span"},{"href":"#id-181","text":"(97) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-182","text":"(98) ","element":"a"},{"text":"and the definition of ","element":"span"},{"style":{"height":15.6},"width":88.24,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-18.png","element":"img","alt":" ζ1,θ,ω","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":88.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-19.png","element":"img","alt":"ζ2,θ,ω","inline":true},{"text":", we simplify ","element":"span"},{"href":"#id-180","text":"(95) ","element":"a"},{"text":"to get","element":"span"}],[{"style":{"width":"42%"},"width":672,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-20.png","element":"img"}],[{"text":"which is equivalent to","element":"span"}],[{"id":"id-183","style":{"width":"79%"},"width":1264,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-21.png","element":"img"}],[{"text":"To make sure there exists ","element":"span"},{"style":{"height":19.66},"width":142.76,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-22.png","element":"img","alt":" agθ,ω > 0","inline":true,"padRight":true},{"text":"that satisfies ","element":"span"},{"href":"#id-183","text":"(99)","element":"a"},{"text":", the right-hand side of ","element":"span"},{"href":"#id-183","text":"(99) ","element":"a"},{"text":"needs to be positive, ","element":"span"},{"text":"which follows as below:","element":"span"}],[{"id":"id-184","style":{"width":"91%"},"width":1456,"height":558,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-23.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.4},"width":198,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-24.png","element":"img","alt":"˜λ, ˜θ1, and ˜θ2","inline":true,"padRight":true},{"text":"are the normalized rates defined in ","element":"span"},{"href":"#id-172","text":"(83) ","element":"a"},{"text":"and we have used the stability condition ","element":"span"},{"style":{"height":17.41},"width":245.92,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-25.png","element":"img","alt":"˜λ ≤ 0.5 − 0.5δ","inline":true},{"text":". We further simplify the left-hand side of ","element":"span"},{"href":"#id-183","text":"(99) ","element":"a"},{"text":"as","element":"span"}],[{"style":{"width":"69%"},"width":1100,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/45-26.png","element":"img"}],[{"text":"From the above equation and ","element":"span"},{"href":"#id-184","text":"(100)","element":"a"},{"text":", ","element":"span"},{"style":{"height":19.68},"width":66.56,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-0.png","element":"img","alt":" agθ,ω ","inline":true,"padRight":true},{"text":"needs to satisfy","element":"span"}],[{"style":{"width":"21%"},"width":336,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-1.png","element":"img"}],[{"text":"Finally, we take ","element":"span"},{"style":{"height":19.66},"width":113.32,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-2.png","element":"img","alt":" agθ,ω as","inline":true}],[{"style":{"width":"59%"},"width":936,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-3.png","element":"img"}],[{"text":"After finding an appropriate ","element":"span"},{"style":{"height":19.68},"width":66.56,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-4.png","element":"img","alt":" agθ,ω","inline":true},{"text":", we can choose ","element":"span"},{"style":{"height":19.68},"width":215.4,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-5.png","element":"img","alt":" 0 < γgθ,ω < 1","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"href":"#id-180","text":"(95) ","element":"a"},{"text":"holds or","element":"span"}],[{"style":{"width":"50%"},"width":808,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-6.png","element":"img"}],[{"text":"Moreover, from ","element":"span"},{"href":"#id-185","text":"(94) ","element":"a"},{"text":"a lower bound ","element":"span"},{"style":{"height":19.66},"width":197.76,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-7.png","element":"img","alt":" xg11,θ,ω for x1","inline":true,"padRight":true},{"text":"is derived; In other words,","element":"span"},{"href":"#id-185","text":"(94) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"height":19.66},"width":200.48,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-8.png","element":"img","alt":" x1 > xg11,θ,ω.","inline":true,"padRight":true},{"text":"From ","element":"span"},{"href":"#id-179","text":"(93)","element":"a"},{"text":", we can find the corresponding ","element":"span"},{"style":{"height":19.66},"width":93.6,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-9.png","element":"img","alt":" xg12,θ,ω","inline":true,"padRight":true},{"text":"and take ","element":"span"},{"style":{"height":19.66},"width":370.56,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-10.png","element":"img","alt":" xg1θ,ω = (xg11,θ,ω, xg12,θ,ω)","inline":true},{"text":". By repeating the ","element":"span"},{"text":"same arguments when ","element":"span"},{"style":{"height":16},"width":329.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-11.png","element":"img","alt":" 1 + x1 < ω(1 + x2)","inline":true},{"text":", we finally conclude that","element":"span"}],[{"style":{"width":"63%"},"width":1010,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-12.png","element":"img"}],[{"text":"for","element":"span"}],[{"style":{"width":"100%"},"width":1618,"height":1566,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/46-13.png","element":"img"}],[{"id":"id-163","style":{"fontWeight":"bold"},"text":"F.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-162","style":{"fontWeight":"bold"},"text":"4","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Define ","element":"span"},{"style":{"height":25.74},"width":323.2,"height":64.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-0.png","element":"img","alt":" V pθ,ω(x) = x21ω + x22","inline":true},{"text":", and ","element":"span"},{"style":{"height":19.66},"width":194.44,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-1.png","element":"img","alt":" αpθ,ω = 1/2","inline":true},{"text":". Assume that ","element":"span"},{"style":{"height":13.2},"width":119,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-2.png","element":"img","alt":" x1 = 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":274.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-3.png","element":"img","alt":" x2 > (1 − ω)/ω","inline":true},{"text":"; ","element":"span"},{"text":"which means the new job will be assigned to the first server. The transition probabilities of the discrete-time chain sampled at Poisson arrivals is given in ","element":"span"},{"href":"#id-186","text":"(88) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-187","text":"(89)","element":"a"},{"text":", and we calculate ","element":"span"},{"style":{"height":19.66},"width":160.68,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-4.png","element":"img","alt":" PV pθ,ω(x)","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"id":"id-188","style":{"width":"99%"},"width":1584,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-5.png","element":"img"}],[{"text":"We define ","element":"span"},{"style":{"height":16},"width":543.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-6.png","element":"img","alt":" di := θi/(θi + λ) for i = 1, 2 and","inline":true}],[{"id":"id-189","style":{"width":"92%"},"width":1470,"height":452,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-7.png","element":"img"}],[{"text":"From ","element":"span"},{"href":"#id-188","text":"(101)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"91%"},"width":1446,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-8.png","element":"img"}],[{"text":"Outside a finite set, we need the above equation to be non-positive; which is equivalent to","element":"span"}],[{"style":{"width":"67%"},"width":1072,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-9.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"style":{"height":12.8},"width":97.5,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-10.png","element":"img","alt":"2 < 1,","inline":true}],[{"id":"id-191","style":{"width":"76%"},"width":1210,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-11.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"style":{"width":"62%"},"width":988,"height":212,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-12.png","element":"img"}],[{"text":"By taking ","element":"span"},{"style":{"height":19.6},"width":200,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-13.png","element":"img","alt":" βpθ,ω ≤ d2/2","inline":true},{"text":", it suffices for the following to be non-positive,","element":"span"}],[{"style":{"width":"32%"},"width":512,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-14.png","element":"img"}],[{"text":"which holds for ","element":"span"},{"style":{"height":16},"width":244.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-15.png","element":"img","alt":" x2 ≥ 2cRR/d2","inline":true},{"text":". Thus, for ","element":"span"},{"style":{"height":16},"width":914.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-16.png","element":"img","alt":" x1 = 0 and x2 ≥ max (2cRR(λ + θ2)/θ2, (1 − ω)/ω) =","inline":true},{"href":"#id-161","style":{"height":16},"width":373.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-17.png","element":"img","alt":"2cRR(λ + θ2)/θ2, (76)","inline":true,"padRight":true},{"text":"holds. The case of ","element":"span"},{"style":{"height":13.18},"width":113.8,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-18.png","element":"img","alt":" x2 = 0","inline":true,"padRight":true},{"text":"and non-zero ","element":"span"},{"style":{"height":9.6},"width":35,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-19.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"follows same arguments and ","element":"span"},{"href":"#id-161","text":"(76) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"height":19.66},"width":1441.4,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-20.png","element":"img","alt":" βpθ,ω ≤ d1/2√ω, x2 = 0, and x1 ≥ max (2cRR(λ + θ1)/θ1, ω − 1) = 2cRR(λ + θ1)/θ1.","inline":true,"padRight":true},{"text":"We now consider the case of ","element":"span"},{"style":{"height":16},"width":579,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-21.png","element":"img","alt":" x1, x2 > 0 and x1 + 1 ≤ ω(x2 + 1)","inline":true},{"text":", and note that","element":"span"}],[{"style":{"width":"75%"},"width":1202,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-22.png","element":"img"}],[{"text":"Hence, it suffices to find finite set ","element":"span"},{"style":{"height":19.6},"width":72,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-23.png","element":"img","alt":" Cpθ,ω","inline":true},{"text":", constants ","element":"span"},{"style":{"height":19.6},"width":60.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-24.png","element":"img","alt":" bpθ,ω","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.6},"width":142,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-25.png","element":"img","alt":" βpθ,ω > 0","inline":true},{"text":", such that the following holds ","element":"span"},{"text":"for ","element":"span"},{"style":{"height":25.74},"width":324.72,"height":64.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-26.png","element":"img","alt":" V pθ,ω(x) = x21ω + x22,","inline":true}],[{"style":{"width":"53%"},"width":856,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/47-27.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":16},"width":347.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-0.png","element":"img","alt":" x1 + 1 ≤ ω(x2 + 1)","inline":true},{"text":", the new arrival is assigned to the first queue and we find ","element":"span"},{"style":{"height":19.8},"width":201.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-1.png","element":"img","alt":" ∆V pθ,ω(x) +","inline":true},{"style":{"height":20.53},"width":339.96,"height":51.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-2.png","element":"img","alt":"√ω + 1βpθ,ω(x2 + 1)","inline":true,"padRight":true},{"text":"using the same calculations as ","element":"span"},{"href":"#id-189","text":"(102)","element":"a"},{"text":".","element":"span"}],[{"id":"id-190","style":{"width":"88%"},"width":1404,"height":548,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-3.png","element":"img"}],[{"text":"We next consider two different cases based on the value of ","element":"span"},{"style":{"height":13.6},"width":33,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-4.png","element":"img","alt":" d1","inline":true,"padRight":true},{"text":"and analyze them separately.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"One. ","element":"span"},{"style":{"height":13.6},"width":233.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-5.png","element":"img","alt":" 0.8 ≤ d1 < 1 :","inline":true,"padRight":true},{"text":"We first notice that the coefficient of ","element":"span"},{"style":{"height":9.6},"width":35.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-6.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-190","text":"(104) ","element":"a"},{"text":"is negative, as ","element":"span"},{"style":{"height":16},"width":148.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-7.png","element":"img","alt":" d1 > 1/2","inline":true},{"text":". For ","element":"span"},{"style":{"height":13.2},"width":110,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-8.png","element":"img","alt":"x1 ≥ 1","inline":true},{"text":", ","element":"span"},{"href":"#id-190","text":"(104) ","element":"a"},{"text":"is equal to","element":"span"}],[{"style":{"width":"85%"},"width":1362,"height":634,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-9.png","element":"img"}],[{"text":"where the third line follows from ","element":"span"},{"href":"#id-191","text":"(103)","element":"a"},{"text":", and the last line from the fact that when ","element":"span"},{"style":{"height":13.6},"width":306.24,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-10.png","element":"img","alt":" 0.8 ≤ d1 < 1, both","inline":true,"padRight":true},{"text":"terms ","element":"span"},{"style":{"height":17.34},"width":675.32,"height":43.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-11.png","element":"img","alt":" −d31 − 2d21 − 2d1 + 2 and −d21 − 3d1 + 3","inline":true,"padRight":true},{"text":"are negative. Next, we notice that ","element":"span"},{"href":"#id-190","text":"(105) ","element":"a"},{"text":"is equal to","element":"span"}],[{"style":{"width":"71%"},"width":1134,"height":422,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-12.png","element":"img"}],[{"text":"where the second line follows from ","element":"span"},{"href":"#id-191","text":"(103)","element":"a"},{"text":". Taking ","element":"span"},{"style":{"height":20.4},"width":455.5,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-13.png","element":"img","alt":" βpθ,ω ≤ d2/2√ω + 1, we get","inline":true}],[{"style":{"width":"89%"},"width":1418,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-14.png","element":"img"}],[{"text":"which is non-positive for ","element":"span"},{"style":{"height":13.2},"width":110,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-15.png","element":"img","alt":" x2 ≥ 1","inline":true},{"text":". Finally, when ","element":"span"},{"style":{"height":16},"width":832.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-16.png","element":"img","alt":" 0.8 ≤ d1 < 1, x1, x2 > 0, and x1 + 1 ≤ ω(x2 + 1),","inline":true,"padRight":true},{"href":"#id-161","text":"(76) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"height":20.53},"width":340.8,"height":51.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/48-17.png","element":"img","alt":" βpθ,ω ≤ d2/2√ω + 1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Two. ","element":"span"},{"style":{"height":13.4},"width":160.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-0.png","element":"img","alt":" d1 < 0.8 :","inline":true,"padRight":true},{"text":"Taking ","element":"span"},{"style":{"height":23.57},"width":314.04,"height":58.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-1.png","element":"img","alt":" βpθ,ω ≤ d2√ω+1(1−d2)","inline":true},{"text":", we note that the coefficient of ","element":"span"},{"style":{"height":9.6},"width":36.5,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-2.png","element":"img","alt":" x2","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-190","text":"(105) ","element":"a"},{"text":"is negative. Thus, from ","element":"span"},{"style":{"height":16},"width":329.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-3.png","element":"img","alt":" x1 + 1 ≤ ω(x2 + 1)","inline":true},{"text":", ","element":"span"},{"href":"#id-190","text":"(104) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-190","text":"(105)","element":"a"},{"text":",","element":"span"}],[{"id":"id-192","style":{"width":"96%"},"width":1532,"height":450,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-4.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":19},"width":269,"height":47.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-5.png","element":"img","alt":" di = ˜θi/(˜θi + ˜λ)","inline":true,"padRight":true},{"text":"in terms of the normalized rates, we get","element":"span"}],[{"style":{"width":"63%"},"width":1006,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-6.png","element":"img"}],[{"text":"which is negative from the stability condition. For ","element":"span"},{"style":{"height":26.6},"width":258,"height":66.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-7.png","element":"img","alt":" βpθ,ω ≤˜θ1+˜θ2−˜λ˜λ√ω+1 ","inline":true,"padRight":true},{"text":", from ","element":"span"},{"href":"#id-192","text":"(106) ","element":"a"},{"text":"we get","element":"span"}],[{"style":{"width":"69%"},"width":1098,"height":292,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-8.png","element":"img"}],[{"text":"which is non-positive for","element":"span"}],[{"style":{"width":"42%"},"width":674,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-9.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":13.6},"width":140,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-10.png","element":"img","alt":" d1 < 0.8","inline":true},{"text":", we can see that ","element":"span"},{"style":{"height":19},"width":246.5,"height":47.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-11.png","element":"img","alt":"˜λ > ˜θ1/4; thus,","inline":true}],[{"style":{"width":"97%"},"width":1544,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-12.png","element":"img"}],[{"text":"where we have used the fact that ","element":"span"},{"style":{"height":17.6},"width":305,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-13.png","element":"img","alt":"˜θ1 ≥ ˜θ2, ω ≤ cRR","inline":true},{"text":", ","element":"span"},{"style":{"height":17.4},"width":269,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-14.png","element":"img","alt":"˜θ1 + ˜θ2 − ˜λ ≥ δ","inline":true},{"text":", and ","element":"span"},{"style":{"height":17.4},"width":247.5,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-15.png","element":"img","alt":"˜λ ≤ 0.5 − 0.5δ","inline":true,"padRight":true},{"text":"and it suffices for ","element":"span"},{"style":{"height":9.6},"width":35,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-16.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"to be greater than or equal to ","element":"span"},{"href":"#id-190","style":{"height":13.8},"width":472.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-17.png","element":"img","alt":" 4cRR. For x1 < 4cRR, (104)","inline":true,"padRight":true},{"text":"can be upper bounded as","element":"span"}],[{"style":{"width":"76%"},"width":1220,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-18.png","element":"img"}],[{"text":"where in the last inequality we have used ","element":"span"},{"style":{"height":13.4},"width":140,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-19.png","element":"img","alt":" d1 < 0.8","inline":true},{"text":". From ","element":"span"},{"href":"#id-190","text":"(105) ","element":"a"},{"text":"and taking ","element":"span"},{"style":{"height":20.4},"width":338,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-20.png","element":"img","alt":" βpθ,ω ≤ d2/2√ω + 1,","inline":true}],[{"style":{"width":"81%"},"width":1294,"height":388,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-21.png","element":"img"}],[{"text":"which is negative for","element":"span"}],[{"style":{"width":"25%"},"width":408,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/49-22.png","element":"img"}],[{"text":"Finally, ","element":"span"},{"text":"when ","element":"span"},{"style":{"height":16},"width":439.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-0.png","element":"img","alt":" x1 + 1 ≤ ω(x2 + 1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":250.8,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-1.png","element":"img","alt":" x1, x2 > 0","inline":true},{"text":", ","element":"span"},{"href":"#id-161","text":"(76) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"height":19.8},"width":148.5,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-2.png","element":"img","alt":" βpθ,ω ≤","inline":true},{"style":{"height":28.8},"width":829.2,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-3.png","element":"img","alt":"1√ω+1 min� ˜θ22(˜θ2+˜λ), ˜θ1 + ˜θ2 − ˜λ�, x1 ≥ 4cRR","inline":true},{"text":", and ","element":"span"},{"style":{"height":21.23},"width":400.52,"height":53.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-4.png","element":"img","alt":" x2 ≥ 1 + 16cRR+100ωd2","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"Repeating the same arguments when ","element":"span"},{"style":{"height":14},"width":209.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-5.png","element":"img","alt":" x1, x2 > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":381.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-6.png","element":"img","alt":" x1 + 1 > ω(x2 + 1)","inline":true},{"text":", ","element":"span"},{"href":"#id-161","text":"(76) ","element":"a"},{"text":"holds for ","element":"span"},{"style":{"height":19.6},"width":127.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-7.png","element":"img","alt":" βpθ,ω ≤","inline":true},{"style":{"height":28.8},"width":1004.48,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-8.png","element":"img","alt":"1√ω+1 min� ˜θ12(˜θ1+˜λ), ˜θ1 + ˜θ2 − ˜λ�, x1 ≥ 1 + ω(16cRR2+100)d1","inline":true,"padRight":true},{"text":", and ","element":"span"},{"style":{"height":15.8},"width":221.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-9.png","element":"img","alt":" x2 ≥ 4cRR2","inline":true},{"text":". Finally, ","element":"span"},{"href":"#id-161","text":"(76) ","element":"a"},{"text":"holds with","element":"span"}],[{"style":{"width":"77%"},"width":1234,"height":558,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-10.png","element":"img"}],[{"text":"where the fourth line holds since ","element":"span"},{"style":{"height":19.6},"width":339.5,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-11.png","element":"img","alt":" PV pθ,ω(x) ≤ V pθ,ω(y)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":200.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-12.png","element":"img","alt":" y = (y1, y2)","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":14},"width":191.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-13.png","element":"img","alt":" yi = xi + 1","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":".","element":"span"}]]},{"heading":"G Numerical results","paragraphs":[[{"text":"We first note that due to the countably infinite state-space setting of our problem, we are unable to directly compare our algorithm to other learning algorithms proposed in the literature. One potential candidate algorithm uses the reward biased maximum likelihood estimation (RBMLE)","element":"span"},{"text":"2","element":"span"},{"text":", which estimates the unknown model parameter with the likelihood perturbed a vanishing bias towards parameters with a larger long-term average reward (i.e., optimal value). This scheme also uses the principle of “optimism in the face of uncertainty” in how it perturbs the maximum likelihood estimate. The naive version of the RMBLE algorithm does not apply to our examples due the following key assumption: over all parameters (and the control policies used for them), the transition probabilities are assumed to be mutually absolutely continuous; this is critical for the proofs and also allows the use of log-likelihood functions for computations. Similarly, naive use of the algorithms in [","element":"span"},{"href":"#id-30","referenceIndex":29,"text":"29","element":"a"},{"text":"] and ","element":"span"},{"href":"#id-29","referenceIndex":20,"text":"[20] ","element":"a"},{"text":"is not possible, again due to a similar absolutely continuity assumption which is critical for the proofs. Our posterior computations avoid such issues as the true parameter always has non-zero mass during the execution of the algorithm: episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"always starts in state ","element":"span"},{"style":{"height":14},"width":35,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-14.png","element":"img","alt":" 0d ","inline":true,"padRight":true},{"text":"which is positive recurrent for the Markov chain with true parameter ","element":"span"},{"style":{"height":12.6},"width":36.5,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-15.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"and policy used ","element":"span"},{"style":{"height":17.18},"width":52.88,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-16.png","element":"img","alt":" π∗θk","inline":true},{"text":". The RBMLE algorithm has yet ","element":"span"},{"text":"another issue in that it requires knowledge of the optimal value function, and hence, for our examples, it may only apply to Model 1 for which the value function is known analytically. Finally, whereas we do get to observe inter-arrival times for both model, we never directly observe completed service times owing to the sampling employed, and this precludes the direct use of Upper-Confidence-Bound based parameter estimation followed by certainty equivalent control algorithms. Owing to these issues, at this point of time, we’re unable to perform empirical comparisons of Algorithm ","element":"span"},{"href":"#id-57","text":"1 ","element":"a"},{"text":"to other candidate algorithms.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model 1: Two-server queueing system with a common buffer.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-85","text":"3b ","element":"a"},{"text":"illustrates the behavior of the regret of Model 1 for three different arrival rate values and averaged over 2000 simulation runs. In these simulations, the parameter space is selected as","element":"span"}],[{"style":{"width":"62%"},"width":998,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/50-17.png","element":"img"}],[{"id":"id-193","style":{"width":"94%"},"width":1492,"height":536,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-0.png","element":"img"}],[{"text":"Figure 4: Total variation distance between the posterior and real distribution for ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":265.5,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-1.png","element":"img","alt":" λ = 0.3, 0.5, 0.7","inline":true},{"text":". ","element":"figcaption","subtype":"caption"},{"id":"id-194","text":"The y axis is plotted on a logarithmic scale to display the differences clearly.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"92%"},"width":1474,"height":504,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-2.png","element":"img"}],[{"text":"Figure 5: Optimal policy parameters for different service rate vectors in the two exemplary queuing systems in Model 1 and Model 2 with ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":133,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-3.png","element":"img","alt":" λ = 0.5.","inline":true}],[{"text":"which results in a prior size of ","element":"span"},{"text":"105","element":"span"},{"text":". As depicted in Figure ","element":"span"},{"href":"#id-85","text":"3a, ","element":"a"},{"text":"the regret has a sub-linear behavior and increases with the arrival rate. The total variation distance between the posterior and real distribution, a point-mass on the random ","element":"span"},{"style":{"height":12.6},"width":36.5,"height":31.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-4.png","element":"img","alt":" θ∗","inline":true},{"text":", are plotted in Figure ","element":"span"},{"href":"#id-193","text":"4a. ","element":"a"},{"text":"As expected, the distance diminishes towards 0, indicating the learning of the true parameter. As mentioned in Appendix ","element":"span"},{"href":"#id-82","text":"E.1, ","element":"a"},{"text":"the optimal policy minimizing the average number of jobs in a system with parameter ","element":"span"},{"style":{"height":11.6},"width":17,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-5.png","element":"img","alt":" θ","inline":true},{"text":", is a threshold policy ","element":"span"},{"style":{"height":16.88},"width":157.8,"height":42.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-6.png","element":"img","alt":" πt(θ) with","inline":true,"padRight":true},{"text":"optimal finite threshold ","element":"span"},{"style":{"height":16},"width":145.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-7.png","element":"img","alt":" t(θ) ∈ N","inline":true},{"text":", which can be numerically determined as the smallest ","element":"span"},{"style":{"height":11.6},"width":94.44,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-8.png","element":"img","alt":" i ∈ N","inline":true,"padRight":true},{"text":"for which ","element":"span"},{"style":{"height":17.38},"width":274,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-9.png","element":"img","alt":" Ji(θ) < Ji+1(θ)","inline":true},{"text":", calculated in [","element":"span"},{"href":"#id-12","referenceIndex":31,"text":"31","element":"a"},{"text":"]. We compute the optimal threshold ","element":"span"},{"style":{"height":16},"width":327.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-10.png","element":"img","alt":" t(θ) for every θ ∈ Θ","inline":true,"padRight":true},{"text":"and present the results in Figure ","element":"span"},{"href":"#id-194","text":"5a. ","element":"a"},{"text":"We can see that the threshold increases as the ratio of the service rates grows. Specifically, this is why in Appendix ","element":"span"},{"text":"G, ","element":"span"},{"text":"we imposed conditions on ","element":"span"},{"style":{"height":11.4},"width":27.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-11.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"to ensure that the ratio between the service rates is both upper and lower bounded.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model 2: Two heterogeneous parallel queues","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-85","text":"3b ","element":"a"},{"text":"illustrates the behavior of the regret of Model 2 for three different arrival rate values and averaged over 2000 simulation runs. We note that the regret is sub-linear and increases with higher arrival rates. In these simulations, the parameter space is selected as","element":"span"}],[{"style":{"width":"62%"},"width":998,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-12.png","element":"img"}],[{"text":"which results in a prior size of ","element":"span"},{"text":"28","element":"span"},{"text":". As discussed earlier, our goal is to find the average cost minimizing policy within the class of policies ","element":"span"},{"style":{"height":17.38},"width":712.52,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-13.png","element":"img","alt":" Π = {πω; ω ∈ [(cRR)−1, cRR]}, cR ≥ 1","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":153.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-14.png","element":"img","alt":" πω(x) =","inline":true},{"style":{"height":16},"width":467.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-15.png","element":"img","alt":"arg min (1 + x1, ω (1 + x2))","inline":true,"padRight":true},{"text":"with ties broken for ","element":"span"},{"text":"1","element":"span"},{"text":". As discussed before, even with the transition kernel fully specified (by the values of arrival and service rates), the optimal policy in ","element":"span"},{"style":{"height":10.8},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-16.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"is not known except when ","element":"span"},{"style":{"height":13.8},"width":126.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-17.png","element":"img","alt":" θ1 = θ2","inline":true,"padRight":true},{"text":"where the optimal value is ","element":"span"},{"style":{"height":11},"width":101.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/51-18.png","element":"img","alt":" ω = 1","inline":true},{"text":", and so, to learn it, we will use Proximal Policy Optimization with approximating martingale-process (AMP) method for countable state-space","element":"span"}],[{"id":"id-195","style":{"width":"43%"},"width":696,"height":538,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-0.png","element":"img"}],[{"text":"Figure 6: Estimated average cost of Model 2 for three different service rate vectors.","element":"figcaption","subtype":"caption"}],[{"text":"MDPs [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"]. We run the algorithm for ","element":"span"},{"text":"200 ","element":"span"},{"text":"policy iterations, using ","element":"span"},{"text":"20 ","element":"span"},{"text":"actors for each iteration. We take the state ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0) ","element":"span"},{"text":"as a regeneration state and simulate ","element":"span"},{"text":"1500 ","element":"span"},{"text":"independent regenerative cycles per actor in each algorithm iteration. To approximate the value function, we employ a fully connected feed-forward neural network with one hidden layer consisting of ","element":"span"},{"style":{"height":11.2},"width":124,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-1.png","element":"img","alt":" 10 × 10","inline":true,"padRight":true},{"text":"units and ReLU activation functions. The AMP method is also employed for variance reduction in value function estimation. The optimal ","element":"span"},{"style":{"height":7.2},"width":24.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-2.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"for every ","element":"span"},{"style":{"height":12},"width":96,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-3.png","element":"img","alt":" θ ∈ Θ","inline":true,"padRight":true},{"text":"is shown in Figure ","element":"span"},{"href":"#id-194","text":"5b, ","element":"a"},{"text":"indicating that ","element":"span"},{"style":{"height":7.4},"width":24.5,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-4.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"increases as the ratio of the service rates grows. Therefore, it is necessary to ensure that the ratio between the service rates is bounded from above and below. Furthermore, to evaluate the regret numerically, the value of ","element":"span"},{"style":{"height":16},"width":77.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-5.png","element":"img","alt":" J(θ)","inline":true,"padRight":true},{"text":"is required for every ","element":"span"},{"style":{"height":12.2},"width":98,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-6.png","element":"img","alt":" θ ∈ Θ","inline":true},{"text":", which is not known. Thus, after finding the optimal ","element":"span"},{"style":{"height":7.4},"width":24.5,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-7.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"using the PPO algorithm, we perform a separate simulation to approximate the optimal average cost. In Figure ","element":"span"},{"href":"#id-195","text":"6, ","element":"a"},{"text":"we plot the estimated average cost for three different service rate vectors, demonstrating that the optimal average cost decreases as the service rates increase. In Figure ","element":"span"},{"href":"#id-193","text":"4b ","element":"a"},{"text":"we also depict the total variation distance between the posterior and real distribution, which is a point-mass on the random ","element":"span"},{"style":{"height":12.4},"width":36.5,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/71071/images/52-8.png","element":"img","alt":" θ∗","inline":true},{"text":", and observe that the distance is converging to zero.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]