36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"96414","publisher":"neurips","paperJSON":{"title":"Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation","paperID":"96414","avgLineHeight":10.91,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We study reinforcement learning with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"multinomial logistic ","element":"span"},{"text":"(MNL) function approximation where the underlying transition probability kernel of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov decision processes ","element":"span"},{"text":"(MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency. We establish that ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"achieves a","element":"span"},{"style":{"height":19.6},"width":362,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/0-0.png","element":"img","alt":"�O(κ−1d32 H32 √T) fre-","inline":true,"padRight":true},{"text":"quentist regret bound with constant-time computational cost per episode. Here, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is the dimension of the transition core, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is the horizon length, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is the total number of steps, and ","element":"span"},{"style":{"height":7.41},"width":20.52,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/0-1.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"is a problem-dependent constant. Despite the simplicity and practicality of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", its regret bound scales with ","element":"span"},{"style":{"height":13.41},"width":59.48,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/0-2.png","element":"img","alt":" κ−1","inline":true},{"text":", which is potentially large in the worst case. To improve the dependence on ","element":"span"},{"style":{"height":13.41},"width":59.48,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/0-3.png","element":"img","alt":" κ−1","inline":true},{"text":", we propose ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":", which estimates the value function using the local gradient information of the MNL transition model. We show that its frequentist regret bound is","element":"span"},{"style":{"height":19.6},"width":434.48,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/0-4.png","element":"img","alt":"�O(d32 H32 √T +κ−1d2H2).","inline":true,"padRight":true},{"text":"To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode. Numerical experiments demonstrate the superior performance of the proposed algorithms.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Reinforcement learning ","element":"span"},{"text":"(RL) is a sequential decision-making problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time. Despite significant empirical progress in RL algorithms for various applications [","element":"span"},{"href":"#id-0","referenceIndex":47,"text":"47","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":52,"text":"52","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":65,"text":"65","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":66,"text":"66","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":25,"text":"25","element":"a"},{"text":"], the theoretical understanding of RL algorithms had long been limited to tabular methods [","element":"span"},{"href":"#id-5","referenceIndex":40,"text":"40","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":56,"text":"56","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":77,"text":"77","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":79,"text":"79","element":"a"},{"text":"], which explicitly enumerate the entire state and action spaces and learn the value (or the policy) for each state and action. Recently, there has been an increasing body of research in RL with function approximation to extend beyond the tabular problem setting. In particular, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"linear function approximation ","element":"span"},{"text":"has served as a foundational model [","element":"span"},{"href":"#id-10","referenceIndex":43,"text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"]. On the other hand, the linear transition model assumption poses significant constraints: 1) the output of the function must be within ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]","element":"span"},{"text":", and 2) the sum of the probabilities for all possible next states must be exactly 1. These constraints make it challenging to apply RL with linear function approximation to real-world applications [","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"]. To overcome such challenges, there has been literature on RL with general function approximation [","element":"span"},{"href":"#id-16","referenceIndex":21,"text":"21","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":44,"text":"44","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":18,"text":"18","element":"a"},{"text":"]. Despite the guarantee of sample efficiency achieved by their algorithms, this accomplishment might be impeded by computational intractability or the necessity to rely on stronger assumptions. As a result, the resulting methods may not be as general or practical.","element":"span"}],[{"text":"On the other hand, Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"introduce specific non-linear parametric MDPs called MNLMDPs (Assumption ","element":"span"},{"href":"#id-21","text":"1) ","element":"a"},{"text":"where the transition probability of MDPs is given by an MNL model. They consider an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"upper confidence bound ","element":"span"},{"text":"(UCB) approach to balance exploration and exploitation. Since it is costly or even intractable to compute UCB explicitly, randomized exploration methods such as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thompson Sampling ","element":"span"},{"text":"(TS) are widely studied in RL with linear function approximation as well as tabular MDPs. This is because, in various decision-making problems ranging from multi-armed bandits to RL, randomized exploration algorithms have been shown to perform better than UCB methods in empirical evaluations [","element":"span"},{"href":"#id-22","referenceIndex":16,"text":"16","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":57,"text":"57","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":64,"text":"64","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":49,"text":"49","element":"a"},{"text":"]. Furthermore, randomized exploration can be easily integrated with linear function approximation. This is because the value function in linear MDPs can be linearly parameterized, allowing perturbations of the estimator to directly control the perturbations of the value function. However, although there has been some literature aiming to propose randomized algorithms for general function classes [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":75,"text":"75","element":"a"},{"text":"], these methods do not discuss how to define the posterior distribution supported by the given function class and how to draw the optimistic sample from the posterior [","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":75,"text":"75","element":"a"},{"text":"], or they require stronger assumptions on stochastic optimism [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"], which is one of the most challenging elements in frequentist regret analysis. Thus, the design of a tractable randomized exploration RL algorithm and the feasibility of frequentist regret analysis for randomized exploration remain open challenges. Hence, the following question arises:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Can we design a provably efficient and tractable randomized algorithm for RL with MNL function approximation?","element":"span"}],[{"text":"We answer the above question by proposing the first randomized algorithm, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", achieving ","element":"span"},{"style":{"height":19.01},"width":293.52,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-0.png","element":"img","alt":"O(κ−1d32 H32 √T)","inline":true,"padRight":true},{"text":"frequentist regret with constant-time computational cost per episode. ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"is not only the first algorithm with randomized exploration for MNL-MDPs, but also, to the best of our knowledge, it provides the first frequentist regret analysis for a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"non-linear model-based ","element":"span"},{"text":"algorithm with randomized exploration without assuming stochastic optimism ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"[37]","element":"a"},{"text":".","element":"span"}],[{"text":"While ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"statistically ","element":"span"},{"text":"efficient, the current method used to analyze the regret of MNL function approximation introduces a problem-dependent constant ","element":"span"},{"style":{"height":7.39},"width":20.48,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-1.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"(Assumption ","element":"span"},{"href":"#id-28","text":"4)","element":"a"},{"text":", which reflects the level of non-linearity of the MNL transition model. This constant ","element":"span"},{"style":{"height":7.41},"width":20.48,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-2.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"originates from the use of generalized linear models (GLMs) for contextual bandit settings [","element":"span"},{"href":"#id-29","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":51,"text":"51","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":45,"text":"45","element":"a"},{"text":"] and MNL bandit settings [","element":"span"},{"href":"#id-32","referenceIndex":54,"text":"54","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":55,"text":"55","element":"a"},{"text":"]. The magnitude of the constant ","element":"span"},{"style":{"height":7.39},"width":20,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-3.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"can be exponentially small with respect to the size of the decision set, hence the regret bound scaling with ","element":"span"},{"style":{"height":13.39},"width":59.48,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-4.png","element":"img","alt":" κ−1","inline":true},{"text":"could be prohibitively large in the worst case [","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"23","element":"a"},{"text":"]. Worse yet, the situation is even more challenging in RL, as in the worst case, ","element":"span"},{"style":{"height":13.41},"width":59.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-5.png","element":"img","alt":"κ−1","inline":true},{"text":"can be much larger than in the case of bandits. To overcome the prohibitive dependence on ","element":"span"},{"style":{"height":7.41},"width":20,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-6.png","element":"img","alt":"κ","inline":true},{"text":", algorithms based on new Bernstein-like inequalities and the self-concordant-like property of the log-loss have been proposed for logistic bandits [","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"23","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":24,"text":"24","element":"a"},{"text":"] and for MNL bandits [","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"61","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":6,"text":"6","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"50","element":"a"},{"text":"]. As an extension of these works, the following fundamental question remains open:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Is it possible for RL algorithms with MNL function approximation to have a sharper dependence on the problem-dependent constant ","element":"span"},{"style":{"height":10.8},"width":40.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-7.png","element":"img","alt":" κ?","inline":true}],[{"text":"For the above question, we propose the second randomized algorithm referred to as ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":", which establishes a regret bound of","element":"span"},{"style":{"height":19.6},"width":435,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-8.png","element":"img","alt":"�O(d32 H32 √T + κ−1d2H2)","inline":true,"padRight":true},{"text":"with constant-time computational cost per episode. We summarize our main contributions as follows:","element":"span"}],[{"text":"• We propose computationally tractable randomized algorithms for RL with MNL function approximation: ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":". To the best of our knowledge, these are the first randomized model-based RL algorithms with MNL function approximation that achieve the frequentist regret bounds with constant-time computational cost per episode.","element":"span"}],[{"text":"• We establish that ","element":"span"},{"style":{"height":19.6},"width":560.48,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-9.png","element":"img","alt":" RRL-MNL enjoys �O(κ−1d32 H32 √T)","inline":true,"padRight":true},{"text":"frequentist regret bound with constant-time computational cost per episode, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is the dimension of the transition core, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is horizon length, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is the total number of rounds, and ","element":"span"},{"style":{"height":7.39},"width":20.48,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/1-10.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"is a problem-dependent constant. We","element":"span"}],[{"text":"derive the stochastic optimism of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", and to our knowledge, this is the first frequentist regret analysis for a non-linear model-based algorithm with randomized exploration without assuming stochastic optimism.","element":"span"}],[{"text":"• To achieve a regret bound with improved dependence on ","element":"span"},{"style":{"height":7.39},"width":20.52,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-0.png","element":"img","alt":" κ","inline":true},{"text":", we introduce ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":", which constructs the optimistic randomized value functions by taking into account the effects of the local gradient information for the MNL transition model at each reachable state. We prove that ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"enjoys an","element":"span"},{"style":{"height":19.6},"width":436,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-1.png","element":"img","alt":"�O(d32 H32 √T + κ−1d2H2)","inline":true,"padRight":true},{"text":"regret with constant-time computational cost per episode, significantly improving the regret of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"without requiring prior knowledge of ","element":"span"},{"style":{"height":7.41},"width":28.48,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-2.png","element":"img","alt":" κ.","inline":true}],[{"text":"• We evaluate our algorithms on tabular MDPs and demonstrate the superior performance of our proposed algorithms compared to the existing state-of-the-art MNL-MDP algorithm [","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"]. The experiments provide evidence that our proposed algorithms are both computationally and statistically efficient.","element":"span"}],[{"text":"Related works on RL with function approximation and MNL contextual bandits are provided in Appendix ","element":"span"},{"text":"A.","element":"span"}]]},{"heading":"2 Problem Setting","paragraphs":[[{"text":"We consider the episodic ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov decision processes ","element":"span"},{"text":"(MDPs) denoted by ","element":"span"},{"style":{"height":18},"width":390,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-3.png","element":"img","alt":" M(S, A, H, {P}Hh=1, r)","inline":true},{"text":", ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is the state space, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is the action space, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is the horizon length of each episode, ","element":"span"},{"style":{"height":18.19},"width":125,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-4.png","element":"img","alt":" {P}Hh=1","inline":true,"padRight":true},{"text":"is the collection of probability distributions, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is the reward function. Every episodes start from the initial state ","element":"span"},{"style":{"height":9.6},"width":30.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-5.png","element":"img","alt":" s1","inline":true,"padRight":true},{"text":"and for every step ","element":"span"},{"style":{"height":15.81},"width":356.52,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-6.png","element":"img","alt":" h ∈ [H] := {1, ..., H}","inline":true,"padRight":true},{"text":"in an episode, the learning agent interacts with the environment represented as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":". The agent observes the state ","element":"span"},{"style":{"height":13.81},"width":120,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-7.png","element":"img","alt":" sh ∈ S","inline":true},{"text":", chooses an action ","element":"span"},{"style":{"height":14.19},"width":136,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-8.png","element":"img","alt":" ah ∈ A","inline":true},{"text":", receives a reward ","element":"span"},{"style":{"height":16},"width":286.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-9.png","element":"img","alt":" r(sh, ah) ∈ [0, 1]","inline":true,"padRight":true},{"text":"and the next state ","element":"span"},{"style":{"height":10.99},"width":73.48,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-10.png","element":"img","alt":" sh+1","inline":true,"padRight":true},{"text":"is given by the transition probability distribution ","element":"span"},{"style":{"height":16},"width":192.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-11.png","element":"img","alt":" Ph(·|sh, ah)","inline":true},{"text":". Then this process is repeated throughout the episode. A policy ","element":"span"},{"style":{"height":16},"width":307,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-12.png","element":"img","alt":" π : S × [H] → A","inline":true,"padRight":true},{"text":"is a function that determines the action of the agent at state ","element":"span"},{"style":{"height":9.81},"width":34.52,"height":24.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-13.png","element":"img","alt":" sh","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":16},"width":414,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-14.png","element":"img","alt":"ah = π(sh, h) := πh(sh).","inline":true}],[{"text":"We define the value function of the policy ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-15.png","element":"img","alt":" π","inline":true},{"text":", denoted by ","element":"span"},{"style":{"height":16.59},"width":97.48,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-16.png","element":"img","alt":" V πh (s)","inline":true},{"text":", as the expected sum of re- ","element":"span"},{"text":"wards under the policy ","element":"span"},{"style":{"height":7.2},"width":22.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-17.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"until the end of the episode starting from ","element":"span"},{"style":{"height":9.18},"width":138.52,"height":22.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-18.png","element":"img","alt":" sh = s","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":16.51},"width":159,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-19.png","element":"img","alt":" V πh (s) =","inline":true},{"style":{"height":48},"width":566.48,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-20.png","element":"img","alt":"(sh′, πh′(sh′)) | sh = s","inline":true,"padRight":true},{"text":". Similarly, we define the action-value function ","element":"span"},{"style":{"height":16.61},"width":188,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-21.png","element":"img","alt":" Qπh(s, a) =","inline":true}],[{"style":{"height":3.2},"width":21.52,"height":8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-22.png","element":"img","alt":"h′=h","inline":true},{"style":{"height":30.4},"width":542,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-23.png","element":"img","alt":"r(s, a) + Es′∼Ph(·|s,a)�V πh+1(s′)�","inline":true},{"text":". We define an optimal policy ","element":"span"},{"style":{"height":11.2},"width":37.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-24.png","element":"img","alt":" π∗","inline":true},{"text":"to be a policy that achieves the highest possible value at every ","element":"span"},{"style":{"height":16},"width":282,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-25.png","element":"img","alt":" (s, h) ∈ S × [H]","inline":true},{"text":". We denote the optimal value function by ","element":"span"},{"style":{"height":18.99},"width":269,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-26.png","element":"img","alt":"V ∗h (s) = V π∗h (s)","inline":true,"padRight":true},{"text":"and the optimal action-value function by ","element":"span"},{"style":{"height":18.99},"width":347.48,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-27.png","element":"img","alt":" Q∗h(s, a) = Qπ∗h (s, a)","inline":true},{"text":". To simplify, we ","element":"span"},{"text":"introduce the notation ","element":"span"},{"style":{"height":17.79},"width":638.48,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-28.png","element":"img","alt":" PhVh+1(s, a) = Es′∼Ph(·|s,a)[Vh+1(s′)]","inline":true},{"text":". Recall that the Bellman equations are,","element":"span"}],[{"style":{"width":"85%"},"width":1358,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-29.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"height":17.81},"width":1127,"height":44.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-30.png","element":"img","alt":"πH+1(s) = V ∗H+1(s) = 0 and V ∗h (s) = maxa∈A Q∗h(s, a) for all s ∈ S.","inline":true}],[{"text":"The goal of the agent is to maximize the sum of rewards for K episodes. In other words, the goal is to minimize the cumulative regret of the policy ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-31.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"over K episodes where ","element":"span"},{"style":{"height":18.19},"width":215.52,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-32.png","element":"img","alt":" π = {πk}Kk=1","inline":true},{"text":"is a collection ","element":"span"},{"text":"of policies ","element":"span"},{"style":{"height":13.81},"width":39.52,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-33.png","element":"img","alt":" πk ","inline":true,"padRight":true},{"text":"at k-th episode. The regret is defined as","element":"span"}],[{"style":{"width":"37%"},"width":600,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-34.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.6},"width":33.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/2-35.png","element":"img","alt":" sk1 ","inline":true,"padRight":true},{"text":"is the initial state at the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th episode.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Multinomial Logistic Markov Decision Processes (MNL-MDPs)","element":"span"}],[{"text":"Even though a lot of provable RL algorithms for linear MDPs are proposed, there is a simple but fundamental problem with the linear transition model assumption on the linear MDPs. In other words, the output of a linear function approximating the transition model must be in ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1] ","element":"span"},{"text":"and the probability of all possible following states must sum to ","element":"span"},{"text":"1 ","element":"span"},{"text":"exactly. Such restrictive assumption can affect the regret performances of algorithm suggested under the linearity assumption. To resolve these challenges, Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"propose a setting of a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"multinomial logistic Markov decision processes ","element":"span"},{"text":"(MNL-MDPs), where the state transition model is given by a multinomial logistic model. We introduce the formal definition for MNL-MDP as follows:","element":"span"}],[{"id":"id-21","style":{"fontWeight":"bold"},"text":"Assumption 1 ","element":"span"},{"text":"(MNL-MDPs [","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An MDP ","element":"span"},{"style":{"height":18.21},"width":405,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-0.png","element":"img","alt":" M(S, A, H, {Ph}Hh=1, r)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is an ","element":"span"},{"text":"MNL-MDP ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"feature map ","element":"span"},{"style":{"height":16.8},"width":369.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-1.png","element":"img","alt":" φ : S × A × S → Rd","inline":true},{"style":{"fontStyle":"italic"},"text":", if for each ","element":"span"},{"style":{"height":16},"width":130,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-2.png","element":"img","alt":" h ∈ [H]","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":17.79},"width":143,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-3.png","element":"img","alt":" θ∗h ∈ Rd","inline":true},{"style":{"fontStyle":"italic"},"text":", such that for any ","element":"span"},{"style":{"height":16},"width":243,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-4.png","element":"img","alt":"(s, a) ∈ S × A","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.8},"width":641.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-5.png","element":"img","alt":" s′ ∈ Ss,a := {s′ ∈ S : P(s′ | s, a) ̸= 0}","inline":true},{"style":{"fontStyle":"italic"},"text":", the state transition kernel of ","element":"span"},{"style":{"height":12.4},"width":26,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-6.png","element":"img","alt":" s′","inline":true},{"style":{"fontStyle":"italic"},"text":"when an action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is taken at a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is given by,","element":"span"}],[{"style":{"width":"73%"},"width":1170,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"We call each unknown vector ","element":"span"},{"style":{"height":16.61},"width":39.52,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-8.png","element":"img","alt":" θ∗h ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"transition core. Furthermore, we denote the maximum cardinality of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the set of reachable states as ","element":"span"},{"style":{"height":16.61},"width":430.52,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-9.png","element":"img","alt":" U, i.e., U := maxs,a |Ss,a|.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Remark 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"While Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"style":{"fontStyle":"italic"},"text":"[35] ","element":"a"},{"style":{"fontStyle":"italic"},"text":"assume a homogeneous transition kernel, we assume an inhomogeneous transition kernel, in which the probability varies depending on the current time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"style":{"fontStyle":"italic"},"text":"even for the same state transition, which is a more general setting. Also, for notational simplicity, we denote the true transition kernel ","element":"span"},{"style":{"height":17.2},"width":159.52,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-10.png","element":"img","alt":" Ph as Pθ∗h","inline":true},{"style":{"fontStyle":"italic"},"text":", and the estimated transition kernel by ","element":"span"},{"style":{"height":13.39},"width":129.48,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-11.png","element":"img","alt":" θ as Pθ.","inline":true}],[{"id":"id-194","style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Assumptions","element":"span"}],[{"text":"We introduce some standard regularity assumptions.","element":"span"}],[{"id":"id-41","style":{"fontWeight":"bold"},"text":"Assumption 2 ","element":"span"},{"text":"(Boundedness)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We assume ","element":"span"},{"style":{"height":17.01},"width":324,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-12.png","element":"img","alt":" ∥φ(s, a, s′)∥2 ≤ Lφ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"height":16.8},"width":416,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-13.png","element":"img","alt":" (s, a, s′) ∈ S × A × Ss,a","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.59},"width":452.52,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-14.png","element":"img","alt":" ∥θ∗h∥2 ≤ Lθ for all h ∈ [H].","inline":true}],[{"id":"id-43","style":{"fontWeight":"bold"},"text":"Assumption 3 ","element":"span"},{"text":"(Known reward)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We assume that the reward function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is known to the agent.","element":"span"}],[{"id":"id-28","style":{"fontWeight":"bold"},"text":"Assumption 4 ","element":"span"},{"text":"(Problem-dependent constant)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":19.41},"width":558.48,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-15.png","element":"img","alt":" Bd(Lθ) := {θ ∈ Rd : ∥θ∥2 ≤ Lθ}","inline":true},{"style":{"fontStyle":"italic"},"text":". There exists ","element":"span"},{"style":{"height":11.6},"width":93,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-16.png","element":"img","alt":"κ > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for any ","element":"span"},{"style":{"height":16.8},"width":699.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-17.png","element":"img","alt":" (s, a) ∈ S × A and s′, �s ∈ Ss,a with s′ ̸= �s,","inline":true}],[{"style":{"width":"39%"},"width":630,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Discussion of assumptions ","element":"span"},{"text":"Assumption ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"is common in the literature on RL with function approximation [","element":"span"},{"href":"#id-10","referenceIndex":43,"text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":72,"text":"72","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"] to make the regret bounds scale-free. Assumption ","element":"span"},{"href":"#id-43","text":"3 ","element":"a"},{"text":"is used to focus on the main challenge of model-based RL that learning about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"of the environment is more difficult than learning ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":". In the model-based RL literature [","element":"span"},{"href":"#id-44","referenceIndex":71,"text":"71","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":72,"text":"72","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":81,"text":"81","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"], the known reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"assumption is widely used. Assumption ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"is typical in generalized linear contextual bandit [","element":"span"},{"href":"#id-29","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":51,"text":"51","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"23","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":24,"text":"24","element":"a"},{"text":"] and MNL contextual bandit literature [","element":"span"},{"href":"#id-32","referenceIndex":54,"text":"54","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":8,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":55,"text":"55","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"61","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":6,"text":"6","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"50","element":"a"},{"text":"] to guarantee non-singular Fisher information matrix.","element":"span"}]]},{"heading":"3 Randomized Algorithm for MNL-MDPs having constant-time computational cost","paragraphs":[[{"text":"Previous work for MNL-MDPs [","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"] proposed a UCB-based exploration algorithm. Constructing a UCB-based optimistic value function is not only computationally intractable but also tends to overly optimistically estimate the true optimal value function. Additionally, their algorithm incurs increasing computation costs as episodes progress, as it requires all samples from the previous episode to estimate the transition core. In this section, we present a novel model-based RL algorithm that incorporates ","element":"span"},{"style":{"fontStyle":"italic"},"text":"randomized exploration ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"online parameter estimation ","element":"span"},{"text":"for MNL-MDPs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Algorithm: ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Online transition core estimation","element":"span"},{"text":"While Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"estimate the transition core using maximum likelihood estimation over all samples from previous episodes, we employ an efficient online parameter estimation method by exploiting the particular structure of the MNL transition model. The key insight is that the negative log-likelihood function for the MNL model in each episode is strongly convex over a bounded domain. This property allows us to utilize a variation of the online Newton step [","element":"span"},{"href":"#id-48","referenceIndex":30,"text":"30","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":31,"text":"31","element":"a"},{"text":"], which inspired online algorithms for logistic bandits [","element":"span"},{"href":"#id-50","referenceIndex":74,"text":"74","element":"a"},{"text":"] and MNL contextual bandits [","element":"span"},{"href":"#id-34","referenceIndex":55,"text":"55","element":"a"},{"text":"]. Specifically, for ","element":"span"},{"style":{"height":16},"width":300.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-19.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"text":", we define the response variable ","element":"span"},{"style":{"height":18.4},"width":79.48,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/3-20.png","element":"img","alt":" ykh =","inline":true}],[{"id":"id-59","style":{"width":"100%"},"width":1596,"height":664,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-0.png","element":"img"}],[{"style":{"height":23.38},"width":1321.52,"height":58.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-1.png","element":"img","alt":"�ykh(s′)�s′∈Sk,h such that ykh(s′) = 1(skh+1 = s′) for s′ ∈ Sk,h := Sskh,akh. Then, ykh ","inline":true,"padRight":true},{"text":"is sampled from ","element":"span"},{"text":"the following multinomial distribution: ","element":"span"},{"style":{"height":24.61},"width":775.48,"height":61.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-2.png","element":"img","alt":" ykh ∼ multinomial(1,�Pθ∗h(s′ | skh, akh)�s′∈Sk,h)","inline":true},{"text":", where ","element":"span"},{"text":"1 ","element":"span"},{"text":"represents that ","element":"span"},{"style":{"height":18.21},"width":36.52,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-3.png","element":"img","alt":" ykh ","inline":true,"padRight":true},{"text":"is a single-trial sample. We define the per-episode loss ","element":"span"},{"style":{"height":16.61},"width":115.48,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-4.png","element":"img","alt":" ℓk,h(θ)","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"48%"},"width":764,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-5.png","element":"img"}],[{"text":"Then, the estimated transition core for ","element":"span"},{"style":{"height":16.4},"width":39.48,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-6.png","element":"img","alt":" θ∗h ","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"id":"id-65","style":{"width":"84%"},"width":1338,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":18.61},"width":39.48,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-8.png","element":"img","alt":" θ1h ","inline":true,"padRight":true},{"text":"can be initialized as any point in ","element":"span"},{"style":{"height":16.4},"width":279.52,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-9.png","element":"img","alt":" Bd(Lθ) and Ak,h ","inline":true,"padRight":true},{"text":"is the Gram matrix defined by","element":"span"}],[{"style":{"width":"79%"},"width":1264,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Stochastically optimistic value function ","element":"span"},{"text":"First of all, we introduce the key challenges of regret analysis for randomized algorithms, explain how previous works have overcome these challenges, and then describe why the techniques from previous works cannot be applied to MNL-MDPs. Ensuring that the estimated value function is optimistic with sufficient frequency is a crucial challenge in analyzing the frequentist regret of randomized algorithms. A common way to promote sufficient exploration in randomized algorithms is by perturbing the estimated value function or by performing posterior sampling in the transition model class. Frequentist regret analysis of randomized exploration in an RL setting has been conducted for tabular [","element":"span"},{"href":"#id-51","referenceIndex":59,"text":"59","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-53","referenceIndex":62,"text":"62","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":60,"text":"60","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":67,"text":"67","element":"a"},{"text":"], linear MDPs [","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"], and general function classes [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":75,"text":"75","element":"a"},{"text":"]. In the case of linear MDPs [","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"], since the property that the action-value function is linear in the feature map allows perturbing the estimated parameter directly to control the perturbation of the estimated value function. Also, even though Ishfaq et al. ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"[37] ","element":"a"},{"text":"presented a randomized algorithm for the general function class using eluder dimension, they assume stochastic optimism (anti-concentration), which is in fact one of the most challenging aspects of frequentist analysis. Other posterior sampling algorithms in RL for the general function class such as [","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":75,"text":"75","element":"a"},{"text":"], except for very limited examples, do not discuss how to define the posterior distribution supported by the given function class and how to draw the optimistic sample from the posterior. That is why even after there exists a so-called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"general function class","element":"span"},{"text":"-based result, it is often the case that results in specific parametric models are still needed.","element":"span"}],[{"text":"Note that in episodic RL, the perturbed estimated value functions are propagated back through horizontal steps, requiring careful adjustment of the perturbation scheme to maintain a sufficient probability of optimism without decaying too quickly with the horizon. For example, if the probability of the estimated value function being optimistic at horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is denoted as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", this would result in the probability that the estimated value function in the initial state is optimistic being on the order of ","element":"span"},{"style":{"height":16.61},"width":48.48,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/4-11.png","element":"img","alt":" pH","inline":true},{"text":", implying that the regret can increase exponentially with the length of the horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". ","element":"span"},{"text":"Additionally, the non-linearity and substitution effect of the next state transition in the MNL-MDPs make applying the existing TS techniques infeasible to guarantee optimism in MNL-MDPs with sufficient frequency. Instead, we design the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"stochastically optimistic value function ","element":"span"},{"text":"by exploiting the structure of the MNL transition model. In other words, the prediction error of MNL transition model (Definition ","element":"span"},{"href":"#id-56","text":"1) ","element":"a"},{"text":"can be bounded by the weighted norm of the dominant feature ","element":"span"},{"style":{"height":14.8},"width":27,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-0.png","element":"img","alt":" ˆφ","inline":true,"padRight":true},{"text":"(Lemma ","element":"span"},{"href":"#id-57","text":"4)","element":"a"},{"text":". Based on such dominant feature, we perturb the estimated value function by injecting Gaussian noise whose variance is proportional to the inverse of the Gram matrix to encourage the perturbation with higher variance in less explored directions. To guarantee the optimism with fixed probability, we adapt optimistic sampling technique [","element":"span"},{"href":"#id-52","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":54,"text":"54","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":36,"text":"36","element":"a"},{"text":"]. For each ","element":"span"},{"style":{"height":16},"width":356.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-1.png","element":"img","alt":" m ∈ [M], sample i.i.d.","inline":true,"padRight":true},{"text":"Gaussian noise vector ","element":"span"},{"style":{"height":23.6},"width":381.52,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-2.png","element":"img","alt":" ξ(m)k,h ∼ N(0d, σ2kA−1k,h)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":9.6},"width":38,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-3.png","element":"img","alt":" σk","inline":true,"padRight":true},{"text":"is an exploration parameter, and add the most optimistic ","element":"span"},{"text":"inner product value ","element":"span"},{"style":{"height":23.6},"width":447.52,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-4.png","element":"img","alt":" maxm∈[M] ˆφk,h(s, a)⊤ξ(m)k,h","inline":true},{"text":"to the estimated value function. To summarize for ","element":"span"},{"text":"any ","element":"span"},{"style":{"height":19.41},"width":851,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-5.png","element":"img","alt":" (s, a) ∈ S × A, set QkH+1(s, a) = 0 and for h ∈ [H],","inline":true}],[{"style":{"width":"97%"},"width":1548,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-6.png","element":"img"}],[{"text":"where","element":"span"},{"style":{"height":17.9},"width":539.12,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-7.png","element":"img","alt":"V kh (s) = maxa′ Qkh(s, a′)","inline":true},{"text":"and","element":"span"},{"style":{"height":18.21},"width":501,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-8.png","element":"img","alt":"ˆφk,h(s, a) := φ(s, a, ˆs)","inline":true},{"text":"for","element":"span"},{"style":{"height":10.99},"width":118,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-9.png","element":"img","alt":"ˆs =","inline":true},{"style":{"height":22.61},"width":528.48,"height":56.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-10.png","element":"img","alt":"argmaxs′∈Ss,a ∥φ(s, a, s′)∥A−1k,h","inline":true},{"text":"Based on these stochastically optimistic value function, the agent plays a greedy action ","element":"span"},{"style":{"height":18.19},"width":428.52,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-11.png","element":"img","alt":" akh = argmaxa′ Qkh(skh, a′)","inline":true},{"text":". We layout the procedure in Algorithm ","element":"span"},{"href":"#id-59","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Remark 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Note that ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"style":{"fontStyle":"italic"},"text":"only requires constant-time computational cost and storage cost per episode, as it does not require storing all samples from previous episodes, and the Gram matrix ","element":"span"},{"style":{"height":15.79},"width":78,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-12.png","element":"img","alt":" Ak,h","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can be updated incrementally.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Regret bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"}],[{"text":"We present the regret upper bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":". The complete proof is deferred to Appendix ","element":"span"},{"text":"C.","element":"span"}],[{"id":"id-60","style":{"fontWeight":"bold"},"text":"Theorem 1 ","element":"span"},{"text":"(Regret Bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that Assumption ","element":"span"},{"href":"#id-21","style":{"fontStyle":"italic"},"text":"1- ","element":"a"},{"href":"#id-28","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. For any ","element":"span"},{"style":{"height":21.6},"width":248,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-13.png","element":"img","alt":" 0 < δ < Φ(−1)2 ,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if we set the input parameters in Algorithm ","element":"span"},{"href":"#id-59","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"as ","element":"span"},{"style":{"height":22},"width":392,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-14.png","element":"img","alt":" λ = L2φ, σk = �O(H√d)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":23.2},"width":313,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-15.png","element":"img","alt":" M = ⌈1 − log Hlog Φ(1)⌉","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-16.png","element":"img","alt":" Φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the normal CDF, then with probability at least ","element":"span"},{"style":{"height":11.6},"width":78,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-17.png","element":"img","alt":" 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", the cumulative regret of the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"style":{"fontStyle":"italic"},"text":"policy ","element":"span"},{"style":{"height":7.2},"width":22.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-18.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is upper-bounded as follows:","element":"span"}],[{"style":{"width":"37%"},"width":598,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/5-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"KH ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the total number of steps.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Discussion of Theorem ","element":"span"},{"href":"#id-60","style":{"fontWeight":"bold"},"text":"1 ","element":"a"},{"text":"To our best knowledge, this is the first result to provide a frequentist regret bound for the MNL-MDPs. Among the previous RL algorithms using function approximation, the most comparable techniques to our method are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"model-free ","element":"span"},{"text":"algorithms with randomized exploration [","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"]. To guarantee stochastic optimism, Zanette et al. ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"[73] ","element":"a"},{"text":"established a lower bound on the difference between the estimated value and the optimal value by the summation of linear terms with respect to the average feature (Lemma F.1 in [","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":"]). This property is achievable due to the linear expression of the value function in linear MDPs. Instead, we established a lower bound on the difference between value functions by the summation of the Bellman errors (Definition ","element":"span"},{"href":"#id-56","text":"1) ","element":"a"},{"text":"along the sample path obtained through the optimal policy (Lemma ","element":"span"},{"href":"#id-61","text":"7)","element":"a"},{"text":". Hence, our analysis significantly differs from that of Zanette et al. ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"[73] ","element":"a"},{"text":"since the value function in MNL-MDPs is no longer linearly parametrized, and there is no closed-form expression for it.","element":"span"}],[{"text":"Compared to [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"], they also used an optimistic sampling technique; however, our theoretical sampling size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":") ","element":"span"},{"text":"is much tighter than that of [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"], i.e., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":") ","element":"span"},{"text":"for the linear function class, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S||A|","element":"span"},{"text":")) ","element":"span"},{"text":"for the general function class. While Ishfaq et al. ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"[37] ","element":"a"},{"text":"extend the results of the linear function class to general function class under the assumption of stochastic optimism (Assumption C in [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"]), we provide the frequentist regret analysis for a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"non-linear model-based ","element":"span"},{"text":"algorithm with randomized exploration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"without assuming stochastic optimism","element":"span"},{"text":".","element":"span"}],[{"text":"Compared to the optimistic exploration algorithm for MNL-MDPs [","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"], our randomized exploration requires a more involved proof technique to ensure that the perturbation of the estimated value function has enough variance to maintain optimism with sufficient frequency (Lemma ","element":"span"},{"href":"#id-62","text":"6)","element":"a"},{"text":". As a result, ","element":"span"},{"text":"the established regret of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"differs by a factor of","element":"span"},{"style":{"height":15.81},"width":51.48,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-0.png","element":"img","alt":"√d","inline":true},{"text":", which aligns with the difference in the existing bounds of linear bandits between a TS-based algorithm [","element":"span"},{"href":"#id-63","referenceIndex":2,"text":"2","element":"a"},{"text":"] and a UCB-based algorithm [","element":"span"},{"href":"#id-64","referenceIndex":1,"text":"1","element":"a"},{"text":"]. Additionally, we achieve statistical efficiency for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"inhomogeneous transition model","element":"span"},{"text":", which is a more general setting than that of Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"[35]","element":"a"},{"text":". Our computation cost per episode is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"while the computation cost per episode of Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof Sketch of Theorem ","element":"span"},{"href":"#id-60","style":{"fontWeight":"bold"},"text":"1 ","element":"a"},{"text":"We provide the proof sketch of Theorem ","element":"span"},{"href":"#id-60","text":"1. ","element":"a"},{"text":"By decomposing the regret into the estimation part and the pessimism part, we have","element":"span"}],[{"style":{"width":"60%"},"width":962,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-1.png","element":"img"}],[{"text":"We bound these two parts separately. For the estimation part, for each ","element":"span"},{"style":{"height":16},"width":279,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-2.png","element":"img","alt":" k ∈ [K], h ∈ [H]","inline":true},{"text":", we first show that the online estimated transition core ","element":"span"},{"href":"#id-65","style":{"height":18.99},"width":98.52,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-3.png","element":"img","alt":" θkh (2)","inline":true,"padRight":true},{"text":"concentrates around the unknown transition ","element":"span"},{"text":"core parameter ","element":"span"},{"style":{"height":16.59},"width":40,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-4.png","element":"img","alt":" θ∗h ","inline":true,"padRight":true},{"text":"with high probability (Lemma ","element":"span"},{"href":"#id-66","text":"1)","element":"a"},{"text":". Then, we show that the prediction error induced ","element":"span"},{"text":"by the estimated transition core can be bounded by the weighted norm of the dominant feature ","element":"span"},{"style":{"height":14.61},"width":26.96,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-5.png","element":"img","alt":"ˆφ","inline":true},{"text":", multiplied by the confidence radius of the estimated transition core (Lemma ","element":"span"},{"href":"#id-57","text":"4)","element":"a"},{"text":". The bounded prediction error, together with the concentration of Gaussian noise, implies the desired bound on the estimation part (Lemma ","element":"span"},{"href":"#id-67","text":"10)","element":"a"},{"text":". For the pessimism part, we first show that the stochastically optimistic value function ","element":"span"},{"style":{"height":17.6},"width":46.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-6.png","element":"img","alt":" V k1","inline":true},{"text":"is optimistic than the true optimal value function ","element":"span"},{"style":{"height":15.01},"width":44.48,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-7.png","element":"img","alt":" V ∗1","inline":true},{"text":"with sufficient frequency ","element":"span"},{"text":"(Lemma ","element":"span"},{"href":"#id-62","text":"6)","element":"a"},{"text":". In the next step, we show that the pessimism part is upper bounded by a bound of the estimation part times the inverse probability of being optimistic (Lemma ","element":"span"},{"href":"#id-68","text":"11)","element":"a"},{"text":". Combining all the results, we can conclude the proof. Refer to Appendix ","element":"span"},{"text":"C ","element":"span"},{"text":"for detailed proofs.","element":"span"}]]},{"heading":"4 Statistically Improved Algorithm for MNL-MDPs","paragraphs":[[{"text":"Although ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"is provably efficient and achieves constant-time computational cost per episode, the current analysis makes its regret bound scale with ","element":"span"},{"style":{"height":13.39},"width":59,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-8.png","element":"img","alt":" κ−1","inline":true},{"text":". Recall that the problem-dependent constant ","element":"span"},{"style":{"height":7.39},"width":20,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-9.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"introduced in Assumption ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"indicates the curvature of the MNL function, i.e., how difficult it is to learn the true transition core parameter. It is required to ensure the non-singular Fisher information matrix, hence is typically used in GLM or MNL bandit algorithms that use the maximum likelihood estimator. As introduced in Faury et al. ","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"[23]","element":"a"},{"text":", ","element":"span"},{"style":{"height":13.41},"width":59.48,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-10.png","element":"img","alt":" κ−1 ","inline":true,"padRight":true},{"text":"can be exponentially large in the worst case. The appearance of ","element":"span"},{"style":{"height":7.39},"width":20.52,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-11.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"in existing bounds originates in the connection between the difference of estimators and the difference of gradients of negative log-likelihood, usually denoted as ","element":"span"},{"style":{"fontWeight":"bold"},"text":"G ","element":"span"},{"text":"in Filippi et al. ","element":"span"},{"href":"#id-29","referenceIndex":26,"text":"[26]","element":"a"},{"text":". Without considering local information at all, using a loose lower bound for ","element":"span"},{"style":{"fontWeight":"bold"},"text":"G ","element":"span"},{"text":"incurs ","element":"span"},{"style":{"height":13.39},"width":59.52,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-12.png","element":"img","alt":"κ−1","inline":true},{"text":"in regret bound (see Section 4.1 in Agrawal et al. ","element":"span"},{"href":"#id-39","referenceIndex":6,"text":"[6]","element":"a"},{"text":"). Recently, improved dependence on ","element":"span"},{"style":{"height":7.39},"width":20,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-13.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"has been achieved in bandit literature [","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"23","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"61","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":6,"text":"6","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"50","element":"a"},{"text":"] through the use of generalization of the Bernstein-like tail inequality [","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"23","element":"a"},{"text":"] and the self-concordant-like property of the log loss [","element":"span"},{"href":"#id-69","referenceIndex":11,"text":"11","element":"a"},{"text":"]. However, a direct adaptation of the MNL bandit technique would result in sub-optimal dependence on the assortment size in MNL bandit, which corresponds to the size of the set of reachable states, such as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":". In this section, we introduce a new randomized algorithm for MNL-MDPs, equipped with a tight online parameter estimation and feature centralization technique that achieves a regret bound with improved dependence on ","element":"span"},{"style":{"height":11.6},"width":135,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-14.png","element":"img","alt":" κ and U.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Algorithms: ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Tight online transition core estimation ","element":"span"},{"text":"Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"presented a jointly efficient UCB-based MNL contextual bandit algorithm using online mirror descent algorithm. Adapting the update rule from ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76]","element":"a"},{"text":", the estimated transition core run by the online mirror descent is given by","element":"span"}],[{"id":"id-71","style":{"width":"77%"},"width":1226,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-15.png","element":"img"}],[{"text":"where","element":"span"},{"style":{"height":20},"width":39.48,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-16.png","element":"img","alt":"�θ1h ","inline":true,"padRight":true},{"text":"can be initialized as any point in ","element":"span"},{"style":{"height":16},"width":163,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-17.png","element":"img","alt":" Bd(Lθ), η","inline":true,"padRight":true},{"text":"is a step size, and ","element":"span"},{"style":{"height":20.21},"width":76.48,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-18.png","element":"img","alt":" �Bk,h ","inline":true,"padRight":true},{"text":"is defined as","element":"span"}],[{"style":{"width":"85%"},"width":1350,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/6-19.png","element":"img"}],[{"id":"id-73","style":{"width":"100%"},"width":1594,"height":754,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-0.png","element":"img"}],[{"text":"Note that the MNL model in Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"operates in a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"multiple-parameter ","element":"span"},{"text":"setting, where there are multiple unknown choice parameters and one given context feature. In contrast, our MNL model operates in a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"single-parameter ","element":"span"},{"text":"setting, where there is one unknown transition core and features for up to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"reachable states. This difference results in variations in applying the self-concordant-like property of the log-loss for the MNL model. For instance, Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"utilized the fact that the log-loss for the multiple parameter MNL model is","element":"span"},{"style":{"height":16.19},"width":51,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-1.png","element":"img","alt":"√6","inline":true},{"text":"-self-concordant-like (Lemma 2 in Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76]","element":"a"},{"text":"). On the other hand, Lee and Oh ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"[50] ","element":"a"},{"text":"revisit the self-concordant-like property and demonstrate that the log-loss of the single-parameter MNL model is ","element":"span"},{"style":{"height":16.19},"width":71.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-2.png","element":"img","alt":" 3√2","inline":true},{"text":"-self-concordant-like (Proposition B.1 in Lee and Oh ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"[50]","element":"a"},{"text":"). This results in a concentration bound that is independent of ","element":"span"},{"style":{"height":11.6},"width":126.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-3.png","element":"img","alt":" κ and U","inline":true},{"text":", introduced in Lemma ","element":"span"},{"href":"#id-70","text":"12.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Note that the online estimated parameters ","element":"span"},{"href":"#id-65","style":{"height":18.99},"width":99,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-4.png","element":"img","alt":" θkh (2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-71","style":{"height":20},"width":99.48,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-5.png","element":"img","alt":" �θkh (5)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"do not aim to minimize ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the sum of negative log-likelihoods,","element":"span"},{"style":{"height":20.61},"width":243.48,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-6.png","element":"img","alt":"�kk′=1 ℓk′,h(θ)","inline":true},{"style":{"fontStyle":"italic"},"text":". Instead, we show that the online estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"parameter concentrates around the unknown transition core ","element":"span"},{"style":{"height":16.4},"width":40,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-7.png","element":"img","alt":" θ∗h","inline":true},{"style":{"fontStyle":"italic"},"text":"with high probability (Lemma ","element":"span"},{"href":"#id-66","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"& ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"12)","element":"a"},{"style":{"fontStyle":"italic"},"text":". This online update approach allows us to estimate the transition core with constant-time computational cost per episode, as the agent does not need to store all samples from previous episodes.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Optimistic randomized value function","element":"span"},{"text":"To achieve improved dependence on ","element":"span"},{"style":{"height":7.41},"width":20,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-8.png","element":"img","alt":" κ","inline":true},{"text":", a crucial point is to utilize the local gradient information of MNL transition probabilities for each reachable state when constructing the Gram matrix. In MNL bandit problems [","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"61","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":"], this can be accomplished by substituting the Hessian of the negative log-likelihood with the Gram matrix using global gradient information ","element":"span"},{"style":{"height":7.39},"width":20.52,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-9.png","element":"img","alt":" κ","inline":true},{"text":". However, there are fundamental differences between the settings in Perivier and Goyal ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61]","element":"a"},{"text":", Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"and ours. Perivier and Goyal ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"address the case where the reward for each product is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"uniform ","element":"span"},{"text":"(i.e., all products have a reward of 1), and the reward for not selecting a product from the given assortment (also known as the outside option) is 0. On the other hand, Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"deal with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"non-uniform ","element":"span"},{"text":"rewards where the reward for each product may vary; however, the rewards for individual products are known a priori to the agent. In contrast, in MNL-MDPs, the value for each reachable state may vary (non-uniform) and is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not known ","element":"span"},{"text":"beforehand. Due to these differences, the analysis techniques in MNL bandits [","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"61","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":"] cannot be directly applied to our setting. Instead, we adapt the feature centralization technique [","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"50","element":"a"},{"text":"]. Then, the Hessian of the per-round loss ","element":"span"},{"style":{"height":16.59},"width":115,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-10.png","element":"img","alt":" ℓk,h(θ)","inline":true,"padRight":true},{"text":"is expressed in terms of the centralized feature as follows:","element":"span"}],[{"style":{"width":"70%"},"width":1124,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-11.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20},"width":841,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-12.png","element":"img","alt":" ¯φ(s, a, s′; θ) := φ(s, a, s′)−E�s∼Pθ(·|s,a)[φ(s, a, �s)]","inline":true,"padRight":true},{"text":"is the centralized feature by ","element":"span"},{"style":{"height":11.2},"width":186,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-13.png","element":"img","alt":" θ. For more","inline":true,"padRight":true},{"text":"details, please refer to Appendix ","element":"span"},{"href":"#id-72","text":"D.2.","element":"a"}],[{"text":"Now we introduce the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimistic randomized value function","element":"span"},{"style":{"height":20.19},"width":353,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/7-14.png","element":"img","alt":"�Qkh(·, ·) for ORRL-MNL","inline":true},{"text":". The key point is ","element":"span"},{"text":"that when perturbing the estimated value function, we use the centralized feature by the estimated","element":"span"}],[{"text":"transition parameter","element":"span"},{"style":{"height":21.6},"width":778,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-0.png","element":"img","alt":"�θkh. For any (s, a) ∈ S × A, set �QkH+1(s, a) = 0","inline":true,"padRight":true},{"text":"and for each ","element":"span"},{"style":{"height":16},"width":136.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-1.png","element":"img","alt":" h ∈ [H],","inline":true}],[{"style":{"width":"90%"},"width":1436,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-2.png","element":"img"}],[{"text":"where","element":"span"},{"style":{"height":22.4},"width":799.52,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-3.png","element":"img","alt":"�V kh (s) := maxa∈A �Qkh(s, a) and νrandk,h (s, a) is the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"randomized bonus term ","element":"span"},{"text":"defined by","element":"span"}],[{"style":{"width":"91%"},"width":1458,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-4.png","element":"img"}],[{"text":"Here we sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.i.d. ","element":"span"},{"text":"Gaussian noise ","element":"span"},{"style":{"height":23.81},"width":382.48,"height":59.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-5.png","element":"img","alt":" ξ(m)k,h ∼ N(0d, σ2kB−1k,h)","inline":true,"padRight":true},{"text":"for each ","element":"span"},{"style":{"height":16},"width":149,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-6.png","element":"img","alt":" m ∈ [M]","inline":true,"padRight":true},{"text":"and set ","element":"span"},{"style":{"height":22.8},"width":122.48,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-7.png","element":"img","alt":" ξs′k,h :=","inline":true},{"style":{"height":24.59},"width":97.04,"height":61.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-8.png","element":"img","alt":"ξm(s′)k,h","inline":true},{"text":"where ","element":"span"},{"style":{"height":22.59},"width":726.48,"height":56.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-9.png","element":"img","alt":" m(s′) := argmaxm∈[M] ¯φ(s, a, s′; �θkh)⊤ξmk,h","inline":true},{"text":"is the most optimistic sampling index ","element":"span"},{"text":"for a reachable state ","element":"span"},{"style":{"height":12.4},"width":26,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-10.png","element":"img","alt":" s′","inline":true},{"text":". Based on these optimistic randomized value function, at each episode the agent plays a greedy action with respect to","element":"span"},{"style":{"height":20.4},"width":47.52,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-11.png","element":"img","alt":"�Qkh ","inline":true,"padRight":true},{"text":"as summarized in Algorithm ","element":"span"},{"href":"#id-73","text":"2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Remark 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Note that the second term in the randomized bonus always has a positive value, but it rapidly decreases as episode proceeds. While due to the randomness of ","element":"span"},{"style":{"height":14.61},"width":19.52,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-12.png","element":"img","alt":" ξ","inline":true},{"style":{"fontStyle":"italic"},"text":", the randomized bonus ","element":"span"},{"style":{"height":20.19},"width":82.48,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-13.png","element":"img","alt":"νrandk,h","inline":true},{"style":{"fontStyle":"italic"},"text":"itself cannot be guaranteed to always have a positive value. Consequently, the constructed","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"value function ","element":"span"},{"style":{"height":20.21},"width":118,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-14.png","element":"img","alt":"�Qkh(·, ·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can be optimistic or pessimistic. However, as shown in Lemma ","element":"span"},{"href":"#id-74","style":{"fontStyle":"italic"},"text":"18, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"optimistic ","element":"span"},{"style":{"fontStyle":"italic"},"text":"sampling technique ensures that the optimistic randomized value function","element":"span"},{"style":{"height":20.4},"width":47.48,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-15.png","element":"img","alt":"�Qkh ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has at least a constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"probability of being optimistic than the true optimal value function.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"As with ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"style":{"fontStyle":"italic"},"text":", since the transition core is estimated in an online manner and the Gram matrices with local gradient information ","element":"span"},{"style":{"height":20.21},"width":234,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-16.png","element":"img","alt":" Bk,h and �Bk,h","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are updated incrementally, ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"style":{"fontStyle":"italic"},"text":"also requires constant-time computational cost and storage cost per-episode. Although ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"style":{"fontStyle":"italic"},"text":"requires an additional ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"computation cost for feature centralization, the computation complexity order is the same as that of ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"style":{"fontStyle":"italic"},"text":"because it also needs to go over reachable states to calculate the dominant feature ","element":"span"},{"style":{"height":14.99},"width":26.48,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-17.png","element":"img","alt":" ˆφ","inline":true},{"style":{"fontStyle":"italic"},"text":", which also incurs a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"computation cost. On the other hand, ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"style":{"fontStyle":"italic"},"text":"does not require prior knowledge of ","element":"span"},{"style":{"height":7.39},"width":20.52,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-18.png","element":"img","alt":" κ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and achieves a regret with a better dependence on ","element":"span"},{"style":{"height":7.39},"width":27,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-19.png","element":"img","alt":" κ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Regret Bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"}],[{"text":"We present the regret upper bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":". The complete proof is deferred to Appendix ","element":"span"},{"text":"D.","element":"span"}],[{"id":"id-75","style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"text":"(Regret Bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that Assumption ","element":"span"},{"href":"#id-21","style":{"fontStyle":"italic"},"text":"1- ","element":"a"},{"href":"#id-28","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. For any ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< ","element":"span"},{"style":{"height":21.6},"width":200,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-20.png","element":"img","alt":"δ < Φ(−1)2","inline":true},{"style":{"fontStyle":"italic"},"text":", if we set the input parameters in Algorithm ","element":"span"},{"href":"#id-73","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"as ","element":"span"},{"style":{"height":19.39},"width":460,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-21.png","element":"img","alt":" λ = O(L2φd log U), βk =","inline":true},{"style":{"height":24.4},"width":895,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-22.png","element":"img","alt":"O(√d log U log(kH)), σk = Hβk, M = ⌈1 − log(HU)log Φ(1) ⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":16},"width":222.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-23.png","element":"img","alt":" η = O(log U)","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"at least ","element":"span"},{"style":{"height":11.6},"width":87.64,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-24.png","element":"img","alt":" 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", the cumulative regret of the ","element":"span"},{"style":{"height":14.4},"width":305,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-25.png","element":"img","alt":" ORRL-MNL policy π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is upper-bounded as follows:","element":"span"}],[{"style":{"width":"50%"},"width":796,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-26.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"KH ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the total number of time steps.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Dicussion of Theorem ","element":"span"},{"href":"#id-75","style":{"fontWeight":"bold"},"text":"2","element":"a"},{"text":"Theorem ","element":"span"},{"href":"#id-75","text":"2 ","element":"a"},{"text":"establishes that the leading term in the regret bound does not suffer from the problem-dependent constant ","element":"span"},{"style":{"height":13.39},"width":59.52,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-27.png","element":"img","alt":" κ−1","inline":true},{"text":"and the second term of the regret bound is independent of the size of set of reachable states. To the extent of our knowledge, this is the first algorithm that provides a frequentist regret guarantee with improved dependence on ","element":"span"},{"style":{"height":13.41},"width":59.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-28.png","element":"img","alt":" κ−1","inline":true},{"text":"in MNLMDPs. Compared to ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", the technical challenge lies in ensuring the stochastic optimism of the estimated value for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":". Note that the prediction error (Definition ","element":"span"},{"href":"#id-56","text":"1) ","element":"a"},{"text":"for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"is characterized by two components: one related to the gradient information of the MNL transition model at each reachable state, and the other related to the dominant feature with respect to the Gram matrix ","element":"span"},{"style":{"height":15.6},"width":76,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/8-29.png","element":"img","alt":" Bk,h","inline":true,"padRight":true},{"text":"(Lemma ","element":"span"},{"href":"#id-76","text":"16)","element":"a"},{"text":". Hence, the probability of the Bellman error at each horizon, when following the optimal policy, being negative can depend on the size of the reachable states. This implies that the probability of stochastic optimism can be exponentially small, not only in the horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"but also in the size of the reachable states ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":". However, as shown in Lemma ","element":"span"},{"href":"#id-74","text":"18, ","element":"a"},{"text":"this challenge has been overcome by using a sample size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"logarithmically ","element":"span"},{"text":"increases with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", effectively addressing the issue.","element":"span"}],[{"id":"id-79","style":{"width":"93%"},"width":1490,"height":380,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/9-0.png","element":"img"}],[{"text":"Figure 1: Riverswim experiment results","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Proof Sketch of Theorem ","element":"span"},{"href":"#id-75","style":{"fontWeight":"bold"},"text":"2 ","element":"a"},{"text":"The overall proof pipeline for Theorem ","element":"span"},{"href":"#id-75","text":"2 ","element":"a"},{"text":"is similar to that of Theorem ","element":"span"},{"href":"#id-60","text":"1. ","element":"a"},{"text":"The main differences lie in the concentration of the estimated transition core (Lemma ","element":"span"},{"href":"#id-72","text":"D.2)","element":"a"},{"text":", the bound on the prediction error (Lemma ","element":"span"},{"href":"#id-72","text":"D.2)","element":"a"},{"text":", and the stochastic optimism (Lemma ","element":"span"},{"href":"#id-74","text":"18)","element":"a"},{"text":". Please refer to Appendix ","element":"span"},{"text":"D ","element":"span"},{"text":"for detailed proofs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Optimistic exploration extension ","element":"span"},{"text":"In general, since TS-based randomized exploration requires a more rigorous proof technique than UCB-based algorithms, our technical ingredients enable the use of optimistic exploration in a straightforward manner. We introduce ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+ ","element":"span"},{"text":"(Algorithm ","element":"span"},{"href":"#id-77","text":"3) ","element":"a"},{"text":"in the Appendix ","element":"span"},{"text":"E, ","element":"span"},{"text":"an optimism-based algorithm for MNL-MDPs. It is both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"computationally ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"statistically ","element":"span"},{"text":"efficient compared to ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL ","element":"span"},{"text":"[","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"], achieving ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the tightest regret bound ","element":"span"},{"text":"for MNLMDPs.","element":"span"}],[{"id":"id-188","style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic","fontFamily":"monospace"},"text":"UCRL-MNL+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(Algorithm ","element":"span"},{"href":"#id-77","style":{"fontStyle":"italic"},"text":"3) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"has","element":"span"},{"style":{"height":19.6},"width":433.48,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/9-1.png","element":"img","alt":"�O(dH3/2√T + κ−1d2H2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"regret with high probability.","element":"span"}]]},{"heading":"5 Numerical Experiments","paragraphs":[[{"text":"We perform a numerical evaluation on a variant of RiverSwim [","element":"span"},{"href":"#id-78","referenceIndex":58,"text":"58","element":"a"},{"text":"] to demonstrate practicality of our proposed algorithms. We compare our algorithms (","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+","element":"span"},{"text":") with the state-of-the-art ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL ","element":"span"},{"text":"[","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"] for MNL-MDPs. For each configuration, we report the averaged results over 10 independent runs. Figure ","element":"span"},{"href":"#id-79","text":"1a ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-79","text":"1b ","element":"a"},{"text":"show the episodic return of each algorithm, which is the sum of all the rewards obtained in one episode. First, our proposed algorithms (","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":", ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+","element":"span"},{"text":") outperform ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL ","element":"span"},{"text":"[","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"] for both cases of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S| ","element":"span"},{"text":"= 4","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"8","element":"span"},{"text":". Second, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+ ","element":"span"},{"text":"reach the optimal values quickly compared to the other algorithms, demonstrating improved statistical efficiency. Figure ","element":"span"},{"href":"#id-79","text":"1c ","element":"a"},{"text":"illustrates the comparison in running time of the algorithms for the first 1,000 episodes. Our proposed algorithms are at least 50 times faster than ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL","element":"span"},{"text":". These differences become more pronounced as the episodes progress because our algorithms have a constant computation cost, whereas the computation cost of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL ","element":"span"},{"text":"increases over time.","element":"span"}]]},{"heading":"6 Conclusions","paragraphs":[[{"text":"We propose randomized algorithms with provable efficiency and constant-time computational cost for MNL-MDPs. For the first algorithm, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", we use an optimistic sampling technique to ensure the stochastic optimism of the estimated value functions and provide the frequentist regret analysis. This is the first frequentist regret analysis for a non-linear model-based algorithm with randomized exploration without assuming stochastic optimism. To achieve a statistically improved regret bound, we propose ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"by constructing the optimistic randomized value function using the effects of the local gradient of the MNL transition model equipped with the centralized feature. As a result, we achieve a frequentist regret guarantee with improved dependence on ","element":"span"},{"style":{"height":7.41},"width":20,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/9-2.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"in RL with the MNL transition model, which is a significant contribution. The effectiveness and practicality of our methods are supported by numerical experiments.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We sincerely thank the anonymous reviewers for their constructive feedback. This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2022R1C1C1006859, 2022R1A4A1030579, and RS-2023-00222663) and by AI-Bio Research Grant through Seoul National University.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-64","text":"[1] ","element":"span"},{"text":"Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 24:2312–2320, 2011.","element":"span"}],[{"id":"id-63","text":"[2] ","element":"span"},{"text":"Marc Abeille and Alessandro Lazaric. Linear Thompson Sampling Revisited. In Aarti Singh and Jerry Zhu, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 20th International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", volume 54 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 176–184. PMLR, PMLR, 20–22 Apr 2017.","element":"span"}],[{"id":"id-36","text":"[3] ","element":"span"},{"text":"Marc Abeille, Louis Faury, and Clément Calauzènes. Instance-wise minimax-optimal algorithms for logistic bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 3691–3699. PMLR, 2021.","element":"span"}],[{"id":"id-19","text":"[4] ","element":"span"},{"text":"Alekh Agarwal and Tong Zhang. Model-based rl with optimistic posterior sampling: Structural conditions and sample complexity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35: 35284–35297, 2022.","element":"span"}],[{"id":"id-26","text":"[5] ","element":"span"},{"text":"Alekh Agarwal and Tong Zhang. Non-linear reinforcement learning in large action spaces: Structural conditions and sample-efficiency of posterior sampling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 2776–2814. PMLR, 2022.","element":"span"}],[{"id":"id-39","text":"[6] ","element":"span"},{"text":"Priyank Agrawal, Theja Tulabandhula, and Vashist Avadhanula. A tractable online learning algorithm for the multinomial logit contextual bandit. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Journal of Operational Research","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-52","text":"[7] ","element":"span"},{"text":"Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 1184–1194, 2017.","element":"span"}],[{"id":"id-46","text":"[8] ","element":"span"},{"text":"Sanae Amani and Christos Thrampoulidis. Ucb-based algorithms for multinomial logistic regression bandits. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:2913–2924, 2021.","element":"span"}],[{"id":"id-13","text":"[9] ","element":"span"},{"text":"Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 463–474. PMLR, 2020.","element":"span"}],[{"id":"id-7","text":"[10] ","element":"span"},{"text":"Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 263–272. PMLR, 2017.","element":"span"}],[{"id":"id-69","text":"[11] ","element":"span"},{"text":"Francis Bach. Self-concordant analysis for logistic regression. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Electronic Journal of Statistics","element":"span"},{"text":", 4(2):384 – 414, 2010.","element":"span"}],[{"id":"id-111","text":"[12] ","element":"span"},{"text":"Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 2005.","element":"span"}],[{"id":"id-88","text":"[13] ","element":"span"},{"text":"Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine learning","element":"span"},{"text":", 22(1):33–57, 1996.","element":"span"}],[{"id":"id-82","text":"[14] ","element":"span"},{"text":"Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1283–1294. PMLR, 2020.","element":"span"}],[{"id":"id-193","text":"[15] ","element":"span"},{"text":"Nicolo Campolongo and Francesco Orabona. Temporal variability in implicit online learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 33:12377–12387, 2020.","element":"span"}],[{"id":"id-22","text":"[16] ","element":"span"},{"text":"Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 24, 2011.","element":"span"}],[{"id":"id-33","text":"[17] ","element":"span"},{"text":"Xi Chen, Yining Wang, and Yuan Zhou. Dynamic assortment optimization with changing contextual information. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of machine learning research","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-20","text":"[18] ","element":"span"},{"text":"Zixiang Chen, Chris Junchi Li, Huizhuo Yuan, Quanquan Gu, and Michael Jordan. A general framework for sample-efficient function approximation in reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Eleventh International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-96","text":"[19] ","element":"span"},{"text":"Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient pac rl with rich observations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 31, 2018.","element":"span"}],[{"id":"id-98","text":"[20] ","element":"span"},{"text":"Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1665–1674. PMLR, 2019.","element":"span"}],[{"id":"id-16","text":"[21] ","element":"span"},{"text":"Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 2826–2836. PMLR, 2021.","element":"span"}],[{"id":"id-12","text":"[22] ","element":"span"},{"text":"Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-35","text":"[23] ","element":"span"},{"text":"Louis Faury, Marc Abeille, Clément Calauzènes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3052–3060. PMLR, 2020.","element":"span"}],[{"id":"id-37","text":"[24] ","element":"span"},{"text":"Louis Faury, Marc Abeille, Kwang-Sung Jun, and Clément Calauzènes. Jointly efficient and optimal algorithms for logistic bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 546–580. PMLR, 2022.","element":"span"}],[{"id":"id-4","text":"[25] ","element":"span"},{"text":"Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 610(7930):47–53, 2022.","element":"span"}],[{"id":"id-29","text":"[26] ","element":"span"},{"text":"Sarah Filippi, Olivier Cappé, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1","element":"span"},{"text":", NIPS’10, page 586–594, Red Hook, NY, USA, 2010. Curran Associates Inc.","element":"span"}],[{"id":"id-148","text":"[27] ","element":"span"},{"text":"Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic regression: The importance of being improper. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference On Learning Theory","element":"span"},{"text":", pages 167–208. PMLR, 2018.","element":"span"}],[{"id":"id-17","text":"[28] ","element":"span"},{"text":"Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2112.13487","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-192","text":"[29] ","element":"span"},{"text":"David A Freedman. On tail probabilities for martingales. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the Annals of Probability","element":"span"},{"text":", pages 100–118, 1975.","element":"span"}],[{"id":"id-48","text":"[30] ","element":"span"},{"text":"Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 69(2):169–192, 2007.","element":"span"}],[{"id":"id-49","text":"[31] ","element":"span"},{"text":"Elad Hazan, Tomer Koren, and Kfir Y Levy. Logistic regression: Tight bounds for stochastic and online optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 197–209. PMLR, 2014.","element":"span"}],[{"id":"id-156","text":"[32] ","element":"span"},{"text":"Elad Hazan et al. Introduction to online convex optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Foundations and Trends® in Optimization","element":"span"},{"text":", 2(3-4):157–325, 2016.","element":"span"}],[{"id":"id-85","text":"[33] ","element":"span"},{"text":"Jiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 4171–4180. PMLR, 2021.","element":"span"}],[{"id":"id-89","text":"[34] ","element":"span"},{"text":"Jiafan He, Heyang Zhao, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for linear markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 12790–12822. PMLR, 2023.","element":"span"}],[{"id":"id-15","text":"[35] ","element":"span"},{"text":"Taehyun Hwang and Min-hwan Oh. Model-based reinforcement learning with multinomial logistic function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI conference on artificial intelligence","element":"span"},{"text":", pages 7971–7979, 2023.","element":"span"}],[{"id":"id-58","text":"[36] ","element":"span"},{"text":"Taehyun Hwang, Kyuwook Chai, and Min-Hwan Oh. Combinatorial neural bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 40th International Conference on Machine Learning","element":"span"},{"text":". PMLR, 2023.","element":"span"}],[{"id":"id-14","text":"[37] ","element":"span"},{"text":"Haque Ishfaq, Qiwen Cui, Viet Nguyen, Alex Ayoub, Zhuoran Yang, Zhaoran Wang, Doina Precup, and Lin Yang. Randomized exploration in reinforcement learning with general value function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", volume 139, pages 4607–4616. PMLR, PMLR, 2021.","element":"span"}],[{"id":"id-87","text":"[38] ","element":"span"},{"text":"Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2024. URL ","element":"span"},{"href":"https://openreview.net/forum?id=nfIAEJFiBZ","style":{"fontFamily":"monospace"},"text":"https://openreview.net/forum?id=nfIAEJFiBZ","element":"a"},{"text":".","element":"span"}],[{"id":"id-97","text":"[39] ","element":"span"},{"text":"Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning Journal","element":"span"},{"text":", 3(1), 2024.","element":"span"}],[{"id":"id-5","text":"[40] ","element":"span"},{"text":"Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 11(4), 2010.","element":"span"}],[{"id":"id-90","text":"[41] ","element":"span"},{"text":"Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. Model-based reinforcement learning with value-targeted regression. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Learning for Dynamics and Control","element":"span"},{"text":", pages 666–686. PMLR, 2020.","element":"span"}],[{"id":"id-80","text":"[42] ","element":"span"},{"text":"Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1704–1713. PMLR, 2017.","element":"span"}],[{"id":"id-10","text":"[43] ","element":"span"},{"text":"Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 2137– 2143. PMLR, 2020.","element":"span"}],[{"id":"id-18","text":"[44] ","element":"span"},{"text":"Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:13406–13418, 2021.","element":"span"}],[{"id":"id-31","text":"[45] ","element":"span"},{"text":"Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 30, 2017.","element":"span"}],[{"id":"id-92","text":"[46] ","element":"span"},{"text":"Yeoneung Kim, Insoon Yang, and Kwang-Sung Jun. Improved regret analysis for varianceadaptive linear bandits and horizon-free linear mixture mdps. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:1060–1072, 2022.","element":"span"}],[{"id":"id-0","text":"[47] ","element":"span"},{"text":"Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The International Journal of Robotics Research","element":"span"},{"text":", 32(11):1238–1274, 2013.","element":"span"}],[{"id":"id-95","text":"[48] ","element":"span"},{"text":"Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Pac reinforcement learning with rich observations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 29:1840–1848, 2016.","element":"span"}],[{"id":"id-25","text":"[49] ","element":"span"},{"text":"Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbedhistory exploration in stochastic linear bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Uncertainty in Artificial Intelligence","element":"span"},{"text":", pages 530–540. PMLR, 2020.","element":"span"}],[{"id":"id-40","text":"[50] ","element":"span"},{"text":"Joongkyu Lee and Min-hwan Oh. Nearly minimax optimal regret for multinomial logistic bandit. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2405.09831","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-30","text":"[51] ","element":"span"},{"text":"Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 2071–2080. PMLR, 2017.","element":"span"}],[{"id":"id-1","text":"[52] ","element":"span"},{"text":"Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nature","element":"span"},{"text":", 518(7540):529–533, 2015.","element":"span"}],[{"id":"id-81","text":"[53] ","element":"span"},{"text":"Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 2010–2020. PMLR, 2020.","element":"span"}],[{"id":"id-32","text":"[54] ","element":"span"},{"text":"Min-hwan Oh and Garud Iyengar. Thompson sampling for multinomial logit contextual bandits. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 32:3151–3161, 2019.","element":"span"}],[{"id":"id-34","text":"[55] ","element":"span"},{"text":"Min-hwan Oh and Garud Iyengar. Multinomial logit contextual bandits: Provable optimality and practicality. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI Conference on Artificial Intelligence","element":"span"},{"text":", pages 9205–9213, 2021.","element":"span"}],[{"id":"id-6","text":"[56] ","element":"span"},{"text":"Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 1466–1474, 2014.","element":"span"}],[{"id":"id-23","text":"[57] ","element":"span"},{"text":"Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 2701–2710. PMLR, 2017.","element":"span"}],[{"id":"id-78","text":"[58] ","element":"span"},{"text":"Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 26, 2013.","element":"span"}],[{"id":"id-51","text":"[59] ","element":"span"},{"text":"Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 2377–2386. PMLR, 2016.","element":"span"}],[{"id":"id-54","text":"[60] ","element":"span"},{"text":"Aldo Pacchiano, Philip Ball, Jack Parker-Holder, Krzysztof Choromanski, and Stephen Roberts. Towards tractable optimism in model-based reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Uncertainty in Artificial Intelligence","element":"span"},{"text":", pages 1413–1423. PMLR, 2021.","element":"span"}],[{"id":"id-38","text":"[61] ","element":"span"},{"text":"Noemie Perivier and Vineet Goyal. Dynamic pricing and assortment under a contextual mnl demand. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:3461–3474, 2022.","element":"span"}],[{"id":"id-53","text":"[62] ","element":"span"},{"text":"Daniel Russo. ","element":"span"},{"text":"Worst-case regret bounds for exploration via randomized value functions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 32, 2019.","element":"span"}],[{"id":"id-94","text":"[63] ","element":"span"},{"text":"Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2256–2264, 2013.","element":"span"}],[{"id":"id-24","text":"[64] ","element":"span"},{"text":"Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Foundations and Trends® in Machine Learning","element":"span"},{"text":", 11(1):1–96, 2018.","element":"span"}],[{"id":"id-2","text":"[65] ","element":"span"},{"text":"David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nature","element":"span"},{"text":", 550(7676):354–359, 2017.","element":"span"}],[{"id":"id-3","text":"[66] ","element":"span"},{"text":"David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Science","element":"span"},{"text":", 362(6419):1140–1144, 2018.","element":"span"}],[{"id":"id-55","text":"[67] ","element":"span"},{"text":"Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, and Pierre Ménard. Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:10737–10751, 2022.","element":"span"}],[{"id":"id-83","text":"[68] ","element":"span"},{"text":"Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33, 2020.","element":"span"}],[{"id":"id-99","text":"[69] ","element":"span"},{"text":"Yining Wang, Ruosong Wang, Simon Shaolei Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-84","text":"[70] ","element":"span"},{"text":"Gellért Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Algorithmic Learning Theory","element":"span"},{"text":", pages 1237–1264. PMLR, 2021.","element":"span"}],[{"id":"id-44","text":"[71] ","element":"span"},{"text":"Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 6995–7004. PMLR, 2019.","element":"span"}],[{"id":"id-42","text":"[72] ","element":"span"},{"text":"Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 10746–10756. PMLR, 2020.","element":"span"}],[{"id":"id-11","text":"[73] ","element":"span"},{"text":"Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 1954–1964. PMLR, 2020.","element":"span"}],[{"id":"id-50","text":"[74] ","element":"span"},{"text":"Lijun Zhang, Tianbao Yang, Rong Jin, Yichi Xiao, and Zhi-Hua Zhou. Online stochastic linear optimization under one-bit feedback. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 392–401. PMLR, 2016.","element":"span"}],[{"id":"id-27","text":"[75] ","element":"span"},{"text":"Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Mathematics of Data Science","element":"span"},{"text":", 4(2):834–857, 2022.","element":"span"}],[{"id":"id-47","text":"[76] ","element":"span"},{"text":"Yu-Jie Zhang and Masashi Sugiyama. Online (multinomial) logistic bandit: Improved regret and constant computation cost. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirty-seventh Conference on Neural Information Processing Systems","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-8","text":"[77] ","element":"span"},{"text":"Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 33, pages 15198–15207, 2020.","element":"span"}],[{"id":"id-91","text":"[78] ","element":"span"},{"text":"Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S Du. Improved variance-aware confidence sets for linear bandits and linear mixture mdp. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:4342–4355, 2021.","element":"span"}],[{"id":"id-9","text":"[79] ","element":"span"},{"text":"Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Model-free reinforcement learning: from clipped pseudo-regret to sample complexity. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 12653–12662. PMLR, 2021.","element":"span"}],[{"id":"id-93","text":"[80] ","element":"span"},{"text":"Dongruo Zhou and Quanquan Gu. Computationally efficient horizon-free reinforcement learning for linear mixture mdps. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 35:36337–36349, 2022.","element":"span"}],[{"id":"id-45","text":"[81] ","element":"span"},{"text":"Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 4532–4576. PMLR, 2021.","element":"span"}],[{"id":"id-86","text":"[82] ","element":"span"},{"text":"Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 12793–12802. PMLR, 2021.","element":"span"}]]},{"heading":"Contents of Appendix","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"A Related Work ","element":"span"},{"style":{"fontWeight":"bold"},"text":"16","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Notations & Definitions ","element":"span"},{"style":{"fontWeight":"bold"},"text":"18","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C Detailed Regret Analysis for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(Theorem ","element":"span"},{"href":"#id-60","style":{"fontWeight":"bold"},"text":"1) ","element":"a"},{"style":{"fontWeight":"bold"},"text":"22","element":"span"}],[{"style":{"width":"96%"},"width":1522,"height":441,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"D Detailed Regret Analysis for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(Theorem ","element":"span"},{"href":"#id-75","style":{"fontWeight":"bold"},"text":"2) ","element":"a"},{"style":{"fontWeight":"bold"},"text":"42","element":"span"}],[{"style":{"width":"96%"},"width":1523,"height":442,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"E ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Optimistic Exploration Extension ","element":"span"},{"style":{"fontWeight":"bold"},"text":"67","element":"span"}],[{"style":{"width":"96%"},"width":1523,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"F ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Details ","element":"span"},{"style":{"fontWeight":"bold"},"text":"70","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G Auxiliary Lemmas ","element":"span"},{"style":{"fontWeight":"bold"},"text":"70","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"H Limitations ","element":"span"},{"style":{"fontWeight":"bold"},"text":"72","element":"span"}]]},{"heading":"A Related Work","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"RL with linear function approximation ","element":"span"},{"text":"There has been a growing interest in studies that extend beyond tabular MDPs and focus on function approximation methods with provable guarantees [","element":"span"},{"href":"#id-80","referenceIndex":42,"text":"42","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":71,"text":"71","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":43,"text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-81","referenceIndex":53,"text":"53","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-82","referenceIndex":14,"text":"14","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-83","referenceIndex":68,"text":"68","element":"a"},{"text":", ","element":"span"},{"href":"#id-84","referenceIndex":70,"text":"70","element":"a"},{"text":", ","element":"span"},{"href":"#id-85","referenceIndex":33,"text":"33","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":81,"text":"81","element":"a"},{"text":", ","element":"span"},{"href":"#id-86","referenceIndex":82,"text":"82","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":", ","element":"span"},{"href":"#id-87","referenceIndex":38,"text":"38","element":"a"},{"text":"]. In particular, for minimizing regret in linear MDPs, Jin et al. ","element":"span"},{"href":"#id-10","referenceIndex":43,"text":"[43] ","element":"a"},{"text":"propose an optimistic variant of the Least-Squares Value Iteration (LSVI) algorithm [","element":"span"},{"href":"#id-88","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":59,"text":"59","element":"a"},{"text":"] under the assumption that the transition model and reward function of the MDPs are linear function of a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-dimensional feature mapping and they guarantee ","element":"span"},{"style":{"height":19.66},"width":233.68,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-3.png","element":"img","alt":"�O(d32 H32 √T)","inline":true,"padRight":true},{"text":"regret. Zanette et al. ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"[73] ","element":"a"},{"text":"propose a randomized LSVI algorithm that incorporates exploration by perturbing the least-square approximation of the action-value function, and this algorithm guarantees","element":"span"}],[{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-4.png","element":"img","alt":"�","inline":true},{"style":{"height":18.3},"width":219.24,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-5.png","element":"img","alt":"O(d2H2√T)","inline":true,"padRight":true},{"text":"regret. Ishfaq et al. ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"[37] ","element":"a"},{"text":"propose a variant of the randomized LSVI algorithm that combines optimism and TS by perturbing the training data with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.i.d. ","element":"span"},{"text":"scalar noise, achieving a regret bound of ","element":"span"},{"style":{"height":19.66},"width":233.72,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/15-6.png","element":"img","alt":"�O(d32 H32 √T)","inline":true},{"text":". Similarly, Ishfaq et al. ","element":"span"},{"href":"#id-87","referenceIndex":38,"text":"[38] ","element":"a"},{"text":"introduce a randomized RL algorithm that employs Langevin Monte Carlo (LMC) to approximate the posterior distribution of the action-value","element":"span"}],[{"text":"Table 1: This table compares the problem settings, online update, performance of the this paper with those of other methods in provable RL with function approximation. For computation cost, we only keep the dependence on the number of episode ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"K","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1585,"height":437,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-0.png","element":"img"}],[{"text":"function, also ensuring a regret bound of ","element":"span"},{"style":{"height":19.66},"width":233.68,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-1.png","element":"img","alt":"�O(d32 H32 √T)","inline":true},{"text":". Also, there have been studies on model-based methods with function approximation in linear MDPs, such as Yang and Wang ","element":"span"},{"href":"#id-42","referenceIndex":72,"text":"[72]","element":"a"},{"text":", which assume that the transition probability kernel is a bilinear model parametrized by a matrix and propose a UCB-based algorithm with an upper bound of ","element":"span"},{"style":{"height":19.68},"width":226.48,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-2.png","element":"img","alt":"�O(d32 H2√T)","inline":true,"padRight":true},{"text":"for regret. He et al. ","element":"span"},{"href":"#id-89","referenceIndex":34,"text":"[34] ","element":"a"},{"text":"propose an algorithm achieving nearly minimax optimal regret ","element":"span"},{"style":{"height":18.3},"width":183.48,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-3.png","element":"img","alt":"�O(dH√T)","inline":true},{"text":". Jia et al. ","element":"span"},{"href":"#id-90","referenceIndex":41,"text":"[41] ","element":"a"},{"text":"consider a specific type of MDPs called linear mixture MDPs in which the transition probability kernel is a linear combination of different basis kernels. This model encompasses various types of MDPs studied previously in Modi et al. ","element":"span"},{"href":"#id-81","referenceIndex":53,"text":"[53]","element":"a"},{"text":", Yang and Wang ","element":"span"},{"href":"#id-42","referenceIndex":72,"text":"[72]","element":"a"},{"text":". For this model, Jia et al. ","element":"span"},{"href":"#id-90","referenceIndex":41,"text":"[41] ","element":"a"},{"text":"propose a UCB-based RL algorithm with value-targeted model parameter estimation that guarantees an upper bound of ","element":"span"},{"style":{"height":19.66},"width":265.2,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-4.png","element":"img","alt":"�O(dH32 √T) for","inline":true,"padRight":true},{"text":"regret. The same linear mixture MDPs have been used in other studies such as Ayoub et al. ","element":"span"},{"href":"#id-13","referenceIndex":9,"text":"[9]","element":"a"},{"text":", Zhou et al. ","element":"span"},{"href":"#id-45","referenceIndex":81,"text":"[81","element":"a"},{"text":", ","element":"span"},{"href":"#id-86","referenceIndex":82,"text":"82]","element":"a"},{"text":". Specifically, in Zhou et al. ","element":"span"},{"href":"#id-45","referenceIndex":81,"text":"[81]","element":"a"},{"text":", a variant of the method proposed by Jia et al. ","element":"span"},{"href":"#id-90","referenceIndex":41,"text":"[41] ","element":"a"},{"text":"is suggested and proved that the algorithm guarantees an upper bound of ","element":"span"},{"style":{"height":18.3},"width":183.48,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-5.png","element":"img","alt":"�O(dH√T)","inline":true,"padRight":true},{"text":"regret with a matching lower bound of ","element":"span"},{"style":{"height":18.29},"width":179.44,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-6.png","element":"img","alt":" Ω(dH√T)","inline":true,"padRight":true},{"text":"for linear mixture MDPs. More recently, there are also works achieving horizon-free regret bounds for linear mixture MDPs ","element":"span"},{"href":"#id-91","referenceIndex":78,"text":"[78, ","element":"a"},{"href":"#id-92","referenceIndex":46,"text":"46, ","element":"a"},{"href":"#id-93","referenceIndex":80,"text":"80]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"RL with non-linear function approximation ","element":"span"},{"text":"Studies have been conducted on extending function approximation beyond linear models. Ayoub et al. ","element":"span"},{"href":"#id-13","referenceIndex":9,"text":"[9]","element":"a"},{"text":", Wang et al. ","element":"span"},{"href":"#id-83","referenceIndex":68,"text":"[68]","element":"a"},{"text":", Ishfaq et al. ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"[37] ","element":"a"},{"text":"provide upper bound for regret based on eluder dimension [","element":"span"},{"href":"#id-94","referenceIndex":63,"text":"63","element":"a"},{"text":"]. Also, there has been an effort to develop sample-efficient methods with more “general” function approximation [","element":"span"},{"href":"#id-95","referenceIndex":48,"text":"48","element":"a"},{"text":", ","element":"span"},{"href":"#id-80","referenceIndex":42,"text":"42","element":"a"},{"text":", ","element":"span"},{"href":"#id-96","referenceIndex":19,"text":"19","element":"a"},{"text":"–","element":"span"},{"href":"#id-16","referenceIndex":21,"text":"21","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":44,"text":"44","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":75,"text":"75","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":18,"text":"18","element":"a"},{"text":", ","element":"span"},{"href":"#id-97","referenceIndex":39,"text":"39","element":"a"},{"text":"] However, these attempts may have been hindered by the difficulty of solving computationally intractable problems [","element":"span"},{"href":"#id-95","referenceIndex":48,"text":"48","element":"a"},{"text":", ","element":"span"},{"href":"#id-80","referenceIndex":42,"text":"42","element":"a"},{"text":", ","element":"span"},{"href":"#id-96","referenceIndex":19,"text":"19","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":21,"text":"21","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":44,"text":"44","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":18,"text":"18","element":"a"},{"text":"], the necessity of relying on stronger assumptions [","element":"span"},{"href":"#id-98","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"], or the lack of discussion on how to define the posterior distribution supported by a given function class and how to draw the optimistic sample from the posterior [","element":"span"},{"href":"#id-19","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":75,"text":"75","element":"a"},{"text":"]. That is why even after there exists a so-called “general function class”-based result, it is often the case that the results in specific parametric models are still needed. Despite the large number of studies on RL with linear function approximation, there is limited research on extending beyond linear models to other parametric models. Wang et al. ","element":"span"},{"href":"#id-99","referenceIndex":69,"text":"[69] ","element":"a"},{"text":"use generalized linear function approximation, where the Bellman backup of any value function is assumed to be a generalized linear function of feature mapping. Hwang and Oh ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"[35] ","element":"a"},{"text":"discuss the limitations of linear function approximation and propose a UCB-based algorithm for MNL transition model in feature space achieving ","element":"span"},{"style":{"height":19.66},"width":218.08,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-7.png","element":"img","alt":"�O(dH32 √T).","inline":true,"padRight":true},{"text":"Ishfaq et al. ","element":"span"},{"href":"#id-97","referenceIndex":39,"text":"[39] ","element":"a"},{"text":"present TS-based RL algorithms that utilize approximate samplers, such as LMC or Underdamped LMC, to enhance the implementation and computational tractability of TS for RL with general function classes.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Contextual bandits ","element":"span"},{"text":"Faury et al. ","element":"span"},{"href":"#id-35","referenceIndex":23,"text":"[23] ","element":"a"},{"text":"first provide a UCB-based algorithm with ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-8.png","element":"img","alt":" κ","inline":true},{"text":"-independent regret for binary logistic bandit and Abeille et al. ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"[3] ","element":"a"},{"text":"present UCB & TS based algorithms achieving nearly minimax optimal regret for the same setting. Faury et al. ","element":"span"},{"href":"#id-37","referenceIndex":24,"text":"[24] ","element":"a"},{"text":"propose a jointly efficient UCB-based algorithm that achieve ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-9.png","element":"img","alt":" κ","inline":true},{"text":"-independent regret bound with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"computation cost. In the context of MNL model, Oh and Iyengar ","element":"span"},{"href":"#id-32","referenceIndex":54,"text":"[54] ","element":"a"},{"text":"employ TS approach, while Oh and Iyengar ","element":"span"},{"href":"#id-34","referenceIndex":55,"text":"[55] ","element":"a"},{"text":"incorporate a combination of UCB exploration and online parameter updates for MNL bandits. Both of the methods have ","element":"span"},{"style":{"height":18.29},"width":192.12,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-10.png","element":"img","alt":" O(κ−1√T)","inline":true,"padRight":true},{"text":"regret. Amani and Thrampoulidis ","element":"span"},{"href":"#id-46","referenceIndex":8,"text":"[8] ","element":"a"},{"text":"propose an optimistic algorithm with better dependence on ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/16-11.png","element":"img","alt":" κ","inline":true},{"text":". ","element":"span"},{"text":"Agrawal et al. ","element":"span"},{"href":"#id-39","referenceIndex":6,"text":"[6] ","element":"a"},{"text":"design a UCB-based algorithm with ","element":"span"},{"style":{"height":18.29},"width":126.36,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-0.png","element":"img","alt":"O(√T)","inline":true,"padRight":true},{"text":"regret bound without ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-1.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"in its leading term, and Perivier and Goyal ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"establish ","element":"span"},{"style":{"height":19.2},"width":193.12,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-2.png","element":"img","alt":" O(�T/κ∗)","inline":true,"padRight":true},{"text":"regret for the uniform reward setting. Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"develop jointly efficient UCB-based algorithm for non-uniform MNL bandit problem. Lee and Oh ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"[50] ","element":"a"},{"text":"propose nearly minimax optimal MNL bandit algorithm for both uniform and non-uniform reward structures.","element":"span"}]]},{"heading":"B Notations & Definitions","paragraphs":[[{"text":"In this section, we formally summarize some definitions and notations used to analyze the proposed algorithm.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Inhomogeneous MNL transition model","element":"span"}],[{"text":"For ","element":"span"},{"style":{"height":16},"width":130.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-3.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":", the probability of state transition to ","element":"span"},{"style":{"height":15.6},"width":144.36,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-4.png","element":"img","alt":" s′ ∈ Ss,a","inline":true,"padRight":true},{"text":"when an action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is taken at a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is given by","element":"span"}],[{"style":{"width":"64%"},"width":1019,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-5.png","element":"img"}],[{"text":"The estimated transition probability parameterized by ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-6.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is denoted as","element":"span"}],[{"style":{"width":"45%"},"width":729,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Feature vector","element":"span"}],[{"text":"We abbreviate the feature vector as follows:","element":"span"}],[{"style":{"width":"74%"},"width":1181,"height":343,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-8.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Response variable & per-episode loss","element":"span"}],[{"text":"The response variable ","element":"span"},{"style":{"height":17.89},"width":38.56,"height":44.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-9.png","element":"img","alt":" ykh ","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"69%"},"width":1094,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-10.png","element":"img"}],[{"text":"The per-episode loss ","element":"span"},{"style":{"height":16.78},"width":119.52,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-11.png","element":"img","alt":" ℓk,h(θ)","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"100%"},"width":1602,"height":393,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/17-12.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Regularity constants","element":"span"}],[{"style":{"width":"82%"},"width":1306,"height":436,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Estimated transition core","element":"span"}],[{"text":"The estimated transition core for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"is given by","element":"span"}],[{"style":{"width":"67%"},"width":1075,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-1.png","element":"img"}],[{"text":"and the estimated transition core for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"is given by","element":"span"}],[{"style":{"width":"54%"},"width":860,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Gram matrices","element":"span"}],[{"text":"The Gram matrix with global gradient information ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-3.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"58%"},"width":925,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-4.png","element":"img"}],[{"text":"The Gram matrices with local gradient information are given by","element":"span"}],[{"style":{"width":"71%"},"width":1134,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-5.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Confidence radius","element":"span"}],[{"style":{"width":"99%"},"width":1575,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-6.png","element":"img"}],[{"style":{"height":18.99},"width":292.32,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-7.png","element":"img","alt":"= �O(κ−1/2d1/2) ,","inline":true}],[{"style":{"width":"99%"},"width":1583,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"γ","element":"span"},{"style":{"height":19.98},"width":587.24,"height":49.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-9.png","element":"img","alt":"k := γk(δ) = Cξσk�d log(Md/δ) .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Filtration","element":"span"}],[{"text":"For an arbitrary set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", we denote the ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-10.png","element":"img","alt":" Σ","inline":true},{"text":"-algebra generated by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":16},"width":96.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-11.png","element":"img","alt":" Σ(X)","inline":true},{"text":". Then we define the following filtrations","element":"span"}],[{"style":{"width":"88%"},"width":1404,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/18-12.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Pseudo-noise","element":"span"}],[{"text":"For ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"},{"text":", the pseudo-noise is sampled as","element":"span"}],[{"style":{"width":"25%"},"width":398,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-0.png","element":"img"}],[{"text":"and for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":", the pseudo-noise is sampled as","element":"span"}],[{"style":{"width":"24%"},"width":396,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-1.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"times independently.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Estimated value functions","element":"span"}],[{"text":"The stochastically optimistic value function for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL ","element":"span"},{"text":"is defined as follows:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":19.73},"width":249.72,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-2.png","element":"img","alt":"kH+1(s, a) = 0 ,","inline":true}],[{"style":{"width":"100%"},"width":1649,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-3.png","element":"img"}],[{"text":"The optimistic randomized value function for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"is defined as follows:","element":"span"}],[{"style":{"width":"92%"},"width":1467,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-4.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"91%"},"width":1454,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-5.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Prediction error & Bellman error","element":"span"}],[{"id":"id-56","style":{"fontWeight":"bold"},"text":"Definition 1 ","element":"span"},{"text":"(Prediction error & Bellman error)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":246.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-6.png","element":"img","alt":" (s, a) ∈ S × A","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":310.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-7.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"style":{"fontStyle":"italic"},"text":", we define the prediction error about ","element":"span"},{"style":{"height":18.67},"width":90.2,"height":46.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-8.png","element":"img","alt":" θkh as","inline":true}],[{"style":{"width":"62%"},"width":992,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Also we define the Bellman error as follows:","element":"span"}],[{"style":{"width":"48%"},"width":765,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Good events","element":"span"}],[{"text":"For any ","element":"span"},{"style":{"height":16},"width":156.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-11.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"text":", we define the following good events:","element":"span"}],[{"style":{"width":"77%"},"width":1235,"height":558,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/19-12.png","element":"img"}],[{"style":{"width":"86%"},"width":1366,"height":710,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Derivative of MNL transition model","element":"span"}],[{"id":"id-100","style":{"fontWeight":"bold"},"text":"Proposition 1 ","element":"span"},{"text":"(Derivative of MNL transition model)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The gradient and Hessian of ","element":"span"},{"style":{"height":16},"width":274.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-1.png","element":"img","alt":" Pθ(· | ·, ·) can be","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"calculated as follows:","element":"span"}],[{"style":{"width":"99%"},"width":1585,"height":674,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-100","style":{"fontStyle":"italic"},"text":"1. ","element":"a"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":270.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-3.png","element":"img","alt":" θ = (θ1, . . . , θd)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.3},"width":142,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-4.png","element":"img","alt":" [φs,a,s′]i","inline":true,"padRight":true},{"text":"be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th component of ","element":"span"},{"style":{"height":13.5},"width":106.08,"height":33.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-5.png","element":"img","alt":" φs,a,s′","inline":true},{"text":". Then, we have","element":"span"}],[{"style":{"width":"85%"},"width":1354,"height":415,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-6.png","element":"img"}],[{"text":"Then, the gradient of ","element":"span"},{"style":{"height":16},"width":198.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-7.png","element":"img","alt":" Pθ(s′ | s, a)","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"80%"},"width":1284,"height":320,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/20-8.png","element":"img"}],[{"text":"On the other hand, the second derivative ","element":"span"},{"style":{"height":22.96},"width":298.24,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/21-0.png","element":"img","alt":"∂∂θi∂θj Pθ(s′ | s, a)","inline":true,"padRight":true},{"text":"can be obtained as follows:","element":"span"}],[{"style":{"width":"99%"},"width":1583,"height":2229,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/21-1.png","element":"img"}]]},{"heading":"C Detailed Regret Analysis for RRL-MNL (Theorem 1)","paragraphs":[[{"text":"In this section, we provide the complete proof of Theorem ","element":"span"},{"href":"#id-60","text":"1. ","element":"a"},{"text":"First, we introduce all the technical lemmas needed to prove Theorem ","element":"span"},{"href":"#id-60","text":"1 ","element":"a"},{"text":"along with their proofs. At the end of this section, we present the proof of Theorem ","element":"span"},{"href":"#id-60","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Concentration of Estimated Transition Core ","element":"span"},{"style":{"height":18.67},"width":42.68,"height":46.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-0.png","element":"img","alt":" θkh","inline":true}],[{"text":"In this section, we provide the concentration inequality for the estimated transition core run by the approximate online Newton step. The proof is similar to that given by Oh and Iyengar ","element":"span"},{"href":"#id-34","referenceIndex":55,"text":"[55]","element":"a"},{"text":". For completeness, we provide the detailed proof.","element":"span"}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"Lemma 1 ","element":"span"},{"text":"(Concentration of online estimated transition core)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For each ","element":"span"},{"style":{"height":19.73},"width":449.76,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-1.png","element":"img","alt":" h ∈ [H], if λ ≥ L2φ, then we","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"have","element":"span"}],[{"style":{"width":"100%"},"width":1600,"height":307,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of lemma ","element":"span"},{"href":"#id-66","style":{"fontStyle":"italic"},"text":"1. ","element":"a"},{"text":"Recall that the per-round loss ","element":"span"},{"style":{"height":16.8},"width":119.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-3.png","element":"img","alt":" ℓk,h(θ)","inline":true,"padRight":true},{"text":"and its gradient ","element":"span"},{"style":{"height":16.8},"width":138.92,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-4.png","element":"img","alt":" Gk,h(θ)","inline":true,"padRight":true},{"text":"is defined as follows:","element":"span"}],[{"style":{"width":"75%"},"width":1197,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-5.png","element":"img"}],[{"text":"For the analysis, we define the conditional expectations of ","element":"span"},{"style":{"height":16.8},"width":308.84,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-6.png","element":"img","alt":" ℓk,h(θ) & Gk,h(θ)","inline":true,"padRight":true},{"text":"as follows: ","element":"span"},{"style":{"height":21.57},"width":1128.16,"height":53.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-7.png","element":"img","alt":"¯ℓk,h(θ) := Eykh [ℓk,h(θ) | Fk,h] , ¯Gk,h(θ) := Eykh[Gk,h(θ) | Fk,h] .","inline":true}],[{"text":"By Taylor expansion with ","element":"span"},{"style":{"height":18.74},"width":828.68,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-8.png","element":"img","alt":"¯θ = νθkh + (1 − ν)θ∗h for some ν ∈ (0, 1), we have","inline":true}],[{"id":"id-101","style":{"width":"92%"},"width":1460,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.8},"width":138.76,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-10.png","element":"img","alt":" Hk,h(θ)","inline":true,"padRight":true},{"text":"is the Hessian of the per-round loss evaluated at ","element":"span"},{"style":{"height":13.2},"width":102.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-11.png","element":"img","alt":" θ, i.e.,","inline":true}],[{"style":{"width":"99%"},"width":1585,"height":1334,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/22-12.png","element":"img"}],[{"text":"where the inequality utilizes the fact that ","element":"span"},{"style":{"height":16.98},"width":746.72,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-0.png","element":"img","alt":" xx⊤ + yy⊤ ⪰ xy⊤ + yx⊤ for any x, y ∈ Rd","inline":true},{"text":". Therefore, we have","element":"span"}],[{"style":{"width":"78%"},"width":1248,"height":959,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.6},"width":64.76,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-2.png","element":"img","alt":" ˙sk,h","inline":true,"padRight":true},{"text":"is the state satisfying ","element":"span"},{"style":{"height":18.18},"width":355.04,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-3.png","element":"img","alt":" φ(skh, akh, ˙sk,h) = 0d","inline":true,"padRight":true},{"text":"and the last inequality comes from the ","element":"span"},{"text":"Assumption ","element":"span"},{"href":"#id-28","text":"4.","element":"a"}],[{"text":"Using the lower bound of the Hessian of the per-round loss evaluated at ","element":"span"},{"style":{"height":13.81},"width":22,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-4.png","element":"img","alt":"¯θ","inline":true},{"text":", from ","element":"span"},{"href":"#id-101","text":"(10) ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-102","style":{"width":"100%"},"width":1586,"height":1052,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":208.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/23-6.png","element":"img","alt":" DKL(P ∥ Q)","inline":true,"padRight":true},{"text":"is the Kullback-Leibler divergence of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":", from ","element":"span"},{"href":"#id-102","text":"(12) ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-103","style":{"width":"100%"},"width":1586,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-0.png","element":"img"}],[{"text":"Since the objective function in ","element":"span"},{"href":"#id-103","text":"(14) ","element":"a"},{"text":"is convex, by the first-order optimality condition for any ","element":"span"},{"style":{"height":11.6},"width":64.04,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-1.png","element":"img","alt":" θ ∈","inline":true}],[{"style":{"width":"77%"},"width":1234,"height":186,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-2.png","element":"img"}],[{"text":"which gives","element":"span"}],[{"style":{"width":"94%"},"width":1496,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-3.png","element":"img"}],[{"text":"Then, we have","element":"span"}],[{"id":"id-104","style":{"width":"98%"},"width":1567,"height":546,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-4.png","element":"img"}],[{"text":"where the last inequality follows by the fact that","element":"span"}],[{"style":{"width":"93%"},"width":1479,"height":169,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-5.png","element":"img"}],[{"text":"Therefore, from ","element":"span"},{"href":"#id-104","text":"(16) ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-105","style":{"width":"98%"},"width":1563,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-6.png","element":"img"}],[{"text":"By substituting ","element":"span"},{"href":"#id-105","text":"(17) ","element":"a"},{"text":"into ","element":"span"},{"text":"(13)","element":"span"},{"text":", we have","element":"span"}],[{"id":"id-106","style":{"width":"86%"},"width":1377,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/24-7.png","element":"img"}],[{"text":"Note that since we have","element":"span"}],[{"id":"id-107","style":{"width":"94%"},"width":1505,"height":1209,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-0.png","element":"img"}],[{"text":"where the first inequality utilizes the inequality ","element":"span"},{"style":{"height":14},"width":583.8,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-1.png","element":"img","alt":" x⊤Ay + y⊤Ax ≤ x⊤Ax + y⊤Ay","inline":true,"padRight":true},{"text":"for any positivesemidefinite matrix ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":", and the last inequality holds since ","element":"span"},{"style":{"height":21.14},"width":686.36,"height":52.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-2.png","element":"img","alt":" 0 ≤ Pθkh(s′ | skh, akh) ≤ 1 and �s′ Pθkh(s′ |","inline":true},{"style":{"height":17.9},"width":197.32,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-3.png","element":"img","alt":"skh, akh) = 1.","inline":true}],[{"text":"Combining the results of ","element":"span"},{"href":"#id-106","text":"(18) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-107","text":"(19)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"95%"},"width":1510,"height":600,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-4.png","element":"img"}],[{"text":"where for the first equality we use ","element":"span"},{"style":{"height":16.98},"width":428.6,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-5.png","element":"img","alt":" Ak+1,h = Ak,h + κ2 Wk,h","inline":true},{"text":". By rearranging the terms, we have","element":"span"}],[{"style":{"width":"88%"},"width":1406,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/25-6.png","element":"img"}],[{"text":"Then summing over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"gives","element":"span"}],[{"style":{"width":"97%"},"width":1543,"height":538,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-0.png","element":"img"}],[{"text":"For the final step, note that","element":"span"},{"style":{"height":19.81},"width":592.68,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-1.png","element":"img","alt":"� ¯Gi,h(θih) − Gi,h(θih)�⊤ (θih − θ∗h)","inline":true,"padRight":true},{"text":"is a martingale difference sequence. ","element":"span"},{"text":"To bound this term, we invoke the following lemmas:","element":"span"}],[{"id":"id-108","style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":16},"width":544.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-2.png","element":"img","alt":" δ ∈ (0, 1) and (k, h) ∈ [K] × [H]","inline":true},{"style":{"fontStyle":"italic"},"text":", with a probability at least ","element":"span"},{"style":{"height":12},"width":227.56,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-3.png","element":"img","alt":" 1 − δ we have","inline":true}],[{"style":{"width":"86%"},"width":1367,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-4.png","element":"img"}],[{"id":"id-109","style":{"fontWeight":"bold"},"text":"Lemma 3 ","element":"span"},{"text":"(Generalized elliptical potential)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":18.18},"width":470,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-5.png","element":"img","alt":" St := {xt,1, . . . , xt,K} ⊂ Rd","inline":true},{"style":{"fontStyle":"italic"},"text":". For any ","element":"span"},{"style":{"height":13.2},"width":169.6,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-6.png","element":"img","alt":" 1 ≤ t ≤ T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":129.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-7.png","element":"img","alt":" i ∈ [K]","inline":true},{"style":{"fontStyle":"italic"},"text":", suppose ","element":"span"},{"style":{"height":16.8},"width":205.36,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-8.png","element":"img","alt":" ∥xt,i∥2 ≤ L","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":22.18},"width":586.56,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-9.png","element":"img","alt":" Vt := λId + �t−1τ=1�i∈Sτ xτ,ix⊤τ,i","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":11.6},"width":104.88,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-10.png","element":"img","alt":" λ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":15.78},"width":119.52,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-11.png","element":"img","alt":"λ ≥ L2","inline":true},{"style":{"fontStyle":"italic"},"text":", then we have","element":"span"}],[{"style":{"width":"45%"},"width":723,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-12.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-108","text":"2, ","element":"a"},{"text":"with probability at least ","element":"span"},{"style":{"height":14},"width":238.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-13.png","element":"img","alt":" 1 − δ, we have","inline":true}],[{"style":{"width":"97%"},"width":1553,"height":445,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-14.png","element":"img"}],[{"text":"where the second inequality comes from Lemma ","element":"span"},{"href":"#id-109","text":"3. ","element":"a"},{"text":"Note that the Gram matrix ","element":"span"},{"style":{"height":15.58},"width":80.68,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-15.png","element":"img","alt":" Ak,h","inline":true,"padRight":true},{"text":"in Algorithm ","element":"span"},{"href":"#id-59","text":"1 ","element":"a"},{"text":"and the Gram matrix ","element":"span"},{"style":{"fontWeight":"bold"},"text":"V ","element":"span"},{"text":"in Lemma ","element":"span"},{"href":"#id-109","text":"3 ","element":"a"},{"text":"are different by the factor of ","element":"span"},{"style":{"height":16.98},"width":19,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-16.png","element":"img","alt":"κ2","inline":true,"padRight":true},{"text":", which results in additional ","element":"span"},{"style":{"height":19.38},"width":19,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-17.png","element":"img","alt":" 2κ","inline":true,"padRight":true},{"text":"factor for the bound of ","element":"span"},{"style":{"height":28.32},"width":547.84,"height":70.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/26-18.png","element":"img","alt":"�ki=1 maxs′∈Si,h ∥φi,h,s′∥2A−1i+1,h.","inline":true}],[{"text":"In the following, we provide all the proofs of the lemmas used to prove Lemma ","element":"span"},{"href":"#id-66","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"C.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-108","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-108","style":{"fontStyle":"italic"},"text":"2. ","element":"a"},{"text":"Note that","element":"span"},{"style":{"height":19.82},"width":600.28,"height":49.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-0.png","element":"img","alt":"� ¯Gi,h(θih) − Gi,h(θih)�⊤ (θih − θ∗h)","inline":true,"padRight":true},{"text":"is a martingale difference se- ","element":"span"},{"text":"quence, i.e.,","element":"span"}],[{"style":{"width":"51%"},"width":811,"height":195,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-1.png","element":"img"}],[{"text":"On the other hand, for any ","element":"span"},{"style":{"height":14.19},"width":118.12,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-2.png","element":"img","alt":" θ ∈ Rd","inline":true},{"text":", since we have","element":"span"}],[{"style":{"width":"61%"},"width":968,"height":483,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-3.png","element":"img"}],[{"text":"then, it follows by","element":"span"}],[{"id":"id-112","style":{"width":"78%"},"width":1250,"height":346,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-4.png","element":"img"}],[{"text":"where the last inequality follows by ","element":"span"},{"style":{"height":18.32},"width":631.72,"height":45.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-5.png","element":"img","alt":" ∥θih − θ∗h∥2 ≤ ∥θih∥2 + ∥θ∗h∥2 ≤ 2Lθ","inline":true},{"text":". Hence, if we denote ","element":"span"},{"style":{"height":21.1},"width":845.84,"height":52.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-6.png","element":"img","alt":"Mk,h := �ki=1� ¯Gi,h(θih) − Gi,h(θih)�⊤ (θih − θ∗h)","inline":true},{"text":", then ","element":"span"},{"style":{"height":15.58},"width":84.72,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/27-7.png","element":"img","alt":" Mk,h","inline":true,"padRight":true},{"text":"is a martingale. Note that we also","element":"span"}],[{"text":"have","element":"span"}],[{"id":"id-110","style":{"width":"96%"},"width":1535,"height":1236,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-0.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-110","text":"(21) ","element":"a"},{"text":"holds by the Cauchy–Schwarz inequality, ","element":"span"},{"href":"#id-110","text":"(22) ","element":"a"},{"text":"holds because","element":"span"}],[{"style":{"width":"70%"},"width":1123,"height":297,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-1.png","element":"img"}],[{"text":"However, if we denote ","element":"span"},{"style":{"height":23.33},"width":524.68,"height":58.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-2.png","element":"img","alt":" Bk,h := 2 �ki=1 ∥θih − θ∗h∥2Wi,h","inline":true},{"text":", since ","element":"span"},{"style":{"height":15.6},"width":76.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-3.png","element":"img","alt":" Bk,h","inline":true,"padRight":true},{"text":"is itself a random variable, to ","element":"span"},{"text":"apply Freedman’s inequality to ","element":"span"},{"style":{"height":15.58},"width":84.72,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-4.png","element":"img","alt":" Mk,h","inline":true},{"text":", we consider two cases depending on the values of ","element":"span"},{"style":{"height":15.58},"width":87.96,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-5.png","element":"img","alt":" Bk,h.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Case 1 : ","element":"span"},{"style":{"height":19.76},"width":175.48,"height":49.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/28-6.png","element":"img","alt":" Bk,h ≤ 4kU","inline":true}],[{"style":{"width":"82%"},"width":1308,"height":1036,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Case 2 : ","element":"span"},{"style":{"height":19.78},"width":175.48,"height":49.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-1.png","element":"img","alt":" Bk,h > 4kU","inline":true}],[{"text":"Suppose that ","element":"span"},{"style":{"height":23.33},"width":611.64,"height":58.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-2.png","element":"img","alt":" Bk,h = 2 �ki=1 ∥θih − θ∗h∥2Wi,h > 4kU","inline":true,"padRight":true},{"text":". Then, we have both a lower and upper bound ","element":"span"},{"text":"for ","element":"span"},{"style":{"height":15.58},"width":76.28,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-3.png","element":"img","alt":" Bk,h","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"65%"},"width":1041,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-4.png","element":"img"}],[{"text":"Then by the peeling process from Bartlett et al. ","element":"span"},{"href":"#id-111","referenceIndex":12,"text":"[12]","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":14.8},"width":262.96,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-5.png","element":"img","alt":" ηk > 0, we have","inline":true}],[{"id":"id-115","style":{"width":"94%"},"width":1502,"height":653,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.78},"width":563.4,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-7.png","element":"img","alt":" m = 1 + ⌈2 log2 kULφLθ⌉. For Ij","inline":true},{"text":", note that from ","element":"span"},{"href":"#id-112","text":"(20) ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"50%"},"width":802,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/29-8.png","element":"img"}],[{"text":"By Freedman’s inequality (Lemma ","element":"span"},{"href":"#id-113","text":"29)","element":"a"},{"text":", we have","element":"span"}],[{"id":"id-114","style":{"width":"78%"},"width":1246,"height":876,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-0.png","element":"img"}],[{"text":"By substituting Eq. ","element":"span"},{"href":"#id-114","text":"(24) ","element":"a"},{"text":"into Eq. ","element":"span"},{"href":"#id-115","text":"(23)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"57%"},"width":916,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-1.png","element":"img"}],[{"text":"Then, combining with the result of Case 1 & 2, letting ","element":"span"},{"style":{"height":26.8},"width":680.72,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-2.png","element":"img","alt":" ηk = log mδ/k2 = log (1+⌈2 log2 kULφLθ⌉)k2δ","inline":true}],[{"text":"and taking union bound over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":238.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-3.png","element":"img","alt":" 1 − δ, we have","inline":true}],[{"id":"id-116","style":{"width":"79%"},"width":1263,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-4.png","element":"img"}],[{"text":"By applying ","element":"span"},{"style":{"height":16.78},"width":231.2,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-5.png","element":"img","alt":" 2√ab ≤ a + b","inline":true,"padRight":true},{"text":"to the first term on the right hand side, we have","element":"span"}],[{"id":"id-117","style":{"width":"81%"},"width":1288,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-6.png","element":"img"}],[{"text":"Combining the results of Eq. ","element":"span"},{"href":"#id-116","text":"(25) ","element":"a"},{"text":"& Eq. ","element":"span"},{"href":"#id-117","text":"(26)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"95%"},"width":1521,"height":364,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/30-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-109","style":{"fontWeight":"bold"},"text":"3","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-109","style":{"fontStyle":"italic"},"text":"3. ","element":"a"},{"text":"By definition of ","element":"span"},{"style":{"height":13.6},"width":199.16,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-0.png","element":"img","alt":" Vt, we have","inline":true}],[{"id":"id-118","style":{"width":"79%"},"width":1263,"height":686,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-1.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":15.79},"width":271.88,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-2.png","element":"img","alt":" λ ≥ L2, we have","inline":true}],[{"style":{"width":"28%"},"width":446,"height":80,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-3.png","element":"img"}],[{"text":"Since for any ","element":"span"},{"style":{"height":16},"width":148.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-4.png","element":"img","alt":" z ∈ [0, 1]","inline":true},{"text":", it follows that ","element":"span"},{"style":{"height":16},"width":271.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-5.png","element":"img","alt":" z ≤ 2 log(1 + z)","inline":true},{"text":". Hence, we have","element":"span"}],[{"style":{"width":"55%"},"width":876,"height":482,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-6.png","element":"img"}],[{"text":"where the second inequality comes from Eq. ","element":"span"},{"href":"#id-118","text":"(27) ","element":"a"},{"text":"and the last inequality follows by the determinant-trace inequality (Lemma ","element":"span"},{"href":"#id-119","text":"28)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bound on Prediction Error","element":"span"}],[{"text":"In this section, we provide the bound on the prediction error induced by estimated transition core ","element":"span"},{"style":{"height":18.67},"width":54.32,"height":46.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-7.png","element":"img","alt":" θkh.","inline":true}],[{"id":"id-57","style":{"fontWeight":"bold"},"text":"Lemma 4 ","element":"span"},{"text":"(Bound on Prediction Error)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-8.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", suppose that Lemma ","element":"span"},{"href":"#id-66","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Then for any ","element":"span"},{"style":{"height":16},"width":394.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-9.png","element":"img","alt":" (s, a) ∈ S × A, we have","inline":true}],[{"style":{"width":"40%"},"width":638,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-57","style":{"fontStyle":"italic"},"text":"4. ","element":"a"},{"text":"Recall that","element":"span"}],[{"style":{"width":"80%"},"width":1284,"height":247,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/31-11.png","element":"img"}],[{"text":"Then by the mean value theorem, there exists ","element":"span"},{"style":{"height":18.72},"width":354.76,"height":46.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-0.png","element":"img","alt":"¯θ = ρθkh + (1 − ρ)θ∗h","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":16},"width":149.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-1.png","element":"img","alt":" ρ ∈ [0, 1]","inline":true,"padRight":true},{"text":"satisfying ","element":"span"},{"text":"that","element":"span"}],[{"style":{"width":"100%"},"width":1598,"height":1226,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-2.png","element":"img"}],[{"text":"where the second inequality comes from the fact that ","element":"span"},{"style":{"height":16},"width":336.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-3.png","element":"img","alt":" P¯θ(s′ | s, a) ≤ 1","inline":true,"padRight":true},{"text":"is a multinomial probability, the third inequality holds due to the Cauchy-Schwarz inequality, and the last inequality follows from Lemma ","element":"span"},{"href":"#id-66","text":"1 ","element":"a"},{"text":"and the definition of ","element":"span"},{"style":{"height":17.31},"width":75.84,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-4.png","element":"img","alt":" ˆφk,h","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":18.29},"width":410.88,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-5.png","element":"img","alt":" ˆφk,h(s, a) := φ(s, a, ˆs)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":22.46},"width":604.56,"height":56.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-6.png","element":"img","alt":"ˆs = argmaxs′∈Ss,a ∥φ(s, a, s′)∥A−1k,h.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"C.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Good Events with High Probability","element":"span"}],[{"id":"id-120","style":{"fontWeight":"bold"},"text":"Lemma 5 ","element":"span"},{"text":"(Good event probability)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":11.62},"width":114.4,"height":29.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-7.png","element":"img","alt":" K ∈ N","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-8.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", the good event ","element":"span"},{"style":{"height":16},"width":140.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-9.png","element":"img","alt":" G(K, δ′)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"holds with probability at least ","element":"span"},{"style":{"height":16},"width":460.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-10.png","element":"img","alt":" 1 − δ where δ′ = δ/(2KH).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-120","style":{"fontStyle":"italic"},"text":"5. ","element":"a"},{"text":"For any ","element":"span"},{"style":{"height":16},"width":318.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-11.png","element":"img","alt":" δ′ ∈ (0, 1), we have","inline":true}],[{"style":{"width":"67%"},"width":1075,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-12.png","element":"img"}],[{"text":"On the other hand, for any ","element":"span"},{"style":{"height":16},"width":304.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-13.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"text":", by Lemma ","element":"span"},{"href":"#id-121","text":"30, ","element":"a"},{"style":{"height":22.88},"width":133.32,"height":57.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-14.png","element":"img","alt":" Gξk,h(δ′)","inline":true,"padRight":true},{"text":"holds with probability at least ","element":"span"},{"style":{"height":11.6},"width":101.84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-15.png","element":"img","alt":"1 − δ′","inline":true},{"text":". Then, for ","element":"span"},{"style":{"height":16},"width":247.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-16.png","element":"img","alt":" δ′ = δ/(2KH)","inline":true,"padRight":true},{"text":"by taking union bound, we have the desired result as follows:","element":"span"}],[{"style":{"width":"76%"},"width":1218,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-17.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stochastic Optimism","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"Lemma 6 ","element":"span"},{"text":"(Stochastic optimism)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":18.4},"width":916.16,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-18.png","element":"img","alt":" δ with 0 < δ < Φ(−1)/2, let σk = Hαk(δ) = �O(H√d).","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"If we take multiple sample size ","element":"span"},{"style":{"height":23.01},"width":320.24,"height":57.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-19.png","element":"img","alt":" M = ⌈1 − log Hlog Φ(1)⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", then for any ","element":"span"},{"style":{"height":16},"width":279.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-20.png","element":"img","alt":" k ∈ [K], we have","inline":true}],[{"style":{"width":"47%"},"width":747,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/32-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of lemma ","element":"span"},{"href":"#id-62","style":{"fontStyle":"italic"},"text":"6. ","element":"a"},{"text":"Before presenting the proof, we introduce the following lemmas.","element":"span"}],[{"id":"id-61","style":{"fontWeight":"bold"},"text":"Lemma 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":268.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-0.png","element":"img","alt":" k ∈ [K], it holds","inline":true}],[{"style":{"width":"57%"},"width":911,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ι","element":"span"},{"style":{"height":19.5},"width":747.76,"height":48.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-2.png","element":"img","alt":"kh(s, a) := r(s, a) + PhV kh+1(s, a) − Qkh(s, a).","inline":true}],[{"id":"id-123","style":{"fontWeight":"bold"},"text":"Lemma 8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":156.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-3.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be given. For any ","element":"span"},{"style":{"height":16},"width":308.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-4.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16},"width":227.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-5.png","element":"img","alt":" σk = Hαk(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":". If we define the event ","element":"span"},{"style":{"height":20.69},"width":167.56,"height":51.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-6.png","element":"img","alt":" G∆k,h(δ) as","inline":true}],[{"style":{"width":"54%"},"width":865,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"then conditioned on ","element":"span"},{"style":{"height":20.7},"width":660.08,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-8.png","element":"img","alt":" G∆k,h(δ), for any (s, a) ∈ S × A, we have","inline":true}],[{"style":{"width":"44%"},"width":701,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-9.png","element":"img"}],[{"id":"id-122","style":{"fontWeight":"bold"},"text":"Lemma 9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":167.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-10.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be given. For any ","element":"span"},{"style":{"height":16},"width":322.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-11.png","element":"img","alt":" (h, k) ∈ [H] × [K]","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16},"width":237.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-12.png","element":"img","alt":" σk = Hαk(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":". If we take multiple sample size ","element":"span"},{"style":{"height":23.01},"width":320.28,"height":57.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-13.png","element":"img","alt":" M = ⌈1 − log Hlog Φ(1)⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", then conditioned on the event ","element":"span"},{"style":{"height":21.38},"width":419.08,"height":53.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-14.png","element":"img","alt":" G∆k (δ) := �h∈[H] G∆k,h(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"we have","element":"span"}],[{"style":{"width":"51%"},"width":822,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-15.png","element":"img"}],[{"text":"Now, we define the event of the estimated value function being optimistic at the start of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th episode as","element":"span"}],[{"style":{"width":"31%"},"width":503,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-16.png","element":"img"}],[{"text":"Then for the event ","element":"span"},{"style":{"height":16},"width":351.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-17.png","element":"img","alt":" Gk(δ) =: Gk, we have","inline":true}],[{"style":{"width":"41%"},"width":655,"height":207,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-18.png","element":"img"}],[{"text":"where the last inequality comes from lemma ","element":"span"},{"href":"#id-120","text":"5.","element":"a"}],[{"text":"On the other hand, by Lemma ","element":"span"},{"href":"#id-61","text":"7, ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"56%"},"width":897,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-19.png","element":"img"}],[{"text":"If we define an event","element":"span"}],[{"style":{"width":"50%"},"width":793,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-20.png","element":"img"}],[{"text":"then, by Lemma ","element":"span"},{"href":"#id-122","text":"9, ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"88%"},"width":1409,"height":438,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-21.png","element":"img"}],[{"text":"where the last inequality comes from the choice of ","element":"span"},{"style":{"height":12},"width":29.24,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/33-22.png","element":"img","alt":" δ.","inline":true}],[{"text":"In the following, we provide all the proofs of the lemmas used to prove Lemma ","element":"span"},{"href":"#id-62","text":"6.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"C.4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-61","style":{"fontWeight":"bold"},"text":"7","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of lemma ","element":"span"},{"href":"#id-61","style":{"fontStyle":"italic"},"text":"7. ","element":"a"},{"text":"In this proof, we use ","element":"span"},{"style":{"height":17.9},"width":41.76,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-0.png","element":"img","alt":" xkh ","inline":true,"padRight":true},{"text":"as the states sampled under the ","element":"span"},{"style":{"height":10.98},"width":40.16,"height":27.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-1.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"to distinguish with ","element":"span"},{"style":{"height":17.9},"width":49.32,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-2.png","element":"img","alt":" skh.","inline":true,"padRight":true},{"text":"Since we have,","element":"span"}],[{"style":{"width":"96%"},"width":1532,"height":715,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-3.png","element":"img"}],[{"text":"then by applying this argument recursively, we finally have","element":"span"}],[{"style":{"width":"78%"},"width":1246,"height":183,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-123","style":{"fontWeight":"bold"},"text":"8","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-123","style":{"fontStyle":"italic"},"text":"8. ","element":"a"},{"text":"Since we have","element":"span"}],[{"style":{"height":19.73},"width":808.08,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-5.png","element":"img","alt":"−ιkh(s, a) = Qkh(s, a) −�r(s, a) + PhV kh+1(s, a)�","inline":true}],[{"style":{"width":"99%"},"width":1579,"height":568,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-6.png","element":"img"}],[{"text":"at least with constant probability.","element":"span"}],[{"style":{"width":"86%"},"width":1374,"height":269,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-7.png","element":"img"}],[{"text":"Now, for ","element":"span"},{"style":{"height":23.89},"width":814.36,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-8.png","element":"img","alt":" ∀m ∈ [M], since ξ(m)k,h ∼ N(0d, σ2kA−1k,h), we have","inline":true}],[{"style":{"width":"47%"},"width":761,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/34-9.png","element":"img"}],[{"text":"which means,","element":"span"}],[{"style":{"width":"62%"},"width":988,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-0.png","element":"img"}],[{"text":"by setting ","element":"span"},{"style":{"height":16},"width":227.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-1.png","element":"img","alt":" σk = Hαk(δ)","inline":true},{"text":". Then, finally we have the desired results as follows:","element":"span"}],[{"style":{"width":"90%"},"width":1427,"height":428,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-122","style":{"fontWeight":"bold"},"text":"9","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-122","style":{"fontStyle":"italic"},"text":"9. ","element":"a"},{"text":"For each ","element":"span"},{"style":{"height":16},"width":130.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-3.png","element":"img","alt":" h ∈ [H]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":129.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-4.png","element":"img","alt":" k ∈ [K]","inline":true},{"text":", define an event ","element":"span"},{"style":{"height":17.9},"width":416.88,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-5.png","element":"img","alt":" Ekh := {−ιkh(sh, ah) ≥ 0}","inline":true,"padRight":true},{"text":"Then ","element":"span"},{"text":"it holds","element":"span"}],[{"style":{"width":"71%"},"width":1138,"height":521,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-6.png","element":"img"}],[{"text":"where the first inequality uses the union bound, the second inequality comes from the Lemma ","element":"span"},{"href":"#id-123","text":"8 ","element":"a"},{"text":"and the last inequality holds due to the choice of ","element":"span"},{"style":{"height":23.01},"width":329.96,"height":57.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-7.png","element":"img","alt":" M = ⌈1 − log Hlog Φ(1)⌉.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"C.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bound on Estimation Part","element":"span"}],[{"text":"We decompose the regret into the estimation part and the pessimism part as follows:","element":"span"}],[{"style":{"width":"60%"},"width":957,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-8.png","element":"img"}],[{"text":"and we bound these two parts in the following sections, respectively. ","element":"span"},{"id":"id-67","style":{"fontWeight":"bold"},"text":"Lemma 10 ","element":"span"},{"text":"(Bound on estimation part)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":19.73},"width":335.64,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-9.png","element":"img","alt":" δ ∈ (0, 1), if λ ≥ L2φ","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability at least","element":"span"}],[{"style":{"width":"72%"},"width":1153,"height":155,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of lemma ","element":"span"},{"href":"#id-67","style":{"fontStyle":"italic"},"text":"10. ","element":"a"},{"text":"For any given ","element":"span"},{"style":{"height":16},"width":139.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-11.png","element":"img","alt":" k ∈ [K],","inline":true}],[{"style":{"width":"98%"},"width":1559,"height":334,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/35-12.png","element":"img"}],[{"text":"where the second equality holds due to the variant of ","element":"span"},{"style":{"height":17.89},"width":165,"height":44.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-0.png","element":"img","alt":" ιkh(skh, akh)","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"88%"},"width":1398,"height":281,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-1.png","element":"img"}],[{"text":"Then, by applying this argument recursively for whole horizon, we have","element":"span"}],[{"id":"id-124","style":{"width":"100%"},"width":1590,"height":1228,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-2.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-124","text":"(30) ","element":"a"},{"text":"comes from the Cauchy-Schwarz inequality and ","element":"span"},{"href":"#id-124","text":"(31) ","element":"a"},{"text":"holds due the the Lemma ","element":"span"},{"href":"#id-57","text":"4 ","element":"a"},{"text":"& ","element":"span"},{"href":"#id-121","text":"30. ","element":"a"},{"text":"Then, with probability at least ","element":"span"},{"style":{"height":16},"width":278.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-3.png","element":"img","alt":" 1 − δ/4, we have","inline":true}],[{"id":"id-125","style":{"width":"82%"},"width":1315,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-4.png","element":"img"}],[{"text":"On the other hand, for ","element":"span"},{"style":{"height":19.52},"width":37.36,"height":48.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-5.png","element":"img","alt":"˙ζkh","inline":true},{"text":", we have ","element":"span"},{"style":{"height":19.52},"width":172.48,"height":48.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-6.png","element":"img","alt":" | ˙ζkh| ≤ 2H","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.79},"width":274.04,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-7.png","element":"img","alt":" E[ ˙ζkh | Fk,h] = 0","inline":true},{"text":", which means ","element":"span"},{"style":{"height":19.79},"width":236.76,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-8.png","element":"img","alt":" { ˙ζkh | Fk,h}k,h","inline":true,"padRight":true},{"text":"is a martingale difference sequence for any ","element":"span"},{"style":{"height":16},"width":130.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-9.png","element":"img","alt":" k ∈ [K]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":131,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-10.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":". Hence, by applying the AzumaHoeffding inequality with probability at least ","element":"span"},{"style":{"height":16},"width":278.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-11.png","element":"img","alt":" 1 − δ/4, we have","inline":true}],[{"id":"id-126","style":{"width":"68%"},"width":1080,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/36-12.png","element":"img"}],[{"text":"Combining the results of ","element":"span"},{"href":"#id-125","text":"(32) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-126","text":"(33)","element":"a"},{"text":", with probability at least ","element":"span"},{"style":{"height":16},"width":278.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-0.png","element":"img","alt":" 1 − δ/2, we have","inline":true}],[{"id":"id-127","style":{"width":"91%"},"width":1448,"height":868,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-1.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-127","text":"(34) ","element":"a"},{"text":"follows from the fact that both ","element":"span"},{"style":{"height":16},"width":95.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-2.png","element":"img","alt":" αk(δ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":90.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-3.png","element":"img","alt":" γk(δ)","inline":true,"padRight":true},{"text":"are increasing in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", ","element":"span"},{"href":"#id-127","text":"(35) ","element":"a"},{"text":"comes from Cauchy-Schwarz inequality and ","element":"span"},{"href":"#id-127","text":"(36) ","element":"a"},{"text":"holds by the generalized elliptical potential lemma (Lemma ","element":"span"},{"href":"#id-109","text":"3)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"1%"},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bound on Pessimism Part","element":"span"}],[{"id":"id-68","style":{"fontWeight":"bold"},"text":"Lemma 11 ","element":"span"},{"text":"(Bound on pessimism)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":19.73},"width":881.48,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-5.png","element":"img","alt":" δ with 0 < δ < Φ(−1)/2, let σk = Hαk(δ). If λ ≥ L2φ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and we take multiple sample size ","element":"span"},{"style":{"height":23.01},"width":316.56,"height":57.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-6.png","element":"img","alt":" M = ⌈1 − log Hlog Φ(1)⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability at least ","element":"span"},{"style":{"height":16},"width":269.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-7.png","element":"img","alt":" 1 − δ/2, we have","inline":true}],[{"style":{"width":"44%"},"width":705,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of lemma ","element":"span"},{"href":"#id-68","style":{"fontStyle":"italic"},"text":"11. ","element":"a"},{"text":"Similar to the techniques used in [","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":"], we show that the difference between the optimal value function ","element":"span"},{"style":{"height":14.93},"width":48.12,"height":37.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-9.png","element":"img","alt":" V ∗1 ","inline":true,"padRight":true},{"text":"and the estimated value function ","element":"span"},{"style":{"height":17.33},"width":49.12,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-10.png","element":"img","alt":" V k1 ","inline":true,"padRight":true},{"text":"can be controlled by constructing an ","element":"span"},{"text":"upper bound on ","element":"span"},{"style":{"height":14.94},"width":48.12,"height":37.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-11.png","element":"img","alt":" V ∗1 ","inline":true,"padRight":true},{"text":"and a lower bound on ","element":"span"},{"style":{"height":17.34},"width":49.08,"height":43.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-12.png","element":"img","alt":" V k1 ","inline":true,"padRight":true},{"text":". In this proof, we consider three kinds of pseudo-noises, ","element":"span"},{"style":{"height":17.01},"width":59.92,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-13.png","element":"img","alt":"ξ, ¯ξ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":20,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-14.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"that we define later in the proof. Also, for ","element":"span"},{"style":{"height":16},"width":163.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-15.png","element":"img","alt":" δ′ = δ/10","inline":true},{"text":", we denote ","element":"span"},{"style":{"height":17.63},"width":297.52,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-16.png","element":"img","alt":" G(K, δ′), ¯G(K, δ′)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":140.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-17.png","element":"img","alt":"G(K, δ′)","inline":true,"padRight":true},{"text":"as the good events induced by ","element":"span"},{"style":{"height":17.01},"width":59.92,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-18.png","element":"img","alt":" ξ, ¯ξ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":20,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-19.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"respectively. From now on, we denote ","element":"span"},{"style":{"height":16},"width":145.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-20.png","element":"img","alt":" G(K, δ′)","inline":true,"padRight":true},{"text":"by the event ","element":"span"},{"style":{"height":17.63},"width":501.48,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-21.png","element":"img","alt":" G(K, δ′) ∩ ¯G(K, δ′) ∩ G(K, δ′)","inline":true},{"text":". Then, by Lemma ","element":"span"},{"href":"#id-120","text":"5, ","element":"a"},{"text":"the event ","element":"span"},{"style":{"height":16},"width":145.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-22.png","element":"img","alt":" G(K, δ′)","inline":true,"padRight":true},{"text":"holds with high probability at least ","element":"span"},{"style":{"height":16},"width":177.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-23.png","element":"img","alt":" 1 − 3δ/10.","inline":true}],[{"text":"First, we construct the lower bound of ","element":"span"},{"style":{"height":17.34},"width":49.08,"height":43.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-24.png","element":"img","alt":" V k1","inline":true,"padRight":true},{"text":". For any given ","element":"span"},{"style":{"height":16},"width":129.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-25.png","element":"img","alt":" k ∈ [K]","inline":true},{"text":", let ","element":"span"},{"href":"#id-59","style":{"height":26.51},"width":404.64,"height":66.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-26.png","element":"img","alt":" �ξ := {�ξ(m)k,h }m∈[M] ⊂ Rd","inline":true,"padRight":true},{"text":"be","element":"span"}],[{"style":{"width":"99%"},"width":1583,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-27.png","element":"img"}],[{"text":"non-random ","element":"span"},{"style":{"height":14},"width":20.36,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-28.png","element":"img","alt":"�ξ","inline":true},{"style":{"height":11.2},"width":52.76,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-29.png","element":"img","alt":"(m)","inline":true},{"style":{"height":10},"width":46.08,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-30.png","element":"img","alt":"k,h ","inline":true,"padRight":true},{"text":"in place of ","element":"span"},{"style":{"height":23.89},"width":74.2,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-31.png","element":"img","alt":" ξ(m)k,h ","inline":true,"padRight":true},{"text":". Then consider the following minimization problem:","element":"span"}],[{"style":{"width":"63%"},"width":1006,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-32.png","element":"img"}],[{"text":"And we denote ","element":"span"},{"style":{"height":25.02},"width":419.48,"height":62.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-33.png","element":"img","alt":" ξ := {ξ(m)k,h }h∈[H],m∈[M]","inline":true,"padRight":true},{"text":"by a minimizer and ","element":"span"},{"style":{"height":18.56},"width":121.48,"height":46.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-34.png","element":"img","alt":" V k1(sk1)","inline":true,"padRight":true},{"text":"by the minimum of the ","element":"span"},{"text":"above minimization problem, i.e., ","element":"span"},{"style":{"height":19.07},"width":321.16,"height":47.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-35.png","element":"img","alt":" V kh(·) := V kh (· ; ξ)","inline":true},{"text":". Then, under the event ","element":"span"},{"style":{"height":16},"width":140.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-36.png","element":"img","alt":" G(K, δ′)","inline":true},{"text":", since ","element":"span"},{"style":{"height":23.89},"width":313.52,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-37.png","element":"img","alt":"{ξ(m)k,h }h∈[H],m∈[M]","inline":true,"padRight":true},{"text":"is also a feasible solution of the above optimization problem, and since ","element":"span"},{"style":{"height":17.9},"width":94.32,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-38.png","element":"img","alt":" V kh =","inline":true},{"style":{"height":17.9},"width":129,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-39.png","element":"img","alt":"V kh ( ; ξ)","inline":true},{"text":", thus we have","element":"span"}],[{"id":"id-128","style":{"width":"59%"},"width":950,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/37-40.png","element":"img"}],[{"text":"Second, to find an upper bound for ","element":"span"},{"style":{"height":10.98},"width":48.08,"height":27.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-0.png","element":"img","alt":" V ∗","inline":true},{"text":", considering i.i.d copies ","element":"span"},{"style":{"height":24.06},"width":313.52,"height":60.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-1.png","element":"img","alt":" {¯ξ(m)k,h }h∈[H],m∈[M]","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":23.89},"width":313.52,"height":59.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-2.png","element":"img","alt":"{ξ(m)k,h }h∈[H],m∈[M]","inline":true,"padRight":true},{"text":"and run Algorithm ","element":"span"},{"href":"#id-59","text":"1 ","element":"a"},{"text":"to get a corresponding value function ","element":"span"},{"style":{"height":18.14},"width":49.08,"height":45.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-3.png","element":"img","alt":" ¯V kh","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.14},"width":50.52,"height":45.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-4.png","element":"img","alt":" ¯Qkh","inline":true,"padRight":true},{"text":"for ","element":"span"},{"text":"all ","element":"span"},{"style":{"height":16},"width":130.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-5.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":". Define the event that ","element":"span"},{"style":{"height":17.63},"width":121.48,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-6.png","element":"img","alt":"¯V k1 (sk1)","inline":true,"padRight":true},{"text":"is optimistic in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th episode as","element":"span"}],[{"style":{"width":"30%"},"width":479,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-7.png","element":"img"}],[{"text":"Then by Lemma ","element":"span"},{"href":"#id-62","text":"6, ","element":"a"},{"text":"for given ","element":"span"},{"style":{"height":14},"width":169.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-8.png","element":"img","alt":" δ, we have","inline":true}],[{"style":{"width":"28%"},"width":459,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-9.png","element":"img"}],[{"text":"Then by the definition of optimism, under the event ","element":"span"},{"style":{"height":16},"width":290.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-10.png","element":"img","alt":" G(K, δ′), we have","inline":true}],[{"id":"id-130","style":{"width":"72%"},"width":1150,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-11.png","element":"img"}],[{"text":"where the expectations are over the ","element":"span"},{"style":{"height":17.01},"width":20.76,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-12.png","element":"img","alt":"¯ξ","inline":true},{"text":"’s conditioned on the event ","element":"span"},{"style":{"height":16.02},"width":45.44,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-13.png","element":"img","alt":"¯Xk","inline":true,"padRight":true},{"text":"and the second inequality comes from ","element":"span"},{"href":"#id-128","text":"(37)","element":"a"},{"text":". On the other hand, under the event ","element":"span"},{"style":{"height":17.63},"width":140.16,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-14.png","element":"img","alt":"¯G(K, δ′)","inline":true,"padRight":true},{"text":"by the law of the total expectation, we have","element":"span"}],[{"id":"id-129","style":{"width":"95%"},"width":1517,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-15.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-129","text":"(39) ","element":"a"},{"text":"comes from the fact that ","element":"span"},{"style":{"height":24.06},"width":313.52,"height":60.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-16.png","element":"img","alt":" {¯ξ(m)k,h }h∈[H],m∈[M]","inline":true,"padRight":true},{"text":"is also a feasible solution of the above ","element":"span"},{"text":"optimization problem under the event ","element":"span"},{"style":{"height":17.63},"width":140.16,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-17.png","element":"img","alt":"¯G(K, δ′)","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":18.54},"width":301.92,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-18.png","element":"img","alt":"¯V k1 (sk1) ≥ V k1(sk1)","inline":true},{"text":". Then, by combining the ","element":"span"},{"text":"results of ","element":"span"},{"href":"#id-129","text":"(39) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-130","text":"(38)","element":"a"},{"text":", under the event ","element":"span"},{"style":{"height":16},"width":295.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-19.png","element":"img","alt":" G(K, δ′), we have","inline":true}],[{"id":"id-131","style":{"width":"100%"},"width":1586,"height":523,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-20.png","element":"img"}],[{"text":"Note that since ","element":"span"},{"style":{"height":17.01},"width":20.76,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-21.png","element":"img","alt":"¯ξ","inline":true,"padRight":true},{"text":"is the i.i.d copy of ","element":"span"},{"style":{"height":14},"width":20,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-22.png","element":"img","alt":" ξ","inline":true},{"text":", therefore ","element":"span"},{"style":{"height":18.43},"width":66.32,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-23.png","element":"img","alt":"¯Vk,1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":66.32,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-24.png","element":"img","alt":" Vk,1","inline":true,"padRight":true},{"text":"are independent, which means ","element":"span"},{"style":{"height":19.52},"width":267.8,"height":48.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-25.png","element":"img","alt":"{¨ζk | Fk−1}Kk=1","inline":true,"padRight":true},{"text":"is a martingale difference sequence with ","element":"span"},{"style":{"height":23.31},"width":215.72,"height":58.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-26.png","element":"img","alt":" |¨ζk| ≤ 2HΦ(−1)","inline":true},{"text":". Therefore by applying ","element":"span"},{"text":"Azuma-Hoeffiding inequality under the event ","element":"span"},{"style":{"height":16},"width":145.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-27.png","element":"img","alt":" G(K, δ′)","inline":true},{"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":249.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-28.png","element":"img","alt":" 1 − δ′, we have","inline":true}],[{"id":"id-138","style":{"width":"67%"},"width":1066,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-29.png","element":"img"}],[{"text":"On the other hand, by dividing the first term in ","element":"span"},{"href":"#id-131","text":"(40) ","element":"a"},{"text":"into two terms we have","element":"span"}],[{"style":{"width":"57%"},"width":912,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-30.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":13.18},"width":33.52,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-31.png","element":"img","alt":" I1","inline":true},{"text":", note that since it is related to the estimation error, under the event ","element":"span"},{"style":{"height":16},"width":145.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-32.png","element":"img","alt":" G(K, δ′)","inline":true,"padRight":true},{"text":"we can bound the sum of ","element":"span"},{"style":{"height":13.2},"width":33.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-33.png","element":"img","alt":" I1","inline":true,"padRight":true},{"text":"for the total episode number using Lemma ","element":"span"},{"href":"#id-67","text":"10 ","element":"a"},{"text":"as follows:","element":"span"}],[{"id":"id-136","style":{"width":"89%"},"width":1425,"height":194,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/38-34.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":13.2},"width":33.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-0.png","element":"img","alt":" I2","inline":true},{"text":", since we have","element":"span"}],[{"id":"id-132","style":{"width":"91%"},"width":1443,"height":416,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-1.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-132","text":"(43) ","element":"a"},{"text":"comes from ","element":"span"},{"style":{"height":17.38},"width":422.84,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-2.png","element":"img","alt":" ak1 = argmaxa Qk1(sk1, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"href":"#id-132","text":"(44) ","element":"a"},{"text":"holds by the following definition of ","element":"span"},{"style":{"height":17.9},"width":175.52,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-3.png","element":"img","alt":"ιkh(skh, akh):","inline":true}],[{"style":{"width":"88%"},"width":1410,"height":207,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-4.png","element":"img"}],[{"text":"Then by applying the same argument recursively for the whole horizon, we have","element":"span"}],[{"style":{"width":"31%"},"width":505,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-5.png","element":"img"}],[{"text":"where we denote","element":"span"}],[{"style":{"width":"70%"},"width":1110,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-6.png","element":"img"}],[{"text":"Note that","element":"span"}],[{"text":"event ","element":"span"},{"style":{"height":16},"width":145.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-7.png","element":"img","alt":" G(K, δ′)","inline":true,"padRight":true},{"text":"by applying the Azuma-Hoeffding inequality with probability at least ","element":"span"},{"style":{"height":14},"width":243.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-8.png","element":"img","alt":" 1 − δ′, we have","inline":true}],[{"id":"id-135","style":{"width":"67%"},"width":1073,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-9.png","element":"img"}],[{"text":"To bound ","element":"span"},{"style":{"height":20.4},"width":274.72,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-10.png","element":"img","alt":"�Hh=1 ιkh(skh, akh)","inline":true},{"text":", we divide the whole horizon index set into two groups as follows: ","element":"span"},{"style":{"height":13.78},"width":61.36,"height":34.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-11.png","element":"img","alt":"H+","inline":true}],[{"id":"id-134","style":{"width":"100%"},"width":1586,"height":380,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-12.png","element":"img"}],[{"text":"On the other hand, for ","element":"span"},{"style":{"height":14},"width":128.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-13.png","element":"img","alt":" j ∈ H−","inline":true},{"text":", under the event ","element":"span"},{"style":{"height":16},"width":285.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-14.png","element":"img","alt":" G(K, δ′) we have","inline":true}],[{"id":"id-133","style":{"width":"99%"},"width":1571,"height":530,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/39-15.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-133","text":"(47) ","element":"a"},{"text":"holds by Lemma ","element":"span"},{"href":"#id-57","text":"4.","element":"a"}],[{"text":"By combining the result of ","element":"span"},{"href":"#id-134","text":"(46) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-133","text":"(48)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"62%"},"width":992,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-0.png","element":"img"}],[{"text":"Then summing ","element":"span"},{"style":{"height":13.2},"width":33.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-1.png","element":"img","alt":" I2","inline":true,"padRight":true},{"text":"over the total number of episodes, under the event ","element":"span"},{"style":{"height":16},"width":295.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-2.png","element":"img","alt":" G(K, δ′), we have","inline":true}],[{"id":"id-137","style":{"width":"94%"},"width":1502,"height":466,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-3.png","element":"img"}],[{"text":"where the last inequality holds due to the Lemma ","element":"span"},{"href":"#id-109","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-135","text":"(45)","element":"a"},{"text":".","element":"span"}],[{"text":"Finally, by summing ","element":"span"},{"href":"#id-131","text":"(40) ","element":"a"},{"text":"over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"and plugging the results of ","element":"span"},{"href":"#id-136","text":"(42)","element":"a"},{"text":", ","element":"span"},{"href":"#id-137","text":"(50) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-138","text":"(41) ","element":"a"},{"text":"then, we have","element":"span"}],[{"style":{"width":"91%"},"width":1445,"height":460,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-4.png","element":"img"}],[{"text":"To conclude the proof, by setting ","element":"span"},{"style":{"height":16},"width":162.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-5.png","element":"img","alt":" δ′ = δ/10","inline":true,"padRight":true},{"text":"and we take a union bound over the two applications of Azuma-Hoeffding (","element":"span"},{"style":{"height":21.22},"width":110.28,"height":53.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-6.png","element":"img","alt":"¨ζk,...ζkh","inline":true},{"text":") and the event ","element":"span"},{"style":{"height":16},"width":145.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-7.png","element":"img","alt":" G(K, δ′)","inline":true},{"text":", we get the desired result with probability at least ","element":"span"},{"style":{"height":16},"width":137.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-8.png","element":"img","alt":"1 − δ/2.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"C.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Regret Bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"RRL-MNL","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"1. ","element":"a"},{"text":"We can decompose the regret with estimation part and pessimism part as follows:","element":"span"}],[{"style":{"width":"60%"},"width":958,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-9.png","element":"img"}],[{"text":"Since both Lemma ","element":"span"},{"href":"#id-67","text":"10 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-68","text":"11 ","element":"a"},{"text":"holds with probability at least ","element":"span"},{"style":{"height":16},"width":127.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-10.png","element":"img","alt":" 1 − δ/2","inline":true,"padRight":true},{"text":"respectively, by taking the union bound the following holds with probability at least ","element":"span"},{"style":{"height":12},"width":98.88,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-11.png","element":"img","alt":" 1 − δ:","inline":true}],[{"style":{"width":"93%"},"width":1489,"height":211,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/40-12.png","element":"img"}]]},{"heading":"D Detailed Regret Analysis for ORRL-MNL (Theorem 2)","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Concentration of Estimated Transition Core ","element":"span"},{"style":{"height":18.18},"width":42.68,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-0.png","element":"img","alt":"�θkh","inline":true}],[{"text":"In this section, we provide the detailed proof of Lemma ","element":"span"},{"href":"#id-70","text":"12, ","element":"a"},{"text":"which demonstrates the concentration result for ","element":"span"},{"style":{"height":18.3},"width":42.64,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-1.png","element":"img","alt":"�θkh","inline":true,"padRight":true},{"text":"independently of ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-2.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":". Note that we adapt the proof provided by Zhang and ","element":"span"},{"text":"Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"in the MNL contextual bandit setting to MNL-MDPs and improve the result, making it independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":". We provide the lemmas for the concentration of the online transition core for completeness, noting that there are slight differences compared to their work, which stem from the different problem setting.","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Lemma 12 ","element":"span"},{"text":"(Concentration of online estimated transition core)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":249.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-3.png","element":"img","alt":" η = O(log U)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.8},"width":77.28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-4.png","element":"img","alt":" λ =","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then, for any ","element":"span"},{"style":{"height":16},"width":152,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-5.png","element":"img","alt":" δ ∈ (0, 1]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and for any ","element":"span"},{"style":{"height":16},"width":279.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-6.png","element":"img","alt":" h ∈ [H], we have","inline":true}],[{"style":{"width":"74%"},"width":1180,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"12. ","element":"a"},{"text":"Recall that the transition core updated by the online mirror descent is represented by","element":"span"}],[{"style":{"width":"46%"},"width":741,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":34.14},"width":1172.6,"height":85.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-9.png","element":"img","alt":"�ℓk,h(θ) = ℓk,h(�θkh) + (θ − �θkh)⊤∇ℓk,h(�θkh) + 12��θ − �θkh��∇2ℓk,h(�θkh) .","inline":true,"padRight":true},{"text":"We introduce the ","element":"span"},{"text":"following lemma providing that the estimation error of the online estimator ","element":"span"},{"style":{"height":18.29},"width":42.68,"height":45.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-10.png","element":"img","alt":"�θkh","inline":true,"padRight":true},{"text":"can be bounded by ","element":"span"},{"text":"the regret.","element":"span"}],[{"id":"id-141","style":{"fontWeight":"bold"},"text":"Lemma 13 ","element":"span"},{"text":"(Lemma 12 in [","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.78},"width":435.48,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-11.png","element":"img","alt":" α = log U + 2(1 + LθLφ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.6},"width":96.76,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-12.png","element":"img","alt":" λ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":". If we set the step size ","element":"span"},{"style":{"height":16},"width":139.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-13.png","element":"img","alt":"η = α/2","inline":true},{"style":{"fontStyle":"italic"},"text":", then we have","element":"span"}],[{"id":"id-139","style":{"width":"88%"},"width":1401,"height":258,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-14.png","element":"img"}],[{"text":"Now, we bound the first term of ","element":"span"},{"href":"#id-139","text":"(51)","element":"a"},{"text":". To simplify the presentation, for all ","element":"span"},{"style":{"height":16},"width":310.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-15.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"text":", we define the softmax function ","element":"span"},{"style":{"height":18.98},"width":444.84,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-16.png","element":"img","alt":" σk,h : R|Sk,h| → [0, 1]|Sk,h| ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"36%"},"width":586,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":58.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-18.png","element":"img","alt":" [·]s′","inline":true,"padRight":true},{"text":"denote the element corresponding to ","element":"span"},{"style":{"height":11.6},"width":117.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-19.png","element":"img","alt":" s′ ∈ S","inline":true,"padRight":true},{"text":"of the input vector. We also define the pseudo-inverse of the softmax function ","element":"span"},{"style":{"height":11.98},"width":74.88,"height":29.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-20.png","element":"img","alt":" σk,h","inline":true,"padRight":true},{"text":"via ","element":"span"},{"style":{"height":21.22},"width":399.68,"height":53.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-21.png","element":"img","alt":" [σ+k,h(p)]s′ = log([p]s′)","inline":true,"padRight":true},{"text":"which has the property ","element":"span"},{"text":"that for all ","element":"span"},{"style":{"height":28.8},"width":1267.88,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-22.png","element":"img","alt":" p ∈ ∆|Sk,h|, we have σk,h(σ+k,h(p)) = p and �s∈Sk,h exp�[σ+k,h(p)]s�= 1.","inline":true}],[{"text":"We denote ","element":"span"},{"style":{"height":20.46},"width":640.76,"height":51.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-23.png","element":"img","alt":" Φk,h = [φk,h,s′]s′∈Sk,h ∈ Rd×|Sk,h|","inline":true,"padRight":true},{"text":"for simplicity. ","element":"span"},{"text":"Then, the transition model can also be written as ","element":"span"},{"style":{"height":19.73},"width":656.28,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-24.png","element":"img","alt":" Pθ(s′ | skh, akh) = [σk,h(Φ⊤k,hθ∗h)]s′","inline":true},{"text":". ","element":"span"},{"text":"We further define ","element":"span"},{"style":{"height":12},"width":125.8,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-25.png","element":"img","alt":" �zi,h =","inline":true},{"style":{"height":29.68},"width":603.88,"height":74.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-26.png","element":"img","alt":"σ+i,h�Eθ∼N(�θih,cB−1i,h)[σi,h(Φ⊤i,hθ)]�","inline":true},{"text":"for our analysis. Then, we have","element":"span"}],[{"id":"id-140","style":{"width":"98%"},"width":1559,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/41-27.png","element":"img"}],[{"id":"id-142","text":"We can bound the first term of ","element":"span"},{"href":"#id-140","text":"(52) ","element":"a"},{"text":"by the following lemma.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 14. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":152,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-0.png","element":"img","alt":" δ ∈ (0, 1]","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, for all ","element":"span"},{"style":{"height":16},"width":306.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-1.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":231.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-2.png","element":"img","alt":" 1 − δ, we have","inline":true}],[{"style":{"width":"87%"},"width":1382,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-3.png","element":"img"}],[{"id":"id-143","text":"Furthermore, we can bound the second term of ","element":"span"},{"href":"#id-140","text":"(52) ","element":"a"},{"text":"by the following lemma.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 15. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":19.73},"width":207.08,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-4.png","element":"img","alt":" λ ≥ 72L2φcd","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, for any ","element":"span"},{"style":{"height":16},"width":680.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-5.png","element":"img","alt":" c > 0 and all (k, h) ∈ [K] × [H], we have","inline":true}],[{"style":{"width":"84%"},"width":1343,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-6.png","element":"img"}],[{"text":"Combining Lemma ","element":"span"},{"href":"#id-141","text":"13, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-142","text":"14, ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-143","text":"15, ","element":"a"},{"text":"and by setting ","element":"span"},{"style":{"height":16},"width":357.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-7.png","element":"img","alt":" η = α/2, c = 2α/3","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.34},"width":518.24,"height":50.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-8.png","element":"img","alt":"λ ≥ max{12√2L3φα, 48L2φdα}","inline":true},{"text":", we derive that","element":"span"}],[{"style":{"width":"97%"},"width":1552,"height":472,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0 ","element":"span"},{"text":"is an absolute constant. In the above, we choose ","element":"span"},{"style":{"height":16},"width":588.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-10.png","element":"img","alt":" λ = O(d log U), α = O(log U). The","inline":true,"padRight":true},{"text":"second inequality of ","element":"span"},{"href":"#id-143","text":"(53) ","element":"a"},{"text":"is derived from the fact that","element":"span"}],[{"style":{"width":"63%"},"width":1010,"height":453,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-11.png","element":"img"}],[{"text":"The first inequality holds from ","element":"span"},{"style":{"height":15.58},"width":184.72,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-12.png","element":"img","alt":" Bi,h ⪰ λId","inline":true},{"text":", and the second inequality is obvious from our setting of ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-13.png","element":"img","alt":"λ","inline":true},{"text":". Therefore, we can conclude that","element":"span"}],[{"style":{"width":"75%"},"width":1198,"height":153,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/42-14.png","element":"img"}],[{"text":"In the following section, we provide the proofs of the lemmas used in Lemma ","element":"span"},{"href":"#id-70","text":"12.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"D.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-141","style":{"fontWeight":"bold"},"text":"13","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-141","style":{"fontStyle":"italic"},"text":"13. ","element":"a"},{"text":"Let ","element":"span"},{"style":{"height":37.01},"width":1185.72,"height":92.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-0.png","element":"img","alt":"�ℓi,h(θ) = ℓi,h(�θih) + ∇ℓi,h(�θih)⊤ �θ − �θih�+ 12��θ − �θih��2∇2ℓi,h(�θih) be a","inline":true}],[{"text":"second-order Taylor expansion of ","element":"span"},{"style":{"height":18.75},"width":204,"height":46.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-1.png","element":"img","alt":" ℓi,h(θ) at �θih","inline":true},{"text":". Since we have","element":"span"}],[{"style":{"width":"97%"},"width":1553,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-2.png","element":"img"}],[{"text":"by Lemma ","element":"span"},{"href":"#id-144","text":"31, ","element":"a"},{"text":"if we define ","element":"span"},{"style":{"height":21.58},"width":455.96,"height":53.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-3.png","element":"img","alt":" ψ(θ) = 12∥θ∥2Bi,h we obtain","inline":true}],[{"id":"id-146","style":{"width":"97%"},"width":1552,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-4.png","element":"img"}],[{"text":"By applying Lemma ","element":"span"},{"href":"#id-145","text":"33, ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"100%"},"width":1586,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-5.png","element":"img"}],[{"text":"By setting ","element":"span"},{"style":{"height":16.78},"width":181.16,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-6.png","element":"img","alt":" η = αi,h/2","inline":true,"padRight":true},{"text":"and merging equations ","element":"span"},{"href":"#id-146","text":"(54) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-147","text":"(55)","element":"a"},{"text":", we arrive at","element":"span"}],[{"style":{"width":"97%"},"width":1541,"height":227,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-7.png","element":"img"}],[{"text":"Meanwhile, we obtain","element":"span"}],[{"id":"id-147","style":{"width":"73%"},"width":1168,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-8.png","element":"img"}],[{"text":"by taking the gradient over both sides of the Taylor approximation of ","element":"span"},{"href":"#id-147","style":{"height":16.78},"width":303.44,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-9.png","element":"img","alt":" ℓi,h(θ). Using (57)","inline":true},{"text":", we proceed to bound the first term of ","element":"span"},{"href":"#id-147","text":"(56) ","element":"a"},{"text":"as follows:","element":"span"}],[{"style":{"width":"72%"},"width":1153,"height":559,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20.86},"width":75.4,"height":52.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-11.png","element":"img","alt":"¯θi+1h","inline":true,"padRight":true},{"text":"is a convex combination of ","element":"span"},{"style":{"height":17.9},"width":42.64,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-12.png","element":"img","alt":"�θih","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.22},"width":75.4,"height":48.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-13.png","element":"img","alt":" �θi+1h","inline":true,"padRight":true},{"text":". The second equality arises from the Taylor expansion, the first inequality is due to the self-concordant property, and the final inequality is justified by the following:","element":"span"}],[{"style":{"width":"71%"},"width":1130,"height":474,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/43-14.png","element":"img"}],[{"text":"By summing over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"and reorganizing the terms, we arrive at the final result as follows: ","element":"span"},{"style":{"height":36.43},"width":322,"height":91.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-0.png","element":"img","alt":"��θk+1h − θ∗h��2Bk+1,h","inline":true}],[{"style":{"width":"98%"},"width":1563,"height":397,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-1.png","element":"img"}],[{"text":"where the first inequality holds by Assumption ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"and the last inequality holds since ","element":"span"},{"style":{"height":14},"width":212.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-2.png","element":"img","alt":" α = log U +","inline":true},{"style":{"height":16.78},"width":572.6,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-3.png","element":"img","alt":"2(1 + LφLθ) ≥ αi,h for all i ∈ [k].","inline":true}],[{"id":"id-151","style":{"fontWeight":"bold"},"text":"D.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-142","style":{"fontWeight":"bold"},"text":"14","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-142","style":{"fontStyle":"italic"},"text":"14. ","element":"a"},{"text":"The norm of ","element":"span"},{"style":{"height":29.68},"width":743,"height":74.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-4.png","element":"img","alt":" �zi,h = σ+i,h�Eθ∼N(�θih,cB−1i,h)[σi,h(Φ⊤i,hθ)]�","inline":true},{"text":"is generally unbounded ","element":"span"},{"href":"#id-148","referenceIndex":27,"text":"[27]","element":"a"},{"text":". In this proof, we utilize the smoothed version of ","element":"span"},{"style":{"height":11.98},"width":60.96,"height":29.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-5.png","element":"img","alt":" �zi,h","inline":true},{"text":", defined as follows:","element":"span"}],[{"style":{"width":"56%"},"width":889,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-6.png","element":"img"}],[{"text":"where the smooth function ","element":"span"},{"style":{"height":20.51},"width":1146.56,"height":51.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-7.png","element":"img","alt":" smoothui,h(p) = (1 − u)p + (u/U)1 with u ∈ [0, 1/2], and 1 ∈ R|Si,h|","inline":true,"padRight":true},{"text":"is an all-one vector.","element":"span"}],[{"text":"Exploiting the property of ","element":"span"},{"style":{"height":21.22},"width":844.48,"height":53.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-8.png","element":"img","alt":" σ+i,h such that σi,h(σ+i,h(p)) = p for any p ∈ ∆|Si,h|","inline":true},{"text":", it is straightforward ","element":"span"},{"text":"to show that ","element":"span"},{"style":{"height":21.23},"width":580.32,"height":53.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-9.png","element":"img","alt":" �zui,h = σ+i,h(smoothui,h(σi,h(�zi,h)))","inline":true},{"text":". Then, by Lemma ","element":"span"},{"href":"#id-149","text":"34, ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"85%"},"width":1360,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-10.png","element":"img"}],[{"text":"Given the definition of ","element":"span"},{"style":{"height":10},"width":56.32,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-11.png","element":"img","alt":" ℓi,h","inline":true},{"text":", we know that ","element":"span"},{"style":{"height":19.9},"width":361.36,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-12.png","element":"img","alt":" ℓ(z∗i,h, yih) = ℓi,h(θ∗h)","inline":true},{"text":", where ","element":"span"},{"style":{"height":19.25},"width":239.24,"height":48.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-13.png","element":"img","alt":" z∗i,h = Φ⊤i,hθ∗h","inline":true},{"text":". We can ","element":"span"},{"text":"bound the gap between the loss of ","element":"span"},{"style":{"height":19.25},"width":181.88,"height":48.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-14.png","element":"img","alt":" θ∗h and �zui,h ","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"85%"},"width":1358,"height":538,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.78},"width":533,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-16.png","element":"img","alt":" Mi,h = log(|Si,h|) + 2 log(U/u)","inline":true},{"text":", and the second equality holds by a direct calculation of the first order and Hessian of the logistic loss.","element":"span"}],[{"text":"Now, we first bound the first term of the right-hand side. Let ","element":"span"},{"style":{"height":18.91},"width":573.84,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-17.png","element":"img","alt":" di,h = (z∗i,h − �zui,h)/(M + LφLθ)","inline":true},{"text":", ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"+ 2 log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"style":{"fontStyle":"italic"},"text":"/u","element":"span"},{"text":")","element":"span"},{"text":". Then, one can check that ","element":"span"},{"style":{"height":16.78},"width":235.88,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-18.png","element":"img","alt":" ∥di,h∥∞ ≤ 1","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":18.91},"width":188.56,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-19.png","element":"img","alt":" ∥z∗i,h∥∞ ≤","inline":true},{"style":{"height":18.62},"width":618.24,"height":46.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-20.png","element":"img","alt":"maxs′∈Si,h ∥φi,h,s′∥2∥θ∗h∥2 ≤ LφLθ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.93},"width":356.32,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-21.png","element":"img","alt":" ∥�zui,h∥∞ ≤ log(U/u)","inline":true},{"text":". Moreover, since ","element":"span"},{"style":{"height":17.9},"width":60.08,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-22.png","element":"img","alt":" z∗i,h","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.5},"width":60.96,"height":43.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-23.png","element":"img","alt":" �zui,h","inline":true,"padRight":true},{"text":"are independent of ","element":"span"},{"style":{"height":17.78},"width":129.96,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-24.png","element":"img","alt":" yih, di,h","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":15.98},"width":68.36,"height":39.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-25.png","element":"img","alt":" Fi,h","inline":true},{"text":"-measurable. Since ","element":"span"},{"style":{"height":19.9},"width":669.52,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-26.png","element":"img","alt":" E[(σi,h(z∗i,h) − yih)(σi,h(z∗i,h) − yih)⊤ |","inline":true},{"style":{"height":19.9},"width":781.08,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-27.png","element":"img","alt":"Fi,h] = ∇σi,h(z∗i,h) and ∥σi,h(z∗i,h)−yih∥1 ≤ 2","inline":true},{"text":", we can apply Lemma ","element":"span"},{"href":"#id-150","text":"32. ","element":"a"},{"text":"For any ","element":"span"},{"style":{"height":16},"width":258.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/44-28.png","element":"img","alt":" k and δ ∈ (0, 1],","inline":true}],[{"text":"with probability at least ","element":"span"},{"style":{"height":16},"width":294.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-0.png","element":"img","alt":" 1 − δ/H, we have","inline":true}],[{"style":{"width":"88%"},"width":1399,"height":766,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-1.png","element":"img"}],[{"text":"where the second inequality holds since ","element":"span"},{"style":{"height":23.97},"width":739.24,"height":59.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-2.png","element":"img","alt":" ∥di,h∥2∇σi,h(z∗i,h) = d⊤i,h∇σi,h(z∗i,h)di,h ≤ 2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":100.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-3.png","element":"img","alt":" λ ≥ 1","inline":true},{"text":". ","element":"span"},{"text":"Plugging ","element":"span"},{"href":"#id-151","text":"(60) ","element":"a"},{"text":"into ","element":"span"},{"href":"#id-151","text":"(59) ","element":"a"},{"text":"and rearranging the term, we get","element":"span"}],[{"style":{"width":"98%"},"width":1558,"height":1119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-4.png","element":"img"}],[{"text":"Finally, combining ","element":"span"},{"href":"#id-151","text":"(58) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-151","text":"(61)","element":"a"},{"text":", by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/k","element":"span"},{"text":", we derive that","element":"span"}],[{"style":{"width":"77%"},"width":1226,"height":354,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-5.png","element":"img"}],[{"text":"where the last inequality holds by the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"+2 log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"style":{"fontStyle":"italic"},"text":"/u","element":"span"},{"text":")","element":"span"},{"text":". Taking the union bound over ","element":"span"},{"style":{"height":16},"width":130.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/45-6.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":", we conclude the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.1.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-143","style":{"fontWeight":"bold"},"text":"15","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-143","style":{"fontStyle":"italic"},"text":"15. ","element":"a"},{"text":"We start the proof from the observation of Proposition 2 in Foster et al. ","element":"span"},{"href":"#id-148","referenceIndex":27,"text":"[27]","element":"a"},{"text":", stating that ","element":"span"},{"style":{"height":11.98},"width":61,"height":29.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-0.png","element":"img","alt":" �zi,h","inline":true,"padRight":true},{"text":"represents the mixed prediction, which adheres to the following property:","element":"span"}],[{"id":"id-153","style":{"width":"100%"},"width":1586,"height":271,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-1.png","element":"img"}],[{"text":"Consider the quadratic approximation","element":"span"}],[{"style":{"width":"75%"},"width":1203,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-2.png","element":"img"}],[{"text":"Using the property that ","element":"span"},{"style":{"height":10},"width":56.32,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-3.png","element":"img","alt":" ℓi,h","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":18.77},"width":124.16,"height":46.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-4.png","element":"img","alt":" 3√2Lφ","inline":true},{"text":"-self-concordant-like function as asserted by Proposition B.1 in ","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"[50]","element":"a"},{"text":", and applying Lemma ","element":"span"},{"href":"#id-152","text":"35, ","element":"a"},{"text":"we obtain","element":"span"}],[{"style":{"width":"74%"},"width":1177,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-5.png","element":"img"}],[{"text":"Also, we have","element":"span"}],[{"style":{"width":"93%"},"width":1490,"height":384,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-6.png","element":"img"}],[{"text":"where we define the function ","element":"span"},{"style":{"height":16.78},"width":366.04,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-7.png","element":"img","alt":"�fi,h : B(0d, 1) → R as","inline":true}],[{"style":{"width":"99%"},"width":1571,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-8.png","element":"img"}],[{"text":"We denote ","element":"span"},{"style":{"height":18.51},"width":561,"height":46.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-9.png","element":"img","alt":"�Zi+1,h =�Rd �fi+1,h(θ) dθ ≤ +∞","inline":true,"padRight":true},{"text":"and define ","element":"span"},{"style":{"height":15.58},"width":111.08,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-10.png","element":"img","alt":" �Θi+1,h","inline":true,"padRight":true},{"text":"as the distribution whose density ","element":"span"},{"text":"function is ","element":"span"},{"style":{"height":16.78},"width":283.12,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-11.png","element":"img","alt":"�fi+1,h(θ)/ �Zi+1,h","inline":true},{"text":". Then, we can rewrite ","element":"span"},{"href":"#id-153","text":"(63) ","element":"a"},{"text":"as follows:","element":"span"}],[{"style":{"width":"90%"},"width":1435,"height":555,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-12.png","element":"img"}],[{"text":"where the second inequality is by Jensen’s inequality and the last inequality holds because ","element":"span"},{"style":{"height":15.58},"width":149.72,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-13.png","element":"img","alt":"�Θi+1,h is","inline":true,"padRight":true},{"text":"symmetric around ","element":"span"},{"style":{"height":23.1},"width":396.4,"height":57.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-14.png","element":"img","alt":"�θi+1h and thus Eθ∼�Θi+1,h","inline":true,"padRight":true},{"text":"��","element":"span"},{"style":{"height":28.8},"width":503.6,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-15.png","element":"img","alt":"∇Li,h(�θi+1h ), θ − �θi+1h ��= 0.","inline":true}],[{"text":"Combining ","element":"span"},{"href":"#id-153","text":"(62) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-153","text":"(64)","element":"a"},{"text":", we get","element":"span"}],[{"id":"id-154","style":{"width":"73%"},"width":1164,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/46-16.png","element":"img"}],[{"text":"Moreover, we have","element":"span"}],[{"style":{"width":"100%"},"width":1586,"height":742,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-0.png","element":"img"}],[{"text":"inequality holds because ","element":"span"},{"style":{"height":16},"width":252.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-1.png","element":"img","alt":"�Zi+1,h and Zi,h","inline":true,"padRight":true},{"text":"are identical normalizing factors. Integrating ","element":"span"},{"href":"#id-154","text":"(65) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-154","text":"(66) ","element":"a"},{"text":"and summing over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", yields","element":"span"}],[{"style":{"width":"92%"},"width":1468,"height":258,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-2.png","element":"img"}],[{"text":"Moreover, we can further bound the second term on the right-hand side of ","element":"span"},{"href":"#id-154","text":"(66)","element":"a"},{"text":". By Cauchy-Schwarz inequality, we get","element":"span"}],[{"style":{"width":"87%"},"width":1394,"height":293,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-3.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":28.8},"width":613.96,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-4.png","element":"img","alt":"�Θi+1,h = N��θi+1h , cB−1i,h�, θ − �θi+1h","inline":true,"padRight":true},{"text":"follows the same distribution as","element":"span"}],[{"id":"id-155","style":{"width":"80%"},"width":1278,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":28.8},"width":169.84,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-6.png","element":"img","alt":" λj�B−1i,h�","inline":true},{"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th largest eigenvalue of ","element":"span"},{"style":{"height":21.62},"width":73.52,"height":54.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-7.png","element":"img","alt":" B−1i,h","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":206.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-8.png","element":"img","alt":" {e1, . . . , ed}","inline":true,"padRight":true},{"text":"are orthogonal basis","element":"span"}],[{"text":"of ","element":"span"},{"style":{"height":13.38},"width":45.8,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-9.png","element":"img","alt":" Rd","inline":true},{"text":". Furthermore, since we know that ","element":"span"},{"style":{"height":21.63},"width":228.92,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-10.png","element":"img","alt":" B−1i,h ≤ λ−1Id","inline":true},{"text":", we can bound the term ","element":"span"},{"text":"(I) ","element":"span"},{"text":"by","element":"span"}],[{"style":{"width":"79%"},"width":1263,"height":249,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-11.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.59},"width":40.96,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-12.png","element":"img","alt":" χ2","inline":true,"padRight":true},{"text":"is the chi-square distribution and the last inequality holds due to Jensen’s inequality. By choosing ","element":"span"},{"style":{"height":19.73},"width":207.08,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-13.png","element":"img","alt":" λ ≥ 72L2φcd","inline":true},{"text":", we arrive that","element":"span"}],[{"id":"id-157","style":{"width":"67%"},"width":1073,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/47-14.png","element":"img"}],[{"text":"where the last inequality holds because the moment-generating function for ","element":"span"},{"style":{"height":16.58},"width":40.92,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-0.png","element":"img","alt":" χ2","inline":true},{"text":"-distribution is bounded by ","element":"span"},{"style":{"height":18.54},"width":767.48,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-1.png","element":"img","alt":" EW ∼χ2[exp(tW)] ≤ 1/√1 − 2t for all t ≤ 1/2","inline":true},{"text":". Now, we bound the term ","element":"span"},{"text":"(II)","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"87%"},"width":1390,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":33.09},"width":897.76,"height":82.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-3.png","element":"img","alt":"¯Bi,h =�∇2ℓi,h(�θi+1h )�−1/2Bi,h�∇2ℓi,h(�θi+1h )�−1/2","inline":true},{"text":". Let ","element":"span"},{"style":{"height":28.8},"width":296.72,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-4.png","element":"img","alt":"¯λj := λj�c ¯B−1i,h�","inline":true},{"text":"be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th largest eigenvalue of the matrix. Then, a similar analysis as ","element":"span"},{"href":"#id-155","text":"(67) ","element":"a"},{"text":"gives that","element":"span"}],[{"style":{"width":"89%"},"width":1425,"height":354,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-5.png","element":"img"}],[{"text":"where the last inequality holds due to ","element":"span"},{"style":{"height":20.8},"width":462.04,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-6.png","element":"img","alt":" EXj,Xj′∼N (0,1)[X2j X2j′] ≤ 3","inline":true,"padRight":true},{"text":"when considering the case where ","element":"span"},{"style":{"height":13.6},"width":104.52,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-7.png","element":"img","alt":"j = j′ ","inline":true,"padRight":true},{"text":"and the last equality is derived from the fact that","element":"span"},{"style":{"height":32.3},"width":442.68,"height":80.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-8.png","element":"img","alt":"��dj=1 ¯λj�2= tr�c ¯B−1i,h�","inline":true},{"text":". Here, we denote","element":"span"}],[{"text":"tr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":") ","element":"span"},{"text":"as the trace of the matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":".","element":"span"}],[{"text":"We define matrix ","element":"span"},{"style":{"height":20},"width":702.6,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-9.png","element":"img","alt":" Ri+1,h := λId/2 + �iτ=1 ∇2ℓτ,h(θτ+1,h)","inline":true},{"text":". Under the condition ","element":"span"},{"style":{"height":19.73},"width":153.16,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-10.png","element":"img","alt":" λ ≥ 2L2φ","inline":true},{"text":", we ","element":"span"},{"text":"have ","element":"span"},{"style":{"height":20.22},"width":503.16,"height":50.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-11.png","element":"img","alt":" ∇2ℓi,h(θi+1,h) ⪯ L2φId ≤ λ2 Id","inline":true},{"text":". Then, we have ","element":"span"},{"style":{"height":15.6},"width":241.56,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-12.png","element":"img","alt":" Bi,h ⪰ Ri+1,h","inline":true},{"text":". Therefore, we can bound the ","element":"span"},{"text":"trace by","element":"span"}],[{"style":{"width":"70%"},"width":1114,"height":185,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-13.png","element":"img"}],[{"text":"where the last inequality holds due to Lemma 4.7 of Hazan et al. ","element":"span"},{"href":"#id-156","referenceIndex":32,"text":"[32]","element":"a"},{"text":". Therefore we can bound the term ","element":"span"},{"text":"(II) ","element":"span"},{"text":"as","element":"span"}],[{"id":"id-158","style":{"width":"64%"},"width":1026,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-14.png","element":"img"}],[{"text":"Combining ","element":"span"},{"href":"#id-157","text":"(68) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-158","text":"(69)","element":"a"},{"text":", we get","element":"span"}],[{"id":"id-159","style":{"width":"95%"},"width":1517,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-15.png","element":"img"}],[{"text":"Plugging ","element":"span"},{"href":"#id-154","text":"(66) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-159","text":"(70) ","element":"a"},{"text":"into ","element":"span"},{"href":"#id-154","text":"(65)","element":"a"},{"text":", and taking summation over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", we derive that","element":"span"}],[{"style":{"width":"88%"},"width":1410,"height":400,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-16.png","element":"img"}],[{"text":"where the last inequality holds because ","element":"span"},{"style":{"height":25.36},"width":938.68,"height":63.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-17.png","element":"img","alt":"�ki=1 log det(Ri+1,h)det(Ri,h) = log(det(Rk+1,h)/ det(λ/2Id)) ≤","inline":true}],[{"style":{"height":30.4},"width":288.72,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/48-18.png","element":"img","alt":"d log�1 +2kL2φdλ �","inline":true},{"text":". By rearranging the terms, we conclude the proof.","element":"span"}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bound on Prediction Error","element":"span"}],[{"text":"In this section, we present the bound on the prediction error of parameters updated by ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":". First, we compare the problem setting of MNL contextual bandits with ours and introduce the challenges of applying their analysis to our setting.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"MNL dynamic assortment optimization (single-parameter & uniform reward) [","element":"span"},{"href":"#id-38","referenceIndex":61,"style":{"fontWeight":"bold"},"text":"61","element":"a"},{"style":{"fontWeight":"bold"},"text":"] ","element":"span"},{"text":"Perivier and Goyal ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"consider an assortment selection problem where the user choice is given by a MNL choice model with the single-parameter. At each time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the agent observes context features ","element":"span"},{"style":{"height":18.18},"width":260.84,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-0.png","element":"img","alt":"{xt,i}Mi=1 ⊂ Rd","inline":true},{"text":". Then the agent decides on the set ","element":"span"},{"style":{"height":16},"width":166.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-1.png","element":"img","alt":" St ⊂ [M]","inline":true,"padRight":true},{"text":"to offer to a user, with ","element":"span"},{"style":{"height":16},"width":159.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-2.png","element":"img","alt":" |St| ≤ N","inline":true},{"text":". ","element":"span"},{"text":"Without loss of generality, we may assume ","element":"span"},{"style":{"height":16},"width":151,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-3.png","element":"img","alt":" |St| = N","inline":true},{"text":". Then the user purchases one single product ","element":"span"},{"style":{"height":16},"width":209.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-4.png","element":"img","alt":"j ∈ St ∪ {0}","inline":true,"padRight":true},{"text":"and the probability of each product ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"is purchased by a user follows the MNL model parametrized by a unknown fixed parameter ","element":"span"},{"style":{"height":15.79},"width":148.04,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-5.png","element":"img","alt":" θ∗ ∈ Rd,","inline":true}],[{"style":{"width":"51%"},"width":821,"height":149,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-6.png","element":"img"}],[{"text":"Then the difference between the revenue induced by ","element":"span"},{"style":{"height":12.34},"width":39.68,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-7.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"and that by an estimator ","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-8.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"in Perivier and Goyal ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"is expressed as follows:","element":"span"}],[{"id":"id-160","style":{"width":"68%"},"width":1078,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-9.png","element":"img"}],[{"text":"If we define ","element":"span"},{"style":{"height":16.59},"width":214.72,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-10.png","element":"img","alt":" Q : RN → R","inline":true},{"text":", such that for all ","element":"span"},{"style":{"height":28.18},"width":911.56,"height":70.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-11.png","element":"img","alt":" u = (u1, . . . , uN) ∈ RN, Q(u) := �Ni=1 exp(ui)1+�Nj=1 exp(uj)","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":18.88},"width":460.88,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-12.png","element":"img","alt":" v∗ = (x⊤t,i1θ∗, . . . , x⊤t,iN θ∗)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.54},"width":405.92,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-13.png","element":"img","alt":" v = (x⊤t,i1θ, . . . , x⊤t,iN θ)","inline":true},{"text":", then Eq. ","element":"span"},{"href":"#id-160","text":"(71) ","element":"a"},{"text":"can be expressed ","element":"span"},{"text":"as follows:","element":"span"}],[{"id":"id-161","style":{"width":"98%"},"width":1562,"height":192,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":9.82},"width":24,"height":24.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-15.png","element":"img","alt":" ¯v","inline":true,"padRight":true},{"text":"is a convex combination of ","element":"span"},{"style":{"height":11.6},"width":144.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-16.png","element":"img","alt":" v∗ and v","inline":true},{"text":". For the first term in Eq. ","element":"span"},{"href":"#id-161","text":"(72)","element":"a"},{"text":", we have","element":"span"}],[{"id":"id-162","style":{"width":"93%"},"width":1478,"height":690,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":105.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-18.png","element":"img","alt":" Ht(θ)","inline":true,"padRight":true},{"text":"is the Gram matrix used in ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"defined by","element":"span"}],[{"style":{"width":"88%"},"width":1397,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-19.png","element":"img"}],[{"text":"Note that the term ","element":"span"},{"style":{"height":18.02},"width":270.52,"height":45.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/49-20.png","element":"img","alt":" ∥θ∗ − θ∥Ht(θ∗)","inline":true,"padRight":true},{"text":"can be bounded by the concentration result of the estimated parameter. ","element":"span"},{"text":"On the other hand, to apply the elliptical potential lemma to the term","element":"span"}],[{"id":"id-163","style":{"width":"99%"},"width":1583,"height":657,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-0.png","element":"img"}],[{"text":"Now since the coefficient ","element":"span"},{"style":{"height":17.12},"width":372.04,"height":42.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-1.png","element":"img","alt":" qt,j(St, θ∗)qt,0(St, θ∗)","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":20.22},"width":188.76,"height":50.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-2.png","element":"img","alt":" ∥x∥H−1t (θ∗)","inline":true,"padRight":true},{"text":"in Eq. ","element":"span"},{"href":"#id-162","text":"(73) ","element":"a"},{"text":"aligns with the coeffi- ","element":"span"},{"text":"cients of the lower bound of ","element":"span"},{"href":"#id-163","style":{"height":16.34},"width":303.92,"height":40.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-3.png","element":"img","alt":" Ht(θ∗) in Eq. (74)","inline":true},{"text":", the elliptical potential lemma can be applied. Note that such a lower bound in Eq. ","element":"span"},{"href":"#id-163","text":"(74) ","element":"a"},{"text":"holds since Perivier and Goyal ","element":"span"},{"href":"#id-38","referenceIndex":61,"text":"[61] ","element":"a"},{"text":"deals with the uniform reward, i.e., ","element":"span"},{"style":{"height":18.7},"width":616.04,"height":46.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-4.png","element":"img","alt":" 1 − �i∈St qt,i(St, θ∗) = qt,0(St, θ∗).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Mulitinomial logistic bandit problem [","element":"span"},{"href":"#id-47","referenceIndex":76,"style":{"fontWeight":"bold"},"text":"76","element":"a"},{"style":{"fontWeight":"bold"},"text":"] ","element":"span"},{"text":"Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"address the multiple-parameter MNL contextual bandit problem where at each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"the agent selects an action ","element":"span"},{"style":{"height":15.78},"width":143.84,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-5.png","element":"img","alt":"xt ∈ Rd","inline":true,"padRight":true},{"text":"and receives response feedback ","element":"span"},{"style":{"height":16},"width":260.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-6.png","element":"img","alt":" yt ∈ {0} ∪ [N]","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"possible outcomes. Each outcome ","element":"span"},{"style":{"height":16},"width":124.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-7.png","element":"img","alt":" i ∈ [N]","inline":true,"padRight":true},{"text":"is associated with a ground-truth parameter ","element":"span"},{"style":{"height":17.33},"width":140.52,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-8.png","element":"img","alt":" θ∗i ∈ Rd","inline":true},{"text":", and the probability of the ","element":"span"},{"text":"outcome ","element":"span"},{"style":{"height":16},"width":227.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-9.png","element":"img","alt":" P(yt = i | xt)","inline":true,"padRight":true},{"text":"follows the MNL model,","element":"span"}],[{"style":{"width":"87%"},"width":1380,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-10.png","element":"img"}],[{"text":"In this model, there are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"unknown choice parameter ","element":"span"},{"style":{"height":17.39},"width":472.04,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-11.png","element":"img","alt":" Θ∗ := [θ∗1, . . . , θ∗N] ∈ Rd×N","inline":true,"padRight":true},{"text":"and the agent ","element":"span"},{"text":"chooses one context feature ","element":"span"},{"style":{"height":9.58},"width":36.16,"height":23.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-12.png","element":"img","alt":" xt","inline":true},{"text":", that is why we call multiple-parameter MNL model. Then, the expected revenue of an action ","element":"span"},{"style":{"height":9.58},"width":36.16,"height":23.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-13.png","element":"img","alt":" xt","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"is given by","element":"span"}],[{"style":{"width":"45%"},"width":714,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-14.png","element":"img"}],[{"text":"where we define the softmax function ","element":"span"},{"style":{"height":17.39},"width":342.8,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-15.png","element":"img","alt":" σ : RN → [0, 1]N by","inline":true}],[{"style":{"width":"99%"},"width":1570,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-16.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":19.98},"width":434.8,"height":49.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-17.png","element":"img","alt":" ρ := [ρ1, . . . , ρN] ∈ RN+1+","inline":true,"padRight":true},{"text":"represents the reward for each outcome ","element":"span"},{"style":{"height":16},"width":441.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-18.png","element":"img","alt":" j ∈ [N] with ρ0 = 0. Then,","inline":true,"padRight":true},{"text":"the difference between the revenue induced by ","element":"span"},{"style":{"height":12.21},"width":51.64,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-19.png","element":"img","alt":" Θ∗ ","inline":true,"padRight":true},{"text":"and that by an estimator ","element":"span"},{"href":"#id-47","referenceIndex":76,"style":{"height":17.28},"width":138.64,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-20.png","element":"img","alt":"ˆΘ in [76","inline":true},{"text":"] is expressed by","element":"span"}],[{"id":"id-164","style":{"width":"85%"},"width":1354,"height":351,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/50-21.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":22.18},"width":827.16,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-0.png","element":"img","alt":" Ξk =� 10 (1 − ν)∇2[σ( ˆΘxt + ν(Θ∗ − ˆΘ)xt)]kdν","inline":true},{"text":". Then for the first term in Eq. ","element":"span"},{"href":"#id-164","text":"(75)","element":"a"},{"text":", we ","element":"span"},{"text":"have","element":"span"}],[{"id":"id-165","style":{"width":"79%"},"width":1268,"height":370,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.18},"width":47.88,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-2.png","element":"img","alt":" Ht","inline":true,"padRight":true},{"text":"is the Gram matrix used in ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"defined by","element":"span"}],[{"style":{"width":"42%"},"width":680,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-3.png","element":"img"}],[{"text":"Note that the term ","element":"span"},{"style":{"height":18.88},"width":389.8,"height":47.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-4.png","element":"img","alt":" ∥vec(Θ∗) − vec( ˆΘ)∥Ht","inline":true,"padRight":true},{"text":"in Eq. ","element":"span"},{"href":"#id-165","text":"(76) ","element":"a"},{"text":"can be bounded by the concentration result of the estimated parameter, and the term ","element":"span"},{"style":{"height":23.22},"width":513.96,"height":58.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-5.png","element":"img","alt":" ∥H− 12t (IN ⊗ x⊤t )∇σ( ˆΘxt)ρ∥2","inline":true,"padRight":true},{"text":"also can be bounded as ","element":"span"},{"text":"follows:","element":"span"}],[{"style":{"width":"72%"},"width":1148,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-6.png","element":"img"}],[{"text":"Here Zhang and Sugiyama ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76] ","element":"a"},{"text":"bound the term ","element":"span"},{"style":{"height":23.22},"width":481.52,"height":58.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-7.png","element":"img","alt":" ∥H− 12t (IN ⊗ x⊤t )∇σ( ˆΘxt)∥2 ","inline":true,"padRight":true},{"text":"using a matrix version ","element":"span"},{"text":"of elliptical lemma. However, they assume ","element":"span"},{"style":{"height":16},"width":166.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-8.png","element":"img","alt":" ∥ρ∥2 ≤ R","inline":true,"padRight":true},{"text":"(Assumption 2 in ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76]","element":"a"},{"text":").","element":"span"}],[{"text":"Now, regarding the prediction error in our setting, the estimated values (","element":"span"},{"style":{"height":19.49},"width":126.8,"height":48.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-9.png","element":"img","alt":"�V kh+1(·)","inline":true},{"text":") for each reachable ","element":"span"},{"text":"state are typically distinct, and we do not assume a constant upper bound on the ","element":"span"},{"style":{"height":7.6},"width":32.64,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-10.png","element":"img","alt":" ℓ2","inline":true},{"text":"-norm of the estimated value vector for all reachable states. Instead, we can bound the ","element":"span"},{"style":{"height":7.6},"width":32.6,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-11.png","element":"img","alt":" ℓ2","inline":true},{"text":"-norm of the estimated value vector for all reachable states as follows:","element":"span"}],[{"style":{"width":"99%"},"width":1585,"height":182,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-12.png","element":"img"}],[{"style":{"height":10.8},"width":104.36,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-13.png","element":"img","alt":"h+1s, a h+1s′∈Ss,a ","inline":true,"padRight":true},{"text":"by a factor of","element":"span"},{"style":{"height":16},"width":62.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-14.png","element":"img","alt":"√U","inline":true},{"text":". To address, we adapt the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"feature centralization technique ","element":"span"},{"text":"[","element":"span"},{"href":"#id-40","referenceIndex":50,"text":"50","element":"a"},{"text":"] to bound the prediction error independently of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", without making any additional assumptions. The key point is that the Hessian of per-round loss ","element":"span"},{"style":{"height":16.8},"width":119.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-15.png","element":"img","alt":" ℓk,h(θ)","inline":true,"padRight":true},{"text":"is expressed in terms of the centralized feature as follows:","element":"span"}],[{"style":{"width":"70%"},"width":1116,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.68},"width":849.68,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-17.png","element":"img","alt":" ¯φ(s, a, s′; θ) := φ(s, a, s′) − E�s∼Pθ(·|s,a)[φ(s, a, �s)]","inline":true,"padRight":true},{"text":"is the centralized feature by ","element":"span"},{"style":{"height":13.2},"width":182.24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-18.png","element":"img","alt":" θ. Now, we","inline":true,"padRight":true},{"text":"provide the bound on prediction error of the estimated parameter updated by ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"},{"text":".","element":"span"}],[{"id":"id-76","style":{"fontWeight":"bold"},"text":"Lemma 16 ","element":"span"},{"text":"(Bound on the prediction error)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-19.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", suppose that Lemma ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"12 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds. Let us denote the prediction error about ","element":"span"},{"style":{"height":18.4},"width":93.2,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-20.png","element":"img","alt":"�θkh by","inline":true}],[{"style":{"width":"64%"},"width":1015,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, for any ","element":"span"},{"style":{"height":16},"width":394.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-22.png","element":"img","alt":" (s, a) ∈ S × A, we have","inline":true}],[{"style":{"width":"98%"},"width":1554,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-23.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-76","style":{"fontStyle":"italic"},"text":"16. ","element":"a"},{"text":"Let us define ","element":"span"},{"style":{"height":21.36},"width":650.72,"height":53.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-24.png","element":"img","alt":" F(θ) := �s′∈Ss,a Pθ(s′ | s, a)�V kh+1(s′)","inline":true},{"text":". Then, by Taylor expan- ","element":"span"},{"text":"sion we have","element":"span"}],[{"style":{"width":"78%"},"width":1243,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/51-25.png","element":"img"}],[{"style":{"width":"89%"},"width":1418,"height":508,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/52-0.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"81%"},"width":1286,"height":737,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/52-1.png","element":"img"}],[{"text":"Then, the prediction error can be bounded as follows:","element":"span"}],[{"style":{"width":"88%"},"width":1396,"height":149,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/52-2.png","element":"img"}],[{"text":"For the first term in Eq. ","element":"span"},{"href":"#id-76","text":"(77)","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"92%"},"width":1464,"height":383,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/52-3.png","element":"img"}],[{"text":"where in the first inequality we use ","element":"span"},{"style":{"height":19.49},"width":234.28,"height":48.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/52-4.png","element":"img","alt":"�V kh+1(s′) ≤ H","inline":true,"padRight":true},{"text":"and Cauchy-Scharwz inequality, and the second ","element":"span"},{"text":"inequality follows by the concentration result of Lemma ","element":"span"},{"href":"#id-70","text":"12.","element":"a"}],[{"text":"For the second term in Eq. ","element":"span"},{"href":"#id-76","text":"(77)","element":"a"},{"text":", since ","element":"span"},{"style":{"height":19.49},"width":317.64,"height":48.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/53-0.png","element":"img","alt":" 0 ≤ �V kh+1(s′) ≤ H,","inline":true}],[{"style":{"width":"92%"},"width":1465,"height":1373,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/53-1.png","element":"img"}],[{"text":"where for the second inequality we use Cauchy-Schwarz inequality, ","element":"span"},{"style":{"height":14},"width":509.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/53-2.png","element":"img","alt":" xx⊤ + yy⊤ ⪰ xy⊤ + yx⊤ for","inline":true,"padRight":true},{"text":"any ","element":"span"},{"style":{"height":16.59},"width":161.2,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/53-3.png","element":"img","alt":" x, y ∈ Rd","inline":true},{"text":", and triangle inequality. Note that","element":"span"}],[{"style":{"width":"92%"},"width":1460,"height":604,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/53-4.png","element":"img"}],[{"style":{"width":"99%"},"width":1585,"height":617,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-0.png","element":"img"}],[{"text":"where for the second inequality follows by Lemma ","element":"span"},{"href":"#id-70","text":"12 ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":19.98},"width":428.4,"height":49.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-1.png","element":"img","alt":"�s′∈Ss,a P¯θ(s′ | s, a) = 1","inline":true},{"text":". Combining ","element":"span"},{"text":"the results of Eq. ","element":"span"},{"href":"#id-76","text":"(78) ","element":"a"},{"text":"and Eq. ","element":"span"},{"href":"#id-76","text":"(81) ","element":"a"},{"text":"and , we conclude the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Good Events with High Probability","element":"span"}],[{"text":"In this section, we introduce the good events used to prove Theorem ","element":"span"},{"href":"#id-75","text":"2 ","element":"a"},{"text":"and show that the good events happen with high probability.","element":"span"}],[{"id":"id-166","style":{"fontWeight":"bold"},"text":"Lemma 17 ","element":"span"},{"text":"(Good event probability)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":350.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-2.png","element":"img","alt":" K ∈ N and δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", the good event ","element":"span"},{"style":{"height":16},"width":240.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-3.png","element":"img","alt":" G(K, δ′) holds","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with probability at least ","element":"span"},{"style":{"height":16},"width":460.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-4.png","element":"img","alt":" 1 − δ where δ′ = δ/(2KH).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-166","style":{"fontStyle":"italic"},"text":"17. ","element":"a"},{"text":"For any ","element":"span"},{"style":{"height":16},"width":318.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-5.png","element":"img","alt":" δ′ ∈ (0, 1), we have","inline":true}],[{"style":{"width":"69%"},"width":1102,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-6.png","element":"img"}],[{"text":"On the other hand, for any ","element":"span"},{"style":{"height":16},"width":305.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-7.png","element":"img","alt":" (k, h) ∈ [K] × [H]","inline":true},{"text":", by Lemma ","element":"span"},{"href":"#id-121","text":"30 ","element":"a"},{"style":{"height":22.86},"width":140.92,"height":57.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-8.png","element":"img","alt":" Gξk,h(δ′)","inline":true,"padRight":true},{"text":"holds with probability at least ","element":"span"},{"style":{"height":11.6},"width":101.84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-9.png","element":"img","alt":"1 − δ′","inline":true},{"text":". Then, for ","element":"span"},{"style":{"height":16},"width":247.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-10.png","element":"img","alt":" δ′ = δ/(2KH)","inline":true,"padRight":true},{"text":"by taking union bound, we have the desired result as follows:","element":"span"}],[{"style":{"width":"77%"},"width":1221,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-11.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"D.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stochastic Optimism","element":"span"}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Lemma 18 ","element":"span"},{"text":"(Stochastic optimism)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-12.png","element":"img","alt":" δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16},"width":296.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-13.png","element":"img","alt":" 0 < δ < Φ(−1)/2","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16},"width":224.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-14.png","element":"img","alt":" σk = Hβk(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":". If we take multiple sample size ","element":"span"},{"style":{"height":24.43},"width":327.28,"height":61.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-15.png","element":"img","alt":" M = ⌈1 − log(HU)log Φ(1) ⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", then for any ","element":"span"},{"style":{"height":16},"width":279.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-16.png","element":"img","alt":" k ∈ [K], we have","inline":true}],[{"style":{"width":"47%"},"width":759,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-74","style":{"fontStyle":"italic"},"text":"18. ","element":"a"},{"text":"First, we introduce the following lemmas.","element":"span"}],[{"id":"id-168","style":{"fontWeight":"bold"},"text":"Lemma 19. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-18.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be given. For any ","element":"span"},{"style":{"height":16},"width":602.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-19.png","element":"img","alt":" (k, h) ∈ [K] × [H], let σk = Hβk(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":". If we define the event ","element":"span"},{"style":{"height":20.7},"width":175.12,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-20.png","element":"img","alt":" G∆k,h(δ) as","inline":true}],[{"style":{"width":"76%"},"width":1205,"height":230,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-21.png","element":"img"}],[{"id":"id-167","style":{"fontStyle":"italic"},"text":"then conditioned on ","element":"span"},{"style":{"height":20.7},"width":667.68,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-22.png","element":"img","alt":" G∆k,h(δ), for any (s, a) ∈ S × A, we have","inline":true}],[{"style":{"width":"44%"},"width":709,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/54-23.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 20. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":165.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-0.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be given. For any ","element":"span"},{"style":{"height":16},"width":319.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-1.png","element":"img","alt":" (h, k) ∈ [H] × [K]","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":16},"width":232.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-2.png","element":"img","alt":" σk = Hβk(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":". If we take multiple sample size ","element":"span"},{"style":{"height":24.43},"width":327.08,"height":61.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-3.png","element":"img","alt":" M = ⌈1 − log(HU)log Φ(1) ⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", then conditioned on the event ","element":"span"},{"style":{"height":20.7},"width":418.6,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-4.png","element":"img","alt":" G∆k (δ) := ∩h∈[H]G∆k,h(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"we have","element":"span"}],[{"style":{"width":"52%"},"width":827,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-5.png","element":"img"}],[{"text":"Based on the result of Lemma ","element":"span"},{"href":"#id-167","text":"20, ","element":"a"},{"text":"using the same argument as in Lemma ","element":"span"},{"href":"#id-62","text":"6 ","element":"a"},{"text":"we obtain the desired result.","element":"span"}],[{"text":"In the following section, we provide the proofs of the lemmas used in Lemma ","element":"span"},{"href":"#id-74","text":"18.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"D.4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-168","style":{"fontWeight":"bold"},"text":"19","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-168","style":{"fontStyle":"italic"},"text":"19. ","element":"a"},{"text":"Recall the definition of Bellman error (Definition ","element":"span"},{"href":"#id-56","text":"1)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"94%"},"width":1497,"height":400,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-6.png","element":"img"}],[{"text":"Then, it is enough to show that","element":"span"}],[{"style":{"width":"65%"},"width":1039,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-7.png","element":"img"}],[{"text":"at least with constant probability. On the other hand, under the event ","element":"span"},{"style":{"height":20.7},"width":129.76,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-8.png","element":"img","alt":" G∆k,h(δ)","inline":true},{"text":", by Lemma ","element":"span"},{"href":"#id-76","text":"16 ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"97%"},"width":1544,"height":757,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-9.png","element":"img"}],[{"text":"Therefore, by setting ","element":"span"},{"style":{"height":16},"width":224.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-10.png","element":"img","alt":" σk = Hβk(δ)","inline":true},{"text":", we have for ","element":"span"},{"style":{"height":16.8},"width":382.76,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-11.png","element":"img","alt":" m ∈ [M] and s′ ∈ Ss,a,","inline":true}],[{"style":{"width":"65%"},"width":1043,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/55-12.png","element":"img"}],[{"id":"id-170","style":{"width":"99%"},"width":1585,"height":537,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-0.png","element":"img"}],[{"text":"Consequently, we arrive at the conclusion as follows:","element":"span"}],[{"id":"id-169","style":{"width":"93%"},"width":1484,"height":938,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-1.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-169","text":"(84) ","element":"a"},{"text":"comes from the fact that ","element":"span"},{"style":{"height":16.8},"width":311.36,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-2.png","element":"img","alt":" maxs,a |Ss,a| = U","inline":true,"padRight":true},{"text":"and the union bound, and ","element":"span"},{"href":"#id-169","text":"(85) ","element":"a"},{"text":"follows by ","element":"span"},{"href":"#id-170","text":"(82)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-167","style":{"fontWeight":"bold"},"text":"20","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-167","style":{"fontStyle":"italic"},"text":"20. ","element":"a"},{"text":"It holds","element":"span"}],[{"style":{"width":"75%"},"width":1201,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-3.png","element":"img"}],[{"text":"where the first inequality uses the Bernoulli’s inequality, the second inequality follows by Lemma ","element":"span"},{"href":"#id-168","text":"19, ","element":"a"},{"text":"and the last inequality holds due to the choice of ","element":"span"},{"style":{"height":24.42},"width":336.96,"height":61.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-4.png","element":"img","alt":" M = ⌈1 − log(UH)log Φ(1) ⌉.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"D.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bound on Estimation Part","element":"span"}],[{"text":"In this section, we provide the upper bound on the estimation part of the regret: ","element":"span"},{"style":{"height":20.4},"width":355.12,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-5.png","element":"img","alt":"�Kk=1(�V k1 −V ∗1 )(sk1).","inline":true,"padRight":true},{"id":"id-171","style":{"fontWeight":"bold"},"text":"Lemma 21 ","element":"span"},{"text":"(Bound on estimation)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":19.73},"width":516.32,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-6.png","element":"img","alt":" δ ∈ (0, 1), if λ = O(L2φd log U)","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"at least ","element":"span"},{"style":{"height":16},"width":277.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-7.png","element":"img","alt":" 1 − δ/2, we have","inline":true}],[{"style":{"width":"57%"},"width":917,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/56-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-171","style":{"fontStyle":"italic"},"text":"21. ","element":"a"},{"text":"With the same argument in Lemma ","element":"span"},{"href":"#id-67","text":"10, ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-173","style":{"width":"100%"},"width":1586,"height":767,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/57-0.png","element":"img"}],[{"id":"id-172","text":"where the last inequality follows by Lemma ","element":"span"},{"href":"#id-76","text":"16. ","element":"a"},{"text":"Now we introduce the following lemma.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 22. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":771.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/57-1.png","element":"img","alt":" (k, h) ∈ [K] × [H] and (s, a) ∈ S × A, it holds","inline":true}],[{"style":{"width":"89%"},"width":1412,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/57-2.png","element":"img"}],[{"text":"By plugging the result of Lemma ","element":"span"},{"href":"#id-172","text":"22 ","element":"a"},{"text":"into Eq. ","element":"span"},{"href":"#id-173","text":"(87)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"95%"},"width":1514,"height":633,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/57-3.png","element":"img"}],[{"text":"By letting us denote","element":"span"}],[{"id":"id-174","style":{"width":"95%"},"width":1507,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/57-4.png","element":"img"}],[{"text":"and summing over all episodes, we have","element":"span"}],[{"style":{"width":"99%"},"width":1585,"height":874,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/58-0.png","element":"img"}],[{"text":"For term ","element":"span"},{"text":"(i)","element":"span"},{"text":", we have","element":"span"}],[{"style":{"width":"89%"},"width":1412,"height":546,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/58-1.png","element":"img"}],[{"id":"id-176","text":"where the last inequality follows by the following lemma:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 23. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For each ","element":"span"},{"style":{"height":19.73},"width":309.72,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/58-2.png","element":"img","alt":" h ∈ [H], if λ ≥ L2φ","inline":true},{"style":{"fontStyle":"italic"},"text":", then we have","element":"span"}],[{"style":{"width":"79%"},"width":1267,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/58-3.png","element":"img"}],[{"text":"Then, term ","element":"span"},{"text":"(i) ","element":"span"},{"text":"can be bounded as follows:","element":"span"}],[{"style":{"width":"84%"},"width":1333,"height":332,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/58-4.png","element":"img"}],[{"id":"id-175","text":"For term ","element":"span"},{"text":"(ii)","element":"span"},{"text":", we introduce the following lemma:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 24. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-0.png","element":"img","alt":" δ ∈ (0, 1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be given. For any ","element":"span"},{"style":{"height":16},"width":610.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-1.png","element":"img","alt":" (k, h) ∈ [K]×[H] and (s, a) ∈ S ×A","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":226.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-2.png","element":"img","alt":" 1 − δ, it holds","inline":true}],[{"style":{"width":"58%"},"width":931,"height":350,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":19.22},"width":497.36,"height":48.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-4.png","element":"img","alt":" γk(δ) := Cξσk�d log(Md/δ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for an absolute constant ","element":"span"},{"style":{"height":15.58},"width":130.32,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-5.png","element":"img","alt":" Cξ > 0.","inline":true}],[{"style":{"width":"99%"},"width":1585,"height":628,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-6.png","element":"img"}],[{"text":"where the last inequality follows by Eq. ","element":"span"},{"href":"#id-174","text":"(90)","element":"a"},{"text":". Note that","element":"span"}],[{"style":{"width":"92%"},"width":1467,"height":1059,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-7.png","element":"img"}],[{"text":"where the first inequality holds since ","element":"span"},{"style":{"height":21.62},"width":214.16,"height":54.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-8.png","element":"img","alt":" B−1k,h ⪯ A−1k,h","inline":true},{"text":", the second inequality follows from ","element":"span"},{"style":{"height":17.39},"width":183.32,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-9.png","element":"img","alt":" (x + y)2 ≤","inline":true},{"style":{"height":16.59},"width":173.6,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-10.png","element":"img","alt":"2x2 + 2y2","inline":true},{"text":", and the third inequality uses the triangle inequality, and the fourth inequality uses ","element":"span"},{"style":{"height":24.67},"width":540.96,"height":61.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/59-11.png","element":"img","alt":"��s∈Sk,h P�θk+1h (�s | skh, akh) = 1","inline":true},{"text":", and the last inequality follows by Lemma ","element":"span"},{"href":"#id-109","text":"3. ","element":"a"},{"text":"By substituting","element":"span"}],[{"text":"Eq. ","element":"span"},{"href":"#id-175","text":"(93) ","element":"a"},{"text":"into Eq. ","element":"span"},{"href":"#id-175","text":"(92)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"98%"},"width":1556,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-0.png","element":"img"}],[{"text":"For term ","element":"span"},{"text":"(iii)","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"96%"},"width":1534,"height":575,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-1.png","element":"img"}],[{"text":"where for the second inequality we use the same argument used to derive Eq. ","element":"span"},{"href":"#id-175","text":"(93) ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-109","text":"3.","element":"a"}],[{"text":"For term ","element":"span"},{"text":"(iv)","element":"span"},{"text":", since we have ","element":"span"},{"style":{"height":19.52},"width":181.32,"height":48.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-2.png","element":"img","alt":" | ˙ζkh| ≤ 2H","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.79},"width":291.72,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-3.png","element":"img","alt":" E[ ˙ζkh | Fk,h] = 0","inline":true},{"text":", which means ","element":"span"},{"style":{"height":19.79},"width":245.6,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-4.png","element":"img","alt":" { ˙ζkh | Fk,h}k,h","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"a martingale difference sequence for any ","element":"span"},{"style":{"height":16},"width":136.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-5.png","element":"img","alt":" k ∈ [K]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":137.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-6.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":". Hence, by applying the AzumaHoeffding inequality with probability at least ","element":"span"},{"style":{"height":16},"width":278.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-7.png","element":"img","alt":" 1 − δ/4, we have","inline":true}],[{"id":"id-177","style":{"width":"68%"},"width":1080,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-8.png","element":"img"}],[{"text":"Combining all results of Eq. ","element":"span"},{"href":"#id-176","text":"(91)","element":"a"},{"text":", ","element":"span"},{"href":"#id-175","text":"(94)","element":"a"},{"text":", ","element":"span"},{"href":"#id-175","text":"(95)","element":"a"},{"text":", and ","element":"span"},{"href":"#id-177","text":"(96)","element":"a"},{"text":", we have the desired result.","element":"span"}],[{"style":{"width":"96%"},"width":1527,"height":319,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/60-9.png","element":"img"}],[{"text":"In the following, we provide the proof of the lemmas used in Lemma ","element":"span"},{"href":"#id-171","text":"21.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"D.5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-172","style":{"fontWeight":"bold"},"text":"22","element":"a"}],[{"style":{"width":"85%"},"width":1359,"height":910,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/61-0.png","element":"img"}],[{"text":"where the first inequality holds by triangle inequality.","element":"span"}],[{"text":"For ","element":"span"},{"text":"(i)","element":"span"},{"text":", we have","element":"span"}],[{"id":"id-179","style":{"width":"88%"},"width":1397,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/61-1.png","element":"img"}],[{"text":"where in the equality we apply the mean value theorem with ","element":"span"},{"style":{"height":19.54},"width":425.4,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/61-2.png","element":"img","alt":" ϑkh = v �θkh + (1 − v)�θk+1h","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":16},"width":149.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/61-3.png","element":"img","alt":"v ∈ [0, 1]","inline":true},{"text":", and the inequality follows by Cauchy-Schwarz inequality. Meanwhile, since we have","element":"span"}],[{"id":"id-178","style":{"width":"86%"},"width":1372,"height":1018,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/61-4.png","element":"img"}],[{"text":"by substituting ","element":"span"},{"href":"#id-178","text":"(98) ","element":"a"},{"text":"into ","element":"span"},{"href":"#id-179","text":"(97) ","element":"a"},{"text":"we have","element":"span"}],[{"id":"id-181","style":{"width":"87%"},"width":1386,"height":689,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/62-0.png","element":"img"}],[{"text":"Note that by Jensen’s inequality, we have","element":"span"}],[{"id":"id-180","style":{"width":"97%"},"width":1547,"height":413,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/62-1.png","element":"img"}],[{"id":"id-182","text":"Also, we introduce the following lemma:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 25. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":339.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/62-2.png","element":"img","alt":" k ∈ [K] and h ∈ [H]","inline":true},{"style":{"fontStyle":"italic"},"text":", the following holds:","element":"span"}],[{"style":{"width":"28%"},"width":458,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/62-3.png","element":"img"}],[{"text":"Then, substituting ","element":"span"},{"href":"#id-180","text":"(100) ","element":"a"},{"text":"into ","element":"span"},{"href":"#id-181","text":"(99)","element":"a"},{"text":", we have","element":"span"}],[{"id":"id-183","style":{"width":"84%"},"width":1339,"height":356,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/62-4.png","element":"img"}],[{"text":"where the second inequality comes from Lemma ","element":"span"},{"href":"#id-182","text":"25, ","element":"a"},{"text":"and the last inequality holds due to ","element":"span"},{"style":{"height":19.97},"width":460.16,"height":49.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/62-5.png","element":"img","alt":"�s′∈Ss,a Pϑkh(s′ | s, a) = 1.","inline":true}],[{"text":"For ","element":"span"},{"text":"(ii)","element":"span"},{"text":", we have","element":"span"}],[{"id":"id-184","style":{"width":"97%"},"width":1538,"height":877,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/63-0.png","element":"img"}],[{"text":"where the last inequality is obtained through the same argument as used to bound ","element":"span"},{"text":"(i)","element":"span"},{"text":". Combining the results of Eq. ","element":"span"},{"href":"#id-183","text":"(101) ","element":"a"},{"text":"and Eq. ","element":"span"},{"href":"#id-184","text":"(102)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"93%"},"width":1481,"height":287,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/63-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"D.5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-176","style":{"fontWeight":"bold"},"text":"23","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-176","style":{"fontStyle":"italic"},"text":"23. ","element":"a"},{"text":"Note that","element":"span"}],[{"style":{"width":"94%"},"width":1492,"height":801,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/63-2.png","element":"img"}],[{"text":"Taking the logarithm on both sides yields","element":"span"}],[{"style":{"width":"67%"},"width":1068,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/63-3.png","element":"img"}],[{"style":{"width":"89%"},"width":1416,"height":506,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-0.png","element":"img"}],[{"text":"where the last inequality uses ","element":"span"},{"style":{"height":24.69},"width":532.76,"height":61.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-1.png","element":"img","alt":"�s′∈Sk,h P�θk+1h (s′ | skh, akh) = 1","inline":true},{"text":". From the fact that ","element":"span"},{"style":{"height":16},"width":271.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-2.png","element":"img","alt":" z ≤ 2 log(1 + z)","inline":true}],[{"text":"for any ","element":"span"},{"style":{"height":16},"width":148.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-3.png","element":"img","alt":" z ∈ [0, 1]","inline":true},{"text":", it follows that","element":"span"}],[{"style":{"width":"85%"},"width":1353,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-4.png","element":"img"}],[{"text":"Finally, we obtain","element":"span"}],[{"style":{"width":"82%"},"width":1310,"height":390,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-5.png","element":"img"}],[{"text":"where the last inequality follows by the determinant-trace inequality (Lemma ","element":"span"},{"href":"#id-119","text":"28)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.5.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-175","style":{"fontWeight":"bold"},"text":"24","element":"a"}],[{"style":{"width":"99%"},"width":1583,"height":1145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/64-6.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"D.5.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-182","style":{"fontWeight":"bold"},"text":"25","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-182","style":{"fontStyle":"italic"},"text":"25. ","element":"a"},{"text":"We provide a proof for Lemma ","element":"span"},{"href":"#id-182","text":"25 ","element":"a"},{"text":"since it is slight modification of Lemma 20 of ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"[76]","element":"a"},{"text":". From the definition, we know that","element":"span"}],[{"id":"id-185","style":{"width":"100%"},"width":1586,"height":749,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-0.png","element":"img"}],[{"text":"For the last inequality of ","element":"span"},{"href":"#id-185","text":"(103)","element":"a"},{"text":", we provide the upper bound of ","element":"span"},{"style":{"height":16.78},"width":443.68,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-1.png","element":"img","alt":" l2-norm of ∇ℓk,h(θ). Since","inline":true}],[{"style":{"width":"47%"},"width":747,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-2.png","element":"img"}],[{"text":"the gradient of the loss function is given by","element":"span"}],[{"style":{"width":"81%"},"width":1290,"height":494,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-3.png","element":"img"}],[{"text":"Therefore, we have","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":378,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"D.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bound on Pessimism Part","element":"span"}],[{"text":"In this section, we provide the upper bound on the pessimism part of the regret: ","element":"span"},{"style":{"height":20.38},"width":354.36,"height":50.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-5.png","element":"img","alt":"�Kk=1(V ∗1 − �V k1 )(sk1).","inline":true,"padRight":true},{"id":"id-186","style":{"fontWeight":"bold"},"text":"Lemma 26 ","element":"span"},{"text":"(Bound on pessimism)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-6.png","element":"img","alt":" δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16},"width":318.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-7.png","element":"img","alt":" 0 < δ < Φ(−1)/2","inline":true},{"style":{"fontStyle":"italic"},"text":", let ","element":"span"},{"style":{"height":14.4},"width":182.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-8.png","element":"img","alt":" σk = Hβk","inline":true},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":10.8},"width":70.92,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-9.png","element":"img","alt":" λ =","inline":true},{"style":{"height":19.73},"width":231.32,"height":49.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-10.png","element":"img","alt":"O(L2φd log U)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and we take multiple sample size ","element":"span"},{"style":{"height":24.42},"width":327.24,"height":61.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-11.png","element":"img","alt":" M = ⌈1 − log(HU)log Φ(1) ⌉","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability at least","element":"span"}],[{"style":{"width":"78%"},"width":1238,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/65-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-186","style":{"fontStyle":"italic"},"text":"26. ","element":"a"},{"text":"As seen in Lemma ","element":"span"},{"href":"#id-74","text":"18, ","element":"a"},{"text":"by using multiple sampling technique we show that the optimistic randomized value function ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-0.png","element":"img","alt":"�V","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"is optimistic than the true optimal value with constant probability Hence, with the same argument used in Lemma ","element":"span"},{"href":"#id-68","text":"11, ","element":"a"},{"text":"we can show that the pessimism term of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"is upper bounded by a bound of the estimation term times the inverse probability of being optimistic, i.e.,","element":"span"}],[{"style":{"width":"81%"},"width":1286,"height":187,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"D.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Regret Bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-75","style":{"fontStyle":"italic"},"text":"2. ","element":"a"},{"text":"Since both Lemma ","element":"span"},{"href":"#id-171","text":"21 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-186","text":"26 ","element":"a"},{"text":"holds with probability at least ","element":"span"},{"style":{"height":16},"width":128.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-2.png","element":"img","alt":" 1 − δ/2","inline":true,"padRight":true},{"text":"respectively, by taking the union bound we conclude the proof.","element":"span"}]]},{"heading":"E Optimistic Exploration Extension","paragraphs":[[{"text":"In this section, we introduce ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+ ","element":"span"},{"text":"(Algorithm ","element":"span"},{"href":"#id-77","text":"3)","element":"a"},{"text":", which is both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"computationally ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"statistically ","element":"span"},{"text":"efficient for MNL-MDPs with UCB-based exploration. The main difference compared to ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"ORRL-MNL ","element":"span"},{"text":"is that ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+ ","element":"span"},{"text":"constructs an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimistic value function ","element":"span"},{"text":"that is greater than the optimal value function with high probability. At each episode ","element":"span"},{"style":{"height":16},"width":133.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-3.png","element":"img","alt":" k ∈ [K]","inline":true},{"text":", with the estimated transition core parameter ","element":"span"},{"href":"#id-71","style":{"height":20.82},"width":1069.8,"height":52.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-4.png","element":"img","alt":"�θkh (5), for (s, a) ∈ S × A, set ˆQkH+1(s, a) = 0. For each h ∈ [H],","inline":true}],[{"style":{"width":"100%"},"width":1586,"height":324,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-5.png","element":"img"}],[{"text":"Based on these ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimistic value function ","element":"span"},{"style":{"height":19.36},"width":50.48,"height":48.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-6.png","element":"img","alt":"ˆQkh","inline":true},{"text":", at each episode the agent plays a greedy action with ","element":"span"},{"text":"respect to ","element":"span"},{"style":{"height":19.34},"width":50.52,"height":48.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-7.png","element":"img","alt":"ˆQkh ","inline":true,"padRight":true},{"text":"as summarized in Algorithm ","element":"span"},{"href":"#id-77","text":"3.","element":"a"}],[{"id":"id-77","style":{"width":"100%"},"width":1589,"height":642,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-8.png","element":"img"}],[{"text":"The main difference in regret analysis lies in ensuring the optimism of the estimated value function ","element":"span"},{"style":{"height":19.34},"width":50.48,"height":48.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-9.png","element":"img","alt":"ˆQkh","inline":true,"padRight":true},{"text":"(Lemma ","element":"span"},{"href":"#id-187","text":"27)","element":"a"},{"text":". In the following statement (formal statement of Corollary ","element":"span"},{"href":"#id-188","text":"1)","element":"a"},{"text":", we provide a regret ","element":"span"},{"text":"guarantee for ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+","element":"span"},{"text":", which enjoys the tightest regret bound for MNL-MDPs.","element":"span"}],[{"id":"id-189","style":{"fontWeight":"bold"},"text":"Theorem 3 ","element":"span"},{"text":"(Regret Bound of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"UCRL-MNL+","element":"span"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that Assumption ","element":"span"},{"href":"#id-21","style":{"fontStyle":"italic"},"text":"1- ","element":"a"},{"href":"#id-28","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. For any ","element":"span"},{"style":{"height":16},"width":167.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-10.png","element":"img","alt":" δ ∈ (0, 1),","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if we set the input parameters in Algorithm ","element":"span"},{"href":"#id-77","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"as ","element":"span"},{"style":{"height":20.72},"width":790.96,"height":51.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/66-11.png","element":"img","alt":" λ = O(L2φd log U), βk = O(√d log U log(kH))","inline":true,"padRight":true},{"style":{"height":16},"width":225.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-0.png","element":"img","alt":"η = O(log U)","inline":true},{"style":{"fontStyle":"italic"},"text":", then with probability at least ","element":"span"},{"style":{"height":11.6},"width":87.56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-1.png","element":"img","alt":" 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", the cumulative regret of the ","element":"span"},{"style":{"height":14.4},"width":327.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-2.png","element":"img","alt":" UCRL-MNL+ policy π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is upper-bounded by","element":"span"}],[{"style":{"width":"46%"},"width":738,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"KH ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the total number of time steps.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-189","style":{"fontStyle":"italic"},"text":"3. ","element":"a"},{"text":"By Lemma ","element":"span"},{"href":"#id-166","text":"17, ","element":"a"},{"text":"suppose that the good event ","element":"span"},{"style":{"height":16},"width":145.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-4.png","element":"img","alt":" G(K, δ′)","inline":true,"padRight":true},{"text":"holds with probability at least ","element":"span"},{"style":{"height":11.6},"width":87.68,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-5.png","element":"img","alt":" 1 − δ","inline":true},{"text":". Then, we show that the optimistic value function ","element":"span"},{"style":{"height":19.34},"width":50.48,"height":48.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-6.png","element":"img","alt":"ˆQkh ","inline":true,"padRight":true},{"text":"is deterministically greater than the ","element":"span"},{"text":"true optimal value function as follows:","element":"span"}],[{"id":"id-187","style":{"fontWeight":"bold"},"text":"Lemma 27 ","element":"span"},{"text":"(Optimism)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that the event ","element":"span"},{"style":{"height":20.7},"width":129.76,"height":51.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-7.png","element":"img","alt":" G∆k,h(δ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"holds for all ","element":"span"},{"style":{"height":16},"width":130.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-8.png","element":"img","alt":" k ∈ [K]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":130.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-9.png","element":"img","alt":" h ∈ [H]","inline":true},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"height":16},"width":394.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-10.png","element":"img","alt":" (s, a) ∈ S × A, we have","inline":true}],[{"style":{"width":"21%"},"width":348,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-11.png","element":"img"}],[{"text":"Conditioned on ","element":"span"},{"style":{"height":16},"width":145.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-12.png","element":"img","alt":" G(K, δ′)","inline":true},{"text":", by Lemma ","element":"span"},{"href":"#id-187","text":"27 ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"90%"},"width":1434,"height":203,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-13.png","element":"img"}],[{"text":"Note that","element":"span"}],[{"style":{"width":"80%"},"width":1281,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-14.png","element":"img"}],[{"text":"where we denote ","element":"span"},{"style":{"height":28.8},"width":975.52,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-15.png","element":"img","alt":" ζkh := ( ˆV kh+1 − V πkh+1)(skh+1) − E�s|skh,akh�( ˆV kh+1 − V πkh+1)(�s)�","inline":true},{"text":". Then, with the same argument, we have","element":"span"}],[{"style":{"width":"47%"},"width":755,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-16.png","element":"img"}],[{"text":"By summing over all episodes, we have","element":"span"}],[{"id":"id-190","style":{"width":"75%"},"width":1204,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-17.png","element":"img"}],[{"text":"On the other hand, note that","element":"span"}],[{"style":{"width":"97%"},"width":1553,"height":972,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/67-18.png","element":"img"}],[{"text":"where the last inequality follows by Lemma ","element":"span"},{"href":"#id-172","text":"22.","element":"a"}],[{"text":"Term ","element":"span"},{"text":"(i) ","element":"span"},{"text":"can be bounded as in Eq. ","element":"span"},{"href":"#id-176","text":"(91)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"93%"},"width":1480,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-0.png","element":"img"}],[{"text":"For term ","element":"span"},{"text":"(ii)","element":"span"},{"text":", recall that as in Eq. ","element":"span"},{"href":"#id-175","text":"(93) ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"72%"},"width":1149,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-1.png","element":"img"}],[{"text":"Then, we have","element":"span"}],[{"style":{"width":"83%"},"width":1331,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-2.png","element":"img"}],[{"text":"For term ","element":"span"},{"text":"(iii)","element":"span"},{"text":", since we have","element":"span"}],[{"style":{"width":"89%"},"width":1420,"height":252,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-3.png","element":"img"}],[{"text":"Combining the results of Eq. ","element":"span"},{"href":"#id-190","text":"(106)","element":"a"},{"text":", ","element":"span"},{"href":"#id-190","text":"(107)","element":"a"},{"text":", and ","element":"span"},{"href":"#id-190","text":"(108)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"52%"},"width":837,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-4.png","element":"img"}],[{"text":"Finally, by Azuma-Hoeffiding inequality as in Eq. ","element":"span"},{"href":"#id-177","text":"(96) ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"25%"},"width":399,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-5.png","element":"img"}],[{"text":"This concludes the proof.","element":"span"}],[{"text":"In the following, we provide the proof of Lemma ","element":"span"},{"href":"#id-187","text":"27.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Optimism","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-187","style":{"fontStyle":"italic"},"text":"27. ","element":"a"},{"text":"We prove this by backwards induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":". For the base case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", since","element":"span"}],[{"style":{"width":"66%"},"width":1053,"height":155,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/68-6.png","element":"img"}],[{"text":"Suppose that the statement holds for ","element":"span"},{"style":{"height":16},"width":383.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-0.png","element":"img","alt":" h+1 where h ∈ [H −1]","inline":true},{"text":". Then, for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"and for any ","element":"span"},{"style":{"height":16},"width":245.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-1.png","element":"img","alt":" (s, a) ∈ S ×A,","inline":true}],[{"style":{"width":"68%"},"width":1078,"height":699,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-2.png","element":"img"}],[{"text":"where the first inequality follows from the induction hypothesis and the second inequality holds by Lemma ","element":"span"},{"href":"#id-76","text":"16.","element":"a"}]]},{"heading":"F Experiment Details","paragraphs":[[{"id":"id-191","style":{"width":"96%"},"width":1524,"height":292,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-3.png","element":"img"}],[{"text":"Figure 2: The “RiverSwim” environment with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"states ","element":"figcaption","subtype":"caption"},{"href":"#id-78","referenceIndex":58,"text":"[58]","element":"a","subtype":"caption"}],[{"text":"The RiverSwim environment (Figure ","element":"span"},{"href":"#id-191","text":"2) ","element":"a"},{"text":"consists of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"states that are arranged in a chain. The agent starts in the leftmost state with a relatively small reward of ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"005 ","element":"span"},{"text":"and aims to reach the rightmost state, which has a relatively large reward of ","element":"span"},{"text":"1","element":"span"},{"text":". Choosing to swim to the left moves the agent deterministically to the left, while swimming to the right has a probability of transitioning the agent toward the right state, but also a high chance of remaining in the current state or even moving left due to the strong current of river. Therefore, efficient exploration is crucial in order to learn the optimal policy for this environment.","element":"span"}],[{"text":"We fine-tuned the hyperparameters for each algorithm within specific ranges. Figures ","element":"span"},{"href":"#id-79","text":"1a ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-79","text":"1b ","element":"a"},{"text":"show the episodic returns in the RiverSwim environment over 10 independent runs with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S| ","element":"span"},{"text":"= 4","element":"span"},{"style":{"fontStyle":"italic"},"text":", H ","element":"span"},{"text":"= 12","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"= 10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"000 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S| ","element":"span"},{"text":"= 8","element":"span"},{"style":{"fontStyle":"italic"},"text":", H ","element":"span"},{"text":"= 24","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"= 10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"000","element":"span"},{"text":", respectively. The shaded areas represent the standard deviations (1-sigma error). Figure ","element":"span"},{"href":"#id-79","text":"1c ","element":"a"},{"text":"compares the running time of the algorithms over the first 1,000 episodes. All experiments were conducted on a Xeon(R) Gold 6226R CPU @ 2.90GHz (16 cores).","element":"span"}]]},{"heading":"G Auxiliary Lemmas","paragraphs":[[{"id":"id-119","style":{"fontWeight":"bold"},"text":"Lemma 28 ","element":"span"},{"text":"(Determinant-trace inequality [","element":"span"},{"href":"#id-64","referenceIndex":1,"text":"1","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"height":16.59},"width":267.76,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-4.png","element":"img","alt":" x1, . . . , xt ∈ Rd","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and for any ","element":"span"},{"style":{"height":13.2},"width":170.96,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-5.png","element":"img","alt":" 1 ≤ τ ≤ t","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":20},"width":1046.76,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-6.png","element":"img","alt":"∥xτ∥2 ≤ L. Let Vt = λId + �tτ=1 xτx⊤τ for some λ > 0. Then,","inline":true}],[{"style":{"width":"26%"},"width":424,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-7.png","element":"img"}],[{"id":"id-113","style":{"fontWeight":"bold"},"text":"Lemma 29 ","element":"span"},{"text":"(Freedman’s inequality [","element":"span"},{"href":"#id-192","referenceIndex":29,"text":"29","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a real-valued martingale ","element":"span"},{"style":{"height":16},"width":350.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-8.png","element":"img","alt":" {Yk : k = 0, 1, 2, . . .}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with difference sequence ","element":"span"},{"style":{"height":16},"width":398.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/69-9.png","element":"img","alt":" {Xk : k = 0, 1, 2, 3, . . .}","inline":true},{"style":{"fontStyle":"italic"},"text":". Assume that the difference sequence is uniformly ","element":"span"},{"style":{"fontStyle":"italic"},"text":"bounded, ","element":"span"},{"style":{"height":13.2},"width":146.4,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-0.png","element":"img","alt":" Xk ≤ R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . .","element":"span"},{"style":{"fontStyle":"italic"},"text":". Define the predictable quadratic variation process of the martingale:","element":"span"}],[{"style":{"width":"81%"},"width":1286,"height":301,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-1.png","element":"img"}],[{"id":"id-121","style":{"fontWeight":"bold"},"text":"Lemma 30 ","element":"span"},{"text":"(Gaussian noise concentration (Lemma D.2 in [","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":"]))","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":18.74},"width":314.08,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-2.png","element":"img","alt":" ξ(1), ξ(2), . . . , ξ(M)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic"},"text":"independent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"style":{"fontStyle":"italic"},"text":"-dimensional multivariate normal distributed vector with mean ","element":"span"},{"style":{"height":12.78},"width":39.92,"height":31.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-3.png","element":"img","alt":" 0d","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and covariance ","element":"span"},{"style":{"height":16.59},"width":366.84,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-4.png","element":"img","alt":"σ2A−1 for some σ > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and a positive definite matrix ","element":"span"},{"style":{"height":19.54},"width":759.52,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-5.png","element":"img","alt":" A−1, i.e., ξ(m) ∼ N(0d, σ2A−1) for m ∈ [M].","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Then for any ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-6.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":237.56,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-7.png","element":"img","alt":" 1 − δ, we have","inline":true}],[{"style":{"width":"49%"},"width":782,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":15.58},"width":44.48,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-9.png","element":"img","alt":" Cξ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is an absolute constant. ","element":"span"},{"id":"id-144","style":{"fontWeight":"bold"},"text":"Lemma 31 ","element":"span"},{"text":"(Proposition 4.1 of ","element":"span"},{"href":"#id-193","referenceIndex":15,"text":"15)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let the ","element":"span"},{"style":{"height":10.8},"width":81.04,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-10.png","element":"img","alt":" wt+1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the solution of the update rule","element":"span"}],[{"style":{"width":"39%"},"width":620,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":15.79},"width":222.36,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-12.png","element":"img","alt":" V ⊆ W ⊆ Rd ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a non-empty convex set and ","element":"span"},{"style":{"height":16.78},"width":782.52,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-13.png","element":"img","alt":" Dψ(w1, w2) = ψ(w1)−ψ(w2)−⟨∇ψ(w2), w1−","inline":true},{"style":{"height":16},"width":62.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-14.png","element":"img","alt":"w2⟩","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the Bregman Divergence w.r.t. a strictly convex and continuously differentiable function ","element":"span"},{"style":{"height":14},"width":203,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-15.png","element":"img","alt":"ψ : W → R","inline":true},{"style":{"fontStyle":"italic"},"text":". Further supposing ","element":"span"},{"style":{"height":16},"width":88.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-16.png","element":"img","alt":" ψ(w)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"-strongly convex w.r.t. a certain norm ","element":"span"},{"style":{"height":16},"width":70.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-17.png","element":"img","alt":" ∥ · ∥","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"style":{"fontStyle":"italic"},"text":", then there exists a ","element":"span"},{"style":{"height":16},"width":404.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-18.png","element":"img","alt":" gt ∈ ∂ℓt(wt+1) such that","inline":true}],[{"id":"id-150","style":{"width":"100%"},"width":1589,"height":417,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, for any ","element":"span"},{"style":{"height":16},"width":301.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-20.png","element":"img","alt":" δ ∈ (0, 1], we have","inline":true}],[{"style":{"width":"77%"},"width":1224,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-21.png","element":"img"}],[{"id":"id-145","style":{"fontWeight":"bold"},"text":"Lemma 33 ","element":"span"},{"text":"(Lemma 1 of ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":28.8},"width":679.36,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-22.png","element":"img","alt":" ℓ(z, y) = �Kk=0 1{y = k} · log� 1[σ(z)]k","inline":true}],[{"style":{"height":17.38},"width":666.24,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-23.png","element":"img","alt":"y ∈ {0} ∪ [K] and b ∈ RK where C > 0","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, we have","element":"span"}],[{"style":{"width":"95%"},"width":1521,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-24.png","element":"img"}],[{"id":"id-149","style":{"fontWeight":"bold"},"text":"Lemma 34 ","element":"span"},{"text":"(Lemma 17 of ","element":"span"},{"href":"#id-47","referenceIndex":76,"text":"76","element":"a"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":28.8},"width":667.28,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-25.png","element":"img","alt":" ℓ(z, y) = �Kk=0 1{y = k} · log� 1[σ(z)]k","inline":true}],[{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"style":{"fontStyle":"italic"},"text":"-dimensional vector. Define ","element":"span"},{"style":{"height":17.78},"width":450.08,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-26.png","element":"img","alt":" zµ ≜ σ+ (smoothµ(σ(z)))","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16.8},"width":464.92,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-27.png","element":"img","alt":" smoothµ(p) = (1 − µ)p +","inline":true},{"style":{"height":16},"width":203.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-28.png","element":"img","alt":"µ1/(K + 1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, for ","element":"span"},{"style":{"height":16},"width":341.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-29.png","element":"img","alt":" µ ∈ [0, 1/2], we have","inline":true}],[{"id":"id-152","style":{"width":"100%"},"width":1589,"height":425,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/70-30.png","element":"img"}]]},{"heading":"H Limitations","paragraphs":[[{"text":"We make an assumption about the transition model of MDPs by using the MNL model, which is a specific parametric model. This assumption implies that we assume the realizability of the MNL model. It’s worth noting that the realizability assumption has also been commonly made in previous literature on provable reinforcement learning with function approximation, including works such as [","element":"span"},{"href":"#id-42","referenceIndex":72,"text":"72","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":43,"text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":73,"text":"73","element":"a"},{"text":", ","element":"span"},{"href":"#id-81","referenceIndex":53,"text":"53","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-82","referenceIndex":14,"text":"14","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-83","referenceIndex":68,"text":"68","element":"a"},{"text":", ","element":"span"},{"href":"#id-84","referenceIndex":70,"text":"70","element":"a"},{"text":", ","element":"span"},{"href":"#id-85","referenceIndex":33,"text":"33","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":81,"text":"81","element":"a"},{"text":", ","element":"span"},{"href":"#id-86","referenceIndex":82,"text":"82","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":35,"text":"35","element":"a"},{"text":"]. However, we hope that this condition can be relaxed in the future work.","element":"span"}]]},{"heading":"NeurIPS Paper Checklist","paragraphs":[[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Claims","element":"span"}],[{"text":"Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The main claims made in the abstract is to propose provably efficient randomized algorithms for MNL-MDPs. In Section ","element":"span"},{"text":"1 ","element":"span"},{"text":"(Introduction), we provide the motivation and main contributions of this paper.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the abstract and introduction do not include the claims made in the paper.","element":"span"}],[{"text":"• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.","element":"span"}],[{"text":"• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.","element":"span"}],[{"text":"• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Limitations","element":"span"}],[{"text":"Question: Does the paper discuss the limitations of the work performed by the authors? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We discuss the limitation of this work in Appendix ","element":"span"},{"text":"H. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.","element":"span"}],[{"text":"• The authors are encouraged to create a separate \"Limitations\" section in their paper.","element":"span"}],[{"text":"• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.","element":"span"}],[{"text":"• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.","element":"span"}],[{"text":"• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.","element":"span"}],[{"text":"• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.","element":"span"}],[{"text":"• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.","element":"span"}],[{"text":"• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theory Assumptions and Proofs","element":"span"}],[{"text":"Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We provide the full set of assumptions in Section ","element":"span"},{"href":"#id-194","text":"2.2 ","element":"a"},{"text":"and a complete proof of main results in Appendix ","element":"span"},{"text":"C ","element":"span"},{"text":"and ","element":"span"},{"text":"D. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.","element":"span"}],[{"text":"• All assumptions should be clearly stated or referenced in the statement of any theorems.","element":"span"}],[{"text":"• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.","element":"span"}],[{"text":"• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.","element":"span"}],[{"text":"• Theorems and Lemmas that the proof relies upon should be properly referenced.","element":"span"}],[{"text":"4. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Result Reproducibility","element":"span"}],[{"text":"Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We provide numerical experiments that support our main claims in Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"and the detailed information of experiments in Appendix ","element":"span"},{"text":"F. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.","element":"span"}],[{"text":"• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.","element":"span"}],[{"text":"• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.","element":"span"}],[{"text":"• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.","element":"span"}],[{"text":"(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.","element":"span"}],[{"text":"(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).","element":"span"}],[{"text":"(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.","element":"span"}],[{"text":"5. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Open access to data and code","element":"span"}],[{"text":"Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We have attached the data and code with sufficient instructions to reproduce the main experimental results in the supplementary material. Guidelines:","element":"span"}],[{"text":"• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https://nips.cc/ ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).","element":"span"}],[{"text":"• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https: ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"//nips.cc/public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.","element":"span"}],[{"text":"• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.","element":"span"}],[{"text":"• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).","element":"span"}],[{"text":"• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.","element":"span"}],[{"text":"6. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Setting/Details","element":"span"}],[{"text":"Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We provide the detailed explanation for the experimental setting in Appendix ","element":"span"},{"text":"F. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.","element":"span"}],[{"text":"• The full details can be provided either with the code, in appendix, or as supplemental material.","element":"span"}],[{"text":"7. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Statistical Significance","element":"span"}],[{"text":"Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We report error bars (standard deviation) in our numerical experiment results shown in Section ","element":"span"},{"text":"5. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The authors should answer \"Yes\" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.","element":"span"}],[{"text":"• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).","element":"span"}],[{"text":"• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)","element":"span"}],[{"text":"• The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean.","element":"span"}],[{"text":"• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.","element":"span"}],[{"text":"• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).","element":"span"}],[{"text":"• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.","element":"span"}],[{"text":"8. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiments Compute Resources","element":"span"}],[{"text":"Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We provide the detailed information on the computer resources used to conduct numerical experiments in Appendix ","element":"span"},{"text":"F. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.","element":"span"}],[{"text":"• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.","element":"span"}],[{"text":"• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).","element":"span"}],[{"text":"9. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Code Of Ethics","element":"span"}],[{"text":"Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics ","element":"span"},{"href":"https://neurips.cc/public/EthicsGuidelines","style":{"fontFamily":"monospace"},"text":"https://neurips.cc/public/EthicsGuidelines","element":"a"},{"text":"? Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The research conducted in this paper adheres to the NeurIPS Code of Ethics in all aspects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.","element":"span"}],[{"text":"• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).","element":"span"}],[{"text":"10. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Broader Impacts","element":"span"}],[{"text":"Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: There is no negative societal impacts of the work performed because this research focuses on theoretical aspects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that there is no societal impact of the work performed. • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.","element":"span"}],[{"text":"• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.","element":"span"}],[{"text":"• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.","element":"span"}],[{"text":"• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.","element":"span"}],[{"text":"• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).","element":"span"}],[{"text":"11. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Safeguards","element":"span"}],[{"text":"Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: The research conducted in this paper does not pose any such risks. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper poses no such risks.","element":"span"}],[{"text":"• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.","element":"span"}],[{"text":"• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.","element":"span"}],[{"text":"• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.","element":"span"}],[{"text":"12. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Licenses for existing assets","element":"span"}],[{"text":"Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This paper does not use any external assets such as code, data, or models. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not use existing assets. • The authors should cite the original paper that produced the code package or dataset. • The authors should state which version of the asset is used and, if possible, include a URL.","element":"span"}],[{"text":"• The name of the license (e.g., CC-BY 4.0) should be included for each asset. • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.","element":"span"}],[{"text":"• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"paperswithcode.com/datasets ","element":"span"},{"text":"has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.","element":"span"}],[{"text":"• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.","element":"span"}],[{"text":"• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.","element":"span"}],[{"text":"13. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"New Assets","element":"span"}],[{"text":"Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This paper does not release new assets. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not release new assets.","element":"span"}],[{"text":"• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.","element":"span"}],[{"text":"• The paper should discuss whether and how consent was obtained from people whose asset is used.","element":"span"}],[{"text":"• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.","element":"span"}],[{"text":"14. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Crowdsourcing and Research with Human Subjects","element":"span"}],[{"text":"Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?","element":"span"}],[{"style":{"width":"14%"},"width":227,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/96414/images/77-0.png","element":"img"}],[{"text":"Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.","element":"span"}],[{"text":"• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.","element":"span"}],[{"text":"15. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects","element":"span"}],[{"text":"Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA]","element":"span"}],[{"text":"Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.","element":"span"}],[{"text":"• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.","element":"span"}],[{"text":"• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]